Professional Documents
Culture Documents
Sherif Sakr (Editor), Albert Zomaya (Editor) - Encyclopedia of Big Data Technologies-Springer (2019) PDF
Sherif Sakr (Editor), Albert Zomaya (Editor) - Encyclopedia of Big Data Technologies-Springer (2019) PDF
Albert Y. Zomaya
Editors
Encyclopedia of
Big Data
Technologies
Encyclopedia of Big Data Technologies
Sherif Sakr • Albert Y. Zomaya
Editors
Encyclopedia of Big
Data Technologies
With 429 Figures and 54 Tables
123
Editors
Sherif Sakr Albert Y. Zomaya
Institute of Computer Science School of Information Technologies
University of Tartu Sydney University
Tartu, Estonia Sydney, Australia
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
In the field of computer science, data is considered as the main raw material
which is produced by abstracting the world into categories, measures, and
other representational forms (e.g., characters, numbers, relations, sounds,
images, electronic waves) that constitute the building blocks from which
information and knowledge are created. In practice, data generation and
consumption has become a part of people’s daily life especially with the
pervasive availability of Internet technology. We are progressively moving
toward being a data-driven society where data has become one of the most
valuable assets. Big data has commonly been characterized by the defining
3V’s properties which refer to huge Volume, consisting of terabytes or
petabytes of data; high in Velocity, being created in or near real time; and
diversity in Variety of form, being structured and unstructured in nature.
Recently, research communities, enterprises, and government sectors have
all realized the enormous potential of big data analytics, and continuous ad-
vancements have been emerging in this domain. This Encyclopedia answers
the need for solid and comprehensive research source in the domain of Big
Data Technologies. The aim of this Encyclopedia is to provide a full picture
of various related aspects, topics, and technologies of big data including
big data enabling technologies, big data integration, big data storage and
indexing, data compression, big data programming models, big SQL systems,
big streaming systems, big semantic data processing, graph analytics, big
spatial data management, big data analysis, business process analytics, big
data processing on modern hardware systems, big data applications, big data
security and privacy, and benchmarking of big data systems.
With contributions of many leaders in the field, this Encyclopedia provides
the reader with comprehensive reading materials for a large range of audi-
ences. It is a main aim of the Encyclopedia to influence the readers to think
further and investigate the areas that are novel to them. The Encyclopedia has
been designed to serve as a solid and comprehensive reference not only to
expert researchers and software engineers in the field but equally to students
and junior researchers as well. The first edition of the Encyclopedia contains
more than 250 entries covering a wide range of topics. The Encyclopedia’s
entries will be updated regularly to follow the continuous advancement in the
v
vi Preface
domain and have up-to-date coverage available for our readers. It will be
available both in print and online versions. We hope that our readers will find
this Encyclopedia as a rich and valuable resource and a highly informative
reference for Big Data Technologies.
vii
viii List of Topics
Professor Sherif Sakr is the Head of Data Systems Group at the Institute
of Computer Science, University of Tartu. He received his PhD degree in
Computer and Information Science from Konstanz University, Germany,
in 2007. He received his BSc and MSc degrees in Computer Science
from the Information Systems Department at the Faculty of Computers and
Information in Cairo University, Egypt, in 2000 and 2003, respectively.
During his career, Prof. Sakr held appointments in several international
and reputable organizations including the University of New South Wales,
Macquarie University, Data61/CSIRO, Microsoft Research, Nokia Bell Labs,
and King Saud bin Abdulaziz University for Health Sciences.
Prof. Sakr’s research interest is data and information management in
general, particularly in big data processing systems, big data analytics,
data science, and big data management in cloud computing platforms.
Professor Sakr has published more than 100 refereed research publications
in international journals and conferences such as the Proceedings of the
VLDB Endowment (PVLDB), IEEE Transactions on Parallel and Distributed
Systems (IEEE TPDS), IEEE Transactions on Service Computing (IEEE
TSC), IEEE Transactions on Big Data (IEEE TBD), ACM Computing Surveys
(ACM CSUR), Journal of Computer and System Sciences (JCSS), Informa-
tion Systems Journal, Cluster Computing, Journal of Grid Computing, IEEE
Communications Surveys and Tutorials (IEEE COMST), IEEE Software,
Scientometrics, VLDB, SIGMOD, ICDE, EDBT, WWW, CIKM, ISWC,
BPM, ER, ICWS, ICSOC, IEEE SCC, IEEE Cloud, TPCTC, DASFAA,
ICPE, and JCDL. Professor Sakr Co-authored 5 books and Co-Edited 3 other
books in the areas of data and information management and processing. Sherif
is an associate editor of the cluster computing journal and Transactions on
Large-Scale Data and Knowledge-Centered Systems (TLDKS). He is also an
editorial board member of many reputable international journals. Prof. Sakr is
an ACM Senior Member and an IEEE Senior Member. In 2017, he has been
xiii
xiv About the Editors
Dr. Zomaya has delivered more than 180 keynote addresses, invited
seminars, and media briefings and has been actively involved, in a variety
of capacities, in the organization of more than 700 conferences.
Dr. Zomaya is a Fellow of the IEEE, the American Association for the
Advancement of Science, and the Institution of Engineering and Technology
(UK). He is a Chartered Engineer and an IEEE Computer Society’s Golden
Core Member. He received the 1997 Edgeworth David Medal from the Royal
Society of New South Wales for outstanding contributions to Australian
science. Dr. Zomaya is the recipient of the IEEE Technical Committee on
Parallel Processing Outstanding Service Award (2011), the IEEE Technical
Committee on Scalable Computing Medal for Excellence in Scalable Com-
puting (2011), the IEEE Computer Society Technical Achievement Award
(2014), and the ACM MSWIM Reginald A. Fessenden Award (2017). His
research interests span several areas in parallel and distributed computing and
complex systems. More information can be found at http://www.it.usyd.edu.
au/~azom0780/.
About the Section Editors
xvii
xviii About the Section Editors
Marcos Dias de Assuncao Inria Avalon, LIP Laboratory, ENS Lyon, Uni-
versity of Lyon, Lyon, France
xxvii
xxviii List of Contributors
Nuno Preguiça NOVA LINCS and DI, FCT, Universidade NOVA de Lisboa,
Caparica, Portugal
Deepak Puthal Faculty of Engineering and Information Technologies,
School of Electrical and Data Engineering, University of Technology Sydney,
Ultimo, NSW, Australia
Jianzhong Qi The University of Melbourne, Melbourne, VIC, Australia
Yongrui Qin School of Computing and Engineering, University of Hudder-
sfield, Huddersfield, UK
Christoph Quix Fraunhofer-Institute for Applied Information Technology
FIT, Sankt Augustin, Germany
Do Le Quoc TU Dresden, Dresden, Germany
Francois Raab InfoSizing, Manitou Springs, CO, USA
Tilmann Rabl Database Systems and Information Management Group,
Technische Universität Berlin, Berlin, Germany
Muhammad Rafi School of Computer Science, National University of
Computer and Emerging Sciences, Karachi, Pakistan
Mohammad Saiedur Rahaman School of Science (Computer Science and
IT), RMIT University, Melbourne, VIC, Australia
Erhard Rahm University of Leipzig, Leipzig, Germany
Vineeth Rakesh Data Mining and Machine Learning Lab, School of
Computing, Informatics, and Decision Systems Engineering, Arizona State
University, Tempe, AZ, USA
Vijayshankar Raman IBM Research – Almaden, San Jose, CA, USA
Rajiv Ranjan School of Computing Science, Newcastle University,
Newcastle upon Tyne, UK
Niranjan K. Ray Kalinga Institute of Industrial Technology (KIIT),
Deemed to be University, Bhubaneswar, India
Manuel Resinas University of Seville, Sevilla, Spain
Juan Reutter Pontificia Universidad Catolica de Chile, Santiago, Chile
Christian Riegger Data Management Lab, Reutlingen University,
Reutlingen, Germany
Stefanie Rinderle-Ma Faculty of Computer Science, University of Vienna,
Vienna, Austria
Norbert Ritter Department of Informatics, University of Hamburg,
Hamburg, Germany
Nicoló Rivetti Faculty of Industrial Engineering and Management,
Technion – Israel Institute of Technology, Haifa, Israel
List of Contributors xli
2008), which provides better performance but we target full replication, which is a common
weaker semantics. This trend also motivated the deployment scenario (Baker et al. 2011; Shute
development of Spanner, which provides general et al. 2013; Bronson et al. 2013). Our layering
serializable transactions (Corbett et al. 2012) of Paxos on top of a concurrency control
and sparked other recent efforts in the area of algorithm is akin to the Replicated Commit
strongly consistent geo-replication (Sovran et al. system, which layers Paxos on top of two-phase
2011; Saeida Ardekani et al. 2013a; Lloyd et al. Locking (Mahmoud et al. 2013). However, by
2013; Zhang et al. 2013; Kraska et al. 2013; leveraging NMSI, we execute reads exclusively
Mahmoud et al. 2013). locally and run parallel instances of Paxos for
In this chapter, we present Blotter, a transac- different objects, instead of having a single
tional geo-replicated storage system, whose goal instance per shard. Furthermore, when a client
is to cut the latency penalty for ACID transactions is either colocated with the Paxos leader or when
in geo-replicated systems, by leveraging a recent that leader is in the closest data center to the
isolation proposal called Non-Monotonic Snap- client, Blotter can commit transactions within a
shot Isolation (NMSI) (Saeida Ardekani et al. single round-trip to the closest data center.
2013a). We focus on the Blotter algorithms and We have implemented Blotter as an exten-
discuss how they achieve: (1) at most one round- sion to the well-known geo-replicated storage
trip across data centers (assuming a fault-free run system Cassandra (Lakshman and Malik 2010).
and that clients are proxies in the same data center Our evaluation shows that, despite adding a small
as one of the replicas) and (2) read operations that overhead in a single data center, Blotter per-
are always served by the local data center. forms much better than both Jessy and the pro-
To achieve these goals, Blotter combines a tocols used by Spanner and also outperforms in
novel concurrency control algorithm that exe- many metrics a replication protocol that ensures
cutes at the data center level, with a carefully snapshot isolation (Elnikety et al. 2005). This
configured Paxos-based (Lamport 1998) repli- shows that Blotter can be a valid choice when
cated state machine that replicates the execution several replicas are separated by high latency
of the concurrency control algorithm across data links, performance is critical, and the semantic
centers. Both of these components exploit several differences between NMSI and snapshot isolation
characteristics of NMSI to reduce the amount are acceptable by the application.
of coordination between replicas. In particular,
the concurrency control algorithm leverages the
fact that NMSI does not require a total order Key Research Findings
on the start and commit times of transactions.
Such an ordering would require either synchro- The research conducted to design and implement
nized clocks, which are difficult to implement, Blotter led to the following key contributions and
even using expensive hardware (Corbett et al. findings:
2012), or synchronization between replicas that
do not hold the objects accessed by a trans- 1. A simple yet precise definition of non-
action (Saeida Ardekani et al. 2013b), which monotonic snapshot isolation, a recent
hinders scalability. In addition, NMSI allows us isolation model that aims at maximizing
to use separate (concurrent) Paxos-based state parallel execution of transactions in geo-
machines for different objects, on which we geo- replicated systems.
replicate the commit operation of the concurrency 2. The design of Blotter, which combines a
control protocol. novel concurrency control protocol for a
Compared to a previously proposed NMSI single data-center and its extension toward an
system called Jessy (Saeida Ardekani et al. arbitrary number of data centers by leveraging
2013a), instead of assuming partial replication, a carefully designed state machine replication
Achieving Low Latency Transactions for Geo-replicated Storage with Blotter 3
solution leveraging a variant of the Paxos one sees the effects of ta but not tb , and the
algorithm. other sees the effects of tb but not ta . The next A
3. The demonstration that it is possible to achieve figure exemplifies an execution that is admissible
strong consistency in geo-replicated storage under NMSI but not under SI, since under SI both
systems within a modest latency envelop in T3 and T4 would see the effects of both T1 and
fault-free runs. T2 because they started after the commit of T1
and T2 .
Non-monotonic Snapshot Isolation
We start by providing a specification of our target
isolation level, NMSI, and discussing the advan-
tages and drawbacks of this choice.
Second, instead of forcing the snapshot to
reflect a subset of the transactions that committed
Snapshot Isolation Revisited
at the transaction begin time, NMSI gives the
NMSI is an evolution of snapshot isolation (SI).
implementation the flexibility to reflect a more
Under SI, a transaction (logically) executes in a
convenient set of transactions in the snapshot,
database snapshot taken at the transaction begin
possibly including transactions that committed
time, reflecting the writes of all transactions that
after the transaction began. This property, also
committed before that instant. Reads and writes
enabled by serializability, is called forward fresh-
execute against this snapshot and, at commit time,
ness (Saeida Ardekani et al. 2013a).
a transaction can commit if there are no write-
write conflicts with concurrent transactions. (In
Definition 1 (Non-Mon. Snapshot Isol. (NMSI))
this context, two transactions are concurrent if the
An implementation of a transactional system
intervals between their begin and commit times
obeys NMSI if, for any trace of the system
overlap.)
execution, there exists a partial order among
Snapshot isolation exhibits the write-skew
transactions that obeys the following rules, for
anomaly, where two concurrent transactions T1
any pair of transactions ti and tj in the trace:
and T2 start by reading x0 and y0 , respectively,
and later write y1 and x1 , as exemplified in the
1. if tj reads a value for object x written by ti
following figure. This execution is admissible
then ti tj ^ Àtk writing to x W ti tk tj
under snapshot isolation, as there is no write-
2. if ti and tj write to the same object x then
write conflict, and each transaction has read from
either ti tj or tj ti .
a database snapshot. However, this execution is
not serializable.
The example in Fig. 1 obeys NMSI but not SI,
as the depicted partial order meets Definition 1.
The Benefits of NMSI in Replicated Settings particular case of our protocol, the occurrence of
NMSI weakens the specification of SI in a way anomalies is very rare: for a “long fork” to occur,
that can be leveraged for improving performance two transactions must commit in two different
in replicated settings. data centers, form a quorum with a third data
The possibility of having “long forks” al- center, and both complete before hearing from the
lows, in a replicated setting, for a single (local) other.
replica to make a decision concerning what data Finally, NMSI allows consecutive transactions
the snapshot should read. This is because it is from the same client to observe a state that
not necessary to enforce a serialization between reflects a set of transactions that does not grow
all transaction begin and commit operations, al- monotonically (when consecutive transactions
though it is still necessary to check for write-write switch between two different branches of a long
conflicts, fork). However, in our algorithms, this is an
In the case of “forward freshness,” this allows unlikely occurrence, since it requires that a client
for a transaction to read (in most cases) the most connects to different data centers in a very short
recent version of a data object at a given replica, time span.
independently of the instant when the transaction
began. This not only avoids the bookkeeping as- Blotter Protocols
sociated with tracking transaction start times but We now describe the Blotter protocols, intro-
also avoids a conflict with transactions that might ducing first the concurrency protocol algorithm
have committed after the transaction began. in a single data center scenario, and then how
this protocol can be replicated to multiple data
Impact of NMSI for Applications centers. A more detailed explanation of the pro-
We analyze in turn the impact of “forward fresh- tocols including their evaluation can be found
ness” and “long forks” for application develop- elsewhere (Moniz et al. 2017).
ers. Forward freshness allows a transaction ta
to observe the effects of another transaction tb System Model
that committed after ta began (in real time). In Blotter is designed to run on top of any distributed
this case, the programmer must decide whether storage system with nodes spread across one or
this is a violation of the intended application multiple data centers. We assume that each data
semantics, which is analogous to decide if se- object is replicated at all data centers. Within
rializability or strict serializability is the most each data center, data objects are replicated and
adequate isolation level for a given application. partitioned across several nodes. We make no
Long forks allow two transactions to execute restrictions on how this intra-data center replica-
against different branches of a forked database tion and partitioning takes place. We assume that
state, provided there are no write-write conflicts. nodes may fail by crashing and recover from such
In practice, the main implication of this fact is faults. When a node crashes, it loses its volatile
that the updates made by users may not become state, but all data that was written to stable storage
instantly visible across all replicas. For example, is accessible after recovery.
this could cause two users of a social network We use an asynchronous system model, i.e.,
to each think that they were the first to post a we do not assume any known bounds on com-
new promotion on their own wall, since they do putation and communication delays. We do not
not see each other’s posts immediately (Sovran prescribe a fixed bound on the number of faulty
et al. 2011). Again, the programmer must reason nodes within each data center – this depends on
whether this is admissible. In this case, a mitigat- the intra-data center replication protocol used.
ing factor is that this anomaly does not cause the
consistency of the database to break. (This is in Blotter Architecture
contrast with the “write-skew” anomaly, which is The client library of Blotter exposes an API
present in both SI and NMSI.) Furthermore, in the with the expected operations: begin a new
Achieving Low Latency Transactions for Geo-replicated Storage with Blotter 5
transaction, read an object given its identifier, The begin and write operations are local to the
write an object given its identifier and new client. Read operations are relayed to the TM, A
value, and commit a transaction, which either who returns the values and metadata for the
returns commit or abort. objects that were read. The written values are
The set of protocols that comprise Blotter are buffered by the client library and only sent to
organized into three different components: the TM at commit time, together with the accu-
mulated metadata for the objects that were read.
Blotter intra-data center replication. At the This metadata is used to set the versions that
lowest level, we run an intra-data center repli- running transactions must access, as explained
cation protocol to mask the unreliability of in- next.
dividual machines within each data center. This
level must provide the protocols above it with Transaction Manager (TM). The TM handles
the vision of a single logical copy (per data cen- the two operations received from the clients:
ter) of each data object and associated metadata, read and commit. For reads, it merely relays
which remains available despite individual node the request and reply to or from the data man-
crashes. We do not prescribe a specific protocol ager (DM) responsible for the object being read.
for this layer, since any of the existing protocols Upon receiving a commit request, the TM acts
that meet this specification can be used. as a coordinator of a two-phase commit (2PC)
protocol to enforce the all-or-nothing atomicity
Blotter Concurrency Control. These are the property. The first phase sends a dm-prewrite-
protocols that ensure transaction atomicity and request, with the newly written values, to all
NMSI isolation in a single data center and, at the DMs storing written objects. Each DM verifies
same time, are extensible to multiple data centers if the write complies with NMSI. If none of the
by serializing a single protocol step. DMs identifies a violation, the TM sends the
DMs a dm-write message containing the meta-
Inter-data Center Replication. This completes data with snapshot information aggregated from
the protocol stack by replicating a subset of all replies on the first phase; otherwise, it sends a
the steps of the concurrency control protocol dm-abort.
across data centers. It implements state machine
replication (Schneider 1990; Lamport 1978) by Data Manager (DM). The core of the concur-
judiciously applying Paxos (Lamport 1998) to the rency control logic is implemented by the DM,
concurrency control protocol to avoid unneces- which maintains the data and meta-data neces-
sary coordination across data centers. sary to provide NMSI. Algorithm 1 presents the
handlers for the three types of requests. Now, we
Single Data Center Protocol explain how these functions enforce NMSI rules
The single data center concurrency control mod- presented in our definition.
ule consists of the following three components:
the client library and the transaction managers Partial order . We use a multi-version proto-
(TM), which are non-replicated components that col, i.e., the system maintains a list of versions
act as a front end providing the system interface for each object. This list is indexed by an integer
and implementing the client side of the trans- version number, which is incremented every time
action processing protocol, respectively; and the a new version of the object is created (e.g., for a
data managers (DM), which are the replicated given object x, overwriting x0 creates version x1 ,
components that manage the information associ- and so on). In a multi-versioned storage, the
ated with data objects. relation can be defined by the version number that
transactions access, namely, if ti writes xm and tj
Client Library. This provides the interface of writes xn , then ti tj , m < n; and if ti writes
Blotter, namely, begin, read, write, and commit. xm and tj reads xn , then ti tj , m n.
6 Achieving Low Latency Transactions for Geo-replicated Storage with Blotter
// write operation
8 upon h dm-write, T, x, agg-startd-before i from TM do
9 for each T’ in agg-startd-before do
10 if T’ not in x.snapshot then
11 x.snapshot[T’] x.last;
12 x.last x.last + 1;
13 x.value[x.last] x.nextvalue;
14 finishWrite h T, x, TM i;
// abort operation
15 upon h dm-abort, T, x i from TM do
16 finishWrite h T, x, TM i;
// process read operation
17 processRead h T, x, TM i
18 if T … x.snapshot then
19 if x.prewrite ¤ ? then
20 x.buffered x.buffered [ {(T, TM)};
21 return
22 else
23 x.snapshot[T] x.last;
24 version x.snapshot[T];
25 value x.value[version];
26 send h read-response, T, value, fT 0 jx:snapshotŒT 0 < versiong i to TM;
// process dm-prewrite request
27 processPrewrite h T, x, value, TM i
28 if x.snapshot[T] ¤ ? ^ x.snapshot[T] < x.last then
// there is a write-write conflict
29 send h prewrite-response, reject, ? i to TM;
30 else
31 x.prewrite T;
32 x.nextvalue value;
33 send h prewrite-response, accept, fT 0 jT 0 2 x:snapshotg i to TM;
// clean prewrite information and serve buffered reads and pending prewrites
34 finishWrite h T, x, TMi
35 if x.prewrite = T then
36 x.nextvalue ?; x.prewrite ?;
37 for each (T, TM) in x.buffered do
38 processRead h T, x, TM i;
39 if x.pending 6D ? then
40 (T, x, value, TM) removeFirst(x.pending);
41 processPrewrite h T, x, value, TM i;
NMSI rule number 1. Rule number 1 of the transaction T 2 writes xiC1 (T 0 T 2), then the
definition of NMSI says that, for object x, trans- protocol must prevent T 1 from either reading or
action t must read the value written by the “latest” overwriting the values written by T 2. Otherwise,
transaction that updated x (according to ). To we would have T 0 T 2 T 1 and T 1 should
illustrate this, consider the example run in Fig. 2. have read the value for object x written by T 2
When a transaction T 1 issues its first read op- (i.e., xiC1 ) instead of that written by T 0 (i.e.,
eration, it can read the most recently committed xi ). Next, we detail how this is achieved first for
version of the object, say xi written by T 0 (lead- reads, then writes, and then how to enforce the
ing to T 0 T ). If, subsequently, some other rule transitively.
Achieving Low Latency Transactions for Geo-replicated Storage with Blotter 7
Reading the latest preceding version. The key to Preventing illegal overwrites. In the previous
enforcing this requirement is to maintain state example, we must also guarantee that T 1 does
associated with each object, stating the version not overwrite any value written by T 2. To en-
a running transaction must read, in case such force this, it suffices to verify, at the time of
a restriction exists. In the previous example, if the commit of transaction t , for every object
T 2 writes xiC1 , this state records that T 1 must written by t , if T should read its most recent
read xi . version. If this is the case, then the transac-
To achieve this, our algorithm maintains a tion can commit, since no version will be incor-
per-object dictionary data structure (x:snapshot), rectly overwritten; otherwise, it must abort (func-
mapping the identifier of a transaction t to a tion processPrewrite). In the example, T 4
particular version of x that t either must read aborts, since y:snapshot records that T 4 must
or has read from. Figure 2 depicts the changes read yj and a more recent version exists (yj C1 ).
to this data structure in the shaded boxes at the Allowing T 4 to commit and overwrite yj C1
bottom of the figure. When the DM processes would lead to T 2 T 4. This breaks rule num-
a read of x for t (function processRead), if ber 1 of NMSI, since it would have required T 1
the dictionary has no information for t , the most to read xiC1 written by T 2, which did not occur.
recent version is read, and this information is Applying the rules transitively. Finally, for en-
stored in the dictionary. Otherwise, the specified forcing rule number 1 of the definition of NMSI
version is returned. in a transitive manner, it is also necessary to
In the previous example, T 1 must record the guarantee the following: if T 2 writes xiC1 and
version it reads in the x.snapshot variable. Subse- yj C1 , and subsequently another transaction T 3
quently, when the commit of T 2 overwrites that reads yj C1 and writes wk , then the protocol must
version of x, we are establishing that T 2 6 T 1. also prevent T 1 from reading or overwriting the
As such, if T 2 writes to another object y, creating values written by T 3, otherwise we would have
yj C1 , then it must also force T 1 to read the pre- T 0 T 2 T 3 T 1, and thus T 1 should also
ceding version yj . To do this, when transaction have read xiC1 .
T 2 commits, for every transaction t that read To achieve this, when transaction T 3 (which
(or must read) an older version of object x (i.e., read yj C1 ) commits, for every transaction t that
the transactions with entries in the dictionary of must read version yl with l < j C 1 (i.e., the
x), the protocol will store in the dictionary of transactions that had entries in the dictionary of
every other object y written by T 2 that t must y when to read y), the protocol will store in the
read the previous version of y, unless an even dictionary of every other object w written by T 3
older version is already prescribed (dm-write that t must read the previous version of w (if
handler). In this particular example, y:snapshot an older version is not already specified). In the
would record that T 1 and T 4 must read version example, since the state for y:snapshot after T 2
yj , since, at commit time, x:snapshot indicates commits specifies that T 1 must read version yj ,
that these transactions read xi . then, when T 3 commits, w:snapshot is updated
8 Achieving Low Latency Transactions for Geo-replicated Storage with Blotter
to state that T 1 and T 4 must read version wk1 . mits, even for transactions that do not conflict.
This is implemented by collecting dependencies In this case, we leverage the fact that the partial
in read and pre-write functions and use these order required by the NMSI definition can be
information to update the snapshot information built by serializing the dm-prewrite opera-
for all objects in the write function. tion on a per-object basis, instead of serializing
tm-commits across all objects. As such, instead
NMSI rule number 2. Rule number 2 of the of having one large state machine whose state is
NMSI definition says that any pair of transactions defined by the entire database, we can have one
that write the same object x must have a relative state machine per object, with the state being the
order, i.e., either ti tj or tj ti . This order object (including its metadata), and supporting
is defined by the version number of x created by only the dm-prewrite operation.
each transaction. Finally, each Paxos-based state machine com-
Therefore, it remains to ensure that this is mand requires several cross-data center message
a partial order (i.e., no cycles). A cycle could delays (depending on the variant of Paxos used).
appear if two or more transactions concurrently We adjusted the configuration of Paxos to reduce
committed a chain of objects in a different order, the cross data center steps to a single round-trip
e.g., if tm wrote both xi and yj C1 and tn wrote (from the client of the protocol, i.e., the TM) for
both xiC1 and yj . To prevent this, it suffices update transactions, by using Multi-Paxos (Lam-
to use a two-phase commit protocol where, for port 1998) and configuring Paxos to only tolerate
each object, a single accepted prepare can be one unplanned outage of a data center. In fact,
outstanding at any time. this assumption is common in existing deployed
systems (Ananthanarayanan et al. 2013; Corbett
Geo-Replication et al. 2012). (Planned outages are handled by re-
Blotter implements geo-replication, with each configuring the Paxos membership Lamport et al.
object replicated in all data centers, using Paxos- 2010.)
based state machine replication (Lamport 1998;
Schneider 1990). In this model, all replicas exe-
cute a set of client-issued commands according Examples of Applications
to a total order, thus following the same sequence
of states and producing the same sequence of Blotter can be leveraged to design multiple large-
responses. We view each data center as a state scale geo-distributed applications, and in par-
machine replica. The state is composed by the ticular web applications whose semantics are
database (i.e., all data objects and associated compatible with the guarantees (and anomalies)
metadata), and the state machine commands are associated with NMSI.
the tm-read and the tm-commit of the TM- In particular, we have explored how to imple-
DM interface. ment a set of the main operations of a micro-
Despite being correct, this approach has three blogging platform, such as Twitter.
drawbacks, which we address in detail in a sepa- In our implementation of a simplistic version
rate paper (Moniz et al. 2017). of Twitter, we model the timeline (sometimes
First, read operations in our concurrency con- called wall in the context of social network ap-
trol protocol are state machine commands that plications) as a data object in the (geo-replicated)
mutate the state, thus requiring an expensive data storage service. Furthermore, we associate
consensus round. To address this, we leverage the with each user a set of followers and a set of
fact that the NMSI properties allow for removing followees. Each of these sets is modeled as an
this information from the state machine, since the independent data object in the data storage.
modified state only needs to be used locally. We have supported three different user inter-
Second, the total order of the state machine actions. We selected these operations by relying
precludes the concurrent execution of two com- on the model presented in Zhang et al. (2013):
Achieving Low Latency Transactions for Geo-replicated Storage with Blotter 9
• Post-tweet appends a tweet to the timeline of not affect these semantics, since all transactions
a user and its followers, which results in a that write on this object are totally ordered by the A
transaction with many reads and writes, since Paxos leader that is responsible for mediating the
one has to read the followers of the user post- access to it.
ing the tweet and then write to the timeline of The experimental results that evaluate the per-
both the user and each corresponding follower. formance of these implementations can be in a
• Follow-user appends new information in the separate paper (Moniz et al. 2017).
set of followers to the profiles of the follower
and the followee, which results in a transac-
tion with two reads and two writes. Future Directions of Research
• Read-timeline reads the wall of the user is-
suing the operation, resulting in a single read In this chapter, we have presented the design
operation. of Blotter, a novel system architecture and set
of protocols that exploits the benefits of NMSI
to improve the performance of geo-replicated
In this case, the long fork anomaly can only transactional systems.
result in a user observing (slightly) stale data, Blotter demonstrates the practical benefits that
and perhaps even updating its own state based can be achieved by relaxing the well-known and
on that data, e.g., a user that decides to follow widely used SI model. To better frame the appli-
another can miss a few entries of that new user cations that can benefit from the Blotter design,
in his own timeline. Note that reading stale data we have presented a simple yet precise specifica-
can be addressed by waiting for the information tion of NMSI and discussed how this specifica-
about transactions to be propagated across all tion differs in practice from the SI specification.
data centers in the deployment. An interesting direction for future research in
We also implemented the RUBiS benchmark, this context is to explore automatic mechanisms
which models an auction site similar to eBay, on to check if an existing application designed for
top of Blotter. the SI model could safely be adapted to operate
We ported the benchmark from using a rela- under NMSI. This could be achieved by exploit-
tional database as the storage back end to using ing application state invariants, and modeling
a key-value store. Each row of the relational transactions through their preconditions and post-
database is stored with a key formed by the name conditions.
of the table and the value of the primary key.
We additionally store data for supporting efficient Acknowledgements Computing resources for this work
queries (namely, indexes and foreign keys). were provided by an AWS in Education Research Grant.
The research of R. Rodrigues is funded by the Euro-
In this benchmark, users issue operations such pean Research Council (ERC-2012-StG-307732) and by
as selling, browsing, bidding, or buying items and FCT (UID/CEC/50021/2013). This work was partially
consulting a personal profile that lists outstanding supported by NOVA LINCS (UID/CEC/04516/2013) and
auctions and bids. All of these operations were EU H2020 LightKone project (732505). This chapter is
derived from Moniz et al. (2017).
implemented as Blotter transactions. In this appli-
cation, the long fork anomaly that is allowed by
NMSI has a few implications. In particular, when
browsing the items that are available to be sold References
(or in auction), users can observe stale data, in the
form of missing a few objects that are relevant for Ananthanarayanan R, Basker V, Das S, Gupta A, Jiang
their queries. H, Qiu T, Reznichenko A, Ryabkov D, Singh M,
Venkataraman S (2013) Photon: fault- tolerant and
Surprisingly, since bidding and buying an item
scalable joining of continuous data streams. In: SIG-
involves writing to a single object in the data store MOD’13: proceeding of 2013 international conference
in our adaptation, the long fork anomaly does on management of data, pp 577–588
10 ACID Properties in MMOGs
Baker J, Bond C, Corbett JC, Furman J, Khorlin A, Larson replicated commit. Proc VLDB Endow 6(9):661–672.
J, Leon JM, Li Y, Lloyd A, Yushprakh V (2011) Mega- http://dl.acm.org/citation.cfm?id=2536360.2536366
store: providing scalable, highly available storage for Moniz H, Leitão J, Dias RJ, Gehrke J, Preguiça N,
interactive services. In: Proceeding of the conference Rodrigues R (2017) Blotter: low latency transactions
on innovative data system research (CIDR), pp 223– for geo-replicated storage. In: Proceedings of the 26th
234. http://www.cidrdb.org/cidr2011/Papers/CIDR11_ international conference on World Wide Web, Interna-
Paper32.pdf tional World Wide Web conferences steering commit-
Bronson et al N (2013) Tao: facebook’s distributed data tee, WWW ’17, Perth, pp 263–272. https://doi.org/10.
store for the social graph. In: Proceeding of the 2013 1145/3038912.3052603
USENIX annual technical conference, pp 49–60 Saeida Ardekani M, Sutra P, Shapiro M (2013a) Non-
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, monotonic snapshot isolation: scalable and strong con-
Burrows M, Chandra T, Fikes A, Gruber RE (2008) sistency for geo-replicated transactional systems. In:
Bigtable: a distributed storage system for structured Proceeding of the 32nd IEEE symposium on reliable
data. ACM Trans Comput Syst 26(2):4:1–4:26. http:// distributed systems (SRDS 2013), pp 163–172. https://
doi.acm.org/10.1145/1365815.1365816 doi.org/10.1109/SRDS.2013.25
Corbett et al JC (2012) Spanner: Google’s globally- Saeida Ardekani M, Sutra P, Shapiro M, Preguiça N
distributed database. In: Proceeding of the 10th (2013b) On the scalability of snapshot isolation. In:
USENIX conference on operating systems design and Euro-Par 2013 parallel processing. LNCS, vol 8097.
implementation, OSDI’12, pp 251–264. http://dl.acm. Springer, pp 369–381. https://doi.org/10.1007/978-3-
org/citation.cfm?id=2387880.2387905 642-40047-6_39
DeCandia et al G (2007) Dynamo: Amazon’s highly Schneider FB (1990) Implementing fault-tolerant ser-
available key-value store. In: Proceeding of the 21st vices using the state machine approach: a tutorial.
ACM symposium on operating systems principles, ACM Comput Surv 22(4):299–319. http://doi.acm.org/
pp 205–220. http://doi.acm.org/10.1145/1294261. 10.1145/98163.98167
1294281 Shute J, Vingralek R, Samwel B, Handy B, Whipkey
Elnikety S, Zwaenepoel W, Pedone F (2005) Database C, Rollins E, Oancea M, Littlefield K, Menestrina
replication using generalized snapshot isolation. In: D, Ellner S, Cieslewicz J, Rae I, Stancescu T, Apte
Proceedings of the 24th IEEE symposium on reliable H (2013) F1: a distributed SQL database that scales.
distributed systems, SRDS’05. IEEE Computer So- Proc VLDB Endow 6(11):1068–1079. https://doi.org/
ciety, Washington, DC, pp 73–84. https://doi.org/10. 10.14778/2536222.2536232
1109/RELDIS.2005.14 Sovran Y, Power R, Aguilera MK, Li J (2011) Transac-
Hoff T (2009) Latency is everywhere and it costs you tional storage for geo-replicated systems. In: Proceed-
sales – how to crush it. Post at the high scalability blog. ing of the 23rd ACM symposium on operating systems
http://tinyurl.com/5g8mp2 principles, SOSP’11, pp 385–400. http://doi.acm.org/
Kraska T, Pang G, Franklin MJ, Madden S, Fekete 10.1145/2043556.2043592
A (2013) MDCC: multi-data center consistency. In: Zhang Y, Power R, Zhou S, Sovran Y, Aguilera M, Li
Proceeding of the 8th ACM European conference on J (2013) Transaction chains: achieving serializability
computer systems, EuroSys’13, pp 113–126. http://doi. with low latency in geo-distributed storage systems. In:
acm.org/10.1145/2465351.2465363 Proceedings of the 24th ACM symposium on operating
Lakshman A, Malik P (2010) Cassandra: a decentral- systems principles, SOSP, pp 276–291. http://doi.acm.
ized structured storage system. SIGOPS Oper Syst org/10.1145/2517349.2522729
Rev 44(2):35–40. http://doi.acm.org/10.1145/1773912.
1773922
Lamport L (1978) Time, clocks, and the ordering of events
in a distributed system. Commun ACM 21(7):558–565.
http://doi.acm.org/10.1145/359545.359563
Lamport L (1998) The part-time parliament. ACM Trans ACID Properties in MMOGs
Comput Syst 16(2):133–169. http://doi.acm.org/10.
1145/279227.279229
Lamport L, Malkhi D, Zhou L (2010) Reconfiguring a Transactions in Massively Multiplayer Online
state machine. ACM SIGACT News 41(1):63–73 Games
Lloyd W, Freedman MJ, Kaminsky M, Andersen
DG (2013) Stronger semantics for low-latency geo-
replicated storage. In: Proceeding of the 10th USENIX
conference on networked systems design and imple-
mentation, NSDI’13, pp 313–328. http://dl.acm.org/
citation.cfm?id=2482626.2482657
Active Disk
Mahmoud H, Nawab F, Pucher A, Agrawal D, El Abbadi
A (2013) Low-latency multi-datacenter databases using Active Storage
Active Storage 11
processing elements, which limits performance 2. Heterogeneity in terms of storage and com-
and scalability and worsens resource and energy puter hardware interfaces is to be addressed.
efficiency. 3. Toolchain: extensive tool support is necessary
Nowadays, three important technological de- to utilize Active Storage: compilers, hardware
velopments open an opportunity to counter these generators, debugging and monitoring tools,
drawbacks. Firstly, hardware manufactures are and advisors.
able to fabricate combinations of storage and 4. Placement of data and computation across the
compute elements at reasonable costs and pack- hierarchy is crucial to efficiency.
age them within the same device. Secondly, the 5. Workload adaptivity is a major goal as static
fact that this trend covers virtually all levels of assignments lower the placement and colloca-
the memory hierarchy: (a) CPU and caches, (b) tion effects.
memory and compute, (c) storage and compute,
(d) accelerators – specialized CPUs and stor- These challenges already attract research fo-
age, and eventually (e) network and compute. cus, as todays accelerators exhibit Active Stor-
Thirdly, as magnetic/mechanical storage is being age alike characteristics in a simplistic man-
replaced with semiconductor nonvolatile tech- ner, i.e., GPUs and FPGAs are defined by a
nologies (Flash, Non-Volatile Memories– NVM), considerably higher level of internal parallelism
another key trend emerges: the device internal and bandwidth in contrast to their connection
bandwidth, parallelism, and access latencies are to the host system. Specialized computation al-
significantly better than the external ones (device- ready uses hardware programmable with high-
to-host). This is due to various reasons: inter- level synthesis toolchains like TaPaSCo (Korinth
faces, interconnect, physical design, and architec- et al. 2015) and co-location with storage elements
tures. and often necessitate for shared virtual memory.
Active Storage is a concept that targets the Classical research questions, e.g., about a dy-
execution data processing operations (fully or namic workload distribution, data dependence,
partially) in situ: within the compute elements on and flexible data placement are approached by
the respective level of the storage hierarchy, close Fan et al. (2016), Hsieh et al. (2016), and Chen
to where data is physically stored and transfer the and Chen (2012), respectively. The significant
results back, without moving the raw data. The potential arising with Near-Data Processing is in-
underlying assumptions are: (a) the result size vestigated under perfect conditions by Kotra et al.
is much smaller than the raw data, hence less (2017) stating performance boosts of about 75%.
frequent and smaller-sized data transfers; or (b)
the in situ computation is faster and more efficient Key Research Findings
than on the host, thus higher performance and
scalability or better efficiency. Related concepts Storage
are in situ processing, In-Storage Processing, The concept of Active Storage is not new.
smart storage, and Near-Data Processing (Bal- Historically it is deeply rooted in the concept
asubramonian et al. 2014). The Active Storage of database machines (DeWitt and Gray 1992;
paradigms have profound impact on multiple as- Boral and DeWitt 1983) developed in the
pects: 1970s and 1980s. Boral and DeWitt (1983)
discusses approaches such as processor-per-
1. Interfaces: hardware and software interfaces track or processor-per-head as an early attempt to
need to be extended, and new abstractions combine storage and simple computing elements
need to be introduced. This includes device to accelerate data processing. Existing I/O
and storage interfaces and operating systems bandwidth and parallelism are claimed to be
and I/O abstractions: operations and condi- the limiting factor to justify parallel DBMS.
tions, records/objects vs. blocks, atomic prim- While this conclusion is not surprising given the
itives, and transactional support. characteristics of magnetic/mechanical storage
Active Storage 13
combined with Amdahl’s balanced systems law, trends have important impact. On the one hand,
it is revised with modern technologies. Modern semiconductor storage technologies (NVM, A
semiconductor storage technologies (NVM, Flash) offer significantly higher bandwidths,
Flash) are offering high raw bandwidth and levels lower latencies, and levels of parallelism. On
of parallelism. Boral and DeWitt (1983) also the other hand, hardware vendors are able to
raises the issue of temporal locality in database fabricate economically combinations of storage
applications, which has been questioned back and compute units and package them on storage
then and is considered to be low in modern devices. Both combined the result in new
workloads, causing unnecessary data transfers. generation of Active Storage devices.
Near-Data Processing and Active Storage present Smart SSDs (Do et al. 2013) or multi-stream
an opportunity to address it. SSDs aim to achieve better data processing
The concept of Active Disk emerged toward performance by utilizing on device resources
the end of the 1990s and early 2000s. It and pushing down data processing operations
is most prominently represented by systems close to the data. Programming models such
such as Active Disk (Acharya et al. 1998), as SSDlets are being proposed. One trend is
IDISK (Keeton et al. 1998), and Active In-Storage Processing (Jo et al. 2016; Kim
Storage/Disk (Riedel et al. 1998). While et al. 2016) that presents significant performance
database machines attempted to execute fixed increase on embedded CPUs for standard DBMS
primitive access operations, Active Disk targets operators. Combinations of storage and GPGPUs
executing application-specific code on the drive. demonstrate an increase of up to 40x (Cho et al.
Active Storage/Disk (Riedel et al. 1998) relies 2013a). IBEX (Woods et al. 2013, 2014) is a
on processor-per-disk architecture. It yields system demonstrating operator pushdown on
significant performance benefits for I/O-bound FPGA-based storage.
scans in terms of bandwidth, parallelism, and Do et al. (2013) is one of the first works to ex-
reduction of data transfers. IDISK (Keeton et al. plore offloading parts of data processing on Smart
1998) assumed a higher complexity of data SSDs, indicating potential of significant perfor-
processing operations compared to Riedel et al. mance improvements (up to 2.7x) and energy
(1998) and targeted mainly analytical workloads savings (up to 3x). Do et al. (2013) defines a new
and business intelligence and DSS systems. session-based communication protocol (DBMS-
Active Disk (Acharya et al. 1998) targets an SmartSSD) comprising three operations: OPEN,
architecture based on on-device processors and CLOSE, and GET. In addition they define a set of
pushdown of custom data processing operations. APIs for on-device functionality: Command API,
Acharya et al. (1998) focusses on programming Thread API, Data API, and Memory API. It does
models and explores a streaming programming not only enable pushdown but also workload-
model, expressing data-intensive operations as dependent, cooperative processing. In addition,
so-called disklets, which are pushed down and Do et al. Do et al. (2013) identify two research
executed on the disk processor. questions: (i) How can Active Storage handle the
An extension of the above ideas (Sivathanu problem of on-device processing at the presence
et al. 2005) investigates executing operations on of a more recent version of the data in the buffer?
the RAID controller. Yet, classical RAID tech- (ii) What is the efficiency of operation pushdown
nologies rely on general-purpose CPUs that oper- at the presence of large main memories? The
ate well with slow mechanical HDDs, are easily latter becomes obvious in the context of large
overloaded, and turn into a bottleneck with mod- data sets (Big Data) and computationally inten-
ern storage technologies (Petrov et al. 2010). sive operations.
Although in the Active Disk, concept Similarly, Seshadri et al. (2014) propose and
increases the scope and applicability, it is describe a user-programmable SSD called Wil-
equally impacted by bandwidth limitations low, which allows the users to augment the stor-
and high manufacturing costs. Nowadays two age device with the application-specific logic.
14 Active Storage
In order to provide the new functionality of the close to memory and by utilizing novel interfaces
SSD for a certain application, three subsystems alike. On the other hand, new types of memory
must be appropriately modified to support a new are introduced, characterized by an even higher
set of RPC commands: the application, the kernel density and therefore larger volumes, shorter
driver, and the operating system running on each latencies, and higher bandwidth and internal
storage processing unit inside the Flash SSD. parallelism, as well as non-volatile persistence
The initial ideas of Do et al. (2013) have behavior.
been recently extended in Kim et al. (2016), Jo Balasubramonian (2016) discusses in his
et al. (2016), and Samsung (2015). Kim et al. article the features that can be meaningfully
(2016) demonstrate between 5x and 47x perfor- added to memory devices. Not only do these
mance improvement for scans and joins. Jo et al. features execute parts of an application, but they
(2016) describe a similar approach for In-Storage may also take care of auxiliary operations that
Computing based on the Samsung PM1725 SSD maintain high efficiency, reliability, and security.
with ISC option (Samsung 2015) integrated in Research combining memory technologies with
MariaDB. Other approaches (Cho et al. 2013b; the Active Storage concept in general, often
Tiwari et al. 2013). Tiwari et al. (2013) stress the referred to as Processing-In-Memory (PIM),
importance of in situ processing. is very versatile. In the late 1990 (Patterson
Woods et al. (2014, 2013) demonstrate with et al. 1997), proposed IRAM as a first attempt
Ibex an intelligent storage engine for commodity- to address the Memory Wall, by unifying
relational databases. By off-loading complex processing logic and DRAM, starting with
queries operators, they tackle the bandwidth general research question of the computer
bottlenecks arising when moving large amounts science like communication, interfaces, cache
of data from storage to processing nodes. In coherence, or address schemes. Hall et al.
addition, the energy consumption is reduced (1999) purpose combining their Data-IntensiVe
due to the usage of FPGAs rather than general- Architecture (DIVA) PIM memories with
purpose processors. Ibex supports aggregation external host processors and defining a PIM-
(GROUP By), projection, and selection. Najafi to-PIM interconnect; Vermij et al. (2017) present
et al. (2013) and Sadoghi et al. (2012) explore an extension to the CPU architecture to enable
approaches for flexible query processing on NDP capabilities close to the main memory by
FPGAs. introducing a new component attached to the
JAFAR is an Active Storage approach for col- system bus responsible for the communication;
umn stores by Xi et al. (2015) and Babarinsa and Boroumand et al. (2017) propose a new
Idreos (2015). JAFAR is based on MonetDB and hardware cache coherence mechanism designed
aims at reducing data transfers, hence pushing specifically for PIM; Picorel et al. (2017)
size reducing DB operators such as selections; show that the historically important flexibility
joins are not considered. Xi et al. (2015) stress to map any virtual page to any page frame
the importance of on-chip accelerators but do is unnecessary regarding NDP and introduce
not consider Active Storage and accelerators for Distributed Inverted Page Table (DIPTA) as an
complex computations in situ. alternative near-memory structure.
Studying the upcomming new interface
Memory: Processing-In-Memory (PIM) HMC, Azarkhish et al. (2017) analyzed its
Manufacturing costs of DRAM decrease, and support for NDP in a modular and flexible
memory volume increases steadily, while access fashion. The authors propose a fully backward
latencies improve at a significantly lower rate compatible extension to the standard HMC
yielding the so-called Memory Wall (Wulf and called smart memory cube and design a high-
McKee 1995). On the one hand, technologies bandwidth, low-latency, and AXI(4)-compatible
such as Hybrid Memory Cube (HMC) attempt logic base interconnect, featuring a novel address
to address this issue by locating processing units scrambling mechanism for the reduction in
Active Storage 15
Transparent offloading and mapping (TOM): enabling Ren Y, Wu X, Zhang L, Wang Y, Zhang W, Wang Z, Hack
programmer-transparent near-data processing in GPU M, Jiang S (2017) iRDMA: efficient use of RDMA in A
systems. In: Proceeding of 2016 43rd international distributed deep learning systems. In: IEEE 19th in-
symposium on computer architecture (ISCA 2016), pp ternational conference on high performance computing
204–216 and communications, pp 231–238
István Z, Sidler D, Alonso G (2017) Caribou: intelli- Riedel E, Gibson GA, Faloutsos C (1998) Active storage
gent distributed storage. Proc VLDB Endow 10(11): for large-scale data mining and multimedia. In: Pro-
1202–1213 ceedings of the 24rd international conference on very
Jo I, Bae DH, Yoon AS, Kang JU, Cho S, Lee DDG, Jeong large data bases (VLDB’98), pp 62–73
J (2016) Yoursql: a high-performance database system Sadoghi M, Javed R, Tarafdar N, Singh H, Palaniappan R,
leveraging in-storage computing. Proc VLDB Endow Jacobsen HA (2012) Multi-query stream processing on
9:924–935 FPGAs. In: 2012 IEEE 28th international conference
Keeton K, Patterson DA, Hellerstein JM (1998) A case for on data engineering, pp 1229–1232
intelligent disks (idisks). SIGMOD Rec 27(3):42–52 Samsung (2015) In-storage computing. http://www.flash-
Kim G, Chatterjee N, O’Connor M, Hsieh K (2017a) memorysummit.com/English/Collaterals/Proceedings/
Toward standardized near-data processing with unre- 2015/20150813_S301D_Ki.pdf
stricted data placement for GPUs. In: Proceeding of in- Seshadri S, Gahagan M, Bhaskaran S, Bunker T, De
ternational conference on high performance computing A, Jin Y, Liu Y, Swanson S (2014) Willow: a user-
networking, storage and analysis (SC’17), pp 1–12 programmable SSD. In: Proceeding of OSDI’14
Kim NS, Chen D, Xiong J, Hwu WMW (2017b) Hetero- Sivathanu M, Bairavasundaram LN, Arpaci-Dusseau
geneous computing meets near-memory acceleration AC, Arpaci-Dusseau RH (2005) Database-aware
and high-level synthesis in the post-moore era. IEEE semantically-smart storage. In: Proceedings of the 4th
Micro 37(4):10–18 conference on USENIX conference on file and storage
Kim S, Oh H, Park C, Cho S, Lee SW, Moon B (2016) technologies (FAST’05), vol 4, pp 18–18
In-storage processing of database scans and joins. Inf Sykora J, Koutny T (2010) Enhancing performance of
Sci 327(C):183–200 networking applications by IP tunneling through active
Korinth J, Chevallerie Ddl, Koch A (2015) An open- networks. In: 9th international conference on networks
source tool flow for the composition of reconfigurable (ICN 2010), pp 361–364
hardware thread pool architectures. In: Proceedings of Szalay A, Gray J (2006) 2020 computing: science in an
the 2015 IEEE 23rd annual international symposium exponential world. Nature 440:413–414
on field-programmable custom computing machines Tennenhouse DL, Wetherall DJ (1996) Towards an active
(FCCM’15). IEEE Computer Society, Washington, network architecture. ACM SIGCOMM Comput Com-
DC, pp 195–198 mun Rev 26(2):5–17
Kotra JB, Guttman D, Chidambaram Nachiappan N, Kan- Tiwari D, Boboila S, Vazhkudai SS, Kim Y, Ma
demir MT, Das CR (2017) Quantifying the potential X, Desnoyers PJ, Solihin Y (2013) Active flash:
benefits of on-chip near-data computing in manycore towards energy-efficient, in-situ data analytics on
processors. In: 2017 IEEE 25th international sympo- extreme-scale machines. In: Proceeding of FAST,
sium on modeling, analysis, and simulation of com- pp 119–132
puter and telecommunication system, pp 198–209 Vermij E, Fiorin L, Jongerius R, Hagleitner C, Lunteren
Lim H, Park G (2017) Triple engine processor (TEP): JV, Bertels K (2017) An architecture for integrated
a heterogeneous near-memory processor for diverse near-data processors. ACM Trans Archit Code Optim
kernel operations. ACM Ref ACM Trans Arch Code 14(3):30:1–30:25
Optim Artic 14(4):1–25 Wang Y, Zhang M, Yang J (2017) Towards memory-
Muramatsu B, Gierschi S, McMartin F, Weimar S, Klotz G efficient processing-in-memory architecture for convo-
(2004) If you build it, will they come? In: Proceeding of lutional neural networks. In: Proceeding 18th ACM
2004 joint ACM/IEEE Conference on digital libraries SIGPLAN/SIGBED conference on languages compil-
(JCDL’04) p 396 ers, and tools for embedded systems (LCTES 2017),
Najafi M, Sadoghi M, Jacobsen HA (2013) Flexible query pp 81–90
processor on FPGAs. Proc VLDB Endow 6(12):1310– Woods L, Teubner J, Alonso G (2013) Less watts, more
1313 performance: an intelligent storage engine for data
Patterson D, Anderson T, Cardwell N, Fromm R, Keeton appliances. In: Proceeding of SIGMOD, pp 1073–1076
K, Kozyrakis C, Thomas R, Yelick K (1997) A case for Woods L, István Z, Alonso G (2014) Ibex: an intelligent
intelligent ram. IEEE Micro 17(2):34–44 storage engine with support for advanced sql offload-
Petrov I, Almeida G, Buchmann A, Ulrich G (2010) ing. Proc VLDB Endow 7(11):963–974
Building large storage based on flash disks. In: Pro- Wulf WA, McKee SA (1995) Hitting the memory
ceeding of ADMS’10 wall: implications of the obvious. SIGARCH CAN
Picorel J, Jevdjic D, Falsafi B (2017) Near-Memory Ad- 23(1):20–24
dress Translation. In: 2017 26th international confer- Xi SL, Babarinsa O, Athanassoulis M, Idreos S (2015)
ence on Parallel architectures and compilation tech- Beyond the wall: near-data processing for databases.
niques, pp 303–317, 1612.00445 In: Proceeding of DaMoN, pp 2:1–2:10
18 Ad Hoc Benchmark
flag when it finds evidence that the distribution more accurate, but smaller values of W when
of the stream items has changed. When this hap- the data is changing, so that the model ignores A
pens, the model builder revises the current model obsolete data. These trade-offs are investigated
or discards the model and builds a new one with by Kuncheva and Žliobaitė (2009).
fresh data. Usually, change detection algorithms
monitor one statistic or a small number of statis-
tics of the stream, so they will not detect every Methods
possible kind of change, which is in general com-
putationally unfeasible. Two easy-to-implement Adaptive windowing schemes use sliding win-
change detection algorithms for streams of real dows whose size increases or decreases in re-
values are the CUSUM (Cumulative Sum) and sponse to the change observed in the data. They
Page-Hinkley methods (Basseville and Nikiforov intend to free the user from having to guess
1993; Gama et al. 2014). Roughly speaking, both expected rates of change in the data, which may
methods monitor the average of the items in the lead to poor performance if the guess is incorrect
stream; when the recent average differs from the or if the rate of change is different at different
historical average by some threshold related to moments. Three methods are reviewed in this
the standard deviation, they declare change. section, particularly the ADWIN method.
Methods such as EWMA, CUSUM, and Page- The Drift Detection Method (DDM) proposed
Hinkley store a constant amount of real values, by Gama et al. (2004) applies to the construc-
while window-based methods require, if imple- tion of two-class predictive models. It is based
mented naively, memory linear in W . On the on the theoretical and practical observation that
other hand, keeping a window provides more the empirical error of a predictive model should
information usable by the model builder, namely, decrease or remain stable as the model is built
the instances themselves. with more and more data from a stationary distri-
A disadvantage of all three methods as de- bution, assuming one controls overfitting. There-
scribed is that they require the user to provide pa- fore, when the empirical error instead increases,
rameters containing assumptions about the mag- this is evidence that the distribution in the data
nitude or frequency of changes. Fixing a window has changed.
size to have size W means that the user expects More precisely, let pt denote the
the last W items to be relevant, so that there is p error rate of
the predictor at time t , and st D pt .1 pt /=t
little change within them, but that items older its standard deviation. DDM stores the smallest
than W are suspect of being irrelevant. Fixing a value pmin of the error rates observed up to time t ,
parameter in the EWMA estimator to a value and the standard deviation smin at that point. Then
close to 1 indicates that change is expected to be at time t :
rare or slow, while a value closer to 0 suggests
that change may be frequent or abrupt. A similar
assumption can be found in the choice of param- • If pt C st pmin C 2 smin , DDM declares a
eters for the CUSUM and Page-Hinkley tests. warning. It starts storing examples in anticipa-
In general, these methods face the trade-off tion of a possible declaration of change.
between reaction time and variance in the data: • If pt C st pmin C 3 smin , DDM declares a
The user would like them to react quickly to change. The current predictor is discarded and
changes (which happens with, e.g., smaller values a new one is built using the stored examples.
of W and ) but also have a low number of The values for pmin and smin are reset as well.
false positives when no change occurs (which is
achieved with larger values of W and ). In the This approach is generic and fast enough for
case of sliding windows, in general one wants to the use in the streaming setting, but it has the
have larger values of W when no change occurs, drawback that it may be too slow in responding
because models built from more data tend to be to changes. Indeed, since pt is computed on the
20 Adaptive Windowing
basis of all examples since the last change, it fraction of outliers or cluster quality measures,
may take many observations after the change to and in a frequent pattern mining task, the number
make pt significantly larger than pmin . Also, for of frequent patterns that appear in the window.
slow change, the number of examples retained in Significant variation inside the window of any of
memory may become large. these measures indicates distribution change in
An evolution of this method that uses EWMA the stream. ADWIN is parameterized by a con-
to estimate the errors is presented and thoroughly fidence parameter ı 2 .0; 1/ and a statistical test
analyzed by Ross et al. (2012). T .W0 ; W1 ; ı/; here W0 and W1 are two windows,
The OLIN method due to Last (2002) and Co- and T decides whether they are likely to come
hen et al. (2008) also adjusts dynamically the size from the same distribution. A good test should
of the sliding window used to update a predictive satisfy the following criteria:
model, in order to adapt it to the rate of change
in nonstationary data streams. OLIN uses the sta-
• If W0 and W1 were generated from the same
tistical significance of the difference between the
distribution (no change), then with probability
training and the validation accuracy of the current
at least 1 ı the test says “no change.”
model as an indicator of data stability. Higher
• If W0 and W1 were generated from two dif-
stability means that the window can be enlarged
ferent distributions whose average differs by
to use more data to build a predictor, and lower
more than some quantity .W0 ; W1 ; ı/, then
stability implies shrinking the window to discard
with probability at least 1 ı the test says
stale data. Although described for one specific
“change.”
type of predictor (“Information Networks”) in
Last (2002) and Cohen et al. (2008), the tech-
nique should apply many other types. When there has been change in the average but
The ADWIN (ADaptive WINdowing) algo- its magnitude is less than .W0 ; W1 ; ı/, no claims
rithm is due to Bifet and Gavaldà (2007) and Bifet can be made on the validity of the test’s answer.
(2010). Its purpose is to be a self-contained mod- Observe that in reasonable tests, decreases as
ule that can be used in the design of data stream the sizes of W0 and W1 increase, that is, as the
algorithms (for prediction or classification, but test sees larger samples. ADWIN applies the test
also for other tasks) to detect and manage change to a number of partitions of its sliding window
in a well-specified way. In particular, it wants into two parts, W0 containing the oldest elements
to resolve the trade-off between fast reaction to and W1 containing the newer ones. Whenever
change and reduced false alarms without relying T .W0 ; W1 ; ı/ returns “change”, ADWIN drops
on the user guessing an ad hoc parameter. Intu- W0 , so the sliding window becomes W1 . In this
itively, the ADWIN algorithm resolves this trade- way, at all times, the window is kept of the
off by checking change at many scales simul- maximum length such that there is no proof of
taneously or trying many sliding window sizes change within it. In times without change, the
simultaneously. It should be used when the scale window can keep growing indefinitely (up to a
of change rate is unknown, and this is problematic maximum size, if desired).
enough to compensate a moderate increase in In order to be efficient in time and mem-
computational effort. ory, ADWIN represents its sliding window in a
More precisely, ADWIN maintains a sliding compact way, using the Exponential Histogram
window of real numbers that are derived from data structure due to Datar et al. (2002). This
the data stream. For example, elements in the structure maintains a summary of a window by
window could be W bits indicating whether the means of a chain of buckets. Older bits are
current predictive model was correct on the last summarized and compressed in coarser buckets
W stream items; the window then can be used to with less resolution. A window of length W is
estimate the current error rate of the predictor. In stored in only O.k log W / buckets, each using
a clustering task, it could instead keep track of the a constant amount of memory words, and yet
Adaptive Windowing 21
the histogram returns an approximation of the At the level of algorithm design, it was used
average of the window values that is correct up by Bifet and Gavaldà (2009) to give a more A
to a factor of 1=k. ADWIN does not check all adaptive version of the well-known CVFDT al-
partitions of its window into pairs .W0 ; W1 /, but gorithm for building decision trees from streams
only those at bucket boundaries. Therefore, it due to Hulten et al. (2001); ADWIN improves
performs O.k log W / tests on the sliding window it by replacing hard-coded constants in CVFDT
for each stream item. The standard implementa- for the sizes of sliding windows and the dura-
tion of ADWIN uses k D 5 and may add the tion of testing and training phases with data-
rule that checks are only performed only every adaptive conditions. A similar approach was used
t number of items for efficiency – at the price by Bifet et al. (2010b) for regression trees using
of a delay of up to t time steps in detecting a perceptrons at the leaves. In Bifet et al. (2009a,b,
change. 2010a, 2012), ADWIN was used in the context
In Bifet and Gavaldà (2007) and Bifet (2010), of ensemble classifiers to detect when a member
a test based on the so-called Hoeffding bound of the ensemble is underperforming and needs
is proposed, which can be rigorously proved to to be replaced. In the context of pattern mining,
satisfy the conditions above for a “good test.” ADWIN was used to detect change and maintain
Based on this, rigorous guarantees on the false the appropriate sliding window size in algorithms
positive rate and false negative rate of ADWIN that extract frequent graphs and frequent trees
are proved in Bifet and Gavaldà (2007) and Bifet from streams of graphs and XML trees (Bifet
(2010). However, this test is quite conservative et al. 2011b; Bifet and Gavaldà 2011). In the
and will be slow to detect change. In practice, context of process mining, ADWIN is used by
tests based on the normal approximation of a Carmona and Gavaldà (2012) to propose a mech-
binomial distribution should be used, obtaining anism that helps in detecting changes in the pro-
faster reaction time for a desired false positive cess, localize and characterize the change once it
rate. has occurred, and unravel process evolution.
Algorithms for mining data streams will prob- ADWIN is a very generic, domain-independent
ably store their own sliding window of examples mechanism that can be plugged into a large
to revise/rebuild their models. One or several variety of applications. Some examples include
instances of ADWIN can be used to inform the the following:
algorithm of the occurrence of change and the
optimal window size it should use. The time
• Bakker et al. (2011) in an application to detect
and memory overhead is moderate (logarithmic
stress situations in the data from wearable
in the size of the window) and often negligible
sensors
compared with the cost of the main algorithm
• Bifet et al. (2011a) in application to detect
itself.
sentiment change in Twitter streaming data
Several change detection methods for streams
• Pechenizkiy et al. (2009) as the basic detection
were evaluated by Gama et al. (2009). The con-
mechanism in a system to control the stability
clusions were that Page-Hinkley and ADWIN
and efficiency of industrial fuel boilers
were the most appropriate. Page-Hinkley exhib-
• Talavera et al. (2015) in an application to
ited a high rate of false alarms, and ADWIN used
segmentation of video streams
more resources, as expected.
ADWIN has been applied in the design of stream- A main research problem continues to be the
ing algorithms and in applications that need to efficient detection of change in multidimensional
deal with nonstationary streams. data. Algorithms as described above (OLIN,
22 Adaptive Windowing
DDM, and ADWIN) deal, strictly speaking, with Bifet A, Gavaldà R (2009) Adaptive learning from evolv-
the detection of change in a unidimensional ing data streams. In: Advances in intelligent data anal-
ysis VIII, proceedings of the 8th international sympo-
stream of real values; it is assumed that this sium on intelligent data analysis, IDA 2009, Lyon, Aug
stream of real values, derived from the real 31–Sept 2 2009, pp 249–260. https://doi.org/10.1007/
stream, will change significantly when there 978-3-642-03915-7_22
is significant change in the multidimensional Bifet A, Gavaldà R (2011) Mining frequent closed trees
in evolving data streams. Intell Data Anal 15(1):29–48.
stream. https://doi.org/10.3233/IDA-2010-0454
Several instances of these detectors can be Bifet A, Holmes G, Pfahringer B, Gavaldà R (2009a)
created to monitor different parts of the data Improving adaptive bagging methods for evolving data
space or to monitor different summaries or streams. In: Advances in machine learning, proceed-
ings of the first Asian conference on machine learning,
projections thereof, as in, e.g., Carmona and ACML 2009, Nanjing, 2-4 Nov 2009, pp 23–37. https://
Gavaldà (2012). However, there is no guarantee doi.org/10.1007/978-3-642-05224-8_4
that all real changes will be detected in this way. Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà
Papapetrou et al. (2015) and Muthukrishnan et al. R (2009b) New ensemble methods for evolving data
streams. In: Proceedings of the 15th ACM SIGKDD
(2007) among others have proposed efficient international conference on knowledge discovery and
schemes to directly monitor change in change data mining, KDD ’09. ACM, New York, pp 139–148.
in multidimensional data. However, the problem http://doi.acm.org/10.1145/1557019.1557041
in its full generality is difficult to scale to high Bifet A, Holmes G, Pfahringer B (2010a) Leveraging
bagging for evolving data streams. In: European con-
dimensions and arbitrary change, and research in ference on machine learning and knowledge discov-
more efficient mechanisms usable in streaming ery in databases, proceedings, part I of the ECML
scenarios is highly desirable. PKDD 2010, Barcelona, 20–24 Sept 2010, pp 135–150.
https://doi.org/10.1007/978-3-642-15880-3_15
Bifet A, Holmes G, Pfahringer B, Frank E (2010b) Fast
perceptron decision tree learning from evolving data
Cross-References streams. In: Advances in knowledge discovery and data
mining, proceedings, part II of the 14th Pacific-Asia
conference, PAKDD 2010, Hyderabad, 21-24 June
Definition of Data Streams
2010, pp 299–310. https://doi.org/10.1007/978-3-642-
Introduction to Stream Processing Algorithms 13672-6_30
Management of Time Bifet A, Holmes G, Pfahringer B, Gavaldà R (2011a)
Detecting sentiment change in twitter streaming data.
In: Proceedings of the second workshop on appli-
cations of pattern analysis, WAPA 2011, Castro Ur-
References diales, 19-21 Oct 2011, pp 5–11. http://jmlr.csail.mit.
edu/proceedings/papers/v17/bifet11a/bifet11a.pdf
Bakker J, Pechenizkiy M, Sidorova N (2011) What’s your Bifet A, Holmes G, Pfahringer B, Gavaldà R (2011b) Min-
current stress level? Detection of stress patterns from ing frequent closed graphs on evolving data streams. In:
GSR sensor data. In: 2011 IEEE 11th international Proceedings of the 17th ACM SIGKDD international
conference on data mining workshops (ICDMW), Van- conference on knowledge discovery and data mining,
couver, 11 Dec 2011, pp 573–580. https://doi.org/10. KDD ’11. ACM, New York, pp 591–599. http://doi.
1109/ICDMW.2011.178 acm.org/10.1145/2020408.2020501
Basseville M, Nikiforov IV (1993) Detection of abrupt Bifet A, Frank E, Holmes G, Pfahringer B (2012) En-
changes: theory and application. Prentice-Hall, Upper sembles of restricted hoeffding trees. ACM TIST
Saddle River. http://people.irisa.fr/Michele.Basseville/ 3(2):30:1–30:20. http://doi.acm.org/10.1145/2089094.
kniga/. Accessed 21 May 2017 2089106
Bifet A (2010) Adaptive stream mining: pattern learning Bifet A, Gavaldà R, Holmes G, Pfahringer B (2018) Ma-
and mining from evolving data streams, frontiers in chine learning for data streams, with practical examples
artificial intelligence and applications, vol 207. IOS in MOA. MIT Press, Cambridge. https://mitpress.mit.
Press. http://www.booksonline.iospress.nl/Content/ edu/books/machine-learning-data-streams
View.aspx?piid=14470 Carmona J, Gavaldà R (2012) Online techniques for deal-
Bifet A, Gavaldà R (2007) Learning from time-changing ing with concept drift in process mining. In: Advances
data with adaptive windowing. In: Proceedings of the in intelligent data analysis XI – proceedings of the 11th
seventh SIAM international conference on data mining, international symposium, IDA 2012, Helsinki, 25-27
26–28 Apr 2007, Minneapolis, pp 443–448. https://doi. Oct 2012, pp 90–102. https://doi.org/10.1007/978-3-
org/10.1137/1.9781611972771.42 642-34156-4_10
Advancements in YARN Resource Manager 23
Advancements in YARN
Resource Manager, Fig. 1
100
50
Time
or iterative computations. This motivated decou- (e.g., machine learning), and interactive
pling the cluster resource management infrastruc- computations.
ture from specific programming models and led Large shared clusters Instead of using dedi-
to the birth of YARN (Vavilapalli et al. 2013). cated clusters for each application, diverse
YARN manages cluster resources and exposes workloads are consolidated on clusters
a generic interface for applications to request of thousands or even tens of thousands
resources. This allows several applications, in- of machines. This consolidation avoids
cluding MapReduce, to be deployed on a single unnecessary data movement, allows for better
cluster and share the same resource management resource utilization, and enables pipelines
layer. with different application classes.
YARN is a community-driven effort that was High resource utilization Operating large clus-
first introduced in Apache Hadoop in November ters involves a significant cost of ownership.
2011, as part of the 0.23 release. Since then, Hence, cluster operators rely on resource man-
the interest of the community has continued un- agers to achieve high cluster utilization and
abated. Figure 1 shows that more than 100 tickets, improve their return on investment.
i.e., JIRAs (YARN JIRA 2017), related to YARN Predictable execution Production jobs typically
are raised every month. A steady portion of these come with Service Level Objectives (SLOs),
JIRAs are resolved, which shows the continuous such as completion deadlines, which have to
community engagement. In the past year alone, be met in order for the output of the jobs to be
160 individuals have contributed code to YARN. consumed by downstream services. Execution
Moreover, YARN has been widely deployed predictability is often more important than
across hundreds of companies for production pure application performance when it comes
purposes, including Yahoo! (Oath), Microsoft, to business-critical jobs.
Twitter, LinkedIn, Hortonworks, Cloudera, eBay,
This diverse set of requirements has
and Alibaba.
introduced new challenges to the resource
Since YARN’s inception, we observe the fol-
management layer. To address these new
lowing trends in modern clusters:
demands, YARN has evolved from a platform
for batch analytics workloads to a production-
Application variety Users’ interest has ex- ready, general-purpose resource manager that
panded from batch analytics applications (e.g., can support a wide range of applications and user
MapReduce) to include streaming, iterative requirements over large shared clusters. In the
Advancements in YARN Resource Manager 25
remainder of this entry, we first give a brief and managing containers’ life cycle (e.g., start,
overview of YARN’s architecture and dedicate monitor, pause, and kill containers). A
the rest of the paper to the new functionality that
was added to YARN these last years. Resource Manager (RM) The RM runs on a
dedicated machine, arbitrating resources among
various competing applications. Multiple RMs
can be used for high availability, with one of
YARN Architecture them being the master. The NMs periodically
inform the RM of their status, which is stored
YARN follows a centralized architecture in which at the cluster state. The RM-NM communication
a single logical component, the resource manager is heartbeat-based for scalability. The RM also
(RM), allocates resources to jobs submitted to maintains the resource requests of all applica-
the cluster. The resource requests handled by tions (application state). Given its global view
the RM are intentionally generic, while specific of the cluster and based on application demand,
scheduling logic required by each application resource availability, scheduling priorities, and
is encapsulated in the application master (AM) sharing policies (e.g., fairness), the scheduler
that any framework can implement. This allows performs the matchmaking between application
YARN to support a wide range of applications requests and machines and hands leases, called
using the same RM component. YARN’s archi- containers, to applications. A container is a log-
tecture is depicted in Fig. 2. Below we describe ical resource bundle (e.g., 2 GB RAM, 1 CPU)
its main components. The new features, which bound to a specific node.
appear in orange, are discussed in the following YARN includes two scheduler implementa-
sections. tions, namely, the Fair and Capacity Schedulers.
The former imposes fairness between applica-
Node Manager (NM) The NM is a daemon tions, while the latter dedicates a share of the
running at each of the cluster’s worker nodes. cluster resources to groups of users.
NMs are responsible for monitoring resource Jobs are submitted to the RM via the
availability at the host node, reporting faults, YARN Client protocol and go through an
YARN Client
Cluster State
NM Servcice
AM Servcice
Service
Scheduler
Opportunistic
Scheduler
Guaranteed
Scheduler
Advancements in YARN Resource Manager, Fig. 2 YARN architecture and overview of new features (in orange)
26 Advancements in YARN Resource Manager
admission control phase, during which se- Feedback delays The heartbeat-based AM-RM
curity credentials are validated and various and NM-RM communications can cause idle
operational and administrative checks are node resources from the moment a container
performed. finishes its execution on a node to the moment
an AM gets notified through the RM to launch
Application Master (AM) The AM is the job a new container on that node.
orchestrator (one AM is instantiated per sub- Underutilized resources The RM assigns con-
mitted job), managing all its life cycle aspects, tainers based on the allocated resources at
including dynamically increasing and decreasing each node, which might be significantly higher
resource consumption, managing the execution than the actually utilized ones (e.g., a 4GB
flow (e.g., running reducers against the output of container using only 2 GB of its memory).
mappers), and handling faults. The AM can run
arbitrary user code, written in any programming
language. By delegating all these functions to In a typical YARN cluster, NM-RM heartbeat
AMs, YARN’s architecture achieves significant intervals are set to 3 s, while AM-RM intervals
scalability, programming model flexibility, and vary but are typically up to a few seconds. There-
improved upgrading/testing. fore, feedback delays are more pronounced for
An AM will typically need to harness re- workloads with short tasks.
sources from multiple nodes to complete a job. Below we describe the new mechanisms that
To obtain containers, the AM issues resource were introduced in YARN to improve cluster
requests to the RM via heartbeats, using the AM resource utilization. These ideas first appeared in
Service interface. When the scheduler assigns a the Mercury and Yaq systems (Karanasos et al.
resource to the AM, the RM generates a lease 2015; Rasley et al. 2016) and are part of Apache
for that resource. The AM is then notified and Hadoop as of version 2.9 (Opportunistic schedul-
presents the container lease to the NM for launch- ing 2017; Distributed scheduling 2017).
ing the container at that node. The NM checks
the authenticity of the lease and then initiates the Opportunistic containers Unlike guaranteed
container execution. containers, opportunistic ones are dispatched to
In the following sections, we present the main an NM, even if there are no available resources
advancements made in YARN, in particular with on that node. In such a case, the opportunistic
respect to resource utilization, scalability, support containers will be placed in a newly introduced
for services, and execution predictability. NM queue (see Fig. 2). When resources become
available, an opportunistic container will be
picked from the queue, and its execution will
start immediately, avoiding any feedback delays.
Resource Utilization These containers run with lower priority in
YARN and will be preempted in case of resource
In the initial versions of YARN, the RM would contention for guaranteed containers to start their
assign containers to a node only if there were execution. Hence, opportunistic containers im-
unallocated resources on that node. This guar- prove cluster resource utilization without impact-
anteed type of allocation ensures that once an ing the execution of guaranteed containers. More-
AM dispatches a container to a node, there will over, whereas the original NM passively executes
be sufficient resources for its execution to start conflict-free commands from the RM, a modern
immediately. NM uses these two-level priorities as inputs to
Despite the predictable access to resources local scheduling decisions. For instance, low-
that this design offers, it has the following short- priority jobs with non-strict execution guarantees
comings that can lead to suboptimal resource or tasks off the critical path of a DAG, are good
utilization: candidates for opportunistic containers.
Advancements in YARN Resource Manager 27
The AMs currently determine the execution cation demands (e.g., active containers, resource
type for each container, but the system could requests per second). Increasing the heartbeat A
use automated policies instead. The AM can also intervals could improve scalability in terms of
request promotion of opportunistic containers to number of nodes, but would be detrimental to
guarantee to protect them from preemption. utilization (Vavilapalli et al. 2013) and would
still pose problems as the number of applications
Hybrid scheduling Opportunistic containers increases.
can be allocated centrally by the RM or in a Instead, as of Apache Hadoop 2.9 (YARN
distributed fashion through a local scheduler that Federation 2017), a federation-based approach
runs at each NM and leases containers on other scales a single YARN cluster to tens of thousands
NMs without contacting the RM. Centralized of nodes. This approach divides the cluster into
allocation allows for higher-quality placement smaller units, called subclusters, each with its
decisions and sharing policies. Distributed own YARN RM and NMs. The federation system
allocation offers lower allocation latencies, which negotiates with subcluster RMs to give appli-
can be beneficial for short-lived containers. cations the experience of a single large cluster,
To prevent conflicts, guaranteed containers are allowing applications to schedule their tasks to
always assigned by the RM. any node of the federated cluster.
To determine the least-loaded nodes for plac- The state of the federated cluster is coordi-
ing opportunistic containers, the RM periodically nated through the State Store, a central compo-
gathers information about the running and queued nent that holds information about (1) subcluster
containers at each node and propagates this infor- liveliness and resource availability via heartbeats
mation to the local schedulers too. To account for sent by each subcluster RM, (2) the YARN sub-
occasional load imbalance across nodes, YARN cluster at which each AM is being deployed,
performs dynamic rebalancing of queued con- and (3) policies used to impose global cluster
tainers. invariants and perform load rebalancing.
To allow jobs to seamlessly span subclusters,
Resource overcommitment Currently, oppor- the federated cluster relies on the following com-
tunistic containers can be employed to avoid ponents:
feedback delays. Ongoing development also
focuses on overcommitting resources using Router A federated YARN cluster is equipped
opportunistic containers (Utilization-based with a set of routers, which hide the presence
scheduling 2017). In this scenario, opportunistic of multiple RMs from applications. Each ap-
containers facilitate reclaiming overcommitted plication gets submitted to a router, which,
resources on demand, without affecting the based on a policy, determines the subcluster
performance and predictability of jobs that opt for the AM to be executed, gets the subcluster
out of overcommitted resources. URL from the State Store, and redirects the
application submission request to the appro-
priate subcluster RM.
Cluster Scalability AMRM Proxy This component runs as a service
at each NM of the cluster and acts as a proxy
A single YARN RM can manage a few thousands for every AM-RM communication. Instead of
of nodes. However, production analytics clusters directly contacting the RM, applications are
at big cloud companies are often comprised of forced by the system to access their local
tens of thousands of machines, crossing YARN’s AMRM Proxy. By dynamically routing the
limits (Burd et al. 2017). AM-RM messages, the AMRM Proxy pro-
YARN’s scalability is constrained by the re- vides the applications with transparent ac-
source manager, as load increases proportionally cess to multiple YARN RMs. Note that the
to the number of cluster nodes and the appli- AMRM Proxy is also used to implement the
28 Advancements in YARN Resource Manager
local scheduler for opportunistic containers across libraries or version incompatibilities be-
and could be used to protect the system against tween the libraries and YARN.
misbehaving AMs. To this end, the upcoming Apache Hadoop 3.1
release adds first-class support for long-running
This federated design is scalable, as the services in YARN, allowing for both traditional
number of nodes each RM is responsible for process-based and Docker-based containers. This
is bounded. Moreover, through appropriate service framework allows users to deploy existing
policies, the majority of applications will be services on YARN, simply by providing a JSON
executed within a single subcluster; thus the file with their service specifications, without hav-
number of applications that are present at ing to translate those requirements into low-level
each RM is also bounded. As the coordination resource requests at runtime.
between subclusters is minimal, the cluster’s size The main component of YARN’s service
can be scaled almost linearly by adding more framework is the container orchestrator, which
subclusters. This architecture can provide tight facilitates service deployment. It is an AM that,
enforcement of scheduling invariants within a based on the service specification, configures the
subcluster, while continuous rebalancing across required requests for the RM and launches the
subclusters enforces invariants in the whole corresponding containers. It deals with various
cluster. service operations, such as starting components
A similar federated design has been followed given specified dependencies, monitoring their
to scale the underlying store (HDFS Federation health and restarting failed ones, scaling up
2017). and down component resources, upgrading
components, and aggregating logs.
A RESTful API server is developed to allow
users to manage the life cycle of services on
Long-Running Services YARN via simple commands, using framework-
independent APIs. Moreover, a DNS server
As already discussed, YARN’s target applications enables service discovery via standard DNS
were originally batch analytics jobs, such as lookups and greatly simplifies service failovers.
MapReduce. However, a significant share of
today’s clusters is dedicated to workloads that Scheduling services Apart from the aforemen-
include stream processing, iterative computa- tioned support for service deployment and man-
tions, data-intensive interactive jobs, and latency- agement, service owners also demand precise
sensitive online applications. Unlike batch control of container placement to optimize the
jobs, these applications benefit from long-lived performance and resilience of their applications.
containers (from hours to months) to amortize For instance, containers of services are often
container initialization costs, reduce scheduling required to be collocated (affinity) to reduce net-
load, or maintain state across computations. Here work costs or separated (anti-affinity) to mini-
we use the term services for all such applications. mize resource interference and correlated fail-
Given their long-running nature, these appli- ures. For optimal service performance, even more
cations have additional demands, such as sup- powerful constraints are useful, such as complex
port for restart, in-place upgrade, monitoring, and intra- and inter-application constraints that collo-
discovery of their components. To avoid using cate services with one another or put limits in the
YARN’s low-level API for enabling such opera- number of specific containers per node or rack.
tions, users have so far resorted to AM libraries When placing containers of services, clus-
such as Slider (Apache Slider 2017). Unfortu- ter operators have their own, potentially con-
nately, these external libraries only partially solve flicting, global optimization objectives. Examples
the problem, e.g., due to lack of common stan- include minimizing the violation of placement
dards for YARN to optimize resource demands constraints, the resource fragmentation, the load
Advancements in YARN Resource Manager 29
imbalance, or the number of machines used. Due deadlines, malleable and gang parallelism, and
to their long lifetimes, services can tolerate longer inter-job dependencies. A
scheduling latencies than batch jobs, but their
placement should not impact the scheduling la- Reservation planning and scheduling RDL
tencies of the latter. provides a uniform and abstract representation
To enable high-quality placement of services of jobs’ needs. Reservation requests are received
in YARN, Apache Hadoop 3.1 introduces sup- ahead of a job’s submission by the reservation
port for rich placement constraints (Placement planner, which performs online admission
constraints 2017). control. It accepts all jobs that can fit in the cluster
agenda over time and rejects those that cannot be
satisfied. Once a reservation is accepted by the
Jobs with SLOs planner, the scheduler is used to dynamically
assign cluster resources to the corresponding job.
In production analytics clusters, the majority of
cluster resources is usually consumed by produc- Periodic reservations Given that a high per-
tion jobs. These jobs must meet strict Service centage of production jobs are recurring (e.g.,
Level Objectives (SLOs), such as completion hourly, daily, or monthly), YARN allows users to
deadlines, for their results to be consumed by define periodic reservations, starting with Apache
downstream services. At the same time, a large Hadoop 2.9. A key property of recurring reser-
number of smaller best-effort jobs are submitted vations is that once a periodic job is admitted,
to the same clusters in an ad hoc manner for each of its instantiations will have a predictable
exploratory purposes. These jobs lack SLOs, but resource allocation. This isolates periodic pro-
they are sensitive to completion latencies. duction jobs from the noisiness of sharing.
Resource managers typically allocate re-
sources to jobs based on instantaneous enforce- Toward predictable execution The idea of re-
ment of job priorities and sharing invariants. curring reservations was first exposed as part of
Although simpler to implement and impose, this the Morpheus system (Jyothi et al. 2016). Mor-
instantaneous resource provisioning makes it pheus analyzes inter-job data dependencies and
challenging to meet job SLOs without sacrificing ingress/egress operations to automatically derive
low latency for best-effort jobs. SLOs. It uses a resource estimator tool, which is
To ensure that important production jobs will also part of Apache Hadoop as of version 2.9,
have predictable access to resources, YARN was to estimate jobs’ resource requirements based on
extended with the notion of reservations, which historic runs. Based on the derived SLOs and
provide users with the ability to reserve resources resource demands, the system generates recur-
over (and ahead of) time. The ideas around ring reservations and submits them for planning.
reservations first appeared in Rayon (Curino This guarantees that periodic production jobs will
et al. 2014) and are part of YARN as of Apache have guaranteed access to resources and thus
Hadoop 2.6. predictable execution.
need for finer control of resource types other However, queue reconfiguration has two
than memory and CPU. Examples include disk main drawbacks: (1) setting and changing
bandwidth, network I/O, GPUs, and FPGAs. configurations is a tedious process that can only
Adding new resource types in YARN used be performed by modifying XML files; (2) queue
to be cumbersome, as it required extensive code owners cannot perform any modifications to their
changes. The upcoming Apache Hadoop 3.1 re- sub-queues; the cluster admin must do it on their
lease (Resource profiles 2017) follows a more behalf.
flexible resource model, allowing users to add To address these shortcomings, Apache
new resources with minimal effort. In fact, users Hadoop 2.9 (OrgQueue 2017) allows configu-
can define their resources in a configuration file, rations to be stored in an in-memory database
eliminating the need for code changes or recom- instead of XML files. It adds a RESTful API
pilation. The Dominant Resource Fairness (Gh- to programmatically modify the queues. This
odsi et al. 2011) scheduling algorithm at the RM has the additional benefit that queues can
has also been adapted to account for generic be dynamically reconfigured by automated
resource types, while resource profiles can be services, based on the cluster conditions or on
used for AMs to request containers specifying organization-specific criteria. Queue ACLs allow
predefined resource sets. Ongoing work focuses queue owners to perform modifications on their
on the isolation of resources such as disk, net- part of the queue structure.
work, and GPUs.
Timeline server Information about current and
Node labels Cluster operators can group nodes previous jobs submitted in the cluster is key for
with similar characteristics, e.g., nodes with debugging, capacity planning, and performance
public IPs or nodes used for development or tuning. Most importantly, observing historic data
testing. Applications can then request containers enables us to better understand the cluster and
on nodes with specific labels. This feature is jobs’ behavior in aggregate to holistically im-
supported by YARN’s Capacity Scheduler from prove the system’s operation.
Apache Hadoop 2.6 on (Node labels 2017) and The first incarnation of this effort was the ap-
allows at most one label to be specified per node, plication history server (AHS), which supported
thus creating nonoverlapping node partitions in only MapReduce jobs. The AHS was superseded
the cluster. The cluster administrator can specify by the timeline server (TS), which can deal with
the portion of a partition that a queue of the generic YARN applications. In its first version,
scheduler can access, as well as the portion of the TS was limited to a single writer and reader
a queue’s capacity that is dedicated to a specific that resided at the RM. Its applicability was
node partition. For instance, queue A might be therefore limited to small clusters.
restricted to access no more than 30% of the Apache Hadoop 2.9 includes a major redesign
nodes with public IPs, and 40% of queue A has of TS (YARN TS v2 2017), which separates
to be on dev machines. the collection (writes) from the serving (reads)
of data, and performs both operations in a dis-
Changing queue configuration Several com- tributed manner. This brings several scalability
panies use YARN’s Capacity Scheduler to share and flexibility improvements.
clusters across non-coordinating user groups. The new TS collects metrics at various
A hierarchy of queues isolates jobs from each granularities, ranging from flows (i.e., sets of
department of the organization. Due to changes YARN applications logically grouped together)
in the resource demands of each department, to jobs, job attempts, and containers. It also
the queue hierarchy or the cluster condition, collects cluster-wide data, such as user and queue
operators modify the amount of resources information, as well as configuration data.
assigned to each organization’s queue and to The data collection is performed by collectors
the sub-queues used within that department. that run as services at the RM and at every NM.
Advancements in YARN Resource Manager 31
The AM of each job publishes data to the col- Apache Slider (2017) Apache Slider (incubating). http://
lector of the host NM. Similarly, each container slider.incubator.apache.org A
Burd R, Sharma H, Sakalanaga S (2017) Lessons learned
pushes data to its local NM collector, while the from scaling YARN to 40 K machines in a multi-
RM publishes data to its dedicated collector. The tenancy environment. In: DataWorks Summit, San Jose
readers are separate instances that are dedicated Curino C, Difallah DE, Douglas C, Krishnan S, Ramakr-
to serving queries via a REST API. By default ishnan R, Rao S (2014) Reservation-based scheduling:
if you’re late don’t blame us! In: ACM symposium on
Apache HBase (Apache HBase 2017) is used as cloud computing (SoCC)
the backing storage, which is known to scale to Dean J, Ghemawat S (2004) MapReduce: simplified data
large amounts of data and read/write operations. processing on large clusters. In: USENIX sympo-
sium on operating systems design and implementation
(OSDI)
Distributed scheduling (2017) Extend YARN to sup-
port distributed scheduling. https://issues.apache.org/
Conclusion jira/browse/YARN-2877
Ghodsi A, Zaharia M, Hindman B, Konwinski A, Shenker
YARN was introduced in Apache Hadoop at the S, Stoica I (2011) Dominant resource fairness: fair
end of 2011 as an effort to break the strong ties allocation of multiple resource types. In: USENIX
symposium on networked systems design and imple-
between Hadoop and MapReduce and to allow
mentation (NSDI)
generic applications to be deployed over a com- HDFS Federation (2017) Router-based HDFS federation.
mon resource management fabric. Since then, https://issues.apache.org/jira/browse/HDFS-10467
YARN has evolved to a fully fledged production- Jyothi SA, Curino C, Menache I, Narayanamurthy SM,
Tumanov A, Yaniv J, Mavlyutov R, Goiri I, Krishnan
ready resource manager, which has been de-
S, Kulkarni J, Rao S (2016) Morpheus: towards auto-
ployed on shared clusters comprising tens of mated slos for enterprise clusters. In: USENIX sympo-
thousands of machines. It handles applications sium on operating systems design and implementation
ranging from batch analytics to streaming and (OSDI)
Karanasos K, Rao S, Curino C, Douglas C, Chaliparam-
machine learning workloads to low-latency ser- bil K, Fumarola GM, Heddaya S, Ramakrishnan R,
vices while achieving high resource utilization Sakalanaga S (2015) Mercury: hybrid centralized and
and supporting SLOs and predictability for pro- distributed scheduling in large shared clusters. In:
duction workloads. YARN enjoys a vital commu- USENIX annual technical conference (USENIX ATC)
Node Labels (2017) Allow for (admin) labels on
nity with hundreds of monthly contributions. nodes and resource-requests. https://issues.apache.org/
jira/browse/YARN-796
Opportunistic Scheduling (2017) Scheduling of oppor-
Cross-References tunistic containers through YARN RM. https://issues.
apache.org/jira/browse/YARN-5220
OrgQueue (2017) OrgQueue for easy capacitysched-
Hadoop uler queue configuration management. https://issues.
apache.org/jira/browse/YARN-5734
Placement Constraints (2017) Rich placement constraints
Acknowledgements The authors would like to thank in YARN. https://issues.apache.org/jira/browse/
Subru Krishnan and Carlo Curino for their feedback while YARN-6592
preparing this entry. We would also like to thank the Rasley J, Karanasos K, Kandula S, Fonseca R, Vojnovic
diverse community of developers, operators, and users M, Rao S (2016) Efficient queue management for clus-
that have contributed to Apache Hadoop YARN since its ter scheduling. In: European conference on computer
inception. systems (EuroSys)
Resource Profiles (2017) Extend the YARN resource
model for easier resource-type management and
profiles. https://issues.apache.org/jira/browse/YARN-
3926
References Utilization-Based Scheduling (2017) Schedule containers
based on utilization of currently allocated containers.
Apache Hadoop (2017) Apache Hadoop. http://hadoop. https://issues.apache.org/jira/browse/YARN-1011
apache.org Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar
Apache HBase (2017) Apache HBase. http://hbase. M, Evans R, Graves T, Lowe J, Shah H, Seth S,
apache.org Saha B, Curino C, O’Malley O, Radia S, Reed B,
32 ADWIN Algorithm
Baldeschwieler E (2013) Apache Hadoop YARN: yet with the definition of a benchmark or workload.
another resource negotiator. In: ACM symposium on The benchmark is run on several different
cloud computing (SoCC)
YARN Federation (2017) Enable YARN RM scale out via
systems, and the performance and price of each
federation using multiple RMs. https://issues.apache. system is measured and recorded. Performance
org/jira/browse/YARN-2915 is typically a throughput metric (work/second)
YARN JIRA (2017) Apache JIRA issue tracker for YARN. and price is typically a five-year cost-of-
https://issues.apache.org/jira/browse/YARN
YARN TS v2 (2017) YARN timeline service v.2. https://
ownership metric. Together, they give a price/per-
issues.apache.org/jira/browse/YARN-5355 formance ratio.” In short, we define that a
software benchmark is a program used for
comparison of software products/tools executing
on a pre-configured hardware environment.
ADWIN Algorithm Analytics benchmarks are a type of domain-
specific benchmark targeting analytics for
Adaptive Windowing databases, transaction processing, and big data
systems. Originally, the TPC (Transaction
Processing Performance Council) (TPC 2018)
defined online transaction processing (OLTP)
Alignment Creation (TPC-A and TPC-B) and decision support (DS)
benchmarks (TPC-D and TPC-H). The DS
Business Process Model Matching systems can be seen as some sort of special
online analytical processing (OLAP) system
with an example of the TPC-DS benchmark,
which is a successor of TPC-H (Nambiar and
Analytics Benchmarks Poess 2006) and specifies many OLAP and data
mining queries, which are the predecessors of
Todor Ivanov and Roberto V. Zicari the current analytics benchmarks. However,
Frankfurt Big Data Lab, Goethe University due to the many new emerging data platforms
Frankfurt, Frankfurt, Germany like hybrid transaction/analytical processing
(HTAP) (Kemper and Neumann 2011; Özcan
et al. 2017), distributed parallel processing
engines (Sakr et al. 2013; Hadoop 2018;
Synonyms Spark 2018; Flink 2018; Carbone et al. 2015,
etc.), Big data management (AsterixDB 2018;
Big data benchmarks; Decision support bench- Alsubaiee et al. 2014), SQL-on-Hadoop-alike
marks; Machine learning benchmarks; OLAP (Abadi et al. 2015; Hive 2018; Thusoo et al.
benchmarks; OLTP benchmarks; Real-time 2009; SparkSQL 2018; Armbrust et al. 2015;
streaming benchmarks Impala 2018; Kornacker et al. 2015, etc.), and
analytics systems (Hu et al. 2014) integrating
machine learning (MLlib 2018; Meng et al.
Definitions 2016; MADlib 2018; Hellerstein et al. 2012),
Deep Learning (Tensorflow 2018) and more, the
The meaning of the word benchmark is (Ander- emerging benchmarks try to follow the trend to
sen and Pettersen 1995) A predefined position, stress these new system features. This makes
used as a reference point for taking measures the currently standardized benchmarks (such as
against. There is no clear formal definition of TPC-C, TPC-H, etc.) only partially relevant for
analytics benchmarks. the emerging big data management systems as
Jim Gray (1992) describes the benchmarking they offer new features that require new analytics
as follows: “This quantitative comparison starts benchmarks.
Analytics Benchmarks 33
Analytics Benchmarks, Table 3 Active STAC bench- Big Data Analytics Benchmarks
marks (STAC 2018) Following the big data technology trends, many A
Benchmark domain Specification name new benchmarks for big data analytics have
Feed handlers STAC-M1 emerged. Good examples for OLTP benchmarks
Data distribution STAC-M2 targeting the NoSQL engines are the Yahoo!
Tick analytics STAC-M3 Cloud Serving Benchmark (short YCSB),
Event processing STAC-A1 developed by Yahoo, and LinkBench developed
Risk computation STAC-A2 by Facebook, described in Table 4. However,
Backtesting STAC-A3 most big data benchmarks stress the capabilities
Trade execution STAC-E of Hadoop as the major big data platform, as
Tick-to-trade STAC-T1 listed in Table 5. Others like BigFUN (Pirzadeh
Time sync STAC-TS et al. 2015) and BigBench (Ghazal et al. 2013) are
Big data In-development technology-independent. For example, BigBench
Network I/O STAC-N1, STAC-T0 (standardized as TPCx-BB) addresses the Big
Data 3V’s characteristics and relies on workloads
which can be implemented by different SQL-
overview of the trends in data management tech- on-Hadoop systems and parallel processing
nologies, Nambiar et al. (2012) highlight the engines supporting advanced analytics and
role of big data technologies and how they are machine learning libraries. Since there are no
currently changing the industry. One such tech- clear boundaries for the analytical capabilities
nology is the NoSQL storage engines (Cattell of the new big data systems, it is also hard to
2011) which relax the ACID (atomicity, con- formally specify what is an analytics benchmark.
sistency, isolation, durability) guarantees but of- A different type of benchmark, called
fer faster data access via distributed and fault- benchmark suites, has become very popular.
tolerant architecture. There are different types Their goal is to package a number of micro-
of NoSQL engines (key value, column, graph, benchmarks or representative domain workloads
and documents stores) covering different data together and in this way enable the users to easily
representations. test the systems for the different functionalities.
In the meantime, many new dig data Some of these suites target one technology like
technologies such as (1) Apache Hadoop (2018) SparkBench (Li et al. 2015; Agrawal et al. 2015),
with HDFS and MapReduce; (2) general parallel which stresses only Spark, while others like
processing engines like Spark (2018) and Flink HiBench offer implementations for multiple
(2018); (3) SQL-on-Hadoop systems like Hive processing engines. Table 6 lists some popular
(2018) and Spark SQL (2018); (4) real-time big data benchmarking suites.
stream processing engines like Storm (2018), A more detailed overview of the current big
Spark Streaming (2018) and Flink; and (5) data benchmarks is provided in a SPEC Big Data
graph engines on top of Hadoop like GraphX Research Group survey by Ivanov et al. (2015)
(2018) and Flink Gelly (2015) have emerged. and a journal publication by Han et al. (2018).
All these tools enabled advanced analytical
techniques from data science, machine learning, Foundations
data mining, and deep learning to become
common practices in many big data domains. In the last 40 years, the OLTP and DS/OLAP
Because of all these analytical techniques, which systems have been the industry standard systems
are currently integrated in many different ways for data storage and management. Therefore, all
in both traditional database and new big data popular TPC benchmarks were specified in these
management systems, it is hard to define the areas. The majority TPC benchmark specifica-
exact features that a successor of the DS/OLAP tions (Poess 2012) have the following main com-
systems should have. ponents:
36 Analytics Benchmarks
• Preamble – Defines the benchmark domain • ACID – Atomicity, consistency, isolation, and
and the high level requirements. durability requirements.
• Database Design – Defines the requirements • Workload scaling – Defines tools and method-
and restrictions for implementing the database ology on how to scale the workloads.
schema. • Metric/Execution rules – Defines how to exe-
• Workload – Characterizes the simulated work- cute the benchmark and how to calculate and
load. derive the metrics.
Analytics Benchmarks 37
• Benchmark driver – Defines the requirements and are replaced by more general one like system
for implementing the benchmark driver/pro- under test (SUT). For example, new categories in
gram. TPCx-BB are:
• Full disclosure report – Defines what needs to
be reported and how to organize the disclosure
• System under test – Describes the system
report.
architecture with its hardware and software
• Audit requirements – Defines the require-
components and their configuration require-
ments for performing a successful auditing
ments.
process.
• Pricing – Defines the pricing of the compo-
nents in the system under test including the
system maintenance.
The above structure was typical for the OLTP and
• Energy – Defines the methodology, rules, and
DS/OLAP benchmarks defined by TPC, but due
metrics to measure the energy consumption of
to the emerging hybrid OLTP/OLAP systems and
the system under test in the TPC benchmarks.
big data technologies, these trends have changed
(Bog 2013) adapting the new system features.
For example, the database schema and ACID The above components are part of the standard
properties are not anymore a key requirement in TPC benchmark specifications and are not
the NoSQL and big data management systems representative for the entire analytics benchmarks
38 Analytics Benchmarks
spectrum. Many of the newly defined big data rable and strictly depend on the environment in
benchmarks are open-source programs. However, which they were obtained.
the main characteristics of a good domain- In terms of component specification, the situa-
specific benchmark are still the same. Jim tion looks similar. All TPC benchmarks use syn-
Gray (1992) defined four important criteria that thetic data generators, which allow for scalable
domain-specific benchmarks must meet: and deterministic workload generation. However,
many new benchmarks use open data sets or
real workload traces like BigDataBench (Wang
• Relevant: It must measure the peak perfor-
et al. 2014) or a mix between real data and
mance and price/performance of systems
synthetically generated data. This influences also
when performing typical operations within
the metrics reported by these benchmarks. They
that problem domain.
are often not clearly specified or very simplistic
• Portable: It should be easy to implement the
(like execution time) and cannot be used for an
benchmark on many different systems and
accurate comparison between different environ-
architectures.
ments.
• Scalable: The benchmark should apply to
The ongoing evolution in the big data sys-
small and large computer systems. It should
tems and the data science, machine learning, and
be possible to scale the benchmark up to larger
deep learning tools and techniques will open
systems and to parallel computer systems
many new challenges and questions in the de-
as computer performance and architecture
sign and specification of standardized analytics
evolve.
benchmarks. There is a growing need for new
• The benchmark must be understandable/inter-
standardized big data analytics benchmarks and
pretable; otherwise it will lack credibility.
metrics.
Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears Huppler K (2009) The art of building a good benchmark.
R (2010) Benchmarking cloud serving systems with In: Nambiar RO, Poess M (eds) Performance evalu-
YCSB. In: Proceedings of the 1st ACM symposium ation and benchmarking. Springer, Berlin/Heidelberg,
on cloud computing, SoCC 2010, Indianapolis, 10–11 pp 18–30
June 2010, pp 143–154 Impala (2018) https://impala.apache.org/
Ferdman M, Adileh A, Koçberber YO, Volos S, Alisafaee Ivanov T, Rabl T, Poess M, Queralt A, Poelman J, Poggi
M, Jevdjic D, Kaynak C, Popescu AD, Ailamaki N, Buell J (2015) Big data benchmark compendium. In:
A, Falsafi B (2012) Clearing the clouds: a study of TPCTC, pp 135–155
emerging scale-out workloads on modern hardware. In: Kemper A, Neumann T (2011) Hyper: a hybrid
Proceedings of the 17th international conference on OLTP&OLAP main memory database system based on
architectural support for programming languages and virtual memory snapshots. In: Proceedings of the 27th
operating systems, ASPLOS, pp 37–48 international conference on data engineering, ICDE
Ferrarons J, Adhana M, Colmenares C, Pietrowska S, 2011, Hannover, 11–16 Apr 2011, pp 195–206
Bentayeb F, Darmont J (2013) PRIMEBALL: a parallel Kim K, Jeon K, Han H, Kim SG, Jung H, Yeom HY
processing framework benchmark for big data applica- (2008) Mrbench: a benchmark for mapreduce frame-
tions in the cloud. In: TPCTC, pp 109–124 work. In: 14th international conference on parallel and
Flink (2018) https://flink.apache.org/ distributed systems, ICPADS 2008, Melbourne, 8–10
Gelly (2015) https://flink.apache.org/news/2015/08/24/ Dec 2008, pp 11–18
introducing-flink-gelly.html Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching
Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M,
A, Jacobsen H (2013) Bigbench: towards an industry Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I,
standard benchmark for big data analytics. In: Proceed- Robinson H, Rorke D, Rus S, Russell J, Tsirogian-
ings of the ACM SIGMOD international conference on nis D, Wanderman-Milne S, Yoder M (2015) Impala:
management of data, SIGMOD 2013, New York, 22– a modern, open-source SQL engine for Hadoop. In:
27 June 2013, pp 1197–1208 CIDR 2015, seventh biennial conference on innovative
Ghazal A, Ivanov T, Kostamaa P, Crolotte A, Voong R, data systems research, Asilomar, 4–7 Jan 2015, Online
Al-Kateb M, Ghazal W, Zicari RV (2017) Bigbench proceedings
V2: the new and improved bigbench. In: 33rd IEEE Li M, Tan J, Wang Y, Zhang L, Salapura V (2015)
international conference on data engineering, ICDE SparkBench: a comprehensive benchmarking suite for
2017, San Diego, 19–22 Apr 2017, pp 1225–1236 in memory data analytic platform spark. In: Proceed-
GraphX (2018) https://spark.apache.org/graphx/ ings of the 12th ACM international conference on
Gray J (1992) Benchmark handbook: for database and computing frontiers, pp 53:1–53:8
transaction processing systems. Morgan Kaufmann Luo C, Zhan J, Jia Z, Wang L, Lu G, Zhang L, Xu
Publishers Inc., San Francisco C, Sun N (2012) CloudRank-D: benchmarking and
Hadoop (2018) https://hadoop.apache.org/ ranking cloud computing systems for data processing
Han R, John LK, Zhan J (2018) Benchmarking big applications. Front Comp Sci 6(4):347–362
data systems: a review. IEEE Trans Serv Comput MADlib (2018) https://madlib.apache.org/
11(3):580–597 Meng X, Bradley JK, Yavuz B, Sparks ER, Venkataraman
Hellerstein JM, Ré C, Schoppmann F, Wang DZ, Fratkin S, Liu D, Freeman J, Tsai DB, Amde M, Owen S,
E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M,
A (2012) The MADlib analytics library or MAD skills, Talwalkar A (2016) Mllib: machine learning in Apache
the SQL. PVLDB 5(12):1700–1711 spark. J Mach Learn Res 17:34:1–34:7
Hive (2018) https://hive.apache.org/ MLlib (2018) https://spark.apache.org/mllib/
Hockney RW (1996) The science of computer benchmark- Nambiar R (2014) Benchmarking big data systems: in-
ing. SIAM, Philadelphia troducing TPC express benchmark HS. In: Big data
Hogan T (2009) Overview of TPC benchmark E: the benchmarking – 5th international workshop, WBDB
next generation of OLTP benchmarks. In: Performance 2014, Potsdam, 5–6 Aug 2014, Revised Selected Pa-
evaluation and benchmarking, first TPC technology pers, pp 24–28
conference, TPCTC 2009, Lyon, 24–28 Aug 2009, Nambiar RO, Poess M (2006) The making of TPC-DS.
Revised Selected Papers, pp 84–98 In: Proceedings of the 32nd international conference
Hu H, Wen Y, Chua T, Li X (2014) Toward scalable on very large data bases, Seoul, 12–15 Sept 2006,
systems for big data analytics: a technology tutorial. pp 1049–1058
IEEE Access 2:652–687 Nambiar R, Chitor R, Joshi A (2012) Data management
Huang S, Huang J, Dai J, Xie T, Huang B (2010) – a look back and a look ahead. In: Specifying big
The HiBench benchmark suite: characterization of the data benchmarks – first workshop, WBDB 2012, San
MapReduce-based data analysis. In: Workshops pro- Jose, 8–9 May 2012, and second workshop, WBDB
ceedings of the 26th IEEE ICDE international confer- 2012, Pune, 17–18 Dec 2012, Revised Selected Papers,
ence on data engineering, pp 41–51 pp 11–19
Apache Apex 41
Özcan F, Tian Y, Tözün P (2017) Hybrid transactional/- Sethuraman P, Taheri HR (2010) TPC-V: a benchmark
analytical processing: a survey. In: Proceedings of the for evaluating the performance of database applications A
2017 ACM international conference on management of in virtual environments. In: Performance evaluation,
data, SIGMOD conference 2017, Chicago, 14–19 May measurement and characterization of complex systems
2017, pp 1771–1775 – second TPC technology conference, TPCTC 2010,
Patil S, Polte M, Ren K, Tantisiriroj W, Xiao L, López Singapore, 13–17 Sept 2010. Revised Selected Papers,
J, Gibson G, Fuchs A, Rinaldi B (2011) YCSB++: pp 121–135
benchmarking and performance debugging advanced Shim JP, Warkentin M, Courtney JF, Power DJ, Sharda R,
features in scalable table stores. In: ACM symposium Carlsson C (2002) Past, present, and future of decision
on cloud computing in conjunction with SOSP 2011, support technology. Decis Support Syst 33(2):111–126
SOCC’11, Cascais, 26–28 Oct 2011, p 9 Spark (2018) https://spark.apache.org
Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, SparkSQL (2018) https://spark.apache.org/sql/
Madden S, Stonebraker M (2009) A comparison of SparkStreaming (2018) https://spark.apache.org/
approaches to large-scale data analysis. In: Proceedings streaming/
of the ACM SIGMOD international conference on SPEC (2018) www.spec.org/
management of data, SIGMOD 2009, Providence, 29 STAC (2018) www.stacresearch.com/
June–2 July 2009, pp 165–178 Storm (2018) https://storm.apache.org/
Pirzadeh P, Carey MJ, Westmann T (2015) BigFUN: a Tensorflow (2018) https://tensorflow.org
performance study of big data management system Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony
functionality. In: 2015 IEEE international conference S, Liu H, Wyckoff P, Murthy R (2009) Hive – a
on big data, pp 507–514 warehousing solution over a map-reduce framework.
Poess M (2012) Tpc’s benchmark development model: PVLDB 2(2):1626–1629
making the first industry standard benchmark on big TPC (2018) www.tpc.org/
data a success. In: Specifying big data benchmarks – Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W,
first workshop, WBDB 2012, San Jose, 8–9 May 2012, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li
and second workshop, WBDB 2012, Pune, 17–18 Dec X, Qiu B (2014) BigDataBench: a big data benchmark
2012, Revised Selected Papers, pp 1–10 suite from internet services. In: 20th IEEE international
Poess M, Rabl T, Jacobsen H, Caufield B (2014) TPC- symposium on high performance computer architec-
DI: the first industry benchmark for data integration. ture, HPCA 2014, pp 488–499
PVLDB 7(13):1367–1378
Poess M, Rabl T, Jacobsen H (2017) Analysis of TPC-
DS: the first standard benchmark for SQL-based big
data systems. In: Proceedings of the 2017 symposium
on cloud computing, SoCC 2017, Santa Clara, 24–27
Sept 2017, pp 573–585 Apache Apex
Pöss M, Floyd C (2000) New TPC benchmarks for
decision support and web commerce. SIGMOD Rec Ananth Gundabattula1 and Thomas Weise2
29(4):64–71 1
Pöss M, Nambiar RO, Walrath D (2007) Why you should
Commonwealth Bank of Australia, Sydney,
run TPC-DS: a workload analysis. In: Proceedings NSW, Australia
2
of the 33rd international conference on very large Atrato Inc., San Francisco, CA, USA
data bases, University of Vienna, 23–27 Sept 2007,
pp 1138–1149
Raab F (1993) TPC-C – the standard benchmark for
online transaction processing (OLTP). In: Gray J (ed)
The benchmark handbook for database and trans-
action systems, 2nd edn. Morgan Kaufmann, San Introduction
Mateo
Rockart JF, Ball L, Bullen CV (1982) Future role Apache Apex (2018; Weise et al. 2017) is a large-
of the information systems executive. MIS Q 6(4):
1–14
scale stream-first big data processing framework
Sakr S, Liu A, Fayoumi AG (2013) The family of MapRe- that can be used for low-latency, high-throughput,
duce and large-scale data processing systems. ACM and fault-tolerant processing of unbounded (or
Comput Surv 46(1):11:1–11:44 bounded) datasets on clusters. Apex development
Sangroya A, Serrano D, Bouchenak S (2012) MRBS:
towards dependability benchmarking for Hadoop
started in 2012, and it became a project at the
MapReduce. In: Euro-Par: parallel processing work- Apache Software Foundation in 2015. Apex can
shops, pp 3–12 be used for real-time and batch processing, based
42 Apache Apex
on a unified stateful streaming architecture, with influence these translations, such as the affinity to
support for event-time windowing and exactly- control which operators should be deployed into
once processing semantics (Fig. 1). the same thread, process, or host.
Low-level, compositional API: The Apex en-
gine defines the low-level API, which can be used
Application Model and APIs to assemble a pipeline by composing operators
and streams into a DAG. The API offers a high
DAG: The processing logic of an Apex applica- degree of flexibility and control: Individual op-
tion is represented by a directed acyclic graph erators are directly specified by the developer,
(DAG) of operators and streams. Streams are un- and attributes can be used to control details such
bounded sequences of events (or tuples), and op- as resource constraints, affinity, stream encoding,
erators are the atomic functional building blocks and more. On the other hand, the API tends to
for sources, sinks, and transformations. With the be more verbose for those use cases that don’t
DAG, arbitrary complex processing logic can require such flexibility. That’s why Apex, as
be arranged in sequence or in parallel. With part of the library, offers higher-level constructs,
an acyclic graph, the output of an operator can which are based on the DAG API.
only be transmitted to downstream operators. High-level, declarative API: The Apex
For pipelines that require a loop (or iteration), a library provides the high-level stream API,
special delay operator is supported that allows the which allows the application developer to specify
output of an operator to be passed back as input the application through a declarative, fluent
to upstream operators (often required in machine style API (similar to API found in Apache
learning). The engine also defines a module in- Spark and Apache Flink). Instead of identifying
terface, which can be used to define composite individual operators, the developer specifies
operators that represent a reusable DAG fragment sources, transformations, and sinks by chaining
(Fig. 2). method calls on the stream interface. The API
The user-defined DAG is referred to as the internally keeps track of operator(s) and streams
logical plan. The Apex engine will expand it into for eventual expansion into the lower-level DAG.
the physical plan, where operators are partitioned The stream API does not require knowledge of
(for parallelism) and grouped into containers for individual operator classes and more concise for
deployment. Attributes in the logical plan can use cases that don’t require advanced constructs.
Apache Apex 43
Apache Apex, Fig. 2 Apex converts a logical plan into a physical execution model by means of configuration
It is still possible to add customizations as the output. Each operator can have multiple ports.
needs of the application evolve. Operators that don’t have input ports are sources,
SQL: SQL is important for analytics and and operators that don’t emit output are sinks.
widely used in the big data ecosystem, with BI These operators interface with external systems.
tools that can connect to engines such as Apache Operators that have both types of ports typically
Hive, Apache Impala, Apache Drill, and others, transform data.
besides the traditional database systems. SQL The operator interfaces inform how the engine
isn’t limited to querying data; it can also be will interact with the operator at execution time,
used to specify transformations. Recent efforts once it is deployed into a container (Fig. 3).
to bring SQL to stream processing like Apache Apex has a continuous operator model; the
Calcite (2018), Flink (Carbone et al. 2015a), operator, once deployed, will process data until
Beam (Akidau et al. 2015), and KSQL (Confluent the pipeline is terminated. The setup and activate
blog 2018) promise to target a wider audience calls can be used for initialization. From then on,
without lower-level programming expertise, for data (individual events) will be processed. Peri-
use cases including ad hoc data exploration, odically the engine will call begin/endWindow to
ETL, and more. Apex provides a SQL API that is demarcate streaming intervals (processing time).
implemented based on Calcite. It currently covers This presents an opportunity to perform work that
select, insert, where clause, inner join, and scalar is less suitable for every event. The operator can
functions. Supported endpoints (read and write) also implement the optional CheckpointListener
can be file, Kafka, or any other stream defined interface to perform work at checkpoint bound-
with the DAG API. aries.
Operators
Checkpointing
Operators are Java classes that implement the
Operator interface and other optional interfaces Checkpointing in Apex is the foundation for
that are defined in the Apex API and recognized failure handling as well as on-demand parallelism
by the engine. a.k.a. dynamic partitioning. Checkpointing en-
Operators have ports (type-safe connection ables application recovery in case of failure by
points for the streams) to receive input and emit storing the state of the DAG at configurable units
44 Apache Apex
Apache Apex, Fig. 4 Checkpoint markers injected into the tuple stream at streaming window boundaries
Checkpointing in Apex is thus an asynchronous, them transient) and also how fields are serialized
lightweight, and decentralized approach. through optional annotations. A
Idempotency for resumption from check- Exactly-once state and exactly-once pro-
point: Introduction of checkpoint markers into cessing: Exactly-once processing is the holy grail
the stream in Apex is a slight variation from of distributed stream processing engines. In re-
Carbone et al. (2015b). The checkpoint mark- ality though, reprocessing cannot be avoided in
ers are introduced at the input operator (source) a distributed system, and therefore exactly-once
and traverse the entire DAG. Upon resumption refers to the effect on the state, which is also
from failure, the stream needs to be replayed why it is alternatively referred to as “effectively
in the exact same sequence to enable the en- once.” While some engines define exactly-once
tire application as a reproducible state machine. semantics purely from internal state persistence
Apex provides the necessary building blocks to point of view, others aim to provide constructs for
achieve idempotency. Individual operators in the exactly-once processing in addition to state with
DAG ensure idempotency is maintained in their varying degrees of implementation complexity as
processing logic. For each stream that crosses given in Kulkarni et al. (2015), Noghabi et al.
a container/process boundary, Apex provides a (2017), and Jacques-Silva et al. (2016). Exactly-
buffer server like (Lin et al. 2016) that can be once state management in Apex is a distributed
used by a downstream operator to start con- process by the virtue of each operator triggering
sumption from a specific position. This enables its own state passivation when the checkpoint tu-
fine-grained recovery and dynamic partitioning ple is processed. While some streaming systems
without a full DAG reset. In case of an input only focus on exactly-once semantics for internal
operator, idempotency of replay is either backed state, the Apex library provides support for end-
by capabilities of the source (in cases like Kafka to-end exactly-once for many of its connectors,
and files by offset tracking and sequential scan) thereby also covering the effect of processing on
or the input operator can record the source data the state in respective external systems. This is
for replay by using a write-ahead log (WAL) possible through interface contracts in the low-
implementation that is provided by the Apex level API and idempotency and by utilizing capa-
library. bilities that are specific to the integrated system,
Opt-in independence: Operators in Apex can such as transactions, atomic renames, etc.
choose to opt out of checkpointing cycles to At-most-once and at-least-once processing
avoid serialization overhead. As an example, an semantics: It is not atypical for at-least-once
operator that is representing a matrix transpose processing semantics for use cases like find-
operation can choose to not be part of the check- ing a max value in a given stream, while at-
pointing process as it does not need a state to most-once processing is required for use cases
be accumulated across tuples or across streaming where recency of data processing state matters
windows. On the other hand, an operator which the most. At-least-once semantics is achieved by
is computing a cumulative sum or part of sort, the upstream operator replaying the tuples from
join logic in SQL DAG needs the state to be the checkpoint and the downstream operator not
handed from one streaming window to another implementing any lightweight checkpoint state-
and hence needs to be stateful. Stateless operators based checks. For at-most-once semantics, the
can be enabled either by an annotation marker on upstream operator needs to just stream tuples to
the operator implementation or declaring it in the downstream without starting from a checkpoint.
configuration provided as input to the application Low-level API for processing semantics: As
launch. For stateful operators, Apex ensures that described in the Operator section, the operator is
the state is serialized at checkpoint intervals. given a chance to perform business logic-specific
The default serialization mechanism allows the process at streaming window boundaries as well
operator implementation to decide (and optimize) as checkpoint boundaries. Several operators in
which fields are serialized (or skipped by marking Apex implement incremental lightweight state
46 Apache Apex
saving at window boundaries and a complete op- has complete control to define the data structures
erator state at checkpointing boundaries. Utilities that represent the operator state. For recovery,
exist in the framework to get/set data structures Apex will only reset to checkpoint the part of the
that represent state using a window ID marker as DAG that is actually affected by a failure. These
input thus allowing for a lightweight state passi- are examples of features that set Apex apart from
vation. Since a checkpoint resumption can result other streaming frameworks like Zaharia et al.
in reprocessing multiple windows, this incremen- (2012) and Carbone et al. (2015b). With flexibil-
tal state can come in handy to completely skip ity at the lower level, higher-level abstractions are
windows that are processed from a certain check- provided as part of the library that optimize for
point. Let us consider how this approach provides specific use cases.
for exactly-once processing semantics in systems
that do not provide two-phase commit transaction
support. Until the last streaming window before a
crash (referred to as orphaned window hereafter), High Availability
state can be passivated incrementally at streaming
window boundaries. For the orphaned window, High availability in Apex is achieved by ex-
for which neither a lightweight state has been tending two primary constructs. Leverage YARN
persisted nor a full checkpoint cycle is com- as cluster manager to handle process and node
plete, custom business logic can be invoked using failure scenarios and use the distributed file sys-
a “check/read and then process” pattern. This tem to recover the state. An Apex application is
“check and process along with a business helper associated with an application ID and is either
function” is only executed for the orphaned win- given a new ID or resume from a previous ap-
dow reprocessing. Processing resumes a normal plication identity as identified by an ID given as
pattern beyond this orphaned window processing. a startup parameter. Apex being a YARN native
Thus a combination of lightweight checkpoint- application utilizes the YARN Resource manager
ing at window boundaries, full checkpointing at to allocate the application master (AM) at the
checkpoint boundaries, and the check and pro- time of launch. AM either instantiates a new
cess pattern can achieve exactly-once processing physical deployment plan using the logical plan
semantics in Apex for this class of systems. or reuses an existing physical plan if an existing
Managed state and spillable data struc- application ID is passed at startup. As the DAG
tures: State saving at checkpointing boundary is physicalized and starts executing, each of the
may be sub-optimal when the state is really large JVM operators sends heartbeats to the AM using
(Del Monte 2017). Incremental state persistence a separate thread. Subsequent to this, there are
at smaller intervals and/or spillable state can many scenarios that can result in a fault scenario
be effective patterns to mitigate this (To et al. and recovery options possible as given in Nasir
2017; Fernandez et al. 2013; Akidau et al. 2013; (2016).
Sebepou and Magoutis 2011). Apex supports Worker container failure: In case of a con-
both. To handle very large state, the Apex library tainer JVM crashing or stalling for a long time
provides for spillable data structures similar to due to a GC pause, AM detects the absence of the
Carbone et al. (2017). Instead of a full copy local heartbeat and initiates a kill (if applicable) and
store like RocksDB, spillable data structures in redeploys sequence. AM would negotiate with
Apex are block structured and by default backed YARN for the replacement container and request
by the distributed file system (DFS), allowing to kill the old container (if it was stalled). The
for delta writes and on-demand load (from DFS) new instance would then resume from a check-
when operators are restored. pointed state. It may be noted that the entire sub-
Flexibility, ease of use, and efficiency have DAG downstream to the failed operator will be
been core design principles for Apex, and check- redeployed as well to ensure the semantics of pro-
pointing is no exception. When desired, the user cessing are upheld especially for cases when the
Apache Apex 47
downstream operators are implementing exactly- the memory and compute resource settings in the
once semantics. DAG (Fig. 5). A
AM failure: During an AM crash, the indi- Deployment models: The Apex operator de-
vidual worker containers continue processing the ployment model allows for four different pat-
tuples albeit without any process taking care of terns:
the recovery and monitoring. In the meantime,
YARN would detect that AM container has gone 1. Thread – Operator logic is invoked by a single
down and redeploy a new instance of it and pass thread in the same JVM.
the previous application ID as part of the restart 2. Container – Operators coexist in the same
process. This new instance would then resume JVM.
from its checkpointed state. The checkpointed 3. Host – Operators are available on the same
state of AM is more related to the execution node.
state of the DAG. This new AM container would 4. Default – Highest cost in terms of serialization
update its metadata in the distributed file system and IPC.
representing the application ID which in turn
would be picked up by the worker containers to Affinity or anti-affinity patterns further allow
reestablish their heartbeat cycles. for customized location preferences while de-
Machine failures: In the event of a physi- ploying the operators by the application master.
cal node failure, YARN resource manager (RM) Affinity allows for multiple operators to be lo-
would be notified of the node manager (NM) cated on the same host or share the same con-
death. This would result in RM deploying all tainer or thread.
containers currently running on that NM. It is
possible that there are multiple containers of the
application on the failed host, and all of these Partitioning/Scaling
would be migrated to new host(s) using the pro-
cess described before. Operator parallelism a.k.a partitioning helps in
dealing with latency and throughput trade-off
aspects of a streaming application design. Apex
Resource Management enables developers to concentrate on the business
logic and enable a configuration-based approach
Apex is a YARN native application that follows to scale a logical implementation to a scalable
the YARN principles of resource request and physical deployment model. This approach is
grant model. The application master upon referred to as partitioning and is enabled by
launch generates a physical execution plan configuring a partitioner for an operator.
from the DAG logical representation. The
application master itself is a YARN container. Partitioners
The application master then spawns the operator A partitioner implementation can be specified
instances by negotiating operator containers from how to partition the operator instances. While use
the YARN resource manager, taking into account case like specifying a fixed number of partitions
1C Parallel Partitioning
While partitioners and unifiers allow for inde-
U2 2B pendent scaling of each operator of an applica-
1D
tion, there will be use cases where downstream
I5
operators can align with the upstream operators’
1E level of parallelism and avoid a unifier, thus
further decreasing the overall latencies due to
Apache Apex, Fig. 6 Unifiers allow for varying paral- decreased network hops (in some systems, this
lelism between operators. U1 and U2 unifiers are auto-
also referred to as chaining). This is enabled via
matically injected into the DAG at deployment time
a configuration parameter set on the downstream
operator, and this parallel partitioning can be
configured for as many downstream operators as
at startup can be met by using one of the parti-
the application design mandates it to be (Fig. 8).
tioners available as part of the framework, Apex
allows for custom implementation as well.
Such custom implementations can cater to
Dynamic Partitioning
many use cases like deciding partitions on the in-
Load variation is a common pattern across many
put source system characteristics. As an example,
streaming systems, and some of these engines
the Kafka partitioners as part of the Apex library
provide for dynamic scaling as given in Floratou
have the capability to scale to as many Kafka
et al. (2017) and Bertolucci et al. (2015). Also the
partitions that each of the configured Kafka topics
advent of cloud computing where cost is based on
have or alternatively support mapping multiple
resource consumption models as given in Hum-
Kafka partitions to one Apex operator partition.
mer et al. (2013) and Sattler and Beier (2013)
A partitioner can be stateful; it can re-shard the
makes compelling case for the need to dynam-
state of the old partitions (Fig. 6).
ically adjust resources based on context. Static
partitioning may not be sufficient to achieve la-
Unifiers tency SLAs or optimize resource consumption.
While partitioning allows for scalability of the Apex allows partitions to be scaled or contracted
operator business logic, it invariably results in an dynamically at checkpointing boundaries. A par-
impedance mismatch if the downstream operator titioner implementation can use the operational
is not at the right scaling factor. Apex allows metrics of the physical operator instances to de-
parallelism to vary between operators. In such cide on the optimal scaling configuration. Un-
cases, unifiers are automatically injected into the like other streaming engines like Spark which
physical plan and define how the upstream results need stand-alone shuffle service, Apex dynamic
are shuffled to the new parallelism level. The user partitioning can be aligned with the application
can customize the unifier implementation, which patterns.
Apache Apex 49
PARALLEL PARTITIONING
A
Logical Plan
1 2 3 4 5
Parallel Partitioning
1A 2A 3A
1B 2B 3B 4 5
1C 2C 3C
Apache Apex, Fig. 7 Parallel partitioning can avoid shuffling overheads when it is not necessary
DYNAMIC PARTITIONING
Logical Plan
1 2 3 4 5
Dynamic Partitioning
Daytime topology Nighttime topology
1A 2A 3A 1A
1B 2B 3B 4 5 1B 2 3 4 5
1C 2C 3C
1C
Apache Apex, Fig. 8 Dynamic partitioning allows for scale-up and scale-down strategies at runtime. Example above
has different topologies at daytime and nighttime allowing for efficient use of hardware
Integration Using Apex Library NOSQL databases like Cassandra and HBase;
and JMS-enabled systems like MQ besides many
Apex enables a faster time to market model by others. The library implementations provide
shipping many connectors to external systems value add by not only implementing read/write
that can act as a source or a sink or both. The patterns to these systems but also aiming to
Apex library is also referred to as Malhar. Some provide exactly-once processing.
of the common integrations include JDBC driver- File systems: Apex operators bring in lot more
enabled databases like Oracle, MySQL, and to the maturity of the implementation by enabling
Postgres; replayable sources like Kafka and files; enterprise patterns like polling directories for new
50 Apache Apex
file arrivals, offset tracking backed sequential of the features that would benefit Apex in the near
scanning approach, resumption from checkpoints future.
in case of failures, handling of varied formats of
the file contents, and compression formats while
writing contents and even splitting very large References
files into blocks while reading to enable efficient
Akidau T et al (2013) MillWheel: fault-tolerant stream
partitioning strategies.
processing at internet scale. PVLDB 6:1033–1044
Kafka: Kafka integration is one of the battle- Akidau T et al (2015) The dataflow model: a practical
tested integrations of Apex. Kafka operators al- approach to balancing correctness, latency, and cost in
low for flexible mapping of operators wherein massive-scale, unbounded, out-of-order data process-
ing. PVLDB 8:1792–1803
the mapping configuration can map “m” Kafka
Apache Apex (2018) https://apex.apache.org/
partitions across “n” Apex operators. It may be Apache Calcite (2018) https://calcite.apache.org/
noted that m can represent partitions that may Bertolucci M et al (2015) Static and dynamic big data
span across multiple Kafka clusters and multiple partitioning on Apache Spark. PARCO
Carbone P et al (2015a) Apache Flink™: stream and batch
Kafka topics spread across these Kafka topics. processing in a single engine. IEEE Data Eng Bull
Apex Kafka operators also support exactly-once 38:28–38
semantics for Kafka versions prior to 0.11 version Carbone P et al (2015b) Lightweight asynchronous snap-
thus taking the burden of a transaction to Apex shots for distributed dataflows. CoRR abs/1506.08603:
n. pag
processing layer as opposed to relying on Kafka. Carbone P et al (2017) State management in Apache
The integration in Apex thus can be considered Flink® : consistent stateful distributed stream process-
far richer as compared to the other systems. ing. PVLDB 10:1718–1729
Non-replayable sources like JMS: Exactly- Confluent blog (2018) https://www.confluent.io/blog/
ksql-open-source-streaming-sql-for-apache-kafka/
once semantics at input sources is a bit more Del Monte B (2017) Efficient migration of very
involved in case of systems like JMS wherein the large distributed state for scalable stream processing.
source does not have a contract to replay a data PhD@VLDB
point once a handover handshake is complete. Fernandez RC et al (2013) Integrating scale out and fault
tolerance in stream processing using operator state
Apex provides for a WAL implementation which management. SIGMOD conference
is used by the JMS operator to replay a series of Floratou A et al (2017) Dhalion: self-regulating stream
previously acknowledged messages when restart- processing in Heron. PVLDB 10:1825–1836
ing from a checkpoint. Hummer W et al (2013) Elastic stream processing in the
cloud. Wiley Interdisc Rew: Data Min Knowl Discov
JDBC-enabled databases: JDBC-enabled 3:333–345
databases like Oracle, Postgres, and MySQL Jacques-Silva G et al (2016) Consistent regions: guar-
allow for an exactly-once write pattern through anteed tuple processing in IBM streams. PVLDB 9:
transactions. The Apex connector can use this 1341–1352
Kulkarni S et al (2015) Twitter Heron: stream processing
to commit transactions at streaming windows at scale. SIGMOD conference
or checkpoint boundaries. The JDBC source Lin W et al (2016) StreamScope: continuous reliable
supports idempotent read by tracking the offset distributed processing of dig data streams. NSDI
(for a query with order by clause). Nasir MAU (2016) Fault tolerance for stream processing
engines. CoRR abs/1605.00928: n. pag
Noghabi SA et al (2017) Stateful scalable stream process-
ing at LinkedIn. PVLDB 10:1634–1645
Conclusion Sattler K-U, Beier F (2013) Towards elastic stream pro-
cessing: patterns and infrastructure. BD3@VLDB
Sebepou Z, Magoutis K (2011) CEC: continuous eventual
Apex is an enterprise-grade distributed streaming checkpointing for data stream processing operators.
engine that comes with many foundational con- In: 2011 IEEE/IFIP 41st International Conference on
structs as well as connectors that help in faster Dependable Systems & Networks (DSN). pp 145–156
To Q-C et al (2017) A survey of state management in big
time to market as compared to some similar
data processing systems. CoRR abs/1702.01596: n. pag
systems. Dockerized containers and support for Weise T et al (2017) Learning Apache Apex. Packt Pub-
next-generation container orchestration are some lishing
Apache Flink 51
Zaharia M et al (2012) Resilient distributed datasets: a typically achieves lower latencies to produce re-
fault-tolerant abstraction for in-memory cluster com- sults, it induces operational challenges because A
puting. NSDI
Zaharia M et al (2013) Discretized streams: fault-tolerant
streaming applications which run 24 7 make
streaming computation at scale. SOSP high demands on failure recovery and consis-
tency guarantees.
The most fundamental difference between
batch and stream processing applications is
that stream processing applications process
Apache Flink continuously arriving data. The two core building
blocks when processing streaming data are
Fabian Hueske and Timo Walther state and time. Applications require state for
Data Artisans GmbH, Berlin, Germany every non-trivial computation that involves
more than a single event. For example, state
is required to collect multiple events before
Synonyms performing a computation or to hold the result of
partial computations. Time on the other hand
Stratosphere platform is important to determine when all relevant
data was received and a computation can be
performed. Having good control over time
Definitions enables applications to treat result completeness
for latency. Time is not relevant in batch
Apache Flink is a system for distributed batch processing because all input data is known and
and stream processing. It addresses many present when the processing starts.
challenges related to the processing of bounded Stateful stream processing is a very versatile
and unbounded data. Among other things, Flink architectural pattern that can be applied to a
provides flexible windowing support, exactly- wide spectrum of data-related use cases. Besides
once state consistency, event-time semantics, and providing better latency than batch processing
stateful stream processing. It offers abstractions solutions, stateful stream processing applications
for complex event processing and continuous can also address use cases that are not suitable
queries. for batch processing approaches at all. Event-
driven applications are stateful stream processing
applications that ingest continuous streams of
Overview events, apply business logic to them, and emit
new events or trigger external actions. These ap-
Today, virtually all data is continuously gener- plications share more characteristics with trans-
ated as streams of events. This includes busi- actional workloads than analytical applications.
ness transactions, interactions with web or mobile Another common use case for stateful stream
application, sensor or device logs, and database processing is complex event processing (CEP),
modifications. There are two ways to process which is applied to evaluate patterns over event
continuously produced data, namely batch and streams.
stream processing. For stream processing, the
data is immediately ingested and processed by
a continuously running application as it arrives. Historical Background
For batch processing, the data is first recorded
and persisted in a storage system, such as a file Apache Flink originates from an academic
system or database system, before it is (periodi- research project. In 2010, the Stratosphere
cally) processed by an application that processes project was started by five research groups
a bounded data set. While stream processing from Technische Universität Berlin, Humboldt
52 Apache Flink
Universität zu Berlin, and Hasso Plattner Institute lational database systems, a cost-based optimizer
Potsdam (http://gepris.dfg.de/gepris/projekt/ enumerated execution plans under consideration
132320961?language=en. Visited on 22 Dec of interesting and existing physical properties
2017). The goal of the project was to develop (such as partitioning, sorting, and grouping) and
novel approaches for large-scale distributed data chose the least expensive plan for execution.
processing. In the course of the project, the Compared to the MapReduce programming and
researchers developed the prototype of a data execution model, the Stratosphere research pro-
processing system to evaluate the new approaches totype provided a programming API that was as
and released the software as open source under versatile as MapReduce but eased the definition
the Apache software license (Alexandrov et al. of advanced data analysis applications, an effi-
2014). cient runtime based on concepts and algorithms
When the project started in 2010, the Apache of relational database systems, and a database-
Hadoop project (https://hadoop.apache.org. Vis- style optimizer to automatically choose efficient
ited on 22 Dec 2017), an open-source imple- execution plans. At the same time, the prototype
mentation of Google’s MapReduce (Dean and had similar hardware requirements and offered
Ghemawat 2008) and GFS (Ghemawat et al. similar scalability as MapReduce. Based on the
2003) publications, had gained a lot of interest initial Stratosphere prototype, further research
in research and industry. MapReduce’s strong was conducted on leveraging static code analysis
points were its ability to scale data processing of user-defined function for logical program opti-
tasks to a large number of commodity machines mization (Hueske et al. 2012) and the definition
and its excellent tolerance for hardware and soft- and execution of iterative programs to support
ware failures. However, the database research machine learning and graph analysis applications
community had also realized that MapReduce (Ewen et al. 2012).
was neither the most user-friendly nor most ef- In May 2014, the developers of the
ficient approach to define complex data analysis Stratosphere prototype decided to donate the
applications. Therefore, the Stratosphere project source code of the prototype to the Apache
aimed to build a system that combined the ad- Software Foundation (ASF) (https://wiki.apache.
vantages of MapReduce and relational database org/incubator/StratosphereProposal. Visited on
systems. The first result was a system consist- 22 Dec 2017). Due to trademark issues, the
ing of the PACT programming model, the dis- project was renamed to Apache Flink. Flink
tributed dataflow processing engine Nephele, and is the German word for “nimble” or “swift.”
an optimizer that translated PACT programs into The initial group of committers consisted of the
Nephele dataflows (Battré et al. 2010). The PACT eight Stratosphere contributors and six members
programming model generalized the MapReduce of the Apache Incubator to teach the Flink
programming model by providing more paral- community the Apache Way. In August 2014,
lelizable operator primitives and defining pro- version 0.6 of Apache Flink was released as
grams as directed acyclic dataflows. Hence, spec- the first release under the new name Apache
ifying complex analytical applications became Flink (http://flink.apache.org/news/2014/08/26/
much easier (Alexandrov et al. 2011). The run- release-0.6.html. Visited on 22 Dec 2017). Three
time operators to execute PACT programs were months later in November 2014, version 0.7 was
implemented based on well-known algorithms released. This release included a first version
from database system literature, such as external of Flink’s DataStream API. With this addition,
merge-sort, block-nested loop join, hybrid-hash Flink was able to address stream as well as batch
join, and sort-merge join. Having a choice in processing use cases with a single processing
runtime operators and data distribution strategies engine (Carbone et al. 2015a). Because the
due to the flexibility of the distributed dataflow distributed dataflow processor (formerly known
execution engine Nephele resulted in different as Nephele) had always supported pipelined
alternatives of how a PACT program could be data transfers, the new stream processing
executed. Similar to physical optimization in re- capabilities did not require major changes
Apache Flink 53
Task Manager
Source Map KeyBy Window
Job Manager
Metrics Collection
Monitoring
Checkpointing
Task Manager
Source Map KeyBy Window
responsible for executing operator pipelines and in Fig. 2 highlight the task slots. Each pipeline
exchanging data streams. is independently evaluated. Once an operator’s
The example from Fig. 2 is executed with a result is ready, it will be forwarded immediately
degree of parallelism equal to 3 and thus occupies to the next operator without additional synchro-
three task slots. Task slots split the resources of a nization steps. A pipeline consists of one or
TaskManager into a finer granularity and define more subtasks. Depending on the operations, a
how many slices of a program can be executed subtask executes only one operator, such as the
concurrently by a TaskManager; the dotted lines window operator, or multiple operators that are
Apache Flink 55
concatenated by so-called operator chaining. In ing operations based on the wall-clock time is the
the example, the reading of a record from the easiest notion of time with little overhead. How- A
source, its transformation, and the extraction of ever, it assumes a perfect environment where all
a key are performed in a chain without having to records arrive in a strict order, can be processed
serialize intermediate data. Both the keyBy and on time without any side effects, and do not need
the sink operation break the chain, because keyBy to be reprocessed. In the real world, events might
requires a shuffling step and the sink has a lower arrive in the wrong order, or large amounts of
parallelism than its predecessor. events might arrive at the same time and, thus,
Operators in Flink DataStream programs can logically belong to the same window but do not
be stateful. In our example, the sources need reach the window operator punctually. Moreover,
to store the current read offset in the log of the amount of hardware resource available for
the publish-subscribe system to remember which processing can affect which records are grouped
records have been consumed so far. Window together. Hence, processing time is not applicable
operators need to buffer records until the final if quality-of-service specifications require that
aggregation at the end of a window can be trig- results, which are computed from a log of events,
gered. The JobManager is responsible for coor- and must be reproducible.
dinating the creation of consistent checkpoints of Event time addresses the issues above by re-
a program’s state, i.e., the state of all operators, lying on a timestamp that is associated with
and restarting the program in case of failures every record and by defining a strategy to making
by loading the latest complete checkpoint and progress that guarantees consistent results. Flink
replaying parts of a stream. Flink’s checkpoint adopts the dataflow model (Akidau et al. 2015).
and recovery mechanism will be covered later in In addition to timestamped records, data streams
more detail. need to be enriched with watermarks in order
to leverage event time. Watermarks are metadata
Time Handling records with a timestamp that indicate that no
The handling of time receives special attention in record with a timestamp lower than the water-
Flink’s DataStream API. Most streaming applica- mark’s timestamp will be received from a con-
tions need a notion of time to define operations on nection in the future. Every operator in Flink
conceptually never-ending inputs. The previous contains logic to compute a new watermark from
example included a tumbling time window to the watermarks it receives via its incoming con-
discretize the stream into segments containing nections. The watermark of an operator acts as
the records received within an interval of 5 sec- its internal clock and triggers computations, such
onds. Similarly, sliding windows allow for similar as the computation of a window when the water-
fixed-length windows but with a possible overlap, mark passes its end time. An operator tracks for
e.g., a window of 1 minute length that is eval- each incoming connection the highest observed
uated every 10 seconds. Session windows have watermark and computes its own watermark as
a variable length and are evaluated depending the smallest watermark across the current water-
on a gap of inactivity. These window types as marks of its incoming connections. Whenever an
well as several other operations on data streams operator advances its own watermark, it performs
depend on a notion of time to specify an interval all computations that are triggered by the new
of the stream that they interact with. Time-based watermark, emits the resulting records, and sub-
operators need to be able to look up the “current sequently broadcasts its new watermark across
time” while processing a record of a stream. Flink all its outgoing connections. This mechanism
supports two time modes. ensures that watermarks are strictly increasing
Processing time is defined by the local clock of and that emitted records remain aligned with the
the machine that processes an operator. Perform- emitted watermarks.
56 Apache Flink
many RocksDB features for checkpointing to a this approach are that (1) state access is always
remote file system, and maintaining state that local instead of accessing a remote database, (2) A
becomes very large (on the order of multiple state and computation are scaled together, and (3)
terabytes). When using the RocksDB state back- applications define the schema of their own state,
end, Flink can not only create checkpoints asyn- similar to microservices.
chronously but also incrementally which speeds Entire social networks have been built with
up the checkpointing operation. Flink (Koliopoulos 2017) following the event-
driven approach. Incoming user requests are writ-
ten to a distributed log that acts as the single
Key Applications source of truth. The log is consumed by multiple
distributed stateful applications which build up
In the past, analytical data was commonly their state independently from each other for
dumped into database systems or distributed aggregated post views, like counts, and statistics.
file systems and, if at all, processed in nightly or Finally, the updated results are stored in a mate-
monthly batch jobs. Nowadays, in a globalized, rialized view that web servers can return to the
faster-moving world, reacting to events quickly user.
is crucial for economic success. For Flink, In addition, Flink offers a domain-specific API
processing of unbounded data means handling for complex event processing to detect and react
events when they occur in a non-approximated on user-defined patterns in event streams. CEP is
but accurate fashion. This requires support useful for fraud or intrusion detection scenarios
for event-time processing, exactly-once state or to monitor and validate business processes.
consistency, and fault tolerance capabilities. Furthermore, Flink provides relational APIs to
Applications must be scalable to handle both unify queries on bounded and unbounded data
current and increasing data volumes in the streams. With its Table and SQL APIs, Flink
future, leading to a parallelized and distributed can continuously update a result table which is
execution. Moreover, Flink applications can defined by query on one or more input streams,
consume data from a variety of sources such as similar to a materialized view as known from
publish-subscribe message queues, file systems, relational database systems.
databases, and sockets.
However, the use cases for Flink are not
limited to faster data analytics. Its flexibility
to model arbitrary dataflows and its precise Cross-References
control over state and time allow a variety of new
applications. So-called event-driven applications Apache Spark
are becoming more and more popular as part Apache Apex
of an overall event-driven architecture. Such Apache Samza
applications trigger computations based on Continuous Queries
incoming events and might store an accumulated Definition of Data Streams
history of events to relate it to newly arrived Introduction to Stream Processing Algorithms
events. Applications can output their result Stream Window Aggregation Semantics and
to a database or key-value store, trigger an Optimization
immediate reaction via an RPC call, or emit a new Streaming Microservices
event which might trigger subsequent actions.
In contrast to traditional two-tier architectures
where application and database system separate
References
business logic and data from each other, event-
driven applications own computation and state Akidau T et al (2015) The dataflow model: a practical
and keep them close together. The benefits of approach to balancing correctness, latency, and cost in
58 Apache Hadoop
massive-scale, unbounded, out-of-order data process- Hueske F, Kalavri V (2018) Stream processing with
ing. Proc VLDB Endowment 8(12):1792–1803 Apache Flink: fundamentals, implementation, and
Alexandrov A, Ewen S, Heimel M, Hueske F, Kao O, operation of streaming applications. O’Reilly Media,
Markl V, : : : , Warneke D (2011) MapReduce and Sebastopol. ISBN 149197429X
PACT-comparing data parallel programming models.
In BTW, pp 25–44
Alexandrov A, Bergmann R, Ewen S, Freytag JC, Hueske
F, Heise A, : : : , Naumann F (2014) The stratosphere
platform for big data analytics. VLDB J 23(6):939–964
Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke
D (2010) Nephele/PACTs: a programming model and
Apache Hadoop
execution framework for web-scale analytical process-
ing. In: Proceedings of the 1st ACM symposium on Architectures
cloud computing. ACM, pp 119–130
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S,
Tzoumas K (2015a) Apache Flink: stream and batch
processing in a single engine. In: Bulletin of the IEEE
Computer Society Technical Committee on Data Engi- Apache Kafka
neering, vol 36, no. 4
Carbone P et al (2015b) Apache Flink: stream and batch
processing in a single engine. In: Bulletin of the IEEE
Matthias J. Sax
Computer Society Technical Committee on Data Engi- Confluent Inc., Palo Alto, CA, USA
neering, vol 36, no. 4
Carbone P et al (2015c) Lightweight asynchronous
snapshots for distributed dataflows. In CoRR
abs/1506.08603. http://arxiv.org/abs/1506.08603 Definitions
Carbone P, Ewen S, Fóra G, Haridi S, Richter S, Tzoumas
K (2017) State management in apache flink® : consis- Apache Kafka (Apache Software Foundation
tent stateful distributed stream processing. Proc VLDB 2017b; Kreps et al. 2011; Goodhope et al. 2012;
Endowment 10(12):1718–1729
Dean J, Ghemawat S (2008) MapReduce: simplified Wang et al. 2015; Kleppmann and Kreps 2015)
data processing on large clusters. Commun ACM is a scalable, fault-tolerant, and highly available
51(1):107–113 distributed streaming platform that can be used
Ewen S, Tzoumas K, Kaufmann M, Markl V (2012) Spin-
to store and process data streams.
ning fast iterative data flows. Proc VLDB Endowment
5(11):1268–1279 Kafka consists of three main components:
Ghemawat S, Gobioff H, Leung ST (2003) The google
file system. ACM SIGOPS Oper Syst Rev 37(5):29–43.
• the Kafka cluster,
ACM
Hueske F, Peters M, Sax MJ, Rheinländer A, Bergmann R, • the Connect framework (Connect API),
Krettek A, Tzoumas K (2012) Opening the black boxes • and the Streams programming library
in data flow optimization. Proc VLDB Endowment (Streams API).
5(11):1256–1267
Koliopoulos A (2017) Drivetribe’s modern take on CQRS
with Apache Flink. Drivetribe. https://data-artisans. The Kafka cluster stores data streams, which
com/blog/drivetribe-cqrs-apache-flink. Visited on 7 are sequences of messages/events continuously
Sept 2017
Mani Chandy K, Lamport L (1985) Distributed snapshots:
produced by applications and sequentially and
determining global states of distributed systems. ACM incrementally consumed by other applications.
Trans Comp Syst (TOCS) 3(1):63–75 The Connect API is used to ingest data into Kafka
The Apache Software Foundation. RocksDBjA persistent and export data streams to external systems like
key-value storejRocksDB. http://rocksdb.org/. Visited
on 30 Sept 2017
distributed file systems, databases, and others.
For data stream processing, the Streams API
allows developers to specify sophisticated stream
processing pipelines that read input streams from
Recommended Reading the Kafka cluster and write results back to Kafka.
Friedman E, Tzoumas K (2016) Introduction to Apache
Flink: stream processing for real time and beyond. Kafka supports many different use cases
O’Reilly Media, Sebastopol. ISBN 1491976586 categories such as traditional publish-subscribe
Apache Kafka 59
messaging, streaming ETL, and data stream sage’s offset is implicitly determined by the order
processing. in which messages are appended to a partition. A
Hence, each message within a topic is uniquely
identified by its partition and offset. Kafka guar-
Overview antees strict message ordering within a single
partition, i.e., it guarantees that all consumers
A Kafka cluster provides a publish-subscribe reading a partition receive all messages in the
messaging service (Fig. 1). Producer clients exact same order as they were appended to the
(publishers) write messages into Kafka, and partition. There is no ordering guarantee between
consumer clients (subscribers) read those messages in different partitions or different top-
messages from Kafka. Messages are stored in ics.
Kafka servers called brokers and organized in Topics can be written by multiple producers
named topics. A topic is an append-only sequence at the same time. If multiple producers write to
of messages, also called a log. Thus, each time a the same partition, their messages are interleaved.
message is written to a topic, it is appended to the If consumers read from the same topic, they
end of the log. A Kafka message is a key-value can form a so-called consumer group. Within a
pair where key and value are variable-length byte consumer group, each individual consumer reads
arrays. Additionally, each message has a time data from a subset of partitions. Kafka ensures
stamp that is stored as 64-bit integer. that each partition is assigned to exactly one con-
Topics are divided into partitions. When a sumer within a group. Different consumer groups
message is written to a topic, the producer must or consumers that don’t belong to any consumer
specify the partition for the message. Producers group are independent from each other. Thus, if
can use any partitioning strategy for the mes- two consumer groups read the same topic, all
sages they write. By default, messages are hash- messages are delivered to both groups. Because
partitioned by key, and thus all messages with the consumers are independent from each other, each
same key are written to the same partition. Each consumer can read messages at its own pace. This
message has an associated offset that is the mes- results in decoupling—a desirable property for a
sage’s position within the partition, i.e., a mono- distributed system—and makes the system robust
tonically increasing sequence number. The mes- against stragglers. In summary, Kafka supports
Producer client
Partition 1 1 2 3 4 5 6 7 8 9 10
Consumer client
Partition 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14
offsets: (B.3: 11)
Apache Kafka, Fig. 1 Kafka topics are divided into a consumer group to share the read workload over all
partitions that are ordered sequences of messages. Mul- consumers within the group (Source: Kleppmann and
tiple producers can write simultaneously into the same Kreps 2015)
topic. Consumers track their read progress and can form
60 Apache Kafka
multiple producers and can deliver the same data message with a null value, and it indicates that
to multiple consumers. all previous updates for the corresponding key
can be deleted (including the tombstone itself).
replicate all writes to the leader in the back- If a consumer is stopped and restarted later
ground. If the broker hosting the leader fails, on, it usually should continue reading where it A
Kafka initiates a leader election via ZooKeeper, left off. To this end, the consumer is responsible
and one of the followers becomes the new leader. for storing its offset reliably so it can retrieve
All clients will be updated with the new leader it on restart. In Kafka, consumers can commit
information and send all their read/write request their offsets to the brokers. On offset commit bro-
to the new leader. If the failed broker recovers, kers store consumer offsets reliably in a special
it will rejoin the cluster, and all hosted partitions topic called offset topic. As offset topic is also
become followers. partitioned and replicated and is thus scalable,
fault-tolerant, and highly available. This allows
Message Delivery Kafka to manage a large number of consumers
Reading data from Kafka works somewhat differ- at the same time. The offset topic is configured
ently compared to traditional messaging/publish- with log compaction enabled (cf. section “Log
subscribe systems. As mentioned in section Compaction”) to guarantee that offsets are never
“Kafka Brokers,” brokers do not delete messages lost.
after delivery. This design decision has multiple
advantages: Delivery Semantics
Kafka supports multiple delivery semantics,
namely, at-most-once, at-least-once, and exactly
• Brokers do not need to track the reading once. What semantics a user gets depends on
progress of consumers. This allows for an multiple factors like cluster/topic configuration
increased read throughput, as there is no as well as client configuration and user code.
progress tracking overhead for the brokers. It is important to distinguish between the write
• It allows for in-order message delivery. If and read path when discussing delivery seman-
brokers track read progress, consumers need tics. Furthermore, there is the concept of end-to-
to acknowledge which messages they have end processing semantics that applies to Kafka’s
processed successfully. This is usually done Streams API. In this section, we will only cover
on a per-message basis. Thus, if an earlier the read and write path and refer to section
message is not processed successfully, but a “Kafka Streams” for end-to-end processing se-
later message is processed successfully, re- mantics.
delivery of the first message happens out of
order. Writing to Kafka
• Because brokers don’t track progress and When a producer writes data into a topic, brokers
don’t delete data after delivery, consumers acknowledge a successful write to the producer. If
can go back in time and reprocess old data a producer doesn’t receive an acknowledgement,
again. Also, newly created consumers can it can ignore this and follow at-most-once seman-
retrieve older data. tics, as the message might not have been written
to the topic and thus could be lost. Alternatively, a
The disadvantage of this approach is that con- producer can retry the write resulting in at-least-
sumer clients need to track their progress them- once semantics. The first write could have been
selves. This happens by storing the offset of the successful, but the acknowledgement might be
next message a consumer wants to read. This lost. Kafka also supports exactly once writes by
offset is included in the read request to the broker, exploiting idempotence. For this, each message
and the broker will deliver consecutive messages is internally assigned a unique identifier. The
starting at the requested offset to the consumer. producer attaches this identifier to each message
After processing all received messages, the con- it writes, and the broker stores the identifier as
sumer updates its offset accordingly and sends metadata of each message in the topic. In case of
the next read request to the cluster. a producer write retry, the broker can detect the
62 Apache Kafka
duplicate write by comparing the message identi- recover its offsets as the last committed offset
fiers. This deduplication mechanism is internal to and thus would resume reading after the failed
producer and broker and does not guard against message.
application-level duplicates. In contrast, the process-first strategy provides
at-least-once semantics. Because the consumer
Atomic Multi-partition Writes Kafka also doesn’t update its offsets after processing the
support atomic multi-partition writes that are receive message successfully, it would always
called transactions. A Kafka transaction is fall back to the old offset after recovering from
different to a database transaction, and there an error. Thus, it would reread the processed
is no notion of ACID guarantees. A transaction message, resulting in potential duplicates in the
in Kafka is similar to an atomic write of multiple output.
messages that can span different topics and/or
partitions. Due to space limitations, we cannot Transactional Consumers Consumers can be
cover the details of transactional writes and configured to read all (including aborted mes-
can only give a brief overview. A transactional sages) or only committed messages (cf. para-
producer is first initialized for transactions by graph Atomic multi-partition writes in section
performing a corresponding API call. The broker “Writing to Kafka”). This corresponds to a read
is now ready to accept transactional writes for uncommitted and read committed mode similar
this producer. All messages sent by the producer to other transactional systems. Note that aborted
belong to the current transaction and won’t messages are not deleted from the topics and will
be delivered to any consumer as long as the be delivered to all consumers. Thus, consumers
transaction is not completed. Messages within in read committed mode will filter/drop aborted
a transaction can be sent to any topic/partition messages and not deliver them to the application.
within the cluster. When all messages of a
transaction have been sent, the producer commits
the transaction. The broker implements a two- Kafka Connect Framework
phase commit protocol to commit a transaction,
and it either successfully “writes” all or none Kafka Connect—or the Connect API—is a
of the messages belonging to a transaction. A framework for integrating Kafka with external
producer can also abort a transaction; in this case, systems like distributed file systems, databases,
none of the messages will be deleted from the log, key-value stores, and others. Internally, Kafka
but all messages will be marked as aborted. Connect uses producer/consumer clients as
described in section “Kafka Brokers,” but the
Reading from Kafka framework implements much of the functionality
As discussed in section “Message Delivery,” con- and best practices that would otherwise have to
sumers need to track their read progress them- be implemented for each system. It also allows
selves. For fault-tolerance reasons, consumers running in a fully managed, fault-tolerant, and
commit their offsets regularly to Kafka. Con- highly available manner.
sumers can apply two strategies for this: after The Connect API uses a connector to
receiving a message, they can either first commit communicate with each type of external system.
the offset and process the message afterwards, A source connector continuously reads data from
or they do it in reverse order and first process an external source system and writes the records
the message and commit the offset at the end. into Kafka, while a sink connector continuously
The commit-first strategy provides at-most-once consumes data from Kafka and sends the records
semantics. If a message is received and the offset to the external system. Many connectors are
is committed before the message is processed, available for a wide variety of systems, including
this message would not be redelivered in case HDFS, S3, relational databases, document
of failure: after a failure, the consumer would database systems, other messaging systems, file
Apache Kafka 63
systems, metric systems, analytic systems, and as a first-class abstraction next to data streams.
so on. Developers can create connectors for other It shares a few ideas with Apache Samza A
systems. (cf. article on “ Apache Samza”) (Apache
Kafka Connect can be either used as stand- Software Foundation 2017c; Noghabi et al.
alone or deployed to a cluster of machines. To 2017; Kleppmann and Kreps 2015) such as
connect with an external system, a user creates building on top of Kafka’s primitives for fault
a configuration for a connector that defines the tolerance and scaling. However, there are many
specifics of the external system and the desired notable differences to Samza: for example, Kafka
behavior, and the user uploads this configuration Streams is implemented as a library and can
to one of the Connect workers. The worker, run in any environment, unlike Samza, which is
which is running in a JVM, deploys the connector coupled to Apache Hadoop’s resource manager
and distributes the connector’s tasks across the YARN (Apache Software Foundation 2017a;
cluster. Source connector tasks load data from the Vavilapalli et al. 2013); it also provides stronger
external system and generate records, which Con- processing guarantees and supports both streams
nect then writes to the corresponding Kafka top- and tables as core data abstractions.
ics. For sink connector tasks, Connect consumes
the specified topics and passes these records to
the task, which then is responsible for sending the
records to the external system. Streams and Tables
Most stream processing frameworks provide the
abstraction of a record stream that is an append-
Single-Message Transforms only sequence of immutable facts. Kafka Streams
The Connect API allows to specify simple also introduces the notion of a changelog stream,
transformation—so-called single-message a mutable collection of data items. A changelog
transforms, SMT—for individual messages that stream can also be described as a continuously
are imported/exported into/from Kafka. Those updating table. The analogy between a changelog
transformations are independent of the connector and a table is called the stream-table duality
and allow for stateless operations. Standard (Kleppmann 2016, 2017). Supporting tables as
transformation functions are already provided first-class citizens allow to enrich data streams
by Kafka, but it is also possible to implement via stream-table joints or populate tables as self-
custom transformations to perform initial data updating caches for an application.
cleaning. If SMTs are not sufficient because a The concept of changelogs aligns with the
more complex transformation is required, the concept of compacted topics (cf. section “Log
Streams API (described in the next section) can Compaction”). A changelog can be stored in a
be used instead. compacted topic, and the corresponding table can
be recreated by reading the compacted topic with-
out data loss as guaranteed by the compaction
Kafka Streams contract.
Kafka Streams also uses the idea of a
Kafka Streams—or the Streams API—is the changelog stream for (windowed) aggregations.
stream processing library of Apache Kafka. It An aggregation of a record stream yields both a
provides a high-level DSL that supports stateless table and a changelog stream as a result—not a
as well as stateful stream processing operators record stream. Thus, the table always contains
like joins, aggregations, and windowing. the current aggregation result that is updated for
The Streams API also supports exactly once each incoming record. Furthermore, each update
processing guarantees and event-time semantics to the result table is propagated downstream
and handles out-of-order and late-arriving data. as a changelog record that is appended to the
Additionally, the Streams API introduces tables changelog stream.
64 Apache Kafka
Furthermore, processing time semantics is inher- an atomic multi-partition write, it ensures that
ently non-deterministic. either all writes are committed or all changes are A
Event time semantics are provided when the aborted together.
record contains a time stamp in its payload or
metadata. This time-stamp is assigned when a
record is created and thus allows for deterministic
Summary
data reprocessing. Applications may include
a time stamp in the payload of the message,
Apache Kafka is a scalable, fault-tolerant, and
but the processing of such time stamps is then
highly available distributed streaming platform.
application-dependent. For this reason, Kafka
It allows fact and changelog streams to be stored
supports record metadata time stamps that are
and processed, and it exploits the stream-table
stored in topics and automatically set by the
duality in stream processing. Kafka’s transactions
producer when a new record is created.
allow for exactly once stream processing seman-
Additionally, Kafka topics can be configured
tics and simplify exactly once end-to-end data
to support ingestion time for time-based opera-
pipelines. Furthermore, Kafka can be connected
tions: ingestion time is the time when data is
to other systems via its Connect API and can thus
appended to the topic, i.e., the current broker
be used as the central data hub in an organization.
wall clock time on write. Ingestion time is an
approximation of event time, assuming that data
is written to a topic shortly after it was created. It
can be used if producer applications don’t provide Cross-References
a record metadata time stamp. As ingestion time
is an approximation of event time, the Streams Apache Flink
API is agnostic to ingestion time (it is treated as Apache Samza
a special case of event time). Continuous Queries
Kleppmann M (2017) Designing data-intensive applica- level Apache Software Foundation (ASF) (ASF
tions. O’Reilly Media Inc., Sebastopol 2017) project in 2010 (Khudairi 2010). The goal
Kleppmann M, Kreps J (2015) Kafka, Samza and
the Unix philosophy of distributed data. IEEE Data
of the project from the outset has been to provide
Eng Bull 38(4):4–14. http://sites.computer.org/debull/ a machine learning framework that was both
A15dec/p4.pdf accessible to practitioners and able to perform so-
Kreps J, Narkhede N, Rao J (2011) Kafka: a distributed phisticated numerical computation on large data
messaging system for log processing. In: Proceedings
of the NetDB, pp 1–7
sets.
Noghabi SA, Paramasivam K, Pan Y, Ramesh N, Mahout has undergone two major stages of
Bringhurst J, Gupta I, Campbell RH (2017) Samza: architecture design. The first versions relied on
stateful scalable stream processing at LinkedIn. Proc the Apache Hadoop MapReduce framework, a
VLDB Endow 10(12):1634–1645. https://doi.org/10.
14778/3137765.3137770
popular tool for orchestrating large-scale data
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar flow and computation. Hadoop, while flexible
M, Evans R, Graves T, Lowe J, Shah H, Seth S, enough to handle many typical workflows, has
Saha B, Curino C, O’Malley O, Radia S, Reed B, severe limitations when applied to highly itera-
Baldeschwieler E (2013) Apache Hadoop YARN: yet
another resource negotiator. In: 4th ACM symposium
tive processes, like those used in most machine
on cloud computing (SoCC). https://doi.org/10.1145/ learning methods.
2523616.2523633 The Hadoop implementation of MapReduce
Wang G, Koshy J, Subramanian S, Paramasivam K, Zadeh does not allow for caching of intermediate results
M, Narkhede N, Rao J, Kreps J, Stein J (2015) Build-
ing a replicated logging system with Apache Kafka. across steps of a long computation. This means
PVLDB 8(12):1654–1655. http://www.vldb.org/pvldb/ that algorithms that iteratively use the same data
vol8/p1654-wang.pdf many times are at a severe disadvantage – the
framework must read data from disk every iter-
ation rather than hold it in memory. This type of
algorithm is common in data science and machine
Apache Mahout learning: examples include k-means clustering
(which has to compute distances for all points
Andrew Musselman in each iteration) and stochastic gradient descent
Apache Software Foundation, Seattle, WA, USA (which has to compute gradients for the same
points over many iterations).
Since enabling iterative work on large data
Definitions sets is a core requirement of a machine learning
library geared toward big data, Mahout moved
Apache Mahout (http://mahout.apache.org) is away from Hadoop in its second design phase.
a distributed linear algebra framework that Starting with release 0.10.0 (PMC 2015), Ma-
includes a mathematically expressive domain- hout switched its computation engine to Apache
specific language (DSL). It is designed to aid Spark, a framework designed specifically to fa-
mathematicians, statisticians, and data scientists cilitate distributed in-memory computation. At
to quickly implement numerical algorithms while the same time, Mahout jobs built using Hadoop
focusing on the mathematical concepts in their MapReduce were deprecated to discourage users
work, rather than on code syntax. Mahout uses from relying on them long-term, and a new in-
an extensible plug-in interface to systems such as terface on the front end was released, named
Apache Spark and Apache Flink. “Samsara,” after the Hindu and Buddhist concept
of rebirth. This new front end has a simple syntax
for matrix math modeled after other systems such
Historical Background as MATLAB and R, to allow maximal readability
and quick prototyping. Samsara can be run in an
Mahout was founded as a sub-project of Apache interactive shell where results are displayed in-
Lucene in late 2007 and was promoted to a top- line, enabling a live back-and-forth with code and
Apache Mahout 67
results, much like an interactive SQL or other regard to hardware and compute engines, and
interpreted environment familiar to many users. allows data science and machine learning prac- A
As an example of the Samsara DSL syntax for titioners to focus on the math in their work
calculating, for instance, the transpose of a matrix symbolically.
A multiplied with A itself, a common operation For example, in the distributed stochastic prin-
in machine learning jobs, can be written in Scala cipal component analysis (dSPCA) job in Ma-
code as hout, there is a computation that is expressed
val C = A.t %*% A:
symbolically as
and others come from solvers purpose-built for putations), the pluggable compute engine layer,
specific types of hardware and environments. and the back end which handles computation
according to instructions from compute engines.
Noteworthy for the Mahout user is the Scala
Overview of Architecture DSL layer which allows for interactive, iterative
development which can be captured directly into
The most recent design of Mahout allows for Scala packages and run as application code in
iterative rapid prototyping in the Samsara DSL production environments with minimal changes.
which can be used directly in production with Also notable is the pluggable compute engine
minimal changes. Computation on the back end layer, allowing flexibility in deployment as well
is adaptable to varied configurations of compute as future-proofing for when new compute engines
engine and hardware. For example, an algorithm become viable and beneficial to users’ needs. In
can be handed off to a Spark or Flink engine for fact, code written for the Spark engine, for exam-
computation, and any required matrix arithmetic ple, can be directly re-used on Flink with minor
can be computed in the JVM, on the CPU (us- changes to import and initialization statements in
ing all available cores), or on compatible GPUs, the code.
depending on the shape and characteristics of the Another recent introduction to Mahout is a
vectors and matrices being operated on. All these template that simplifies algorithm contributions
combinations will work with the same front-end from the project’s maintainers as well as from its
code with minimal changes. users. The interface is modeled after the machine
All the moving parts involved are seen in learning training and evaluation patterns found in
Fig. 1, moving from the top layer where appli- many R packages and the Python-based scikit-
cation code is written to the front end which learn package.
hands off computation to the appropriate engine As seen in Fig. 2, a new algorithm need only
or to native solvers (which perform the com- define a fit method in a Fitter class, which
populates a Model class, which contains pa-
rameter estimates, statistics about the fit, and
User Interface an overall summary of performance, and which
Application Code finally uses its predict method to make predic-
tions on new data. The bulk of the work any
Samsara Scala DSL new algorithm performs is written inside the fit
method.
Logical DAG In-Core Algebra
CUDA Bindings
Physical DAG NVIDIA GPUs
Physical Translation Key Applications
Front End
Flink Spark H2O Compute Engines Practical applications of machine learning
In-Core Algebra H2OMatrix Back End usually fall into one of three categories, the
so-called Three Cs: collaborative filtering
CUDA Bindings
(known commonly as “recommender sys-
NVIDIA GPUs tems,” or “recommenders”), classification, and
clustering.
Apache Mahout, Fig. 1 Current architecture of Mahout, Recommender systems typically come into
showing four main components: the user interface where play in situations where the goal is to present
application code is written, the front end which handles new items to users which they are likely to
which subsystem will perform computation (which could
be native and in-core), the pluggable compute engine need or want. These recommendations are based
layer, and the back end which takes instructions from a on historical user behavior across all users and
compute engine and performs mathematical operations across all items they interact with. Examples
Apache Mahout 69
include online retail, where increasing customer with a text corpus that contains documents on a
spending and site interactions is a key busi- variety of subjects. In order to categorize all the
ness goal. A common use of a recommender is documents into one or more relevant topics which
to present examples of products that other cus- could be used for quicker filing and retrieval,
tomers have bought, calculated by the similarity the entire corpus can be analyzed by a clustering
between other customers’ purchases or between method which separates documents into groups,
products themselves. members of each being relevant to similar topics
Classifiers cover a broad range of methods or concepts.
which can be summarized as predicting either a In real-world applications, the distinction be-
categorical value for a user or an item, such as tween these methods can be less clear-cut. Often
“likely or unlikely to default on a bank loan,” or more than one type of machine learning method
a numerical value such as age or salary. These will be used in combination to achieve further
methods are used across most scientific fields and nuance and improve relevance and effectiveness
industrial sectors, for anything from screening for of predictions. For example, after a corpus of
illnesses in medicine, to determining risk in the documents is organized into topics, if a new
financial sector, to predictive plant maintenance document is added to the corpus, it can effec-
in manufacturing operations. tively be “classified” based on which cluster it is
Clustering methods are generally used to make closest to. Similarly, the input to a recommender
sense of groups of users, documents, or other system in production can be a customer’s recent
items. Presented as a whole, a data set can be browsing history on the product website. The
daunting or even impossible to understand as a customer’s history can be fed into a classifier to
pile of records full of numbers and values, and determine which cohort of customers they belong
it is often desirable to segment the whole set to. This information can then be the input to a
into smaller pieces, each of which collects most- recommender built specifically per cohort, so that
similar items together for more close analysis any products recommended are more relevant to
and inspection. A common use for clustering is that customer.
70 Apache Samza
Apache Samza, an open source stream pro- those design decisions and their practical conse-
cessing framework, can be used for any of the quences. A
above applications (Kleppmann and Kreps 2015;
Noghabi et al. 2017). It was originally developed
at LinkedIn, then donated to the Apache Soft-
ware Foundation in 2013, and became a top-level Partitioned Log Processing
Apache project in 2015. Samza is now used in
production at many Internet companies, including A Samza job consists of a set of Java Virtual
LinkedIn (Paramasivam 2016), Netflix (Netflix Machine (JVM) instances, called tasks, that
Technology Blog 2016), Uber (Chen 2016; Her- each processes a subset of the input data.
mann and Del Balso 2017), and TripAdvisor The code running in each JVM comprises the
(Calisi 2016). Samza framework and user code that implements
Samza is designed for usage scenarios that the required application-specific functionality.
require very high throughput: in some produc- The primary API for user code is the Java
tion settings, it processes millions of messages interface StreamTask, which defines a method
per second or trillions of events per day (Feng process(). Figure 1 shows two examples of
2015; Paramasivam 2016; Noghabi et al. 2017). user classes implementing the StreamTask
Consequently, the design of Samza prioritizes interface.
scalability and operational robustness above most Once a Samza job is deployed and initialized,
other concerns. the framework calls the process() method
The core of Samza consists of several fairly once for every message in any of the input
low-level abstractions, on top of which high- streams. The execution of this method may
level operators have been built (Pathirage et al. have various effects, including querying or
2016). However, the core abstractions have been updating local state and sending messages to
carefully designed for operational robustness, and output streams. This model of computation is
the scalability of Samza is directly attributable to closely analogous to a map task in the well-
the choice of these foundational abstractions. The known MapReduce programming model (Dean
remainder of this article provides further detail on and Ghemawat 2004), with the difference that
Apache Samza, Fig. 1 The two operators of a streaming word-frequency counter using Samza’s StreamTask API
(Image source: Kleppmann and Kreps 2015, © 2015 IEEE, reused with permission)
72 Apache Samza
Apache Samza, Fig. 2 A Samza task consumes input from one partition, but can send output to any partition (Image
source: Kleppmann and Kreps 2015, © 2015 IEEE, reused with permission)
Apache Samza 73
Apache Samza, Fig. 3 A task’s local state is made durable by emitting a changelog of key-value pairs to Kafka (Image
source: Kleppmann and Kreps 2015, © 2015 IEEE, reused with permission)
becomes the input to another task. Unlike many one team’s job is consumed by one or more jobs
other stream processing frameworks, Samza does that are administered by other teams. The jobs
not implement its own message transport layer may be operating at different levels of maturity:
to deliver messages between stream operators. for example, a stream produced by an important
Instead, Kafka is used for this purpose; since production job may be consumed by several unre-
Kafka writes all messages to disk, it provides liable experimental jobs. Using Kafka as a buffer
a large buffer between stages of the processing between jobs ensures that adding an unreliable
pipeline, limited only by the available disk space consumer does not negatively impact the more
on the Kafka brokers. important jobs in the system.
Typically, Kafka is configured to retain several Finally, an additional benefit of using Kafka
days or weeks worth of messages in each topic. for message transport is that every message
Thus, if one stage of a processing pipeline fails stream in the system is accessible for debugging
or begins to run slow, Kafka can simply buffer and monitoring: at any point, an additional
the input to that stage while leaving ample time consumer can be attached to inspect the
for the problem to be resolved. Unlike system message flow.
designs based on backpressure, which require a
producer to slow down if the consumer cannot
keep up, the failure of one Samza job does not
affect any upstream jobs that produce its inputs.
Fault-Tolerant Local State
This fact is crucial for the robust operation of
Stateless stream processing, in which any mes-
large-scale systems, since it provides fault con-
sage can be processed independently from any
tainment: as far as possible, a fault in one part of
other message, is easy to implement and scale.
the system does not negatively impact other parts
However, many important applications require
of the system.
that stream processing tasks maintain state. For
Messages are dropped only if the failed or
example:
slow processing stage is not repaired within the
retention period of the Kafka topic. In this case,
dropping messages is desirable because it isolates • when performing a join between two streams,
the fault: the alternative – retaining messages a task must maintain an index of messages
indefinitely until the job is repaired – would lead seen on each input within some time window,
to resource exhaustion (running out of memory in order to find messages matching the join
or disk space), which would cause a cascading condition when they arrive;
failure affecting unrelated parts of the system. • when computing a rate (number of events per
Thus, Samza’s design of using Kafka’s on- time interval) or aggregation (e.g., sum of
disk logs for message transport is a crucial factor a particular field), a task must maintain the
in its scalability: in a large organization, it is current aggregate value and update it based on
often the case that an event stream produced by incoming events;
74 Apache Samza
• when processing an event requires a database through the KeyValueStore interface. For
query to look up some related data (e.g., look- workloads with good locality, Samza’s RocksDB
ing up a user record for the user who per- with cache provides performance close to in-
formed the action in the event), the database memory stores; for random-access workloads on
can also be regarded as stream processor state. large state, it remains significantly faster than
accessing a remote database (Noghabi et al.
2017).
Many stream processing frameworks use tran- If a job is cleanly shut down and restarted, for
sient state that is kept in memory in the process- example, to deploy a new version, Samza’s host
ing task, for example, in a hash table. However, affinity feature tries to launch each StreamTask
such state is lost when a task crashes or when instance on the machine that has the appropri-
a processing job is restarted (e.g., to deploy a ate RocksDB store on its local disk (subject to
new version). To make the state fault-tolerant, available resources). Thus, in most cases the state
some frameworks such as Apache Flink period- survives task restart without any further action.
ically write checkpoints of the in-memory state However, in some cases – for example, if a
to durable storage (Carbone et al. 2015); this ap- processing node suffers a full system failure – the
proach is reasonable when the state is small, but state on the local disk may be lost or rendered
it becomes expensive as the state grows (Noghabi inaccessible.
et al. 2017). In order to survive the loss of local disk stor-
Another approach, used, for example, by age, Samza again relies on Kafka. For each store
Apache Storm, is to use an external database containing state of a stream task, Samza creates
or key-value store for any processor state that a Kafka topic called a changelog that serves as
needs to be fault-tolerant. This approach carries a replication log for the store. Every write to the
a severe performance penalty: due to network local RocksDB store is also encoded as a message
latency, accessing a database on another node is and published to this topic, as illustrated in Fig. 3.
orders of magnitude slower than accessing local These writes can be performed asynchronously in
in-process state (Noghabi et al. 2017). Moreover, batches, enabling much greater throughput than
a high-throughput stream processor can easily synchronous random-access requests to a remote
overwhelm the external database with queries; data store. The write queue needs to only be
if the database is shared with other applications, flushed when the offsets of input streams are
such overload risks harming the performance of written to durable storage, as described in the last
other applications to the point that they become section.
unavailable (Kreps 2014). When a Samza task needs to recover its state
In response to these problems, Samza pio- after the loss of local storage, it reads all mes-
neered an approach to managing state in a stream sages in the appropriate partition of the changelog
task that avoids the problems of both check- topic and applies them to a new RocksDB store.
pointing and remote databases. Samza’s approach When this process completes, the result is a new
to providing fault-tolerant local state has sub- copy of the store that contains the same data as
sequently been adopted in the Kafka Streams the store that was lost. Since Kafka replicates all
framework (see article on Apache Kafka). data across multiple nodes, it is suitable for fault-
Samza allows each task to maintain state on tolerant durable storage of this changelog.
the local disk of the processing node, with an in- If a stream task repeatedly writes new values
memory cache for frequently accessed items. By for the same key in its local storage, the
default, Samza uses RocksDB, an embedded key- changelog contains many redundant messages,
value store that is loaded into the JVM process since only the most recent value for a given
of the stream task, but other storage engines key is required in order to restore local
can also be plugged in its place. In Fig. 1, the storage. To remove this redundancy, Samza
CountWords task accesses this managed state uses a Kafka feature called log compaction
Apache Samza 75
on the changelog topic. With log compaction utilization and limits the ability to scale on
enabled, Kafka runs a background process demand (Kulkarni et al. 2015). A
that searches for messages with the same key By contrast, Samza relies on existing cluster
and discards all but the most recent of those management software, which allows Samza jobs
messages. Thus, whenever a key in the store to share a pool of machines with non-Samza
is overwritten with a new value, the old value applications. Samza supports two modes of dis-
is eventually removed from the changelog. tributed operation:
However, any key that is not overwritten is
retained indefinitely by Kafka. This compaction
• A job can be deployed to a cluster man-
process, which is very similar to internal
aged by Apache Hadoop YARN (Vavilapalli
processes in log-structured storage engines,
et al. 2013). YARN is a general-purpose re-
ensures that the storage cost and recovery time
source scheduler and cluster manager that can
from a changelog corresponds to the size of
run stream processors, MapReduce batch jobs,
the state, independently of the total number of
data analytics engines, and various other ap-
messages ever sent to the changelog (Kleppmann
plications on a shared cluster. Samza jobs can
2017).
be deployed to existing YARN clusters with-
out requiring any special cluster-level config-
uration or resource allocation.
• Samza also supports a stand-alone mode in
Cluster-Based Task Scheduling
which a job’s JVM instances are deployed and
executed through some external process that
When a new stream processing job is started,
is not under Samza’s control. In this case, the
it must be allocated computing resources: CPU
instances use Apache ZooKeeper (Junqueira
cores, RAM, disk space, and network bandwidth.
et al. 2011) to coordinate their work, such as
Those resources may need to be adjusted from
assigning partitions of the input streams.
time to time as load varies and reclaimed when
a job is shut down.
At large organizations, hundreds or thousands The stand-alone mode allows Samza to be
of jobs need to run concurrently. At such scale, it integrated with an organization’s existing de-
is not practical to manually assign resources: task ployment and cluster management tools or with
scheduling and resource allocation must be auto- cloud computing platforms: for example, Netflix
mated. To maximize hardware utilization, many runs Samza jobs directly as EC2 instances on
jobs and applications are deployed to a shared Amazon Web Services (AWS), relying on the
pool of machines, with each multi-core machine existing cloud facilities for resource allocation
typically running a mixture of tasks from several (Paramasivam 2016). Moreover, Samza’s cluster
different jobs. management interface is pluggable, enabling fur-
This architecture requires infrastructure for ther integrations with other technologies such as
managing resources and for deploying the code Mesos (Hindman et al. 2011).
of processing jobs to the machines on which it With large deployments, an important con-
is to be run. Some frameworks, such as Storm cern is resource isolation, that is, ensuring that
and Flink, have built-in mechanisms for resource each process receives the resources it requested
management and deployment. However, frame- and that a misbehaving process cannot starve
works that perform their own task scheduling and colocated processes of resources. When running
cluster management generally require a static in YARN, Samza supports the Linux cgroups
assignment of computing resources – potentially feature to enforce limits on the CPU and memory
even dedicated machines – before any jobs can use of stream processing tasks. In virtual machine
be deployed to the cluster. This static resource environments such as EC2, resource isolation is
allocation leads to inefficiencies in machine enforced by the hypervisor.
76 Apache Samza
Summary ShuyiChen2/scalable-complex-event-processing-
on-samza-uber
Das S, Botev C, Surlaker K, Ghosh B, Varadarajan B,
Apache Samza is a stream processing frame- Nagaraj S, Zhang D, Gao L, Westerman J, Ganti P,
work that is designed to provide high throughput Shkolnik B, Topiwala S, Pachev A, Somasundaram
and operational robustness at very large scale. N, Subramaniam S (2012) All aboard the Databus!
Efficient resource utilization requires a mixture LinkedIn’s scalable consistent change data capture
platform. In: 3rd ACM symposium on cloud computing
of different jobs to share a multi-tenant com- (SoCC). https://doi.org/10.1145/2391229.2391247
puting infrastructure. In such an environment, Dean J, Ghemawat S (2004) MapReduce: simplified data
the primary challenge in providing robust oper- processing on large clusters. In: 6th USENIX sympo-
ation is fault isolation, that is, ensuring that a sium on operating system design and implementation
(OSDI)
faulty process cannot disrupt correctly running Feng T (2015) Benchmarking apache Samza: 1.2
processes and that a resource-intensive process million messages per second on a single node.
cannot starve others. http://engineering.linkedin.com/performance/
Samza isolates stream processing jobs from benchmarking-apache-samza-12-million-messages-
second-single-node
each other in several different ways. By using Goodhope K, Koshy J, Kreps J, Narkhede N, Park R, Rao
Kafka’s on-disk logs as a large buffer between J, Ye VY (2012) Building LinkedIn’s real-time activity
producers and consumers of a stream, instead data pipeline. IEEE Data Eng Bull 35(2):33–45. http://
of backpressure, Samza ensures that a slow or sites.computer.org/debull/A12june/A12JUN-CD.pdf
Hermann J, Balso MD (2017) Meet michelangelo:
failed consumer does not affect upstream jobs. uber’s machine learning platform. https://eng.uber.
By providing fault-tolerant local state as a com- com/michelangelo/
mon abstraction, Samza improves performance Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph
and avoids reliance on external databases that AD, Katz R, Shenker S, Stoica I (2011) Mesos: a
platform for fine-grained resource sharing in the data
might be overloaded by high query volume. Fi- center. In: 8th USENIX symposium on networked
nally, by integrating with YARN and other cluster systems design and implementation (NSDI)
managers, Samza builds upon existing resource Junqueira FP, Reed BC, Serafini M (2011) Zab: high-
scheduling and isolation technology that allows performance broadcast for primary-backup systems. In:
41st IEEE/IFIP international conference on dependable
a cluster to be shared between many different systems and networks (DSN), pp 245–256. https://doi.
applications without risking resource starvation. org/10.1109/DSN.2011.5958223
Kleppmann M (2017) Designing data-intensive applica-
tions. O’Reilly Media. ISBN:978-1-4493-7332-0
Kleppmann M, Kreps J (2015) Kafka, Samza and
Cross-References the Unix philosophy of distributed data. IEEE Data
Eng Bull 38(4):4–14. http://sites.computer.org/debull/
A15dec/p4.pdf
Apache Flink Kreps J (2014) Why local state is a fundamental primitive
Apache Kafka in stream processing. https://www.oreilly.com/ideas/
Continuous Queries why-local-state-is-a-fundamental-primitive-in-
stream-processing
Kreps J, Narkhede N, Rao J (2011) Kafka: a distributed
messaging system for log processing. In: 6th inter-
national workshop on networking meets databases
References (NetDB)
Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C,
Calisi L (2016) How to convert legacy Hadoop Map/Re- Mittal S, Patel JM, Ramasamy K, Taneja S (2015)
duce ETL systems to Samza streaming. https://www. Twitter heron: stream processing at scale. In: ACM
youtube.com/watch?v=KQ5OnL2hMBY international conference on management of data (SIG-
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi MOD), pp 239–250. https://doi.org/10.1145/2723372.
S, Tzoumas K (2015) Apache flink: stream and batch 2723374
processing in a single engine. IEEE Data Eng Bull Netflix Technology Blog (2016) Kafka inside Keystone
38(4):28–38. http://sites.computer.org/debull/A15dec/ pipeline. http://techblog.netflix.com/2016/04/kafka-
p28.pdf inside-keystone-pipeline.html
Chen S (2016) Scalable complex event processing Noghabi SA, Paramasivam K, Pan Y, Ramesh N,
on Samza @Uber. https://www.slideshare.net/ Bringhurst J, Gupta I, Campbell RH (2017) Samza:
Apache Spark 77
stateful scalable stream processing at LinkedIn. Proc computations in memory in a fault-tolerant man-
VLDB Endow 10(12):1634–1645. https://doi.org/10. ner. RDDs, which are immutable and partitioned A
14778/3137765.3137770
Paramasivam K (2016) Stream processing with Apache
collections of records, provide a programming
Samza – current and future. https://engineering. interface for performing operations, such as map,
linkedin.com/blog/2016/01/whats-new-samza filter, and join, over multiple data items. For fault-
Pathirage M, Hyde J, Pan Y, Plale B (2016) SamzaSQL: tolerance purposes, Spark records all transforma-
scalable fast data management with streaming SQL. In:
IEEE international workshop on high-performance big
tions carried out to build a dataset, thus forming
data computing (HPBDC), pp 1627–1636. https://doi. a lineage graph.
org/10.1109/IPDPSW.2016.141
Qiao L, Auradar A, Beaver C, Brandt G, Gandhi M,
Gopalakrishna K, Ip W, Jgadish S, Lu S, Pachev A,
Ramesh A, Surlaker K, Sebastian A, Shanbhag R, Overview
Subramaniam S, Sun Y, Topiwala S, Tran C, West-
erman J, Zhang D, Das S, Quiggle T, Schulman B, Spark (Zaharia et al. 2016) is an open-source big
Ghosh B, Curtis A, Seeliger O, Zhang Z (2013) On data framework originally developed at the Uni-
brewing fresh Espresso: LinkedIn’s distributed data
serving platform. In: ACM international conference versity of California at Berkeley and later adopted
on management of data (SIGMOD), pp 1135–1146. by the Apache Foundation, which has maintained
https://doi.org/10.1145/2463676.2465298 it ever since. Spark was designed to address
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar some of the limitations of the MapReduce model,
M, Evans R, Graves T, Lowe J, Shah H, Seth S,
Saha B, Curino C, O’Malley O, Radia S, Reed B, especially the need for speed processing of large
Baldeschwieler E (2013) Apache Hadoop YARN: yet datasets. By using RDDs, purposely designed
another resource negotiator. In: 4th ACM symposium to store restricted amounts of data in memory,
on cloud computing (SoCC). https://doi.org/10.1145/
Spark enables performing computations more ef-
2523616.2523633
Wang G, Koshy J, Subramanian S, Paramasivam K, Zadeh ficiently than MapReduce, which runs computa-
M, Narkhede N, Rao J, Kreps J, Stein J (2015) Build- tions on the disk.
ing a replicated logging system with Apache Kafka. Although the project contains multiple com-
Proc VLDB Endow 8(12):1654–1655. https://doi.org/
ponents, at its core (Fig. 1) Spark is a comput-
10.14778/2824032.2824063
ing engine that schedules, distributes, and moni-
tors applications comprising multiple tasks across
nodes of a computing cluster (Karau et al. 2015).
For cluster management, Spark supports its native
Apache Spark Spark cluster (standalone), Apache YARN (Vavi-
lapalli et al. 2013), or Apache Mesos (Hindman
Alexandre da Silva Veith and Marcos Dias de et al. 2011). At the core also lies the RDD
Assuncao abstraction. RDDs are sets of data items dis-
Inria Avalon, LIP Laboratory, ENS Lyon, tributed across the cluster nodes and that can
University of Lyon, Lyon, France be manipulated in parallel. At a higher level, it
provides support for multiple tightly integrated
components for handling various types of work-
Definitions loads such as SQL, streaming, and machine learn-
ing.
Apache Spark is a cluster computing solution Figure 2 depicts an example of a word
and in-memory processing framework that ex- count application using Spark’s Scala API
tends the MapReduce model to support other for manipulating datasets. Spark computes
types of computations such as interactive queries RDDs in a lazy fashion, the first time they
and stream processing (Zaharia et al. 2012). De- are used. Hence, the code in the example is
signed to cover a variety of workloads, Spark evaluated when the counts are saved to disk, in
introduces an abstraction called Resilient Dis- which moment the results of the computation
tributed Datasets (RDDs) that enables running are required. Spark can also read data from
78 Apache Spark
Spark Core
Spark APIs shine both during exploratory work Cloud Computing for Big Data Analysis
and when engineering a solution deployed in Definition of Data Streams
production. Over the years, much effort has been Graph Processing Frameworks
paid toward making APIs easier to use and to Hadoop
optimize, for instance, the introduction of the Large-Scale Graph Processing System
Dataset API to avoid certain performance degra- Programmable Memory/Processing in Memory
dations that could occur if a user did not prop- (PIM)
erly design the chain of operations executed by Streaming Microservices
the Spark engine. As mentioned earlier, consid-
erable work has also focused on creating new
libraries and packages, including for processing
References
live streams. The discretized model employed
by Spark’s stream processing API, however, in- Alsheikh MA, Niyato D, Lin S, Tan H-P, Han Z (2016)
troduces some delays depending on the time Mobile big data analytics using deep learning and
length of micro-batches. More recent and emerg- Apache Spark. IEEE Netw 30(3):22–29
ing application scenarios, such as analysis of Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK,
Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia
vehicular traffic and networks, monitoring of op- M (2015) Spark SQL: relational data processing in
erational infrastructure, wearable assistance (Ha Spark. In: Proceedings of the 2015 ACM SIGMOD
et al. 2014), and 5G services, require data pro- international conference on management of data
cessing and service response under very short de- (SIGMOD’15). ACM, New York, pp 1383–1394
Freeman J, Vladimirov N, Kawashima T, Mu Y, Sofroniew
lays. Spark can be used as part of a larger service NJ, Bennett DV, Rosen J, Yang C-T, Looger LL,
workflow, but alone it does not provide means to Ahrens MB (2014) Mapping brain activity at scale with
address some of the challenging scenarios that cluster computing. Nat Methods 11(9):941–950
Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin
require very short response times. This requires
MJ, Stoica I (2014) Graphx: graph processing in a dis-
the use of other frameworks such as Apache tributed dataflow framework. In: OSDI, vol 14, pp 599–
Storm. 613
To reduce the latency of applications Ha K, Chen Z, Hu W, Richter W, Pillai P, Satyanarayanan
M (2014) Towards wearable cognitive assistance. In:
delivered to users, many service components
12th annual international conference on mobile sys-
are also increasingly being deployed at the tems, applications, and services, MobiSys’14. ACM,
edges of the Internet (Hu et al. 2015) under New York, pp 68–81. https://doi.org/10.1145/2594368.
a model commonly called edge computing. 2594383
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph
Some frameworks are available for processing
AD, Katz R, Shenker S, Stoica I (2011) Mesos: a
streams of data using resources at the edge platform for fine-grained resource sharing in the data
(e.g., Apache Edgent (https://edgent.apache. center. In: NSDI, vol 11, pp 22–22
org/)), whereas others are emerging. There are Hu YC, Patel M, Sabella D, Sprecher N, Young V (2015)
Mobile edge computing – a key technology towards
also frameworks that aim to provide high-level
5G. ETSI White Paper 11(11):1–16
programming abstractions for creating dataflows Karau H, Konwinski A, Wendell P, Zaharia M (2015)
(e.g., Apache Beam (https://beam.apache.org/)) Learning Spark: lightning-fast big data analysis.
with underlying execution engines for several O’Reilly Media, Inc., Beijing
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman
processing solutions. There is, however, a lack of S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al
unifying solutions on programming models and (2016) Mllib: machine learning in Apache Spark. J
abstractions for exploring resources from both Mach Learn Res 17(1):1235–1241
cloud and edge computing deployments, as well Ryza S, Laserson U, Owen S, Wills J (2017) Advanced
analytics with Spark: patterns for learning from data at
as scheduling and resource management tools scale. O’Reilly Media, Inc., Sebastopol
for deciding what processing tasks need to be Shah MA, Hellerstein JM, Chandrasekaran S, Franklin
offloaded to the edge. MJ (2003) Flux: an adaptive partitioning operator
Apache SystemML 81
for continuous query systems. In: 19th international productivity of data scientists. ML algorithms
conference on data engineering (ICDE 2003). IEEE are expressed in a high-level language with R- or A
Computer Society, pp 25–36
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar
Python-like syntax, and the system automatically
M, Evans R, Graves T, Lowe J, Shah H, Seth S, generates efficient, hybrid execution plans of
Saha B, Curino C, O’Malley O, Radia S, Reed B, single-node CPU or GPU operations, as well
Baldeschwieler E (2013) Apache hadoop YARN: yet as distributed operations using data-parallel
another resource negotiator. In: 4th annual symposium
on cloud computing (SOCC’13). ACM, New York,
frameworks such as MapReduce (Dean and
pp 5:1–5:16. https://doi.org/10.1145/2523616.2523633 Ghemawat 2004) or Spark (Zaharia et al. 2012).
Wu Y, Tan KL (2015) ChronoStream: elastic stateful SystemML’s high-level abstraction provides
stream computation in the cloud. In: 2015 IEEE the necessary flexibility to specify custom
31st international conference on data engineering,
pp 723–734
ML algorithms while ensuring physical data
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mc- independence, independence of the underlying
Cauley M, Franklin MJ, Shenker S, Stoica I (2012) runtime operations and technology stack, and
Resilient distributed datasets: a fault-tolerant abstrac- scalability for large data. Separating the concerns
tion for in-memory cluster computing. In: 9th USENIX
conference on networked systems design and imple-
of algorithm semantics and execution plan gener-
mentation (NSDI’12). USENIX Association, Berkeley, ation is essential for the automatic optimization
pp 2–2 of execution plans regarding different data and
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica cluster characteristics, without the need for algo-
I (2013) Discretized streams: fault-tolerant streaming
computation at scale. In: 24th ACM symposium on rithm modifications in different deployments.
operating systems principles (SOSP’13). ACM, New
York, pp 423–438
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave
A, Meng X, Rosen J, Venkataraman S, Franklin MJ,
Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Overview
Apache Spark: a unified engine for big data processing.
Commun ACM 59(11):56–65 In SystemML (Ghoting et al. 2011; Boehm
et al. 2016), data scientists specify their ML
algorithms using a language with R- or Python-
like syntax. This language supports abstract
Apache Spark Benchmarking data types for scalars, matrices and frames, and
operations such as linear algebra, element-wise
SparkBench operations, aggregations, indexing, and statistical
operations but also control structures such as
loops, branches, and functions. These scripts are
parsed into a hierarchy of statement blocks and
Apache SystemML statements, where control flow delineates the
individual blocks. For each block of statements,
Declarative Large-Scale Machine the system then compiles DAGs (directed acyclic
Learning graphs) of high-level operators (HOPs), which is
the core internal representation of SystemML’s
Matthias Boehm compiler. Size information such as matrix
IBM Research – Almaden, San Jose, CA, USA dimensions and sparsity are propagated via
intra- and inter-procedural analysis from the
inputs through the entire program. This size
Definitions information is then used to compute memory
estimates per operator and accordingly select
Apache SystemML (Ghoting et al. 2011; Boehm physical operators resulting in a DAG of low-
et al. 2016) is a system for declarative, large-scale level operators (LOPs). These LOP DAGs are
machine learning (ML) that aims to increase the finally compiled into executable instructions.
82 Apache SystemML
Example: As an example, consider the fol- form method that computes and solves the normal
lowing script for linear regression via a closed- equations:
1: X = read($X);
2: y = read($Y);
3: lambda = 0.001;
4: if( $icpt == 1 )
5: X = cbind(X, matrix(1, nrow(X), 1));
6: I = matrix(1, ncol(X), 1);
7: A = t(X) %*% X + diag(I) * lambda;
8: b = t(X) %*% y;
9: beta = solve(A, b);
10: write(beta, $B);
This script reads the feature matrix X and labels and branch removal, the block of lines 4–5
y, optionally appends a column of 1s to X for is removed if $icpt==0, which further
computing the intercept, computes the normal allows unconditional size propagation and the
equations, and finally solves the resulting linear merge of the entire program into a single
system of equations. SystemML then compiles, HOP DAG. Furthermore, the expression
for example, for lines 6–10, a HOP DAG that ‘diag.matrix.1; ncol.X/; 1// ˇ lambda’ is simpli-
contains logical operators such as matrix multi- fied to ‘diag.matrix.lambda; ncol.X/; 1//’, which
plications for X> X and X> y. Given input meta- avoids unnecessary operations and intermediates.
data (e.g., let X be a dense 107 103 matrix), SystemML’s rewrite system contains hundreds
the compiler also computes memory estimates of such rewrites, some of which even change
for each operation (e.g., 80:08 GB for X> y). If the asymptotic behavior (e.g., trace.X Y/ !
the memory estimate of an operation exceeds the sum.X ˇ Y> /). Additional dynamic rewrites are
driver memory budget, this operation is sched- size-dependent because they require sizes for
uled for distributed execution, and appropriate cost estimation or validity constraints. Examples
physical operators are selected (e.g., mapmm as are matrix multiplication chain optimization as
a broadcast-based operator for X> y). well as dynamic simplification rewrites such
Static and Dynamic Rewrites: SystemML’s as sum.X2 / ! X> X j ncol.X/ D 1. The
optimizer applies a broad range of optimizations former exploits the associativity of matrix
throughout its compilation chain. An important multiplications and aims to find an optimal
class of optimizations with high-performance parenthesization for which SystemML applies
impact are rewrites. SystemML applies static a textbook dynamic programming algorithm
and dynamic rewrites (Boehm et al. 2014a). (Boehm et al. 2014a).
Static rewrites are size-independent and include Operator Selection and Fused Operators:
traditional programming language techniques Another important class of optimizations is the
– such as common subexpression elimination, selection of execution types and physical opera-
constant folding, and branch removal – tors (Boehm et al. 2016). SystemML’s optimizer
algebraic simplifications for linear algebra, analyzes the memory budgets of the driver and
as well as backend-specific transformations executors and selects – based on worst-case mem-
such as caching and partitioning directives for ory estimates (Boehm et al. 2014b) – local or dis-
Spark. For instance, in the above example, tributed execution types. Besides data and clus-
after constant propagation, constant folding, ter characteristics (e.g., data size/shape, memory,
Apache SystemML 83
and parallelism), the compiler also considers ma- This example originates from a stepwise linear
trix and operation properties (e.g., diagonal/sym- regression algorithm for feature selection that A
metric matrices, sparse-safe operations) as well iteratively selects additional features and calls
as data flow properties (e.g., co-partitioning and the previously introduced regression algorithm.
data locality). Depending on the chosen execu- SystemML addresses this challenge of unknown
tion types, different physical operators are con- sizes via dynamic recompilation (Boehm et al.
sidered with a selection preference from local 2014a) that recompiles subplans – at the granular-
or special-purpose to shuffle-based operators. For ity of HOP DAGs – with exact size information of
example, the multiplication X> X from line 7 al- intermediates during runtime. During initial com-
lows for a special transpose-self operator (tsmm), pilation, operations and DAGs with unknowns
which is an easily parallelizable unary operator are marked for dynamic recompilation, which
that exploits the output symmetry for less com- also includes splitting DAGs after data-dependent
putation. For distributed operations, this oper- operators. During runtime, the recompiler then
ator has a block size constraint because it re- deep copies the HOP DAG, updates sizes, ap-
quires access to entire rows. If special-purpose plies dynamic rewrites, recomputes memory es-
or broadcast-based operators do not apply, the timates, generates new runtime instructions, and
compiler falls back to the shuffle-based cpmm resumes execution with the intermediate results
and rmm operators (Ghoting et al. 2011). Ad- computed so far. This approach yields good plans
ditionally, SystemML replaces special patterns and performance even in the presence of initial
with hand-coded fused operators to avoid unnec- unknowns.
essary intermediates (Huang et al. 2015; Elgamal Runtime Integration: At runtime level,
et al. 2017) and unnecessary scans (Boehm et al. SystemML interprets the generated instructions.
2014a; Ashari et al. 2015; Elgamal et al. 2017), as Single-node and Spark operations directly map
well as to exploit sparsity across chains of opera- to instructions, whereas for the MapReduce
tions (Boehm et al. 2016; Elgamal et al. 2017). backend, instructions are packed into a minimal
For example, computing the weighted squared number of MR jobs. For distributed operations,
loss via sum.W ˇ .X UV> /2 / would create matrices (and similarly frames) are stored in
huge dense intermediates. In contrast, sparsity- a blocked representation of pairs of block
exploiting operators leverage the sparse driver indexes and blocks with fixed block size, where
(i.e., the sparse matrix W and the sparse-safe individual blocks can be dense, sparse, or ultra-
multiply ˇ) for selective computation that only sparse. In contrast, for single-node operations,
considers nonzeros in W. the entire matrix is represented as a single
Dynamic Recompilation: Dynamic rewrites block, which allows reusing the block runtime
and operator selection rely on size information across backends. Data transfers between the local
for memory estimates, cost estimation, and va- and distributed backends and driver memory
lidity constraints. Hence, unknown dimensions management are handled by a multilevel buffer
or sparsity lead to conservative fallback plans. pool (Boehm et al. 2016) that controls local
Example scenarios are complex conditional con- evictions, parallelizes and collects RDDs, creates
trol flow or function call graphs, user-defined broadcasts, and reads/writes data from/to the
functions, data-dependent operations, computed distributed file system. For example, a single-
size expressions, and changing sizes or sparsity node instruction first pins its inputs into memory
as shown in the following example: – which triggers reads from HDFS or RDDs
if necessary – performs the block operation,
1: while( continue ) {
2: parfor( i in 1:n ) { registers the output in the buffer pool, and
3: if( fixed[1,i] == 0 ) { finally unpins its inputs. Many block operations
4: X = cbind(Xg, Xorig[,i]); and the I/O system for different formats are
5: AIC[1,i] = linregDS(X,y); }}
6: #select & append best to Xg multi-threaded to exploit parallelism in scale-
7: } up environments. For compute-intensive deep
84 Apache SystemML
learning workloads, SystemML further calls SystemML addresses this challenge with a dedi-
native CPU and GPU libraries for BLAS and cated resource optimizer for automatic resource
DNN operations. In contrast to other deep provisioning (Huang et al. 2015) on resource
learning frameworks, SystemML also supports negotiation frameworks such as YARN or Mesos.
sparse neural network operations. Memory The key idea is to optimize resource configu-
management for the GPU device is integrated rations via an online what-if analysis with re-
with the buffer pool allowing for lazy data gard to the given ML program as well as data
transfer on demand. Finally, SystemML uses and cluster characteristics. This framework op-
numerically stable operations based on Kahan timizes performance without unnecessary over-
addition for descriptive statistics and certain provisioning, which can increase throughout in
aggregation functions (Tian et al. 2012). shared on-premise clusters and save money in
cloud environments.
Compressed Linear Algebra: Furthermore,
Key Research Findings there is a broad class of iterative ML algorithms
that use repeated read-only data access and I/O-
In additional to the compiler and runtime tech- bound matrix-vector multiplications to converge
niques described so far, there are several ad- to an optimal model. For these algorithms, it
vanced techniques with high-performance im- is crucial for performance to fit the data into
pact. available single-node or distributed memory.
Task-Parallel Parfor Loops: SystemML’s However, general-purpose, lightweight, and
primary focus is data parallelism. However, heavyweight compression techniques struggle to
there are many use cases such as ensemble achieve both good compression ratios and fast de-
learning, cross validation, hyper-parameter compression to enable block-wise uncompressed
tuning, and complex models with disjoint or operations. Compressed linear algebra (CLA)
overlapping data that are naturally expressed (Elgohary et al. 2016) tackles this challenge
in a task-parallel manner. These scenarios are by applying lightweight database compression
addressed by SystemML’s parfor construct techniques – for column-wise compression
for parallel for loops (Boehm et al. 2014b). with heterogeneous encoding formats and co-
In contrast to similar constructs in high- coding – to matrices and executing linear algebra
performance computing (HPC), parfor only operations such as matrix-vector multiplications
asserts the independence of iterations, and a directly on the compressed representation. CLA
dedicated parfor optimizer reasons about yields compression ratios similar to heavyweight
hybrid parallelization strategies that combine compression and thus allows fitting large
data and task parallelism. Reconsider the datasets into memory while achieving operation
stepwise linear regression example. Alternative performance close to the uncompressed case.
plan choices include (1) a local, i.e., multi- Automatic Operator Fusion: Similar to
threaded, parfor with local operations, (2) a query compilation and loop fusion in databases
remote parfor that runs the entire loop as a and HPC, the opportunities for fused operators
distributed Spark job, or (3) a local parfor – in terms of fused chains of operations – are
with concurrent data-parallel Spark operations if ubiquitous. Example benefits are a reduced
the data does not fit into the driver. number of intermediates, reduced number of
Resource Optimization: The selection of ex- scans, and sparsity exploitation across operations.
ecution types and operators is strongly influenced Despite their high-performance impact, hand-
by memory budgets of the driver and executor coded fused operators are usually limited to few
processes. Finding a good static cluster config- operators and incur a large development effort.
uration that works well for a broad range of Automatic operator fusion via code generation
ML algorithms and data sizes is a hard problem. (Elgamal et al. 2017) overcomes this challenge
Apache SystemML 85
by automatically determining valid fusion plans throughout the stack of SystemML and similar
and generating access-pattern-aware operators in systems (Kumar et al. 2017): A
the form of hand-coded skeletons with custom Specification Languages: SystemML focuses
body code. In contrast to existing work on on optimizing fixed algorithm specifications.
operator fusion, SystemML introduced a cost- However, end-to-end applications would further
based optimizer framework to find optimal fusion benefit from even higher levels of abstractions
plans in DAGs of linear algebra programs for for feature engineering, model selection,
dense, sparse, and compressed data as well as and life cycle management in general. A
local and distributed operations. promising direction is a stack of declarative
languages that allows for reuse and cross-level
optimization.
Optimization Techniques: Regarding the au-
Examples of Application
tomatic optimization of ML programs, further
work is required regarding size and sparsity es-
SystemML has been applied in a variety of ML
timates, adaptive query processing and storage
applications including statistics, classification,
(as an extension of dynamic recompilation), and
regression, clustering, matrix factorization,
principled approaches to automatic rewrites and
survival analysis, and deep learning. In contrast
automatic operator fusion.
to specialized systems for graphs like GraphLab
Runtime Techniques: A better support for
(Low et al. 2012) or deep learning like
deep learning and scientific applications requires
TensorFlow (Abadi et al. 2016), SystemML
the extension from matrices to dense/sparse ten-
provides a unified system for small- to large-
sors of different data types and their operations.
scale problems with support for dense, sparse,
Additionally, further research is required regard-
and ultra-sparse data, as well as local and
ing the automatic exploitation of accelerators and
distributed operations. Accordingly, SystemML’s
heterogenous hardware.
primary application area is an environment with
Benchmarks: Finally, SystemML – but also
diverse algorithms, varying data characteristics,
the community at large – would benefit from ded-
or different deployments.
icated benchmarks for the different classes of ML
Example deployments include large-scale
workloads and different levels of specification
computation on top of MapReduce or Spark and
languages.
programmatic APIs for notebook environments
or embedded scoring. Thanks to deployment-
specific compilation, ML algorithms can be
reused without script changes. This flexibility Cross-References
enabled the integration of SystemML into
systems with different architectures. For Apache Mahout
example, SystemML has been shipped as part Cloud Computing for Big Data Analysis
of the open-source project R4ML and the IBM Hadoop
products BigInsights, Data Science Experience Python
(DSX), and multiple Watson services. Scalable Architectures for Big Data Analysis
Tools and Libraries for Big Data Analysis
infectious diseases” (https://www.cdcfoundation. the multiple measures provide insight into aspects
org/what-public-health). More recently, there of and interactions between the elements of in- A
has been increasing interest in the concept of dividual, public, and population health, defined
population health, i.e., “the health outcomes of above.
a group of individuals, including the distribution Geography provides a key context for all three
of such outcomes within the group” (Kindig types of health; specifically, the locations where
and Stoddard 2003). Both public and population one lives and works contribute greatly to one’s
health focus on measures of health in an overall individual exposome and to the range of en-
community with public health typically focusing vironmental and social determinants of disease
on health promotion and disease prevention (and health) associated with the health of the
and population health as defining current and local population. As a result, big spatial data
target distributions of health outcomes across a play a key (and expanding) role in the definition,
population, typically, to evaluate the impact of description, and understanding of health.
policy (e.g., government funding or insurance
coverage).
Key Research Findings
information” (PHI) and its use with the healthcare of local environmental conditions or global
system. HIPAA lists individual location data models of climate impact on pollution levels A
at a finer resolution than broadscale (3-digit) or food sources. Robust and reproducible spatial
US ZIP codes among its list of PHI data health analysis requires coordination of expert
elements, requiring care and documentation knowledge across multiple disciplines in order to
in the collection, storage, visualization, and fully reach its vast potential.
reporting of geographic information at this and
finer resolutions.
The competitive environment of US healthcare Cross-References
systems and the proprietary environment of
providers of electronic health record (EHR) Big Data for Health
systems result in considerable informatics Privacy-Preserving Record Linkage
challenges when individual patients interact Spatio-Social Data
with multiple providers across different units. Spatiotemporal Data: Trajectories
Some interesting progress has been made toward Using Big Spatial Data for Planning User
informatics solutions (e.g., the Fast Healthcare Mobility
Interoperability Resources (FHIR)) to address
these administrative barriers to data sharing and
allow data sharing through app-based platforms References
(Mandel et al. 2016), but this remains an
active area of research in the fields of medical Brownstein JS, Freifeld CC, Madoff LC (2009) Digital
informatics, particularly for FHIR extensions disease detection: harnessing the web for public health
involving geolocation data extensions. surveillance. N Engl J Med 360:2153–2157
Estrin D, Sim I (2010) Open mHealth architecture:
an engine for health care innovation. Science 330:
759–760
Future Directions for Research Goodchild MF (1992) Geographic information science.
Int J Geogr Inf Syst 6:31–45
Kindig D, Stoddart G (2003) What is population health?
The examples above illustrate that health Am J Public Health 93:380–383
applications of big spatial data technologies Kitron U (1998) Landscape ecology and epidemiology of
draw from traditional spatial analyses within vector-borne diseases: tools for spatial analysis. J Med
individual, population, and public health but also Entomol 35:435–445
Krieger N (2001) Theories for social epidemiology in the
illustrate that, to date, such work typically occurs 21st century: an ecosocial perspective. Int J Epidemiol
in application-specific settings. The development 30:668–677
of cohesive concepts and associated accurate and Lazar D, Kennedy R, King G, Vespignani A (2014) The
parable of Google Flu: traps in big data analysis.
reliable toolboxes for spatial analytics and their
Science 343:1203–1205
health applications remains an area of active Liu Y, Sarnat JA, Kilaru V, Jacob DJ, Koutrakis P (2005)
research. These developments include many Estimating ground-level PM2.5 in the Eastern United
challenging issues relating to accurate healthcare States using satellite remote sensing. Environ Sci Tech-
nol 39:3269–3278
monitoring, assessment, and provision for each
Mandel JC, Kreda DA, Mandl KD, Kohane IS, Romoni
individual patient. Health applications extend RB (2016) SMART on FHIR: a standards-based, in-
well beyond that of individual health, and provide teroperable apps platform for electronic health records.
detailed insight into population and public J Am Med Inform Assoc 23:899–908
Miller GM, Jones DP (2014) The nature of nurture:
health as well. There is great need for creativity
refining the definition of the exposome. Toxicol Sci
in informatics to allow connection between, 137:1–2
linkage across, and searches of elements of Murdoch TB, Detsky AS (2013) The inevitable appli-
spatially referenced health and health-related cation of big data to heath care. J Am Med Assoc
309:1351–1352
measurements. As noted above, “health-related”
Murray CJL, Lopez AD (1997) Alternative projections of
can range from within-individual measures of mortality and disability by cause 1990–2020: Global
immune response to satellite-based assessments Burden of Disease Study. Lancet 349:1498–1504
90 Approximate Computation
Nilsen W, Kumar S, Shar A, Varoquiers C, Wiley 2013; Zaharia et al. 2013). These systems target
T, Riley WT, Pavel M, Atienza AA (2012) Ad- low-latency execution environments with strict
vancing the science of mHealth. J Health Commun
17(supplement 1):5–10
service-level agreements (SLAs) for processing
Shaddick G, Thomas ML, Green A, Brauer M, van Donke- the input data streams.
laar A, Burnett R, Chang HH, Cohen A, van Dingenen In the current deployments, the low-latency
R, Dora C, Gumy S, Liu Y, Martin R, Waller LA, West requirement is usually achieved by employing
J, Zidek JV, Pruss-Ustun A (2017) Data integration
model for air quality: a hierarchical approach to the
more computing resources. Since most stream
global estimation of exposures to air pollution. J R Stat processing systems adopt a data-parallel pro-
Soc Ser C 67:231–253 gramming model (Dean and Ghemawat 2004),
Sui D, Elwood S, Goodchild M (eds) (2013) Crowdsourc- almost linear scalability can be achieved with
ing geographic knowledge: volunteered geographic in-
formation in theory and practice. Springer, Dondrecht
increased computing resources (Quoc et al. 2013,
Vazquez-Prokopec GM, Stoddard ST, Paz-Soldan V, 2014, 2015a,b). However, this scalability comes
Morrison AC, Elder JP, Kochel TJ, Scott TW, at the cost of ineffective utilization of computing
Kitron U (2009) Usefulness of commercially available resources and reduced throughput of the system.
GPS data-loggers for tracking human movement and
exposure to dengue virus. Int J Health Geogr 8:68.
Moreover, in some cases, processing the entire
https://doi.org/10.1186/1476-072X-8-68 input data stream would require more than the
Waller LA, Gotway CA (2004) Applied spatial statistics available computing resources to meet the desired
for public health data. Wiley, Hoboken latency/throughput guarantees.
Wild CP (2005) Complementing the genome with the
“exposome”: the outstanding challenge of environmen- To strike a balance between the two desirable,
tal exposure measurement in molecular epidemiology. but contradictory design requirements – low
Cancer Epidemiol Biomark Prev 14:1847–1850 latency and efficient utilization of computing
resources – there is a surge of approximate
computing paradigm that explores a novel design
point to resolve this tension. In particular,
Approximate Computation approximate computing is based on the
observation that many data analytics jobs are
amenable to an approximate rather than the exact
Tabular Computation
output (Doucet et al. 2000; Natarajan 1995). For
such workflows, it is possible to trade the output
accuracy by computing over a subset instead of
Approximate Computing for the entire data stream. Since computing over a
Stream Analytics subset of input requires less time and computing
resources, approximate computing can achieve
Do Le Quoc1 , Ruichuan Chen2;4 , Pramod desirable latency and computing resource
Bhatotia3 , Christof Fetzer1 , Volker Hilt2 , and utilization.
Thorsten Strufe1 Unfortunately, the advancements in approx-
1
TU Dresden, Dresden, Germany imate computing are primarily geared towards
2
Nokia Bell Labs, Stuttgart, Germany batch analytics (Agarwal et al. 2013; Srikanth
3
The University of Edinburgh and Alan Turing et al. 2016), where the input data remains un-
Institute, Edinburgh, UK changed during the course of computation. In
4
Nokia Bell Labs, NJ, USA particular, these systems rely on pre-computing a
set of samples on the static database, and take an
Introduction appropriate sample for the query execution based
on the user’s requirements (i.e., query execution
Stream analytics systems are extensively used in budget). Therefore, the state-of-the-art systems
the context of modern online services to trans- cannot be deployed in the context of stream pro-
form continuously arriving raw data streams into cessing, where the new data continuously arrives
useful insights (Foundation 2017a; Murray et al. as an unbounded stream.
Approximate Computing for Stream Analytics 91
and (ii) pipelined stream processing model. These 2. The input stream is stratified based on the
systems offer three main advantages: (a) efficient source of data items, i.e., the data items from
fault tolerance, (b) “exactly-once” semantics, and each sub-stream follow the same distribution
(c) unified programming model for both batch and are mutually independent. Here, a stra-
and stream analytics. The proposed algorithm for tum refers to one sub-stream. If multiple sub-
approximate computing is generalizable to both streams have the same distribution, they are
stream processing models, and preserves their combined to form a stratum.
advantages.
Online Adaptive Stratified Reservoir the original items from the selected items, the
Sampling algorithm assigns a specific weight Wi to the A
To realize the real-time stream analytics, a novel items selected from each sub-stream Si :
sampling technique called Online Adaptive Strat-
(
ified Reservoir Sampling (OASRS) is proposed. Ci =Ni if Ci > Ni
It achieves both stratified and reservoir samplings Wi D (1)
without their drawbacks. Specifically, OASRS 1 if Ci Ni
does not overlook any sub-streams regardless of
their popularity, does not need to know the statis- STREAMAPPROX supports approximate
tics of sub-streams before the sampling process, linear queries which return an approximate
and runs efficiently in real time in a distributed weighted sum of all items received from all
manner. sub-streams. Though linear queries are simple,
The high-level idea of OASRS is simple. The they can be extended to support a large range of
algorithm first stratifies the input stream into sub- statistical learning algorithms (Blum et al. 2005,
streams according to their sources. The data items 2008). It is also worth mentioning that, OASRS
from each sub-stream are assumed to follow the not only works for a concerned time interval
same distribution and are mutually independent. (e.g., a sliding time window), but also works with
(Here, a stratum refers to one sub-stream. If unbounded data streams.
multiple sub-streams have the same distribution,
they can be combined to form a stratum.) The Distributed execution. OASRS can run in a
algorithm then samples each sub-stream indepen- distributed fashion naturally as it does not require
dently, and perform the reservoir sampling for synchronization. One straightforward approach is
each sub-stream individually. To do so, every to make each sub-stream Si be handled by a set
time a new sub-stream Si is encountered, its of w worker nodes. Each worker node samples an
sample size Ni is determined according to an equal portion of items from this sub-stream and
adaptive cost function considering the specified generates a local reservoir of size no larger than
query budget. For each sub-stream Si , the algo- Ni =w. In addition, each worker node maintains
rithm performs the traditional reservoir sampling a local counter to measure the number of its
to select items at random from this sub-stream, received items within a concerned time interval
and ensures that the total number of selected for weight calculation. The rest of the design
items from Si does not exceed its sample size Ni . remains the same.
In addition, the algorithm maintains a counter Ci
to measure the number of items received from Si Error Estimation
within the concerned time interval. This section describes how to apply OASRS to
Applying reservoir sampling to each sub- randomly sample the input data stream to gen-
stream Si ensures that algorithm can randomly erate the approximate results for linear queries.
select at most Ni items from each sub-stream. Next, a method to estimate the accuracy of ap-
The selected items from different sub-streams, proximate results via rigorous error bounds is
however, should not be treated equally. In presented.
particular, for a sub-stream Si , if Ci > Ni (i.e., Similar to section “Online Adaptive Stratified
the sub-stream Si has more than Ni items in Reservoir Sampling”, suppose the input data
total during the concerned time interval), the stream contains X sub-streams fSi gX iD1 .
algorithm randomly selects Ni items from this STREAMAPPROX computes the approximate
sub-stream and each selected item represents sum of all items received from all sub-streams
Ci =Ni original items on average; otherwise, if by randomly sampling only Yi items from each
Ci Ni , the algorithm selects all the received Ci sub-stream Si . As each sub-stream is sampled
items so that each selected item only represents independently, the variance of the approximate
P
itself. As a result, in order to statistically recreate sum is: Var.SUM/ D X iD1 Var.SUMi /.
94 Approximate Computing for Stream Analytics
Further, as items are randomly selected for a Above, the estimation of the variances of the
sample within each sub-stream, according to the approximate sum and the approximate mean of
random sampling theory (Thompson 2012), the the input data stream has been shown. Similarly,
variance of the approximate sum can be estimated the variance of the approximate results of any
as: linear queries also can be estimated by applying
the random sampling theory.
X
X si2
c
Var.SUM/ D Ci .Ci Yi / (2)
Yi Error bound. According to the “68-95-99.7”
iD1
rule (Wikipedia 2017), approximate result falls
within one, two, and three standard deviations
Here, Ci denotes the total number of items
away from the true result with probabilities of
from the sub-stream Si , and si denotes the stan-
68%, 95%, and 99.7%, respectively, where the
dard deviation of the sub-stream Si ’s sampled
standard deviation is the square root of the vari-
items:
ance as computed above. This error estimation is
Y critical because it gives a quantitative understand-
1 Xi
si2 D .Ii;j INi /2 ; where ing of the accuracy of the proposed sampling
Yi 1 technique.
j D1
(3)
Yi
1 X
INi D Ii;j
Yi Discussion
j D1
Next, the estimation of the variance of the The design of STREAMAPPROX is based on
approximate mean value of all items received the assumptions mentioned in section “Design
from all the X sub-streams is described. This Assumptions”. This section discusses some
approximate mean value can be computed as: approaches that could be used to meet the
assumptions.
PX
SUM iD1 .Ci MEANi / I: Virtual cost function. This work currently
MEAN D PX D PX
iD1 Ci iD1 Ci assumes that there exists a virtual cost function
(4) to translate a user-specified query budget into the
X
X
D .!i MEANi / sample size. The query budget could be specified
iD1 as either available computing resources, desired
accuracy, or latency.
C For instance, with an accuracy budget, the
Here, !i D PX i . Then, as each sub-
i D1 Ci
stream is sampled independently, according to the sample size for each sub-stream can be deter-
random sampling theory (Thompson 2012), the mined based on a desired width of the confidence
variance of the approximate mean value can be interval using Eq. (5) and the “68-95-99.7” rule.
estimated as: With a desired latency budget, users can specify
it by defining the window time interval or the
X
X slide interval for the computations over the input
c
Var.MEAN/ D Var.!i MEANi / data stream. It becomes a bit more challenging
iD1 to specify a budget for resource utilization. Nev-
X ertheless, there are two existing techniques that
X
D !i2 Var.MEANi / (5) could be used to implement such a cost function
iD1 to achieve the desired resource target: (a) virtual
X
data center (Angel et al. 2014), and (b) resource
X s 2 Ci Yi prediction model (Wieder et al. 2012) for latency
D !i2 i
Yi Ci requirements.
iD1
Approximate Computing for Stream Analytics 95
Pulsar (Angel et al. 2014) proposes an abstrac- nevertheless for completeness, two proposals for
tion of a virtual data center (VDC) to provide the stratification of evolving streams, bootstrap A
performance guarantees to tenants in the cloud. In (Dziuda 2010) and semi-supervised learning
particular, Pulsar makes use of a virtual cost func- (Masud et al. 2012), are discussed in this
tion to translate the cost of a request processing section.
into the required computational resources using Bootstrap (Dziuda 2010) is a well-studied
a multi-resource token algorithm. The cost func- non-parametric sampling technique in statistics
tion could be adapted for STREAMAPPROX as for the estimation of distribution for a given
follows: a data item in the input stream is consid- population. In particular, the bootstrap technique
ered as a request and the “amount of resources” randomly selects “bootstrap samples” with
required to process it as the cost in tokens. Also, replacement to estimate the unknown parameters
the given resource budget is converted in the form of a population, for instance, by averaging the
of tokens, using the pre-advertised cost model per bootstrap samples. A bootstrap-based estimator
resource. This allows computing the sample size can be employed for the stratification of incoming
that can be processed within the given resource sub-streams. Alternatively, a semi-supervised
budget. algorithm (Masud et al. 2012) could be used
For any given latency requirement, resource to stratify a data stream. The advantage of this
prediction model (Wieder et al. 2010a,b, 2012) algorithm is that it can work with both labeled
could be employed. In particular, the prediction and unlabeled streams to train a classification
model could be built by analyzing the diurnal model.
patterns in resource usage (Charles et al. 2012)
to predict the future resource requirement for the
given latency budget. This resource requirement Related Work
can then be mapped to the desired sample size
based on the same approach as described above. Over the last two decades, the databases
community has proposed various approximation
II: Stratified sampling. This work currently techniques based on sampling (Al-Kateb and
assume that the input stream is already stratified Lee 2010; Garofalakis and Gibbon 2001),
based on the source of data items, i.e., the online aggregation (Hellerstein et al. 1997), and
data items within each stratum follow the sketches (Cormode et al. 2012). These techniques
same distribution – it does not have to be a make different trade-offs w.r.t. the output quality,
normal distribution. This assumption ensures supported queries, and workload. However,
that the error estimation mechanism still holds the early work in approximate computing was
correct since STREAMAPPROX applies the mainly geared towards the centralized database
Central Limit Theorem. For example, consider architecture.
an IoT use-case which analyzes data streams Recently, sampling-based approaches have
from sensors to measure the temperature of been successfully adopted for distributed
a city. The data stream from each individual data analytics (Agarwal et al. 2013; Srikanth
sensor follows the same distribution since et al. 2016; Krishnan et al. 2016; Quoc et al.
it measures the temperature at the same 2017b,a). In particular, BlinkDB (Agarwal et al.
location in the city. Therefore, a straightforward 2013) proposes an approximate distributed
way to stratify the input data streams is to query processing engine that uses stratified
consider each sensor’s data stream as a stratum sampling (Al-Kateb and Lee 2010) to support
(sub-stream). In more complex cases where ad-hoc queries with error and response time
STREAMAPPROX cannot classify strata based constraints. Like BlinkDB, Quickr (Srikanth
on the sources, the system needs a pre-processing et al. 2016) also supports complex ad-hoc queries
step to stratify the input data stream. This in big-data clusters. Quickr deploys distributed
stratification problem is orthogonal to this work, sampling operators to reduce execution costs of
96 Approximate Computing for Stream Analytics
Cormode G, Garofalakis M, Haas PJ, Jermaine C IEEE 8th international conference on cloud computing
(2012) Synopses for massive data: samples, his- (CLOUD) A
tograms, wavelets, sketches. Foundations and Trends in Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe
Databases. Now, Boston T (2017a) Privacy preserving stream analytics: the
Dean J, Ghemawat S (2004) MapReduce: simplified data marriage of randomized response and approximate
processing on large clusters. In: Proceedings of the computing. https://arxiv.org/abs/1701.05403
USENIX conference on operating systems design and Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T
implementation (OSDI) (2017b) PrivApprox: privacy-preserving stream analyt-
Doucet A, Godsill S, Andrieu C (2000) On sequential ics. In: Proceedings of the 2017 USENIX conference
monte carlo sampling methods for bayesian filtering. on USENIX annual technical conference (USENIX
Stat Comput 10:197–208 ATC)
Dziuda DM (2010) Data mining for genomics and pro- Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T
teomics: analysis of gene and protein expression data. (2017c) Approximate stream analytics in apache flink
Wiley, Hoboken and apache spark streaming. CoRR, abs/1709.02946
Foundation AS (2017a) Apache flink. https://flink.apache. Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe
org T (2017d) StreamApprox: approximate computing for
Foundation AS (2017b) Apache spark streaming. https:// stream analytics. In: Proceedings of the international
spark.apache.org/streaming middleware conference (middleware)
Foundation AS (2017c) Kafka – a high-throughput dis- Srikanth K, Anil S, Aleksandar V, Matthaios O, Robert
tributed messaging system. https://kafka.apache.org G, Surajit C, Ding B (2016) Quickr: lazily approxi-
Garofalakis MN, Gibbon PB (2001) Approximate query mating complex ad-hoc queries in big data clusters.
processing: taming the terabytes. In: Proceedings of In: Proceedings of the ACM SIGMOD international
the international conference on very large data bases conference on management of data (SIGMOD)
(VLDB) Thompson SK (2012) Sampling. Wiley series in probabil-
Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggre- ity and statistics. The Australasian Institute of Mining
gation. In: Proceedings of the ACM SIGMOD interna- and Metallurgy, Carlton
tional conference on management of data (SIGMOD) Wieder A, Bhatotia P, Post A, Rodrigues R (2010a)
Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues Brief announcement: modelling mapreduce for optimal
R (2016) IncApprox: a data analytics system for in- execution in the cloud. In: Proceedings of the 29th
cremental approximate computing. In: Proceedings of ACM SIGACT-SIGOPS symposium on principles of
the 25th international conference on World Wide Web distributed computing (PODC)
(WWW) Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Con-
Masud MM, Woolam C, Gao J, Khan L, Han J, Hamlen ductor: orchestrating the clouds. In: Proceedings of the
KW, Oza NC (2012) Facing the reality of data stream 4th international workshop on large scale distributed
classification: coping with scarcity of labeled data. systems and middleware (LADIS)
Knowl Inf Syst 33:213–244 Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orches-
Murray DG, McSherry F, Isaacs R, Isard M, Barham P, trating the deployment of computations in the cloud
Abadi M (2013) Naiad: a timely dataflow system. In: with conductor. In: Proceedings of the 9th USENIX
Proceedings of the twenty-fourth ACM symposium on symposium on networked systems design and imple-
operating systems principles (SOSP) mentation (NSDI)
Natarajan S (1995) Imprecise and approximate computa- Wikipedia (2017) 68-95-99.7 Rule
tion. Kluwer Academic Publishers, Boston Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mc-
Quoc DL, Martin A, Fetzer C (2013) Scalable and real- Cauley M, Franklin MJ, Shenker S, Stoica I (2012)
time deep packet inspection. In: Proceedings of the Resilient distributed datasets: a fault tolerant abstrac-
2013 IEEE/ACM 6th international conference on util- tion for in-memory cluster computing. In: Proceedings
ity and cloud computing (UCC) of the 9th USENIX conference on networked systems
Quoc DL, Yazdanov L, Fetzer C (2014) Dolen: user-side design and implementation (NSDI)
multi-cloud application monitoring. In: International Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica
conference on future internet of things and cloud (FI- I (2013) Discretized streams: fault-tolerant streaming
CLOUD) computation at scale. In: Proceedings of the twenty-
Quoc DL, D’Alessandro V, Park B, Romano L, Fetzer fourth ACM symposium on operating systems princi-
C (2015a) Scalable network traffic classification using ples (SOSP)
distributed support vector machines. In: Proceedings of
the 2015 IEEE 8th international conference on cloud
computing (CLOUD)
Quoc DL, Fetzer C, Felber P, Étienne Rivière, Schiavoni Approximate Reasoning
V, Sutra P (2015b) Unicrawl: a practical geographically
distributed web crawler. In: Proceedings of the 2015 Automated Reasoning
98 Approximated Neighborhood Function
Overview
Synonyms
Spatial big data architectures are intended to
Apache Hadoop; Lambda architecture; Location address requirements for spatio-temporal data
analytic architecture that is too large or computationally demanding
for traditional computing architectures (Shekhar
et al. 2012). This includes operations such
as real-time data ingest, stream processing,
Definitions
batch processing, storage, and spatio-temporal
analytical processing.
Spatial big data is a spatio-temporal data that is
Spatial big data offers additional challenges
too large or requires data-intensive computation
beyond what is commonly faced in the big data
that is too demanding for traditional computing
space. This includes advanced indexing, query-
architectures. Stream processing in this context is
ing, analytical processing, visualization, and ma-
the processing of spatio-temporal data in motion.
chine learning. Examples of spatial big data in-
The data is observational; it is produced by sen-
clude moving vehicles (peer-to-peer ridesharing,
sors – moving or otherwise. Computations on the
delivery vehicles, ships, airplanes, etc.), station-
data are made as the data is produced or received.
ary sensor data (e.g., SCADA, AMI), cell phones
A distributed processing cluster is a networked
(call detail records), IoT devices, as well as spa-
collection of computers that communicate and
tially enabled content from social media (Fig. 1).
process data in a coordinated manner. Computers
Spatial big data architectures are used in a
in the cluster are coordinated to solve a com-
variety of workflows:
mon problem. A lambda architecture is a scal-
able, fault-tolerant data-processing architecture
that is designed to handle large quantities of data • Real-time processing of observational data
by exploiting both stream and batch processing (data in motion). This commonly involves
methods. Data partitioning involves physically monitoring and tracking dynamic assets in
dividing a dataset into separate data stores on real-time; this can include vehicles, aircraft,
a distributed processing cluster. This is done to and vessels, as well as stationary assets such
achieve improved scalability, performance, avail- as weather and environmental monitoring
ability, and fault-tolerance. Distributed file sys- sensors.
tems, in the context of big data architectures, are • Batch processing of persisted spatio-temporal
similar to traditional distributed file systems but data (data at rest). Workflows with data at
are intended to persist large datasets on com- rest incorporate tabular and spatial processing
Architectures 99
Batch A
Data Storage
Processing
Data Analytic
Analytics
Sources Data Store
Real-time Stream
Data Ingest Processing
Management
tion. In commercial systems, spatial analysis also for Hadoop (Whitman et al. 2014) is an open
includes summarizing data, incident and simi- source library that implements range and distance A
lar location detection, proximity analysis, pattern queries and k-NN. It also supports a distributed
analysis, and data management (import, export, PMR quadtree-based spatial index. GeoSpark (Yu
cleansing, etc.). et al. 2015) is a framework for performing spatial
joins, range, and k-NN queries. The framework
Spatial Joins supports quadtree and R-tree indexing of the
Spatial joins have been widely studied in both source data. Magellan (Sriharsha 2015) is an
the standard sequential environment (Jacox and open source library for geospatial analytics that
Samet 2007), as well as in the parallel (Brinkhoff uses Spark (Zaharia et al. 2010). It supports a
et al. 1996) and distributed environments (Abel broadcast join and a reduce-side optimized join
et al. 1995). For over 20 years, algorithms have and is integrated with Spark SQL for a traditional,
been developed to take advantage of parallel and SQL user experience. SpatialSpark (You et al.
distributed processing architectures and software 2015) supports both a broadcast spatial join and a
frameworks. The recent resurgence in interest partitioned spatial join on Spark. The partition-
in spatial join processing is the results of new- ing is supported using either fixed-grid, binary
found interest in distributed, fault-tolerant, com- space partition or a sort-tile approach. STARK
puting frameworks such as Apache Hadoop, as (Hagedorn et al. 2017) is a Spark-based frame-
well as the explosion in observational and IoT work that supports spatial joins, k-NN, and range
data. queries on both spatial and spatio-temporal data.
With distributed processing architectures, STARK supports three temporal operators: con-
there are two principal approaches that are tains, containedBy, and intersects) and also sup-
employed when performing spatial joins. The ports the DBSCAN density-based spatial clus-
first, termed a broadcast (or mapside) spatial terer (Ester et al. 1996). MSJS (multi-way spa-
join, is designed for joining a large dataset with tial join algorithm with Spark (Du et al. 2017))
another small dataset (e.g., political boundaries). addresses the problem of performing multi-way
The large dataset is partitioned across the spatial joins using the common technique of cas-
processing nodes and the complete small cading sequences of pairwise spatial joins. Simba
dataset is broadcast to each of the nodes. This (Xie et al. 2016) offers range, distance (circle
allows significant optimization opportunities. range), and k-NN queries as well as distance
The second approach, termed a partitioned (or and k-NN joins. Two-level indexing, global and
reduce side) spatial join, is a more general local, is employed, similar to the various in-
technique that is used when joining two large dexing work on Hadoop MapReduce. Location-
datasets. Partitioned joins use a divide-and- Spark (Tang et al. 2016) supports range query,
conquer approach (Aji et al. 2013). The two k-NN, spatial join, and k-NN join. Location-
large datasets are divided into small pieces via Spark uses global and local indices – Grid, R-
a spatial decomposition, and each small piece is tree, Quadtree, and IR-tree. GeoMesa is an open-
processed independently. source, distributed, spatio-temporal index built
SJMR (Spatial Join with MapReduce) intro- on top of Bigtable-style databases (Chang et
duced the first distributed spatial join on Hadoop al. 2008) using an implementation of the Geo-
using the MapReduce programming model (Dean hash algorithm implemented in Scala (Hughes et
and Ghemawat 2008; Zhang et al. 2009). Spatial- al. 2015). The Esri GeoAnalytics Server (Whit-
Hadoop (Eldawy and Mokbel 2015) optimized man et al. 2017) supports many types of spa-
SJMR with a persistent spatial index (it sup- tial analysis is a distributed environment (lever-
ports grid files, R trees, and R C trees) that aging the Spark framework). It provides func-
is precomputed. Hadoop-GIS (Aji et al. 2013), tionality for summarizing data (e.g., aggrega-
which is utilized in medical pathology imaging, tion, spatio-temporal join, polygon overlay), in-
features both 2D and 3D spatial join. GIS Tools cident and similar location detection, proximity
102 Architectures
analysis, and pattern analysis (hot spot analysis, (Gedik et al. 2008). Such applications can use
NetCDF generation). multiple computational units, such as the floating
point unit on a graphics processing unit or field-
MapReduce Programming Model programmable gate arrays (FPGAs), without ex-
MapReduce is a programming model and an as- plicitly managing allocation, synchronization, or
sociated implementation for processing big data communication among those units.
sets with a parallel, distributed algorithm (Sakr The stream processing paradigm simplifies
et al. 2013). A MapReduce program is composed parallel software and hardware by restricting
of a map procedure (or method), which performs the parallel computation that can be performed.
filtering and sorting (such as sorting students Given a sequence of data (a stream), a series
by first name into queues, one queue for each of operations (kernel functions) is applied to
name), and a reduce method, which performs a each element in the stream. Kernel functions
summary operation (such as counting the number are usually pipelined, and optimal local on-chip
of students in each queue, yielding name frequen- memory reuse is attempted, in order to minimize
cies). A MapReduce framework manages the the loss in bandwidth, accredited to external
processing by marshalling the distributed cluster memory interaction. Uniform streaming, where
nodes, running the various tasks and algorithms one kernel function is applied to all elements in
in parallel, managing communications and data the stream, is typical. Since the kernel and stream
transfers between cluster nodes, while supporting abstractions expose data dependencies, compiler
fault tolerance and redundancy. tools can fully automate and optimize on-chip
The MapReduce model is inspired by the map management tasks.
and reduces functions commonly used in func-
tional programming (note that their purpose in Lambda Architecture
the MapReduce framework is not the same as A Lambda Architecture is an architecture that
in their original forms). The key contributions is intended to process large volumes of data by
of the MapReduce model are not the actual map incorporating both batch and real-time process-
and reduce functions that resemble the Message ing techniques (Marz and Warren 2015). This
Passing Interface (MPI) standard’s reduce and approach to architecture attempts to balance la-
scatter operations. The major contributions of the tency, throughput, and fault-tolerance by using
MapReduce model are the scalability and fault- batch processing to provide comprehensive and
tolerance that is supported through optimization accurate views of batch data, while simultane-
of the execution engine. A single-threaded im- ously using real-time stream processing to pro-
plementation of MapReduce is commonly slower vide views of online data. The two view outputs
than a traditional (non-MapReduce) implementa- may be joined before presentation. In addition,
tion; gains are typically realized with multi-node historic analysis is used to tune the real-time
or multi-threaded implementations. analytical processing as well as building models
MapReduce libraries have been written in for prediction (machine learning). The rise of
many programming languages, with different lambda architecture is correlated with the growth
levels of optimization. The most popular open- of big data, real-time analytics, and the drive to
source implementation is found in Apache mitigate the latencies of map-reduce (Fig. 2).
Hadoop. The Lambda architecture depends on a
data model with an append-only, immutable
Stream Processing data source that serves as a system of record.
Stream processing is a computer programming It is intended for ingesting and processing
paradigm, equivalent to dataflow programming, timestamped events that are appended to existing
event stream processing, and reactive program- events rather than overwriting them. State is
ming that allows some applications to more eas- determined from the natural time-based ordering
ily exploit a limited form of parallel processing of the data.
Architectures 103
Batch Layer
A
Databases Batch Batch Batch
Flat Files Injestion Storage Processing
Serving Analytic
Layer Clients
Stream Stream
Stream Injestion
Data Processing
Stream Layer
Architectures, Fig. 3
GPU-accelerated Spark
Client Applications
framework
Apache Spark
Spark
Spark SQL Spark ML GraphX
Streaming
Spark Infrastructure
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
Input Data Batches of Batches of
Stream Spark Input Data Spark Processed Data
Streaming Engine
This approach takes advantage of data locality, temporally aware algorithms expressed with
where nodes manipulate the data they have ac- high-level functions like map, reduce, join, and
cess to. This allows the dataset to be processed window. Finally, processed data can be pushed
faster and more efficiently than it would be in out to filesystems, databases, and live dashboards.
a more conventional supercomputer architecture Spark’s machine learning (Spark ML) and graph
that relies on a parallel file system where com- processing (GraphX) algorithms can be applied
putation and data are distributed via high-speed to these data streams.
networking. Internally, Streaming receives live input data
Hadoop has been deployed in traditional data- streams and divides the data into batches, which
centers as well as in the cloud. The cloud allows are then processed by the Spark engine to
organizations to deploy Hadoop without the need generate the final stream of results in batches
to acquire hardware or specific setup expertise. (Fig. 4).
Vendors who currently have an offering for the Spark Streaming provides a high-level ab-
cloud that incorporate Hadoop include Microsoft, straction called discretized stream or DStream,
Amazon, IBM, Google, and Oracle. Most of the which represents a continuous stream of data.
Fortune 50 companies currently deploy Hadoop DStreams can be created either from input data
clusters. streams from sources such as Kafka, Flume, and
Kinesis, or by applying high-level operations on
other DStreams. Internally, a DStream is repre-
Spark Streaming
sented as a sequence of RDDs.
Spark Streaming is an extension to Spark API
that supports scalable, high-throughput, fault-
tolerant, stream processing of real-time data Big Data as a Service
streams (Garillot and Maas 2018). Data can be Platform as a Service (PaaS) is a category
ingested from many sources (e.g., Kafka, Flume, of cloud computing services that provides
or TCP sockets) and can be processed using a platform allowing customers to run and
Architectures 105
manage applications without the complexity S3 (Simple Storage Service). Data storage is
of building and maintaining the infrastructure provided through DynamoDB (NoSQL), Red- A
usually associated with developing and launching shift (columnar), and RDS (relational data store).
an application (Chang et al. 2010). PaaS is Machine learning and real-time data processing
commonly delivered in one of three ways: infrastructures are also supported.
Other significant examples of BDaaS
providers include the IBM Cloud and the
• As a public cloud service from a provider
Oracle Data Cloud. Big data Infrastructure as
• As a private service (software or appliance)
a Service (IaaS) offerings (that work with other
inside the firewall
clouds such as AWS, Azure, and Oracle) are
• As software deployed on a public infrastruc-
available from Hortonworks, Cloudera, Esri, and
ture as a service
Databricks.
has primarily focused on density-based Chang WY, Abu-Amara H, Sanford JF (2010) Transform-
clustering algorithms such as DBSCAN, ing Enterprise Cloud Services. Springer, London, pp
55–56
HDBSCAN (McInnes and Healy 2017), and Dean J, Ghemawat S (2008) MapReduce:
OPTICS. simplified data processing on large
• Recently, much attention has been paid to clusters. Commun ACM 51(1):107–113.
incorporating GPU processing capabilities https://doi.org/10.1145/1327452.1327492
DeWitt D, Gray J (1992) Parallel database systems: the
into distributed processing frameworks future of high performance database systems. Commun
such as Spark. While some basic spatial ACM 35(6). https://doi.org/10.1145/129888.129894
capabilities can currently be supported (e.g., DeWitt DJ, Gerber RH, Graefe G, Heytens ML, Kumar
aggregation and visualization of point data), KB, Muralikrishna M (1986) GAMMA – a high per-
formance dataflow database machine. In: Proceedings
much work needs to be done to further of the 12th international conference on very large data
streamline and optimize the integration of bases (VLDB ’86), Kyoto, Japan, pp 228–237
GPU processors and extend the native spatio- Du Z, Zhao X, Ye X, Zhou J, Zhang F, Liu R (2017)
temporal capabilities. An effective high-performance multiway spatial join
algorithm with spark. ISPRS Int J Geo-Information
6(4):96
Eldawy A, Mokbel MF (2015) SpatialHadoop: a mapre-
duce framework for spatial data. In: IEEE 31st interna-
tional conference on data engineering (ICDE), Seoul,
South Korea, pp 1352–1363
Cross-References Ester M, Kriegel HP, Sander J, Xu X (1996) A density-
based algorithm for discovering clusters in large spatial
Big Data Architectures databases with noise. In: Proceedings of the 2nd inter-
Real-Time Big Spatial Data Processing national conference on knowledge discovery and data
mining (KDD-96), Portland, Oregon, pp 226–231
Streaming Big Spatial Data
Garillot F, Maas G (2018) Stream processing with
apache spark: best practices for scaling and opti-
mizing Apache spark. O’Reilly Media, Sebastopol.
References http://shop.oreilly.com/product/0636920047568.do
Gedik B, Andrade H, Wu K-L, Yu PS, Doo M (2008)
SPADE: the system s declarative stream processing en-
Abel DJ, Ooi BC, Tan K-L, Power R, Yu JX (1995)
gine. In: Proceedings of the 2008 ACM SIGMOD inter-
Spatial join strategies in distributed spatial DBMS. In:
national conference on management of data (SIGMOD
Advances in spatial databases – 4th international sym-
’08), pp 1123–1134. https://doi.org/10.1145/1376616.
posium, SSD’95. Lecture notes in computer science,
1376729
vol 1619. Springer, Portland, pp 348–367
Ghemawat S, Gobioff H, Leung S (2003) The Google file
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J
system. In: Proceedings of the 19th ACM symposium
(2013) Hadoop-GIS: a high performance spatial data
on operating systems principles, Oct 2003, pp 29–43.
warehousing system over mapreduce. Proc VLDB En-
https://doi.org/10.1145/945445.945450
dow 6(11):1009–1020
Grossman M, Sarkar, V (2016) SWAT: a programmable,
Alexander W, Copeland G (1988) Process and dataflow
in-memory, distributed, high-performance computing
control in distributed data-intensive systems. In: Pro-
platform. In: Proceedings of the 25th ACM interna-
ceedings of the 1988 ACM SIGMOD international
tional symposium on high-performance parallel and
conference on management of data (SIGMOD ’88), pp
distributed computing (HPDC ’16). ACM, New York,
90–98. https://doi.org/10.1145/50202.50212
pp 81–92. https://doi.org/10.1145/2907294.2907307
Apache (2006) Welcome to Apache Hadoop!.
Hagedorn S, Götze P, Sattler KU (2017) The STARK
http://hadoop.apache.org. Accessed 26 Mar 2018
framework for spatio-temporal data analytics on spark.
Brinkhoff T, Kriegel HP, Seeger B (1996) Parallel process-
In: Proceedings of the 17th conference on database
ing of spatial joins using r-trees. In: Proceedings of the
systems for business, technology, and the web (BTW
12th international conference on data engineering, New
2017), Stuttgart
Orleans, Louisiana, pp 258–265
Hassaan M, Elghandour I (2016) A real-time big data
Chang F, Dean J, Ghemawat S, Hsieh W, Wallach DA,
analysis framework on a CPU/GPU heterogeneous
Burrows M, Chandra T, Fikes A, Gruber RE (2008)
cluster: a meteorological application case study. In:
Bigtable: a distributed storage system for structured
Proceedings of the 3rd IEEE/ACM international con-
data. ACM Trans Comput Syst 26(2). https://doi.org/
ference on big data computing, applications and tech-
10.1145/1365815.1365816
Architectures 107
nologies (BDCAT ’16). ACM, New York, pp 168–177. computing. In: Proceedings of the eleventh ACM in-
https://doi.org/10.1145/3006299.3006304 ternational workshop on data engineering for wireless A
Hong S, Choi W, Jeong W-K (2017) GPU in-memory and mobile access (MobiDE ’12), pp 1–6. https://doi.
processing using spark for iterative computation. In: org/10.1145/2258056.2258058
Proceedings of the 17th IEEE/ACM international sym- Shvachko K, Kuang H, Radia S, Chansler R (2010)
posium on cluster, cloud and grid computing (CC- The Hadoop distributed file system. In: Proceedings
Grid ’17), pp 31–41. https://doi.org/10.1109/CCGRID. of the 2010 IEEE 26th symposium on mass storage
2017.41 systems and technologies (MSST). https://doi.org/10.
Hughes JN, Annex A, Eichelberger CN, Fox A, Hulbert 1109/MSST.2010.5496972
A, Ronquest M (2015) Geomesa: a distributed archi- Sriharsha R (2015) Magellan: geospatial analytics
tecture for spatio-temporal fusion. In: Proceedings of on spark. https://hortonworks.com/blog/magellan-
SPIE defense and security. https://doi.org/10.1117/12. geospatial-analytics-in-spark/. Accessed June 2017
2177233 Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref
Jacox EH, Samet H (2007) Spatial join techniques. ACM WG (2016) LocationSpark: a distributed in-memory
Trans Database Syst 32(1):7 data management system for big spatial data. Proc
Klein J, Buglak R, Blockow D, Wuttke T, Cooper B (2016) VLDB Endow 9(13):1565–1568. https://doi.org/10.
A reference architecture for big data systems in the 14778/3007263.3007310
national security domain. In: Proceedings of the 2nd Whitman RT, Park MB, Ambrose SM, Hoel EG (2014)
international workshop on BIG data software engineer- Spatial indexing and analytics on Hadoop. In: Proceed-
ing (BIGDSE ’16). https://doi.org/10.1145/2896825. ings of the 22nd ACM SIGSPATIAL international con-
2896834 ference on advances in geographic information systems
Marz N, Warren J (2015) Big data: principles and best (SIGSPATIAL ’14), pp 73–82. https://doi.org/10.1145/
practices of scalable realtime data systems, 1st edn. 2666310.2666387
Manning Publications, Greenwich Whitman RT, Park MB, Marsh BG, Hoel EG (2017)
McInnes L, Healy J (2017) Accelerated hierarchical den- Spatio-temporal join on Apache spark. In: Hoel E,
sity based clustering. In: IEEE international conference Newsam S, Ravada S, Tamassia R, Trajcevski G (eds)
on data mining workshops (ICDMW), New Orleans, Proceedings of the 25th ACM SIGSPATIAL interna-
Louisiana, pp 33–42 tional conference on advances in geographic infor-
Mysore D, Khupat S, Jain S (2013) Big data architecture mation systems (SIGSPATIAL’17). https://doi.org/10.
and patterns. IBM, White Paper, 2013. http://www.ibm. 1145/3139958.3139963
com/developerworks/library/bdarchpatterns1. Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba:
Accessed 26 Mar 2018 efficient in-memory spatial analytics. In: Proceedings
NoSQL (2009) NoSQL definition. of the 2016 international conference on management
http://nosql-database.org. Accessed 26 Mar 2018 of data (SIGMOD ’16), pp 1071–1085. https://doi.org/
Pavlo A, Aslett M (2016) What’s really new with 10.1145/2882903.2915237
NewSQL? SIGMOD Rec 45(2):45–55. https://doi.org/ You S, Zhang J, Gruenwald L (2015) Large-scale spatial
10.1145/3003665.3003674 join query processing in Cloud. In: 2015 31st IEEE in-
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion ternational conference on data engineering workshops,
B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Seoul, 13–17 April 2015, pp 34–41
Dubourg V, Vanderplas J, Passos A, Cournapeau D, Yu J, Wu J, Sarwat M (2015) Geospark: a cluster comput-
Perrot M, Duchesnay É (2011) Scikit-learn: machine ing framework for processing large-scale spatial data.
learning in Python. J Mach Learn Res 12:2825–2830 In: Proceedings of the 23rd SIGSPATIAL international
Prasad S, McDermott M, Puri S, Shah D, Aghajar- conference on advances in geographic information sys-
ian D, Shekhar S, Zhou X (2015) A vision for tems, Seattle, WA
GPU-accelerated parallel computation on geo-spatial Yuan Y, Salmi MF, Huai Y, Wang K, Lee R, Zhang
datasets. SIGSPATIAL Spec 6(3):19–26. https://doi. X (2016) Spark-GPU: an accelerated in-memory data
org/10.1145/2766196.2766200 processing engine on clusters. In: Proceedings of the
Sakr S, Liu A, Fayoumi AG (2013) The family of 2016 IEEE international conference on big data (Big
mapreduce and large-scale data processing systems. Data 2016), Washington, DC, pp 273–283
ACM Comput Surv 46(1):1. https://doi.org/10.1145/ Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica
2522968.2522979 I (2010) Spark: cluster computing with working sets.
Sena B, Allian AP, Nakagawa EY (2017) Characterizing In: Proceedings of the 2nd USENIX conference on hot
big data software architectures: a systematic mapping topics in cloud computing (HotCloud’10), Boston, MA
study. In: Proceeding of the 11th Brazilian symposium Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR:
on software components, architectures, and reuse (SB- parallelizing spatial join with mapreduce on clus-
CARS ’17). https://doi.org/10.1145/3132498.3132510 ters. In: IEEE international conference on Cluster
Shekhar S, Gunturi V, Evans MR, Yang KS. 2012. Spatial computing (CLUSTER’09), New Orleans, Louisiana,
big-data challenges intersecting mobility and cloud pp 1–8
108 Artifact-Centric Process Mining
discover behavioral
extract dependencies
Life-
Database cycle
model
Event log Life- Life-
discover data model cycle cycle
data model model model
describes all data objects
event data source artifact-centric model
import and their relations
events with multiple describes behavior and
case identifiers behavioral depdencies of data objects
paths through a life-cycle model or interactions, Billing documents are recorded in tables; relation
or to identify infrequent behavior as outliers. F1 links Sales to Delivery documents in a one-
to-many relation: S1 relates to D1 and D2;
correspondingly F2 links Billing to Delivery
documents in a one-to-many relation. Events on
Overview
process steps and data access are recorded in
time stamp attributes such as Date created or in
Historically, artifact-centric process mining ad-
a separate Document Changes table linked to all
dressed the unsolved problem of process mining
other tables.
on event data with multiple case identifiers and
one-to-many and many-to-many relationships by
adopting the concept of a (business) artifact as
Convergence and Divergence
an alternative approach to describing business
Process mining requires to associate events to
processes.
a case identifier in order to analyze behavioral
relations between events in the same case (van der
Event Data with Multiple Case Identifiers Aalst 2016). The data in Fig. 2 provides three case
Processes in organizations are typically sup- identifiers: SD id, DD id, and BD id. Classical
ported by information systems to structure the process mining forces to associate all events to a
information handled in these processes in well- single case identifier. However, this is equivalent
defined data objects which are often stored in to flattening and de-normalizing the relational
relational databases. During process execution, structure along its one-to-many relationships.
various data objects are created, read, updated, For example, associating all Create events of
and deleted, and various data objects are related Fig. 2 to SD id flattens the tables into the event
to each other in one-to-one, one-to-many, and log of Fig. 3(left) having two cases for S1 and
many-to-many relations. Figure 2 illustrates in S2. Due to flattening, event Create of B2 has
a simplified form the data structures typically been duplicated as it was extracted once for S1
found in ERP systems. Sales, Delivery, and and once for S2, also called divergence. Further,
110 Artifact-Centric Process Mining
F3
Document Changes
Change id Date Ref. id Table Change type Old Value New Value
1 17-5-2020 S1 SD Price updated 100 80
2 19-5-2020 S1 SD Delivery block released X -
3 19-5-2020 S1 SD Billing block released X -
4 10-6-2020 B1 BD Invoice date updated 20-6-2020 21-6-2020
Artifact-Centric Process Mining, Fig. 3 An event log (left) serializing the “create” events of the database of Fig. 2
based on the case identifier “SD id.” The resulting directly-follows relation (right) suffers convergence and divergence
Create for B1 is followed by Create for D2 Convergence and divergence may cause up to
although B1 and D2 are unrelated in Fig. 2, also 50% of erroneous behavioral relations (Lu et al.
called convergence (Lu et al. 2015). The be- 2015) in event logs. Convergence and divergence
havioral relations which underly automated pro- can be avoided partially by scoping extraction
cess discovery become erroneous through conver- of event data into event logs with a single case
gence and divergence. For example, the directly- identifier in a manual process (Jans 2017).
follows relation of the log (Fig. 3 right) states er-
roneously that three Invoice documents have been Artifact-Centric Process Models
created – whereas the original data source con- Artifact-centric process mining adopts modeling
tains only two – and that in two cases Invoice cre- concept of a (business) artifact to analyze event
ation was followed by Delivery creation (between data with multiple case identifiers in their en-
related data objects), whereas in the original data tirety (Lu et al. 2015; Nooijen et al. 2012; van Eck
source this only happened once for B2 and D3. et al. 2017). The notion of a (business) artifact
Artifact-Centric Process Mining 111
was proposed by Nigam and Caswell (2003) as describes which behavioral dependencies exist
an alternative approach to describing business between actions and states of different artifacts A
processes. This approach assumes that any pro- (e.g., pick delivery may occur for a Delivery
cess materializes itself in the (data) objects that object only if all its Billing objects are in state
are involved in the process, for instance, sales cleared). Where business process models created
documents and delivery documents; these objects in languages such as BPMN, EPC, or Petri nets
have properties such as the values of the fields describe a process in terms of activities and their
of a paper form, the processing state of an order, ordering in a single case, an artifact-centric model
or the location of a package. Typically, a data describes process behavior in terms of creation
model describes the (1) classes of objects that are and evolution of instances of multiple related data
relevant in the process, (2) the relevant properties objects. In an artifact-centric process model, the
of these objects in terms of class attributes, and unit of modularization is the artifact, consisting
(3) the relations between the classes. A process of data and behavior, whereas in an activity-
execution instantiates new objects and changes centric process modeling notation, the unit of
their properties according to the process logic. modularization is the activity, which can be an
Thereby, the relations between classes describe elementary task or a sub-process. A separate en-
how many objects of one class are related to how try in this encyclopedia discusses the problem of
many objects of another class. automated discovery of activity-centric process
An artifact-centric process model enriches the models with sub-processes.
classes of the data model themselves with process Figure 4 shows an artifact-centric process
logic restricting how objects may evolve during model in the notation of Proclets (van der Aalst
execution. More precisely, one artifact (1) encap- et al. 2001) for the database of Fig. 2. The life cy-
sulates several classes of the data model (e.g., cle of each document (Sales, Delivery, Billing) is
Sales Documents and Sales Document Lines), (2) described as a Petri net. Behavioral dependencies
provides actions that can update the classes at- between actions in different objects are described
tributes and move the artifact to a particular state, through interface ports and asynchronous
and (3) defines a life cycle. The artifact life cycle channels that also express cardinalities in the
describes when an instance of the artifact (i.e., interaction. For example, the port annotation “C”
a concrete object) is created, in which state of specifies that Clear Invoice in Billing enables
the instance which actions may occur to advance Pick Delivery in multiple related Delivery
the instance to another state (e.g., from created objects. Port annotation “1” specifies that Pick
to cleared), and which goal state the instance Delivery can occur after Clear Invoice occurred
has to reach to complete a case. A complete in the one Billing object related to the Delivery.
artifact-centric process model provides a life- The gray part of Fig. 4 shows a more involved
cycle model for each artifact in the process and behavioral dependency. Whenever Update Price
Pick Clear
blocking del. blocked Delivery 1 + Invoice
Rel.del.block + 1 rel
picked cleared
Artifact-Centric Process Mining, Fig. 4 Example of an artifact-centric process model in the proclet notation
112 Artifact-Centric Process Mining
Artifact-Centric Process Mining, Fig. 5 Example of an artifact-centric process model in the Guard-Stage-Milestone
notation
occurs in a Sales document, all related Delivery regarding objects and relations but not regard-
documents get blocked. Only after Release deliv- ing actions: the recorded event data shows two
ery block occurred in Sales, the Delivery docu- additional event types for the life cycle of the
ment may be updated again, and Pick Delivery Sales document – Release billing block and
may occur. For the sake of simplicity, the model Last Change.
does not show further behavioral dependencies 2. Life-cycle conformance states how well the
such as “Update price also blocks related Billing life-cycle model of each artifact describes the
documents.” order of events observed in reality. This cor-
Figure 5 shows the same model in the Guard- responds to conformance in classical process
Stage-Milestone notation (Hull et al. 2011) mining. For example, in Fig. 2, Update in-
(omitting some details). Each round rectangle voice date occurs in Billing after Clear In-
denotes a stage that can be entered when its voice which does not conform to the life-cycle
guard condition (diamond) holds and is left when model in Fig. 4.
its milestone condition (circle) holds. The guard 3. Interaction conformance states how well the
and milestone conditions specify declarative entire artifact centric model describes the be-
constraints over data attributes, stages, and havioral dependencies between artifacts. In
milestones of all artifacts in the model. For Fig. 2, instance D3 of Delivery is created after
example, Picking can start when the pickDelivery its related instance B2 of Billing. This does not
event is triggered in the process, the delivery conform to the channels and ports specified in
document has reached its created milestone, the Fig. 4.
billing document related to the delivery document
has reached its cleared milestone (d.BD.cleared), The objective of artifact-centric process min-
and the stage Blocking Delivery is not active in ing is to relate recorded behavior to modeled be-
the related sales document. havior, through (1) discovering an artifact-centric
Artifact-Centric Process Mining process model that conforms to the recorded be-
The behavior recorded in the database of Fig. 2 havior, (2) checking how well recorded behavior
does not conform to the models in Figs. 4 and an artifact-centric model conform to each
and 5: other and detecting deviations, and (3) extending
a given artifact-centric model with further infor-
1. Structural conformance states how well the mation based on recorded event data.
data model describes the data records ob- Artifact-centric process discovery is a tech-
served in reality. The proclet model of Fig. 4 nique to automatically or semiautomatically learn
structurally conforms to the data in Fig. 2 artifact-centric process models from event data.
Artifact-Centric Process Mining 113
The problem is typically solved by a decomposi- based on the relations between artifact types
tion into the following four steps: discovered in step 2. This step is specific to A
artifact-centric process mining; several alter-
1. Discovering the data model of entities or ta- native, automated techniques have been pro-
bles, their attributes, and relations from the posed. User input may be required to se-
data records in the source data. This step lect domain-relevant behavioral dependencies
corresponds to data schema recovery. It can among the discovered ones.
be omitted if the data schema is available
and correct; however, in practice foreign key In case the original data source is a relational
relations may not be documented at the data database, steps 3 and 4 require to automatically
level and need to be discovered. extract event logs from the data source for discov-
2. Discovering artifact types and relations from ering life-cycle models and behavioral dependen-
the data model and the event data. This step cies. As in classical process discovery, it depends
corresponds to transforming the data schema on the use case to which degree the discovered
discovered in step 1 into a domain model data model, life-cycle model, and behavioral de-
often involving undoing horizontal and ver- pendencies shall conform to the original data.
tical (anti-) partitioning in the technical data Artifact-centric conformance checking and
schema and grouping entities into domain- Artifact-centric model enhancement follow the
level data objects. User input or detailed in- same problem decomposition into data schema,
formation about the domain model are usually artifact types, life cycles, and interactions as
required. artifact-centric discovery. Depending on which
3. Discovering artifact life-cycle models for each models are available, the techniques may also be
artifact type discovered in step 2. This step combined by first discovering data schema and
corresponds to automated process discovery artifact types, then extracting event logs, and then
for event data with a single case identifier checking life-cycle and behavioral conformance
and can be done fully automatically up to for an existing model or enhancing an existing
parameters of the discovery algorithm. artifact model with performance information.
4. Discovering behavioral dependencies between Figure 6 shows a possible result of artifact-
the artifact life cycles discovered in step 3 centric process discovery on the event data in
2 Last Chg.
Artifact-Centric Process Mining, Fig. 6 Possible result of artifact-centric process discovery from the event data in
Fig. 2
114 Artifact-Centric Process Mining
Fig. 2 using the technique of Lu et al. (2015) Heuristics Miner of Weijters and Ribeiro (2011)
where the model has been enhanced with infor- and vanden Broucke and Weerdt (2017) with the
mation about the frequencies of occurrences of aim of visual analytics. Popova et al. (2015) ad-
events and behavioral dependencies. vocate to discover models with precise semantics
and free of behavioral anomalies that (largely) fit
the event log (Leemans et al. 2013; Buijs et al.
2012) allowing for translating the result to the
Key Research Findings Guard-Stage-Milestone notation.
Artifact-type discovery. Nooijen et al. (2012) Behavioral dependencies. Lu et al. (2015) dis-
provide a technique for automatically discovering cover behavioral dependencies between two arti-
artifact types from a relational database, leverag- facts by extracting an interaction event log that
ing schema summarization techniques to cluster combines the events of any two related artifact
tables into artifact types based on information instances into one trace. Applying process dis-
entropy in a table and the strength of foreign covery on this interaction event log then allows
key relations. The semiautomatic approach of Lu to extract “flow edges” between activities of the
et al. (2015) can then be used to refine artifact different artifacts, also across one-to-many rela-
types and undo horizontal and vertical (anti-) tions, leading to a model as shown in Fig. 6. This
partitioning and to discover relations between approach has been validated to return only those
artifacts. Popova et al. (2015) show how to dis- dependencies actually recorded in the event data
cover artifact types from a rich event stream but suffers when interactions can occur in many
by grouping events based on common identifiers different variants, leading to many different “flow
into entities and then deriving structural relations edges.”
between them. van Eck et al. (2017) generalize the interaction
event log further and create an integrated event
Event log extraction. In addition to discovering log of all artifact types to be considered (two
artifact types, Nooijen et al. (2012) also auto- or more) where for each combination of related
matically create a mapping from the relational artifact instances, all events are merged into a
database to the artifact-type specification. The single trace. From this log, a composite state
technique of Verbeek et al. (2010) can use this machine model is discovered which describes the
mapping to generate queries for event log extrac- synchronization of all artifact types. By project-
tion for life-cycle discovery automatically. Jans ing the composite state machine onto the steps of
(2017) provides guidelines for extracting specific each artifact type, the life-cycle model for each
event logs from databases through user-defined artifact is obtained, and the interaction between
queries. The event log may also be extracted multiple artifacts can be explored interactively in
from database redo logs using the technique of a graphical user interface through their relation
de Murillas et al. (2015) and from databases in the composite state machine. This approach
through a meta-model-based approach as pro- assumes one-to-one relations between artifacts.
posed by de Murillas et al. (2016). Popova and Dumas (2013) discover behavioral
dependencies in the form of data conditions over
Life-cycle discovery. Given the event log of an data attributes and states of other artifacts, similar
artifact, artifact life-cycle discovery is a classical to the notation in Fig. 5 but is limited to one-to-
automated process discovery problem for which one relations between artifacts.
various process discovery algorithms are avail-
able, most returning models based on or similar Conformance checking. Artifact life-cycle con-
to Petri nets. Weerdt et al. (2012) compared formance can be checked through extracting arti-
various discovery algorithms using real-life event fact life-cycle event logs and then applying clas-
logs. Lu et al. (2015) advocates the use of the sical conformance checking techniques (Fahland
Artifact-Centric Process Mining 115
et al. 2011a). The technique in Fahland et al. to more than one case identifier or object and
(2011b) checks interaction conformance in an where more than one case identifier has to be A
artifact life-cycle model if detailed information considered in the analysis.
about which artifact instances interact is recorded The primary use case is in analyzing pro-
in the event data. cesses in information systems storing multiple,
related data objects, such as Enterprise Resource
Models for artifacts. Artifact-centric process Planning (ERP) systems. These systems store
mining techniques originated and are to a large documents about business transactions that are
extent determined by the modeling concepts related to each other in one-to-many and many-
available to describe process behavior and data to-many relations. Lu et al. (2015) correctly dis-
flow with multiple case identifiers. Several tinguish normal and outlier flows between 18
proposals have been made in this area. The different business objects over 2 months of data
Proclet notation (van der Aalst et al. 2001) of the Order-to-Cash process in an SAP ERP
extended Petri nets with ports that specify one-to- system using artifact-centric process mining. The
many and many-to-many cardinality constraints same technique was also used for identifying
on messages exchanged over channels in an outlier behavior in processes of a project man-
asynchronous fashion. Fahland et al. (2011c) agement system together with end users. van
discuss a normal form for proclet-based models Eck et al. (2017) analyzed the personal loan
akin to the second normal form in relational and overdraft process of a Dutch financial insti-
schemas. The Guard-Stage-Milestone (GSM) tution. Artifact-centric process mining has also
notation (Hull et al. 2011) allows to specify been applied successfully on software project
artifacts and interactions using event-condition- management systems such as Jira and customer
actions rules over the data models of the relationship management systems such as Sales-
different artifacts. Several modeling concepts force (Calvo 2017).
of GSM were adopted by the CMMN 1.1 Artifact-centric process mining can also be
standard of OMG (2016). Hariri et al. (2013) applied on event data outside information sys-
propose data-centric dynamic systems (DCDS) tems. One general application area is analyzing
to specify artifact-centric behavior in terms the behavior of physical objects as sensed by
of updates of database records using logical multiple related sensors. For instance, van Eck
constraints. Existing industrial standards can et al. (2016) analyzed the usage of physical ob-
also be extended to describe artifacts as shown jects equipped with multiple sensors. Another
by Lohmann and Nyolt (2011) for BPMN and general application area is analyzing the behavior
by Estañol et al. (2012) for UML. Freedom of software components from software execution
of behavioral anomalies can be verified for event logs. For instance, Liu et al. (2016) follow
UML-based models (Calvanese et al. 2014) the artifact-centric paradigm to structure events
and for DCDS (Montali and Calvanese 2016). of software execution logs into different compo-
Meyer and Weske (2013) show how to translate nents and discover behavioral models for each
between artifact-centric and activity-centric software component individually.
process models, and Lohmann (2011) shows
how to derive an activity-centric process model
describing the interactions between different Future Directions for Research
artifacts based on behavioral constraints.
At the current stage, artifact-centric process min-
ing is still under development allowing for several
Examples of Application directions for future research.
Automatically discovering artifact types from
Artifact-centric process mining is designed for data sources is currently limited to summarizing
analyzing event data where events can be related the structures in the available data. Mapping these
116 Artifact-Centric Process Mining
structures to domain concepts still requires user structural conformance and interaction con-
input. Also the automated extraction of event logs formance is required, not only for detecting
from the data source relies on the mapping from deviations but also to objectively evaluate the
the data source to the artifact-type definition. quality of artifact-centric process discovery
How to aid the user in discovering and map- algorithms.
ping the data to domain-relevant structures and
reducing the time and effort to extract event logs,
possibly through the use of ontologies, is an open Cross-References
problem. Also little research has been done for
improving the queries generated for automated Automated Process Discovery
event log extraction to handle large amount of Business Process Analytics
event data. Conformance Checking
For discovering behavioral dependencies be- Hierarchical Process Discovery
tween artifacts, only few and limited techniques Multidimensional Process Analytics
are available. The flow-based discovery of Lu Schema Mapping
et al. (2015) that can handle one-to-many rela-
tions is limited to interactions between two arti-
facts and suffers in the presence of many different References
behavioral variants of the artifacts or the inter-
actions. The alternative approaches (Popova and Buijs JCAM, van Dongen BF, van der Aalst WMP (2012)
Dumas 2013; van Eck et al. 2017) are currently A genetic algorithm for discovering process trees. In:
limited to one-to-one relations between artifacts. IEEE congress on evolutionary computation
Calvanese D, Montali M, Estañol M, Teniente E (2014)
Solving the discovery of behavioral dependencies Verifiable UML artifact-centric business process mod-
between artifacts thereby faces two fundamental els. In: CIKM
challenges: Calvo HAS (2017) Artifact-centric log extraction for
cloud systems. Master’s thesis, Eindhoven University
of Technology
1. Although many different modeling languages de Murillas EGL, van der Aalst WMP, Reijers HA (2015)
and concepts for describing artifact-centric Process mining on databases: unearthing historical data
processes have been proposed, the proposed from redo logs. In: BPM
concepts do not adequately capture these de Murillas EGL, Reijers HA, van der Aalst WMP (2016)
Connecting databases with process mining: a meta
complex dynamics in an easy-to-understand model and toolset. In: BMMDS/EMMSAD
form (Reijers et al. 2015). Further research Estañol M, Queralt A, Sancho MR, Teniente E (2012)
is needed to identify appropriate modeling Artifact-centric business process models in UML. In:
concepts for artifact interactions. Business process management workshops
Fahland D, de Leoni M, van Dongen BF, van der Aalst
2. Systems with multiple case identifiers are in WMP (2011a) Behavioral conformance of artifact-
their nature complex systems, where complex centric process models. In: BIS
behaviors and multiple variants in the different Fahland D, de Leoni M, van Dongen BF, van der Aalst
artifacts multiply when considering artifact WMP (2011b) Conformance checking of interacting
processes with overlapping instances. In: BPM
interactions. Further research is needed on Fahland D, de Leoni M, van Dongen BF, van der Aalst
how to handle this complexity, for example, WMP (2011c) Many-to-many: some observations on
through generating specific, interactive views interactions in artifact choreographies. In: ZEUS
Hariri BB, Calvanese D, Giacomo GD, Deutsch A, Mon-
as proposed by van Eck et al. (2017).
tali M (2013) Verification of relational data-centric
dynamic systems with external services. In: PODS
Although several, comprehensive con- Hull R, Damaggio E, Masellis RD, Fournier F, Gupta
formance criteria in artifact-centric process M, Heath FT, Hobson S, Linehan MH, Maradugu S,
Nigam A, Sukaviriya N, Vaculín R (2011) Business
mining have been identified, only behavioral
artifacts with guard-stage-milestone lifecycles: manag-
conformance of artifact life cycles can currently ing artifact interactions with conditions and events. In:
be measured. Further research for measuring DEBS
Auditing 117
Jans M (2017) From relational database to event log: cess models. In: 2017 IEEE 19th conference on busi-
decisions with quality impact. In: First international ness informatics (CBI), vol 1, pp 109–118 A
workshop on quality data for process analytics Verbeek HMW, Buijs JCAM, van Dongen BF, van der
Leemans SJJ, Fahland D, van der Aalst WMP (2013) Dis- Aalst WMP (2010) Xes, xesame, and prom 6. In:
covering block-structured process models from event CAiSE forum
logs – a constructive approach. In: Petri nets Weerdt JD, Backer MD, Vanthienen J, Baesens B (2012) A
Liu CS, van Dongen BF, Assy N, van der Aalst WMP multi-dimensional quality assessment of state-of-the-
(2016) Component behavior discovery from software art process discovery algorithms using real-life event
execution data. In: 2016 IEEE symposium series on logs. Inf Syst 37(7):654–676. https://doi.org/10.1016/j.
computational intelligence (SSCI), pp 1–8 is.2012.02.004
Lohmann N (2011) Compliance by design for Weijters AJMM, Ribeiro JTS (2011) Flexible heuris-
artifact-centric business processes. Inf Syst 38(4): tics miner (FHM). In: 2011 IEEE symposium on
606–618 computational intelligence and data mining (CIDM),
Lohmann N, Nyolt M (2011) Artifact-centric modeling pp 310–317
using BPMN. In: ICSOC workshops
Lu X, Nagelkerke M, van de Wiel D, Fahland D (2015)
Discovering interacting artifacts from ERP systems.
IEEE Trans Serv Comput 8:861–873
Meyer A, Weske M (2013) Activity-centric and artifact- Assessment
centric process model roundtrip. In: Business process
management workshops Auditing
Montali M, Calvanese D (2016) Soundness of data-aware,
case-centric processes. Int J Softw Tools Technol Trans
18:535–558
Nigam A, Caswell NS (2003) Business artifacts: an ap-
proach to operational specification. IBM Syst J 42: Attestation
428–445
Nooijen EHJ, van Dongen BF, Fahland D (2012) Au- Auditing
tomatic discovery of data-centric and artifact-centric
processes. In: Business process management work-
shops. Lecture notes in business information process-
ing, vol 132. Springer, pp 316–327. https://doi.org/10.
1007/978-3-642-36285-9_36 Attribute-Based Access
OMG (2016) Case management model and notation, ver- Control (ABAC)
sion 1.1. http://www.omg.org/spec/CMMN/1.1
Popova V, Dumas M (2013) Discovering unbounded
synchronization conditions in artifact-centric process Security and Privacy in Big Data Environment
models. In: Business process management workshops
Popova V, Fahland D, Dumas M (2015) Artifact lifecycle
discovery. Int J Coop Inf Syst 24:44
Reijers HA, Vanderfeesten ITP, Plomp MGA, Gorp PV, Auditing
Fahland D, van der Crommert WLM, Garcia HDD
(2015) Evaluating data-centric process approaches:
does the human factor factor in? Softw Syst Model Francois Raab
16:649–662 InfoSizing, Manitou Springs, CO, USA
vanden Broucke SKLM, Weerdt JD (2017) Fodina: a ro-
bust and flexible heuristic process discovery technique.
Decis Support Syst 100:109–118 Synonyms
van der Aalst WMP (2016) Process mining – data science
in action, 2nd edn. Springer. https://doi.org/10.1007/
978-3-662-49851-4 Assessment; Attestation; Certification; Review;
van der Aalst WMP, Barthelmess P, Ellis CA, Wainer Validation
J (2001) Proclets: a framework for lightweight inter-
acting workflow processes. Int J Coop Inf Syst 10:
443–481
van Eck ML, Sidorova N, van der Aalst WMP (2016) Definitions
Composite state machine miner: discovering and ex-
ploring multi-perspective processes. In: BPM
van Eck ML, Sidorova N, van der Aalst WMP (2017) An examination of the implementation, execu-
Guided interaction exploration in artifact-centric pro- tion, and results of a test or benchmark, generally
118 Auditing
performed by an independent third-party, and and executed to measure the underlying system.
resulting in a report of findings Other benchmarks are provided in the form of a
set of functional requirements to be implemented
using any fitting technology. In this later case,
Historical Background the level at which the implementation meets the
stated requirements can directly affect the re-
The use of a third-party audit to attest to the sults of the test. Consider a benchmark requiring
veracity of a claim has historically been com- the execution of two tasks acting on the same
monplace in the financial sector. The goal of data set. An implementation that correctly imple-
such audit activities is to bolster the credibility ments the tasks but fails to have them act on the
of an organization’s claims regarding its financial same data set would avoid potential data access
standing. Similar auditing activities are found in conflicts.
other fields where there is value in validating the The process of auditing a benchmark imple-
level at which a set of requirements have been mentation includes all aspects of that implemen-
followed. tation. This may include reviewing custom code,
As formal definitions of computer systems examining data generation tools and their output,
performance benchmarks started to emerge, so executing functional testing to verify the proper
did the call for independent certification of pub- behavior of required features, and validating the
lished results. The Transaction Processing Perfor- integration of the various benchmark compo-
mance Council (TPC – www.tpc.org) was the first nents.
industry standard benchmark consortium to for-
malize the requirement for independent auditing Auditing the Execution
of benchmark results. Most benchmarks involve the execution of mul-
tiple steps, for a prescribed duration and in some
Foundations specified sequence. The level at which these ex-
ecution rules are followed can greatly impact
The purpose of auditing in the context of a per- the outcome of the test. Consider a benchmark
formance test or benchmark is to validate that requiring that two tasks be executed concurrently
the test results were produced in compliance with for a specified duration. An execution that runs
the set of requirements defining the test. These the tasks for the specified duration but schedules
requirements are used to define multiple aspects them in a serial manner would avoid potential
of the test, including what is being tested, how contention for system resources.
it is being tested, how the test results are mea- The process of auditing a benchmark execu-
sured, and how accurately are they documented. tion involves verifying that controls are in place
Auditing a test or benchmark result consists in to drive the execution based on the stated rules
validating some or all of these requirements. and that sufficient traces of the execution steps are
Benchmark requirements can be viewed as captured. This may be accomplished by review-
belonging to one of the following categories: im- ing execution scripts, witnessing the execution in
plementation rules, execution rules, results col- real time, and examining logs that were produced
lection, pricing rules, and documentation. The during the execution.
motivation behind auditing a set of requirements,
and the process involved in such validation, is Auditing the Results
largely based on which of these categories the The end goal of implementing and executing a
requirements belong to. benchmark is to produce a result for use in per-
formance engineering, marketing, or other venue.
Auditing the Implementation While a benchmark may be implemented cor-
Some benchmarks are provided in the form of a rectly and executed according to all stated rules,
complete software kit that can simply be installed it may produce a flawed outcome if the results
Auditing 119
are not collected properly. Consider a benchmark benchmark result can be misunderstood and lead
that involves a workload that gradually ramps to incorrect conclusions. Consider a benchmark A
up until the system under test reaches saturation, with a body of existing results that have been pro-
with the goal of measuring the system’s behavior duced on single-socket systems. A new, record-
at that saturation point. A test that measures breaking result may be received more favorably
the system’s behavior too early during ramp-up if the documentation omits to disclose that it
would avoid the potential disruptions caused by was produced on a dual-socket system, leaving
saturation. the reader to assume that a single socket was
The process of auditing the collection of re- used.
sults during a benchmark execution involves ver- The process of auditing the documentation
ifying that all necessary conditions are met for the of a benchmark result consists in verifying that
measurement to be taken. In cases where metrics all relevant information has been provided with
are computed from combining multiple measure- accuracy and candor. A criterion to determine if
ments, it also involves verifying that all the com- the documentation of a test result is sufficient
ponents of the metric are properly sourced and is when the information provided allows another
carry sufficient precision. party to independently reproduce the result.
p2 p4
Automated Discovery of p1 t p6 p7 A
b e
Hierarchical Process Models a
c f
Hierarchical Process Discovery p3 p5
Simplicity
Second, given two models, all other things equal,
a d b
the simplest model is usually the best of the two
(a principle known as Occam’s razor). That is, a Automated Process Discovery, Fig. 2 A process model
model should be as understandable as possible, with low fitness, high precision, low generalisation and
for instance sma. high simplicity w.r.t. L
Automated Process Discovery 123
a b
Process Discovery Algorithms
Automated Process a a b
Discovery, Fig. 4 A
process model (“trace a b c d b
model”) with high fitness,
high precision, low a b d c b c b
generalisation and low
simplicity w.r.t. L a b c d b c b
...
124 Automated Process Discovery
b
a
d
Nevertheless, directly follows-based tech- by generalising the behaviour in the log, in the
niques are often used to get a first idea of the worst case ultimately ending in a flower model. A
process behind an event log. Besides a basic IM (Leemans et al. 2013a),
algorithms exist that focus on filtering noise
(Leemans et al. 2013b), handling incomplete
Soundness-Guaranteeing Algorithms behaviour (when the event log misses crucial
Soundness is a prerequisite for further automated information of the process) (Leemans et al.
or machine-assisted analysis of business process 2014a), handling lifecycle information of events
models. In this section, soundness or relaxed (if the log contains information of e.g. when
soundness guaranteeing algorithms are discussed. activities started and ended) (Leemans et al.
2015), discovering challenging constructs such
as inclusive choice and silent steps (Leemans
Evolutionary Tree Miner
2017), and handling very large logs and complex
To address the issue of soundness, the Evolu-
processes (Leemans et al. 2016), all available in
tionary Tree Miner (ETM) (Buijs et al. 2012a)
the ProM framework.
discovers process trees. A process tree is an
Several IM-algorithms guarantee to return a
abstract hierarchical view of a workflow net and
model that perfectly fits the event log, and all
is inherently sound.
algorithms are capable of rediscovering the pro-
ETM first constructs an initial population of
cess, assuming that the process can be described
models; randomly or from other sources. Second,
as a process tree (with some other restrictions,
some models are selected based on fitness, pre-
such as no duplicated activities) and assuming
cision, generalisation and simplicity with respect
that the event log contains “enough” information.
to the event log. Third, the selected models are
However, due to the focus on fitness, precision
smart-randomly mutated. This process of selec-
tends to be lower on event logs of highly unstruc-
tion and mutation is repeated until a satisfactory
tured processes.
model is found, or until time runs out.
All Inductive Miner algorithms are available
ETM is flexible as both the selection and the
as plug-ins of the ProM framework, and some
stopping criteria can be adjusted to the use case at
as plug-ins of the Apromore framework. Further-
hand; one can prioritise (combinations of) quality
more, the plug-in Inductive visual Miner (Lee-
criteria. However, due to the repeated evaluation
mans et al. 2014b) provides an interactive way to
of models, on large event logs of complex pro-
apply these algorithms and perform conformance
cesses, stopping criteria might force a user to
checking.
make the decision between speed and quality.
An algorithm that uses a similar recursive
strategy, but lets constructs compete with one
Inductive Miner Family another is the Constructs Competition Miner
The Inductive Miner (IM) family of process dis- (Redlich et al. 2014), however its implementation
covery algorithms, like the Evolutionary Tree has not been published.
Miner, discovers process trees to guarantee that
all models that are discovered are sound. The Structured Miner
IM algorithms apply a recursive strategy: first, The Structured Miner (STM) (Augusto et al.
the “most important” behaviour of the event log 2016) applies a different strategy to obtain highly
is identified (such as sequence, exclusive choice, block-structured models and to tradeoff the log
concurrency, loop, etc.). Second, the event log quality criteria. Instead of discovering block-
is split in several parts, and these steps are re- structured models directly, SM first discovers
peated until a base case is encountered (such BPMN models and, second, structures these
as a log consisting of a single activity). If no models. The models can be obtained from any
“most important” behaviour can be identified, other discovery technique, for instance Heuristics
then the algorithms try to continue the recursion Miner or Fodina, as these models need not be
126 Automated Process Discovery
sound. These models are translated to BPMN, (2011), Di Ciccio et al. (2016), and Ferilli et al.
after which they are made block-structured by (2016) discover the constraints of which Declare
shifting BPMN-gateways in or out, thereby models consist using several acceptance criteria,
duplicating activities. in order to be able to balance precision and
STM benefits from the flexibility of the used fitness. However, using such models in practice
other discovery technique to strike a flexible bal- tends to be challenging (Augusto et al. 2017b).
ance between log-quality criteria and can guaran-
tee to return sound models. However, this guaran- Other Algorithms
tee comes at the price of equivalence (the model Unsound models are unsuitable for futher auto-
is changed, not just restructured), simplicity (ac- mated processing, however might be useful for
tivities are duplicated) and speed (the restructur- manual analysis. In the remainder of this section,
ing is O.nn /). several algorithms are discussed that do not guar-
STM is available as both a ProM and an antee soundness.
Apromore plugin.
˛-Algorithms
(Hybrid) Integer Linear Programming Miner The first process discovery algorithm described
The Integer Linear Programming Miner was the ˛-algorithm (van der Aalst et al. 2004).
(ILP) van der Werf et al. (2009) constructs a The ˛ algorithm considers the directly follows
Petri net, starting with all activities as transitions graph and identifies three types of relations be-
and no places, such that every activity can be tween sets of activities from the graph: sequence,
arbitrarily executed. Second, it adds places using concurrency and mutual exclusivity. From these
an optimisation technique: a place is only added relations, a Petri net is constructed by searching
if it does not remove any trace of the event log for certain maximal patterns.
from the behaviour of the model. Under this The ˛ algorithm is provably (Badouel 2012)
condition, the behaviour is restricted as much as able to rediscover some processes, assuming that
possible. the log contains enough information and with
ILP focusses on fitness and precision: it guar- restrictions on the process. In later versions, sev-
antees to return a model that fits the event log, and eral restrictions have been addressed, such as:
the most precise model within its representational (a) no short loops (activities can follow one an-
bias (Petri nets, no duplicated activities). How- other directly; addressed in ˛ C (de Medeiros
ever, the ILP miner does not guarantee sound- et al. 2004)), (b) no long-distance dependencies
ness, does not handle noise and tends to return (choices later in the process depend on choices
complex models (Leemans 2017). made earlier; addressed in Wen et al. (2006)),
The first of these two have been addressed (c) no non-free-choice constructs (transitions that
in the HybridILPMiner (van Zelst et al. 2017), share input places have the same input places;
which performs internal noise filtering. Further- addressed in ˛ CC (Wen et al. 2007a)), and (d)
more, it adjusts the optimisation step to guarantee no silent transitions (addressed in ˛ # (Wen et al.
that the final marking is always reachable, and, 2007b, 2010) and in ˛ $ (Guo et al. 2015)). Fur-
in some cases, returns workflow nets, thereby thermore, a variant has been proposed, called the
achieving relaxed soundness. Tsinghua-˛ (Wen et al. 2009), that deals with
non-atomic event logs. That is, event logs in
Declarative Techniques which executions of activities take time.
Petri nets and BPMN models express what can However, these algorithms guarantee neither
happen when executing the model. In contrast, soundness nor perfect fitness nor perfect pre-
declarative models, such as Declare models, ex- cision, the algorithms cannot handle noise and
press what cannot happen when executing the cannot handle incompleteness. Furthermore, the
model, thereby providing greater flexibility in ˛ algorithms might be less fast on complex event
modelling. Declare miners such as Maggi et al. logs, as typically they are exponential. Therefore,
Automated Process Discovery 127
the ˛-algorithms are not very suitable to be ap- be on a path from start to end, the total number
plied to real-life logs. of edges is minimised, while the sum of edge A
Little Thumb (Weijters and van der Aalst frequencies is maximised. Then, splits and joins
2003) extends the ˛ algorithms with noise- (BPMN gateways) are inserted to construct a
handling capabilities: instead of considering BPMN model.
binary activity relations, these relations are SPM aims to improve over the precision of IM
derived probabilistically and then filtered and the speed of ETM for real-life event logs.
according to a user-set threshold. The balance between precision and fitness can
be adjusted in the directly follows-optimisation
Causal-Net Miners step, which allows users to adjust the amount of
The Flexible Heuristics Miner (FHM) (Weijters noise filtering. However, the returned models are
and Ribeiro 2011) uses the probabilistic activ- not guaranteed to be sound (proper completion is
ity relations of Little Thumb and focuses on not guaranteed), and several OR-joins might be
soundness. To solve the issue of soundness, FHM inserted, which increases complexity.
returns causal nets, a model formalism in which SPM is available as a plug-in of Apromore and
it is defined that non-sound parts of the model are as a stand-alone tool via https://doi.org/10.6084/
not part of the behaviour of the net. m9.figshare.5379190.v1.
The Fodina algorithm (vanden Broucke and
Weerdt 2017) extends FHM with long-distance
dependency support and, in some cases, duplicate Conclusion
activities. The Proximity miner (Yahya et al.
2016) extends FHM by incorporating domain Many process mining techniques require a
knowledge. For more algorithms using causal process model as a prerequisite. From an event
nets, please refer to Weerdt et al. (2012) and Au- log, process discovery algorithms aim to discover
gusto et al. (2017b). a process model, this model preferably having
Even though causal nets are sound by defini- clear semantics, being sound, striking a user-
tion, they place the burden of soundness checking adjustable balance between fitness, precision,
on the interpreter/user of the net, and this still generalisation and simplicity, and having
does not guarantee, for instance, that every ac- confidence that the model represents the business
tivity in the model can be executed. Therefore, process from which the event log was recorded.
translating a causal net to a Petri net or BPMN Three types of process discovery algorithms were
model for further processing does not guarantee discussed: directly follows-based techniques,
soundness of the translated model. soundness-guaranteeing algorithms and other
FHM, Fodina (http://www.processmining.be/ algorithms, all targetting a subset of these quality
fodina) and Proximity Miner (https://sourceforge. criteria.
net/projects/proxi-miner/) are all available as In explorative process mining projects, choos-
ProM plug-ins and/or Apromore plug-ins. ing a discovery algorithm and its parameters
is a matter of repeatedly trying soundness-
Split Miner guaranteeing algorithms, evaluating their results
To strike a different balance in log-quality using conformance checking and adjusting
criteria compared to IM that favours fitness, algorithm, parameters and event log as new
while improving in speed over ETM, Split Miner questions pop up (van Eck et al. 2015).
(SPM) (Augusto et al. 2017a) preprocesses the
directly follows graph before constructing a
BPMN model. In the preprocessing of directly Cross-References
follows graphs, first, loops and concurrency are
identified and filtered out. Second, the graph is Conformance Checking
filtered in an optimisation step: each node must Decision Discovery in Business Processes
128 Automated Process Discovery
Application and theory of Petri Nets and concurrency van der Werf JMEM, van Dongen BF, Hurkens CAJ,
– Proceedings of the 35th international conference, Serebrenik A (2009) Process discovery using inte- A
PETRI NETS, Tunis, 23–27 June 2014. Lecture notes ger linear programming. Fundam Inform 94(3–4):387–
in computer science, vol 8489. Springer, pp 91–110. 412. http://doi.org/10.3233/FI-2009-136
http://doi.org10.1007/978-3-319-07734-5_6 van Dongen BF, de Medeiros AKA, Verbeek HMW,
Leemans SJJ, Fahland D, van der Aalst WMP (2014b) Weijters AJMM, van der Aalst WMP (2005) The prom
Process and deviation exploration with inductive visual framework: a new era in process mining tool support.
miner. In: Limonad L, Weber B (eds) Proceedings of In: Ciardo G, Darondeau P (eds) Applications and the-
the BPM demo sessions 2014, co-located with the 12th ory of Petri Nets 2005, Proceedings of the 26th interna-
international conference on business process manage- tional conference, ICATPN, Miami, 20–25 June 2005.
ment(BPM), Eindhoven, 10 Sept 2014, CEUR-WS.org, Lecture notes in computer science, vol 3536. Springer,
CEUR workshop proceedings, vol 1295, p 46. http:// pp 444–454. http://doi.org/10.1007/11494744_25
ceur-ws.org/Vol-1295/paper19.pdf van Eck ML, Lu X, Leemans SJJ, van der Aalst WMP
Leemans SJJ, Fahland D, van der Aalst WMP (2015) (2015) PMˆ2 : a process mining project methodology.
Using life cycle information in process discovery. In: In: Zdravkovic J, Kirikova M, Johannesson P (eds) Ad-
Reichert M, Reijers HA (eds) Business process man- vanced information systems engineering – Proceedings
agement workshops – BPM, 13th international work- of the 27th international conference, CAiSE, Stock-
shops, Innsbruck, 31 Aug–3 Sept 2015, Revised papers. holm, 8–12 June 2015. Lecture notes in computer
Lecture notes in business information processing, vol science, vol 9097. Springer, pp 297–313. http://doi.org/
256. Springer, pp 204–217. http://doi.org/10.1007/978- 10.1007/978-3-319-19069-3_19
3-319-42887-1_17 van Zelst SJ, van Dongen BF, van der Aalst WMP, Ver-
Leemans SJJ, Fahland D, van der Aalst WMP (2016) beek HMW (2017) Discovering relaxed sound work-
Scalable process discovery and conformance checking. flow nets using integer linear programming. CoRR
Softw Syst Model Special issue:1–33. http://doi.org/10. abs/1703.06733. http://arxiv.org/abs/1703.06733
1007/s10270-016-0545-x Weerdt JD, Backer MD, Vanthienen J, Baesens B (2012) A
Maggi FM, Mooij AJ, van der Aalst WMP (2011) multi-dimensional quality assessment of state-of-the-
User-guided discovery of declarative process models. art process discovery algorithms using real-life event
In: DBL (2011), pp 192–199. http://doi.org/10.1109/ logs. Inf Syst 37(7):654–676. http://doi.org/10.1016/j.
CIDM.2011.5949297 is.2012.02.004
OMG (2011) Business process model and notation Weijters AJMM, Ribeiro JTS (2011) Flexible heuristics
(BPMN) version 2.0. Technical report, Object manage- miner (FHM). In: DBL (2011), pp 310–317. http://doi.
ment group (OMG) org/10.1109/CIDM.2011.5949453
ProcessGold (2017) Enterprise platform. http:// Weijters AJMM, van der Aalst WMP (2003)
processgold.com/en/, [Online; Accessed 11 Nov Rediscovering workflow models from event-based
2017] data using little thumb. Integr Comput Aided Eng
Redlich D, Molka T, Gilani W, Blair GS, Rashid A 10(2):151–162. http://content.iospress.com/articles/
(2014) Constructs competition miner: process control- integrated-computer-aided-engineering/ica00143
flow discovery of bp-domain constructs. In: Sadiq SW, Wen L, Wang J, Sun J (2006) Detecting implicit depen-
Soffer P, Völzer H (eds) Business process management dencies between tasks from event logs. In: Zhou X, Li
– Proceedings of the 12th international conference, J, Shen HT, Kitsuregawa M, Zhang Y (eds) Frontiers
BPM, Haifa, 7–11 Sept 2014. Lecture notes in com- of WWW research and development – APWeb 2006,
puter science, vol 8659. Springer, pp 134–150. http:// Proceedings of the 8th Asia-Pacific web conference,
doi.org/10.1007/978-3-319-10172-9_9 Harbin, 16–18 Jan 2006. Lecture notes in computer
Reisig W (1992) A primer in Petri net design. Springer science, vol 3841. Springer, pp 591–603. http://doi.org/
compass international. Springer, Berlin/New York 10.1007/11610113_52
Rosa ML, Reijers HA, van der Aalst WMP, Dijkman Wen L, van der Aalst WMP, Wang J, Sun J (2007a) Mining
RM, Mendling J, Dumas M, García-Bañuelos L (2011) process models with non-free-choice constructs. Data
APROMORE: an advanced process model repository. Min Knowl Discov 15(2):145–180. http://doi.org/10.
Expert Syst Appl 38(6):7029–7040. http://doi.org/10. 1007/s10618-007-0065-y
1016/j.eswa.2010.12.012 Wen L, Wang J, Sun J (2007b) Mining invisible tasks
vanden Broucke SKLM, Weerdt JD (2017) Fodina: a from event logs. In: Dong G, Lin X, Wang W, Yang
robust and flexible heuristic process discovery tech- Y, Yu JX (eds) Advances in data and web management,
nique. Decis Support Syst 100:109–118. http://doi.org/ Joint 9th Asia-Pacific web conference, APWeb 2007,
10.1016/j.dss.2017.04.005 and Proceedings of the 8th international conference,
van der Aalst WMP (2016) Process mining – data science on web-age information management, WAIM, Huang
in action, 2nd edn. Springer http://www.springer.com/ Shan, 16–18 June 2007. Lecture notes in computer
gp/book/9783662498507 science, vol 4505. Springer, pp 358–365. http://doi.org/
van der Aalst W, Weijters A, Maruster L (2004) Workflow 10.1007/978-3-540-72524-4_38
mining: discovering process models from event logs. Wen L, Wang J, van der Aalst WMP, Huang B, Sun J
IEEE Trans Knowl Data Eng 16(9):1128–1142 (2009) A novel approach for process mining based on
130 Automated Reasoning
event types. J Intell Inf Syst 32(2):163–190. http://doi. of “A Bit of History”). In the context of big
org/10.1007/s10844-007-0052-1 data processing, automated reasoning is more
Wen L, Wang J, van der Aalst WMP, Huang B, Sun J
(2010) Mining process models with prime invisible
relevant to modern knowledge representation
tasks. Data Knowl Eng 69(10):999–1021. http://doi. languages, such as the W3C standard Web
org/10.1016/j.datak.2010.06.001 Ontology Language (OWL) (https://www.w3.
Yahya BN, Song M, Bae H, Sul S, Wu J (2016) Domain- org/TR/owl2-overview/), in which a knowledge
driven actionable process model discovery. Comput Ind
Eng 99:382–400. http://doi.org/10.1016/j.cie.2016.05.
base consists of a schema component (TBox) and
010 a data component (ABox).
From the application perspective, perhaps
the most well-known modern knowledge
representation mechanism is Knowledge Graph
Automated Reasoning (Pan et al. 2016b, 2017). In 2012, Google
popularized the term “Knowledge Graph”
Jeff Z. Pan1 and Jianfeng Du2 by using it for improving its search engine.
1
University of Aberdeen, Scotland, UK Knowledge Graphs are then adopted by most
2
Guangdong University of Foreign Studies, leading search engines (such as Bing and Baidu)
Guangdong, China and many leading IT companies (such as IBM
and Facebook). The basic idea of Knowledge
Graph is based on the knowledge representation
Synonyms formalism called semantic networks. There is
a modern W3C standard for semantic networks
Approximate reasoning; Inference; Logical rea- called RDF (Resource Description Framework,
soning; Semantic computing https://www.w3.org/TR/rdf11-concepts/). Thus
RDF/OWL graphs can be seen as exchangeable
knowledge graphs, in the big data era.
While this entry will be mainly about auto-
Definitions mated reasoning techniques in the big data era,
their classifications, key contributions, typical
Reasoning is the process of deriving conclu- systems, as well as their applications, it starts
sions in a logical way. Automatic reasoning is with a brief introduction of the history.
concerned with the construction of computing
systems that automate this process over some
knowledge bases. A Bit of History
Automated Reasoning is often considered as a
subfield of artificial intelligence. It is also studied Many consider the Cornell Summer Meeting of
in the fields of theoretical computer science and 1957, which brought together many logicians and
even philosophy. computer scientists, as the origin of automated
reasoning.
The first automated reasoning systems were
Overview theorem provers, systems that represent axioms
and statements in first-order logic and then use
The development of formal logic (Frege 1884) rules of logic, such as modus ponens, to infer
played a big role in the field of automated rea- new statements. The first system of this kind
soning, which itself led to the development of is the implementation of Presburger’s decision
artificial intelligence. procedure (which proved that the sum of two even
Historically, automated reasoning is largely numbers is even) by Davis (1957).
related to theorem proving, general problem Another early type of automated reasoning
solvers, and expert systems (cf. the section system were general problem solvers, which at-
Automated Reasoning 131
tempt to provide a generic planning engine that From the direction point of view, automatic
could represent and solve structured problems, by ontology reasoning can be classified into (1) A
decomposing problems into smaller more man- forward reasoning Baader et al. (2005), in which
ageable subproblems, solving each subproblem the inference starts with the premises, moves
and assembling the partial answers into one final forward, and ends with the conclusions; (2) back-
answer. The first system of this kind is Logic ward reasoning (Grosof et al. 2003), in which
Theorist from Newell et al. (1957). the inference starts with the conclusions, moves
The first practical applications of automated backward, and ends with the premises; as well as
reasoning were expert systems, which focused on (3) bi-directional reasoning (MacGregor 1991) in
much more well-defined domains than general which the inference starts with both the premises
problem solving, such as medical diagnosis and the conclusions and moves forward and back-
or analyzing faults in an aircraft, and on more ward simultaneously or interactively, until the in-
limited implementations of first-order logic, such termediate conclusions obtained by forward steps
as modus ponens implemented via IF-THEN include all intermediate premises required by
rules. One of the forerunners of these systems is backward steps.
MYCIN by Shortliffe (1974). From the monotonicity point of view, auto-
Since 1980s, there have been prosperous stud- matic ontology reasoning can be classified into
ies of practical subsets of first-order logics as (1) monotonic ontology reasoning in which no
ontology languages, such as description logics existing conclusions will be dropped when new
(Baader et al. 2003) and answer set programming premises are added, as well as (2) nonmonotonic
(Lifschitz 2002), as well as the standardization ontology reasoning (Quantz and Suska 1994) in
of ontology language OWL (version 1 in 2004 which some existing conclusions can be dropped
and version 2 in 2009). The wide adoption of on- when new premises are added.
tology and Knowledge Graph (Pan et al. 2016b, From the scalability point of view, automatic
2017), including by Google and many other lead- ontology reasoning can be classified into (1) par-
ing IT companies, confirms the status of ontology allel ontology reasoning (Bergmann and Quantz
language in big data era. 1995), in which reasoning algorithms can ex-
In the rest of the entry, we will focus on ploit multiple computation cores in a computa-
automated reasoning with Ontology languages. tion nodes, and (2) distributed ontology reason-
ing (Borgida and Serafini 2003) and (Serafini
2005), in which reasoning algorithms can exploit
Classification a cluster of computation nodes. Scalable ontology
reasoning is also often related to strategies of
There can be different ways of classifying re- modularization (Suntisrivaraporn et al. 2008) and
search problems related to automated ontology approximation (Pan and Thomas 2007).
reasoning. From the mobility point of view, automated
From the purpose point of view, automatic ontology reasoning can be classified into (1)
ontology reasoning can be classified into (1) reasoning with temporal ontologies (Artale and
deductive ontology reasoning (Levesque and Franconi 1994), in which the target ontologies
Brachman 1987), which draws conclusions from contain temporal constructors for class and prop-
given premises; (2) abductive ontology reasoning erty descriptions, and (2) stream ontology rea-
(Colucci et al. 2003), which finds explanations soning (Stuckenschmidt et al. 2010; Ren and Pan
for observations that are not consequences of 2011), which, given some continuous updates of
given premises; as well as (3) inductive ontology the ontology, requires updating reasoning results
reasoning (Lisi and Malerba 2003), which without naively recomputing all results.
concludes that all instances of a class have a From the certainty point of view, automatic
certain property if some instances of the class reasoning can be classified into (1) ontology rea-
have the property. soning with certainty in which both premises and
132 Automated Reasoning
conclusions are certain and either true or false, that real-world knowledge and data are hardly
as well as (2) uncertainty ontology reasoning perfect or completely digitalized.
(Koller et al. 1997) in which either premises or
conclusions are uncertain and often have truth
values between 0= 1 and 1. There are different Typical Reasoning Systems
kinds of uncertainties within ontologies, such
as probabilistic ontologies (Koller et al. 1997), Below are descriptions of some well-known
fuzzy ontologies (Straccia 2001), and possibilis- OWL reasoners (in alphabetical order).
tic ontologies (Qi et al. 2011).
CEL
CEL (Baader et al. 2006) is a LISP-based rea-
Key Contributions soner for ELC (Baader et al. 2008), which covers
the core part of OWL 2 EL. CEL is the first
The highlight on contributions of automated on- reasoner for the description logic ELC, support-
tology reasoning is the standardization of the ing as its main reasoning task the computation
Web Ontology Language (OWL). of the subsumption hierarchy induced by ELC
The first version of OWL (or OWL 1) was ontologies.
standardized in 2004. It is based on the SHOIQ
DL (Horrocks and Sattler 2005). However, there ELK
are some limitations of OWL 1: ELK (Kazakov et al. 2012) is an OWL 2 EL rea-
soner. At its core, ELK uses a highly optimized
1. The datatype support is limited (Pan and Hor- parallel algorithm (Kazakov et al. 2011). It sup-
rocks 2006); ports stream reasoning in OWL 2 EL (Kazakov
2. The only sub-language, OWL-Lite, of OWL 1 and Klinov 2013).
is not tractable;
3. The semantics of OWL 1 and RDF are not FaCT
fully compatible (Pan and Horrocks 2003). FaCT Horrocks (1998) is a reasoner for the de-
scription logic SHIF (OWL-Lite). It is the first
The second version of OWL (or OWL 2) modern reasoner that demonstrates the feasibility
was standardized in 2009. It is based on the of using optimized algorithms for subsumption
SROIQ DL (Horrocks et al. 2006). On the checking in realistic applications.
one hand, OWL 2 has more expressive power,
such as the stronger support of datatypes (Pan FaCT++
and Horrocks 2006; Motik and Horrocks 2008) FaCT++ (Tsarkov and Horrocks 2006) is a rea-
and rules (Krötzsch et al. 2008). On the other soner for (partially) OWL 2. It is the new gener-
hand, OWL 2 has three tractable sub-languages, ation of the well-known FaCT reasoner which is
including OWL 2 EL (Baader et al. 2005), OWL implemented using C++, with a different internal
2 QL (Calvanese et al. 2007), and OWL 2 RL architecture and some new optimizations.
(Grosof et al. 2003).
This two-layer architecture of OWL 2 allows HermiT
approximating OWL 2 ontologies to those in its HermiT (Glimm et al. 2014) is a reasoner for
tractable sub-languages, such as approximations OWL 2. It is the first publicly available OWL 2
toward OWL 2 QL (Pan and Thomas 2007), reasoner based on a hypertableau calculus (Motik
toward OWL 2 EL (Ren et al. 2010), and to- et al. 2009), with a highly optimized algorithm
ward OWL 2 RL (Zhou et al. 2013), so as to for ontology classification (Glimm et al. 2010).
exploit efficient and scalable reasoners of the sub- HermiT can handle DL-Safe rules (Motik et al.
languages. The motivation is based on the fact 2005) on top of OWL 2.
Automated Reasoning 133
Ontop Applications
Ontop (Calvanese et al. 2016) is an ontology-
based data access (OBDA) management system Automated ontology reasoning has been widely
for RDF and OWL 2 QL, as well as SWRL used in web applications, such as for content
with limited forms of recursions. It also supports management (BBC), travel planning and booking
efficient SPARQL-to-SQL mappings via R2RML (Skyscanner), and web search (Google, Bing,
(Rodriguez-Muro and Rezk 2015). Ontop has Baidu).
some optimizations on query rewriting based on It is also being applied in a growing number
database dependencies (Rodriguez-Muro et al. of vertical domains. One typical example is life
2013). science. For instance, OBO Foundry includes
more than 100 biological and biomedical ontolo-
Pellet gies. The SNOMED CT (Clinical Terminology)
Pellet (Sirin et al. 2007) is a reasoner for OWL ontology is widely used in healthcare systems
2. It also has dedicated support for OWL 2 of over 15 countries, including the USA, the
EL. It incorporates optimizations for nominals, UK, Australia, Canada, Denmark, and Spain. It is
conjunctive query answering, and incremental also used by major US providers, such as Kaiser
reasoning. Permanente. Other vertical domains include, but
not limited to, agriculture, astronomy, oceanog-
Racer raphy, defense, education, energy management,
Racer (Haarslev and Möller 2001) is a reasoner geography, and geoscience.
for OWL 1. It has a highly optimized version While ontologies are widely used as struc-
of tableau calculus for the description logic tured vocabularies, providing integrated and user-
SHIQ.D/ (Horrocks and Patel-Schneider centric view of heterogeneous data sources in the
2003). big data era, benefits of using automated ontology
reasoning include:
RDFox
RDFox (Motik et al. 2014) is a highly scalable 1. Reasoning support is critical for development
in-memory RDF triple store that supports shared and maintenance of ontologies, in particular
memory parallel datalog (Ceri et al. 1989) rea- on derivation of taxonomy from class defini-
soning. It supports stream reasoning (Motik et al. tions and descriptions.
134 Automated Reasoning
2. Easy location of relevant terms within large Baader F, Brandt S, Lutz C (2005) Pushing the EL enve-
structured vocabulary; lope. In: IJCAI2015
Baader F, Lutz C, Suntisrivaraporn B (2006) CEL—a
3. Query answers enhanced by exploiting polynomial-time reasoner for life science ontologies.
schema and class hierarchy. In: IJCAR’06, pp 287–291
Baader F, Brandt S, Lutz C (2008) Pushing the EL enve-
An example in the big data context is the use of lope further. In: OWLED2008
Bergmann F, Quantz J (1995) Parallelizing description
ontology and automated ontology reasoning for logics. In: KI1995, pp 137–148
data access in Statoil, where about 900 geologists Borgida A, Serafini L (2003) Distributed description log-
and geophysicists use data from previous opera- ics. J Data Semant 1:153–184
tions in nearby locations to develop stratigraphic Calvanese D, Giacomo GD, Lembo D, Lenzerini M,
Rosati R (2007) Tractable reasoning and efficient query
models of unexplored areas, involving diverse answering in description logics: the dl-lite family. J
schemata and TBs of relational data spread over Autom Reason 39:385–429
1000s of tables and multiple databases. Data Calvanese D, De Giacomo G, Lembo D, Lenzerini M,
analysis is the most important factor for drilling Poggi A, Rodriguez-Muro M, Rosati R, Ruzzi M, Savo
DF (2011) The MASTRO system for ontology-based
success. 30–70% of these geologists and geo- data access. J Web Semant 2:43–53
physicists’ time is spent on data gathering. The Calvanese D, Cogrel B, Komla-Ebri S, Kontchakov R,
use of ontologies and automated ontology reason- Lanti D, Rezk M, Rodriguez-Muro M, Xiao G (2016)
Ontop: answering SPARQL queries over relational
ing enables better use of experts’ time, reducing
databases. Semant Web J 8:471-487
turnaround for new queries significantly. Ceri S, Gottlob G, Tancar L (1989) What you always
wanted to know about Datalog (and never dared to ask).
IEEE Trans Knowl Data Eng 1:146–166
Chen J, Lecue F, Pan JZ, Chen H (2017) Learning
Outlook from ontology streams with semantic concept drift. In:
IJCAI-2017, pp 957–963
Despite the current success of automated on- Colucci S, Noia TD, Sciascio ED, Donini F (2003) Con-
tology reasoning, there are still some pressing cept abduction and contraction in description logics. In:
DL2003
challenges in the big data era, such as the follow- Davis M (1957) A computer program for Presburgers’
ing: algorithm. In: Summaries of talks presented at the sum-
mer institute for symbolic logic, Cornell University, pp
215–233
1. Declarative data analytics (Kaminski et al. Frege G (1884) Die Grundlagen der Arithmetik. Breslau,
2017) based on automated ontology reason- Wilhelm Kobner
ing; Glimm B, Horrocks I, Motik B, Stoilos G (2010) Optimis-
2. Effective approaches of producing high- ing ontology classification. In: ISWC2010, pp 225–240
Glimm B, Horrocks I, Motik B, Stoilos G, Wang Z (2014)
quality (Ren et al. 2014; Konev et al. 2014) HermiT: an OWL 2 reasoner. J Autom Reason 53:
ontologies and Knowledge Graphs; 245–269
3. Integration of automated ontology reasoning Grosof BN, Horrocks I, Volz R, Decker S (2003) Descrip-
with data mining (Lecue and Pan. 2015) tion logic programs: combining logic programs with
description logic. In: WWW2003, pp 48–57
and machine learning (Chen et al. 2017) Haarslev V, Möller R (2001) RACER system description.
approaches. In: IJCAR2001, pp 701–705
Horrocks I (1998) Using an expressive description logic:
FaCT or fiction? In: KR1998
Horrocks I, Patel-Schneider P (2003) From SHIQ and
References RDF to OWL: the making of a web ontology language.
J Web Semant 1:7–26
Artale A, Franconi E (1994) A computational account for Horrocks I, Sattler U (2005) A tableaux decision proce-
a description logic of time and action. In: KR1994, pp dure for shoiq. In: IJCAI2005, pp 448–453
3–14 Horrocks I, Kutz O, Sattler U (2006) The even more
Baader F, Calvanese D, McGuinness DL, Nardi D, Patel- irresistible sroiq. In: KR2006, pp 57–67
Schneider PF (eds) (2003) The description logic hand- Kaminski M, Grau BC, Kostylev EV, Motik B, Horrocks
book: theory, implementation, and applications. Cam- I (2017) Foundations of declarative data analysis using
bridge University Press, New York limit datalog programs. In: Proceedings of the twenty-
Automated Reasoning 135
sixth international joint conference on artificial intel- Pan JZ, Horrocks I (2006) OWL-Eu: adding cus-
ligence, IJCAI-17, pp 1123–1130. https://doi.org/10. tomised datatypes into OWL. J Web Semant 4: A
24963/ijcai.2017/156 29–39
Kazakov Y, Klinov P (2013) Incremental reasoning in Pan JZ, Thomas E (2007) Approximating OWL-DL
OWL EL without bookkeeping. In: ISWC2013, pp ontologies. In: The proceedings of the 22nd na-
232–247 tional conference on artificial intelligence (AAAI-07),
Kazakov Y, Krötzsch M, Simancik F (2011) Concur- pp 1434–1439
rent classification of EL ontologies. In: ISWC2011, Pan JZ, Ren Y, Zhao Y (2016a) Tractable approximate
pp 305–320 deduction for OWL. Artif Intell 235:95–155
Kazakov Y, Krötzsch M, Simancik F (2012) Pan JZ, Vetere G, Gomez-Perez J, Wu H (2016b) Ex-
ELK reasoner: architecture and evaluation. In: ploiting linked data and knowledge graphs for large
ORE2012 organisations. Springer, Switzerland
Koller D, Levy A, Pfeffer A (1997) P-CLASSIC: a Pan JZ, Calvanese D, Eiter T, Horrocks I, Kifer M, Lin F,
tractable probablistic description logic. In: AAAI1997 Zhao Y (2017) Reasoning web: logical foundation of
Konev B, Lutz C, Ozaki A, Wolter F (2014) Exact knowledge graph construction and querying answering.
learning of lightweight description logic ontologies. In: Springer, Cham
KR2014, pp 298–307 Qi G, Ji Q, Pan JZ, Du J (2011) Extending description
Krötzsch M, Rudolph S, Hitzler P (2008) Description logics with uncertainty reasoning in possibilistic logic.
logic rules. In: ECAI2008, pp 80–84 Int J Intell Syst 26:353–381
Serafini L, Tamilin A (2005) Drago: distributed reasoning Quantz JJ, Suska S (1994) Weighted defaults in de-
architecture for the semantic web. In: ESWC2005, pp scription logics: formal properties and proof the-
361–376 ory. In: Annual conference on artificial intelligence,
Lecue F, Pan JZ (2015) Consistent knowledge discovery pp 178–189
from evolving ontologies. In: AAAI-15 Ren Y, Pan JZ (2011) Optimising ontology stream reason-
Levesque HJ, Brachman RJ (1987) Expressiveness and ing with truth maintenance system. In: CIKM 2011
tractability in knowledge representation and reasoning. Ren Y, Pan JZ, Zhao Y (2010) Soundness preserving
Comput Intell 3:78–93 approximation for TBox reasoning. In: AAAI2010
Lifschitz V (2002) Answer set programming and plan Ren Y, Parvizi A, Mellish C, Pan JZ, van Deemter K,
generation. Artif Intell 138:39–54 Stevens R (2014) Towards competency question-driven
Lisi FA, Malerba D (2003) Ideal refinement of descrip- ontology authoring. In: Proceedings of the 11th con-
tions in AL-Log. In: ICILP2003 ference on extended semantic web conference (ESWC
Lutz C, Seylan I, Wolter F (2013) Ontology-based data 2014)
access with closed predicates is inherently intractable Ren Y, Pan JZ, Guclu I, Kollingbaum M (2016) A
(Sometimes). In: IJCAI2013, pp 1024–1030 combined approach to incremental reasoning for EL
MacGregor RM (1991) Inside the LOOM description ontologies. In: RR2016
classifier. ACM SIGART Bull – special issue on imple- Rodriguez-Muro M, Rezk M (2015) Efficient SPARQL-
mented knowledge representation and reasoning sys- to-SQL with R2RML mappings. J Web Semant
tems 2:88–92 33:141–169
Motik B, Horrocks I (2008) Owl datatypes: design and Rodriguez-Muro M, Kontchakov R, Zakharyaschev M
implementation. In: ISWC2008, pp 307–322 (2013) Query rewriting and optimisation with database
Motik B, Sattler U, Studer R (2005) Query answering for dependencies in ontop. In: DL2013
OWL-DL with rules. J Web Semant 3:41–60 Rosati R, Almatelli A (2010) Improving query
Motik B, Shearer R, Horrocks I (2009) Hypertableau answering over DL-Lite ontologies. In: KR2010,
reasoning for description logics. J Artif Intell Res 36: pp 290–300
165–228 Shortliffe EH (1974) MYCIN: a rule-based computer pro-
Motik B, Nenov Y, Piro R, Horrocks I, Olteanu D (2014) gram from advising physicians regarding antimicrobial
Parallel materialisation of datalog programs in cen- therapy selection. PhD thesis, Stanford University
tralised, main-memory RDF systems. In: AAAI2014 Sirin E, Parsia B, Grau B, Kalyanpur A, Katz Y (2007)
Motik B, Nenov Y, Piro R, Horrocks I (2015a) Handling Pellet: a practical owl-dl reasoner. J Web Semant
of owl: sameAs via rewriting. In: AAAI2015 5:51–53
Motik B, Nenov Y, Robert Piro IH (2015b) Steigmiller A, Glimm B (2015) Pay-As-You-Go descrip-
Incremental update of datalog materialisation: tion logic reasoning by coupling tableau and saturation
the backward/forward algorithm. In: procedure. J Artif Intell Res 54:535–592
AAAI2015 Steigmiller A, Glimm B, Liebig T (2014a) Reasoning with
Newell A, Shaw C, Simon H (1957) Empirical explo- nominal schemas through absorption. J Autom Reason
rations of the logic theory machine. In: Proceedings of 53:351–405
the 1957 western joint computer conference Steigmiller A, Liebig T, Glimm B (2014b) Konclude:
Pan JZ, Horrocks I (2003) RDFS(FA) and RDF MT: system description. J Web Semant 27:78–85
two semantics for RDFS. In: Fensel D, Sycara K, Straccia U (2001) Reasoning within fuzzy description
Mylopoulos J (eds) ISWC2003 logics. J Artif Intell Res 14:147–176
136 Availability
Stuckenschmidt H, Ceri S, Valle ED, van Harmelen F Zhou Y, Grau BC, Horrocks I, Wu Z, Banerjee J (2013)
(2010) Towards expressive stream reasoning. In: Se- Making the most of your triple store: query answering
mantic challenges in sensor networks, no. 10042 in in OWL 2 using an RL reasoner. In: WWW2013, pp
Dagstuhl seminar proceedings 1569–1580
Suntisrivaraporn B, Qi G, Ji Q, Haase P (2008) A
modularization-based approach to finding all justifi-
cations for OWL DL entailments. In: ASWC2008,
pp 1–15
Thomas E, Pan JZ, Ren Y (2010) TrOWL: tractable OWL
2 reasoning infrastructure. In: ESWC2010 Availability
Tsarkov D, Horrocks I (2006) FaCT++ description logic
reasoner: system description. In: IJCAR2006, pp 292–
297 Security and Privacy in Big Data Environment
B
often refers to a functional validation of a exist in literature. This article summarizes some
system’s specific behavior. of them and provides an overview of features
A benchmark harness often provides reusable and functionality a benchmark harness should
functionality that can be shared across a set of provide.
benchmarks or workloads and provides well- Han (Han et al. 2014) formulates require-
defined interfaces to implement or integrate ments on how to design big data benchmarks
multiple benchmarks or workloads into the same (or big data benchmark harnesses). He proposes
harness and to execute them. Such common a design consisting of a user interface layer, a
functionality may include: function layer, and an execution layer. The user
interface layer shall provide users the ability to
• Preparing and configuring the test environ- select benchmark workloads, data sets, and scale
ment and to initiate benchmark runs. The function
• Generating and loading of data layer shall consist of data generators, test genera-
• Starting, supervising, and stopping of a bench- tors, and metrics, while the execution layer shall
mark workload provide system configuration tools, data format
• Scheduling and accounting of benchmark op- conversion tools, a result analyzer, and a result
erations reporter. All these components shall be shared
• Monitoring of the system under test across benchmarks and data sets and provide
• Collecting runtime statistics from the bench- mechanisms to implement various benchmark
mark workload and system under test workloads on a common harness, which can be
• Generating benchmark reports (including adapted to a variety of environments.
benchmark metrics and result validation)
System Monitoring and Statistics statistics that help engineers to analyze and
Collection troubleshoot performance bottlenecks and tune
The previous section discussed the need to mon- or optimize a system.
itor the benchmark workload itself. The objec- Han describes components for their layered
tive of running a benchmark can be technical or harness architecture that allow the definition of
competitive (Baru et al. 2012). In the technical metrics and the analysis and reporting of results
use of benchmarks, the objective is to optimize a (Han et al. 2014). Metrics can be classified as
system, while in the competitive use of a bench- user-perceived metrics such as request rates and
mark, the objective is to compare the relative response times and architectural metrics such as
performance of different systems. Both cases resource utilization and other system statistics
require the collection of additional system statis- from the SUT (Wang et al. 2014).
tics to fully evaluate a benchmark run, analyze Graphalytics (Capotă et al. 2015) contains
its resource usage, and eventually optimize the an output validator module, which receives data
system. from the benchmark workload itself as well as
A benchmark harness should therefore also be the big data platform. Benchmark configuration
capable of collecting system statistics across all data from the benchmark, evaluation of the
relevant layers of the stack during a benchmark benchmark’s correctness from the output
run. In the Graphalytics benchmark harness, a validator, as well as system statistics from the
“system monitor” component is “responsible system monitor component are then processed
for gathering resource utilization statistics from by the report generator to compile a benchmark
the SUT” (Capotă et al. 2015). CloudPerf’s report, which is then stored in a result database.
statistics framework enables application- and CloudPerf (Michael et al. 2017) supports
environment-specific statistics providers to the definition of metrics, result checks, and
capture and push statistics into the harness for key performance indicators (KPIs), which are
live monitoring and post-run analysis (Michael et evaluated by a reporter component at the end of a
al. 2017). run based on both workload and system statistics.
The report includes system configuration
Benchmark Report Generation information, benchmark and system metrics,
The outcome of a benchmark run is typically KPIs, a validation whether the benchmark run
documented in a report. The benchmark report was successful or failed based on the configured
should include information about the benchmark result checks, as well as detailed workload and
itself, its configuration and scale, and the system system runtime statistics.
environment on which it was conducted and
contain benchmark metrics as well as relevant
system statistics. Its intention is to summarize the Examples of Application
outcome of a benchmark and potentially validate
its result to classify it as “passed” or “failed”. BigDataBench (Wang et al. 2014) is a bench-
The report should allow to compare two different mark suite of a rather loosely coupled collection
systems against each other and allow to judge of benchmarks, data sets, and data generators
them with respect to their relative performance. (Ming et al. 2013), with little common harness
Additional desirable metrics include each functionality across the various benchmarks, but
system’s cost (in terms of necessary system contains a large number of benchmark work-
components or power consumption), quality- loads (originally 19, which have since then been
of-service, and maximum capacity with respect enhanced to 37 benchmark workloads (Zhan et
to data volume, supported users, or concurrency. al. 2016)). Graphalytics on the other hand im-
For technical benchmarking (Ghazal et al. 2013), plements a modular and extensible benchmark
the report should further include relevant system harness “which facilitates the addition of new
Big Data Analysis and IoT 141
datasets, algorithms, and platforms, as well as Zhan J, et al (2016) BigDataBench tutorial. ASPLOS.
the actual performance evaluation process, in- http://www.bafst.com/events/asplos16/BigDataBench_
tutorial_16/wp-content/uploads/BigDataBench_First_
cluding result collection and reporting” (Capotă Part-asplos16.pdf
et al. 2015). CloudPerf (Michael et al. 2017), a B
benchmark harness primarily designed for dis-
tributed multi-tenant systems, consists of an ex-
tensible framework with well-defined APIs to
implement workloads, data loading, statistics col-
lection, fault injection, and cloud deployment
while providing common functionality in the har- Big Data Analysis and IoT
ness itself.
Yongrui Qin1 and Quan Z. Sheng2
1
School of Computing and Engineering,
Cross-References University of Huddersfield, Huddersfield, UK
2
Department of Computing, Macquarie
System Under Test
University, Sydney, NSW, Australia
References Definitions
Baru C, Bhandarkar M, Nambiar R, Poess M, Rabl T
(2012) Setting the direction for big data benchmark Intrinsic characteristics of IoT big data are iden-
standards. In: Technology conference on performance tified and classified. Potential IoT applications
evaluation and benchmarking. Springer, Berlin, pp bringing significant changes in many domains are
197–208 described. Many new challenges and open issues
Capotă M, et al (2015) Graphalytics: a big data benchmark
for graph-processing platforms. Proceedings of the to be addressed in IoT are discussed.
GRADES’15. ACM, New York
Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte A,
Jacobsen HA (2013) BigBench: towards an industry
standard benchmark for big data analytics. In: Pro-
ceedings of the 2013 ACM SIGMOD international Overview
conference on management of data. ACM, New York,
pp 1197–1208 This entry firstly discusses the intrinsic
Han R, Lu X, Jiangtao X (2014) On big data benchmark-
characteristics of IoT big data and classifies
ing, Workshop on big data benchmarks, performance
optimization, and emerging hardware. Springer, Cham them into three categories: data generation,
Michael N, et al (2017) CloudPerf: a performance test data quality, and data interoperability. Specific
framework for distributed and dynamic multi-tenant characteristics of each category are identified.
environments. Proceedings of the 8th ACM/SPEC on
Then it introduces ongoing and/or potential IoT
international conference on performance engineering.
ACM, New York applications, which show that IoT can bring
Ming Z et al (2013) BDGS: a scalable big data generator significant changes in many domains, i.e., cities
suite in big data benchmarking, Workshop on big data and homes, environment monitoring, health,
benchmarks. Springer, Cham
energy, business, etc. Since many new challenges
Rabl T, Poess M (2011) Parallel data generation for
performance analysis of large, complex RDBMS. In: and issues around IoT big data have not been
Proceedings of the fourth international workshop on addressed, which require substantial efforts
testing database systems. ACM, New York, p 5 from both academia and industry, this entry
Wang L, et al (2014) Bigdatabench: a big data benchmark
also identifies several key directions for future
suite from internet services. High Performance Com-
puter Architecture (HPCA), 2014 IEEE 20th interna- research and development from a data-centric
tional symposium on. IEEE, New York perspective.
142 Big Data Analysis and IoT
In the context of the Internet, addressable and some research open issues on IoT from the data
interconnected things, instead of humans, act as perspective, and section “Summary” offers some
the main data producers, as well as the main data
consumers. Computers will be able to learn and
concluding remarks.
gain information and knowledge to solve real world
B
problems directly with the data fed from things. As
an ultimate goal, computers enabled by the Internet IoT Data Taxonomy
of Things technologies will be able to sense and
react to the real world for humans.
In this section, we identify the intrinsic charac-
teristics of IoT data and classify them into three
As of 2012, 2.5 quintillion (2:51018 ) bytes of categories: data generation, data quality, and
data are created daily (http://www-01.ibm.com/ data interoperability. We also identify specific
software/data/bigdata/). In IoT, connecting all of characteristics of each category, and the overall
the things that people care about in the world IoT data taxonomy is shown in Fig. 2.
becomes possible. All these things would be able
to produce much more data than nowadays. The Data Generation
volumes of data are vast, the generation speed • Velocity. In IoT, data can be generated at
of data is fast, and the data/information space is different rates. For example, for GPS-enabled
global (James et al. 2009). Indeed, IoT is one of moving vehicles in road networks, the GPS
the major driving forces for big data analytics. signal sampling frequency could be every few
Given the scale of IoT, topics such as storage, seconds, every few minutes, or even every half
distributed processing, real-time data stream an- an hour. But some sensors can scan at a rate
alytics, and event processing are all critical, and up to 1,000,000 sensing elements per second
we may need to revisit these areas to improve (https://www.tekscan.com/support/faqs/what-
upon existing technologies for applications of this are-sensors-sampling-rates). On one hand, it
scale. is challenging to handle very high sampling
In this entry, we systematically investigate the rates, which require efficient processing of the
key technologies related to the development of fast generated data. On the other hand, it is
IoT applications, particularly from a data-centric also challenging to deal with low sampling
perspective. Our aim is to provide a better under- rates, due to the fact that some important
standing of IoT data characteristics, applications, information may be lost for data processing
and issues. Although some reviews about IoT and decision making.
have been conducted recently (e.g., Atzori et al. • Scalability. Since things are able to continu-
2010; Zeng et al. 2011; An et al. 2013; Perera ously generate data together with the foresee-
et al. 2013; Li et al. 2014; Yan et al. 2014), able excessively large number of things, the
they focus on high-level general issues and are IoT data is expected to be at an extremely
mostly fragmented. Sheng et al. (2017) covers large scale. It is easy to image that, in IoT data
a number of areas in Web of Things. These processing systems, scalability will be a long-
efforts do not specifically cover topics on IoT data standing issue, aligning with the current big
characteristics and potential applications from a data trend.
data-centric perspective, which is fundamentally • Dynamics. There are many dynamic elements
critical to fully embrace IoT. within IoT data. Firstly, many things are mo-
The remainder of the chapter is organized as bile, which will lead to different locations
follows. Section “IoT Data Taxonomy” identi- at different times. Since they will move to
fies an IoT data taxonomy. In section “Potential different environments, the sensing results of
IoT Applications”, some typical ongoing and/or things will also be changing to reflect the real
potential IoT applications where data techniques world. Secondly, many things are fragile. This
for IoT can bring significant changes are de- means the generated data will change over
scribed. Finally, section “Open Issues” highlights time due to the failure of things. Thirdly, the
144 Big Data Analysis and IoT
connections between things could be intermit- of sensors of the same type may also be
tent. This also creates dynamics in any IoT deployed in a nearby area, which can produce
data processing system. similar sensing results of that area. For the
• Heterogeneity. There will be many kinds of same sensor, due to the possible high sam-
things potentially connecting to the Internet in pling rates, redundant sensing data can be
the future, ranging from cars, robots, fridges, produced.
and mobile phones to shoes, plants, watches, • Ambiguity. Dealing with large amount of
and so on. These kinds of things will generate ambiguity in IoT data is inevitable. The data
data in different formats using different vo- produced by assorted things can be interpreted
cabularies. In addition, there will be assorted in different ways due to different data needs
IoT data processing systems, which will also from different things or other data consumers.
provide data in customized formats to tailor Such data can also be useful and important to
different data needs. any other things, which brings about the chal-
lenges of proper interpretation of the produced
data to different data consumers.
Data Quality • Inconsistency. Inconsistency is also prevalent
• Uncertainty. In IoT, uncertainty may come in IoT data. For example, in RFID data, in-
from different sources. In RFID data, the un- consistency can occur due to missing readings
certainty can refer to missing readings, read- of tags at some locations along the supply
ings of nonexisting IDs, etc. In wireless sensor chain. It is also easy to observe inconsistency
networks, uncertainty can refer to sensing pre- in sensing data as when multiple sensors are
cision (the degree of reproducibility of a mea- monitoring the same environment and report-
surement), accuracy (the maximum difference ing sensing results. Due to the precision and
that will exist between the actual value and the accuracy of the sensing process and other
indicated value), etc. problems including packet loss during trans-
• Redundancy. Redundancy can also be easily mission, data inconsistency is also an intrinsic
observable in IoT. For example, in RFID data, characteristic in sensing data.
the same tags can be read multiple times at the
same location (because multiple RFID readers
exist at the spot or tags are read multiple Data Interoperability
times at in the same spot) or at different • Incompleteness. In order to process IoT data,
locations. In wireless sensor networks, a group being able to detect and react to events in real
Big Data Analysis and IoT 145
time, it is important to combine data from dif- applications in future smart cities. In a modern
ferent types of data sources to build a big and city, lots of digital data traces are generated there
complete picture of relevant backgrounds of every second via cameras and sensors of all
the real world. However, as this process relies kinds (Guinard 2010). All this data represents B
on the cooperation of mobile and distributed a gold mine for everyone, if people in the
things that are generating relevant background city would be able to take advantage of it in
data, incompleteness is easily observable in an efficient and effective way. For example,
IoT data. Suppose there are a large number IoT can facilitate resource management issues
of available data sources. It is of great impor- for modern cities. Specifically, static resources
tance to determine which data sources can best (e.g., fire stations, parking spots) and mobile
address the incompleteness of data for a given resources (e.g., police cars, fire trucks) in a
data processing task. city can be managed effectively using IoT
• Semantics. To address the challenges posed technologies. Whenever events (fires, crime
by the deluge of IoT data, things, or machines reports, cars looking for parking) arise, IoT
acting as the major data consumers should be a technologies would be able to quickly match
promising trend for data processing in the IoT resources with events in an optimal way based on
era. Inspired by Semantic Web technologies, the information captured by smart things, thereby
in order to enable machines to understand reducing cost and saving time. Taxi drivers
data for human beings, injecting semantics in the city would also be able to better serve
into data could be an initial step. Therefore, prospective passengers by learning passenger’s
semantics within IoT data will play an impor- mobility patterns and other taxi drivers’ serving
tant role in the process of enabling things/ma- behaviors through the help of IoT technologies
chines to understand and process IoT data by (Yuan et al. 2011). One study estimated a loss of
themselves. $78 billion in 2007 in the form of 4.2 billion lost
hours and 2.9 billion gallons of wasted gasoline
in the United States alone (Mathur et al. 2010).
Potential IoT Applications IoT could bring fundamental changes in urban
street parking management, which would greatly
As pointed out by Ashton (2009) that IoT “has benefit the whole society by reducing traffic
the potential to change the world, just as the congestion and fuel consumption.
Internet did.” The ongoing and/or potential IoT Security in a city is of great concerns, which
applications show that IoT can bring signifi- can benefit a lot from the development of IoT
cant changes in many domains, i.e., cities and technologies. Losses resulted from property
homes, environment monitoring, health, energy, crimes were estimated to be $17.2 billion in
business, etc. IoT can bring the ability to react the United States in 2008 (Guha et al. 2010).
to events in the physical world in an automatic, Current security cameras, motion detectors, and
rapid, and informed manner. This also opens up alarm systems are not able to help track or
new opportunities for dealing with complex or recover stolen property. IoT technologies can
critical situations and enables a wide variety of help to deter, detect, and track personal property
business processes to be optimized. In this sec- theft since things are interconnected and can
tion, we overview several representative domains interact with each other. IoT technologies can
where IoT can make some profound changes. also help improve stolen property recovery rates
and disrupt stolen property distribution networks.
Smart Cities and Homes Similarly, a network of static and mobile sensors
IoT can connect billions of smart things and can can be used to detect threats on city streets and in
help capture information in cities. Based on IoT, open areas such as parks.
cities would become smarter and more efficient. With IoT technologies, people can browse and
Below are some examples of promising IoT manage their homes via the Web. For example,
146 Big Data Analysis and IoT
they would be able to check whether the light varying in intensity and spectra. So it is difficult
in their bedrooms is on and could turn it off by to quantify, e.g., it is difficult to determine what
simply clicking a button on a Web page. Similar kind of sound is noise. Further, the definitions
operations and management could be done in and feelings of noise are quite subjective. For
office environments. Plumbing is ranked as one of example, some noises could be pleasant, like
the ten most frequently found problems in homes flowing water, while others can be annoying,
(Lai et al. 2010). It is important to determine the such as car alarms, screeching breaks, and people
spatial topology of hidden water pipelines behind arguing. A device has been designed and built to
walls and underground. In IoT, smart things in monitor residential noise pollution to address the
homes would be able to report plumbing prob- above problems (Zimmerman and Robson 2011).
lems automatically and report to owners and/or Firstly, noise samples from three representative
plumbers for efficient maintenance and repair. houses are used, which span the spectrum of
quiet to noisy neighborhoods. Secondly, a noise
Environment Monitoring model is developed to characterize residential
IoT technologies can also help to monitor and noise. Thirdly, noise events of an entire day (24 h)
protect environments thereby improving human’s are compressed into a 1-minute auditory sum-
knowledge about environments. Take water as an mary. Data collection, transmission, and storage
example. Understanding the dynamics of bodies requirements can be minimized in order to utilize
of water and their impact on the global envi- low-cost and low-power components, while suffi-
ronment requires sensing information over the cient measurement accuracy is still maintained.
full volume of water. In such context, IoT tech- Intel has developed smart sensors that can
nologies would be able to provide effective ap- warn people about running outside when the air is
proaches to study water. Also IoT could improve polluted (http://www.fastcoexist.com/1680111/
water management in a city. Drinking water is intels-sensors-will-warn-you-about-running-outs
becoming a scarce resource around the world. In ide-when-the-air-is-polluted). For example, if
big cities, efficiently distributing water is one of someone is preparing to take a jog along
the major issues (Guinard 2010). Various reports his/her regular route, an application on his/her
show that on average 30% of drinkable water is smartphone pushes out a message: air pollution
lost during transmission due to the aging infras- levels are high in the park where he/she usually
tructure and pipe failures. Further, water can be runs. Then he/she could try a recommended route
contaminated biologically or chemically due to that is cleaner. Currently, many cities already
inefficient operation and management. In order have pollution and weather sensors. They are
to effectively manage and efficiently transport usually located on top of buildings, far from
water, IoT technologies would be of great impor- daily human activities.
tance.
Soil contains vast ecosystems that play a key Health
role in the Earth’s water and nutrient cycles, Covering a number of markets including manu-
but scientists cannot currently collect the high- facturing, transportation, systems, software, ser-
resolution data required to fully understand vices, and so on, IoT in healthcare is expected
them. Many soil sensors are inherently fragile to grow to a market size of $300 billion by
and often produce invalid or uncalibrated data 2022 (Firouzi et al. 2018). In future IoT environ-
(Ramanathan et al. 2009). IoT technologies ments, an RFID-enabled information infrastruc-
would help to validate, calibrate, repair, or ture would be likely to revolutionize areas such
replace sensors, allowing to use available sensors as healthcare and pharmaceutical. For example,
without sacrificing data integrity and meanwhile a healthcare environment such as a large hospi-
minimizing the human resources required. tal or aged care could tag all pieces of medi-
Sound is another example where IoT tech- cal equipment (e.g., scalpels, thermometers) and
nologies can help. Sound is multidimensional, drug products for inventory management. Each
Big Data Analysis and IoT 147
storage area or patient room would be equipped these thermostats using just-in-time heating and
with RFID readers that could scan medical de- cooling based on travel-to-home distance ob-
vices, drug products, and their associated cases. tained from location-aware mobile phones. The
Such an RFID-based infrastructure could offer a system makes use of a GPS-enabled thermostat B
hospital unprecedented near real-time ability to which could lead to savings of as much as 7%.
track and monitor objects and detect anomalies In IoT, as things in homes would become smart
(e.g., misplaced objects) as they occur. and connected to the Internet, similar energy
As personal health sensors become ubiqui- savings could be more effective. For example,
tous, they are expected to become interoperable. by automatically sensing occupancy and sleep
This means standardized sensors can wirelessly patterns in a home, it would be possible to save
communicate their data to a device many people energy by automatically turning off the home’s
already carry today (e.g., mobile phones). It is HVAC (heating, ventilation, and air conditioning)
argued by Lester et al. (2009) that one challenge system.
in weight control is the difficulty of tracking Besides home heating, fuel consumption is
food calories consumed and calories expended by also an important issue. GreenGPS, a navigation
activity. Then they present a system for automatic service that uses participatory sensing data to
monitoring of calories consumed using a single map fuel consumption on city streets, has been
body-worn accelerometer. To be fully benefited designed by Ganti et al. (2010). GreenGPS would
from such data for a large body of people, ap- allow drivers to find the most fuel-efficient routes
plying IoT technologies in such area would be a for their vehicles between arbitrary end points. In
promising direction. IoT, fuel consumption would be further reduced
Mobile technology and sensors are creating by enabling cars and passengers to communicate
ways to inexpensively and continuously monitor with each other for ride sharing (Yuan et al.
people’s health. Doctors may call their clients to 2011).
schedule an appointment, rather than vice versa,
because the doctors could know their clients’ Business
health conditions in real time. Some projects IoT technologies would be able to help to
for such purpose have been initiated. For exam- improve efficiency in business and bring other
ple, EveryHeartBeat (http://join.everyheartbeat. impacts on business (Mattern and Floerkemeier
org/) is a project for Body Computing to “connect 2010):
the more than 5 billion mobile phones in the
world to the health ecosystem.” In the initial • From a commercial point of view, IoT can
stage, heart rate monitoring is investigated. Con- help increase the efficiency of business pro-
sumers would be able to self-track their pulse, cesses and reduce costs in warehouse logistics
and studies show heart rate monitoring could be and in service industries. This is because more
useful in detecting heart conditions and enabling complete and necessary information can be
early diagnosis. The future goal is to include collected by interconnected things. Owing to
data on blood sugar levels and other biometrics its huge and profound impact on the society,
collected via mobile devices. IoT research and applications can also trigger
new business models involving smart things
Energy and associated services.
Home heating is a major factor in worldwide • From a social and political point of view, IoT
energy use. In IoT, home energy management technologies can provide a general increase in
applications could be built upon embedded Web the quality of life for the following reasons.
servers (Priyantha et al. 2008). Through such Firstly, consumers and citizens will be able
online Web services, people can track and man- to obtain more comprehensive information.
age their home energy consumption. A system is Secondly, care for aged and/or disabled peo-
designed by Gupta et al. (2009) for augmenting ple can be improved with smarter assistance
148 Big Data Analysis and IoT
systems. Thirdly, safety can be increased. For directions for future research and development
example, road safety can be improved by re- from a data-centric perspective.
ceiving more complete and real-time traffic
and road condition information. • Data Quality and Uncertainty: In IoT, as data
• From a personal point of view, new services volume increases, inconsistency and redun-
enabled by IoT technologies can make life dancy within data would become paramount
more pleasant, entertaining, independent, and issues. One of the central problems for data
also safer. For example, business taking ad- quality is inconsistency detection, and when
vantages of technologies of search of things in data is distributed, the detection would be far
IoT can help locate lost things quickly, such more challenging (Fan et al. 2010). This is
as personal belongings, pets, or even other because inconsistency detection often requires
people (Tran et al. 2017). shipping data from one site to another. Mean-
while, inherited from RFID data and sensor
data, IoT data would be of great uncertainty,
Take improving information handover
which also presents significant challenges.
efficiency in a global supply chain as an example.
• Co-space Data: In an IoT environment, the
The concept of digital object memories (DOM)
physical space and the virtual (data) space
is proposed in Stephan et al. (2010), which
co-exist and interact simultaneously. Novel
stores order-related data via smart labels on
technologies must be developed to allow data
the item. Based on DOM, relevant life cycle
to be processed and manipulated seamlessly
information could be attached to the product.
between the real and digital spaces (Ooi et al.
Considering the potential different stakeholders
2009). To synchronize data in both real and
including manufacturer, distributor, retailer, and
virtual worlds, large amount of data and in-
end customer along the supply/value chain, this
formation will flow between co-spaces, which
approach facilitates information handover.
pose new challenges. For example, it would
Further, there are many important bits of infor-
be challenging to process heterogeneous data
mation in an IoT-based supply chain, such as the
streams in order to model and simulate real
5W (what, when, where, who, which). It is also
world events in the virtual world. Besides,
necessary to integrate them efficiently and in real
more intelligent processing is needed to iden-
time in other operations. The EPCIS (Electronic
tify and send interesting events in the co-space
Product Code Information System) network is a
to objects in the physical world.
set of tools and standards for tracking and sharing
• Transaction Handling: When the data
RFID-tagged products in IoT. However, much of
being updated is spread across hundreds or
this data remains in closed networks and is hard
thousands of networked computers/smart
to integrate (Wu et al. 2013). IoT technologies
things with differing update policies, it would
could be used to make it easier to use all this data,
be difficult to define what the transaction
to integrate it into various applications, and to
is. In addition, most of things are resource-
build more flexible, scalable, global application
constrained, which are typically connected to
for better (even real-time) logistics.
the Internet using lightweight, stateless proto-
cols such as CoAP (Constrained Application
Protocol) (http://tools.ietf.org/html/draft-
Open Issues ietf-core-coap-18) and 6LoWPAN (IPv6
over Low-Power Wireless Personal Area
The development of IoT technologies and ap- Networks) (http://tools.ietf.org/wg/6lowpan)
plications is merely beginning. Many new chal- and accessed using RESTful Web services.
lenges and issues have not been addressed, which This makes transaction handling in IoT a
require substantial efforts from both academia great challenge. As pointed out by James et al.
and industry. In this section, we identify some key (2009) that the problem is that the world is
Big Data Analysis and IoT 149
changing fast, the data representing the world add to the understanding of raw data coming
is on multiple networked computers/smart from sensor data streams and other types
things, and existing database technologies of data streams? To resolve this challenge,
cannot manage. Techniques developed for semantic enrichment of sensing data is a B
streamed and real-time data may provide some promising research direction. Further, con-
hints. sider the potential excessively large amount
• Frequently Updated Time-Stamped Structured of subscriptions to IoT data. To produce
(FUTS) Data: The Internet, and hence IoT, proper semantic enrichment to meet different
contains potentially billions of Frequently enrichment needs from different subscribers
Updated Time-Stamped Structured (FUTS) poses great challenges. Finally, how to
data sources, such as real-time traffic effectively incorporate semantic enrichment
reports, air pollution detection, temperature techniques with semantic event processing to
monitoring, crops monitoring, etc. FUTS provide much better expressiveness in event
data sources contain states and updates of processing is still at its initial stage. This
physical world things. Current technologies will also demand a large amount of research
are not capable in dealing with FUTS data efforts.
sources (James et al. 2009) because (i) no • Mining: Data mining aims to facilitate the
data management system can easily display exploration and analysis of large amounts of
FUTS past data, (ii) no efficient crawler or data, which can help to extract useful informa-
storage engine is able to collect and store tion for huge volume of IoT data. Data mining
FUTS data, and (iii) querying and delivering challenges may include extraction of temporal
FUTS data is hardly supported. All these pose characteristics from sensor data streams, event
great challenges for the design of novel data detection from multiple data streams, data
management systems for FUTS data. stream classification, activity discovery, and
• Distributed and Mobile Data: In IoT, data recognition from sensor data streams. Besides,
will be increasingly distributed and mobile. clustering and table summarization in large
Different from traditional mobile data, dis- data sets, mining large (data, information or
tributed and mobile data in IoT would be much social) networks, sampling, and information
more highly distributed and data intensive. In extraction from the Web are also great chal-
the context of interconnecting huge numbers lenges in IoT.
of mobile and smart objects, centralized data • Knowledge Discovery: Knowledge discovery
stores would not be a suitable tool to manage is the process of extracting useful knowledge
all the dynamics of mobile data produced in from data. This is essential especially when
IoT. Thus there is a need for novel ways to connected things populate their data to the
manage distributed and mobile data efficiently Web. The following issues related to knowl-
and effectively in IoT. edge discovery in IoT have been identified by
• Semantic Enrichment and Semantic Event (Weikum 2011; Tran et al. 2017): (i) auto-
Processing: The full potentials of IoT matic extraction of relational facts from nat-
would heavily rely on the progress of ural language text and multimodal contexts;
Semantic Web. This is because things and (ii) large-scale gathering of factual knowledge
machines should play a much more important candidates and their reconciliation into com-
role than humans in IoT to process and prehensive knowledge bases; (iii) reasoning
understand data. This calls for new research in on uncertain hypotheses, for knowledge dis-
semantic technologies. For example, there covery and semantic search; and (iv) deep and
are increasing efforts in building public real-time question answering, e.g., to enable
knowledge bases (such as DBpedia, FreeBase, computers to win quiz game shows.
Linked Open Data Cloud, etc.). But how these • Security: Due to the proliferation of embed-
knowledge bases can be effectively used to ded devices in IoT, effective device security
150 Big Data Analysis and IoT
mechanisms are essential to the development Internet. There are many challenges, such as
of IoT technologies and applications (Huang the information privacy, search, updating, and
et al. 2017). The National Intelligence Council sharing, to be addressed.
(Anonymous 2008) argues that, to the extent • Social Concerns: Since IoT connects every-
that everyday objects become information se- day objects to the Internet, social concerns
curity risks, the IoT could distribute those would become a hot topic in the development
risks far more widely than the Internet has of IoT. Further, online social networks with
to date. For example, RFID security presents personal things information may incur social
many challenges. Potential solutions should concerns as well, such as disclosures of per-
consider aspects from hardware and wireless sonal activities and hobbies, etc. Appropriate
protocol security to the management, regu- economic and legal conditions and a social
lation, and sharing of collected RFID data consensus on how the new technical oppor-
(Welbourne et al. 2009). Besides, it is argued tunities in IoT should be used also represent
by Lagesse et al. (2009) that there is still a substantial task for the future (Mattern and
no generic framework for deploying and ex- Floerkemeier 2010). Trust management could
tending traditional security mechanisms over play an important role in addressing social
a variety of pervasive systems. Regarding se- concerns (Lin and Dong 2018). With the help
curity concerns of the network layer, it is of trust models for social IoT, malicious be-
suggested by Kounavis et al. (2010) that the haviors can be detected effectively.
Internet can be gradually encrypted and au-
thenticated based on the observations that the
recent advances in implementation of cryp- Summary
tographic algorithms have made general pur-
pose processors capable of encrypting packets It is predicted that the next generation of the In-
at high rates. But how to generalize such ternet will be comprised of trillions of connected
algorithms to IoT would be challenging as computing nodes at a global scale. Through these
things in IoT normally only maintain low nodes, everyday objects in the world can be
transmission rates and connections are usually identified, connected to the Internet, and take
intermittent. decisions independently. In this context, the In-
• Privacy: Privacy protection is a serious ternet of Things (IoT) is considered as a new
challenge in IoT. One of the fundamental revolution of the Internet. In IoT, the possibility
problems is the lack of a mechanism to of seamlessly merging the real and the virtual
help people expose appropriate amounts of worlds, through the massive deployment of em-
their identity information. Embedded sensing bedded devices, opens up many new and exciting
is becoming more and more prevalent on directions for both research and development.
personal devices such as mobile phones In this entry, we have provided an overview
and multimedia cite layers. Since people of some key research areas of IoT, specifically
are typically wearing and carrying devices from a data-centric perspective. It also presents
capable of sensing, details such as activity, a number of fundamental issues to be resolved
location, and environment could become before we can fully realize the promise of IoT
available to other people. Hence, personal applications. Over the last few years, the Internet
sensing can be used to detect their physical of Things has gained momentum and is becom-
activities and bring about privacy concerns ing a rapidly expanding area of research and
(Klasnja et al. 2009). Patient’s electronic business. Many efforts from researchers, vendors,
health record (EHR) is becoming medical and governments have been devoted to creating
big data with high volume Yang et al. (2018). and developing novel IoT applications. Along
The e-health IoT network is growing very with the current research efforts, we encourage
fast, leading to big medical private data on the more insights into the problems of this promising
Big Data Analysis and IoT 151
technology and more efforts in addressing the Guinard D (2010) A web of things for smarter cities. In:
open research issues identified in this entry. Technical talk, pp 1–8
Guo B, Zhang D, Wang Z, Yu Z, Zhou X (2013)
Opportunistic IoT: exploring the harmonious interac-
tion between human and the internet of things. J Netw
B
Comput Appl 36(6):1531–1539
Cross-References Gupta M, Intille SS, Larson K (2009) Adding GPS-control
to traditional thermostats: an exploration of potential
energy savings and design challenges. In: Proceedings
Big Data Analysis for Smart City Applications of the 7th international conference on pervasive com-
Big Data and Fog Computing puting (pervasive), Nara. Springer, pp 95–114
Big Data in Smart Cities Huang Q, Yang Y, Wang L (2017) Secure data access
control with ciphertext update and computation out-
Big Data Stream Security Classification for IoT
sourcing in fog computing for internet of things. IEEE
Applications Access 5:12941–12950
INFSO (2008) INFSO D.4 Networked Enterprise & RFID
INFSO G.2 Micro & Nanosystems. In: Co-operation
with the working group RFID of the ETP EPOSS,
internet of things in 2020, roadmap for the future,
References version 1.1, 27 May 2008
ITU (2005) International Telecommunication Union
An J, Gui X, Zhang W, Jiang J, Yang J (2013) Research (ITU) internet reports, The internet of things, Nov 2005
on social relations cognitive model of mobile nodes in James AE, Cooper J, Jeffery KG, Saake G (2009) Re-
internet of things. J Netw Comput Appl 36(2):799–810 search directions in database architectures for the inter-
Anonymous (2008) National Intelligence Council (NIC), net of things: a communication of the first international
Disruptive civil technologies: six technologies with po- workshop on database architectures for the internet
tential impacts on US interests out to 2025, Conference of things (DAIT 2009). In: Proceedings of the 26th
report CR 2008-07, Apr 2008. http://www.dni.gov/nic/ British national conference on databases (BNCOD),
NIC_home.html Birmingham. Springer, pp 225–233
Ashton K (2009) That ‘Internet of Things’ thing. http:// Klasnja PV, Consolvo S, Choudhury T, Beckwith R,
www.rfidjournal.com/article/view/4986 Hightower J (2009) Exploring privacy concerns about
Atzori L, Iera A, Morabito G (2010) The internet of personal sensing. In: Proceedings of the 7th interna-
things: a survey. Comput Netw 54(15):2787–2805 tional conference on pervasive computing (pervasive),
CASAGRAS (2000) CASAGRAS (Coordination And Nara. Springer, pp 176–183
Support Action for Global RFID-related Activities and Kounavis ME, Kang X, Grewal K, Eszenyi M, Gueron
Standardisation) S, Durham D (2010) Encrypting the internet. In: Pro-
European Commission (2009) European commission: ceedings of the ACM SIGCOMM conference on appli-
internet of things, an action plan for Europe. http:// cations, technologies, architectures, and protocols for
europa.eu/legislation_summaries/information_society/ computer communications (SIGCOMM), New Delhi.
internet/si0009_en.htm ACM, pp 135–146
Fan W, Geerts F, Ma S, Müller H (2010) Detecting Lagesse B, Kumar M, Paluska JM, Wright M (2009) DTT:
inconsistencies in distributed data. In: Proceedings of a distributed trust toolkit for pervasive systems. In:
the 26th international conference on data engineering Proceedings of the 7th annual IEEE international con-
(ICDE), Long Beach. IEEE, pp 64–75 ference on pervasive computing and communications
Firouzi F, Rahmani AM, Mankodiya K, Badaroglu M, (PerCom), Galveston. IEEE, pp 1–8
Merrett GV, Wong P, Farahani B (2018) Internet-of- Lai T-T, Chen Y-H, Huang P, Chu H-H (2010) PipeProbe:
things and big data for smarter healthcare: from device a mobile sensor droplet for mapping hidden pipeline.
to architecture, applications and analytics. Futur Gener In: Proceedings of the 8th international conference on
Comput Syst 78:583–586 embedded networked sensor systems (SenSys), Zurich.
Ganti RK, Pham N, Ahmadi H, Nangia S, Abdelzaher TF ACM, pp 113–126
(2010) GreenGPS: a participatory sensing fuel-efficient Lester J, Hartung C, Pina L, Libby R, Borriello G, Dun-
maps application. In: Proceedings of the 8th interna- can G (2009) Validated caloric expenditure estimation
tional conference on mobile systems, applications, and using a single body-worn sensor. In: Proceedings of the
services (MobiSys), San Francisco. ACM, pp 151–164 11th international conference on ubiquitous computing
Guha S, Plarre K, Lissner D, Mitra S, Krishna B, Dutta (Ubicomp), Orlando. ACM, pp 225–234
P, Kumar S (2010) AutoWitness: locating and tracking Li S, Xu LD, Zhao S (2014) The internet of things:
stolen property while tolerating GPS and radio outages. a survey. Information Systems Frontiers, page (to
In: Proceedings of the 8th international conference on appear)
embedded networked sensor systems (SenSys), Zurich. Lin Z, Dong L (2018) Clarifying trust in social internet of
ACM, pp 29–42 things. IEEE Trans Knowl Data Eng 30(2):234–248
152 Big Data Analysis for Smart City Applications
Mathur S, Jin T, Kasturirangan N, Chandrasekaran J, Xue Yang Y, Zheng X, Guo W, Liu X, Chang V (2018, in press)
W, Gruteser M, Trappe W (2010) ParkNet: drive-by Privacy-preserving fusion of IoT and big data for e-
sensing of road-side parking statistics. In: Proceedings health. Futur Gener Comput Syst
of the 8th international conference on mobile systems, Yuan J, Zheng Y, Zhang L, Xie X, Sun G (2011) Where
applications, and services (MobiSys), San Francisco. to find my next passenger? In: Proceedings of the
ACM, pp 123–136 13th international conference on ubiquitous computing
Mattern F, Floerkemeier C (2010) From the internet of (Ubicomp), Beijing. ACM, pp 109–118
computers to the internet of things. In: From active Zeng D, Guo S, Cheng Z (2011) The web of things: a
data management to event-based systems and more. survey (invited paper). J Commun 6(6):424–438
Springer, Berlin, pp 242–259 Zimmerman T, Robson C (2011) Monitoring residential
Ooi BC, Tan K-L, Tung AKH (2009) Sense the phys- noise for prospective home owners and renters. In:
ical, walkthrough the virtual, manage the co (exist- Proceedings of the 9th international conference on per-
ing) spaces: a database perspective. SIGMOD Record vasive computing (pervasive), San Francisco. Springer,
38(3):5–10 pp 34–49
Perera C, Zaslavsky AB, Christen P, Georgakopoulos D
(2013) Context aware computing for the internet of
things: a survey. CoRR, abs/1305.0982
Priyantha NB, Kansal A, Goraczko M, Zhao F (2008) Tiny
web services: design and implementation of interop- Big Data Analysis for Smart
erable and evolvable sensor networks. In: Proceedings City Applications
of the 6th international conference on embedded net-
worked sensor systems (SenSys), Raleigh. ACM, pp Eugenio Cesario
253–266
Ramanathan N, Schoellhammer T, Kohler E, Whitehouse Institute of High Performance Computing and
K, Harmon T, Estrin D (2009) Suelo: human-assisted Networks of the National Research Council of
sensing for exploratory soil monitoring studies. In: Italy (ICAR-CNR), Rende, Italy
Proceedings of the 7th international conference on em-
bedded networked sensor systems (SenSys), Berkeley.
ACM, pp 197–210
Sheng QZ, Qin Y, Yao L, Benatallah B (2017) Managing Introduction
the web of things: linking the real world to the web.
Morgan Kaufmann, Amsterdam The twenty-first Century is frequently referenced
Stephan P, Meixner G, Koessling H, Floerchinger F,
as the “Century of the City”, reflecting the un-
Ollinger L (2010) Product-mediated communication
through digital object memories in heterogeneous precedented global migration into urban areas
value chains. In: Proceedings of the 8th annual that is happening nowadays (Butler 2010; Zheng
IEEE international conference on pervasive comput- et al. 2014). As a matter of fact, the world is
ing and communications (PerCom), Mannheim. IEEE,
rapidly urbanizing and undergoing the largest
pp 199–207
Tran NK, Sheng QZ, Babar MA, Yao L (2017) Search- wave of urban growth in history. In fact, accord-
ing the web of things: state of the art, challenges ing to a United Nations report, urban population
and solutions. ACM Comput Surv (CSUR), 50(4):55: is expected to grow from 2.86 billion in 2000 to
1–55:34
4.98 billion in 2030 (UNR 2014): this translates
Weikum G (2011) Database researchers: plumbers or
thinkers? In: Proceedings of the 14th international to roughly 60% of the global population living in
conference on extending database technology (EDBT), cities by 2030.
Uppsala. ACM, pp 9–10 Such a steadily increasing urbanization has
Welbourne E, Battle L, Cole G, Gould K, Rector K,
Raymer S, Balazinska M, Borriello G (2009) Build-
been leading to many big cities and bringing
ing the internet of things using RFID: the RFID significant social, economic and environmental
ecosystem experience. IEEE Internet Comput 13(3): transformations. On the one hand it is modern-
48–55 izing people’s lives and giving increased op-
Wu Y, Sheng QZ, Shen H, Zeadally S (2013)
Modeling object flows from distributed and feder-
portunities offered in urban areas, on the other
ated RFID data streams for efficient tracking and hand it is engendering big challenges in city
tracing. IEEE Trans Parallel Distrib Syst 24(10): management issues, such as large-scale resource
2036–2045 planning, increased energy consumption, traffic
Yan Z, Zhang P, Vasilakos AV (2014) Research on social
relations cognitive model of mobile nodes in internet of
congestion, air pollution, water quality, crime
things. J Netw Comput Appl 42:120–134 rising, etc. (Zheng et al. 2014).
Big Data Analysis for Smart City Applications 153
Tackling these challenges seemed nearly cations, data gathering, and analysis, finalized to
impossible years ago given the complex effectively deliver smart applicative innovations
and dynamic settings of cities. However, the to cities and citizens (Lea 2017). The most im-
widespread diffusion of sensing technologies and portant data-related issues are briefly discussed in B
large-scale computing infrastructures has been the following.
enabling the collection of big and heterogeneous Networking and communications. The communi-
data (i.e., mobility, air quality, water/electricity cation infrastructure of a smart city must support
consumption, etc.) that are daily produced in the connection among devices, buildings, facili-
urban spaces (Al Nuaimi et al. 2015). Such big ties, people, and gather data and deliver services
data imply rich knowledge about a city and are a to myriad endpoints. Moreover, the heterogeneity
valuable opportunity to achieve improvements in of services in a smart city ecosystems requires
urban policies and management issues. a holistic approach to networking and commu-
Motivated by the challenging idea of building nications, i.e. from low bandwidth wireless tech-
more intelligent cities, several research projects nologies to dedicated fiber optics as connection
are devoted to the development of Smart City backbones. Such issues represent a technological
applications. A Smart City is a city seeking to ad- challenge to deal with.
dress public issues via ICT-based solutions on the Cyber-physical systems and the IoT. Cyber-
basis of a multi-stakeholder, municipality based physical systems and the Internet of Things
partnership (EUP 2017). The goal of building a (IoT) are critical to the growth of smart cities,
smart city is to improve the quality of citizens’ providing an urban connectivity architecture in
lives by using computer systems and technology which the objects of everyday life are equipped
to provide smart and efficient services and facil- with micro-controllers, transceivers for digital
ities. In this scenario, data analytics offers useful communication, and suitable protocol stacks that
algorithms and tools for searching, summarizing, will allow effective connectivity among devices
classifying and associating data turning it into and with users (Zanella et al. 2014). This enables
useful knowledge for citizens and decision mak- the interaction among a wide variety of devices
ers. Moreover, managing data coming from large (i.e., home appliances, surveillance cameras,
and distributed sensing nodes poses significant monitoring sensors) that collect enormous
technical challenges that can be solved trough volume of data, making the smart city a digital
innovative solutions in parallel and distributed ar- data-rich ecosystem (Cicirelli et al. 2017).
chitectures, Cloud platforms, scalable databases, Cloud computing. Cloud computing has a signifi-
and advanced data mining algorithms and tools. cant influence on the development of smart cities
Such issues pose several challenges and oppor- for two main reasons. First, it affects the way
tunities of innovations in urban environments, cities manage and deliver services, and enables
as described in the following sections of this a broader set of players to enter the smart city
article. market. Second, the analysis of high-volume data
(often to be processed in near real-time) takes
crucial advantage from the scalable and elastic
Research, Technological and Social execution environments offered by Cloud archi-
Challenges in Smart Cities tectures.
Open Data. Another significant challenge in
A Smart City is a complex ecosystem that poses smart cities is the adoption and exploitation of
several research and technological challenges, open data. Open data in the context of smart cities
which are complementary and connected among refer to public policy that requires or encourages
them. In fact, successful smart city deployments public agencies to release data sets (e.g., citywide
require a combination of methodological and crime statistics, city service levels, infrastructure
technological solutions, ranging from the low- data) and make them freely accessible. The most
level sensors/actuators, effective data communi- important challenges associated with open data
154 Big Data Analysis for Smart City Applications
include data security and in particular privacy on a road, etc.). Surveillance cameras are widely
issues. The evolution of open data represents a installed in urban areas, generating a huge vol-
broadening of available information related to ume of images and videos reflecting traffic pat-
city operations, representing de-facto another terns, thus providing a visual ground truth of
source of city data. traffic conditions to people. Floating car data are
Big data analytics. As above described, cities are generated by vehicles traveling around a city with
collecting massive and heterogeneous amounts a GPS sensor, whose trajectories can be matched
of data (text, video, audio), some of it is static to a road network for deriving speeds on road
but increasingly large parts are real-time. These segments. As many cities have already installed
data exhibit the classic characteristics of big data: GPS sensors in taxicabs, buses, and logistics
volume, velocity (real-time generation), variety trucks for different purposes, floating car data
(extremely heterogeneous), veracity and value have already been widely available and they have
(very useful for business and research applica- had higher flexibility and a lower deployment
tions). For such a reason, a big effort is required cost with respect to loop sensors and surveillance
to develop suitable data analytics methodologies camera-based approaches.
to deal with such massive volumes of data, ex- Mobile Phone Data. Mobile telephone calls pro-
tremely heterogeneous in its sources, formats, duce a huge amount of data every day. In fact,
and characteristics. a call detail record contains attributes that are
Citizen engagement. Citizen engagement repre- specific to a single instance of a phone call,
sents a complementary aspect of smart cities and such as the user’s location, start time, duration
although not strictly a technical consideration, it of the call, etc. The analysis of such kind of
concerns to make citizens feel into the “collec- data allows users’ community detection, citywide
tive intelligence” of cities. A a matter of fact, human mobility discovery, etc.
their willingness to exploit web applications (or Geographical Data. Geographical data describe
other dialogue channels) for meaningful dialogue the road network of the urban area under analysis.
between cities and their citizens (i.e., reporting It is usually represented by a graph that is
city issues, accidents, quality of services, etc.), composed of a set of edges (denoting road
is a crucial issue for the success of smart city segments) and a collection of geo-referenced
initiatives. nodes (standing for road intersections). Each
edge can be enriched by several properties, such
as the length, speed constraint, type of road, and
Urban Big Data Collected in Smart number of lanes. Each node can also include
Cities information about point of interest (POI), i.e.,
restaurants, shopping malls, usually described
Due to the complex and dynamic settings of by a name, address, category, and a set of geo-
cities, urban data are generated from many spatial coordinates; however, the information of
different sources in many different formats. POIs could vary in time (e.g., a restaurant may
A list of the most representative types of change its name, be moved to a new location, or
data and their brief description is reported in even be shut down), so they need to be updated
the following (Yuan et al. 2013; Zheng et al. frequently. Commuting Data. Commuting data
2014). include data generated by people that leave an
Traffic Data. Such a kind of data concerns the electronic footprint of their activities in urban
mobility of people and vehicles in urban and environments. For example, people traveling in
extra-urban areas. Traffic data can be collected cities continuously generate digital traces of their
in different ways, i.e. using loop sensors, surveil- passage, i.e., card swiping data in a subway/bus
lance cameras, and floating cars. Loop sensors system, ticketing data in parking lots, records
are usually used to compute several mobility left by users when they borrow/return a bike in
measures (i.e., speed of vehicles, traffic volume a bike sharing system, etc., thus producing huge
Big Data Analysis for Smart City Applications 155
volume of commuting data that support decisions mental information about city’s resource usage.
of urban planners and policy makers. Such data can be obtained directly from sensors
Environmental Data. Environmental data mounted on vehicles or in gas stations, by elec-
includes meteorological information (humidity, tricity and gas smart meters, or inferred from B
temperature, barometer pressure, wind speed, other data sources implicitly. Such data can be
and weather conditions), air quality data used to evaluate a city’s energy infrastructures
(concentration of CO, CO2 , NO2 , PM2:5 , SO2 ), (e.g., the distribution of gas stations), find the
environmental noise, etc. that can be collected most gas-efficient driving route, calculate the pol-
by sensors placed in the city, portable sensors lution emission from vehicles on road surfaces,
mounted on public vehicles, or air quality optimize the energy electricity usage in residen-
monitoring stations. There are also data collected tial areas, etc.
by satellites that scan the surface of the earth Health Care Data. A large volume of health care
with rays of different lengths to generate images and disease data is generated by hospitals and
representing the ecology and meteorology of a clinics. In addition, the advances of wearable
wide region. computing devices enable people to monitor their
Social Network Data. The huge volume of user- own health conditions, such as heart rate, pulse,
generated data in social media platforms, such and sleep time. Such data can be exploited for
as Facebook, Twitter and Instagram, are valuable diagnosing a disease and doing a remote medical
information sources concerning human dynam- examination. In urban environments, this data can
ics and behaviors in urban areas. In particular, be aggregated to study the impact of environ-
social network data can be considered in two mental change on people’s health, i.e. how urban
different forms. One is a social structure, rep- noise impact people’s mental health, how air
resented by a graph, denoting the relationship, pollution influences asthma-related diseases, etc.
interdependency, or interaction among users. The
other one is the user-generated social media data,
such as texts, photos, and videos, which con- Smart City Applications
tain rich information about a user’s behaviors/in-
terests. Moreover, posts are often tagged with A large number of smart city applications,
geo-graphical coordinates or other information exploiting urban data, have been implemented
(e.g., tags, photos) that allow identifying users’ world-wide. A brief survey of the most important
positions and build trajectories from geo-tagged applications, grouped in seven categories, is
social data. Therefore, social media users mov- reported below.
ing through a set of locations produce a huge Smart Transportation. The joint analysis of traffic
amount of geo-referenced data that embed exten- and commuting data provides useful knowledge
sive knowledge about human mobility, crowded to improve public and private mobility systems in
zones, interests and social dynamics in urban a city. For example, intensive studies have been
areas. done to learn historical traffic patterns aimed
Economy Data. There is a variety of data rep- at suggesting fast driving routes (Bejan et al.
resenting a city’s economic dynamics. For ex- 2010; Herrera et al. 2010), at predicting real-time
ample, transaction records of credit cards, stock traffic flows (Herrera et al. 2010) and forecast-
prices, housing prices, and people’s incomes are ing future traffic conditions on individual road
valuable information to capture the economic segments (Castro-Neto et al. 2009). Mobility can
rhythm of a city, therefore predicting which urban also benefit in scenarios where traffic information
assets and features can positively (or negatively) collected through sensors, smart traffic lights and
influence future economy. on-vehicle devices can be used to take action for
Energy Data. Energy data includes electricity specific critic events or to alleviate/resolve a traf-
consumption of buildings, energy costs, gas con- fic problem. Moreover, several applications are
sumption of vehicles, etc. and provides funda- devoted to improve taxi services: some systems
156 Big Data Analysis for Smart City Applications
provide taxi drivers with some locations (and former level) may enable a more efficient energy
routes) which they are more likely to pick up management to stay within community-assigned
passengers quickly, while others schedule taxis energy usage limits.
to pick-up passengers via ride-sharing (subject to Smart Environment. The analysis of environmen-
time, capacity and monetary constraints) maxi- tal data (i.e., urban air quality, noise, pollution)
mizing the profit of the trip (Yuan et al. 2013; Ma provides new knowledge about how environmen-
et al. 2013). Finally, big data are also profitable tal phenomena are influenced by multiple com-
used to build more reliable and efficient public plex factors, such as meteorology, traffic volume,
transportation systems: for example, (i) real-time and land uses, or the impact that noise pollution
arrival time predictions of buses are more accu- has on people’s mental and physical health, etc.
rate by predicting traffic traces; (ii) bike sharing (Zheng et al. 2014). Moreover, accurate weather
system operators may benefit from accurate mod- forecasts can support country’s agriculture effec-
els of bicycle flows in order to appropriately load- tiveness, it can inform people of possible haz-
balance the stations throughout the day (Al Nu- ardous conditions by supporting a more efficient
aimi et al. 2015; Zheng et al. 2014). management of energy utilization (Al Nuaimi
Smart Healthcare. Big data analysis is profitably et al. 2015). For example, a recent research study
exploited to provide smart healthcare solutions, at MIT developed a real-time control system
which can improve the quality of patient care based on transport data and weather data to pre-
through enhanced patient handling. New systems dict taxy demand with different weather condi-
can support a real-time monitoring and analy- tions (Liv 2017).
sis of clinic data, or enable clinic parameters Smart Safety and Security. Environmental dis-
checking on demand or on a given frequency asters, pandemics, severe accidents, crime and
(i.e., hourly, daily or weekly). In particular, for terrorism attacks pose additional threats to public
certain patients’ health issues, several benefits security and order. The wide availability of differ-
can be obtained from real-time data (blood pres- ent kinds of urban data provides the opportunity
sure, cholesterol, sleeping patterns) monitoring to build predictive models with the ability, on one
through smart devices, which are connected to hand, to learn from history how to handle the
home or hospital for accurate and timely re- aforementioned threats correctly and, on the other
sponses to health issues and for a comprehensive hand, to detect them in a timely manner or even
patient history records (Al Nuaimi et al. 2015; predict them in advance (Zheng et al. 2014). For
Zheng et al. 2014). example, new technologies are enabling police
Smart Energy. The rapid progress of urbanization departments to access growing volumes of crime-
is consuming more and more energy, calling related data that can be analyzed to understand
for technologies that can sense city-scale energy patterns and trends (Cesario et al. 2016), to an-
cost, improve energy infrastructures, and finally ticipate criminal activity and thus to optimize
reduce energy consumption (Zheng et al. 2014). public safety resource allocation (officers, patrol
To this purpose, innovative systems are developed routes, etc.). Other studies are devoted to predict
(i) to allow near-real forecasting of energetic future environmental changes or natural disas-
demand through efficient analysis of big data ters: some projects (Cesario and Talia 2010) are
collected, or (ii) to facilitate decision-making specifically aimed at developing new frameworks
related to the supply levels of electricity in line for earthquakes forecasting and to predict how the
with actual demand of the citizens and over all af- ground will shake as a result, which will give an
fecting conditions. Moreover, intelligent demand opportunity to save lives and resources.
response mechanisms can also be profitably used Smart Education. From assessment of graduates
to shift energy usage to periods of low demand to online attitudes, each student generates a
or to periods of high availability of renewable unique data track. By analyzing these data,
energy. For example, intelligent algorithms (run- education institutes (from elementary schools
ning at either device level or at community/trans- to universities) can realize whether they are using
Big Data Analysis for Smart City Applications 157
their resources correctly and producing profitable data more and more effective and efficient.
didactic results. For example, data concerning From a methodological side, it is expected
student attendance, misconducts resulting in that more advanced algorithms for urban
suspensions, college eligibility, dropout rate, on- Big Data analysis will be studied through a B
track rate, etc., can be collected and analyzed to higher integration with conventional city-related
understand how different teaching strategies can disciplines’ theories; this will converge towards a
address specific learning results. In addition, multidisciplinary research field, where computer
behavioural patterns can be extracted and sciences meet traditional urban sciences (i.e, civil
exploited for student profiling, which will have engineering, transportation, economics, energy
positive effect on the knowledge levels and engineering, environmental sciences, ecology,
will suggest teaching/learning tools aimed at and sociology) to inspire data-driven smart city
improving student’s performance on the basis of services. Furthermore, video and audio analytics
his/her attitudes. methodologies, real-time data mining algorithms,
Smart Governance. Smart governance, whose advanced data-driven mobility applications, etc.
term has been coined few years ago, is about will provide smart solutions to deal with the
using technological innovation to improve the different conditions that dynamically change
performance of public administration, to enhance during the day in a city. From a technological
accountability and transparency, and to support side, advanced solutions for gathering, storing
politics in better planning and decision making. and analyzing data will be investigated: for
Specifically, political decisions can benefit from example, further research in Cloud Comput-
big data analytics models, which can support ing, IoT, communication infrastructures will
governments’ focus on citizens’ concerns related undoubtedly provide an essential support for
to health and social care, housing, education, innovative and scalable solutions able to capture,
policing, and other issues. In this way, more transmit, storage, search, and analyze data in so
appropriate and effective decisions related to ever increasing data-rich cities.
employment, production, and location strategies
can be made. On the other hand, citizens can
Cross-References
use governance data to make suggestions and
enhance the quality of government services.
Big Data Analysis for social good
Finally, information technology enables the
Big Data Analysis and IoT
integration and collaboration of different
government agencies and combine or streamline
their processes. This will result in more efficient
References
operations, better handling of shared data, and
stronger regulation management and enforcement Al Nuaimi E, Al Neyadi H, Mohamed N, Al-Jaroodi
(Al Nuaimi et al. 2015). J (2015) Applications of big data to smart cities. J
Internet Serv Appl 6(1):1–15
Bejan AI, Gibbens RJ, Evans D, Beresford AR, Bacon J,
Friday A (2010) Statistical modelling and analysis of
Future Directions for Research sparse bus probe data in urban areas. In: 13th inter-
national IEEE conference on intelligent transportation
Urban environments will continue generating systems, pp 1256–1263
Butler D (2010) Cities: the century of the city. Nature
larger and larger volumes of data, which will
467:900–901
steadily represent a valuable opportunity for Castro-Neto M, Jeong YS, Jeong MK, Han LD (2009)
the development of Smart City applications Online-svr for short-term traffic flow prediction under
and to achieve improvements in urban policies typical and atypical traffic conditions. Expert Syst Appl
36(3, Part 2):6164–6173
and management issues. This will lead towards
Cesario E, Talia D (2010) Using grids for exploiting the
several research directions aimed at making the abundance of data in science. Scalable Comput: Pract
storage, management and analysis of smart city Exp 11(3):251–262
158 Big Data Analysis for Social Good
Cesario E, Catlett C, Talia D (2016) Forecasting crimes ward repackaging of it to make it user-friendly
using autoregressive models. In: 2016 IEEE 2nd in- to using it to drive artificial intelligence appli-
ternational conference on big data intelligence and
computing, pp 795–802
cations. Much of what has been done with big
Cicirelli F, Guerrieri A, Spezzano G, Vinci A (2017) An data for social good revolves around making
edge-based platform for dynamic smart city applica- daily activities more information driven, such as
tions. Futur Gener Comput Syst 76(C):106–118 mapping applications for daily commutes and
EUP (2017) World’s population increasingly urban with
more than half living in urban areas. Technical report,
providing better weather forecasts. These types
Urban Intergroup, European Parliament of uses can benefit everyone in some way. This
Herrera JC, Work DB, Herring R, Ban XJ, Jacobson chapter focused on the analysis of big data to
Q, Bayen AM (2010) Evaluation of traffic data ob- build knowledge that can benefit individuals and
tained via GPS-enabled mobile phones: the mobile
century field experiment. Trans Res C: Emerg Technol
families at risk of harm, illness, disability, ed-
18(4):568–583 ucational failure, or other special needs. This
Lea R (2017) Smart cities: an overview of the technology type of social good is focused on solutions to
trends driving smart cities. Source:www.ieee.org major ongoing social problems as opposed to the
Liv (2017) The live singapore! project. Source:
senseable.mit.edu/livesingapore
incremental improvement of daily living for the
Ma S, Zheng Y, Wolfson O (2013) T-share: a large-scale majority of the population (Coulton et al. 2015;
dynamic taxi ridesharing service. In: 2013 IEEE 29th Liebman 2018).
international conference on data engineering (ICDE),
pp 410–421
UNR (2014) World’s population increasingly urban with
more than half living in urban areas. Technical report, The Data
United Nations
Yuan NJ, Zheng Y, Zhang L, Xie X (2013) T-finder: a
The primary source of big data to address
recommender system for finding passengers and vacant
taxis. IEEE Trans Knowl Data Eng 25(10):2390–2403 social problems is the data from administrative
Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M information systems of government and
(2014) Internet of things for smart cities. IEEE Internet nongovernmental agencies. Criminal justice,
Things J 1(1):22–32
health care, income assistance, child and adult
Zheng Y, Capra L, Wolfson O, Yang H (2014) Ur-
ban computing: concepts, methodologies, and ap- protective services, disability programs, employ-
plications. ACM Trans Intell Syst Technol 5(3):38: ment, economic development, early childhood
1–38:55 programs, K-12 schools, and postsecondary
education are all example of programs that
produce large datasets that can be used to
produce evidence around who participates in
Big Data Analysis for Social these programs and how effective they might
Good be. The data can range from data on who is
enrolled in these programs to expenditures to
Robert M. Goerge detailed notes from the service provider on the
Chapin Hall at the University of Chicago, characteristics of the individuals and the need
Chicago, IL, USA they have. Many of the information systems
for these programs have been in place since
the 1980s, and they have become ubiquitous
Overview since the turn of the century. Therefore, data on
hundreds of millions of individuals with billions
Big data is not just the 3, 4, or 5 Vs (Volume, of transactions is now available.
Velocity, Variety, Veracity and Value) that char- However, unlike increasingly more examples
acterize it but has become a catchall term for of data that is open and accessible on govern-
innovative uses of large datasets to gain addi- ment websites, the type of data that is needed
tional benefit and build knowledge (Japec et al. to conduct rigorous research and analysis is not
2015). How it is used can range from straightfor- widely accessible outside of the organizations
Big Data Analysis for Social Good 159
tive is New York University and University of making. For example, Allegheny County in
Chicago’s Administrative Research Data Facility, Pennsylvania is using predictive risk models to
also known as the Coleridge Initiative (Com- determine if a child is going to be re-referred to
mission on Evidence-Based Policymaking 2017) child protective services. This information can be
(https://coleridgeinitiative.org/). used to determine if services should be provided
at the time of a call to child protective services
(Hurley 2018). It is safe to say that there are high
Non-technological Barriers stakes around these decisions and the receptivity
of frontline workers to use these data, when they
Given the sensitivity of these data, not only to are shouldering the risks of these decisions, may
those individuals in the data but to the organiza- be less than enthusiastic. Careful evaluations and
tional leaders who could potentially be negatively the need to ensure that no harm comes as a result
affected in multiple ways by disclosure of the of decisions driven by big data are of the utmost
data, there are rules and regulations specific to importance.
each dataset that govern the uses of the data Coulton et al. (2015) provides a number
(Petrila 2018). For the most part, the use of the of examples of data analysis that have been
data should promote the administration of the conducted in welfare programs, child welfare
program. For example, the Family Educational services, homelessness programs, Medicaid, and
Rights and Privacy Act (FERPA) allows disclo- participation in multiple programs. Multiple tra-
sure of personally identifiable information if the ditional designs, such as randomized controlled
purpose is to “improve instruction.” Determining trials and quasi-experimental studies, are also
whether or not the use has the potential to im- being used. While it is clear that these efforts
prove instruction is not clearly established and will provide significant additional knowledge,
open to interpretation by each educational entity. translating these findings to action – to promote
Therefore, access to these data is predominantly social good – is not a trivial task. “Translational
still relied on developing relationships with the data science” is the term that speaks to how
data providers (Goerge 2018). the data science can improve services and
Given the ambiguity of many of the rules and achieve social good and improvements in social
regulations around the use of sensitive admin- welfare (https://cdis.uchicago.edu/news/2017/4/
istrative data, beyond keeping the data secure, 21/translational-data-science-workshop).
it is necessary to partner with data providing With increased use of machine learning tech-
agencies in a way that can both build trust and niques, such as decision trees, dimension re-
also ensure that the data providing agency can duction methods, nearest neighbor algorithms,
benefit from any analysis done. Trust involves support vector models, and penalized regression,
multiple factors that must be addressed over time, with administrative data, new possibilities for im-
but a process whereby the data providers can pact on social good are on the horizon (Hindman
review the work of analysts is critical to ensure 2015). Research organizations and universities
not only the proper use of the data but also that are preparing to apply these methods to improve
data providers are simply aware of where analysis the social good. One caution, which many have
of their information is disseminated. expressed, is that it is important to understand
the interventions and processes that profession-
als, from medical professionals to social works
Analysis of These Data to teachers, use to address specific conditions.
Simply processing big data can lead to spurious
The types of analyses that are done with these correlations on one hand and recommendations
data to support the social good are expanding that are completely unfeasible on another. Many
at a significant rate. Predictive analytics are professionals are very constrained in what they
beginning to be used to affect frontline decision- can do with their data, even if they know that
Big Data Analysis in Bioinformatics 161
there are confirmed relationships between partic- Hindman M (2015) Building better models: prediction,
ular characteristics of individuals and future out- replication, and machine learning in the social sciences.
Ann Am Acad Pol Soc Sci 659(May):48–62. https://
comes. We live in a country where confidentiality doi.org/10.1177/0002716215570279
and privacy are predominant social values and Hurley D (2018) Can an algorithm tell when kids are in
B
where research ethics must be practiced (Com- danger? Times, New York
mission on Evidence-Based Policymaking 2017). Japec L, Kreuter F, Berg M, Biemer P, Decker P, Lampe C
(2015) AAPOR report on big data
A lack of understanding of how a particu- Liebman JB (2018) Using data to more rapidly ad-
lar analysis or algorithm may be applied could dress difficult U.S. social problems. Ann Am Acad
lead to unintended consequences (Sandvig et al. Pol Soc Sci 675(1):166–181. https://doi.org/10.1177/
2016). The issue of race is one of the most 0002716217745812
Petrila J (2018) Turning the law into a tool rather than a
sensitive. Often, because of how much variance, barrier to the use of administrative data for evidence-
without rich data, a race variable can explain; too based policy. Ann Am Acad Pol Soc Sci 675(1):67–82.
much is attributed to this characteristic. It can https://doi.org/10.1177/0002716217741088
lead to profiling and inappropriate targeting and, Policy-Making, Commission on Evidence-Based
Policymaking. 2017. The promise of evidence-
at a minimum, ineffective practices. based policymaking. Washington, DC: Commission
on Evidence-Based Policymaking. Available from
https://www.cep.gov/content/dam/cep/report/cep-final-
report.pdf (accessed 14 October 2017).
Conclusion Sandvig C, Hamilton K, Karahalios K, Langbort C (2016)
Automation, algorithms, and politics j when the algo-
rithm itself is a racist: diagnosing ethical harm in the
Whether the innovative analyses of big data have basic components of software. Int J Commun 10(0):19
caught up to the ethical use of those analyses Wang C, Cen M-H, Schifano E, Jing W, Yan J
is an open question. As data scientists attempt (2016) Statistical methods and computing for Big
to implement innovations based on analytics in Data. Stat Interface 9(4):399–414. https://doi.org/
10.4310/SII.2016.v9.n4.a1
high stakes domains, such as health and human
services, great caution must be taken to ensure
that no harm is done. While this is not new to be
concerned about potential harms, the algorithms
used may not be sufficiently transparent to make
clear what potential actions might cause harm or
Big Data Analysis in
action based on race, gender, sexuality, ethnicity,
Bioinformatics
disability, or some other characteristics.
Mario Cannataro
Data Analytics Research Center, University
References “Magna Græcia” of Catanzaro, Catanzaro, Italy
and with the person’s environment” (Source eration sequencing, that produce an overwhelm-
National Human Genome Research Institute). ing amount of experimental data; (ii) the inte-
Proteomics “Proteomics describes the study of gration of omics data with clinical data (e.g.,
all the proteins (the proteome) in an organism, bioimages, pharmacology data, etc.); (iii) the de-
tissue type, or cell” (Source National Human velopment of bioinformatics algorithms and com-
Genome Research Institute). prehensive data analysis pipelines that prepro-
Interactomics “Interactomics describes the cess, analyze, and extract knowledge from such
study the whole set of molecular interactions data. The peculiar characteristics of big data in
(the interactome) in an organism, tissue type, the context of bioinformatics can be declined as
or cell.” follows.
Pharmacogenetics “Pharmacogenetics is the
field of study dealing with the variability of
responses to medications due to variation • Volume: the volume of big data in bioin-
in single genes” (Source National Human formatics is having an explosion, especially
Genome Research Institute). in healthcare and medicine, due to many
Pharmacogenomics “Pharmacogenomics is reasons: high resolution of experimental
similar to pharmacogenetics, except that it platforms that may produce terabytes of omics
typically involves the search for variations data, increasing number of subjects enrolled
in multiple genes that are associated with in omics studies, increasing digitalization
variability in drug response” (Source National of healthcare data (laboratory tests, visits
Human Genome Research Institute). to the doctor, administrative data about
payments) that in the past was stored on
paper.
• Velocity: in bioinformatics new data is cre-
Overview ated at increasing speed due to technologi-
cal advances in experimental platforms and
The chapter discusses big data aspects of bioin- due to increased sampling rate of sensors for
formatics data with special focus on omics data health monitoring. This requires that data are
and related data analysis pipelines. Main chal- ingested and analyzed in near real time. As
lenges related to omics data storage, prepro- an example, considering the medical devices
cessing and analysis are presented. After de- used to monitor patients or the personal sen-
scribing main omics data, such as genomics, sors used by general subjects to collect data,
proteomics and interactomics data, the chapter big data analytics platforms need to analyze
describes main biological databases, and presents that data and to transmit results to medical
several data analysis pipelines for analyzing big doctors or to the user itself in near real time.
omics data that attempt to face big data chal- Moreover, the use of the IoT (Internet of
lenges. Things) devices in medicine will increase ve-
locity of big data in bioinformatics.
• Variety: there is a great variety and hetero-
geneity of data in bioinformatics coming
Introduction from medicine, biology, and healthcare.
Other than raw experimental omics data,
Omics disciplines such as genomics, proteomics, there are clinical data, administration data,
and interactomics, to cite a few, are at the core or sensors data, and social data that contribute to
modern biomedical research. Omics studies are variety.
enabled by three main factors: (i) the availability • Variability: how the data is captured in bioin-
of high-throughput experimental platforms, such formatics may vary due to several factors,
as microarray, mass spectrometry, and next gen- such as experiment setup, use of standard for-
Big Data Analysis in Bioinformatics 163
mats or proprietary formats, and use of local generated through high-throughput experimental
conventions in the collection of data. Thus platforms (e.g., microarray, mass spectrometry,
variability may lead to wrong interpretation of next generation sequencing) that produce the so-
data. called omics data (e.g., genomics, proteomics, B
• Value: the value of big data produced in bioin- and interactomics data). The volume of these data
formatics is related not only to the cost of is increasingly huge because the resolution of
infrastructures to produce (experimental plat- such platforms is increasingly high and because
forms) and to analyze (software tools, com- the biomedical studies usually involve thousands
puting infrastructures) them but also to the of biological samples.
biomedical knowledge it is possible to extract
from data. Microarray Data
Microarrays allow to detect, among the others,
gene expression data, single nucleotide polymor-
The emerging of this big data trend in bioin-
phism (SNP) data, and drug metabolism enzymes
formatics poses new challenges for computer
and transporters (DMET) SNP data. A main dif-
science solutions, regarding the efficient storage,
ference among such data is that gene expressions
preprocessing, integration, and analysis of omics
are represented as numerical data, while SNPs are
and clinical data that is today the main bottleneck
encoded as strings. The analysis of microarray
of the analysis pipeline. To face those challenges,
data presents several challenges (Guzzi and Can-
main trends are (i) use of high-performance
nataro 2011) that are outlined in the following.
computing in all steps of analysis pipeline,
including parallel processing of raw experimental
Gene Expression Data
data, parallel analysis of data, and efficient data
Main biological processes are carried out by
visualization; (ii) deployment of data analysis
genes and proteins. According to the central
pipelines and main biological databases on the
Cloud; (iii) use of novel data models that combine dogma of molecular biology, genes encode
structured (e.g., relational data) and unstructured the formation of proteins, and, in particular,
(e.g., text, multimedia, biosignals, bioimages) genes regulate the formation of proteins through
data, with special focus on graph databases; (iv) the transcription process which produces RNA
(ribonucleic acid). The expression level of RNA
development of novel data analytics methods
(or mRNA – messenger RNA) is an indirect
such as sentiment analysis, affective computing,
measure about the production of the proteins.
and graph analytics that overcome the limits
Microarrays allow to study the expression level
of traditional statistical and data mining analysis;
of the RNA associated to several genes (and
(v) particular attention to issues regarding privacy
of data and users, as well as permitted ways to use thus their potential protein transcriptions) in a
and analyze biomedical data (e.g., with extensive single experiment. Moreover, microarrays also
development of anonymization techniques and allow the study of the microRNA (miRNA) that
privacy-preserving rules and models). The rest are short noncoding molecules of RNA that
contain about 22 nucleotides and have a role
of the chapter first introduces big omics data in
in various functions and especially in regulation
bioinformatics, then describes main biological
of gene expression (Di Martino et al. 2015). In
databases, and finally presents several data
a typical microarray experiment, first the RNA
analysis pipelines for analyzing omics data.
is extracted from a biological sample, and then
it is put onto the microarray chip that contains
a set of probes to bind the RNA. Probes are
Big Data in Bioinformatics formed by oligonucleotides or complementary
DNA (c-DNA). A light source is used to bleach
This section introduces the most relevant big data the fluorescent markers, and then the results
generated in bioinformatics. Such data are mainly are stored in a high-definition image. At this
164 Big Data Analysis in Bioinformatics
point, a preprocessing phase is needed to remove between SNPs and diseases or between the pres-
the noise, recognize the position of different ence of SNPs and response to drugs (Wollstein
probes, and identify corresponding genes. The et al. 2007).
correspondence among probes and genes is
usually encoded into a location file that needs DMET Data
to be used to correctly locate the expression level Pharmacogenomics is a subfield of genomics
of each gene. Starting from a raw microarray data that studies how variations in genes (e.g., SNPs)
(named .CEL file in the Affymetrix terminology), are responsible for variation in the metabolism
the resulting preprocessed file contains the of drugs, and it also investigates the so-called
expression level of each gene (among the adverse drug reactions (ADR) that happen when
analyzed genes) in a sample. Joining all the data a drug has a narrow therapeutic index (a measure
related to several samples, it is possible to obtain of the amount of drug that may cause lethal
a matrix where each column indicates a sample, effect), i.e., the difference between the lethal and
each row indicates a gene, and the element the therapeutic dose of the drug is little. The
.i; j / contains the expression level of gene i in Affymetrix DMET (drug metabolism enzymes
sample j. and transporters) platform (Arbitrio et al. 2016b)
analyzes simultaneously the polymorphisms of
SNP Data 225 ADME-related genes, i.e., genes known
DNA is made up of four bases called adenine to be related to drug absorption, distribution,
(A), cytosine (C), guanine (G), and thymine (T). metabolism, and excretion (ADME) (Li et al.
These elements are organized as double helices 2010). In particular, the current Affymetrix
that form long chains called chromosomes. Each DMET microarray chip allows the investigation
chromosome contains a DNA molecule, and each of 1936 probes, corresponding to 1936 different
DNA molecule is made up of many genes that nucleotides each one representing a portion of
are segments of DNA. Genes are then translated the genes having a role in drug metabolism. A
into proteins by the mediation of the ribonucleic DMET SNP dataset is a table where each row
acid (RNA) that has a simpler structure with indicates a gene or a specific portion of a gene,
respect to the DNA and is made of a single each column indicates a sample (patient), and a
filament made up of four subunits, called adenine cell .i; j / contains the SNP, if any, on the ADME
(A), cytosine (C), guanine (G), and uracil (U). gene i , detected in sample j.
Each individual has a unique sequence of DNA The importance of the analysis of DMET
that determines his/her characteristics, and differ- data is demonstrated by different works that evi-
ences among sequences are measured in terms denced the correlation between genetic variations
of substitutions of bases in the same position. in ADME genes and different effects of drug
The substitution of a single base that occurs in treatments (Di Martino et al. 2011; Arbitrio et al.
a small subset of population is named mutation 2016a).
or single nucleotide polymorphism (SNP), e.g.,
if we consider the sequence .AT GT / and the Mass Spectrometry Data
modified one .AC GT /, then the SNP is denoted Proteomics is about the study of the proteins
with the symbol T =C . A SNP dataset is thus expressed in an organism or a cell. Computational
constituted by a table where each row indicates a proteomics is about the algorithms, databases,
gene or a specific portion of a gene, each column and methods used to manage, store, and analyze
indicates a sample (patient), and a cell .i; j / proteomics data (Cannataro 2008). Mass spec-
contains the SNP, if any, on the gene i , detected trometry (MS), the main experimental platform
in sample j. SNPs data are represented by strings for proteomics, uses the mass spectrometer to
over the A; C; G; T alphabet. A SNP marker is investigate the proteins expressed in a sample
just a single base change in a DNA sequence, (Aebersold and Mann 2003). In a very intuitive
and several works demonstrated the correlation way, MS measures the mass (in Dalton) of all
Big Data Analysis in Bioinformatics 165
molecules contained in a sample. More precisely, represent the interactions among them. The PINs
the output of a MS experiment, said spectrum, is a are represented through undirected graphs, but
long sequence of pairs of numbers .i; m´/, where sophisticated models use directed and labeled
m´ is the mass-to-charge ratio of a molecule and edges that carry on additional information on B
i is the intensity or abundance of that molecule. each interaction. The PIN of an organism is built
The manual inspection of spectra is unfeasible, so by extracting all the binary interactions .Pi ; Pj /
several algorithms to manage and analyze them from a PPI database and by building the related
are available. An important role is played by graph. A key activity of interactomics is the study
preprocessing algorithms that regard the way how of the PIN of an organism or the comparison
spectra are cleaned, preprocessed, organized, and of the PINs of different organisms. In summary,
stored in an efficient way. Computational pro- computational methods of interactomics regard
teomics includes (i) qualitative proteomics, i.e., the prediction, storage, and querying of PPIs
the identification of the sequence of the detected data and the analysis and comparison of PINs
proteins; (ii) quantitative proteomics, i.e., the de- graphs.
termination of the relative abundance of proteins
in a sample; and (iii) biomarker discovery, i.e.,
finding most discriminant features among spec- Biological Databases
tra coming from two different populations (e.g.,
healthy versus diseased). This section presents main databases containing
information about biological data, with special
Interaction Data focus on protein sequences, structures, and inter-
Interactomics is a recent omics discipline that actions.
refers to the study of the interactome, i.e., the list
of all the interactions in an organism (Cannataro Sequences Databases
et al. 2010b). Interactomics focuses on the model- Sequences databases contain the primary struc-
ing, storage, and retrieval of protein-to-protein in- ture of proteins (i.e., the amino acids sequence)
teractions (PPI) and protein interaction networks and related annotations, such as the experiment
(PINs). Interactomics is important to explain and and scientist that discovered them. They offer a
interpret protein functions because many of them querying interface that can be interrogated using
are performed when proteins interact with each a protein identifier or a sequence fragment, e.g.,
other. If interactions are stable along the time, to retrieve similar proteins. Similar proteins to a
they form the so-called protein complexes, and target protein are retrieved by using a matching
computational methods of interactomics may be algorithm named sequence alignment.
used to discover and predict protein complexes.
Wet lab experiments allow to find binary interac- The EMBL Nucleotide Sequence Database
tions that involve only two proteins, or multiple The EMBL (European Molecular Biology
interactions, such as protein complexes. Compu- Laboratory) Nucleotide Sequence Database
tational (in silico) methods of interactomics may (Boeckmann et al. 2003) stores primary nu-
predict novel interactions without the execution cleotide sequences, and, through an international
of wet lab experiments. PPIs are often stored collaboration, it is synchronized with the DDBJ
in databases as a list of binary interactions in (DNA Data Bank of Japan) and GenBank
the form .Pi ; Pj /. Such databases offer efficient databases. The database contains information
storing and retrieval of single PPI data or the en- about the CDS (coding DNA sequence),
tire PIN, using expressive querying mechanisms. i.e., the coding region of genes and other
A way to represent all PPIs of an organism is information. Currently it contains more than
through a graph named protein interaction net- three billions nucleotides and more than 1:3
work (PIN) (Cannataro and Guzzi 2012). The millions sequences. Data are made available as
nodes of a PIN are the proteins, while the edges txt, XML, or FASTA. FASTA is a standard format
166 Big Data Analysis in Bioinformatics
macromolecules generated by using crystallog- tions and all information about the performed
raphy and nuclear magnetic resonance (NMR) mass spectrometry experiment (Jones et al.
experiments. Each PDB entry is stored in a 2006). Protein identifications are identified by
single flat file. PDB is publicly accessible at unique accession numbers and accompanied by B
www.rcsb.org. As an example, the sequence and a list of one or more peptide identifications and
secondary structure for protein 1L2Y (chain A) related sequence and position of the peptide
is reported below (empty, means no structure within the protein. Peptide identifications are
assigned; H; ˛ helix; G; 3=10helix). encoded using the PSI (Proteomic Standard
Initiative) mzData format and its evolutions.
NLYIQWLKDG GPSSGRPPPS PRIDE is publicly accessible at www.ebi.ac.uk/
HHHHHHHTT GGGGT pride.
Proteomics Databases
Proteomics databases considered here are those Protein-Protein Interactions Databases
containing mass spectrometry data. Databases of experimentally determined interac-
tions obtained through high-throughput experi-
Global Proteome Machine Database ments, and databases of predicted interactions,
The Global Proteome Machine Database (Craig obtained through in silico prediction, are pre-
et al. 2004) is a system for storing, analyzing, and sented in the following.
validating proteomics tandem mass spectrometry
(MS/MS) data. The system implements a DIP
relational database and uses a model based The Database of Interacting Proteins (DIP)
on the Hupo-PSI Minimum Information contains interactions experimentally determined
About Proteomic Experiment (MIAPE) (Taylor in different organisms, including Escherichia
et al. 2006) scheme. The Global Proteome coli, Rattus norvegicus, Homo sapiens, Saccha-
Machine Database is publicly accessible at www. romyces cerevisiae, Mus musculus, Drosophila
thegpm.org. melanogaster, and Helicobacter pylori. The DIP
database is implemented as a relational database,
Peptide Atlas and it stores data about proteins, experiments, and
PeptideAtlas (Desiere et al. 2006) is a database interactions data. The entry of new interactions
that stores mass spectrometry experiments and is manually cured, and interactions need to be
data about peptides and proteins identified by liq- described in peer-reviewed journals. Users have
uid chromatography tandem mass spectrometry different ways to query DIP through a web-
(LC-MS/MS). PeptideAtlas uses spectra as pri- based interface: (i) Node, by specifying a protein
mary information to annotate the genome; thus, identifier; (ii) BLAST, by inserting a protein
the operation of this database involves these main sequence that is used by DIP to retrieve all the
phases: (i) identification of peptides through LC- matching proteins; (iii) Motif, by specifying a
MS/MS, (ii) genome annotation using peptide sequence motif described as regular expression,
data, and (iii) storage of such genomics and pro- (iv) Article, by inserting an article identifier that
teomics data in the PeptideAtlas database. Users is used by DIP to search for interactions described
can query the system using a web interface or in that article; and (v) pathBLAST, by inserting a
can download the database. The Global Proteome list of proteins composing a pathway that are used
Machine Database is publicly accessible at www. by DIP to extract all the interaction pathways
peptideatlas.org that align with the query pathway. DIP also
allows to export the whole PIN of an organism
PRIDE using different formats, such as PSI-MI XML
The PRoteomics IDEntifications database and PSI-MI TAB. DIP is publicly accessible at
(PRIDE) stores protein and peptide identifica- dip.mbi.ucla.edu/dip/.
168 Big Data Analysis in Bioinformatics
the level of integration between software, and (iv) biological interpretation. Summarization
experimental data, and biological databases recognizes the position of different genes in raw
that improves the analysis and the storage of images, associating some set of pixels to the
bioinformatics big data. The provision of data gene that generated them. Normalization corrects B
using the data as a service (DaaS) paradigm is the variation of gene expression in the same
thus an emerging approach for the deployment microarray due to experimental bias. Filtering
of biological databases. Examples of biological reduces the number of investigated genes con-
databases deployed on the cloud are contained sidering biological aspects, e.g., genes of known
in the DaaS of the Amazon Web Services functions, or statistical criteria (e.g., associated
(AWS), which provides a centralized repository p-value). Annotation associates to each gene a
of public data sets, such as GenBank, Ensembl, set of functional information, such as biological
1000 Genomes Project, Unigene, and Influenza processes related to a gene, and cross reference to
Virus (aws.amazon.com/publicdatasets) (Fusaro biological databases. Statistical and data mining
et al. 2011). The work (Calabrese and Cannataro analysis aim to identify biologically meaningful
2015) presents an overview of main cloud-based genes, e.g., by finding differentially expressed
bioinformatics systems and discusses challenges genes among two groups of patients on the basis
and problems related to the use of the cloud in of their expression values, in case-control studies.
bioinformatics. In biological interpretation, extracted genes are
correlated to the biological processes in which
they are involved, e.g., a set of under-expressed
Data Analysis Pipelines in genes may be related to the insurgence of a dis-
Bioinformatics ease. Main challenges of microarray data analysis
are discussed in Guzzi and Cannataro (2011). In
This section presents main software pipelines for the following, relevant software tools for gene
preprocessing and analysis of microarray, mass expression preprocessing and analysis are briefly
spectrometry, and interactomics big data. described.
list.txt specifies the input folder, where parameter settings or for using old libraries, and
list.txt specifies the list of input CEL guarantee of analysis quality by providing the
files. Considering SNP genotyping data, the most updated annotation libraries.
APT command line tool to preprocess DMET
data is apt-dmet-genotype that allows the dChip
extraction of genotype call (e.g., the individuation dChip (DNA-Chip Analyzer) is a stand-alone
of the corresponding nucleotide) for each program for summarizing Affymetrix gene
probeset. Using the apt-dmet-genotype expression data. dChip offers both preprocessing
command line, it is possible to control main and analysis functions, performs the model-based
preprocessing parameters, such as the algorithm intensity normalization only, and is available
used to determine the genotype call. for Windows platform only. Compared to -
CS, dChip does not download automatically
Affymetrix Expression Console the needed libraries nor the annotation of
Expression Console is a software tool provided preprocessed data. To preprocess gene expression
by Affymetrix that implements probe set sum- data with dChip, users need to manually
marization of Affymetrix binary CEL files and download libraries, to perform the summarization
provides both summarization and quality con- method, and finally to manually download
trol algorithms. A main advantage of Expres- annotation libraries to annotate results. dChip
sion Console is the quality control, but it is not is publicly accessible at www.dchip.org.
extensible for the preprocessing of multivendor
datasets.
RMAExpress
Micro-CS RMAExpress is a stand-alone program for sum-
Micro-CS or -CS (Microarray CEL file Sum- marizing Affymetrix gene expression data (rma-
marizer) is a software tool for the automatic express.bmbolstad.com). RMAExpress performs
normalization, summarization, and annotation of only the RMA normalization, is available for
Affymetrix gene expression data that adopts a Windows platform, and must be compiled for
client-server architecture (Guzzi and Cannataro running on the Linux platform. Compared to
2010). The -CS client is deployed as a plug-in -CS, RMAExpress does not download auto-
of the TIGR M4 platform or as a Java stand-alone matically the needed libraries nor perform the
tool. It encloses the Affymetrix Power Tools annotation of files. To preprocess gene expression
code and allows users to read, preprocess, and data with RMAExpress, users need to manually
analyze binary microarray gene expression data, download libraries, to perform the summarization
providing the automatic loading of preprocessing method (RMA), and finally to manually down-
libraries and the management of intermediate load annotation libraries to annotate results.
files. The -CS server automatically updates a
repository of preprocessing and annotation li- easyExon
braries by connecting in a periodic way to the easyExon (Chang et al. 2008) is a software
Affymetrix libraries repository. Then, it provides pipeline for preprocessing and analysis of
to the -CS client the correct and more recent exon array data. easyExon can manage either
libraries needed for preprocessing the microar- raw binary CEL files, by transparently calling
ray data. The -CS server is implemented as Affymetrix APT tools, or already summarized
a web service. Users can preprocess microarray files. easyExon integrates external tools such as
data without the need to locate, download, and the Affymetrix Power Tools (APT) for the data
invoke the proper preprocessing tools and chip- preprocessing and the Integrate Genome Browser
specific libraries. Thus, main advantages of us- (IGB) for the biological interpretation of results.
ing -CS are avoiding wasting time for finding A drawback is related to the fact that it focuses
correct libraries, reducing errors due to incorrect only on exon array data.
Big Data Analysis in Bioinformatics 171
discriminant features in spectra able to discrim- or it can be submitted to different tools. Such
inate samples coming from healthy or diseased computations are executed in parallel on a high-
people. Mass spectrometry protein identification performance computing infrastructure, i.e., the
methods (Perkins et al. 1999; Nesvizhskii 2007) Swiss BioGRID, by using a web-based interface.
include peptide mass fingerprinting (PMF)
from simple mass spectrometry (MS) spectra EIPeptiDi
and protein identification from peptide peaks The ICAT-labeling method allows the identifica-
obtained though tandem mass spectrometry tion and quantification of proteins present in two
(MS/MS), also called peptide fragmentation samples. Nevertheless, when the ICAT is coupled
fingerprinting (PFF). In PMF, after a specific to LC-MS/MS, the analysis pipeline presents a
enzymatic cleavage of a protein and a mass drawback: the insufficient sample-to-sample re-
spectrometry measurement of the resulting producibility on peptide identification. This re-
peptides, the masses of generated peptides sults in a low number of peptides that are si-
form a map in the spectra called peptide mass multaneously identified in different experiments.
fingerprinting. Protein identification is obtained The EIPeptiDi tool (Cannataro et al. 2007) im-
by comparing such spectra against a database proves peptide identification in a set of spectra,
of masses of theoretical peptides obtained by in by using a cross validation of peaks commonly
silico digestion. In PFF, a peptide/protein (pre- detected in different spectra in tandem mass spec-
cursor ion) is selected, isolated, and fragmented trometry. As a result, EIPeptiDi increases the
using tandem mass spectrometry (MS/MS). The overall number of peptides identified in a set of
fragmentation permits the de novo sequencing, or samples.
the masses of the fragments are compared against
a database of masses of theoretical peptides Analysis of Mass Spectrometry Data:
obtained by in silico digestion. Protein Quantification
Protein quantification aims to infer the relative
MASCOT abundance of proteins. Algorithms include
MASCOT (Perkins et al. 1999) performs protein label-based methods that mine data coming from
identification through PMF and scores results labeled samples, and label-free methods, that an-
using the MOWSE (MOlecular Weight SEarch) alyze the spectra of non-labeled samples. Label-
algorithm. Spectra generated by the user are com- based methods include ASAPRatio, XPRESS,
pared against the theoretical spectra, obtained by and ProICAT, described in the following, while
an enzymatic digestion of protein sequence in a label-free methods include MSQuant, XCMS,
reference database. For each observed peak in a OpenMS, and PeptideProphet.
spectrum, the system calculates the probability
that a match between a peak and a peptide is ASAPRatio
purely random. DMET Console is available here The automated statistical analysis on protein ratio
www.matrixscience.com. (ASAPRatio) (Li et al. 2003) tool computes the
relative abundances of proteins from ICAT data.
swissPIT It preprocesses spectra with a smoothing filter,
swissPIT (Swiss Protein Identification Toolbox) subtracts background noise from each spectrum,
(Quandt et al. 2009) runs workflows of anal- computes the abundance of each couple of cor-
ysis of mass spectrometry data on the Swiss responding peptides, groups all the peptides that
BioGRID infrastructure. swissPIT uses different belong to the same protein, and finally computes
peptide identification tools so the user can submit the abundance of the whole protein.
proteomics data to multiple analysis tools to im-
prove the protein identification process. A single- XPRESS
input data may be submitted to the same tool The XPRESS tool (Han et al. 2001) computes
and analyzed using different parameter settings, the relative abundance of proteins, from an ICAT-
Big Data Analysis in Bioinformatics 173
labeled pair of samples, by reconstructing the and analysis and allows to define data analysis
light and heavy profiles of the precursor ions and workflows combining them (Kunszt et al. 2015).
determining. It offers three main portlets (i.e., predefined
workflows): quality control, proteomics identifi- B
Analysis of Mass Spectrometry Data: cation, and proteomics quantification workflows.
Biomarker Discovery The portal uses the P-GRADE method to submit
Biomarker discovery is a key technique for the workflows to different grid systems or to remote
early detection of diseases and thus for increasing and local resources.
patient survival rates. Biomarker discovery in
spectra data aims to find a subset of features of the MS-Analyzer
spectra, i.e., peaks or identified peptide/proteins, MS-Analyzer is a biomarker discovery dis-
able to discriminate the spectra in classes (usually tributed platform to classify mass spectrometry
two classes), e.g., spectra belonging to healthy proteomics data on the grid (Cannataro and
or diseased subjects. Main biomarker discovery Veltri 2007). It uses ontologies and workflows to
algorithms first use spectra preprocessing algo- support the composition of spectra preprocessing
rithms to clean and reduce dimension of data, and algorithms and data mining algorithms imple-
then apply classification to preprocessed spectra, mented as services. MS-Analyzer implements
where classification may be obtained with statis- various spectra preprocessing algorithms and
tical or data mining methods. A main challenge encapsulates several data mining and visual-
in biomarker discovery from spectra is the high ization tools that are classified through two
number of features, i.e., peaks or peptides, com- ontologies. The data mining ontology models
pared to the low number of samples. data analysis and visualization tools, while
the proteomics ontology models spectra data
BioDCV and preprocessing algorithms. An ontology-
BioDCV (Cannataro et al. 2007) is a grid-based based workflow editor and scheduler allows the
distributed software for the analysis of both pro- easy composition, execution, and monitoring
teomics and genomics data for biomarker discov- of services on the grid or on a local cluster of
ery. BioDCV solves the selection bias problem workstations. Users can browse the ontologies
that is the discovery of the correct discriminating for selecting the preprocessing and analysis
features, with a learning method based on support tools suitable to the input spectra data, then
vector machines. BioDCV allows to setup com- compose a workflow using drag and drop
plex predictive biomarker discovery experiments functions, and finally execute the workflow
on proteomics data. on the Grid or on a cluster. The user can
design different analysis experiments using a
Trans-Proteomic Pipeline parameter sweep approach (the same workflow
The Trans-Proteomic Pipeline (TPP) (Deutsch is executed in many instances in parallel where
et al. 2010) is a complete suite of software tools, each instance has different parameter settings)
managed by the Seattle Institute of Systems Bi- or using different preprocessing algorithms. This
ology, that supports the complete pipeline of pro- allows to evaluate preprocessing effectiveness in
teomic data analysis. The TPP pipeline contains terms of performance (e.g., binned vs not-binned
tools for spectra conversion in standard formats, spectra) and in terms of data analysis quality
probabilistic peptide identification, protein quan- (e.g., classification accuracy when changing the
tification, validation, and biological interpretation preprocessing type).
of results.
Analysis of Protein Interaction Networks:
Swiss Grid Proteomics Portal Protein Complexes Prediction
The Swiss Grid Proteomics Portal, now called The analysis of the biological properties of pro-
iPortal, offers several tools for data preprocessing tein interaction networks (PINs) can be done
174 Big Data Analysis in Bioinformatics
considering the topological properties of the cor- for the prediction of protein complexes. It is a
responding graph, or by finding relevant sub- meta-predictor; in fact it executes in parallel three
structures in the graph, or comparing the topol- prediction algorithms (MCL, MCODE, RNSC)
ogy of different graphs. Main algorithms analyze on the input network and then integrates their
the topological properties of a network and ex- results trying to predict complexes not predicted
tract its global and local properties that are used by the previous predictors and to improve the
to infer biological knowledge. For instance, the overall quality measures of the prediction.
finding of small subgraphs that are statistically
overrepresented may indicate proteins with spe-
Analysis of Protein Interaction Networks:
cific functions. The finding of highly connected
Local Alignment
subregions may indicate protein complexes. In
As sequence alignment is used to evaluate if
the following, some tools for the analysis of PINs
similarity among biological sequences (DNA
are described.
or proteins) is due by chance or by evolution,
The Molecular Complex Detection Algorithm protein interaction network alignment aims
The molecular complex detection algorithm to study the conservation of the interactome
(MCODE) (Bader and Hogue 2003) was one of of organisms or the conservation of protein
the first approaches to extract protein complexes complexes among species. Network alignment
from a protein interaction network. The main algorithms comprises two main classes: global
idea of MCODE is that protein complexes, i.e., alignment that finds large common sub-networks
dense sub-networks, are clusters in the graph; covering all the input networks and optimizing a
thus, it finds complexes by building clusters. global function based on topology considerations
MCODE presents three main steps: (i) it weights and local alignment that finds similar small
all vertices considering local network density, subregions optimizing a similarity function based
where the local area is a subgraph named k- on functional similarity, such as homology. Local
core; (ii) then the algorithm uses this resulting alignment of interaction networks searches for
weighted graph to extend each subgraph, starting similar or equal subgraphs in two or more
from the highest weighted vertex; When no more networks (Berg and Lassig 2004). The graph
vertices can be added to the complex, this task is alignment, conceptually similar to the sequence
repeated considering the next highest weighted alignment, is based on a scoring function that
subgraph; and (iii), finally, protein complexes are measures the similarities of two subgraphs that in
filtered considering the number of nodes of each turn is used to measure the global comparison of
subgraph. interaction networks, by measuring the similarity
of mutually similar subgraphs. Local alignment
The Markov Cluster Algorithm algorithms comparing two or more networks
The Markov Cluster algorithm (MCL) (Enright analyze the conservation or divergence of the
et al. 2002) performs graph clustering through the interaction patterns among different species and
simulation of a stochastic flow in the network and give as output a set of conserved subgraphs.
by analyzing the flow distribution. The main idea
of MCL is to consider a biological network as a
transportation network where there is a random Mawish
flow. MCL simulates the evolution of the flow Mawish (Koyutürk et al. 2005) is a pairwise
as Markov processes, and after a number of local alignment algorithm that finds high similar
iterations, MCL separates network regions with subgraphs. The algorithm builds an alignment
a high flow from regions with low flow. among two networks starting from orthologous
nodes. In Mawish the network alignment problem
IMPRECO formulates as a graph optimization problem with
IMPRECO (IMproving PREdiction of COm- the aim to build an alignment graph with the
plexes) (Cannataro et al. 2010a) is a parallel tool maximum score. Mawish includes the following
Big Data Analysis in Bioinformatics 175
steps: (i) matching of the orthologs nodes be- particular, the information about the global align-
longing to two networks (a similarity score ta- ment results is used to guide the building of the
bles among nodes is built); (ii) growing of the local alignment. GLAlign uses the MAGNA++
alignment (starting from a pair of similar nodes, global alignment and the AlignMCL local align- B
the neighborhood of both nodes in the input ment.
networks is expanded to small local regions of
high similarity using a greedy heuristic); and (iii)
evaluation (the aligned subgraphs are evaluated
using a statistical approach). Analysis of Protein Interaction Networks:
Global Alignment
Global alignment algorithms find a mapping that
AlignNemo
involves all the nodes of the input interaction
AlignNemo is a local alignment method that uses
networks. Such algorithms find a relation among
and integrates both homology and topology in-
all the nodes of the two input networks and may
formation (Ciriello et al. 2012). AlignNemo finds
introduce new nodes (dummy nodes) if it is not
subgraphs of the networks that have relationship
possible to match some nodes. Opposite to local
in biological function and in the topology of
alignment that finds small regions of similarity
interactions. The integration algorithm of Align-
or conserved motifs, global alignment finds a
Nemo, which combines homology and topology
mapping that involves all nodes and optimizes a
information, allows better precision and recall
cost function.
performance, compared to other local alignment
methods.
AlignMCL MAGNA
AlignMCL is a local alignment algorithm that At the basis of the MAGNA (Maximizing Ac-
finds conserved sub-networks in protein interac- curacy in Global Network Alignment) algorithm
tion networks (Mina and Guzzi 2012). Align- (Saraph and Milenković 2014), there is the fol-
MCL first builds an alignment graph through the lowing consideration about the limits of several
merging of the input protein interaction networks global alignment algorithms. In fact, many global
and then mines it to find conserved sub-networks alignment methods first compute node similari-
by using the Markov Cluster (MCL) algorithm. ties to identify the high-scoring alignments with
AlignMCL is an evolution of AlignNemo, where respect to the overall node similarity and then
the mining strategy of AlignNemo, not suitable evaluate the accuracy of the alignments by using
for big networks, is substituted by the Markov criteria that consider the number of conserved
Cluster algorithm that allows better performance. edges. Thus, they first build the alignment and
To improve its usability, authors developed a then verify the number of conserved edges that
version of AlignMCL as a Cytoscape plug-in may be not optimal. Starting from this consid-
that can be used transparently to align protein eration, MAGNA takes care of edge conserva-
interaction networks loaded and visualized on tion when the alignment is built and not at the
Cytoscape. end of the alignment, trying to maintain a good
quality of node mapping. The core of MAGNA
GLAlign is a genetic algorithm that simulates a population
GLAlign (Global Local Aligner) is a local align- of alignments that change over time and allows
ment algorithm that integrates global and lo- the survival to those alignments that improve
cal alignment (Milano et al. 2016). The idea of accuracy. Other than optimizing the number of
GLAlign is to use the topological information conserved edges, MAGNA also optimizes other
obtained by performing a global alignment of the alignment accuracy measures, such as the combi-
input networks, to improve the local alignment. In nation of node and edge conservation.
176 Big Data Analysis in Bioinformatics
Brown KR et al (2009) NAViGaTOR: network Rew Data Min Knowl Disc 3(3):216–238. https://doi.
analysis, visualization and graphing Toronto. org/10.1002/widm.1090
Bioinformatics 25(24):3327–3329. https://doi.org/ Chang TY, Li YY, Jen CH, Yang TP, Lin CH, Hsu
10.1093/bioinformatics/btp595, http://bioinformatics. MT, Wang HW (2008) Easyexon – a java-based gui
oxfordjournals.org/content/25/24/3327.abstract http:// tool for processing and visualization of affymetrix
bioinformatics.oxfordjournals.org/content/25/24/3327. exon array data. BMC Bioinf 9(1):432. https://doi.org/
full.pdf+html 10.1186/1471-2105-9-432, http://www.biomedcentral.
Calabrese B, Cannataro M (2015) Cloud computing in com/1471-2105/9/432
healthcare and biomedicine. Scalable Comput Pract Ciriello G, Mina M, Guzzi PH, Cannataro M, Guerra
Exp 16(1):1–18. http://www.scpe.org/index.php/scpe/ C (2012) Alignnemo: a local network alignment
article/viewFile/1057/424 method to integrate homology and topology. PloS One
Calabrese B, Cannataro M (2016) Bioinformatics and 7(6):e38107
microarray data analysis on the cloud. In: Guzzi PH Consortium TU (2010) The universal protein
(ed) Microarray data analysis, methods in molecular resource (UniProt) in 2010. Nucleic Acids Res
biology, vol 1375. Springer, New York, pp 25–39. 38(Suppl 1):D142–D148. https://doi.org/10.1093/nar/
https://doi.org/10.1007/7651_2015_236 gkp846
Cannataro M (2008) Computational proteomics: manage- Craig R, Cortens JP, Beavis RC (2004) Open source
ment and analysis of proteomics data. Brief Bioinform system for analyzing, validating, and storing protein
bbn011+. https://doi.org/10.1093/bib/bbn011 identification data. J Proteome Res 3(6):1234–1242.
Cannataro M, Guzzi P (2012) Data management of protein https://doi.org/10.1021/pr049882h, http://pubs.acs.org/
interaction networks. Wiley, New York doi/abs/10.1021/pr049882h, http://pubs.acs.org/doi/
Cannataro M, Talia D (2003) Towards the next-generation pdf/10.1021/pr049882h
grid: a pervasive environment for knowledge-based Desiere F, Deutsch EW, King NL, Nesvizhskii AI,
computing. In: 2003 international symposium on in- Mallick P, Eng J, Chen S, Eddes J, Loevenich SN,
formation technology (ITCC 2003), 28–30 Apr 2003, Aebersold R (2006) The peptideatlas project. Nucleic
Las Vegas, pp 437–441. https://doi.org/10.1109/ITCC. Acids Res 34(Suppl 1):D655–D658. https://doi.org/10.
2003.1197569 1093/nar/gkj040, http://nar.oxfordjournals.org/content/
Cannataro M, Veltri P (2007) MS-analyzer: preprocessing 34/suppl_1/D655.abstract, http://nar.oxfordjournals.
and data mining services for proteomics applications org/content/34/suppl_1/D655.full.pdf+html
on the Grid. Concurr Comput Pract Exp 19(15):2047– Deutsch EW, Mendoza L, Shteynberg D, Farrah T, Lam
2066. https://doi.org/10.1002/cpe.1144 H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B,
Cannataro M, Carelli G, Pugliese A, Saccà D (2001) Eng JK, Martin DB, Nesvizhskii AI, Aebersold R
Semantic lossy compression of XML data. In: Proceed- (2010) A guided tour of the trans-proteomic pipeline.
ings of the 8th international workshop on knowledge Proteomics 10(6):1150–1159. https://doi.org/10.1002/
representation meets databases (KRDB 2001), Rome, pmic.200900375
15 Sept 2001. http://ceur-ws.org/Vol-45/05-cannataro. Di Martino MT, Arbitrio M, Guzzi PH, Leone E, Baudi
pdf F, Piro E, Prantera T, Cucinotto I, Calimeri T, Rossi
Cannataro M, Cuda G, Gaspari M, Greco S, Tradigo M, Veltri P, Cannataro M, Tagliaferri P, Tassone P
G, Veltri P (2007) The EIPeptidi tool: enhancing (2011) A peroxisome proliferator-activated receptor
peptides discovering in icat-based LC MS/MS exper- gamma (pparg) polymorphism is associated with zole-
iments. BMC Bioinf 8:255. https://doi.org/10.1186/ dronic acid-related osteonecrosis of the jaw in multiple
1471-2105-8-255. Published on-line myeloma patients: analysis by dmet microarray profil-
Cannataro M, Talia D, Tradigo G, Trunfio P, Veltri P ing. Br J Haematol 154(4):529–533. https://doi.org/10.
(2008) Sigmcc: a system for sharing meta patient 1111/j.1365-2141.2011.08622.x
records in a peer-to-peer environment. FGCS 24(3): Di Martino MT, Guzzi PH, Caracciolo D, Agnelli L, Neri
222–234. https://doi.org/10.1016/j.future.2007.06.006 A, Walker BA, Morgan GJ, Cannataro M, Tassone P,
Cannataro M, Guzzi PH, Veltri P (2010a) Impreco: dis- Tagliaferri P (2015) Integrated analysis of micrornas,
tributed prediction of protein complexes. Futur Gener transcription factors and target genes expression dis-
Comput Syst 26(3):434–440 closes a specific molecular architecture of hyperdiploid
Cannataro M, Guzzi PH, Veltri P (2010b) Protein-to- multiple myeloma. Oncotarget 6(22):19132
protein interactions: technologies, databases, and al- Enright AJ, Van Dongen S, Ouzounis C (2002) An ef-
gorithms. ACM Comput Surv 43(1). https://doi.org/10. ficient algorithm for large-scale detection of protein
1145/1824795.1824796 families. Nucleic Acids Res 30(7):1575–1584
Cannataro M, Barla A, Flor R, Jurman G, Merler S, Fusaro V, Patil P, Gafni E, Dennis P, Wall D, PJ T
Paoli S, Tradigo G, Veltri P, Furlanello C (2007) (2011) Biomedical cloud computing with Amazon web
A grid environment for high-throughput proteomics. services. Plos Comput Biol 7. https://doi.org/10.1371/
IEEE Trans Nanobiosci 6(2):117–123. https://doi.org/ journal.pcbi.1002147
10.1109/TNB.2007.897495 Guzzi PH, Cannataro M (2010) -cs: an extension of the
Cannataro M, Guzzi PH, Sarica A (2013) Data mining and tm4 platform to manage affymetrix binary data. BMC
life sciences applications on the grid. Wiley Interdisc Bioinf 11(1):315
Big Data Analysis in Bioinformatics 179
Guzzi PH, Cannataro M (2011) Challenges in microarray Milano M, Cannataro M, Guzzi PH (2016) Glalign: using
data management and analysis. In: 2011 24th interna- global graph alignment to improve local graph align-
tional symposium on computer-based medical systems ment. In: Tian T, Jiang Q, Liu Y, Burrage K, Song J,
(CBMS). IEEE, pp 1–6 Wang Y, Hu X, Morishita S, Zhu Q, Wang G (eds)
Guzzi PH, Agapito G, Di Martino MT, Arbitrio M, IEEE international conference on bioinformatics and
B
Tassone P, Tagliaferri P, Cannataro M (2012) Dmet- biomedicine, BIBM 2016, Shenzhen, 15–18 Dec 2016.
analyzer: automatic analysis of affymetrix dmet data. IEEE Computer Society, pp 1695–1702. https://doi.
BMC Bioinf 13(1):258. https://doi.org/10.1186/1471- org/10.1109/BIBM.2016.7822773
2105-13-258 Mina M, Guzzi PH (2012) Alignmcl: comparative
Guzzi PH, Agapito G, Cannataro M (2014) coreSNP: analysis of protein interaction networks through
parallel processing of microarray data. IEEE Trans markov clustering. In: BIBM workshops,
Comput 63(12):2961–2974. https://doi.org/10.1109/ pp 174–181
TC.2013.176 Nesvizhskii AI (2007) Protein identification by tandem
Han DK, Eng J, Zhou H, Aebersold R (2001) Quanti- mass spectrometry and sequence database searching.
tative profiling of differentiation-induced microsomal In: Matthiesen R (ed) Mass spectrometry data analysis
proteins using isotope-coded affinity tags and mass in proteomics. Methods in molecular biology, vol 367.
spectrometry. Nat Biotechnol 19(10):946–951. https:// Humana Press, pp 87–119. https://doi.org/10.1385/1-
doi.org/10.1038/nbt1001-946 59745-275-0:87
Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart Pagel P, Mewes H, Frishman D (2004) Conservation of
JM, Altman RB, Klein TE (2002) PharmGKB: the protein-protein interactions – lessons from ascomycota.
pharmacogenetics knowledge base. Nucleic Acids Res Trends Genet 20(2):72–76. https://doi.org/10.1016/j.
30(1):163–165 tig.2003.12.007
Jones P, Côté RG, Martens L, Quinn AF, Taylor CF, Perkins DN, Pappin DJ, Creasy DM, Cottrell JS (1999)
Derache W, Hermjakob H, Apweiler R (2006) PRIDE: Probability-based protein identification by searching
a public repository of protein and peptide identifica- sequence databases using mass spectrometry data.
tions for the proteomics community. Nucleic Acids Electrophoresis 20(18):3551–3567
Res 34(Database issue). https://doi.org/10.1093/nar/ Quandt A, Masselot A, Hernandez P, Hernandez C, Maf-
gkj138 fioletti S, Appel RDD, Lisacek F (2009) SwissPIT:
Kai X, Dong D, Jing-Dong JH (2006) IntNetDB v an workflow-based platform for analyzing tandem-MS
1.0 an integrated protein-protein interaction network spectra using the Grid. Proteomics https://doi.org/10.
database generated by a probabilistic model. BMC 1002/pmic.200800207
Bioinf 508(7):S1 Saraph V, Milenković T (2014) MAGNA: maximizing
Koyutürk M, Grama A, Szpankowski W (2005) Pairwise accuracy in global network alignment. Bioinformat-
local alignment of protein interaction networks guided ics pp btu409+. https://doi.org/10.1093/bioinformatics/
by models of evolution. In: Miyano S, Mesirov JP, btu409
Kasif S, Istrail S, Pevzner PA, Waterman MS (eds) RE- Sherry ST, Ward MH, Kholodov M, Baker J, Phan
COMB. Lecture notes in computer science, vol 3500. L, Smigielski EM, Sirotkin K (2001) dbSNP: the
Springer, pp 48–65 NCBI database of genetic variation. Nucleic Acids Res
Kunszt P, Blum L, Hullr B et al (2015) Quantitative 29(1):308–311
profiling of differentiation-induced microsomal pro- Sissung TM, English BC, Venzon D, Figg WD, Deeken
teins using isotope-coded affinity tags and mass spec- JF (2010) Clinical pharmacology and pharmacogenet-
trometry. Concurr Comput Pract Exp 27(2):433–445. ics in a genomics era: the DMET platform. Pharma-
https://doi.org/10.1002/cpe.3294 cogenomics 11(1):89–103. https://doi.org/10.2217/pgs.
Li Xj, Zhang H, Ranish JA, Aebersold R (2003) Auto- 09.154
mated statistical analysis of protein abundance ratios Taylor CF, Hermjakob H, Julian RK, Garavelli JS, Ae-
from data generated by stable-isotope dilution and bersold R, Apweiler R (2006) The work of the human
tandem mass spectrometry. Anal Chem 75(23):6648– proteome organisation’s proteomics standards initiative
6657. https://doi.org/10.1021/ac034633i (HUPO PSI). OMICS 10(2):145–151. https://doi.org/
Li J, Zhang L, Zhou H, Stoneking M, Tang K (2010) 10.1089/omi.2006.10.145
Global patterns of genetic diversity and signals of Vijayan V, Saraph V, Milenković T (2015) MAGNA++:
natural selection for human ADME genes. Hum Mol maximizing accuracy in global network alignment via
Genet. https://doi.org/10.1093/hmg/ddq498, http:// both node and edge conservation. Bioinformatics
hmg.oxfordjournals.org/content/early/2010/12/02/hmg. 31(14):2409–2411. https://doi.org/10.1093/
ddq498.abstract, http://hmg.oxfordjournals.org/content/ bioinformatics/btv161
early/2010/12/02/hmg.ddq498.full.pdf+html von Mering C, Jensen L, Snel B, Hooper S, Krupp
Mamano N, Hayes W (2017) SANA: simulated an- M, Foglierini M, Jouffre N, Huynen M, Bork P
nealing far outperforms many other search algo- (2005) String: known and predicted protein-protein
rithms for biological network alignment. Bioin- associations, integrated and transferred across or-
formatics 33(14):2156–2164. https://doi.org/10.1093/ ganisms. Nucleic Acids Res 33(Database issue):
bioinformatics/btx090 433–437
180 Big Data Analysis Techniques
observed value of the past data y1 ; : : : ; yT and function to apply a smoothing mechanism on
yOT Ch is the forecast for time T C h, with the entire time series. In doing so, exponentially
forecast horizon h. This method is generally decreasing weights are assigned to all observed
used to provide a baseline for the comparison values over time, i.e., the farther back in time B
and evaluation of more sophisticated forecasting a value was observed, the smaller its weight.
models. Another simple technique is the average The simplest form of exponential smoothing is
approach, which uses the mean of the past data as st D ˛ yt C .1 ˛/ st1 where ˛ is the
prediction yOT Ch D yNT D .y1 C CyT /=T . The smoothing factor with 0 < ˛ < 1. As such, the
drift approach uses the average change over time smoothed time series is a weighted average of the
(called drift) in the observed data and adds to the current observation yt and the previous smoothed
last observed value to create a forecast yOT Ch D time series st1 where ˛ controls the smoothing
P
yt C T h1 TtD2 .yt yt1 / D yt C h yTt y 1
1
. effect and thus the impact of recent changes.
Out of the more sophisticated univariate fore- To create a forecast, the smoothed statistic is
casting methods, the ARIMA model (McCarthy used with the last observed value yOT ChjT D
et al. 2006; Box et al. 2008) is the most well- ˛ yT C .1 ˛/ sT 1 . The model is also capable
known. The ARIMA uses three concepts for the of handling time series with trend (double
modeling of a time series: autoregression AR, exponential smoothing) and seasonal (triple
integration I, and moving average MA. These exponential smoothing) characteristics. This
different concepts provide the ARIMA model is done by introducing additional components
with a high degree of adaptability to different that take potential trend and seasonal behavior
time series characteristics, making it applicable into account. These components are smoothed
in a wide range of use cases. The modeling begins themselves and added to the forecast calculation
with the integration step, in which the time series using individual smoothing factors.
is differentiated to eliminate trend and seasonal Two more recent additions to this class are
characteristics to make it stationary, i.e., its multivariate adaptive regression splines (MARS)
expectation value and variance are constant over (Friedman 1991) and gradient boosting machines
time. In the next step, the two predictive model (GBM) (Friedman 2001). MARS uses a set of
parts AR and MA are applied. The autoregressive external influences of which it automatically ex-
part models future time series values based on tracts the most relevant ones to forecast the cur-
the most recent historical time series values. rently analyzed time series. Furthermore, it is
The moving average part models future values capable of modeling nonlinear relationships be-
based on error terms which are the result of tween the external influences and the target se-
the simulated prediction of the available time ries. The GBM model uses an ensemble of weak
series history. Both model parts are further distin- predictors, typically regression trees, to combine
guished into nonseasonal and seasonal variants. their predictions into the final forecast for the
A seasonal model has to be applied if a seasonal analyzed time series. The input for the decision
pattern is present in the time series. For every trees may be historical data of the time series
analyzed time series, the degree of differentiation itself or external influences.
has to be determined, and the right number of
AR and MA components has to be identified in Multivariate Forecasting Methods
order to create the optimal ARIMA model. Some Multivariate forecasting techniques focus on the
guides on how to find the right ARIMA model analysis and prediction of multiple time series
for a specific time series have been published in within one model. The most commonly refer-
Robert Nau (2016) and Box et al. (2008). enced approaches are VARIMA models (Riise
Another commonly used univariate technique and Tjostheim 1984; Tiao and Box 1981). Like
is exponential smoothing, also known as the ARIMA, they are based on the concepts of au-
Holt-Winters model (McCarthy et al. 2006; Holt toregression and moving averages but apply them
2004). This model uses the exponential window to as set of time series. In doing so, they model
182 Big Data Analysis Techniques
how the time series in a data set influence each grained aggregation level by the application of an
other. Each time series is modeled based on its aggregation function, e.g., the sum. For example,
own past data, the past data of all other time series time series representing the energy consumption
of the set, and an error term. Assuming a simple of many individual consumers are aggregated to
two-dimensional VARIMA model with two time a higher aggregation level representing a whole
series yt.1/ and yt.2/ , the model equations are: city or a certain district. With this, missing obser-
vations can be compensated as observation gaps
yt.1/ D c1 C a11 yt1
.1/ .2/
C a12 yt1 C "1 in one time series can be covered by observations
from other time series. In addition, the influence
yt.2/ D c2 C a21 yt1
.1/ .2/
C a22 yt1 C "2 of noisy data is reduced as fluctuations of differ-
where c is a constant value and is an error term. ent time series can equalize themselves. However,
The model parameters c and a must be estimated. this approach is not guaranteed to solve the prob-
Econometric models build up on this idea and lem of incomplete data entirely. Observation gaps
add external influences which affect the time can persist even at high aggregation levels, e.g., if
series that should be predicted but are not affected only few time series are available for aggregation.
themselves by other time series (Chatfield 2000). In addition, aggregation leads to a loss of detail
and hampers the calculation of forecasts on more
Techniques for Incomplete Time Series fine-grained levels of data.
Data
The forecasting of time series requires complete Machine Learning Approaches
historic data. Gaps and missing values generally Techniques from the area of machine learning can
prevent the creation of a forecasting model. For also be applied to the topic of forecasting. Two
this reason, approaches for handling incomplete well-known representatives are neural networks
or intermittent time series have been discussed in (Faraway and Chatfield 1998) and support vector
the forecasting literature. Croston’s method is the regression (Basak et al. 2007). In the context of
most widely used approach for this kind of data forecasting, both techniques are mostly used to
and was designed for intermittent time series of solve regression problems, i.e., the dependence
nonnegative integer values (Croston 1972; Shen- of a time series value on its predecessors, just
stone and Hyndman 2005). The model involves as the AR component of the ARIMA model.
two exponential smoothing processes. One mod- These approaches can be considered as univari-
els the time period between recorded (non-zero) ate techniques as they generally focus on single
observations of the time series, while the other time series. However, they also form a class
is used to model the actual observed values. of their own as they utilize a different model-
As Croston’s method models single time series ing approach. While typical univariate techniques
only, it belongs to the class of univariate tech- model the time series as a whole, machine learn-
niques. It is mentioned in this separate cate- ing approaches often separate the sequence of
gory because it was especially developed for the observations into tuples containing an influenced
context of incomplete time series data. While historical observation and the observation(s) that
Croston’s method is able to calculate forecasts influence it. These tuples are then used to train a
for incomplete time series, missing values have to model which solves the regression problem. This
occur regularly in order to be properly modeled. makes machine learning approaches more robust
This can be problematic in application scenarios, against missing values and also allows some tech-
where regular patterns of missing observations niques to model nonlinear dependencies in the
are not given. observed data. On the downside, these methods
Another possibility to deal with incomplete have a reduced modeling versatility (e.g., no
time series is to exploit the often hierarchical trend or season models) and are computationally
structure of large data sets (Fliedner 2001). For expensive due to the higher number of parameters
this, time series are transferred to a more coarse that have to be optimized.
Big Data Analysis Techniques 183
Applications of Time Series lated time series data (e.g., smart meters installed
Forecasting in the same city, pressure sensors on multiple
tools of the same type) more and more available.
Time series forecasting is an important tool for Thus, the integration of such “peer” time series B
planning and decision-making in many applica- into multivariate forecasting models has become
tion domains. A typical domain for forecasting is an important research question.
commerce, where predictions of sales quantities
are used to observe market development and steer
companies accordingly. Over the last decades, re-
Cross-References
newable energy production with its integration in
the energy market has established itself as a major
Apache Mahout
application domain for time series forecasting.
Apache SystemML
Due to the fluctuating nature of renewable energy
Big Data Analysis and IOT
sources, considerable planning is necessary in
Business Process Analytics
order to guarantee stable energy supply and grid
Business Process Monitoring
operation. A lot of research effort is and has been
spent on the development of accurate models for
the forecasting of energy demand and energy pro-
duction of renewable energy sources (Arrowhead References
2015; GOFLEX 2017). In recent years, the Inter-
net of things (IOT) (Vermesan and Friess 2013) Arrowhead (2015) Arrowhead framework, 28 Apr 2015.
has become another major application domain for http://www.arrowhead.eu/
Basak D, Pal S, Patranabis DC (2007) Support vector
time series forecasting. The IOT trend has led to a regression. Neural Inf Process Lett Rev 11(10):203–
wide range deployment of embedded sensors into 224. https://doi.org/10.4258/hir.2010.16.4.224
physical devices ranging from home appliances Box GEP, Jenkins GM, Reinsel GC (2008) Time series
to vehicles and even facilities. All these sensors analysis: forecasting and control. Wiley series in prob-
ability and statistics. Wiley, Hoboken
create time series, and the prediction of these time Chatfield C (2000) Time-series forecasting. Chapman &
series is of interest in many domains. For exam- Hall/CRC Press, Boca Raton
ple, the German “Industry 4.0” initiative aims Croston JD (1972) Forecasting and stock control for
intermittent demands. Oper Res Q 23:289–303
at the integration of IOT data in manufacturing.
Faraway J, Chatfield C (1998) Time series forecasting
There, key performance indicators are forecasted with neural networks: a comparative study using the
in order to plan production and maintenance airline data. J R Stat Soc Ser C Appl Stat 47(2):
schedules. 231–250
Fliedner G (2001) Hierarchical forecasting: issues and use
guidelines. Ind Manag Data Syst 101(1):5–12. https://
doi.org/10.1108/02635570110365952
Future Directions for Research Friedman JH (1991) Multivariate adaptive regression
splines. Ann Stat 19(1):1–67. https://doi.org/10.1214/
aos/1176347963
While time series forecasting has been studied
Friedman JH (2001) Greedy function approximation: a
extensively, a lot of research continues to be done gradient boosting machine. Ann Stat 29(5):1189–1232
in this field. One major direction for research is GOFLEX (2017) GOFLEX project, 11 Feb 2017. http://
the development of forecasting models for certain goflex-project.eu/
Holt CC (2004) Forecasting seasonals and trends by
application domains. This means that specific
exponentially weighted moving averages. Int J Fore-
domain knowledge is directly integrated into the cast 20(1):5–10. https://doi.org/10.1016/j.ijforecast.
forecast model in order to improve forecast accu- 2003.09.015
racy. Another major research direction considers Makridakis SG, Wheelwright SC, Hyndman RJ (1998)
Forecasting: methods and applications. Wiley, New
the handling and utilization of large-scale time
York
series data. The data gathering trend in general McCarthy TM, Davis DF, Golicic SL, Mentzer JT
and IOT in particular have made large sets of re- (2006) The evolution of sales forecasting management:
184 Big Data Analytics
Big data refers to data sets which have large sizes Exascale computing refers to computing systems
and complex structures. The data size can range capable of at least one exaFLOPS, i.e., 1018
from dozens of terabytes to a few petabytes and floating point calculations per second. The need
is still growing. Big data is usually characterized of going to exascale becomes imperative with
with three Vs, namely the large Volume of data the emerging of big data applications. Especially
size, high Variety of data formats and fast Veloc- for extreme-scale scientific applications such
ity of data generation. as climate modeling (Fu et al. 2017b) and
The importance of big data comes from the earthquake simulation (Fu et al. 2017a), it can
value contained in the data. However, due to the easily take years to complete one simulation if
three Vs, it is challenging to obtain the value from executed on a single core, not to imagine that
big data fastly and efficiently. Thus, designing big scientists usually need to run the same simulation
data systems which analyze big data according to for multiple times to evaluate the impact of
the features of the data is important. different parameters.
Big Data and Exascale Computing 185
However, going toward exascale computing to reduce the error rate of each individual com-
is not easy, and the challenges come from both ponent on chip, such as adding redundant cores
hardware and software aspects. From the hard- in processors and using powerful error correction
ware side, with the millions of cores in one codes in unreliable message channels to reduce B
computer system, the power consumption and re- the likelihood of failures/errors. On the software
silience problem have become especially impor- side, debugging in exascale is extremely hard
tant. From the software side, several major chal- as some scalability bugs occur only intermit-
lenges must be resolved, such as achieving ex- tently at full scale. The current solution is mainly
treme parallelism to utilize the millions of cores using application-based checkpoint/restart tech-
efficiently, energy efficiency, fault tolerance, and nique (Cappello et al. 2014). Another solution
I/O efficiency for data-intensive computations. is again introducing redundancy in the software
and using replication to reduce the likelihood of
Current Efforts Toward Exascale software failures. Emerging nonvolatile memory
Computing can also be used to reduce the checkpointing
Researchers have made great efforts to address overhead (Gao et al. 2015).
the hardware and software challenges of exascale The current efforts on developing high-
computing. performance computers (HPC) have led to a
number of petascale computers. Since 1993,
Energy efficiency. The US Department of En- the TOP500 project lists the world’s 500 most
ergy (DOE) has set a power budget of no more powerful nondistributed computer systems
than 20 MW for an exaflop, forcing co-design twice a year. The ranking is mainly based on
innovations in both hardware and software to im- measuring a portable implementation of the high-
prove the energy efficiency of exascale comput- performance LINPACK benchmark. Currently,
ing. On the hardware side, many-core processor China’s Sunway TaihuLight HPC holds the No.
design gets more popular in exascale computing 1 position in the TOP500 ranking (released
to increase the ratio of computational power to in November 2017), and it reaches 93.015
power consumption. New technologies such as petaFLOPS on the LINPACK benchmarks and
Intel’s Running Average Power Limit (RAPL) are has a theoretical peak performance of 125.436
introduced to monitor and control various power petaFLOPS. This means that to reach exascale
and thermal features of processors. The energy computing, we still need to increase the computa-
efficiency of other hardware components, such tion power of existing HPC systems by at least ten
as memory and interconnect, is also enhanced to times. Table 1 shows the specifications of the first
improve the overall efficiency of the systems. On five supercomputers on the latest TOP500 list.
the software side, simulation codes and parallel Many countries are planning to have their
programming models need to be modified ac- exascale computers in the next 5 years. For ex-
cordingly to better adapt to the new architectural ample, according to the national plan for the
features of future exascale computing systems. next generation of high-performance computers,
China will have their first operational exascale
Resilience and fault tolerance. With the large computer by 2020. The US Exascale Computing
system scale, failures due to hardware and soft- Project plans to have capable exascale systems
ware reasons have become the norm rather than delivered in 2022 and deployed in 2023.
the exception in exascale computing systems.
Current efforts have been made on both hard-
ware and software aspects to solve the resilience Convergence of Big Data and
problem. On the hardware side, system-on-chip Exascale Computing
designs are proposed to integrate hardware com-
ponents onto a single chip to reduce the hard Although the powerful computation capability of
error rate. Techniques have also been introduced supercomputers seems a perfect fit for big data
186 Big Data and Exascale Computing
Big Data and Exascale Computing, Table 1 Specifications of the top five supercomputers in the world (released in
November 2017)
Rank Name Model Processor Interconnect Vendor Country Year
1 Sunway Sunway MPP SW26010 Sunway NRCPC China 2016
TaihuLight
2 Tianhe-2 TH-IVB-FEP Xeon E52692, Xeon Phi TH Express-2 NUDT China 2013
31S1P
3 Piz Daint Cray XC50 Xeon E5-2690v3, Tesla Aries Cray Switzerland 2016
P100
4 Gyoukou ZettaScaler-2.2 Xeon D-1571, PEZY-SC2 Infiniband ExaScaler Japan 2017
HPC system EDR
5 Titan Cray XK7 Opteron 6274, Tesla K20X Gemini Cray United 2012
States
processing, there’s a big gap between the two. infrastructures of these instance types are also
This is mainly because the design of HPC aims different from traditional ones which are mostly
for vertical scaling (i.e., scale-up which means based on commodity machines. For example, it
adding more resources to individual nodes in a is recently announced that Cray will bring its
system), while many big data processing systems physical supercomputers into Windows Azure
require good horizontal scalability (i.e., scale-out cloud, for users to easily get access to HPC in
which means adding more nodes in a system). cloud.
Research on how to fill the gap and to lead to the On the second pathway, scientists are improv-
convergence between the two fields is necessary. ing the parallelism of data-intensive simulations
such as earthquake simulations to better utilize
the power of supercomputers. As I/O is impor-
Current Efforts Toward Convergence
tant to the overall performance of big data in
Various projects and workshops have been pro-
HPC, new components such as Burst Buffer are
posed to foster the convergence. For example,
introduced in Parallel File System to improve the
the BigStorage (EU 2017) European project aims
I/O efficiency of big data in HPC. The traditional
to use storage-based techniques to fill the gap
workloads in HPC systems are mostly write-
between existing exascale infrastructures and ap-
intensive, while many big data applications are
plications dealing with large volume of data. The
read-intensive. Burst Buffer has been adopted in
Big Data and Extreme Computing (BDEC) work-
storage systems as a prefetcher for read-intensive
shop is jointly organized by over 50 universities,
data analytics jobs to optimize their I/O perfor-
research institutes, and companies in the United
mances.
States, the European Union, China, and Japan.
It started from 2013 and has been looking for
pathways toward the convergence between big Challenges and Opportunities
data and extreme computing. Technically, there We see a clear trend and need of the convergence
are mainly two pathways to prompt the conver- between big data and exascale computing. On the
gence, namely, introducing new exascale archi- one hand, HPC cloud is emerging in major cloud
tecture compatible for existing big data analyt- providers such as Windows Azure, Amazon EC2,
ics softwares and introducing new data analytics and Tencent Cloud to provide high-performance
software for exascale computing platforms. computing resources or even exclusive physical
On the first pathway, current cloud providers resources for data-intensive applications. On the
are offering high-performance instance types other hand, new softwares such as Burst Buffer
with high-performance processors, lots of and I/O scheduling algorithms are proposed for
memory, and fast networking for computation- optimizing big data applications on the traditional
and data-intensive applications. The underlying parallel supercomputers. Although these efforts
Big Data and Fog Computing 187
have offered great opportunities for the conver- on supercomputing, ICS ’15. ACM, New York, pp 263–
gence, challenges also exist. 272. http://doi.acm.org/10.1145/2751205.2751212
Reed DA, Dongarra J (2015) Exascale computing and big
Challenges on converging big data and exas- data. Commun ACM 58(7):56–68. http://doi.acm.org/
cale computing mainly lie in the exascale soft- 10.1145/2699414
B
ware and hardware co-design. Currently, most Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM,
efforts on improving the scalability of large- Kulkarni S, Jackson J, Gade K, Fu M, Donham J, Bha-
gat N, Mittal S, Ryaboy D (2014) Storm@twitter. In:
scale simulations in HPC systems are hardware- Proceedings of the 2014 ACM SIGMOD international
specific, namely, the optimizations for improving conference on management of data, SIGMOD ’14.
the parallelism of simulation codes are designed ACM, New York, pp 147–156
for a specific HPC hardware architecture (e.g., Fu Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar
M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha
et al. 2017a,b). Due to the architectural difference B, Curino C, O’Malley O, Radia S, Reed B, Balde-
of HPC systems (refer to Table 1), a portable schwieler E (2013) Apache hadoop yarn: yet another
interface for building big data analytics jobs and resource negotiator. In: Proceedings of the 4th annual
automatically parallelizing them on HPC systems symposium on cloud computing, SOCC ’13. ACM,
New York, pp 5:1–5:16
is needed. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica
I (2010) Spark: cluster computing with working sets.
In: Proceedings of the 2nd USENIX conference on hot
topics in cloud computing (HotCloud’10). USENIX
References Association, Berkeley, pp 1–7
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica
Apache (2011) The Apache Hadoop project. http://www. I (2013) Discretized streams: fault-tolerant streaming
hadoop.org. Accessed 1 Dec 2017 computation at scale. In: Proceedings of the twenty-
Apache (2017) Apache Beam: an advanced unified pro- fourth ACM symposium on operating systems princi-
gramming model. https://beam.apache.org/. Accessed ples, SOSP ’13. ACM, New York, pp 423–438
1 Dec 2017
Cappello F, Al G, Gropp W, Kale S, Kramer B, Snir
M (2014) Toward exascale resilience: 2014 update.
Supercomput Front Innov: Int J 1(1):5–28
Dean J, Ghemawat S (2004) Mapreduce: simplified data
Big Data and Fog Computing
processing on large clusters. In: Proceedings of the
6th conference on symposium on operating systems Yogesh Simmhan
design & implementation, OSDI’04, vol 6. USENIX Department of Computational and Data
Association, Berkeley, pp 137–149
EU (2017) Bigstorage European project. http://bigstorage-
Sciences, Indian Institute of Science (IISc),
project.eu/. Accessed 1 Dec 2017 Bangalore, India
Friedman E, Tzoumas K (2016) Introduction to Apache
flink: stream processing for real time and beyond, 1st
edn. O’Reilly Media Inc., Sebastopol
Fu H, He C, Chen B, Yin Z, Zhang Z, Zhang W, Zhang Synonyms
T, Xue W, Liu W, Yin W, Yang G, Chen X (2017a)
18.9Pflopss nonlinear earthquake simulation on sun-
way taihulight: enabling depiction of 18-hz and 8-
Cloudlets
meter scenarios. In: Proceedings of the international
conference for high performance computing, network-
ing, storage and analysis, SC ’17. ACM, New York, Definitions
pp 2:1–2:12
Fu H, Liao J, Ding N, Duan X, Gan L, Liang Y, Wang Fog Computing A model of distributed com-
X, Yang J, Zheng Y, Liu W, Wang L, Yang G (2017b)
puting comprising of virtualized, heterogeneous,
Redesigning cam-se for peta-scale climate modeling
performance and ultra-high resolution on sunway tai- commodity computing and storage resources for
hulight. In: Proceedings of the international conference hosting applications, analytics, content, and ser-
for high performance computing, networking, storage vices that are accessible with low latency from
and analysis, SC ’17. ACM, New York, pp 1:1–1:12
the edge of wide area networks where clients are
Gao S, He B, Xu J (2015) Real-time in-memory check-
pointing for future hybrid memory systems. In: Pro- present while also having back-end connectivity
ceedings of the 29th ACM on international conference to cloud computing resources.
188 Big Data and Fog Computing
computing has been evolving as a concept and responses, if any, are sent back to the same or
covers a wide class of resources that sit between different client(s) (Simmhan et al. 2013; Cai
the edge devices and cloud data centers on the et al. 2017). In an edge-centric model, or edge
network topology, have capacities that fall be- computing, part (or even all) of the processing B
tween edge devices and commodity clusters on is done at the data source with the rest done
clouds, and may be managed ad hoc as a smart on the cloud (Beck et al. 2014; Anand et al.
gateway or professionally as a computing infras- 2017).
tructure (Varshney and Simmhan 2017; Vaquero These have their relative advantages. Cloud
and Rodero-Merino 2014; Yi et al. 2015). computing outsources computing infrastructure
management to providers, who offer elastic ac-
Motivations cess to seemingly infinite compute resources on-
demand which can be rented by the minute. They
Fog computing is relevant in the context of wide- are also cheaper due to economies of scale at
area distributed systems, with many clients at the centralized locations (Cai et al. 2017). Edge com-
edge of the Internet (Donnet and Friedman 2007). puting leverages the compute capacity of existing
Such clients may be mobile or consumer devices captive devices and reduces network transfers,
(e.g., smartphone, smart watch, virtual reality both of which lower the costs. There may also be
(VR) headset) used interactively by humans or enhanced trust and context available closer to the
devices that are part of IoT (e.g., smart meter, edge (Garcia Lopez et al. 2015).
traffic cameras and signaling, driverless cars) While fog computing is still maturing, there
for machine-to-machine (M2M) interactions. The are many reason why its rise is inevitable due
client may serve both as consumers of data or to the gaps in these two common computing
actuators that receive control signals (e.g., fitness approaches. The network latency from the edge
notification on smart watch, signal change opera- client to the cloud data center is high and variable,
tion on traffic light), as well as producers of data averaging between 20 and 250 ms depending on
(e.g., heart rate observations from smart watch, the location of the client and data center (He
video streams from traffic cameras) (Dastjerdi et al. 2013). The network bandwidth between the
and Buyya 2016). edge and the cloud, similarly, averages at about
There are two contemporary models of 800–1200 KB/s. Both these mean that latency-
computing for applications and analytics that use sensitive or bandwidth-intensive applications
data from, or generate content and controls for, will offer poor performance using a cloud-centric
such clients at the edge of the Internet (Simmhan model due to the round-trip time between edge
2017), as illustrated in Fig. 1. In a cloud- and cloud (Satyanarayanan et al. 2009, 2015).
centric model, also called cloud computing, Another factor is the connectivity of devices
data from these clients is sent to a central data to the Internet. Mobile devices may be out
center where the processing is done and the of network coverage periodically and cause
Cloud
Edge
Big Data and Fog Computing, Fig. 1 Connectivity between resources in cloud, edge, and P2P models
190 Big Data and Fog Computing
Hierarchical Mesh
Flat
Cloud
Fog
Edge
Big Data and Fog Computing, Fig. 2 Different interaction models in fog computing
Big Data and Fog Computing 191
Fog resources tend to be more reliable and edge of the network. This is accelerated by IoT
available than edge resources, though lacking the deployments. Traditional web clients which just
robust fail-over mechanisms in the cloud that is consume services and content from the WWW
possible due to a large resource pool. This makes saw the growth of content distribution networks B
them well-suited to serve as a layer for persisting (CDN) to serve these with low latency. The data
data and services for the short and medium term. from IoT sensors is instead generating data at the
Fog resources themselves may be deployed in a clients that is being pushed to the cloud (Cai et al.
stationary environment (e.g., coffee shop, airport, 2017). In this context, fog computing has been
cell tower) or in a mobile platform (e.g., train, described as acting like an inverse CDN (Satya-
cab) (Chiang and Zhang 2016). This can affect narayanan et al. 2015).
the network connectivity of the fog with the A large swathe of IoT data comes as observa-
cloud, in case it uses cellular networks for Inter- tional streams, or time-series data, from widely
net access, and even its energy footprint (Jalali distributed sensors (Naas et al. 2017; Shukla et al.
et al. 2017). 2017). These data streams vary in their rates –
Fog deployment models are still emerging. once every 15 min for smart utility meters, every
These resources may be deployed within a public second for heart rate monitoring by a fitness
or a private network, depending on its end use. watch to 50 Hz by phasor measurement units
Smart city deployments may make them available (PMU) in smart power grids – and the number
to utility services within the city network (Yan- of sensors can range in the millions for city-
nuzzi et al. 2017), while retail shops and transit scale deployments. These are high-velocity data
services can make them available to their cus- streams that are latency sensitive and need online
tomers. A wider deployment for public use on- analytics and decision-making to provide, say,
demand, say, by cellular providers, cities, or even health alerts or manage the power grid behav-
cloud providers, will make it comparable to cloud ior (Simmhan et al. 2013). Here, fog computing
computing in terms of accessibility. These have can help move the decision-making close to the
implications on the operational costs as well (Va- edge to reduce latency.
quero and Rodero-Merino 2014). Another class of high-volume data that is
The fog resources may be made available emerging is from video streams from traffic and
as-a-service, similar to cloud resources. These surveillance cameras, for public safety, intelligent
may be virtualized or non-virtualized infrastruc- traffic management, and even driverless
ture (Bittencourt et al. 2015), with containers cars (Aazam et al. 2016). Here, the bandwidth
offering a useful alternative to hypervisor-based consumed in moving the data from the edge to
VMs that may be too heavyweight for lower- the cloud can be enormous as high-definition
end fog resources (Anand et al. 2017). However, cameras become cheap but network capacity
there is a still a lack of a common platform, and growth does not keep pace (Satyanarayanan
programming models are just emerging (Hong et al. 2015). The applications that need to be
et al. 2013; Ravindra et al. 2017). Most appli- supported can span real-time video analytics
cations that use the fog tend to be custom de- to just recording footage for future use. Fog
signed, and there has only been some theoretical computing can reduce the bandwidth consumed
work on scheduling applications on edge, fog, in the core Internet and limit data movement to
and cloud (Brogi and Forti 2017; Ghosh and the local network. In addition, it can offer higher
Simmhan In Press). compute resources and accelerators to deploy
complex analytics as well.
In addition, telemetry data from monitoring
Role of Big Data the health of the IoT fabric itself may form a large
corpus (Yannuzzi et al. 2017). Related to this is
One of the key rationales for deploying and us- provenance that describes the source and process-
ing fog computing is Big Data generated at the ing done on the distributed devices that may be
192 Big Data and Fog Computing
essential to determine the quality and veracity of robustness. These span virtual and augmented
the data. Fog can help with the collection and reality (VR/AR) applications and gaming (Yi
curation of such ancillary data streams as well. et al. 2015), Industrial IoT (Chiang and Zhang
Data archival is another key requirement 2016), and field support for the military (Lewis
within such applications (Cai et al. 2017; Vaquero et al. 2014). Some of the emerging and high-
and Rodero-Merino 2014). Besides online impact applications are highlighted below.
analytics, the raw or preprocessed observational
data may need to be persisted for medium or long Smart Cities
term to train models for machine learning (ML) Smart cities are a key driver for fog comput-
algorithms, for auditing to justify automated ing, and these are already being deployed. The
decisions, or to analyze on-demand based on Barcelona city’s “street-side cabinets” offer fog
external factors. Depending on the duration of resources as part of the city infrastructure (Yan-
persistence, data may be buffered in the fog nuzzi et al. 2017). Here, audio and environment
either transiently or for movement to the cloud sensors, video cameras, and power utility mon-
during off-peak hours to shape bandwidth usage. itors are packaged alongside compute resources
Data can also be filtered or aggregated to send and network backhaul capacity as part of fog cab-
only the necessary subset to the cloud. inets placed along streets. These help aggregate
Lastly, metadata describing the entities in data from sensors, perform basic analytics, and
the ecosystem will be essential for information also offer WiFi hot spots for public use. As an
integration from diverse domains (Anand et al. example, audio analytics at the fog helps iden-
2017). These can be static or slow changing tify loud noises that then triggers a surveillance
data or even complex knowledge or semantic camera to capture a segment of video for further
graphs that are constructed. They may need to be analysis. Similar efforts are underway at other
combined with real-time data to support decision- cities as well (Amrutur et al. 2017). These go
making (Zhou et al. 2017). The fog layer can play toward supporting diverse smart city applications
a role in replicating and maintaining this across for power and water utilities, intelligent transport,
distributed resources closer to the edge. public safety, etc., both by the city and by app
There has been rapid progress on Big developers.
Data platforms on clouds and clusters, with One of the major drivers for such city-scale
frameworks like Spark, Storm, Flink, HBase, fog infrastructure is likely to be video surveil-
Pregel, and TensorFlow helping store and lance that is starting to become pervasive in urban
process large data volumes, velocities, and spaces (Satyanarayanan et al. 2015). Such city
semi-structured data. Clouds also offer these or even crowd-sourced video feeds form meta-
platforms as a service. However, there is a lack of sensors when combined with recent advances in
programming models, platforms, and middleware deep neural networks (DNN). For example, fea-
to support various processing patterns necessary ture extraction and classification can count traf-
over Big Data at the edge of the network that fic and crowds, detect pollution levels, identify
can effectively leverage edge, fog, and cloud anomalies, etc. As such, they can replace myriad
resources (Pradhan et al. In Press). other environment sensors when supported by
real-time analytics. Such DNN and ML algo-
rithms are computationally cost to train and even
Examples of Applications infer and can make use of fog resources with
accelerators (Khochare et al. 2017).
The applications driving the deployment and
need for fog computing are diverse. But Healthcare
some requirements are recurrent: low latency Wearables are playing a big role in not just
processing, high volume data, high computing personal fitness but also as assistive technologies
or storage needs, privacy and security, and in healthcare. Projects have investigated the use
Big Data and Fog Computing 193
of such on-person monitoring devices to detect drones have limited compute capacity to increase
when stroke patients have fallen and need exter- their endurance and need to prioritize its use
nal help (Cao et al. 2015). Others use eye-glass for autonomous navigation. So they typically
cameras and head-up displays (HUDs) to offer serve as “data mules,” collecting data from sen- B
verbal and cognitive cues to Alzheimer’s patients sors along the route but doing limited onboard
suffering from memory loss, based on visual processing (Misra et al. 2015). However, when
analytics (Satyanarayanan et al. 2009). Predictive situations of interest arise, they can make use
analytics over brain signals monitored from EEG of their local capacity, resources available with
headsets have been used for real-time mental state nearby drones, or with fog resources in the base
monitoring (Sadeghi et al. 2016). These are then station, while the trip is ongoing. This can help
used to mitigate external conditions and stimuli decide if an additional flypast is necessary.
that can affect patients’ mental state.
All these applications require low latency and
reliable analytics to be performed over observa- Research Directions
tions that are being collected by wearables. Given
that such devices need to be lightweight, the Fog computing is still exploratory in nature.
computation is often outsourced to a smartphone While it has gained recent attention from
or a server that acts as a fog resource and to which researchers, many more topics need to be
the observational data is passed for computation. examined further.
There are also decisions to be made with respect
to what to compute on the wearable and what Fog Architectures. Many of the proposed fog
to communicate to the fog, so as to balance the architectural designs have not seen large-scale
energy usage of the device. deployments and are just plausible proposals.
While city-scale deployments with 100s of fog
Mobility devices are coming online, the experiences from
Driverless cars and drones are emerging appli- their operations will inform future design (Yan-
cation domains where fog computing plays a nuzzi et al. 2017). Network management is
key role (Chiang and Zhang 2016). Both these likely to be a key technical challenge as traffic
platforms have many onboard sensors and real- management within the metropolitan area
time analytics for autonomous mobility. Vehic- network (MAN) gains importance (Vaquero
ular networks allow connected cars to commu- and Rodero-Merino 2014). Unlike static fog
nicate with each other (V2V) to cooperatively resources, mobile or ad hoc resources such as
share information to make decisions on traffic vehicular fog will pose challenges of resource
and road conditions (Hou et al. 2016). These discovery, access, and coordination (Hou et al.
can be extended to share compute capacity to 2016; He et al. In Press). Resource churn
perform these analytics. This allows a collection will need to be handled through intelligent
of parked or slow-moving vehicles to form a fog scheduling (Garcia Lopez et al. 2015). Resources
of resources among proximate ones. These may will also require robust adaptation mechanisms
complement occasional roadside units that offer based on the situational context (Preden et al.
connectivity with the cloud. Here, the entities 2015). Open standards will be required to ensure
forming the fog can change dynamically and interoperability. To this end, there are initial
requires distributed resource coordination. efforts on defining reference models for fog
Drones are finding use in asset monitoring, computing (Byers and Swanson 2017).
such as inspecting gas pipelines and power trans-
mission towers at remote locations (Loke 2015). Data Management. Tracking and managing
Here, a mobile base station has a digital control content across edge, fog, and cloud will be
tether with a collection of drones that follow a key challenge. Part of this is the result of
a preset path to collect observational data. The devices in IoT acting as data sources and compute
194 Big Data and Fog Computing
platforms, which necessitates coordination across and energy awareness are distinctive features that
the cyber and physical worlds (Anand et al. have been included in such schedulers (Brogi
2017). The generation of event streams that and Forti 2017; Ghosh and Simmhan In Press).
are transient and need to be processed in a Formal modeling of the fog has been undertaken
timely manner poses additional challenges to as well (Sarkar and Misra 2016). Quality of
the velocity dimension of Big Data (Naas et al. Experience (QoE) as a user-centric alternative
2017). Data discovery, replication, placement, metric to Quality of Service (QoS) is also being
and persistence will need careful examination in examined (Aazam et al. 2016).
the context of wide area networks and transient Such research will need to be revisited as the
computing resources. Sensing will need to be architectural and application models for fog com-
complemented with “sensemaking” so that data puting become clearer, with mobility, availability,
is interpreted correctly by integrating multiple and energy usage of resources offering unique
sources (Preden et al. 2015). challenges (Shi et al. 2012).
Programming Models and Platforms. Despite Security, Privacy, Trust. Unlike cloud comput-
the growth of edge and fog resources, there is a ing where there is a degree of trust in the service
lack of a common programming abstraction or provider, fog computing may contain resources
runtime environments for defining and executing from diverse and ad hoc providers. Further, fog
distributed applications on these resources (Sto- devices may not be physically secured like a
ica et al. 2017). There has been some prelimi- data center and may be accessible by third par-
nary work in defining a hierarchical pattern for ties (Chiang and Zhang 2016). Containerization
composing applications that generate data from does not offer the same degree of sand-boxing
the edge and need to incrementally aggregate and between multiple tenant applications that virtual-
process them at the fog and cloud layers (Hong ization does. Hence, data and applications in the
et al. 2013). They use spatial partitioning to fog operate within a mix of trusted and untrusted
assign edge devices to fogs. The ECHO platform zones (Garcia Lopez et al. 2015). This requires
offers a dataflow model to compose applications constant supervision of the device, fabric, ap-
that are then scheduled on distributed runtime plications, and data by multiple stakeholders to
engines that are present on edge, fog, and cloud ensure that security and privacy are not com-
resources (Ravindra et al. 2017). It supports di- promised. Techniques like anomaly detection,
verse execution engines such as NiFi, Storm, intrusion detection, moving target defense, etc.
and TensorFlow, but the user couples the tasks will need to be employed. Credential and identity
to a specific engine. A declarative application management will be required. Provenance and
specification and Big Data platform are necessary auditing mechanisms will prove essential as well.
to ease the composition of applications in such These will need to be considered as first-class
complex environments. features when designing the fog deployment or
Related to this are application deployment and the application.
resource scheduling. VMs are used in the cloud
to configure the required environment, but they
may prove to be too resource intensive for fog.
Some have examined the use of just a subset
of the VM’s footprint on the fog and migrating Cross-References
this image across resources to track the mobility
of user(s) who access its services (Bittencourt Big Data Analysis and IOT
et al. 2015). Resource scheduling on edge, fog, Big Data in Mobile Networks
and cloud has also been explored, though often Big Data in Smart Cities
validated just through simulations due to the Cloud Computing for Big Data Analysis
lack of access to large-scale fog setups (Gupta Mobile Big Data: Foundations, State of the Art,
et al. 2017; Zeng et al. 2017). Spatial awareness and Future Directions
Big Data and Fog Computing 195
Misra P, Simmhan Y, Warrior J (2015) Towards Stoica I, Song D, Popa RA, Patterson DA, Mahoney
a practical architecture for internet of things: an MW, Katz RH, Joseph AD, Jordan M, Hellerstein JM,
India-centric view. IEEE Internet of Things (IoT) Gonzalez J et al (2017) A Berkeley view of systems
Newsletter challenges for AI. Technical report, University of Cali-
Naas MI, Parvedy PR, Boukhobza J, Lemarchand L fornia, Berkeley
(2017) iFogStor: an IoT data placement strategy for fog Vaquero LM, Rodero-Merino L (2014) Finding your way
infrastructure. In: IEEE international conference on fog in the fog: towards a comprehensive definition of fog
and edge computing (ICFEC) computing. ACM SIGCOMM Comput Commun Rev
Nelson DL, Bell CG (1986) The evolution of work- 44(5):27–32
stations. IEEE Circuits and Devices Mag 2(4): Varshney P, Simmhan Y (2017) Demystifying fog com-
12–16 puting: characterizing architectures, applications and
Pradhan S, Dubey A, Khare S, Nannapaneni S, abstractions. In: IEEE international conference on fog
Gokhale A, Mahadevan S, Schmidt DC, Lehofer M and edge computing (ICFEC)
(In Press) Chariot: goal-driven orchestration middle- Yannuzzi M, van Lingen F, Jain A, Parellada OL, Flores
ware for resilient IoT systems. ACM Trans Cyber-Phys MM, Carrera D, Perez JL, Montero D, Chacin P,
Syst Corsaro A, Olive A (2017) A new era for cities with
Preden JS, Tammemae K, Jantsch A, Leier M, Riid A, fog computing. IEEE Internet Comput 21(2):54–67
Calis E (2015) The benefits of self-awareness and Yi S, Li C, Li Q (2015) A survey of fog computing:
attention in fog and mist computing. IEEE Comput concepts, applications and issues. In: Workshop on
48(7):37–45 mobile big data
Ravindra P, Khochare A, Reddy SP, Sharma S, Varshney Zeng X, Garg SK, Strazdins P, Jayaraman PP, Geor-
P, Simmhan Y (2017) Echo: an adaptive orchestration gakopoulos D, Ranjan R (2017) IOTSim: a simulator
platform for hybrid dataflows across cloud and edge. for analysing IoT applications. J Syst Archit 72:93–107
In: International conference on service-oriented com- Zhou Q, Simmhan Y, Prasanna VK (2017) Knowledge-
puting (ICSOC) infused and consistent complex event processing over
Sadeghi K, Banerjee A, Sohankar J, Gupta SK (2016) real-time and persistent streams. Futur Gener Comput
Optimization of brain mobile interface applications Syst 76:391–406
using IoT. In: IEEE international conference on high
performance computing (HiPC)
Sarkar S, Misra S (2016) Theoretical modelling of fog
computing: a green computing paradigm to support IoT
applications. IET Netw 5(2):23–29 Big Data and Privacy Issues
Satyanarayanan M, Bahl P, Caceres R, Davies N (2009) for Connected Vehicles
The case for VM-based cloudlets in mobile computing. in Intelligent Transportation
IEEE Pervasive Comput 8(4):1536–1268 Systems
Satyanarayanan M, Simoens P, Xiao Y, Pillai P, Chen Z,
Ha K, Hu W, Amos B (2015) Edge analytics in the In-
ternet of things. IEEE Pervasive Comput 14(2):1536– Adnan Mahmood1;2;3 , Hushairi Zen2 , and
1268 Shadi M. S. Hilles3
Shi C, Lakafosis V, Ammar MH, Zegura EW (2012) 1
Telecommunication Software and Systems
Serendipity: enabling remote computing among inter-
mittently connected mobile devices. In: ACM interna- Group, Waterford Institute of Technology,
tional symposium on mobile ad hoc networking and Waterford, Ireland
computing (MobiHoc) 2
Universiti Malaysia Sarawak, Kota Samarahan,
Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge com- Sarawak, Malaysia
puting: vision and challenges. IEEE Internet Things J 3
3(5):2327–4662 Al-Madinah International University, Shah
Shukla A, Chaturvedi S, Simmhan Y (2017) RIoTBench: Alam, Selangor, Malaysia
an IoT benchmark for distributed stream processing
systems. Concurr Comput: Pract Exp 29:e4257
Simmhan Y (2017) Iot analytics across edge and cloud
platforms. IEEE IoT Newsletter. https://iot.ieee.org/ne
wsletter/may-2017/iot-analytics-across-edge-and-cloud
-platforms.html Definitions
Simmhan Y, Aman S, Kumbhare A, Liu R, Stevens S,
Zhou Q, Prasanna V (2013) Cloud-based software Intelligent Transportation System is regarded as
platform for big data analytics in smart grids. Comput
Sci Eng 15(4):38–47
a smart application encompassing the promis-
Sinha A (1992) Client-server computing. Commun ACM ing features of sensing, analysis, control, and
35(7):77–98 communication technologies in order to improve
Big Data and Privacy Issues for Connected Vehicles in Intelligent Transportation Systems 197
the safety, reliability, mobility, and efficiency of in a number of road fatalities. The World Health
ground transportation. Organization (2017) estimates that around 1.25
Big Data is an evolving and emerging ter- million people across the world die each year
minology describing a voluminous amount (ter- and millions more get injured as a result of road B
abytes, petabytes, exabytes, or even zettabyte) of accidents, with nearly half of them being vulner-
structured, unstructured, and semi-structured data able road users, i.e., pedestrians, roller skaters,
which could be mined for information. and motorcyclists. Approximately 90% of these
road accidents occurs in low- and middle-income
economies as a result of their poor infrastructures.
Overview From a network management perspective, this re-
sults in serious deterioration of quality of service
The evolution of Big Data in large-scale Internet- for numerous safety-critical and non-safety (i.e.,
of-Vehicles has brought forward unprecedented infotainment) applications and in degradation of
opportunities for a unified management of the quality of experience of vehicular users. Thus,
transportation sector, and for devising smart In- there exists a significant room for improvement
telligent Transportation Systems. Nevertheless, in the existing transportation systems in terms of
such form of frequent heterogeneous data collec- enhancing safety, non-safety (i.e., infotainment),
tion between the vehicles and numerous appli- and efficiency aspects; and if properly addressed,
cations platforms via diverse radio access tech- this would ultimately set tone for highly effi-
nologies has led to a number of security and cacious intelligent transportation systems indis-
privacy attacks, and accordingly demands for a pensable for realizing the vision of connected
‘secure data collection’ in such architectures. In vehicles.
this respect, this chapter is primarily an effort The paradigm of vehicular communication
to highlight the said challenge to the readers, has been studied and thoroughly analyzed
and to subsequently propose some security re- over the past two decades (Ahmed et al.
quirements and a basic system model for secure 2017). It has rather evolved from traditional
Big Data collection in Internet-of-Vehicles. Open vehicle-to-infrastructure (V2I) communication
research challenges and future directions have to more recent vehicle-to-vehicle (V2V)
also been deliberated. communication and vehicle-to-pedestrian (V2P)
communication, thus laying the foundations for
the futuristic notion of vehicle-to-everything
Introduction (V2X) communication (Seo et al. 2016). While
cellular networks have been mostly relied upon
Over the past few decades, a rapid increase in for V2V communication due to its extended
the number of vehicles (both passenger and com- communication range (i.e., theoretically up to
mercial) across the globe has pushed the present- 100 km) and extremely high data rates, its high
day transportation sector to its limits. This has cost and low reliability in terms of guaranteeing
thus caused the transportation systems to be- stringent delay requirements do not make it
come highly ineffective and relatively expen- as an ideal mode of communication in highly
sive to maintain and upgrade overtime (Contreras dynamic networks. On the contrary, the evolution
et al. 2017). According to one recent estimate of dedicated short-range communication
(British Petroleum 2017), the number of vehicles (DSRC) as a two-way short-to-medium range
on the road has already surpassed 0.9 billion and wireless communication for safety-based V2V
is likely to be doubled by the end of year 2035. applications has been a subject of attention in
This massive increase in the number of vehicles recent years for engineers and scientists as a
has not only led to extreme traffic congestion(s) result of its efficacious real-time information
in dense urban environments and hinders eco- exchange between the vehicles. While a number
nomic growth in several ways but also transpires of challenges still hinder the effective deployment
198 Big Data and Privacy Issues for Connected Vehicles in Intelligent Transportation Systems
DS o mm
RC u n
c
-ba ica
se tion
dV
2V
B
ode
eN
Ro (DS
ad RC
sid )
eU
ed
bas
nit
l u l ar- icast
l
Ce I Mult
V2
eN
od
eB
Ce omm
llu
c
lar unic
-ba at
se ion
RSU Controller
dV
2V
Core Network
SDN Controller
Big Data and Privacy Issues for Connected Vehicles in Intelligent Transportation Systems, Fig. 1
Heterogeneous vehicular networking architecture
beyond fifth-generation (beyond 5G) wireless produced due to the same (Sun and Ansari 2016).
networking technologies (Mumtaz et al. 2017). Moreover, today the smart vehicles have emerged
However, none of these aforementioned tech- as a multisensor platform, as the usual number of
nologies possess a potential to fully realize the intelligent sensors installed on a vehicle is typi- B
breadth of vehicular safety and non-safety appli- cally around 100, and this number is anticipated
cations simultaneously and especially when their to increase to 200 by year 2020 (Choi et al. 2016).
requirements are in conflict with one another. The data streams generated via these sensors are
The ideal approach thus is to ensure a synergy not only high in volume but also possess a fast
of several RATs so as to provide an appropriate velocity due to highly dynamic nature of vehic-
heterogeneous vehicular networking platform in ular networks. Moreover, information generated
order to meet the stringent communication re- from numerous social networking platforms (i.e.,
quirements. Heterogeneity is an important and Weibo, Facebook, Twitter, WhatsApp, WeChat)
timely topic but beyond the scope of this work. are regularly accessed by the vehicular users and
Since a vehicular network encompasses numer- generate a lot of traffic data in real time. This data
ous heterogeneous data sources (i.e., not only is spatiotemporal in nature due to its dependence
from heterogeneous RATs but also from diverse on time and geographical location of vehicles.
sort of sensing equipment), the amount of data Also, since the trajectory of vehicles is usually
generated is not only huge but also poses a sig- reliant on the road distribution across a huge
nificant challenge to the network’s security and geographical region, the collected information
stability. Therefore, this chapter, in contrast to is also heterogeneous, multimodal, and multidi-
other commonly published surveys and chapters, mensional and varies in its quality. This context
primarily focuses on secure Big Data collection is pushing the concept of traditional vehicular ad
in vehicular networks (interchangeably, referred hoc networks to large-scale Internet of Vehicles
to as Internet of Vehicles in this chapter). (IoVs), and all of this collected data converges as
Big Data in vehicular networks which ultimately
is passed via the core networks to both regional
Toward Big Data Collection and centralized clouds.
in Vehicular Networks Nevertheless, today’s Internet architecture is
not yet scalable and is quite inefficient to han-
In the near future, several substantial changes dle such a massive amount of IoV Big Data.
are anticipated in the transportation industry es- Also, transferring of such data is relatively time-
pecially in light of the emerging paradigms of consuming and expensive and requires a mas-
Internet of Things (IoTs), cloud computing, edge sive amount of bandwidth and energy. Moreover,
and/or fog computing, software-defined network- real-time processing of this collected information
ing, and more recently Named Data Network- is indispensable for a number of safety-critical
ing, among many others. Such promising notions applications and hence demands for a highly
have undoubtedly strengthened and contributed efficient data processing architecture with a lot
to the industrial efforts of devising intelligently of computational power. As vehicular networks
connected vehicles for developing safe and con- are highly distributive in nature, a distributed
venient road environments. Furthermore, with the edge-based processing is recommended over tra-
rapid evolution of high-speed mobile Internet ditional centralized architectures. However, such
access and the demand for a seamless ubiquitous distributed architectures mostly have limitations
communication at a much affordable price, new of compute and storage and therefore are unable
vehicular applications/services are emerging. The to cache and process a huge amount of informa-
European Commission estimates that around 50– tion. Last but not the least, it is quite significant
100 billion smart devices would be directly con- to design a secure mechanism which ensures that
nected to the Internet by the end of year 2019, and collection of IoV Big Data is trusted and not
approximately 507.5 ZB/year of data would be tampered with. There is a huge risk of fraudulent
200 Big Data and Privacy Issues for Connected Vehicles in Intelligent Transportation Systems
messages injected by a malicious vehicle that only lead to a considerable delay in the requested
could easily endanger the whole traffic system(s) services but also exposes a user’s privacy data to
or could potentially employ the entire network to extreme vulnerabilities (Singh et al. 2016; Zheng
pursue any dangerous activity for its own wicked et al. 2015). Thus, any secure IoV information
benefits. Thus, how to effectively secure the Big collection scheme should adhere to the following
Data collection in IoVs deserves researching. In requirements so as to guarantee the secure Big
this context, this chapter is of high importance so Data collection:
as to bring to forth the significance of such chal-
Data Authentication – to verify the identities of
lenges for the attention of academic and industrial
the vehicles or vehicular clouds (i.e., vehicles
research community.
with similar objectives and in vicinity of one
another form vehicular clouds for resource-
sharing purposes)
Security Requirements and System Data Integrity – to ensure that the transmitted
Model for Secure Big Data Collection data has been delivered correctly without any
in IoVs modification or destruction, as this would ulti-
mately become part of the Big Data stream
Big Data primarily has three main characteristics,
Applications/Services Access Control – to war-
(a) volume, (b) variety, and (c) velocity, and this
rant that each respective vehicle has access to
is also referred to as 3Vs (Guo et al. 2017).
the applications/services it is only entitled for
Vehicles usually collect a massive amount of
Data Confidentiality – to guarantee a secure com-
data from diverse geographical areas and with
munication among vehicles participating in a
various heterogeneous attributes, thus ultimately
networking environment
converging to IoV Big Data with variation in its
Data Non-repudiation – to ensure that any partic-
size, volume, and dimensionality. The analytics
ular vehicle could not deny the authenticity of
of such Big Data can help network operators
another vehicle
to optimize the overall resource(s) planning of
Anti-jamming – to prevent intrusion attempts
next-generation vehicular networks. It would also
from malicious vehicles which may partially
facilitate the national transportation agencies to
or completely choke the entire network
critically analyze and subsequently mitigate the
dense traffic problems in an efficient manner, An ideal approach to tackle these requirements
in turn making the lives of millions of people would be via a hierarchical edge-computing ar-
comfortable and convenient. These are some of chitecture as depicted in Fig. 2. Each vehicle (or
the potential advantages that have lately encour- a vehicular cluster), along with their respective
aged the vehicles’ manufacturers to build large- vehicular users, is associated with their private
scale Big Data platforms for the intelligent trans- proxy VMs situated in the edge node which not
portation systems. However, all of this data ana- only collects the (raw) data streams from their
lytics and subsequent decision making becomes registered network entities via their respective
impractical if any malicious vehicle is able to base stations or access points but also classi-
inject data in the IoV Big Data stream which fies them into various groups depending on the
would have severe implications for both safety type of data and subsequently generates meta-
and non-safety vehicular applications. In terms data which is passed to regional/centralized IoV
of safety applications, any maliciously fed infor- Big Data centers. The metadata encompasses
mation may lead to false trajectory predictions certain valuable information, i.e., geographical
and wrong estimation of a vehicle’s neighboring locations of vehicles and corresponding times-
information in the near future which could prove tamps, type of vehicular applications accessed or
quite fatal for passengers traveling in both semi- infotainment services and contents requested, and
autonomous and fully autonomous vehicles. In a pattern recognition identifier that checks for
terms of non-safety applications, this could not any irregularities in data stream against various
Big Data and Privacy Issues for Connected Vehicles in Intelligent Transportation Systems 201
Data Data
Data Plane
(Vehicles, Vehicular Clouds, Base Stations and Access Points of
diverse Radio Access Technologies – Heterogeneous Networks)
prescribed network operator policies and also certain time. Our envisaged architecture hence
by comparing it with previously known patterns. proposes to structure the raw data streams into
In this way, a suspicious content could be fil- valuable metadata, which not only consumes less
tered out at the edge without passing it to the resources in making a requisite decision but also
regional/centralized storage. The metadata only takes less storage space in contrast to raw data
contains valuable information without violating streams.
a user’s privacy. Also, as several vehicles and
vehicular users requests similar type of appli-
cations/services and pass identical information Open Research Challenges
about their surroundings (i.e., provided if they and Future Directions
are in vicinity of one another), intelligent con-
tent similarity schemes should be employed in Unlike traditional IoT entities which are gener-
order to avoid any duplication that may lead ally static in nature, IoV is a highly dynamic or-
to excessive computational overheads both at ganism, and this feature makes it quite difficult to
the edge and at the regional/centralized cloud effectively deploy the next-generation intelligent
level. transportation system platforms, at least in their
Thus, in this way, not only the computational current form. Since vehicles usually traverses at
overheads can be considerably reduced but the quite high geographical speeds, their neighbor-
amount of IoV Big Data that needs to be cached hood also changes dynamically. Thus, it is chal-
can also be significantly minimized. It is there- lenging to accurately apprehend a vehicle’s cur-
fore pertinent to mention that (both) instanta- rent and anticipated surrounding neighborhood
neous and historical data is of critical importance on run-time basis. A major problem is that by
in making intelligent decision making in IoV ar- the time the IoV information has been captured,
chitectures. Nevertheless, it is also not practically analyzed, and presented to network operators or
possible to store all of the historical IoV Big interested vehicles/vehicular users, the real-time
Data for an infinite time due to storage issues situation has already changed, and the IoV data
globally. Thus, it is of extreme importance to not recently presented has become obsolete. Thus,
only secure the collection of IoV Big Data but IoV standalone might not make any essence in
to also ensure that the said data streams should the long run; rather, it has to be further optimized
not be excessively cached but disposed after a with an emerging yet promising paradigm of
202 Big Data and Privacy Issues for Connected Vehicles in Intelligent Transportation Systems
Zheng X et al (2015) Big data for social transportation. data concept. Big data is a field concerning the
IEEE Trans Intell Transp Syst 17(3):620–630. https:// storage, management, and analysis of large and
doi.org/10.1109/TITS.2015.2480157
heterogeneous amount of data which require the
development of new methodologies and tech- B
nologies for their management and processing.
Thus, as defined by “The McKinsey Global Insti-
tute” Manyika and Chui (2011) “Big data refers
to data sets whose size is beyond the ability of
Big Data and typical database software tools to capture, store,
Recommendation manage and analyze.”
In the last years, people are able to produce
Giancarlo Sperlí and share digital data (such as text, audio, images,
Department of Electrical Engineering and and video) at an unprecedented rate due to the
Information Technology, University of Naples continuously progress in consumer electronics
“Federico II”, Naples, Italy and to the widespread availability of Internet
access. For facilitate browsing of large multi-
media repositories and for providing useful rec-
Synonyms ommendation, several algorithms and tools have
been proposed. Such tools, usually referred to
Data management; Information filtering; as recommender systems, collect and analyze
Knowledge discovery usage data in order to identify user interests and
preferences that allows to suggest items that meet
users’ needs.
Definitions
gestions. Moreover, they can use other attributes, (feedbacks, past behaviors, and so on) (ranking
such as viewed items, demographic data, subject stage); finally, it selects a subset of the item best
interests, and favorite artists. Recommender sys- suited (pos-filtering stage).
tems automatically identify relevant information The recommender system approaches are
for a given user, learning from available data usually classified into the following three
(user actions, user profiles). In particular, a rec- categories:
ommender system supports users to identify the
most interesting items, helping people confused • Collaborative Filtering: these systems rely
by multiple choices analyzing the behavior of the solely on user’s preferences to make recom-
people who have similar interests to them. mendations. Until a new item is rated by
In a traditional model of recommender sys- an appropriate number of users, the recom-
tem, there are two types of entities: “user” and mender system would not be able to provide
“item”. Users (often customers) have associated meaningful recommendations. This problem
metadata (or content) such as age, gender, race, is known as cold-start problem due to an
and other demographic information. Items (often initial lack of ratings. This approach is divided
products) also have metadata such as text descrip- into (Fig. 1):
tion, price, weight, etc. Different interactions can – Memory-based approaches identify inter-
be made between these two types of entity: user A esting items from other similar users’ opin-
downloads/purchases movie B, and user X gives ions by performing the nearest neighbors
a rating 5 to the product Y and so on. on a rating matrix. This heuristics make
The last-generation architecture of recom- rating predictions based on the entire col-
mender systems is composed by one or more of lection of previously rated items by the
the following stages (Ricci et al. 2011b; Colace users.
et al. 2015): when a given user i is browsing a – Model-based approaches use the collection
particular items’ collection, the system initially of ratings to learn a model, which is then
determines a set of useful candidate items used to make ratings’ prediction.
Oic O based on user preferences and needs • Content-based filtering approaches provide
(pre-filtering stage); successively it assigns score their suggestion based on the content of items,
(or a rank) ri;j to each item oj in Oic , computed described as a set of tags, and the user profiles.
by analyzing the characteristics of each item The user may be recommended items similar
to the ones the user preferred in the past. These to past actions to propose additional products or
approaches have some limitations; indeed services that enhance the customer satisfaction.
the tags are explicitly associated with the An approach that models recommendations as
items that the systems recommend. Another a social choice is proposed by Albanese et al.
limitation is that two different items are (2013). In particular, the authors combine in-
indistinguishable if they are represented by trinsic features of multimedia, past behaviors
the same tags. Furthermore, if there is a new of users, and overall behavior of entire com-
user with few ratings, it would not be able to munity of the users. Moreover, in Zhao et al.
produce accurate recommendations (Fig. 2). (2016), the authors propose a social recommen-
• Hybrid-filtering based approach combines dation framework based on an unified preferences
both content-based filtering and collaborative learning process that integrates item content fea-
filtering approaches to produce a recommen- tures with collaborative user-item relationships.
dation. It is possible to classify the hybrid Finally, Bartolini et al. (2016) propose a rec-
recommender system into the following four ommendation framework that manages heteroge-
categories: (i) approaches that combine the neous data sources to provide context-aware rec-
prediction of collaborative and content-based ommendation techniques supporting intelligent
methods; (ii) approaches that integrate some multimedia services for the users.
features of content-based into a collaborative
approach; (iii) approaches that include some
collaborative characteristics into a content-
based approach; and (iv) approaches that Recommender Systems and Online
integrate main characteristics of both models. Social Networks
Big data in social networks (big social data) items metadata), opinions (textual comments to
have been used to customize recommender which it is possible to associate a sentiment),
systems in order to provide appropriate behavior (in the majority of cases logs of past
suggestions to users according to their needs items observations made by users), and feedbacks
and preferences. Also many e-commerce sites (usually expressed in the form of ratings) with
use social recommendation system to infer items features and context information for
similarity between user purchase behaviors supporting different applications using proper
to recommend products that can improve the customizations (e.g., recommendation of news,
recommendation process in social networks. It photos, movies, travels, etc.). In Amato et al.
is based on social network information, refers (2017) it is proposed a recommending system for
to trust and friendship among users (social big data applications based on a collaborative and
relationship that resulted from common interests user-centered approach. The recommendations
among users). In particular it is necessary to are made based on the interactions among users
consider multi-attribute of products and users and generated multimedia contents in one or
into the recommendation model. Furthermore, more social networks.
it allows to consider new different features for In the last years, different recommendation
computing the similarity between two users. techniques have been proposed in the literature
These features have three main properties: for supporting several kinds of applications in
multimedia social networks, in which nodes
1. Contents generated by users refer to all con- could be users or multimedia objects (photos,
tents that are created by the users themselves video, posts, and so on), and links between nodes
and their meta-information. could be assumed different meaning (publishing,
2. Relationship information constitutes a endorsement, and so on). These systems exploit
social circle representing directly linked or information extracted by these networks to
connected relationships between users. improve the accuracy of recommendation, and
3. Interaction information refers to messages or they provide new types of suggestions; for
contents exchanged between users. instance, they can recommend not only items
but also groups, friends, events, tags, etc. to users
Leveraging the information contained in through algorithms that it is possible to classify
OSNs, Qian et al. (2014) propose an unified into the following three different classes.
personalized recommendation model based The first one is composed by a set of algo-
on probabilistic matrix factorization that rithms leveraging graph as data model in which
combines three social factors: personal interest, nodes and edges represent, respectively, single
interpersonal interest similarity, and interpersonal social entities and the relationships among them.
influence. The last two factors can enhance the Jiang et al. (2015b) propose a modified version
intrinsic link among features in the latent space to of random walk on a star-structured heteroge-
deal with cold-start problem. In Sun et al. (2015), neous network, a particular network centered on
the authors propose recommender systems based social domain which is connected with other
on social networks, called RSboSN, which domains (i.e., video, Web posts, etc.). The aim
integrates social network graph and the user-item of this approach is to predict user-item links in
matrix to improve the prediction accuracy of the a target domain leveraging the items in auxiliary
traditional recommender systems. In particular, a domains.
clustering algorithm for the user-item-tag matrix Algorithms that try to learn user and item
is used to identify the most suitable friends preferences, through factorization of the related
for realistic recommendation tasks. Moreover, rating matrices, compose the second class.
Colace et al. (2015) propose a collaborative Jamali and Lakshmanan (2013) propose Het-
user-centered recommendation approach that eromf, a context-dependent matrix factorization
combines preferences (usually in the shape of model that considers jointly two types of latent
Big Data and Recommendation 207
factors: general latent factors for each entity type Content information on LBSNs could be related
and context-dependent latent factors for each to a user check-in action; in particular, there are
context in which entities are involved. three types of content information:
The third class is constituted by algorithms B
that model the interactions among users and items • POI properties related to an action of check in,
as “latent topics,” improving the recommendation for example, by observing a POIs description
process with other types of information. To such as “vegetarian restaurant”; it is possible to in-
purposes, Yuan et al. (2013) propose a proba- fer that the restaurant serves “vegetarian food”
bilistic model to exploit the different types of and users who check in at this business object
information (such as spatial, temporal, and activ- might be interested in the vegetarian diet.
ity data) extracted from social networks to infer • User interests related to user comments about
behavioral models for users. In turn, Pham et al. a particular POI.
(2016) propose an algorithm able to measure the • Sentiment indications also related to users
influence among different types of entities, with comments on a given POI.
the final aim to achieve higher recommendation
accuracy and to understand the role of each entity The technological development led to see
in recommendation process. online social network an important means on
which make business, allowing firms to promote
their products and services and users to provide
Recommender Systems and feedbacks about their experiences. In Zhang et al.
Location-Based Social Networks (2016), the authors propose an approach based
on matrix factorization that exploits both the
In addition to the described approaches, the rapid attributes of items and social links of users to
growth of cities led to develop recommenda- address the cold-start problem for new users
tion systems based on points of interest (POIs). by utilizing the social links between users. In
People, in everyday life, decide where to go particular, Kernel-based attribute-aware matrix
according to their personal interests. Thus, POI factorization (KAMF) method is developed
recommendation helps users to filter out uninter- to discover the nonlinear interactions among
esting places and reduce their decision-making attributes, users, and items. A recommender
time. Furthermore, the continuous increase in the system that integrates community-contributed
use of smartphones led to the emergence of a new photos, social, and location aspects is proposed
type of social network called location-based so- by Jiang et al. (2016) for providing a personalized
cial networks (LBSNs); the main characteristic of travel sequence recommendation. In this system,
these networks is to integrate the user’s location similarities between user and route package
into the well-known OSN capabilities. With the are used to rank famous routes, whose top
development of location-based social networks one are optimized by analyzing social similar
(LBSNs) (i.e., FourSquare, Gowalla, etc.), POI users’ travel records. Moreover, Amato et al.
recommendation on LBSNs (Gao et al. 2015) (2014) propose a touristic context-aware
has recently attracted much attention as the latter recommendation system, based on the experience
provides an unprecedented opportunity to study of previous users and on personal preferences of
human mobile behaviors for providing useful tourists, which combines information gathered
recommendations in spatial, temporal, social, and by several heterogeneous sources. Eventually, an
content aspects. These systems allow users to author topic model-based collaborative filtering
make check in at specific POIs with mobile de- (ATCF) method, based on user preference topics
vices and leave tips or share their experience (i.e., cultural, cityscape, or landmark) extracted
with other users. These facets result in a W 4 from the geo-tag constrained textual description
(who, when, where, what) information layout, of photos, is proposed by Jiang et al. (2015a) to
corresponding to four distinct information layers. provide POI recommendation for social users.
208 Big Data and Recommendation
Sun Z, Han L, Huang W, Wang X, Zeng X, Wang M, formats far exceeds the capacities of traditional
Yan H (2015) Recommender systems based on social modelling approaches. Traditional modelling ap-
networks. J Syst Softw 99:109–119
Yuan Q, Cong G, Ma Z, Sun A, Thalmann NM (2013)
proaches tend to suit predictive analysis more
Who, where, when and what: discover spatio-temporal than those in contexts where a great deal of data B
topics for Twitter users. In: Proceedings of the 19th is being generated at every moment’s activity.
ACM SIGKDD international conference on knowledge Like every new technological innovation, the
discovery and data mining. ACM, pp 605–613
Zhang JD, Chow CY, Xu J (2016) Enabling kernel-based
importance of big data has burgeoned with the
attribute-aware matrix factorization for rating predic- use of different kinds of media for personal as
tion. IEEE Trans Knowl Data Eng 29(4):798–812 well as business communication. In the wake of
Zhao Z, Lu H, Cai D, He X, Zhuang Y (2016) User such a growth of big data, there has also been
preference learning for online social recommendation.
IEEE Trans Knowl Data Eng 28(9):2522–2534
skepticism as to what constitutes it (Sha and
Carotti-Sha 2016). As mentioned above, the most
common distinction by which big data can be
characterized is that it cannot be stored using
Big Data Application traditional data structures. Further the quantity
in Manufacturing Industry of big data keeps on increasing as action takes
place in real time. So, when an aircraft engine
Amit Mitra1 and Kamran Munir2 runs, then it generates real-time data over every
1 moment that it is running. Such a dynamic dimen-
Bristol Business School, University of the West
of England, Bristol, UK sion, as well as its profligacy, is something that
2 distinguishes the nature of big data generation
Computer Science and Creative Technologies,
University of the West of England, Bristol, UK from standard data.
In tandem with this, over the last decade, there
has been an exponential growth in the reliance
Introduction on customer-generated data in every industry and
every walk of life. In a world where instantaneous
Today data on customers or clients of companies access and response to data directly influence
is one of the key resources that adds to the prospects of a company, sales of business in-
valuation of an organization. Often data is not telligence software have far outstripped sales of
only generated by customers who are buying traditional software. Just like human capacity is
from organizations but increasingly because of being replaced by software agents, similarly there
feedback that they may share on various social is a noticeable shift from reliance on stored data
media platforms about their experiences of buy- onto accessing and processing real-time data.
ing. So, traditional approaches where data collec- Such real-time data is also getting generated as
tion was mainly based on conducting surveys of customers engage in consuming products and
customer experiences are increasingly becoming sellers are eager to curate these experiences so
a thing of the past. On the one hand, people that they can be mimicked to increase familiarity
prefer to refrain from participating in surveys as and thus the custom of the consumer. The latter in
these may be time-consuming; on the other the most situations is voluminous, and as it is being
amount of time spent on social media platforms generated through various sources, it requires
has been on the rise. Given such an environment, nonstandard mechanisms of processing. So tra-
it is commonplace for organizations to collect ditional formats of storing and analysis using
data through touch points that customers use RDBMS are increasingly found to be wanting
when ordering or choosing products and services. in their capacities (Gandomi and Haider 2015).
Therefore, data generation is not limited to re- According to Davenport (2014), big data is going
sponses to survey questions but more because of to play a critical role in among others “every
touch points and social media posts. In this ever- industry that uses machinery, every industry that
growing and dynamic need to deal with variety of has physical facilities, and every industry that
210 Big Data Application in Manufacturing Industry
involves money.” Obviously, the manufacturing key differences between big data and traditional
industry is certainly going to be an environment analytics.
which will be generating substantial amounts of
data as it is no exception to the characteristics The dimension of primary purpose of big
mentioned by Davenport (ibidem). The manufac- data in Table 1 reveals how the entire industrial
turing industry is likely to be critically impacted world is reorienting itself with regard to customer
upon by the presence of a great deal of data that data. The traditional information management
would need to be processed using NoSQL. approach has been to use analytics to cater to
better internal decision-making through creation
of reports and presentations that advise senior
Data Characteristics executives on internal decisions. In contrast data
The diagram in Fig. 1 is aimed to illustrate the vo- scientists are today involved in data-driven prod-
luminous nature of big data and the ensuing types ucts and services through the creation of cus-
of data structures that are compatible with it. The tomer facing and customer touching applications.
diagram illustrates the increasingly unstructured
nature of growth of big data.
From a management perspective, the advent of Manufacturing Industry Transaction
big data has made it impossible to think of busi- Processing
nesses in the same way as in the past. Whereas
traditional approaches have used analytics to un- Given that discovery and agility are drivers that
derstand and fine-tune processes to keep manage- would enable manufacturing organizations to in-
ment informed while alerting them to anomalies novate and differentiate, data or big data would be
(business intelligence is driven by the idea of at the heart of these endeavors. So in the manufac-
“exception reporting”). In contrast big data has turing industry, not only manufactured products
flipped such an analytical orientation on its head. but also manufacturing equipment would increas-
The central view of big data is that the world and ingly contain data-producing sensors. Ability of
the data that it describes are in a state of change devices that carry out, for instance, machining,
and flux; those organizations that can recognize welding, and robotic functions, would be re-
and react quickly have the upper hand in this porting on their own performance and need for
space. The sought-after business and IT capabili- service. Being networked, most of these devices
ties are discovery and agility, rather than stability can be monitored and controlled from a central
(Davenport 2014). Table 1 below projects a few control hub.
Value addition in the manufacturing indus-
try could also come about through connections
with big data-fueled supply chains that would
ensure supplies are available to support manu-
facturing while at the same time ensuring that
only optimal amounts of goods are manufactured.
Admittedly, discrete manufacturing companies
have addressed this issue, whereas there is still
a distance to go for process- or flow-based manu-
facturing like in the oil and natural gas sectors.
Foresight of troubleshooting products going
forward along with the ability to outline expec-
tations is critical in the manufacturing industry.
Big Data Application in Manufacturing Industry, Evidently, such foresight is based on accessing
Fig. 1 Big data growth is increasingly unstructured. and evaluating a great deal of data (Marr 2016).
(Source: Adapted from EMC Education Services (2015)) Some of this is user generated, and some of this
Big Data Application in Manufacturing Industry 211
Big Data Application in Manufacturing Industry, Table 1 Big data and traditional analytics
Dimensions Big data Traditional analytics
Type of data Unstructured formats Formatted in rows and columns
Volume of data 100 terabytes to petabytes Tens of terabytes or less
B
Flow of data Constant flow of data Static pool of data
Analysis methods Machine learning Hypothesis based
Primary purpose Data-driven products Internal decision support and
services
Source: Adapted from Davenport (2014).
Big Data Application in Manufacturing Industry, Table 2 Types of NoSQL data stores
Type Typical usage Examples
Key-value store – a simple data storage system Image stores Berkeley DB
that uses a key to access a value Key-based file systems Memcached
Object cache Redis
Systems designed to scale Riak
DynamoDB
Column family store – a sparse matrix system Web crawler results Apache HBase
that uses a row and a column as keys Big data problems that can relax Apache Cassandra
consistency rules Hypertable
Apache Accumulo
Graph store – For relationship intensive problems Social networks Neo4j
Fraud detection AllegroGraph
Relationship-heavy data Bigdata (RDF data store)
InfiniteGraph (objectivity)
Document store – Storing hierarchical data High-variability data MongoDB (10Gen)
structures directly in the database Document search CouchDB
Integration hubs Couchbase
Web content management MarkLogic
Publishing eXist-db
Berkeley DB XML
Source: Adapted from McCreary and Kelly (2013).
is generated by intelligent agents such as Internet NoSQL is a set of concepts that allows
of Things (IoT) nodes. the rapid and efficient processing of data
sets with a focus on performance, reliability,
and agility. (McCreary and Kelly 2013;
Analytics and Decision Support pg 4.)
Requirements It is clear from Table 2 above that there are
a range of areas where NoSQL may be deemed
As we all know, SQL is an acronym for Struc- to be a more reliable tool to process data in
tured Query Language which is used to query comparison to standard SQL. So considering data
data stored in rows and columns of databases. sources, key-value stores provide accommoda-
So SQL is used to create, manipulate, and store tion for a wide range of usage areas that are com-
data in traditional formats with fixed rows and monplace in the way data seems to get generated
columns. However, a San Francisco-based de- on different platforms.
veloper group interested in issues surrounding Just like customer relationship management
scalable open-source databases first coined the requires a completely different perspective of
term NoSQL (McCreary and Kelly 2013). In data storage on customers, similarly NoSQL stor-
essence, NoSQL may be defined as: age envisages a change in the mindsets of data
212 Big Data Application in Manufacturing Industry
Organization Readiness
Big Data Application in Manufacturing Industry,
and Affordability
Fig. 2 Impact of NoSQL business drivers. (Source:
Adapted from McCreary and Kelley (2013)) Manufacturing organizations may not be techno-
logically or with regard to existing human capac-
ity be able to modernize sufficiently quickly and
handling. The diagram in Fig. 2 below projects integrate large volumes of data that is produced
this altered perspective. within the industrial environment. So there needs
In Fig. 2 above, the different business drivers, to be capacity that should be in place to accept
viz., volume, velocity, variability, and agility, and process these data sets. The approach that
are likely to apply gradual pressure on standard would be most appropriate would be to grad-
single CPU systems that could eventually lead ually phase in capacity to process big data. In
to surfacing of cracks in the system’s ability to the context of manufacturing data, organizations
handle increased processing expectations. In a need to handle data using RDBMS, big data
real-world scenario of the manufacturing context, involving various formats as well as hybrid data.
large amounts of data on production, of materials Using NoSQL to process big data is undoubtedly
handling, and of supply chains all would be the most appropriate option for manufacturing
arriving fairly rapidly. Volume and velocity refer organizations.
to the quantity and the speed at which such data As an example, the levels of cost reduction
would be getting generated, and so there would achieved through big data technologies are
be increasing need to cope with these heightened incomparable to other alternatives in the present
requirements within restricted time horizons. To- context. As a matter of fact, MIPS (millions
day’s successful manufacturing companies would of instructions per second – that is how fast a
need to swiftly access data from sources that are computer system crunches data) and terabyte
not necessarily reducible into rows and columns. storage for structured data are now most
So, there may be recorded conversations, there cheaply delivered through big data technologies
may be Facebook posts, and there may be video like Hadoop clusters. For instance, a small
recordings that would need to be considered by manufacturing company considering storage of
the manufacturing systems to make sure that 1 terabyte for a year was £28 k for a traditional
business intelligence is sufficiently factored in to RDBMS, £3.9 k for a data appliance, and only
the manufacturing system to then cater to the £1.6 k for a Hadoop cluster – and the latter had
continually changing expectations of the clien- the highest speed of processing data under most
tele. As a matter of fact, customer expectations circumstances (Davenport 2014). Clearly, the
are so quick changing that a manufacturing com- cost-benefit trade-offs of not acquiring big data
pany can very easily find itself forced out of an capabilities could potentially manifest in losing
industry if competitor application of NoSQL is competitive advantage in the manufacturing
more effective and insightful and leads to a better industry.
Big Data Architectures 213
behavior, and how they interact with each other. ferent ways and that make different assumptions
Examples of roles of components include batch about how big data applications are structured
processing (i.e., handling the volume aspect of and where the system is hosted. Although the
big data), stream processing (handling velocity), term architecture is used in the literature to also
data management (handling variety) and mod- describe implementation-specific approaches, we
ules enabling security, processing across geo- target here the most conceptual definition of ar-
graphically distributed locations, and other spe- chitecture, independent of how it is realized. With
cific functions. Software architectures are not this definition in mind, we review the most influ-
concerned on how modules are realized (im- ential big data architectures in the next pages.
plemented); thus, it is possible that two differ-
ent realizations of the same architecture might
be built with very different technology stacks. Lambda Architecture
Specification of a software architecture helps in
the definition of a blueprint containing the long- The Lambda architecture (Marz and Warren
term vision for the future system and can help 2015) is one of the first documented architectures
in directing development efforts and technical for big data. It contains components that handle
decisions during the system implementation. data volume and data velocity: it combines
information from batch views and real-time
Overview views to generate results of user queries.
Conceptually, the architecture is divided in
According to Len Bass, “The software architec- three layers: the batch layer, the speed layer,
ture of a system is the set of structures needed and the serving layer (Marz and Warren 2015).
to reason about the system, which comprise soft- The role of the batch layer is to generate batch
ware elements, relations among them, and prop- views that are a solution of the volume problem:
erties of both” (Bass 2013). Therefore, rather than the assumption in the architecture design is that,
being a detailed description of how a system is because batch views are computed from large
built, a software architecture describes the diverse volumes of data, it is likely that this computation
building blocks of the system and the relationship will run for hours (or even days). Meanwhile,
between them, keeping aside implementation de- there will be still data arriving in the system
tails that, even if changed, do not change the (velocity) and queries being made by users. Thus,
vision described in the architecture. the answer needs to be built with the last available
The software architecture of a system is an batch view, which may be outdated by hours or
ideal manner to communicate key design princi- days.
ples of the system, system properties, and even An interesting characteristic of batch views is
the decision-making process that led the system that they are immutable once generated. The next
architecture to attain a certain shape. batch view is generated from the combination of
The more complex the system, the bigger the all the data available in the system at the time the
need for a comprehensive documentation of its previous batch view was generated in addition to
architecture. Thus, in the realm of big data, where all the data that arrived since such a view was
there is a need to process large quantities of built (and that is available in the speed layer).
fast-arriving and diverse data on a system, it is Because batch views are continuously generated,
important that a blueprint exists that documents problems with the data are corrected in the next
what component of the system handles each of batch view generated after the problem was de-
the aspects of big data, how these parts integrate, tected and corrected, without requiring any other
and how they work toward achieving their goals. action by administrators.
This is the goal of big data architectures. To compensate staleness in the batch layer, the
Over the years, several architectures were pro- speed layer processes data, via stream comput-
posed that target the challenges of big data in dif- ing, that arrived since the latest batch view was
Big Data Architectures 215
In summary, BDAF is an architecture more as it is not design for the same level of distri-
concerned with data management aspects of big bution targeted by BDAF. However, BASIS has
data and thus complements well the Lambda a stronger emphasis in enabling analytics as a
architecture, which has a bigger emphasis on pro- service that can be composed by applications to
cessing capabilities with strong assertions on how provide specialized analytics to users, which is
data should be presented. However, its design the main differential of BASIS compared to the
is more complex to be able to address inherent other discussed architectures.
issues with its complex target infrastructure.
SOLID
BASIS
SOLID (Cuesta et al. 2013) is a big data archi-
BASIS (Costa and Santos 2016) is a big data tecture for clusters that has been developed with
architecture that has been designed with a specific the goal of prioritizing the variety dimension of
domain in mind, namely, smart cities. Neverthe- big data over volume and velocity. As a result,
less, the principles and design choices make it many architectural decisions are made that differ
suitable to general domains as well. significantly from the other approaches described
Given its target context of smart cities, it is so far.
assumed that there are several sources of data to The first key difference regards the expected
the system that needs to be transformed before data model. SOLID models data as graphs. The
being analyzed. This data enters the system via second key difference is that it mandates a spe-
processing modules that include not only batch cific data model, namely, RDF (Resource De-
and stream (handling volume and velocity, as scription Framework (Manola et al. 2004)) to be
in the previous architectures) but also specific used in the architecture.
modules for data integration (handling variety). Conceptually, SOLID is similar to the Lambda
Other necessary components are APIs supporting architecture with a few more components. The
ingestion of data from sensors and other particu- merge layer is responsible for batch processing
lar devices and connectors to databases (enabling (thus handling the volume aspect of big data).
data to be persistently stored). Similar to the case of Lambda architecture, it
In its storage layer, BASIS encompasses mod- generates immutable views from all the data
ules supporting different types of data storage: available until the moment the view is generated
file storage (called landing zone), distributed stor- and stores the result in a database.
age (for large volume), and three special-purpose The online layer is conceptually equivalent to
storage modules, handling operational data, ad- Lambda’s speed layer, handling big data’s veloc-
ministrative data, and data to be made publicly ity. It stores data received since the last dump to
available. the merge layer. The service layer receives user
On top of this processing and storage infras- queries and issues new queries to the online and
tructure, the architecture contains a layer sup- index layers to build the answer to the query, and
porting arbitrary big data analytics tasks, which thus this is equivalent to Lambda’s serving layer.
are accessed by applications that lay in the layer Notice that, different from the Lambda architec-
above, the topmost. ture’s serving layer, the online layer does not
An implementation of the architecture is also precompute views to be used to answer queries; it
presented by the authors to demonstrate the vi- builds the answers on demand based on received
ability of the architecture. In terms of capabil- requests.
ities, BASIS is very similar to BDAF, but also Besides the three layers that are equivalent
differentiates itself in some aspects. The first one to Lambda layers, two other layers are present
is that requirements for cross-site coordination in SOLID: the first extra layer is the index
and security are of smaller concern in BASIS, layer, which generates indices that speed up
Big Data Architectures 217
the access to the data items generated by the ceives access from brokers, which act on behalf
merge layer. The second one is the data layer, of users, to provide information about available
which stores data and metadata associated with sensors and corresponding data feeds.
the architecture, which is generated by the merge B
layer.
M3Data
differ in target domains, platforms, support for quirements of the system being designed, where
different features, and main features. It is evident it will be applied, what are the data sources, and
that there is no one-size-fits-all solution for big what transformations and queries will occur in
data architectures, and the answer to the question the data.
of which one is the best depends on individual Once the above is identified, an architect needs
requirements. to select the big data architecture whose design
and goals are closer to the system being con-
Examples of Applications ceived. During this stage, two situations can be
observed. In the first situation, it can be deter-
A key application of big data architectures re- mined that some functionalities are missing in
gards initial blueprints of big data solutions in a the candidate architecture. If this is the case,
company or specific domain. In early stages of new capabilities will need to be designed and
adoption, architects need to understand the re- incorporated to the final system.
Big Data Architectures 219
In the second case, the architect might identify minimum knowledge about the underlying
that some elements of the architecture antago- platform.
nize with needs/requirements of the system under
development. In this case, a careful assessment B
needs to be carried out to identify the risks and
costs involved in circumventing the antagonistic Cross-References
characteristics. Solutions to such situation can re-
Big Data Indexing
quire complex logic and brittle software systems
Cloud Computing for Big Data Analysis
that may fail in achieving their goals. Therefore,
Definition of Data Streams
architects should avoid selecting candidate archi-
ETL
tectures with antagonistic capabilities in relation
Graph Processing Frameworks
to the needs of the system.
Hadoop
Large-Scale Graph Processing System
Rendezvous Architectures
Streaming Microservices
Marz N, Warren J (2015) Big Data: principles and best based on learning data representations (Bengio
practices of scalable realtime data systems. Manning, et al. 2013). Giving that Hordri et al. (2017)
Stamford
Ramaswamy L, Lawson V, Gogineni SV (2013) Towards
have systematically reviewed that the features of
a quality-centric big data architecture for federated the Deep Learning are a hierarchical layer, high-
sensor services. In: Proceedings of the 2013 IEEE in- level abstraction, process a high volume of data,
ternational congress on Big Data (BigData Congress), universal model, and does not overfit training
IEEE, Santa Clara, 27 June–2 July 2013
Schnizler F, Liebig T, Marmor S et al (2014) Hetero-
data. Deep Learning can be applied to learn from
geneous stream processing for disaster detection and labeled data if it is available in sufficiently large
alarming. In: Proceedings of the 2014 IEEE inter- amounts; it is particularly interesting to learn
national conference on Big Data (Big Data), IEEE from large amounts of unlabeled/unsupervised
Computer Society, Washington, DC, 27–30 Oct 2014
data (Bengio et al. 2013; Bengio 2013), so it is
also interesting to extract meaningful represen-
tations and patterns from Big Data. Since data
keeps getting bigger, several Deep Learning tools
Big Data Availability have appeared to enable efficient development
and implementation of Deep Learning method in
Data Replication and Encoding Big Data (Bahrampour et al. 2015). Due to Deep
Learning having different approaches such as
convolutional neural network (CNN) and recur-
rent neural network (RNN), these Deep Learning
Big Data Benchmark tools are going to optimize different aspects of
the development of Deep Learning approaches.
TPC-DS The list of the Deep Learning tools includes but
not limited to Caffe, Torch7, Theano, DeepLearn-
ing4j, PyLearn, deepmat, and many more. How-
ever, in this paper we will only focus on three
Big Data Benchmarks tools such as Caffe, Torch7, and Theano since
most of the researchers are using these tools
Analytics Benchmarks according to the 2015 KDnuggets Software Poll.
Overview
Big Data Deep Learning Tools
Big Data generally refers to data that exceeds the
Nur Farhana Hordri1;2 , Siti Sophiayati typical storage, processing, and computing ca-
Yuhaniz1 , Siti Mariyam Shamsuddin2 , and pacity of conventional databases and data analy-
Nurulhuda Firdaus Mohd Azmi1 sis techniques. While Big Data is growing rapidly
1
Advanced Informatics School, Universiti in the digital world in different shapes and sizes,
Teknologi Malaysia, Kuala Lumpur, Malaysia there are constraints when managing and an-
2
UTM Big Data Centre, Universiti Teknologi alyzing the data using conventional tools and
Malaysia, Johor, Malaysia technologies. Most conventional learning meth-
ods use shallow-structured learning architectures.
Since the constraints are not an ordinary task,
Definition they require a collaboration of both developments
of advanced technologies and interdisciplinary
Deep Learning or also known as deep structured teams. Hence, Machine Learning techniques, to-
learning or hierarchical learning is a part of a gether with advances in available computational
broader family of Machine Learning methods power, have come to play an important role in Big
Big Data Deep Learning Tools 221
Data analytics and knowledge discovery (Chen obtained from the Internet for Google’s transla-
and Lin 2014). Today, Deep Learning becoming tor, Android’s voice recognition, Google’s street
a trend due to its techniques that use supervised view, and image search engine (Jones 2014).
and/or unsupervised strategies to automatically Microsoft also uses language translation in Bing B
learn hierarchical representations in deep archi- voice search. IBM joins forces to build a brain-
tectures for classification (Bengio and Bengio like computer by using Deep Learning to leverage
2000; Boureau and Cun 2008). Big Data for competitive advantage (Kirk 2013).
One of the main tasks associated with Big In summary, we could not find any related
Data Analytics is retrieving information (Na- works on reviewing Deep Learning tools in Big
tional Research Council 2013). Efficient storage Data analytics. Therefore, the purpose of the
and retrieval information are a growing problem paper is to review the current top-listed tools of
in Big Data, primarily because of the enormous Deep Learning in Big Data applications. Findings
quantities of data such as text, image, video, may assist other researchers in determining the
and audio being collected made available across potential tools that going to be used in their
multiple domains. Instead of using raw inputs study. The remainder of this paper is divided
for indexing data, Deep Learning can be used into following parts: in section “Tools of Deep
to generate high-level abstract data representa- Learning Big Data,” we summarized each of the
tions to be used for semantic indexing. These top three tools for Deep Learning. Moreover,
representations can expose complex associations we discussed the recent applications, languages,
and factors (especially when the raw input is environments, and libraries for each tool. This is
Big Data), which leads to semantic knowledge to ensure a good understanding on the concept of
and understanding. While Deep Learning in pro- the common Deep Learning tools for Big Data
viding an understanding of semantic and rela- that have been done over the years. Consequently,
tional of the data, a vector representation (corre- in section “Comparative Study of Deep Learn-
sponding to the extracted representations) from ing Tools,” we created a comparative table of
data instances would provide faster searching the properties of tools that have been discussed
and obtain information. With Deep Learning one in section “Tools of Deep Learning Big Data.”
can leverage unlabeled documents (unsupervised Finally, we have concluded this paper in the last
data) to have access to a much larger amount of section which is in section “Future Directions for
input data, using a smaller amount of supervised Research.”
data to improve the data representations and make
it more relevant in learning task and specific
conclusion. Tools of Deep Learning Big Data
Chen and Lin (2014) has discussed on how
Deep Learning has also been successfully applied This section describes the foundation of this
in industry products that take advantage of the review by discussing several of tools of Deep
large volume of digital data. Google, Apple, and Learning that have been applied with Big Data.
Facebook are one of the companies that collect Caffe. Convolutional Architecture for Fast
and analyze massive amounts of data everyday Feature Embedding or also known as Caffe is a
which aggressively make Deep Learning push Deep Learning framework made with expression,
forward to each related project. As in Efrati speed, and modularity in mind (Tokui et al.
(2013), by utilizing Deep Learning and data col- 2015). Caffe was originally developed at UC
lected by Apple services, the virtual personal Berkeley by Yangqing Jia which is open sourced
assistant in iPhone, Siri, has offered a wide va- under BSD license. The code is written in
riety of services including weather reports, sports clean, efficient CCC with CUDA used for GPU
news, answer to user’s questions, and reminder, computation and nearly complete well-supported
while Google uses Deep Learning algorithms bindings to Python/Numpy and MATLAB
for massive chunks of messy data which are (Jia et al. 2014; Tokui et al. 2015; Kokkinos
222 Big Data Deep Learning Tools
2015). Caffe becomes a preferable framework mately 12–14 h. The architecture is implemented
for research use due to the careful modularity in NVIDIA GTX780 and NVIDIA K40 GPUs.
of the code and the clean separation of network Also, in 2015, Kokkinos (2015) has evaluated
definition from actual implementation (Tokui the performance of Caffe, cuDNN, and CUDA-
et al. 2015). Several different types of Deep convnet2 using the layer configurations which are
Learning architecture that Caffe support as commonly used for benchmarking convolutional
mentioned in Caffe Tutorial (2018) are image performance and quoting average throughput for
classification and image segmentation. Caffe all these layers. This performance portability is
supports CNN, Regions with CNN (RCNN), long crossed GPU architecture with Tesla K40. Tokui
short-term memory (LSTM), and fully connected and co-workers (2015) have introduced Chainer,
neural network designs. a Python-based, stand-alone, open-source frame-
According to Jia et al. (2014), Caffe stored work for Deep Learning models (Caffe Tuto-
data and communicates data in four-dimensional rial 2018). Tokui claims that Chainer provides a
arrays called blobs. Blobs provide a unified mem- flexible, intuitive, and high-performance means
ory interface, holding batches of images, param- of implementing a full range of Deep Learning
eters, or parameter updates. The data is trans- models. He then compared Chainer with Caffe
ferred from one blob to another blob in the next using various CNNS. Based on the experiment,
layer. Based on Chetlur et al. (2014), in Caffe since Caffe is written in CCC and Chainer is
each type of model operation is encapsulated in written in Python, the total time required for
a layer. Layer development comprises declaring trial-and-error testing in implementing and tuning
and implementing a layer class, defining the layer new algorithms is likely to be similar for both
in the protocol buffer model scheme, extending frameworks. Still in 2015, Kokkinos combined a
the layer factory, and including tests. These layers careful design of the loss for boundary detection
take one or more blobs as input and yield one training, a multi-resolution architecture and train-
or more blobs as output. Each layer has two key ing with external data to improve an accuracy of
responsibilities for the operation of the network. the current state of the art (Chetlur et al. 2014).
There are (i) a forward pass that takes the inputs This detector is fully integrated into the Caffe
and produces the output and (ii) a backward pass framework and processes a 320 420 image in
that takes the gradients with respect to the param- less than a second on an NVIDIA Titan GPU.
eters and to the inputs, which are in turn back- Torch7. Collobert et al. (2011) defined Torch7
propagated to earlier layers. After that, Caffe does as a versatile numeric computing framework
all the bookkeeping for any directed acyclic graph and Machine Learning library that extends Lua
of layers, ensuring correctness of the forward and (Collobert et al. 2011). By using Lua you can
backward passes. A typical network begins with have a lightweight scripting language. Collobert
a data layer which is run on CPU or GPU by and his team chose Lua because Lua is the only
setting a single switch. The CPU/GPU switch is fastest interpreted language that they can find.
seamless and independent of the model defini- Lua is easy to be embedded in a C application.
tion. They also chose Lua instead of Python because
Caffe is able to clear access to deep archi- Lua is well-known in the gaming programmer
tectures which can train the models for fastest community. Most packages in Torch7 or third-
research progress and emerged the commercial party packages that depend on Torch7 rely on
applications. In 2015, Ahmed et al. (2015) has its Tensor class to represent signals, images,
implemented deep network architecture using the and videos. There are eight packages in Torch7
Caffe Deep Learning framework which the re- which are (i) torch, (ii) lab and plot, (iii) qt, (iv)
search adapted various of layers from the frame- nn, (v) image, (vi) optim, (vii) unsup, and (viii)
work and wrote their own layers that are specific third-party. Lua also provides a wide range of
to the deep network architecture. The conver- algorithms for Deep Machine Learning and uses
gences time for training the network is approxi- the scripting LuaJIT.
Big Data Deep Learning Tools 223
Torch7 is designed and developed to use simulations from Torch7 can be run faster than
Open Multi-Processing (OpenMP) directives real time and help in large-scale training. Other
for various operations in its Tensor library than games, the author also said that Torch7 is
and neural network package (Collobert et al. useful for experiments in vision, physics, and B
2011). This is due to the OpenMP which agent learning.
provides a shared memory CPU parallelization Theano. Theano is a Python library that
framework in C/CCC and Fortran languages allows to define, optimize, and evaluate
on the almost operating system and compiler mathematical expressions involving multidimen-
tool set. Hence, OpenMP normally requires sional arrays efficiently (Al-Rfou et al. 2016;
minimal modification for integrating into an Goodfellow et al. 2013). Theano aims to improve
existing project. If the compiler is supported both execution time and development time of
by OpenMP, Torch7 will automatically detect Machine Learning applications, especially Deep
it and compile a high-level package that adds Learning algorithm (Bergstra et al. 2011). By
multi-threaded Tensor operations, convolutions, describing the mathematical algorithms as in
and several neural network classes. The number functional language and profit from fast coding,
of threads to be used can be specified by either Theano makes it easier to leverage the rapid
setting or environment variable to the desired development pattern encouraged by the Python
number. language and depending on Numpy and Scipy.
In 2015, Lane et al. (2015) have done the Theano also provides a language that is an
experiment in designing the raw feasibility of actual implementation for describing expressions
executing Deep Learning models. The paper also (Sherkhane and Vora 2017). Likewise, in
considered end-to-end model performance met- Bastien’s paper (2012), Theano uses CUDA
rics of execution time and energy consumption, to define a class of n-dimensional arrays
especially in terms of how these map to applica- located in GPU memory with Python binding
tion needs. For the experiments, Torch7 has been (Bastien et al. 2012). Theano can automatically
used for training all the models in NVIDIA Tegra compute symbolic differentiation of complex
K1. The result of the experiments shows that expressions, disregard the variables that are not
Deep Learning is enabling to be ready to integrate required to compute the final output, recycle
with IOT, smartphones, and wearable systems. partial results to avoid redundant computations,
Shokri and Shmatikov (2015) have designed, im- apply mathematical simplifications, calculate
plemented, and evaluated a practical system that operations in place when possible to minimize
enables multiple parties to jointly learn an accu- the memory usage, and apply numerical stability
rate neural network model for a given objective optimization to overcome or minimize the error
without sharing their input datasets (Shokri and due to hardware approximations (Al-Rfou et al.
Shmatikov 2015). They used Torch7 packages as 2016).
they mentioned that Torch7 has been used and Theano defines a language to represent math-
extended by major Internet companies such as ematical expressions and manipulate them. For
Facebook. Later in 2016, small towers of wooden example, Theano represents symbolic mathemat-
blocks are created using a 3D game engine which ical expressions as directed, acyclic graphs (Al-
the stability is randomized and renders them Rfou et al. 2016). These graphs are also bipartite,
collapsing (Lerer et al. 2016), using Torch7 Deep containing two kinds of nodes: variable nodes
Learning environment to integrate with Machine and apply nodes. Practically, for graph inputs
Learning framework directly into the UE4 game and outputs, usually variables and intermediate
loop. Torch7 allowed fine-grained scripting and values are going to be used. During the execution
online control of UE4 simulations. Since Lua phase, values will be provided for input variables
is the dominant scripting languages for games and computed for intermediate and output ones.
and many games including UE4 support, Torch7 The apply node has inputs and outputs, while
is well-suited for game engine integration. The the variable node can be the input to several
224 Big Data Deep Learning Tools
Big Data Deep Learning Tools, Table 1 Properties of Deep Learning tools (Boureau and Cun 2008)
Property Caffe Torch7 Theano
Core language C Lua Python
CPU X X X
Multi-threaded CPU X BLAS X Widely used X BLAS, conv2D limited
OpenMP
GPU X X X
Multi-GPU X Only data parallel X ✗ Experimental version
available
NVIDIA cuDNN X X X
Quick deployment on X Easiest X ✗ Via secondary libraries
standard models
Automatic gradient X X X Most flexible (also over
computation loops)
apply nodes but can be the output of at most one Comparative Study of Deep Learning
apply node. Since variables are strongly typed Tools
in Theano, the type of these variables must be
specified at creation time. By calling Python This section provides a comparison of Deep
functions on variables, the user can then interact Learning tools as in Table 1 below. Table
with them in a direct and natural way. Due to 1 describes the general properties of Caffe,
the computation graph is acyclic and its structure Torch7, and Theano. According to Bahrampour
is fixed and independent from the actual data, et al. (2015), these tools are the top three
it can be a challenge to express loops symboli- well-developed and widely used frameworks
cally. Therefore, Theano implements a special Op by the Deep Learning community. However,
called Scan to abstracts the entire loop in a single the comparison only focused on the speed of
apply node in the graph. The Scan node handles performance of the convolutional frameworks.
the communication between the external or outer
computation graph it belongs to and the internal
or inner graph. The Scan is also responsible to Future Directions for Research
manage the bookkeeping between the different
iterations. Overall, Deep Learning tools that have been dis-
Glorot et al. (2011) proposed a Deep Learn- cussed in the previous sections may give the
ing approach which learns to extract a mean- guideline to the other researcher on which tools
ingful representation of the problem of domain that want to use in their study. We found that
adaptation of sentiment classifiers (Glorot et al. each of the tools is effective when the researchers
2011). All algorithms were implemented using use to match their study. Nevertheless, since the
the Theano library, while in 2016, Litjens and his data keeps getting bigger, the technology also
team (2016) introduced Deep Learning as a tech- becoming more advanced. There might be new
nique to improve the objectivity and efficiency Deep Learning tools that can outperform the
of histopathologic slide analysis. In this work, current tools.
the convolutional neural network has been trained
using the open source Deep Learning libraries
which are Theano. However, in this work, due to Conclusion
the memory limitations of the GPU, some of the
sized patches performed substantially worse on In contrast between conventional Machine Learn-
patch-based accuracy. ing tools and Deep Learning tools, Deep Learning
has the advantages of potentially providing a
Big Data Deep Learning Tools 225
solution to Big Data analysis due to its extracting Bengio Y (2013) Deep learning of representations: look-
complex data representations from large volumes ing forward. In: International conference on statistical
language and speech processing, Springer, Berlin/Hei-
of unsupervised data. Deep Learning tool be- delberg, pp 1–37
comes valuable tools when it comes to Big Data Bengio Y, Bengio S (2000) Modeling high-dimensional
B
analytics where the analysis data involving large discrete data with multi-layer neural networks. In: Solla
volumes of raw data are unsupervised and uncat- SA, Leen TK, Muller KR (eds) Advances in neural
information processing systems. MIT Press, Boston, pp
egorized. The tools can be applied to managing 400–406
and analyzing the data in Big Data analytics Bengio Y, Courville A, Vincent P (2013) Representation
problem due to its features that can learn from learning: a review and new perspectives. IEEE Trans
hierarchical representation in deep architectures Pattern Anal Mach Intell 35(8):1798–1828. https://doi.
org/10.1109/tpami.2013.50. arXiv: 1206.5538
for classification. Hence, in this study, we have Bergstra J, Bastien F, Breuleux O, Lamblin P, Pascanu
aimed and identified the top three tools of Deep R, Delalleau O, Desjardins G, Warde-Farley D, Good-
Learning that can be useful to Big Data. The fellow I, Bergeron A, Bengio Y (2011) Theano: Deep
tools are Caffe, Torch7, and Theano. This study learning on gpus with python. In NIPS 2011, BigLearn-
ing Workshop, Granada, Spain, vol. 3
also has given a comparative study on the perfor- Boureau YL, Cun YL (2008) Sparse feature learning for
mance of the convolutional frameworks for each deep belief networks. In: Advances in neural informa-
tool. tion processing systems, NIPS Press, Vancouver, pp
1185–1192
Caffe Tutorial (2018) https://web.archive.org/web/
20170405073658/https://vision.princeton.edu/courses/
Cross-References COS598/2015sp/slides/Caffe/caffe_tutorial.pdf
Chen XW, Lin X (2014) Big data deep learning: chal-
Big Data Analysis Techniques lenges and perspectives. IEEE Access 2:514–525
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran
Deep Learning on Big Data J, Catanzaro B, Shelhamer E (2014) cuDNN: ef-
Languages for Big Data Analysis ficient primitives for deep learning. arXiv preprint
Overview of Online Machine Learning in Big arXiv:1410.0759
Data Streams Collobert R, Kavukcuoglu K, Farabet C (2011) Torch7:
a matlab-like environment for machine learning. In:
Tools and Libraries for Big Data Analysis BigLearn, NIPS workshop, no. EPFL-CONF-192376
Efrati A (2013) How ‘deep learning’ works at
Acknowledgments This work is supported by The Min- Apple, beyond. Information (online). https://www.
istry of Higher Education, Malaysia under Fundamental theinformation.com / How - Deep - Learning - Works - at-
Research Grant Scheme: 4F877. Apple-Beyond
Glorot X, Bordes A, Bengio Y (2011) Domain adaptation
for large-scale sentiment classification: a deep learning
approach. In: Proceedings of the 28th international con-
References ference on machine learning (ICML-11), Washington,
pp 513–520
Ahmed E, Jones M, Marks TK (2015) An improved Goodfellow IJ, Warde-Farley D, Lamblin P, Dumoulin V,
deep learning architecture for person re-identification. Mirza M, Pascanu R, Bergstra J, Bastien F, Bengio Y
In: Proceedings of the IEEE conference on computer (2013) Pylearn2: a machine learning research library.
vision and pattern recognition, Boston, pp 3908–3916 stat, 1050, p. 20
Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bah- Hordri NF, Samar A, Yuhaniz SS, Shamsuddin SM (2017)
danau D, Ballas N, Bastien F, Bayer J, Belikov A, A systematic literature review on features of deep
Belopolsky A, Bengio Y (2016) Theano: A Python learning in big data analytics. Int J Adv Soft Comput
framework for fast computation of mathematical ex- Appl 9(1):32–49
pressions. arXiv preprint arXiv:1605.02688, pp 472– Jia Y, Shelhamer E, Donahue J, Karayev S, Long J,
473 Girshick R, Guadarrama S, Darrell T (2014) Caffe:
Bahrampour S, Ramakrishnan N, Schott L, Shah M (2015) convolutional architecture for fast feature embedding.
Comparative study of caffe, neon, theano, and torch for In: Proceedings of the 22nd ACM international confer-
deep learning. arXiv preprint arXiv:1511.06435 ence on Multimedia, ACM, New York, pp 675–678
Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow Jones N (2014) Computer science: the learning machines.
I, Bergeron A, Warde-Farley D, Bengio Y (2012) Nature 505(7482):146–148
Theano: new features and speed improvements. arXiv Kirk J (2013) Universities, IBM join forces to build
preprint arXiv:1211.5590 a brain-like computer. PCWorld. http://www.
226 Big Data Enables Labor Market Intelligence
training, especially in those jobs that are highly needs can be analyzed without the time lags
specialized or heavily affected by digitalization. found in traditional administrative and survey
On one hand several jobs are disappearing, data sources that may require several months for
while new jobs are emerging, some of which being available. In addition, a real-time analy- B
are simply a variant of existing jobs, others are sis of Web labor market represents an anytime
genuinely new jobs that were nonexistent until snapshot of the market demands, providing a
few years ago. On the other hand, the quantity and useful source of information about specific job re-
quality of the demand for skills and qualifications quirements, the offered contracts, requested skills
associated to the new labor market have changed (both hard and soft), the industry sector, etc.,
dramatically. New skills are needed not only to which are not covered in any surveys.
perform new jobs, but also the skill requirements
of existing jobs have changed considerably. A Motivating Example. Table 1 shows two job
vacancies extracted from specialized Web sites.
The Need for LMI. In such a dynamic scenario, Even though both advertisements seek for com-
the problem of monitoring, analyzing, and un- puter scientists, the differences in terms of job re-
derstanding these labor market changes (i) timely quirements, skills, and corresponding educational
and (ii) at a very fine-grained geographical level levels are quite evident. The first job vacancy (A)
has become practically significant in our daily is looking for a software developer, while the
lives. Which occupations will grow in the future second one (B) is searching for a less specialized
and where? What skills will be demanded the candidate, i.e., an ICT technician. Indeed, in the
most in the next years? Those are just few ques- latter case, the only requested abilities required
tions with which economists and policy makers are to be able to use some solutions and the
have to face with (see, e.g., the recent work knowledge of a programming language (nice to
by Frey and Osborne (2017) about occupations have) that is usually taught in some professional
that might disappear due to the digitalization). high schools.
Here, a key role is played by Big Data concerning The ability to catch such differences and to
labor market (e.g., job advertisements, cv posted classify these two distinct occupational profiles
on the Web, etc.). This data needs to be collected, (i) timely and (ii) according to a standard tax-
organized, and manipulated for enabling a real- onomy is mandatory for analyzing, sharing, and
time monitoring and analysis of labor market comparing the Web labor market dynamics over
dynamics and trends. In this way, labor market different regions and countries. Here, job offer
Big Data Enables Labor Market Intelligence, Table 1 An example of real Web job vacancies as collected in the
CEDEFOP project (Taken from Boselli et al. 2017b)
(A) ICT Developer. “Looking to recruit a software (B) Application Consultant. “We are seeking for an
developer to join our dynamic R&D team to support work application consultant that will support our internal SW
on Windows-based software for current and upcoming engineers in the realization of ad hoc ERP software. The
products. A flexible, self-starting attitude is essential ideal candidate should have (at least) a high-level degree
along with a strong motivation to learn new skills as and 2+ years experience in using both Sharepoint-MS
required. The ideal candidate will have at least 2–3 years CRM and Magic. Coding skills on VB are appreciated.”
experience working in a fast and dynamic small team. [ISCO 3512: ICT user support technicians]
Experience in Python is the key requirement for my
client. A degree in computer science/computer
engineering is preferable, but other engineering/science
graduates will be considered if they have software
development experience.” [ISCO 2512: Software
Developer]
Workplace: Milan Workplace: Rome
Contract type: Permanent Contract type: Unlimited term
228 Big Data Enables Labor Market Intelligence
22 Health
Big Data Enables Labor Market Intelligence, Fig. 1 A branch of the ISCO classification tree for professionals
A was classified on the code 2512: Software chine learning (Li et al. 2017; Zhu et al. 2016;
Developer of the ISCO hierarchical classifier by Kokkodis et al. 2015; Verroios et al. 2015; Singh
labor market experts of The European Network et al. 2010; Yi et al. 2007; Hong et al. 2013).
on Regional Labour Market Monitoring (2017), On the other side, companies need to automatize
while job offer B refers to 3512: ICT user support human resource (HR) department activities; as a
technicians (an example is provided in Fig. 1). consequence, a growing amount of commercial
In addition, focusing on skills (both hard and skill-matching products have been developed in
soft) have to be identified from raw text and the last years, for instance, BurningGlass, Work-
linked against the skill taxonomy (e.g., the ESCO day, Pluralsight, EmployInsight, and TextKernel
taxonomy), while new emerging skills have to be (see, e.g., Neculoiu et al. 2016). To date, the
recognized and returned as well. Which are the only commercial solution that uses international
most important in skill in these advertisements? standard taxonomies is Janzz: a Web-based plat-
Which are the most relevant skills in these job form to match labor demand and supply in both
occupations? These are just few questions that public and private sectors. It also provides API
labor market analysts and specialists need to face access to its knowledge base, but it is not aimed at
to deeply understand the labor market at local, classifying job vacancies. Worth of mentioning is
national, and international level and to design Google Job Search API, a pay-as-you-go service
labor policies accordingly. announced in 2016 for classifying job vacancies
through the Google Machine Learning service
over O*NET, that is the US standard occupation
Research Findings taxonomy. Although this commercial service is
still a closed alpha, it is quite promising and also
sheds the light on the needs for reasoning with
Literature on LMI. Labor market intelligence Web job vacancies using a common taxonomy as
is an emerging cross-disciplinary field of studies baseline.
that is gaining research interests in both indus-
trial and academic communities. On one side,
the task of extracting useful information from Calls for Projects on LMI. Focusing on pub-
unstructured texts (e.g., vacancies, curricula, etc.) lic administrations and organizations, there is
mainly focused on the e-recruitment process, by a growing interest in designing and implement-
attempting to support or to automate the resume ing real LMI applications to Web labor market
management by matching candidate profiles with data for supporting the policy design and evalu-
job descriptions using language models and ma- ation activities through evidence-based decision-
Big Data Enables Labor Market Intelligence 229
making. In 2010 the European Commission has All these initiatives are quite effective and
published the communication “A new impetus shed the light on the relevance of reasoning with
for European Cooperation in Vocational Educa- labor market information for supporting business
tion and Training (VET) to support the Europe purposes in both private sector (e.g., job match- B
2020 strategy” (CEDEFOP 2010) aimed at pro- ing, candidate profile, etc.) and in the public one
moting education systems, in general, and VET, (e.g., labor market analysis).
in particular. In 2016, the European Commis-
sion highlighted the importance of vocational KDD Applied to LMI
and educational activities, as they are “valued The extraction of knowledge from (Big) labor
for fostering job-specific and transversal skills, market data has been addressed using the KDD
facilitating the transition into employment and approach as baseline (i.e., knowledge discovery
maintaining and updating the skills of the work- in databases). KDD is mainly composed of five
force according to sectorial, regional, and local steps, as shown by Fayyad et al. (1996). Clearly,
needs” (European Commission 2016a). In 2016, this approach needs to be adapted to the domain
the EU and Eurostat launched the ESSnet Big of interest, enhancing one task or step in spite
Data project (EuroStat 2016), involving 22 EU of another. To this end, Fig. 2 reports a graphical
member states with the aim of “integrating big overview of the KDD framed within the LMI con-
data in the regular production of official statistics, text, along with a number of labels that specify
through pilots exploring the potential of selected which Vs of the Big Data context is involved.
Big Data sources and building concrete applica- Each step can be adapted to the Web labor market
tions.” Furthermore, in 2014 the EU CEDEFOP context as follows.
agency – aimed at supporting the development
of European Vocational Education and Training Step 1: Selection. Data sources have to be se-
– has launched a call-for-tender for realizing lected first. Each Web source has to be evaluated
a system able to collect and classify Web job and ranked to guarantee the reliability of the
vacancies from five EU countries (CEDEFOP information included (see, e.g., the work concern-
2014). The rationale behind the project is to ing the believability of data by Wang and Strong
turn data extracted from Web job vacancies into 1996). Just to give an example, this phase should
knowledge (and thus value) for policy design and take into account the vacancy publication date,
evaluation through a fact-based decision-making. the website’s update frequency, the presence of
Given the success of the prototype, a further call- structured data, if there are any restrictions in
for-tender has been opened for realizing the Web downloading, etc. At the end of this process, a
labor market monitor for the whole EU, including rank of reliable Web sources has been produced.
28 EU country members and all the 24 languages As one might note, this step requires to deal
of the Union (CEDEFOP 2016). with all the 4 Vs of Big Data that include also
VOLUME VELOCITY VOLUME VELOCITY VOLUME VELOCITY VOLUME VELOCITY VOLUME VELOCITY
Big Data Enables Labor Market Intelligence, Fig. 2 The KDD steps annotated to specify which Vs of Big Data is
involved in each step. Figure built on top of the figure from Fayyad et al. (1996)
230 Big Data Enables Labor Market Intelligence
the veracity, intended as the biases, noise, and e.g., Vassiliadis et al. 2002 for details on the ETL
abnormality in data. process). Notice that, at the end of this step, the
variety issue related to Big Data is expected to be
Step 2: Preprocessing. This includes the data solved as a cleansed and well-defined model of
cleaning task to remove noise from the data or the data is reached.
inappropriate outliers (if any), deciding how to
handle missing data, as well as identifying a Step 4: Data Mining and Machine Learning.
function to detect and remove duplicated entries This step requires to identify proper AI algo-
(e.g., duplicated vacancies, vacancy with missing rithms (e.g., classification, prediction, regression,
values, etc.). Data quality and cleaning tasks clustering), searching for patterns of interest in
are indeed mandatory steps of any data-driven a particular representational form on the basis
decision-making approach for guaranteeing the of the analysis purposes. More specifically, in
believability of the overall process, intended as the context of LMI, it usually requires the use
“the extent to which data is accepted or regarded of text classification algorithms (ontology based,
as true, real, and credible” (see, e.g., Wang and machine learning based, etc.) that builds a classi-
Strong 1996; Redman 1998; Mezzanzanica et al. fication function to map a data item into one of
2015; Batini et al. 2009). several predefined classes. Here, items are repre-
Notice that the identification of duplicated job sented by Web job offers, while the predefined
vacancies is far from straightforward. Indeed, job classes are the one from a taxonomy (e.g., a pro-
vacancies are usually posted on different website, prietary one or a public one, as ESCO, O*NET,
and the same text is often reused to advertise etc.). This is the core step of the KDD approach,
a similar position. The former is a duplication even for LMI activity. Therefore, the task of job
while the latter not. The identification of appro- vacancy classification deserves to be formally
priate features to correctly identify duplicates is described in terms of text categorization.
crucial in the Web labor market domain. Notice Text categorization aims at assigning a
that this step allows reducing the complexity of Boolean value to each pair .dj ; ci / 2 D C
Big Data scenario, as the impact of the veracity where D is a set of documents and C a set of
dimension is mitigated through data quality and predefined categories. A true value assigned
cleaning activities performed within this task. to .dj ; ci / indicates document dj to be set
under the category ci , while a false value
Step 3: Transformation. This step includes the indicates dj cannot be assigned under ci . In
data reduction and projection tasks, which aim at the LMI scenario, a set of job vacancies J
identifying a unified model to represent the data, can be seen as a collection of documents each
depending on the goal of the task. Furthermore, it of which has to be assigned to one (and only
may include the use of dimensionality reduction one) occupation code of the taxonomy. Hence,
or transformation methods to reduce the effective classifying a job vacancy over a classification
number of variables or to find invariant represen- system requires to assign one occupation code
tations for the data. Still in this case, this step to a job vacancy. This text categorization
reduces the complexity of the dataset by address- task can be solved through machine learning,
ing the variety dimension. This step is usually as specified by Sebastiani (2002). Formally
performed by means of ETL techniques (extract, speaking, let J D fJ1 ; : : : ; Jn g be a set
transform, and load) that aim at supporting the of job vacancies and the classification of J
data preprocessing and transformation tasks in under the taxonomy consists of jOj independent
the KDD process. Roughly speaking, in the ETL problems of classifying each job vacancy J 2 J
process, the data extracted from a source system under a given taxonomy occupation code oi for
undergoes a set of transformation routines that i D 1; : : : ; jOj. Then, a classifier is a function
analyze, manipulate, and then cleanse the data W J O ! f0; 1g that approximates an
before loading them into a knowledge base (see, unknown target function P W J O ! f0; 1g.
Big Data Enables Labor Market Intelligence 231
Clearly, as this case requires to deal with a single- from five EU countries, namely, the United King-
label classifier, 8j 2 J the following constraint dom, Ireland, Czech Republic, Italy, and Ger-
P
must hold: o2O .j; o/ D 1. many, through machine learning (Boselli et al.
2017b). To date, the system collected 7+ million B
Step 5: Interpretation/Evaluation. Finally, this job vacancies over the five EU countries, and
step employs visual paradigms to visually rep- it accounts among the research projects that a
resent the resulting knowledge according to the selection of Italian universities addressed in the
user’s purposes. In the LMI context, this step context of Big Data (Bergamaschi et al. 2016).
requires to take into account the user’s ability in This prototype allowed the CEDEFOP agency
understanding the data and its main goal in the to have evidence of the competitive advantage
LMI field. For example, public administrations that the analysis of Web job vacancy enables with
might be interested in identifying the most re- respect to the classical survey-based analyses,
quested occupations in its local area; companies in terms of (i) near-real-time labor market in-
might focus on monitoring skills trend and iden- formation; (ii) reduced time-to-market (iii) fine-
tifying new skills for some occupations, so that grained territorial analysis, from countries down
they can promptly design training path for their to municipalities; (iv) identification of the most
employees, etc. In the last few years, a lot of requested skills; and (v) evaluation of skills im-
work has been done for producing off-the-shelf pact to each ISCO profession.
visual libraries that implement several narrative Online Demo. To better clarify the final out-
and visual paradigms. A powerful example is come of a LMI project, the reader can refer
D3js (Bostock et al. 2011), a data-driven respon- to the animated heat map showing the job va-
sive library for producing dynamic, interactive cancy collection progress over time available at
data visualization even in Big Data context (see, http://goo.gl/MQUiUn. On the other side, an ex-
e.g., Bao and Chen 2014). ample of how the LMI knowledge might be made
available to decision-maker is provided at http://
goo.gl/RRb63S.
Examples of Application and Projects
Application 2: LMI for Producing Official
As said above, LMI is an emerging field, in which Statistics
the private sector plays a key role through many Focusing on the statistical perspective related to
commercial solutions while public and academic the analysis of Big Data for LMI, the Eurostat
applications/projects are fewer. Focusing on the – the official statistical office of the European
latter, here two EU projects are reported as they Union – launched a project named ESSnet Big
are mature examples of how to use Big Data for Data aimed at integrating Big Data about LMI
LMI purposes. in the regular production of official statistics,
through pilots exploring the potential of selected
Application 1: LMI for Decision-Making Big Data sources and building concrete applica-
In 2014, the CEDEFOP EU Agency invited aca- tions (EuroStat 2016). The project – started in
demics to participate a call-for-tender aimed at late 2016 – is composed of 22 EU partners and
collecting Web job vacancies from five EU coun- strictly follows the KDD approach to collect data
tries and extracting the requested skills from from Web portals previously ranked and cleanse
the data. The rationale behind the project is to and transform it for classifying data over standard
turn data extracted from Web job vacancies into taxonomies. Worth noting here is the important
knowledge (and thus value) for policy design and role that the representativeness of Big Data plays
evaluation through fact-based decision-making. in this project, as participants aim at evaluating
The system (Boselli et al. 2017a) is now run- the ability of Big Data to be representative of
ning on the CEDEFOP data center since June the whole population (or a stratified sample) for
2016, gathering and classifying job vacancies being included in official EU statistics.
232 Big Data Enables Labor Market Intelligence
Future Directions for Research maker to consider reliable enough the analyses
that have been computed, making the overall
LMI is a cross-disciplinary research area that process ineffective. In essence, the use of XAI
involves computer scientists, statisticians, and algorithms in LMI fits this challenging issue that
economists working together to make sense of DARPA identifies working on machine learning
labor market data for decision-making. In such a problems to construct decision policies for an
scenario, three distinct research directions can be autonomous system to perform a variety of
identified for each area of interest, as discussed simulated missions (DARPA 2017). Specifically,
below. providing explanation of a job vacancy classifier
would request to answer a question as why a
K1: Making LMI Explainable. Explainable ar- job vacancy J has been classified as financial
tificial intelligence (XAI) is an emerging branch manager rather than financial and insurance
of artificial intelligence aimed at investigating manager?
new machine learning models able to be com- To better understand what an explanation
pletely explainable in contexts for which trans- should be, Fig. 4 reports an example where the
parency is important, for instance, where these LIME tool proposed by Ribeiro et al. (2016)
models are adopted to solve analysis (e.g., clas- has been applied to retrieve the explanation
sification) or synthesis tasks (e.g., planning, de- of the example above, explaining a classifier
sign), as well as mission critical applications. that was previously trained on million English
DARPA recently launched the Explainable AI job vacancies. As one might note, the system
(XAI) program that aims to create a suite of AI provides the probabilities that lead the ML
systems able to explain their own behavior, focus- algorithm to classify the job vacancy on a
ing on ML and deep learning techniques. To date, given ISCO code. Furthermore, it also provides
indeed, most ML research models rarely provide the contribution that each feature (here, word)
explanations/justifications for the outcomes they has given to the classification process. We
offer in the tasks they are applied to. The need for can observe that terms finance and manager
explainable AI indeed is motivated mainly by the positively contributed in classifying on financial
need for (i) trust, (ii) interaction, and (iii) trans- manager (code 1211). Surprisingly, the system
parency, as recently discussed in by Fox et al. also reports that the term finance has a higher
(2017). This is also the reason behind the interest probability to classify a job vacancy on financial
in XAI applications by academia and industrial and insurance manager (code 1346), but the
communities (see, e.g., Biran and Cotton 2017; presence of term manager here contributed
Ribeiro et al. 2016). Focusing on business con- negatively in classifying on that code. Such a
texts, as LMI is, autonomous decision-making kind of result is quite useful for improving both
can be framed as a set of economic problems that transparency and interaction of ML systems in
need to make evidence of the rationale behind LMI applications. In essence, this real example
the action suggested, so that the final decision sheds the light on what happens behind the
can result believable to human decision-makers. scenes of a black-box classifier: Although the
Figure 3 would give a graphical overview of the system correctly classifies the vacancy on the
main goal of XAI, that is, the development of AI- right code, the negative contribution that the term
based systems able to explain their rationale, to “manager” gives to the class with label 1346
characterize their strengths and weaknesses, as is wrong and might have unpredictable effects
well as to convey an understand how they will on the classification in the future. On the other
behave in the future. side, having such a kind of information would
XAI in LMI. In the LMI field, indeed, allow analyst to revise and refine the system,
ML-based approaches do not provide any applying a sort of explain and validate approach
explanations for their outcomes/predictions. This to improve both the effectiveness and reliability
untransparent behavior may prevent the decision- of the machine learning-based system.
Big Data Enables Labor Market Intelligence 233
Today Task
• Why did you do that?
XAI Task
• I understand why
New • I understand why not
Training Machine Explainable Explanation • I know when you succeed
Data Learning Model Interface • I know when you fail
Process • I know when to trust you
• I know why you erred
User
Big Data Enables Labor Market Intelligence, Fig. 3 Graphical overview of a XAI system (Taken from www.darpa.
mil)
Big Data Enables Labor Market Intelligence, Fig. 4 1346, “financial and insurance services branch managers”;
An example of explanation for a ML-based system. 3313,“accounting associate professionals”; 2413, “finan-
ESCO codes are as follows. 1211,“financial manager”; cial analysts”
This, in turn, would allow final users to answer vacancy is growing in importance, mainly in the
the following questions: context of LMI for decision-making, where one
of the main goal is to be able to share and
compare labor market dynamics among different
• Q1: Why did you classify J on that code?
countries.
• Q2: Which feature has lead you to classify J
on code A rather than B?
• Q3: Which is the confidence score in your • Q1: How to compare labor market for different
classification outcome? countries?
• Q4: How robust your classification is to errors • Q2: How to deal with jargon and lexicon of
and changes in labor market? different countries?
• Q3: How to deal with different labor market
taxonomy of each country?
Answering to these questions in a believable
manner is a crucial step to make the decision-
maker in state to take reliable decisions and de- To this end, there is call by European Community
sign effective policies when the data is processed in using the ESCO standard taxonomy that would
and managed by AI-based systems. provide a lingua franca to express occupations,
K2: Dealing with cross-language LMI. The skills, and qualifications over 28 official EU lan-
problem of dealing with multilingual Web job guages. Unfortunately, the use of ESCO does
234 Big Data Enables Labor Market Intelligence
not completely solve the issue related to multi- or partially switched, and recruitment strategies
language job vacancies, as a training/test set has might have sectoral and occupational specificities
still to be created for each language. To cope (see, e.g., Gosling et al. 2004; Mang 2012). From
with this issue, the community working on LMI a statistical perspective, samples of available Web
is working on two distinct directions: vacancies are prone to problems of self-selection
and/or non-ignorable missing mechanism (Little
and Rubin 2014).
• Build a classifier for a text classification task
In such a scenario, several approaches have
in some target language T , given labeled
been taken by different authors in attempting to
training tests for the same task in a different
deal with representativeness issues related to the
source language S . Clearly, a function
usage of online job vacancy data for analytical
that maps word correspondence between
and policy-making purposes. Among others,
languages T and S is required. This procedure
a number of studies decided to focus on the
is called cross-language text classification;
segment of the labor market characterized by
see, e.g., Shi et al. (2010), Bel et al. (2003),
widespread access to the Internet (e.g., ICT)
and Olsson et al. (2005);
in the belief that they would be relatively well
• Use unlabeled data from both languages and a
represented among the job applicants due to char-
word translation map to induce cross-lingual
acteristics of Internet users and would therefore
word correspondences. In this way, a cross-
offer a representative data sample (Anne Kennan
lingual representation is created enabling
et al. 2006; Wade and Parent 2002). Notice that
the transfer of classification knowledge from
this approach has been considered as valid by EU
the source to the target language, aka cross-
commission and that in 2016 invited to contribute
language structural correspondence learning;
to a call-for-tender aimed at obtaining data on
see, e.g., Prettenhofer and Stein (2011).
job vacancies for ICT practitioners and other
professional categories gathered by crawling job
K3: Representativeness of Web Labor Mar- advertisements published in relevant online job
ket Information. A key issue related to the use advertisement outlets (European Commission
of Big Data for decision-making relies on their 2016b).
ability of being representative of the whole pop- Recently, some researchers considered online
ulation. Indeed, while on one side labor market data generalizable due to the dominant market
surveys can be designed for being representative share and very high reputation of the chosen por-
of a certain population (sectors/occupations etc.), tal among employers and employees (Kureková
on the other side, representativeness of web data et al. 2012).
might not be evaluated easily, and this affects the Nonetheless, the problem related to represen-
reliability of statistics and analytics derived from tativeness of Big Data concerning LMI data –
it. This issue also affects the labor market intel- that is important for including them in official
ligence field (e.g., some occupations, and sectors statistics and data analytics – still remains an
are not present in web advertisements). To cope open issue.
with this issue, the question has been rephrased
to assess whether a set of online job vacancies
is a representative sample of all job vacancies Concluding Remarks
in a specified economic context. Indeed, even
though it should be considered finite, the pop- LMI is an emerging and cross-disciplinary field
ulation of job vacancies continuously changes, of study that draws competences from different
and this makes difficult to exactly determine it disciplines, such as of computer science, statis-
in stable manner. Indeed, jobs and people are tics, and economics in order to make sense of
usually reallocated in the labor market and in Big Data in the labor market field. Though the
firms, respectively; tasks are split, restructured, rationale behind LMI is the use of labor market
Big Data Enables Labor Market Intelligence 235
data, the final goal that guides LMI activities research in Italy: a perspective. Engineering 2(2):
may vary as the purpose of the users varies. 163–170
Biran O, Cotton C (2017) Explanation and justification in
Private companies aim at supporting the HR in machine learning: a survey. In: IJCAI-17 workshop on
recruitment activities, while analysts and public explainable AI (XAI), p 8
B
organizations would use LMI for studying the la- Boselli R, Cesarini M, Marrara S, Mercorio F, Mez-
bor market dynamics and for designing more ef- zanzanica M, Pasi G, Viviani M (2017a) Wolmis: a
labor market intelligence system for classifying web
fective policies following a data-driven approach. job vacancies. J Intell Inf Syst, 1–26. https://doi.org/
In any case, all the approaches mentioned strictly 10.1007/s10844-017-0488-x
rely on the KDD approach as baseline and mas- Boselli R, Cesarini M, Mercorio F, Mezzanzanica M
sively employ AI algorithms for classifying oc- (2017b) Using machine learning for labour market
intelligence. In: The European conference on machine
cupations/resumes and extract occupations and learning and principles and practice of knowledge dis-
skills. As for the AI in general, in the early covery in databases – ECML-PKDD
future, the research on LMI will overcome some Bostock M, Ogievetsky V, Heer J (2011) D3 data-
limitations that might prevent a steady use of Big driven documents. IEEE Trans Vis Comput Graph
17(12):2301–2309
Data in the LMI field, as the ability to explain CEDEFOP (2010) A new impetus for European coopera-
and motivate to decision-makers how AI works tion in vocational education and training to support the
behind the hood; the ability to deal with multiple Europe 2020 strategy (publicly available at https://goo.
languages, lexicon, and idioms with a limited gl/Goluxo)
CEDEFOP (2014) Real-time labour market information
effort; and the ability to be representative – or at on skill requirements: feasibility study and working
least statistically relevant – of the whole popula- prototype. CEDEFOP reference number ao/rpa/vkvet-
tion observed. These are the research directions nsofro/real-time lmi/010/14. Contract notice 2014/s
identified by top players on LMI – mainly EU 141-252026 of 15 July 2014. https://goo.gl/qNjmrn
CEDEFOP (2016) Real-time labour market information
and US governments – for allowing Big Data to on skill requirements: setting up the EU system for
enable labor market intelligence. online vacancy analysis ao/dsl/vkvet-grusso/real-time
lmi 2/009/16. contract notice – 2016/s 134-240996 of
14 July 2016. https://goo.gl/5FZS3E
DARPA (2017) Explainable artificial intelligence
Cross-References (XAI). Available at https://www.darpa.mil/program/
explainable-artificial-intelligence
Big Data Visualization Tools European Commission (2016a) A new skills agenda for
Europe. COM(2016) 381/2
Data Cleaning European Commission (2016b) Vacancies for ICT online
ETL repository 2 (victory 2): data collection. https://goo.gl/
3D7qNQ
European Commission (2017) ESCO: European skills,
competences, qualifications and occupations. Available
References at https://ec.europa.eu/esco/portal/browse
EuroStat (2016) The ESSnet big data project. Available at
Anne Kennan M, Cole F, Willard P, Wilson C, Marion https://goo.gl/EF6GtU
L (2006) Changing workplace demands: what job Ads Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) The KDD
tell us. In: Aslib proceedings, vol 58. Emerald group process for extracting useful knowledge from volumes
publishing limited, pp 179–196 of data. Commun ACM 39(11):27–34
Bao F, Chen J (2014) Visual framework for big data in Fox M, Long D, Magazzeni D (2017) Explainable plan-
d3.js. In: 2014 IEEE workshop on electronics, com- ning. arXiv preprint arXiv:170910256
puter and applications. IEEE, pp 47–50 Frey CB, Osborne MA (2017) The future of employment:
Batini C, Cappiello C, Francalanci C, Maurino A (2009) how susceptible are jobs to computerisation? Technol
Methodologies for data quality assessment and im- Forecast Soc Chang 114(Supplement C):254–
provement. ACM Comput Surv (CSUR) 41(3):16 280. https://doi.org/10.1016/j.techfore.2016.08.019,
Bel N, Koster CH, Villegas M (2003) Cross-lingual text http://www.sciencedirect.com/science/article/pii/
categorization. In: ECDL, vol 2769. Springer, pp 126– S0040162516302244
139 Gosling SD, Vazire S, Srivastava S, John OP (2004)
Bergamaschi S, Carlini E, Ceci M, Furletti B, Gi- Should we trust web-based studies? A comparative
annotti F, Malerba D, Mezzanzanica M, Monreale analysis of six preconceptions about internet question-
A, Pasi G, Pedreschi D et al (2016) Big data naires. Am Psychol 59(2):93
236 Big Data for Cybersecurity
Hong W, Zheng S, Wang H (2013) Dynamic user profile- UK Commission for Employment and Skills (2015) The
based job recommender system. In: Computer science importance of LMI. Available at https://goo.gl/TtRwvS
& education (ICCSE). IEEE UK Department for Education and Skills (2004) LMI
Kokkodis M, Papadimitriou P, Ipeirotis PG (2015) Hiring Matters!
behavior models for online labor markets. In: Proceed- U.S. Department of Labor/Employment & Training Ad-
ings of the eighth ACM international conference on ministration (2017) O*net: the occupational informa-
web search and data mining. ACM, pp 223–232 tion network. Available at https://www.onetcenter.org
Kureková L, Beblavỳ M, Haita C (2012) Qualifications or Vassiliadis P, Simitsis A, Skiadopoulos S (2002) Concep-
soft skills? Studying demand for low-skilled from job tual modeling for ETL processes. In: Proceedings of
advertisements. Technical report, NEUJOBS working the 5th ACM international workshop on data warehous-
paper ing and OLAP. ACM, pp 14–21
Li H, Ge Y, Zhu H, Xiong H, Zhao H (2017) Verroios V, Papadimitriou P, Johari R, Garcia-Molina
Prospecting the career development of talents: a H (2015) Client clustering for hiring modeling in
survival analysis perspective. In: Proceedings of work marketplaces. In: Proceedings of the 21th ACM
the 23rd ACM SIGKDD international conference SIGKDD international conference on knowledge dis-
on knowledge discovery and data mining. ACM, covery and data mining. ACM, pp 2187–2196
pp 917–925 Wade MR, Parent M (2002) Relationships between job
Little RJ, Rubin DB (2014) Statistical analysis with miss- skills and performance: a study of webmasters. J Man-
ing data. Wiley, New York, USA age Inf Syst 18(3):71–96
Mang C (2012) Online job search and matching quality. Wang RY, Strong DM (1996) Beyond accuracy: what data
Technical report, IFO working paper quality means to data consumers. J Manage Inf Syst
Mezzanzanica M, Boselli R, Cesarini M, Mercorio F 12(4):5–33
(2015) A model-based evaluation of data quality activ- Yi X, Allan J, Croft WB (2007) Matching resumes and
ities in KDD. Inf Process Manag 51(2):144–166 jobs based on relevance models. In: Proceedings of
Neculoiu P, Versteegh M, Rotaru M, Amsterdam TB the 30th annual international ACM SIGIR conference
(2016) Learning text similarity with siamese recurrent on research and development in information retrieval.
networks. ACL 2016, p 148 ACM, pp 809–810
Olsson JS, Oard DW, Hajič J (2005) Cross-language Zhu C, Zhu H, Xiong H, Ding P, Xie F (2016) Recruitment
text classification. In: Proceedings of the 28th annual market trend analysis with sequential latent variable
international ACM SIGIR conference on research and models. In: Proceedings of the 22nd ACM SIGKDD
development in information retrieval. ACM, pp 645– international conference on knowledge discovery and
646 data mining. ACM, pp 383–392
Organization IL (2017) ISCO: the international standard
classification of occupations. Available at http://www.
ilo.org/public/english/bureau/stat/isco/
Prettenhofer P, Stein B (2011) Cross-lingual adaptation
using structural correspondence learning. ACM Trans Big Data for Cybersecurity
Intell Syst Tech (TIST) 3(1):13
Redman TC (1998) The impact of poor data quality on the
typical enterprise. Commun ACM 41(2):79–82
Markus Wurzenberger, Florian Skopik, and
Ribeiro MT, Singh S, Guestrin C (2016) Why should I Giuseppe Settanni
trust you?: explaining the predictions of any classifier. Center for Digital Safety and Security, AIT
In: Proceedings of the 22nd ACM SIGKDD interna- Austrian Institute of Technology, Vienna,
tional conference on knowledge discovery and data
mining. ACM, pp 1135–1144
Austria
Sebastiani F (2002) Machine learning in automated text
categorization. ACM compu surv (CSUR) 34(1):1–47
Shi L, Mihalcea R, Tian M (2010) Cross language text
classification by model translation and semi-supervised
Synonyms
learning. In: Proceedings of the 2010 conference on
empirical methods in natural language processing. As- Computer security; IT security
sociation for Computational Linguistics, pp 1057–1067
Singh A, Rose C, Visweswariah K, Chenthamarakshan
V, Kambhatla N (2010) Prospect: a system for screen-
ing candidates for recruitment. In: Proceedings of the Definitions
19th ACM international conference on information and
knowledge management. ACM, pp 659–668
The European Network on Regional Labour Market
Cybersecurity deals with the protection of com-
Monitoring (2017) Web page (http://www. puter systems; information and communication
regionallabourmarketmonitoring.net) technology (ICT) networks, such as enterprise
Big Data for Cybersecurity 237
networks, cyber-physical systems (CPS), and the came a big data problem. According to the four
Internet of Things (IoT); and their components, Vs feature of big data (Chen et al. 2014), big data
against threats that aim at harming, disabling, and is defined by (i) volume (large amount of data),
destroying their software and hardware. (ii) variety (heterogeneous and manifold data), B
(iii) velocity (rapidly generated data that has to be
analyzed online, i.e., when the data is generated),
Overview and (iv) value (extraction of huge value from a
large amount of data).
In the beginning of the Digital Revolution, the This section deals with the two main research
primary goal of computer hackers that attack ICT areas of cybersecurity that are related to big
infrastructures and networks was just to receive data: (i) intrusion detection, which deals with
recognition from like-minded people. Thus, the revealing cyberattacks from network traces and
consequences have been mostly downtimes and log data, and (ii) CTI analysis and sharing, which
the often costly need of recovering and cleaning supports incident response and decision-making
up compromised systems. Therefore, the conse- processes. Therefore, the section first splits into
quences were manageable with rather little effort. two main parts, which are structured the same
In the meantime, the ICT networks’ relevance way. Both first answer the question why the
massively increased, and they became the vital considered topic is a big data problem and then
backbone of economic as well as daily life. Since summarize the state of the art. Finally, future
their importance and complexity continuously directions for the research of big data for cyber-
evolve, also cyberattacks against them become security are presented. Thus, it is also discussed
more complex and diverse and consequently diffi- how intrusion detection and CTI analysis and
cult to detect. Today’s modern attacks usually tar- sharing can benefit from each other.
get on stealing intellectual properties, sabotaging
systems, and demanding ransom money. Sophis-
ticated and tailored attacks are called advanced Intrusion Detection
persistent threats (APT) (Tankard 2011) and usu-
ally result in high financial loss and tarnished Countermeasures and defense mechanisms
reputation. against cyberattacks split into tools that aim at
Since usually the protection against sophisti- timely detection of attack traces and tools that
cated cyberattacks is difficult and highly chal- try to prevent attacks or deploy countermeasures
lenging, (i) timely detection to limit the caused in time. Technical preventive security solutions
damage to a minimum, which can be achieved by are, for example, firewalls, antivirus scanners
applying smart intrusion detection systems (IDS), and digital rights, account management, and
and (ii) preventive actions, such as raising situa- network segmentation. Tools that aim at detecting
tional awareness, through analyzing and sharing cyberattacks and therefore only monitor a
cyber threat intelligence (CTI), are of primary network are called intrusion detection systems
importance. Because of the continuously growing (IDS). Currently applied IDS mostly implement
interconnection of digital devices, the amount of signature-based black-listing approaches. Black-
generated data and new malware samples per day listing approaches prohibit system behavior and
grows with an exorbitant speed. Moreover, par- activities that are defined malicious. Hence, those
ticular types of malware, such as worms, rapidly solutions can only detect already known attack
evolve over time and spread over the network in patterns. Nevertheless these established tools
unpredictable ways (Dainotti et al. 2007; Kim et build the vital backbone of an effective security
al. 2004). Thus, it gets more and more difficult architecture and ensure prevention and detection
to analyze all generated data that might include of the majority of threats. Nevertheless, the wide
hints on cyberattacks and malicious system be- range of application possibilities of modern
havior. This is the reason why cybersecurity be- devices and the fast changing threat landscape
238 Big Data for Cybersecurity
demand smart, flexible, and adaptive white- setting any counter measurements automatically
listing-based intrusion detection approaches, (Whitman and Mattord 2012).
such as self-learning anomaly-based IDS. In In 1980, James Anderson was one of the first
contrast to black-listing approaches, white-listing researchers who indicated the need of IDS and
approaches permit a baseline of normal system contradicted the common assumption of software
behavior and consider every deviation of this developers and administrators that their computer
ground truth as malicious activity. Thus, it is systems are running in “friendly,” i.e., secure,
possible to detect previously unknown attacks. environments (James P Anderson 1980). Since
the 1980s, the amount of data generated in ICT
Intrusion Detection: A Big Data Problem networks tremendously increased, and intrusion
Because of the rapidly growing number of in- detection therefore became a big data problem.
terconnected digital devices, also the amount of Furthermore, since Anderson published his report
generated data increases. Thus, the volume of on IDS, a lot of research in this area has been
data that potentially can be analyzed for intrusion done until today (Axelsson 2000; Sabahi and
detection is enormous. Hence, approaches are re- Movaghar 2008; Liao et al. 2013; Garcia-Teodoro
quired that can analyze and also correlate a huge et al. 2009).
load of data points. Furthermore, there is a wide IDS can be roughly categorized as host-based
variety of data types available for intrusion detec- IDS (HIDS), network-based IDS (NIDS), and
tion. The spectrum ranges from binary files over hybrid or cross-layer IDS that make use of both
network traffic to log data, just to name a few. data collected on hosts and monitored network
Additionally the representation of this data differs traffic (Vacca 2013; Sabahi and Movaghar 2008).
depending on the applied technologies, imple- There exist generally three methods that are
mentations, and used services, as well as oper- used in IDS: (i) signature-based detection (SD),
ating systems, why analysis of data created in an (ii) stateful protocol analysis (SPA), and (iii)
ICT network comprising various different types anomaly-based detection (AD) (Liao et al.
of devices is not straightforward. Furthermore, 2013).
the data is generated quickly, and the velocity is
increasing with the number of connected devices.
Thus, to provide online intrusion detection, i.e., • SD (knowledge based) uses predefined sig-
the data is analyzed when it is generated, which natures and patterns to detect attackers. This
ensures timely detection of attacks and intruders, method is simple and effective for detecting
big data approaches are required that can process known attacks. The drawback of this method
rapidly large amounts of data. Since most of the is it is ineffective against unknown attacks or
generated data relates to normal system behavior unknown variants of an attack, which allows
and network communication, it is important to attackers to evade IDS based on SD. Further-
efficiently filter the noise, so only relevant data more, since the attack landscape is rapidly
points that include information of high value for changing, it is difficult to keep the signatures
intrusion detection remain. up to date, and thus maintenance is time-
consuming (Whitman and Mattord 2012).
Intrusion Detection: Methods • SPA (specification based) uses predeter-
and Techniques mined profiles that define benign protocol
Like other security tools, IDS aim to achieve a activity. Occurring events are compared
higher level of security in ICT networks. Their against these profiles, to decide if protocols
primary goal is to timely detect cyberattacks, so are used in the correct way. IDS based on
that it is possible to react quickly and therefore SPA track the state of network, transport,
reduce the window of opportunity of the at- and application protocols. They use vendor-
tacker to cause damage. IDS are passive security developed profiles and therefore rely on their
mechanisms, which only detect attacks without support (Scarfone and Mell 2007).
Big Data for Cybersecurity 239
actions (Heckerman et al. 1998). information system the attack targets. Large
• Clustering: Clustering enables grouping of amounts of data are to be analyzed while handling
unlabeled data and is often applied to detect such incidents in order to derive meaningful
outliers (Xu and Wunsch 2005). information on the relations existing among the
• Decision trees: Decision trees have a treelike collected data units, and eventually obtain cyber
structure, which comprises paths that lead to a threat intelligence (CTI), with the purpose of
classification based on the values of different designing suitable reaction strategies and timely
features (Safavian and Landgrebe 1991). mitigate the incident’s effects.
• Hidden Markov models (HMM): A Markov
chain connects states through transition prob- CTI: A Big Data Problem
abilities. HMM aim at determining hidden Hundreds of thousands of new malware sam-
(unobservable) parameters from observed pa- ples are created every day. Considering the large
rameters (Baum and Eagon 1967). amount of vulnerabilities and exploits revealed at
• Support vector machines (SVM): SVM a similar rate, there exists a large volume of data
construct hyperplanes in a high- or infinite- comprising information on cyber threats and at-
dimensional space, which then can be used tacks that can be analyzed to improve incident re-
for classification and regression. Thus, similar sponse, to facilitate the decision-making process,
to clustering, SVM can, for example, be and to allow an effective and efficient configura-
applied for outlier detection (Steinwart and tion of cybersecurity mechanisms. Furthermore,
Christmann 2008). the variety of the data is manifold. There exist,
for example, many different standards describing
vulnerabilities, threats, and incidents that follow
Cyber Threat Intelligence different structures, as well as threat reports and
fixes written in free text form. Because of the
The concepts of data, information, and intelli- large amount of data on cyberattacks generated
gence differ from one another in terms of volume daily, the data that can be analyzed to derive
and usability. While data is typically available intelligence is created with a high velocity and
in large volumes, and describes individual and should be processed as fast as possible. Since
unarguable facts, information is produced when large parts of the data are just available as free
a series of data points are combined to answer text and computer networks show a high diversity
a simple question. Although information is a far among each other, it is of prime importance to
more useful output than the raw data, it still does select and extract the information with the highest
not directly inform a specific action. Intelligence value.
takes this process a stage further by interrogating
data and information to tell a story (e.g., a fore- From Raw Data to Cyber Threat
cast) that can be used to inform decision-making. Intelligence
Crucially, intelligence never answers a simple Threat information may originate from a wide
question; rather, it paints a picture that can be variety of internal and external data sources.
used to help people answer much more compli- Internal sources include security sensors (e.g.,
cated questions. Progressing along the path from intrusion detection systems, antivirus scanners,
data to information to intelligence, the quantity malware scanners), logging data (from hosts,
of outputs drops off dramatically, while the value servers, and network equipment such as
of those outputs rises exponentially (Chairman of firewalls), tools (e.g., network diagnostics, foren-
the Joint Chiefs of Staff 2013). sics toolkits, vulnerability scanners), security
Modern cybersecurity incidents manifest management solutions (security information and
themselves in different forms depending on the event management systems (SIEMs), incident
attack perpetrators’ goal, on the sophistication management ticketing systems), and additionally
of the employed attack vectors, and on the also personnel, who reports suspicious behavior,
Big Data for Cybersecurity 241
social engineering attempts, and the like. actions reported by humans rather than soft-
Typical external sources (meaning external ware sensors.
to an organization) may include sharing • Forensic toolkits and sandboxing are vi-
communities (open public or closed ones), tal means to safely analyze the behavior of B
governmental sources (such as national CERTs untrusted programs without exposing a real
or national cybersecurity centers), sector peers corporate environment to any threat.
and business partners (for instance, via sector-
specific information sharing and analysis centers
There are multiple types of potentially useful
(ISACs)), vendor alerts and advisories, and
information obtained from the data sources men-
commercial threat intelligence services.
tioned before for security defense purposes. Ev-
Stemming from this variety of sources, it be-
ery type has its own characteristics regarding the
comes evident that cyber threat intelligence can
objective (e.g., to facilitate detection, to support
be (preferably automatically) extracted from nu-
prosecution, etc.), applicability, criticality, and
merous technical artifacts that are produced dur-
“shareability” (i.e., the effort required to make
ing regular IT operations in organizations:
an artifact shareable because of steps required
to extract, validate, formulate, and anonymize
• Operating system services an application
some piece of information). Depending on their
logs and provides insights into deviations from
content, cyber threat information can be grouped
normal operations within the organizational
into the categories defined by NIST (2016) and
boundaries.
OASIS (2017) and described below:
• Router, Wi-Fi, and remote services logs pro-
vide insights into failed login attempts and
potentially malicious scanning actions. • Indicators: are technical artifacts or observ-
• System and application configuration ables that suggest an attack is imminent or is
settings and states, often at least partly currently underway or that a compromise may
reflected by configuration management have already occurred. Examples are IP ad-
databases (CMDBs), help to identify weak dresses, domain names, file names and sizes,
spots due to not required but running services, process names, hashes of file contents and
weak account credentials, or wrong patch process memory dumps, service names, and
levels. altered configuration parameters.
• Firewall, IDS, and antivirus logs and alerts • Tactics, techniques, and procedures
point to probable causes, however, often with (TTPs): characterize the behavior of an
high false-positive rates that need to be veri- actor. A tactic is the highest-level description
fied. of this behavior, while techniques give a
• Web browser histories, cookies, and caches more detailed description of behavior in the
are viable resources for forensic actions af- context of a tactic, and procedures are an even
ter something happened, to discover the root lower-level, highly detailed description in the
cause of a problem (e.g., the initial drive-by context of a technique. Some typical examples
download). include the usage of spear phishing emails,
• Security information and event manage- social engineering techniques, websites for
ment systems (SIEMs) already provide cor- drive-by attacks, exploitation of operating
related insights across machines and systems. systems and/or application vulnerabilities, the
• E-mail histories are essential sources to learn intentional distribution of manipulated USB
about and eventually counter (spear) phishing sticks, and various obfuscation techniques.
attempts and followed links to malicious sites. From these TTPs, organizations are able to
• Help desk ticketing systems, incident man- learn how malicious attackers work and derive
agement/tracking systems, and people pro- higher-level and generally valid detection, as
vide insights into any suspicious events and well as remediation techniques, compared to
242 Big Data for Cybersecurity
quite specific measures based on just, often cyber defense and rely on proven plans and
temporarily valid, indicators. measures.
• Threat actors: contain information regarding • Courses of actions (CoAs): are recom-
the individual or a group posing a threat. mended actions that help to reduce the impact
For example, information may include the of a threat. In contrast to best practices,
affiliation (such as a hacker collective or a CoAs are very specific and shaped to a
nation-state’s secret service), identity, motiva- particular cyber issue. Usually CoAs span
tion, relationships to other threat actors, and the whole incident response cycle starting
even their capabilities (via links to TTPs). This with detection (e.g., add or modify an IDS
information is being used to better understand signature), containment (e.g., block network
why a system might be attacked and work out traffic to command and control server),
more targeted and effective countermeasures. recovery (e.g., restore base system image), and
Furthermore, this type of information can be protection from similar events in the future
applied to collect evidences of an attack to be (e.g., implement multifactor authentication).
used in court. • Tools and analysis techniques: this category
• Vulnerabilities: are software flaws that can is closely related to best practices, however
be used by a threat actor to gain access to a focuses more on tools rather than procedures.
system or network. Vulnerability information Within a community it is desirable to align
may include its potential impact, technical used tools to each other to increase compat-
details, its exploitability, and the availability ibility, which makes it easier to import/ex-
of an exploit, affected systems, platforms, and port certain types of data (e.g., IDS rules).
version, as well as mitigation strategies. A Usually there are sets of recommended tools
common schema to rate the seriousness of (e.g., log extraction/parsing/analysis, editor),
a vulnerability is the common vulnerability useful tool configurations (e.g., capture fil-
scoring schema (CVSS) (Scarfone and Mell ter for network protocol analyzer), signatures
2009), which considers the enumerated (e.g., custom or tuned signatures), extensions
details to derive a comparable metric. There (e.g., connectors or modules), code (e.g., al-
are numerous web platforms that maintain gorithms, analysis libraries), and visualization
lists of vulnerabilities, such as the common techniques.
vulnerability and exposure (CVE) database
from MITRE (https://cve.mitre.org/) and
the national vulnerability database (NVD)
(https://nvd.nist.gov/). It is important to notice Interpreting, contextualizing, and correlating
that the impact of vulnerabilities usually needs information from different categories allow the
to be interpreted for each organization (and comprehension of the current security situation
even each system), individually, depending on and provide therefore threat intelligence (Skopik
the criticality of the affected systems for the 2017). Independently from the type, there are
main business processes. common desired characteristics to make cyber
• Cybersecurity best practices: include threat intelligence applicable. Specifically, CTI
commonly used cybersecurity methods should be obtained timely, to allow security op-
that have demonstrated effectiveness in eration centers to promptly act; it should be
addressing classes of cyber threats. Some relevant in order to be applicable to the pertaining
examples are response actions (e.g., patch, operational environment; it should be accurate
configuration change), recovery operations, (i.e., correct, complete, and unambiguous) and
detection strategies, and protective measures. specific, providing sufficient level of detail and
National authorities, CERTs, and large context; finally, CTI is required to be action-
industries frequently publish best practices able to provide or suggest effective course of
to help organizations building up an effective action.
Big Data for Cybersecurity 243
Future Directions for Research to provide insights into cybersecurity threats and
attacks to finally correlate those information to
The following section discusses the future di- gain CTI (Skopik 2017). However, since anoma-
rections for research in the fields of intrusion lies are just data without any context, and there- B
detection and CTI, as well as how both might fore do not carry any intrinsic information, the
profit from each other. In the field of intru- detected anomalies have to be interpreted first.
sion detection, already a lot of research and To fulfill this task properly and timely, intrusion
development have been done in the direction detection should be enriched with existing CTI.
of signature-based IDS, with focus on black- Thus, sharing acquired knowledge is mandatory.
listing. But today’s evolving interconnection and The “pyramid of pain” (Bianco 2014), shown in
the associated dependence on digital devices are Fig. 1, demonstrates how much effort is required
the reason for sophisticated and tailored attacks, to acquire certain information, starting from a
carried out by smart attackers who flexibly adjust contextless anomaly. The goal is that stakeholders
their behavior according to the circumstances who demand a high level of security for their
they face and therefore cannot be revealed and network infrastructures and are also able to afford
stopped by signature-based IDS and black-listing the resources to analyze the large number of alerts
approaches which are only capable of reveal- generated by AD-based IDS solutions use their
ing known attack vectors. Thus, the future re- acquired knowledge to define signatures. These
search direction in cybersecurity related to big signatures, i.e., shareable and timely intelligence,
data tends toward anomaly-based intrusion detec- can then be shared with further organizations on
tion and white-listing. The main challenges here a broad basis. Hence, industry players beyond
will be reducing a usually high number of false specialized security solution providers should be
positives that create a great effort for security included in the threat intelligence creation to
analysts. Additionally, because networks grow enable a faster reaction against new threats across
fast and hardware devices and software versions whole industry sectors.
change quickly, smart self-learning algorithms Finally, to reduce the effort of dealing with
are required so that IDS can adapt themselves to the large and quickly evolving amounts of data
new situations, without learning malicious sys- concerning cybersecurity, novel smart big data
tem/network behavior and to keep the effort that approaches will be required that make use of
has to be invested for configuration to a mini- collected CTI to support and steer the configura-
mum. tion and deployment of cybersecurity tools, and
One solution to the aforementioned problems to automatize as many tasks as possible, since
can be CTI. There exists already a considerable the amount of generated data and the number
amount of research on how findings and observ- of cyber threats and attacks keep continuously
ables, also from IDS outputs, can be combined growing.
IP Addresses Easy
Cross-References Liao HJ, Lin CHR, Lin YC, Tung KY (2013) Intrusion
detection system: a comprehensive review. J Netw
Comput Appl 36(1):16–24
Big Data in Network Anomaly Detection NIST (2016) Guide to cyber threat information sharing.
Exploring Scope of Computational Intelligence Special publication 800–150. Technical report
in IoT Security Paradigm OASIS (2017) Structured threat information expression
Network Big Data Security Issues v2.0. https://oasis-open.github.io/cti-documentation/
Sabahi F, Movaghar A (2008) Intrusion detection: a
survey. In: 3rd international conference on systems
and networks communications, ICSNC’08. IEEE,
pp 23–26
Safavian SR, Landgrebe D (1991) A survey of decision
tree classifier methodology. IEEE Trans Syst Man Cy-
References bern 21(3):660–674
Scarfone K, Mell P (2007) Guide to intrusion detection
Axelsson S (2000) Intrusion detection systems: a survey and prevention systems (IDPS), NIST special publi-
and taxonomy. Technical report cation, vol 800. Department of Commerce, National
Baum LE, Eagon JA (1967) An inequality with applica- Institute of Standards and Technology, Gaithersburg,
tions to statistical estimation for probabilistic functions p 94
of markov processes and to a model for ecology. Bull Scarfone K, Mell P (2009) An analysis of CVSS version 2
Am Math Soc 73(3):360–363 vulnerability scoring. In: Proceedings of the 2009 3rd
Bianco D (2014) The pyramid of pain. detectre- international symposium on empirical software engi-
spond. Blogspot. com/2013/03/the-pyramid-of-pain. neering and measurement. IEEE Computer Society, pp
html 516–525
Buczak AL, Guven E (2016) A survey of data mining Skopik F (2017) Collaborative cyber threat intelligence:
and machine learning methods for cyber security in- detecting and responding to advanced cyber attacks at
trusion detection. IEEE Commun Surv Tutorials 18(2): the national level. CRC Press, Boca Raton
1153–1176 Steinwart I, Christmann A (2008) Support vector ma-
Cannady J (1998) Artificial neural networks for misuse chines. Springer, New York
detection. In: National information systems security Tankard C (2011) Advanced persistent threats and how to
conference, pp 368–381 monitor and deter them. Netw Secur 2011(8):16–19
Chairman of the Joint Chiefs of Staff (2013) Joint Vacca JR (2013) Managing information security. Elsevier,
publication 2–0. Joint intelligence. Technical report. Amsterdam/Boston/Heidelberg
http://www.dtic.mil/doctrine/new_pubs/jp2_0.pdf Whitman ME, Mattord HJ (2012) Principles of infor-
Chandola V, Banerjee A, Kumar V (2009) Anomaly de- mation security, Course technology, 4th edn. Cengage
tection: a survey. ACM Comput Surv (CSUR) 41(3):15 Learning, Stamford. Conn., oCLC: 930764051
Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining:
Netw Appl 19(2):171–209 practical machine learning tools and techniques. Mor-
Dainotti A, Pescapé A, Ventre G (2007) Worm traffic gan Kaufmann, Cambridge
analysis and characterization. In: IEEE international Xu R, Wunsch D (2005) Survey of clustering algorithms.
conference on communications, ICC’07. IEEE, pp IEEE Trans Neural Netw 16(3):645–678
1435–1442
Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G,
Vázquez E (2009) Anomaly-based network intrusion
detection: techniques, systems and challenges. Comput
Secur 28(1):18–28
Goldstein M, Uchida S (2016) A comparative evaluation
of unsupervised anomaly detection algorithms for mul-
tivariate data. PLoS One 11(4):e0152173 Big Data for Health
Heckerman D et al (1998) A tutorial on learning with
bayesian networks. Nato Asi Ser D Behav Soc Sci
89:301–354 Valerio Persico
James P Anderson (1980) Computer security threat mon- University of Napoli “Federico II”, Naples, Italy
itoring and surveillance. Technical report 17. James P.
Anderson Company, Fort Washington, DC
Kim J, Radhakrishnan S, Dhall SK (2004) Measurement
and analysis of worm propagation on Internet network
topology. In: Proceedings of 13th international con-
Synonyms
ference on computer communications and networks,
ICCCN 2004. IEEE, pp 495–500 Healthcare; Well-being
Big Data for Health 245
constraints, has found fundamental pillars in Healthcare-related big data can come from
smartphones and mobile apps, these tools be- both sources internal to the organization (such as
ing able to gather and deliver different kinds health records and clinical decision support sys-
of information (e.g., streams of physiological tems) and sources external to the organizations
signs monitoring results or public health infor- (such as government sources, laboratories, phar-
mation); macies, insurance companies, and health mainte-
• Smart health integrates ideas from ubiquitous nance organizations).
computing and ambient intelligence, where Two primary types of health-related big data
related processes generate large amounts of can be identified: structured data, which refer
data being collected over a network under to those with defined data model (e.g., clinical
daily life and that is potentially valuable to databases), and unstructured data, with no de-
monitor, predict, and improve patients’ physi- fined schema or explicit semantics (e.g., natural
cal and mental conditions; language text as in doctor prescriptions). The
• Personalized health, sometimes referred to as latter require specialized analytical methods to
adaptive health, is user-centric, i.e., it tar- extract the contained information and to trans-
gets to take patient-specific decisions, rather form it into computable form. However, it is
than stratifying patients into typical treatment worth noting that this categorization has not to
groups. be intended as absolute: for instance, structured
databases may contain unstructured information
The remainder of the entry is composed of in practice (e.g., free-text fields) and unstruc-
three main parts discussing, respectively: tured data may in fact have some structure (e.g.,
metadata associated with an image or markup
embedded in a document).
• the main sources and types of information that Up to 80% of healthcare data has been esti-
compose the health-related big data scenario; mated to consist of rich unstructured data either
• the most significant use cases and applications only partially exploited or not leveraged at all.
of big data in the health domain; Only the remaining 20% is estimated to reside in
• the related opportunities and raising issues. automated systems, such as hospital, radiology,
and laboratory information systems (Fernandes
et al. 2012). Accordingly, technological solutions
Characterization of Health-Related to either automatically interpret such resources or
Big Data assist people tasked with interpreting these data
sources are critical, as allowing to scale up the
Larger amounts of data are becoming available analysis and consider a much broader set of data.
to organizations (working at global, national, re- The main sources of the health-related infor-
gional level) both gathered from novel applica- mation today available can be identified in:
tions (e.g., health and wellness monitoring ones)
and obtained by applying analysis techniques on
• clinical data;
already existing – but not fully capitalized yet
• medical and clinical references and publica-
– information (e.g., clinical reports). Worldwide
tions;
digital healthcare data was estimated to be equal
• omics data;
to 500 petabytes in 2012 and is expected to
• streamed data;
reach 25,000 petabytes in 2020 (Cyganek et al.
• social networking and web data;
2016; Sun and Reddy 2013). Being aware of the
• business, organizational, and external data.
sources as well as types and characteristics of
this data is fundamental for understanding the
applications that rely on it as well as the unclosed Each of the categories above is briefly dis-
opportunities. cussed in the following.
Big Data for Health 247
by the different NGS applications allows to may lead to identify local and global patterns,
gain more significant biological insights toward locate influential entities, and examine network
a complete understanding of the mechanisms dynamics (Wasserman and Faust 1994).
driving gene expression changes in human
genetic pathologies, rather than limiting to Business, Organizational, and External
the interpretation of single data sets (Costa Data
et al. 2013). Other data, not strictly related to health, such as
administrative, billing, and scheduling informa-
Streamed Data tion, further enrich the list of potential sources,
Personal devices – mainly in the form of allowing to implement studies that encompass not
smartphones – have significantly penetrated exclusively medical aspects.
into society in a relatively short period of time. Herland et al. in their review identified four
In addition, the progress in wireless sensors, distinct levels to categorize health data (Herland
mobile communication technology, and stream et al. 2014):
processing enable to design and implement body
area networks (BANs) consisting of intelligent 1. molecular-level data (e.g., used in bioinfor-
devices (e.g., low-power, miniaturized, non- matics);
invasive, and lightweight wireless sensors, 2. tissue-level data (e.g., used in neuroinformat-
attached on or implanted in the body) able to ics);
monitor either the human body functions or the 3. patient-level data (e.g., used in clinical infor-
surrounding environment. These devices allow matics);
to monitor people’s health status in real time, 4. population data (e.g., used in public health
generating large amounts of data (Chen et al. informatics).
2014; Ullah et al. 2012).
Thanks to the above enablers, large amounts According to the proposed review, each study
of heterogeneous medical information have be- may leverage data from one or more levels to
come available in healthcare organizations, since answer questions that can be clustered according
smart and connected healthcare devices are in- to their nature in different groups, such as human-
creasingly adopted and contribute to generating scale biology, clinical, or epidemic scale.
streams of structured and unstructured data. In the light of the overall characterization
provided in this section – more specifically, due
Social Networking and Web Data to its volume, velocity, and variety – the data
Social media technologies are significantly available in the healthcare field satisfies the com-
changing the practice of medicine: social media monly agreed requirements to be qualified as big
sources, such as online discussion forums, social data (Raghupathi and Raghupathi 2014). Newer
networking sites and feeds, as well as more forms of data (e.g., imaging, omics, and biomet-
general web resources (e.g., blogs and even ric sensors readings) are fueling the growth of
search query logs), are proving to be valuable the volume of health-related data; the velocity of
information resources (Waxer et al. 2013). gathered data has increased with regular moni-
In more details, according to recent surveys, toring processes (e.g., blood pressure readings,
social media are reshaping the nature of health- electrocardiograms, and glucose measurements);
related interactions, changing the way healthcare the variety of structured and unstructured data is
practitioners review medical records and share evident, emphasizing the challenging aspects.
medical information. To be in line with a fourth characteristic in-
In addition, social network perspective pro- troduced by some big data practitioners (i.e., ve-
vides a set of methods for analyzing the struc- racity), big data, analytics, and outcomes have
ture of social entities as well as a variety of to be error-free. Finally, enormous value is en-
theories to explain these structures, whose study visioned in the predictive power of big data, due
Big Data for Health 249
have the potential to change the way medicine is expected to address both health and disease and
taught and practiced. to impact on predisposition, screening, diagno-
In this context, the adoption of data min- sis, prognosis, pharmacogenomics, and surveil-
ing techniques facilitates unprecedented under- lance (Chen and Snyder 2013).
standing of large healthcare data sets, helping This approach is expected to cut global health
researchers gain both novel and deep insights. budgets, both by reducing the need for hospi-
Thus, new biomedical and healthcare knowledge talization and related costs and by minimizing
can be uncovered, for instance, generating sci- the unnecessary adoption of drugs. This also
entific hypotheses from large experimental data, leads to reducing the related side effects and
clinical database, as well as scientific literature to avoiding the development of resistance to drugs
support clinical decisions but also those related to (Nice 2016).
administrative settings (Yoo et al. 2012). In this scenario, leveraging proper artificial
In addition, the use of modeling techniques intelligence solutions to handle and analyze the
(e.g., logistic regression models, Bayesian net- huge amount of available health data is cru-
works, support vector machines, neural networks, cial (Raghupathi and Raghupathi 2014).
and classification trees, etc.) has become highly
developed. The obtained models are often
leveraged to support the computation of risk Health and Wellness Monitoring
scores (Martinez-Millana et al. 2015). However, In the recent past, as physiological signal moni-
the performance of models is subjected to various toring devices have become ubiquitous (e.g., IoT
statistical and clinical factors (e.g., deficiencies devices), there has been an increase in utiliz-
in the baseline data or modeling methods used ing telemetry and continuous physiological time
in the study, over-fitting, differences between series monitoring to improve patient care and
patient characteristics, measurement methods, management (Belle et al. 2015). These solutions
healthcare systems particularities, or data are carrying unprecedented opportunities for per-
quality). sonalized health and wellness monitoring and
management, allowing for the creation and the
Personalized Medicine delivery of new types of cloud services (Jovanov
The availability of valuable big data has a funda- and Milenkovic 2011) that can either replace or
mental importance in the implementation of the complement existing hospital information sys-
personalized health paradigm that allows to put tems (Kulkarni and Ozturk 2011).
into practice the idea of personalized care and Streaming data analytics in healthcare con-
personalized medicine. sists in systematically using continuous time-
In more particulars, P4 medicine is radically varying signals, together with related medical
changing the idea of medicine itself, as it is in- record information. This information is usually
tended to be predictive, preventive, personalized, analyzed through statistical, quantitative, contex-
and participatory (Sobradillo et al. 2011; Hood tual, cognitive, and predictive analytical disci-
and Flores 2012) plines. Enriching the data consumed by analytics
It is based on the comprehensive understand- makes the system more robust helping balance
ing that the biology of each individual – in con- the sensitivity and specificity of the predictive
trast to the current approach of clustering patients analytics.
into treatments groups according to their pheno- This kind of applications requires a platform
type – aims at tailoring treatments to its charac- with enough bandwidth to handle multiple data
teristics, considering the genetic information of sources to be designed for streaming data acqui-
each individual as the main data source. sition and ingestion. According to (Patel et al.
A comprehensive understanding of an individ- 2012), systems for patients’ remote monitoring
ual’s own biology (e.g., personalized omics) is consist of three main building blocks:
Big Data for Health 251
• the sensing and data collection hardware to imaging, X-ray, molecular imaging, ultrasound,
gather physiological and movement data (e.g., photoacoustic imaging, fluoroscopy, positron
tiny either on-body or in-body biosensors and emission tomography-computed tomography,
battery-free RFID tags); and mammography). The volume of medical B
• the communication hardware and software to images is growing exponentially, and therefore
relay data to a remote center; such data requires large storage capacities.
• the data analysis techniques to extract clini- As both the dimensionality and the size of
cally relevant information from physiological data increase, understanding the dependencies
and movement data. among the data and designing efficient, accurate,
and computationally effective methods demand
Several monitoring solutions gather clinical new computer-aided techniques and platforms.
information (such as position, temperature, or Often, image processing techniques such as
breath frequency) via body sensors or mobile enhancement, segmentation, and denoising in
devices and integrate it in the cloud, leveraging addition to machine learning methods are needed.
cloud characteristics such as scalable processing In addition, efforts have been made for col-
capability, seemingly infinite storage, and high lecting, compressing, sharing, and anonymizing
availability (Santos et al. 2016; Thilakanathan medical data.
et al. 2014; Hossain and Muhammad 2015). Of- The integration of medical images with other
ten, the cloud represents a flexible and affordable types of electronic health record (e.g., EHRs) data
solution to overcome the constraints generated and genomic data can also improve the accuracy
by either sensor technologies or mobile devices, and reduce the time taken for a diagnosis.
and therefore a growing number of proposals take
advantage of the cloud capabilities to remotely
off-load computation- or data-intensive tasks or
to perform further processing activities and anal- Opportunities and Issues
yses (Zhang et al. 2014; Tong et al. 2014; Chen
2014; Wang et al. 2014; Wan et al. 2013). An increasing number and variety of organiza-
A variety of signal processing mechanisms tions are beginning to apply big data to address
can be utilized to extract a multitude of target multiple healthcare challenges, transforming data
features. Their specifics largely depend on the to information, supporting research, and assisting
type of diseases under investigation. Integrating providers to improve patient care. Notwithstand-
these dynamic streamed data with static data from ing the unprecedented benefits and opportunities,
the EHR is a key component to provide situa- the adoption of big data analytics in the health
tional and contextual awareness for the analytics scenario is strictly related to (and even amplifies)
engine. a number of typical ICT issues, extending them to
this critical domain. Therefore, healthcare practi-
Medical Image Processing tioners are commonly facing difficulties related
Medical images are a frequent and important to managing and capitalizing this huge amount of
source of data used for diagnosis, therapy heterogeneous data to their advantage. According
assessment, and planning. They provide to the scientific literature, data privacy and se-
important information on anatomy and organ curity, system design and performance, data and
function in addition to detecting diseases states. systems heterogeneity, and data management are
From a data dimension point of view, medical the most recurring issues to be addressed.
images might have 2, 3, or 4 dimensions, as Considering the only recent emergence of big
medical imaging encompasses a wide spectrum data analytics in healthcare, governance issues in-
of methodologies to acquire images (e.g., cluding security, privacy, and ownership have yet
computed tomography, magnetic resonance to be fully addressed (Jee and Kim 2013; Kruse
252 Big Data for Health
et al. 2016). Security, privacy, and confidentiality The adoption of big data in the health field
of the data from individuals are major concerns. is very promising and is gaining more and more
The availability of real-time analytics pro- popularity, as it is able to provide novel insights
cedures is a key requirement for a number of from very large data sets as well as improving
applications, and it is often achieved relying on outcomes while reducing costs. Its potential is
cloud platforms, able to provide proper reliability great but a number of challenges remain to be
with uninterrupted services and minimum down- addressed.
times (Ahuja et al. 2012).
Proper data management represents an open
issue. After data generation, costs are incurred
when for storing, securing, transferring, as well
as analyzing it (Kruse et al. 2016). Indeed, data Cross-References
generation is inexpensive compared with the stor-
age and transfer of the same (Costa 2014). The Big Data Analysis in Bioinformatics
lack of proper skills further exacerbates these Big Data in the Cloud
issues representing a significant barrier to the Big Data Technologies for DNA Sequencing
implementation of big data analytics in health Cloud Computing for Big Data Analysis
domain. Data Fusion
While from the one hand the adoption of Privacy-Preserving Data Analytics
cloud-based solutions mitigates performance- Secure Big Data Computing in Cloud: An
related issues (also providing a good trade- Overview
off with costs), from the other hand, it raises
concerns about data privacy and communication
confidentiality because of the adoption of exter-
nal third-party’s softwares and infrastructures
References
and related concerns of losing control over
data (Ermakova et al. 2013). Ahuja SP, Mani S, Zambrano J (2012) A survey of the
Data heterogeneity and lack of structure are state of cloud computing in healthcare. Netw Commun
commonly considered obstacles and still repre- Technol 1(2):12
sent open issues, as healthcare data is rarely Belle A, Thiagarajan R, Soroushmehr S, Navidi F, Beard
DA, Najarian K (2015) Big data analytics in healthcare.
standardized, often fragmented, or generated in BioMed Res Int 2015:370194
legacy IT systems with incompatible formats. Berner ES, La Lande TJ (2007) Overview of clinical
EHRs are known to not share well across organi- decision support systems. Springer, New York, pp 3–
zational lines (Kruse et al. 2016). In more partic- 22. https://doi.org/10.1007/978-0-387-38319-4_1
Chang H, Choi M (2016) Big data and healthcare: building
ulars, biological and medical data are extremely an augmented world. Healthcare Inf Res 22(3):153–
heterogeneous (more than information from any 155
other research field) (Costa 2014). Additional Chen M (2014) Ndnc-ban: supporting rich media health-
challenges arise when filling the gap emerging care services via named data networking in cloud-
assisted wireless body area networks. Inf Sci 284:142–
between the low-level sensor output and the high- 156
level user requirements of the domain experts. Chen M, Mao S, Liu Y (2014) Big data: a survey. Mobile
As a result, continuous data acquisition and data Netw Appl 19(2):171–209
cleansing require to be performed. Chen R, Snyder M (2013) Promise of personalized Omics
to precision medicine. Wiley Interdiscip Rev Syst Biol
Finally, health-related data analytics is also Med 5(1):73–82
accompanied by significant ethical challenges, Costa FF (2014) Big data in biomedicine. Drug Discov
ranging from risks to individual rights (e.g., Today 19(4):433–440
concerns about autonomy) to demand of Costa V, Aprile M, Esposito R, Ciccodicola A (2013)
RNA-Seq and human complex diseases: recent accom-
transparency and trust (e.g., in public health plishments and future perspectives. Eur J Hum Genet
practice) (Vayena et al. 2015). 21(2):134–142
Big Data for Health 253
Cowie MR, Bax J, Bruining N, Cleland JGF, Koehler Martin-Sanchez F, Verspoor K (2014) Big data in
F, Malik M, Pinto F, van der Velde E, Vardas P medicine is driving big changes. Yearb Med Inform
(2015) e-health: a position statement of the European 9(1):14–20. http://www.ncbi.nlm.nih.gov/pmc/articles/
society of cardiology. Eur Heart J 37(1):63–66. https:// PMC4287083/, pMID: 25123716
doi.org/10.1093/eurheartj/ehv416, http://eurheartj. Martinez-Millana A, Fernández-Llatas C, Sacchi L,
B
oxfordjournals.org/content/37/1/63, http://eurheartj. Segagni D, Guillén S, Bellazzi R, Traver V (2015)
oxfordjournals.org/content/37/1/63.full.pdf From data to the decision: a software architecture to
Cyganek B, Graña M, Krawczyk B, Kasprzak A, Porwik integrate predictive modelling in clinical settings. In:
P, Walkowiak K, Woźniak M (2016) A survey of big 2015 37th annual international conference of the IEEE
data issues in electronic health record analysis. Appl engineering in medicine and biology society (EMBC),
Artif Intell 30(6):497–520 IEEE, pp 8161–8164
Ermakova T, Huenges J, Erek K, Zarnekow R (2013) Murdoch T, Detsky A (2013) The inevitable applica-
Cloud computing in healthcare – a literature review on tion of big data to health care. JAMA 309(13):1351–
current state of research. In: Proceedings of AMCIS 1352. https://doi.org/10.1001/jama.2013.393, http://
2013 data/journals/jama/926712/jvp130007_1351_1352.pdf
Fernandes LM, O’Connor M, Weaver V (2012) Big data, Nice EC (2016) From proteomics to personalized
bigger outcomes. J AHIMA 83(10):38–43 medicine: the road ahead. Expert Rev Proteomics
Garets D, Davis M (2006) Electronic medical records 13(4):341–343. https://doi.org/10.1586/14789450.
vs. electronic health records: yes, there is a differ- 2016.1158107, pMID: 26905403
ence. Policy white paper Chicago, HIMSS Analytics, Omanović-Mikličanin E, Maksimović M, Vujović V
pp 1–14 (2015) The future of healthcare: nanomedicine and
Herland M, Khoshgoftaar TM, Wald R (2014) A review of internet of nano things. Folia Medica Facultatis Medic-
data mining using big data in health informatics. J Big inae Universitatis Saraeviensis 50(1): 23–28
Data 1(1):2. https://doi.org/10.1186/2196-1115-1-2 Patel S, Park H, Bonato P, Chan L, Rodgers M
Holzinger A, Röcker C, Ziefle M (2015) From smart (2012) A review of wearable sensors and systems
health to smart hospitals. Springer, Cham, pp 1–20. with application in rehabilitation. J Neuroeng Rehabil
https://doi.org/10.1007/978-3-319-16226-3_1 9(1):1
Hood L, Flores M (2012) A personal view Raghupathi W, Raghupathi V (2014) Big data analytics in
on systems medicine and the emergence of healthcare: promise and potential. Health Inf Sci Syst
proactive {P4} medicine: predictive, preventive, 2(1):1
personalized and participatory. New Biotech 29(6): Santos J, Rodrigues JJ, Silva BM, Casal J, Saleem K,
613–624. https://doi.org/10.1016/j.nbt.2012.03.004, Denisov V (2016) An iot-based mobile gateway
http://www.sciencedirect.com/science/article/pii/ for intelligent personal assistants on mobile health
S1871678412000477 environments. J Netw Comput Appl 71:194–204.
Hossain MS, Muhammad G (2015) Cloud-assisted speech https://doi.org/10.1016/j.jnca.2016.03.014, http://
and face recognition framework for health monitoring. www.sciencedirect.com/science/article/pii/
Mobile Netw Appl 20(3):391–399 S1084804516300273
Jee K, Kim GH (2013) Potentiality of big data in the Schmidt B, Hildebrandt A (2017) Next-generation se-
medical sector: focus on how to reshape the healthcare quencing: big data meets high performance computing.
system. Healthcare Inf Res 19(2):79–85 Drug Discov Today 22(4):712–717
Jovanov E, Milenkovic A (2011) Body area networks for Silva BM, Rodrigues JJ, de la Torre Díez I, López-
ubiquitous healthcare applications: opportunities and Coronado M, Saleem K (2015) Mobile-health:
challenges. J Med Syst 35(5):1245–1254 a review of current state in 2015. J Biomed Inf
Kruse CS, Goswamy R, Raval Y, Marawi S (2016) Chal- 56:265–272. https://doi.org/10.1016/j.jbi.2015.06.
lenges and opportunities of big data in health care: a 003, http://www.sciencedirect.com/science/article/pii/
systematic review. JMIR Med Inf 4(4):e38 S1532046415001136
Kulkarni P, Ozturk Y (2011) mPHASiS: mobile pa- Smolij K, Dun K (2006) Patient health information man-
tient healthcare and sensor information system. J agement: searching for the right model. Perspect Health
Netw Comput Appl 34(1):402–417. https://doi.org/ Inf Manag 3(10):1–11
10.1016/j.jnca.2010.03.030, http://www.sciencedirect. Sobradillo P, Pozo F, Agustí Á (2011) P4 medicine:
com/science/article/pii/S1084804510000652 the future around the corner. Archivos de
Lander ES (2011) Initial impact of the sequencing of the Bronconeumología (English Edition) 47(1):35–
human genome. Nature 470(7333):187–197 40. https://doi.org/10.1016/S1579-2129(11)70006-
Lu Z (2011) Pubmed and beyond: a survey of web 4, http://www.sciencedirect.com/science/article/pii/
tools for searching biomedical literature. Database S1579212911700064
2011:baq036. https://doi.org/10.1093/database/ Sun J, Reddy CK (2013) Big data analytics for healthcare.
baq036, http://oup/backfile/content_public/journal/ In: Proceedings of the 19th ACM SIGKDD interna-
database/2011/10.1093/database/baq036/2/baq036.pdf tional conference on knowledge discovery and data
Mardis ER (2011) A decade’s perspective on DNA se- mining, KDD’13, ACM, New York, pp 1525–1525.
quencing technology. Nature 470(7333):198–203 http://doi.acm.org/10.1145/2487575.2506178
254 Big Data Hardware Acceleration
clusters that help lead to better strategies and technology becomes the most important factor.
products. Automotive OEMs often analyze vast According to another study (IBM Institute of
amounts of data to understand their customer Business Value 2014), technology is going to be
priorities when purchasing a car. In the recent few the most important external force that will impact B
years, some of the other key developing technolo- the industry in the next decade.
gies in automotive industry, namely, autonomous Considering the size and the extent of the
drive, shared mobility, connected vehicle (V2V automotive industry, it would be months of hard
and V2X), smart interfaces, etc., are thriving labor to analyze and visualize such huge amount
with the help of Big Data Analytics thus making of data before coming to a conclusion (IBM
it an important topic to discuss. The following Institute of Business Value 2014). However, first
sections discuss significance of Big Data for the it is necessary to understand where this data is
automotive OEMs and its role in the development coming from and to understand its relationship.
of emerging technologies in the automotive in-
dustry. Data Generation and the Need of Big Data
in Automotive Industry
Automotive Industry and Big Data Although the term Big Data has attracted huge
The automotive industry has been experiencing a attention in the automotive world, behind this
massive change and is still coping with the effects big word, there is a simple story. The above
of the fourth industrialization revolution. Among example is one simple demonstration of Big Data
others, one of the greatest challenges is the burst Analytics in the automotive industry. The Indus-
of information that a company has to yield to stay try leaders, engineers, consulting agencies, and
ahead of the market in technology, products, and analysts have always been relying on the huge
consumer requirements. amount of data and information to clinch to
Technology has brought the cars to life. Cars the leading position in the market. In the past,
now communicate and interact with humans just this data however has been scattered at different
like any other robot. Based on the connected de- locations.
vices and the sensory data generated, our modern
cars are turning into data factories. When all this
data is combined, it creates an immense value for Development Phase Data
all the players in the ecosystem. In general, OEMs (and dealers) make immense
However, when the information exceeds be- efforts to collect actual field data to understand
yond the certain limit, then humans and machines their customer better. Once a new vehicle has
tend to become careless and slow. That is one of been developed, a single test drive can incur
the reasons why we often tend to forget traffic several terabytes of measurement data, and the
rules on the street. But what if the information amount of data during the development phase
could be fed to the driver in bits and pieces, often exceeds the petabyte range (Bloem et al.
when it is needed at the right time to enable 2014).
them to make the best decisions? Still we are
talking about a petabytes of data that is coming Consumer Phase Data
from numerous sources, and it will be even more With the advancement in the Internet of things
complex to decide when and what information to and the rise of shared economy, more and
process and deliver. more devices are getting connected to cars.
Some studies (Deloitte 2014) show to surprise Traditionally these devices would be biometric
that when the customer is making the first pur- wearables, home appliances, and audio–
chase of a car, brand image and price are the visual equipment. Introducing connectivity to
least considered factors. Among the top factors automotive has opened new horizons for the
that influence a customer are usability and tech- services in the car like GPS navigation, e-mail,
nology. However, for a second purchase, usually and live music streaming to mention a few.
256 Big Data in Automotive Industry
The services sector around the car has evolved As cars are becoming more and more
into a much complex structure with many new intelligent, failures can now be predicted
business models through data analysis for incoming data from
the car sensors and radars (Pickhard 2016).
(i) Infotainment system: The car of today
and the future will turn into a massive How Big Data Is Transforming Automotive
infotainment industry. Online streaming, Industry
news, weather, direct mobile payments, Usually accidents gets unavoidable when there
insurances, and reservations services are are millions of over speeding drivers, faulty parts
few services to mention and are already a in the car, kids running into the street, or an un-
billion-dollar industry. According to some intentional glimpse at the phone beep. There are
conservative estimates, an average car will number of other such reasons when the driver’s
produce a few terabytes of data by 2020 attention is taken away, and it may cause a hazard.
(McKinsey and Company 2016). This data In such situations, the first thing an accident-
has enabled the service providers in under- affected person would want is alerting the med-
standing the trend, behavior, and priorities ical aid. The next important thing could be to in-
to the extent of an individual customer. form the approaching vehicles about the incident
(ii) Mobility service providers: Over the past so that they may avoid any disruption. This would
few years, e-hailing services, vehicle have also been a critical information for the
sharing, and automated remote parking navigation system on the ambulance to take the
are some of the businesses that have shortest route and reach in time to save the lives at
experienced a rapid growth with the success the location. At times this road information is also
of shared economy. Various companies are important for other departments, like the road and
now defining location-based services based traffic management departments that design the
on the huge amounts of incoming data roads and manage the traffic, to ensure smooth
from the car for customer safety, resource traffic and public safety. This data would also be
optimization, better vehicle allocation, and of immense utility for the automaker who could
customer management. have identified or predicted the problem with the
(iii) Insurance service providers: For a long part, before the incident happened.
period of time, insurance providers have Most of this information is of importance
been relying on the aggregated data such as for the city planners. This can enable them to
the driver age, gender, brand, model, etc., create a whole ecosystem by networking all these
to set their insurance package for car and/or building blocks of our transportation system into
driver. However usage-based insurance is a standard traffic protocol and create solutions.
one of the new innovations in the insurance The overall characteristics of these solutions can
industry. Insurance companies can now be classified with three words: revenue genera-
monitor the mileage and driving behavior tion, cost reduction, or safety and security en-
through the onboard diagnostics (OBD) in hancement. However this incoming information
the car. User-based insurance is one of the may be too overwhelming for the few policy-
top five innovations related to the connected makers and thus would require uncountable man
car (SAS 2015). hours to sort out the logic for making a common
(iv) Aftersales service: In the past the OEMs policy that might fit for all in our transportation
have been traditionally relying on the ecosystem.
reactive reliability statistics to improve Often these solutions are drivers for the break-
the reliability of the vehicle and reduce through technology that may eventually become
the downtime. But with the advent of the trend for the future automotive industry. Some
the Big Data technologies, this data has of these trends are discussed below in light of
become much more valuable and enabled car implication with the Big Data; however this de-
companies to switch to predictive analysis. velopment is also reliant on the other associate
Big Data in Automotive Industry 257
technologies, called enablers, that will be dis- flect the vehicle performance in different ter-
cussed along the way for each case. rains and what could be the future improve-
ments in the R&D phase. Although these
(a) Development phase: Data can be a vital re- benefits have been theoretically proven, but B
source to refine or even redefine how our yet there is a long way to go before the
future vehicles are designed. By mining data infrastructure is in place.
from the repair shops, data from the em- (c) Insurance and security: A single car my
bedded sensors in the car, and many other produce huge data each day, and valuable
sources, the engineers now understand that clues regarding the performance and health of
they can reduce the engineering cycles, opti- the vehicle are hidden in the data generated
mize energy usage, and reduce development from the car. As for the scenario defined
costs. This is enabling car companies to roll above, there could be major challenge for
out new models that are reliable, sustainable, the insurance companies to correctly identify
and innovative. Through data analysis, OEMs the responsibility and damage estimation. As
are learning the trends about the customers for now the user-based insurance has proved
from various regions and preparing cars that to be a feasible solution. However, real time
are customized for the specific region, e.g., speed and location data generated from the
Chinese customers may prefer an SUV when car, for instance, can help determine the cause
choosing a normal passenger car. The infor- and location of the accident. Big Data anal-
mation regarding the customer to such extent ysis has significantly lowered the payments
would not have been possible without the that the customers had to make to insurance
Big Data. The engineers on the other hand companies in the past.
can now also process the data and optimize
the production plan to ensure the availability The data collected from the accident scene
of the different models within the deadlines. can be carefully analyzed to find reasonable in-
The OEMs can also collect the data from the novations that could be made to ensure safety
car to improve the downtime of the vehicle of the future user. Enhanced car safety features
(McKinsey and Company 2016). can save many lives that would otherwise go
(b) Market and customer: In the past the OEMs wasted. Through the data analysis, now com-
have been traditionally relying on the reactive panies offer strong laminated windshields that
reliability statistics to improve the aftersales would not shatter in the case of an accident, and
of the car. However, these days OEMs can the modernized airbags will inflate just before
define an overall strategy in advance by per- the impact to prevent the driver from any head
forming the proactive diagnostics. The cus- injury.
tomer who may have decided to change a
faulty part at a later time due to the wait (d) Autonomous drive: Autonomous driving is
time can now be prevented from any further the headline of every news article, and its
traffic hazards. On the other hand, OEMs value is much known. As for now the driver
can improve life expectancy of a vehicle by assistive systems, collision prevention, and
providing superior parts based on the weather lane keeping are some of the features that de-
and terrain requirements of different regions. fine our so-called autonomous vehicles. But
Based on this data, OEMs can redesign their in the future, Big Data analysis will play a vi-
marketing strategies according to the regions. tal role in making the vehicles smarter. Huge
A customer in the hot desert will have very amounts of traffic and roadside assistance
different requirements compared to a per- data was fed to the car to enable it to suc-
son in the frozen mountains, and the vehicle cessfully maneuver through the densely pop-
would have very different requirements for ulated cities. The arrival of fully autonomous
the engine oil, fuel, or other parts. Through vehicles will not only reshape the mobility
the customer data and feedback, it can also re- but also the way we humans see the car.
258 Big Data in Automotive Industry
The interior of the car may require to be and then collect data to large servers placed at
fully redefined based on the human activities. various centralized locations. These data storage
We have already seen some of the concepts techniques may work fine for the small scale
presented at the auto shows by different au- ecosystems. However in future ahead with com-
tomakers (Al Najada and Mahgoub 2016). As plex ecosystem of autonomous driving, requiring
the data generated in the car will no longer global monetization of car data in real time, such
be just about the engine speed or location, so techniques might fail to serve. It is only then that
there may be need to redefine the data clusters a car manufactured in China will be able to oper-
(Al Najada and Mahgoub 2016). ate fully autonomously in the USA or other parts
(e) Connected cars and Internet of things (IoT): of the world. What’s needed be understood is that
IHS Automotive forecasts that there will be technology is actually removing the boundaries
152 million actively connected cars on global among different territories for the automotive.
roads by 2020, and it will be a 40 billion- Although the relationship of data clusters is
dollar industry. This makes connected cars well defined to realize the autonomous drive and
much more interesting for the OEMs and for the connected cars in the future, the challenge
startups. 5G technologies, with 10 GB down- still remains; but how do you store the data in
load speed and near-zero latency, will bring order to retrieve and process it in the shortest
in the true connected vehicle; however 4G- viable time? Since autonomous cars require data
edge is serving the purpose at the moment. processing in almost real time, effective Big Data
This means that in the future, the car will be processing algorithms (as in Daniel et al. (2017))
able to share the data at in real time. The in- can be one of the possible solutions; however,
coming data can be analyzed to enable many the car industry will be looking up to even more
use cases. The in-car entertainment features effective means such as combination of edge and
are already in the car for some time now. cloud computing (Liu et al. 2017) as shown in
The new streaming services will be offering Fig. 1.
online games, movies, and more based on the At times the sensors on board are sufficient to
behavior of the user to bring a customized detect and handle certain road situations, but yet
content to the user. Syncing this information they are unable to make the car respond to the
with home devices will continue to provide bigger ecosystem. Upcoming hazards, weather
the customer with the same experience from conditions, traffic jams, roadside assistance, etc.,
the car till home. Through navigation analyt- are yet to be realized through a connected net-
ics, a car will be able to actuate the connected work. This information is crucial and must be
devices at the home or in office and route the provided spontaneously to enable the driver to
way by analyzing the most updated weather make a better decision on the road (Sherif et
and traffic conditions, keeping in view the al. 2017). This concept proposes certain infor-
previous preferences of the driver. Thus, Big mation like the humans on road and a traffic
Data Analytics will be increasingly important hazard directly to the edge clouds that can then
part of the car once it can fully integrate IoT provide this information to the car directly with
and realize V2X communication. a negligible lag. Automotive will require this and
other similar concepts to not always rely on the
Traditional Ways to Store and Process Data Internet speed for better faster data processing
in Automotive Industry and transfer.
If we tend to think about the data vehicles pro-
duced, then we realize it’s a very complex ecosys- Future Opportunities and Challenges for
tem. Data from cars, service providers, weather Big Data in Automotive Industry
reports, and environment all need to be processed In a complex work environment of a car, it is
and stored simultaneously. Traditionally the auto- essential for in-vehicle measurement to be com-
motive companies spend hundreds of man-hours plete and time synchronization of measurement
Big Data in Automotive Industry 259
Parking NB-loT
house Backend
B
Edge cloud
Traffic lights,
LTE
road side
infrastructure LTE V2P LTE
signals from different sources. The sensors and to the market, with better technological features,
stereo cameras on the future car, for instance, through deep understanding of the global market.
will stream in large amounts of raw data during Big Data has enabled the globalization yet keep-
the journey. However, this incoming data must be ing the element of personalization in future cars.
processed in real time (Amini et al. 2017). Al- Although Big Data has created several
though, this sounds too ambitious to shorten the business opportunities, along come certain
testing days and measurement campaigns, but it is challenges. Technically speaking Big Data still
realistic. With modern Big Data Analytics tech- relies on two things: proper data storage and the
niques, the calibration of the automotive could computational power. While using the mobility
be done in virtual environments, and this would services, customers entrust mobility platforms
save hours of man time. To enhance the electronic with a considerable amount of data including
and performance capabilities of cars, Big Data their personal information and financial details,
Analytics will be the most logical solution. which are saved on the respective enterprise
Another research (SAS 2015) estimated that servers. For the customer it is always of utmost
investments in Big Data will account over $2.8 concern how this data is treated by these
billion in 2017 and will grow at 12% over the enterprises. This issue will be even more complex
next 3 years. It surely has been one of the biggest once the autonomous vehicles arrive and people
enablers of the technological revolution in the will be riding robot taxies. Autonomous cars
automotive industry and has successfully helped will be sharing numerous data content with
to create many disruptive business models and the servers, which, if not sufficiently secured
determine new fields of innovation and research. or hacked, can cause significant damage to
Applying the advanced data analytics across the the customer and company integrity (Sherif
manufacturing value chain has generated expo- et al. 2017). Thus, data flowing in require
nential value. Well-defined sets of data have been proper storage and protection. The location
utilized to perform analytics like the customer of data storage also has direct relation to the
behavior analytics, market mix analytics, supply computational power. Computation power can be
chain optimization analytics, and predictive qual- evaluated from two aspects: computational speed
ity analytics. Executives now do better process and accuracy. Connection speed, algorithm, and
risk analytics and set new and better KPIs based data processing are among the many factors that
on customer data accumulated globally. It enables may affect the computational speed; however,
them to make better strategies for the companies the location of data storage is also an equally
and bring stronger and more competitive products important factor. Taking the data from one
260 Big Data in Cellular Networks
References
(i) Packet inspection methods search for prede- detection that are often employed in cyber-
fined fingerprints that characterize services security tasks. For example, anomaly detection is
(e.g., protocol messages) in packet contents. used in network security contexts to detect traffic
These algorithms are recognized to be gen- related to security flaws, virus, or malware, so to
erally fast and precise, but their importance trigger actions to protect users and the network.
has been drastically reduced by the increas- Anomaly detection for cyber-security builds
ing deployment of encrypted protocols; on the assumption that malicious activities
(ii) Machine learning methods automatically change the behavior of the network traffic, which
search for typical traffic characteristics of is determined by interactions with legitimate
services. Supervised learning algorithms services. Anomaly detection methods rely on
are usually preferred, in which features models that summarize the “normal” network
of the traffic, e.g., packet sizes and inter- behavior from measurements, i.e., building a
arrival times, are used as input to train baseline. Traffic classification algorithms can be
classification models describing the traffic. leveraged for learning the behavior of legitimate
Neural networks, decision trees, and support services, e.g., by building behavioral models that
vector machines are some examples of are able to identify popular services. Live traffic
suitable learning algorithms. These methods is then monitored and alerts are triggered when
are appealing since they can find complex the network behavior differs from the baseline
patterns describing the services with models according to predefined metrics.
minimal human intervention. However, they Anomaly detection methods are attractive
require large training datasets to build up since they allow the early detection of unknown
good-quality models. Moreover, it is hard threats, e.g., zero-day exploits. These methods
to achieve general classification models. however may not detect stealth attacks, which
In other words, the models that classify are not sufficiently large to disturb the traffic.
traffic in one network may not be able to Moreover, the large number of Internet services
classify traffic from other networks. Data represents a challenge for anomaly detection,
from different networks is usually needed to since unknown services (that are possibly
build up generic models. legitimate) might be considered anomalies as
(iii) Behavioral methods identify services based well. In other words, anomaly detection may
on the behavior of network nodes – e.g., suffer from false positives, i.e., false alerts
servers that are typically reached by clients triggered because of unknown services or by
and the order in which servers are con- transient changes in legitimate services.
tacted. Also in this case, machine learning
algorithms can help to discover patterns,
e.g., learning rules that describe the behav- Big Data Challenges on Network
ior of nodes and outputting the most likely Monitoring
services running on the nodes. Behavioral Network monitoring applications face all chal-
methods require little metadata, thus work- lenges of typical big data applications.
ing well with encrypted services. However, Taking velocity as a first example: Traffic
as before, they require large datasets for classification and anomaly detection are often
learning the hosts’ behaviors, as well as performed in high-speed networks, e.g., back-
training data from different networks, in bone links at Internet service providers (ISPs).
particular if coupled with machine learning The systems need to cope with tens or even
algorithms. hundreds of gigabits of traffic per second since
thousands of users are aggregated in such links.
Anomaly Detection Many scalable network monitoring approaches
Bhuyan et al. (2013) and García-Teodoro et al. have been proposed to overcome this challenge.
(2009) describe diverse techniques for anomaly Flow monitoring is one prominent example.
Big Data in Computer Network Monitoring 263
Instead of letting complex algorithms handle for the 1:10 sampling rate), it reduces the value of
raw traffic directly, a specialized equipment the data, making it harder to run some monitoring
that implements basic preprocessing steps in applications.
hardware is deployed. This equipment is usually The volume of measurements is another clear B
called a flow exporter. The flow exporter builds example on why network monitoring faces big
summaries of the traffic according to the flows data challenges. The availability of historical
seen active in the network. Flow exporters data is a strong asset for some monitoring
have been shown to scale well, processing applications, e.g., security applications that
more than 40 Gbps of network traffic with perform forensics can rely on historical data
commodity servers (Trevisan et al. 2017). to detect whether users have been impacted by
Analysis applications then operate on such flow- previously unknown attacks. Table 1 suggests
level summaries. For instance, machine learning that the archival of network measurements
algorithms have been shown to deliver promising requires large storage capacity even if highly
results with flow-level measurements, despite efficient formats such as compressed flow
the coarse granularity of the measurements – summaries are selected. Figure 1 quantifies
see Sperotto et al. (2010) for details. this requirement by depicting the storage
Hofstede et al. (2014), however, show that space needed to archive compressed flow
even flow-based approaches are starting to measurements from vantage points monitoring
be challenged by the ever-increasing network around 30,000 end users. Note that these vantage
speeds. Table 1 illustrates the speed in which points are relatively small, observing only few
flow exporters would send out data from a Gbps of traffic on peak periods. The figure shows
vantage point when observing 50 Gbps of raw that the space occupied by the measurements
traffic on average. The table includes volumes grows almost linearly with time as expected.
estimated for different flow export technologies – After 4 years, the archival consumes almost
i.e., Cisco’s NetFlow v5 and NetFlow v9 as well 30 TB.
as IPFIX, which the IETF standard protocol to Around three billion flows have been reported
export flow-level information from the network. by flow exporters in each month depicted in the
The data rate leaving the flow exporter is figure. This shows how challenging is to extract
remarkably smaller than the raw traffic stream. value from network measurements. Consider an
Nevertheless, applications processing only the anomaly detection system that triggers an event
flow summaries (e.g., for anomaly detection) every time a set of flows is considered suspi-
would still have to handle more than 70 Mbps if cious. The system would likely bother network
IPFIX is chosen to export data. More than that, administrators with the false alerts even if only a
these numbers refer to a single vantage point, single false alert would be generated every tens
and there could be hundreds of vantage points in of millions flows.
a midsized network. Whereas sampling is often
used to reduce exported volumes (see numbers
Big Data Technologies in Network
Monitoring
Big Data in Computer Network Monitoring, Table 1
Estimated stream of packets and bytes leaving a flow Since network monitoring shows the characteris-
exporter that observes 50 Gbps on average – source (Hof-
stede et al. 2014) tics of a big data problem, a natural question is
whether researchers and practitioners are lever-
Sampling Protocol Packets/s Bits/s
aging big data approaches in this context.
1:1 NetFlow v5 4.1 k 52 M
There is still a small number of academic
1:1 10.2 k 62 M
NetFlow v9 works that actually leverage big data for net-
1:10 4.7 k 27 M
work monitoring. This is somehow surprising
1:1 IPFIX 12.5 k 75 M
considering that big data technologies such as
264 Big Data in Computer Network Monitoring
15
10
013
13
14
14
15
15
16
16
17
17
18
0
0
/2
/2
/2
/2
/2
/2
/2
/2
/2
/2
/2
01
07
01
07
01
07
01
07
01
07
01
Months
MapReduce are rather mature – in fact, MapRe- data management (from data gathering until
duce has been introduced by Dean and Ghe- transformation) and analytics (i.e., the remaining
mawat (2004). Even considering the well-known operations).
weaknesses of some technologies, e.g., MapRe-
duce inefficiencies when dealing with iterative Data Management
algorithms over large datasets, alternatives do Marchal et al. (2014) introduce an intrusion
exist, and they are now already well-established detection system that relies on heterogeneous
too. A prominent example is Spark, proposed by sources, e.g., NetFlow, DNS logs, and honeypots.
Zaharia et al. (2010) precisely to overcome the Intruders are identified by ad hoc metrics that
weaknesses of MapReduce. spot anomalies (possibly malicious) in the data.
This raises questions on whether these Queries that are generally common in network
technologies lack characteristics to fulfill the monitoring are solved to calculate the metrics,
requirements of network monitoring applications. such as for finding records containing a given
Roughly speaking, early adopters have mostly string and records that pass particular filters –
tried to adapt or to extend existing big data e.g., to count contacted destination IP addresses
technologies to better fit the network monitoring given a source address.
applications. The authors benchmark different big data
To fully understand current efforts, it is technologies concluding that Spark and Shark
convenient to put the ongoing research on big answer queries faster than alternatives (e.g.,
data approaches for network monitoring in the Hive). They however highlight challenges
perspective of a knowledge discovery in data to process the measurements: There is a
(KDD) process. The classical KDD process, at- shortage of solutions for ingesting, reading, and
tributed to Fayyad et al. (1996), is summarized as consolidating measurements in the distributed
a sequence of operations over the data: gathering, file systems supported by the frameworks.
selection, preprocessing, transformation, mining, Lee and Lee (2013) have also identified these
evaluation, interpretation, and visualization. This problems. Network measurements are often cap-
schema is depicted in Fig. 2. Next works applying tured and transferred in optimized binary formats
big data approaches for network monitoring that cannot be directly and efficiently stored in
are discussed according to the covered KDD distributed databases and file systems. The au-
phases, which are clustered in two groups: thors focus on how to process raw PCAP and
Big Data in Computer Network Monitoring 265
Interpretation
Evaluation
Data mining
Transformation
Preprocessing
B
Selection
Knowledge
Patterns/
Models
Transformed
Preprocessed data
Target data
data
Data
Big Data in Computer Network Monitoring, Fig. 2 Classical knowledge discovery in data (KDD) process – source
(Fayyad et al. 1996)
NetFlow traces in Hadoop. Since HDFS does not lenges. Other challenges immediately become
know how to split PCAPs and NetFlow formats, evident when one considers that network mon-
the processing of each file in Hadoop-based so- itoring applications often operate not only with
lutions cannot profit from parallelism and data batches of historical data (i.e., volume) but also
locality to scale up horizontally. Authors im- with live measurement streams that need to be
plement new Hadoop APIs to explicitly process evaluated on the fly (i.e., velocity).
PCAP and NetFlow archives, showing that the Bär et al. (2014) show that solutions target-
APIs improve performance. ing batch processing (e.g., Spark) present poor
These APIs are however specific to HDFS, performance for such online analysis, and even
PCAP, and NetFlow and allow only sequential stream-oriented alternatives (e.g., Spark Stream-
reading of the archives in HDFS. While they ing) miss functionalities that are essential for
certainly alleviate the problem, challenges per- network monitoring applications. The latter is
sist if one would need advanced analytics or also highlighted by Čermák et al. (2016), who
functionalities that are not able to interact with mention the difficulties to recalculate metrics
the APIs. For instance, when faced with simi- from old data as one of the several weaknesses
lar problem of ingesting heterogeneous measure- in such general stream processing engines. Bär
ments into big data frameworks, Wullink et al. et al. (2014) propose DBStream to calculate con-
(2016) have created an extra preprocessing step tinuous and rolling analytics from online data
to convert measurements from the original format streams. DBStream is thus built upon some ideas
(i.e., PCAPs in HDFS) into a query-optimized of big data frameworks but specialized to solve
format (i.e., Parquet). This conversion is shown to network monitoring problems. DBStream is still
bring massive improvements in performance, but a centralized system, relying on a single-node
only preselected fields are loaded into the query- SQL database, thus lacking the horizontal scal-
optimized format. ability features of a modern big data solution.
These works make clear that big data frame- Noting a similar gap in stream analytics tools
works are attractive to process network measure- for BGP measurements, Orsini et al. (2016) intro-
ments, thanks to built-in capabilities to handle the duce BGPStream. They point out that although
usually large volumes of measurements. Never- BGP is a crucial Internet service that manages
theless, they shed lights on how basic steps, like routing, efficient ways to process large BPG mea-
ingesting measurements into the frameworks, are surements on real time are missing, e.g., to detect
by no means trivial. Those are not the only chal- anomalies and attacks in the routing service. BG-
266 Big Data in Computer Network Monitoring
PStream provides tools and libraries that connect by network monitoring applications. It is also
live BGP data sources to APIs and applications. clear that there is work in progress to fill in this
BGPStream can be used, for instance, to effi- gap.
ciently ingest online data into stream processing If one considers the KDD process in a big
solutions, such as on Spark Streaming. data context, there are several general big data
frameworks and technologies that cover the initial
Analytics stages, e.g., data gathering and preprocessing.
The management of network measurements, i.e., The network monitoring community has clearly
gathering, ingestion etc., is just the starting point focused on these initial stages when measure-
of network monitoring applications. Often data ments become big data. The community is on the
analytics methods used by these applications way to answer questions such as how to ingest
have troubles to handle big data – e.g., some measurements in big data frameworks and how
machine learning algorithms do not scale well to process large streams of live measurements
with large datasets or are hard to parallelize. from different sources. However, these efforts
Selecting the best algorithms for the specific are still somehow isolated. Lessons learned need
network monitoring problem is a complex task. to be incorporated in the frameworks, so to al-
For instance, it is commonly accepted that there low easier processing of other sources and the
is no single best learning model for addressing reuse of APIs and libraries for different network
different problems simultaneously. Indeed, even monitoring analytics. One recent attempt in that
if multiple models could be very well suited for direction is Apache Spot (2017), which proposes
a particular problem, it is very difficult to un- an open data model to facilitate the integration
derstand which one performs best in operational of different measurements, providing a common
network where different data distributions and taxonomy for describing security telemetry data
statistical mixes may emerge. Recently, Vanerio and to detect threats.
and Casas (2017) have proposed a solution for Since both online stream and offline batch
network security and anomaly detection that is analyses are needed for network monitoring, an-
suitable for big data. Their method is based on other research direction is toward the unification
the use of super learners, a class of ensemble of the two programming models. For example,
learning techniques known to perform asymptot- building upon the dataflow abstraction (Akidau
ically as well as the best combination of the base et al. 2015), Apache Beam (2017) offers one of
learners. such unified programming models. It can rep-
Liu et al. (2014), Fontugne et al. (2014) and resent and transform both (i) batch sources and
Wang et al. (2016) are other examples of authors (ii) streaming sources, thus relieving program-
that have started to implement analytics meth- mers from the burdens of keeping two code
ods based on MapReduce or Spark to process bases and increasing portability across execution
large collections of measurements (e.g., feature engines.
selection). However, works on big data analytics As the challenges in the initial phases of the
for network monitoring are still incipient, lacking KDD process are tackled, researchers are mov-
comprehensive approaches that are applicable for ing toward subsequent KDD phases, such as by
more sophisticated cases. proposing data mining algorithms and visualiza-
tion techniques suitable for big data. The recent
progresses in training deep learning models with
Research Directions big datasets are just some examples of this trend.
However, network monitoring applications that
From this short survey, it can be concluded that profit from these progresses are almost nonexis-
researchers have identified the potential of big tent.
data approaches for network monitoring, but cur- Casas et al. (2016) are working on Big-
rent technologies seem to miss features required DAMA, one of the few initiatives that are
Big Data in Computer Network Monitoring 267
trying to face these challenges comprehensively. Callado A, Kamienski C, Szabó G, Gero BP, Kelner J,
Big-DAMA aims at conceiving a new big Fernandes S, Sadok D (2009) A survey on internet
traffic identification. Commun Surv Tutorials 11(3):
data analytics framework specifically tailored 37–52
to network monitoring applications. Big- Casas P, D’Alconzo A, Zseby T, Mellia M (2016) Big-
B
DAMA will include scalable online and offline DAMA: big data analytics for network traffic monitor-
machine learning and data mining algorithms ing and analysis. In: Proceedings of the LANCOMM,
pp 1–3
to analyze extremely fast and/or large network Čermák M, Jirsík T, Laštovička M (2016) Real-time
measurements, particularly focusing on stream analysis of NetFlow data for generating network traffic
analysis algorithms and online processing statistics using Apache Spark. In: Proceedings of the
tasks. Big-DAMA aims also at creating new NOMS, pp 1019–1020
Dean J, Ghemawat S (2004) MapReduce: simplified data
benchmarks that will allow researchers and processing on large clusters. In: Proceedings of the
practitioners to understand which of the many OSDI, pp 10–10
tools and techniques emerged in recent years suit Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From
best for network monitoring applications. data mining to knowledge discovery: an overview. AI
Mag 17(3):37–54
Projects such as Apache Spot and Big-DAMA Fontugne R, Mazel J, Fukuda K (2014) Hashdoop: a
are at their early stages and clearly there is still a mapreduce framework for network anomaly detection.
long way to go. It seems however a matter of time In: Proceedings of the INFOCOM WKSHPS, pp 494–
until the network monitoring community fills in 499
García-Teodoro P, Díaz-Verdejo J, Maciá-Fernández G,
the gaps, fully exploiting the potential of big Vázquez E (2009) Anomaly-based network intrusion
data approaches to improve network monitoring detection: techniques, systems and challenges. Comput
applications. Secur 28:18–28
Hofstede R, Čeleda P, Trammell B, Drago I, Sadre R,
Sperotto A, Pras A (2014) Flow monitoring explained:
from packet capture to data analysis with NetFlow and
IPFIX. Commun Surv Tutorials 16(4):2037–2064
Cross-References Lee Y, Lee Y (2013) Toward scalable internet traffic
measurement and analysis with hadoop. SIGCOMM
Comput Commun Rev 43(1):5–13
Big Data for Cybersecurity
Liu J, Liu F, Ansari N (2014) Monitoring and analyzing
Big Data in Mobile Networks big traffic data of a large-scale cellular network with
Big Data in Network Anomaly Detection hadoop. IEEE Netw 28(4):32–39
Marchal S, Jiang X, State R, Engel T (2014) A big
data architecture for large scale security monitor-
ing. In: Proceedings of the BIGDATACONGRESS,
pp 56–63
References Nguyen TT, Armitage G (2008) A survey of techniques for
internet traffic classification using machine learning.
Akidau T, Bradshaw R, Chambers C, Chernyak S, Commun Surv Tutorials 10(4):56–76
Fernández-Moctezuma RJ, Lax R, McVeety S, Mills Orsini C, King A, Giordano D, Giotsas V, Dainotti A
D, Perry F, Schmidt E, Whittle S (2015) The dataflow (2016) BGPStream: a software framework for live and
model: a practical approach to balancing correctness, historical BGP data analysis. In: Proceedings of the
latency, and cost in massive-scale, unbounded, out-of- IMC, pp 429–444
order data processing. Proc VLDB Endow 8(12):1792– Sperotto A, Schaffrath G, Sadre R, Morariu C, Pras
1803 A, Stiller B (2010) An overview of IP flow-based
Apache Beam (2017) Apache Beam: an advanced unified intrusion detection. Commun Surv Tutorials 12(3):
programming model. https://beam.apache.org/ 343–356
Apache Spot (2017) A community approach to fighting Trevisan M, Finamore A, Mellia M, Munafo M, Rossi
cyber threats. http://spot.incubator.apache.org/ D (2017) Traffic analysis with off-the-shelf hardware:
Bär A, Finamore A, Casas P, Golab L, Mellia M challenges and lessons learned. IEEE Commun Mag
(2014) Large-scale network traffic monitoring with 55(3):163–169
DBStream, a system for rolling big data analysis. In: Valenti S, Rossi D, Dainotti A, Pescapè A, Finamore A,
Proceedings of the BigData, pp 165–170 Mellia M (2013) Reviewing traffic classification. In:
Bhuyan MH, Bhattacharyya DK, Kalita JK (2013) Net- Data traffic monitoring and analysis – from measure-
work anomaly detection: methods, systems and tools. ment, classification, and anomaly detection to quality
Commun Surv Tutorials 16(1):303–336 of experience, 1st edn. Springer, Heidelberg
268 Big Data in Cultural Heritage
sensor networks, social media, digital libraries knowledge base (KB) in compliance with a given
and archives, multimedia collections, etc.); data model, eventually exploiting the LD/LOD
• advanced data management techniques and (linked data/linked open data) paradigm (possible
technologies; data integration problems for heterogeneous data
• system resources management; sources can be addressed by means of schema
• real-time and complex event processing for mapping, record linkage and data fusion tech-
monitoring and security aims; niques).
• ability to provide useful and personalized data In addition, specific semantics to be attached
to users based on their preferences/needs and to the data can be provided using annotation
context and advanced information retrieval schemata, including ontologies, vocabularies,
services, data analytics, and other utilities, taxonomies, etc., related to the cultural heritage
according to the SOA (service-oriented archi- domain.
tecture) paradigm. The KB can leverage different data repositories
realized by advanced data management technolo-
gies (e.g., distributed file systems, NoSQL, in-
By means of a set of ad hoc APIs, and exploit- memory and relational databases) and provides a
ing the discussed services, a CH big data platform set of basic APIs to read/write data by an access
has to be able to support several applications: method manager.
mobile guides for cultural environments; web The data processing layer has to provide a
portals to promote the cultural heritage of a given query engine that can be invoked by user appli-
organization, multimedia recommender, and sto- cations.
rytelling systems; event recognition for cultural It has as main task the search of data of interest
heritage preservation and security; and so on. using information retrieval facilities by means of
Concerning technical features, a big data plat- proper information filters, as an example:
form for CH applications has to be built on a
layered architecture (see Fig. 1). In the following,
• query by keyword or tags and using specific
we describe the responsibilities of each layer.
metadata of the annotation schemata,
The resource management layer has to co-
• query by example for cultural items with im-
ordinate and optimize the use of the different
ages,
resources needed for implementing the whole
• ontology based and query expansion semantic
system by means of the Resource Manager com-
retrieval,
ponent.
• etc.
The data source layer has as main goal to
manage the “ingestion” of all the different data
coming from the heterogeneous data sources. Finally, the data analytics layer is based on
The data ingestion component can operate in a different analytics services allowing to create per-
continuous or asynchronous way, in real-time sonalized “dashboards” for a given cultural en-
or batched manner or both depending on the vironment. In addition, it provides basic data
characteristics of the source. In addition, ingested mining, graph analysis, and machine learning
data has to be then represented according to a algorithms useful to infer new knowledge.
given format.
The stream processing layer has to be used to
provide real-time processing and event recogni- CHIS: A Big Data Platform for CH
tion activities on the data streams using a stream Applications
processing engine and a complex event processor
(Cugola and Margara 2015; Panigati et al. 2015). CHIS (cultural heritage information systems)
The data storage and management layer has (Castiglione et al. 2017) is the name of a big data
to store and manage data of interest in a proper platform which prototype was realized within
Big Data in Cultural Heritage 271
DATA ANALYTICS
Architecture of a big data
platform for CH
LAYER
applications B
ANALYTICS ANALYTICS
SERVICE SERVICE
INFORMATION INFORMATION
FILTER FILTER
RESOURCE MANAGER
MANAGEMENT LAYER
ACCESS METHOD
DATA STORAGE AND
MANAGEMENT
MANAGER
RESOURCE
LAYER
KNOWLEDGE BASE
COMPLEX EVENT
STREAM
PROCESSOR
LAYER
STREAM PROCESSOR
ENGINE
DATA SOURCE
LAYER
DATA
INGESTION
the DATABENC project to support several CH First of all, the CHIS data model is based
applications. on the concept of cultural item: examples are
It allows to query, browse, and analyze cultural specific ruins of an archeological site, sculptures
digital contents from a set of distributed and het- and/or pictures exhibited within a museum, his-
erogeneous digital repositories, providing several torical buildings and famous squares in an old
functionalities to guarantee CH preservation and town center, and so on.
fruition. In the model, it is possible to describe
It shows all the functionalities and technical a cultural item with respect to a variety of
features required to a CH big data system. In the annotation schemata, by aggregating different
following, we describe the main characteristics, harvesting sets of “metadata,” for example, the
also providing some implementation details. archeological view, the architectural perspective,
272 Big Data in Cultural Heritage
the archivist vision, the historical background, documents, etc.) associated with a cultural
etc., as in the Europeana Data Model (EDM) (A item.
proposal that aims at bridging the gaps between • All the relationships among cultural items
different harvesting standards providing a unique within a cultural environment and interactions
solution to metadata specification (http://pro. with users (behaviors) are managed by means
europeana.eu/edm-documentation)). of a graph database (i.e., Titan).
Stream processing of data from sensor net- • The entire cartography related to a cultural
works has been realized according to a “publish- environment together with PoIs is managed by
subscribe” paradigm; concerning its implementa- a GIS (i.e., PostGIS);
tion details, CHIS uses Apache Kafka to realize • Multimedia data management has been real-
message brokers and the publish-subscribe com- ized using the Windsurf library (Bartolini and
munication, Apache Storm as stream processing Patella 2015);
engine, and Apache Zookeeper as coordinator. • CHIS exploits an RDF store (i.e., different Al-
On the other hand, ingestion mechanisms from legrograph instances) to memorize particular
the other kinds of data sources (digital archives/li- data views in terms of triples related to a given
braries and social media networks) have been cultural environment and useful for specific
realized using the available APIs or by means of aims, providing a SPARQL endpoint for the
basic wrapping techniques. applications;
All the extracted data are converted in a JSON • All system configuration parameters, internal
format according to the described data model. In catalogues, and thesauri are stored in a rela-
particular for multimedia contents, the descrip- tional database (i.e., PostegreSQL)
tions in terms of metadata are captured and stored • All the basic analytics and query facilities are
within CHIS, while raw data can be opportunely provided by Spark based on HDFS;
linked and, in other cases, temporarily imported • Semantics of data can be specified by linking
into the system for content-based analysis (Bar- values of high-level metadata to some avail-
tolini and Patella 2015). able internal (managed by Sesame) or external
Data are then stored in the CHIS KB. In ontological schemata.
particular, the KB leverages a stack of different
technologies that are typical of a big data sys-
tem:
Eventually, several analytics can be performed
on the stored data using graph analysis and ma-
• The data describing basic properties of cul- chine learning library provided by Spark. As an
tural items (e.g., name, short description, etc.) example, the analysis of the graph coding in-
and basic information on users profiles are teractions among users, multimedia objects, and
stored into a key-value data store (i.e., Redis). cultural items can be properly used to support
• The complete description in terms of all recommendation or influence analysis tasks.
the metadata using the different annotation This heterogeneous knowledge base provides
schemata is in turn saved using a wide-column basic Restful APIs to read/write data and fur-
data store (i.e., Cassandra). A table for each ther functionalities for importing/exporting data
kind of cultural objects is used, having a in the most common diffused web standards.
column for each “metadata family”; column CHIS adopts such heterogeneous technologies in
values can be literals or URIs (according to order to meet the specific requirements of the
LD paradigm). applications dealing with the huge amount of data
• The document store technology (i.e., Mon- at stake.
goDB) is used to deal with JSON messages, For example, social networking applications
complete user profiles, and descriptions of in- typically benefit of graph database technologies
ternal resources (multimedia data and textual because of their focus on data relationships.
Big Data in Mobile Networks 273
In turn, more efficient technologies (key-value Picariello A., Schreiber F., Tanca L. (eds) Data Man-
agement in Pervasive Systems. Data-Centric Systems
ore wide-column stores) are required by tourism and Applications. pp 311–325. Ed. Springer Cham
applications to quickly and easily access the data Print. ISBN 978-3-319-20061-3.
of interest. Castiglione A, Colace F, Moscato V, Palmieri F (2017)
B
The overall system architecture is made by Chis: a big data infrastructure to manage digital cultural
items. Futur Gener Comput Syst. https://doi.org/10.
using multiple virtual machines rent by a cloud 1016/j.future.2017.04.006
provider, each equipped with Hadoop/YARN and Chen CP, Zhang CY (2014) Data-intensive applications,
Spark packages, accessing an horizontally scal- challenges, techniques and technologies: a survey on
ing dynamic storage space that increases on de- big data. Inf Sci 275:314–347
Colace F, De Santo M, Moscato V, Picariello A, Schreiber
mand, according to the evolving data memo- F, Tanca L (2015) Patch: a portable context-aware atlas
rization needs. It consists in a certain number for browsing cultural heritage. In: Data management in
of slave virtual machines (that can be increased pervasive systems. Springer, Cham, pp 345–361
over time in the presence of changing demands) Cugola C, Margara A (2015) The complex event process-
ing paradigm. In: Colace F., De Santo M., Moscato V.,
and two additional ones that run as masters: one Picariello A., Schreiber F., Tanca L. (eds) Data Man-
for controlling HDFS and another for resource agement in Pervasive Systems. Data-Centric Systems
management. and Applications. Ed. Springer Cham Print. pp 113–
The CHIS platform has been then used to sup- 133. ISBN 978-3-319-20061-3.
Kabassi K (2013) Personalisation Systems for Cul-
port several CH applications, as an example the tural Tourism. In: Tsihrintzis G., Virvou M., Jain L.
multimedia guide for the Paestum archeological (eds) Multimedia Services in Intelligent Environments.
site described in Colace et al. (2015). Smart Innovation, Systems and Technologies, vol 25.
pp. 101–111. Ed. Springer, Heidelberg. ISBN 978-3-
319-00374-0.
Panigati E, Schreiber FA, Zaniolo C (2015) Data streams
and data stream management systems and languages.
Concluding Remarks In: Colace F., De Santo M., Moscato V., Picariello
A., Schreiber F., Tanca L. (eds) Data Management in
This entry focused on the importance of big Pervasive Systems. Data-Centric Systems and Applica-
tions. pp 93–111. Ed. Springer Cham Print. ISBN 978-
data technologies to support smart cultural en-
3-319-20061-3.
vironments. We outlined the main characteris- Schaffers H, Komninos N, Pallot M, Trousse B, Nilsson
tics and functionalities required to a general big M, Oliveira A (2011) Smart Cities and the Future
data platform for CH information management Internet: Towards Cooperation Frameworks for Open
Innovation. In: Domingue J. et al. (eds) The Future
and described CHIS, i.e., a big data system to
Internet. FIA 2011. Lecture Notes in Computer Sci-
query, browse, and analyze cultural digital con- ence, vol. 6656. pp 431–446. Ed. Springer, Berlin,
tents from a set of distributed and heteroge- Heidelberg. ISBN 978-3-642-20897-3.
neous digital repositories, thus providing several
functionalities to guarantee CH preservation and
fruition.
Big Data in Mobile Networks
Big Data in Mobile Networks, Fig. 1 Summary of data types produced (or carried) extraction of signaling messages of the cellular mobility management (which, in turn,
by cellular networks. Billing metadata such as call detail records (CDRs) can be provide fine-grained location and mobility data) and integral or partial user frames
exploited for generating users’ location data, social network information (from users’ (packet traces) that can be further parsed to produce aggregated flow traces or filtered
interactions), etc. Passively monitored traffic flowing in selected links allows the to application-specific traces
275
B
276 Big Data in Mobile Networks
CDRs might not be as rich as other cellular Nowadays, high-performance DAG cards are
data (e.g., signaling data including handover commercially available and are largely employed
information), but they have been a popular study by cellular network operators in order to monitor
subject since the 2000s, in particular as a mean their infrastructures at different vantage points
to observe human mobility and social networks (VPs). Passive measurements allow to gain
at scale. The reason behind this interest should network statistics at a large scale, potentially
be sought in their rather high availability, as from millions of connected devices, hence with
network operators are collecting these logs for a high degree of statistical significance. This
billing which makes them a ready-made data data type is intrinsically more powerful than
source. They, in fact, do not require specialized CDRs: passive traces are exact copy of the traffic
infrastructure for the acquisition. Indeed, there passing through a VP and open the possibility
is a fairly rich literature on the their exploitation to fully reconstruct the activity of a mobile
for a plethora of applications, ranging from road terminal through the study of both user traffic
traffic congestion detection and public transport and control information. Needless to say, this
optimization to demographic studies. Despite the poses serious challenges, in particular for what
initial interest, the research community gradually concern customers’ privacy and the potential
abandoned this data source as it has become disclosure of business sensitive information.
clear that the location details conveyed by These issues have heavily limited the access to
CDRs were biased by the users’ activity degree. this kind of data to third parties and the research
Specifically, the position of a user is observable community.
in conjunction with an activity, which translates The employment of passively collected traces
in the impossibility of locating a user with fine for the purpose of network monitoring and
temporal granularity. management dates back to the 2000s: a first
More recent studies (Fiadino et al. 2017), formalization of the techniques for passively
however, argue that the spreading of flat rates for monitoring mobile networks can be found in
voice and data traffic encourages users to gener- Ricciato (2006), where the author focuses on
ate more actions. In particular, competitive data 3G networks and list common operator-oriented
plans are nowadays characterized by very high uses of the passively sniffed data. Despite the
data volume limits, letting customers keep their evolution of cellular networks (e.g., the passage
terminals always connected. As a side effect, from 3G to 4G standards), the main concept
the quality of mobility data provided by CDRs of passive monitoring has not changed much.
improved as the average number of tickets per Figure 2 illustrates a simplified cellular network
user has increased. A resurgence of research work architecture and highlights the links that can be
tackling the study of CDRs is expected to happen wiretapped in order to collect passive traces. The
in the next years. Indeed, the research community cellular infrastructure is composed of two main
is already targeting some open issues, such as building blocks: a radio access network (RAN),
the lack of CDR standardization across operators which manages the radio last mile, and a core
and more accurate positioning within the cells’ network (CN). Fourth-generation networks (such
coverage areas (Ricciato et al. 2017). as LTE) are characterized by less elements at the
RAN, where the base stations (evolved NodeBs)
are connected in a fully meshed fashion. The CN
Passive Monitoring of 2G (GSM/EDGE) and 3G (UMTS/HSPA)
networks is formed by the circuit-switched
Passive traces are network logs collected (CS) and the packet-switched (PS) domains,
through built-in logging capabilities of network responsible for the traditional voice traffic and
elements (e.g., routers) or by wiretapping links packet data traffic, respectively. In contrast,
through specialized acquisition hardware, such 4G networks are purely packet switched, as
as data acquisition and generation (DAG) cards. all type of traffic is carried over IP, including
Big Data in Mobile Networks 277
Big Data in Mobile Networks, Fig. 2 A simplified rep- the circuit switched, handling traditional voice traffic). 4G
resentation of the 3G and 4G cellular network architecture. networks have less network elements on the access net-
3G networks are characterized by the presence of two work side and have just a packet-switched core, optimized
cores (the packet switched, handling data connections, and for data connections
traditional voice calls which are managed by an when monitoring at the CN), monitored inter-
IP multimedia subsystem (IMS). faces (the closer the passive sniffing occurred
In order to be reachable, mobile terminals to the RAN, the higher the chance to get finer-
notify the CN whenever they change location in grained positioning in the cellular topology), and
the network topology (in other words, to which use of triangulation (to calculate the position
base stations they are attached to). Therefore, within the cell’s coverage area, in case the signal
monitoring S1 (in LTE) or IuB/IuCS/IuPS inter- strength is available). The interested reader might
faces (in 3G) would allow the collection of mo- refer to Ricciato et al. (2015) for a comprehensive
bility information through the observation of net- discussion on the passively observed location
work signaling (control) messages (i.e., signaling messages, their time and space granularity, and
traces). Similarly to CDR datasets, an applica- how they compare with the more popular CDRs.
tion for this type of traces is the study of human Monitoring interfaces at the core of the PS
mobility, as specific signaling messages (such domain, such as the Gn/Gi in 3G network and
as handovers, location area updates, periodic S5/S8/Gi interfaces in LTE, allows the collec-
updates, etc.) allow to reconstruct the trajectory tion of large-scale packet traces, i.e., a copy
of a terminal in the network topology, which, in of frames at IP level augmented with metadata
turn, can be exploited to infer the actual mobility such as time stamps, the ID of the link where
of the mobile customer with a certain degree of the wiretapping occurred, and a user (IMSI) and
accuracy. These data could potentially provide a device (IMEI) identifier. The packet traces con-
higher time granularity in the positioning of ter- sist of the full frames (in case the payload is
minals w.r.t. CDRs, as the terminal’s position is kept) or the header only. Cellular packet traces
not necessarily observable in conjunction with an are extremely informative data, as they provide
action (a call, SMS, data connection). The actual a complete snapshot of what happened in the
time granularity and position accuracy depend on network, supporting network management and
multiple factors: activity state of the terminal (in troubleshooting (e.g., detection of network im-
general, active terminals provide their location pairments, data-driven and informed dimension-
changes with higher temporal granularity, even ing of resources, etc.) but also large-scale user
278 Big Data in Mobile Networks
behavior studies (e.g., web surfing profiles, mar- nical) and human-related (nontechnical). The
keting analysis, etc.). For most applications, the use cases in the first class aim at achieving in-
header brings enough information, and therefore formed network management and optimization
the payload can be discarded, making privacy actions, including more specific tasks such as
and storage management easier. It should be detection of anomalies, troubleshooting, resource
noted that the payload is more and more often dimensioning, and security monitoring. Nontech-
encrypted. nical applications focus on the study of human
Raw packet traces can be further processed behavior relying on both user and control data
to produce more refined datasets. For example, exchanged among user equipments, cellular net-
by applying a stateful tracking of packets, it is work components, and remote services. Some
possible to reconstruct flow traces, i.e., records common human-related uses of cellular data are
summarizing connections (sequence of packets) the study of human mobility (through CDRs
between a source and a destination, aggregating or signaling messages), analysis of social net-
by IPs and ports. Such logs could support, among works, and web surfing behavior (through pas-
others, the observation of throughput, latency, sively monitoring DNS or HTTP traffic at the
round-trip time (RTT), or other performance CN), with applications in the field of marketing
indicators that can be exploited for service- and behavioral studies.
level agreement (SLA) verification and quality The remainder of this section provide a non-
of service (QoS) studies. Another typical post- exhaustive list of common uses of cellular data
processing approach is filtering specific packet found in the scientific literature (also summarized
types (e.g., TCP SYN packets for the study of in Table 1).
congestions over links and network security
enforcement) or applications (i.e., application-
layer traces, such as DNS, HTTP).
Technical Areas
The analysis of cellular data, and passive traces
in particular, could support technical tasks aimed
Uses of Cellular Data at achieving data-driven network engineering
and optimization. There is an interesting
In the last years, cellular network data has been research branch in this direction, targeting self-
broadly employed in different application areas, optimization, self-configuration, and self-healing
not necessary with technical purposes. Indeed, based on machine learning approaches. One
the progress in the development of Big Data example is Baldo et al. (2014), where authors
technologies, coupled with the decrease of prices investigate the advantages of adopting Big Data
for storage, has unlocked endless possibilities for techniques for Self Optimized Networks (SON)
the exploitation of the massive amount of data for 4G and future 5G networks. A comprehensive
carried and produced by mobile networks. survey on the use of Big Data technologies and
In general, one can distinguish between two machine learning techniques in this field was
classes of use cases, i.e., network-related (tech- edited by Klaine et al. (2017).
Big Data in Mobile Networks, Table 1 A non-exhaustive list of applications for mobile network data divided into
two main categories (technical and nontechnical) plus a hybrid class of uses
Technical areas Nontechnical areas Hybrid areas
Anomaly detection Churn prediction Quality of Experience (QoE)
Traffic classification Customer loyalty
Network optimization Web surfing
Study of remote services Marketing and tariff design
Performance assessment and QoS Human mobility
Big Data in Mobile Networks 279
Cellular passive traces can be also employed The field of TC has been extensively studied:
for studying the functioning and behavior of complete surveys can be found in Valenti et al.
remote Over The Top (OTT) Internet ser- (2013) and Dainotti et al. (2012). Standard clas-
vices (Fiadino et al. 2016). Indeed, the popular- sification approaches rely on deep packet inspec- B
ity of mobile applications (e.g., video stream- tion (DPI) techniques, using pattern matching
ing, instant messaging, online social networks, and statistical traffic analysis (Deri et al. 2014;
etc.) poses serious challenges to operators as Bujlow et al. 2015). In the last years, the most
they heavily load the access of cellular networks. popular approach for TC is the application of
Hence, understanding their usage patterns, con- machine learning (ML) techniques (Nguyen and
tent location, addressing dynamics, etc. is crucial Armitage 2008), including supervised and unsu-
for better adapting and managing the networks pervised algorithms.
but also to analyze and track the evolution of The increasing usage of web-based services
these popular services. has also shifted the attention to the application of
TC techniques to passive traces at higher levels
of the protocol stack (specifically, HTTP). Two
Congestions and Dimensioning examples are Maier et al. (2009) and Erman
Nowadays, most of the services have moved et al. (2011), where authors use DPI techniques
over the IP/TCP protocol. The bandwidth-hungry to analyze the usage of HTTP-based apps, show-
nature of many mobile applications is posing ing that HTTP traffic highly dominates the to-
serious challenges for the management and op- tal downstream traffic volume. In Casas et al.
timization of cellular network elements. In this (2013a), authors characterize traffic captured in
scenario, monitoring congestions is critical to re- the core of cellular networks and classify the
veal the presence of under-provisioned resources. most popular services running on top of HTTP.
The use of packet traces for the detection of Finally, the large-scale adoption of end-to-end
bottlenecks is one of the earliest applications in encryption over HTTP has motivated the develop-
the field of cellular passive monitoring. Ricciato ment of novel techniques to classify applications
et al. (2007) rely on the observation of both distributed on top of HTTPS traffic (Bermudez
global traffic statistics, such data rate, as well et al. 2012), relying on the analysis of DNS
as TCP-specific indicators, such as the observed requests, which can be derived by filtering and
retransmission frequency. Jaiswal et al. (2004) parsing specific passive packet traces.
target the same goal by adopting a broader set
of flow-level metrics, including TCP congestion
Anomaly Detection
window and end-to-end round-trip time (RTT)
Anomaly detection (AD) is a discipline aiming at
measurements. A more recent work that targets
triggering notifications when unexpected events
the early detection of bottlenecks can be found in
occur. The definition of what is unexpected –
Schiavone et al. (2014).
i.e., what differs from a normal or predictable
behavior – strictly depends on the context. In the
Traffic Classification case of mobile networks, it might be related to a
Network traffic classification (TC) is probably number of scenarios, for example, degradation of
one of the most important tasks in the field of performance, outages, sudden changes in traffic
network management. Its goal is to allow oper- patterns, and even evidences of malicious traf-
ators to know what kind of services are been used fic. All these events can be observed by mining
by customers and how they are impacting the anomalous patterns in passive traces at different
network by studying their passive packet traces. level of granularities.
This is particularly relevant in cellular networks, A comprehensive survey on AD for computer
where the resources are in general scarcer and networks can be found in Chandola et al. (2009).
taking informed network management decision is An important breakthrough in the field of AD
crucial. in large-scale networks was set by the work of
280 Big Data in Mobile Networks
Lakhina et al. (2005), where authors introduced through mobile devices from the perspective of
the application of the principal component anal- passive cellular measurements, as well as subjec-
ysis (PCA) technique to the detection of network tive tests in controlled labs. The performance of
anomalies in traffic matrices. Another powerful multimedia services (VoIP, video streaming, etc.)
statistical approach tailored for cellular networks over cellular networks is affected by a number
has been presented in D’Alconzo et al. (2010). of factors, depending on the radio access net-
The same statistical technique has been expanded work (signal strength), core network setup (in-
and tested in a nationwide mobile network by ternal routing), and remote provisioning systems.
Fiadino et al. (2014, 2015). This class of studies helps operators in prop-
The study presented in Casas et al. (2016a) erly dimensioning network resources not only to
is particularly interesting: it applies clustering meet provisioning requirements from the network
techniques on cellular traces to unveil device- perspective but also taking into account the actual
specific anomalies, i.e., potentially harmful traffic perception of quality by final users.
patterns caused by well-defined classes of mobile
terminals. The peculiarity of this kind of studies Human Mobility
consists in the use of certain metadata (device The exploitation of mobile terminals and cellular
brand, operating system, etc.) that are observ- networks as source of mobility information is
able in passive traces collected in mobile net- possibly the most popular human-oriented topic
works (in general, monitoring fixed-line networks in the field of mobile network data analysis.
does not provide such degree of details on the Cellular-based mobility studies have a high num-
subscribers). ber of applications (e.g., city planning, analysis
of vehicular traffic, public transportation design,
User-Oriented Areas etc.). Recently, cellular location data has been ex-
Mobile network data implicitly bring large-scale ploited to study the population density (Ricciato
information of human behavior. Indeed, operators et al. 2015) and to infer socioeconomic indicators
can mine the satisfaction of users by observing and study their relationship with mobility pat-
their usage patterns. This type of analysis has terns (Pappalardo et al. 2016).
clear applications in the fields of marketing, tariff Mobility studies through cellular data are
design, and support business decisions in a data- based on the assumption that the movements
driven fashion. An example is the prediction of of a terminal in the context of the network
churning users (i.e., customers lost to competi- topology reflect the actual trajectory of the
tors): the study presented in Modani et al. (2013) mobile subscriber carrying the device. Location
relies on CDRs – and on the social network information can be extracted from CDRs or
inferred from them – to extract specific behaviors signaling messages devoted to the mobility
that lead to termination of contracts. management of terminals. Most of the existing
Another area of user-oriented research is the literature is based on CDRs (Bar-Gera 2007;
study of the quality of experience (QoE). The González et al. 2008; Song et al. 2010). However,
term QoE refers to a hybrid network- and human- since CDR datasets only log the position of
related discipline that has the goal of estimating users when an action occurs, their exploitation
the level of satisfactions of customers when using for the characterization of human mobility has
IP-based services. In other words, practitioners in been criticized (Ranjan et al. 2012). In Song
this field try to map network performance indi- et al. (2010), authors highlight some concerns
cators (e.g., throughput, RTT, latency, etc.) with in this direction and also show that users are
subjective scores assigned by users in controlled inactive most of the time. The main limitation
environments. Some examples of QoE studies lies in the fact that the mobility perceived from
conducted over cellular network data are Casas the observation of CDRs is highly biased, as
et al. (2013b, 2016b). In these works, authors it strictly depends on the specific terminals’
study the QoE of popular applications accessed action patterns. In other words, users are visible
Big Data in Mobile Networks 281
during few punctual instants, which makes most make it possible to link all the collected logs
of movements untraceable. Fiadino et al. (2017) to single users and, in turn, to their identity. To
claim that the situation is changing and nowadays protect customers’ privacy, the anonymization
CDRs are richer than the past, due to the increas- of such unique identifiers is a common practice B
ing usage of data connections and background adopted by operators before handing privacy-
applications that has consequently increased the sensitive datasets to internal or third-party prac-
number of records. The now higher frequency of titioners. Indeed, sharing network traces with ex-
tickets allows a finer-grained tracking of users’ ternal entities is critical in some applications, e.g.,
positions. Some recent mobility studies, such as detection of attacks with federated approaches
Callegari et al. (2017) and Jiang et al. (2017), are, (Yurcik et al. 2008).
in fact, successfully based on CDRs. A typical anonymization technique consists in
Few studies have explored more sophisticated applying nonreversible hashing functions, such as
approaches based on passively captured signaling MD5 or SHA-1, with secret salt values to IMSI
messages – e.g., handover, location area updates, and IMEI. Being these hash functions crypto-
etc. Caceres et al. (2007). In Fiadino et al. (2012), graphically safe, it is still possible to associate
authors studied the differences between CDRs all logs and actions of a single user to an anony-
and signaling data in terms of number of actions mous identifier, with limited risks of collisions
per user. Depending on the monitored interfaces, (differently from other simpler approaches, such
these approaches greatly vary in terms of cost as truncation of the last digits of customer identi-
and data quality (Ricciato 2006). Although the fiers). In addition, their one-way nature prevents
analysis of signaling is promising, there is a gen- reconstructing the original IDs. There is a rather
eral lack of studies based on actual operational rich literature on advanced anonymization tech-
signaling data (some exceptions are the works by niques: a recent example can be found in Mivule
Fiadino et al. (2012) and Janecek et al. (2015), and Anderson (2015).
where authors focus on the study of vehicular It should be noted that all the listed applica-
traffic on highways and detection of congestions), tions for cellular data target macroscopic behav-
as complex dedicated monitoring infrastructures iors, observing users’ traces on the large scale.
for the extraction and immense data storage sys- The cases in which a single user needs to be
tems are required. addressed are rare and restricted to specific uses
(e.g., detection of malicious users) or following
authorized legal warrants. Nevertheless, the in-
Privacy Concerns dustry and the research community are address-
ing such concerns. For example, Dittler et al.
The collection and exploitation of cellular data (2016) propose an approach for location manage-
might raise some concerns. Indeed, given the pen- ment that protects users’ location privacy without
etration of mobile devices and the modern usage disrupting the basic functionalities of cellular net-
patterns, cellular networks carry an extremely works. On the other hand, the increasing adoption
detailed ensemble of private information. As seen of encryption protocols to transport user data
before, by observing cellular metadata or specific (e.g., HTTPS, instant messaging and VoIP with
network traces, it is possible to reconstruct the end-to-end encryption, etc.) limits some poten-
mobility of a mobile subscriber and infer the tially abusive uses of passive traces.
list of visited places, learn the acquaintances and
build social networks, and even study the web
surfing habits.
Customers’ activity is associated to unique Cross-References
identifiers, such as the IMSI (International Mo-
bile Subscriber Identity) and the IMEI (Interna- Big Data in Computer Network Monitoring
tional Mobile Equipment Identity). These IDs Big Data in Network Anomaly Detection
282 Big Data in Mobile Networks
Janecek A, Valerio D, Hummel K, Ricciato F, Hlavacs H based mobile phone data. Publications Office of the
(2015) The cellular network as a sensor: from mobile European Union. https://doi.org/10.2788/162414
phone data to real-time road traffic monitoring. IEEE Ricciato F, Widhalm P, Pantisano F, Craglia M (2017)
Trans Intell Transp Syst. https://doi.org/10.1109/TITS. Beyond the “single-operator, CDR-only” paradigm: an
2015.2413215 interoperable framework for mobile phone network
B
Jiang S, Ferreira J, Gonzalez MC (2017) Activity-based data analyses and population density estimation. Perva-
human mobility patterns inferred from mobile phone sive Mob Comput 35:65–82. https://doi.org/10.1016/j.
data: a case study of Singapore. IEEE Trans BigData. pmcj.2016.04.009
https://doi.org/10.1109/TBDATA.2016.2631141 Schiavone M, Romirer-Maierhofer P, Ricciato F, Baiocchi
Klaine PV, Imran MA, Onireti O, Souza RD (2017) A A (2014) Towards bottleneck identification in cellular
survey of machine learning techniques applied to self networks via passive TCP monitoring. In: Ad-hoc, mo-
organizing cellular networks. IEEE Commun Surv Tu- bile, and wireless networks – 13th international confer-
torials 99:1–1. https://doi.org/10.1109/COMST.2017. ence (ADHOC-NOW 2014). Proceedings, Benidorm,
2727878 22–27 June 2014, pp 72–85
Lakhina A, Crovella M, Diot C (2005) Mining anomalies Song C, Qu Z, Blumm N, Barabási AL (2010) Limits
using traffic feature distributions. SIGCOMM Comput of predictability in human mobility. https://doi.org/10.
Commun Rev 35(4):217–228 1126/science.1177170
Lee Y, Lee Y (2012) Toward scalable internet traffic Valenti S, Rossi D, Dainotti A, Pescapè A, Finamore
measurement and analysis with Hadoop. SIGCOMM A, Mellia M (2013) Reviewing traffic classification.
Comput Commun Rev 43(1):5–13 In: Biersack E, Callegari C, Matijasevic M (eds) Data
Liu J, Liu F, Ansari N (2014) Monitoring and analyzing traffic monitoring and analysis: from measurement,
big traffic data of a large-scale cellular network with classification, and anomaly detection to quality of ex-
Hadoop. IEEE Netw 28(4):32–39. https://doi.org/10. perience. Springer, Berlin/Heidelberg, pp 123–147
1109/MNET.2014.6863129 Yurcik W, Woolam C, Hellings G, Khan L, Thuraisingham
Maier G, Feldmann A, Paxson V, Allman M (2009) B (2008) Measuring anonymization privacy/analysis
On dominant characteristics of residential broadband tradeoffs inherent to sharing network data. In: NOMS
internet traffic. In: Proceedings of the 9th ACM SIG- 2008 – 2008 IEEE network operations and manage-
COMM conference on internet measurement confer- ment symposium, pp 991–994. https://doi.org/10.1109/
ence (IMC’09). ACM, New York, pp 90–102. https:// NOMS.2008.4575265
doi.org/10.1145/1644893.1644904
Mivule K, Anderson B (2015) A study of usability-aware
network trace anonymization. In: 2015 science and
information conference (SAI), pp 1293–1304. https://
doi.org/10.1109/SAI.2015.7237310
Big Data in Network Anomaly
Modani N, dey k, Gupta R, Godbole S (2013) CDR analy- Detection
sis based telco churn prediction and customer behavior
insights: a case study. In: Lin X, Manolopoulos Y, Duc C. Le and Nur Zincir-Heywood
Srivastava D, Huang G (eds) Web information systems
Dalhousie University, Halifax, NS, Canada
engineering – WISE 2013. Springer, Berlin/Heidel-
berg, pp 256–269
Nguyen TTT, Armitage G (2008) A survey of tech-
niques for internet traffic classification using machine Synonyms
learning. IEEE Commun Surv Tutorials 10(4):56–76.
https://doi.org/10.1109/SURV.2008.080406
Pappalardo L, Vanhoof M, Gabrielli L, Smoreda Z, Pe- Network anomaly detection; Network normal
dreschi D, Giannotti F (2016) An analytical framework traffic/Behavior modeling; Network outlier
to nowcast well-being using mobile phone data. Int J detection
Data Sci Anal 2:75–92
Ranjan G, Zang H, Zhang Z, Bolot J (2012) Are call
detail records biased for sampling human mobility?
SIGMOBILE Mob Comput Commun Rev 16(3):33 Definitions
Ricciato F (2006) Traffic monitoring and analysis for
the optimization of a 3G network. Wireless Commun
13(6):42–49 Network anomaly detection refers to the problem
Ricciato F, Vacirca F, Svoboda P (2007) Diagnosis of finding anomalous patterns in network activ-
of capacity bottlenecks via passive monitoring in ities and behaviors, which deviate from normal
3G networks: an empirical analysis. Comput Netw
51(4):1205–1231
network operational patterns. More specifically,
Ricciato F, Widhalm P, Craglia M, Pantisano F (2015) Es- in network anomaly detection context, a set of
timating population density distribution from network- network actions, behaviors, or observations is
284 Big Data in Network Anomaly Detection
pronounced anomalous when it does not conform failures of network devices or connections
by some measures to a model of profiled network or excessive traffic load at specific regions
behaviors, which is mostly based on modelling of the network. In some instances, network
benign network traffic. anomalies could be the result of errors in
the network configuration, such as routing
protocol configuration errors causing network
Overview instability.
• Anomaly in network security This is the main
In today’s world, networks are growing fast and focus of many research works in the literature,
becoming more and more diverse, not only con- where the network anomalous behaviors are
necting people but also things. They account for caused by security-related problems. Exam-
a large proportion of the processing power, due ples are distributed denial-of-service (DDoS)
to the trend of moving more and more of the attacks and network anomalies caused by in-
computing and the data to the cloud systems. trusions or insider threats. DDoS attacks are
There might also come the time when the vast conceivably one of the most noticeable types
majority of things are controlled in a coordinated of anomalies, where the services offered by
way over the network. This phenomenon not only a network are disrupted. This effectively pre-
opens many opportunities but also raises many vents normal user from accessing the content
challenges at a much larger scale. by overloading the network infrastructures or
Due to the ever-expanding nature of the net- by disabling a vital service such as domain
work, more and more data and applications are name server (DNS) lookups. In case of net-
being created, used, and stored. Furthermore, work intrusions, once the malicious entity ob-
novel threats and operational/configuration prob- tained the control of some parts of a network,
lems start to arise. This in return creates the those network resources could end up being
environment for systems such as the network used for a large scheme DDoS attack or for
anomaly detection to become one of the most other illegitimate resource-intensive tasks. On
prominent components in the modern networks. the other hand, similar to insider threats, the
attackers could silently exploit the controlled
Examples of anomalies on the network resources to gather valuable information and
Network anomalies are mostly categorized credentials, such as credit cards and user ID
into two types: performance-related anomalies and passwords, or organization’s operational
and security-related anomalies (Thottan and Ji information and strategies. This could result
2003). in more resilient, hard to disclose types of
anomalies, which in many cases may be over-
• Anomaly in performance This type of anoma- looked by many anomaly detection systems.
lies represents abrupt changes in network op- • Another type of anomaly that may appear
erations. Most of the time, they may be gen- on the network is the shift or drift in the
erated by configuration errors or unexpected normal network behaviors. This is labelled as
shifting in the network demand or supply re- anomalous in many cases, because a deviation
sources. Several examples of such anomalies is seen compared to the modelled behaviors.
include server or router failures and transient Based on specific anomaly detection systems,
congestions. A server failure, such as a web these observations can be seen as informa-
server failure, could occur when there is a tional alerts or just simply false positives.
swift increase in the demand for the site’s
content to the point that it exceeds the ca- Generally, in a specific work, only one type
pabilities of the servers. Network congestion of anomaly is considered the main focus, but
at short time scales occurs due to hot spots from network anomaly detection perspective in
in the network that may be caused by the this chapter, all types are important and critical
Big Data in Network Anomaly Detection 285
• Rule-based and pattern matching approaches efficient and easily adaptable tools for anomaly
mainly relied on expert systems encoded in a detection. Thus, machine learning techniques nat-
set of rules and patterns for matching against urally found applications in this field for their
incoming network traffic and behaviors. ability to automatically learn from the data and
Examples of this category are Snort and extract patterns that can be used for identifying
Bro, where anomaly detection can be done anomalies in a timely manner. Although most of
in rules or scripts to identify excessive amount the machine learning applications in cybersecu-
in specific properties, such as utilization of rity generally and intrusion detection specifically
bandwidth, the number of open transmission are based on supervised learning techniques, in
control protocol connections, or the number anomaly detection, unsupervised learning is the
of Internet control message protocol packets, most applied technique due to the ability of learn-
which exceeds a predefined threshold. ing without the label information, i.e., a priori
• Ontology and logic-based approaches attempt knowledge of anomalies, which essentially is the
to model network behaviors using expressive defining characteristic of anomaly detection.
logic structures, such as finite state machines
and ontologies for encoding knowledge of • Unsupervised learning techniques can be used
experts in the domain. In those models, the to construct a model generalizing normal net-
sequence of alarms and actions obtained from work and user behaviors and producing mea-
different locations on a given network are sures which could be used to detect anoma-
modelled by a reasoning model designed by lies in network traffic. Examples of different
the experts not only for detecting anomalies works in this area are by Dainotti et al. (2007,
and known intrusions but also describing and 2009), where temporal correlation, wavelet
diagnosing the related problems. Notable analysis, and traditional change point detec-
works in this category are by Ilgun et al. tion approaches are applied to produce a net-
(1995) and Shabtai et al. (2010), which work signal model and worm traffic model,
are based on the state machine model and and (Sequeira and Zaki 2002) and (Rashid
the knowledge-based temporal abstraction, et al. 2016), where user profiles are created
respectively. from the sequence of user actions in a time
window using clustering techniques and hid-
The efficiency of the above approaches den Markov model.
quintessentially depends on the accuracy of • A more direct approach using unsupervised
expert knowledge embedded in the defined rules, learning for anomaly detection is applying
patterns, and ontologies. Thus, it is possible that clustering analysis and outlier detection meth-
the constructed models do not adapt adequately ods. In this approach, clustering algorithms
to the evolution in the network and cybersecurity are used to find significant clusters represent-
environments, allowing new faults and threats ing majority of normal network/system behav-
evade detection. Moreover, given a new network, iors. From clustering results, anomalies can
it may be necessary to spend a considerable be detected automatically as the data instances
amount of time to model the needs for specific that do not fit into the constructed clusters of
patterns/rules for building traffic models to match normal data using outlier detection methods.
network conditions. As network topologies and Notable works are Otey et al. (2006), Jiang
traffic conditions are expanding and evolving, et al. (2006), Finamore et al. (2011) and Casas
the techniques based on traffic inspection and et al. (2012).
rule/pattern/model matching may not scale • Unsupervised learning results can be analyzed
gracefully. by human experts with or without the use of
Envisioning the scale and variability of net- visualization and other supervised learning
works and their potential growth in the near methods to give more insight. Notable works
future, one can easily realize the need for more are Kayacik et al. (2007), Perdisci et al.
Big Data in Network Anomaly Detection 287
The SOM is trained in an unsupervised man- Network and Transport layers’ header informa-
ner, with the assumption that learning solely tion of the transmitted packets and can be inde-
from data, it could learn well-separated clusters pendent from the application layer data. Given
to differentiate distinct communication behav- that network traffic encryption is very popular
iors. Post training, to quantify the system perfor- in both benign applications for protecting users’
mance, if the ground truth of the training data privacy and sensitive information, and malicious
is available, can be used for labelling the SOM applications for hiding from the detection sys-
nodes. On the other hand, when label information tems analyze network packet payload, the de-
is not available, cluster analysis based on the tection approach using only netflows exported
trained SOM topology visualization can be used from packet headers may improve the state of the
along with expert knowledge and information art in unsupervised learning-based data analysis.
from other sources to derive meaningful insights Furthermore, the use of network flows can sig-
from the data. The SOM-based framework then nificantly reduce the computational requirement
can be used to analyze incoming network data of a network monitoring system by allowing the
and support network administrators in operation analysis of each flow as a summary of packets and
by visualizing the traffic. avoiding deep packet inspection.
Figure 2 presents the results obtained using
Examples of applications The framework can the three training schemes in the framework. It is
be applied in many network situations with dif- clear that the first training scheme, where SOM is
ferent sources of data and domain knowledge formed using both known normal and malicious
incorporated. Some applications are presented in data, was able to form well-separated regions
detail in Le et al. (2016) and Le (2017). In this for the classes and achieve 99.9% detection rate
chapter, a brief description of one example in with only 0.6% false-positive rate. Similarly, the
applying the framework for network flow data second training scheme, which is the most suit-
analysis in order to detect Virut botnet in a mixed able for network anomaly detection applications
dataset from García et al. (2014) is presented. where only normal data is used to form the SOM,
Network flows are the summary of connec- achieved promising results at 80.9% detection
tions between hosts, which include descriptive rate and 6% false-positive rate. Figure 2b shows
statistics that are calculated from aggregating that SOM trained using scheme (ii) was able to
Big Data in Network Anomaly Detection, Fig. 2 SOM visualization of test data in the Virut dataset based on the
three training schemes. (a) scheme (i) (b) scheme (ii) (c) scheme (iii)
Big Data in Network Anomaly Detection 289
identify the malicious data by mapping them to where the attackers can perform malicious
an unpopular region, which is distanced from actions over a long duration, sometimes
the training data. On the other hand, scheme months (or longer) to evade monitoring
(iii) SOM was able to differentiate only 3.4% systems. Hence detection systems and B
of the normal data from the malicious data used network operations team could fail to
for training (Fig. 2c). The visualization capabil- recognize the small changes as the sign
ities supported by the framework, as shown in of anomalies.
Fig. 2, could give network administrators options – Network data can be encrypted and/or
to quickly analyze network data, even without anonymized, which may inhibit many
ground truth by accessing the regions formed detection systems that monitor network
by the SOM. In-depth analysis and details of traffic.
applications of the framework in other use cases, – In many cases, the developers may face
such as web attack and novel botnet detection, the dilemma of impurity in training data.
can be found in Le (2017). This can seriously affect the efficiency of
anomaly detection systems, as a model in-
advertently built with abnormal data would
encode that as a normal behavior, which
Challenges in Applying Big Data in makes it incapable to detect future anoma-
Network Anomaly Detection lies of the type.
• The very nature of anomaly detection may
Although machine learning has been extensively
itself pose a problem to developers in defin-
applied in the task of anomaly detection, there
ing an anomaly and in distinguishing it from
are certain challenges that need to be overcome
normal network behaviors. Similarly, machine
for a successful realization of research results
learning techniques may have trouble identi-
in production network environments. First and
fying the novel behaviors that are dissimilar
foremost, there are challenges which come from
from learned patterns for anomaly detection
the nature of the anomaly detection task that
tasks. This is unlike other applications of ma-
all the approaches, including machine learning-
chine learning, such as classification and rec-
based approaches, must address. Some of them
ommendation systems, where the main focus
are already summarized in Sommer and Paxson
is in finding similarity.
(2010).
• There are significant challenges in general-
izing anomaly detection methods to become
• The problems related to the data: independent of the environment. Different net-
– Network data is diverse and abundant, yet work environments usually have different re-
obtaining high-quality data for designing quirements and conditions for defining normal
and evaluating anomaly detection systems and anomaly behaviors. Moreover, in each
typically involves considerable difficulties. network the concepts of anomaly and normal
Many companies and organizations are profiles are continuously shifting or drifting.
prevented from sharing data by agreements Therefore, it is necessary to study dynamic,
protecting users’ identities and privacy. adaptive, and self-evolving anomaly detection
Even when the data is shared, most of it systems to overcome limitations of the tradi-
comes without any form of ground truth for tional static models.
training and evaluating anomaly detection • Ideally, an anomaly detection method should
systems with confidence. pursue highest anomaly detection rates
– Secondly, network anomaly detection data while avoiding a high rate of false alarms.
is highly skewed; most of the time, only However, in designing a novel anomaly
a small unidentified portion of data is detection method, one always has to take into
anomalous. One example is insider threats, account a trade-off between the two measures.
290 Big Data in Network Anomaly Detection
Considering the dominance of normal data • The very nature of network is streaming. In
and the scale of networks, even a small false- small networks, gradual drifting and shifting
positive rate may result in a catastrophic of behaviors and concepts in data can be
amount of false alarms that need attention addressed by retraining learned data models
from human experts – network operations periodically. However, in big data applica-
team. Lessons have been learned that the tions, the prevalent phenomenon needs to be
detection systems with low precision would addressed adequately from a system design
be impractical for production network perspective. This requires dynamic and adap-
environments. This requires the anomaly tive learning algorithms and self-evolving ar-
detection system not only to have the ability chitectures that are capable of working on
to detect abnormal network behaviors but also high-speed data streams. Furthermore, such
to correlate and to learn from the alarms in systems need to have capabilities for revis-
order to provide meaningful feedback to/from ing/updating themselves in a timely manner
human experts to improve. according to the ever-changing dynamics of
the network environment without sacrificing
from their detection performances.
To apply big data successfully in the anomaly
• In big data environments, there is a significant
detection problem, not only the aforementioned
challenge in converting anomaly warnings to
challenges but also the greater scope and scale of
meaningful alerts and producing patterns for
challenges need to be addressed and overcome.
future prediction, considering the tremendous
These include but are not limited to the following
amount of data that a warning system must
aspects of network-based systems:
process a day. This will require big data sys-
tems that are able to cooperate with human
• Runtime limitation presents an important experts and continuously learn from human
challenge for a network anomaly detection input to evolve itself to adapt to the new nor-
system, especially when big data is applied mal and malicious behaviors on the network.
to deal with larger networks. Reducing
computational complexity in preprocessing,
training, and deployment of such systems is a Conclusion
priority.
• Data for network anomaly detection systems Network anomaly detection has become more
is presented at multiple levels of granular- and more prevalent and crucial to monitor and
ity and formats. The data can be acquired secure networks. Given the scale and dynamics
at host or network level, from end devices, of a network, developing practical big data appli-
network devices, security devices, or servers, cations for the task is an open field of research.
in different formats and structures. In small Due to the very nature of network data and
networks, the data can be processed manually learning algorithms, many traditional approaches
by experts; however, when it comes to big for network analytics and anomaly detection are
data applications, the problem of data acqui- not scalable. Moreover, novel and task-specific
sition, data representation, and preprocessing challenges need to be addressed earnestly in order
must be addressed systematically in order to to successfully introduce new big data applica-
provide high-quality data for training big data tions to the field.
algorithms efficiently. Similarly, developing
appropriate and fast feature selection methods Future directions for research In light of the
is crucial not only to ensure efficiency of challenges discussed above, we aim to help to
anomaly detection systems but also to provide strengthen the future research on big data applica-
compact representation of the data with suffi- tions for anomaly detection systems by a concrete
cient information. set of recommendations:
Big Data in Network Anomaly Detection 291
• Drift and shift in network data streams can be Dainotti A, Pescapé A, Ventre G (2009) A cas-
addressed using active online learning algo- cade architecture for DoS attacks detection based
on the wavelet transform. J Comput Secur 17(6):
rithms (Khanchi et al. 2017). 945–968
• Dynamic anomaly detection systems should Finamore A, Mellia M, Meo M (2011) Mining unclas-
B
be capable of learning from multiple sources sified traffic using automatic clustering techniques.
such as the offline data sources, streaming data Springer, Berlin/Heidelberg, pp 150–163. https://doi.
org/10.1007/978-3-642-20305-3_13
sources, expert knowledge, and feedbacks. It García S, Grill M, Stiborek J, Zunino A (2014) An empir-
has been shown in Le et al. (2016) that learn- ical comparison of botnet detection methods. Comput
ing algorithms with visualization capabilities Secur 45:100–123. https://doi.org/10.1016/j.cose.2014.
not only achieve the goals of anomaly de- 05.011
Haddadi F, Zincir-Heywood AN (2016) Benchmarking
tection via clustering and outlier detection the effect of flow exporters and protocol filters on
techniques but also provide a handy tool for botnet traffic classification. IEEE Syst J 10:1390–1401.
network managers to acquire insights of the https://doi.org/10.1109/JSYST.2014.2364743
different network behaviors. This also enables Ilgun K, Kemmerer RA, Porras PA (1995) State transition
analysis: a rule-based intrusion detection approach.
the involvement of network experts in design- IEEE Trans Softw Eng 21(3):181–199. https://doi.org/
ing and developing machine learning-based 10.1109/32.372146
systems. Jiang S, Song X, Wang H, Han JJ, Li QH (2006) A
• The utility of network flows as metadata has clustering-based method for unsupervised intrusion de-
tections. Pattern Recogn Lett 27(7):802–810. https://
been shown to increase the learning efficiency doi.org/10.1016/j.patrec.2005.11.007
and overcome the opaqueness of the data un- Kayacik HG, Zincir-Heywood AN, Heywood MI (2007)
der certain conditions (Haddadi and Zincir- A hierarchical SOM-based intrusion detection system.
Heywood 2016). This implies that novel data Eng Appl Artif Intell 20(4):439–451
Khanchi S, Heywood MI, Zincir-Heywood AN (2017)
representation and summarization techniques Properties of a GP active learning framework for
could help to speed up the implementation and streaming data with class imbalance. In: Proceedings of
evolution of anomaly detection systems. the genetic and evolutionary computation conference,
pp 945–952. https://doi.org/10.1145/3071178.3071213
Kohonen T (2001) Self-organizing maps. Springer series
in information sciences, vol 30, 3rd edn. Springer,
Berlin/Heidelberg. https://doi.org/10.1007/978-3-642-
Cross-References 56927-2
Laney D (2001) 3D data management: controlling data
volume, velocity, and variety. Technical report, META
Big Data for Cybersecurity
Group
Big Data in Computer Network Monitoring Le DC (2017) An unsupervised learning approach for net-
Network Big Data Security Issues work and system analysis. Master’s thesis, Dalhousie
University
Le DC, Zincir-Heywood AN, Heywood MI (2016) Data
analytics on network traffic flows for botnet behaviour
detection. In: 2016 IEEE symposium series on compu-
References tational intelligence (SSCI), pp 1–7. https://doi.org/10.
1109/SSCI.2016.7850078
Bhuyan MH, Bhattacharyya DK, Kalita JK (2014) Net- Leung K, Leckie C (2005) Unsupervised anomaly detec-
work anomaly detection: methods, systems and tools. tion in network intrusion detection using clusters. In:
IEEE Commun Surv Tutorials 16(1):303–336. https:// Proceedings of the twenty-eighth Australasian confer-
doi.org/10.1109/SURV.2013.052213.00046 ence on computer science, vol 38, pp 333–342
Casas P, Mazel J, Owezarski P (2012) Unsupervised Otey ME, Ghoting A, Parthasarathy S (2006) Fast dis-
network intrusion detection systems: detecting the tributed outlier detection in mixed-attribute data sets.
unknown without knowledge. Comput Commun Data Min Knowl Discov 12(2–3):203–228. https://doi.
35(7):772–783. https://doi.org/10.1016/j.comcom. org/10.1007/s10618-005-0014-6
2012.01.016 Perdisci R, Gu G, Lee W (2006) Using an ensemble
Dainotti A, Pescapé A, Ventre G (2007) Worm traffic anal- of one-class SVM classifiers to harden payload-based
ysis and characterization. In: 2007 IEEE international anomaly detection systems. In: Sixth international
conference on communications, pp 1435–1442. https:// conference on data mining (ICDM’06), pp 488–498.
doi.org/10.1109/ICC.2007.241 https://doi.org/10.1109/ICDM.2006.165
292 Big Data in Smart Cities
Rajashekar D, Zincir-Heywood AN, Heywood MI (2016) tion, and energy infrastructures that can ensure
Smart phone user behaviour characterization based sustainable socioeconomic development, healthy
on autoencoders and self organizing maps. In: ICDM
workshop on data mining for cyber security, pp 319–
environment, enhanced quality-of-life, and intel-
326. https://doi.org/10.1109/ICDMW.2016.0052 ligent management of environment and natural
Rashid T, Agrafiotis I, Nurse JR (2016) A new take resources. From a technology perspective, smart
on detecting insider threats: exploring the use of cities are focused on capturing, processing, and
hidden Markov models. In: Proceedings of the 8th
ACM CCS international workshop on managing in-
making effective use of heterogeneous, tempo-
sider security threats, pp 47–56. https://doi.org/10. ral, and ever increasing large-scale or big urban
1145/2995959.2995964 data.
Sequeira K, Zaki M (2002) ADMIT: anomaly-based data
mining for intrusions. In: Proceedings of the eighth
ACM SIGKDD international conference on knowledge
discovery and data mining, pp 386–395. https://doi.org/ Overview
10.1145/775047.775103
Shabtai A, Kanonov U, Elovici Y (2010) Intrusion de- City governance is a complex phenomenon and
tection for mobile devices using the knowledge-based,
temporal abstraction method. J Syst Softw 83(8):1524– requires new combinations of top-down and
1537. https://doi.org/10.1016/j.jss.2010.03.046 bottom-up ICT-enabled governance method-
Sommer R, Paxson V (2010) Outside the closed world: on ologies to efficiently manage cities. Cities are
using machine learning for network intrusion detection. a major source of urban, socioeconomic, and
In: 2010 IEEE symposium on security and privacy,
pp 305–316. https://doi.org/10.1109/SP.2010.25 environmental data, which provide basis for
Thottan M, Ji C (2003) Anomaly detection in IP networks. developing and testing Smart City solutions.
IEEE Trans Signal Process 51(8):2191–2204 The notion of Smart Cities has gained a
Veeramachaneni K, Arnaldo I, Korrapati V, Bassias
lot of momentum in the past decade. City
C, Li K (2016) AIˆ2: training a big data ma-
chine to defend. In: 2016 IEEE 2nd international administrators have realized the potential for ICT
conference on big data security on cloud (Big- to collect and process city data and generate
DataSecurity). IEEE, pp 49–54. https://doi.org/10. new intelligence concerning the functioning
1109/BigDataSecurity-HPSC-IDS.2016.79
of a city that aims to secure effectiveness and
efficiency in city planning, decision-making, and
governance. Without urban data and use of ICT,
Big Data in Smart Cities this intelligence either cannot be generated or
requires considerable resources, e.g., financial,
Zaheer Khan1 and Jan Peters-Anders2 human, etc.
1
University of the West of England, Bristol, UK Recent publications in smart cities domain
2
Austrian Institute of Technology, Vienna, emphasize better use of urban data by developing
Austria innovative domain specific applications which
can provide new information or knowledge (Khan
et al. 2012, 2014a, 2015; Garzon et al. 2014;
Soomro et al. 2016). This new knowledge can
Synonyms
be used by city administrations, citizens, and
private sector organizations to effectively and
Smart cities or future cities; Smart City Big Data
efficiently manage their daily activities and/or
or large-scale urban data and/or city data
business processes, as well as the decision-
making processes that underpin the governance
Definitions of the city. The city data is generated from various
thematic domains (e.g., transport, land-use,
A smart city is a city which invests in Infor- energy, economic, climate, etc.; however, there is
mation and Communication Technology (ICT) little debate on the holistic view of cross-thematic
enhanced governance and participatory processes city data which requires Big Data management
to develop appropriate public service, transporta- approaches and high performance computing
Big Data in Smart Cities 293
paradigms like cloud computing (Khan et al. coordinates which provide location context. This
2015). location context helps to perform investigation
In this entry, we provide a holistic view of and data analysis associated with specific areas
Smart City Big Data by reviewing different in cities. This data is stored in different city B
Smart City projects and applications (UrbanAPI, repositories often managed by different city
DECUMANUS, Smarticipate), describing cross- departments. Some processed data is partly
thematic domain Big Data and data management made accessible through open data initiatives
challenges cities face and for which they require (Open Government Data 2017). Often individual
appropriate solutions. We also list various elements of open data cannot provide useful
datasets available in cities and share lessons information on their own, but when integrated
learned from the above-mentioned projects. with data from various other data sources (e.g.,
basemaps, land-use, etc.), it becomes extremely
useful to determine the possible effects and
Smart Cities Big Data: A Holistic View impacts of planning interventions that influence
the future development of the sustainable Smart
Smart Cities are tightly coupled with data and City. In addition to the above data sources,
information which can be used for knowledge new technological inventions such as Internet of
generation to take decisions aiming to resolve Things (IoTs), smart phones (call data records),
societal challenges concerning the quality of life social media, etc., are also adding to the huge
of citizens, economic development, and more amount of existing urban data.
generally sustainable city planning. In this sec- Figure 1 shows where Smart Cities can make
tion, we try to answer the question: Is Smart effective use of the city’s Big Data. Cities want to
City data really Big Data? In this respect, we make effective use of data in planning and policy-
try to present different facets of Smart Cities making processes to make informed decisions. A
data aligned with the main “V” characteristics typical policy-making process generates multiple
of the Big Data paradigm, i.e., variety, volume, information requirements, which involves inte-
and velocity. To our knowledge there are several gration of data from different thematic or sectoral
research publications on individual Smart City domains; addresses the temporal dimension of
applications, but a holistic view of the variety of data; and ensures that data generated at various
data available in cities is not evident in the current governance levels, i.e., from city to national and
literature. regional levels, is taken into consideration.
City data includes socioeconomic, built- This suggests that though application specific
environment, and environmental data that data provides necessary information it is not suffi-
can be utilized for data analytics to generate cient to provide the holistic overview essential to
useful information and intelligence. Often this the monitoring of a city environment by intercon-
data is collected using conventional methods nected socioeconomic and environmental charac-
such as daily local business transactions with teristics. For instance, land-use and transportation
city administrations, opinion surveys, traffic are highly inter-linked, e.g., urban sprawl results
counters, LiDAR earth observation, sensors for in increasing use of the private cars which gener-
environment monitoring (e.g., noise, air quality, ates unhealthy particulates affecting air quality in
etc.), and well-being and quality of life data cities, causing health problems for citizens and
from health agencies. More advanced methods negative impacts for social/health services. An
like satellite based earth observation imagery integrated analysis of the data generated in the
(e.g., land monitoring, energy loss, climate urban environment is essential to fully understand
change) are also being used to develop the the complexities involved in Smart City Big Data
information and intelligence base for analysis governance and management.
across a range of scales from local to city and In our previous work (Khan et al. 2012),
city-region. Most of the city data has geo- we presented a generic process for integrated
294
Big Data in Smart Cities, Fig. 1 Integrated view – an example of drivers and impact assessment (Khan et al. 2012)
Big Data in Smart Cities
Big Data in Smart Cities 295
information system development and identified to this data as the base data, which provides
several technical requirements related to urban the necessary basis for developing Smart City
and environmental data management including applications. Usually such data is collected by
data acquisition, metadata, discovery and well-tested methods e.g., LiDAR scanning, and B
metadata catalog, data quality, data retrieval, data hence is trustable (veracity). Such data is usually
storage, schema mapping and transformations, collected once and updated after many years,
service interoperability, data fusion and e.g., population, road networks, LiDAR based
synthesis, etc. With the expected growth of topography. In some cases, a regular update is
IoTs and sensors; extensive use of social media necessary, e.g., new building constructions. The
and smart phones; and use of satellite-based volume of such data also varies from one city to
very high-resolution imagery for city planning another. For a medium to large scale city, this data
and decision-making, these requirements are can be from few 100 MBs to 10 s of GBs de-
still valid for today’s Smart City Big Data pending upon preserved historical temporal data
management. (velocity).
We present a holistic view of a variety of The base data may comprise of:
Smart City data in Fig. 2, classified into two Buildings: 2D/3D building location or struc-
categories: (i) the “Base Data” and (ii) the “Other ture. This can also be in CityGML format or
Thematic/Application specific data.” Typically, BIM (Building Information Modeling) compliant
cities possess huge amounts of historical data data.
in their databases or electronic files. This data Road Networks: include data about major
can be related to buildings, base maps, topogra- and small road networks, pedestrian or cycle
phy, land cover/use, road networks, underground pathways.
utility/service distribution network, population, Basemaps: provide background maps of a city
green infrastructure, master plan, etc. We refer in different resolutions for orientation purposes.
Big Data in Smart Cities, Fig. 2 Holistic view of smart city big data
296 Big Data in Smart Cities
Satellite-Based Remote Sensing Data: This bytes (ceil value) and can be recorded in a CSV
can be variety of data collected as high resolution or XML file in a semistructured format. As an ex-
images. This data can be related to light emission, ample, assuming there are 500 sensors deployed
thermal energy loss, and land-monitoring (e.g., in a neighborhood of a city, and each sensor sends
land cover changes including green/blue/brown data every 10 min, then the total data collected in
areas). one hour is: 100 (approx. bytes) 500 (sensors)
Land Cover/Use: provides landscape data 6 (for 1 h) D 300,000 (about 300 K Bytes).
classified into forests wetlands, impervious This means total data collected in one day would
surfaces, agriculture, etc. Land cover can be be: 24 h 300,000 bytes D 7,200,000 bytes
determined by airborne or satellite imagery. (approximately 7.2 M Bytes), and the total data
In contrast, land-use provides additional collected in one year 7.2 MB (data per day) 365
information to see land cover changes over time. (days / year) D 2.628 G Bytes. It is clear, as there
Population Census: urban population includ- are a variety of other sensors (e.g., infrastructure
ing density of people (by age and gender) in monitoring sensors, air quality – PM10, NO2,
specific areas. etc.) in the city environment collecting different
Households/Workplaces: buildings or places thematic data on daily basis, that the daily data
classified as households or workplaces. collection may reach 100 s of G Bytes to 10 s of
Urban Topography: Digital earth images T Bytes.
mainly identifying various features. Mostly, it is The above example considers only one source
provided as digital surface model and/or digital of data and there are several other sources like
elevation model. CCTV, energy consumption by households,
Utility Distribution Network Data: Under- transport data, social media, etc., which results
ground sewerage pipes, water distribution pipes, in the generation of a few 100 s of GBs to 10 s
gas/electricity/fiber optic lines, domestic/com- of TBs data on daily basis. The amount of this
mercial use of energy, etc. data will significantly increase with deployment
Policies or Master Plan: Medium to long- of 1000 s of new sensors, use of smart devices,
term City Master Plan that specifies future urban and billions of Internet of Things (IoTs) in the
development. next few years.
Green Infrastructure: Tree canopy, parks, The thematic/application specific data may
front/back gardens, allotments, etc. comprise of:
Public Services Data: contains information Mobile Phone Data: call data records (CDR)
about public services rendered to citizens, e.g., are generated when a mobile phone user per-
council tax, social housing, social benefits, etc. forms any activity on a smart phone, e.g., call-
The application specific data is more ing, texting, etc. This data is normally collected
frequently generated (velocity) and belongs by service providers and can be unstructured or
to different thematic domains (variety). The semistructured data.
temporal dimension of data allows cities to Utility Services Data: energy consumption,
analyze historical changes in the land-use and fuel (domestic methane gas, vehicle fuel) usage,
other applications. For instance, mobile phone water consumption, etc. This can be semistruc-
data can be in the range of few hundred GBs to tured data.
TBs per week in unstructured to semistructured Sensor Data: variety of sensors including air
format. This data can be used to analyze quality, traffic counters, etc. This can be unstruc-
population distribution and mobility patterns for tured data.
the daily analysis offering insights concerning Social Media Data: from social media,
the socioeconomic dynamic of the city as a direct Twitter, Facebook, etc. This is unstructured
input to city planning (Peters-Anders et al. 2017). data.
As another example, a typical calibrated data CCTV Data: from CCTV footage, transport
reading from a basic sensor can be up to 100 live feed, etc. This is unstructured data.
Big Data in Smart Cities 297
Security Data: crime/incidents at different quency data is stored in the above formats. Such
areas represented as semistructured or structured data can be queried by writing specific scripts to
data. generate the required information.
Public Health Data: health records, health The unstructured data can be social media B
surveys, etc. This can be semistructured or struc- data (e.g., Twitter), citizens comments on plan-
tured data. ning applications, audio/video, live traffic feeds
Economic Data: public financial records as (e.g., CCTV), satellite/raster imagery, city master
well as overall financial data of city. This can be plan or policy documents (e.g., PDF), radar or
semistructured to structured data. sonar data (e.g., vehicular, meteorological data),
Survey Data: survey carried out in different surveys, mobile data, etc. Such data can be in
districts about quality of life, public services, etc. text, audio/video, or image format and hence it
This can be unstructured as well as semistruc- is challenging to query such data.
tured data. The above discussion highlights the following
Business Data: data about local enterpris- important aspects:
es/businesses, etc. This can be unstructured (e.g.,
business website) or semistructured data. (i) Cities generate large amount of data on daily
Energy Data: data from smart meters, energy basis which possess all “V” Big Data char-
grid (demand/supply), etc. This can be semistruc- acteristics
tured or structured data. (ii) Location based (i.e., geo-coordinate related)
Transport Data: data about public/private data is a key element in Smart City Big Data
transport, traffic counters, traffic cameras, fuel to generate contextual information
consumption for mobility, etc. This can be (iii) City data may come in unstructured,
unstructured or semistructured data. semistructured, or structured formats and
Air Quality Data: particulate concentration requires innovative Big Data processing
in the air – PM10, CO, etc. This can be unstruc- approaches
tured or semistructured data.
Open Government Data: Data published by
the cities for open use by citizens and local Applications and Lessons Learned
businesses. A variety of formats and datasets may
be provided and can be unstructured, semistruc- Here we present key Smart Cities Big Data ap-
tured, or structured. plications and associated data challenges which
In general, the city datasets can be available as need investigation for the development of Smart
unstructured, semistructured, and structured data, City information systems.
and in a variety of formats including PDF, XLS,
XML, CSV, SHP, DRG, GeoTIFF, JPEG2000, Applications
X3D, GeoJSON, CityGML, etc., or be published The Framework Programme (FP) Seven
online via, e.g., OGC (Open Geospatial Consor- UrbanAPI project (Khan et al. 2014b) used both
tium) web-services, like WMS, WMTS, WFS, base data and thematic/application specific data,
etc. which provides a suitable example of the use
Structured data is normally held in a relational of unstructured, semistructured, and structured
data model and stored in databases of different datasets. For example, Peters-Anders et al. (2017)
departments or agencies of the city administra- used mobile data (CDR) to gain new insights
tion. Examples of such data can be land cadastre, in the urban environment. It was observed that
population census, public health, energy con- the mobile data collected by telecom operators
sumption, vector data, e.g., road networks, etc. may range into few 100 s of GBytes to TBytes
The semistructured data is often stored in file on weekly basis. Mostly data is received in
format like, e.g., CSV, XML, JSON, or NoSQL customized binary format or in proprietary file
databases. Often application specific high fre- formats and millions of records are stored in
298 Big Data in Smart Cities
sequential (unstructured) order. Processing of which is more suitable to visualize and analyze
such data requires clear understanding of the file data at city scale. This downscaling was per-
formats to extract the required information. It formed by using the Barcelona Super Comput-
was also observed that the quality and accuracy ing’s high performance computing infrastructure
of the time series data needs to be checked (128 nodes/blades; each blade with 8 GB RAM
before processing and drawing conclusions and two 2.3 GHz processors; Global Parallel File
on the outcomes. Peters-Anders et al. (2017) System; Linux OS and Myrinet high bandwidth
explained various base datasets used with the network to connect nodes) and consumed about
mobile data to generate an interactive visual 881,664 CPU hours.
interface depicting daily population distribution As compared to the above applications,
and mobility patterns. The authors highlighted the Horizon 2020 smarticipate project makes
data quality (e.g., difference in temporal and use of base and thematic datasets to develop
spatial resolution; lack of fine granularity in citizens participation applications and generate
mobile data; privacy issues), processing issues new datasets for public services (Khan et al.
(e.g., lack of a common data model), and 2017a). The smarticipate platform enables cities
suggested advanced techniques, e.g., a Map- to create topics of interest, i.e., a specific planning
Reduce model or cloud computing, that can be proposal in a neighborhood, and share it with
applied to generate required outputs. citizens to seek their opinions, suggestions,
In contrast, The FP7 DECUMANUS project and alternative proposals with the objective to
(Garzon et al. 2014) provides a suitable example co-create and/or co-design local neighborhood
of Big Data application where satellite based very plans. The topic creation requires datasets from
high resolution images are collected, often in base data (e.g., base maps, layers of information,
100 s of M Bytes to G Bytes, for the development e.g., road networks, street furniture, etc.) and
of Smart City services. For some services, e.g., application/thematic data, e.g., air quality data,
climate service, data could reach to 100 s of soil data, noise, utility infrastructure, etc. For
G Bytes to T Bytes. One of the positive as- example, Hamburg’s tree planting use case
pects of the FP7 DECUMANUS project was that requires base maps with street networks, soil
it demonstrated practical use of satellite based data, utility infrastructure, and existing tree data.
earth observation data in city planning (Lud- All this data can be in 2D as well as in 3D to
low et al. 2017). The higher resolution imagery; provide visual representation of the area. This
the capacity to capture more updated images on means topic creation can make use of structured
frequent basis; lower cost; and the potential to and semistructured data. Once citizens are invited
complement any gaps by integrating it with in- to participate for a specific topic, they can
situ data were the key success factors. However, provide Web2.0 based comments by engaging
the data comes in unstructured form and complex in a dialogue with other citizens (unstructured
algorithms have to be used for digital process- data), showing likeness/dis-likeness or creating
ing and integration with in-situ data (e.g., base alternative proposals by adding more data on the
or thematic data). Different DECUMANUS ser- platform.
vices demonstrated this complexity, e.g., climate It is expected that a typical topic creation may
change downscaling processing (San José et al. add 10 s of MBytes to 100 s of MBytes (3D
2015a), health impact (San José et al. 2015b); ur- scene) and may seek citizens’ participation for a
ban green planning (Marconcini and Metz 2016); full week (7 days). Let us assume 50 MBytes is
and energy efficiency. For instance, San José used for tree planting topic creation. Now imag-
et al. (2015a) describes the experiments where ine on average each user adds about 1000 bytes of
very high-resolution earth observation images, comments to one proposal, and if we assume that
i.e., covering about 23 20 Kilo meters (km) about 1000 citizens provide their comments, the
at European Latitude, were dynamically down- total amount of unstructured data generated will
scaled to about 200 and 50 m resolution scales, be 1000 1000 D 1 MBytes approximately. This
Big Data in Smart Cities 299
means one instance of a proposal can generate Cities collect new data on regular basis and
about 1MBytes of unstructured data, so total provide temporal time series data that can be
data for one instance will become 50 MBytes used to detect changes, e.g., land-use, built en-
(topic) C 1 MBytes (comments) D 51 MBytes. vironment or atmosphere, etc. For example, land B
If half of the sample group create alternative pro- monitoring data (e.g., LiDAR) can be collected
posals (with same base/thematic or new data) and bi-yearly due to data collection costs; likewise,
attract new comments on the alternative propos- satellite based remote sensing data can be col-
als, then the data generated will be: 51 MBytes lected more frequently, e.g., bi-monthly; in con-
(data from main topic/proposal) 500 (alterna- trast, mobile data, e.g., Call-Data-Records con-
tive proposals) D 25.5 GBytes. This means, typi- tain data instances based on any mobile phone
cally, about 25 GBytes data per week is expected. activity and can be in seconds, minutes, or hours.
Imagine if there are 10 different topics initiated All the above applications indicate that there
by the city council in different neighborhoods, is no standard approach defined and adopted by
then it is expected to generate about 250 GBytes the cities to manage the temporal and spatial
of data per week. This amount may further in- dimension of city data, which makes it difficult to
crease if the topic creation is more complex such develop generic processing scripts or applications
as the Forte Trionfale refurbishment scenario which can be reused by many cities. This high-
(i.e., with buildings and other surrounding infras- lights the need to use standard data representation
tructure) for the city of Rome, where over 100 or exchange formats, like CityGML, which can
MBytes may be used for topic creation. Over time tackle variety of data format heterogeneity by
the amount of data generated by the smarticipate using its so-called Application Domain Extension
platform will increase significantly and to be (ADE) to represent energy, utilities, etc.
effective must utilize Big Data management and Quality of Data: The base data is often stored
processing capabilities by using Edge, Hadoop, in a non-normalized schema; have different data
Spark and Storm, etc., based approaches. formats; and is managed/owned by different de-
partments or agencies. In addition, often there is
Challenges as Lessons Learned incomplete or no meta-data available and not all
Structuring of Data: The structuring of data the data comes with geo-coordinates. Data may
varies from one city to another, i.e., one city have different spatial resolution scales and often
may have a specific dataset in unstructured for- data is not available in the required spatial scale,
mat, whereas another city may have designed e.g., up to few centimeters. All these aspects
it in well-structured format. For example, most compromise the quality of the data and require
cities have their city master plan in a static for- rigorous data harmonization and compatibility
mat, e.g., image or document. However, cities efforts and preprocessing to prepare the data for
with more advanced data collection and manage- possible use in different thematic applications.
ment resources (e.g., Hamburg in the smarticipate Veracity of Data: The level of accuracy of the
project) have their city master plan defined in an base and thematic data may vary from one ap-
Object Oriented Data Model that provides more plication to another. Most base data are collected
flexibility to query the plan and perform monitor- by using reliable data collection methods, e.g.,
ing and validation of land-use development. LiDAR, official population census, etc. However,
The spatial data is often represented in a spe- not all the thematic applications relying on the
cific spatial reference system. Cities in different data from citizen science or public participa-
countries use different spatial reference systems tion, down-scaled satellite imagery, etc., provide
which may require Smart City applications to highly accurate data, and this suggests the need
accommodate different reference systems for har- to validate the data source to make it trustworthy,
monizing and processing the data and providing reliable, and usable in other applications or in
the information in a uniform representation, e.g., decision-making processes. For instance, W3C
to perform comparative analysis. PROV model (2013) is used in citizen participa-
300 Big Data in Smart Cities
tion applications to validate the source of data and takes place which is not sustainable, and hence
establish trust in it (IES Cities 2013). requires alternative distributed computing models
Data Access and Availability: A lot of such as fog computing (Perera et al. 2017) and/or
data is managed in different departmental silos. edge computing (Ai et al. 2017).
This data is often without standard accessible
interfaces or APIs (e.g., OGC’s WFS interface),
which makes it difficult to discover/search or Summary and Future Directions for
access from the external world. As a result, cities Research
might have duplicate and often nonmatching data
in different departments or agencies. This entry provides a thorough insight into
Data Security and Privacy: Security and Smart City Big Data covering “V” characteristics
privacy related concerns right up the agenda and main categories: basic and thematic data.
in response to security and privacy related From the Smart City Big Data perspective, we
legislations, e.g., the Data Protection Act 1996 presented three applications from FP7 UrbanAPI,
or the General Data Protection Regulations FP7 DECUMANUS, and H2020 smarticipate
(GDPR). While suitable harmonizing and projects. We also highlighted various data related
integration and processing approaches are challenges cities are facing today including city
needed for city data, this must not accidently data possess harmonization, integration, and
violate individuals’ privacy. This becomes really processing, which require further research to
challenging when multiple data sources are apply distributed or edge computing models.
combined for Smart City Big Data processing. The applications and challenges provide point-
Also, Smart City services should provide ers to future research directions including: (i) de-
end-to-end security and privacy by adopting signing common data models for data harmoniza-
novel security frameworks, e.g., SSServProv tion and application development; (ii) defining
Framework (Khan et al. 2017b) or block- and using standard approaches for data represen-
chain based authentication and self-verification tation or exchange, e.g., CityGML; (iii) inventing
techniques, e.g., VeidBlock (Abbasi and Khan new edge based models so that processing can
2017). be performed close to the data source and un-
Data Processing and Data Transfer: Many necessary data transfer can be limited; (iv) pro-
applications are developed by using OGC com- viding necessary security and privacy measures
pliant services. Use of Apache Hadoop and Spark to avoid violation of privacy laws; (v) assess-
and Storm based platforms permit the processing ing and improving data quality and provenance
of historical as well as real-time data (Khan et al. through new techniques; (vi) introducing intuitive
2014a, 2015; Soomro et al. 2016). Other libraries visualization of data for better awareness raising;
and tools for geo-processing (e.g., Java based and (vii) innovative data analytics approaches
GeoServer on the back-end and JavaScript based to generate new information and knowledge to
mapping libraries like, e.g., OpenLayers, Leaflet, support decision-making in city governance.
etc., in the front-end) can be utilized to pro-
cess and present city data to different stakehold-
ers (Khan et al. 2017a). In some cases, micro-
services or a typical service oriented environment Cross-References
can be used to design service workflows; data can
be accessed and processed by using OGC compli- Big Data Analysis for Smart City Applications
ant WFS and WPS interfaces. In most cases, data Big Data Analysis and IOT
is fetched from remote repositories, transferred, Linked Geospatial Data
and stored in central repositories (or on the cloud) Query Processing: Computational Geometry
to process the data (Khan et al. 2012). In such Spatio-Textual Data
situations, a transfer of huge amounts of data Spatio-Social Data
Big Data in Social Networks 301
Streaming Big Spatial Data top-down land use planning intelligence to bottom-up
Spatiotemporal Data: Trajectories stakeholder engagement for smart cities – a case study:
DECUMANUS service products. Int J Serv Technol
Manag 23(5/6):465–493
Acknowledgments Authors acknowledge that the work
Marconcini M, Metz A (2016) A suite of novel EO-based
B
presented in this chapter was funded by the University
products in support of urban green planning. In: Con-
of the West of England, Bristol, UK and three European
ference on urban planning and regional development in
Commission projects: FP7 UrbanAPI (Grant# 288577),
the information society, 22–24 June 2016
FP7 DECUMANUS (Grant# 607183) and H2020 Smar-
Open Government Data (2017) Open knowl-
ticipate (Grant# 693729). We also thank David Ludlow
edge foundation [online], URL-Access:
from the University of the West of England, Bristol, UK
https://okfn.org/about/our-impact/opengovernment/.
for reviewing and providing constructive feedback on this
Last-accessed 16 Oct 2017
chapter.
Perera C, Qin Y, Estrella J, Reiff-Marganiec S, Vsilakos
A (2017) Fog computing for sustainable smart cities: a
survey. ACM Comput Surv 50(3):1–43. 32
Peters-Anders J, Khan Z, Loibl W, Augustin H, Brein-
References bauer A (2017) Dynamic, interactive and visual analy-
sis of population distribution and mobility dynamics in
Abbasi AG, Khan Z (2017) VeidBlock: verifiable identity an urban environment using mobility explorer frame-
using blockchain and ledger in a software defined work. Information 8(2):56
network. In: SCCTSA2017 co-located with 10th utility San José R, Pérez J, Pecci J, Garzón A, Palacio M (2015a)
and cloud computing conference, Austin, 5–8 Dec Urban climate impact analysis using WRF-chem and
2017 diagnostic downscaling techniques: Antwerp, Helsinki,
Ai Y, Peng M, Zhang K (2017) Edge cloud comput- London, Madrid and Milan case studies. In: Proceed-
ing technologies for Internet of things: a primer. ings of the VII atmospheric science symposium, Istan-
Digital Commun Netw. https://doi.org/10.1016/j.dcan. bul, vol 27, pp 223–247, Apr 2015
2017.07.001 San José R, Pérez J, Pérez L, González R, Pecci J, Garzón
Garzon A, Palacios M, Pecci J, Khan Z, Ludlow D A, Palacios M (2015b) Regional and urban down-
(2014) Using space-based downstream services for scaling of global climate scenarios for health impact
urban management in smart cities. In: Proceedings of assessments. Meteorol Clim Air Qual Appl 27:223–
the 7th IEEE/ACM utility and cloud computing, 8–11 247
Dec 2014. IEEE, London, pp 818–823 Soomro K, Khan Z, Hasham K (2016) Towards provi-
IES Cities (2013) Internet-enabled services for the cities sioning of real-time Smart City services using clouds.
across Europe, European Commission, Grant Agree- In: 9th international conference on utility and cloud
ment 325097 computing, Shanghai, 06–09 Dec 2016
Khan Z, Ludlow D, McClatchey R, Anjum A (2012) W3C PROV model (2013) W3C PROV model primer,
An architecture for integrated intelligence in urban W3C working group note 30 Apr 2013. Online,
management using cloud computing. J Cloud Comput URL at: https://www.w3.org/TR/prov-primer/. Last-
Adv Syst Appl 1(1):1–14 Accessed 12 Oct 2017
Khan Z, Liaquat Kiani S, Soomro K (2014a) A framework
for cloud-based context-aware information services for
citizens in smart cities. J Cloud Comput Adv Syst Appl
3:14
Khan Z, Ludlow D, Loibl W, Soomro K (2014b) ICT
enabled participatory urban planning and policy devel-
opment: the UrbanAPI project. Transform Gov People
Process Policy 8(2):205–229
Khan Z, Anjum A, Soomro K, Muhammad T (2015) Big Data in Social Networks
Towards cloud based big data analytics for smart future
cities. J Cloud Comput Adv Syst Appl 4:2 Antonio Picariello
Khan Z, Dambruch J, Peters-Anders J, Sackl A, Strasser Università di Napoli Federico II – DIETI,
A, Frohlich P, Templer S, Soomro K (2017a) Develop-
ing knowledge-based citizen participation platform to Napoli, Italy
support Smart City decision making: the smarticipate
case study. Information 8(2):47
Khan Z, Pervaiz Z, Abbasi AG (2017b) Towards a secure
service provisioning framework in a Smart City envi-
ronment. Futur Gener Comput Syst 77:112–135
Synonyms
Ludlow D, Khan Z, Soomro K, Marconcini M, Metz A,
Jose R, Perez J, Malcorps P, Lemper M (2017) From Big data analytics; Social networking
302 Big Data in Social Networks
generated contents. Hence, it is possible to that are the subject of the analysis, every user
exploit the relationships established among corresponding to a node of the graph (call this set
users, users and contents for different pur- U ); social relationships, the set of relationships
poses, such as (i) identify the most appropriate among two users; we can refer to it as the set of B
content for specific users, (ii) ranking of social edges E; multimedia object, the media contents
network users with respect to a given item, generated by users and through which users can
and (iii) identify a subset of user to maximize interact among them; and tags , the keywords as-
the influence of specific contents and so on. sociated with a particular user-generated content.
The following – famous – social networks In the last few years, the scientific community
lie into these categories: Twitter, Facebook, focused a particular interest into multimedia so-
Instagram, Google+, Last.fm, and Flickr. cial networks, i.e., the OSN added with multime-
– Location-based social networks: These net- dia user-generated content. In this case, we can
works introduce location as a new object. detect different relationships: user to user, this is
Some examples are Foursquare, Yelp, and the most common relationship, where two users
TripAdvisor. are connected to each other (e.g., friendships,
– Social movie rating businesses: In this cat- memberships, following, and so on); similarity,
egory, we can identify the following social this kind of relationships exists between two
networks: Flixster, IMDb, and MovieLens. multimedia objects and refers to the similarity
of these contents; it is based on the analysis
of high-level and low-level features; and user
Social networking has its beginnings in the work
to multimedia, this represents the actions that a
of social scientists in the context of human social
particular user can make on a particular content.
networks, mathematicians and physicists in the
In order to handle in a compact way the
context of complex network theory, and, most re-
different kinds of data contained in a OSN, a
cently, computer scientists in the investigation of
powerful data structure that is emerging is the
information social networks. New challenges and
hypergraph (Amato et al. 2016b). A hypergraph
research topics are thus arising, especially con-
is the generalization of the concept of graph, in
cerning marketing applications like recommenda-
which edges are not referred only to two vertices,
tion, influence maximization, influence diffusion,
but to an undefined number of vertices. Formally,
viral marketing, word of mouth, etc. and also
a hypergraph is defined as H D .X; E/, X being
problems related to security like Lurker detec-
a set of vertices and E a set of non-empty subsets
tion and malicious user detection (Amato et al.
of X , called hyperedges or edges. Therefore, E
2016a).
is a subset of 2X n f;g, where 2X is, as usually,
the power set of X .
Social Networks Models
Graphs are the most suitable data structure that Social Network Analysis
can reflect and represent social networks: a graph When applied on an OSN, the big data analytics
can exactly depict the relationships among differ- techniques may provide what is sometimes called
ent users. How to model an OSN with a graph “the wisdom of the crowds” (i.e., the collective
is dependent on the different kinds of objects opinion of a group of individuals rather than that
forming the network. of a single expert), thus revealing sentiments, de-
Generally speaking, a social network is a di- tecting typical patterns, providing recommenda-
rected graph G D .V; E/, V being the set of tions, and influencing people or groups of people.
vertices of the network – e.g., users, multimedia Social networks analysis, or SNA, is one of the
objects, tags, and so on – and E the set of edges key points of the modern sociology, especially
among vertices, e.g., the social ties. for the study of society and social relationship
The different entities that could be considered between people. Nowadays, the growing popular-
in an OSN typically are users, the set of users ity of OSNs and the data generated by them lay
304 Big Data in Social Networks
the foundations for analyzing several sociological that if we add more nodes to a seed set, we
phenomena that are used in many fields. SNA in- cannot reduce the size of the final set, and (b)
volves several techniques like machine learning, sub-modularity, meaning that a function has the
text analysis, and media analysis. Analyzing such property of the diminishing returns. Note that
huge amount of data allows to understand the sub-modularity tells us that the obtained gain
social tie between nodes. The chance to analyze adding a node v to the total set T is always lower
this huge amount of data is really important for than – equal – adding that node v to a subset S. In
many kinds of application in very different fields. mathematical terms:
A summary of these applications are recommen-
dation systems, which are the products to suggest f .S [ v/ f .S / f .T [ v/ f .T / (1)
to users, based on collaborative filtering in which
the products to suggest are based on the interests The most used stochastic diffusion models
of similar users (Amato et al. 2017b); influence are the independent cascade model (IC) and the
analysis, the analysis of the most influential users linear threshold model (LT).
in the network (Amato et al. 2017a) community The IC model is described by Kempe et al.
detection, the detection of different communities 2003a based on the and marketing research Gold-
of similar users in a network; and advertising enberg et al. 2001. This model is also related to
systems, which ads should be promoted to users. epidemic models (Keeling and Eames 2005).
Briefly, SNA may be broadly divided into In this approach, every edge eu;v 2 E has a
two categories: (a) linkage-based and structural probability pu;v 2 Œ0; 1, and the model works as
analysis, the analysis of links among nodes of following:
network to understand social relationships, and
content-based analysis, the analysis is based on 1. at initial step t D 0, an initial seed set S0
contents generated in the OSN. is activated, while the remaining nodes are
inactive;
Influence Diffusion 2. each active node u executes an activation at-
Influence diffusion analysis describes how infor- tempt by performing a Bernoulli trial with the
mation propagates across the users in a network. probability pu;v to active its neighbors vi ;
The main goal of this research field is to create 3. if a node v has not been activated, the node
a diffusion model that describes how information u has no others chances to achieve this. If the
could be “diffused.” Different models have been node v is activated we can add this node to the
proposed in the literature, which can be classified active sets St in the following step t C1. When
into two categories: progressive and nonprogres- a node is activated, it can’t turn inactive again;
sive models. The nonprogressive models make 4. the steps 1–3 are repeated until no new nodes
the assumption that nodes’ states can change become activated, at time tn .
indefinitely, while in the progressive ones, when
a node is active, it cannot switch its status. For This kind of diffusion is very useful to explain
both the described categories, it’s possible to the diffusion behavior of information or viruses,
classify the proposed models into stochastic-, like in contagion.
epidemiologic-, and voting-based models. Figure 1 shows an example of this diffusion
The stochastic diffusion models are based on process. At time t D 0 only the node a is active
stochastic processes; in this case, considering Fig. 1a, and it represents the seed set. At step
an OSN modeled as a graph, a node can have t D 1, node a attempts to activate nodes b, c,
only two possible states: active and inactive, and d Fig. 1b, and only d is activated; in this
depending on if the node has been influenced by case a fails to activate b and c. At this point,
the new information or not. Stochastic diffusion the edges used or that have a destination already
models have two properties related to the influ- active, are left out. At step t D 2 only d tries
ence spread function: (a) monotonicity, meaning to activate its neighbors c, e, and f Fig. 1c. In the
Big Data in Social Networks 305
Big Data in Social Networks, Fig. 1 Independent cascade algorithm. (a) t D 0. (b) t D 1. (c) t D 2. (d) t D 3
last step, t D 3 node e tries to affect g but fails, the sum of the probabilities is greater than its
and the diffusion stops because there are not any threshold. If we indicate N.v/ as the neigh-
outgoing edges to inactive nodes. Fig. 1d. The bors nodes of v, it becomes active if:
main problem of this model is that we assume that
in each step, a node can be activated only by one X
of its neighbors, and so the influence of a node pu;v wv (2)
v2N.v/\Sti
depends only on one source.
To overcome the constraints that the influence
of a node depends only on one source, scientists 2. if the node v became active, it belongs to
considered a novel approach where a node u the Sti C1 , and it can attempt to influence its
is activated when it receives influence by more neighbors. If it doesn’t become active, other
than one neighbor, also known as the LT model, new active nodes can contribute to activate it
proposed by Kempe et al. 2003b. In this model, in further steps.
every node u has a weight w(v) 2 Œ0; 1 that
indicates the threshold to reach to influence the
node v. This means that to activate the node v, the Figure 2 shows an example of LT diffusion
sum of the probability pu;v of its active neighbors process. As in the IC model, when a node be-
has to be more than wv . This corresponds to comes active, we mark it with red color; indeed
say that a single node activation depends on near each node v is shown the threshold wv . At
multiple sources. Initially, only an initial seed set step t D 0, only the node a is active Fig. 2a, it
S0 is active. The steps of the algorithm are the begins to influence its neighbors, but it succeeds
following: only with c and b because their threshold is
lower than pa;b and pa;c Fig. 2b. The pa;d is not
sufficient to activate d, but it goes on to influence
1. at each step ti every node u 2 Sti attempts also at step t D 2 where, through the influence
to activate its inactive neighbors vi with a of c and b, the threshold is reached and node d is
probability pu;v , where Sti indicates the active activated. The process goes on until there are not
set at time ti . The node v becomes active if nodes become active Fig. 2c.
306 Big Data in Social Networks
Big Data in Social Networks, Fig. 2 Linear threshold model. (a) t D 0. (b) t D 1. (c) t D 2. (d) t D 3. (e) t D 4.
(f) t D 5
vertices of the same cluster and comparatively communities can be computed in a social network
few edges joining vertices of different clusters. G.V; E/ whose enumeration is a NP-complete
Such clusters, or communities, can be considered problem. The aim of several community detec-
as fairly independent compartments of a graph. tion algorithms is to optimize a goodness metric B
Detecting communities is of great importance in that represents the quality of the communities
sociology, biology, and computer science, where detected from the network.
usually systems are often represented as graphs. It is possible to define the two main types of
This problem is very hard and not yet satisfacto- community. A strong community is a subgraph
rily solved, despite the huge effort of a large in- in which each vertex has more probability to be
terdisciplinary community of scientists working connected with the vertices of the same commu-
on it over the past few years. In graph theory, nity compared to a node in a different community.
community detection is a fundamental task to A weak community is a subgraph in which the
analyze and understand the structure of complex average edge probability between each vertex in
network. As reported in Wang et al. (2015), “re- the same community exceeds the average edge
vealing the latent community structure, which is probability between the vertex with the vertices
crucial to understanding the features of networks, of any other group.
is an important problem in network and graph As shown in Chakraborty et al. (2017), usually
analysis.” community detection analysis is composed of
During the last decade, many approaches have two phases: detection of meaningful community
been proposed to solve this challenging problem structure from a network and, second, evaluation
in diverse ways. The term “community” has been of the appropriateness of the detected community
extensively used in the literature in different con- structure. In particular, the second phase raises
texts and with different connotations. the importance to define several metrics that are
A first definition is provided by Gulbahce and the fundamentals of different community detec-
Lehmann 2008 that define a community as “a tion definitions. Moreover community detection
densely connected subset of nodes that is only approaches can be classified in two categories:
sparsely linked to the remaining network.” Porter global, which require the knowledge of entire
et al. (2009) provide a definition based on the network, and local algorithms, which expand an
fields of sociology and anthropology, in which initial seed set of nodes into possibly overlapping
a community is a “cohesive group of nodes that communities by examining only a small part of
are connected more densely to each other than the network.
to the nodes in other communities.” Moreover, Finally, the algorithms for community detec-
Fortunato 2010 defines communities as “groups tion can be classified as in the following: (a) co-
of vertices that probably share common proper- hesive subgraph discovery, (b) vertex clustering,
ties and/or play similar roles within the graph.” (c) community quality optimization, (d) divisive,
The most commonly used definition is that of and (e) model based.
Yang et al. (2010): “a community as a group of
network nodes, within which the links connecting
nodes are dense but between which they are
sparse.” Papadopoulos et al. (2012) ultimately
References
define communities as “groups of vertices that are Amato F, Moscato V, Picariello A, Piccialli F, Sperlí G
more densely connected to each other than to the (2016a) Centrality in heterogeneous social networks
rest of the network.” for lurkers detection: an approach based on hyper-
However, the definition of a community is graphs. Concurr Comput Pract Exp 30(3):e4188
Amato F, Moscato V, Picariello A, Sperlí G (2016b)
not clear because it is complex to define the Multimedia social network modeling: a proposal. In:
concept of “good” community. Several possible 2016 IEEE tenth international conference on semantic
308 Big Data in the Cloud
Key Research Findings Big Data in the Cloud Big Data: Persistence Versus
Cloud Distribution
Although clouds provide an ideal environment
Storage Levels for Big Data in the Cloud for big data storage, the access delay to them is B
A cloud storage consists of enormous number generally high due to network limitations (Terzo
(order of thousands) of storage server clusters et al. 2013). The delay can be particularly prob-
connected by a high-bandwidth network. A stor- lematic for frequently accessed data (e.g., index
age middleware (e.g., SafeSky (Zhao et al. 2015)) of big data search engine). As such, in addition
is used to provide distributed file system and to to storage services, other cloud services need to
deal with storage allocation throughout the cloud be in place to manage distribution of big data
storage. so that the access delay can be reduced. That is,
Generally, cloud providers offer various stor- separating the big data storage in the cloud from
age levels with different pricing and latencies. the distribution.
These storage levels can be utilized to store big Content Delivery Network (CDN) is a net-
data in the cloud, depending on the price and work of distributed servers aiming to reduce data
latency constraints. Amazon cloud, for instance, access delay over the Internet (Pierre and Van
provides object storage, file storage, and block Steen 2006). The CDN replicates and caches data
storage to address the sheer size, velocity, and within a network of servers dispersed at different
formats of big data. geographical locations.
Object storage is cheaper and slower to ac- Figure 1 depicts how CDN technology oper-
cess, generally used to store large objects with ates. As we can see in this figure, upon receiving a
low access rate (e.g., backups and archive data). request to access data, CDN control server caches
This type of storage service operates based on the data in the nearest edge server to deliver it
Hard Disk Drive (HDD) technology. Amazon with minimum latency.
simple storage service (https://aws.amazon.com/ Cloud providers offer distribution services
s3/) (S3) is an example of Object storage service through built-in CDN services or by integrating
(Wu et al. 2010). with current CDN providers. For instance,
File storage service is also offered based on Amazon cloud offers CloudFront (Dignan 2008)
HDD storage technology. However, it provides CDN service. Akamai, a major CDN provider has
file system interface, file system access consis- been integrated with Microsoft Azure cloud to
tency, and file locking. Unlike object storage, file provide short delay in accessing big data (Sirivara
storage capacity is elastic which means growing 2016).
and shrinking automatically in response to ad-
dition and removal of data. Moreover, it works Real-Time Big Data Analytics in the Cloud
as a concurrently accessible storage for up to Cloud big data can be subject to real-time pro-
thousands of instances. Amazon Elastic File Sys- cessing in applications that detect patterns or per-
tem (https://aws.amazon.com/efs/) (EFS) is an ceive insights from big data. For instance, in S3C
example of this type of storage. (Woodworth et al. 2016) huge index structures
Block storage level is offered based on Solid- need to be accessed to enable a real-time search
State Drives (SSD) and provides ultra-low access service. Other instances are big data analytics
latency. Big data with frequent access or delay for sensor data (Hu et al. 2014), financial data
sensitive can be stored in this storage level. Ama- (Hu et al. 2014), and streaming data (Li et al.
zon Elastic Block Store (EBS) is an example of 2016).
Amazon cloud service offered based on block The faster organizations can fetch insights
storage. High-performance big data applications from big data, the greater chance in generating
(e.g., Sagiroglu and Sinanc 2013) commonly use revenue, reducing expenditures, and increasing
storage level to minimize their access latency. productivity. The main advantage of separating
310 Big Data in the Cloud
Cached
Contents
CDN Edge Server
nt ed
ts
Co ach
en
C
CDN Control
t Server
ten
Con uest
Re q
Big Data in the Cloud, Fig. 1 CDN technology helps to minimize access latency for frequently accessed big data in
the cloud
storage from distribution is to help organizations Security of Big Data Against Internal Attackers
to process big data in real-time. In addition to common security measures (e.g.,
In most cases, data collected from various logging or node maintenance against malware),
sources (e.g., sensor data, organizations’ data) are user-side encryption (Salehi et al. 2014; Wood-
raw and noisy, thus, are not ready for analysis. worth et al. 2016) is becoming a popular ap-
Jagannathan (2016) propose a method to deal proach to preserve the privacy of big data hosted
with this challenge. in the cloud. That is, the data encryption is done
using user’s keys and in the user premises; there-
Security of Big Data in the Cloud fore, external attackers cannot see or tamper with
One major challenge in utilizing cloud for big the data. However, user-side encryption brings
data is security concerns (Dewangan and Verma about higher computational complexity and hin-
2015). The concern for big data owners is es- ders searching big data (Wang et al. 2010).
sentially not having full control over their data
hosted by clouds. The data can be exposed to Several research works have been undertaken
external or, more importantly, to internal attacks. recently to initiate different types of search over
A survey carried out by McKinsey in 2016 encrypted data in the cloud. Figure 2 provides a
(Elumalai et al. 2016) showed that security and taxonomy of research works on the search over
compliance are the two primary considerations encrypted data in the cloud. In general, proposed
in selecting a cloud provider, even beyond cost search methods are categorized as search over
benefits. In October 2016, external attackers per- structured and unstructured data. Unstructured
formed a Distributed Denial of Service (DDoS) searches can be further categorized as keyword
attack that made several big data dealing organi- search, Regular expression search, and semantic
zations (e.g., Twitter, Spotify, Github) inaccessi- search.
ble. Prism (Greenwald and MacAskill 2013) is an Privacy-Preserving query over encrypted
example of an internal attack launched in 2013. In graphstructured data (Cao et al. 2011) is an
this section, we discuss major security challenges instance of search over encrypted structured data.
for big data in the cloud and current efforts to REseED (Salehi et al. 2014), SSE (Curtmola et
address them. al. 2006), and S3C (Woodworth et al. 2016) are
Big Data in the Cloud 311
loT
Devices CLOUD ANALYTICS
Processing
Big Data in the Cloud, Fig. 3 Work-flow of IoT systems dealing with big data in the cloud
tools developed for regular expression (Regex), To address the security challenge, researches
keyword, and semantic searching, respectively, have been done in that area. For instance, a
over unstructured big data in the cloud. Big data layered framework (Reddy et al. 2012) is pro-
integrity in the cloud is another security concern. posed for assuring big data analytics security in
The challenge is how to verify the data integrity the cloud. It consists of securing multiple lay-
in the cloud storage without downloading and ers, namely, virtual machine layer, storage layer,
then uploading the data back to cloud. Rapid cloud data layer, and secured virtual network
incoming of new data to the cloud makes monitor layer.
verifying data integrity further complicated
(Chen and Zhao 2012). Techniques such as
continuous integrity monitoring are applied to IoT Big Data in the Cloud
deal with big data integrity challenge in the cloud The Internet of Things (IoT) is a mesh of physical
(Lebdaoui et al. 2016). devices embedded with electronics, sensors, and
network connectivity to collect and exchange
data (Atzori et al. 2010). IoT devices are de-
Security Against External Attackers ployed in different systems, such as smart cities,
External attackers are generally considered as smart homes, or to monitor environments (e.g.,
potential threats to the cloud that handle big data. smart grid). Basic work-flow of IoT systems that
If cloud storages face external attacks, users’ data use cloud services (Dolgov 2017) is depicted in
will be leaked and controlled by attackers. So, Fig. 3. According to the figure, the IoT devices
implementing security on the cloud that handles produce streams of big data that are hosted and
big data is a challenge. analyzed using cloud services. Many of these
312 Big Data in the Cloud
systems (e.g., monitoring system) require low NoSQL-NoSQL (generally referred to as “Not
latency (real-time) analytics of the collected data. Only SQL”) is a framework for databases that
Big data generated in IoT systems generally provides high-performance processing for big
suffer from redundant or noisy data which can data. In fact, conventional relational database
affect big data analytic tools (Chen et al. 2014). systems have several features (e.g., transactions
Knowledge Discovery in Databases (KDD) management and concurrency control) that
and data mining technologies offer solutions for make them unscalable for big data. In contrast,
analysis of big data generated by IoT systems NoSQL databases are relieved from these extra
(Tsai et al. 2014). The big data are converted features and lend themselves well to distributed
into useful information using KDD. The output processing in the clouds. These database systems
of KDD is further processed using data mining can generally handle big data processing in real-
or visualization techniques to extract insights or time or near real-time. In this part, an overview
patterns of the data. of the NoSQL databases offered by clouds is
Amazon IOT (https://aws.amazon.com/iot/), explained.
Dell Statistica (http://iotworm.com/internet- • Hadoop MapReduce – MapReduce frame-
things-business-data-analytics-tools/), Amazon work was published by google in 2005 (Dit-
Greengrass (https://aws.amazon.com/greengrass/) trich and Quiané-Ruiz 2012). Hadoop is an open
Azure stream analytics are tools that function source implementation tool based on MapReduce
based on the aforementioned work-flow in clouds framework developed by Apache (Vavilapalli et
to extract insights from IoT big data with low al. 2013). Hadoop is a batch data analytics tech-
latency. nology that is designed for failure-prone comput-
Fog computing (Bonomi et al. 2012) extends ing environments. The specialty of Hadoop is that
the ability of clouds to edge servers with the goal developers need to define only the map and re-
of minimizing latency and maximizing the effi- duce functions that operate on big data to achieve
ciency of core clouds. Fog computing can also be data analytics. Hadoop utilizes a special dis-
used to deal with noisy big data in IoT (Bonomi et tributed file system, named Hadoop Distributed
al. 2012). For instance, Amazon service enables File System (HDFS), to handle data (e.g., read
creation of fog computing that can be pipelined and write operations). Hadoop achieves fault-
to other Amazon cloud services (e.g., Amazon tolerance through data replication. Amazon Elas-
SimpleDB and EMR) for big data processing. tic MapReduce (EMR) (https://aws.amazon.com/
emr/details/hadoop/) is an example of Hadoop
framework (Jourdren et al. 2012) offered as a
Cloud-Based Big Data Analytics Tools service by Amazon cloud. Amazon EMR helps to
Unlike conventional data that are either structured create and handle elastic clusters of Amazon EC2
(e.g., XML, XAML, and relational databases) instances running Hadoop and other applications
or unstructured, big data can be a mixture of in the Hadoop system.
structured, semi-structured, or unstructured data • Key-Value Store – key-value database
(Wu et al. 2014). It is approximated that around system is designed for storing, retrieving, and
85% of total data come from organization are handling data in mapping format where data
unstructured and moreover, almost all the individ- are considered as a value and a key is used
uals generated data are also unstructured (Pusala as an identifier of a particular data as because
et al. 2016). keys and values are resided pairwise in the
Such heterogeneous big data cannot be an- database system. Amazon DynamoDB (https://
alyzed using traditional database management aws.amazon.com/dynamodb/), GoogleBigTable
tools (e.g., relational database management sys- (https://cloud.google.com/bigtable/), and HBase
tem (RDBMS)). Rather, a set of new cloud- (https://hbase.apache.org/) are examples of
based processing tools have been developed for cloud-based key-value store databases to handle
processing big data (Chen et al. 2014). big data.
Big Data in the Cloud 313
• Document Store – These database systems Different big data analytic tools introduced in
are similar to key-value store databases. this section are appropriate for different types
However, keys are mapped to documents, instead of datasets and applications. There is no single
of values. Pairs of keys and documents (e.g., perfect solution for all types of big data analytics B
in JASON or XML formats) are used to build in the cloud.
data model over big data in the cloud. This
type of database systems are used to store
semi-structured data as documents, usually in
JSON, XML, or XAML formats. Amazon cloud Future Directions for Research
offers Amazon SimpleDB (https://aws.amazon.
com/simpledb/) as a document store database Cloud-based big data analytics has progressed
service that creates and manages multiple significantly over the past few years. The progres-
distributed replicas of data automatically to sion has led to emergence of new research areas
enable high availability and data durability. that need further investigations from industry and
Data owner can change data model on the academia. In this section, an overview of these
fly, and data are automatically indexed later potential areas is provided.
on. Different businesses use various cloud
• Graph Databases – These database providers to handle their big data. However,
systems are used for processing data that have current big data analytics tools are limited to a
graph nature. For instance, users’ connections single cloud provider. Solutions are required to
in a social network can be modeled using bridge big data stored in different clouds. Such
a graph database (Shimpi and Chaudhari solutions should be able to provide real-time
2012). In this case, users are considered as processing of big data across multiple clouds. In
nodes and connections between users are addition, such solutions should consider security
represented as edges in the graph. The graph constraints of big data reside in different clouds.
can be processed to predict links between The distribution of access to data is not uni-
nodes. That is, possible connections between form in all big datasets. For instance, social
users. Giraph (http://giraph.apache.org/), Neo4j network pages and video stream repositories (Li
(https://zeroturnaround.com/rebellabs/examples- et al. 2016 and Darwich et al. 2017) have long-
where-graph-databases-shine-neo4j-edition/), and tail access patterns. That is, only few portions of
FlockDB (https://blog.twitter.com/engineering/ big data are accessed frequently, while the rest
en_us/a/2010/introducing-flockdb.html) are of the datasets are rarely accessed. Solutions are
examples of cloud-based graph database services. required to optimally place big data in different
Facebook (https://www.facebook.com), a large cloud storage types, with respect to access rate
social networking site, is currently using Giraph to the data. Such solutions can minimize the
database system to store and analyze their users’ incurred cost of using cloud and access latency
connections. to the data.
In Memory Analytics – These are techniques Security of big data in cloud is still a ma-
that process big data in the main memory (RAM) jor concern for many businesses (Woodworth et
with reduced latency (Dittrich and Quiané-Ruiz al. 2016). Although there are tools that enable
2012). For instance, Amazon provides Elastic- processing (e.g., search) of user-side encrypted
Cache (https://aws.amazon.com/elasticache/) ser- big data in cloud, further research is required to
vice to improve web applications’ performance expand processing of such data without revealing
which at previous generally relied on slower disk- them to the cloud. Making use of homomorphic
based databases. Beside that there are also tools encryption (Naehrig et al. 2011) is still in its
for performing in memory analytics on the big infancy. It can be leveraged to enable more so-
data in cloud such as Apache Spark, SAP, and phisticated (e.g., mathematical) data processing
Microstrategy (Sultan 2015). without decrypting data in the cloud.
314 Big Data in the Cloud
cost models for cloud computing add additional between 64 MBs to 1 GB, and usually holds many
urgency for optimized processing. records. Therefore, the record-level access no
A typical query in big data applications may longer holds. Even the feasible operations over
touch files in the order of 100 s of GBs or TBs of the data are different from those supported in
size. These queries are typically very expensive relational databases. For example, record updates
as they consume significant resources and require and deletes are not allowed in the MapReduce
long periods of time to execute. For example, infrastructure. All of these unique characteristics
in transaction log applications, e.g., transaction of big data fundamentally affect the design of
history of customer purchases, one query might the appropriate indexing and preprocessing tech-
be interested in retrieving all transactions from niques.
the last 2 months that exceed a certain amount Plain Hadoop is found to be orders of mag-
of dollar money. Such query may need to scan nitude slower than distributed database manage-
billions of records and go over TBs of data. ment systems when evaluating queries on struc-
Indexing techniques are well-known tech- tured data (Stonebraker et al. 2010; Abadi 2010).
niques in database systems, especially relational One of the main observed reasons for this slow
databases (Chamberlin et al. 1974; Maier performance is the lack of indexing in the Hadoop
1983; Stonebraker et al. 1990), to optimize infrastructure. As a result, significant research
query processing. Examples of the standard efforts have been dedicated to designing indexing
indexing techniques are the B+-tree (Bayer and techniques suitable for the Hadoop infrastructure.
McCreight 1972), R-Tree (Guttman 1984), and These techniques have ranged from record-level
Hash-based indexes (Moro et al. 2009) along indexing (Dittrich et al. 2010, 2012; Jiang et al.
with their variations. However, transforming 2010; Richter et al. 2012) to split-level indexing
these techniques and structures to big data (Eltabakh et al. 2013; Gankidi et al. 2014), from
infrastructures is not straightforward due to the user-defined indexes (Dittrich et al. 2010; Jiang
unique characteristics of both the data itself and et al. 2010; Gankidi et al. 2014) to system-
the underlying infrastructure processing the data. generated and adaptive indexes (Richter et al.
At the data level, the data is no longer assumed to 2012; Dittrich et al. 2012; Eltabakh et al. 2013),
be stored in relational tables. Instead, the data is and from single-dimension indexes (Dittrich et al.
received and stored in the forms of big batches of 2010, 2012; Jiang et al. 2010; Richter et al. 2012;
flat files. In addition, the data size exceeds what Eltabakh et al. 2013) to multidimensional indexes
relational database systems can typically handle. (Liu et al. 2014; Eldawy and Mokbel 2015; Lu
On the other hand, at the infrastructure level, et al. 2014).
the processing model no longer follows the re- Table 1 provides a comparison among sev-
lational model of query execution, which relies eral of the Hadoop-based indexing techniques
on connecting a set of query operators together to with respect to different criteria. Record-level
form a query tree. Instead, the MapReduce com- granularity techniques aim for skipping irrelevant
puting paradigm is entirely different as it relies on records within each data split, but eventually they
two rigid phases of map and reduce (Dean and may touch all splits. In contrast, the split-level
Ghemawat 2008). Moreover, the access pattern granularity techniques aim for skipping entire
of the data from the file system is also different. irrelevant splits. SpatialHadoop system provides
In relational databases, the data records are read both split-level global indexing and record-level
in the form of disk pages (a.k.a disk blocks), local indexing, and thus it can skip irrelevant data
which are very small in size (typically between at both granularities.
8 to 128 KBs) and usually hold few data records Some techniques index only one attribute at
(10 s or at most 100 s of records). And thus, it is a time (Dimensionality = 1), while others allow
assumed that the database systems can support indexing multidimensional data (Dimensionality
record-level access. In contrast, in the Hadoop = m). Techniques like HadoopDB and Polybase
file system (HDFS), a single data block ranges Index inherit the multidimensional capabilities
Big Data Indexing 317
from the underlying DBMS. E3 technique en- top of the Jaql high-level query language (Beyer
ables indexing pairs of values (from two at- et al. 2011), and thus queries are expressed, com-
tributes) but only for a limited subset of the pos- piled, and optimized using Jaql engine (Beyer
sible values. Most techniques operate only on the et al. 2011). An example query is as follows:
HDFS data (DB-Hybrid = N), while HadoopDB
and Polybase Index have a database system inte- read( hdfs("docs.json") )
grated with HDFS to form a hybrid system. -> transform {
In most of the proposed techniques, the sys- author: $.meta.author,
tem’s admin decides on which attributes to be products: $.meta.product,
indexed. The only exceptions are the LIAH index, Total: $.meta.Qty * $.meta.
which is an adaptive index that automatically de- Price}
tects the changes of the workload and accordingly -> filter $.products == "XYZ";
creates (or deletes) indexes, and the E3 index,
which automatically indexes all attributes in pos- Jaql has the feature of applying selection-
sibly different ways depending on the data types push-down during query compilation whenever
and the workload. Finally, the index structure possible. As a result, in the given query the
is either stored in HDFS along with its data as filter, operator will be pushed before the
in Hadoop++, HAIL, LIAH, SpatialHadoop, and transform operator, with the appropriate
ScalaGist, in a database system along with its rewriting. The E3 framework can then detect
data as in HadoopDB, or in a database system this filtering operation directly after the read
while the data resides in HDFS as in E3 and operation of the base file and thus can push
Polybase Index. The following sections cover few the selection predicate into its customized input
of these techniques in more details. format to apply the filtering as early as possible.
Target Queries: Indexing techniques target HadoopDB provides a front-end for express-
optimizing queries that involve selection predi- ing SQL queries on top of its data, which is
cates, which is the common theme for all tech- called SMS. SMS is an extension to Hive. In
niques listed in Table 1. Yet, they may differ on HadoopDB queries are expressed in an identical
how queries are expressed and the mechanism way to standard SQL as in the following example:
by which the selection predicates are identified.
SELECT pageURL, pageRank
For example, Hadoop++, HAIL, and LIAH allow
FROM Rankings
expressing the query in Java while passing the
WHERE pageRank > 10;
selection predicates as arguments within the map-
reduce job configuration. As a result, a cus- SpatialHadoop is designed for spatial queries,
tomized input format will receive these predicates and thus it provides a high-level language and
(if any) and perform the desired filtering during constructs for expressing these queries and
execution. In contrast, E3 framework is built on operating on spatial objects, e.g., points and
318 Big Data Indexing
Header Footer
Partitions
Map-Reduce Job (at loading time) data on the
indexed key
S T S T
Data Split 1 Data Split 1 Data Split 2 Data Split 2
Co-group Co-group
Header Info
for each split Co-partition Split without Trojan Index
T T
S Index S Index
(instead of the data splits) and consult the trojan trojan index will point to the data records within
index for that split w.r.t the selection predicate. If the split that satisfies the query. If the trojan index
there are multiple indexes, then the appropriate is clustered, then this means that the data records
index is selected based on the selection predi- within the given block are ordered according to
cate. If none of the records satisfies the query the indexed attribute, and thus the retrieval of the
predicate, then the entire split is skipped, and the records will be faster and requires less I/Os.
map function terminates without actually check- It is worth highlighting that trojan indexes are
ing any record within this split. Otherwise, the categorized as local indexes meaning that a local
320 Big Data Indexing
index is created for each data split in contrast the entire indexes to be re-created) in order to
to building a single global index for the entire accommodate for the new batches.
dataset. Local indexes have their advantages and
disadvantages. For example, one of the advan-
tages is that the entire dataset does not need to be
sorted, which is important because global sorting Record-Level Adaptive Indexing
is prohibitively expensive in big data. However,
one disadvantage is that each indexed split has The work proposed in Dittrich et al. (2012),
to be touched at query time. This implies that a Richter et al. (2012) overcomes some of the
mapper function has to be scheduled and initiated limitations of other previous indexing techniques,
by Hadoop for each split even if many of these e.g, Dittrich et al. (2010) and Jiang et al. (2010).
splits are irrelevant to the query. The key limitations include the following: First,
Hadoop++ framework also provides a mech- high creation overhead for indexes. Usually
anism, called Trojan Join, to speed up the join building the indexes requires a preprocessing
operation between two datasets, say S and T step, and this step can be expensive since it
(refer to Fig. 1b). The basic idea is to partition has to go over the entire dataset. Previous
both datasets (at the same time) on the join key. evaluations have shown that this overhead is
This partitioning can be performed at the loading usually redeemed from few queries, i.e., the
time as a preprocessing step. The actual join execution of few queries using the index will
does not take place during this partitioning phase. redeem the cost paid upfront to create the index.
Instead, only the corresponding data partitions Although that is true, reducing the creation
from both datasets are grouped together in bigger overhead is always a desirable thing. The second
splits, referred to as Co-Partition Splits. limitation is the question of which attributes
At query time, when S and T need to be to index? In general, if the query workload is
joined, the join operation can take place as a map- changing, then different indexes may need to
only job, where each mapper will be assigned one be created (or deleted) over time. The work
complete co-partition split. As such, each mapper in Dittrich et al. (2012) and Richter et al. (2012)
can join the corresponding partitions included addresses these two limitations.
in its split. Therefore, the join operation be- HAIL (Hadoop Aggressive Indexing Li-
comes significantly less expensive since the shuf- brary) (Dittrich et al. 2012) makes use of the fact
fling/sorting and reduce phases have been elim- that Hadoop, by default, creates three replicas
inated (compared to the traditional map-reduce of each data block – this default behavior can
join operation in Hadoop). As highlighted in be altered by the end users to either increase or
Fig. 1b, the individual splits from S or T within decrease the number of replicas. In plain Hadoop,
a single co-group may have a trojan index built these replicas are exact mirror of each other.
on them. However, HAIL proposes to reorganize the data
Hadoop++ framework is suitable for static in- in each replica in a different way, e.g., each of
dexing and joining. That is, at the time of loading the three replicas of the same data block can
the data into Hadoop, the system needs to know be sorted on a different attribute. As a result, a
whether or not indexes need to be created (and single file can have multiple clustered indexes
on which attributes) and also whether or not co- at the same time. For example, as illustrated in
partitioning between specific datasets needs to be Fig. 2, the 1st replica can have each of its splits
performed. After loading the data, no additional sorted on attribute X , the 2nd replica sorted on
indexes or co-partitioning can be created unless attribute Y , and the 3rd replica sorted on attribute
the entire dataset is reprocessed from scratch. Z. These sorting orders are local within each
Similarly, if new batches of files arrive and need split. Given this ordering, a clustered trojan index
to be appended to an existing indexed dataset, as proposed in Dittrich et al. (2010) can be built
then the entire dataset needs to be reloaded (and on each replica independently.
Big Data Indexing 321
Trojan index on Y
…
Trojan index
on X
records from being processed. Although these where indexes are mostly useless. One example
techniques show some improvements in query provided in Eltabakh et al. (2013) is highlighted
execution, they still encounter unnecessary high in Fig. 3b. This example highlights what is called
overhead during execution. For example, imagine “nasty values,” which are values that are infre-
the extreme case where a queried value x appears quent over the entire dataset but scattered over
only in very few splits of a given file. In this most of the data splits, e.g., each split has one or
case, the indexing techniques like Hadoop++ few records of this value. In this case, the inverted
and HAIL would encounter the overheads of index will point to all splits and becomes almost
starting a map task for each split, reading the useless.
split headers, searching the local index associated E3 framework handles these nasty values by
with the split, and then reading few data records coping their records into an auxiliary material-
or directly terminating. These overheads are ized view. For example, the base file A in Fig. 3b
substantial in a map-reduce job and if eliminated will now have an addition materialized view
can improve the performance. file stored in HDFS that contains a copy of all
The E3 framework proposed in Eltabakh records having the nasty value v. Identifying the
et al. (2013) is based on the aforementioned nasty values and deciding on which ones have
insight. Its objective is not to build a fine-grained a higher priority to handle have been proven to
record-level index but instead be more Hadoop- be an NP-hard problem, and the authors have
compliant and build a coarse-grained split-level proposed an approximate greedy algorithm to
index to eliminate entire splits whenever possible. solve it (Eltabakh et al. 2013).
E3 proposes a suite of indexing mechanisms that The adaptive caching mechanism in E3 is
work together to eliminate irrelevant splits to a used to optimize conjunctive predicates, e.g.,
given query before the execution, and thus map (A D x and B D y), under the cases where
tasks start only for a potentially small subset of each of x and y individually is frequent, but their
splits (See Fig. 3a). E3 integrates four indexing combination in one record is very infrequent. In
mechanisms, which are split-level statistics, in- other words, none of x or y is nasty value, but
verted indexes, materialized views, and adaptive their combination is a nasty pair. In this case,
caching; each is beneficial under specific cases. neither the indexes nor the materialized views
The split-level statistics are calculated for each are useful. Since it is prohibitively expensive to
number and date field in the dataset. enumerate all pairs and identify the nasty ones,
Examples of the collected statistics include the the E3 framework handles these nasty pairs
min and max values of each field in each split, by dynamically observing the query execution
and if this min-max range is very sparse, then and identifying on the fly the nasty pairs. For
the authors have proposed a domain segmentation example, for the conjunctive predicates (A D x
algorithm to divide this range into possibly many and B D y), E3 consults the indexes to select a
but tighter ranges to avoid false positives (a false subset of splits to read. And then, it will observe
positive is when the index indicates that a value the number of mappers that actually produced
may exist in the dataset while it is actually not matching records. If the number of mappers
present). The inverted index is built on top of the is very small compared to the triggered ones,
string fields in the dataset, i.e., each string value then E3 identifies .x; y/ to be a nasty pair.
is added to the index, and it points to all splits Consequently, .x; y/ will be cached along with
including this values. By combining these two pointers to its relevant splits.
types of indexes for a given query involving a
selection predicate, E3 can identify which splits
are relevant to the query, and only for those splits Hadoop-RDBMS Hybrid Indexing
a set of mappers will be triggered.
The other two mechanisms, i.e., materialized There has been a long debate on whether or
views and adaptive caching, are used in the cases not Hadoop and database system can coexist
Big Data Indexing 323
Adaptive Caching
B
For number & date fields
Split-level statistics E3 Indexing
Auxiliary
(min, max, Framework Materialized Views
domain-segmentation ranges)
File A
v appears infrequently in {v, …} {v, …} {v, …} {v, …}
many splits (nasty atom)
Split-Level {v, …} {v, …}
Inverted Index
Split 1 Split 2 Split 3 Split N
N})
Building a materialized view
including all records containing v
together in a single working environment and infrastructure by replacing the HDFS storage
whether or not this strategy is beneficial. There layer by a database management layer. That is,
are several successful projects that built such the Data Node and Task Tracker on each slave
integration (Abouzeid et al. 2009; Gankidi et al. node in the Hadoop’s cluster will be running
2014; Floratou et al. 2014a,b; Balmin et al. 2013; an instance of a database system. This database
Katsipoulakis et al. 2015; Tian et al. 2016). instance replaces the HDFS layer, and thus the
HadoopDB is one of the early projects that data on each slave node are stored and managed
brings the optimizations of relational database by the database engine.
systems to Hadoop (Abouzeid et al. 2009). HadoopDB will push as much of work as
HadoopDB proposes major changes to Hadoop’s possible to the database engine, and as a result
324 Big Data Indexing
all indexing capabilities and query optimizations ing relational-like data, spatial data, and semi-
of database systems automatically become ac- structured JSON data. The presented techniques
cessible. However, the drawback of HadoopDB are the backbones for numerous applications that
is that the management of dynamic scheduling run at scale over big data. And with the increas-
and fault tolerance becomes more complicated. ing complexity of these applications in terms of
In addition, the integration of structured and un- the size of the collected or generated data, the
structured data in the same workflow becomes heterogeneity in the data types, and the growing
tricky to perform. complexity of the applied analytics and querying,
Polybase (Gankidi et al. 2014) is another sys- the indexing techniques will play an even more
tem that enables the integration of Hadoop and important role in bringing the processing perfor-
database engines. In Polybase, the HDFS datasets mance into the realm of feasibility.
are defined within the database system as external
tables. And then, users’ queries can span both the
Cross-References
data stored in the DBMS and the data stored in
HDFS’s external tables. At execution time, part
Hadoop
of the query can be translated to map-reduce jobs
Indexing
while another part is SQL-based.
Inverted Index Compression
The data flow between the two systems in
Performance Evaluation of Big Data Analysis
Polybase takes place through custom Input-
Formats and Database Connectors. However,
without efficient access plans to the data in the
external tables, these tables can easily become a
References
bottleneck, and the entire execution plan slows Abadi DJ (2010) Tradeoffs between parallel database
down. The work in Gankidi et al. (2014) proposes systems, Hadoop, and Hadoopdb as platforms for
an indexing technique, called Polybase Split- petabyte-scale analysis. In: SSDBM, pp 1–3
Indexing, that creates B+-tree indexes on the Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz
A, Rasin A (2009) HadoopDB: an architectural hybrid
HDFS datasets. These indexes reside within the of MapReduce and DBMS technologies for analytical
database system. These indexes can be leveraged workloads. In: VLDB, pp 922–933
in different ways. For selection queries, they Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ,
can be used as early split-level filters to identify Silberschatz A (2010) Hadoopdb in action: build-
ing real world applications. In: SIGMOD conference,
the relevant splits in HDFS. For join queries, pp 1111–1114
they can be used for performing a semi-join Balmin A, Beyer KS, Ercegovac V, McPherson J, Özcan
within the database system before retrieving F, Pirahesh H, Shekita EJ, Sismanis Y, Tata S, Tian Y
HDFS’s data. Moreover, they can be used as (2013) A platform for extreme analytics. IBM J Res
Dev 57(3/4):4
caches of hot HDFS data within the database Bayer R, McCreight E (1972) Organization and main-
system, and if a query touches only the attributes tenance of large ordered indexes. Acta Informatica
within the index, then the entire processing can 1(3):173–189
be performed inside the database. Beyer K, Ercegovac V, Gemulla R, Balmin A, Eltabakh
MY, Kanne CC, Ozcan F, Shekita E (2011) Jaql: a
scripting language for large scale semi-structured data
analysis. In: PVLDB, vol 4
Conclusion Chamberlin DD, Astrahan MM, Blasgen MW, Gray JN,
King WF, Lindsay BG, Lorie R, Mehl JW et al (1974)
A history and evaluation of system r. In: ACM comput-
This entry covers various types of big data in- ing practices, pp 632–646
dexing techniques to speed up and enhance the Dean J, Ghemawat S (2008) Mapreduce: simplified data
data’s retrieval and querying. These techniques processing on large clusters. Commun ACM 51(1)
Dittrich J, Quiané-Ruiz JA, Jindal A, Kargin Y, Setty V,
range from record-level to split-level, nonadap-
Schad J (2010) Hadoop++: making a yellow elephant
tive to adaptive, and non-hybrid to hybrid. They run like a cheetah (without it even noticing). In: VLDB,
also cover a wide range of data types includ- vol 3, pp 518–529
Big Data Stream Security Classification for IoT Applications 325
to have smart sensors collaborate directly without of resources, when few subjects have access to
human involvement to deliver a new class of streaming data (Puthal et al. 2015b, 2017d).
applications (Granjal et al. 2015; Puthal et al.
2016a, b). As security will be a fundamental • Security verification is very important in data
enabling factor of most IoT applications, mech- stream in order to avoid the unwanted and
anisms must also be designed to protect commu- corrupted data.
nications enabled by such technologies (Sharma • Another important problem needs to address
et al. 2017). Granjal et al. (2015) reviewed the to perform the security verification in near real
first article on IoT communication security. Other time.
surveys do exist that, rather than analyzing the • Security verification should not degrade the
technologies currently being designed to enable performance of stream processing engine, i.e.,
Internet communications with sensing and ac- speed of security verification should be much
tuating devices, focus on the identification of more efficient than stream processing engine.
security requirements and on the discussion of
approaches to the design of new security mecha-
nisms (Medaglia and Serbanati 2010; Puthal et al. There are several applications that exist where
2017a; El-Sayed et al. 2017) or, on the other end, IoT devices work as source of data stream (Puthal
discuss the legal aspects surrounding the impact et al. 2017d, e). Here we list out several ap-
of the IoT on the security and privacy of its users plications such as real-time health monitoring
(Puthal et al. 2016a, 2017b; Weber 2010). applications (health care), industrial monitoring,
Applications dealing with large data sets geo-social networking, home automation, war
obtained via simulation or actual real-time front monitoring, smart city monitoring, SCADA,
sensor networks/social network are increasing event detection, disaster management, and emer-
in abundance (Puthal et al 2015a, 2017b, Fox gency management.
et al. 2004). The data obtained from real-time From above all applications, we found data
sources may contain certain discrepancies which that needs to be protected from malicious at-
arise from the dynamic nature of the source. tacks to maintain originality of data before it
Furthermore, certain computations may not reaches at data processing center (Puthal et al.
require all the data, and hence this data must 2016b, 2017c, d; Chen et al. 2015). As the data
be filtered before it can be processed (Puthal et sources are sensor nodes, it is always important
al. 2015a, 2016b, 2017c). By installing adaptive to propose lightweight security solutions for data
filters that can be controlled in real time, we stream (Chen et al. 2015).
can filter out only the relevant parts of the The rest of the paper is organized as fol-
data, thereby improving the overall computation lows: section “Security Properties” gives the se-
speed. curity classification and properties, section “CIA
Nehme et al. (2009) proposed a system Triad” highlights the CIA triad properties of big
StreamShield system designed to address the data streams, subsequently section “Classifica-
problem of security and privacy in the data tion” classifies the security issues based on CIA
stream. They have clearly highlighted the need properties, and finally this paper concluded in
of two types of security in data stream, i.e., section “Conclusion”.
(1) the “data security punctuations” (dsps),
describing the data-side security policies, and (2) Security Properties
the “query security punctuations” (qsps) in their
paper. The advantages of such stream-centric The goal of security services in big data stream
security model include flexibility, dynamicity, is to protect the information and resources from
and speed of enforcement. Stream processor can malicious attacks and misbehavior (Puthal et al.
adapt to not only data-related but also to security- 2016a). The security requirements in big data
related selectivity, which helps reduce the waste stream include:
Big Data Stream Security Classification for IoT Applications 327
CIA Triad
• Partial: There is considerable informational
Confidentiality, integrity, and availability, also disclosure.
known as the CIA triad, is a model designed • Complete: A total compromise of critical sys-
to guide policies for information security within tem information.
328 Big Data Stream Security Classification for IoT Applications
Integrity – modifies the data (modify or misuse Confidentiality of Big Data Streams
of the data) to maintain internal consistency and Confidentiality is the ability to cover messages
external consistency. Def: Integrity, in terms of from a passive attacker so that information in big
big data stream, is the assurance that information data streams remains confidential. This is one of
can only be accessed or modified by those au- the most important issues for several applications
thorized users or applications. Measures taken to such as health care, military, etc. Confidentiality
ensure integrity include controlling the physical must be maintained throughout the entire lifetime
environment of networked terminals and servers, of the data, from source smart IoT device to long-
restricting access to data, and maintaining rigor- term archiving in a cloud data center (Puthal et
ous authentication practices. Authentication pro- al. 2017b). Confidentiality is typically achieved
cess is the part of integrity process. via data encryption. This is a kind of passive
Authentication: This measures whether or attack, where unauthorized attackers monitor and
not an attacker needs to be authenticated in big listen to the communication medium. The attacks
data stream in order to exploit the vulnerability. against data privacy are always passive in nature.
Authentication is required to access and exploit
the vulnerability. Integrity of Big Data Streams
Measures of the impact on integrity of a suc- Data integrity of big data streams can be defined
cessful exploit of the vulnerability on the target as the overall complete consistency of the data.
system: This can be indicated by any malicious activ-
ity, while data is transmitted between the IoT
devices or between IoT and cloud data center.
• Partial: Considerable breach in integrity This can also defined as the man-in-the-middle
• Complete: A total compromise of system in- attack. Data integrity is generally enforced during
tegrity the system setup by applying standard security
rules. Data integrity can be maintained through
Availability – means data must be available applying validation or error checking. Message
in the system by using controls (administrative authentication code (MAC) concept is broadly
controls, technical controls (security, firewalls, applied for the data integrity in IoT infrastructure.
encryptions), physical controls (mass layer,
i.e., access control, safety of the data)). Def: Availability of Big Data Streams
Data availability is a term in big data stream Data availability in big data streams is to control
to ensure that data continues to be available at the access to the end users or specific appli-
a required level of performance in situations cations. This subsection classifies the different
ranging from normal through disastrous. In access controls with data stream properties. There
general, data availability is achieved through are two questions required for the access control
providing access to authenticated user and/or process, i.e., who and what?
applications. It also works same as works at the
data center for ensuring data availability (access
control). Classification
Measures of the impact on availability of a
successful exploit of the vulnerability on the Table 1 presents the possible threats and at-
target system: tacks of IoT-generated big data stream, and the
classification is based on CIA (confidentiality,
integrity, and availability) triad from previous
• Partial: Considerable lag in or interruptions in discussion. As we have classified the complete
resource availability security threats and solutions over big data stream
• Complete: Total shutdown of the affected re- in CIA triad in previous sections, here in Table 1,
source we give the pictorial classifications. As seen from
Big Data Stream Security Classification for IoT Applications 329
Big Data Stream Security Classification for IoT Applications, Table 1 Possible threats of IoT-generated big data
stream in CIA triad representation
Confidentiality Integrity Availability
• Access authorization • Spoofed, altered, or replayed • Mandatory access control system B
• Attack against privacy data stream information • Role-based access control
• Monitoring and eavesdropping • Time synchronization • Attribute-based access control
• Traffic analysis • Selective forwarding • Risk-based access control
• Camouflage adversaries • Eavesdropping (passive attacks) • View-based access control
• Dropping data packet attacks • Activity-based access control
• Desynchronization • Encryption-based access control
• Sinkhole • Proximity-based access control
• Selfish behavior on data • Privilege state-based access control
forwarding • Risk-based access control
• Sybil • Discretionary access control
• Wormholes • Information flow control model
• Acknowledgment spoofing
• Hello flood attacks
Table 1, security threats in big data stream always Chen P, Wang X, Wu Y, Su J, Zhou H (2015) POSTER:
fall in any properties of CIA triad. iPKI: identity-based private key infrastructure for se-
curing BGP protocol. In: 22nd ACM SIGSAC con-
ference on computer and communications security,
Conclusion Colorado, pp 1632–1634
El-Sayed H, Sankar S, Prasad M, Puthal D, Gupta A,
Mohanty M, Lin C (2017) Edge of things: the big
The IoT infrastructure may be already visible picture on the integration of edge, IoT and the cloud
in current deployments where networks of smart in a distributed computing environment. IEEE Access
sensing devices are being interconnected with the 6:1706–1717
wireless medium, and IP-based standard tech- Fox G, Gadgil H, Pallickara S, Pierce M, Grossman R,
Gu Y, Hanley D, Hong X (2004) High performance
nologies will be fundamental in providing a com- data streaming in service architecture. Technical report,
mon and well-accepted ground for the develop- Indiana University and University of Illinois at Chicago
ment and deployment of new IoT applications. Granjal J, Monteiro E, Sá Silva J (2015) Security for the
internet of things: a survey of existing protocols and
According to the 4Vs’ features of big data, cur-
open research issues. IEEE Commun Surv Tutorials
rent data stream tends to the new term as big 17(3):1294–1312
data stream where sources are the IoT smart Medaglia C, Serbanati A (2010) An overview of privacy
sensing devices. Considering that security may be and security issues in the internet of things. In: The
internet of things. Springer, New York, pp 389–395
an enabling factor of many of IoT applications,
Nehme R, Lim H, Bertino E, Rundensteiner E (2009)
mechanisms to secure data stream using data in StreamShield: a stream-centric approach towards secu-
flow for the IoT will be fundamental. With such rity and privacy in data stream environments. In: ACM
aspects in mind, in this survey we perform a basic SIGMOD international conference on management of
data, Rhode Island, pp 1027–1030
classification of security protocols.
Puthal D, Nepal S, Ranjan R, Chen J (2015a) DPBSV- An
efficient and secure scheme for big sensing data stream.
Cross-References In: 14th IEEE international conference on trust, se-
curity and privacy in computing and communications,
Helsinki, pp 246–253
Definition of Data Streams Puthal D, Nepal S, Ranjan R, Chen J (2015b) A dynamic
key length based approach for real-time security verifi-
cation of big sensing data stream. In: 16th international
References conference on web information system engineering,
Miami, pp 93–108
Bella G, Paulson L (1998) Kerberos version IV: inductive Puthal D, Nepal S, Ranjan R, Chen J (2016a) Threats to
analysis of the secrecy goals. In: Computer security – networking cloud and edge datacenters in the Internet
ESORICS 98, Louvain-la-Neuve, pp 361–375 of Things. IEEE Cloud Computing 3(3):64–71
330 Big Data Supercomputing
generally also applies to the specific case of DNA roughly doubling every 24 months, the years
sequencing data. In particular, DNA data are of- after the advent of NGS witnessed a decline of
ten processed in a data analysis pipeline (like the sequencing cost that could only be measured
META-pipe Pedersen and Bongo 2016) where in orders of magnitudes as can be seen in the B
not only the raw data but also several additional so-called Carlson curves (Carlson 2003).
contextual metadata and provenance data are gen- The challenges that arise by this gap between
erated and have to be maintained. The raw data accumulation of data on the one hand and the
produced during DNA sequencing are image data capacities to store, analyze, and interpret them
generated by the sequencing hardware. Further in a meaningful way are asking for new and
DNA processing steps comprise: creative, perhaps also radical ways to deal with
data. We can distinguish two related, but distinct,
Primary analysis: producing short DNA
aspects of this problem: (1) data storage and
sequences (called reads) out of the raw data
accessibility and (2) data analysis in acceptable
and assigning quality scores to them
times.
Secondary analysis: assembling several short
reads guided by a reference DNA sequence –
this process is called read mapping – as well
as analyzing the reads with respect to the ref-
erence sequence by, for example, identifying Key Research Findings
single nucleotide variants or deletions
Tertiary analysis: processing the genome data to Several ways to address the big data challenges
achieve advanced analyses by integrating sev- when processing genome sequencing data have
eral data sources (like multiple DNA samples, been proposed in the last decade. We briefly sur-
metadata, annotations, etc.) vey several streams of research in the following
subsections.
Big Data Challenges
Current life sciences are more and more
data driven. In practice, this means that data From Data Storage to Data Disposal
are recorded and generated in an essentially Thus far, the generally acknowledged, albeit in
automatic way. A large portion of life science most cases, tacit agreement in the life sciences
data is taken up by DNA sequences that are, for was to keep all raw data for the purpose of re-
example, used to identify the genetic basis of a producibility or also for secondary analyses. This
disease in the medical sciences or of (wanted or attitude has begun to erode in DNA sequencing
unwanted) traits of cultivated plants or animals projects. First, it should be clearly defined what
which are supposed to be used in breeding raw data are in sequencing projects. In its most
programs. It is obvious that such large bodies primitive form, the sequence raw data are image
of data can no longer be stored and analyzed in files which are processed in several steps to yield
the traditional way but can only be harnessed by short sequence reads which in turn are assem-
advanced data management and analysis systems. bled to transcripts, genes, or genomes. It has
For the years around the millennium, the become a generally adopted practice to consider
growth rate of computing power kept pace with the short sequence reads, together with quality
the growth rate of data output of sequencing scores, as raw data which can be archived in the
facilities, such that then data analysis was Sequence Read Archive (https://www.ncbi.nlm.
feasible even with ordinary PCs. This remarkable nih.gov/sra). Probably unknowingly data scien-
parallelism has ended in 2008 with the emergence tists and analysts have adopted the plant breeders’
of a new sequencing technology – next- wisdom: “Breeding is the art of throwing away”
generation sequencing. While computing power (Becker 2011) transforming it into “Managing
is predicted to follow Moore’s law, that is, data is the art of throwing away”.
332 Big Data Technologies for DNA Sequencing
l
for DNA Sequencing,
l
Fig. 1 Articles on ultrafast
algorithms in PubMed
400
l
l
300
l
Count
l
l
l
200
l
l
l
l
100
l
l l
l l
l
l l l l
l l
l
l
l l
l l
l l l l l l l
0
genome, thus representing a significant reduction et al. (2015) describe a distributed framework for
in disk space requirement. The lossy CRAM read mapping based on the Message Passing In-
file format applies a heavyweight compression terface (MPI) in a cluster of nodes. They discuss
scheme and thus reduces the storage consumption the issues of splitting and distributing the inputs B
even more (Bonfield and Mahoney 2013). as well as merging the results. Already in the year
Departing from the idea of processing data 2010, an overview by Taylor (2010) surveyed
stored in flat files, integrating genome-specific applications of the Hadoop framework; more re-
compression into a database system which holds cently, other approaches based on the MapRe-
data in the main memory was developed in Dorok duce framework have been developed (Chung
et al. (2017). In this so-called base-centric en- et al. 2014). Other distributed processing systems
coding, each base is stored in a separate row like Spark have also received attention for ap-
of a database table such that database-specific plications in genome sequencing and processing
operations can be used to analyze the genome (Mushtaq et al. 2017).
sequence. This technique requires that the entire To benefit from advances of modern hardware
genome dataset fits into the main memory. technology, some DNA analysis processes have
been ported to run on graphical processing units
Parallel Processing and Modern Hardware (GPUs) (Salavert Torres et al. 2012). However,
Support these approaches incur an overhead for prepro-
The accurate and fast detection of genomic cessing the data and require highly specialized
variants like DNA insertions/deletions or single algorithms in order to take advantage of the GPU
nucleotide polymorphisms (SNPs) based on platforms.
NGS data plays an essential role in clinical
researches. For this aim, several analysis Integration of Heterogeneous Data
pipelines combining short read aligners (e.g., Considering the growth rate of DNA sequencing
BWA-MEM, Bowtie2, SOAP3, and Novoalign data, Stephens et al. have demonstrated in their
(http://novocraft.com/)) with variant callers (e.g., study (Stephens et al. 2015) that the sequencing
the Genome Analysis Toolkit HaplotypeCaller technologies are one of the most important gener-
(GATK-HC), Samtools mpileup, and Torrent ators of big data. However, these massive datasets
Variant Caller (v4.0 Life Technologies)) have are often neither well structured nor organized
been developed. However, the analysis of the in any consistent manner which makes their di-
raw sequence data is computationally intensive rect usage difficult. Consequently, the researchers
and requires significant memory consumption. In have, first of all, to deal with the handling of
order to deal with this problem, most aligners and big data to perform any analysis or comparison
variant callers have been implemented in multi- studies between different genomes. Thus, an ef-
threaded mode (e.g., BWA-MEM, Bowtie2, fort for these big data challenges in life sciences
and Novoalign) or using GPU-based software is needed today to store the data in hierarchical
(SOAP3) to ensure a feasible computation time. systems in more efficient ways that make them
For details like memory usage or multithreading available at different levels of analysis.
of the different approaches, see the review by Management of biological data often requires
Mielczarek and Szyda (2016); also see the the connection and integration of different data
references therein for details about the different sources for a combined analysis. By default,
tools mentioned above. biomedical text formats use identifiers (e.g., for
Thus, parallelization of analysis tasks is as genes or genomic variants), and different text
important as the sequencing process itself and files can only be combined by ID-based linking:
promises a huge potential for DNA processing. data combination heavily relies on string equality
Several frameworks aimed at parallelization for matching over these IDs. To exacerbate the situ-
specific applications. The MapReduce paradigm ation, these IDs are often hidden in larger strings
has hence received a lot of attraction. Martínez such that these IDs have to be extracted first
334 Big Data Technologies for DNA Sequencing
before the data can be joined. A conjecture shared Bioinformatics for Sequencing Data
by many researches in the field is that, on the high Bioinformatic skills play an essential role in the
level of data management, explicitly linking data exploitation of the full potential of NGS data.
items in a graph structure by navigational access Until now, different bioinformatic tools and al-
will enable more efficient data combination than gorithms have been published for the storage and
join operations over string IDs. Hence, graph computational analysis of DNA sequences. Such
databases are deemed to be most suitable for such applications are important for the detection as
a data integration layer (Have and Jensen 2013) well as the understanding of relevant biological
because different data sources can be combined processes. Currently, in the OMICtools directory
by a link (edge) in the data graph. (https://omictools.com/) for NGS data analysis,
One prototypical framework that integrates there are 47 categories with 9089 applications
genome data sources and other biological which are developed, for example, for data pro-
datasets in a common graph-based framework cessing, quality control, genomic research, data
is the BioGraphDB presented by Fiannaca visualization, or the identification of robust ge-
et al. (2016) who use the OrientDB multi- nomic associations/variants with complex dis-
model database to interconnect several text- eases.
based external data sources. The BioGraphDB In addition to computational analysis of se-
system is able to process queries in the Gremlin quencing data, the field of bioinformatics is cru-
graph query language. Textual input data is cial for storage of large-scale sequencing data in
processed in this framework as follows (Fiannaca databases as well as repositories. The Sequence
et al. 2016): “As general rule, each biological Read Archive (SRA) (https://www.ncbi.nlm.nih.
entity and its properties have been mapped gov/sra) is one of the most relevant public domain
respectively into a vertex and its attributes, and repositories in which raw sequence data gener-
each relationship between two biological entities ated using next-generation sequencing technolo-
has been mapped into an edge. If a relationship gies are stored for free and continuous access to
has some properties, they are also saved as edge the data is possible. There are further big data
attributes. Vertices and edges are grouped into projects that are based on databases/repositories
classes, according to the nature of the entities.” collecting sequencing data in life sciences:
Future Directions for Research Carlson R (2003) The pace and proliferation of biological
technologies. Biosecur Bioterror Biodefense Strategy
Pract Sci 1(3):203–214
DNA sequencing is one of the major producers of Christley S, Lu Y, Li C, Xie X (2008) Human genomes as
big data. Novel sequencing technology will even email attachments. Bioinformatics 25(2):274–275
B
increase the amount of DNA sequencing data Chung WC, Chen CC, Ho JM, Lin CY, Hsu WL, Wang
produced. The introduction of mobile sequencing YC, Lee DT, Lai F, Huang CW, Chang YJ (2014)
Clouddoe: a user-friendly tool for deploying Hadoop
devices like nanopore-based technologies clouds and analyzing high-throughput sequencing data
(Loman and Watson 2015) will turn DNA with MapReduce. PloS one 9(6):e98146
sequencing into an everyday diagnosis and Dorok S, Breß S, Teubner J, Läpple H, Saake G, Markl
monitoring tool. Though initially the available V (2017) Efficiently storing and analyzing genome
data in database systems. Datenbank-Spektrum 17(2):
devices suffered from high error rates, ongoing 139–154
technological advances will lead to an improved Fiannaca A, La Rosa M, La Paglia L, Messina A, Urso A
data quality (Jain et al. 2015). (2016) Biographdb: a new graphdb collecting heteroge-
It might be worthwhile to closely inspect neous data for bioinformatics analysis. In: Proceedings
of BIOTECHNO
genome data processing pipelines in order to Have CT, Jensen LJ (2013) Are graph databases ready for
identify bottlenecks. Martínez et al. (2015) bioinformatics? Bioinformatics 29(24):3107
report on the use of high-performance profiling Jain M, Fiddes IT, Miga KH, Olsen HE, Paten B,
Akeson M (2015) Improved data analysis for the
that revealed idling hardware resources. By
minion nanopore sequencer. Nat Methods 12(4):
improving the task scheduling, a better runtime 351–356
could be achieved. This kind of low-level Loman NJ, Watson M (2015) Successful test launch for
performance optimization is one field of future nanopore sequencing. Nat methods 12(4):303
Martínez H, Barrachina S, Castillo M, Tárraga J, Medina
research to improve performance of DNA
I, Dopazo J, Quintana-Ortí ES (2015) Scalable RNA
processing. sequencing onclusters of multicore processors. Trust-
Extrapolating the current drop in sequencing com/BigDataSE/ISPA 3:190–195
cost, the most radical option could be the so- Mielczarek M, Szyda J (2016) Review of alignment and
SNP calling algorithms for next-generation sequenc-
called streaming algorithms for sequence analy-
ing data. J Appl Genet 57(1):71–79. https://doi.org/10.
sis. With streaming algorithms, DNA sequences 1007/s13353-015-0292-7
are analyzed in real time and are not stored at all, Mushtaq H, Liu F, Costa C, Liu G, Hofstee P, Al-Ars Z
as shown in Cao et al. (2016). (2017) Sparkga: a spark framework for cost effective,
fast and accurate dna analysis at scale. In: Proceedings
of the 8th ACM international conference on bioinfor-
matics, computational biology, and health informatics.
Cross-References ACM, pp 148–157
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C
(2017) Salmon provides fast and bias-aware quantifi-
Big Data Analysis in Bioinformatics
cation of transcript expression. Nat Methods 14(4):
GPU-Based Hardware Platforms 417–419
Hadoop Pedersen E, Bongo LA (2016) Big biological data man-
agement. In: Pop F, Kolodziej J, Martino BD (eds) Re-
source management for big data platforms. Computer
communications and networks. Springer, Heidelberg,
References pp 265–277
Popitsch N, von Haeseler A (2012) NGC: lossless and
Becker H (2011) Pflanzenzüchtung. UTB basics. UTB lossy compression of aligned high-throughput sequenc-
GmbH ing data. Nucleic Acids Res 41(1):e27–e27
Bonfield JK, Mahoney MV (2013) Compression of Salavert Torres J, Blanquer Espert I, Tomas Dominguez
FASTQ and SAM format sequencing data. PLoS One A, Hernendez V, Medina I, Terraga J, Dopazo J (2012)
8(3):e59190 Using GPUs for the exact alignment of short-read
Cao MD, Ganesamoorthy D, Elliott AG, Zhang H, Cooper genetic sequences by means of the burrows-wheeler
MA, Coin LJ (2016) Streaming algorithms for identifi- transform. IEEE/ACM Trans Comput Biol Bioinform
cation of pathogens and antibiotic resistance potential (TCBB) 9(4):1245–1256
from real-time minion tm sequencing. GigaScience Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C,
5(1):32 Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE
336 Big Data Visualization Tools
(2015) Big data: astronomical or genomical? PLoS The level of difficulty in transforming a data-
Biol 13(7):e1002195 curious user into someone who can access and
Taylor RC (2010) An overview of the hadoop/mapre-
duce/hbase framework and its current applications in
analyze that data is even more burdensome now
bioinformatics. BMC Bioinform 11(12):S1 for a great number of users with little or no
support and expertise on the data processing part.
Data visualization has become a major research
challenge involving several issues related to data
Big Data Visualization Tools storage, querying, indexing, visual presentation,
interaction, and personalization (Bikakis and Sel-
Nikos Bikakis lis 2016; Shneiderman 2008; Godfrey et al. 2016;
ATHENA Research Center, Athens, Greece Idreos et al. 2015).
Given the above, modern visualization and
exploration systems should effectively and effi-
Synonyms ciently handle the following aspects.
2010; Bikakis et al. 2017; Jugel et al. 2016; Liu produced results, the user is able to interrupt
et al. 2013). the execution and define the next operation,
Hierarchical exploration. Approximation without waiting for the exact result to be
techniques are often defined in a hierarchical computed.
manner (Elmqvist and Fekete 2010; Bikakis et al. In this context, several systems adopt pro-
2017), allowing users to explore data in different gressive techniques. In these techniques the re-
levels of detail (e.g., hierarchical aggregation). sults/visual elements are computed/constructed
Hierarchical approaches (sometimes also re- incrementally based on user interaction or as
ferred as multilevel) allow the visual exploration time progresses (Stolper et al. 2014; Bikakis et
of very large datasets in multiple levels, offering al. 2017). Further, numerous recent systems in-
both an overview, as well as an intuitive and tegrate incremental and approximate techniques.
effective way for finding specific parts within a In these cases, approximate results are computed
dataset. Particularly, in hierarchical approaches, incrementally over progressively larger samples
the user first obtains an overview of the dataset of the data (Fisher et al. 2012; Agarwal et al.
before proceeding to data exploration operations 2013; Im et al. 2014).
(e.g., roll-up, drill-down, zoom, filtering) and fi- Incremental and adaptive processing. The
nally retrieving details about the data. Therefore, dynamic setting established nowadays hinders
hierarchical approaches directly support the vi- (efficient) data preprocessing in modern systems.
sual information seeking mantra “overview first, Additionally, it is common in exploration scenar-
zoom and filter, then details on demand.” Hierar- ios that only a small fragment of the input data is
chical approaches can also effectively address the accessed by the user.
problem of information overloading as they adopt In situ data exploration (Alagiannis et al.
approximation techniques. 2012; Olma et al. 2017; Bikakis et al. 2017) is
Hierarchical techniques have been extensively a recent trend, which aims at enabling on-the-fly
used in large graphs visualization, where the exploration over large and dynamic sets of data,
graph is recursively decomposed into smaller without (pre)processing (e.g., loading, indexing)
subgraphs that form a hierarchy of abstraction the whole dataset. In these systems, incremental
layers. In most cases, the hierarchy is constructed and adaptive processing and indexing techniques
by exploiting clustering and partitioning methods are used, in which small parts of data are
(Rodrigues et al. 2013; Auber 2004; Tominski et processed incrementally “following” users’
al. 2009). interactions.
Progressive results. Data exploration requires Caching and prefetching. Recall that, in ex-
real-time system’s response. However, comput- ploration scenarios, a sequence of operations is
ing complete results over large (unprocessed) performed and, in most cases, each operation
datasets may be extremely costly and in several is driven by the previous one. In this setting,
cases unnecessary. Modern systems should pro- caching and/or prefetching the sets of data that
gressively return partial and preferably represen- are likely to be accessed by the user in the near
tative results, as soon as possible. future can significantly reduce the response time
Progressiveness can significantly improve (Battle et al. 2016; Bikakis et al. 2017). Most
efficiency in exploration scenarios, where it is of these approaches use prediction techniques
common that users attempt to find something which exploit several factors (e.g., user behavior,
interesting without knowing what exactly they user profile, use case) in order to determine the
are searching for beforehand. In this case, upcoming user interactions.
users perform a sequence of operations (e.g., User assistance. The large amount of avail-
queries), where the result of each operation able information makes it difficult for users to
determines the formulation of the next operation. manually explore and analyze data. Modern sys-
In systems where progressiveness is supported, tems should provide mechanisms that assist the
in each operation, after inspecting the already user and reduce the effort needed on their part,
Big Data Visualization Tools 339
considering the diversity of preferences and re- this domain high volumes of data are collected
quirements posed by different users and tasks. from sensors and satellites on a daily basis. Stor-
Recently, several approaches have been devel- ing these data over the years results in massive
oped in the context of visualization recommenda- amounts of data that have to be analyzed. Visual B
tion (Vartak et al. 2016). These approaches rec- analytics can assist scientists to perform core
ommend the most suitable visualizations in order tasks, such as climate factor correlation analysis,
to assist users throughout the analysis process. event prediction, etc. Further, in this domain, vi-
Usually, the recommendations take into account sualization systems are used in several scenarios
several factors, such as data characteristics, ex- in order to capture real-time phenomena, such as
amined task, user preferences and behavior, etc. hurricanes, fires, floods, and tsunamis.
Especially considering data characteristics, In the domain of bioinformatics, visualization
there are several systems that recommend techniques are exploited in numerous tasks. For
the most suitable visualization technique (and example, analyzing the large amounts of biolog-
parameters) based on the type, attributes, ical data produced by DNA sequencers is ex-
distribution, or cardinality of the input data (Key tremely challenging. Visual techniques can help
et al. 2012; Ehsan et al. 2016; Mackinlay et al. biologist to gain insight and identify interesting
2007; Bikakis et al. 2017). In a similar context, “areas of genes” on which to perform their exper-
some systems assist users by recommending iments.
certain visualizations that reveal surprising In the Big Data era, visualization techniques
and/or interesting data (Vartak et al. 2014; are extensively used in the business intelligence
Wongsuphasawat et al. 2016). Other approaches domain. Finance markets are one application
provide visualization recommendations based on area, where visual analytics allow to monitor
user behavior and preferences (Mutlu et al. 2016). markets, identify trends, and perform predictions.
Besides, market research is also an application
area. Marketing agencies and in-house marketing
Examples of Applications departments analyze a plethora of diverse sources
(e.g., finance data, customer behavior, social
Visualization techniques are of great importance media). Visual techniques are exploited to realize
in a wide range of application areas in the Big task such as identifying trends, finding emerging
Data era. The volume, velocity, heterogeneity, market opportunities, finding influential users
and complexity of available data make it ex- and communities, and optimizing operations
tremely difficult for humans to explore and ana- (e.g., troubleshooting of products and services),
lyze data. Data visualization enables users to per- business analysis, and development (e.g., churn
form a series of analysis tasks that are not always rate prediction, marketing optimization).
possible with common data analysis techniques
(Keim et al. 2010). Cross-References
Major application domains for data visualiza-
tion and analytics are physics and astronomy. Graph Exploration and Search
Satellites and telescopes collect daily massive Visualization
and dynamic streams of data. Using traditional Visualization Techniques
analysis techniques, astronomers are able to iden- Visualizing Semantic Data
tify noise, patterns, and similarities. On the other
hand, visual analytics can enable astronomers to
identify unexpected phenomena and perform sev-
eral complex operations, which are not feasible References
by traditional analysis approaches.
Agarwal S, Mozafari B, Panda A, Milner H, Madden S,
Another application domain is atmospheric Stoica I (2013) Blinkdb: queries with bounded errors
sciences like meteorology and climatology. In and bounded response times on very large data. In:
340 Big Data Visualization Tools
European conference on computer systems (EuroSys), Jugel U, Jerzak Z, Hackenbroich G, Markl V (2016)
Prague, 15–17 Apr 2013 VDDa: automatic visualization-driven data aggregation
Alagiannis I, Borovica R, Branco M, Idreos S, Ailamaki in relational databases. VLDB J 53–77
A (2012) Nodb: efficient query execution on raw data Keim DA, Kohlhammer J, Ellis GP, Mansmann F (2010)
files. In: ACM conference on management of data Mastering the information age - solving problems with
(SIGMOD), Scottsdale, pp 241–252 visual analytics. In: Eurographics Association, pp 1–
Auber D (2004) Tulip - a huge graph visualization frame- 168, ISBN 978-3-905673-77-7
work. In: Jünger M, Mutzel P (eds) Graph draw- Key A, Howe B, Perry D, Aragon C (2012) VizDeck:
ing software. Mathematics and visualization. Springer, self-organizing dashboards for visual analytics. In:
Berlin/Heidelberg Proceedings of the 2012 ACM SIGMOD international
Battle L, Stonebraker M, Chang R (2013) Dynamic re- conference on management of data (SIGMOD ‘12).
duction of query result sets for interactive visualiza- ACM, New York, pp 681–684
tion. In: IEEE international conference on Big Data, Liu Z, Jiang B, Heer J (2013) imMens: realtime visual
pp 1–8 querying of big data. Comput Graph Forum 32(3):
Battle L, Chang R, Stonebraker M (2016) Dynamic 421–430
prefetching of data tiles for interactive visualization. In: Mackinlay JD, Hanrahan P, Stolte C (2007) Show me:
ACM conference on management of data (SIGMOD), automatic presentation for visual analysis. IEEE Trans
San Francisco, 26 June–1 July 2016 Vis Comput Graph 13(6):1137–1144
Bikakis N, Sellis T (2016) Exploration and visualiza- Mutlu B, Veas EE, Trattner C (2016) Vizrec: recommend-
tion in the web of big linked data: a survey of the ing personalized visualizations. ACM Trans Interact
state of the art. In: Proceedings of the workshops of Intell Syst 6(4):31:1–31:39
the EDBT/ICDT 2016 joint conference, EDBT/ICDT Olma M, Karpathiotakis M, Alagiannis I, Athanassoulis
workshops 2016, Bordeaux, 15 Mar 2016 M, Ailamaki A (2017) Slalom: coasting through raw
Bikakis N, Liagouris J, Krommyda M, Papastefanatos data via adaptive partitioning and indexing. Proc
G, Sellis T (2016) Graphvizdb: a scalable plat- VLDB Endowment 10(10):1106–1117
form for interactive large graph visualization. In: Park Y, Cafarella M, Mozafari B (2016) Visualization-
32nd IEEE international conference on data engi- aware sampling for very large databases. In: 2016
neering, ICDE 2016, Helsinki, 16–20 May 2016, IEEE 32nd international conference on data engineer-
pp 1342–1345 ing (ICDE), Helsinki, pp 755–766
Bikakis N, Papastefanatos G, Skourla M, Sellis T (2017) A Rodrigues JF Jr, Tong H, Pan J, Traina AJM, Traina C
hierarchical aggregation framework for efficient mul- Jr, Faloutsos C (2013) Large graph analysis in the
tilevel visual exploration and analysis. Semant Web J GMine system. IEEE Trans Knowl Data Eng 25(1):
8(1):139–179 106–118
Ehsan H, Sharaf MA, Chrysanthis PK (2016) Muve: Shneiderman B (2008) Extreme visualization: squeezing a
efficient multi-objective view recommendation for vi- billion records into a million pixels. In: Proceedings of
sual data exploration. In: IEEE international confer- the 2008 ACM SIGMOD international conference on
ence on data engineering (ICDE), Helsinki, 16–20 management of data (SIGMOD ‘08). ACM, New York,
May 2016 pp 3–12
Elmqvist N, Fekete J (2010) Hierarchical Aggregation Stolper CD, Perer A, Gotz D (2014) Progressive visual
for Information Visualization: Overview, Techniques, analytics: user-driven visual exploration of in-progress
and Design Guidelines. IEEE Trans Vis Comput Graph analytics. IEEE Trans Vis Comput Graph 20(12):
16(3):439–454 1653–1662
Fisher D, Popov I, Drucker S, Schraefel MC (2012) Trust Tominski C, Abello J, Schumann H (2009) Cgv an in-
me, I’m partially right: incremental visualization lets teractive graph visualization system. Comput Graph
analysts explore large datasets faster. In: Proceedings 33(6):660–678
of the SIGCHI conference on human factors in com- Vartak M, Madden S, Parameswaran AG, Polyzo-
puting systems (CHI ‘12). ACM, New York, pp 1673– tis N (2014) SEEDB: automatically generating
1682 query visualizations. Proc VLDB Endowment 7(13):
Godfrey P, Gryz J, Lasek P (2016) Interactive visualiza- 1581–1584
tion of large data sets. IEEE Trans Knowl Data Eng Vartak M, Huang S, Siddiqui T, Madden S, Parameswaran
28(8):2142–2157 AG (2016) Towards visualization recommendation sys-
Idreos S, Papaemmanouil O, Chaudhuri S (2015) tems. SIGMOD Rec 45(4):34–39
Overview of data exploration techniques. In: Proceed- Ward MO, Grinstein G, Keim D (2015) Interactive data vi-
ings of the 2015 ACM SIGMOD international confer- sualization: foundations, techniques, and applications,
ence on management of data (SIGMOD ‘15). ACM, 2nd edn. A. K. Peters, Ltd, Natick
New York, pp 277–281 Wongsuphasawat K, Moritz D, Anand A, Mackinlay
Im J, Villegas FG, McGuffin MJ (2014) VisReduce: fast JD, Howe B, Heer J (2016) Voyager: exploratory
and responsive incremental information visualization analysis via faceted browsing of visualization recom-
of large datasets. In: 2013 IEEE international confer- mendations. IEEE Trans Vis Comput Graph 22(1):
ence on Big Data (Big Data), Silicon Valley, pp 25–32 649–658
Big Data Warehouses for Smart Industries 341
research and help practitioners to build BDWs, Data Collection, Preparation, and Enrichment
this entry also focuses on the relevance and (CPE)
value of BDWs for Smart Industries, such as Data arrives from the data provider component
the analysis of customer complaints or the (e.g., users, legacy systems, Web, transactional
identification of patterns between defective parts systems, sensors), and it is collected in two dis-
and production data, decreasing the time needed tinct ways, either in batches or via streaming
to fulfill complaints. mechanisms. Batch data is firstly stored in the
sandbox storage (distributed file system with-
out schema restrictions) before being processed,
Key Research Findings while streaming data is immediately processed
(prepared and enriched), since it typically holds
In order to design and implement a BDW, prac- more value when analyzed in near real time. De-
titioners should focus on three main aspects: spite this, streaming data can also be stored in the
the architecture of the system, the technologies sandbox storage through an optional alternative
to use, and the data modeling technique. This path, due to the value inherent in its raw and
section presents the foundations, i.e., models and unprocessed state, or only as a backup strategy,
methods (Hevner et al. 2004), to approach these for example. The Data Mining/Machine Learning
same issues. models previously built by data scientists in the
The Big Data Warehousing approach Data Science sandbox can be included in this
presented in this entry takes into consideration stage, both in batch and streaming workloads.
general and scarce scientific contributions This works by applying the models to make
regarding Big Data systems, integrating and ex- predictions as data flows through the CPE work-
tending them with the appropriate contributions loads.
considering the specific characteristics of BDWs
previously enumerated. First, the architectural Platforms: Data Organization and Distribution
components are as aligned as possible with The BDW consists of three different storage areas
the Big Data Reference Architecture from the classified into two categories: file system and
National Institute of Standards and Technology indexed storage.
(NIST), which presents the general components
of a Big Data system (NBD-PWG 2015). Second, 1. Sandbox storage – it is a highly flexible stor-
the approach presented in this entry follows age area without the need to rigorously define
several guidelines of the Lambda Architecture any kind of metadata. Data scientists (Costa
(Marz and Warren 2015), motivating practitioners and Santos 2017b) use the sandbox storage
to store the data with the highest level of detail, to “play with the data,” by identifying trend-
storing it as a set of immutable events (if s/patterns and elaborating predictions. Conse-
possible), and clearly dividing the workflows quently, it is seen as a schema-less storage
as batch and streaming workflows. Finally, technology, mainly supported by highly scal-
contributions such as the Big Data Processing able distributed file systems.
Flow (Krishnan 2013) and the Data Highway 2. Batch storage – it is an indexed storage system
Concept (Kimball and Ross 2013) are also used to store and process Big Data arriving
relevant in this approach, as can be seen in the in batches. Unlike the previous storage area,
following subsections. it is based on previously defined metadata to
support structured analytics, although it uses
Architecture a significantly more flexible data modeling
Logical components help practitioners in the de- technique than the typical relational DWs (see
sign of a BDW. Figure 1 illustrates the BDW section “Data Modeling”). Its data modeling
architecture, highlighting the several logical com- technique facilitates the integration of pat-
ponents and data flows. terns, simulations, and predictions extracted
Big Data Warehouses for Smart Industries 343
Big Data Warehouses for Smart Industries, Fig. 1 Big Data Warehouse architecture
and developed by data scientists. It is capable streaming processing). The streaming storage
of processing huge amounts of data due to its uses exactly the same data modeling technique
fast sequential access capabilities. as the batch storage, and the data stored in
3. Streaming storage – it is an indexed storage these two areas can be combined using an
system that, instead of storing and processing adequate distributed query engine (see Fig. 1).
data arriving in batches, deals with data arriv- Ideally, for certain scenarios including up-
ing in a streaming fashion. It does not need date operations on data, the streaming stor-
to process the same amount of data as the age technology must hold fast random access
batch storage, since it stores data temporarily, capabilities. Moreover, the streaming storage
which is then moved to the batch storage after may also be supported by in-memory database
a certain period (e.g., 1 day, 1 week, 1 month). technologies for fast operations on fast data,
It can also include patterns, simulations, and assuming that practitioners consider their level
predictions made by data scientists in near of interoperability with the chosen distributed
real time (e.g., machine learning included in query engine.
344 Big Data Warehouses for Smart Industries
While the sandbox storage uses a distributed Regarding data CPE, the decision to adopt a
file system to act as a landing zone and as a place certain technology mainly depends on the ve-
where data scientists can freely experiment, the locity of data flowing through the system. For
batch and streaming storage areas are indexed batch data, using mechanisms such as Sqoop to
storage systems (e.g., key-value stores, column- collect data from relational databases, for exam-
oriented stores, document-oriented stores) ple, custom connectors, or Talend Big Data is
responsible for storing batch and streaming data, appropriate. However, if data flows via streaming
respectively. These two storage areas are indexed mechanisms, it should be collected using efficient
to adequately support Big Data Warehousing streaming systems like Kafka and Flume, for
workloads in an effective and efficient manner. example. Once the data is collected, it is ready for
Such workloads may include ad hoc querying or preparation and enrichment. Technologies like
self-service Business Intelligence (BI) over huge Spark, Pig, Talend Big Data, MapReduce, and
amounts of data, structured data visualization Hive are suitable for batch processing, while
(e.g., dashboards, reports), highly intensive technologies like Spark Streaming and Storm
geospatial analytics, and visualizing simulations are suitable for streaming processing. If some
or predictions made by Data Mining/Machine of these workloads include the application of
Learning models. previously trained Data Mining/Machine Learn-
ing models, practitioners should be aware of the
level of integration between their predictive Data
Analytics, Visualization, and Access
Science tools and the chosen technology for data
The core component of this stage is the dis-
preparation and enrichment.
tributed query engine, which is able to query
Depending on the level of infrastructural com-
the data stored in the indexed storage systems,
plexity and the storage access needs (random
even combining batch and streaming data in a
access vs. sequential access), practitioners may
single query (see the distributed query engines
consider three strategies:
in section “Technological Infrastructure”). The
distributed query engine can be used to explore 1. HDFS for the sandbox storage and HDF-
the data stored in the BDW but also serves the S/Hive for the batch and streaming storage
purpose of providing data to the data visualiza- 2. HDFS for the sandbox storage, HDFS/Hive
tion component and to the Data Science sandbox for the batch storage, and a NoSQL database
when needed. The data consumer component rep- for the streaming storage
resents the stakeholders interested in consuming 3. HDFS for the sandbox storage and Kudu for
data from the BDW (e.g., data scientists, data the batch and streaming storage
analysts, external users, external systems, top-
level managers, operational managers).
According to laboratory experiments (Vale
Lima 2017), the second strategy may offer
Technological Infrastructure less performance than the first one, as NoSQL
After designing the logical solution (architecture) databases are oriented toward online transac-
to solve the problem, practitioners need to con- tion processing (OLTP) applications (Cattell
sider the suitable technologies to implement each 2011). However, they facilitate random access
logical component, in order to materialize the operations on streaming data (e.g., updating
architecture into a physical system. Therefore, a previously inserted record or accessing a
the technological infrastructure is presented in specific record), as HDFS/Hive are sequential
Fig. 2, which illustrates a vast set of possible access-oriented systems, which typically requires
technologies to implement the architecture pre- recomputing entire tables/partitions to update
sented above. One should be aware that there are specific records. Recently, efforts have been
several alternatives, as the Big Data ecosystem made to provide random access capabilities
grows every day. to Hive (Apache Hive 2017), but, currently, it
Big Data Warehouses for Smart Industries 345
Big Data Warehouses for Smart Industries, Fig. 2 Technological infrastructure for Big Data Warehouses
is not a feature that can be typically seen in Despite being faster, practitioners need to be
production environments, and it leads to several aware that using HDFS/Hive to store streaming
interoperability challenges (e.g., most of SQL- data accentuates the problem of having small files
on-Hadoop engines lack the support to query in Hadoop (Mackey et al. 2009). Small streaming
transactional/ACID Hive tables). processing intervals (micro-batches) lead to very
346 Big Data Warehouses for Smart Industries
small files, causing severe stress on the Hadoop visualization platforms using JavaScript, for
NameNode. Consequently, if practitioners do not example.
want to sacrifice data timeliness, option 2 may be
more suitable. If increasing streaming intervals
to 30 s, 1, or 2 min, for example, does not hurt Data Modeling
data timeliness, the HDFS files being generated The final and maybe central piece of a Big Data
will be bigger and therefore may not affect as Warehousing system is the data modeling ap-
much the performance and stability of the system. proach. Big Data systems are often mentioned
Finally, option 3 focuses on the use of technolo- as schema-less systems. However, for Big Data
gies like Kudu for both batch and streaming data Warehousing in particular, at some point data
(Lipcon et al. 2015), which promises to support should be structured to support the decision-
both scenarios without discarding the advantages making processes through adequate analytical
of HDFS/Hive and NoSQL databases presented mechanisms (e.g., data visualization of Data Min-
above. Kudu is still an emerging technology, ing/Machine Learning predictions, simulation re-
but there is a lack of enough evidence of its sults, and historical data). Having said that, this
performance in these scenarios. does not mean that the data model should not be
Regarding analytical tasks, several technolo- flexible, quite the contrary, that is the reason why
gies can contribute to support the Data Science traditional ways of modeling data are not always
sandbox, mainly in the training and testing of the best solution for Big Data contexts (Costa et
several Data Mining/Machine Learning models, al. 2017), due to the volume, variety, and velocity
such as Spark MLlib, Mahout, R/Python, and of data sources. BDWs should have more flexible
Weka, among many others. The distributed and scalable data modeling approaches, as the
query engine responsible for processing data one depicted in Fig. 3.
stored in the indexed storage systems of the Taking into consideration the description in
BDW (batch and streaming storage areas) can section “Platforms: Data Organization and Dis-
be implemented using any SQL-on-Hadoop tribution,” the sandbox storage can be seen as
engine that adequately interoperates with an analogy to a data lake (O’Leary 2014), where
the chosen storage technologies, including data flows in its raw form throughout a schema-
Presto, Impala, Hive, Spark SQL, Drill, and less distributed file system, being stored for latter
HAWQ, among others. Despite being frequently exploration and processing. Consequently, con-
mentioned as SQL-on-Hadoop engines, these cerns with metadata and schema are not relevant
systems can process and combine data from at this stage. The content generated by source
multiple distributed systems, including NoSQL systems lands in this zone, despite its size or
databases. Some of these engines provide shape. As previously seen, that is mainly the case
interactive performance, i.e., near real-time for batch data before being processed and moved
query execution times over massive amounts to the batch storage system. In contrast, streaming
of data, while others provide more robustness, data is typically processed immediately and then
scalability, and higher SQL compatibility levels. stored in the streaming storage, assuring latencies
There are some scientific benchmarks available that satisfy the business requirements.
(Floratou et al. 2014; Santos et al. 2017), and The batch and streaming storages use the same
practitioners are always encouraged to perform modeling approach, which is interpreted as fol-
some benchmarking before adopting certain lows:
technologies for production environments.
Finally, there are several data visualization 1. An analytical object represents a subject of
technologies with appropriate connectors to these interest (e.g., sales transactions, inventory
distributed query engines, such as Tableau, Power movements, social media posts, webpage
BI, and Tibco. For specific contexts, practitioners clicks). It is an analogy to traditional fact
may even develop their own custom-made data tables (Kimball and Ross 2013).
Big Data Warehouses for Smart Industries 347
Big Data Warehouses for Smart Industries, Fig. 3 Data model for Big Data Warehouses
2. Analytical objects contain descriptive and an- customer segmentation, sentiment associated
alytical families, which can be either logical or to social media post). In contrast to traditional
physical constructs, depending on the storage DWs, these attributes are typically denormal-
technology in use (see section “Technologi- ized into a single analytical object (e.g., Hive
cal Infrastructure”). These families contain a table, HBase/Cassandra table) to avoid the
set of several attributes. Descriptive attributes cost of join operations on Big Data environ-
are an analogy to traditional dimensions’ at- ments (Santos et al. 2017) and achieve better
tributes. Factual attributes are an analogy to performance (Costa et al. 2017).
traditional facts in fact tables, while predictive
attributes are predictions retrieved from the 3. Despite the preference for fully denormalized
Data Science sandbox (e.g., sales forecasting, structures, in certain contexts joining analyt-
348 Big Data Warehouses for Smart Industries
ical objects may be significantly useful – see as a similar mechanism to traditional OLAP
the data model in (Costa and Santos 2017a). cubes with materialized results.
This strategy is significantly useful when one
of the analytical objects involved in the join is
relatively small, it is not refreshed frequently, Examples of Application
and/or it is reused by other analytical
objects. For example, the information about This section discusses the implementation of
customers of an organization will definitively a BDW in an organization that is working to
be reused many times in different analytical be aligned with the Industry 4.0 initiative. In
objects. As mentioned above, if applicable, this context, a Big Data Warehousing project is
practitioners can create an analytical object for being developed to help with operational and
“customers.” This is not seen as a traditional managerial decision-making. Several working
dimension, as by itself it can contain factual areas of the factory are considered a priority in
and predictive attributes about customers. This this project: customer complaints, internal defect
strategy will save storage space (avoiding costs, productivity associated with the several
so much redundancy in Big Data envi- production lines, and lot release analysis. The
ronments) without significantly decreasing BDW being developed follows the foundations
performance (sometimes even improving) for presented in the previous sections:
relatively small complementary analytical
objects, as strategies like map/broadcast • Data is collected, processed, and stored in
joins are a major selling point of several batch and streaming storage environments.
distributed query engines for Big Data (e.g., • The technological infrastructure in use
Presto). is aligned with the several technologies
4. For each analytical object, practitioners identified above.
should define a granularity key, which may, • Analytical objects are adequately identified, as
or may not, be physically implemented well as their descriptive and factual attributes.
through a primary key (depending on For example, Figs. 4 and 5 are data visual-
the technology being implemented), and izations based on the “customer complaints
partitioning and bucketing/clustering keys, (QCs)” analytical object, in which “customer”
which may significantly improve performance and “business unit” are descriptive families,
in certain scenarios by adequately distributing and the factual information about the com-
and organizing the data (Costa et al. 2017). plaint is the analytical family.
5. Besides analytical objects, there are three
more types of optional objects. The date The first dashboard (Fig. 4) aims to give a
and time objects are responsible for general overview about the QCs, considering sev-
storing the date and time attributes to be eral factors: the production date of the claimed
used by all analytical objects, which can parts, the business unit that produced these parts,
then be used in join operations, as these the responsibility associated to the parts with
objects are significantly small. Similarly to defect (cluster), and the customers that made the
complementary analytical objects, besides complaints.
standardizing information regarding date and After a general analysis of the QCs, a detailed
time in the BDW, it reduces storage footprint perspective on the customers that claimed some
and may even improve performance (Costa parts is presented in Fig. 5. In this dashboard, one
et al. 2017), as the small join operations can take conclusions regarding the relationship
compensate the amount of denormalized data between the date in which the complaint was
to process. Materialized objects simply store made and the production date of the claimed
long-running and complex queries that are parts. For a simplified analysis, all complaints are
frequently requested to the BDW. They can act colored according to the corresponding customer.
Big Data Warehouses for Smart Industries
Big Data Warehouses for Smart Industries, Fig. 4 General dashboard for complaints analysis (fictional data)
349
B
350 Big Data Warehouses for Smart Industries
Big Data Warehouses for Smart Industries, Fig. 5 Complaints analysis by customer dashboard (fictional data)
Future Directions for Research FCT (Fundação para a Ciência e Tecnologia) within
the Project Scope: UID/CEC/00319/2013 and the
Doctoral scholarship (PD/BDE/135101/2017) and by
As BDWs start to be implemented in more con- European Structural and Investment Funds in the FEDER
texts, and as technology evolves, methodological component, through the Operational Competitiveness
aspects may keep evolving. There are significant and Internationalization Programme (COMPETE 2020)
research contributions to be made to this topic, [Project nı 002814; Funding Reference: POCI-01-0247-
FEDER-002814].
namely, new/extended techniques and technolo-
gies for data storage and processing. Sharing sci-
entific benchmarks of emergent systems will also
keep the community aware of different directions References
regarding possible architectural approaches.
Apache Hive (2017) Apache Hive documentation.
Apache Software Foundation. https://cwiki.apache.org/
confluence/display/Hive/Home. Accessed 12 May
Cross-References 2017
Cattell R (2011) Scalable SQL and NoSQL data
Apache Hadoop stores. ACM SIGMOD Rec 39:12–27. https://doi.org/
10.1145/1978915.1978919
Apache Spark
Chen M, Mao S, Liu Y (2014) Big data: a survey.
Columnar Storage Formats Mob Netw Appl 19:171–209. https://doi.org/10.1007/
Data Lake s11036-013-0489-0
Hive Costa C, Santos MY (2017a) The SusCity Big Data
Warehousing approach for smart cities. In: Proceedings
Impala
of international database engineering & applications
NoSQL Database Systems symposium, p 10
Spark SQL Costa C, Santos MY (2017b) The data scientist profile and
Storage Technologies for Big Data its representativeness in the European e-competence
framework and the skills framework for the informa-
Acknowledgments This entry has been supported tion age. Int J Inf Manag 37:726–734. https://doi.org/
by COMPETE: POCI-01-0145-FEDER-007043 and 10.1016/j.ijinfomgt.2017.07.010
Big Semantic Data Processing in the Life Sciences Domain 351
processes. Biomedical researchers therefore tend not necessary, some biomedical problems will be
to specialize on single processes as this facilitates discussed.
the collection and analysis of empirical data. As
a result, raw experimental data in biomedical set-
tings is very heterogeneous, creating difficulties
The KG Landscape in Biomedical
in deriving sufficiently large training sets that can
Domains
be used confidently for machine learning. In fact,
many biomedical experts have noticed that data in
The number of publications mentioning seman-
their domain is not “big” because of its volume –
tic web, SPARQL, knowledge graphs, or related
instead, it is “big” primarily because of its variety
concepts can be used as a proxy to understand
and velocity (Almeida et al. 2014; Ruttenberg et
where that technology sits in the “hype cycle”
al. 2007; Weber et al. 2014).
curve. According to google trends, the hype as-
Knowledge graphs and semantic query (KG)
sociated with semantic web and SPARQL peaked
offer a promising solution for the training set
around 2006. However, interest in the topic has
problem – if one could write a SPARQL query to
remained stable, and yearly dedicated confer-
recover treatment, survival time, and genetic in-
ences continue to attract hundreds of partici-
formation from cancer patients around the world,
pants. Searches in Google Scholar, Scopus, and
for example, it may be possible to create a train-
PubMed reveal an upward trend in the number of
ing set and use it to predict which patients would
publications mentioning both KG and biomedical
likely benefit from which drugs. This chapter
or related domains (Bio) when compared against
will present some of the KGs and research tools
KG across domains (Fig. 1), hinting perhaps that
that have enabled and empowered biomedical sci-
these technologies have achieved maturity.
entists with such predictive biomedical analysis
KGs are also trending in industry: on January
tools. The description in this chapter assumes a
2018 LinkedIn included the word “SPARQL”
basic knowledge of semantic technologies, and
in 120 job positions and the phrase “semantic
whereas a knowledge of biological concepts is
web” in 136 – many of which from very rec-
Big Semantic Data Processing in the Life Sciences for 2017 may be an artifact of incomplete indexing since
Domain, Fig. 1 Percentage of papers on the topic of KG the query was performed in early January 2018
which are also in Bio domains. The decrease in PubMed
Big Semantic Data Processing in the Life Sciences Domain 353
ognizable biomedical industries (Eli Lilly, Am- gene names are linked to associated proteins
gen, Wolters Kluwer, Elsevier, Albert Einstein including those generated by alternative splicing
College of Medicine, Duke University Health (different proteins produced by the same gene),
System, and Clarivate). A total of 1990 Google polymorphisms (variants occurring in at least B
patents results matched the word SPARQL – 1% of the population), and/or posttranslational
among which were patents from GlaxoSmithK- modifications (changes in the protein after its
line and Agfa HealthCare. There have also been sequence is assembled). The key to the continued
setbacks and loss of interest in KG in Bio do- growth and popularity of UniProt has been
mains – the Conference on Semantics in Health- its reliance on automated predictions that are
care and Life Sciences (CSHALS), for example, manually curated by experts.
was discontinued in 2015; the Cambridge Seman- Usage trends: the terms “UniProt SPARQL”
tic Web Group, which was heavily influenced by or variations appear 1430 times in Google
the surrounding biotechnology hub, was discon- Scholar, 40 times in Scopus, and 15 times in
tinued in the summer of 2017. The latest version PubMed. UniProt (by itself) appears 8760 times
of the linked open data (LOD) cloud includes in Scholar, 1128 times in Scopus, and 932 times
332 KGs in the life sciences domain, or 28% in PubMed. The UniProt RDF developers have
of the whole set, but it is unclear from the Bio observed approximately 2000 different non-bot
literature whether any of these are being used by users querying the SPARQL endpoint, who use
the community. Nevertheless, the topic of KGs it as an API. The true potential of UniProt as a
and SPARQL federation continues to grow and SPARQL endpoint has yet to be fully harnessed –
be used by a community of biomedical experts however, the latest development in CRISPR–
versed in these technologies. The next sections Cas9 and other gene editing technologies,
will introduce some of most successful KG in Bio together with a wish to custom design new
domains as well as the problems they sought to proteins based on desired function, may become
address. the use case that leads to an explosion in the
number of uses for this endpoint.
Gene Ontology
Founded in 1998, the gene ontology (GO) (Ash- Bio2RDF
burner et al. 2000) was a visionary initiative to Bio2RDF was probably the first project intention-
describe gene products across model organism ally devised to deliver a query federation solution
databases. GO is in fact not one but three sep- to the bioinformatics and systems biology
arate ontologies, all of which necessary to fully communities. In 2008, when Bio2RDF was first
describe a gene product: biological processes, released (Belleau et al. 2008), there were over
cellular components, and molecular functions. 1000 online bioinformatics databases published
GO also includes annotations of gene products annually in the NAR database issue. Frustrated
that link then to each of the three ontologies. by the need to create wrappers for multiple APIs,
Usage trends: the term “gene ontology” ap- many bioinformatics organizations raised the
pears 13 K times in PubMed, 156 K times issue of numerous obstacles to cross-database
in Google Scholar, and 18 K times in Scopus. integration efforts (Haas et al. 2001; Stark et al.
It owes its popularity to the adoption by the 2006; Fujita et al. 2010), the most significant
bioinformatics community, who use it through R of which was entity identifiers and attribute
packages to enhance bioinformatics pipelines. name alignment across sources. Bio2RDF’s
main contribution was the definition of a URI
UniProt RDF creation convention which preserved the origin
UniProt RDF is the KG exposing the UniProt of the identifier or attribute while still respecting
protein database as a SPARQL endpoint. the requirement that URI is dereferenceable
UniProt is the largest knowledgebase of (http://bio2rdf.org/<dataset>:<identifier>, http://
protein sequence and annotations. Canonical bio2rdf.org/<dataset>_vocabulary:<class> or
354 Big Semantic Data Processing in the Life Sciences Domain
for drug targeting. This exploration is known as genes produce multiple proteins via alternative
gene set enrichment analysis (GSEA). In one splicing (Black 2003). Another nomenclature
prominent example, GSEA was used to identify issue arises when scientists uses protein family
prognosis biomarkers and allow for better triag- and individual protein names interchangeably B
ing of leukemia patients (Radich et al. 2006). (e.g., Ras is a protein family that includes H-Ras,
K-Ras, and N-Ras, which are often referred to
Drug Discovery and Drug Repurposing simply as “Ras proteins”). Finally, genes/proteins
Solutions for drug discovery and repurpose are in different species often share a name if they
probably two of the most well-described use share a function. MAP2K1, for example, exists in
cases for KGs in the Bio domain (Jentzsch et both rat (Rattus norvegicus) and in humans. The
al. 2009; Chen et al. 2010; Wild et al. 2011). convention in biomedical publishing has been to
Before starting a clinical trial, most pharma- use the symbol in its uppercase form (MAP2K1)
ceutical companies first seek to understand its when referring to the human variant but the
effect on biological entities. In many cases, the capitalized form (Map2k1) when referring to
interaction between the drug and a known protein nonhuman alternatives. This creates very obvious
structure can be computationally predicted (i.e., challenges for automated entity extraction.
in silico experiments). UniProt RDF is a key The Bio2RDF KG has tackled the gene, pro-
resource for this purpose, but others, including tein, protein family, and transcript nomencla-
Chem2Bio2RDF and Open PHACTS, are also ture problem with resort to a universal identifier
very relevant since they provide the necessary schema that preserves the provenance of the iden-
bridge between biological and chemical entities. tifiers while providing a dereferenceable descrip-
Results from in silico models are often decisive in tion for each, including the entity type and links
moving forward with more expensive preclinical to related entities. With Bio2RDF, bioinformatics
experiments – an added incentive to aggregate ex- workflows can be simplified by resorting to a
tra information when building predictive models. single source for identifier mapping. KGs also
In a recent in silico experiment (Hu et al. 2016), facilitate inference of attributes when the entities
researchers used UniProt and the Protein Data are similar – for example, knowledge that the
Bank to identify five plant-based compounds and Map2k1 protein in rat is responsible for activating
five FDA approved drugs that could to target cell growth can be easily inferred into its human
two cancer-causing genes. KGs in the chem- equivalent by the use of semantic inference. At
ical domain can also help biomedical experts last, the problem of inconsistent phenotype and
achieve more accurate predictions by augmenting disease naming is a problem that has frustrated
or filtering their models with chemical/biological multiple attempts at information integration but
properties such as toxicity (Kotsampasakou et al. one which has greatly benefited from KG-based
2017), pharmacodynamics, and absorption. solutions and is well described in Harrow et al.
(2017).
Mapping Genes to Gene Products
to Phenotypes
Gene names are unique and nonoverlapping –
the Human Genome Nomenclature Consortia
provide scientists with a standard nomenclature. Machine Learning in Bio with KGs
However, most protein names share a name
with the gene from which they originate. In the previous section, some of the most signifi-
Biomedical scientists adopted the convention cant problems in processing of life science data
of using italics when referring to the gene but were presented. The focus so far has been on
not when referring to the protein (e.g., BRAF data integration. This section goes a step further
is the protein, and BRAF is the gene). This 1:1 and explores how KGs are being used to create
gene/protein correspondence breaks down when training sets for machine learning.
356 Big Semantic Data Processing in the Life Sciences Domain
Ontologies for Big Data ISWC 2014. Springer International Publishing, Cham,
Semantic Interlinking pp 114–130
Hu J-B, Dong M-J, Zhang J (2016) A holistic in silico
approach to develop novel inhibitors targeting ErbB1
and ErbB2 kinases. Trop J Pharm Res 15:231. https://
B
References doi.org/10.4314/tjpr.v15i2.3
Jentzsch A, Zhao J, Hassanzadeh O, et al (2009) Linking
Almeida JS, Dress A, Kühne T, Parida L (2014) ICT open drug data. In: Proc I-SEMANTICS 2009, Graz
for Bridging Biology and Medicine (Dagstuhl Perspec- Jiang J, Li X, Zhao C et al (2017) Learning and inference
tives Workshop 13342). Dagstuhl Manifestos 3:31–50. in knowledge-based probabilistic model for medical
https://doi.org/10.4230/DagMan.3.1.31 diagnosis. Knowl-Based Syst 138:58–68. https://doi.
Ashburner M, Ball CA, Blake JA et al (2000) Gene org/10.1016/J.KNOSYS.2017.09.030
ontology: tool for the unification of biology. The Gene Jupp S, Malone J, Bolleman J et al (2014) The EBI
Ontology Consortium. Nat Genet 25:25–29. https://doi. RDF platform: linked open data for the life sciences.
org/10.1038/75556 Bioinformatics 30:1338–1339. https://doi.org/10.1093/
Belleau F, Nolin M-A, Tourigny N et al (2008) Bio2RDF: bioinformatics/btt765
towards a mashup to build bioinformatics knowledge Kotsampasakou E, Montanari F, Ecker GF (2017) Predict-
systems. J Biomed Inform 41:706–716. https://doi.org/ ing drug-induced liver injury: the importance of data
10.1016/J.JBI.2008.03.004 curation. Toxicology 389:139–145. https://doi.org/10.
Black DL (2003) Mechanisms of alternative pre- 1016/J.TOX.2017.06.003
messenger RNA splicing. Annu Rev Biochem Koutkias VG, Lillo-Le Louët A, Jaulent M-C (2017) Ex-
72:291–336. https://doi.org/10.1146/annurev.biochem. ploiting heterogeneous publicly available data sources
72.121801.161720 for drug safety surveillance: computational framework
Chen B, Ding Y, Wang H, et al (2010) Chem2Bio2RDF: and case studies. Expert Opin Drug Saf 16:113–124.
A Linked Open Data Portal for Systems Chemical https://doi.org/10.1080/14740338.2017.1257604
Biology. In: Proceedings of the 2010 IEEE/WIC/ACM Lamurias A, Ferreira JD, Clarke LA, Couto FM (2017)
International Conference on Web Intelligence and In- Generating a tolerogenic cell therapy knowledge graph
telligent Agent Technology - Volume 01. pp 232–239 from literature. Front Immunol 8:1–23. https://doi.org/
Church K (2017) Word2Vec. Nat Lang Eng 23:155–162. 10.3389/fimmu.2017.01656
https://doi.org/10.1017/S1351324916000334 Noy NF, Shah NH, Whetzel PL et al (2009) BioPortal:
Deus HF, Veiga DF, Freire PR et al (2010) Exposing ontologies and integrated data resources at the click of
The Cancer Genome Atlas as a SPARQL endpoint. J a mouse. Nucleic Acids Res 37:W170–W173. https://
Biomed Inform 43:998–1008. https://doi.org/10.1016/ doi.org/10.1093/nar/gkp440
j.jbi.2010.09.004 Radich JP, Dai H, Mao M et al (2006) Gene expres-
Drolet BC, Lorenzi NM (2011) Translational research: sion changes associated with progression and response
understanding the continuum from bench to bed- in chronic myeloid leukemia. Proc Natl Acad Sci
side. Transl Res 157:1–5. https://doi.org/10.1016/j. U S A 103:2794–2799. https://doi.org/10.1073/pnas.
trsl.2010.10.002 0510423103
Fujita PA, Rhead B, Zweig AS et al (2010) The UCSC Robbins DE, Gruneberg A, Deus HF et al (2013) A
Genome Browser database: update 2011. Nucleic self-updating road map of The Cancer Genome Atlas.
Acids Res 39:1–7. https://doi.org/10.1093/nar/gkq963 Bioinformatics 29:1333–1340. https://doi.org/10.1093/
Garcia A, Lopez F, Garcia L, et al (2017) Biotea, seman- bioinformatics/btt141
tics for PubMed Central. https://doi.org/10.7287/peerj. Ruttenberg A, Clark T, Bug W et al (2007) Advanc-
preprints.3469v1 ing translational research with the Semantic Web.
Haas LM, Schwarz PM, Kodali P et al (2001) Discov- BMC Bioinf 8(Suppl 3):S2. https://doi.org/10.1186/
eryLink: a system for integrated access to life sciences 1471-2105-8-S3-S2
data sources. IBM Syst J 40:489–511. https://doi.org/ Saleem M, Padmanabhuni SS, A-CN N et al (2014)
10.1147/sj.402.0489 TopFed: TCGA tailored federated query processing
Harrow I, Jiménez-Ruiz E, Splendiani A et al (2017) and linking to LOD. J Biomed Semant 5:47. https://doi.
Matching disease and phenotype ontologies in the org/10.1186/2041-1480-5-47
ontology alignment evaluation initiative. J Biomed Se- Schriml LM, Arze C, Nadendla S et al (2012) Disease
mant 8:55. https://doi.org/10.1186/s13326-017-0162-9 ontology: a backbone for disease semantic integration.
Hasnain A, Fox R, Decker S, Deus H (2012) Cataloguing Nucleic Acids Res 40:D940–D946. https://doi.org/10.
and Linking Life Sciences LOD Cloud. In: 1st Interna- 1093/nar/gkr972
tional Workshop on Ontology Engineering in a Data- Sioutos N, de Coronado S, Haber MW et al (2007) NCI
driven World OEDW 2012. pp 1–11 thesaurus: a semantic model integrating cancer-related
Hasnain A, Kamdar MR, Hasapis P, et al (2014) Linked clinical and molecular information. J Biomed Inform
Biomedical Dataspace: Lessons Learned Integrating 40:30–43. https://doi.org/10.1016/j.jbi.2006.02.013
Data for Drug Discovery BT. In: Mika P, Tudorache Stark C, Breitkreutz B-J, Reguly T et al (2006) BioGRID:
T, Bernstein A, et al. (eds) The Semantic Web – a general repository for interaction datasets. Nucleic
358 Big Semantic Data Processing in the Materials Design Domain
Acids Res 34:D535–D539. https://doi.org/10.1093/nar/ need to be addressed. This entry discusses these
gkj109 challenges and shows the semantic technologies
Vieira A (2016) Knowledge Representation in Graphs
using Convolutional Neural Networks. Comput Res
that alleviate the problems related to variety,
Repos abs/1612.02255 variability, and veracity.
Wang M (2017) Predicting Rich Drug-Drug Interactions
via Biomedical Knowledge Graphs and Text Jointly
Embedding. Compuring Resour Repos abs/1712.08875
Washington NL, Haendel MA, Mungall CJ et al Overview
(2009) Linking human diseases to animal models
using ontology-based phenotype annotation. PLoS Materials design and materials informatics are
Biol 7:e1000247. https://doi.org/10.1371/journal.pbio. central for technological progress, not the least
1000247
Weber GM, Mandl KD, Kohane IS (2014) Finding the in the green engineering domain. Many tradi-
missing link for big biomedical data. JAMA 311:2479– tional materials contain toxic or critical raw ma-
2480. https://doi.org/10.1001/jama.2014.4228 terials, whose use should be avoided or elimi-
Wild DJ, Ding Y, Sheth AP, et al (2011) Systems chemical nated. Also, there is an urgent need to develop
biology and the Semantic Web: what they mean for the
future of drug discovery research. Drug Discov Today. new environmentally friendly energy technology.
https://doi.org/10.1016/j.drudis.2011.12.019 Presently, relevant examples of materials design
Xu N, Li Y, Zhou X et al (2015) CDKN2 gene deletion challenges include energy storage, solar cells,
as poor prognosis predictor involved in the progression thermoelectrics, and magnetic transport (Ceder
of adult B-lineage acute lymphoblastic leukemia pa-
tients. J Cancer 6:1114–1120. https://doi.org/10.7150/ and Persson 2013; Jain et al. 2013; Curtarolo
jca.11959 et al. 2013).
Yoshida Y, Makita Y, Heida N et al (2009) PosMed The space of potentially useful materials yet
(Positional Medline): prioritizing genes with an arti-
to be discovered – the so-called chemical white
ficial neural network comprising medical documents
to accelerate positional cloning. Nucleic Acids Res space – is immense. The possible combinations
37:W147–W152. https://doi.org/10.1093/nar/gkp384 of, say, up to six different elements constitute
Zeginis D, Hasnain A, Loutas N et al (2014) A collabo- many billions. The space is further extended by
rative methodology for developing a semantic model
possibilities of different phases, low-dimensional
for interlinking cancer chemoprevention linked-data
sources. Semant Web 5(2):127–142. https://doi.org/10. systems, nanostructuring, and so forth, which
3233/SW-130112 adds several orders of magnitude. This space
was traditionally explored by experimental
techniques, i.e., materials synthesis and subse-
quent experimental characterization. Parsing and
searching the full space of possibilities this way
Big Semantic Data Processing are however hardly practical. Recent advances
in the Materials Design in condensed matter theory and materials
Domain modeling make it possible to generate reliable
materials data by means of computer simulations
Patrick Lambrix1 , Rickard Armiento1 , Anna
based on quantum mechanics (Lejaeghere et al.
Delin2 , and Huanyu Li1
1 2016). High-throughput simulations combined
Linköping University, Swedish e-Science
with machine learning can speed up progress
Research Centre, Linköping, Sweden
2 significantly and also help to break out of local
Royal Institute of Technology, Swedish
optima in composition space to reveal unexpected
e-Science Research Centre, Stockholm, Sweden
solutions and new chemistries (Gaultois et al.
2016).
This development has led to a global effort
Definitions – known as the Materials Genome Initiative
(https://www.mgi.gov/) – to assemble and
To speed up the progress in the field of materials curate databases that combine experimentally
design, a number of challenges related to big data known and computationally predicted materials
Big Semantic Data Processing in the Materials Design Domain 359
challenging to have provenance information from description of materials knowledge (and thereby
which one can derive the data quality. Not all the create ontological knowledge).
computed data is confirmed by lab experiments. In the remainder of this section, a brief
Some data is generated by machine learning and overview of databases, ontologies, and standards
data mining algorithms. in the field is given.
Databases
Sources of Data and Semantic The Inorganic Crystal Structure Database (ICSD,
Technologies https://icsd.fiz-karlsruhe.de/) is a frequently
utilized database for completely identified
Although the majority of materials data that has inorganic crystal structures, with nearly 200 k
been produced by measurement or through pre- entries (Belsky et al. 2002; Bergerhoff et al.
dictive computation have not yet become orga- 1983). The data contained in ICSD serve as an
nized in general easy-to-use databases, several important starting point in many electronic struc-
sizable databases and repositories do exist. How- ture calculations. Several other crystallographic
ever, as they are heterogeneous in nature, seman- information resources are also available (Glasser
tic technologies are important for the selection 2016). A popular open-access resource is the
and integration of the data to be used in the Crystallography Open Database (COD, http://
materials design workflow. This is particularly www.crystallography.net/cod/) with nearly 400 k
important to deal with variety, variability, and entries (Grazulis et al. 2012).
veracity. At the International Centre for Diffraction
Within this field the use of semantic technolo- Data (ICDD, http://www.icdd.com/) a number
gies is in its infancy with the development of on- of databases for phase identification are hosted.
tologies and standards. Ontologies aim to define These databases have been in use by experimen-
the basic terms and relations of a domain of inter- talists for a long time.
est, as well as the rules for combining these terms Springer Materials (http://materials.springer.
and relations. They standardize terminology in a com/) contains among many other data sources,
domain and are a basis for semantically enriching the well-known Landolt-Börnstein database, an
data, integration of data from different databases extensive data collection from many areas of
(variety), and reasoning over the data (variability physical sciences and engineering. Similarly,
and veracity). According to Zhang et al. (2015a) The Japan National Institute of Material Science
in the materials domain ontologies have been (NIMS) Materials Database MatNavi (http://
used to organize materials knowledge in a formal mits.nims.go.jp/index_en.html) contains a wide
language, as a global conceptualization for ma- collection of mostly experimental but also some
terials information integration (e.g., Cheng et al. computational electronic structure data.
2014), for linked materials data publishing, for Thermodynamical data, necessary for comput-
inference support for discovering new materials, ing phase diagrams with the CALPHAD method,
and for semantic query support (e.g., Zhang et al. exist in many different databases (Campbell et al.
2015b, 2017). 2014). Open-access databases with relevant data
Further, standards for exporting data from can be found through OpenCalphad (http://www.
databases and between tools are being developed. opencalphad.com/databases.html).
These standards provide a way to exchange data Databases of results from electron structure
between databases and tools, even if the internal calculations have existed in some form for several
representations of the data in the databases and decades. In 1978, Moruzzi, Janak, and Williams
tools are different. They are a prerequisite for published a book with computed electronic prop-
efficient materials data infrastructures that allow erties such as, e.g., density of states, bulk mod-
for the discovery of new materials (Austin 2016). ulus, and cohesive energy of all metals (Moruzzi
In several cases the standards formalize the et al. 2013). Only recently, however, the use of
Big Semantic Data Processing in the Materials Design Domain 361
such databases has become widespread, and some The Materials Ontology in Ashino (2010) was
of these databases have grown to a substantial designed for data exchange among thermal
size. property databases. Other ontologies were built
Among the more recent efforts to collect to enable knowledge-guided materials design B
materials properties obtained from electronic or new materials discovery, such as PREMP
structure calculations publicly available, a few ontology (Bhat et al. 2013) for steel mill products,
prominent examples include the Electronic MatOnto ontology (Cheung et al. 2008) for
Structure Project (ESP) (http://materialsgenome. oxygen ion-conducting materials in the fuel cell
se) with ca 60 k electronic structure results, domain, and SLACKS ontology (Premkumar
Aflow (Curtarolo et al. 2012, http://aflowlib. et al. 2014) that integrates relevant product life-
org/) with data on over 1.7 million compounds, cycle domains which consist of engineering
the Materials Project with data on nearly analysis and design, materials selection, and
70 k inorganic compounds (Jain et al. 2013, manufacturing. The FreeClassOWL ontology
https://materialsproject.org/), the Open Quantum (Radinger et al. 2013) is designed for the
Materials Database (OQMD, http://oqmd. construction and building materials domain
org/), with over 470 k entries (Saal et al. and supports semantic search for construction
2013), and the NOMAD repository with 44 materials. MMOY ontology (Zhang et al. 2016)
million electronic structure calculations (https:// captures metal materials knowledge from Yago.
repository.nomad-coe.eu/). Also available is The ontology design pattern in Vardeman et al.
the Predicted Crystallography Open Database (2017) models and allows for reasoning about
(PCOD, http://www.crystallography.net/pcod/) material transformations in the carbon dioxide
with over 1 million predicted crystal structures, and sodium acetate productions by combining
which is a project closely related to COD. baking soda and vinegar. Some ontologies are
As the amount of computed data grows, generated (Data source in Table 1) by extracting
the need for informatics infrastructure also knowledge from other data resources such as the
increases. Many of the databases discussed Plinius ontology (van der Vet et al. 1994) which
above have made their frameworks available, is extracted from 300 publication abstracts in
and well-known examples include the ones by the domain of ceramic materials, and MatOWL
Materials Project and OQMD. Other publicly (Zhang et al. 2009) which is extracted from
available frameworks used in publications MatML schema data to enable ontology-based
for materials design and informatics include data access. The ontologies may also use other
the Automated Interactive Infrastructure and ontologies as a basis, for instance, MatOnto that
Database for Computational Science (AiiDA, uses DOLCE (Gangemi et al. 2002) and EXPO
http://www.aiida.net/) (Pizzi et al. 2016), the (Soldatova and King 2006).
Atomic Simulation Environment (ASE, https:// From the knowledge representation perspec-
wiki.fysik.dtu.dk/ase/) (Larsen et al. 2017), and tive (Table 2), the basic terms defined in materials
the high-throughput toolkit (httk, http://www. ontologies involve materials, properties, perfor-
httk.org) (Faber et al. 2016). mance, and processing in specific sub-domains.
The number of concepts ranges from a few to
Ontologies several thousands. There are relatively few rela-
We introduce the features of current materials tionships and most ontologies have instances. Al-
ontologies from materials (Table 1) and a knowl- most all ontologies use OWL as a representation
edge representation perspective (Table 2), respec- language. In terms of organization of materials
tively. ontologies, Ashino’s Materials Ontology, Ma-
Most ontologies focus on specific sub- tOnto, and PREMP ontology are developed as
domains of the materials field (Domain in several ontology components that are integrated
Table 1) and have been developed with a specific in one ontology. In Table 2 this is denoted in the
use in mind (Application Scenario in Table 1). modularity column.
362 Big Semantic Data Processing in the Materials Design Domain
Big Semantic Data Processing in the Materials Design Domain, Table 1 Comparison of materials ontologies from
a materials perspective
Materials ontology Data source Domain Application scenario
Ashino’s Materials Ontology Thermal property databases Thermal properties Data exchange, search
(Ashino 2010)
Plinius ontology (van der Vet Publication abstracts Ceramics Knowledge extraction
et al. 1994)
MatOnto (Cheung et al. 2008) DOLCE ontologya , EXPO Crystals New materials
ontologyb discovery
PREMP ontology (Bhat et al. PREMP platform Materials Knowledge-guided
2013) design
FreeClassOWL (Radinger et al. Eurobau datac , Construction and building Semantic query support
2013) GoodRelations ontologyd materials
MatOWL (Zhang et al. 2009) MatML schema data Materials Semantic query support
MMOY (Zhang et al. 2016) Yago data Metals Knowledge extraction
ELSSI-EMD ontology (CEN Materials testing data from Materials testing, ambient Data interoperability
2010) ISO standards temperature tensile
testing
SLACKS ontology (Premkumar Ashino’s Materials Laminated composites Knowledge-guided
et al. 2014) Ontology, MatOnto design
a
DOLCE stands for Descriptive Ontology for Linguistic and Cognitive Engineering
b
EXPO ontology is used to describe scientific experiments
c
Eurobau.com compiles construction materials data from ten European countries
d
GoodRelations ontology (Hepp 2008) is used for e-commerce with concepts such as business entities and prices
Big Semantic Data Processing in the Materials Design Domain, Table 2 Comparison of materials ontologies from
a knowledge representation perspective
Materials ontology Ontology metrics Language Modularity
Ashino’s Materials Ontology 606 concepts, 31 relationships, OWL
(Ashino 2010) 488 instances
Plinius ontology (van der Vet et al. 17 concepts, 4 relationships, 119 Ontolingua code
1994) instancesa
MatOnto (Cheung et al. 2008) 78 concepts, 10 relationships, 24 OWL
instances
PREMP ontology (Bhat et al. 62 concepts UML
2013)
FreeClassOWL (Radinger et al. 5714 concepts, 225 relationships OWL
2013) 1469 instances
MatOWL (Zhang et al. 2009) (not available) OWL
MMOY (Zhang et al. 2016) 544 metal concepts, 1781 related OWL
concepts, 9 relationships, 318
metal instances 1420 related
instances
ELSSI-EMD ontology (CEN 35 concepts, 37 relationships, 33 OWL
2010) instances
SLACKS ontology (Premkumar 34 concepts and 10 relationships at OWL
et al. 2014) leastb
a
103 instances out of 119 are elements in the periodic system
b
The numbers are based on the high-level class diagram and an illustration of instances’ integration in SLACKS shown
in Premkumar et al. (2014)
Big Semantic Data Processing in the Materials Design Domain 363
Belsky A, Hellenbrandt M, Karen VL, Luksch P (2002) dard for experimental, predicted, and critically eval-
New developments in the inorganic crystal struc- uated thermodynamic property data storage and cap-
ture database (ICSD): accessibility in support of ma- ture (ThermoML) (IUPAC Recommendations 2006).
terials research and design. Acta Crystallogr Sect Pure Appl Chem 78:541–612. https://doi.org/10.1351/
B Struct Sci 58(3):364–369. https://doi.org/10.1107/ pac200678030541
S0108768102006948 Frenkel M, Chirico RD, Diky V, Brown PL, Dymond JH,
Bergerhoff G, Hundt R, Sievers R, Brown ID (1983) Goldberg RN, Goodwin ARH, Heerklotz H, Knigs-
The inorganic crystal structure data base. J Chem berger E, Ladbury JE, Marsh KN, Remeta DP, Stein
Inf Comput Sci 23(2):66–69. https://doi.org/10.1021/ SE, Wakeham WA, Williams PA (2011) Extension
ci00038a003 of ThermoML: the IUPAC standard for thermody-
Bernstein HJ, Bollinger JC, Brown ID, Grazulis S, Hester namic data communications (IUPAC recommendations
JR, McMahon B, Spadaccini N, Westbrook JD, Westrip 2011). Pure Appl Chem 83:1937–1969. https://doi.org/
SP (2016) Specification of the crystallographic infor- 10.1351/PAC-REC-11-05-01
mation file format, version 2.0. J Appl Cryst 49:277– Gangemi A, Guarino N, Masolo C, Oltramari A, Schnei-
284. https://doi.org/10.1107/S1600576715021871 der L (2002) Sweetening ontologies with dolce.
Bhat M, Shah S, Das P, Reddy S (2013) Premp: knowl- Knowledge engineering and knowledge management:
edge driven design of materials and engineering pro- ontologies and the semantic web, pp 223–233. https://
cess. In: ICoRD’13. Springer, pp 1315–1329. https:// doi.org/10.1007/3-540-45810-7_18
doi.org/10.1007/978-81-322-1050-4_105 Gaultois MW, Oliynyk AO, Mar A, Sparks TD, Mul-
Campbell CE, Kattner UR, Liu ZK (2014) File and holland GJ, Meredig B (2016) Perspective: web-
data repositories for next generation CALPHAD. based machine learning models for real-time screen-
Scr Mater 70(Suppl C):7–11. https://doi.org/10.1016/ ing of thermoelectric materials properties. APL Mater
j.scriptamat.2013.06.013 4(5):053,213. https://doi.org/10.1063/1.4952607
Ceder G, Persson KA (2013) The Stuff of Dreams. Sci Am Ghiringhelli LM, Carbogno C, Levchenko S, Mohamed
309:36–40 F, Huhs G, Lueders M, Oliveira M, Scheffler M
CEN (2010) A guide to the development and use of stan- (2016) Towards a common format for computational
dards compliant data formats for engineering materials materials science data. PSI-K Scientific Highlights
test data. European Committee for Standardization July
Cheng X, Hu C, Li Y (2014) A semantic-driven knowl- Glasser L (2016) Crystallographic information resources.
edge representation model for the materials engineer- J Chem Edu 93(3):542–549. https://doi.org/10.1021/
ing application. Data Sci J 13:26–44. https://doi.org/10. acs.jchemed.5b00253
2481/dsj.13-061/ Grazulis S, Dazkevic A, Merkys A, Chateigner D, Lut-
Cheung K, Drennan J, Hunter J (2008) Towards an on- terotti L, Quiros M, Serebryanaya NR, Moeck P,
tology for data-driven discovery of new materials. In: Downs RT, Le Bail A (2012) Crystallography open
McGuinness D, Fox P, Brodaric B (eds) Semantic database (COD): an open-access collection of crystal
scientific knowledge integration AAAI/SSS workshop, structures and platform for world-wide collaboration.
pp 9–14 Nucleic Acids Res 40(Database issue):D420–D427.
Curtarolo S, Setyawan W, Wang S, Xue J, Yang K, Taylor https://doi.org/10.1093/nar/gkr900
R, Nelson L, Hart G, Sanvito S, Buongiorno-Nardelli Hepp M (2008) Goodrelations: an ontology for describing
M, Mingo N, Levy O (2012) AFLOWLIB.ORG: a products and services offers on the web. Knowl Eng
distributed materials properties repository from high- Pract Patterns 329–346. https://doi.org/10.1007/978-3-
throughput ab initio calculations. Comput Mater Sci 540-87696-0_29
58(Supplement C):227–235. https://doi.org/10.1016/j. Ivanova V, Lambrix P (2013) A unified approach for
commatsci.2012.02.002 debugging is-a structure and mappings in networked
Curtarolo S, Hart G, Buongiorno-Nardelli M, Mingo taxonomies. J Biomed Semant 4:10:1–10:19. https://
N, Sanvito S, Levy O (2013) The high-throughput doi.org/10.1186/2041-1480-4-10
highway to computational materials design. Nat Mater Jain A, Ong SP, Hautier G, Chen W, Richards WD,
12(3):191. https://doi.org/10.1038/nmat3568 Dacek S, Cholia S, Gunter D, Skinner D, Ceder G,
Euzenat J, Shvaiko P (2007) Ontology matching. Springer, Persson KA (2013) Commentary: the materials project:
Berlin/Heidelberg a materials genome approach to accelerating materials
Faber F, Lindmaa A, von Lilienfeld A, Armiento innovation. APL Mater 1(1):011,002. https://doi.org/
R (2016) Machine learning energies of 2 mil- 10.1063/1.4812323
lion Elpasolite $(AB{C}_{2}{D}_{6})$ crystals. Phys Kaufman JG, Begley EF (2003) MatML: a data inter-
Rev Lett 117(13):135,502. https://doi.org/10.1103/ change markup language. Adv Mater Process 161:
PhysRevLett.117.135502 35–36
Frenkel M, Chiroco RD, Diky V, Dong Q, Marsh KN, Lambrix P, Strömbäck L, Tan H (2009) Information in-
Dymond JH, Wakeham WA, Stein SE, Knigsberger tegration in bioinformatics with ontologies and stan-
E, Goodwin ARH (2006) XML-based IUPAC stan- dards. In: Bry F, Maluszynski J (eds) Semantic tech-
Big Spatial Data Access Methods 365
niques for the Web, pp 343–376. https://doi.org/10. materials market as linked data. In: Proceedings of
1007/978-3-642-04581-3_8 the 9th international conference on semantic systems.
Larsen AH, Mortensen JJ, Blomqvist J, Castelli IE, Chris- ACM, pp 25–32. https://doi.org/10.1145/2506182.
tensen R, Duak M, Friis J, Groves MN, Hammer B, 2506186
Hargus C, Hermes ED, Jennings PC, Jensen PB, Ker- Rajan K (2015) Materials informatics: the materials
B
mode J, Kitchin JR, Kolsbjerg EL, Kubal J, Kaasbjerg Gene and big data. Annu Rev Mater Res 45:153–
K, Lysgaard S, Maronsson JB, Maxson T, Olsen T, 169. https://doi.org/10.1146/annurev-matsci-070214-
Pastewka L, Peterson A, Rostgaard C, Schitz J, Schtt 021132
O, Strange M, Thygesen KS, Vegge T, Vilhelmsen L, Saal JE, Kirklin S, Aykol M, Meredig B, Wolverton
Walter M, Zeng Z, Jacobsen KW (2017) The atomic C (2013) Materials design and discovery with high-
simulation environment – a Python library for working throughput density functional theory: the open quan-
with atoms. J Phys Condens Matter 29(27):273,002. tum materials database (OQMD). JOM 65(11):1501–
https://doi.org/10.1088/1361-648X/aa680e 1509. https://doi.org/10.1007/s11837-013-0755-4
Lejaeghere K, Bihlmayer G, Bjrkman T, Blaha P, Blgel Soldatova LN, King RD (2006) An ontology of scientific
S, Blum V, Caliste D, Castelli IE, Clark SJ, Corso experiments. J R Soc Interface 3(11):795–803. https://
AD, Gironcoli Sd, Deutsch T, Dewhurst JK, Marco ID, doi.org/10.1098/rsif.2006.0134
Draxl C, Duak M, Eriksson O, Flores-Livas JA, Gar- Swindells N (2009) The representation and exchange of
rity KF, Genovese L, Giannozzi P, Giantomassi M, material and other engineering properties. Data Sci J
Goedecker S, Gonze X, Grns O, Gross EKU, Gulans 8:190–200. https://doi.org/10.2481/dsj.008-007
A, Gygi F, Hamann DR, Hasnip PJ, Holzwarth NaW, van der Vet P, Speel PH, Mars N (1994) The Plinius on-
Iuan D, Jochym DB, Jollet F, Jones D, Kresse G, tology of ceramic materials. In: Mars N (ed) Workshop
Koepernik K, Kkbenli E, Kvashnin YO, Locht ILM, notes ECAI’94 workshop comparison of implemented
Lubeck S, Marsman M, Marzari N, Nitzsche U, Nord- ontologies, pp 187–205
strm L, Ozaki T, Paulatto L, Pickard CJ, Poelmans W, Vardeman C, Krisnadhi A, Cheatham M, Janowicz K, Fer-
Probert MIJ, Refson K, Richter M, Rignanese GM, guson H, Hitzler P, Buccellato A (2017) An ontology
Saha S, Scheffler M, Schlipf M, Schwarz K, Sharma design pattern and its use case for modeling material
S, Tavazza F, Thunstrm P, Tkatchenko A, Torrent M, transformation. Semant Web 8:719–731. https://doi.
Vanderbilt D, van Setten MJ, Speybroeck VV, Wills org/10.3233/SW-160231
JM, Yates JR, Zhang GX, Cottenier S (2016) Repro- Zhang X, Hu C, Li H (2009) Semantic query on materials
ducibility in density functional theory calculations of data based on mapping matml to an owl ontology. Data
solids. Science 351(6280):aad3000. https://doi.org/10. Sci J 8:1–17. https://doi.org/10.2481/dsj.8.1
1126/science.aad3000 Zhang X, Zhao C, Wang X (2015a) A survey on
Moruzzi VL, Janak JF, Williams ARAR (2013) Calculated knowledge representation in materials science and
electronic properties of metals. Pergamon Press, New engineering: an ontological perspective. Comput
York Ind 73:8–22. https://doi.org/10.1016/j.compind.2015.
Mulholland GJ, Paradiso SP (2016) Perspective: ma- 07.005
terials informatics across the product lifecycle: se- Zhang Y, Luo X, Zhao Y, chao Zhang H (2015b) An
lection, manufacturing, and certification. APL Mater ontology-based knowledge framework for engineering
4(5):053,207. https://doi.org/10.1063/1.4945422 material selection. Adv Eng Inf 29:985–1000. https://
Murray-Rust P, Rzepa HS (2011) CML: evolution and doi.org/10.1016/j.aei.2015.09.002
design. J Cheminf 3:44. https://doi.org/10.1186/1758- Zhang X, Pan D, Zhao C, Li K (2016) MMOY: towards
2946-3-44 deriving a metallic materials ontology from Yago.
Murray-Rust P, Townsend JA, Adams SE, Phadung- Adv Eng Inf 30:687–702. https://doi.org/10.1016/j.aei.
sukanan W, Thomas J (2011) The semantics of chemi- 2016.09.002
cal markup language (CML): dictionaries and conven- Zhang X, Chen H, Ruan Y, Pan D, Zhao C (2017)
tions. J Cheminfor 3:43. https://doi.org/10.1186/1758- MATVIZ: a semantic query and visualization approach
2946-3-43 for metallic materials data. Int J Web Inf Syst
Pizzi G, Cepellotti A, Sabatini R, Marzari N, Kozinsky B 13:260–280. https://doi.org/10.1108/IJWIS-11-2016-
(2016) AiiDA: automated interactive infrastructure and 0065
database for computational science. Comput Mater Sci
111(Supplement C):218–230. https://doi.org/10.1016/
j.commatsci.2015.09.013
Premkumar V, Krishnamurty S, Wileden JC, Grosse IR
(2014) A semantic knowledge management system Big Spatial Data Access
for laminated composites. Adv Eng Inf 28(1):91–101.
https://doi.org/10.1016/j.aei.2013.12.004 Methods
Radinger A, Rodriguez-Castro B, Stolz A, Hepp M (2013)
Baudataweb: the Austrian building and construction Indexing
366 Bioinformatics
Synonyms
Big Data Analysis in Bioinformatics
Overview
Privacy-Preserving Record Linkage
In 2008, Satoshi Nakamoto (Satoshi 2008) came
up with the design of an unanticipated technology
that revolutionized the research in the distributed
Blockchain Consensus systems community. Nakamoto presented the de-
sign of a peer-to-peer money exchange process
that was shared yet distributed. Nakamoto named
Blockchain Transaction Processing his transactional token as Bitcoin and brought
forth design of a new paradigm blockchain.
The key element of any blockchain system
is the existence of an immutable tamper-proof
Blockchain Data Management block. In its simplest form, a block is an encrypted
aggregation of a set of transactions. The existence
of a block acts as a guarantee that the transactions
Blockchain Transaction Processing have been executed and verified.
Blockchain Transaction Processing 367
Blockchain Transaction Processing, Fig. 2 other servers. (b) Servers run the underlying consensus
Blockchain flow: Three main phases in any blockchain protocol, to determine the block creator. (c) New block is
application are represented. (a) Client sends a transaction created and transmitted to each node, which also implies
to one of the servers, which it disseminates to all the adding to global ledger
Blockchain Transaction Processing, Fig. 3 Topologies for blockchain systems. (a) Public blockchain. (b) Hybrid
blockchain. (c) Permissioned blockchain. (d) Private blockchain
Hybrid blockchain systems attain a middle of-Work (Satoshi 2008) (henceforth referred as
ground between the two extremes. These sys- PoW) for achieving consensus among the partici-
tems allow any node to be part of the consensus pating nodes. PoW requires each node to demon-
process, but only some designated nodes are strate its ability to solve a nontrivial task. The
allowed to form the next block. Cryptocurrency participating nodes compete among themselves
Ripple (Schwartz et al. 2014) supports a variant to solve a complex puzzle (such as computation
of the hybrid model, where some public insti- of a 256-bit SHA value). The node which com-
tutions can act as transaction validators. Per- putes the solution gets the opportunity to generate
missioned blockchain systems are the restricted the next block. It also disseminates the block to
variant of public blockchain systems. These sys- all the nodes, which marks as the proof that it
tems enlist few nodes as the members with equal performed some nontrivial computation. Hence,
rights, that is, each node as part of the system all other nodes respect the winner’s ability and
can form a block. These systems allow us to reach the consensus by continuing to chain ahead
represent internal working of any organization, of this block.
social media groups, and an early stage startup. In PoW the accessibility to larger computa-
tional resources determines the winner node(s).
Blockchain Consensus However, it is possible that multiple nodes could
The common denominator for all the blockchain lay claim for the next block to be added to the
applications is the underlying distributed con- chain. Based on the dissemination of the new
sensus algorithm. Nakamoto suggested Proof- block, this could lead to branching out of the
370 Blockchain Transaction Processing
single chain. Interestingly, these branches are Early implementations of PoS such as Peer-
often short-lived, as all the nodes tend to align coin (King and Nadal 2012) suffer from non-
themselves to the longest chain, which in turn convergence, that is, lack of single chain (also
leads to pruning of the branches. It is important referred as “nothing-at-stake” attack). Such a sit-
to understand that a node receives incentive when uation happens when the longest chain branches
it is able to append a block to the longest chain. into multiple forks. Intuitively, the nodes should
Hence, it is to their advantage to always align aim at converging the branches into the single
with the longest chain; otherwise they would longest chain. However, if the participating nodes
end up spending their computational resources are incentive driven, then they could create blocks
for zero gains. Note: PoW algorithm can theo- in both the chains, to maximize their gains. Such
retically be compromised by an adversary con- an attack is not possible on PoW as it would
trolling at least 51% of computational resources require a node to have double the number of
in the network. This theoretical possibility has computational resources, but as PoS does not
practical implications as a group of miners can require solving a complex problem, nodes can
share resources to generate blocks faster, skewing easily create multiple blocks.
the decentralized nature of the network. Proof-of-Authority (Parity Technologies
Proof-of-Stake (King and Nadal 2012) 2018) (henceforth referred as PoA) is designed to
(henceforth referred at PoS) aims at preserving be used alongside nonpublic blockchain systems.
the decentralized nature of the blockchain The key idea is to designate a set of nodes as
network. In PoS algorithm a node with n% the authority. These nodes are entrusted with
resources gets n% time opportunity to create the task of creating new blocks and validating
a block. Hence, the underlying principle in PoS the transactions. PoA marks a block as part
is that the node with higher stake lays claim to of the blockchain if it is signed by majority
the generation of the next block. To determine of the authorized nodes. The incentive model
a node’s stake, a combination of one or more in PoA highlights that it is in the interest of
factors, such as wealth, resources, and so on, an authority node to maintain its reputation, to
could be utilized. PoS algorithm requires a set of receive periodic incentives. Hence, PoA does not
nodes to act as validators. Any node that wants to select nodes based on their claimed stakes.
act as a validator needs to lock out its resources, Proof-of-Space (Ateniese et al. 2014; Dziem-
as a proof of its stake. bowski et al. 2015) also known as Proof-of-
PoS only permits a validator to create new Capacity (henceforth referred as PoC) is a con-
blocks. To create a new block, a set of validators sensus algorithm orthogonal to PoW. It expects
participate in the consensus algorithm. PoS con- nodes to provide a proof that they have sufficient
sensus algorithm has two variants: (i) chain-based “storage” to solve a computational problem. PoC
and (ii) BFT-style. Chain-based PoS algorithm algorithm targets computational problems such as
uses a pseudorandom algorithm to select a val- hard-to-pebble graphs (Dziembowski et al. 2015)
idator, which then creates a new block and adds it that need large amount of memory storage to
to the existing chain of blocks. The frequency of solve the problem. In the PoC algorithm, the ver-
selecting the validator is set to some predefined ifier first expects a prover to commit to a labeling
time interval. BFT-style algorithm runs a byzan- of the graph, and then he queries the prover for
tine fault tolerance algorithm to select the next random locations in the committed graph. The
valid block. Here, validators are given the right to key intuition behind this approach is that unless
propose the next block, at random. Another key the prover has sufficient storage, he would not
difference between these algorithms is the syn- pass the verification. PoC-based cryptocurrency
chrony requirement; chain-based PoS algorithms such as SpaceMint (Park et al. 2015) claims PoC-
are inherently synchronous, while BFT-style PoS based approaches more resource efficient to PoW
is partially synchronous. as storage consumes less energy.
Blockchain Transaction Processing 371
A decade prior to the first blockchain The client needs f C 1 matching responses
application, distributed consensus problem from different replicas, to mark the request as
had excited the researchers. Paxos (Lamport complete. If the client timeouts while waiting for
1998) and Viewstamped Replication (Oki f C 1 responses, then it multicasts the request B
and Liskov 1988) presented key solutions to to all the replicas. Each replica on receiving a
distributed consensus when the node failures request from the client, which it has not executed,
were restricted to fail-stop. Castro and Liskov starts a timer and queries the primary. If the pri-
(1999) presented their novel Practical Byzantine mary does not reply until timeout, then the repli-
Fault Tolerance (henceforth referred as PBFT) cas proceed to change the primary. Note: PBFT
consensus algorithm to handle byzantine heavily relies on strong cryptographic guarantees,
failures. Interestingly, all of previously discussed and each message is digitally signed by the sender
consensus algorithms provide similar guarantees using his private key.
as PBFT. Garay et al. (2015) helped in Zyzzyva (Kotla et al. 2007) is another inter-
establishing an equivalence between the PoW and esting byzantine fault tolerance protocol aimed
general consensus algorithms. This equivalence at reducing the costs associated with BFT-style
motivated the community to design efficient protocols. Zyzzyva (Kotla et al. 2007) allows the
alternatives to PoW. replicas to reach early consensus by permitting
PBFT runs a three-phase protocol to reach a speculative execution. In Zyzzyva, the replicas
consensus among all the non-byzantine nodes. are relieved from the task of ensuring a global or-
PBFT requires a bound on the number of byzan- der for the client requests. This task is delegated
tine nodes. If “n” represents the total number to the client, which in turn informs the replicas if
of nodes in the system, then PBFT bounds nC1the
˘ they are not in sync.
number of byzantine nodes “f ” as f 3
; The protocol initiates with a client sending
PBFT-based algorithms have an advantage of a request to the primary, which in turn sends
reduced resource consumption but suffer from a pre-prepare message to all the replicas. The
large message complexity (order O.n2 /). Hence, structure of the prepare message is similar to
these algorithms are preferred for restricted or its PBFT counterpart except that it includes a
small blockchain systems, in order to have less digest summarizing the history. Each replica, on
message overhead. receiving a request from the primary, executes
The algorithm starts with a client sending a the request and transmits the response to the
request to the primary node. The primary node client. Each response, in addition to the fields in
verifies the request, assigns a sequence num- the pre-prepare message, contains the digest of
ber, and sends a pre-prepare message to all the history and a signed copy of message from the
replicas. Each pre-prepare message also contains primary. If the client receives 3f C 1 matching
the client’s request and the current view number. responses, then it assumes the response is stable
When a replica receives a pre-prepare message, and considers the request completed. In case the
it first verifies the request and then transmits a number of matching responses received by the
prepare message to all the nodes. The prepare client are in the range 2f C 1 and 3f , then it
message contains the digest of client request, creates a commit certificate and transmits this
sequence number, and view number. Once a node certificate to all the nodes. The commit certificate
receives 2f prepare messages, matching to the includes the history which proves to a receiving
pre-prepare message, it multicasts a commit mes- replica that it can safely commit the history and
sage. The commit message also contains the di- start processing the next request. All the replicas
gest of client request, sequence number, and view acknowledge the client for the commit certificate.
number. Each replica waits for receiving identical If the client receives less than 2f C 1 responses,
2f C1 commit messages, before executing the re- then it starts a timer and informs all the replicas
quest and then transmits the solution to the client. again about the impending request. Now, either
372 Blockchain Transaction Processing
the request would be safely completed, or a new protocol has an additional step in comparison
primary would be elected. to the three phases in PBFT and Aardvark. The
Although Zyzzyva achieves higher throughput client node initiates the protocol by transmitting
than PBFT, its client-based speculative execution a request to all the nodes. Next, each node propa-
is far from ideal. Zyzzyva places a lot of faith on gates this request to all other nodes, and then the
the correctness of the client, which is quite dis- three-phase protocol begins. This extra round of
comforting, as a malicious client can prevent the redundancy ensures that the client request reaches
replicas from maintaining linearizability. Aard- all the instances.
vark (Clement et al. 2009) studies the ill-effects
of various fast, byzantine fault-tolerant protocols
and presents the design of a robust BFT proto- Blockchain Systems
col. In Aadvark, the messages are signed by the
client using both digital signatures and message Bitcoin (Satoshi 2008) is regarded as the first
authentication codes (Katz and Lindell 2007). blockchain application. It is a cryptographically
This prevents malicious clients from performing secure digital currency with the aim of disrupting
a denial-of-service attack, as it is costly for the the traditional, institutionalized monetary
client to sign each message two times. Aadvark exchange. Bitcoin acts as the token of transfer
uses point-to-point communication, instead of between two parties undergoing a monetary
multicast communication. The key intuition be- transaction. The underlying blockchain system
hind such a choice is to disallow a faulty client is a network of nodes (also known as miners)
of replica from blocking the complete network. that take a set of client transactions and validate
Aadvark also periodically changes the primary the same by demonstrating a Proof-of-Work, that
replica. Each replica tracks the throughput of the is, generating a block. The process of generating
current primary and suggests replacing the pri- the next block is nontrivial and requires large
mary, when there is a decrease in its throughput. computational resources. Hence, the miners are
The authors use a simple timer to track the rate of given incentives (such as Bitcoins) for dedicating
primary response. their resources and generating the block. Each
RBFT (Aublin et al. 2013) is a simple ex- miner maintains locally an updated copy of the
tension to Aardvark. RBFT aims at detecting complete blockchain and the associated ledgers
smartly malicious primary replica, which could for every Bitcoin user.
avoid being detected malicious by other replicas. To ensure Bitcoin system remains fair toward
The key intuition behind RBFT is to prevent the all the machines, the difficulty of Proof-of-Work
primary from insinuating delays that are within challenge is periodically increased. We also know
the stated threshold. Hence, the primary could re- that Bitcoin is vulnerable to 51% attack, which
duce the throughput without being ever replaced. can lead to double spending (Rosenfeld 2014).
To tackle this problem, RBFT insists running The intensity of such attacks increases when
f C 1 independent instances of the Aardvark multiple forks of the longest chain are created. To
protocol on each node. One of the instances is avoid these attacks, Bitcoin developers suggest
designated as the “master instance,” and it exe- the clients to wait for their block to be confirmed
cutes the requests. The rest of the instances are before they mark the Bitcoins as transferred. This
labeled as the “backup instances,” and they order wait ensures that the specific block is a little deep
the requests and monitor the performance of the (nearly six blocks) in the longest chain (Rosen-
master instance. If any backup instance observes feld 2014). Bitcoin critics also argue that its
a degradation of the performance of the master Proof-of-Work consumes huge energy (As per
instance, then it transmits a message to elect a some claims, one Bitcoin transaction consumes
new primary. Note: RBFT does not allow more power equivalent to that required by 1:5 Amer-
than one primary on any node. Hence, each node ican homes per day.) and may not be a viable
can have at most one primary instance. RBFT solution for the future.
Blockchain Transaction Processing 373
Bitcoin-NG (Eyal et al. 2016) is a scalable the Ethereum blockchain. Parity can be regarded
variant to Bitcoin’s protocol. Our preceding as an interesting application for the blockchain
discussion highlights that Bitcoin suffers from community, as it provides support for both
throughput degradation. This can be reduced by Proof-of-Work and Proof-of-Authority consensus B
either increasing the block size or decreasing the algorithms. Hence, they allow mechanism for
block interval. However, increasing the former users of their application to set up “authority”
increases the propagation delays, and the latter nodes and resort to non-compute-intensive, POA
can lead to incorrect consensus. Bitcoin-NG algorithm.
solves this problem by dividing the time into a Ripple (Schwartz et al. 2014) is considered
set of intervals and selecting one leader for each as the third largest cryptocurrency after Bitcoin
time interval. The selected leader autonomously and Ethereum, in terms of market cap. It uses
orders the transactions. Bitcoin-NG also uses a consensus algorithm which is a simple variant
two different block structures: key blocks for of BFT algorithms. Ripple requires a number of
facilitating leader election and microblocks for failures f to be bounded as follows: .n
maintaining the ledger. 1/=5 C 1, where n represents the total number
Ethereum (Wood 2015) is a blockchain of nodes. Ripple’s consensus algorithm intro-
framework that permits users to create their duces the notion of a Unified Node List (UNL),
own applications on top of the Ethereum which is a subset of the network. Each server
Virtual Machine (EVM). Ethereum utilizes communicates with the nodes in its UNL for
the notion of smart contracts to facilitate reaching a consensus. The servers exchange the
development of new operations. It also supports set of transactions they received from the clients
a digital cryptocurrency, ether, which is used and propose those transactions to their respective
to incentivize the developers to create correct UNL for vote. If a transaction receives 80% of
applications. One of the key advantages of the votes, it is marked permanent. It is important
Ethereum is that it supports a Turing complete to understand that if the generated UNL groups
language to generate new applications on are a clique, then forks of the longest chain could
top of EVM. Initially, Ethereum’s consensus coexist. Hence, UNLs are created in a manner
algorithm used the key elements of both PoW that they share some set of nodes. Another note-
and PoC algorithms. Ethereum made nodes worthy observation about Ripple protocol is that
solve challenges that were not only compute each client needs to select a set of validators
intensive but also memory intensive. This design or unique nodes that they trust. These validators
prevented existence of miners who utilized next utilize Ripple consensus algorithm to verify
specially designed hardware for compute- the transactions.
intensive applications. Hyperledger (Cachin 2016) is a suite of
Recently, Ethereum modified its consensus resources aimed at modeling industry standard
protocol to include some notions of PoS blockchain applications. It provides a series
algorithm. The modified protocol is referred of application programming interfaces (APIs)
as Casper (Buterin and Griffith 2017). Casper for developers to create their own nonpublic
introduces the notion of finality, that is, it ensures blockchain applications. Hyperledger provides
that one chain becomes permanent in time. It implementations of blockchain systems that
also introduces the notion of accountability, that uses RBFT and other variants of the PBFT
is, to penalize a validator, who performs the consensus algorithm. It also facilitates the
“nothing-at-stake” attack in PoS-based systems. use and development of smart contracts. It
The penalty leveraged on such a validator is is important to understand that the design
equivalent to negating all his stakes. philosophy of Hyperledger leans toward
Parity (Parity Technologies 2018) is an blockchain applications that require existence
application designed on top of Ethereum. It of nonpublic networks, and so, they do not need
provides an interface for its users to interact with a compute-intensive consensus.
374 Blockchain Transaction Processing
ExpoDB (Sadoghi 2017; Gupta and Sadoghi These threads lie at the core of the execution
2018) is an experimental research platform layer as they run the transaction, abide by
that facilitates design and testing of emerging the rules of stated concurrency protocol, and
database technologies. ExpoDB consists of a set achieve agreement among different transactional
of five layers providing distinct functionalities partitions. ExpoDB provides implementation for
(refer to Fig. 4). Application layer acts as eight concurrency controls and three commit
the test bed for evaluating the underlying protocols. ExpoDB also characterizes a storage
database protocols (Gupta and Sadoghi 2018). layer (Sadoghi et al. 2018) for storing the trans-
It allows the system to access OLTP benchmarks actional data, messages, and relevant metadata.
such as YCSB Cooper et al. (2010), TPC- ExpoDB extends blockchain functionality to
C TPP Council (2010), and PPS Harding et al. the traditional distributed systems through a se-
(2017). Application layer acts as an interface cure layer. To facilitate secure transactions, Ex-
for the client-server interaction. Transport layer poDB provides a cryptographic variant to YCSB
allows communication of messages between the – Secure YCSB benchmark. ExpoDB also con-
client and server nodes. tains implementations for a variety of consen-
Execution layer facilitates seamless execution sus protocols such as PoW, PBFT, RBFT, and
of a transaction with the help of a set of threads. Bitcoin-NG.
Blockchain Transaction Processing 375
Future Directions for Research nite state automata, temporal logic, and model
checking (Grumberg and Long 1994; Baier and
Although blockchain technology is just a decade Katoen 2008). We believe similar analysis can be
old, it gained majority of its momentum in the performed in the context of blockchain applica- B
last 5 years. This allows us to render different tions. Theorem provers (such as Z3 De Moura
elements of the blockchain systems and achieve and Bjørner 2008) and proof assistants (such as
higher performance and throughput. Some of COQ Bertot 2006) could prove useful to define
the plausible directions to develop efficient a certified blockchain application. A certified
blockchain systems are as follows: (i) reducing blockchain application can help in stating theo-
the communication messages, (ii) defining retical bounds on the resources required to gen-
efficient block structure, (iii) improving the erate a block. Similarly, some of the blockchain
consensus algorithm, and (iv) designing secure consensus has been shown to suffer from denial-
lightweight cryptographic functions. of-service attacks (Bonneau et al. 2015), and
Statistical and machine learning approaches a formally verified blockchain application can
have presented interesting solutions to automate help realize such guarantees, if the underlying
key processes such as face recognition (Zhao application provides such a claim.
et al. 2003), image classification (Krizhevsky
et al. 2012), speech recognition (Graves et al.
References
2013), and so on. The tools can be leveraged
to facilitate easy and efficient consensus. The Ateniese G, Bonacina I, Faonio A, Galesi N (2014) Proofs
intuition behind this approach is to allow learning of space: when space is of the essence. In: Abdalla
algorithms to select nodes, which are fit to act as M, De Prisco R (eds) Security and cryptography for
networks. Springer International Publishing, pp 538–
a block creator and prune the rest from the list
557
of possible creators. The key observation behind Aublin PL, Mokhtar SB, Quéma V (2013) RBFT: redun-
such a design is that the nodes selected by the dant byzantine fault tolerance. In: Proceedings of the
algorithm are predicted to be non-malicious. Ma- 2013 IEEE 33rd international conference on distributed
computing systems, ICDCS ’13. IEEE Computer Soci-
chine learning techniques can play an important ety, pp 297–306
role in eliminating the human bias and inexpe- Baier C, Katoen JP (2008) Principles of model checking
rience. To learn which nodes can act as block (representation and mind series). The MIT Press, Cam-
creators, a feature set, representative of the nodes, bridge/London
Bertot Y (2006) Coq in a hurry. CoRR abs/cs/0603118,
needs to be defined. Some interesting features http://arxiv.org/abs/cs/0603118, cs/0603118
can be geographical distance, cost of communica- Bonneau J, Miller A, Clark J, Narayanan A, Kroll JA,
tion, available computational resources, available Felten EW (2015) SoK: research perspectives and
memory storage, and so on. These features would challenges for bitcoin and cryptocurrencies. In: Pro-
ceedings of the 2015 IEEE symposium on security and
help in generating the dataset that would help to privacy, SP ’15. IEEE Computer Society, Washington,
train and test the underlying machine learning DC, pp 104–121
model. This model would be ran against new Buterin V, Griffith V (2017) Casper the friendly final-
nodes that wish to join the associated blockchain ity gadget. CoRR abs/1710.09437, http://arxiv.org/abs/
1710.09437, 1710.09437
application. Cachin C (2016) Architecture of the hyperledger
The programming languages and software en- blockchain fabric. In: Workshop on distributed cryp-
gineering communities have developed several tocurrencies and consensus ledgers, DCCL 2016
works that provide semantic guarantees to a lan- Cachin C, Vukolic M (2017) Blockchain consensus proto-
cols in the wild. CoRR abs/1707.01873
guage or an application (Wilcox et al. 2015; Castro M, Liskov B (1999) Practical byzantine fault
Leroy 2009; Kumar et al. 2014). These works tolerance. In: Proceedings of the third symposium on
have tried to formally verify (Keller 1976; Leroy operating systems design and implementation, OSDI
2009) the system using the principles of pro- ’99. USENIX Association, Berkeley, pp 173–186
Clement A, Wong E, Alvisi L, Dahlin M, Marchetti M
gramming languages and techniques such as fi- (2009) Making byzantine fault tolerant systems tolerate
376 Blockchain Transaction Processing
byzantine faults. In: Proceedings of the 6th USENIX In: Proceedings of the 25th international conference on
symposium on networked systems design and imple- neural information processing systems, NIPS’12, vol 1.
mentation, NSDI’09. USENIX Association, Berkeley, Curran Associates Inc., pp 1097–1105
pp 153–168 Kumar R, Myreen MO, Norrish M, Owens S (2014)
Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears CakeML: a verified implementation of ML. In: Pro-
R (2010) Benchmarking cloud serving systems with ceedings of the 41st ACM SIGPLAN-SIGACT sympo-
YCSB. In: Proceedings of the 1st ACM symposium on sium on principles of programming languages, POPL
cloud computing. ACM, pp 143–154 ’14. ACM, New York, pp 179–191
Decker C, Wattenhofer R (2013) Information propagation Lamport L (1998) The part-time parliament. ACM Trans
in the bitcoin network. In: 13th IEEE international Comput Syst 16(2):133–169
conference on peer-to-peer computing (P2P), Trento Leroy X (2009) A formally verified compiler back-end.
De Moura L, Bjørner N (2008) Z3: an efficient SMT J Autom Reason 43(4):363–446
solver. Springer, Berlin/Heidelberg/Berlin, pp 337–340 Nawab F (2018) Geo-scale transaction processing.
Dziembowski S, Faust S, Kolmogorov V, Pietrzak K Springer International Publishing, pp 1–7. https://doi.
(2015) Proofs of space. In: Gennaro R, Robshaw org/10.1007/978-3-319-63962-8_180-1
M (eds) Advances in cryptology – CRYPTO 2015. Oki BM, Liskov BH (1988) Viewstamped replication:
Springer, Berlin/Heidelberg, pp 585–605 a new primary copy method to support highly-
Eyal I, Gencer AE, Sirer EG, Van Renesse R (2016) available distributed systems. In: Proceedings of the
Bitcoin-NG: a scalable blockchain protocol. In: Pro- seventh annual ACM symposium on principles of dis-
ceedings of the 13th USENIX conference on net- tributed computing, PODC ’88. ACM, New York, pp
worked systems design and implementation, NSDI’16. 8–17
USENIX Association, Berkeley, pp 45–59 Parity Technologies (2018) Parity ethereum blockchain.
Garay J, Kiayias A, Leonardos N (2015) The bitcoin https://www.parity.io/
backbone protocol: analysis and applications. Springer, Park S, Kwon A, Fuchsbauer G, Gaži P, Alwen J, Pietrzak
Berlin/Heidelberg/Berlin, pp 281–310 K (2015) SpaceMint: a cryptocurrency based on proofs
Graves A, Mohamed A, Hinton GE (2013) Speech of space. https://eprint.iacr.org/2015/528
recognition with deep recurrent neural networks. Pilkington M (2015) Blockchain technology: principles
CoRR abs/1303.5778, http://arxiv.org/abs/1303.5778, and applications. In: Research handbook on digital
1303.5778 transformations. SSRN
Gray J, Lamport L (2006) Consensus on transaction com- Rosenfeld M (2014) Analysis of hashrate-based double
mit. ACM TODS 31(1):133–160 spending. CoRR abs/1402.2009, http://arxiv.org/abs/
Grumberg O, Long DE (1994) Model checking and 1402.2009, 1402.2009
modular verification. ACM Trans Program Lang Syst Sadoghi M (2017) Expodb: an exploratory data sci-
16(3):843–871 ence platform. In: Proceedings of the eighth bien-
Gupta S, Sadoghi M (2018) EasyCommit: a non-blocking nial conference on innovative data systems research,
two-phase commit protocol. In: Proceedings of the CIDR
21st international conference on extending database Sadoghi M, Bhattacherjee S, Bhattacharjee B, Canim M
technology, Open Proceedings, EDBT (2018) L-store: a real-time OLTP and OLAP system.
Harding R, Van Aken D, Pavlo A, Stonebraker M (2017) OpenProceeding.org, EDBT
An evaluation of distributed concurrency control. Proc Satoshi N (2008) Bitcoin: a peer-to-peer electronic cash
VLDB Endow 10(5):553–564 system. http://bitcoin.org/bitcoin.pdf
Jakobsson M, Juels A (1999) Proofs of work and Schwartz D, Youngs N, Britto A (2014) The ripple proto-
bread pudding protocols. In: Proceedings of the IFIP col consensus algorithm. https://www.ripple.com/
TC6/TC11 joint working conference on secure in- Steen Mv, Tanenbaum AS (2017) Distributed systems, 3rd
formation networks: communications and multimedia edn, version 3.01. distributed-systems.net
security, CMS ’99. Kluwer, B.V., pp 258–272 TPP Council (2010) TPC benchmark C (Revision 5.11)
Katz J, Lindell Y (2007) Introduction to modern cryptog- Wilcox JR, Woos D, Panchekha P, Tatlock Z, Wang X,
raphy. Chapman & Hall/CRC, Boca Raton Ernst MD, Anderson T (2015) Verdi: a framework
Keller RM (1976) Formal verification of parallel pro- for implementing and formally verifying distributed
grams. Commun ACM 19(7):371–384 systems. In: Proceedings of the 36th ACM SIG-
King S, Nadal S (2012) PPCoin: peer-to-peer crypto PLAN conference on programming language design
currency with proof-of-stake. peercoin.net and implementation, PLDI ’15. ACM, New York, pp
Kotla R, Alvisi L, Dahlin M, Clement A, Wong E (2007) 357–368
Zyzzyva: speculative byzantine fault tolerance. In: Pro- Wood G (2015) Ethereum: a secure decentralised gen-
ceedings of twenty-first ACM SIGOPS symposium eralised transaction ledger. http://gavwood.com/paper.
on operating systems principles, SOSP ’07. ACM, pdf
New York, pp 45–58. https://doi.org/10.1145/1294261. Zhao W, Chellappa R, Phillips PJ, Rosenfeld A (2003)
1294267 Face recognition: a literature survey. ACM Comput
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet Surv 35(4):399–458. https://doi.org/10.1145/954339.
classification with deep convolutional neural networks. 954342
BSP Programming Model 377
direct memory access (RDMA) to reduce net- be used (Bonorden 2007a,b; Khayyat et al. 2013)
work costs. to ensure a balanced workload of the BSP during
runtime. Such explicit load balancing techniques
Synchronization employ the restrictive nature of the BSP model to
The synchronization phase ensures the consis- predict the behavior of the user’s code (González
tency of the BSP computations and helps to et al. 2001) and transparently reassign computing
avoid deadlocks. It concludes the end of an iter- units without affecting the application’s correct-
ation by strictly blocking the system from start- ness. Since computing units are independent, they
ing the next iteration until all computing units can be parallelized in shared memory multi-core
complete their current execution and the outgo- systems (Valiant 2011; Yzelman et al. 2014), dis-
ing messages reach their destination successfully. tributed shared-nothing environments (Hill et al.
It may use one or more synchronization barri- 1998), graphics processing units (GPU) (Hou
ers (Khayyat et al. 2013) depending on the nature et al. 2008), parallel system on chip (Bonor-
of the BSP system. For applications with lower den 2008), and heterogeneous clusters (Bonorden
consistency requirements (Gonzalez et al. 2012; 2008).
Xiao et al. 2017), Stale Synchronous Parallel
(SSP) (Cui et al. 2014) model may be used to
reduce the waiting time for each iteration. This
is done by boundingly relaxing the strictness of Limitations
the synchronization barriers to enable computing
units to start the next iteration without waiting for The main disadvantage of BSP is that not all
strugglers to finish their current iteration. applications can be programmed in its restrictive
model. The autonomy and independence of com-
puting units in BSP hinder programmers from
Advantages developing applications that require high data de-
pendency within its data partitions. Similarly, ex-
The BSP model simplifies parallel computations. plicitly synchronizing the state of different com-
It allows users to implement parallel applica- puting units may require sophisticated imple-
tions without worrying about the challenges of mentation of the user’s logic (Sakr et al. 2017;
parallel environments. The workload of a BSP Abdelaziz et al. 2017; Teixeira et al. 2015) or
application can be automatically balanced by significant modifications to the underlying work-
increasing the number of computing units. More flow of the BSP system (Salihoglu and Widom
sophisticated load balancing techniques can also 2014).
BSP Programming Model 379
BSP also suffers from strugglers, i.e., slow reducer. Although MapReduce is not originally
computing units as its strict synchronization bar- designed to run as a traditional BSP systems,
riers restrict the execution of the next iteration un- its execution model follows BSP. That is,
til all computing units finished and all messages mappers and reducers can be considered as B
are delivered. The cost of each iteration in BSP is computing units of two consecutive supersteps
the sum of the execution time of the most delayed for BSP, while intermediate data shuffling and
computing unit, the cost to transmit or receive HDFS I/O play the role of communication
the largest messages buffer for a computing unit, and synchronization phases in BSP. As shown
and the latency of the synchronization barrier. in Fig. 2, each MapReduce iteration contains
As a result, the performance of a BSP iteration two BSP supersteps. The mapper function is
becomes as fast as its slowest computing unit. used to generate the distributed computing
Asynchronous iterations or less strict barriers units of the first superstep (superst ep.i/ ),
may be used to improve the performance of BSP where the execution of the next superstep
at the expense of the algorithm’s accuracy and is held by the shuffling and sorting phase
convergence rate (Gonzalez et al. 2012). Another until all mappers finish their execution. The
way to avoid strugglers is either by excluding reducer’s supertep (superst ep.iC1/ ) starts once
expensive computing units from the computa- the shuffling phase completes and the next
tion (Abdelaziz et al. 2017), utilizing replication superstep(superst ep.iC2/ ) is blocked until all
to reduce their load (Salihoglu and Widom 2013; reducers write their data to HDFS. Alongside
Gonzalez et al. 2012), or shuffling the CPU as- mappers and reducers, programmers may also
signment of expensive computing units (Bonor- implement a combiner that is executed on the
den 2007a; Khayyat et al. 2013). output of mappers, i.e., in superst ep.i/ and
(superst ep.iC2/ ), to reduce the size of the
exchanged messages and thus reduce the network
Applications overhead.
BSP Programming Model, Fig. 2 The BSP compute model within MapReduce
380 BSP Programming Model
range of applications in which it explicitly mod- current trends in theory and practice of informatics
els its computations on top of the BSP program- on theory and practice of informatics, SOFSEM’99.
Springer, London, pp 349–359
ming model. Apache Hama provides three ab- Blelloch GE, Gibbons PB, Matias Y, Zagha M (1995)
stractions to its computing units: general, vertex- Accounting for memory bank contention and delay
based for graph processing, and neuron-based for in high-bandwidth multiprocessors. In: Proceedings of
machine learning. As a result, Apache Hama is the seventh annual ACM symposium on parallel algo-
rithms and architectures, SPAA’95. ACM, New York,
able to customize its optimizations to the commu- pp 84–94
nication phase and synchronization barriers based Bonorden O (2007a) Load balancing in the bulk-
on the underlying BSP application. synchronous-parallel setting using process migrations.
In: 2007 IEEE international parallel and distributed
processing symposium, pp 1–9
Pregel Bonorden O (2007b) Load balancing in the bulk-
Pregel (Malewicz et al. 2010) and its clones are synchronous-parallel setting using process migrations.
a typical example of a BSP application that is In: Proceedings of the 21st international parallel
customized to scale large graph computation dis- and distributed processing symposium (IPDPS), Long
Beach, pp 1–9, 26–30 Mar 2007. https://doi.org/10.
tributed systems. Each vertex in the input graph 1109/IPDPS.2007.370330
is assigned a distinct compute unit. In each graph Bonorden O (2008) Versatility of bulk synchronous
iteration, vertices exchange messages about their parallel computing: from the heterogeneous cluster
state and the weights of their graph edges to to the system on chip. PhD thesis, University of
Paderborn, Germany. http://ubdok.uni-paderborn.de/
achieve a newer meaningful state. The commu- servlets/DerivateServlet/Derivate-6787/diss.pdf
nication pattern of a typical Pregel application is Cui H, Cipar J, Ho Q, Kim JK, Lee S, Kumar A,
usually predictable; a vertex only sends messages Wei J, Dai W, Ganger GR, Gibbons PB, Gibson GA,
to its direct neighbors. As a result, some Pregel Xing EP (2014) Exploiting bounded staleness to speed
up big data analytics. In: Proceedings of the 2014
clones (Gonzalez et al. 2014) utilized this pre- USENIX conference on USENIX annual technical
dictability to improve the overall performance of conference, USENIX association, USENIX ATC’14,
the BSP’s communication phase. Pregel also uti- pp 37–48
lizes aggregators, which is a commutative and as- Dean J, Ghemawat S (2008) Mapreduce: simplified
data processing on large clusters. Commun ACM
sociative function, to reduce the size of messages 51(1):107–113
exchanged between the graph vertices. Graph Gonzalez JA, Leon C, Piccoli F, Printista M, Roda JL, Ro-
algorithms on Pregel proceed with several BSP driguez C, de Sande F (2000) Oblivious BSP. Springer,
iterations until the state of its vertices converge Berlin/Heidelberg, pp 682–685
González JA, León C, Piccoli F, Printista M, Roda JL,
to specific values or reach a maximum predefined Rodríguez C, de Sande F (2001) Performance predic-
limit for the allowed number of iterations. tion of oblivious BSP programs. Springer, Berlin/Hei-
delberg, pp 96–105. https://doi.org/10.1007/3-540-
44681-8_16
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C
Cross-References (2012) PowerGraph: distributed graph-parallel compu-
tation on natural graphs. In: Presented as part of the
10th USENIX symposium on operating systems design
Graph Processing Frameworks
and implementation (OSDI), USENIX, Hollywood,
Parallel Graph Processing pp 17–30
Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin
MJ, Stoica I (2014) GraphX: graph processing in
a distributed dataflow framework. In: 11th USENIX
References symposium on operating systems design and im-
plementation (OSDI). USENIX Association, Broom-
Abdelaziz I, Harbi R, Salihoglu S, Kalnis P (2017) Com- field, pp 599–613. https://www.usenix.org/conference/
bining vertex-centric graph processing with SPARQL osdi14/technical-sessions/presentation/gonzalez
for large-scale RDF data analytics. IEEE Trans Parallel Goodrich MT (1999) Communication-efficient parallel
Distrib Syst 28(12):3374–3388 sorting. SIAM J Comput 29(2):416–432
Beran M (1999) Decomposable bulk synchronous parallel Hill JMD, McColl B, Stefanescu DC, Goudreau MW,
computers. In: Proceedings of the 26th conference on Lang K, Rao SB, Suel T, Tsantilas T, Bisseling RH
Business Process Analytics 381
(1998) Bsplib: the BSP programming library. Parallel graph computation. In: Proceedings of the 24th in-
Comput 24(14):1947–1980 ternational conference on world wide web, interna-
Hou Q, Zhou K, Guo B (2008) BSGP: bulk-synchronous tional world wide web conferences steering committee,
GPU programming. In: ACM SIGGRAPH 2008 Pa- WWW’15, pp 1307–1317
pers, SIGGRAPH’08. ACM, New York, pp 19:1–19:12 Yzelman AN, Bisseling RH, Roose D, Meerbergen K
B
Juurlink BHH, Wijshoff HAG (1996) The e-BSP model: (2014) Multicorebsp for c: a high-performance library
incorporating general locality and unbalanced commu- for shared-memory parallel programming. Int J Parallel
nication into the BSP model. In: Proceedings of the Prog 42(4):619–642
second international Euro-Par conference on parallel
processing-volume II, Euro-Par’96. Springer, London,
pp 339–347
Khayyat Z, Awara K, Alonazi A, Jamjoom H, Williams BSPlib
D, Kalnis P (2013) Mizan: a system for dynamic load
balancing in large-scale graph processing. In: Proceed-
ings of the 8th ACM European conference on computer BSP Programming Model
systems, EuroSys’13. ACM, pp 169–182
Li M, Lu X, Hamidouche K, Zhang J, Panda DK (2016)
Mizan-RMA: accelerating Mizan graph processing
framework with MPI RMA. In: 2016 IEEE 23rd in-
ternational conference on high performance computing
Bulk Synchronous Parallel
(HiPC), pp 42–51
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn BSP Programming Model
I, Leiser N, Czajkowski G (2010) Pregel: a system
for large-scale graph processing. In: Proceedings of
the 2010 ACM SIGMOD international conference on
management of data, SIGMOD’10. ACM, pp 135–146
Sakr S, Orakzai FM, Abdelaziz I, Khayyat Z (2017) Business Process Analytics
Large-scale graph processing using apache giraph.
Springer International Publishing AG, Cham Marlon Dumas1 and Matthias Weidlich2
Salihoglu S, Widom J (2013) GPS: a graph processing sys- 1
Institute of Computer Science, University of
tem. In: Proceedings of the 25th international confer-
ence on scientific and statistical database management, Tartu, Tartu, Estonia
2
SSDBM. ACM, pp 22:1–22:12 Humboldt-Universität zu Berlin, Department of
Salihoglu S, Widom J (2014) Optimizing graph algorithms Computer Science, Berlin, Germany
on Pregel-like systems. Proc VLDB Endow 7(7):577–
588
Siddique K, Akhtar Z, Yoon EJ, Jeong YS, Dasgupta
D, Kim Y (2016) Apache Hama: an emerging bulk Synonyms
synchronous parallel computing framework for big
data applications. IEEE Access 4:8879–8887 Business process mining; Business process
Teixeira CHC, Fonseca AJ, Serafini M, Siganos G, Zaki
monitoring; Process mining; Workflow mining
MJ, Aboulnaga A (2015) Arabesque: a system for
distributed graph mining. In: Proceedings of the 25th
symposium on operating systems principles, SOSP’15.
ACM, New York, pp 425–440 Definitions
Valiant LG (1990) A bridging model for parallel compu-
tation. Commun ACM 33(8):103–111
Valiant LG (2011) A bridging model for multi-core com-
Business process analytics is a body of concepts,
puting. J Comput Syst Sci 77(1):154–166 methods, and tools that allows organizations to
Williams TL, Parsons RJ (2000) The heterogeneous bulk understand, analyze, and predict the performance
synchronous parallel model. Springer, Berlin/Heidel- of their business processes and their conformance
berg, pp 102–108
Xiao W, Xue J, Miao Y, Li Z, Chen C, Wu M, Li W,
with respect to normative behaviors, on the basis
Zhou L (2017) Tux²: distributed graph computation of data collected during the execution of said pro-
for machine learning. In: 14th USENIX symposium on cesses. Business process analytics encompasses
networked systems design and implementation (NSDI), a wide range of methods for extracting, cleans-
Boston, pp 669–682
Yan D, Cheng J, Lu Y, Ng W (2015) Effective techniques
ing, and visualizing business process execution
for message reduction and load balancing in distributed data, for discovering business process models
382 Business Process Analytics
from such data, for analyzing the performance as a reduction of costs, execution times, or error
and conformance of business processes both in a rates but may also refer to gaining a compet-
postmortem and in an online manner, and finally, itive advantage through innovation efforts. The
for predicting the future state of the business essence of BPM, though, is that improvement is
processes based on their past executions. not limited to individual steps of an organiza-
tion’s operations. Rather, improvement shall take
a holistic view and consider the scope of pro-
Overview cesses, entire chains of business events, activities,
and decisions.
This entry provides an overview of the general Adopting this view and the definition brought
context of business process analytics and intro- forward by Dumas et al. (2018), a business pro-
duces a range of concepts and methods, all the cess is a collection of inter-related events, activi-
way from the extraction and cleansing of business ties, and decision points that involve a number of
process event data, to the discovery of process actors and objects, which collectively lead to an
models, the evaluation of the performance and outcome that is of value to at least one customer.
conformance of business processes both offline Typically, a business process has a well-defined
and online, as well as the management of collec- start and end, which gives rise to a notion of an
tions of business process models. These concepts instance of the process, also known as case. Such
and methods are discussed in detail in subsequent a case is the particular occurrence of business
entries of this encyclopedia. events, execution of activities, and realization
of decisions to serve a single request by some
customer.
The Context of Business Process The way in which a business process is de-
Analytics signed and performed affects both the quality of
service that customers perceive and the efficiency
The context of business process analytics is with which services are delivered. An organiza-
spanned by a few essential concepts. Below, we tion can outperform another organization offering
start by reviewing the notion of a business process similar kinds of service, if it has better processes
as a means to structure the operations of an and executes them better. However, the context of
organization. We then discuss how the execution business processes in terms of, for instance, orga-
of a business process on top of modern enterprise nizational structures, legal regulations, and tech-
software systems leads to the accumulation of nological infrastructures, and hence the processes
detailed data about the process, which provide themselves, is subject to continuous change. Con-
the basis for business process analytics. We then sequently, to ensure operational excellence, one-
discuss the concept of business process model, off efforts are not sufficient. Rather, there is a
which plays a central role in the field of business need to continuously assess the operation of a
process management in general and business process, to identify bottlenecks, frequent defects
analytics in particular. and their root causes, and other sources of inef-
ficiencies and improvement opportunities. Busi-
Business Processes ness process analytics is a body of techniques to
The business process management (BPM) disci- achieve this goal.
pline targets models and methods to oversee how
work is performed in an organization (Dumas Business Process Event Data
et al. 2018). It aims at ensuring consistent out- Modern companies use dedicated software tools,
comes and strives for materializing improvement called enterprise software systems, to record, co-
opportunities. The latter may refer to various ordinate, and partially conduct everyday work.
operational aspects. Depending on the objectives Specifically, an enterprise software system sup-
of an organization, improvement may be defined ports the conduct of a business process by au-
Business Process Analytics 383
tomating and coordinating business events, ac- ample, after triggering an automated plausibility
tivities, and decisions involved in each case of check of a loan request, an enterprise software
the process. The former means that the reaction system decides whether to forward the request to
to business events, the conduct of activities, or a a worker for detailed inspection. If so, a worker B
routine for decision-making may directly be pro- manually determines the conditions under which
vided by a software system. As an example, one a loan is offered. A definition of these conditions
may consider the scenario of a bank processing is then submitted to the system when completing
student loans. Here, a database transaction may the respective work item.
be executed to store a loan request after it has By coordinating and conducting the activities
been received. This transaction corresponds to an related to the cases of a business process, an
automated activity that is performed immediately enterprise software system produces an event
in response to the event of receiving the request. stream. Each event denotes a state change of a
Subsequently, a Web service call may be made to case at a particular point in time. Such an event
check the data provided as part of the request for stream may directly be used as input for business
plausibility, which corresponds to an automated process analytics. For instance, using stream pro-
activity that is triggered by reaching a particular cessing technology, compliance violations (loans
state in the respective case. are granted without a required proof of identity of
On the other hand, enterprise software sys- the requester) or performance bottlenecks (final
tems coordinate the (potentially manual) efforts approval of loan offers delays the processing by
of workers, people of an organization that are multiple days) can be detected.
in charge of the conduct of specific activities In many cases, however, business process ana-
of the process. In that case, workers see only lytics is done in a postmortem manner. In other
activities for which they are directly responsible, words, events are recorded and stored in a so-
while the enterprise software system hides the called event log, which is analyzed later using a
complexity of coordinating the overall process. range of techniques. An event log is a collection
A worker then faces a number of work items of events recorded during the execution of (all)
that need to be handled. In the above example, cases of a business process during a period of
a loan officer might see a list of student loan time.
applications that need to be assessed (i.e., ac- In order to understand what type of analysis
cepted or rejected). The assessment of one loan we are able to do using an event log, we need
application is a work item, which corresponds to introduce first the concept of business process
to a single activity. However, it also belongs to model. Indeed, a subset of the analysis techniques
a specific case, i.e., a specific loan request. A that can be applied to an event log also involve
single worker may therefore have multiple work process models as input, as output, or both. These
items to handle, all related to the same activity yet techniques are commonly known under the um-
belonging to different cases. brella term of process mining (van der Aalst
An enterprise software system keeps track of 2016).
the state of each case at any point in time. To this
end, it creates a data record, commonly referred Business Process Models
to as an event, whenever an automated activity is A business process model is a representation of
executed, a business event occurs, a worker starts a business process, i.e., the way an organization
or finishes working on a work item, or a decision operates to achieve a goal, as outlined above.
is made. Each of these events typically includes It represents the activities, business events, and
additional data, e.g., as it has been provided as decisions of a process along with their causal
a result of executing an activity, whether it was and temporal execution dependencies. To this
automated (e.g., the response message of a Web end, a typical business process model is a graph
service call) or conducted manually (e.g., data consisting of at least two types of nodes: task
inserted by a worker in a Web form). For ex- nodes and control nodes. Task nodes describe ac-
384 Business Process Analytics
tivities, whether they are performed by workers or receipt of a student loan application, which is
software systems, or a combination thereof. Con- visualized by the leftmost node in the model that
trol nodes along with the structure of the graph represents the respective business event. Subse-
capture the execution dependencies, therefore de- quently, the application is checked for plausibil-
termining which tasks should be performed after ity, functionality which is provided by an external
completion of a given task. Task nodes may also Web service. Based on the result of this check,
represent business events that guide the execution the application being plausible or not, a decision
of a process, e.g., in terms of delaying the execu- is taken. The latter is represented by control
tion until an external trigger has been observed. node (diamond labeled with “x”). If the applica-
Decisions, in turn, are commonly represented tion is not plausible, missing details are quested
by dedicated control nodes that define branching from the applying student, and the plausibility
conditions for process execution. check is conducted again, once these details have
Beyond these essential concepts, process mod- been received. The default case of a plausible
els may capture additional perspectives of a pro- application (flow arc with a cross bar) leads to
cess. They may model the data handled during two activities being conducted by one or more
processing, e.g., by including nodes called data workers in parallel, i.e., checking the income
objects that denote inputs and outputs of tasks. source(s) and the credit history. This concurrent
Additionally, business process models may in- execution and synchronization are indicated by
clude a resource perspective, which captures the a pair of control nodes (diamonds labeled with
roles of involved workers or the types of support- “C”). Subsequently, the application is manu-
ing information systems. In the above example of ally assessed, and a decision is taken, whether
handling student loan requests, a process model to prepare and send an offer or to reject the
may define the actual application as a data object application.
that is used as input for the activity of checking its
plausibility. In addition, the model could specify
which specific service provided by which system
Business Process Analytics
implements this check.
Techniques
There are many visual languages that can be
used for business process modeling. As an exam-
Business process analytics comprises a collection
ple, Fig. 1 shows a model of the aforementioned
of techniques to investigate business processes,
student loan process captured in a language called
based on their representations in the form of event
BPMN (Business Process Model and Notation;
data and/or business process models. Figure 2
see also http://www.bpmn.org/). An instance of
provides a general overview of these techniques
this process, a specific case, is triggered by the
and the context in which they are defined. This
Student
Bank
Request Not Plausible Check
Missing Missing Details Income
Details Source(s) Rejection
Bank
Received Sent
Check Check Assess Prepare
Plausibility of Credit History Application Offer
Application Low
Student Loan Offer
Risk
Application Sent
Received
Business Process Analytics, Fig. 1 A business process model in BPMN that captures the process of applying for a
student loan
Business Process Analytics 385
context is given by a business process as it is The latter entry will also discuss a number of
supported by an enterprise system, the event methods for visualizing an event log in order to
data that is recorded during process execution get a bird’s-eye view of the underlying business
(available as logs or streams), and the mod- process. B
els that capture various aspects of a business In their raw form, event streams and event
process. logs suffer from various data quality issues. For
In the remainder of this section, we discuss example, an event log might be incomplete: some
major categories of techniques for business pro- events might be missing altogether, while some
cess analytics techniques indicated in bold in events might be missing some attributes (e.g.,
Fig. 2. We further point to other entries in this the time stamp). Due to recording errors, the
encyclopedia, which introduce each of these tech- data carried by some of the events might be
niques in more detail. incorrect or inconsistent. Also, different events
might be defined at different levels of abstraction.
Event Data Management For example, some events might correspond to
The left-hand side of Fig. 2 illustrates that the business-level events (e.g., the acceptance of a
starting point for business process analytics is loan request), while others might be system-level
the availability of event streams and/or event events (e.g., a system-level fault generated by
logs generated by the execution of business pro- a Web service call to check the plausibility of
cesses. These event logs can be represented in the data provided as part of a loan request).
various ways as will be discussed in the entry These data quality issues will also be discussed
Business Process Event Logs and Visualization. in the entry on Business Process Event Logs and
Business
Process
Supports
Enterprise
System
Recorded by Modelled by
Business
Event Process Discovery
Process
Log
Model
Conformance Checking
Event Prediction
Stream Performance
Model
and Deviance Mining
Business Process
Model Analytics
Visualization, and various techniques to detect a process model with respect to an event log
and handle them will be presented in the entry will be discussed in the entry on Process Model
on Event Log Cleaning for Business Process Repair.
Analytics. When applied to complex event logs, the out-
put of automated process discovery techniques
Process Discovery and Conformance might be difficult to understand. Oftentimes, this
Checking is because the event log captures several variants
Assuming that we have a suitably cleaned event of the business process rather than just one. For
log, a first family of business process analytics example, an event log of a loan request process
techniques allows us to either discover a process might contain requests for various different types
model from the event stream/log or to compare a of loan and from different types of applicants.
given event stream/log to a given process model. If we try to discover a single process model for
These two latter types of techniques are discussed all these variants, the resulting process model
in the entries on Automated Process Discovery might contain too many execution paths, which
and Conformance Checking, respectively makes it difficult to grasp at once. One approach
(cf. Fig. 2). Automated process discovery to deal with this complexity is to split the traces
techniques take as input an event log and produce in the event log into multiple sub-logs, and to
a business process model that closely matches discover one process model for each of the sub-
the behavior observed in the event log or implied logs. Intuitively, each of these sub-logs contains
by the traces in the event log. Meanwhile, a set of similar traces and hence represents one
conformance checking techniques take as input possible variant of the process. Techniques for
an event log and a process model and produce as splitting an event log into sub-logs are presented
output a list of differences between the process in the entry on Trace Clustering.
model and the event log. For example, the process Trace clustering is only one of several
model might tell us that after checking the alternative techniques to handle complex event
plausibility of a loan application, we must check logs. Another approach is to use process
the credit history. But maybe in the event log, modeling notations that are well adapted
sometimes after the plausibility check, we do not for capturing processes with high levels of
see that the credit history had been assessed. variability. Such approaches are discussed in
This might be because of an error, or more the entry Declarative Process Mining. Another
often than not, it is because of an exception approach is to discover process models with a
that is not captured in the process model. Some modular structure. Such approaches are discussed
conformance checking techniques take as input in the entries on Artifact-Centric Process Mining
an event log and a set of business rules instead of and on Hierarchical Process Discovery.
a process model. These techniques check if the Sometimes, event logs are not only complex
event log fulfills the business rules, and if not, but also excessively large, to the extent that
they show the violations of these rules. automated process discovery and conformance
When comparing an event log and a process checking techniques do not scale up. Divide-and-
model (using conformance checking techniques), conquer approaches to deal with such scalability
various discrepancies might be found between the issues are discussed in the entry on Decomposed
model and the log. Sometimes these discrepan- Process Discovery and Conformance Checking.
cies are due to data quality issues in the event log, Finally, closely related to the problem of dis-
in which case they can be corrected by adjusting covering process models from event logs is that
the event log, but sometimes it may turn out that of discovering the business rules that are used
the process model is incorrect or incomplete. In in the decision points of the process model. A
this latter case, it might be necessary to adjust dedicated entry on Decision Discovery in Busi-
(or repair) the process model in order to bet- ness Processes deals specifically with this latter
ter fit the event log. Techniques for repairing problem.
Business Process Analytics 387
Performance and Deviance Analysis respect to the expected outcome or with respect
Another core set of business process analytics to established performance objectives. Deviance
techniques are focused on business process mining techniques are discussed in a dedicated
performance and deviance analysis (cf. Fig. 2). entry of this encyclopedia’s section. B
A prerequisite to be able to analyze process
performance is to have a well-defined set of Online and Predictive Process Analytics
process performance measures. A dedicated All the techniques for business process analytics
entry in the encyclopedia on Business Process mentioned above are primarily designed to an-
Performance Measurement deals specifically alyze the current “as-is” state of a process in a
with this question. postmortem setting.
Once we have defined a set of performance Equally important is the detection of devia-
measures, it becomes possible to analyze the tions with respect to conformance requirements
event log from multiple dimensions (e.g., on a in real time, as the process unfolds. The
per variant basis, on a per state basis, etc.) and entry on Streaming Process Discovery and
to aggregate the performance of the process with Conformance Checking deals with the question
respect to the selected performance measure. of how automated process discovery and
A framework for analyzing the performance conformance checking can be performed in
of business processes with respect to multiple real time based on a stream of events (as
dimensions using so-called process cubes is opposed to doing so over an event log collected
presented in the entry on Multidimensional during a period of time). Such techniques
Process Analytics. allow us, for example, to detect deviations with
Waiting time is a prominent performance respect to the normative behavior (captured
measure in the context of business processes, by a process model) in real time, thereby
as it plays a role in the perceived quality of a enabling immediate implementation of corrective
process, and is related to the amount of resources actions.
available in a process and hence to the cost of the Sometimes though, it is desirable to be able to
process. Low waiting times are associated with preempt undesired situations, such as violations
a leaner and more responsive process. However, of performance objectives, before these situations
low waiting times are not easily achievable when happen. This can be achieved by building predic-
demand is high and when the amount of available tive models from historical event logs and then
resources is limited. When the demand for using these predictive models to make predictions
resources exceeds the available capacity, queues (in real time, based on an event stream) about
build up, and these queues give rise to waiting the future state of running cases of the process.
times. Understanding the buildup of queues is a Various techniques to build and validate such
recurrent question in business process analytics. predictive models are discussed in the entry on
Techniques for analyzing queues on the basis of Predictive Business Process Monitoring.
event logs are discussed in the entry on Queue When the performance of a business process
Mining. does not fulfill the objectives of the organization,
Naturally, the performance of the process it becomes necessary to adjust it, for example,
can vary significantly across different groups by redeployment of resources from some tasks
of cases. For example, it may be that the to other tasks of the process, by eliminating or
cycle time 90% of cases of a process is below automating certain tasks, or by changing the
a given acceptable threshold (e.g., 5 days). sequence of execution of the tasks in the process.
However, in some rare cases, the cycle time Business process simulation techniques allow
might considerably exceed this threshold (e.g., us to evaluate the effects of such changes on
30+ days). Deviance mining techniques help the expected performance of the process. In a
us to diagnose the reasons why a subset of the nutshell, business process simulation means that
execution of a business process deviate with we take a business process model, enhance it with
388 Business Process Anomaly Detection
simulation parameters (e.g., the rate of creation querying provides concepts and techniques to
of process instances, the processing times of make effective use of large collections of process
tasks in the process, the number of available models.
resources, etc.), and execute it in a simulated
environment a large number of times, in order
to determine the expected cycle time, cost, Cross-References
or other performance measures. Traditionally,
simulation parameters are determined manually Artifact-Centric Process Mining
by domain experts, based on a sample of cases Automated Process Discovery
of a process, or by observing the execution of the Business Process Deviance Mining
process during a certain period of time. However, Business Process Event Logs and Visualization
the availability of event logs makes it possible Business Process Model Matching
to build more accurate simulation models by Business Process Performance Measurement
extracting the simulation parameters directly Business Process Querying
from an event log. The question of how Conformance Checking
to use event logs for process simulation is Data-Driven Process Simulation
discussed in the entry on Data-Driven Process Decision Discovery in Business Processes
Simulation. Declarative Process Mining
Decomposed Process Discovery and
Conformance Checking
Business Process Model Analytics Event Log Cleaning for Business Process
Business process analytics is not confined to Analytics
the analysis of event logs. It also encompasses Hierarchical Process Discovery
techniques for analyzing collections of process Multidimensional Process Analytics
models, irrespective of the availability of data Predictive Business Process Monitoring
about the execution of a process (Beheshti et al. Process Model Repair
2016). We call the latter family of techniques Queue Mining
business process model analytics techniques Streaming Process Discovery and
(cf., right-hand side of Fig. 2). Conformance Checking
The entries on Business process model match- Trace Clustering
ing and Business process querying deal with
two families of business process model analytics
techniques. The first of these entries deals with References
the problem of comparing two or more process
models in order to characterize their commonal- Beheshti SMR, Benatallah B, Sakr S, Grigori D, Nezhad
H, Barukh M, Gater A, Ryu S (2016) Process analytics
ities (or similarities) and their differences. This – concepts and techniques for querying and analyzing
is a basic operation that allows us, for example, process data. Springer, Cham
to merge multiple process models into a single Dumas M, Rosa ML, Mendling J, Reijers HA (2018)
one (e.g., to consolidate multiple variants of a Fundamentals of business process management, 2nd
edn. Springer, Berlin
process). It also allows us to find process mod- Wil MP, van der Aalst (2016) Process mining – data
els in a repository that are similar to a given science in action, 2nd edn. Springer, Berlin, Heidelberg
one, which is also known as similarity search.
However, business process querying provides a
much broader view on how to manage process
models. That includes languages to define queries Business Process Anomaly
related to particular aspects and properties of Detection
processes, along with efficient evaluation algo-
rithms for such queries. As such, business process Business Process Deviance Mining
Business Process Deviance Mining 389
As to the second aspect (namely, available section “Other Supervised Deviance Explanation
information), two deviance mining settings can Approaches”.
be considered, which differ for the presence of
auxiliary information, in addition to the given log Classification-Based Deviance Mining
traces: Based on a given collection of example traces,
labeled each as either “deviant” or “normal,”
• Supervised: All the traces are annotated with these approaches try to induce a deviance-
some deviance label/score, which allows to re- oriented classification model, capable to detect
gard them as examples for learning a deviance whether any new (unlabeled) trace is deviant
detector/classifier or to extract deviance pat- or not. Moreover, if the model has (or can be
terns. easily turned into) a human readable form (e.g.,
• Unsupervised: No a priori deviance-oriented classification rules), it can be used to explain
information is given for the traces, so that an the given deviances. In fact, most classification-
unsupervised deviance mining method needs based approaches tried to pursue both goals (i.e.,
to be devised, based on the assumption that al- detection and explanation), by inducing rule-
l/most deviances look anomalous with respect /tree-based classifiers.
to the mainstream process behavior. Three main issues make the induction of a
deviance-oriented classification model challeng-
ing: (i) process traces are complex sequential data
The rest of this chapter focuses on ex post that must be brought to some suitable abstraction
deviance mining approaches. In fact, the exploita- level, in order to possibly capture significant dis-
tion of supervised learning methods to detect criminative patterns; (ii) obtaining a sufficiently
deviances at run-time has been considered in informative training set is difficult, since man-
the related field of predictive business process ually labeling the traces is often difficult/costly
monitoring. and most log traces remain unlabeled; and (iii) as
More specifically, the two following sections in many scenarios, deviant behaviors occur more
discuss supervised and unsupervised deviance rarely than normal ones; the training set is often
mining approaches, respectively. The last section imbalanced (Japkowicz and Stephen 2002).
provides some concluding remarks and draws a To cope with the first issue, and reuse standard
few directions for future research in this field. classifier-induction methods, most approaches
All the approaches discussed are summarized in adopt a two-phase mining scheme: (1) log
Table 1. encoding, the example traces are turned into
vectors/tuples in some suitable feature space; and
(2) classifier induction, the transformed traces
Supervised Deviance Mining (equipped each with a class label) are then made
to undergo a propositional classifier-induction
The great majority of supervised deviance mining algorithm.
approaches assume that each log trace is associ- In addition to possibly considering global case
ated with a binary class label telling whether the attributes, most of the encoding schemes adopted
trace was deemed as deviant or not and try to in the literature rely on extracting behavioral
induce a deviance-oriented classification model patterns from the traces and using these patterns
for discriminating between these two classes (i.e., as features. As a preliminary step, each event e
deviant vs. normal traces). These approaches are of a trace is usually abstracted into the associated
named classification-based and described in sec- activity (i.e., the activity executed/referenced to
tion “Classification-Based Deviance Mining”. A in e), so that the trace is turned into an activity
few non-classification-based approaches to the sequence.
discovery of deviance-explanation patterns, from The main kinds of patterns used in the litera-
deviance-annotated traces, are then presented in ture are as follows (Nguyen et al. 2016):
Business Process Deviance Mining 391
• Individual activities (IA) (Suriadi et al. 2013): using one among the following pattern types:
each activity a is regarded as a “singleton” TR, MR, SMR, NSMR, the respective alphabet
pattern and turned into a numeric feature versions of the former types, and IA patterns. For
counting how frequently a occurs in the trace the sake of explanation, only association-based B
(this corresponds to using a “bag-of-activity” classifiers and decision trees are employed. To
encoding). deal with unlabeled traces, the authors envisaged
• Activity sets (SET) (Swinnen et al. 2012): the the possibility to label them automatically in
patterns incorporate subsets of activities that a preprocessing step (with one-class SVM and
frequently co-occur in the traces (regarded nearest-neighbor methods).
each as an event set, thus abstracting from the Lo et al. (2009) focus on the case where the
execution order). deviances are those traces affected by a failure,
• Sequence patterns: the patterns represent among those generated by a software process,
frequent/relevant sequences of activities. and introduce the notion of closed unique itera-
The following families of sequence patterns tive patterns (IPs), as encoding features. These
have been considered: maximal repeats (MR), patterns are extracted through an a priori-like
tandem repeats (TR), super maximal repeats scheme and then selected based on their discrim-
(SMR), near super maximal repeats (NSMR), inative power (according to Fischer’s score Duda
the set-oriented (“alphabet”-like) versions of et al. 1973). To mitigate the effect of imbalanced
the former (i.e., AMR,ATR, etc.) (Bose and classes, a simple oversampling mechanism is
van der Aalst 2013), and (closed unique) exploited.
iterative patterns (IP) (Lo et al. 2009) Nguyen et al. (2014, 2016) define a gen-
(possibly covering non-contiguous activity eral scheme uniforming previous pattern-based
subsequences). deviance mining works, which consists of the
following steps (to be precise, steps 1 and 3
are only considered in Nguyen et al. 2016): (1)
In Suriadi et al. (2013), an IA-based log en- oversampling (to deal with class imbalance), (2)
coding is used to train a decision tree for discrim- pattern mining, (3) pattern selection (based on
inating and explaining deviances defined as traces Fischer’s score), (4) pattern-based log encoding,
with excessively long durations. The discovered and (5) classifier induction. As to the former three
rules are then exploited to analyze how the pres- steps, the authors used one among the pattern
ence of single or multiple activities impacted on types described above (namely, IA, SET, TR,
deviant behaviors. MR, ATR, AMR, and IP) in isolation. In addi-
A semiautomatic deviation analysis method tion, a few heterogeneous encodings are explored
was proposed in Swinnen et al. (2012), which in Nguyen et al. (2014), mixing IA patterns with
combines process mining and association rule one of the other composite patterns (i.e., SET,
mining methods, and addresses a scenario TR, MR, ATR, AMR, and IP). Three standard
where the deviances are cases that deviate induction methods are used to implement the
from the mainstream behavior, captured by last step: the decision-tree learning algorithm
a process model discovered with automated J48, a nearest-neighbor procedure, and a neural-
process discovery techniques. An association- network classifier. Different instantiations of this
based classifier is used to characterize deviant scheme (obtained by varying the kind of patterns
cases, based on a set-oriented representation of and the induction method) are tested in Nguyen
the traces (which keeps information on both et al. (2016) on several real datasets.
activities and nonstructural case properties). To assess the explanatory power of the
Bose and van der Aalst (2013) face the discov- models discovered, the authors evaluate the
ery of a binary classification model in a scenario rulesets obtained by applying J48 to each of the
where deviances correspond to fraudulent/faulty pattern-based encodings. Rulesets discovered
cases. The traces are encoded propositionally with composite pattern types (especially IP,
392 Business Process Deviance Mining
MRA, and SET) looked more interesting than made by the discovered model on new unlabeled
IA-based ones. As to the capability to correctly traces, and eventually improve the model.
classify new test traces, composite patterns do
not seem to lead to substantial gain when the Other Supervised Deviance Explanation
process analyzed is lowly structured and/or the Approaches
deviance is defined in terms of case duration. Fani Sani et al. (2017) propose to extract
Composite patterns conversely looked beneficial explanations from labeled (deviant vs. normal)
on the logs of structured processes, but there is traces by applying classic subgroup discovery
no evidence for bestowing a pattern type as the techniques (Atzmueller 2015), instead of
winner. rule-based classifier-induction methods, after
In fact, the predictiveness of different pattern encoding all the traces propositionally. As a
types may well depend on the application do- result, a number of interesting logical patterns
main. In principle, even on the same dataset, (namely, conjunctive propositional formulas) are
a pattern type might be better at detecting a obtained, identifying each a trace “subgroup”
certain kind of deviance than another type and that looks unusual enough w.r.t. the entire log (in
vice versa. On the other hand, even selecting terms of class distribution).
the classifier-induction method is not that simple, Folino et al. (2017a) consider instead a setting
if one relaxes the requirement that the classi- where interesting self-explaining deviance sce-
fication model must be self-explaining. Indeed, narios are to be extracted from training traces
higher-accuracy results can be achieved by us- associated with numerical deviance indicators
ing more powerful classifiers than those consid- (e.g., concerning performance/conformance/risk
ered in the approaches described above. These measures). The problem is stated as the discovery
observations explain the recent surge of multi- of a deviance-aware conceptual clustering, where
encoding hybrid methods, like those described each cluster needs to be defined via descriptive
next. (not a priori known) classification rules – specifi-
An ensemble-based deviance mining approach cally, conjunctive formulas over (both structural
is proposed in Cuzzocrea et al. (2015, 2016b) and nonstructural) trace attributes and context
that automatically discovers and combines dif- factors.
ferent kinds of patterns (namely, the same as
in Nguyen et al. 2014) and different classifier-
induction algorithms. Specifically, different base Unsupervised Deviance Mining
classifiers are obtained by training different algo-
rithms against different pattern-based encodings Unsupervised deviance mining approaches can
of the log. The discovered base classifiers are be divided in two main categories (Li and
then combined through a (context-aware) stack- van der Aalst 2017), namely, model-based and
ing procedure. An oversampling mechanism is clustering/similarity-based, which are discussed
applied prior to pattern extraction, in order to deal next in two separate subsections.
with class imbalance.
The approach is parametric to the pattern types Model-Based Approaches
and learners employed and can be extended to Since, in general, there may be no a priori pro-
incorporate other encoding/learning methods. In cess model that completely and precisely defines
fact, a simplified version of this approach was deviance/normality behaviors, model-based ap-
combined in Cuzzocrea et al. (2016a) with a con- proaches rely on inducing mining such a model
ceptual event clustering method (to automatically from the given traces and then classifying as
extract payload-aware event abstractions). deviant all those that do not comply enough with
A simple semi-supervised learning scheme is the model. These approaches are next described
used in Cuzzocrea et al. (2016b) to acquire some in separate paragraphs, according to the kind of
feedback from the analyst, on the classifications behavioral model adopted.
Business Process Deviance Mining 393
Approaches using a workflow-oriented model stracted timestamps (extracted form log events),
Most model-based approaches employ a control- to model the execution of each activity in a more
flow model, combined with conformance refined way. The resulting graph G is used to
checking techniques (van der Aalst et al. 2005). assess whether a new trace t is anomalous or not, B
Earlier approaches (e.g., Bezerra et al. 2009) possibly at run-time. Precisely, the likelihood of
simply mined the model by directly applying t w.r.t. G is compared to the lowest one of the
traditional control-flow-oriented techniques (presumed normal) traces in the training log: if
developed in the related field of automated the former is lower than the latter, the trace is
process discovery. However, as generally the marked as anomalous/deviant.
log is not guaranteed to only contain normal The work in Rogge-Solti and Kasneci (2014)
behavior, the effectiveness of such an approach is meant to recognize temporal anomalies
suffers from the presence of outliers. To better concerning activity durations – precisely, cases
deal with outliers, one could think of applying where a group of interdependent activities
methods for automatically filtering out infrequent shows an anomalous run-time. To detect such
behavior, like the one proposed in Conforti et al. anomalies, a stochastic Petri net (Rogge-
(2017). In particular, in Conforti et al. (2017), Solti et al. 2013) is exploited, which can
first an automaton is induced from the log, capture statistical information for both decision
in order to approximately model the process points (modeled as immediate transitions) and
behavior, in terms of direct follow dependencies; the duration of activities (modeled as timed
infrequent transitions are then removed from transitions). Any new trace is confronted to this
the automaton; in a final replay of the traces, any model (in terms of log-likelihood) to decide
behavior that cannot be mapped on the automaton whether it is deviant or not.
is deemed as an outlying feature.
Bezerra and Wainer (2013) propose three ad Approaches using non-workflow-oriented models
hoc algorithms to detect deviant traces in a log L, Anomalies concerning the sequence of activities
named “threshold,” “iterative,” and “sampling” in a trace are detected in Linn and Werth (2016)
algorithm, respectively. All the algorithms rely by reusing variable order Markov models: a se-
on comparing each trace t with a model Mt quence is anomalous if at least one activity of
induced from a subset of L (namely, L ft g the sequence is predicted with a probability value
in the threshold and iterative algorithms and a lower than a certain threshold.
random sample of L in the sampling algorithm). A one-class SVM strategy is exploited in Quan
The anomalousness of t is estimated by either and Tian (2009), after converting the contents
checking if t is an instance of Mt or considering of the given log into vectorial data. This way,
the changes needed to allow Mt to represent t . a hypersphere is determined for the training set,
The last algorithm achieved the best results. assuming that normal behaviors are much more
All the works above focused on control- frequent than deviant ones. Any (possibly un-
flow aspects only while disregarding other seen) data instance lying outside this hypersphere
perspectives (e.g., timestamps and resources/ex- is eventually classified as deviant.
ecutors). Some efforts have been spent recently to A multi-perspective approach has been pro-
detect anomalies pertaining such nonfunctional posed in Nolle et al. (2016), which hinges on
aspects. the induction of an autoencoder-based neural net-
In the two-phase multi-perspective approach work. The network is meant to play as a (black
of Böhmer and Rinderle-Ma (2016), a (“basic box) general-enough behavioral model for the
likelihood”) graph encoding the likelihood of a log, trained to reproduce each trace step by step.
transition between two activities is first induced If an input trace and the corresponding repro-
from the log, assuming that the latter only stores duced version are not similar enough (w.r.t. a
normal behaviors. This graph is then extended manually defined threshold), the former is classi-
by inserting nodes representing resources and ab- fied as deviant. The capacity of the auto-encoding
394 Business Process Deviance Mining
layers is kept limited, in order to make the net- partitioned into two sets: normal cases (CN) and
work robust to both outliers and noise. deviating cases (CD). Then, the norm function
is adjusted as to increase (resp., decrease) the
Similarity/Clustering-Based Approaches likelihood that normal (resp., deviating) cases
These approaches rely on the assumption that are sampled. These two steps (i.e., sampling and
normal traces occur in dense neighborhoods or profile-based classification) can be iterated, in
in big enough clusters, rather than on using be- order to improve the partitioning scheme, before
havioral models (for normal/deviant traces). Con- eventually returning the deviant cases.
sequently, a trace that is not similar enough to
a consistent number of other traces is deemed Clustering-based approaches As noticed
as deviant. This requires some suitable simi- in Li and van der Aalst (2017), model-based
larity/distance/density function and/or clustering approaches (especially those hinging on control-
algorithm to be defined for the traces – see the flow models) work well on regular and structured
Trace Clustering entry for details on this topic. processes but have problems in dealing with
less structured ones exhibiting a variety of
Similarity-based approaches In general, kNN- heterogenous behaviors.
based methods (where NN stands for nearest By contrast, clustering-based anomaly/outlier
neighbor) detect outliers by comparing each ob- detection approaches assume that normal
ject with its neighbors, based on a distance mea- instances tend to form large (and/or dense)
sure. In Hsu et al. (2017) the distance measure enough clusters, and anomalous instances are
is defined in terms of activity-level durations and those belonging to none of these clusters.
contextual information. More specifically, con- In Ghionna et al. (2008) and Folino et al.
textual information is represented in a fuzzy way, (2011), log traces are first clustered in a way
and associated membership functions are used to that the clusters reflect normal process execution
adjust activity-level durations. Distances are then variants, so that a case is labeled as deviant if
calculated using the differences between adjusted it belongs either to a very small cluster (w.r.t. a
durations, and a kNN scheme is eventually used given cardinality threshold) or to no cluster at all.
to identify anomalous process instances. To capture normal execution schemes, a special
A local outlier factor (LOF) method is adopted kind of frequent structural (i.e., control-flow ori-
instead in Kang et al. (2012) to spot process ented) patterns is discovered; the traces are then
instances with abnormal terminations. Notably, clustered based on their correlations with these
the method can be applied to future enactments patterns – specifically, the clustering procedure
of the process. follows a co-clustering scheme, focusing on the
A profile-based deviation detection approach associations between traces and patterns, rather
is proposed in Li and van der Aalst (2017), which than using a pattern-based encoding of the traces.
relies on deeming as deviant any case that is not To possibly unveil links between the discovered
similar enough to a set CS of mainstream cases structural clusters and nonstructural features of
in the log. Specifically, a preliminary version of the process, a decision tree is discovered (us-
CS is first sampled according to a (dynamically ing an ad hoc induction algorithm), which can
modified) “norm” function, biased toward “more be used to classify novel process instances at
normal” cases. Then, the similarity between each run-time. In principle, the clustering procedure
log case and CS is computed, by using some could be performed by using other methods (cf.
suitably defined “profile” function – in princi- Trace Clustering). However, as noticed in Folino
ple, this function can be defined as to capture et al. (2011), one should avoid algorithms, like
different perspectives, but the authors only show k-means, that are too sensitive to noise/outliers,
the application of two alternative control-flow- and may hence fail to recognize adequately the
oriented profile functions. As a result, the log is groups of normal instances.
Business Process Deviance Mining 395
Hybrid solutions All the unsupervised ap- mining techniques discussed so far are reported
proaches described so far focus on the detection in Table 1.
of deviances, without paying attention to their Notably, these techniques can turn pre-
explanation. Clearly, one could perform ex cious in many application scenarios, such B
post explanation tasks by simply resorting as those concerning the detection/analysis of
to supervised classification techniques, after software systems’ failures (Lo et al. 2009),
having annotated deviant and normal traces with frauds (Yang and Hwang 2006), network intru-
different class labels, as done in Swinnen et al. sions (Jalali and Baraani 2012), SLA violations,
(2012). For approaches returning a numerical etc.
anomaly score, one could also think of directly In general, in dynamical application scenarios,
applying a rule-based regression method or a supervised techniques should be extended with
conceptual clustering method like that proposed incremental/continuous learning mechanisms, in
in Folino et al. (2017a) (cf. section “Other order to deal with concept drift issues. In par-
Supervised Deviance Explanation Approaches”). ticular, in security-oriented scenarios where de-
A hybrid discovery approach (both model- viances correspond to attacks of ever-changing
based and clustering-based) was defined in Folino forms, the risk that the classifier induced with su-
et al. (2017b), which synergistically pursues pervised techniques is obsolete/incomplete may
both detection and explanation tasks. Given be very high.
a set of traces, associated each with stepwise In such a situation, unsupervised deviance
performance measurements, a new kind of mining techniques could be a valuable solution,
performance-aware process model is searched especially when the analyst is mainly interested
for, consisting of the following components: (i) in curbing the number of false negatives (i.e.,
a list of easily interpretable classification rules undetected deviances). Among unsupervised
defining distinguished “deviance” scenarios for deviance detection approaches, those based
the process that generated the traces; (ii) a state- on clustering/similarity methods are likely
based performance model for each discovered to fit better lowly structured business pro-
scenario and one for the “normal” executions cesses (Li and van der Aalst 2017) than those
belonging to no deviance scenario. The discovery relying on a (possibly extended) control-flow
problem is stated as that of optimizing the model model.
of normal process instances while ensuring that The opportunity to use unlabeled data for
all the discovered performance models are both classification-based deviance mining has been
general and readable enough. The approach given little attention so far (Cuzzocrea et al.
mainly exploits a conceptual clustering procedure 2016b). Extending current approaches with
(addressed to greedily spot groups of traces that semi-supervised learning methods (Wang and
maximally deviate from the current normality Zhou 2010), to exploit such information, is
model M0 ), embedded in an greedy optimization an interesting direction for future research.
scheme (which iteratively tries to improve the On the other hand, cost-sensitive learning
normality model M0 , initially learnt from the methods (Elkan 2001) could help deal with
entire log). imbalanced data more flexibly and allow to make
the learner pay more attention to one kind of
misclassification (i.e., either false deviances or
missed deviances).
Discussion and Directions of Future Finally, it seems valuable to extend outlier-
Research explanation methods (Angiulli et al. 2009) to
extract readable explanations for the deviances
For the sake of summarization and of immediate found with unsupervised deviance mining tech-
comparison, some major features of the deviance niques.
396 Business Process Deviance Mining
Business Process Deviance Mining, Table 1 Summary of the approaches described here
Setting Task Strategy Reference
Supervised Detection, explanation Classifier induction (IA patterns) Suriadi et al. (2013)
(label-driven)
Detection, explanation Classifier induction (SET patterns) Swinnen et al. (2012)
Detection, explanation Classifier induction (non IP Bose and van der Aalst (2013)
sequence patterns)
Detection, explanation Classifier induction (iterative Lo et al. (2009)
patterns (IPs))
Detection, explanation Classifier induction (IA and Nguyen et al. (2014)
sequence patterns)
Detection, explanation Classifier induction (IA and Nguyen et al. (2016)
sequence patterns)
Detection Classifier ensemble induction (IA Cuzzocrea et al. (2015)
and sequence patterns)
Detection Classifier ensemble induction (IA Cuzzocrea et al. (2016b)
and sequence patterns)
Detection Classifier ensemble induction
(“abstract” IA and sequence Cuzzocrea et al. (2016a)
patterns)
Explanation Subgroup discovery Fani Sani et al. (2017)
Explanation Deviance-oriented conceptual
clustering Folino et al. (2017a)
Unsupervised Detection Model-based (pure control-flow Bezerra et al. (2009)
(anomaly-driven) model)
Detection Model-based (pure control-flow Bezerra and Wainer (2013)
model)
Detection Model-based (time-aware Rogge-Solti and Kasneci (2014)
control-flow model)
Detection Model-based (multi-perspective Böhmer and Rinderle-Ma (2016)
process model)
Detection Model-based (not process-oriented Linn and Werth (2016)
VOMM)
Detection Model-based (notprocess-oriented
1-SVM) Quan and Tian (2009)
Detection Model-based (not process-oriented
autoencoder) Nolle et al. (2016)
Detection Similarity-based (k-NN) Hsu et al. (2017)
Detection Similarity-based (LOF) Kang et al. (2012)
Detection Similarity-based (“profile-biased”
sampling) Li and van der Aalst (2017)
Detection Clustering-based (pattern-aware
co-clustering) Ghionna et al. (2008)
Detection Clustering-based (pattern-aware
co-clustering) Folino et al. (2011)
Detection, explanation Hybrid solution (performance-aware Folino et al. (2017b)
process models + conceptual
clustering)
Business Process Deviance Mining 397
is a minimum requirement that for each event in event stream (XES) format (http://www.
the log, we have three attributes (i) in which case xes-standard.org) standardized by the IEEE
the event occurred (case identifier), (ii) which Task Force on Process Mining and pub-
task the event refers to (herein called the event lished as IEEE 1849 (Acampora et al. 2017) B
class), and (iii) when the event occurred (i.e., a (http://www.win.tue.nl/ieeetfpm). The majority
timestamp). In practice, there may be other event of process mining tools can handle event logs in
attributes, such as the resource who performed XES. The structure of an XES file is based on a
a given task, the cost, or domain-specific data data model, partially depicted in Fig. 2. An XES
such as the loan amount in the case of a loan file represents an event log. It contains multiple
origination process. traces, and each trace can contain multiple events.
Business process event logs such as the above All of them can contain different attributes. An
one can be processed and analyzed using process attribute has to be either a string, date, int, float,
mining tools (cf. entry on Business Process An- or Boolean element as a key-value pair. Attributes
alytics). These tools provide various approaches have to refer to a global definition. There are two
for visualizing event logs in order to understand global elements in the XES file: one for defining
their performance, in particularly, with respect to trace attributes and the other for defining event
time-related performance measures. attributes. Several classifiers can be defined in
In the rest of this entry, we discuss how event XES. A classifier maps one of more attributes of
logs are usually represented (file formats), and we an event to a label that is used in the output of
then discuss different approaches for visualizing a process mining tool. In this way, for instance,
event logs. events can be associated with tasks.
Figure 3 shows an extract of an XES log
corresponding to the event log in Fig. 1. The first
The XES Standard global element (scope = “trace”) in this log tells
us that every trace element should have a child
The event log of Fig. 1 is captured as a list in a element called “concept:name.” This sub-element
tabular format. Simple event logs such as this one will be used to store the case identifier. In this
are commonly represented as tables and stored example, there is one trace that has a value of 1
in a comma-separated values (CSV) format. This for this sub-element. At the level of individual
representation is suitable when all the events in events, there are three expected elements (i.e.,
the log have the same set of attributes. However, “global” element with scope = “event”): “con-
in event logs where different events have different cept:name,” “time:timestamp,” and “resource.”
sets of attributes, a flat CSV file is a less suitable The “concept:name” sub-element of an event
representation. For example, in an event log of a will be used to store the name of the task to
loan application process, some events may have which the event refers. The “time:timestamp”
customer-related attributes (e.g., age of the cus- and “resource” attributes give us the time of
tomer, income, etc.), while other events contain occurrence of the event and who performed it,
attributes about the loan application assessment respectively. The rest of the sample XES log
(e.g., the credit risk score). If we represented represents one trace with two events. The first one
this event log in a CSV file, we would have to refers to “check stock availability,” which was
use (null) values to denote the fact that some completed by SYS1 on the 30 July 2012 at 11:14.
attributes are not relevant for a given event. Also, The second event captures “retrieve product from
in CSV files, the types of the attributes are not warehouse” conducted by Rick at 14:20.
explicitly represented, which can be error-prone, The XES standard also supports an attribute
for example, when dealing with dates, spatial “lifecycle:transition” that is meant to describe the
coordinates, etc. transition of a task from one execution state to an-
A more versatile file format for storing other. An extensive table of predefined transitions
and exchanging event logs is the extensible is included in the standard. The most prominent
400 Business Process Event Logs and Visualization
Business Process Event Logs and Visualization, Fig. 1 Example of an event log for the order-to-cash process
transitions are “assign” to indicate that a task mat but has to be extracted from different sources
has been assigned to a resource; “start” to record and integrated. This extraction and integration
that work on the task has started; “complete” effort is generally not trivial. A list of 27 potential
signalling that a task is done; “ate_abort” to indi- issues is described in Suriadi et al. (2017). We
cate that a task was canceled, closed, or aborted; focus here on four major challenges for log data
and “pi_abort” to indicate that a whole process extraction:
instance is aborted.
Correlation Challenge
This refers to the problem of identifying the case
Challenges of Extracting Event Logs an event belongs to. Many enterprise systems do
not have an explicit notion of process defined.
Event data that is available in tabular format as in Therefore, we have to investigate which attribute
Fig. 1 can be readily converted to XES. In many of process-related entities might serve as a case
cases though, the data which is relevant for event identifier. Often, it is possible to utilize entity
logs is not directly accessible in the required for- identifiers such as order number, invoice number,
Business Process Event Logs and Visualization 401
defines
trace-global
event-global
Log key
Attribute
contains
value
contains
contains
Trace String
contains Date
Event Int
Float
Boolean
or shipment number. Approaches to infer missing we might not yet have observed all cases that
case identifiers are presented in Bayomie et al. started during a recent time period (e.g., cases
(2015) and Pourmirza et al. (2017). that started in the past 6 months) up to their com-
pletion. In other words, some cases might still be
Timestamping Challenge incomplete. Mixing these incomplete cases with
The challenge to work properly with timestamps completed cases can lead to a distorted picture
stems from the fact that many systems do not of the process. For example, the cycle times
consider logging as a primary task. This means or other performance measures calculated over
that logging is often delayed until the system incomplete cases should not be mixed with those
has idle time or little load. Therefore, we might of the completed ones, as they have different
find sequential events with the same timestamp meaning. In order to avoid such distortions, we
in the log. This problem is worsened when logs should consider the option of excluding incom-
from different systems potentially operating in plete cases from an event log. Also, we should
different time zones have to be integrated. Par- ensure that the time period covered by an event
tially, such problems can be resolved with do- log is significantly longer than the mean cycle
main knowledge, for example, when events are time of the process so that the event log contains
known to always occur in a specific order. An samples of both short-lived and long-lived cases.
approach to repair logs with incomplete or in-
accurate timestamps is presented in Rogge-Solti
et al. (2013). For estimating the exact timestamp, Granularity Challenge
we can use knowledge about the typical order of Typically, we are interested in conducting event
tasks and their typical temporal distance. log analysis on a conceptual level for which we
have process models defined. In general, the gran-
Longevity Challenge ularity of event log recording is much finer such
For long running processes (e.g., processes with that each task of a process model might map to
cycle times in the order of weeks or months), a set of events. For example, a task like “retrieve
402 Business Process Event Logs and Visualization
product from warehouse” on the abstraction level erties can be rewritten in a simplified repre-
of a process model maps to a series of events sentation, called a workflow log. A workflow
like “work item #1,211 assigned," “work item log is a set of distinct traces, such that each
#1,211 started," “purchase order form opened,” trace consists of a sequence of symbols and each
“product retrieved,” and “work item #1,211 com- symbol represents an execution of a task. For
pleted.” Often, fine-granular events may show example, given a set of tasks represented by
up repeatedly in the logs, while on an abstract symbols fa; b; c; d g, a possible workflow log is
level, only a single task is executed. Therefore, it fha; b; c; d i; ha; c; b; d i; ha; c; d ig.
is difficult to define a precise mapping between
the two levels of abstraction. The problem of
abstracting from low-level event data to activi- Event Log Visualizations
ties in the business process model is proposed
in Baier et al. (2014) by annotating, matching, Visualization techniques for business process
and clustering. Other related techniques aim at event logs can be broadly classified into
discovering repetitive patterns of low-level events two categories: time-based and graph-based
such that every occurrence of a discovered pattern techniques. Time-based techniques rely on charts
can be abstracted as one occurrence of a higher- where 1 on the axis (typically the horizontal axis)
level event (Mannhardt and Tax 2017). is used to denote the time. The vertical axis is
Once we have overcome the above data quality then divided into stripes, each of which is used to
challenges, we are able to produce an event log represent the events observed in one trace of the
consisting of traces, such that each trace con- log.
tains the perfectly ordered sequence of events Graph-based techniques on the other hand
observed for a given case, each event refers to show relations between tasks, resources, or pos-
a task, and every task in the process is included sibly other objects involved in the process. Each
in the event log. An event log with these prop- node in the the graph represents a task, resource,
Business Process Event Logs and Visualization 403
Business Process Event Logs and Visualization, Fig. 4 Dotted chart of an event log
0 1 2 3 4 5 6 7 8 9 time
a
case 1 b a Processing time
c
d Waiting time
a
case 2 g
h
i
Business Process Event Logs and Visualization, Fig. 5 Example of timeline chart
or other object, while the edges represent control- Figure 4 shows the dotted chart of the event
flow relations (e.g., precedence). log of a healthcare process. The events are plotted
Below are two representative time-based tech- according to their relative time and sorted accord-
niques, namely, dotted and timeline charts, and ing to their overall cycle time. It can be seen that
two graph-based techniques: dependency graphs there is a considerable variation in terms of cycle
and handoff graphs. time. Furthermore, the chart suggests that there
might be three distinct classes of cases: those that
Dotted Charts take no more than 60 days, those taking 60 to 210
In a dotted chart (Song and van der Aalst 2007), days, and a small class of cases taking longer than
each event is plotted on a two-dimensional canvas 210 days.
with the first axis representing its occurrence in
time and the second axis as its association with
a classifier like a case ID. There are different
options to organize the first axis. Time can be Timeline Charts
represented either relative, such that the first If the event log contains timestamps capturing
event is counted as zero or absolute such that both the start and the end of each task, we can plot
later cases with a later start event are further right tasks not as dots but as bars as shown in Fig. 5.
in comparison to cases that started earlier. The This type of a visualization is called a timeline
second axis can be sorted according to different chart. A timeline chart shows the waiting time
criteria. For instance, cases can be shown accord- (from activation until starting) and the processing
ing to their historical order or their overall cycle time (from starting until completion) for each
time. task.
404 Business Process Event Logs and Visualization
Business Process Event Logs and Visualization, Fig. 7 Example of a full dependency graph and an abstracted
version thereof. (a) Full dependency graph. (b) Filtered dependency graph
logs. Accordingly, process mining tools also offer into a smaller log, abstraction only affects the
a second type of simplification operation called dependency graph, not the log.
event log filtering. Filtering an event log means Dependency graphs can also be used to vi-
removing a subset of the traces, events, or pairs of sualize temporal delays (e.g., waiting times or
events in order to obtain a simpler log. Note that processing times) via color-coding or thickness-
filtering and abstraction have different inputs and coding. Concretely, arcs in the dependency graph
outputs. Whereas filtering transforms an event log may be color-coded by the corresponding tem-
406 Business Process Event Logs and Visualization
poral delay between the source event and the Figure 8 shows the performance view pro-
target event of the arc. This visualization allows duced by Disco when applied to the BPI Chal-
us to identify the longest waiting times in the lenge 2017 log. The most infrequent arcs have
process – also known as bottlenecks. Meanwhile, been removed to avoid clutter. The task and the
the nodes in a dependency graph can be color- arc with the highest times are indicated in red. If
coded with the processing times of each task. we visually zoom in, in the tool, we are able to
The latter requires that each task captured in see that the task with the slowest processing time
the log has both a start timestamp and an end is “W_Assess Potential Fraud” (3.1 days) and
timestamp so that the processing time can be the highest inter-event time is the one between
computed. “A_Complete” and “A_Canceled” (27.4 days).
A_Create Application
instant
339 millis
A_Submitted
instant
W_Complete application
6.1 hrs
15 millis
24 hrs 3 millis
A_Complete
instant
A_Accepted A_Validating
instant instant
20.7 hrs
O_Create Offer
instant
1.1 secs
O_Created
instant
Business Process Event Logs and Visualization, Fig. 8 Performance view of the BPI Challenge 2017 event log in
Disco
Business Process Event Logs and Visualization 407
Handoff Graphs graph will contain one node per resource (process
Dependency graphs can also be used to analyze participant), and there will be one arc between
delays resulting from handoffs between process nodes A and B if resource B has performed a task
participants (van der Aalst and Song 2004; Pent- immediately after resource A in at least one trace B
land et al. 2017). When discussing the structure of in the log. This map captures all the handoffs that
event logs, we pointed out that an event record in have occurred in the process. We can then apply
an event log should have at least a case identifier, filters and abstraction operations to this map of
a timestamp, and an event class (i.e., a reference handoffs. The handoff map can also be color-
to a task in the process), but there can be other coded, with the colors corresponding either to the
event attributes, too, like, for example, the re- frequency of each handoff or the waiting time
source (process participant) who performed the between handoffs.
task in question. The event class is the attribute Figure 9 shows a fragment of the average
that is displayed in each node of a dependency duration view produced by the myInvenio
graph. When importing an event log into a tool process mining tool when applied to an event
that supports dependency graphs, the tool gives log of a process for treating sepsis cases at
an option to specify which event attribute should a Dutch hospital (Log available et: https://doi.
be used as the event class. We can set the tool org/10.4121/uuid:915d2bfb-7e84-49ad-a286-dc3
so as to use the resource attribute as the event 5f063a460).The handoff map shows that when
class. If we do so, the resulting dependency a case starts (rightmost node), it is first handled
by resource A. After a few minutes or about an which retain only the most frequent nodes and
hour, resource A hands off to resource H or to edges. Two alternative techniques for simplifying
resources D, F, . . . , R. Resource H hands off to E, process maps are node clustering and graph re-
while , F, . . . , R hand over to B who later hands duction. In node clustering, a cluster of nodes is
off to E. Handoffs that take longer than one day replaced by a single one. This approach is imple-
are shown in red. mented in the Fuzzy Miner (Günther and van der
Pentland et al. 2017 propose an extension of Aalst 2007). In graph reduction techniques, the
dependency graphs (namely, narrative networks) user designates a set of important nodes, and
for visualizing handoffs and identifying routines. other nodes are then abstracted in such a way that
The main idea is to show not only handoffs the control-flow relations between the designated
between process participants but also handoffs nodes remains visible (Daenen 2017). The groups
between systems (applications), for example, sit- of abstracted nodes are then replaced by special
uations where the a process participant needs to (thick) edges.
use multiple systems to perform a task.
Company A
Inform appli-
Check formal
cant about
requirements
decision
Check
grades
Invite for
eligibility
assessment
Examine
employment
references
Company B
Send letter of
acceptance
Assess
Receive Ask to attend
Evaluate capabilities
documents aptitude test
of applicant
Send letter of
rejection
Business Process Model Matching, Fig. 1 Two process models and possible correspondences
between two process models can become a com- Leveraging Activity Labels
plex and challenging task (Mendling et al. 2015). Activity labels capture the main semantics of
The main difference to other matching problems, process models. Hence, they are also the primary
such as schema and ontology matching, is the source of process model matching techniques for
temporal dimension of process models. Two ac- identifying activity correspondences. However,
tivities with the exact same label may refer to the way in which the similarity between two
different streams of action in reality if they occur activity labels is quantified differs considerably.
at different moments of time in the process. There are two types of measures that can be
employed: syntactic measures and semantic
measures.
Key Research Findings Syntactic measures relate to simple string
comparisons and do not take the meaning
The research in the area of business process or context of words into account. The most
model matching has focused on two main as- prominently employed syntactic measure is the
pects: defining automatic techniques for process Levenshtein distance. Given two labels l1 and l2 ,
model matching and developing appropriate the Levenshtein distance counts the number of
mechanisms to evaluate these techniques. This edit operations (i.e., insertions, deletions, and
section discusses the key findings from both substitutions) that are required to transform
streams. l1 into l2 . Although the Levenshtein distance
is rather simplistic, it is used by a variety of
Process Model Matching Techniques matching techniques (cf. Dijkman et al. 2009;
Current process model matching techniques ex- Liu et al. 2014; Belhoul et al. 2012). Many
ploit a variety of techniques to automatically matching techniques, however, also rely on plain
identify activity correspondences. Most of them, word comparisons (cf. Fengel 2014; Antunes
however, leverage either the activity labels or the et al. 2015; Cayoglu et al. 2013). Very common
model structure of process models. measures include the Jaccard and the Dice
Business Process Model Matching 411
coefficient, which both compute the similarity cess model matching techniques also exploit the
between two activity labels based on the number model structure. Typically, structural features of
of shared words. An alternative approach based the model are used to guide the matching process
on word comparisons is the cosine similarity based on the computed activities’ similarities or B
(cf. Weidlich et al. 2010; Makni et al. 2015). To to rule out unlikely constellations.
compute the cosine similarity, activity labels are One of the most prominent strategies to take
transformed into vectors, typically by weighing the model structure into account is the graph edit
words with their frequency of occurrence. The distance (cf. Dijkman et al. 2009; Gater et al.
cosine similarity is then derived from the cosine 2010). It applies the notion of string edit distance
of the angle between the two activity vectors. as used by the Levenshtein distance to graphs.
Semantic measures take the meaning of words In the context of process model matching, the
into account. A very common strategy to do so graph edit distance is used to obtain a set of
is the identification of synonyms using the lex- correspondences that is as similar as possible
ical database WordNet (Miller 1995). Typically, from a structural perspective. An alternative way
matching techniques check for synonyms as part of taking the structure in account is to leverage
of a prepossessing step and then apply other, the notion of single-entry-single-exit fragments.
often also syntactic, similarity measures (cf. An- As described by Vanhatalo et al. (2009), a process
tunes et al. 2015). The most prominent semantic model can be decomposed into a hierarchy of
measure is the Lin similarity (Lin 1998). The Lin fragments with a single entry and a single exit
similarity is a method to compute the semantic node. The resulting fragment hierarchy is useful
relatedness of words based on their information for both identifying candidates for one-to-many
content according to the WordNet taxonomy. To relationships as well as identifying correspon-
use Lin for measuring the similarity between dences that have similar depth in the model. No-
two activities (which mostly contain more than table examples exploiting the fragment hierarchy
a single word), matching techniques typically include the ICoP Framework from Weidlich et al.
combine the Lin similarity with the bag-of-words (2010) and the matching techniques from Gerth
model (see e.g., Klinkmüller et al. 2013; Leopold et al. (2010) and Branco et al. (2012).
et al. 2012). The bag-of-words model transforms Some approaches also use the structure of
an activity into a multiset of words that ignores a process model to derive constraints. For in-
grammar and word order. The Lin similarity can stance, Leopold et al. (2012) derive behavioral
then be obtained by computing the average of the constraints from the model structure that must not
word pairs from the two bags with the highest be violated in the final set of correspondences.
Lin score. Other measures based on the Word- That means if two activities a1 and a2 from
Net dictionary include the ones proposed by Wu model A are in a strict order relation and two
and Palmer (1994) and Lesk (1986). The former activities b1 and b2 from model B are in a strict
computes the similarity between two words by order relation as well, the final set of correspon-
considering the path length between the words in dences cannot include a correspondence between
the WordNet taxonomy. The latter compares the a1 and b2 and between a2 and b1 . A similar
WordNet dictionary definitions of the two words. notion is used in the matching approach from
Some approaches also directly check for hyper- Meilicke et al. (2017).
nym relationships (i.e., more common words).
For instance, Hake et al. (described in Antunes Alternative Techniques
et al. 2015) consider car and vehicle as identical Most existing process model matching tech-
words because vehicle is a hypernym of car. niques combine the algorithms discussed
above in a specific way. Some techniques,
Leveraging Model Structure such as the ICoP framework (Weidlich et al.
Despite of the predominant role of activity la- 2010), also allow the user to configure the
bels for identifying correspondences, many pro- matching technique by selecting the desired
412 Business Process Model Matching
algorithms. However, there are also techniques one can argue that the activity Invite for eligibility
that consider alternative strategies to identify assessment from Company A corresponds to the
activity correspondences. More specifically, activity Ask to attend aptitude test because both
there are techniques building on user feedback, activities have a similar goal, one can also bring
prediction, and matching technique ensembles. up arguments against it. We could, for instance,
Klinkmüller et al. (2014) proposed a semiau- argue that the eligibility assessment in the process
tomated technique that incorporates user feed- of Company A is optional, while the aptitude
back to improve the matching performance. The test in the process of Company B is mandatory.
main idea of their approach is to present the auto- What is more, an aptitude test appears to be way
matically computed correspondences to the user, more specific than an eligibility assessment. This
let the user correct incorrect correspondences, example illustrates that correspondences might
and then, based on the feedback, adapt the match- be actually disputable and that clearly defining
ing algorithm. Weidlich et al. (2013) proposed a which correspondences are correct can repre-
technique that aims at predicting the performance sent a complex and challenging task. This is,
of a matching technique based on the similarity for instance, illustrated by the gold standards
values it produces for a given matching problem. of the Process Model Matching Contests 2013
In this way, the approach can help to select the and 2015. The organizers of the contests found
most suitable matching technique for the prob- that there was not a single model pair for which
lem at hand. Meilicke et al. (2017) proposed an two independent annotators fully agreed on all
ensemble-based technique that runs a set of avail- correspondences (Cayoglu et al. 2013; Antunes
able matching systems and then selects the best et al. 2015).
correspondences based on a voting scheme and The standard procedure for evaluating process
a number of behavioral constraints. This strategy model matching techniques builds on a binary
has the advantage of combining the strengths of gold standard. Such a gold standard is created
multiple matching techniques. by humans and clearly defines which correspon-
While all these techniques are able to improve dences are correct. By comparing the correspon-
the matching results in certain settings, they also dences produced by the matching technique and
come with their own limitations. Incorporating the correspondences from the gold standard, it
user feedback as proposed by Klinkmüller et al. is then possible to compute the well-established
(2014) means that the results can no longer be ob- metrics precision, recall, and F measure. At this
tained in a fully automated fashion. The process point, the evaluation based on a binary gold
model prediction technique from Weidlich et al. standard is widely established and was also used
(2013) may simply conclude that available tech- by two Process Model Matching Contests. The
niques are not suitable for computing correspon- advantage of using a binary gold standard is that
dences. In the same way, the ensemble-matching the notion of precision, recall, and F measure
technique from Meilicke et al. (2017) relies on is well-known and easy to understand. How-
the availability of matching techniques that can ever, it also raises the question why the per-
actually identify the complex relationships that formance of process model matching techniques
may exist between two models. is determined by referring to a single correct
solution when not even human annotators can
agree what this correct solution is. A binary
Approaches for Evaluating Process gold standard implies that any correspondence
Model Matching Techniques that is not part of the gold standard is incorrect
and, thus, negatively affects the above mentioned
The evaluation of process model matching tech- metrics.
niques is a complex task. To illustrate this, re- Recognizing the limitations of a binary gold
consider the correspondences from Fig. 1. While standard, Kuss et al. (2016) recently introduced
Business Process Model Matching 413
the notion of a non-binary gold standard. The company mergers or for analyzing the evolution
overall rationale of a non-binary gold standard is of models.
to take the individual opinion of several people
independently into account. Therefore, several B
people are asked to create a binary gold stan- Future Directions for Research
dard based on their own opinion. By considering
each individual opinion as a vote for a given As indicated by the results of the two Process
correspondence, it is then possible to compute a Model Matching Contests, the main challenge
support value for each correspondence between in the area of process model matching remains
0.0 and 1.0. Based on these support values, Kuss the rather moderate and unstable performance of
et al. (2016) proposed non-binary notions of the the matching techniques. The best F measures in
metrics precision, recall, and F measure that take the Process Model Contest 2015 ranged between
the uncertainty of correspondences into account. 0.45 and 0.67. A recent meta-analysis by Jabeen
While the resulting metrics are not as intuitive to et al. (2017) suggests that this can be mainly
interpret as their binary counterparts, they can be explained by the predominant choice for syntactic
considered as a fairer assessment of quality of and simplistic semantic activity label similarity
a matching technique. Following that argument, measures. These measures are simply not capa-
the results from non-binary evaluation were re- ble of identifying complex semantic relationships
cently presented next to the results of the binary that exist between process models. These obser-
evaluation in the Process Model Matching Track vations lead to three main directions for future
of the Ontology Alignment Evaluation Initiative research:
2016 (Achichi et al. 2016).
1. Identify better strategies to measure activ-
ity label similarity: There is a notable need
for more intelligent strategies to measure the
Examples of Application similarity between activity labels. Syntactic
similarity measures are useful for recogniz-
The activity correspondences produced by ing correspondences with identical or very
process model matching techniques often similar labels. The sole application of these
serve as input for advanced process model measures, however, has been found to simply
analysis techniques. Examples include the result in a high number of false positives
analysis of model differences (Küster et al. (Jabeen et al. 2017). Also the current use of
2006), the harmonization of process model semantic technology is insufficient. Compar-
variants (La Rosa et al. 2013), the process model ing activity labels by computing the semantic
search (Jin et al. 2013), and the detection of similarity between individual word pairs does
process model clones (Uba et al. 2011). Many of not account for the cohesion and the complex
these techniques require activity correspondences relationships that exist between the words of
as input and, therefore, implicitly build on activity labels. Here, a first step would be the
process model matching techniques. However, proper consideration of compound nouns such
the application of process model matching as eligibility assessment. A more advanced
techniques also represents an analysis by itself. step would be to also consider the relationship
The identified correspondences inform the user between nouns and verbs (Kim and Baldwin
about similarities and dissimilarities between 2006). The fact that two activities use the
two models (or sets of models). From a practical same verb does not necessarily imply that they
perspective, process model matching techniques are related. Possible directions to account for
can, among others, provide valuable insights for these complex relationships are technologies
identifying overlap or synergies in the context of such as distributional semantics (Bruni et al.
414 Business Process Model Matching
Euzenat J, Shvaiko P et al (2007) Ontology matching, Lesk M (1986) Automatic sense disambiguation using
vol 18. Springer, Berlin/Heidelberg machine readable dictionaries: how to tell a pine cone
Fengel J (2014) Semantic technologies for aligning het- from an ice cream cone. In: Proceedings of the 5th
erogeneous business process models. Bus Process annual international conference on systems documen-
Manag J 20(4):549–570 tation. ACM, pp 24–26
B
Gal A (2011) Uncertain schema matching. Synth Lect Lin D (1998) An information-theoretic definition of simi-
Data Manag 3(1):1–97 larity. In: ICML, vol 98, pp 296–304
Gater A, Grigori D, Bouzeghoub M (2010) Complex map- Liu K, Yan Z, Wang Y, Wen L, Wang J (2014) Efficient
ping discovery for semantic process model alignment. syntactic process difference detection using flexible
In: Proceedings of the 12th international conference on feature matching. In: Asia-Pacific conference on busi-
information integration and web-based applications & ness process management. Springer, pp 103–116
services. ACM, pp 317–324 Makni L, Haddar NZ, Ben-Abdallah H (2015) Busi-
Gerth C, Luckey M, Küster J, Engels G (2010) Detec- ness process model matching: an approach based on
tion of semantically equivalent fragments for business semantics and structure. In: 2015 12th international
process model change management. In: 2010 IEEE joint conference on e-business and telecommunications
international conference on services computing (SCC). (ICETE), vol 2. IEEE, pp 64–71
IEEE, pp 57–64 Meilicke C, Leopold H, Kuss E, Stuckenschmidt H, Rei-
Jabeen F, Leopold H, Reijers HA (2017) How to make jers HA (2017) Overcoming individual process model
process model matching work better? an analysis of matcher weaknesses using ensemble matching. Decis
current similarity measures. In: Business information Support Syst 100(Supplement C):15–26
systems (BIS). Springer Mendling J, Leopold H, Pittke F (2015) 25 challenges of
Jin T, Wang J, La Rosa M, Ter Hofstede A, Wen L (2013) semantic process modeling. Int J Inf Syst Softw Eng
Efficient querying of large process model repositories. Big Co (IJISEBC) 1(1):78–94
Comput Ind 64(1):41–49 Miller GA (1995) WordNet: a lexical database for English.
Kim SN, Baldwin T (2006) Interpreting semantic relations Commun ACM 38(11):39–41
in noun compounds via verb semantics. In: Proceed- Noy NF (2004) Semantic integration: a survey of
ings of the COLING/ACL on main conference poster ontology-based approaches. ACM Sigmod Rec
sessions. Association for Computational Linguistics, 33(4):65–70
pp 491–498 Rahm E, Bernstein PA (2001) A survey of approaches to
Klinkmüller C, Weber I, Mendling J, Leopold H, Ludwig automatic schema matching. VLDB J 10(4):334–350
A (2013) Increasing recall of process model matching Sagi T, Gal A (2013) Schema matching prediction with
by improved activity label matching. In: Business pro- applications to data source discovery and dynamic
cess management. Springer, pp 211–218 ensembling. VLDB J Int J Very Large Data Bases
Klinkmüller C, Leopold H, Weber I, Mendling J, Lud- 22(5):689–710
wig A (2014) Listen to me: improving process model Shvaiko P, Euzenat J (2013) Ontology matching: state of
matching through user feedback. In: Business process the art and future challenges. IEEE Trans Knowl Data
management. Springer, pp 84–100 Eng 25(1):158–176
Kuss E, Leopold H, Van der Aa H, Stuckenschmidt H, Spohr D, Hollink L, Cimiano P (2011) A machine learning
Reijers HA (2016) Probabilistic evaluation of process approach to multilingual and cross-lingual ontology
model matching techniques. In: Conceptual modeling: matching. In: The semantic web–ISWC 2011, pp 665–
proceedings of 35th international conference, ER 2016, 680
Gifu, 14–17 Nov 2016. Springer, pp 279–292 Uba R, Dumas M, García-Bañuelos L, La Rosa M (2011)
Küster JM, Koehler J, Ryndina K (2006) Improving busi- Clone detection in repositories of business process
ness process models with reference models in business- models. In: Business process management. Springer,
driven development. In: Business process management pp 248–264
workshops. Springer, pp 35–44 Vanhatalo J, Völzer H, Koehler J (2009) The refined
La Rosa M, Dumas M, Uba R, Dijkman R (2013) Business process structure tree. Data Knowl Eng 68(9):793–818
process model merging: an approach to business pro- Weidlich M, Dijkman R, Mendling J (2010) The ICoP
cess consolidation. ACM Trans Softw Eng Methodol framework: identification of correspondences between
(TOSEM) 22(2):11 process models. In: Advanced information systems
Leopold H, Niepert M, Weidlich M, Mendling J, Dijkman engineering. Springer, pp 483–498
R, Stuckenschmidt H (2012) Probabilistic optimization Weidlich M, Sagi T, Leopold H, Gal A, Mendling J (2013)
of semantic process model matching. In: Business Predicting the quality of process model matching. In:
process management. Springer, pp 319–334 Business process management. Springer, pp 203–210
Leopold H, Meilicke C, Fellmann M, Pittke F, Stucken- Wu Z, Palmer M (1994) Verbs semantics and lex-
schmidt H, Mendling J (2015) Towards the automated ical selection. In: Proceedings of the 32nd an-
annotation of process models. In: International con- nual meeting on association for computational lin-
ference on advanced information systems engineering. guistics. Association for Computational Linguistics,
Springer, pp 401–416 pp 133–138
416 Business Process Monitoring
human resources like responsible, accountable, days in order to keep customer satisfaction. The
or informed. problem of lag indicators is that they tell the
When defining PPIs with the just-introduced organization whether the goal has been achieved,
attributes, there is a set of requirements that but they are not directly influenceable by the B
need to be fulfilled. (1) A PPI must satisfy the performers of the process. On the contrary, lead
SMART criteria, which stands for: specific, it has indicators have two main characteristics. First,
to be clear what the indicator exactly describes; they are predictive in the sense that if the lead
measurable, it has to be possible to measure a indicators are achieved, then it is likely that the
current value and to compare it to the target lag indicator is achieved as well. Second, they are
one; achievable, its target must be challenging but influenceable by the performers of the process,
achievable because it makes no sense to pursue a meaning that they should refer to something that
goal that will never be met; relevant, it must be the performers of the process can actively do or
aligned with a part of the organization’s strategy, not do (McChesney et al. 2012). For instance, if
something that really affects its performance; we think that one major issue that prevents fulfill-
and time-bounded, a PPI only has a meaning if ing the aforementioned lag indicator is missing
it is known in the time period in which it is information from the customer when the order
measured so that its evolution in time can be is created, double checking with the customer
evaluated. (2) The PPIs themselves must be ef- the order within the first 24 h could be a lead
ficient, i.e., since the measurement itself requires indicator for such lag indicator. Furthermore, it is
human, financial, and physical resources, it must important to distinguish PPIs from other similar
be worth the effort from a cost/benefit viewpoint. concepts like activity metrics or risk indicators.
In addition, the amount of PPIs to define also Activity metrics measure business activity related
must be efficient, following the less-is-more rule, to PPIs but do not have targets and goals associ-
where a balance between information need and ated with them because it does not make sense
information overload has to be met. (3) PPIs must by definition, for instance, “top ten customers by
be defined in an understandable way, according to revenue.” They provide additional context that
the vocabulary used in the specific domain and helps business people to make informed deci-
in a language comprehensible by the different sions. Finally, risk indicators measure the possi-
stakeholders involved in the PPI management. bility of future adverse impact. They provide an
(4) They must keep traceability to the business early sign to identify events that may harm the
process they are defined for, to avoid consistency continuity of existing processes (Van Looy and
issues when, for instance, the business process Shafagatova 2016).
evolves but PPI definitions are not consequently
updated and become obsolete. (5) They must be Evaluating PPIs
amenable to automation in order to avoid the Once a system of PPIs has been defined taking
need for a redefinition to a computable language, into consideration all the aforementioned infor-
which could lead to the introduction of errors in mation, data sources, methods, and instruments
the redefinition process. need to be established in order to gather the
required data to calculate their values. Gathering
Types of PPIs methods can be classified into two big groups:
PPIs can be broadly classified into two categories, evidence-based methods and subjective methods.
namely, lag and lead indicators, also known as The former includes the analysis of existing
outcomes and performance drivers, respectively. documentation, the observation of the process
The former establishes a goal that the organiza- itself, and the use of data recorded by existing
tion is trying to achieve and is usually linked to a information systems that support the processes.
strategic goal. For instance, the previously intro- The latter includes interviews, questionnaires,
duced PPI example for the order-to-cash process: or workshops, where domain experts are
its cycle time should be less than 20 working queried to obtain the required information.
418 Business Process Performance Measurement
Evidence-based methods are less influenced by to go from the dashboard to an environment for
the individuals involved in the measurement than exploration and analysis to learn insights from
subjective ones; however they come with some data and take action.
weaknesses: people acting different because they Regardless of the purpose, visualization de-
are being observed, the existence of inaccurate or signers should exploit well-known knowledge
misleading information, or the unavailability of on visual perception to create effective visual-
important data. Therefore, sometimes it can be a izations that serve the aforementioned purposes.
good decision to use a combination of them. This knowledge include taking into account the
limited capacity of people to process visual in-
Analysis with PPIs formation: the use of color, form, position, and
The last part of PPM is the visualization and motion to encode data for rapid perception and
analysis of performance data. This visualization the application of the Gestalt principles of visual
and analysis serve two purposes. First, monitor- perception that reveal the visual characteristics
ing the process performance to keep it under that incline us to group objects together (Few
control and to be able to quickly learn whether 2006). Furthermore, especially in the case of
the performance is moving to the right direction dashboards, simplicity is a key aspect to strive
so that appropriate actions are taken. Second, for since dashboards require to squeeze a great
diagnosing the process performance in order to amount of information in a small amount of space
be able to spot points of potential improvement. while preserving clarity.
Consequently, the techniques used to visualize
and analyze the process performance vary de-
pending on its purpose. Key Research Findings
Dashboards are mainly used for the first pur-
pose. According to Few (2006), dashboards are Performance Measurement Models
visual displays of the most important information One specific stream of work that is particularly
needed to achieve one or more objectives that relevant to PPM is that related to performance
fit on a single computer screen so it can be measurement frameworks or models. Without
monitored at a glance. Therefore, in PPM, dash- a doubt, one of the most widely recognized
boards are used for presenting the PPIs and their performance measurement models is the
degree of fulfilment. In addition, they can depict balanced scorecard proposed by Kaplan and
other activity metrics or historic information that Norton (1996). They introduced a four-
are necessary to give PPIs the context that is dimensional approach: (1) financial perspective,
necessary to achieve the objectives for which the (2) customer perspective, (3) internal business
dashboard has been designed. process perspective, and (4) learning and
On the contrary, visual interfaces designed for growth perspective. It is, however, only one
data exploration and analysis are used for diag- of several proposals in this area. For example,
nosing the process performance. These interfaces Keagan et al. (1989) proposed their performance
offer all sorts of facilities to filter, group, or drill measurement matrix which tries to integrate
down data and to create different types of graphs different business performance perspectives:
that enable the comparison and analysis that are financial and nonfinancial and internal and
necessary to make sense of the performance data. external. Brignall et al. (1991) developed the
Also, techniques such as statistical hypothesis results and determinants framework, based on
testing or multivariate analysis and methods such the idea of the two types of PPIs, lag, which
as statistical process control can complement relate to results (competitiveness, financial
these visual interfaces to help identify causes of performance), and lead indicators, which relate
process performance variation. to determinants (quality, flexibility, resource
Usually these two purposes go hand in hand. utilization, and innovation). The performance
Ideally, visualization tools should have the ability pyramid presented by Lynch and Cross (1991)
Business Process Performance Measurement 419
considers four levels, from top to down: (1) as their relationships with goals and among them
vision; (2) market and financial; (3) customer using different variants of order-sorted predicate
satisfaction, flexibility, and productivity; and (4) logic, which also enables formal analysis and
quality, delivery, cycle time, and waste. Adams verification. Finally, del Río-Ortega et al. (2013) B
and Neely (2002) used a performance prism with present a metamodel for the definition of PPIs and
five facets to represent their proposal: stakeholder its translation to a formal language (description
satisfaction, strategies, processes, capabilities, logic) that allows a set of design-time analysis
and stakeholder contribution. Another popular operations.
model is EFQM (2010), which distinguishes Regarding the provision of more user-friendly
between enablers such as leadership, people, approaches for the definition of PPIs, while main-
strategy, or processes, among others, and results, taining a formal foundation that allows further
including customer, people, society, and key reasoning on those PPI definitions, several works
results. So far, the presented models are focused can be named. Castellanos et al. (2005) propose
on the organization in a hierarchical way, but the use of templates provided by a graphical
there are some others that advocate following user interface to define business measures re-
a more horizontal approach, i.e., focused lated to process instances, processes, resources,
on the business processes. Two well-known or the overall business operations, which can be
examples are the framework proposed by Kueng subject to posterior intelligent analysis to un-
(2000) for conceptualizing and developing a derstand causes of undesired values and pre-
process performance measurement system and dict future values. del Río-Ortega et al. (2016)
the Devil’s quadrangle by Brand and Kolk introduce a template-based approach that uses
(1995), which distinguishes four performance linguistic patterns for the definition of PPIs and is
dimensions: time, cost, quality, and flexibility. based on the PPINOT metamodel, thus enabling
These four dimensions have been established their automated computation and analysis. Sal-
as the typical perspectives of business process divar et al. (2016) present a spreadsheet-based
performance measurement (Dumas et al. 2013; approach for business process analysis tasks that
Jansen-Vullers et al. 2008). include writing metrics and assertions, running
performance analysis and verification tasks, and
PPI Definition Approaches reporting on the outcomes. Finally, van der Aa
There is another research stream that has gained a et al. (2017) use hidden Markov models to trans-
great interest in PPM, mainly because the models form unstructured natural language descriptions
mentioned above define different perspectives for into measurable PPIs.
which specific PPIs should be further defined Lastly, a set of researches have focused on
but provide little guidance on how to define and developing graphical approaches for the defini-
operationalize them. Therefore, this research line tion of PPIs. Korherr and List (2007) extend two
is focused on the definition of PPIs, which has graphical notations for business process mod-
been addressed from different perspectives. In elling (BPMN and EPC) to define business pro-
the formalization perspective, the work by Soffer cess and performance measures related to cost,
and Wand (2005) proposes a set of formally quality, and cycle time. Friedenstab et al. (2012)
defined concepts to enable the incorporation of extend BPMN for business activity monitoring
measures and indicators (referred to as criterion by proposing a metamodel and a set of graphical
functions for soft goals) into process modelling symbols for the definition of PPIs, actions to
and design methods. Wetzstein et al. (2008), in be taken after their evaluation, and dashboards.
their framework for business activity monitoring, del Río-Ortega et al. (2017b) present a graphical
describe a KPI ontology using the web service notation and editor to model PPIs together with
modelling language to specify PPIs over semantic business processes built on a set of principles for
business processes. Popova and Sharpanskykh obtaining cognitively effective visual notations.
(2010) formalize the definitions of PPIs as well The proposed notation is based on the PPINOT
420 Business Process Performance Measurement
metamodel and allows the automated computa- control and continuous optimization. Process-
tion of the PPI definitions. oriented organizations use PPM as a tool to
quantify issues or weaknesses associated to
PPI Analysis and Visualization Techniques their business processes, like bottlenecks, and
Concerning the visualization of process to identify improvement points. Sometimes,
performance, researchers have focused on these organizations are engaged in initiatives
developing process-specific visual components to adopt sets of good practices or frameworks,
that complement the use of general-purpose either general ones, like the aforementioned
type of graphs such as bar graphs, bullet graphs, EFQM, or specific to their domains, like ITIL
sparklines, or scatter plots in order to compare or SCOR. In these cases, organizations apply
different cohorts of business process instances the proposed PPM approach, according to
in order to spot process performance variations. the corresponding framework, where a series
Some of these visualizations include arranging of performance measures to be defined are
and connecting pieces of information following frequently proposed. The application of PPM
the process model structure. For instance, can also be triggered by specific laws, related, for
in Wynn et al. (2017), a three-dimensional instance, to transparency or audits in the context,
visualization of the process model enriched for example, of certifications. PPM can also help
with a variety of performance metrics is used in deciding the viability of a business after a first
to identify differences between cohorts. Other short period of existence.
visualizations adapt general-purpose graphs to
visualize elements relevant to processes. Low
et al. (2017) adapt social network graphs and Future Directions for Research
timeline visualization approaches such as Gantt
charts to highlight resource and timing changes; While the definition of PPIs in the context
Claes et al. (2015) adapt dotted charts to represent of different performance measurement models
the occurrence of events in time, and Richter and has been widely analyzed and has produced
Seidl (2017) use a diagram similar to Gantt charts a number of well-established approaches, it is
to represent temporal drifts in an event stream. recognized that they are suitable for structured
This last paper is also an example of a business processes whose expected behavior
technique that analyses the performance data is known a priori. Their direct application is,
in order to obtain insights about process however, limited in the context of processes with
performance, in this case, temporal drifts. a high level of variability or knowledge-intensive
Similarly, other analysis techniques have been processes (KIPs), which are frequently non-
proposed such as in Hompes et al. (2017) in repeatable, collaboration-oriented, unpredictable,
which the Granger causality test is used to and, in many cases, driven by implicit knowledge
discover causal factors that explain business from participants, derived from their capabilities
process variation. Finally, process cubes can also and previous experiences. In that context, new
be used during the analysis. Process cubes are proposals pursue the incorporation of the new
based on the OLAP data cube concept. Each cell concepts involved in this type of processes, like
in the process cube corresponds to a set of events the explicit knowledge used in process activities,
and can be used to compare and find process into the definition of performance measures and
performance variation (Aalst 2013). indicators. Other research approaches focus on
measuring the productivity of knowledge workers
or their interaction and communication in KIPs.
Examples of Application Nevertheless, a general proposal for the definition
of performance measures in KIPs that can be used
Business process performance measurement independently of the context and that addresses
is commonly applied for business process their particularities described above is missing.
Business Process Performance Measurement 421
A research area that is being recently ex- del Río-Ortega A, García F, Resinas M, Weber E, Ruiz
plored is the development of techniques to en- F, Ruiz-Cortés A (2017a) Enriching decision making
with data-based thresholds of process-related kpis. In:
able evidence-driven performance diagnosis and Proceedings of 29th the international conference on
improvement. Traditionally, they have been done advanced information systems engineering (CAiSE),
B
based on the experience and intuition of process pp 193–209
owners. However, nowadays, the amount and na- del Río-Ortega A, Resinas M, Cabanillas C, Ruiz-Cortés
A (2013) On the definition and design-time analysis
ture of available data produced by the information of process performance indicators. Inf Syst 38(4):
systems supporting business processes (e.g. event 470–490
logs) make it possible for the application of dif- del Río-Ortega A, Resinas M, Durán A, Bernárdez B,
ferent statistical and machine learning techniques Ruiz-Cortés A, Toro M (2017b) Visual PPINOT: a
graphical notation for process performance indicators.
for that purpose. Some preliminary approaches Bus Inf Syst Eng. https://link.springer.com/article/10.
include a method for the identification of proper 1007/s12599-017-0483-3
thresholds for a lead PPI in order to support del Río-Ortega A, Resinas M, Durán A, Ruiz-Cortés
the achievement of an associated lag PPI (del A (2016) Using templates and linguistic patterns to
define process performance indicators. Enterp Inf Syst
Río-Ortega et al. 2017a) or the use of statistical 10(2):159–192
hypothesis testing to find root causes or perfor- Dumas M, La Rosa M, Mendling J, Reijers HA
mance drifts (Hompes et al. 2017). (2013) Fundamentals of business process management.
Springer, Berlin/Heidelberg
EFQM (2010) Efqm excellence model. Available from:
http://www.efqm.org/the-efqm-excellence-model
Few S (2006) Information dashboard design: the effective
Cross-References visual communication of data. O’Reilly Media, Inc.,
Beijing
Friedenstab JP, Janiesch C, Matzner M (2012) Extending
Big Data Analysis for Social Good BPMN for business activity monitoring. In: Proceeding
Business Process Analytics of the 45th Hawaii international conference on system
science (HICSS), pp 4158–4167
Business Process Event Logs and Visualization
Hompes BFA, Maaradji A, Rosa ML, Dumas M, Buijs
Business Process Model Matching JCAM, Aalst WMPvd (2017) Discovering causal fac-
Business Process Querying tors explaining business process performance varia-
Predictive Business Process Monitoring tion. In: Advanced information systems engineering
(CAiSE), pp 177–192
Queue Mining
Jansen-Vullers MH, Kleingeld PAM, Netjes M (2008)
Quantifying the performance of workflows. Inf Syst
Manag 25(4):332–343
Kaplan RS, Norton DP (1996) The balanced score-
References card: translating strategy into action. Harvard Business
School Press, Boston
Aalst WMPvd (2013) Process cubes: slicing, dicing, Keagan D, Eiler R, Jones C (1989) Are your perfor-
rolling up and drilling down event data for process mance measures obsolete? Manag Account 70(12):
mining. In: Asia Pacific business process management 45–50
(AP-BPM), pp 1–22 Korherr B, List B (2007) Extending the EPC and the
Adams C, Neely A (2002) Prism reform. Financ Manag BPMN with business process goals and performance
5:28–31 measures. In: Proceeding of the 9th international con-
Brand N, Kolk H (1995) Workflow analysis and design. ference on enterprise information systems (ICEIS),
Bedrijfswetenschappen: Kluwer, Deventer vol 3, pp 287–294
Brignall T, Fitzgerald L, Johnston R, Silvestro R Kueng P (2000) Process performance measurement sys-
(1991) Performance measurement in service busi- tem: a tool to support process-based organizations.
nesses. Manag Account 69(10):34–36 Total Qual Manag 11(1):67–85
Castellanos M, Casati F, Shan MC, Dayal U (2005) Low WZ, van der Aalst WMP, ter Hofstede AHM,
ibom: a platform for intelligent business operation Wynn MT, De Weerdt J (2017) Change visualisation:
management. In: Proceeding of the 21st international analysing the resource and timing differences between
conference on data engineering, pp 1084–1095 two event logs. Inf Syst 65:106–123
Claes J, Vanderfeesten I, Pinggera J, Reijers HA, Weber Lynch R, Cross K (1991) Measure up!: the essential
B, Poels G (2015) A visual analysis of the process of guide to measuring business performance. Mandarin,
process modeling. ISeB 13(1):147–190 London
422 Business Process Querying
McChesney C, Covey S, Huling J (2012) The 4 disciplines and/or envisioned business processes and
of execution: achieving your wildly important goals. relationships between the business processes.
Simon and Schuster, London
Popova V, Sharpanskykh A (2010) Modeling organiza-
A process querying method is a technique
tional performance indicators. Inf Syst 35(4):505–527 that given a process repository and a process
Richter F, Seidl T (2017) TESSERACT: time-drifts in query systematically implements the query in the
event streams using series of evolving rolling averages repository, where a process query is a (formal)
of completion times. In: Business process management
(BPM), pp 289–305
instruction to manage a process repository.
Saldivar J, Vairetti C, Rodríguez c, Daniel F, Casati F,
Alarcón R (2016) Analysis and improvement of busi-
ness process models using spreadsheets. Inf Syst 57:
1–19
Soffer P, Wand Y (2005) On the notion of soft-goals Overview
in business process modeling. Bus Process Manag J
11:663–679
van der Aa H, Leopold H, del-Río-Ortega A, Resinas M, This encyclopedia entry summarizes the state-of-
Reijers HA (2017) Transforming unstructured natural the-art methods for (business) process querying
language descriptions into measurable process perfor- proposed in the publications included in the lit-
mance indicators using hidden markov models. Inf Syst
71:27–39
erature reviews reported in Wang et al. (2014);
Van Looy A, Shafagatova A (2016) Business process Polyvyanyy et al. (2017b) and the related works
performance measurement: a structured literature re- discussed in the reviewed publications. The scope
view of indicators, measures and metrics. SpringerPlus of this entry is limited to the methods for pro-
5(1797):1–24
Wetzstein B, Ma Z, Leymann F (2008) Towards measuring
cess querying that are based on special-purpose
key performance indicators of semantic business pro- (programming) languages for capturing process
cesses. Bus Inf Syst 7:227–238 querying instructions and were developed in the
Wynn MT, Poppe E, Xu J, ter Hofstede AHM, Brown R, areas of process mining (van der Aalst 2016)
Pini A, van der Aalst WMP (2017) ProcessProfiler3d:
a visualisation framework for log-based process perfor-
and business process management (Weske 2012;
mance comparison. Decis Support Syst 100:93–108 Dumas et al. 2013).
Next, this section lists basic concepts in pro-
cess querying. Let Uan and Uav be the uni-
verse of attribute names and attribute values,
Business Process Querying respectively.
Event. An event e is a relation between some
Artem Polyvyanyy attribute names and attribute values with the
The University of Melbourne, Parkville, VIC, property that each attribute name is related to
Australia exactly one attribute value, i.e., it is a partial
function e W Uan ¹ Uav .
Process. A process is a partially ordered set WD
Synonyms .E; /, where E is a set of events and is a
partial order over E.
Process filtering; Process management; Process Trace. A trace is a process WD .E; </, where
retrieval < is a total order over E.
Behavior. A behavior is a multiset of processes.
Behavior model. A behavior model is a simpli-
fied representation of real-world or envisioned
Definitions behaviors and relationships between the be-
haviors that serves a particular purpose for a
Business process querying studies methods for target audience.
managing, e.g., filtering and/or manipulating, Process repository. A process repository is an
repositories of models that describe observed organized collection of behavior models.
Business Process Querying 423
Process query. A process query is an instruction new process querying method. In the figure,
to filter or manipulate a process repository. rectangles and ovals denote active components
and passive components, respectively. Active
Let Upr and Upq be the universe of process B
components represent actions performed by the
repositories and the universe of process queries,
process querying methods. Passive components
respectively.
stand for (collections of) data objects. The frame-
Process querying method. A process query- work is composed of four parts. These parts are
ing method m is a relation between ordered responsible for (i) designing process repositories
pairs, in which the first entries are process and process queries, (ii) preparing and (iii)
repositories and the second entries are pro- executing process queries, and (iv) interpreting
cess queries, and process repositories with the results of the process querying methods; refer to
property that each pair is related to exactly Polyvyanyy et al. (2017b) for details.
one process repository, i.e., it is a function The framework proposes four types of behav-
m W Upr Upq ! Upr . ior models: event logs, process models, simula-
tion models, and correlation models. An event log
is a finite collection of traces, where each trace is
Framework a finite sequence of events observed and recorded
in the real world. A process model is a conceptual
In Polyvyanyy et al. (2017b), a process querying model that describes behaviors. A simulation
framework is proposed. A schematic visualiza- model is a process model with a part of its
tion of the framework is shown in Fig. 1. behavior. Finally, a correlation model is a model
The framework is an abstract system of that describes relations between two behaviors.
components that provide generic functionality All the reported below process querying methods
and can be selectively replaced to result in a operate over event logs and process models.
Interpret
...
Instruction Query
Query
Condition Animating
Query
Condition
1 1
Condition 1
Event Log
Process
Model
Correlation
Model
... Simulation
Model
Formalizing Explaining
Inspecting
Prepare
Indexing
Process
Repository A + Process
Query = Process
Repository B
Projecting
Index
and Process Filtering Optimizing Simulating
Statistics
Caching
Process Visualizing
Querying Process Querying
Statistics ...
Execute
Business Process Querying, Fig. 1 A schematic view of the process querying framework from (Polyvyanyy et al.
2017b)
424 Business Process Querying
Given a model of a finite-state system, for ex- This section discusses the state-of-the-art
ample, a behavior model, and a formal property, methods for process querying. The methods can
model checking techniques verify whether the be seen as instantiations of the process querying
property holds for the system (Baier and Katoen framework, refer to Fig. 1, with various concrete
2008). components. In what follows, the methods
are grouped based on the types of behavior
Temporal Logic. A temporal logic is a formal models that they can take as input in process
system for encoding and reasoning about propo- repositories and are referred to by the names
sitions qualified in terms of time (Øhrstrøm and of the corresponding languages for specifying
Hasle 2007). Several methods for process query- process queries.
ing are grounded in model checking temporal
logic properties, e.g., BPMN-Q, BPSL, CRL, and Log Querying
PPSL. Given a process repository and process Process querying methods proposed in this sec-
query, these methods proceed by translating the tion operate over event logs.
process query into a temporal logic expression,
e.g., linear temporal logic (LTL), computation CRG&eCRG. Compliance Rule Graph (CRG)
tree logic (CTL), and metrical temporal logic is a visual language for specifying compliance
Business Process Querying 425
rules (Ly et al. 2010). The semantics of CRG is depends on the information kept in the so far
defined based on FOL over event logs. Extended executed traces.
Compliance Rule Graph (eCRG) extends CRG to
allow modeling compliance rules over multiple Model Querying B
process perspectives (Knuplesch and Reichert This section lists methods for querying collec-
2017). At this stage, eCRG addresses the interac- tions of process models. These methods can be
tion, time, data, and resource perspectives of pro- split into those that operate over process model
cess compliance. Similar to CRG, the execution specifications and those that explore behaviors
semantics of eCRG is defined based on FOL over encoded in process models. Methods that operate
event logs. over model specifications can be further split into
two subclasses. The first subclass includes meth-
DAPOQ-Lang. The Data-Aware Process ods for querying conceptual models that were
Oriented Query Language (DAPOQ-Lang) also reported useful for querying process models.
allows querying over events in traces of event The second subclass consists of dedicated tech-
logs, as well as over related data objects, niques for querying process model collections.
their versions and data schemas, and temporal DMQL, GMQL, and VMQL are the methods
properties of all the aforementioned elements that belong to the first subclass of methods that
(de Murillas et al. 2016). operate over process model specifications.
for reading and writing VMQL queries. VMQL CRL. Compliance Request Language (CRL) is a
queries can also specify constraints that go visual language that is grounded in temporal logic
beyond the syntax of the host language. and encompasses a range of compliance patterns
For querying purposes, process models and (Elgammal et al. 2016). At this stage, CRL sup-
VMQL queries are encoded in a Prolog database. ports compliance patterns that address control
Executing a VMQL query with respect to a given flow, resource, and temporal aspects of business
model requires finding parts of the model that processes. During execution, control flow and
resemble the structure and properties of the query resource patterns get mapped to LTL, whereas
and satisfy the constraints. temporal patterns are translated into MTL for-
mulas. MTL extends LTL and is interpreted over
Next, we discuss dedicated process querying
a discrete time domain using the digital clock
methods that operate over model specifications.
model.
BPMN-Q. BPMN query language (BPMN-Q)
was originally proposed to query business pro- Descriptive PQL. Descriptive Process Query
cess definitions, i.e., (Awad 2007) the flow be- Language (Descriptive PQL) is a textual
tween activities, branching and joining structures, query language for retrieving process models,
and data flow. BPMN-Q extends the abstract specifying abstractions on process models, and
and concrete syntax of BPMN. Hence, BPMN- defining process model changes (Kammerer
Q is a visual language. Later, BPMN-Q was et al. 2015). The aforementioned operations are
extended to serve as a visual interface to temporal grounded in manipulations with single-entry-
logics, e.g., LTL (Awad et al. 2008) and CTL single-exit subgraphs (Polyvyanyy et al. 2010) of
(Awad et al. 2011). the process models.
BPSL. The Business Property Specification PPSL. The Process Pattern Specification Lan-
Language (BPSL) is a visual language that guage (PPSL) is a lightweight extension of the
aims to facilitate the reasoning over business language for capturing UML activity diagrams
process models (Liu et al. 2007). The authors that allows users to model process constraints,
proposed direct translations of BPSL queries in e.g., those related to quality management (Förster
both LTL and CTL to simplify the integration et al. 2005). The semantics of the language is
of BPSL with the existing formal verification given as translations of PPSL queries into tem-
tools. The language proposes dedicated operators poral logic, viz., LTL formulas (Förster et al.
for the recurring patterns in regulatory process 2007).
compliance and specifies extension points for the Finally, we summarize process querying
definition and reuse of domain-specific temporal methods that operate over behaviors encoded in
properties. models.
Business Process Querying 427
van der Aalst WMP (2016) Process mining – data science Weske M (2012) Business process management – con-
in action, 2nd edn. Springer, Berlin/Heidelberg. https:// cepts, languages, architectures, 2nd edn. Springer,
doi.org/10.1007/978-3-662-49851-4 Berlin/Heidelberg. https://doi.org/10.1007/978-3-642-
Subieta K (2009) Stack-based query language. In: Liu 28616-2
L (ed) Encyclopedia of database systems, Springer,
New York, pp 2771–2772. https://doi.org/10.1007/978-
0-387-39940-9_1115 Business Process Variants
Wang J, Jin T, Wong RK, Wen L (2014) Querying busi- Analysis
ness process model repositories – a survey of current
approaches and issues. World Wide Web 17(3):427–
454. https://doi.org/10.1007/s11280-013-0210-z Business Process Deviance Mining
C
However, one potential weakness is that the per- There are several user-configurable ways the data
formance of accessing data may decline. Since can be cached, including MEMORY_ONLY,
RDBMS typically have tight integration and con- MEMORY_AND_DISK, DISK_ONLY, and
trol of the entire stack from the computation others (Spark RDD 2018). An explicit command
to the storage, performance may be faster and must be invoked to remove the DataFrame from
more predictable. For comparable experiences the cache.
with traditional SQL processing in RDBMS, dis-
tributed SQL engines on Hadoop turn to caching
Apache Hive and LLAP
to achieve the desired performance.
Apache Hive is a SQL data warehouse for
There are several different ways SQL engines
Hadoop, which utilizes an existing distributed
on Hadoop can take advantage of caching to
computation engine for the data processing. The
improve performance. The primary methods are
options of computation engines for Hive to use
as follows:
are MapReduce, Spark, or Tez. Hive uses Live
Internal caching within the SQL engine Long And Process (LLAP) (Apache Hive LLAP
Utilizing external storage systems for caching 2018) for caching data for SQL processing.
Currently, only Tez will take advantage of the
caching available via LLAP.
SQL Engine Internal Caching LLAP is a long-lived daemon which runs
alongside Hadoop components in order to pro-
A common way a SQL engine uses caching vide caching and data pre-fetching benefits. Since
is to implement its own internal caching. This Tez takes advantage of LLAP, when Hive uses
provides the most control for each SQL engine Tez as the computation engine, data access will
as to what data to cache and how to represent go to the LLAP daemons, instead of the HDFS
the cached data. This internal cache is also most DataNodes. With this access pattern and archi-
likely the fastest to access, since it is located clos- tecture, LLAP daemons can cache data for Hive
est to the computation being performed (same queries.
memory address space or local storage), in the
desired format. However, the SQL engine internal
caching may not be shareable between different External Storage Systems for Caching
users and queries, which is beneficial in a multi-
tenant environment. Also, not all SQL processing Another major technique for caching for SQL
frameworks have implemented an internal cache, engines on Hadoop is to utilize an external stor-
which prevents using cached data for queries. age system for caching data. Using a separate
system can have several benefits. An external
Apache Spark SQL system may be able to manage the cache more
Apache Spark SQL is a SQL query engine built effectively and provide additional cache-related
on top of Apache Spark, a distributed data features. Also, an external system can be de-
processing framework. Spark SQL exposes a ployed independently from the other components
concept called a DataFrame, which represents of the ecosystem, which can provide operational
a distributed collection of structured data, flexibility. Additionally, sharing cached data can
similar to a table in an RDBMS. Caching in be enabled or made simpler with a separate sys-
Spark SQL utilizes the caching in the Spark tem for caching. By using a separate system
computation engine. A user can choose to cache for handling the caching of data, SQL engines
a DataFrame in Spark SQL by explicitly invoking can take advantage by reading the data from
a cache command (Spark SQL 2018). When the the cache, instead of the originating data source.
command is invoked, Spark SQL engine will This can greatly improve performance, especially
cache the DataFrame internal to the Spark engine. when frequently accessed files are cached. Below
Caching for SQL-on-Hadoop 433
are some of the main systems which can provide data in the fastest memory tier, and store less
caching in Hadoop ecosystems. accessed data in slower tiers with more capacity.
SQL engines can access any data via the
Alluxio unified namespace, and the data will
Alluxio be either fetched from the configured mounted
storages, or served from Alluxio-managed stor-
C
Alluxio (2018) is a memory speed virtual dis-
tributed storage system. Alluxio provides a uni- age (memory, SSD, and disk), which can greatly
fied namespace across multiple disparate storage improve performance. When Alluxio is deployed
systems. Alluxio enables users to “mount” any colocated with the computation cluster, the
existing storage system like HDFS, Amazon S3, Alluxio memory storage can behave similar to the
or a storage appliance, and presents a single, internal caches of SQL engines. Alluxio is able
unified namespace encompassing all the data. to transparently cache data from various other
When a mount is established, applications simply storage systems, like HDFS. Additionally, since
interact with the Alluxio namespace, and the data the Alluxio storage is independent from the SQL
will be transparently accessed from the mounted engines, the cached data can easily be shared
storage and cached in Alluxio. Since there is between different computation frameworks, thus
no limit to how many mounts are possible, Al- allowing greater flexibility for deployments.
luxio enables accessing all data from a single
namespace and interface, and allows queries to Apache Ignite HDFS Cache
span multiple different data sources seamlessly. Apache Ignite (2018) is a distributed platform for
Figure 1 shows how Alluxio can mount multiple in-memory storage and computation. Apache Ig-
storage systems for multiple computation frame- nite is a full stack of components which includes
works. a key/value store, a SQL processing engine, and
Alluxio enables caching for SQL engines by a distributed computation framework, but can be
providing memory speed access to data via its used to cache data for HDFS and SQL engines on
memory centric architecture. Alluxio stores data Hadoop.
on different tiers, which include memory, solid- Apache Ignite provides a caching layer for
state drive (SSD), and disk. With the memory tier HDFS via the Ignite File System (IGFS). All
and the tiered storage architecture, Alluxio can clients and applications interact with IGFS for
store the important and most frequently accessed data, and IGFS handles reading and writing the
Caching for
SQL-on-Hadoop, Fig. 1
Alluxio mounting various
storage systems
434 Caching for SQL-on-Hadoop
data to memory on the nodes. When using IGFS which are read from HDFS. With Apache Impala,
as a cache, users can optionally configure a sec- users can specify tables or partitions of tables to
ondary file system for IGFS, which allows trans- be cached via HDFS centralized caching. IBM
parent read-through and write-through behavior. Big SQL (Floratou et al. 2016) also uses the
Users can configure HDFS for the secondary file HDFS cache for accelerating access to data and
system, so when the data is synced to IGFS, the utilizes new caching algorithms Adaptive SLRU-
data in Ignite is cached in memory. K and Adaptive EXD to improve the cache hit
Using Apache Ignite, SQL engines can access rates for better performance.
their HDFS data via IGFS, and the data can Since the centralized caching is an HDFS
be cached in the Ignite memory storage. Any feature, it can help improve performance for any
SQL engine can take advantage of this cached HDFS data access. However, for disaggregated
data for greater performance for queries. When clusters, where the HDFS cluster is separate from
using IGFS, the SQL engine internal caching the SQL engine cluster, the HDFS caching may
does not need to be used, which will also en- not be able to help much. Separating computation
able sharing of the cached data. However, IGFS and storage clusters is becoming popular for
only allows a single secondary HDFS-compatible its scalability, flexibility, and cost effectiveness.
file system, if SQL queries need to access var- In these environments, when the computation
ied sources, IGFS would not be able to cache cluster requires data, the data must be accessed
the data. across the network, so even if the data resides in
the memory cache on the HDFS DataNodes, the
network transfer may be the performance bottle-
HDFS Centralized Cache Management neck. Therefore, further improving performance
Another external caching option is for the cache via caching requires either SQL engine inter-
to be implemented in the Hadoop storage system, nal caching or a separate caching system. Also,
HDFS. Depending on the operating system (OS) since the caching system is tightly integrated with
buffer cache is one way to take advantage of HDFS, the deployment is not as flexible as other
caching in HDFS. However, there is no user con- external caching systems.
trol over the OS buffer cache. Instead, HDFS has
a centralized cache management feature (Apache
Hadoop HDFS 2018), which enables users to Conclusion
explicitly specify files or directories which should
be cached in memory. Once a user specifies a path Caching can significantly improve query perfor-
to cache in HDFS, the HDFS DataNodes will be mance for SQL engines on Hadoop. There are
notified, and will cache the data in off-heap mem- several different techniques for caching data for
ory. In order to remove a file or directory from SQL frameworks. SQL engine internal caching
the cache, a separate command must be invoked. has the potential for the greatest performance, but
HDFS caching can help accelerate access to the each framework must implement a cache, and the
data, since the data can be stored in memory, cached data is not always shareable by other ap-
instead of disk. plications or queries. Another option is to use the
By enabling the HDFS centralized cache, the HDFS centralized caching feature to store files in
data of specified files will reside on memory on memory on the HDFS DataNodes. This server-
the HDFS DataNodes, so any SQL engines or side cache is useful to speed up local access to
applications accessing those files will be able data, but if SQL engines are not colocated with
to read the data from memory. Increasing the HDFS, or the data is not stored in HDFS, the net-
data access performance of HDFS will help any work may become the performance bottleneck.
queries accessing the cached files. Apache Impala Finally, external systems can cache data for SQL
has additional support for utilizing the HDFS frameworks on Hadoop. When colocated with the
centralized caching feature for Impala tables computation cluster, external systems can store
Cheap Data Analytics on Cold Storage 435
C
Cross-References Causal Consistency
Apache Spark
TARDiS: A Branch-and-Merge Approach to
Hadoop
Weak Consistency
Hive
Spark SQL
Virtual Distributed File System: Alluxio
Certification
Auditing
References
(Nandkarni 2014). Thus, the topic of managing Over the past decade, the emergence of flash-
cold data has received a lot of attention from both based solid-state storage, declining price of
industry and academia over the past few years. DRAM, and demand for low-latency, real-time
Systems that are purpose-built to provide cost- data analytics have resulted in the traditional
efficient storage and analysis of cold data are online tier being bifurcated into two subtiers,
referred to as cold storage systems. namely, a performance tier based on RAM/SSD
and a capacity tier based on HDD. Thus, most
modern enterprises today use a four-tier storage
hierarchy (performance, capacity, archival,
backup) implemented using three storage types
Background and Overview (online, nearline, and offline) as shown in Fig. 1.
Given the proliferation of cold data over the
Enterprise databases have long used storage tier- past few years, the obvious question that arises
ing for reducing capital and operational expenses. is where should such cold data be stored. Given
Traditionally, databases used a two-tier storage that databases already have a tiered storage in-
hierarchy. An online tier based on enterprise hard frastructure in place, an obvious low-cost solution
disk drives (HDD) provided low-latency (ms) ac- to deal with this problem is to store cold data
cess to data. The backup tier based on offline tape in either the capacity tier or the archival tier.
cartridges or optical drives, in contrast, provided Unfortunately, both these options are not ideal
low-cost, high-latency (hours or days) storage for given recent trends in the high-density storage
storing backups to be restored only during rare hardware landscape.
failures.
As databases grew in popularity, the neces-
sity to reduce recovery time after failure be- HDD-Based High-Density Storage
came important. Further, as regulatory compli- Traditionally, 7,200 RPM HDDs have been the
ance requirements forced enterprises to maintain primary storage media used for provisioning the
long-term data archives, the offline nature of capacity tier. For several years, areal density im-
the backup tier proved too slow for both stor- provements enabled HDDs to increase capacity at
ing and retrieving infrequently accessed archival Kryder’s rate (40% per year), outstripping the 18-
data. This led to the emergence of a new archival month doubling of transistor count predicted by
tier based on nearline storage devices, like vir- Moore’s law. However, over the past few years,
tual tape libraries (VTL), which could store and HDD vendors have hit walls in scaling areal den-
retrieve data automatically without human inter- sity with conventional perpendicular magnetic
vention in minutes. recording (PMR) techniques (Fontana and Decad
2015). As a result, HDD vendors have started to extract insightful results (EMC 2014). As a
working on several techniques to improve areal nearline storage device with access latency that
density, like using helium-bearing instead of air- is five orders of magnitude larger than HDD, tape
bearing for the disk head to pack more plat- will impose a significant performance penalty for
ters tightly or using shingled magnetic recording even latency-insensitive batch analytic workloads
(SMR) techniques that pack more data by over- (Kathpal and Yasa 2014). Thus, today, enterprises
C
lapping adjacent disk tracks, to further increase are faced with a trade-off, as they can store cold
areal density. Despite such attempts from HDD data in tape-based archival tier at the expense of
vendors, HDD areal density is improving only at performance or in the HDD-based capacity tier
an annual rate of 16% instead of 40% (Fontana and trade-off cost.
and Decad 2015; Moore 2016).
HDDs also present another problem when
used as the storage medium of choice for Key Research Findings
storing cold data, namely, high idle power
consumption. Unlike tape drives, which consume Over the past few years, storage hardware ven-
no power once unmounted, HDDs consume a dors and researchers have become cognizant of
substantial amount of power even while idle the gap between the HDD-based capacity tier
(Colarelli and Grunwald 2002). Such power and the tape-based archival tier. This has led
consumption translates to a proportional increase to the emergence of a new class of nearline
in operational expenses. Thus, the capacity tier, storage devices explicitly targeted at cold data
with its always-on HDD-based storage, is not workloads. These devices, also referred as cold
an ideal option for storing infrequently accessed storage devices (CSD), have two salient prop-
cold data. erties that distinguish them from the tape-based
archival tier. First, they use archival-grade, high-
Tape-Based High-Density Storage density, shingled magnetic recording-based HDD
Unlike HDDs, the areal density of tapes has as the storage media instead of tapes. Second,
been increasing steadily at a rate of 33% per they explicitly trade off performance for power
year, and the Linear Tape Organization (LTO) consumption by organizing hundreds of disks in a
roadmap (LTO Ultrium 2015) projects continued massive array of idle disk (MAID) configuration
increase in density for the foreseeable future. To- that keeps only a fraction of HDD powered up at
day, a single LTO-7 cartridge is capable of match- any given time (Colarelli and Grunwald 2002).
ing or even outperforming a HDD, with respect CSD differ dramatically with respect to price,
to sequential data access bandwidth. As modern performance, and peak power consumption char-
tape libraries use multiple drives, the cumula- acteristics. Some CSD enforce a strict upper limit
tive bandwidth achievable using even low-end on the number of HDD that can be spun up at any
tape libraries is 1–2 GB/s. However, the random given point to limit peak power draw. By doing
access latency of tape libraries is still 10,000 so, they right-provision hardware resources, like
higher than HDDs (minutes versus ms) due to in-rack cooling and power management, to cater
the fact that tape libraries need to mechanically to the subset of disks that are spun up, reduc-
load tape cartridges before data can be accessed. ing operational expenses further. For instance,
Thus, tape-based archival tier is used to store only Pelican (Balakrishnan et al. 2014) packs 1,152
rarely accessed compliance and backup data. As SMR disks in a 52U rack for a total capacity
the expected workload in such cases is dominated of 5 PB. However, only 8% of disks are spun
by sequential writes with a few rare reads, the up at any given time due to restrictions enforced
high access latency of tape drives is tolerable. by in-rack cooling and power budget. Similarly,
Using the archival tier to store cold data, each OpenVault Knox (Yan 2013) CSD stores
however, changes the application workload, as 30 SMR HDDs in a 2U chassis of which only
analytical queries need to be issued over cold data one can be spun up to minimize the sensitivity
438 Cheap Data Analytics on Cold Storage
of disks to vibration. Other CSD, like Spectra aspect, the disk chosen for storing data should
ArcticBlue Spectra Logic (2013), provide more span multiple failure domains, like different trays
performance flexibility at the expense of cost, in the rack, so that a failure in any single domain
similar to traditional MAID arrays, by only spin- does not result in total data loss. With respect to
ning down idle disks while permitting a much availability, data layout should ensure that recov-
larger subset of disks to be active simultaneously. ery from a disk failure is seamless and quick,
The net effect of these limitations is that the as disk failures will be a norm rather than an
latency to access data stored in the CSD depends exception in CSD.
on whether the data resides in a disk that is spun In CSD like Pelican (Balakrishnan et al. 2014),
up or down. Access to data in any of the spun- where the power and cooling infrastructure of the
up disks can be done with latency and bandwidth CSD enforces a strict limit on which set of disks
comparable to that of the traditional capacity can be spun up at any given time, the choice
tier. For instance, Pelican, OpenVault Knox, and of data layout is a direct consequence of these
ArcticBlue are all capable of saturating a 10 Gb restrictions. Pelican (Balakrishnan et al. 2014) as-
Ethernet link as they provide between 1 and sembles disks into groups such that disks within
2 GB/s of throughput with an access latency of each group belong to different failure domains
5–10 milliseconds for reading data from spun-up and can be spun up simultaneously. Each data
disks. However, accessing data on a spun-down object stored by Pelican is striped across disks
disk takes 5–10 s, as the target disk needs to be within a single group, using erasure coding to
spun up before data can be accessed. protect data in case of disk failures. By using
CSD form a perfect middle ground between disks that can be active at the same time, Pelican
HDD and tape. Due to the use of high-density meets the performance requirement, as multiple
disks and MAID techniques, CSD are touted to object requests can be served concurrently due
offer cost/GB comparable to tape. With worst- to striping. Erasure coding helps in improving
case access latencies in seconds, CSD are closer availability, as recovery after disk failures can be
to HDD than tape with respect to performance. done within a group without any spin-ups and
However, CSD are not a drop-in replacement for several disks can participate in data restoration.
HDD (Borovica-Gajic et al. 2016). In order to A storage layout that uses historic data access
effectively exploit CSD for reducing the cost of patterns as a hint for future accesses could further
cold data storage and analytics, cold storage in- improve performance by packing multiple ob-
frastructures need to design three components of jects that are always accessed together into a few
both CSD and analytic engines to work in concert disk groups. The task of identifying an optimal
with the shared goal of minimizing the number of mapping of objects to disks given a history of
disk spin-ups, namely, the CSD storage manager, accesses can be modeled as an optimization prob-
the CSD I/O scheduler, and the analytic engine’s lem (Reddy et al. 2015). However, given the large
query executor. number of objects stored in CSD, it can be quite
complicated to handle each object individually in
CSD Storage Manager the optimization algorithm. Thus, the problem of
The CSD storage manager is responsible for choosing a data layout is decomposed into two
implementing the storage layout that maps data subproblems: (i) identifying clusters of objects
to disks. Designing an optimal data layout for that are accessed together from the trace and
CSD is a complex problem, as one has to balance (ii) identifying an optimal mapping of objects to
performance, reliability, and availability simulta- disks. The goal of the first step is to transform in-
neously. From performance point of view, data dividual object access patterns into group access
that is accessed together should be stored on a patterns by clustering objects that are accessed
set of disks that can spin up together so that re- together. The optimization algorithm can then
quests for such data can be serviced concurrently map clusters to disks. It has been shown that such
without any group switches. From the reliability an access-driven data layout approach can save
Cheap Data Analytics on Cold Storage 439
Exploiting the cost benefits of CSD while of database query execution. Unfortunately, both
minimizing the associated performance trade-off assumptions are not true today with CSD, as CSD
requires eliminating unnecessary disk spin-ups. storage is exposed behind an object interface and
The only way of achieving such an optimal data shared among multiple databases.
access pattern is to have the database issue re- The second approach, exemplified by the
quests for all necessary data upfront so that the Skipper framework (Borovica-Gajic et al. 2016),
CSD can return data back in an order that min- is to perform CSD-driven, push-based, out-
imizes the number of disk spin-ups. Thus, the of-order execution of queries by building on
order in which database receives data, and hence work done in adaptive query processing (AQP)
the order in which query execution happens, (Deshpande et al, 2007). Multiway join (MJoin)
should be determined by the CSD to minimize (Viglas et al. 2003) is one such AQP technique
the performance impact of group switches. that enables out-of-order execution by using
Unfortunately, current databases are not de- an n-ary join and symmetric hashing to probe
signed to work with such a CSD-driven query tuples as data arrives, instead of using blocking
execution approach. Traditionally, databases have binary joins whose ordering is predetermined by
used a strict optimize-then-execute model to eval- the query optimizer. However, the incremental
uate queries. The database query optimizer uses nature of symmetric hashing in traditional MJoin
cost models and statistics gathered to determine requires buffering all input data, as tuples that are
the optimal query plan. Once the query plan yet to arrive could join with any of the already
has been generated, the execution engine then received tuples. Thus, in the worst-case scenario
invokes various relational operators strictly based of a query that involves all relations in the dataset,
on the generated query plan with no run-time the MJoin cache must be large enough to hold the
decision-making. This results in pull-based query entire dataset. This requirement makes traditional
execution where the database explicitly requests MJoin inappropriate for the CSD use case as
data in an order determined by the query plan. having a buffer cache as large as the entire dataset
This pull-based execution approach is incompati- defeats the purpose of storing data on the CSD.
ble with the CSD-driven approach, as the optimal Skipper solves this problem by redesigning both
order chosen by the CSD for minimizing group MJoin and database buffer caching algorithms
switches is different from the ordering specified to work in concert such that MJoin can perform
by the query optimizer. query execution even with limited cache capacity
In order to enable CSD-driven query execu- (Borovica-Gajic et al. 2016).
tion, analytical database engines should adopt
push-based query execution instead of a pull-
based one. There are two approaches that can Examples of Application
be used to perform push-based query execution.
The first approach is to tightly couple query With the right combination of layout, I/O
execution with I/O scheduling in such a way that scheduling, and query execution, cold data
the scheduler knows and delivers only objects analytic engines like Skipper have demonstrated
that are required for the query executor to make that it is possible to mask the long access latency
progress. For instance, in a two-table hash join, of CSD for latency-insensitive batch analytic
the scheduler would deliver objects correspond- workloads. Thus, the obvious application for
ing to the build relation over which the hash CSD is as the medium of choice for storing cold
table is built before delivering the probe relation. data, and a natural way of integrating CSD in the
Such an approach was used by tertiary database four-tier storage hierarchy shown in Fig. 1 is to
engines (Sarawagi and Stonebraker 1996) and use it for building a new cold storage tier that
makes two assumptions: (i) the database has stores only cold data.
exclusive access to the storage device, and (ii) However, CSD could potentially have a much
the I/O scheduler on the storage device is aware larger impact on the storage hierarchy based
Cheap Data Analytics on Cold Storage 441
on two observations. First, with latency-critical Over the past few years, several other systems
workloads running in the performance tier, the have been built using alternate storage media
HDD-based capacity tier is already limited to to reduce the cost of storing cold data. For in-
batch analytic workloads today. Second, due to stance, DTStore (Lee et al. 2016) uses LTFS tape
the use of high-density SMR disks and MAID archive for reducing the TCO of online multime-
techniques, CSD can offer cost and capacity com- dia streaming services by classifying data based
C
parable to the tape-based archival tier. Given on access pattern and storing cold data in tape
these two observations, it might be feasible to drives. ROS (Yan et al. 2017) is a rack-scale
get rid of the capacity and archival tiers by using optical disk library that provides PB-sized stor-
a single consolidated cold storage tier. Such an age with in-line accessibility for cold data using
approach would result in the four-tier hierarchy thousands of optical disks packed in a single 42U
shown in Fig. 1 being reduced to a three-tier rack. Nakshatra (Kathpal and Yasa 2014) enables
hierarchy with DRAM or SSD in the performance Hadoop-based batch analytic workloads to run
tier, CSD in the cold storage tier, and tape in the directly over data stored in tape archives. Similar
backup tier. Recent studies report that enterprises to Skipper, Nakshatra also modifies both the tape
can lower their total cost of ownership (TCO) archive’s I/O scheduler and the MapReduce run
between 40% and 60% by using such storage time to perform push-based execution to hide the
consolidation (Borovica-Gajic et al. 2016). In long access latency of tape drives.
terms of absolute savings, these values translate Today, it is unclear as to how these alterna-
to hundreds of thousands of dollars for a 100TB tive storage options fare with respect to SMR-
database and tens of millions of dollars for larger HDD-based CSD as the storage media of choice
PB-sized databases. for storing cold data. Some storage media, like
The implications and benefits of using CSD optical disks, have the shortcoming that their se-
are also applicable to cloud service providers. For quential access bandwidth is substantially lower
instance, cloud providers have already started than HDD or tape. Tape and optical media also
deploying custom-built CSD for storing cold have very high access latency compared to CSD.
data (Bandaru and Patiejunas 2015; Bhat 2016; However, an interesting artifact of using push-
Google 2017). An interesting area of future based engines like Skipper is that query exe-
work involves exploring implementation and cution becomes much less sensitive to the long
pricing aspects of cloud-hosted “cheap-analytics- access latency of storage devices. Similarly, op-
over-CSD-as-a-service” solutions. Such services tical disks are traditionally known to have longer
would benefit both customers and providers alike, life span compared to HDD and tape. However,
as providers can increase revenue by expanding it is possible to deal with even frequent media
their storage-as-a-service offering to include failures by trading off some storage capacity for
cheap analytic services and customers can reduce improved reliability by using erasure coding and
the TCO by running latency-insensitive analytic data replication techniques in software. Thus,
workloads on data stored in CSD. further research is required to understand the
implications of storing cold data on these storage
devices by running realistic cold data analytic
Future Directions for Research workloads and performing a comparative study of
performance and reliability.
Systems like Pelican (Balakrishnan et al. 2014) On the software side, current systems like
and Skipper (Borovica-Gajic et al. 2016) are a Skipper and Nakshatra only target a simplified
first essential step toward broader adoption of subset of SQL or MapReduce-based analytic
cold data analytics, in general, and CSD in par- workloads. They do not support complex
ticular. There are several avenues of future work business intelligence workloads that require
with respect to both cold storage hardware and operators like data cube (Gray et al. 1997),
data analytic software. data mining workloads like clustering and
442 Cheap Data Analytics on Cold Storage
classification, or iterative machine learning Google (2017) Nearline cloud storage. https://cloud.
workloads. They also do not consider the google.com/storage/archival/. Cited 1 Oct 2017
Gray J, Graefe G (1997) The five-minute rule ten years
availability of additional layers of persistent later, and other computer storage rules of thumb.
storage that could be used as caches or prefetch SIGMOD Rec 26(4):63–68. https://doi.org/10.1145/
buffers for data stored in CSD or as an extended 271074.271094
memory that can be used to swap out intermediate Gray J, Graefe G (2007) The five-minute rule twenty
years later and how flash memory changes the rules.
data structures generated by operators like MJoin. In: Proceedings of the 3rd international workshop on
Further research is also required to understand data management on new hardware, New York, pp 1–9
how push-based execution can be extended to Gray J, Putzolu GR (1987) The 5-minute rule for trading
these workloads. memory for disk accesses and the 10-byte rule for
trading memory for CPU time. SIGMOD Rec 16(3):
395–398. https://doi.org/10.1145/38714.38755
Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D,
Venkatrao M, Pellow F, Pirahesh H (1997) Data cube:
References a relational aggregation operator generalizing group-
by, cross-tab, and sub-totals. J Data Min Knowl Discov
Amazon (2015) Amazon simple storage service. https:// 1(1):29–53. https://doi.org/10.1023/A:1009726021843
aws.amazon.com/s3/. Accessed 1 Oct 2017 Kathpal A, Yasa GAN (2014) Nakshatra: towards running
Appuswamy R, Borovica-Gajic R, Graefe G, Ailamaki A batch analytics on an archive. In: Proceedings of 22nd
(2017) The five-minute rule thirty years later, and its IEEE international symposium on the modelling, anal-
impact on the storage hierarchy. In: Proceedings of the ysis and simulation of computer and telecommunica-
eighth international workshop on accelerating analytics tion systems, Paris, pp 479–482
and data management systems using modern processor Lee J, Ahn J, Park C, Kim J (2016) DTStorage: dy-
and storage architectures, Munich namic tape-based storage for cost-effective and highly-
Balakrishnan S, Black R, Donnelly A, England P, Glass available streaming service. In: Proceedings of the 16th
A, Harper D, Legtchenko S, Ogus A, Peterson E, IEEE/ACM international symposium on cluster, cloud
Rowstron A (2014) Pelican: a building block for ex- and grid computing, Cartagena, pp 376–387
ascale cold data storage. In: Proceedings of the 11th LTO Ultrium (2015) LTO roadmap. http://www.
USENIX conference on operating systems design and ltoultrium.com/lto-ultrium-roadmap/. Accessed 1
implementation, Berkeley, pp 351–365 Oct 2017
Bandaru K, Patiejunas K (2015) Under the hood: Mendoza A (2013) Cold storage in the cloud: trends,
Facebooks cold storage system. Facebook. https:// challenges, and solutions. https://www.intel.com/
code.facebook.com/posts/1433093613662262/-under- content/www/us/en/storage/cold-storage-atom-xeon-
the-hood-facebook-s-cold-storage-system-/. Accessed paper.html. Accessed 1 Oct 2017
1 Oct 2017 Moore F (2015) Tiered storage takes center stage.
Bhat S (2016) Introducing azure cool blob storage. Horison Inc. http://horison.com/publications/tiered-
Microsoft. https://azure.microsoft.com/en-us/blog/ storage-takes-center-stage. Accessed 1 Oct 2017
introducing-azure-cool-storage/. Accessed 1 Oct 2017 Moore F (2016) Storage outlook 2016. Horison Inc.
Borovica-Gajic R, Appuswamy R, Ailamaki A (2016) https://horison.com/publications/storage-outlook-
Cheap data analytics using cold storage devices. Proc 2016. Accessed 1 Oct 2017
VLDB Endow 9(12):1029–1040. https://doi.org/10. Nandkarni A (2014) IDC worldwide cold storage
14778/2994509.2994521 taxonomy. http://www.idc.com/getdoc.jsp?containerId
Colarelli D, Grunwald D (2002) Massive arrays of =246732. Accessed 1 Oct 2017
idle disks for storage archives. In: Proceedings of Oracle (2015) OpenStack swift interface for oracle
the ACM/IEEE conference on supercomputing, Los hierarchical storage manager. http://www.oracle.com/
Alamitos, pp 1–11 us/products/servers-storage/storage/storage-software/
Deshpande A, Ives Z, Raman V (2007) Adaptive query solution-brief-sam-swift-2321869.pdf. Accessed 1
processing. J Found Trends Databases 1(1):1–140. Oct 2017
https://doi.org/10.1561/1900000001 Prabhakar S, Agrawal D, Abbadi A (2003) Optimal
EMC (2014) The digital universe of opportunities: scheduling algorithms for tertiary storage. J Distrib
rich data and the increasing value of the inter- Parallel Databases 14(3):255–282. https://doi.org/10.
net of things. https://www.emc.com/collateral/analyst- 1023/A:1025589332623
reports/idc-digital-universe-2014.pdf. Accessed 1 Oct Reddy R, Kathpal A, Basak J, Katz R (2015) Data
2017 layout for power efficient archival storage systems.
Fontana R, Decad G (2015) Roadmaps and technology In: Proceedings of the workshop on power-aware
reality. Presented at the library of congress storage computing and systems, Monterey, pp 16–20
meetings on designing storage architectures for digital Robert Y, Vivien F (2009) Introduction to scheduling, 1st
collections, Washington, 9–10 Sept 2015 edn. CRC Press, Boca Raton
Clojure 443
Sarawagi S, Stonebraker M (1996) Reordering tional praise. But why would you use Clojure
query execution in tertiary memory databases. In: nowadays? What can it do for you that the avail-
Proceedings of the 22th international conference on
very large data bases, San Francisco, pp 156–167
able trillions of other languages available on the
Spectra Logic (2013) Spectra arcticblue overview. https:// market cannot do, or not as efficiently, with pro-
www.spectralogic.com/products/arcticblue/. Accessed gramming in Java 1.9, Python, or NodeJS getting
1 Oct 2017 all the hype and the tooling nowadays, with even
C
Viglas SD, Naughton JF, Burger F (2003) Maximizing the
output rate of multi-way join queries over streaming
big consulting firms entering the NodeJS market.
information sources. In: Proceedings of the 29th Clojure being a LISP has a fantastic minimal core
international conference on very large data bases, and a very small set of functions and data struc-
Berlin, pp 285–296 tures. Of course, you have asynchronous thread-
Yan M (2013) Cold storage hardware v0.5. http://www.
opencompute.org/wp/wp-content/uploads/2013/01/
safe parallel queues, of course you have parallel
Open_Compute_Project_Cold_Storage_Specification_ processing of all the data structures, and it is also
v0.5.pdf. Accessed 1 Oct 2017 a breeze to go for test-driven development, but
Yan W, Yao J, Cao Q, Xie C, Jiang H (2017) ROS: a rack- what’s more is that Clojure changes the way you
based optical storage system with inline accessibility
for long-term data preservation. In: Proceedings of
think about programming and write programs in
the 12th European conference on computer systems, general. It makes you think in small, versatile,
Belgrade, pp 161–174 reusable building block. Eventually the users of
Clojure, the people writing code with it, do not
really think about it anymore; it just becomes the
way they naturally code, whatever the language.
Clojure Clojure also makes your code smaller, easier to
debug, maintain, and push to production with less
Nicolas Modrzyk fear of being woken up at 3 AM for issues. Your
Karabiner Software, Tokyo, Japan program is consistent and reliable. Let’s review
some of the Clojure features and ecosystem that
makes Clojure a language of choice for handling
It’s easy to get sidetracked with technology, and big data.
that is the danger, but ultimately you have to see
what works with the music and what doesn’t. In
a lot of cases, less is more. In most cases, less is
more. Clojure for Instant Prototyping, with
Herbie Hancock a REPL
You usually start a REPL with Leiningen (or What the REPL does is converting the string
boot), the two most famous build tools that actu- you have written into a sequence and then eval-
ally install Clojure at the same time and make you uating its content. To perform the same thing
in command of a build environment at the same the REPL is doing, you can use the function
time. read-string, where read-string takes a string and
Starting a REPL with: converts it to a sequence of symbols, similar to
an abstract syntax tree.
lein repl ( r e a d s t r i n g " ( + 1 1 ) " )
; (+ 1 1)
When working with Clojure, programmers Why is this nice? It makes it really easy
would usually always have a REPL running to to tweak the compilation step found in coding.
try ideas on the run or in a business meeting. To achieve that, Clojure provides macros, which
are evaluated during the pre-processing phase,
transforming the lists of lists of sequence before
they are executed. See the plus-one macro below,
Clojure for Code as Data which adds one to the variable x.
The code is made of a single sequence, sur- In the sense, that the block code is written,
rounded by parenthesis, and the sequence itself is and evaluated as is, not rewritten then evaluated.
made of 3 characters, +, 1, and 1. Which takes us to . . .
Clojure 445
(>>
100000
Clojure for Multi-threading
( range )
( r / map i n c ) Reducers and the different versions of pmap are
( r / f i l t e r even ? ) perfect to apply computations in parallel. To per-
( into []))
form asynchronous message handling between
It looks pretty much the same on paper, apart different blocks of code, Clojure comes with
from the r/ prefixes added to the map and filter a core library named core.async, available on
functions. What does not show on paper is the github.
Clojure 447
To present this, a small example made of The first thread still creates messages, but this
two asynchronous blocks of code, exchanging time sends them to the newly created hi-chan
messages through a channel will be used. channel, while the second thread reads the value
One asynchronous block will send messages coming through the channel and prints them out.
to the channel, while the other will read messages
from that same channel and print them out.
( d e f h i c h a n ( c h a n ) ) C
This very first example creates a thread and ( thread
prints message asynchronously by itself. ( dotimes [ n 10]
( Thread / s l e e p 1000)
( require ( > ! ! h i c h a n
’[ c l o j u r e . core . async ( str " hello :" n ))))
: refer : all ])
( thread
( thread ( dotimes [ _ 10]
( dotimes [ n 10] ( Thread / s l e e p 1000)
( Thread / s l e e p 100) ( println
( println " hello :" n ))) " Channel value : "
( < ! ! h i c h a n ) ) ) )
If executed at the REPL, as soon as the block
of code is evaluated, evaluation does not block Here again, both thread code blocks return im-
and returns a reference to the executing async mediately without blocking the execution of the
thread. main thread, making all this feasible to execute at
In the meantime, the thread goes on with the REPL.
his happy life and starts printing hello messages Core.async can also run in the browser via
every 100 ms. ClojureScript and is a very compelling strategy
This second example creates two threads. A to create a back buffer for onscreen user events
thread will update an atom, a variable ready for and other asynchronous data requests.
asynchronous updates, by safely increasing its
value by one. A second thread safely reads the
value of the atom, while it’s being updated by the Clojure for the Backend
first thread.
( d e f myv a l u e ( atom 0 ) )
Many frameworks exist for server and api devel-
opment in Clojure, but most of them seem to be
( thread inspired or paying close attention to the efforts
( dotimes [ n 10] that were put into the ring and compojure mini-
( Thread / s l e e p 1000)
( swap ! myv a l u e i n c ) ) )
frameworks.
Ring itself is an abstraction over HTTP, while
( thread Compojure is a routing library that can, and
( dotimes [ _ 5] mostly does, run on top of ring.
( Thread / s l e e p 1000)
( println
If you have Leiningen installed, getting started
" C o u n t e r i s now : " with Compojure is as easy as using the compojure
@myv a l u e ) ) ) project template.
The number of threads can be increased, and in l e i n new c o m p o j u r e i n t r o
action, no deadlock occurs. Updates to the atom
This creates a prod-ready but also development-
are done using the STM transaction model that
ready environment to start writing code.
prevents deadlock to occur using transaction to
From the newly created intro folder, a ring
perform updates on atoms.
development server can be started using the fol-
The third asynchronous example removes the
lowing command.
use of the intermediate atom, by directly acting
on messages sent via a channel. lein ring server
448 Clojure
• the backend,
Clojure, Fig. 3 Ring HTTP get
• the frontend, and
• the data exchanged between both the backend
This will set up everything needed, apart from and the frontend can also be in the Extensible
your favorite text editor. The main core.clj file in Data Notation, a subset of Clojure.
the created project looks deceptively simple, with
a single route defined for a GET request on the Clojure’s Reagent is a library that acts as a
root context, anything else returning a simple Not thin wrapper around React, the frontend frame-
Found string. work created by Facebook. It removes the extra
( d e f r o u t e s appr o u t e s javascript boiler-plate to make the code very
(GET " / " [ ] " H e l l o World " ) compact, easier to read, write, and maintain.
( r o u t e / n o t f o u n d " Not Found " ) ) Reagent can be combined with, among many
( d e f app others, core.async so to get asynchronous looking
( wrapd e f a u l t s code (remember only JavaScript thread in the
appr o u t e s browser) without sacrificing readability.
s i t e d e f a u l t s ) ) As for Clojure on the backend, here again
The main route can be updated by adding time there is a Leiningen project template for the
via a Java date object, using basic Clojure/Java reagent framework that can be created with the
Interop. command below.
( d e f r o u t e s appr o u t e s l e i n new r e a g e n t r e a g e n t i n t r o
(GET " / " [ ]
( str To get started with the reagent setup, this
" S e r v e r Time i s : " Leiningen needs two commands to be started, one
( j a v a . u t i l . Date . ) ) ) for the backend and one for the frontend.
( r o u t e / n o t f o u n d " Not Found " ) )
Here is the command to start the server devel-
This is of course nowhere near a production- opment environment:
ready app, but the point here is that this creates lein ring server
a fully ready development environment in a few
minutes, thus bringing server-side prototyping Here is the command to start the client devel-
setup down to pretty much zero (Fig. 3). opment environment:
To get further into the backend programming lein figwheel
in Clojure, there is a nice introduction in the blog
post below, when you want to quickly grasp all The setup in place, it is now possible to start
the ring/compojure concepts: editing the core.cljs file in the created project.
sinatra-docs-in-clojure ( d e f n somecomponent [ ]
[ : div
[ : h3 " I am a component ! " ]
[ : p . someclass
Clojure for the Front End " I have " [ : s t r o n g " bold " ]
[ : s p a n { : s t y l e { : c o l o r "# c c 7 7 c c " } }
ClojureScript is Clojure ported to javascript run- " and r e d " ] " t e x t . " ] ] )
time, and why and how to use it is extensively
( d e f n mountr o o t [ ]
covered in the Reactive with ClojureScript apress ( reagent / render
book. [ somecomponent ]
Clojure 449
" hdfs : / / hostname : p o r t " layer will be the hyperbolic tangent activation
" / home / d a t a " function. (See: tanh function)
" / * / * . bz2 " ) )
In the definition of the network, input and
Of course, using regular Java Interop, it is also output layer each specify how many variables are
possible to target other RDD sources like Cassan- available, and so below:
dra, HBase, and any other storage supported by
• 2 for the input, and
hadoop.
• 1 for the output
Flambo itself allows to create valid Spark job
that can be deployed and run as any other jobs, The network is then created almost by trans-
but one main limitation it has is that it does not lating written text to a network definition using
do allow to write and run new job at the REPL the cortex api.
without using AOT.
( d e f nn
AOT is ahead-of-time compilation meaning ( network / l i n e a r n e t w o r k
you need to statically compile all your code [( layers / input 2 1 1 : id : x )
before running, which does prevent it from being ( layers / l i n e a r >t a n h 1 0 )
created and run at the REPL. ( layers / l i n e a r 1 : id : y ) ] ) )
This limitation sparked, pun intended, a new Now along with the dataset, let’s create a
project worth considering named powderkeg, trained version of this neural network.
whose goal is to overcome that very limitation. In machine learning, training a network is
usually done using two data sets.
Clojure for Machine Learning • a training set, to tell the network what is
expected and correct
To go beyond simple big data, and as a recent and • a testing set, to test the network and check
strong contender to the TensorFlow framework, whether how it performs.
Cortex has recently joined the Clojure ecosystem
for machine learning. The training stage is then achieved by passing
Some starters and favorites examples to look those two data sets, but here the same dataset is
at to fully understand how Cortex works are: used.
The cats and dogs sorting example or the fruits The number of rounds of training is also spec-
classification example. ified, here 3000 to get an efficient network.
In this short entry, we will present briefly ( def t r a i n e d
training a network for xor operations, as it has ( t r a i n / t r a i n n
nn
recently been included in the cortex examples
xord a t a s e t
section and is very simple to go over. xord a t a s e t
As known, xor operations take two values in : b a t c h s i z e 4
input and one value in output. This is represented : epochc o u n t 3000
: s i m p l e l o s s p r i n t ? f a l s e ) )
here by creating a dataset to train the network,
made of a vector of elements, where each element Each training step outputs some output of
is a map with an input, :x, and an output, :y, value. the current state of the trained network, notably
whether it performed better or not than its own
( d e f xord a t a s e t
[{: x [0.0 0.0] :y [0.0]} version in the previous step.
{: x [0.0 1.0] :y [1.0]}
T r a i n i n g network :
{: x [1.0 0.0] :y [1.0]}
...
{: x [1.0 1.0] :y [0.0]}])
Loss f o r epoch 1 :
( c u r r e n t ) 0.7927149 ( b e s t )
The neural network to be created will be a lin- Saving network to
ear network of three layers, where the activation t r a i n e d n e t w o r k . n i p p y
Clojure 451
Finally, to validate results yet another dataset Simply said, Gorilla is a web-based REPL,
is created, of usually new values, and of course which allows to create annotated NoteBooks,
only input values. along ready to be executed code blocks, all in the
browser.
( d e f newd a t a s e t
It is made of a server-side part that runs a
[{: x [0.0 0.0] }
background repl and a simple compojure-based
C
{: x [0.0 1.0] }
{: x [1.0 0.0] } API ready to receive Clojure code as string.
{: x [1.0 1.0] }]) The second part is a Clojure frontend, running
a websocket sending Clojure code to the REPL
Then follows running the trained network with
on the server, and waits for the result of the
this new dataset.
execution and displays it.
( clojure . pprint / pprint The gorilla framework can be added to any of
( execute / run your project using a Leiningen plugin.
trained This plugin is added either to the project def-
newd a t a s e t ) )
inition project.clj or the global leiningen profile,
The result is, not surprisingly, very accurate, profile.clj.
as shown from a screenshot of the network exe-
cuted at the REPL. [ l e i n g o r i l l a " 0 . 4 . 0 " ]
This short entry presented diverse possibilities on Virtualized Big Data Benchmarks
how to put Clojure straight into action to:
• Help developers in writing code using a Cloud Computing for Big Data
REPL, thus reducing usual life cycle overhead Analysis
associated with many programming languages
• Reduce maintenance associated with writing Fabrizio Marozzo and Loris Belcastro
code in other languages, by considerably re- DIMES, University of Calabria, Rende, Italy
ducing the lines of code used to work and
transform data structure by writing macros
and/or using the already available threading Definitions
macros
• Enable pinpoint custom usage of the cores Cloud computing is a model that enables
available to each runtime node of your ap- convenient, on-demand network access to a
plication. Parallel computations is available to shared pool of configurable computing resources
most data structures (e.g., networks, servers, storage, applications,
• core.async allows message passing between and services) that can be rapidly provisioned
different components of your application. and released with minimal management effort
Point of interaction can be defined with atom, or service provider interaction (Mell and Grance
or channels, and by being reliably highlighted, 2011).
reduces the coupling between different parts
of the code, making the overall end product Overview
resistant to changes.
• Front-to-End Clojure brings commonness and In the last decade, the ability to produce and
portability of the application code thus helping gather data has increased exponentially. For ex-
Cloud Computing for Big Data Analysis 453
ample, huge amounts of digital data are gen- dards and Technology) (Mell and Grance 2011),
erated by and collected from several sources, Cloud computing can be described as: “A model
such as sensors, web applications, and services. for enabling convenient, on-demand network ac-
Moreover, thanks to the growth of social net- cess to a shared pool of configurable comput-
works (e.g., Facebook, Twitter, Pinterest, Insta- ing resources (e.g., networks, servers, storage,
gram, Foursquare, etc.) and the widespread dif- applications, and services) that can be rapidly
C
fusion of mobile phones, every day millions of provisioned and released with minimal manage-
people share information about their interests ment effort or service provider interaction.” From
and activities. The amount of data generated, the the NIST definition, we can identify five essen-
speed at which it is produced, and its heterogene- tial characteristics of Cloud computing systems,
ity in terms of format represent a challenge to which are (i) on-demand self-service, (ii) broad
the current storage, process, and analysis capa- network access, (iii) resource pooling, (iv) rapid
bilities. Those data volumes, commonly referred elasticity, and (v) measured service.
as Big Data, can be exploited to extract useful in- Cloud computing vendors provide their
formation and to produce helpful knowledge for services according to three main distribution
science, industry, public services, and in general models:
humankind.
To extract value from such data, novel tech- • Software as a Service (SaaS), in which soft-
nologies and architectures have been developed ware and data are provided through Internet
by data scientists for capturing and analyzing to customers as ready-to-use services. Specif-
complex and/or high velocity data. In general, the ically, software and associated data are hosted
process of knowledge discovery from Big Data by providers, and customers access them with-
is not so easy, mainly due to data characteristics, out the need to use any additional hardware or
such as size, complexity, and variety, that are software.
required to address several issues. To overcome • Platform as a Service (PaaS), in an environ-
these problems and get valuable information and ment including databases, application servers,
knowledge in shorter time, high-performance and development environment for building, test-
scalable computing systems are used in combi- ing, and running custom applications. Devel-
nation with data and knowledge discovery tech- opers can just focus on deploying of applica-
niques. tions since Cloud providers are in charge of
In this context, Cloud computing has emerged maintenance and optimization of the environ-
as an effective platform to face the challenge of ment and underlying infrastructure.
extracting knowledge from Big Data repositories • Infrastructure as a Service (IaaS), that is an
in limited time, as well as to provide an effec- outsourcing model under which customers
tive and efficient data analysis environment for rent resources like CPUs, disks, or more
researchers and companies. From a client per- complex resources like virtualized servers or
spective, the Cloud is an abstraction for remote, operating systems to support their operations.
infinitely scalable provisioning of computation Compared to the PaaS approach, the IaaS
and storage resources (Talia et al. 2015). From model has a higher system administration
an implementation point of view, Cloud systems costs for the user; on the other hand, IaaS
are based on large sets of computing resources, allows a full customization of the execution
located somewhere “in the Cloud,” which are environment.
allocated to applications on demand (Barga et al.
2011). Thus, Cloud computing can be defined as a
distributed computing paradigm in which all the Key Research Findings
resources, dynamically scalable and often virtu-
alized, are provided as services over the Internet. Most available data analysis solutions today
As defined by NIST (National Institute of Stan- are based on open-source frameworks, such as
454 Cloud Computing for Big Data Analysis
ecosystem is undoubtedly one of the most com- vides a user interface that allows programming
plete solutions for data analysis problem, but at visual workflows in any Web browser.
the same time, it is thought for high-skilled users. e-SC is commonly used to provide a data
On the other hand, many other solutions are analysis back end to stand-alone desktop or Web
designed for low-skilled users or for low-medium applications. To this end, the e-SC API pro-
organizations that do not want to spend resources vides a set of workflow control methods and data
C
in developing and maintaining enterprise data structures. In the current implementation, all the
analysis solutions. Two representative examples workflow services within a single invocation of a
of such data analysis solutions are Microsoft workflow execute on the same Cloud node.
Azure Machine Learning and Data Mining Cloud COMPSs (Lordan et al. 2014) is a program-
Framework. ming model and an execution runtime, whose
Microsoft Azure Machine Learning (Azure main objective is to ease the development of
ML) (https://azure.microsoft.com/services/ workflows for distributed environments, includ-
machine-learning-studio/) is a SaaS for the ing private and public Clouds. With COMPSs,
creation of machine learning workflows. It users create a sequential application and specify
provides a very high level of abstraction, because which methods of the application code will be
a programmer can easily design and execute data executed remotely. Providing an annotated inter-
analytics applications by using simple drag-and- face where these methods are declared with some
drop web interface and exploiting many built-in metadata about them and their parameters does
tools for data manipulation and machine learning this selection. The runtime intercepts any call to
algorithms. a selected method creating a representative task
The Data Mining Cloud Framework (DMCF) and finding the data dependencies with all the
(Marozzo et al. 2015) is a software system previous ones that must be considered along the
developed at University of Calabria for allowing application run.
users to design and execute data analysis Sector/Sphere (Gu and Grossman 2009) is
workflows on Clouds. DMCF supports a large an open-source Cloud framework designed to
variety of data analysis processes, including implement data analysis applications involving
single-task applications, parameter-sweeping large, geographically distributed datasets. The
applications, and workflow-based applications. framework includes its own storage and compute
A workflow in DMCF can be developed using a services, called Sector and Sphere, respectively,
visual- or a script-based language. The visual which allow to manage large dataset with high
language, called VL4Cloud (Marozzo et al. performance and reliability.
2016), is based on a design approach for end users
having a limited knowledge of programming
paradigms. The script-based language, called Examples of Application
JS4Cloud (Marozzo et al. 2015), provides a
flexible programming paradigm for skilled users Cloud computing has been used in many sci-
who prefer to code their workflows through entific fields, such as astronomy, meteorology,
scripts. social computing, and bioinformatics, which are
Other solutions have been created mainly for greatly based on scientific analysis on large vol-
scientific research purposes, and, for this reason, ume of data. In many cases, developing and con-
they are poorly used for developing business figuring Cloud-based applications requires a high
applications (e.g., E-Science Central, COMPSs, level of expertise, which is a common bottleneck
and Sector/Sphere). in the adoption of such applications by scientists.
e-Science Central (e-SC) (Hiden et al. 2013) Many solutions for Big Data analysis on
is a Cloud-based system that allows scientists to Clouds have been proposed in bioinformatics,
store, analyze, and share data in the Cloud. It pro- such as Myrna (Langmead et al. 2010), which
456 Cloud Computing for Big Data Analysis
is a Cloud system that exploits MapReduce for ber of multi-core processors. From a software
calculating differential gene expression in large point of view, these new computing platforms
RNA-seq datasets. open big issues and challenges for software
Wang et al. (2015) propose a heterogeneous tools and runtime systems that must be able to
Cloud framework exploiting MapReduce and manage a high degree of parallelism and data
multiple hardware execution engines on FPGA to locality. In addition, to provide efficient meth-
accelerate the genome sequencing applications. ods for storing, accessing, and communicating
Cloud computing has been also used for ex- data, intelligent techniques for data analysis
ecuting complex Big Data mining applications. and scalable software architectures enabling
Some examples are as follows: Agapito et al. the scalable extraction of useful information
(2013) perform an association rule analysis be- and knowledge from data are needed.
tween genome variations and clinical conditions • Massive social network analysis. The effective
of a large group of patients; Altomare et al. analysis of social network data on a large
(2017) propose a Cloud-based methodology to scale requires new software tools for real-
analyze data of vehicles in a wide urban scenario time data extraction and mining, using Cloud
for discovering patterns and rules from trajectory; services and high-performance computing ap-
Kang et al. (2012) present a library for scalable proaches (Martin et al. 2016). Social data
graph mining in the Cloud that allows to find streaming analysis tools represent very useful
patterns and anomalies in massive, real-world technologies to understand collective behav-
graphs; and Belcastro et al. (2016) propose a iors from social media data. New approaches
model for predicting flight delay according to to data exploration and model visualization
weather conditions. are necessary taking into account the size of
Several other works exploited Cloud comput- data and the complexity of the knowledge
ing for conducting data analysis on large amount extracted.
of data gathered from social networks. Some • Data quality and usability. Big Data sets are
examples are as follows: often arranged by gathering data from sev-
You et al. (2014) propose a social sensing eral heterogeneous and often not well-known
data analysis framework in Clouds for smarter sources. This leads to a poor data quality that
cities, especially to support smart mobility appli- is a big problem for data analysts. In fact, due
cations (e.g., finding crowded areas where more to the lack of a common format, inconsistent
transportation resources need to be allocated); and useless data can be produced as a result
Belcastro et al. (2017) present a Java library, of joining data from heterogeneous sources.
called ParSoDA (Parallel Social Data Analytics), Defining some common and widely adopted
which can be used for developing social data format would lead to data that are consistent
analysis applications. with data from other sources, that means high-
quality data.
• In-memory analysis. Most of the data analysis
Future Directions for Research
tools access data sources on disks, while,
differently from those, in-memory analytics
Some of most important research trends and is-
access data in main memory (RAM). This
sues to be addressed in Big Data analysis and
approach brings many benefits in terms of
Cloud systems for managing and mining large-
query speed up and faster decisions. In-
scale data repositories are:
memory databases are, for example, very
• Data-intensive computing. The design of data- effective in real-time data analysis, but they
intensive computing platforms is a very sig- require high-performance hardware support
nificant research challenge with the goal of and fine-grain parallel algorithms (Tan et al.
building computers composed of a large num- 2015). New 64-bit operating systems allow to
Cloud Databases 457
address memory up to one terabyte, so making Langmead B, Hansen KD, Leek JT (2010) Cloud-scale
realistic to cache very large amount of data in rna-sequencing differential expression analysis with
Myrna. Genome Biol 11(8):R83
RAM. This is why this research area has a Li A, Yang X, Kandula S, Zhang M (2010) Cloudcmp:
strategic importance. comparing public cloud providers. In: Proceedings of
• Scalable software architectures for fine-grain the 10th ACM SIGCOMM conference on Internet
in-memory data access and analysis. Exascale measurement. ACM, pp 1–14 C
Lordan F, Tejedor E, Ejarque J, Rafanell R, Álvarez J,
processors and storage devices must be ex- Marozzo F, Lezzi D, Sirvent R, Talia D, Badia R (2014)
ploited with fine-grain runtime models. Soft- Servicess: an interoperable programming framework
ware solutions for handling many cores and for the cloud. J Grid Comput 12(1):67–91
scalable processor-to-processor communica- Marozzo F, Talia D, Trunfio P (2015) Js4cloud: script-
based workflow programming for scalable data anal-
tions have to be designed to exploit exascale ysis on cloud platforms. Concurr Comput Pract Exp
hardware (Mavroidis et al. 2016). 27(17):5214–5237
Marozzo F, Talia D, Trunfio P (2016) A workflow manage-
ment system for scalable data mining on clouds. IEEE
Trans Serv Comput, vol PP(99), p 1
References Martin A, Brito A, Fetzer C (2016) Real-time social
network graph analysis using streammine3g. In: Pro-
Agapito G, Cannataro M, Guzzi PH, Marozzo F, Talia ceedings of the 10th ACM international conference on
D, Trunfio P (2013) Cloud4snp: distributed analysis of distributed and event-based systems, DEBS’16. ACM,
SNP microarray data on the cloud. In: Proceedings of New York, pp 322–329
the ACM conference on bioinformatics, computational Mavroidis I, Papaefstathiou I, Lavagno L, Nikolopoulos
biology and biomedical informatics 2013 (ACM BCB DS, Koch D, Goodacre J, Sourdis I, Papaefstathiou
2013). ACM, Washington, DC, p 468. ISBN:978-1- V, Coppola M, Palomino M (2016) Ecoscale: recon-
4503-2434-2 figurable computing and runtime system for future
Altomare A, Cesario E, Comito C, Marozzo F, Talia D exascale systems. In: 2016 design, automation test in
(2017) Trajectory pattern mining for urban computing Europe conference exhibition (DATE), pp 696–701
in the cloud. Trans Parallel Distrib Syst 28(2):586–599. Mell PM, Grance T (2011) Sp 800-145. The nist definition
ISSN:1045-9219 of cloud computing. Technical report, National Insti-
Belcastro L, Marozzo F, Talia D, Trunfio P (2016) Using tute of Standards & Technology, Gaithersburg
scalable data mining for predicting flight delays. ACM Richardson L, Ruby S (2008) RESTful web services.
Trans Intell Syst Technol. ACM, New York, 8(1): O’Reilly Media, Inc., Newton
5:1–5:20 Talia D, Trunfio P, Marozzo F (2015) Data analysis in the
Belcastro L, Marozzo F, Talia D, Trunfio P (2016, to cloud. Elsevier. ISBN:978-0-12-802881-0
appear) Using scalable data mining for predicting flight Tan KL, Cai Q, Ooi BC, Wong WF, Yao C, Zhang
delays. ACM Trans Intell Syst Technol (ACM TIST) H (2015) In-memory databases: challenges and op-
Belcastro L, Marozzo F, Talia D, Trunfio P (2017) A portunities from software and hardware perspectives.
parallel library for social media analytics. In: The 2017 SIGMOD Rec 44(2):35–40
international conference on high performance comput- Wang C, Li X, Chen P, Wang A, Zhou X, Yu H (2015)
ing & simulation (HPCS 2017), Genoa, pp 683–690. Heterogeneous cloud framework for big data genome
ISBN:978-1-5386-3250-5 sequencing. IEEE/ACM Trans Comput Biol Bioin-
Dean J, Ghemawat S (2004) Mapreduce: simplified data form 12(1):166–178. https://doi.org/10.1109/TCBB.
processing on large clusters. In: Proceedings of the 6th 2014.2351800
conference on symposium on operating systems design You L, Motta G, Sacco D, Ma T (2014) Social data
& implementation, OSDI’04, Berkeley, vol 6, pp 10–10 analysis framework in cloud and mobility analyzer for
Gu Y, Grossman RL (2009) Sector and sphere: the de- smarter cities. In: 2014 IEEE international conference
sign and implementation of a high-performance data on service operations and logistics, and informatics
cloud. Philos Trans R Soc Lond A Math Phys Eng Sci (SOLI), pp 96–101
367(1897):2429–2445
Hiden H, Woodman S, Watson P, Cala J (2013) Devel-
oping cloud applications using the e-science central
platform. Philos Trans R Soc A 371(1983):20120085
Kang U, Chau DH, Faloutsos C (2012) Pegasus: mining
billion-scale graphs in the cloud. In: 2012 IEEE in- Cloud Databases
ternational conference on acoustics, speech and signal
processing (ICASSP), pp 5341–5344. https://doi.org/
10.1109/ICASSP.2012.6289127 Databases as a Service
458 Cloud-Based SQL Solutions for Big Data
or average utilization (leading to insufficient dimensions. The table below lists the most
performance at peak times) popular storage layers from leading providers and
• Required continuous maintenance (high oper- their vendor-specific services. These different
ating expense – OpEx) services can lead to very different software design
• Required additional infrastructure (buildings, choices, depending on the requirements of a
power) impacting both CapEx and OpEx given system or application (Table 1).
C
Cloud-Based SQL Solutions for Big Data, Table 1 Example storage services present in various cloud platforms
Azure (In- Google
AWS troduction (Google
(Ama- to Cloud
zon Microsoft Platform:
EC2: Azure Storage
Cost Size Latency Bandwidth Persistency Access Storage) Storage) Options)
Available Available
Local instance Free Limited, Very Single (not all (not all
storage (included) fixed low Very high Ephemeral instance types) Available types)
Single General
Distributed Limited, instance purpose Persistent
block storage Medium elastic Low High Persistent (at a time) EBS storage disks
Distributed file Not avail-
system High Unlimited Medium Medium Persistent Shared EFS Azure files able
Azure
Object (blob) blob Cloud
store Very low Unlimited High Low Persistent Shared S3 storage storage
460 Cloud-Based SQL Solutions for Big Data
level of availability previously only available to virtualization technologies prevent that to a large
the largest companies. extent, but the problem is not completely gone.
Security Cloud systems offer improvements to Increased failure rates similarly, it is more
system security at various levels: physical secu- common for instances in the compute layer to die
rity (state of the art systems deployed at scale); without a warning, due to hardware or software
network security (e.g., Virtual Private Cloud in corruption, often caused by activity unrelated to
AWS); encryption (automatic storage encryption; a given user.
encryption key management); authentication and
identity management (IAM in AWS); auditing
Non-transparent system topology For the end-
and more. As a result, building secure systems in
user, the mapping of virtual compute instances
a cloud is often easier than providing a similar
onto physical entities is not visible. As a result,
(externally facing) system on-premise.
one cant say if two nodes might influence each
others behavior, or predict the cost of the network
Additional services Cloud vendors offer dozens
communication.
of additional managed services, in areas like (ex-
amples for AWS): messaging systems (e.g., SMS,
SQS), databases (e.g., RDS, DynamoDB), server- Unpredictability of services stability and per-
less computing (e.g., Elastic Beanstalk, Lambda), formance While cloud services typically offer
management (e.g., CloudTrail, CloudWatch) and very high availability, some of them behave in-
more. These powerful tools make creating appli- consistently. For example, it is known that a
cations much easier in the cloud environment. simple read request to AWS S3, while usually
taking a fraction of a second, occasionally can
SaaS While IaaS provides flexibility and cost take multiple seconds.
benefits to organizations, a good software-as-a-
service product can lead also to a higher produc- Limited hardware choice Many on premise
tivity, better user satisfaction, and personnel cost systems use carefully tuned hardware units,
reduction. As a result, cloud systems often focus with balanced CPU/RAM/disk and network.
on providing a low-maintenance, mostly auto- While cloud offers multiple instance types, it
mated and easy-to use experience. For example, is often not possible to get the exact hardware
aspects like friendly user interfaces, automatic configuration that would be optimal for a given
software update management, high-availability, application.
built-in monitoring etc are all now expected by
most users of cloud systems.
Let us look at a list of various challenges that can degrade quickly. This influences aspects like
database system designers face when designing distributed algorithms, data exchange formats
for the cloud etc.
Elasticity Most popular design for on-premise Security While being typically deployed in iso-
large scale database systems is “shared noth-
C
lated, private environments, databases did not
ing” (Stonebraker 1985). This is a great archi- have to worry about many aspects of security
tecture for fixed-topology systems, but presents present in the cloud environments. For example,
a lot of problems when highly elastic behavior is for many users, having all data encrypted in-flight
desired. As a result, other approaches have been is legally required in a shared environment.
proposed.
Extensibility Many database systems allow us-
Scalability Combination of elastic compute ing custom software to enhance various features
resources and pay-for-use billing model, leads of the system (for example, C or C++ user defined
to customers demanding higher peak compute functions). Shared systems need to devise mecha-
power (for a shorter time). As a result, distributed nisms protecting against malicious code provided
database systems need to be designed to through these mechanisms.
efficiently scale to larger sizes.
Monitoring On-premise databases could
Infrastructure reliability In traditional sys-
assume the users had system-level access to
tems, an infrastructure failure (failed disk,
the instances, giving them insight into aspects
network partitioning etc) and performance
like memory or disk utilization. With databases
degradations were often consider exceptional
provided as a service, they often do not have this
events. In the cloud, they are more likely and
access, and additional levels of monitoring needs
systems need to be designed for them.
to be exposed via separate interfaces.
Performance consistency The performance dif-
ferences between various nodes participating in Client connectivity Databases traditionally as-
query processing can be much higher in a shared sumed the user maintain a network connection
environment. As a result, features like load bal- throughout their session. However, with web user
ancing and skew handling are significantly more interfaces and users being more mobile, the need
important in the cloud. for clients that do not depend on continuous
connection arises.
System topology With most of the cloud infras-
tructure virtualized, the placement of logical enti- External data Traditional data warehouses
ties on physical hardware is often unknown. This could assume they managed all (or most) of the
causes problems for aspects like fault tolerance, data they needed to access. With the recent data
where a single machine failure can potentially explosion, many organizations keep a lot of data
influence multiple logical instances, making it in low-cost services like S3. Ability to efficiently
much harder to provide system failure guaran- access these becomes highly desired, especially
tees. with that data being easily accessible within a
given cloud system.
Network efficiency Many high-performance
distributed database systems are designed for Simplicity Another area where databases do not
high-performance networking interfaces like match the SaaS world perfectly is the system us-
InfiniBand – when suddenly faced with e.g., a ability. Databases are famous for the complexity
1 gigabit Ethernet connection, their performance of their management, with activities like tuning,
462 Cloud-Based SQL Solutions for Big Data
backups, monitoring consume a lot of precious • Integration with S3, simplifying data backups
personnel time. • Rich user interface for managing and
monitoring
Pricing Traditional database license was rea-
sonably simple, with the user typically paying Additionally, in 2017 Amazon introduced a
for the volume of data or the amount of used major architectural extension for Redshift called
hardware resources. With the clouds elasticity, Spectrum (Amazon Redshift Spectrum). It is
new payment models had to be invented, typically a sub-system focused on scanning data in S3,
focused on actual system usage. where a part of the Redshift query can be pushed
to a separate system, which uses highly parallel
Multi-tenancy Cloud database systems that compute infrastructure for operations like scans,
choose to provide multi-tenant services face filters and aggregations.
additional challenges. For example, workload
of one user might adversely influence other Microsoft Azure SQL Data Warehouse
users. Additionally, any layers of the system Azure SQL Data Warehouse (SQLDW)
that are shared require strong access control (Microsoft Azure SQL Data Warehouse)
mechanisms. Finally, multi-tenancy can cause is a cloud-focused evolution of the SQL
scalability challenges for the shared system Server Parallel Data Warehouse (Parallel Data
components. Warehouse overview), which in turn extended
SQL Server with shared-nothing architecture
derived from DATAllegro (DATAllegro).
Example Systems Azure SQLDW uses compute nodes accessing
data stored in a distributed block store. All the
In this section we discuss a few representative data is up-front partitioned in the storage layer
examples of database systems provided as cloud into a limited (60) set of partitions. Each compute
services. While they obviously differ in many node owns a subset of them, and inter-node data
database-specific areas, like SQL support, per- transfer is required to access data owned by other
formance etc, we focus on their very different nodes. As such, the data assignment approach and
approaches to building a cloud data warehousing processing layer of Azure SQLDW are similar
system. to shared nothing systems. Still, using a separate
storage layer provides some benefits of the shared
Amazon Redshift filesystem approaches. For example redistribut-
Amazon Redshift (Amazon Redshift) was the ing the data between nodes is a logical operations
first widely available cloud data warehousing (no data copying is needed) and is relatively fast.
system and is still the market leader. It is
derived from the on-premise Paraccel database, Google BigQuery
which followed the traditional shared nothing Google BigQuery (Google BigQuery) has
model (Chen et al. 2009). It uses user-sized EC2 a very different lineage than Redshift and
clusters for both processing and storage. Azure SQLDW. It is based on internal Google
While originally keeping the major design products (BigQuery under the hood), mainly
decisions unchanged, Redshift did introduce a Dremel (Melnik et al. 2010) for data processing
number of cloud-focused improvements and ex- and Colossus distributed file system (Fikes)
tensions (Gupta et al. 2015), mainly: for data storage, following the shared-storage
architecture.
• Ability to scale up/down and pause/resume, Originally Dremel was a reasonably simple
providing much better elasticity than in the on- internal query processing system with limited
premise systems functionality (e.g., no real distributed joins;
Cloud-Based SQL Solutions for Big Data 463
append-only), with a limited, SQL-like language. • In a multi-tenant system, all the barriers are
After releasing it publicly as BigQuery in purely logical, hence sharing data between
2011, Google has implemented a number of users is possible
significant improvements bringing it much closer
to raditional database systems, including a more- The result multi-cluster shared-data architec-
standard SQL support and update capability.
C
ture uniquely translates a lot of cloud benefits to
BigQuery is unique in the discussed set of the end user, providing high elasticity, scalability
products with its usage model – the user issues and novel features.
a query and only pays for the amount of data
that query processes. This is the closest to the Other Systems
pure SaaS approach, with users not having to While four systems discussed in this chapter are
manage anything related to resource allocation. highly representative examples, there are many
With this approach, it is also possible for a single other products in this space, providing additional
query to use tens or even hundreds of machines, unique cloud-related approaches and features.
depending on what is currently available, with the That list includes, but is not limited to: Tera-
users cost staying the same. data Intellicloud (Teradata IntelliCloud), Micro-
BigQuery provides a demonstration of how Focus Vertica (including the Eon mode) (Whats
highly distributed compute layer combined with New in Vertica 9.0: Eon Mode Beta), Pivotal
an efficient distributed storage can provide excep- Greenplum (Pivotal Greenplum on Amazon Web
tional query performance without any up-front Services), IBM DashDB (IBM dashDB), Ora-
investment. cle Autonomous Data Warehouse (Oracle Au-
tonomous Data Warehouse Cloud) and Amazon
Snowflake Elastic Data Warehouse Athena (Amazone Athena).
Snowflake (Introducing Snowflake) is a rare
example of a SQL data warehousing system
designed from scratch with cloud platforms in Conclusions
mind. Its architecture (Dageville et al. 2016) is
driven directly by various attributes of AWS: It seems clear today that most future data pro-
cessing needs will be fulfilled by cloud systems.
This chapter discussed the challenges and op-
• Object store is the cheapest and most reliable portunities cloud brings, as well as approaches
storage, leading to using S3 as the persistent to addressing them. A short overview of a few
data layer example systems demonstrates that databases can
• S3 has no update capability, leading to using take very different architectural approaches to
immutable micro-partitions as data storage this new platform and still build very successful
format cloud systems. With cloud adoption growing at a
• S3 has low performance, leading to colum- rapid pace, we expect a lot of innovation in this
nar storage, heavy data skipping, and local space in the coming years.
caching
• For compute layer to easily scale up and down,
compute nodes should be stateless References
• With data in S3, multiple compute nodes can
access it at the same time, as long as someone Amazon EC2: Storage. http://docs.aws.amazon.com/
coordinates their activity. AWSEC2/latest/UserGuide/Storage.html
Amazon Redshift. https://aws.amazon.com/redshift
• S3 is not good for frequent data access and
Amazon Redshift Spectrum. https://aws.amazon.com/
compute nodes are stateless, hence an extra redshift/spectrum/
management and metadata entity is needed Amazone Athena. https://aws.amazon.com/athena
464 Cloudlets
Overview
A crucial aspect in delivering high perfor- ORC file formats which are the most widely
mance in such large-scale environments is the un- adopted columnar formats in current Big Data
derlying data layout. Most Big Data frameworks frameworks. We conclude the article by high-
are designed to operate on top of data stored in lighting the similarities and differences of these
various formats, and they are extensible enough two formats.
to incorporate new data formats. Over the years,
C
a plethora of open-source data formats have been
designed to support the needs of various applica- Related Work
tions. These formats can be row or column ori-
ented and can support various forms of serializa- The first works on columnar storage in the con-
tion and compression. The columnar data formats text of Hadoop, namely, the RCFile (Row Colum-
are a popular choice for fast analytics workloads. nar File) (He et al. 2011) and the CIF (Col-
As opposed to row-oriented storage, columnar umn Input File format) (Floratou et al. 2011)
storage can significantly reduce the amount of layouts, were mostly targeting the MapReduce
data fetched from disk by allowing access to only framework (Dean and Ghemawat 2008). These
the columns that are relevant for the particular two columnar formats have adopted significantly
query or workload. Moreover, columnar storage different designs. The CIF file format stored each
combined with efficient encoding and compres- column in the data as a separate file, whereas
sion techniques can drastically reduce the storage the RCFile adopted a PAX-like (Ailamaki et al.
requirements without sacrificing query perfor- 2001) data layout where the columns are stored
mance. next to each other in the same file.
Column-oriented storage has been suc- These design choices led to different trade-
cessfully incorporated in both disk-based offs. The CIF file format was able to provide
and memory-based relational databases that better performance as it allowed accessing only
target OLAP workloads (Vertica 2017). In the files that contain the desired columns but
the context of Big Data frameworks, the first required extra logic in order to colocate the files
works on columnar storage for data stored that contain adjacent columns. Without provid-
in HDFS (Apache Hadoop HDFS 2017) have ing colocation in HDFS, reconstructing a record
appeared around 2011 (He et al. 2011; Floratou from multiple files would result in high network
et al. 2011). Over the years, multiple proposals I/O. The RCFile, on the other hand, did not
have been made to satisfy the needs of various require any such logic to be implemented as
applications and to address the increasing data all the columns were stored in the same HDFS
volume and complexity. These discussions block. However, the RCFile layout made it diffi-
resulted in the creation of two popular columnar cult to enable efficient skipping of columns due
formats, namely, the Parquet (Apache Parquet to file system prefetching. As a result, queries
2017) and ORC (Apache ORC 2017) file formats. that were accessing a small number of columns
These formats are both open-source and are would pay a performance overhead (Floratou
currently supported by multiple proprietary et al. 2011).
and open-source Big Data frameworks. Apart The successor of the RCFile format, namely,
from columnar organization, the formats provide the ORC (Optimized Row Columnar) file format,
efficient encoding and compression techniques was developed to overcome the limitations of
and incorporate various statistics that enable the RCFile format. It was originally designed
predicate pushdown which can further improve to speed up Hadoop and Hive, but it is cur-
the performance of analytics workloads. rently incorporated in many other frameworks
In this article, we first present the major works as well. The ORC file format still uses a PAX-
in disk-based columnar storage in the context like (Ailamaki et al. 2001) layout but can more
of Big Data systems and Hadoop data. We then efficiently skip columns by organizing the data
provide a detailed description of the Parquet and in larger blocks called row groups. A detailed
466 Columnar Storage Formats
description of the ORC file format is provided in Parquet 2017) and are actively used by multiple
the following section. organizations when performing analytics over
The Parquet (Apache Parquet 2017) file for- Hadoop data.
mat is another popular, open-source columnar
format. The Parquet file format was designed The Parquet File Format
to efficiently support queries over deeply nested The Parquet file format (Apache Parquet 2017)
data. The format is based on the Dremel dis- is a self-described columnar data format specif-
tributed system (Melnik et al. 2010) that was ically designed for the Hadoop ecosystem. It is
developed at Google for interactively querying supported by many Big Data frameworks such as
large datasets. Similar to the ORC format, the Apache Spark (Zaharia et al. 2012) and Apache
parquet format has also adopted a PAX-like lay- Impala (Kornacker et al. 2015), among others.
out (Ailamaki et al. 2001). The following section The format inherently supports complex nested
presents Parquet’s layout in detail. data types as well as efficient compression and
Detailed comparisons of various columnar for- encoding schemes. In the next section, we de-
mats can be found in Huai et al. (2013) and scribe the format layout in more detail.
in Floratou et al. (2014). The work by Floratou
et al. (2014) compares the ORC and Parquet file File Organization
formats in the context of Hive (Apache Hive The Parquet format supports both primitive (e.g.,
2017) and Impala (Kornacker et al. 2015) for integer, Boolean, double, etc.) and complex data
SQL workloads. types (maps, lists, etc.). The structure of a Parquet
Other related open-source technologies are file is shown in Fig. 1. The file consists of a
Apache Arrow (2017) and Apache Kudu (2017). set of row groups which essentially represent
Apache Arrow is a columnar in-memory format horizontal partitions of the data. A row group
that can be used on top of various disk-based consists of a set of column chunks. Each
storage systems and file formats to provide fast,
in-memory analytical processing. Note that as
opposed to the file formats discussed above, Column 1 Chunk 1 Metadata
Row Group 1
column chunk contains data from a particular initiative to speed up Apache Hive (2017) queries
data column and it is guaranteed to be contigu- and reduce the storage requirements of data
ous in the file. The column chunk consists stored in Apache Hadoop (2017). Currently the
of one or more pages which are the unit of ORC file format is supported by many Big Data
compression and encoding. Each page contains solutions including Apache Hive (2017), Apache
a header that describes the compression and en- Pig (2017), and Apache Spark (Zaharia et al.
C
coding method used for the data in the page. 2012), among others.
As shown in Fig. 1, the file consists of N
columns spread across M row groups. The File Organization
row groups are typically large (e.g., 1 GB) to The ORC format is an optimized version of the
allow for large sequential I/Os. Since an entire previously used Row Columnar (RC) file for-
row group might need to be accessed, the mat (He et al. 2011). The format is self-describing
HDFS block size should be large enough to fit as it includes the schema and encoding informa-
the row group. A typical configuration is to tion for all the data in the file. Thus, no external
have 1 GB HDFS blocks that contain one row metadata is required in order to interpret the data
group of 1 GB. in the file. The format supports both primitive
Metadata is stored at all the levels in the hierar- (e.g., integer, Boolean, etc.) and complex data
chy, i.e., file, column chunk, and page. The types (maps, lists, structs, unions). Apart from
bottom of the file contains the file metadata which the columnar organization, the ORC file supports
include information about the data schema as various compression techniques and lightweight
well as information about the column chunks indexes.
in the file. The metadata contains statistics about The structure of an ORC file is shown in
the data such as the minimum and maximum Fig. 2. As shown in the figure, the file con-
values in a column chunk or page. Using these sists of three major parts: a set of stripes,
statistics, the Parquet file format supports pred- a file footer, and a postscript. The
icate pushdown which allows skipping of pages
and reduces the amount of data that needs to be
decompressed and parsed. The Parquet readers Column 1
first fetch the metadata to filter out the column Index
Column 2
Stripe
postscript part provides all the necessary larger chunks typically results in better com-
information to parse the rest of the file such as the pression ratios but requires more memory to be
compression codec used and the length of the file allocated during decompression.
footer. The file footer contains information The ORC file makes use of various run-length
related to the data stored in the file, such as the encoding techniques to further improve compres-
number of rows in the file, statistics about the sion. The interested reader can find more infor-
columns, and the data schema. The process of mation about the supported encoding methods
reading the file starts by first reading the tail of in ORC Encodings (2017)
the file that consists of the postscript and the
file footer that contain metadata about the File Format Comparison
main body of the file. The Parquet and ORC file formats have both been
An ORC file stores multiple groups of row created with the goal of providing fast analytics
data as stripes. Each stripe has a size of in the Hadoop ecosystem. Both file formats are
about 250 MB and contains only entire rows so columnar and they have adopted a PAX-like lay-
a row cannot span multiple stripes. Internally, out. Instead of storing each column at a separate
each stripe is divided into index data, row file, the columns are stored next to each other
data, and stripe footer in that order. The data in the same HDFS block. By having large row
columns are stored next to each other in a PAX- groups and HDFS blocks, both formats exploit
like (Ailamaki et al. 2001) organization. Depend- large sequential I/Os and can skip unnecessary
ing on the query’s access pattern, only the neces- columns.
sary columns are decompressed and deserialized. Both formats support various compression
The stripe footer contains the location of the codecs at a fine granularity so that the amount
columns in data as well as information about the of data that is decompressed is minimized.
encoding of each column. The indexes contain They also support various encoding schemes
statistics about each column in a row group (set depending on the data type.
of 10,000 rows). For example, in the case of Finally, the file formats incorporate statistics
integer columns, the column statistics include at various levels in order to avoid decompressing
the minimum, maximum, and sum values of the and deserializing data that do not satisfy the filter-
column in the given row group. These statistics ing predicates. Typically, the statistics include the
can be used to skip entire sets of rows that do maximum and the minimum values of a column.
not satisfy the query predicates. Recently, bloom The ORC file additionally supports bloom filters
filters can additionally be used to better prune row at the row group level, while the Parquet file
groups (ORC Index 2017). supports statistics at the page level as well.
Finally, it is worth noting that column statistics Regarding the data types supported, the
are also stored in the file footer at the Parquet file format has been designed to
granularity of stripes. These statistics enable inherently support deeply nested data using
skipping entire stripes based on a filtering the record shredding and assembly algorithm
predicate. presented in Melnik et al. (2010). The ORC file,
on the other hand, flattens the nested data and
Compression creates separate columns for each underlying
Depending on the user’s configuration, an ORC primitive data type.
file can be compressed using a generic compres-
sion codec (typically zlib (ZLIB Compression
2017) or snappy (Snappy Compression 2017)). Conclusions
The file is compressed in chunks so that entire
chunks can be directly skipped without decom- The increased interest in fast analytics over
pressing the data. The default compression chunk Hadoop data fueled the development of multiple
size is 256 KB but it is configurable. Having open-source data file formats. In this work, we
Component Benchmark 469
focused on the columnar storage formats that Huai Y, Ma S, Lee R, O’Malley O, Zhang X (2013)
are used to store HDFS data. In particular, we Understanding insights into the basic structure and
essential issues of table placement methods in clusters.
first reviewed the work on disk-based columnar Proc VLDB Endow 6(14):1750–1761
formats for Big Data systems and then presented Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching
a detailed description of the open-source ORC C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M,
and Parquet file formats. These two columnar Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, C
Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D,
data formats are currently supported by a plethora Wanderman-Milne S, Yoder M (2015) Impala: a mod-
of Big Data solutions. Finally, we highlighted ern, open-source SQL engine for hadoop. In: CIDR
the differences and similarities of the two file Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar
formats. S, Tolton M, Vassilakis T (2010) Dremel: interactive
analysis of web-scale datasets. Proc VLDB Endow
3(1–2):330–339
ORC Encodings (2017) ORC Encodings. https://orc.
Cross-References apache.org/docs/run-length.html
ORC Index (2017) ORC Index. https://orc.apache.org/
docs/spec-index.html
Caching for SQL-on-Hadoop Parquet Encodings (2017) Parquet Encodings. https://
Wildfire: HTAP for Big Data github.com/apache/parquet-format/blob/master/Encodi
ngs.md
Snappy Compression (2017) Snappy Compression.
https://en.wikipedia.org/wiki/Snappy_(compression)
References Vertica (2017) Vertica. https://www.vertica.com/
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mc-
Ailamaki A, DeWitt DJ, Hill MD, Skounakis M (2001) Cauley M, Franklin MJ, Shenker S, Stoica I (2012) Re-
Weaving relations for cache performance. In: Proceed- silient distributed datasets: a fault-tolerant abstraction
ings of the 27th international conference on very large for in-memory cluster computing. NSDI, USENIX
data bases (VLDB’01), pp 169–180 ZLIB Compression (2017) ZLIB Compression. https://en.
Apache Arrow (2017) Apache Arrow. https://arrow. wikipedia.org/wiki/Zlib
apache.org/
Apache Hadoop (2017) Apache Hadoop. http://hadoop.
apache.org
Apache Hadoop HDFS (2017) Apache Hadoop HDFS.
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html Complex Event Processing
Apache Hbase (2017) Apache HBase. https://hbase.
apache.org/
Apache Hive (2017) Apache Hive. https://hive.apache. Pattern Recognition
org/
Apache Kudu (2017) Apache Kudu. https://kudu.apache.
org/
Apache ORC (2017) Apache ORC. https://orc.apache.org/ Component Benchmark
Apache Parquet (2017) Apache Parquet. https://parquet.
apache.org/
Apache Pig (2017) Apache Pig. https://pig.apache.org/ Klaus-Dieter Lange and David L. Schmidt
Dean J, Ghemawat S (2008) Mapreduce: simplified Hewlett Packard Enterprise, Houston, TX, USA
data processing on large clusters. Commun ACM
51(1):107–113
Floratou A, Patel JM, Shekita EJ, Tata S (2011) Column- Synonyms
oriented Storage Techniques for MapReduce. Proc
VLDB Endow 4(7):419–429
Floratou A, Minhas UF, Özcan F (2014) SQL-on-Hadoop: Microbenchmark; Worklet
full circle back to shared-nothing database architec-
tures. Proc VLDB Endow 7(12):1295–1306
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Overview
Xu Z (2011) RCFile: a fast and space-efficient data
placement structure in MapReduce-based warehouse
Component benchmarks are valuable tools for
systems. In: Proceedings of the 2011 IEEE 27th in-
ternational conference on data engineering (ICDE’11). isolating and analyzing the performance charac-
IEEE Computer Society, pp 1199–1208 teristics of specific subsystems within a server
470 Component Benchmark
environment. This chapter will provide histor- such benchmarks or how to compare results
ical context for component benchmarks, brief between different architectures which limited the
descriptions of the most commonly used bench- scope of their use. Industry-standard consortia
marks, and a discussion on their uses in the such as Transaction Processing Performance
field. Council (TPC, http://www.tpc.org) and Standard
Performance Evaluation Corporation (SPEC,
https://www.spec.org) were established in the
Definitions mid-1980s in an effort to provide fair standards
for measuring system-level and component-level
Software that utilizes standard applications, or performance of differing platforms. Several of
their core routines that focus on a particular ac- the benchmarks developed by these consortia
cess pattern, in order to measure the performance are intended to measure component-level
and/or efficiency of one of the primary server and performance.
storage components (CPU transactions, memory,
and storage/network IO).
Foundations
Historical Background
Component benchmarks are useful tools to ana-
From earliest days of digital computer systems, lyze IT equipment and troubleshoot performance
the goal of quantifying, improving, and opti- bottlenecks. They utilize standard applications,
mizing computational performance has been a or their core routines that focus on a particular
subject great interest. The earliest measures of access pattern, in order to measure the perfor-
performance were comparisons of low-level in- mance of one of the primary server and storage
struction execution times. A mix of several of components such as CPU transactions, memory
these execution times would be combined to access, storage IO, or network IO. By limiting
produce an overall rating that could be compared their scope, component benchmarks achieve a
between systems. The most well-known of these deeper analysis of the targeted subsystem. At
early benchmarks was the Gibson mix, devised the same time component benchmarks are often
by Jack Gibson of IBM (Gibson 1970). As high- easier to develop than benchmarks that stress the
level computer languages were developed, more complete server environment and they are typi-
complex applications were created to develop cally less expensive to run, because they require
more rigorous methods of measuring a system’s fewer equipment and engineering resources.
performance. Whetstone (Longbottom 2014) and There can be an overlap in what is consid-
Dhrystone (Weiss 2002) are both early examples ered a component benchmark vs. a microbench-
of high-level language benchmarks. mark. A microbenchmark is typically more lim-
As computer systems became more complex, ited in scope than a component benchmark, fo-
benchmarks were developed to measure the cusing on the performance of a single feature
performance of the separate components of of a component rather than the entire compo-
the system, such as memory, floating-point nent. Microbenchmarks utilize synthetic work-
arithmetic coprocessors, and data IO. Most loads rather than actual user applications to mea-
of these early benchmarks utilized synthetic sure the performance of their intended resource.
workloads and were usually provided to users Microbenchmarks are further described in a later
as source code that needed to be compiled on the chapter.
system of interest. This allowed such benchmarks Following is a description of the most com-
to be used on multiple platforms, but there was mon component benchmarks categorized by their
generally no set standard on how to compile target subsystem in a Big Data environment.
Component Benchmark 471
puters. The LINPACK benchmark uses the application areas: database operations,
LINPACK software library to solve a set of software builds, video data acquisition
predetermined set of n by n linear algebra (VDA), virtual desktop infrastructure
equations in order to measure a system’s (VDI), and electronic design automation
floating-point computing performance, (EDA). Each workload is independent of
measured in floating point operations the others and reports the throughput (in
per second (FLOPS). The benchmark MB/s) and overall response time (in msec)
workload is almost exclusively floating- for a given number of workload instances.
point based, so is an excellent stressor of Netmist is the load generator, allowing
the processors’ math and vector instruction the benchmark suite can be run over
sets. There have been several versions of any version of NFS, SMB/CIFS, object-
LINPACK benchmarks since their creation oriented, local, or other POSIX-compatible
in 1979, which include LINPACK 100 file system.
(n D 100), LINPACK 1000 (n D 1000), and Iometer (http://www.iometer.org) is a tool
HPLinpack (a highly parallelized version which measures the performance and
that can run across multiple systems). characterization of I/O components
HPL is a portable implementation of for single and clustered systems. It is
the HPLinpack benchmark written in based on a client-server model where
C. Precompiled LINPACK and HPL are a single client controls one or manager
available for certain system architectures. driver which generates I/O using one or
more worker threads. Iometer performs
Memory
asynchronous I/O and can access files or
STREAM (https://www.cs.virginia.edu/ block devices. The tool is very flexible
stream; McCalpin 1995) is a simple syn- and allows the user to adjust the workload
thetic benchmark that measures sustainable by configuring target disk parameters
memory bandwidth, reported in MB/s. The (e.g., disk size, starting sector, number of
benchmark is specifically designed to work outstanding I/Os) and the workload access
with dataset much larger than the available specifications (e.g., transfer request size,
processor cache on any given system so percent random/sequential distribution,
that the results are more indicative of the percent read/write distribution, aligned
performance of a very large, vector-style I/Os, reply size, TCP/IP status, burstiness).
application. The benchmark yields four All of the output data is collected into a
metrics, representing different memory CSV file for easy manipulation. One of the
operations: Copy, Add, Scale, and Triad. more commonly utilized output parameter
The benchmark had C and FORTRAN is I/O operations per second (IOPS).
versions available which can run either in IOzone (http://www.iozone.org/docs/IOzone_
a single-threaded or distributed fashion. A msword_98.pdf) is a file system benchmark
precompiled version of this benchmark is utility which generates and measures a
often included in component benchmark variety of file operations. These file opera-
suites. tions include read, write, re-read, re-write,
read backwards, read strided, fread, fwrite,
Storage IO
random read/write, pread/pwrite variants,
SPEC SFS2014 (https://www.spec.org/ aio_read, aio_write, and mmap. The results
sfs2014) is a benchmark suite which can be exported into useful graphs which
measures file server throughput and show performance characteristics and
response time. The SPEC SFS2014 SP2 bottlenecks of the disk I/O subsystem. The
benchmark includes five workloads which benchmark is written in C and can be run
measure storage performance in different under many operating systems.
Component Benchmark 473
positions of all the suffixes of T in lexicographic occurrences into two types, called primary and
order. Assume a long string S Œ1::k appears s secondary: the former start in one phrase and
times in T . Then the s suffixes starting with end in another (or at the next phrase boundary),
S Œ1::k appear in a contiguous range AŒp1 ::p1 C and the latter are completely contained within
s 1. It is likely that the suffixes starting with single phrases. Their idea was to store a pair
S Œ2::k also appear together in another range of Patricia trees, one for the reversed phrases
AŒp2 ::p2 C s 2, in the same order as those in in the parse and the other for the suffixes start-
AŒp1 ::p1 C s 1. The same is likely to occur for ing at phrase boundaries. By dividing a pattern
any S Œk 0 ::k as long as k k 0 is sufficiently large. every possible way into a nonempty prefix and
Such a regularity shows up in the function of a (possibly empty) suffix and searching for the
CSAs, Œi D A1 ŒAŒi C 1, since the values in reversed prefix in the first tree and the suffix
Œp1 ::p1 C s 1 will point to Œp2 ::p2 C s 2, in the second, they obtain a pair of Patricia
and so on, thereby inducing runs of 1s in 0 Œi D tree nodes or, equivalently, a pair of ranges in
Œi Œi 1, which is the main component of the lexicographically sorted reversed prefixes and
CSAs. It also shows up in the Burrows-Wheeler suffixes. Using an O.´/-space data structure for
Transform (BWT) of T , BWTŒi D T ŒAŒi 1, four-sided two-dimensional range reporting, they
which is the main component of the FM-index: obtain all the primary occurrences. They use
BWTŒp2 ::p2 C s 1 will be a run of copies of access to the text to verify the occurrences. Using
S Œ1 and so on. another O.´/-space data structure that represents
Mäkinen et al. (2010) took the first step toward the structure of the LZ77 parse, they can then
adapting CSAs and FM-indexes to handle repet- obtain all the secondary occurrences from the
itive texts better, by run-length compressing the primary ones.
runs of 1s in 0 or the runs of the same letters in Researchers are still following Kärkkäinen
the BWT. The number of runs in 0 is almost the and Ukkonen’s basic design. The major
same as the number r of runs in the BWT. Their alterations have aimed at a compressed
run-length compressed CSA and FM-index then representation of the text, which is essential in the
use O.r/ space, often much less than a standard repetitive scenario. Compressed representations,
CSA or FM-index while still counting patterns’ which must offer direct access in order to support
occurrences quickly. searches, have been built on the LZ77 parse
Until very recently it was not known how to or a variant of it (Kreft and Navarro 2013;
compress similarly the suffix-array sample with Do et al. 2014; Navarro 2017), a context-free
which we can locate patterns’ occurrences in T , grammar (Claude and Navarro 2011, 2012; Gagie
without making those locating queries extremely et al. 2012, 2014; Bille et al. 2017), a run-length
slow. However, Gagie et al. (2018) have now in- compressed FM-index without samples (which
troduced a new sampling scheme that takes O.r/ also replaces the Patricia trees) (Belazzougui
space and lets us locate each occurrence with et al. 2015, 2017), and CDAWGs (Belazzougui
essentially only a predecessor query. Specifically, et al. 2015, 2017). Another improvement has
they can count the number of occurrences of a been to substitute z-fast tries for the Patricia trees,
pattern of length m in O.m log log n/ time and which allows removing the quadratic dependence
then locate each occurrence in O.log log n/ time. on m in the query time (Gagie et al. 2014; Bille
et al. 2017).
Lempel-Ziv and Grammar-Based Indexes There are also indexes based on edit-sensitive
Kärkkäinen and Ukkonen (1996) showed how parsing (Maruyama et al. 2013; Takabatake
to store an O.´/-space data structure in addi- et al. 2014, 2016) and locally consistent
tion to the text T , where ´ is the number of parsing (Nishimoto et al. 2016), although the
phrases in the LZ77 parse of T , such that locating former has poor worst-case query time and seems
all the occurrences of appattern of length m practical only for long patterns, and the latter is
takes O.m2 C m log ´ C m´ log m/ time plus complicated and competitive in theory only when
O.log ´/ time per occurrence. They divided the the text is dynamic.
Compressed Indexes for Repetitive Textual Datasets 477
No structure using O.´/ space offers good O.m log log n/ time plus O.1/ time per occur-
worst-case query times, because it is not known rence (Belazzougui et al. 2015). Its disadvantage
how to support fast access to the text within is that e is always larger than r and ´, by a
this space. The most recent theoretical bounds wide margin in practice, and hence CDAWGs
are due to Bille et al. (2017), who showed tend to be much larger than the structures de-
how we can store an O.´ log.n=´/ log log ´/- scribed above. The fastest variant (Belazzougui
C
space data structure such that later we can and Cunial 2017) uses O.e C e/ space (where e
locate all the occurrences of a pattern of length is the e measure of the reversed text) and answers
m in O.m/ time plus O.log log ´/ time per locating queries in optimal time: O.m/ and then
occurrence. The implementation of Kreft and O.1/ per occurrence.
Navarro (2013) uses O.´/ space and offers An advantage of CDAWGs is that they can
the best compression with query times that are be augmented to support suffix tree functional-
competitive in practice (despite their O.m2 ´/ ity (Belazzougui and Cunial 2017; Takagi et al.
worst-case bound). 2017), which is more powerful than just counting
Several authors (Schneeberger et al. 2009; and locating pattern occurrences and of inter-
Wandelt et al. 2013; Ferrada et al. 2014; est in bioinformatics applications. The CDAWG
Procházka and Holub 2014; Rahn et al. 2014) implements a number of suffix tree navigation
have independently proposed ideas essentially operations in O.log n/ time. Other compressed
equivalent to finding the primary occurrences suffix trees for repetitive collections (Abeliuk
using an FM-index built on the substrings of et al. 2013; Navarro and Ordóñez 2016) build
length 2` centered around phrase boundaries, on a run-length encoded CSA or FM-index and
where ` is a parameter. Gagie and Puglisi apply grammar compression to the extra suf-
(2015) surveyed early work in this direction, fix tree components: the suffix tree’s topology
and others (Danek et al. 2014; Wandelt and Leser and the longest common prefix array. Like the
2015; Valenzuela 2016; Valenzuela and Mäkinen CDAWG, they profit from repetitiveness but are
2017; Ferrada et al. 2018) have given new significantly larger than their underlying com-
implementations. These indexes are practical, pressed suffix array. The only compressed suffix
though they have good worst-case query times tree of this kind offering good space guarantees
only for patterns of length m `. Increasing ` (Gagie et al. 2017b), uses O.r log.n=r// space,
at construction worsens compression, whereas and implements the suffix tree operations in time
searching for patterns longer than ` requires the O.log.n=r//.
use of heuristics.
Maciuca S, del Ojo Elias C, McVean G, Iqbal Z (2016) Procházka P, Holub J (2014) Compressing similar biolog-
A natural encoding of genetic variation in a Burrows- ical sequences using FM-index. In: Proceedings of the
Wheeler transform to enable mapping and genome data compression conference (DCC), pp 312–321
inference. In: Proceedings of the 16th workshop on Rahn R, Weese D, Reinert K (2014) Journaled string tree
algorithms in bioinformatics (WABI), pp 222–233 – a scalable data structure for analyzing thousands
Mäkinen V, Navarro G, Sirén J, Välimäki N (2010) of similar genomes on your laptop. Bioinformatics
Storage and retrieval of highly repetitive sequence 30(24):3499–3505
collections. J Comput Biol 17(3):281–308 Sadakane K (2000) Compressed text databases with effi-
Mäkinen V, Belazzougui D, Cunial F, Tomescu AI (2015) cient query algorithms based on the compressed suf-
Genome-scale algorithm design: biological sequence fix array. In: Proceedings of the 11th international
analysis in the era of high-throughput sequencing. symposium on algorithms and computations (ISAAC),
Cambridge University Press, Cambridge pp 410–421
Maruyama S, Nakahara M, Kishiue N, Sakamoto H Sadakane K (2003) New text indexing functionalities of
(2013) ESP-index: a compressed index based on edit- the compressed suffix arrays. J Algorithms 48(2):294–
sensitive parsing. J Discrete Algorithms 18:100–112 313
Na JC, Park H, Crochemore M, Holub J, Iliopoulos CS, Schneeberger K, Hagmann J, Ossowski S, Warthmann N,
Mouchard L, Park K (2013a) Suffix tree of alignment: Gesing S, Kohlbacher O, Weigel D (2009) Simultane-
an efficient index for similar data. In: Proceedings ous alignment of short reads against multiple genomes.
of the 24th international workshop on combinatorial Genome Biol 10(9):R98
algorithms (IWOCA), pp 337–348 Sirén J (2017) Indexing variation graphs. In: Proceedings
Na JC, Park H, Lee S, Hong M, Lecroq T, Mouchard L, of the 19th workshop on algorithm engineering and
Park K (2013b) Suffix array of alignment: a practical experiments (ALENEX), pp 13–27
index for similar data. In: Proceedings of the 20th sym- Sirén J, Välimäki N, Mäkinen V (2011) Indexing finite
posium on string processing and information retrieval language representation of population genotypes. In:
(SPIRE), pp 243–254 Proceedings of the 11th workshop on algorithms in
Na JC, Kim H, Park H, Lecroq T, Léonard M, Mouchard bioinformatics (WABI), pp 270–281
L, Park K (2016) FM-index of alignment: a compressed Sirén J, Välimäki N, Mäkinen V (2014) Indexing graphs
index for similar strings. Theor Comput Sci 638:159– for path queries with applications in genome research.
170 IEEE/ACM Trans Comput Biol Bioinform 11(2):
Na JC, Kim H, Min S, Park H, Lecroq T, Léonard M, 375–388
Mouchard L, Park K (2018) FM-index of alignment Takabatake Y, Tabei Y, Sakamoto H (2014) Improved
with gaps. Theor Comput Sci. https://doi.org/10.1016/ ESP-index: a practical self-index for highly repetitive
j.tcs.2017.02.020 texts. In: Proceedings of the 13th symposium on exper-
Navarro G (2017) A self-index on block trees. In: Pro- imental algorithms (SEA), pp 338–350
ceedings of the 17th symposium on string processing Takabatake Y, Nakashima K, Kuboyama T, Tabei Y,
and information retrieval (SPIRE), pp 278–289 Sakamoto H (2016) siEDM: an efficient string index
Navarro G, Ordóñez A (2016) Faster compressed suffix and search algorithm for edit distance with moves.
trees for repetitive text collections. J Exp Algorithmics Algorithms 9(2):26
21(1):article 1.8 Takagi T, Goto K, Fujishige Y, Inenaga S, Arimura H
Navarro G, Raffinot M (2002) Flexible pattern matching (2017) Linear-size CDAWG: new repetition-aware in-
in strings – practical on-line search algorithms for texts dexing and grammar compression. In: Proceedings of
and biological sequences. Cambridge University Press, the 24th symposium on string processing and informa-
Cambridge, UK tion retrieval (SPIRE), pp 304–316
Nishimoto T, Tomohiro I, Inenaga S, Bannai H, Takeda Valenzuela D (2016) CHICO: a compressed hybrid in-
M (2016) Dynamic index and LZ factorization in com- dex for repetitive collections. In: Proceedings of the
pressed space. In: Proceedings of the prague stringol- 15th symposium on experimental algorithms (SEA),
ogy conference (PSC), pp 158–170 pp 326–338
Novak AM, Garrison E, Paten B (2017a) A graph ex- Valenzuela D, Mäkinen V (2017) CHIC: a short read
tension of the positional Burrows-Wheeler transform aligner for pan-genomic references. Technical report
and its applications. Algorithms Mol Biol 12(1):18:1– 178129, bioRxiv.org
18:12 Wandelt S, Leser U (2015) MRCSI: compressing and
Novak AM et al (2017b) Genome graphs. Technical report searching string collections with multiple references.
101378, bioRxiv Proc VLDB Endowment 8(5):461–472
Ohlebusch E (2013) Bioinformatics algorithms: sequence Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI:
analysis, genome rearrangements, and phylogenetic scalable similarity search in thousand(s) of genomes.
reconstruction. Oldenbusch Verlag, Bremen, Germany Proc VLDB Endowment 6(13):1534–1545
Paten B, Novak AM, Eizenga JM, Garrison E (2017) Ziv J, Lempel A (1977) A universal algorithm for se-
Genome graphs and the evolution of genome inference. quential data compression. IEEE Trans Inf Theory
Genome Res 27(5):665–676 23(3):337–343
Computer Architecture for Big Data 481
The term “non-von” (Shaw 1982) was coined digital/analog computing, which was sidelined
to characterize a large category of machines that long ago in favor of all-digital systems. Archi-
relaxed one or more of the defining features of tecturally, specialization is one possible avenue
the von Neumann architecture, so as to alleviate for maintaining the performance growth rate, as
some of the perceived problems. Use of cache are massively parallel and in-memory or near-
memories (Smith 1982), often in multiple levels, memory computing.
eased the von Neumann bottleneck for a while,
but the bottleneck reemerged, as the higher cache
data transfer bandwidth became inadequate and How Big Data Affects Computer
applications that lacked or had relatively limited Architecture
locality of reference emerged. Memory inter-
leaving and memory-access pipelining, pioneered The current age of big data (Chen and Zhang
by IBM (Smotherman 2010) and later used ex- 2014; Hu et al. 2014) has once again exposed
tensively in Cray supercomputers, was the next the von Neumann bottleneck, forcing computer
logical step. architects to seek new solutions to the age-old
Extrapolating a bit from Fig. 1 (which covers problem, which has become much more serious.
the period 1985–2010), and using round num- Processing speed continues to rise exponentially,
bers, the total effect of architectural innovations while memory bandwidth increases at a much
has been a 100-fold gain in performance, on slower pace.
top of another factor-of-100 improvement due to It is by now understood that big data is differ-
faster gates and circuits (Danowitz et al. 2012). ent from “lots of data.” It is sometimes defined
Both growth rates in Fig. 1 show signs of slow- in terms of the attributes of volume, variety,
ing down, so that future gains to support the velocity, and veracity, known as the “4Vs” (or
rising processing need of big data will have to “5 Vs,” if we also include value). Dealing with
come, at least in part, from other sources. In the big data requires big storage, big-data processing
technology arena, use of emerging technologies capability, and big communication bandwidth.
will provide some boost for specific applica- The first two (storage and data processing) di-
tions. An intriguing option is resurrecting hybrid rectly affect the architecture of the nodes holding
Computer Architecture for Big Data, Fig. 1 Technology advances and architectural innovations each contributed a
factor of 100 improvement in processor performance over three decades. (Danowitz et al. 2012)
Computer Architecture for Big Data 483
and processing the data. The part of communi- for their intended functions. However, in view
cation that is internode is separately considered of rapid advances in capabilities, performance,
in this encyclopedia. However, there is also the and energy efficiency of field-programmable gate
issue of intranode communication represented in arrays (Kuon et al. 2008), a vast majority of
buses and networks-on-chip that belong to our modern accelerators reported in the literature are
architectural discussion here. built on FPGA circuits.
C
In addition to data volume, the type of data to Accelerators for encryption and decryption
be handled is also changing from structured data, algorithms have a long history (Bossuet et al.
as reflected, for example, in relational databases, 2013). The binary choice of custom-designed
to semi-structured and unstructured data. While VLSI or general-purpose processing for crypto-
this change has some negative effects in terms of graphic computations has expanded to include a
making traditional and well-understood database variety of intermediate solutions which include
technologies obsolete, it also opens up the pos- the use of FPGAs and GPUs. The best solution
sibility of using scalable processing platforms for an application domain depends not only on
made of commodity hardware as part of the the required data rates and the crypto scheme, but
cloud-computing infrastructure. Massive unstruc- also on power, area, and reliability requirements.
tured data sets can be stored in distributed file Data compression (Storer 1988) allows us to
systems, such as the ones designed in connection trade processing time and resources for savings in
with Hadoop (Shafer et al. 2010) or SQL/noSQL storage requirements. While any type of data can
(Cattell 2011). be compressed (e.g., text compression), massive
In addition to the challenges associated with sizes of video files make them a prime target for
rising storage requirements and data access band- compression. With the emergence of video com-
width, the processing load grows with data vol- pression standards (Le Gall 1991), much effort
ume because of various needs. These are: has been expended to implement the standards
on special-purpose hardware (Pirsch et al. 1995),
• Encryption and decryption offering orders of magnitude speed improvement
• Compression and decompression over general-purpose programmed implementa-
• Sampling and summarization tions.
• Visualization and graphing Both sampling and summarization aim to re-
• Sorting and searching duce data volume while still allowing the opera-
• Indexing and query processing tions of interest to be performed with reasonable
• Classification and data mining precision. An alternative to post-collection reduc-
• Deduction and learning tion of data volume is to apply compression dur-
ing data collection, an approach that in the case of
The first four items above, which are different sensor data collection is known as compressive
forms of data translation, are discussed in the next sensing (Baraniuk 2007). Compressive sensing,
section. The other items, viz., data transforma- when applicable, not only saves on processing
tions, will be discussed subsequently. time but also reduces transmission bandwidth
and storage requirements. There are some math-
ematical underpinnings common to compressive
Architectural Aids to Data sensing techniques, but at the implementation
Translations level, the methods and algorithms are by and
large application-dependent.
Many important data translations are handled Data visualization (Ward et al. 2010) refers to
by endowing a general-purpose architecture with the production of graphical representation of data
suitable accelerator units deployed as coproces- for better understanding of hidden structures and
sors. Such accelerators are ideally custom inte- relationships. It may provide the only reasonable
grated circuits, whose designs are fully optimized hope for understanding massive amounts of data,
484 Computer Architecture for Big Data
although machine learning is a complementary Also falling under such acceleration schemes are
and competing method of late. Several visualiza- aids and accelerators for processing large graphs
tion accelerators were implemented in the late (Lee et al. 2017).
1990s (e.g., Scott et al. 1998), but the modern The earliest form of deduction engines were
trend is to use FPGA and cluster-based methods. theorem provers. An important application of
automatic deduction and proof is in hardware ver-
ification (Cyrluk et al. 1995). In recent years, ma-
Architectural Aids to Data chine learning has emerged as an important tool
Transformations for improving the performance of conventional
systems and for developing novel methods of
Sorting is an extremely important primitive that tackling conceptually difficult problems. Game-
is time-consuming for large data sets. It is used playing systems (Chen 2016) constitute impor-
in a wide array of contexts, which includes facil- tant testbeds for evaluating various approaches to
itating subsequent searching operations. It can be machine learning and their associated hardware
accelerated in a variety of ways, from building acceleration mechanisms. This is a field that has
more efficient data paths and memory access just started its meteoric rise and bears watching
schemes, in order to make conventional sorting for future applications.
algorithms run faster, to the extreme of using
hardware sorting networks (Mueller et al. 2012;
Parhami 1999). Memory, Processing,
Indexing is one of the most important oper- and Interconnects
ations for large data sets, such as those main-
tained and processed by Google. Indexing and In both general-purpose and special-purpose sys-
query processing have been targeted for accel- tems interacting with big data, the three intercon-
eration within large-scale database implementa- nected challenges of providing adequate mem-
tions (Casper and Olukotun 2014; Govindaraju ory capacity, supplying the requisite processing
et al. 2004). Given the dominance of relational power, and enabling high-bandwidth data move-
databases in numerous application contexts, a va- ments between the various data-handling nodes
riety of acceleration methods have been proposed must be tackled (Hilbert and Lopez 2011).
for operations on such databases (e.g., Bandi et The memory problem can be approached
al. 2004). Hardware components used in realizing using a combination of established and novel
such accelerators include both FPGAs and GPUs. technologies, including nonvolatile RAM, 3D
Hardware aids for classification are as diverse stacking of memory cells, processing in memory,
as classification algorithms and their underly- content-addressable memory, and a variety of
ing applications. A prime example in Internet novel (nanoelectronics or biologically inspired)
routing is packet classification (Taylor 2005), technologies. We won’t dwell on the memory
which is needed when various kinds of packets, architecture in this entry, because the memory
arriving at extremely high rates, must be sep- challenge is addressed in other articles (see the
arated for appropriate handling. Modern hard- Cross-References).
ware aids for packet classification use custom Many established methods exist for increasing
arrays for pipelined network processing, content- the data-handling capability of a processing
addressable memories (Liu et al. 2010), GPUs node. The architectural nomenclature includes
(Owens et al. 2008), or tensor processing units superscalar and VLIW organizations, collectively
(Sato et al. 2017). Data mining, the process of known as instruction-level parallelism (Rau and
generating new information by examining large Fisher 1993), hardware multithreading (Eggers
data sets, has also been targeted for acceleration et al. 1997), multicore parallelism (Gepner
(Sklyarov et al. 2015), and it can benefit from and Kowalik 2006), domain-specific hardware
similar highly parallel processing approaches. accelerators (examples cited earlier in this entry),
Computer Architecture for Big Data 485
transactional memory (Herlihy and Moss 1993), the cloud, directions of cloud computing and
and SIMD/vector architectural or instruction- associated hardware acceleration mechanisms be-
set extensions (Lee 1995; Yoon et al. 2001). come relevant to our discussion here (Caulfield
A complete discussion of all these methods et al. 2016). Advanced graphics processors (e.g.,
is beyond the scope of this entry, but much Nvidia 2016) will continue to play a key role in
pertinent information can be found elsewhere providing the needed computational capabilities
C
in this encyclopedia. for data-intensive applications requiring heavy
Intranode communication is achieved through numerical calculations. Application-specific ac-
high-bandwidth bus systems (Hall et al. 2000) celerators for machine learning (Sato et al. 2017),
and, increasingly, for multicore processors and and, more generally, various forms of special-
systems-on-chip, by means of on-chip networks ization, constitute another important avenue of
(Benini and De Micheli 2002). Interconnection architectural advances for the big-data universe.
bandwidth and latency rank high, along with
memory bandwidth, among hardware capabilities
needed for effective handling of big-data applica-
Cross-References
tions, which are increasingly implemented using
parallel and distributed processing. Considera-
Energy Implications of Big Data
tions in this domain are discussed in the entry
Parallel Processing with Big Data
“Parallel Processing for Big Data.”
Storage Hierarchies for Big Data
Storage Technologies for Big Data
Future Directions
http://csl.stanford.edu/christos/publications/2012.21 Definitions
stcenturyarchitecture.whitepaper.pdf. Accessed 18 Feb
2018
Storer J (1988) Data compression. Computer Science Compression mechanisms reduce the storage cost
Press, Rockville of retained data. In the extreme case of data that
Taylor DE (2005) Survey and taxonomy of packet classi- must be retained indefinitely, the initial cost of
fication techniques. ACM Comput Surv 37(3):238–275 performing the compression transformation can
C
von Neumann J (1945) First draft of a report on
the EDVAC, University of Pennsylvania. On-line be amortized down to zero, since the savings in
document. https://web.archive.org/web/20130314123 storage space continue to accrue without limit,
032/http:/qss.stanford.edu/godfrey/vonNeumann/vne albeit at decreasing rates as time goes by and
dvac.pdf. Accessed 14 Feb 2018 disk storage becomes cheaper. A more typical
von Neumann J, Burks AW, Goldstine HH (1947) Prelim-
inary discussion of the logical design of an electronic scenario arises when a fixed data retention period
computing instrument. Institute for Advanced Study, must be supported, after which the stored data
Princeton is no longer required; and when a certain level
Ward MO, Grinstein G, Keim D (2010) Interactive data vi- of access operations to the stored data can be
sualization: foundations, techniques, and applications.
CRC Press, Natick expected, as part of a regulatory or compliance
Wulf W, McKee S (1995) Hitting the wall: implications of environment. In this second scenario, the total
the obvious. ACM Comput Archit News 23(1):20–24 cost of retention (TCR) is a function of multiple
Yoon C-W, Woo R, Kook J, Lee S-J, Lee K, Yoo H-J
competing factors, and the compression regime
(2001) An 80/20-MHz 160-mW multimedia processor
integrated with embedded DRAM, MPEG-4 accelera- that provides the most compact storage might not
tor, and 3-D rendering engine for mobile applications. be the one that provides the smallest TCR. This
IEEE J Solid State Circuits 36(11):1758–1767 entry summarizes recent work in the area of cost
models for data retention.
Computer Security
Overview
Big Data for Cybersecurity
Data compression techniques have received ex-
tensive study over more than six decades. En-
tropy coding mechanisms such as Huffman, arith-
Computing Average Distance metic, and asymmetric numeral system (ANS)
coding have been used to support a range of
Degrees of Separation and Diameter in Large modeling approaches, including those based on
Graphs implicit and explicit dictionaries, those based on
statistical prediction, those based on grammar
identification, and those based on the Burrows-
Wheeler transform. Witten et al. (1999) and Mof-
Computing the Cost of fat and Turpin (2002) provide details of many
Compressed Data such combinations; the ANS mechanism is more
recent (Duda 2009, 2013; Moffat and Petri 2017).
Alistair Moffat and Matthias Petri
Each possible arrangement of model and coder
School of Computing and Information Systems,
represents another option for use by practition-
The University of Melboure, Melbourne, VIC,
ers, with a wide range of trade-offs possible.
Australia
Compression systems are typically compared in
two quite different ways, based on the size of
Synonyms the compressed data (the effectiveness of the
approach) and based on the computational and
Data archiving; Data compression; Data retention memory resources required to attain the encoding
488 Computing the Cost of Compressed Data
and decoding transformation (the efficiency of tion data might accumulate through some busi-
the approach). Moreover, it is often the case ness process and be required to be retained for
that effectiveness and efficiency are in tension some minimum compliance period as part of a
– that the most effective techniques are also the regulatory framework; and small fragments of
ones that are least efficient and vice versa. For that data might be required to be available on
example, software implementations of compres- demand for audit or verification purposes. Such
sion tools often include an optional parameter (in data retention regimes would typically be the
several cases, “–1” to “–9”) to indicate the effort subject of a service-level agreement, or SLA,
that should be used during encoding, with better including requirements in regard to the maximum
compression resulting from greater effort; and delay LW between ingest of any data and when
principled trade-off mechanisms have also been it becomes available for querying operations, in
proposed (Farruggia et al. 2014). regard to the maximum latency LR permissible
Hence, if a choice must be made between for any access operations, and in regard to other
competing alternatives (compression programs or such performance and reliability requirements.
option settings) for use in some particular appli- A cloud provider offering a data storage ser-
cation or situation, a joint overall cost must be vice naturally seeks to minimize the overall cost
developed based on all relevant factors and used of providing the service within the constraints of
to measure the total cost of retention. This entry the SLA. Compression is a key technology in this
summarizes recent work by Liao et al. (2017) that regard, since for long retention periods, the flag-
describes such a cost model as a response to a fall cost of compressing the data can almost cer-
service-level agreement and provides a synopsis tainly be recouped via decreased storage costs –
of their key findings. provided, that is, the required GET operations can
also be supported within the constraints imposed
by the SLA.
Data Storage and Retention
An archiving service stores different kinds of Cost Model for Data Retention
objects such as files, log entries, or variable-size
data blobs. Adding objects to the storage system Liao et al. (2017) propose that the total cost of a
is generally referred to as a PUT request; and to storage system be subdivided into four categories
retrieve objects, a GET request specifying one or broadly aligned with the components measured
more objects is issued to the storage system. by most cloud system providers: (a) permanent
The objects stored by the archiving system storage, (b) working memory, (c) compute cycles,
can be viewed as a stream of bytes arriving at and (d) I/O operations. Each of these categories
a constant rate when averaged over time. Liao is then measured and billed over a specific time
et al. (2017) consider the case of data that arrives period, with the units of each category converted
at an ingest rate of GiB/day, is subject to into a common currency of dollars. For example,
some obligatory minimum retention period of consuming 50 GiB of permanent storage for 12
LD days, and while retained must be available months might be billed at $0.30 per GiB-month
to GET requests assumed to occur at the rate for a total storage cost of $180; and the compute
of q queries per stored GiB per day. Each GET cycles necessary to support GET operations on
request for an object that was ingested during that data might cost a further $50 over the course
some particular day less than LD days in the of the year.
past is translated to a byte location within that The service provider can be assumed to form
day’s data, and a request for the corresponding data bales at daily or sub-daily intervals during
object (some number of bytes or kibibytes) is the ingestion period, each bale assembled from
then issued, so that the object can be retrieved multiple PUT operations. Once accumulated, the
and delivered. For example, log data or transac- bale is then processed as a sequence of inde-
Computing the Cost of Compressed Data 489
LW LD
D Legend
C
resource cost, $ per day
C Disk storage
B Memory
CPU cycles
A
F
E
D
t0 t1 t2 t3
time, days
Computing the Cost of Compressed Data, Fig. 1 The construction, if applicable; (C) cost of memory space for
bale-based data retention regime described by Liao et al. index construction, if applicable; (D) cost of disk space
(2017) (Figure 2 of that paper, with copyright held by the for retained data; (E) cost of CPU for decoding and access
original four authors). The six marked zones are: (A) cost operations during retention period; (F) cost of memory
of disk space for incoming data; (B) cost of CPU for index space during retention period, if applicable
pendently retrievable blocks, with the bale as a cost (D) for the remaining lifetime of the bale. To
whole following the life cycle shown in Fig. 1. A service GET requests arriving at an average rate of
generic compression algorithm, such as gzip, or a q queries per day per GiB of stored data, the sys-
domain specific algorithm, such as MPEG, might tem consumes additional compute and memory
be applied to each block, to reduce the amount resources (E), (F) in order to decompress blocks.
of permanent storage required while that block Additional IO costs may arise when transferring
is held through the retention period. Ideally a data to/from permanent storage and to/from the
compression scheme which minimizes the overall storage system itself.
balance between storage cost and resource expen-
diture should be chosen, subject to the constraints Implications of the Cost Model
set by the SLA parameters. For example, the
maximum write delay LW and the maximum Liao et al. measure encoding and decoding
read latency LR set constraints on the encoding speeds for compression mechanisms in four
and decoding speed of the compression algorithm broad groups:
and/or the size of the blocks that can be al-
lowed. Similarly, the choice of permanent storage • No compression at all (denoted as method
has implications on the types of compression none);
algorithms which can comply with the SLA and • Fast encoding and decoding, based on
also on the cost involved. For example, high- an adaptive model and modest resource
latency permanent storage devices will require a consumption, and with reasonably good
compression algorithm with fast decompression compression effectiveness achieved at
speed, in order to handle GET request within LR . even relatively short block lengths (taking
That is, each possible algorithm requires a cer- zlib, https://github.com/madler/zlib, as an
tain amount of compute and memory resources exemplar);
(zones (B) and (C) in Fig. 1) to convert incoming • Excellent compression effectiveness when
bales. After the bale is converted, it resides on long blocks can be permitted, at the cost
secondary storage, incurring a constant storage of more expensive encoding and decoding
490 Computing the Cost of Compressed Data
(taking xz, http://tukaani.org/xz/, as an retention cost (TRC, in dollars per 64 GiB bale)
exemplar); and is plotted on the vertical axis as a function of
• RLZ, a semi-static mechanism designed for cumulative retention time, shown on the horizon-
fast decoding and good compression regard- tal axis, for durations from 1 day to a little over
less of block size but requiring a per-bale 1 year. In the first pane, a low query rate q D 1
memory-resident dictionary throughout each query per day per stored GiB is assumed. In this
bale’s retention period (Hoobin et al. 2011; arrangement, relatively large blocks of 1 MiB can
Petri et al. 2015) (taking https://github.com/ be employed as the encoding unit, and as a result
mpetri/rlz-store as an exemplar). xz achieves excellent compression rates, close to
those of the semi-static RLZ mechanism. More-
Liao et al. also provide compression effective- over, the 8 MiB allocation of memory required
ness rates achieved for typical web data using a by RLZ (to be precise, 8 MiB in the configuration
range of block sizes, highlighting the fact that plotted; dictionary size is another dimension that
for the adaptive zlib and xz approaches, compres- affects the comparison) is a cost drain that is
sion effectiveness diminishes as blocks are made not recouped even though RLZ decoding is faster.
smaller. Hence, for all but short retention durations, where
Figure 2 takes that data and applies it to the encoding speed of zlib is an advantage, xz
two different scenarios. In both graphs the total provides the smallest TRC.
2.00
1.00
0.50
none
zlib
0.20 xz
RLZ
0.10
In the lower pane, the query rate is consid- Duda J (2013) Asymmetric numeral systems: entropy cod-
erably increased, to q D 256 GET requests per ing combining speed of Huffman coding with compres-
sion rate of arithmetic coding. CoRR abs/1311.2540
stored GiB per day, and the block sizes need to Farruggia A, Ferragina P, Venturini R (2014) Bicriteria
shorten (the value b D 64 kiB is used in the data compression: efficient and usable. In: Proceedings
plot) so that decoding time doesn’t swamp the of the European symposium on algorithms (ESA),
other costs. Assuming that each GET request is to pp 406–417 C
Hoobin C, Puglisi SJ, Zobel J (2011) Relative Lempel-Ziv
a different block, that query rate and block size factorization for efficient storage and retrieval of web
mean that around 1:5% of the bale is decoded collections. PVLDB 5(3):265–273
each day, and decoding throughput becomes a Liao K, Moffat A, Petri M, Wirth A (2017) A cost
determining factor. Now RLZ’s dictionary mem- model for long-term compressed data retention. In:
Proceedings of the ACM international conference on
ory pays for itself, and it provides the best TRC, web search and data mining (WSDM), pp 241–249
a combination of fast decoding and excellent Moffat A, Petri M (2017) ANS-based index compression.
compression effectiveness. In: Proceedings of the ACM international conference
Based on the data they collected (including, as on information and knowledge management (CIKM),
pp 677–686
is also used in Figure 2, cost information from Moffat A, Turpin A (2002) Compression and coding
mid-2016 for a major cloud services provider), algorithms. Kluwer, Boston
Liao et al. draw broad conclusions in regard to Petri M, Moffat A, Nagesh PC, Wirth A (2015) Access
the efficacy of the various options: time tradeoffs in archive compression. In: Proceedings
of the Asia information retrieval societies conference
(AIRS), pp 15–28
• if the retention period is short, then zlib is the Witten IH, Moffat A, Bell TC (1999) Managing giga-
most economical choice, using a block size bytes: compressing and indexing documents and im-
ages. Morgan Kaufmann, San Francisco
that decreases as the query rate increases;
• if the query rate is low, and if memory is
expensive relative to secondary storage, then
xz should be preferred; and Confidentiality
• if the query rate is high, and/or if the memory-
to-disk cost ratio is more moderate, then RLZ- Security and Privacy in Big Data Environment
style approaches should be employed.
Definitions
References
A conflict-free replicated data type (CRDT) is
Duda J (2009) Asymmetric numeral systems. CoRR an abstract data type, with a well-defined inter-
abs/0902.0271 face, designed to be replicated at multiple pro-
492 Conflict-Free Replicated Data Types CRDTs
cesses and exhibiting the following properties: increment and decrement operations. As these
(i) any replica can be modified without coordi- operations commute (i.e., executing them in any
nating with other replicas and (ii) when any two order yields the same result), the counter data
replicas have received the same set of updates, type naturally converges toward the expected re-
they reach the same state, deterministically, by sult and reflects all executed operations reflects
adopting mathematically sound rules to guarantee all executed operations.
state convergence. Unfortunately, for most data types, this is
not the case, and several concurrency semantics
are reasonable, with different semantics being
Overview suitable for different applications. For instance,
consider a shared memory cell supporting the
Internet-scale distributed systems often replicate assignment operation. If the initial value is 0, the
data at multiple geographic locations to provide correct outcome for concurrently assigning 1 and
low latency and high availability, despite outages 2 is not well defined.
and network failures. To this end, these systems When defining the concurrency semantics, an
must accept updates at any replica and propagate important concept that is often used is that of
these updates asynchronously to the other repli- the happens-before relation (Lamport 1978). In a
cas. This approach allows replicas to temporarily distributed system, an event e1 happened-before
diverge and requires a mechanism for merging an event e2 , e1 e2 , iff (i) e1 occurred before
concurrent updates into a common state. CRDTs e2 in the same process; or (ii) e1 is the event
provide a principled approach to address this of sending message m, and e2 is the event of
issue. receiving that message; or (iii) there exists an
As any abstract data type, a CRDT implements event e such that e1 e and e e2 . When
some given functionality and exposes a well- applied to CRDTs, we can say that an update u1
defined interface. Applications interact with the happened-before an update u2 iff the effects of
CRDT only through this interface. As CRDTs u1 had been applied in the replica where u2 was
are designed to be replicated and to allow un- initially submitted.
coordinated updates, a key aspect of a CRDT is As an example, if an event was Alice reserved
its semantics in the presence of concurrency. The the meeting room, it is relevant to know if that
concurrency semantics defines what is the behav- was known when Bob reserved the meeting room
ior of the object in the presence of concurrent to determine if Alice should be given priority or
updates, defining the state of the object for any if the two users concurrently tried to reserve the
given set of received updates. same room.
An application developer uses the CRDT For instance, let us use happened-before to
interface and concurrency semantics to reason define the semantics of the add-wins set (also
about the behavior of her application in the known as observed-remove set, OR-set (Shapiro
presence of concurrent updates. A system et al. 2011)). Intuitively, in the add-wins se-
developer creating a system that provides CRDTs mantics, in the presence of two operations that
needs to focus on another aspect of CRDTs: do not commute, a concurrent add and remove
the synchronization model. The synchronization of the same element, the add wins leading to
model defines the requirements that the system a state where the element belongs to the set.
must meet so that CRDTs work correctly. We More formally, the set interface has two update
now detail each of these aspects independently. operations: (i) add.e/, for adding element e to the
set, and (ii) rmv.e/, for removing element e from
Concurrency Semantics the set. Given a set of update operations O that
The operations defined in a data type may intrin- are related by the happens-before partial order ,
sically commute or not. Consider, for instance, a the state of the set is defined as fe j add.e/ 2
counter data type, a shared integer that supports O ^ Àrmv.e/ 2 O add.e/ rmv.e/g.
Conflict-Free Replicated Data Types CRDTs 493
sync
rmv(a)
Replica B • • • C
{a} {} {a}
T ime
Conflict-Free Replicated Data Types CRDTs, Fig. 1 Run with an add-wins set
Figure 1 shows a run where an add-wins set is a wall-clock total order for the events that are
replicated in two replicas, with initial state fag. concurrent under causality (Zawirski et al. 2016).
In this example, in replica A, a is first removed This relation allows to define the last-writer-
and later added again to the set. In replica B, a is wins semantics, where the value written by the
removed from the set. After receiving the updates last writer wins over the values written previ-
from the other replica, both replicas end up with ously, according to the defined total order. More
the element a in the set. The reason for this is that formally, with the set O of operations now totally
there is no rmv.a/ that happened after the add.a/ ordered by <, the state of a last-writer-wins set
executed in replica A. would be defined as: fe j add.e/ 2 O ^
An alternative semantics based on the 8rmv.e/ 2 O rmv.e/ < add.e/g. Returning
happens-before relation is remove-wins. Intu- to our previous example, the state of the repli-
itively, in remove-wins semantics, in the presence cas after the synchronization would include a
of a concurrent add and remove of the same if, according the total order defined among the
element, the remove wins leading to a state where operations, the rmv.a/ of replica B is smaller
the element is not in the set. More formally, given than the add.a/ of replica A. Otherwise, the state
a set of update operations O, the state of the set would be the empty set.
is defined as: fe j add.e/ 2 O ^ 8rmv.e/ 2 We now briefly introduce the concurrency se-
O rmv.e/ add.e/g. In the previous example, mantics proposed for several CRDTs.
after receiving the updates from the other replica,
the state of both replicas would be the empty set,
Set
because there is no add.a/ that happened after
For a set CRDT, we have shown the difference
the rmv.a/ in replica B.
between three possible concurrency semantics:
Another relation that can be useful for defin-
add-wins, remove-wins, and last-writer-wins.
ing the concurrency semantics is that of a total
order among updates and particularly a total order
that approximates wall-clock time. In distributed Register
systems, it is common to maintain nodes with A register CRDT maintains an opaque value and
their physical clocks loosely synchronized. When provides a single update operation that writes
combining the clock time with a site identifier, an arbitrary value: wr.value/. Two concurrency
we have unique timestamps that are totally or- semantics have been proposed leading to two
dered. Due to the clock skew among multiple different CRDTs: the multi-value register and
nodes, although these timestamps approximate the last-writer-wins register. In the multi-value
an ideal global physical time, they do not nec- register, all concurrently written values are kept.
essarily respect the happens-before relation. This In this case, the read operation returns the set of
can be achieved by combining physical and log- concurrently written values. Formally, the state of
ical clocks, as shown by Hybrid Logical Clocks a multi-value register is defined as the multi-set:
(Kulkarni et al. 2014), or by only arbitrating fv j wr.v/ 2 O ^ Àwr.u/ 2 O wr.v/ wr.u/g.
494 Conflict-Free Replicated Data Types CRDTs
In the last-writer-wins register, only the value data structures, such as lists (Preguiça et al. 2009;
of the last write is kept, if any. Formally, the state Weiss et al. 2009; Roh et al. 2011), maps (Brown
of a last-writer-wins register can be defined as a et al. 2014; Almeida et al. 2018), graphs (Shapiro
set that is either empty or holds a single value: et al. 2011), and more complex structures, such
fv j wr.v/ 2 O ^ Àwr.u/ 2 O wr.v/ < wr.u/g, as JSON documents (Kleppmann and Beresford
assuming some initial write. 2017). For each of these CRDTs, the developers
have defined and implemented a type-specific
Counter concurrency semantics.
A counter CRDT maintains an integer and can
be modified by update operations inc and dec, Synchronization Model
to increase and decrease by one unit its value, re- A replicated system needs to synchronize its
spectively (this can easily generalize to arbitrary replicas, by propagating and applying updates in
amounts). As mentioned previously, as opera- every replica. There are two main approaches
tions intrinsically commute, the natural concur- to propagate updates: state-based and operation-
rency semantics is to have a final state that reflects based replication.
the effects of all registered operations. Thus, the In state-based replication, replicas syn-
result state is obtained by counting the number of chronize by establishing bi-directional (or
increments and subtracting the number of decre- unidirectional) synchronization sessions, where
ments: jfinc j inc 2 Ogj jfdec j dec 2 Ogj. both (one, resp.) replicas send their state to a
Now consider that we want to add a write op- peer replica. When a replica receives the state
eration wr.n/, to update the value of the counter of a peer, it merges the received state with
to a given value. This opens two questions re- its local state. As long as the synchronization
lated with the concurrency semantics. First, what graph is connected, every update will eventually
should be the final state when two concurrent propagate to all replicas.
write operations are executed? In this case, the CRDTs designed for state-based replication
last-writer-wins semantics would be simple (as define a merge function to integrate the state of a
maintaining multiple values, as in the multi-value remote replica. It has been shown (Shapiro et al.
register, is overly complex). 2011) that all replicas of a CRDT converge if
Second, what is the result when concurrent (i) the states of the CRDT are partially ordered
writes and inc/dec operations are executed? In according to forming a join semilattice; (ii) an
this case, by building on the happens-before operation modifies the state s of a replica by an
relation, we can define several concurrency inflation, producing a new state that is larger or
semantics. One possibility is a write-wins equal to the original state according to , i.e., for
semantics, where inc/dec operations have no any operation m, s m.s/; and (iii) the merge
effect when executed concurrently with the function computes the join (least upper bound) of
last write. Formally, for a given set O of two states, i.e., for states s; u, it computes s t u.
updates that include at least a write operation, In operation-based replication, replicas
let v be the value in the last write, i.e., converge by propagating operations to every
wr.v/ 2 O ^ Àwr.u/ 2 O wr.v/ < wr.u/. other replica. When an operation is received in
The value of the counter would be v C o, a replica, it is applied to the local replica state.
with o D jfinc j inc 2 O ^ wr.v/ incgj Besides requiring that every operation is reliably
jfdec j dec 2 O ^ wr.v/ decgj representing delivered to all replicas, e.g., by using some
inc/dec operations that happened after the last reliable multicast communication subsystem,
write. some systems may require operations to be
delivered according to some specific order, with
Other CRDTs causal order being the most common.
A number of other CRDTs have been proposed CRDTs designed for operation-based replica-
in the literature, including CRDTs for elementary tion must define, for each operation, a generator
Conflict-Free Replicated Data Types CRDTs 495
and an effector function. The generator function Operation-based CRDTs require executing a
executes in the replica where the operation is sub- generator function against the replica state to
mitted, the source replica, it has no side effects compute an effector operation. In some scenarios,
and generates an effector that encodes the side ef- this may introduce an unacceptable delay for
fects of the operation. In other words, the effector propagating an operation. Pure operation-based
is a closure created by the generator depending CRDTs (Baquero et al. 2014) address this issue
C
on the state of the origin replica. The effector op- by allowing the original operations to be propa-
eration must be reliably executed in all replicas, gated to all replicas, typically at the cost of more
where it updates the replica state. Shapiro et al. complex operations and of having to store more
(2011) have shown that if effector operations are metadata in the CRDT state.
delivered in causal order, replicas will converge
to the same state if concurrent effector operations
commute. If effector operations may be delivered Key Research Findings
without respecting causal order, then all effector
operations must commute. Most operation-based Preservation of Sequential Semantics
CRDT designs require causal delivery. When modeling an abstract data type that has an
Alternative models: When operations modify established semantics under sequential execution,
only part of the state, propagating the complete CRDTs should preserve that semantics. For in-
state for synchronization to a remote replica is stance, CRDT sets should ensure that if the last
inefficient, as the remote replica already knows operation in a sequence of operations to a set
most of the state. Delta-state CRDTs (Almeida added a given element, then a query operation
et al. 2018) address this issue by propagating immediately after that one will show the element
only delta mutators, which encode the changes to be present on the set. Conversely, if the last
that have been made to a replica since the last operation removed an element, then a subsequent
communication. The first time a replica com- query should not show its presence.
municates with some other replica, the full state Sequential execution can occur even in dis-
needs to be propagated. This can be improved by tributed settings if synchronization is frequent.
using a state summary (e.g., version vector) for An instance can be updated in replica A, merged
computing and sending only the deltas in the first into another replica B and updated there, and
communication also, as shown in big delta-state merged back into replica A before A tries to
CRDTs (van der Linde et al. 2016), typically at update it again. In this case, we have a sequential
the cost of storing more metadata in the CRDT execution, even though updates have been exe-
state. Another improvement is to compute digests cuted in different replicas.
that help determine which parts of a remote Historically, not all CRDT designs have met
state are needed, avoiding shipping full states this property. The two-phase-set CRDT (2PSet)
(Enes 2017). does not allow re-adding an element that was
In the context of operation-based replication, removed, and thus it breaks the common sequen-
effector operations should be applied immedi- tial semantics. Later CRDT set designs, such
ately to the source replica. However, propaga- as add-wins and remove-wins sets, do preserve
tion to other replicas can be deferred for some the original sequential semantics while providing
period and effectors stored in an outbound log, different concurrency semantics.
presenting an opportunity to compress the log
by rewriting some operations – e.g., two add.1/ Extended Behavior Under Concurrency
operations in a counter can be converted in a Some CRDT designs handle concurrent opera-
add.2/ operation. This mechanism has been used tions by arbitrating a given sequential ordering
by Cabrita and Preguiça (2017). Delta mutators to accommodate concurrent execution. For exam-
can also be seen as a compressed representation ple, the state of a last-writer-wins set replica can
of a log of operations. be explained by a sequential execution of the op-
496 Conflict-Free Replicated Data Types CRDTs
sync
add(b) rmv(a)
Replica B • • • •
{} {b} {b} {a,b}
Conflict-Free Replicated Data Types CRDTs, Fig. 2 Add-wins set run showing that there might be no sequential
execution of operations that explains CRDT behavior
erations according to the LWW total order used. since no prior coordination with other replicas
When operations commute, such as in counters, is necessary before accepting an operation.
there might even be several sequential executions Furthermore, operations can be accepted with
that explain a given state. minimal user perceived latency since they only
Not all CRDTs need or can be explained by require local durability. By eschewing global
sequential executions. The add-wins set is an coordination, replicas evolve independently, and
example of a CRDT where there might be no reads will not reflect operations accepted in
sequential execution of operations to explain the remote replicas that have not yet been propagated
state observed, as Fig. 2 shows. In this example, to the local replica.
the state of the set after all updates propagate to In the absence of global coordination, session
all replicas includes a and b, but in any sequential guaranties (Terry et al. 1994) specify what the
extension of the causal order, a remove operation user applications can expect from their inter-
would always be the last operation, and conse- action with the system’s interface. Both state-
quently the removed element could not belong to based CRDTs and operation-based CRDTs when
the set. supported by reliable causal delivery provide per-
Some other CRDTs can exhibit states that are object causal consistency. Thus, in the context of
only reached when concurrency does occur. An a given replicated object, the traditional session
example is the multi-value register. If used se- guaranties are met. CRDT-based systems that
quentially, sequential semantics is preserved, and lack transactional support can enforce system-
a read will show the outcome of the most recent wide causal consistency, by integrating multiple
write in the sequence. However, if two or more objects in a single map/directory object (Almeida
values are written concurrently, the subsequent et al. 2018). Another alternative is to use merge-
read will show all those values (as the multi- able transactions to read from a causally con-
value name implies), and there is no sequential sistent database snapshot and to provide write
execution that can explain this result. We also atomicity (Preguiça et al. 2014).
note that a follow-up write can overwrite both a Some operations cannot be expressed in a
single value and multiple values. conflict-free framework and will require global
agreement. As an example, in an auction system,
Guaranties and Limitations bids can be collected under causal consistency,
An important property of CRDTs is that and a new bid will only have to increase the offer
an operation can always be accepted at any with respect to bids that are known to causally
given replica and updates are propagated precede it. However, closing the auction and
asynchronously to other replicas. In the CAP selecting a single winning bid will require global
theorem framework (Brewer 2010; Gilbert and agreement. It is possible to design a system that
Lynch 2002), the CRDT conflict-free approach integrates operations with different coordination
favors availability over consistency when facing requirements and only resorts to global agree-
communication disruptions. This leads to ment when necessary (Li et al. 2012; Sovran et al.
resilience to network failure and disconnection, 2011).
Conflict-Free Replicated Data Types CRDTs 497
Some global invariants, which usually are en- SwiftCloud (Preguiça et al. 2014), and Antidote
forced with global coordination, can be enforced (Antidote: http://antidotedb.org.) (Akkoorath
in a conflict-free manner by using escrow tech- et al. 2016).
niques (O’Neil 1986) that split the available re- CRDTs have also been embedded in multiple
sources among the different replicas. For in- applications. In this case, developers either used
stance, the Bounded Counter CRDT (Balegas one of the available CRDT libraries, implemented
C
et al. 2015b) defines a counter that never goes themselves some previously proposed design, or
negative, among assigning to each replica a re- designed new CRDTs to meet their specific re-
serve of allowed decrements under the condition quirements. An example of this latter use is Roshi
that the sum of all allowed decrements does not (Roshi is a large-scale CRDT set implementa-
exceed the value of the counter. As long as its tion for timestamped events https://github.com/
reserve is not exhausted, a replica can accept soundcloud/roshi.), a LWW-element-set CRDT
decrements without coordinating with other repli- used for maintaining an index in SoundCloud
cas. After a replica exhausts its reserve, a new stream.
decrement will either fail or require synchroniz-
ing with some replica that still can decrement.
This technique uses point-to-point coordination Future Directions of Research
and can be generalized to enforce other system-
wide invariants (Balegas et al. 2015a) Scalability
In order to track concurrency and causal pre-
decessors, CRDT implementations often store
Examples of Applications metadata that grows linearly with the number
of replicas (Charron-Bost 1991). While global
CRDTs have been used in a large number agreement suffers from greater scalability limi-
of distributed systems and applications that tations since replicas must coordinate to accept
adopt weak consistency models. The adoption each operation, the metadata cost from causality
of CRDTs simplifies the development of these tracking can limit the scalability of CRDTs when
systems and applications, as CRDTs guarantee aiming for more than a few hundred replicas.
that replicas converge to the same state when A large metadata footprint can also impact the
all updates are propagated to all replicas. We computation time of local operations and will
can group the systems and applications that certainly impact the required storage and commu-
use CRDTs into two groups: storage systems nication.
that provide CRDTs as their data model and Possible solutions can be sought in more
applications that embed CRDTs to maintain their compact causality representations when multiple
internal data. replicas are synchronized among the same
CRDTs have been integrated in several storage nodes (Malkhi and Terry 2007; Preguiça et al.
systems that make them available to applications. 2014; Gonçalves et al. 2017) or by hierarchical
An application uses these CRDTs to store its approaches that restrict all to all synchronization
data, being the responsibility of the storage and enable more compact mechanisms (Almeida
systems to synchronize the multiple replicas. and Baquero 2013).
The following commercial systems use CRDTs:
Riak (Developing with Riak KV Data Types Reversible Computation
http://docs.basho.com/riak/kv/2.2.3/developing/ Nontrivial Internet services require the composi-
data-types/.), Redis (Biyikoglu 2017), and Akka tion of multiple subsystems, to provide storage,
(Akka Distributed Data: https://doc.akka.io/ data dissemination, event notification, monitor-
docs/akka/2.5.4/scala/distributed-data.html.). ing, and other needed components. When com-
A number of research prototypes have also used posing subsystems, which can fail independently
CRDTs, including Walter (Sovran et al. 2011), or simply reject some operations, it is useful
498 Conflict-Free Replicated Data Types CRDTs
to provide a CRDT interface that undoes previ- ing on the read operations available in the CRDT
ously accepted operations. Another scenario that interface, it might not be necessary to maintain
would benefit from undo is collaborative editing the same state in all replicas. For example, an
of shared documents, where undo is typically a object that has a single read operation returning
feature available to users. the top-K elements added to the object only needs
Undoing an increment on a counter CRDT can to maintain those top-K elements in every replica.
be achieved by a decrement. Logoot-Undo (Weiss The remaining elements are necessary if a remove
et al. 2010) proposes a solution for undoing (and operation is available, as one of the elements not
redoing) operations for a sequence CRDT used in the top needs to be promoted when a top ele-
for collaborative editing. However, providing a ment is removed. Thus, each replica can maintain
uniform approach to undoing, reversing, opera- only the top-K elements and the elements added
tions over the whole CRDT catalog is still an locally.
open research direction. The support of undo is This replication model is named nonuniform
also likely to limit the level of compression that replication (Cabrita and Preguiça 2017) and can
can be applied to CRDT metadata. be used to design CRDTs that exhibit important
storage and bandwidth savings when compared
Security with alternatives that keep all data in all replicas.
While access to a CRDT-based interface can be Although it is clear that this model cannot be used
restricted by adding authentication, any accessing for all data types, several useful CRDT designs
client has the potential to issue operations that have been proposed, including top-K, top-Sum,
can interfere with the other replicas. For instance, and histogram. To understand what data types can
delete operations can remove all existing state. adopt this model and how to explore it in practice
In state-based CRDTs, replicas have access to is an open research question.
state that holds a compressed representation of
past operations and metadata. By manipulation of
this state and synchronizing to other replicas, it Verification
is possible to introduce significant attacks to the An important aspect related with the development
system operation and even its future evolution. of distributed systems that use CRDTs is the
Applications that store state on third-party verification of the correctness of the system. This
entities, such as in cloud storage providers, might involves not only verifying the correctness of
not trust the provider and choose end-to-end en- CRDT designs but also the correctness of the
cryption of the exchanged state. This, however, system that uses CRDTs. A number of works
would require all processing to be done at the have addressed these issues.
edge, under the application control. A research Regarding the verification of the correctness of
direction would be to allow some limited form CRDTs, several approaches have been taken. The
of computation, such as merging state, over in- most commonly used approach is to have proofs
formation whose content is subject to encryption. when designs are proposed or to use some veri-
Potential techniques, such as homomorphic en- fication tools for the specific data type, such as
cryption, are likely to pose significant computa- TLA (Lamport 1994) or Isabelle (Isabelle: http://
tional costs. An alternative is to execute oper- isabelle.in.tum.de/.). There have also been some
ations on encrypted data without disclosing it, works that proposed general techniques for the
relying on specific hardware support, such as verification of CRDTs (Burckhardt et al. 2014;
Intel SGX and ARM TrustZone. Zeller et al. 2014; Gomes et al. 2017), which
can be used by CRDT developers to verify the
Nonuniform Replicas correctness of their designs. Some of these works
The replication of CRDTs typically assumes that (Zeller et al. 2014; Gomes et al. 2017) include
eventually all replicas will reach the same state, specific frameworks that help the developer in the
storing exactly the same data. However, depend- verification process.
Conflict-Free Replicated Data Types CRDTs 499
A number of other works have proposed tech- Biyikoglu C (2017) Under the hood: Redis CRDTs
niques to verify the correctness of distributed (conflict-free replicated data types). Online https://goo.
gl/tGqU7h. Accessed 24 Nov 2017
systems that use CRDTs (Gotsman et al. 2016; Brewer E (2010) On a certain freedom: exploring the CAP
Zeller 2017; Balegas et al. 2015a). These works space, invited talk at PODC 2010, Zurich
typically require the developer to specify the Brown R, Cribbs S, Meiklejohn C, Elliott S (2014) Riak
properties that the distributed system must main- DT map: a Composable, convergent replicated dictio- C
nary. In: Proceedings of the first workshop on princi-
tain and a specification of the operations in the ples and practice of eventual consistency, PaPEC’14.
system (that is independent of the actual code ACM, New York, pp 1:1–1:1. https://doi.org/10.1145/
of the system). Despite these works, the verifi- 2596631.2596633
cation of the correctness of CRDT designs and of Burckhardt S, Gotsman A, Yang H, Zawirski M (2014)
Replicated data types: specification, verification, opti-
systems that use CRDTs, how these verification mality. In: Proceedings of the 41st ACM SIGPLAN-
techniques can be made available to program- SIGACT symposium on principles of programming
mers, and how to verify the correctness of imple- languages, POPL’14. ACM, New York, pp 271–284.
mentations remain an open research problem. https://doi.org/10.1145/2535838.2535848
Cabrita G, Preguiça N (2017) Non-uniform replication.
In: Proceedings of the 21th international conference
Acknowledgements This work was partially supported on principles of distributed systems, OPODIS 2017,
by NOVA LINCS (UID/CEC/04516/2013), EU H2020 Schloss Dagstuhl – Leibniz-Zentrum fuer Informatik,
LightKone project (732505), and SMILES line in project LIPIcs
TEC4Growth (NORTE-01-0145-FEDER-000020). Charron-Bost B (1991) Concerning the size of
logical clocks in distributed systems. Inf Process
Lett 39(1):11–16. https://doi.org/10.1016/0020-
References 0190(91)90055-M
Enes V (2017) Efficient Synchronization of State-based
Akkoorath DD, Tomsic AZ, Bravo M, Li Z, Crain T, CRDTs. Master’s thesis, Universidade do Minho.
Bieniusa A, Preguiça N, Shapiro M (2016) Cure: strong http://vitorenesduarte.github.io/page/other/msc-thesis.
semantics meets high availability and low latency. pdf
In: Proceedings of the 2016 IEEE 36th international Gilbert S, Lynch N (2002) Brewer’s conjecture and the
conference on distributed computing systems (ICDCS), feasibility of consistent, available, partition-tolerant
pp 405–414. https://doi.org/10.1109/ICDCS.2016.98 web services. SIGACT News 33(2):51–59. https://doi.
Almeida PS, Baquero C (2013) Scalable eventu- org/10.1145/564585.564601
ally consistent counters over unreliable networks. Gomes VBF, Kleppmann M, Mulligan DP, Beres-
CoRR abs/1307.3207. http://arxiv.org/abs/1307.3207, ford AR (2017) Verifying strong eventual consis-
1307.3207 tency in distributed systems. Proc ACM Program
Almeida PS, Shoker A, Baquero C (2018) Delta Lang 1(OOPSLA):109:1–109:28. https://doi.org/10.
state replicated data types. J Parallel Distrib Com- 1145/3133933
put 111:162–173. https://doi.org/10.1016/j.jpdc.2017. Gonçalves RJT, Almeida PS, Baquero C, Fonte V (2017)
08.003 DottedDB: anti-entropy without Merkle trees, deletes
Balegas V, Duarte S, Ferreira C, Rodrigues R, Preguiça without tombstones. In: Proceedings of the 2017
NM, Najafzadeh M, Shapiro M (2015a) Putting con- IEEE 36th symposium on reliable distributed systems
sistency back into eventual consistency. In: Réveillère (SRDS), pp 194–203. https://doi.org/10.1109/SRDS.
L, Harris T, Herlihy M (eds) Proceedings of the tenth 2017.28
European conference on computer systems, EuroSys Gotsman A, Yang H, Ferreira C, Najafzadeh M, Shapiro
2015, Bordeaux. ACM, pp 6:1–6:16. https://doi.org/10. M (2016) ’cause i’m strong enough: reasoning about
1145/2741948.2741972 consistency choices in distributed systems. In: Pro-
Balegas V, Serra D, Duarte S, Ferreira C, Shapiro M, ceedings of the 43rd annual ACM SIGPLAN-SIGACT
Rodrigues R, Preguiça NM (2015b) Extending even- symposium on principles of programming languages,
tually consistent cloud databases for enforcing nu- POPL’16. ACM, New York, pp 371–384. https://doi.
meric invariants. In: 34th IEEE symposium on reli- org/10.1145/2837614.2837625
able distributed systems, SRDS 2015, Montreal. IEEE Kleppmann M, Beresford AR (2017) A conflict-free
Computer Society, pp 31–36. https://doi.org/10.1109/ replicated JSON datatype. IEEE Trans Parallel Distrib
SRDS.2015.32 Syst 28(10):2733–2746. https://doi.org/10.1109/TPDS.
Baquero C, Almeida PS, Shoker A (2014) Making 2017.2697382
operation-based CRDTs operation-based. In: Proceed- Kulkarni SS, Demirbas M, Madappa D, Avva B, Leone
ings of the first workshop on principles and practice M (2014) Logical physical clocks. In: Aguilera MK,
of eventual consistency, PaPEC’14. ACM, New York, Querzoni L, Shapiro M (eds) Principles of distributed
pp 7:1–7:2. https://doi.org/10.1145/2596631.2596632 systems – 18th international conference, OPODIS
500 Conformance Checking
2014, Cortina d’Ampezzo. Proceedings. Lecture notes sistency for distributed data, PaPoC’16. ACM, New
in computer science, vol 8878. Springer, pp 17–32. York, pp 12:1–12:4. https://doi.org/10.1145/2911151.
https://doi.org/10.1007/978-3-319-14472-6_2 2911163
Lamport L (1978) Time, clocks, and the ordering of events Weiss S, Urso P, Molli P (2009) Logoot: a scalable opti-
in a distributed system. Commun ACM 21(7):558–565. mistic replication algorithm for collaborative editing on
https://doi.org/10.1145/359545.359563 p2p networks. In: Proceedings of the 2009 29th IEEE
Lamport L (1994) The temporal logic of actions. ACM international conference on distributed computing sys-
Trans Program Lang Syst 16(3):872–923. https://doi. tems, ICDCS’09. IEEE Computer Society, Washing-
org/10.1145/177492.177726 ton, DC, pp 404–412. https://doi.org/10.1109/ICDCS.
Li C, Porto D, Clement A, Gehrke J, Preguiça N, Ro- 2009.75
drigues R (2012) Making geo-replicated systems fast Weiss S, Urso P, Molli P (2010) Logoot-undo: distributed
as possible, consistent when necessary. In: Proceedings collaborative editing system on p2p networks. IEEE
of the 10th USENIX conference on operating sys- Trans Parallel Distrib Syst 21(8):1162–1174. https://
tems design and implementation, OSDI’12. USENIX doi.org/10.1109/TPDS.2009.173
Association, Berkeley, pp 265–278. http://dl.acm.org/ Zawirski M, Baquero C, Bieniusa A, Preguiça N, Shapiro
citation.cfm?id=2387880.2387906 M (2016) Eventually consistent register revisited. In:
Malkhi D, Terry DB (2007) Concise version vectors in Proceedings of the 2nd workshop on the principles and
winfs. Distrib Comput 20(3):209–219. https://doi.org/ practice of consistency for distributed data, PaPoC’16.
10.1007/s00446-007-0044-y ACM, New York, pp 9:1–9:3. https://doi.org/10.1145/
O’Neil PE (1986) The escrow transactional method. ACM 2911151.2911157
Trans Database Syst 11(4):405–430. https://doi.org/10. Zeller P (2017) Testing properties of weakly consistent
1145/7239.7265 programs with repliss. In: Proceedings of the 3rd inter-
Preguiça N, Marques JM, Shapiro M, Letia M (2009) national workshop on principles and practice of consis-
A commutative replicated data type for cooperative tency for distributed data, PaPoC’17. ACM, New York,
editing. In: Proceedings of the 2009 29th IEEE inter- pp 3:1–3:5. https://doi.org/10.1145/3064889.3064893,
national conference on distributed computing systems, https://dl.acm.org/authorize?N37605
ICDCS’09. IEEE Computer Society, Washington, DC, Zeller P, Bieniusa A, Poetzsch-Heffter A (2014) Formal
pp 395–403. https://doi.org/10.1109/ICDCS.2009.20 specification and verification of CRDTs. In: Formal
Preguiça NM, Zawirski M, Bieniusa A, Duarte S, Bale- techniques for distributed objects, FORTE 2014. Lec-
gas V, Baquero C, Shapiro M (2014) Swiftcloud: ture notes in computer science. Springer, pp 33–48
fault-tolerant geo-replication integrated all the way
to the client machine. In: 33rd IEEE international
symposium on reliable distributed systems workshops,
SRDS workshops 2014, Nara. IEEE Computer Society,
pp 30–33. https://doi.org/10.1109/SRDSW.2014.33
Roh HG, Jeon M, Kim JS, Lee J (2011) Replicated Conformance Checking
abstract data types: building blocks for collaborative
applications. J Parallel Distrib Comput 71(3):354–368.
https://doi.org/10.1016/j.jpdc.2010.12.006 Jorge Munoz-Gama
Shapiro M, Preguiça N, Baquero C, Zawirski M (2011) Department of Computer Science, School of
Conflict-free replicated data types. In: Proceedings Engineering, Pontificia Universidad Católica de
of the 13th international conference on stabilization,
Chile, Santiago, Chile
safety, and security of distributed systems, SSS’11.
Springer, Berlin/Heidelberg, pp 386–400. http://dl.
acm.org/citation.cfm?id=2050613.2050642
Sovran Y, Power R, Aguilera MK, Li J (2011) Transac-
tional storage for geo-replicated systems. In: Proceed-
ings of the twenty-third ACM symposium on operat-
Synonyms
ing systems principles, SOSP’11. ACM, New York,
pp 385–400. https://doi.org/10.1145/2043556.2043592 Business process conformance checking
Terry DB, Demers AJ, Petersen K, Spreitzer M, Theimer
M, Welch BB (1994) Session guarantees for weakly
consistent replicated data. In: Proceedings of the third
international conference on parallel and distributed
information systems, PDIS’94, Austin. IEEE Com- Definitions
puter Society, pp 140–149. https://doi.org/10.1109/
PDIS.1994.331722
van der Linde A, Leitão JA, Preguiça N (2016) -CRDTs:
Given an event log and a process model from
making ı-CRDTs delta-based. In: Proceedings of the the same process, conformance checking com-
2nd workshop on the principles and practice of con- pares the recorded event data with the model to
Conformance Checking 501
Record, Evaluate Advisor CV, Final Evaluation, model illustrated in Fig. 3 is much more precise
Accept, Notify Resultsi perfectly fits the model than the flower model.
in Fig. 1 since each of the observed steps can be Generalization relates to a model’s ability to
sequentially replayed at the model, and the trace account for yet to be observed behavior. Typi-
corresponds to a particular possible way to exe- cally, an event log only represents a small fraction
cute the process model. However, the trace hStart of the possible behavior in the process. As such, a
Processing, Evaluate Project, Evaluate Academic good model must be generalizing enough so that
Record, Final Evaluation, Reject, Notify Resultsi unobserved but possible behavior is described.
does not fit the model because the advisor’s CV For example, if the model in Fig. 4 was discov-
(Evaluate Advisor CV) is never evaluated. This ered from an event log that contains only the
suggests that the corresponding application has trace hStart Processing, Evaluate Project, Eval-
been rejected without proper evaluation. uate Academic Record, Evaluate Advisor CV,
Precision relates to a model’s ability to cap- Final Evaluation, Accept, Notify Resultsi, then
ture the observed behavior without allowing un- the model would be both perfectly fitting and pre-
seen behavior. It is not enough to have a model cise since all observed behaviors are captured by
that is perfectly fitting with the log since this the model and that it does not allow any unseen
can be easily achieved with a model that permits behavior. Clearly, there is much unseen behavior
any behavior. Consider the “flower” model in that is very likely to occur in the future, e.g.,
Fig. 2; it consists of all the transitions attached the rejection of an application. This shows that,
to a state that corresponds to both the start and while it is important to have precise models, it is
end state. This means any sequence involving also important to avoid overfitting the observed
the connected transitions is permissible by the behavior.
model. Though perfectly fitting with the log, such Some authors consider a fourth dimension
underfitting model does not convey much useful called simplicity, relating to the model complex-
information to the user. In contrary, the process ity, i.e., simple models should be preferred over
Conformance Checking,
Fig. 2 Imprecise flower
model of the doctoral
scholarship process
Conformance Checking 503
C
Conformance Checking, Fig. 3 Precise but unfitting model of the doctoral scholarship process
Conformance Checking, Fig. 4 Model that overfits one particular possible execution of the doctoral scholarship
process
complex models if both describe the same behav- likely to be prioritized over the other dimensions
ior. However this quality dimension relates only to account for possible future behavior.
to the model and therefore is not normally mea-
sured by conformance checking techniques. This Types of Conformance
dimension is covered in the Automated Process Conformance checking techniques can be applied
Discovery entry of this encyclopedia. to understand and quantify these relationships
Overall, while the three quality dimensions are between a log and model. There is a large col-
orthogonal to each other, in a real-life context, lection of approaches and metrics that are based
one is unlikely to find a pair of log and model that on different ways to compare a log and model.
are in perfect conformance (i.e., perfectly fitting, Figure 5 shows that there are three main
precise, and generalizing). Often times, different groups of conformance checking approaches
scenarios may require different conformance lev- – replay, comparison, and alignment. Replay-
els and prioritization of the quality dimensions. based approaches replay log traces onto
For example, to analyze the well-established the model and record information about the
execution paths of a process, an analyst might conformance during the replay. Process models
prioritize fitness over the other dimensions. On can be denoted in different modeling notations,
the other hand, if an event log only contains e.g., Business Process Modeling Notation
a small number of cases, generalization would (BPMN), Petri nets, and process trees, and each
504 Conformance Checking
Conformance Checking, Fig. 6 Model M1 of the university doctoral scholarship process denoted in Petri net notation
Conformance Checking, Fig. 9 Missing token to fire activity e in token replay of trace t2 D ha; c; b; e; f; hi
redundant tokens. An activity might be marked to remaining token is removed artificially, and the
be fired in the log trace but is not enabled in the number of remaining tokens is incremented.
model due to missing tokens at its input places. Based on the count of each token types con-
Consider the replay of trace t2 in L1 . Starting sumed, produced, missing, and remaining (p = 8,
from the initial state of model M1 as shown in c = 8, m = 1, r = 1), the fitness between model M1
Fig. 6, the firing of activity a, b, and c would and trace t2 can be computed as:
consume three tokens and produce five tokens in
the process. Figure 9 shows the state of model 1 m 1 r
fitness.t2 ; M / D .1 / C .1 /
M1 after the firing of the first three activities and 2 c 2 p
records the number of consumed and produced 1 1 1 1
tokens. According to trace t2 , the next activity D .1 / C .1 / D 0:8
2 5 2 5
to be fired is activity e. However, this is not
possible since one of the input places of activity This fitness metric can be extended to the log
e does not have any token, i.e., activity e is not level by considering the number of produced,
enabled. To continue the replay, the missing token consumed, missing, and remaining tokens from
is artificially added into the empty input place, the token replay of all log traces.
and the number of missing token is incremented.
The rest of trace t2 (activity f and g) can be
replayed successively. Figure 10 shows that, after Cost-Based Alignment
firing activity h, there is a remaining token in the The token replay approach can easily identify
input place of activity d since this activity was deviating traces in an event log. Moreover, the
not fired in the replay of trace t2 . As recalled, a deviation severity can be compared using a fitness
process instance is only completed when there is metric computed from the number of produced,
only one token at each of the sink places and none consumed, missing, and remaining tokens. How-
at any other places. To complete the replay, the ever, the token replay approach is prone to creat-
506 Conformance Checking
ing too many tokens for highly deviating traces trace t3 D ha; b; d; c; g; e; hi in L1 and model
so that any behavior is allowed. This can lead M1 in Fig. 6. The top row (ignoring ) yields
to an overestimation of the fitness. The approach the trace t3 D ha; b; d; c; g; e; hi, and the bottom
is also specific to the Petri net notation. More row (ignoring ) yields a complete model trace
importantly, in the case of a deviating trace, the ha; b; d; c; e; g; hi. For each move in alignment
approach does not provide a model explanation 1 , the top row matches the bottom row if the
of the log trace. For example, the deviations in step in the log trace matches the step in the
trace t2 D ha; c; b; e; f; hi can be explained if model trace. This is called a synchronous move.
it was considered with respect to the complete In the case of deviations, a no-move symbol is
model trace ha; c; b; d; e; f; hi. From this map- placed in the bottom row if there is a step in the
ping, it is clear that the log trace is missing the log trace that cannot be mimicked by the model
execution of activity d (Evaluate Advisor CV). trace. For example, activity g is executed before
These mappings from log traces to model traces activity e in trace t3 , but model M1 requires
were introduced as alignments to address this activity e to be fired before activity g. Hence, a
limitation (van der Aalst et al. 2012). log move is put where the top row has activity g
Alignments are tables of two rows where the and the bottom row has a no-move . Similarly,
top row corresponds to the observed behavior a no-move symbol is placed in the top row if
(i.e., log projection) and the bottom row cor- there is a step in the model trace that cannot be
responds to the modeled behavior (i.e., model mimicked by the log trace. For example, activity
projection). Each column is therefore a move g is executed after activity e in the model trace
in the alignment where the observed behavior according to model M1 . Therefore, a model move
is aligned with the modeled behavior. Consider is added where the top row has a no-move and
alignment 1 in Fig. 11. This alignment aligns the bottom row has activity g. It is also possible
Conformance Checking, Fig. 11 Possible alignments between trace t3 D ha; b; d; c; g; e; hi in L1 and model M1
in Fig. 6
Conformance Checking, Fig. 12 Default alignment between trace t3 in log L1 and model M1 in Fig. 6
Conformance Checking 507
that there are invisible transitions in the model est explanations of log traces in terms of mod-
which are not observable in the log. Similar to a eled behavior. Note that there could be multiple
model move, there would be a no-move in the top optimal alignments for a particular log trace. For
row and an invisible transition label in the bottom example, alignment 1 and 3 are both optimal
row. In total, there are four types of legal moves alignments of trace t3 under the standard cost
in an alignment: synchronous move, log move, assignment. Furthermore, optimal alignments are
C
model move, and invisible move. only optimal with respect to the given cost assign-
For a particular log trace and model, there ment. For example, alignment 1 would cease to
could be many possible alignments where each be the optimal alignment if model moves and log
represents a different explanation of the observed moves of activity g are assigned a cost of 2 (i.e.,
behavior in terms of modeled behavior. For cost.1 / D 4) to reflect that having deviations at
example, Fig. 11 shows three possible alignments the decision part of the process is quite severe.
between trace t3 and model M1 in Fig. 6. Clearly, In practice, optimal alignments can be automat-
alignment 1 and 3 are better alignments of trace ically found by finding the cheapest complete
t3 and model M1 than alignment 2 since they model trace in the synchronous product of the
provide closer explanations with less log moves log trace and model using heuristic algorithms
and model moves. The quality of an alignment with proven optimality guarantees, e.g., the A
can be quantified by assigning costs to moves. In algorithm (van der Aalst et al. 2012).
general, model moves and log moves are assigned Alignments can also be used to compute con-
higher costs than synchronous moves because formance metrics with respect to the different
they represent deviations between modeled quality dimensions.
behavior and observed behavior. A standard cost
assignment could be that all model moves and log Cost-Based Fitness Metric
moves are assigned a cost of 1 and synchronous The fitness of a log trace and a model can be
moves and invisible moves are assigned a cost of quantified by comparing the cost of an opti-
0. Invisible moves are normally assigned zero mal alignment with the worst case scenario cost
costs as they are related to invisible routing (Adriansyah 2014). In the worst scenario, the log
transitions in the model that are not observable trace is completely unfitting with the model. A
in the log. Under the standard cost assignment, default alignment between the two can be com-
the costs of the alignments in Fig. 11 can be puted by assigning all the steps in the log trace
computed as follows: as log moves and all the steps in the complete
model trace as model moves. Since the optimal
cost.1 / D 0 C 0 C 0 C 0 C 1 C 0 C 1 C 0 D 2 alignment minimizes the total alignment cost,
the least costly complete model trace is used.
cost.2 / D 0 C 0 C 1 C 0 C 1 C 1 C 0C1C0 D 4
Figure 12 shows the default alignment between
cost.3 / D 0 C 0 C 0 C 0 C 1 C 0 C 1 C 0 D 2 trace t3 and model M1 under the standard cost
assignment. The top row (ignoring ) yields
This confirms the previous intuition that align- trace t3 , and the bottom row (ignoring ) yields
ment 1 and 3 are better alignments than align- a complete model trace ha; b; d; c; e; g; hi. The
ment 2 . Alignments with the minimal costs cor- cost-based fitness of trace t3 can be computed as:
respond to optimal alignments thatgive the clos-
cost.align.t3 ; M //
fitness.t3 ; M / D 1
cost.aligndefault .t3 ; M //
0C0C0C0C1C0C1C0
D1
1C1C1C1C1C1C1C1C1C1C1C1C1C1
2
D1 D 0:857
14
508 Conformance Checking
where a fitness value of 1.0 means that the model the states corresponds to the weight. For exam-
and log trace are perfectly fitting. ple, the state hai has a weight of 3 because it
appears three times in the model projections (all
Escaping Arc Precision three alignments start with activity a). On the
Precision based on escaping arcs can also be other hand, the state ha; c; bi is only present in
computed using alignments (Adriansyah et al. alignment 6 and therefore has a weight of 1. The
2012). As previously mentioned, an imprecise states of automaton A1 represent states reached
model allows unobserved behavior, i.e., underfit- by the model during the execution of the log.
ting. For example, consider the Petri net model For any particular state in automaton A1 , there
M1 in Fig. 6 and the optimal alignments (under might be activities that are enabled by the model
the standard cost assignment) between model M1 but are not observed in the log execution. These
and log L1 in Fig. 13. Clearly, model M1 is not activities indicate imprecisions of the model and
perfectly precise as it allows for behavior that is are called escaping arcs of the model. Escaping
not observed in log L1 . According to model M1 , arc states (represented as circles highlighted in
activity b, c, and d can be executed in parallel red) are added to automaton A1 by replaying the
following the execution of activity a. However, automaton onto the model and checking for en-
none of the log traces execute activity d after abled activities at each state. For example, at state
activity a. This imprecision in the model can hai (i.e., after firing activity a), activity b, c, and
be quantified by constructing a prefix automaton d are enabled as shown in Fig. 8. However, the
using the model projection of the alignments, i.e., prefix ha; d i was not observed in the construction
the bottom row of the alignments. As previously of automaton A1 using log L1 . This means that
presented, model projections of alignments ex- there is an escaping arc from state hai to state
plain potentially unfitting log traces in terms of ha; d i, and this is added to the automaton by the
modeled behavior so that they can be replayed on state highlighted in red. The rest of the escaping
the process model. Figure 14 illustrates the con- arcs can be added in a similar way.
structed prefix automaton A1 for the alignments With the constructed prefix automaton, escap-
between log L1 and model M1 (ignoring the ing arc precision can be computed by comparing
circles highlighted in red for now). Each prefix of the number of escaping arcs with the number of
the model projections of the alignments identifies allowed arcs for all states:
a state (represented as circles), and the numberin
P
s2S ¨.s/ jesc.s/j
precision.A1 / D 1 P
s2S ¨.s/ jmod.s/j
3 0 C 3 1 C 2 0 C ::: C 1 1 C 1 0 C 1 0
D1
3 1 C 3 3 C 2 2 C ::: C 1 2 C 1 1 C 1 1
6
D1 D 0:833
36
Conformance Checking,
Fig. 13 Optimal
alignments between trace
t1 , t2 , t3 in log L1 and
model M1 in Fig. 6
Conformance Checking, Fig. 14 Prefix automaton A1 of alignments between log L1 and model M1 enhanced with
model behavior
occur in the process. Assuming that the event log are induced as negative events from the log are
gives a complete view of the process (i.e., a log classified as false positive (FP). Activities that are
completeness assumption), the precision of the not permitted by the model but observed in the
process model can be computed using artificial log are classified as false negative (FN). Finally,
negative events and the concepts of precision and activities that are not permitted by the model and
recall in data mining. are induced as negative events from the log are
Artificial negative events can be induced by classified as true negative (TN). As shown in
grouping similar traces and then observing the Fig. 15, precision and recall can be computed us-
events that did not occur for every event in the ing a confusion matrix. Specifically, the precision
traces. Under the log completeness assumption, of the positive class can be computed as:
this means that these unobserved events are neg-
ative events that are not allowed to happen by the TP
process (Goedertier et al. 2009). precision D
TP C FP
The process model can then be compared with
the log by treating the model as a predictive The computed precision value corresponds to
model. For a given incomplete event sequence the precision of the three quality dimensions in
(i.e., an unfinished process instance), activities process mining since it refers to the proportion of
that are permitted by the model and observed modeled behavior that is observed in the log.
in the log are classified as true positives (TP). The use of artificial negative events can also be
Activities that are permitted by the model but extended to quantify generalization and to com-
510 Conformance Checking
C
512 Conformance Checking
Future Directions for Research executed. This topic is extensively covered in the
Process Model Repair entry of this encyclopedia.
While there have been significant advances in the
research of conformance checking over the recent
years, there are still many open challenges and Cross-References
research opportunities. Some of them include:
Automated Process Discovery
Business Process Event Logs and Visualization
Conformance Dimensions
Process Model Repair
The proposed three quality dimensions (fitness,
precision, and generalization) have been widely
accepted, but there is still a need for further References
understanding on how to interpret and quantify
Adriansyah A (2014) Aligning observed and modeled
them through metrics. Furthermore, conformance behavior. Ph.D. thesis, Eindhoven University of Tech-
can be extended beyond the current three nology
dimensions, e.g., log completeness to quantify Adriansyah A, Munoz-Gama J, Carmona J, van Dongen
whether if the observed event data gives BF, van der Aalst WMP (2012) Alignment based pre-
cision checking. In: Rosa ML, Soffer P (eds) Busi-
the full picture of the underlying process ness process management workshops – BPM 2012
(Janssenswillen et al. 2017). international workshops, Tallinn, Estonia, 3 Sept 2012.
Revised papers. Lecture notes in business information
processing, vol 132. Springer, pp 137–149
Big Data and Real Time Buijs, JCAM, Reijers HA (2014) Comparing business
Process mining techniques and tools are getting process variants using models and event logs. In: Enter-
applied to larger and more complex processes. prise, business-process and information systems mod-
eling – 15th international conference, BPMDS 2014,
This means that they have to be scalable to 19th international conference, EMMSAD 2014, held at
handle the increased size and complexity. In fact, CAiSE 2014, Thessaloniki, Greece, 6–17 June 1 2014.
much of the recent research efforts in confor- Proceedings, pp 154–168
mance checking have been focused on this is- Burattin A (2015) Process mining techniques in business
environments – theoretical aspects, algorithms, tech-
sue. Related research lines include decomposed niques and open challenges in process mining. Lecture
conformance checking (Munoz-Gama 2016) and notes in business information processing, vol 207.
online conformance checking for event streams Springer, Cham
(Burattin 2015). Goedertier S, Martens D, Vanthienen J, Baesens B
(2009) Robust process discovery with artificial negative
events. J Mach Learn Res 10:1305–1340
Conformance Diagnosis and Process Janssenswillen G, Donders N, Jouck T, Depaire B (2017)
A comparative study of existing quality measures for
Model Repair
process discovery. Inf Syst 71:1–15
It is not enough to just identify conformance Munoz-Gama J (2016) Conformance checking and di-
issues; good diagnostic and visualization tools agnosis in process mining – comparing observed and
are crucial in helping the analyst identify and modeled processes. Lecture notes in business informa-
tion processing, vol 270. Springer, Cham
understand the root causes of the conformance
Munoz-Gama J, Carmona J, van der Aalst WMP
issues. While there has been work done in this (2014) Single-entry single-exit decomposed confor-
aspect of conformance checking, e.g., Buijs and mance checking. Inf Syst 46:102–122
Reijers (2014) and Munoz-Gama et al. (2014), Rozinat A, van der Aalst WMP (2008) Conformance
checking of processes based on monitoring real behav-
there is much more to be done to provide bet-
ior. Inf Syst 33(1):64–95
ter conformance diagnosis technology, e.g., new van der Aalst WMP (2013) Mediating between modeled
techniques and user study. Finally, once the dif- and observed behavior: the quest for the “right” pro-
ferences between the model and the log have cess: keynote. In: RCIS, pp 1–12. IEEE
van der Aalst WMP (2016) Process mining – data science
been diagnosed, the user may wish to repair the
in action. Springer, Berlin/Heidelberg
model in order to fix such differences and achieve van der Aalst WMP, Adriansyah A, van Dongen BF
a model that better describes the real process (2012) Replaying history on process models for
Continuous Queries 513
Overview
Section “Findings” explores other SQL-like from, and where) and augments them with
streaming languages beyond CQL. streaming constructs (which turn streams into
relations and vice versa). Consider the following
SQL-Like Syntax CQL (Arasu et al. 2006) query with extensions
The surface syntax for streaming SQL dialects adapted from Soulé et al. (2016):
borrows familiar SQL clauses (such as select,
Line 1 declares Quotes as a stream of current contents of stream Quotes. Finally, Line 6
{ ticker , ask } tuples, and Line 2 declares History selects only tuples satisfying the given predicate,
as a relation of { ticker , low} tuples. Neither Quotes whereas any tuples for which the predicate is
nor History is defined with a query in the example. false are dropped.
Lines 3–6 declare Bargains and define it with a The above example illustrates the core features
query. Line 3 declares Bargains as a stream of of CQL, viz.: using operators such as now to
{ ticker , ask, low } tuples. Line 4 specifies the turn streams into relations, using SQL to query
output using the istream operator, which creates relations, and using operators such as istream to
a stream from the insertions to a relation, using * turn relations into streams. For more detail, the
to pick up all available tuple attributes. Line 5 following paragraphs explain the CQL grammar,
joins two relations, Quotes[now] and History, starting with the top-level syntax:
where Quotes[now] creates a relation from the
A program consists of one or more declara- for optional items (X ‹ ), repetition (X C ), and
tions. Each declaration has an identifier (ID), a repetition separated by commas (X C; ). Finally,
tuple type (one or more attributes specified by a query consists of mandatory select and from
their identifiers and types), a declaration kind clauses and optional where and group-by clauses.
(either relation or stream), and an optional query. Next, we look at the grammar for the select
The grammar meta-notation contains superscripts clause, which specifies the query output:
A select clause either specifies an output list Here, a relation at a given point in time is a
directly or wraps an output list in a relation- bag of tuples (i.e., an unordered collection of
to-stream operator. In the first case, the query tuples that can contain duplicates). A stream is an
output is a relation, while in the second case, the unbounded bag of timestamped tuples (pairs of
query output is a stream. There are three relation- htimestamp; tuplei). The grammar for outputList
to-stream operators: istream captures insertions, is borrowed from SQL. Next, we look at the
C
dstream captures deletions, and rstream captures grammar for the from clause, which specifies the
the entire relation at any given point in time. query input:
An input item either identifies a relation di- Typically, before execution, queries in SQL or
rectly or applies a stream-to-relation operator to a its derivatives are first translated into a logical
stream identifier. Stream-to-relation operators are query plan of operators, which is the subject of
written postfix in square brackets, reminiscent of the next section.
indexing or slicing syntax in other programming
languages. There are four such operators: now Stream-Relational Algebra
(tuples with the current timestamp), unbounded Whereas the SQL-like syntax of the previous
(tuples up to the current timestamp), range (time- section is designed to be authored by humans,
based sliding window), and rows (count-based this section describes an algebra designed to
sliding window). Sliding windows can optionally be optimized and executed by stream-query en-
be partitioned, in which case their size is de- gines. The algebra is stream-relational because
termined separately and independently for each it augments with stream operators the relational
unique combination of the specified partitioning algebra from databases. Relational algebra has
attributes. Sliding windows can optionally spec- well-understood semantics (Garcia-Molina et al.
ify a slide granularity. Finally, we look at the 2008), and the CQL authors rigorously defined
grammar for where and groupBy as examples of the formal semantics for the additional stream-
other classic SQL clauses: ing operators (Arasu and Widom 2004). The
following paragraphs provide only an informal
overview of the operators; interested readers can
where WWD ‘where’ expr
consult the literature for formal treatments. The
groupBy WWD ‘group’ ‘by’ IDC;
notation for operator signatures is:
The where clause selects tuples using a predi- operatorhconfigurationi .input/ ! output
cate expression, and the group-by clause specifies
the scope for aggregation queries. These clauses The configuration of an operator specializes
in CQL are borrowed unchanged from SQL. For its behavior. The input to an operator consists of
brevity, we omitted other classic SQL constructs, one or multiple relations or a stream. And the
such as distinct, union, having, or other types of output of an operator consists of either a relation
joins, which streaming SQL dialects such as CQL or a stream. Relational algebra is compositional,
can borrow from SQL as is. since the output of one operator can be plugged
516 Continuous Queries
into the input of another operator, modulo com- example from the start of section “SQL-Like
patible kind and type. For instance, the stream- Syntax” is:
relational algebra for stream Bargains in the CQL
istream hask lowi ‰hQuotes:tickerDHistory:tickeri now.Quotes/; History
• istream.relation/ ! stream
For brevity, we omitted other classic rela-
Watch input relation for insertions,
tional operators, but they could be added trivially,
and send those as tuples on the output
thanks to the compositionality of the algebra.
stream.
CQL can use the well-defined semantics of clas-
• dstream.relation/ ! stream
sic relational algebra operators by applying them
Watch input relation for deletions, and send
on snapshots of relations at a point in time. Some
those as tuples on the output stream.
operators, such as and , process each tuple
• rstream.relation/ ! stream
in isolation, without carrying any state from one
At each time instant, send all tuples
tuple to another (Xu et al. 2013). These operators
currently in input relation on output
could be easily lifted to work on streams, and
stream.
indeed, some streaming SQL dialects do just that.
But the same is not true for stateful operators
such as ‰ and . To use these on streams, CQL This article illustrated continuous queries in
first converts streams to relations, using window SQL-like languages using CQL as the concrete
operators: example, because it is clean and well studied. The
original CQL work contains more details and a
• now.stream/ ! relation denotational semantics (Arasu et al. 2006; Arasu
At each time instant t , all tuples from the and Widom 2004). Soulé et al. furnish CQL with
input stream with timestamp exactly t . a static type system and formalize translations
• unbounded.stream/ ! relation from CQL via stream-relational algebra to a cal-
At each time instant t , all tuples from the culus with an operational semantics (Soulé et al.
input stream with timestamp at most t . 2016).
Continuous Queries 517
out-of-order data. For instance, CEDR suggests a processing. In: Conference on innovative data systems
solution based on stream-relational algebra using research (CIDR), pp 363–373
Botan I, Derakhshan R, Dindar N, Haas L, Miller RJ,
several timestamps per tuple (Barga et al. 2007). Tatbul N (2010) SECRET: a model for analysis of
One challenge that is particular to SQL-like lan- the execution semantics of stream processing systems.
guages is how to extend them with user-defined In: Conference on very large data bases (VLDB),
operators without losing the well-definedness of pp 232–243
Chandrasekaran S, Cooper O, Deshpande A, Franklin MJ,
the restricted algebra. For instance, StreamInsight Hellerstein JM, Hong W, Krishnamurthy S, Madden
addresses this issue with an extensibility frame- S, Raman V, Reiss F, Shah MA (2003) TelegraphCQ:
work (Ali et al. 2011). Finally, the semantics of continuous dataflow processing for an uncertain world.
SQL are defined by reevaluating relational oper- In: Conference on innovative data systems research
(CIDR)
ators on windows whenever the window contents Cranor C, Johnson T, Spataschek O, Shkapenyuk V (2003)
change. Nobody suggests that this reevaluation is Gigascope: a stream database for network applications.
the most efficient approach, but developing better In: International conference on management of data
solutions is an interesting research challenge. For (SIGMOD) industrial track, pp 647–651
Garcia-Molina H, Ullman JD, Widom J (2008) Database
instance, the DABA algorithm performs associa- systems: the complete book, 2nd edn. Pearson/Prentice
tive sliding-window aggregation on FIFO win- Hall, London, UK
dows in worst-case constant time (Tangwongsan Gedik B (2013) Generic windowing support for extensible
et al. 2017). All three of the abovementioned stream processing systems. Softw Pract Exp (SP&E)
44:1105–1128
research topics (out-of-order processing, extensi- Grover M, Rea R, Spicer M (2016) Walmart & IBM revisit
bility, and incremental streaming algorithms) are the linear road benchmark. https://www.slideshare.
still ripe with open issues. net/RedisLabs/walmart-ibm-revisit-the-linear-road-ben
chmark (Retrieved Feb 2018)
Hirzel M, Rabbah R, Suter P, Tardieu O, Vaziri M (2016)
Spreadsheets for stream processing with unbounded
Cross-References windows and partitions. In: Conference on distributed
event-based systems (DEBS), pp 49–60
Hudak P (1998) Modular domain specific languages and
Stream Processing Languages and Abstractions tools. In: International conference on software reuse
(ICSR), pp 134–142
Jain N, Amini L, Andrade H, King R, Park Y, Selo P,
Venkatramani C (2006) Design, implementation, and
References evaluation of the linear road benchmark on the stream
processing core. In: International conference on man-
Abadi DJ, Carney D, Cetintemel U, Cherniack M, Convey agement of data (SIGMOD), pp 431–442
C, Lee S, Stonebraker M, Tatbul N, Zdonik S (2003) Jain N, Mishra S, Srinivasan A, Gehrke J, Widom J,
Aurora: a new model and architecture for data stream Balakrishnan H, Cetintemel U, Cherniack M, Tibbets
management. J Very Large Data Bases (VLDB J) R, Zdonik S (2008) Towards a streaming SQL standard.
12(2):120–139 In: Conference on very large data bases (VLDB), pp
Ali M, Chandramouli B, Goldstein J, Schindlauer R 1379–1390
(2011) The extensibility framework in Microsoft Soulé R, Hirzel M, Gedik B, Grimm R (2016) River:
StreamInsight. In: International conference on data an intermediate language for stream processing. Softw
engineering (ICDE), pp 1242–1253 Pract Exp (SP&E) 46(7):891–929
Arasu A, Widom J (2004) A denotational semantics for Tangwongsan K, Hirzel M, Schneider S (2017) Low-
continuous queries over streams and relations. SIG- latency sliding-window aggregation in worst-case con-
MOD Rec 33(3):6 stant time. In: Conference on distributed event-based
Arasu A, Cherniack M, Galvez E, Maier D, Maskey AS, systems (DEBS), pp 66–77
Ryvkina E, Stonebraker M, Tibbetts R (2004) Linear Xu Z, Hirzel M, Rothermel G, Wu KL (2013) Testing
road: a stream data management benchmark. In: Con- properties of dataflow program operators. In: Confer-
ference on very large data bases (VLDB), pp 480–491 ence on automated software engineering (ASE), pp
Arasu A, Babu S, Widom J (2006) The CQL continu- 103–113
ous query language: semantic foundations and query Zou Q, Wang H, Soulé R, Hirzel M, Andrade H,
execution. J Very Large Data Bases (VLDB J) 15(2): Gedik B, Wu KL (2010) From a stream of relational
121–142 queries to distributed stream processing. In: Confer-
Barga RS, Goldstein J, Ali M, Hong M (2007) Consistent ence on very large data bases (VLDB) industrial track,
streaming through time: a vision for event stream pp 1394–1405
Coordination Avoidance 519
models and abstractions that facilitate and enable to the extracted consistency semantics. Another
avoiding coordination. approach is to provide specialized protocols and
execution mechanisms that guarantee the consis-
tency semantics. The following are examples of
Key Research Findings coordination avoidance methods and work that
combine and explore different extraction and ex-
The goal of coordination avoidance techniques ecution approaches.
and protocols is to minimize the cost of coordi-
nation while retaining the application integrity. Commutativity and Convergence
There have been a plethora of work that inves- A large number of coordination avoidance pro-
tigates such approaches. What these works have tocols leverage the commutativity of operations.
in common is attempting to leverage application- Commutative operations are ones that can be
level semantics to deduce the set of guarantees reordered arbitrarily while reaching the same
that are sufficient for a consistent execution. They final state. Therefore, commutative operations
may take various forms that are presented in this can execute at different nodes in different
section. Coordination avoidance techniques com- orders without synchronously coordinating with
bine a subset of different approaches. Generally, a other nodes. The operations are asynchronously
coordination avoidance protocol tackle two tasks: propagated to other nodes that execute them
extracting or modeling a set of consistency se- without the need of any ordering. Commutative
mantics for a particular application and executing operations are especially useful for hot spots that
operations while minimizing coordination to the are frequently being updated, such as counters.
cases when only the set of extracted consistency Many attempts have been made to generalize
semantics are threatened. the notion of commutativity in the context of data
There are various approaches for extracting management (Badrinath and Ramamritham 1992;
consistency semantics that vary in their ease of Korth 1983). Recoverability (Badrinath and
use, generality, and optimality, where coordina- Ramamritham 1992) generalizes commutativity
tion avoidance protocols typically face the trade- to include more operations by explicitly ordering
off between these three properties. One approach commit points. For example, two operations
extracts consistency semantics from existing code pushing to a stack simultaneously are not
via the use of static analysis or other similar commutative. However, with recoverability,
methods. This approach requires the least inter- they can execute concurrently while maintaining
vention from application developers. However, serializability with the condition that they commit
it introduces the challenge of extracting features in the order they are invoked. Korth (1983)
from general code which can be limited in its generalizes locking to account for the existence
scope. Some other approaches rely on the appli- of commutative operations and thus allow more
cation developers to annotate or model their code concurrency when commutative operations are
to enable extracting the consistency semantics. being used. Commutative operations can also
This approach requires additional intervention be used in conjunction with other consistency
from the application developer, although allow- guarantees. Walter (Sovran et al. 2011), for
ing the use of existing general code. Alterna- example, is a system that employs a consistency
tively, some works explore an approach where notion based on snapshot isolation. Walter allows
specialized frameworks and abstractions are in- commutative operations to be performed without
troduced to facilitate and enable the extraction coordination.
of consistency semantics. With the extracted fea- Convergent and commutative replicated data
tures, there are various approaches to execute dis- types (CRDTs) (Shapiro et al. 2011) are an
tributed applications that also vary in their ease of approach to make the use of commutativity
use, generality, and optimality. One approach is to more powerful by designing access abstractions
selectively coordinate between nodes according and data structures that are more general than
Coordination Avoidance 521
traditional commutative operations while still the transaction into steps and derives what
maintaining commutativity. This has led to the steps can interleave with each other. Other
development of nontrivial data structures that works adopt a similar approach of dividing the
perform complex operations while being commu- transaction to find interleaving between the parts
tative, thus requiring no coordination (Preguica of the transaction. A set of work has shown
et al. 2009). CRDTs are also used in the context that performing this division hierarchically may
C
of eventual consistency by providing a means lead better results (Lynch 1983; Weikum 1985;
to converge operations to the same, eventually Farrag and Özsu 1989). A formalization of
consistent state. Also, it is used in coordination- the correctness of application-level consistency
free protocols to converge operations that do not definitions is provided in Korth and Speegle
need to be synchronously coordinated (Bailis (1988).
et al. 2014). Defining how the application-level consis-
CALM (Alvaro et al. 2011) is a principle that tency semantics are expressed is important for the
shows that a distributed program can be executed practicality of the solution (i.e., how accessible is
without coordination if the program’s logic is the interface to developers) and the performance
monotone. This enables avoiding coordination by gains of the solution (i.e., how the interface can
building programs that ensure logical monotonic- maximize the amount of captured knowledge).
ity in addition to using analysis techniques to Agrawal et al. (1993) propose the use of special
assist whether a given program is logically mono- operations, called consistency assertions, that
tonic. Bloom is a domain-specific declarative captures a user’s view of the database and the
language that facilitates leveraging the CALM criterion for correct concurrency execution. Two
principle to build coordination-free programs. In correctness criteria are developed to leverage the
this approach, the whole program does not have special operations by making the database allow
to be monotonic, rather, it is possible to detect concurrent execution of two transactions if the
potential anomalies that need to be addressed by consistency assertions are not violated. Relative
the developer. serializability (Agrawal et al. 1994) extends the
transaction model by allowing users to define the
Application-Level Correctness Semantics atomicity requirements of transactions relative to
Many data management system works utilize each other. Then, it is possible to interleave parts
application-level invariants and correctness of transactions that do not violate the defined
semantics to avoid coordination (Lamport 1976; atomicity requirements.
Agrawal et al. 1993, 1994; Garcia-Molina 1983; The use of application-level correctness in-
Li et al. 2012; Roy et al. 2015; Bailis et al. variants is especially important for large-scale
2014). In this approach, operations are allowed and geo-scale deployments where the coordina-
to execute and commit without coordination tion cost is high. This has motivated the devel-
if the application-level correctness semantics opment of more solutions that use application-
are not violated. Lamport (1976) (as described level correctness invariants for these new archi-
in Agrawal et al. 1993) is the first to show tectures (Li et al. 2012; Roy et al. 2015; Bailis
that application-level semantics can be used et al. 2014; Zhang et al. 2013). Gemini (Li et al.
to avoid coordination between certain types of 2012) enables the coexistence of fast, eventually
transactions. Garcia-Molina (1983) introduced consistent operations with strongly consistent,
a design of a concurrency control protocol slow operations. Therefore, operations are exe-
that leverages application-level correctness cuted with high performance, and the cost of
semantics. The protocol receives application- enforcing consistency is only paid when nec-
level semantics from users and then uses essary. To facilitate this, a consistency model,
them to allow nonserializable schedules that called RedBlue consistency, is proposed. Using
do not violate the application’s correctness the consistency formulation and application-level
semantics. To achieve this, the protocol divides consistency invariants, operations are assigned to
522 Coordination Avoidance
be either fast or consistent. The assignment is node. Thus, rather than waiting for the response
made with the guarantee that executing fast op- from multiple nodes at multiple data centers,
erations without synchronous coordination would transaction can execute and be guaranteed to
not lead to violations of the application-level con- succeed after execution at the first node only.
sistency semantics. Specifically, fast operations Transaction chains incorporate many annotation
execute locally at one of the data centers and and commutative techniques to enable more flex-
then lazily propagate to other data centers. Fast ibility in the schedules and constraints that are
operations are only guaranteed to be causally enforced on transactions.
ordered. Consistent operations, on the other hand,
are synchronously synchronized, potentially with Weak and Relaxed Consistency
other data centers, leading to high latency. To Another method to avoid coordination is to
be either the Homeostasis protocol (Roy et al. weaken or relax the consistency guarantees.
2015) aims at automating the process of gener- Eventual consistency (Cooper et al. 2008;
ating application-level correctness invariants. A DeCandia et al. 2007; Bailis and Ghodsi
program analysis technique is used to extract 2013) only guarantees that the data copies
the application-level consistency invariants from are eventually consistent in the absence
code. In the Homeostasis protocol, the target of updates. Therefore, coordination can be
consistency level is serializability, and the code performed asynchronously, and convergence can
is analyzed to extract the invariants that will be achieved via commutative and converging
lead to a serializable execution as observed by operators. Causality (Lamport 1978) is a stronger
users. Application developers can then refine the guarantee that ensures that events at one node
extracted application-level correctness invariants. are totally ordered and events at one node are
The Homeostasis protocol may enable and in- ordered after everything that have been received
spire future work on the use of automatic extrac- by the point they are generated. Causality does
tion of correctness invariants that improve their not require synchronous coordination, which
accuracy. However, it is also useful as a tool to made it a candidate for many environments
bootstrap the process of specifying application- with high coordination costs (Lloyd et al.
level correctness invariants. 2011, 2013; Bailis et al. 2013; Nawab et al.
Transaction chains (Zhang et al. 2013) lever- 2015; Du et al. 2013). Extensions of causality
age transaction chopping (Shasha et al. 1995) to that consider causal transactions rather than
commit transactions in a geo-scale deployment operations or events are also shown to require
locally without waiting for other data centers. no synchronous coordination (Lloyd et al. 2011,
Transaction chopping (Shasha et al. 1995) is a 2013; Nawab et al. 2015). Parallel snapshot
theoretical framework that enables decompos- isolation (PSI) (Sovran et al. 2011) extends
ing transactions to smaller pieces. The original snapshot isolation (Berenson et al. 1995) for
goal of transaction chopping is to allow more geo-scale environments. This extension allows
concurrency. Transaction chains (Zhang et al. many types of transactions to execute without the
2013) use the concepts of transaction chopping to need for synchronous coordination between data
decompose transaction. Then, it introduces con- centers. Transactions with write-write conflicts,
straints on the execution of transactions. Trans- however, are still required to synchronously
action parts are executed in the same order at coordinate.
the different nodes (potentially at different data
centers ). Also, instances of the same transaction Versioning and Snapshots
are ordered at the first node that handles that Serializability is a guarantee that the outcome of
type of transactions. It turns out that these two transactions mimics an outcome of a serial exe-
constraints, in addition to transaction chopping, cution. This makes read-only transactions have
allow guaranteeing the success of the execution the opportunity to execute without coordination
of the transaction after the success in the first by reading from a consistent snapshot of data.
Coordination Avoidance 523
The outcome of such read-only transaction might These solutions provide different levels of ease
be stale but consistent. However, the challenge of use, generality, and optimality, making them
in read-only transactions is in the development suitable for different classes of applications and
of methods that minimize the interference with use cases.
read-write transactions. This is especially the
case for analytics queries that translate to large,
C
long-lived read-only transactions. Schlageter Examples of Application
(1981) showed that read-only transactions
can be executed without interfering with Coordination avoidance is used in traditional and
read-write transactions. Specifically, in an cloud data management systems.
optimistic concurrency control protocol, read-
only transactions do not need to be validated
as long as they reflect the consistent state of Future Directions for Research
the database. However, read-write transactions
wait for read-only transactions to finish. This The emergence of many new technologies, sys-
may lead to long delays with large read-only tem architectures, and applications invites the
transactions. To overcome this limitation, multi- investigation of new coordination avoidance tech-
versioning techniques were proposed to execute niques and adaptations of existing coordination
read-only transactions on a recent, possibly stale avoidance technique.
snapshot of the database (Agrawal et al. 1987).
In a replicated system, it was shown that a read- Cross-References
only transaction can avoid coordination with
other replicas by reading a local, consistent Geo-Replication Models
copy (Satyanarayanan and Agrawal 1993). This Geo-Scale Transaction Processing
entails modifying the coordination of read-write Weaker Consistency Models/Eventual Consis-
transactions and additional overhead to enable tency
creating a consistent snapshot locally at every
replica. Recently, Spanner (Corbett et al. 2012)
shows the use of accurate time synchronization
References
to serve consistent read-only transactions with
improved freshness. Agrawal D, Bernstein AJ, Gupta P, Sengupta S (1987)
Distributed optimistic concurrency control with re-
duced rollback. Distrib Comput 2(1):45–59. https://doi.
org/10.1007/BF01786254
Conclusion
Agrawal D, El Abbadi A, Singh AK (1993) Consistency
and orderability: semantics-based correctness criteria
Coordination is continuing to be a major for databases. ACM Trans Database Syst (TODS)
challenge in data management systems due to 18(3):460–486
Agrawal D, Bruno JL, El Abbadi A, Krishnaswamy
large-scale and geo-scale deployments. Coor-
V (1994) Relative serializability (extended abstract):
dination avoidance techniques enable reducing an approach for relaxing the atomicity of trans-
the overhead of coordination, thus improving actions. In: Proceedings of the thirteenth ACM
performance, while maintaining the ease of use SIGACT-SIGMOD-SIGART symposium on principles
of database systems. ACM, pp 139–149
and intuition of data management abstractions
Alvaro P, Conway N, Hellerstein JM, Marczak WR (2011)
and strong consistency guarantees. There have Consistency analysis in bloom: a calm and collected
been many ways that can be used individually or approach. In: CIDR, pp 249–260
together to avoid coordination such as leveraging Badrinath B, Ramamritham K (1992) Semantics-based
concurrency control: beyond commutativity. ACM
commutativity and convergence, application-
Trans Database Syst (TODS) 17(1):163–199
level consistency semantics, relaxed consistency Bailis P, Ghodsi A (2013) Eventual consistency
guarantees, and specialized transaction types. today: limitations, extensions, and beyond. Queue
524 Coordination Avoidance
Shapiro M, Preguiça N, Baquero C, Zawirski M (2011) application, virtualization, and physical. Appli-
Convergent and commutative replicated data types. cation layer consists of user applications. The
Bull Eur Assoc Theor Comput Sci (104):67–88
Shasha D, Llirbat F, Simon E, Valduriez P (1995) Trans-
virtualization layer’s main components are a hy-
action chopping: algorithms and performance studies. pervisor or Virtual Machine Manager (VMM)
ACM Trans Database Syst (TODS) 20(3):325–363 and VM. A hypervisor acts as an administrator
Sovran Y, Power R, Aguilera MK, Li J (2011) Transac- for the virtualized data center that permits the
C
tional storage for geo-replicated systems. In: Proceed-
ings of the twenty-third ACM symposium on operat-
user to configure and manage cloud resources
ing systems principles, SOSP’11. ACM, New York, to create and deploy virtual machines and ser-
pp 385–400. https://doi.org/10.1145/2043556.2043592 vices in a cloud. A VM is an emulation of a
Weikum G (1985) A theoretical foundation of multi- physical machine. The virtualization technology
level concurrency control. In: Proceedings of the fifth
ACM SIGACT-SIGMOD symposium on principles of
allows creating several VMs on a single physical
database systems. ACM, pp 31–43 machine (Mishra et al. 2018a). The physical layer
Zhang Y, Power R, Zhou S, Sovran Y, Aguilera MK, Li is a container of various physical resources like
J (2013) Transaction chains: achieving serializability memory, RAM, etc. (Sahoo et al. 2016).
with low latency in geo-distributed storage systems.
In: Proceedings of the twenty-fourth ACM symposium
The rise of multimedia data and ever-growing
on operating systems principles, SOSP’13. ACM, New requests for multimedia applications has tendered
York, pp 276–291. https://doi.org/10.1145/2517349. multimedia as the biggest contributor of big data.
2522729 Applications on entertainment demand low se-
curity as compared to personalized videos like
business meetings, telemedicine, etc. A secured
video transmission ensures authorized entrance
Co-resident Attack in Cloud and prevents information leakage by any un-
Computing: An Overview approved eavesdroppers. Various cryptographic
algorithms (AES, DES, etc.) can be used to guar-
Sampa Sahoo, Sambit Kumar Mishra,
antee security to the multimedia application. Few
Bibhudatta Sahoo, and Ashok Kumar Turuk
examples of big data are credit card transactions,
National Institute of Technology Rourkela,
Walmart customer transactions, and data gener-
Rourkela, India
ated by Facebook users. A security breach in
big data will cause severe legal consequences
and reputational damage than at present. The
Introduction use of big data in business also gives rise to
privacy issues (Inukollu et al. 2014). Big data
Cloud computing is a novel computing model applications demand scalable and cost-efficient
that leverages on-demand provisioning of com- technology for huge data storage, computation,
putational and storage resources. It uses pay-as- and communication. One of the solutions to ad-
you-go economic model, which helps to reduce dress these issues is the use of cloud computing
CAPEX and OPEX. Cloud computing essence which is both scalable and cost-efficient. Since
and scope can be best described by the “5-3-2 cloud computing is an umbrella term used by
principle,” where 5 stands for its essential char- many technologies like networks, databases, vir-
acteristics (on-demand service, ubiquitous net- tualization, operating system, resource schedul-
work access, resource pooling, rapid elasticity, ing, transmission management, load balancing,
and measured service) (Puthal et al. 2015). Three etc., security issues related to these systems are
stands for service delivery models (SaaS, PaaS, also applicable to cloud computing.
and IaaS). Two points to basic deployment mod- Cloud computing facilitates business models
els (private and public) of the cloud (Sahoo et al. and allows renting of IT resources on a utility
2015). like a basis through the Internet. Security is
Figure 1 shows the layered architecture of one of the most challenging issues to deal with
cloud. Various layers of a cloud are as follows: in cloud computing due to numerous forms of
526 Co-resident Attack in Cloud Computing: An Overview
data, or computation is corruption-free, which 2013). Different classes of attacks (Hyde 2009)
somehow preserves privacy. that affect VMs are:
The following challenges are faced by cloud
providers to build secure and trustworthy cloud
• Co-resident attack: Attacker adversary VM to
system.
communicate with other colocated VMs on C
the same host, thus violating isolation feature
• Outsourcing: A cloud user outsourced its con- of VMs.
tent to the cloud provider’s data center and • Attack on hypervisor: The attack on a hyper-
lost the direct physical control over the data. visor allows the attacker to access VM, host
The computing and data storage are done at operating system, and hardware.
the cloud provider’s end, causing the main • DoS attack: DoS attacks are endeavors by an
reason for cloud insecurity. Outsourcing may unapproved user to degrade or deny resources
also incur privacy violations. to a genuine user. DoS attack consumes re-
• Multi-tenancy: Since multiple customers sources from all VMs on the host.
share the cloud resources, it gives rise to
co-residency issues. Cross-VM or co-resident
VM attack exploits the multi-tenancy nature.
• Massive data and intense computation: Co-resident Attack
Traditional security methods may not
suffice for massive data storage and Colocated VMs are logically separated from each
computation. other. Users are migrating to the cloud that was
exposed to an additional threat caused by neigh-
bors due to the sharing of resources. This ex-
Various Attacks on VM posure raises questions about neighbors’ trust-
VMs are the principal computing unit in the worthiness and integrity as a malicious user can
cloud computing environment that facilitate effi- build side channel to circumvent logical isola-
cient utilization of hardware resources and reduce tion and extract sensitive information from co-
maintenance overhead of computing resources. resident VMs (Han et al. 2017). Like traditional
VMs provide the additional layer of hardware web services, VMs are exposed to various se-
abstraction and isolation but require quality meth- curity threats, and co-resident attack is one of
ods to prevent attackers to access information them. Other names of co-resident attack are co-
from other VMs and the host. Vulnerabilities in residence, co-residency, and colocation attack.
VMs pose an immediate threat to the privacy Figure 2 shows co-resident VM attack in cloud. It
and security of the cloud services. The shared shows that attacker VM and target VM are placed
and distributed nature of cloud resources makes on the same physical machine, where attacker
it difficult to develop a security standard that extracts target VM information through the side
ensures data security and privacy. Vulnerabil- channel.
ities in the cloud are the security loopholes, In co-resident attack, malicious users extract
which allow unauthorized access to the network information from colocated virtual machines on
and infrastructure resources. Vulnerabilities in the same physical machine by building side chan-
virtualization or hypervisor will enable an at- nels. The various side-channel attack includes
tacker to perform a cross-VM side-channel at- cache attack, timing attack, power monitoring at-
tack, and DoS attack provides access to legit- tack, software-initiated fault attack, etc. In cache
imate user’s VM. Solution directives for VM attack, attackers take action based on cache ac-
attacks include alert message for any breach of cess of the victim in a virtualized environment
VMs isolation, integrity and security assurance like a cloud, whereas timing attack is based on
of VM images, vulnerability-free virtual machine computation time of various activities (e.g., the
manager (VMM) or hypervisor, etc. (Modi et al. comparison between attacker’s password with
528 Co-resident Attack in Cloud Computing: An Overview
workloads using only a simple CRUD (create, databases. Defining similar domain-specific stan-
read, update, and delete) interface. dards for those systems is even more complicated
Since each database system provides at least due to the sheer heterogeneity in their data mod-
this minimal subset of operations, CRUD bench- els and APIs.
marks allow a general performance comparison Therefore, many NoSQL vendors and open
of a variety of non-relational NoSQL database source communities implemented their own
systems. database-specific performance measurement
tools, just as the RDBMS vendors did before
TPC. These include, for example, the Cassandra-
Overview stress tool, Mongoperf for MongoDB, the Redis-
benchmark utility, or Basho_bench for Riak.
For many decades, relational database manage- In the end, the results of these tools are often
ment systems (RDBMSs) have been the first difficult to understand, cannot be extended to
choice for business applications. A high degree other scenarios, and do not allow comparisons
of standardization, especially the standard query between different databases.
language SQL, made it possible to develop Eventually, the Yahoo! Cloud Serving Bench-
domain-specific benchmarks for RDBMSs. mark (YCSB) was proposed (Cooper et al. 2010).
The DebitCredit benchmark by Jim Gray laid YCSB is limited to a simple CRUD interface and
down the foundation for standardized database has become the de facto standard for NoSQL
benchmarks (Anon et al. 1985). It introduced database performance comparison, which is why
most of the key ideas that were later taken up by the term NoSQL Benchmark is often used inter-
the Transaction Processing Performance Council changeably for CRUD benchmark. Due to the
(TPC) for the specification of the TPC-A bench- fact that YCSB is open source and extensible,
mark and its famous successor, TPC-C (2010). the developer community has added support for
The TPC was founded in 1988 as an industry- almost every NoSQL database system since the
consensus body to fulfill the need for benchmark release in 2010.
certification. Previously, many RDBMS vendors
had conducted benchmarks based on their own
interpretation of DebitCredit. Since its founda-
tion, the TPC has developed numerous bench- Scientific Fundamentals
marks for various kinds of online transactional
processing (OLTP) and also online analytical The Benchmark Handbook for Database and
processing (OLAP) workloads. Transaction Systems edited by Jim Gray (1993)
However, the amount of useful data in some can be regarded as the first reference work
applications has become so vast that it cannot of database benchmarking. It identifies four
be stored or processed by traditional relational requirements that a domain-specific benchmark
database solutions (Hu et al. 2014). At the be- must meet:
ginning of the twenty-first century, this led to
the development of non-relational database sys-
tems able to cope with such Big Data. These 1. Relevance means that the benchmarks should
systems are subsumed under the term NoSQL be based on a realistic workload such that
database systems. They offer horizontal scalabil- the results reflect something important to their
ity and high availability by sacrificing querying readers.
capabilities, the strict ACID guarantees, and the 2. Simplicity describes that the workload should
transactional support as known from RDBMSs. be understandable.
Since most TPC benchmark specifications as- 3. Portability means that it should be easy to
sume ACID compliant transactions, they cannot run the benchmark against a large number of
be applied to mostly non-transactional NoSQL systems.
CRUD Benchmarks 531
4. Scalability says that it should be possible to DebitCredit and the TPC benchmarks for com-
scale the benchmark up to larger systems as parative studies by RDBMS vendors, but it was
performance and architectures evolve. widely used in research for measuring the perfor-
mance of parallel database system prototypes de-
These requirements have been refined over time signed for shared nothing hardware architectures.
to reflect the current developments of database As these database prototypes should already meet
C
systems (Huppler 2009; Folkerts et al. 2013). The horizontal scalability and elasticity requirements,
book Cloud Service Benchmarking by Bermbach the benchmark defined the key metrics scale-up
et al. (2017) is an excellent modern reference and speedup to measure them.
work in which these design objectives are also Current research work is further developing
discussed in detail with regard to cloud com- domain-specific benchmarks for certain classes
puting and Big Data requirements. Regarding of NoSQL database systems such as document
the mentioned four requirements, CRUD bench- stores that go beyond pure CRUD benchmarks.
marks significantly weaken the relevance aspect This allows to compare the performance of more
in order to achieve the greatest possible portabil- sophisticated query capabilities in the context of
ity, simplicity, and scalability. They do not model a particular application. Reniers et al. (2017) give
a real domain-specific application workload like an up-to-date overview of these domain-specific
the TPC standard specifications but a simpler NoSQL benchmarks and CRUD benchmarks.
one in which the choice of CRUD operations is
selected according to a given probability distribu-
tion of real web application workloads (workload Key Applications
generation). The typical metrics measured during
the benchmark run are throughput (operations In addition to the four requirements that a bench-
per second) and latency (response time in mil- mark must meet, Gray (1993) also describes four
liseconds). CRUD benchmarks also define hardly motivations for database benchmarking which
any restrictions on the system under test in favor can be extended to NoSQL database performance
of portability. The actual data values consist of evaluation:
simple synthetically generated random character
strings (data generation). Overall, this makes 1. Different proprietary software and hard-
them less relevant for concrete applications but ware systems: The classical competitive situ-
allows to measure the performance of highly ation between two relational database vendors
scalable NoSQL systems in the first place. running their solution on their own dedicated
Hence, CRUD benchmarks are also often specialized hardware can, nowadays, for ex-
used and extended in research to measure ample, be transferred to proprietary NoSQL
additional performance aspects of distributed database as a service solutions (DBaaS) as
NoSQL database systems. An overview of these provided by cloud service providers.
research efforts can be found in Friedrich et al. 2. Different software products on the same
(2014), and the book by Bermbach et al. (2017) hardware: This is the most common form
can also be recommended as a reference for in which NoSQL database system vendors
the current state of measurement methods for compare their systems with others. The dif-
nonfunctional properties such as scalability, ferent databases are set up in a cluster of the
elasticity, and consistency of NoSQL data same hardware or cloud infrastructure, and
management solutions. their performance is usually measured with a
It should be noted that the basics of scalability selection of different workloads.
and elasticity benchmarking were already deter- 3. The same software product on different
mined by the Wisconsin benchmark, an early hardware systems: The database workload
performance evaluation framework for RDBMS serves as a reference application for compar-
(DeWitt 1993). It was by no means as popular as ing different underlying hardware platforms.
532 CRUD Benchmarks
Nowadays, this can mean that users bench- standard methodologies for measuring these
mark their NoSQL database system on differ- requirements.
ent infrastructure as a service (IaaS) solutions Furthermore, there is a variety of possible
to select the most performant or cost-effective benchmark designs between simple CRUD
cloud provider. benchmarks and application-oriented bench-
4. Different releases of a software product on marks for one specific NoSQL data model
the same hardware: The benchmarking of that still need to be explored. One can imagine
newer versions of a database system is an domain-specific workloads that are still general
integral part of the modern agile software enough to allow specific implementations for all
development process. This is one of the major the different data models.
reasons why NoSQL developers often imple-
ment database-specific performance measure-
ment tools. Cross-References
Comparative studies are primarily promoted by NOSQL Database Systems
database vendors themselves for marketing and System Under Test
advertising purposes. Since there is no instance YCSB
like TPC for the verification of benchmark results
in the NoSQL area, these results should always
be looked at critically, because the workload
parameters, the hardware, and the presentation of References
the results are selected in such a way that the own
product performs particularly well. This kind of Anon, Bitton D, Brown M, Catell R, Ceri S, Chou T,
benchmarketing already existed for RDBMSs in DeWitt D, Gawlick D, Garcia-Molina H, Good B, Gray
J, Homan P, Jolls B, Lukes T, Lazowska E, Nauman
the early 1980s. J, Pong M, Spector A, Trieber K, Sammer H, Serlin
In addition to comparative studies, one of the O, Stonebraker M, Reuter A, Weinberger P (1985) A
main motivations for benchmarking is systems measure of transaction processing power. Datamation
research, i.e., understanding the performance be- 31(7):112–118
Bermbach D, Wittern E, Tai S (2017) Cloud service
havior of certain system designs or understand- benchmarking: measuring quality of cloud services
ing quality trade-offs like the consistency/latency from a client perspective. Springer, Cham
trade-off of different data replication techniques. Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears
R (2010) Benchmarking cloud serving systems with
YCSB. In: Proceedings of the 1st ACM symposium on
cloud computing, SoCC’10. ACM, New York, pp 143–
Future Directions 154
DeWitt DJ (1993) The wisconsin benchmark: past,
present, and future. In: Gray J (ed) The benchmark
The research in the field of benchmarking
handbook. Morgan Kaufmann, San Mateo
nonfunctional properties of distributed NoSQL Folkerts E, Alexandrov A, Sachs K, Iosup A, Markl V,
database systems is still in an early stage. For Tosun C (2013) Benchmarking in the cloud: what it
example, a number of methods for measuring the should, can, and cannot be. Springer, Berlin/Heidel-
berg, pp 173–188
consistency of replicated databases have already
Friedrich S, Wingerath W, Gessert F, Ritter N (2014)
been proposed. Apart from further studies on NoSQL OLTP benchmarking: a survey. In: 44.
scalability and elasticity, there is only little work Jahrestagung der Gesellschaft für Informatik, Infor-
on the topic of availability benchmarking of matik 2014, Big Data – Komplexität meistern, 22–26
Sept 2014 in Stuttgart, pp 693–704
NoSQL database systems, especially not under
Gray J (ed) (1993) The benchmark handbook for database
different failure conditions. Therefore, further and transaction systems, 2nd edn. Morgan Kaufmann,
research in this area is still needed to develop San Mateo
Cryptocurrency 533
Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable TPC-C (2010) Benchmark specification. http://www.tpc.
systems for big data analytics: a technology tutorial. org/tpcc
IEEE Access 2:652–687
Huppler K (2009) The art of building a good benchmark.
Springer, Berlin/Heidelberg, pp 18–30
Reniers V, Van Landuyt D, Rafique A, Joosen W (2017)
On the state of NoSQL benchmarks. In: Proceedings C
Cryptocurrency
of the 8th ACM/SPEC on international conference on
performance engineering companion, ICPE’17 com-
panion. ACM, New York, pp 107–112 Blockchain Transaction Processing
D
Overview
Data Archives
Enterprises have been acquiring large amounts
of data from a variety of sources to build their
Data Longevity and Compatibility
own “data lakes,” with the goal of enriching their
data asset and enabling richer and more informed
analytics. Data collection and acquisition often
introduce errors in data, for example, missing
Data Archiving values, typos, mixed formats, replicated entries
for the same real-world entity, outliers, and vi-
olations of business rules.
Computing the Cost of Compressed Data A Kaggle’s 2017 survey about the state of
data science and machine learning reveals that
dirty data is the most common barrier faced by
workers dealing with data (Kaggle 2017). Not
Data Cleaning surprisingly, developing effective and efficient
data cleaning solutions is a challenging venue
Xu Chu and is rich with deep theoretical and engineering
School of Computer Science, Georgia Institute problems.
of Technology, Atlanta, GA, USA There are various surveys and books on dif-
ferent aspects of data quality and data cleaning.
Rahm and Do (2000) give a classification of
different types of errors that can happen in an
Synonyms extract-transform-load (ETL) process and sur-
vey the tools available for cleaning data in an
Data cleansing ETL process; Bertossi (2011) provides complex-
ity results for repairing inconsistent data and per-
forming consistent query answering on inconsis-
tent data; Hellerstein (2008) focuses on cleaning
Definitions quantitative data, such as integers and floating
points, using mainly statistical outlier detection
Data cleaning is used to refer to all kinds of techniques; Fan and Geerts (2012) discuss the use
tasks and activities to detect and repair errors in of data quality rules in data consistency, data cur-
the data. rency, and data completeness and how different
© Springer Nature Switzerland AG 2019
S. Sakr, A. Y. Zomaya (eds.), Encyclopedia of Big Data Technologies,
https://doi.org/10.1007/978-3-319-77525-8
536 Data Cleaning
aspects of data quality issues might interact; Dasu detection techniques fall into three categories:
and Johnson (2003) summarize how techniques statistics-based, distance-based, and model-
in exploratory data mining can be integrated with based. Statistics-based outlier detection tech-
data quality management; Ilyas and Chu (2015) niques assume that the normal data points would
provide taxonomies and example algorithms of appear in high-probability regions of a stochastic
qualitative error detection and repairing tech- model, while outliers would occur in the low-
niques; and Ganti and Das Sarma (2013) focus probability regions of a stochastic model (Grubbs
on an operator-centric approach for developing a 1969). They can often provide a statistical
data cleaning solution, which involves the devel- interpretation for discovered outliers, or a
opment of customizable operators that could be score/confidence interval for a data point being
used as building blocks for developing common an outlier, rather than making a binary decision.
solutions. Distance-based outlier detection techniques often
This chapter covers four commonly encoun- define a distance between data points, which is
tered data cleaning tasks, namely, outlier detec- used for defining a normal behavior, for example,
tion, rule-based data cleaning, data transforma- normal data point should be close to a lot of other
tion, and data deduplication. Since there is a very data points, and data points that deviate from
large body of work on these tasks, this chapter such normal behavior are declared outliers (Knorr
only intends to provide an introduction to each and Ng 1998). A major advantage of distance-
data cleaning task and categorize various tech- based techniques is that they are unsupervised
niques proposed in the literature to tackle each in nature and do not make any assumptions
task. There are also many complementary re- regarding the generative distribution for the data.
search efforts to data cleaning that are not covered Instead, they are purely data driven. Model-based
in this chapter, such as data profiling, provenance, outlier detection techniques first learn a classifier
and metadata management, data discovery, con- model from a set of labeled data points and then
sistent query answering, metadata management, apply the trained classifier to a test data point
schema mapping, and data exchange. to determine whether it is an outlier (De Stefano
et al. 2000). Model-based approaches assume that
a classifier can be trained to distinguish between
Key Research Findings the normal data points and the anomalous data
points using the given feature space. They label
Outlier Detection data points as outliers if none of the learned
Outlier detection refers to the activity of detecting models classify them as normal points.
outlying values and is a quantitative error detec-
tion task. While an exact definition of an outlier Enforcing Data Quality Rules
depends on the application, there are some com- One common way to detect and rectify errors
monly used definitions, such as “an outlier is an in databases is through the enforcement of data
observation which deviates so much from other quality rules, often expressed as integrity con-
observations as to arouse suspicions that it was straints (ICs) (Chu et al. 2013; Fan and Geerts
generated by a different mechanism” (Hawkins 2012). In this context, (qualitative) error detec-
1980). For example, for a company whose em- tion is the process of identifying violations of
ployees’ salaries are mostly around 100 K, an ICs, namely, subsets of records that cannot co-
employee with a salary of 10 K can be considered exist together; and error repair is the exercise of
to be an outlying record. modifying the database, such that the violations
Multiple surveys and tutorials have been are resolved and the new data conforms to the
published to summarize different definitions given ICs.
of outliers and algorithms for detecting Figure 1 shows a typical workflow of qualita-
them (Hodge and Austin 2004; Chawla and tive data cleaning. In order to clean a dirty dataset
Sun 2006; Aggarwal 2013). In general, outlier using qualitative data cleaning techniques, there
Data Cleaning 537
need to be data quality rules that reflect the Data Cleaning, Table 1 An example employee table
semantics of the data. One way to obtain data TID FN LN LVL ZIP ST SAL
quality rules is by consulting domain experts, t1 Anne Nash 5 10001 NM 110,000
which requires a lot of domain expertise and t2 Mark White 6 87101 NM 80,000
is usually a time-consuming processing. An al- t3 Jane Lee 4 10001 NY 75,000
ternative approach is to design an algorithm to
automatically discover data quality rules. Given a
dirty dataset and the associated data quality rules,
the error detection step detects violations of the update that changes t1 ŒST from “NM” to “NY,”
specified rules in the data. A violation is minimal and the new data now has no violation with
set of cells in a dataset that cannot coexist to- respect to the two rules.
gether. Finally, given the violations and the rules
that generate those violations, the error repair step Data Transformation
produces data updates, which are applied to the Data transformation refers to the task of trans-
dirty dataset. The error detection and the error forming data from one format to another. Data
repair loop go on until the data conforms to the transformation tasks can be classified into two
data quality rules, namely, there is no violations types in general: syntactic data transformations
in the data. and semantic transformations. Syntactic transfor-
mations aim at transforming a table from one
Example 1 Consider Table 1 that contains em- syntactic format to another, often without requir-
ployee records in a company. Every tuple spec- ing external knowledge or reference data. On
ifies a person in a company with her id (GID), the other hand, semantic transformations usually
name (FN, LN), level (LVL), zip code (ZIP), state involve understanding the meaning/semantics, in-
(ST), and salary (SAL). Suppose two data quality stead of simply as a sequence of characters;
rules hold for this table. The first rule states that, hence, they usually require referencing external
if two employees have the same zip code, they data sources.
must have the same state. The second rule states
that among employees working in the same state, Example 2 Figure 2a shows a syntactic transfor-
a higher-level employee cannot earn less salary mation, where a tabular data containing person
than a lower-level employee. names, telephone numbers, and fax numbers,
Given these two data quality rules, the error originally arranged in a tabular format with phone
detection step detects two violations. The first and fax numbers stored in the same column as
violation consists of four cells ft1 ŒZIP; t1 ŒST; a complex data type, are flattened into a normal
t3 ŒZIP; t3 ŒSTg, which together violate the relational table representation to enable efficient
first data quality rules. The second violation querying of this data. In addition, the telephone
consists of six cells ft1 ŒROLE; t1 ŒST; t1 ŒSAL, numbers and the fax numbers are also trans-
t2 ŒROLE; t2 ŒST; t2 ŒSALg, which together formed by adding two dashes between the nine
violate the second data quality rule. The data digits. Both cases do not require external knowl-
repair step takes the violations and produces an edge to perform these transformations.
538 Data Cleaning
Data Cleaning, Fig. 2 Example transformations. (a) Syntactic transformation. (b) Semantic transformation
Figure 2b shows a semantic transformation, users specify the transformations directly, usually
where country codes are transformed into country through a visual interface. In the example-
names; obviously, it requires external knowledge, driven model, users are required to give a few
such as an external knowledge base that contains input-output examples, based on which the tool
both full and abbreviated country names. automatically infers likely transformations that
match the provided examples. In the proactive
There are generally three major compo- interaction model, the transformation tool
nents/dimensions of a syntactic data transforma- provides hints or highlights for which data needs
tion system: language, authoring, and execution. transformation and sometimes even suggests
Since there are many ways to transform a possible transformations so that users simply
dataset, syntactic data transformation solutions need to accept or reject the suggestions. Finally,
usually adopt a transformation language that transformation execution applies the specified
limits the space of possible transformations. transformations to the dataset. Since there might
The language could consist of a finite set of be multiple specified transformations from the
operations allowed on a table (Raman and previous step, transformation tools usually offer
Hellerstein 2001; Kandel et al. 2011), such as assistance to the users in selecting the desired
splitting a column and merging two columns, transformation, for example, by displaying the ef-
or it could consist of a set functions for fect of the specified transformation immediately
string manipulations (Gulwani 2011), such as on a sample data or by providing interpretable
extracting substring and concatenating two statements of the specified transformations.
substrings. There is a balance that needs to be Different transformation tools offer different
achieved when choosing a particular language: capabilities along these three dimensions.
it needs to be expressive enough so that it Semantic transformations usually involve un-
captures many real-world transformation tasks derstanding the meaning/semantics, or the typi-
and, at the same time, restricted enough to cal use of the data, instead of simply as a se-
allow for effective and easy authoring of the quence of characters. In contrast to syntactic
transformations. For a specific transformation transformations, semantic transformations cannot
task, syntactic transformation tools have different be computed by solely looking at input values;
interaction models with the users to author rather, they usually require external data sources
a transformation program using the adopted to look up the values that need to be trans-
language. The interaction models could be formed. Example-driven techniques have been
generally classified as declarative, example- mostly used for semantic transformations (Singh
driven, or proactive. In the declarative model, and Gulwani 2012; Abedjan et al. 2016). Users
Data Cleaning 539
are usually asked to provide a few input-output Example 3 Figure 3 illustrates a typical example
examples, and semantic transformation tools then of data deduplication. The similarities between
will search the available external data sources pairs of records are computed and are shown in
to find a “query” that matches the examples. the similarity graph (upper right graph in Fig. 3).
The query can then be used to transform new The missing edges between any two records
unseen data. indicate that they are non-duplicates. Records are
then clustered together based on the similarity D
graph. Suppose the user sets the threshold to be
Data Deduplication 0.5, namely, any record pairs having similarity
Data deduplication, also known as duplicate de- greater than 0.5 are considered duplicates.
tection, record linkage, record matching, or entity Although Records P1 and P5 have similarity
resolution, refers to the process of identifying tu- less than 0.5, they are clustered together due
ples in one or more relations that refer to the same to transitivity; that is, they are both considered
real-world entity (Elmagarmid et al. 2007). A duplicates to Record P2 . All records in the same
data deduplication process usually involves many cluster are consolidated into one record in the
steps and choices, including designing similarity final clean relation.
metrics to evaluate the similarity for a pair of
records, training classifiers to determine whether
a pair of records are duplicates or not, cluster- Applications of Data Cleaning
ing all records to obtain clusters of records that
represent the same real-world entity, consolidat- Data cleaning is a necessary step in many data-
ing clusters of records to unique representations, driven analytics. Different data cleaning tasks
designing blocking or distributed strategies to target different types of errors.
scale up the deduplication process, and involving Applications of outlier detection include
humans to decide whether a record pair are dupli- network intrusion detection, financial fraud
cates when machines are uncertain. detection, and abnormal medical condition
detection. For example, in the context of insufficient. New cleaning solutions must adapt
computer networks, different kinds of data, such to growing datasets of the Big Data era, for
as operating system calls and network traffic, example, by leveraging sampling techniques
are collected in large volumes. Outlier detection or distributed computation. (2) Although there
can help with detecting possible intrusions and are existing research about involving humans to
malicious activities in the collected data. perform data deduplication, for example, through
Rule-based data cleaning can help clean any active learning (Tejada et al. 2001; Sarawagi and
relational databases where data quality rules can Bhamidipaty 2002; Arasu et al. 2010), involving
be defined. The richer the semantics of the data humans in other data cleaning tasks, such as
is, the better rule-based data cleaning techniques repairing IC violations and taking user feedback
are at detecting and repairing violations. in discovering of data quality rules, is yet to
Data transformations are used in a variety be explored; (3) a significant portion of data is
of tasks and at different stages of the ETL life residing in semi-structured formats (e.g., JSON)
cycle. For example, before running a data inte- and unstructured formats (e.g., text documents).
gration project, transformations are often used Data quality problems for semi-structured and
to standardize data formats, to enforce standard unstructured data remain largely unexplored;
patterns, or to trim long strings. Transformations and (4) there is significant concerns about data
are also used at the end of the ETL process, for privacy as increasingly more individual data
example, to merge clusters of duplicate records, are collected by governments and enterprises.
to find a unique representation for a cluster of Data cleaning is by nature a task that requires
records (aka golden record), or to prepare data to examining and searching through raw data, which
be consumed by analytics tools. Data transforma- may be restricted in some domains including
tions can also be seen as a tool for data repair in finance and medicine. How to perform most data
rule-based data cleaning, since it can be used to cleaning tasks, while preserving data privacy,
“transform” erroneous data. remains an open challenge.
Duplicate records can occur due to many rea-
sons. For example, a customer might be recorded
multiple times in a customer database if the Cross-References
customer used different names at the time of
purchase; a single item might be represented Data Quality and Data Cleansing of Semantic
multiple times in an online shopping website; Data
and a record might appear multiple times after a
data integration project because that record had
References
different representations in original data sources.
Data deduplication targets specifically duplicate Abedjan Z, Morcos J, Ilyas IF, Ouzzani M, Papotti
records and resolves them. P, Stonebraker M (2016) Dataxformer: a robust
transformation discovery system. In: Proceedings of
32nd international conference on data engineering,
pp 1134–1145
Future Directions for Research Aggarwal CC (2013) Outlier analysis. Springer, New York
Arasu A, Götz M, Kaushik R (2010) On active learning
Data cleaning is a complicated process, and of record matching packages. In: Proceedings of ACM
SIGMOD international conference on management of
an end-to-end data cleaning solution usually
data, pp 783–794
involves many different cleaning sub-tasks. We Bertossi LE (2011) Database repairing and consistent
have discussed techniques for tackling four query answering. Morgan & Claypool Publishers, San
common data cleaning tasks. There are still Rafael
Chawla S, Sun P (2006) Outlier detection: principles,
many challenges and opportunities in building
techniques and applications. In: Advances in knowl-
practical data cleaning systems: (1) the scale edge discovery and data mining, 10th Pacific-Asia
of data renders many data cleaning techniques conference
Data Differencing 541
Domain models
•Applicable rules Observational
Means
•Configuration Contextual
•Cinematics Items
•Constraints (physical)
Domain
Observational means Models
•Sensors (Radar, IR, D
UAVs,…)
•Video
•ID Messages
Contextual items
•Human reports
•Activity/behavior
•Media
•On-gong operations
(routes, arrival times,…) Fusion,
•Threat status Hypotheses,
Inferences
• Meteorological maps
Data/Information Human-Computer
Sources Interface
LEVEL 0 LEVEL 4
Signal Process DBMS
Refinement Refinement
dividual entities, a problem usually named as alytic processes (for latent but quality solutions)
multi-target tracking, while higher levels, namely, and streaming data analytics (for near real-time
Levels 2 and 3, are related with what is hap- situation awareness). This exponential explosion
pening around the entities to understand how has a tremendous effect to the information fusion
information, events, and own actions will impact community that constantly deals with big data
goals and objectives. Situation assessment (L2) is streams.
related with the capacity to obtain a global view The deployment and provision of IF systems
of the environment, to describe the relationship based on big data, like emergency management,
between the entities tracking in the environment will greatly benefit from a cloud-like middleware
and to infer or describe a joint behavior. Impact infrastructure, which could be formulated to
assessment (L3) is related with the estimation of create a hybrid environment consisting of
the impact of a special situation to the application multiple private (municipalities, enterprises,
of interest, processing and evaluating situations research organizations) and public (Amazon
that are of special relevance to the scenario be- 2017; Microsoft 2017) infrastructure providers
cause they relate to some type of threatening, to deliver on-demand access to emergency
critical situation, or any other special world state. management applications. Cloud computing
assembles large networks of virtualized ICT
services such as hardware resources (such as
Computational Resources for Big CPU, storage, and network), software resources
Data Fusion: Cloud Computing (such as databases, application servers, and web
servers), and applications. In industry these
Ubiquitous electronic sources such as sensors and services are referred to as Infrastructure as a
video cast a steady stream of dynamic data, while Service (IaaS), Platform as a Service (PaaS), and
text-based data from databases, Internet, email, Software as a Service (SaaS). Cloud computing
social media, chat and VOIP, etc. is growing services are hosted in large data centers, often
exponentially. Therefore, the high data flows, referred to as data farms, operated by companies
both from devices or text sources, may make such as Amazon (A), Google Microsoft, etc. The
systems unable to process them in real time special features of cloud, like elasticity, thin-
and lead to time delays, extra costs, and even client user interface, as well as resource pools for
inappropriate decisions due to missing or incom- both data and computing, denote that the cloud
plete information. Therefore, IF systems should technique is adequate for collecting, hosting, and
maximize throughput while allowing complex processing emergency data.
fusion analytics to discover critical information However, the creation of a middleware infras-
in a huge amount of data. Hence, there is an tructure and an associated dynamic deployment
immediate need to leverage efficient and dynam- model enabling hybrid cloud-based applications
ically scalable data processing infrastructure to faces technical obstacles and research challenges.
analyze big data streams from multiple sources in Cloud-based delivery of dynamic IF applications
a timely and scalable manner to establish accurate has fundamental differences from existing cloud
situation awareness (Szuster et al. 2017). application models (e.g., web applications, con-
A complete ICT (information and commu- tent delivery networks (CDNs), and large-scale
nications Technologies) paradigm shift supports scientific simulations), usually operated on static
the development and delivery of IF applications data in response to queries such as provision
depending on high-volume data streams, for in- a video file (CDNs) or commit a credit card
stance, emergency management, to avoid getting transaction (e.g., e-commerce). That is, these ap-
overwhelmed by incoming data volumes, data plications are designed to maintain efficient and
sources, or data types. An appropriate trade-off fault-tolerant collection of data that is accessed
between time and accuracy requires big data ana- and aggregated only when a query or transaction
lytic architectures to support both batch data an- request is issued by a user. Such data manipu-
Data Fusion 545
lation application model incurs high overheads out of dangerous areas and coordinating
when applied to IF applications that demand for rescue teams is dependent on the availability
high processing capacity over heterogeneous and of historical data as well as on the effective
streaming data from multiple sources with strict real-time integration and utilization of data
QoS guarantees on processing latency. streaming from multiple sources including
Hence, there is need to innovate a dynamically on-site sensors, social media feeds, and
deployable, multi-cloud middleware infrastruc- messaging on mobile devices. The ability D
ture to establish these applications on-demand, to make sense of data by fusing it into new
scalable interactions with cloud resources (such knowledge would provide clear advantages
as CPU, storage disks, networks, and software in emergency situations. Inadequate SA
frameworks) to optimize quality of service (QoS) in disasters has been identified as one
targets in terms of data analysis delay, decision- of the primary factors in human errors,
making delay, storage cost, availability, reliabil- with grave consequences such as deaths
ity, and the like. The main challenge exists in the and loss of infrastructure. As an example,
design of intelligent technologies with integrated during the 2010 Haiti earthquake (Caragea
control criteria and mechanisms for fast access to et al. 2011), text messaging via mobile
distributed, streaming data, services, and network phones and Twitter made headlines as
equipment and dynamic adaptation of the device being crucial for disaster response, but
capacities and services to traffic loads, user re- only some 100,000 messages were actually
quirements, and scenarios of application. processed by government agencies due
to lack of automated and scalable data
processing infrastructure. As large numbers
Application Examples
of people opted for social media (and, even
worse, many people re-tweeted inaccurate
Three representative examples have been selected
information), it led to a tsunami of data for
to illustrate the open problems and expected IF
agencies and rescue staff, who were not able
outputs for situation assessment based on elec-
to analyze it fast enough to respond.
tronic devices and information sources.
(B) Threats. Integrating low- and high-level
(A) Emergencies. Natural and man-made information may lead to significant new
disasters, such as tsunamis, earthquakes, capabilities in the counterterrorism field
floods, or epidemics, pose a significant (Biermann et al. 2014). In this critical
threat to human societies. The lessons area, work is currently partitioned into two
learned from the growing number of main subareas: analyzing terrorism and
recent emergencies, such as Colorado detecting preparations for it using high-
flood (2013), Japan’s earthquake (2011), level information (intelligence reports) and
Queensland flooding in Australia (2010), immediate response to threats using sensor
Haiti earthquake (2010), etc. have put information. By combining the two, it will
impetus on the development of emergency be possible to achieve shorter response
evacuation systems. Well-coordinated times and better direction of the immediate
emergency management activities that response, enabling better preventative
involves guiding of citizens out of danger measures. From sensor information it is
areas, placement of medical team to the most possible to detect that something is about
appropriate locations, and real-time planning to happen. By combining sensor data with
of evacuation routes before and after a intelligence information, it will be possible
disaster play a significant role in saving to understand more about the terrorist event
lives, protecting critical infrastructures, and and thus more easily prevent it. There are
minimizing causalities. The management of a number of practical needs from civilian,
emergency activities such as guiding people law enforcement, and security applications
546 Data Fusion
for fusing information stored in databases, and any incident management system typically
ontologies, or other sources. Effective use of encompasses multiple and diverse sources of in-
this information potential requires tracking formation such as mobile devices from affected
and data fusion systems that provide near people, social network feeds, available on-site
real-time situation pictures electronically sensors, contextual information databases, etc.
representing a complex, dynamically Processing these big volumes of data in real
evolving overall picture. For example, time is a grand challenge in computing, consid-
the ability to quickly fuse information ering besides that sources are available in dif-
from multiple sources on organized crime ferent formats and relative qualities, scattered in
or terrorist activities from the various different geographical areas, reliable at different
member states and make this available to confidence levels, etc. Situation awareness in big
the appropriate authorities rapidly would areas such as emergency scenarios becomes a
increase safety and protect citizens. So, complex, distributed information fusion scenario
contextual knowledge can aid both the in which innovative techniques are needed to
formulation of an improved state estimate efficiently curate information sources and build
and also be used to aid in the interpretation inferences and assessments in real time useful
of a computed estimate. However, its for decision and sense making. Future research
incorporation in the fusion process adds is required to analyze appropriate storage and
complexity and demands that new, hybrid computing architectures, algorithms to merge in
techniques be developed. real time the diverse data streams to meet the
(C) Healthcare. The problem of bridging required performance in challenging scenarios.
the gaps between the relatively mature
technology of device-based fusion and
the increasingly important, but as yet still References
immature, field of high-level (network or
text-based) medical data remains an open- Amazon (2017) https://aws.amazon.com/. Accessed Oct
15th 2017
ended problem. In spite of this interest, Bettini C, Brdiczka O, Henricksen K, Indulska J, Nicklas
there are, today, many ideas but no real D (2010) A survey of context modelling and reasoning
solutions. The ability to automate the fusion techniques. Pervasive Mob Comput 6(2):161–180
of device-based and text-based information Biermann J, Garcia J, Krenc K, Nimier V, Rein K, Snidaro
L (2014) Multi-level fusion of hard and soft infor-
should have a huge impact on healthcare, mation. 17th International Conference on Information
resulting in faster, more accurate information Fusion, Salamanca. July 2014
for authorities and citizens to base their Biermann J, Garcia J, Krenc K, Nimier V, Rein K, Snidaro
decisions upon, better logistics in the case of L (2016) Standardized representation via BML to sup-
port multi-level fusion of hard and soft information.
medical service provision and consumption Symposium IST/SET-216 Information Fusion (Hard
and transnational medical information and Soft) for Intelligence, Surveillance & Reconnais-
sharing. sance (ISR). Norfolk Virginia, USA 4, 5 May 2015
Bishop C (2004) Pattern recognition and machine learn-
ing. Springer, New York
Caragea C, McNeese N, Jaiswal A, Traylor G,
Conclusion and Future Research Kim H-W, Mitra P, Wu D, Tapia A-H, Giles L,
Jansen B-L, Yen J (2011) Classifying text messages for
the Haiti earthquake. Proceedings of the 8th Interna-
The synergic fusion of sensory data, natural lan-
tional ISCRAM Conference – Lisbon, Portugal, May
guage information, and contextual information 2011
may bring important benefits in critical problems Gómez J, García J, Patricio MA, Molina JM, Llinas J
such as protection from criminal threats, emer- (2011) High-level information fusion in visual sensor
networks. In: Ang K-L, Seng K-P (eds) Information
gency/disaster situation management, border se-
processing in wireless sensor networks: technology,
curity, or healthcare. It is clear that this type trends and applications, IGI Global, Hershey, Pennsyl-
of emergency situations are chaotic in nature, vania
Data Integration 547
MEDIATED SCHEMA
USER QUERY
SOURCE MAPPING DESCRIPTIONS
in terms of HTML pages. Finally, these pages ical database, called warehouse. While in the
are converted in a structured format, for example, virtual approach, the queries posed on the me-
relational tuples. To create wrappers, one can use diated schema are then translated into queries
a manual approach or rely on automatic and semi- on the base sources, in a warehouse system, the
automatic techniques to infer them (Crescenzi queries are directly answered on the materialized
et al. 2001). database. The query execution phase is usually
In order to connect the sources to the faster on data warehouses, but they require an
mediated schema, a set of source descriptions additional materialization step that is expensive
are defined. Essentially, these descriptions in terms of resources and needs to be executed
represent a semantic mapping between every periodically in order to handle new data (e.g.,
source schema and the mediated schema (but not every night). This materialization step is usually
across source schemas). These mappings define performed by using an extract, transform, load
correspondences between source and target (ETL) tool that extracts data from the sources and
attributes and take care of different structural loads them into the warehouse. Unlike wrappers,
representation across the schemas. For example, ETL tools execute more complex transformations
if two attributes appear in the same relation in a to the data but are procedural in nature and
source and in two relations in the target schema, require IT experts to write and maintain them
and the two target relations have a foreign key over time.
defined across them, then the mapping explicitly A data integration system I is formally defined
represents the semantic relationship in the source as a triple hG; S; Mi (Lenzerini 2002), where:
with a join between the two target relations.
A different approach for integrating data from
heterogeneous sources is data warehousing. In • G is the global (or mediated) schema;
these systems, the underlying data sources are • S is the source schema;
translated and materialized into a single phys- • M is the mapping between G and S.
Data Integration 549
GAV Mappings
m1 : SportEvent(date, desc, sport, venue id) ⊆ SoccerEvent(date, desc, venue id)
m2 : SportEvent(date, desc, sport, venue id) ⊆ Event(date, desc, sport id), Sport(id, sport)
m3 : Venue(id, name, city) ⊆ Stadium(id, name, zip), ZipCodes(zip, city)
LAV Mappings
m4 : FB Event(date, desc,type, location, city) ⊆ SportEvent(date, desc,type, venue id),
Venue(venue id, location, city)
In some cases this approach will require an data, the challenge here is to build a system that
important effort in order to abstract the model of provides a uniform search services over these
the underlying data sources and a collaboration data sources as automatically as possible. This is
between the owners of the data sources. There the approach proposed with dataspaces (Franklin
are contexts in which organizations want to et al. 2005), which provide base functionality
share data, but a mediated schema approach over all data sources, regardless of how integrated
is not feasible or too expensive. An important they are. New methods and systems are needed to D
contribution to solve these problems is the peer- enable pay-as-you-go approaches to data integra-
to-peer data management architecture (Halevy tion, where even structured and unstructured data
et al. 2003, 2006; Tatarinov et al. 2003). The can be integrated and become available with little
main difference with a classical data integration setup time (Golshan et al. 2017).
solution is that the mediated schema is not
used anymore, but rather each source (or peer)
provides mappings to a subset of other peers. In Cross-References
the resulting P2P architecture, users pose queries
with respect to a specific peer’s schema. A query Schema Mapping
is then recursively reformulated and forwarded to
connected peers. References
Definitions Architecture
Some architecture models have been proposed,
A data lake is a data repository in which datasets some are abstract sketches from industry (e.g.,
from multiple sources are stored in their original by pwc in Stein and Morrison (2014) or by
structures. It should provide functions to extract podium data (http://www.podiumdata.com/
data and metadata from heterogeneous sources solutions/)), and some are more concrete from
and to ingest them into a hybrid storage sys- research prototypes (e.g., Terrizzano et al. 2015).
tem. In addition, a data lake should offer a data These proposals usually describe conceptual
transformation engine, in which datasets can be architectures, functional requirements, and
transformed, cleaned, and integrated with other possible product components, but they do not
datasets. Finally, interfaces to explore and to provide the theoretical or implementation details
query the data and metadata of a data lake should that would enable repeatable implementations, as
be also available in a data lake system. these scenario-specific solutions may not apply
Data Lake 553
to a wider usage of DLs (Douglas and Curino lake architecture for ingesting, storing, trans-
2015). forming, and using data. These components are
As Hadoop is also able to handle any kind of sketched in the architecture of Fig. 1 (Jarke and
data in its distributed file system, many people Quix 2017).
think that “Hadoop” is the complete answer to the The architecture is separated into four layers:
question on how a DL should be implemented. the ingestion layer, the storage layer, the trans-
Of course, Hadoop is good at managing the huge formation layer, and the interaction layer. D
amount of data with its distributed and scal-
able file system, but it does not provide detailed
Ingestion Layer
metadata functionalities which are required for a
This layer is responsible for importing data from
DL. For example, the DL architecture presented
heterogeneous sources into the DL system. One
in Boci and Thistlethwaite (2015) shows that a
of the key features of the DL concept is the
DL system is a complex ecosystem of several
minimal effort to ingest and load data in the DL.
components and that Hadoop provides only a part
It should be possible to put any kind of data in the
of the required functionality.
DL. The components for data ingestion and meta-
More recent articles about data lakes (e.g.,
data extraction should be able to extract data and
Mathis 2017; LaPlante and Sharma 2016) men-
metadata as far as possible automatically from
tion common functional components of a data
the data sources, using some initial configuration
Interaction
lake architecture Metadata Manager Data Exploration
Layer
Schema Data Query
Query Data
Manager Quality Formulation
Formulation Interaction
Applicaon-
Transformation
specific
Data Marts
Layer
Data Data Integraon
Cleaning Transformaon Workflows
Engine
Metadata
Model
Data Access Interface
Metadata & DQ
Query Data
Management
Schemas,
Mappings,
Rewriting Transformation
Storage
Layer
DQ, Lineage
Raw Data
Stores
Metadata &
DQ Store
controls
DS
Ingestion
Metadata Data
Layer
Config
DQ Extraction Ingestion
Control
Load as-is
Heterogeneous
Data Sources
554 Data Lake
information about the data sources, illustrated in data cleaning) have been applied to datasets,
the figure as DS Config. The metadata extractor or for which analytical reports a dataset has
needs to be able to detect schemas in semi- been relevant. This information will be useful for
structured data (e.g., JSON or XML sources). subsequent usage of the datasets as it contains
The extracted metadata is managed by a metadata knowledge on how the dataset can be used. The
repository in the storage layer. main challenge for the metadata repository is the
In addition to the metadata, the raw data also definition of a metadata model, which, on the
needs to be ingested into the DL. According to one hand, is generic enough to represent the wide
the idea, that the raw data is kept in its orig- variety of metadata in a DL, and on the other
inal format, this is more like a “copy” opera- hand, has detailed and concrete semantics for the
tion and thereby certainly less complex than an metadata items. Furthermore, it should have a
ETL process in data warehouses. Nevertheless, manageable complexity such that the metadata
the data needs to be put into the storage layer can be also exploited by the end users.
of the DL, which might imply some syntactical The raw data repositories are the core of the
transformation or loading of the data into the DL in terms of data volume. As the ingestion
“raw data stores” in the storage layer. The loading layer provides the data in its original format,
process can be done in a lazy way, i.e., only if a different storage systems for relational, graph,
user requests data from this particular data source XML, or JSON data have to be provided. More-
(Karæz et al. 2013). over, the storage of files using proprietary formats
Data governance and data quality (DQ) should be supported. Hadoop seems to be a good
management are important in DLs to avoid candidate as a basic platform for the storage layer,
data swamps. The DQ control component should but it needs to be complemented with compo-
make sure that the ingested data has a minimal nents to support the data fidelity, such as Apache
quality. For example, if a source with information Spark (https://spark.apache.org/). In order to pro-
about genes is considered, it should also provide vide a uniform way to query and access the data
an identifier of the genes in one of the common to the user, the hybrid data storage infrastructure
formats (e.g., from the Gene Ontology) instead should be hidden by a uniform data access in-
of using proprietary IDs that cannot be mapped terface. This data access interface should provide
to other sources. Due to the huge volume and a query language and a data model, which have
variety of the data, it is probably not be possible sufficient expressive power to enable complex
to specify all DQ rules manually; here, also queries and represent the complex data structures
automatic support is required to detect such that are managed by the DL. Current systems
rules and then also to evaluate these rules in a such as Apache Spark and HBase (https://hbase.
fuzzy way (Saha and Srivastava 2014). Also, data apache.org/) offer this kind of functionality using
profiling techniques can help to identify patterns a variant of SQL as query language and data
in the source data (Abedjan et al. 2015). model. Another option is to use JSON as unifying
data representation as this is supported by many
Storage Layer NoSQL systems. Still, a problem is the lack of a
The main components in the storage layer are standardized query language for JSON, but pro-
the metadata repository and the repositories for posals as JSONiq (Florescu and Fourny 2013),
raw data. The metadata repository stores all the which are based syntactically and semantically
metadata of the DL which have been partially on the well-known query languages SQL and
collected automatically in the ingestion layer or XQuery, are promising directions.
will be later added manually during the curation A major challenge is the rewriting of the
or usage of the DL. The metadata should also user queries into the specific query languages
include feedback from the users of the DL, e.g., of the raw data repositories. The classical query
which attributes have been used to join datasets, rewriting problem has been addressed in data
what kind of transformations (for integration or integration for many years in which the main
Data Lake 555
focus was correctness and completeness of the the metadata to see what kind of data is available
rewritten query. In addition to this problem, many and can then explore the data. Thus, there needs
other aspects have to be considered for the rewrit- be a close relationship between the data explo-
ing in the DL system. For example, the costs ration and the metadata manager components.
for data transformation between the different data On the other hand, the metadata generated dur-
formats have also to be taken into account (When ing data exploration (e.g., semantic annotations,
relational and JSON data should be joined, is it discovered relationships) should be inserted into D
more efficient to translate the JSON data into a the metadata store using the schema manager and
relational form or vice versa?) or the costs for enrichment modules.
moving the data between different nodes of the The query formulation component should
distributed storage system. support the user in creating formal queries that
express her information requirement. As stated
Transformation Layer earlier, we cannot expect that the user is able to
To transform the raw data into a desired target handle the functions offered by the data access
structure, the DL needs to offer a data trans- system directly. The data interaction should
formation engine in which operations for data cover all the functionalities which are required
cleaning, data transformation, and data integra- to work with the data, including visualization,
tion can be realized in a scalable way. In contrast annotation, selection and filtering of data, and
to a data warehouse, which aims at providing one basic analytical methods. More complex analyt-
integrated schema for the all data sources, a DL ics involving machine learning and data mining is
should support the ability to create application- in our vision not part of the core of a DL system.
specific data marts, which integrate a subset of
the raw data in the storage layer for a concrete Lazy and Pay-as-You-Go Concepts
application. From the logical point of view, these A basic idea of the DL concept is to consume
data marts are rather a part the interaction layer, as little upfront effort as possible and to spend
as the data marts will be created by the users additional work during the interaction with users,
during the interaction with the DL. On the other e.g., schemas, mappings, and indexes are created
hand, their data will be stored in one of the while the users are working with the DL. This has
systems of the storage layer. The idea of data been referred to as lazy (e.g., in the context of
marts in this context is similar to the concept of loading database (Karæz et al. 2013)) and pay-as-
personal information spaces, which were initially you-go techniques (e.g., in data integration Sarma
proposed in the context of dataspaces (Franklin et al. 2008). All workflows in a DL system have to
et al. 2005), and support the idea of “pay-as- be verified, whether the deferred or incremental
you-go” integration (Jeffery et al. 2008). The computation of the results is applicable in that
knowledge created during the definition of the context. For example, metadata extraction can
data mart (e.g., information about how to trans- be done first in a shallow manner by extracting
form, integrate, and analyze datasets) needs to be only the basic metadata; only detailed data of
maintained at the transformation layer and should the source is required; a more detailed extraction
be captured in the metadata store. method will be applied.
In addition, data marts can be more A challenge for the application of these “pay-
application-independent if they contain a general- as-you-go” methods is to make them really
purpose dataset which has been defined by a data “incremental,” e.g., the system must be able
scientist. Such a dataset might be useful in many to detect the changes and avoid a complete
information requests of the users. recomputation of the derived elements. This
challenge is also addressed in a Lambda
Interaction Layer architecture in which the speed layer should
The top layer focuses at the interaction of the provide the incremental updates to the results
users with the DL. The users will have to access computed by the batch layer.
556 Data Lake
Schema-on-Read and Evolution of regular path queries for graph databases; for
DLs also provide access to un- or semi-structured example, (Calvanese et al. 2012) defined the
data for which a schema was not explicitly term “relative containment” for queries wrt. the
given during the ingestion phase. Schema-on- sources that are available. Such a query language
read means that schemas are only created when could be also applicable for a heterogeneous DL
the data is accessed, which is in-line with the system.
“lazy” concept described in the previous section. A prerequisite for the definition of formal
The shallow extraction of metadata might also mappings is the identification of correspondences
lead changes at the metadata level, as schemas or similarities. If schemas are given explicitly,
are being refined if more details of the data then schema matching approaches can support
sources are known. Also, if a new data source this task (Hartung et al. 2011); however, in DLs
is added to the DL system, or an existing one is the raw data is less structured, and schema in-
updated, some integrated schemas might have to formation is not explicitly available. Thus, in
be updated as well, which leads to the problem of these context methods for data profiling or data
schema evolution (Curino et al. 2013). wrangling have to be combined with schema ex-
Another challenge to be addressed for schema traction and relatable dataset discovery (Alserafi
evolution is the heterogeneity of the schemas et al. 2017).
and the frequency of the changes. While data
warehouses have a relational schema which is Query Rewriting and Optimization
usually not updated very often, DLs are more There is a trade-off between expressive power
agile systems in which data and metadata can and complexity of the rewriting procedure as the
be updated very frequently. The existing methods complexity of query rewriting depends to a large
for schema evolution have to be adapted to deal degree on the choice for the mapping and query
with the frequency and heterogeneity of schema language. However, query rewriting should not
changes in a big data environment (Hartung et al. only consider completeness and correctness, but
2011). also the costs for executing the rewritten query
should be taken into account. Thus, the meth-
Mapping Management ods for query rewriting and query optimization
Mapping management is closely related to the require a tighter integration (Gottlob et al. 2014).
schema evolution challenge. Mappings state how It is also an open question whether there is a
data should be processed on the transformation need for an intermediate language in which data
layer, i.e., to transform the data from its raw and queries are translated to do the integrated
format as provided by the storage layer to a query evaluation over the heterogeneous storage
target data structure for a specific information re- system or whether it is more efficient to use some
quirement. The definition of mapping languages of the existing data representations. Furthermore,
and operations on mappings (e.g., composition) given the growing adoption of declarative lan-
has been studied in the field of model man- guages in big data systems, query processing
agement (Bernstein and Melnik 2007), an ear- and optimization techniques from the classical
lier research area at a time in which schema- database systems could be applied as well in DL
less systems were not yet relevant and models systems.
were seen as an important asset in the design
of interoperable information systems. Although Data Governance and Data Quality
heterogeneity of data and models has been con- Since the output of a DL should be useful knowl-
sidered and generic languages for models and edge for the users, it is important to prevent a
mappings have been proposed (Kensche et al. DL from becoming a data swamp. Similar with
2009), the definition and creation of mappings in data warehouses, incomplete, inaccurate, and am-
a schema-less world has not received much atten- biguous data would jeopardize the query result
tion, yet. There has been some work in the context accuracy, even worse, lead to unusable result
Data Lake 557
(Jarke et al. 1999). There is a trade-off between removes the hurdles to locate and to get physical
data openness and quality control. It is often access to a data source.
mentioned that data governance is required for a This situation can be often seen in industrial
DL. First of all, data governance is an organiza- companies in which data has not been considered
tional challenge, i.e., roles have to be identified, as a valuable resource. Now, with more Industry
stakeholders have to be assigned to roles and 4.0 applications and success stories becoming
responsibilities, and business processes need to available, companies are examining data lakes D
be established to organize various aspects around as a potential solution to manage the data pro-
data governance (Otto 2011). Still, data gover- duced by their production processes or by their
nance needs to be also supported by appropriate products. Also in the research and development
techniques and tools. For data quality, as one department (industrial or academic), data lakes
aspect of data governance, a similar evolution has can be successful because of the heterogeneity
taken place; the initially abstract methodologies of the sources and required flexibility in data
have been complemented in the meantime by spe- analysis (Quix et al. 2016). A slightly different
cific techniques and tools. Preventive, process- concept is the idea of a virtual or logical data
oriented data quality management (in contrast lake, i.e., only metadata will be extracted from
to data cleaning, which is a reactive data qual- the sources and made available in a unifying
ity management) also addresses responsibilities metadata repository. For example, Google imple-
and processes in which data is created in or- ments this idea in their GOODS system (Halevy
der to achieve a long-term improvement of data et al. 2016). Such metadata registries with an
quality. overview of the datasets that are managed by an
organization are also important in the context of
the data privacy laws established in Europe, e.g.,
Applications of Data Lakes organizations must be able to report about the
datasets with personal information managed by
Although data lakes are a young concept, many them.
organizations are investigating or investing in
data lake solutions. Organizations with a good
data management architecture, established data Future Directions for Research
governance, and several data integration solutions
already in place (e.g., data warehouses) are con- The approach of a raw data store in which data
sidering data lakes as a complementary solution is only loosely connected is appealing to solve
to their existing systems. However, the added problems in getting access to the relevant data
value of a data lake in such situations might that is available in an organization. Yet, the idea
be low as many of the benefits of data lakes of data lakes is still relatively new and has not
might be already covered by some of the existing seen much attention in research. Commercial data
systems. integration systems have addressed the data lake
In cases where the data management of an idea and are marketing their systems as solutions
organization is much less developed, a data lake to the challenges to be addressed in data lake
might be an approach to start data governance ini- systems, e.g., methods for metadata extraction
tiatives and to think about the business processes and metadata management. Still, there are many
in a data-oriented way (e.g., where is data cre- opportunities for research, starting from defining
ated, where is it consumed). Many organizations more general reference architectures to identify
have a scattered data management landscape with the important components of a data lake, to devel-
hundreds or even thousands of partially intercon- oping new methods for specific problems in the
nected, partially isolated databases. Especially in data lake context (e.g., schema summarization,
the case of data silos which are difficult to access, identifying relatable datasets, data governance,
data lakes might provide a significant benefit as it query processing on heterogeneous datasets).
558 Data Lake
Chicago, pp 1294–1297. https://doi.org/10.1109/ICDE. growth rate that is even higher than those of
2014.6816764 circuit density and processor performance, mod-
Sarma AD, Dong X, Halevy AY (2008) Bootstrapping
pay-as-you-go data integration systems. In: Wang JTL
eled by Moore’s law (Brock and Moore 2006).
(ed) Proceedings of ACM SIGMOD international con- A few exabytes of data generation per day in the
ference on management of data. ACM Press, Vancou- early 2010s is slated to rise to many yottabytes
ver, pp 861–874 in the 2020s (Cisco Systems 2017; Jacobson
Stein B, Morrison A (2014) The enterprise data lake:
better integration and deeper analytics. http://
2013). As data gains ever-greater value in the D
www.pwc.com/us/en/technology-forecast/2014/cloud- operation of social and business enterprises, data
computing/assets/pdf/pwc-technology-forecast-data- management, integrity, and preservation become
lakes.pdf major concerns. Any physical storage medium
Terrizzano I, Schwarz PM, Roth M, Colino JE (2015)
Data wrangling: the challenging journey from the wild
is subject to decay over time and has a finite
to the lake. In: 7th Biennial conference on innovative lifespan. Some of these lifespans are relatively
data systems (CIDR). http://www.cidrdb.org/cidr2015/ short compared with desirable data retention pe-
Papers/CIDR15_Paper2.pdf riods in practical settings. The format in which
data is stored tends to become obsolete as well.
It follows that an active strategy for ensuring
the integrity and longevity of stored data is an
Data Longevity important part of any data management plan.
and Compatibility
Data Longevity and Compatibility, Table 1 Developing an appreciation for data volumes
Abbr. Name Bytes Example(s)
KB Kilobyte 103 One page of text or a small contact list
MB Megabyte 106 One photo or a short YouTube video
GB Gigabyte 109 A movie
12
TB Terabyte 10 Netflix’s movies, 4 years worth of watching
PB Petabyte 1015 Data held by an e-commerce site or bank
EB Exabyte 1018 Google’s data centers
ZB Zettabyte 1021 WWW size or capacity of all hard drives
24
YB Yottabyte 10 Worldwide daily data production by the 2020s
purview of large organizations, municipalities, fields, heat, vibration, and a number of other
or nation-states. At the rate data production is environmental factors. Generally, the result of
growing, it won’t be long before we will have such deterioration is partial data erasure, rather
to decide whether to adopt proposed prefixes for than arbitrary changes to the stored information.
1027 and 1030 (Googology Wiki 2018). Thus, when codes are used, they can be of the
Data lifetimes range from seconds (for tempo- erasure variety (Plank 2013; Rizzo 1997), which
rary results kept in scratchpad or working mem- imply lower redundancy compared with codes
ory, as database transactions run to completion) to for correcting arbitrary errors. Archival magnetic
centuries or more (for historical archives). Each media, such as reel tapes, must be stored under
combination of data lifetime duration and data ac- carefully regulated environmental conditions to
cess requirement dictates the use of certain stor- maximize their lifespans.
age device types. Setting volatile semiconductor Optical storage media can decay due to the
memories aside because of their nonpermanence, breakdown of material onto which data is stored.
the most common storage technologies used in Here too storing the media under proper envi-
the recent past are listed in Table 2. ronmental conditions can reduce data decay and
increase the media’s lifespan. When longer life-
spans are desired, storage media of higher quality
Data Decay and Device Lifespans must be procured. In particular, M-discs and other
specially developed archival media can prolong
All storage media decay over time, although lifespans to many centuries (Jacobi 2018; Lunt et
the degradation mechanism, and thus methods al. 2013). Such long lifespans can be validated
for dealing with it, is technology-dependent. only via accelerated testing (Svrcek 2009), a
Degradations that are small-scale and/or local process whose results are not as trustworthy as
can be compensated for through error-correcting those obtained through direct testing in actual
codes. However, once a certain number of data field use.
errors have been corrected automatically, the Solid-state media use electrical charges to
storage medium must be discarded and a fresh store data. Imperfect insulation can lead to
copy of the data created in a different place, charges leaking out and thus data being lost.
because continued deterioration will eventually Given that the device itself degrades much more
exceed the codes’ error-correcting capacity, slowly, refreshing data before total decay is one
leading to data loss. possible strategy for avoiding data loss. This
Magnetic storage media, the most commonly process is in effect very similar to refreshing
used current technology for large-scale and long- in DRAM chips (Bhati et al. 2016), except that
term data repositories (hot or cold), degrade the refreshing occurs at much slower rates and,
through weakening or loss of magnetization, a thus, its performance hit and power waste are not
process that is accelerated by external magnetic serious issues.
Data Longevity and Compatibility 561
Data Longevity and Compatibility, Table 2 Attributes of storage devices in use over the past few decades
Type Capacity Cost Speed Life (years) Archival Additional information
Floppy disk MB Low Low 5–7 No Obsolete technology/format
CD/DVD GB Low Medium 3–10 Yes Are becoming less prominent
Blu-ray GBC Low Medium 50 Yes Still used for small-scale archives
M-disc GBC Low Medium 1000 Yes Lifespan unverified
Memory card GB High High 5–10 No Temporary personal storage D
Cassette tape GB Low Low 10–20 No Obsolete technology/format
Cartridge GBC Low Low 10–20 Yes Used in mechanically accessed vaults
Hard disk TB Medium Medium 3–5 No Increasingly being replaced by SSDs
SSD (flash) TB High High 5–10 No Lifespan varies with write frequency
Reel tape TBC Low Low 20–30 Yes Lifespan with proper storage, low use
Disk array PB Medium Medium 10–20 Yes Lifespan improved by fault tolerance
Sources differ widely on the useful lifespans be the preferred methods for large collections
of data storage media. This is in part due to the of data.
fact that manufacturing quality associated with Decades of experience with building and
various suppliers and environmental conditions managing large database systems are being
under which the media operate varies widely. applied to the design of big data repositories
Some typical figures for the most common media that are nonvolatile and long life (Arulraj and
are included in Table 2. These figures should be Pavlo 2017). However, new challenges arising
used with caution, as there is no guarantee that from enormous amounts of less structured
even the lower ends of the cited ranges will hold and unstructured data remain to be overcome.
in practice. Recovery methods (Arulraj et al. 2015) are
both device- and application-dependent, so they
need continual scrutiny and updating as storage
devices evolve, and data is used in previously
Preventing, and Recovering from, unimagined volumes and ways.
Data Decay
assessment may change in light of technology Brock DC, Moore GE (eds) (2006) Understanding
advances (Qian et al. 2015). Moore’s law: four decades of innovation. Chemical
Heritage Foundation, Philadelphia
On the data structuring, management, and Chen PM, Lee EK, Gibson GA, Katz RH, Patterson DA
maintenance side, data centers and their inter- (1994) RAID: high-performance reliable secondary
connection networks will play a key role in the storage. ACM Comput Surv 26(2):145–185
development of robust permanent data archives Chen T, Gao X, Chen G (2016) The features, hardware,
and architectures of data center networks: a survey.
(Chen et al. 2016). Improved understanding of J Parallel Distrib Comput 96:45–74
storage device failure mechanisms and data decay Cisco Systems (2017) The zettabyte era: trends and
(Petascale Data Storage Institute 2012; Schroeder analysis. White paper. http://www.cisco.com/c/en/us/
and Gibson 2007) is another line that should be solutions/collateral/service-provider/visual-networking-
index-vni/vni-hyperconnectivity-wp.html
pursued. Coughlin T (2014) Keeping data for a long time, Forbes,
Finally, development of new technologies available on-line at http://www.forbes.com/sites/
for data storage, discussed in this encyclopedia tomcoughlin/2014/06/29/keeping-data-for-a-long-time/
under “Storage Technologies for Big Data” and #aac168815e26
Denning PJ, Lewis TG (2016) Exponential Laws of com-
“Emerging Hardware Technologies,” will no puting growth. Commun ACM 60(1):54–65
doubt be accelerated as the storage and longevity Dimakis AG, Ramachandran K, Wu Y, Suh C (2011) A
needs to expand. Examples abound, but two survey on network codes for distributed storage. Proc
promising avenues currently being pursued are IEEE 99(3):476–489
Goda K, Kitsuregawa M (2012) The history of storage
storing vast amounts of data on DNA (Bornholt systems. Proc IEEE 100:1433–1440
et al. 2016; Goldman et al. 2013) and building Goldman N, Bertone P, Chen S, Dessimoz C, LeProust
memories based on emerging nanotechnologies EM, Sipos B, Birney E (2013) Towards practical,
involving memristors and other new device types high-capacity, low-maintenance information storage in
synthesized DNA. Nature 494(7435):77–80
(Menon and Gupta 1999; Strukov et al. 2008). Googology Wiki (2018) SI prefix. On-line document.
http://googology.wikia.com/wiki/SI_prefix. Accessed
23 Feb 2018
Hilbert M, Gomez P (2011) The World’s technological
Cross-References capacity to store, communicate, and compute informa-
tion. Science 332:60–65
Iyengar A, Cahn R, Garay JA, Jutla C (1998) Design and
Data Replication and Encoding
implementation of a secure distributed data repository.
Storage Hierarchies for Big Data IBM Thomas J. Watson Research Division, Yorktown
Storage Technologies for Big Data Heights
Structures for Large Data Sets Jacobi J (2018) M-Disc optical media reviewed: your
data, good for a thousand years. PCWorld. On-line
document. http://www.pcworld.com/article/2933478/
storage/m-disc-optical-media-reviewed-your-data-good
-for-a-thousand-years.html
References Jacobson R (2013) 2.5 quintillion bytes of data created
every day: how does CPG & retail manage it?
Arulraj J, Pavlo A (2017) How to build a non-volatile IBM Industry Insights. http://www.ibm.com/blogs/
memory database management system. In: Proceedings insights-on-business/consumer-products/2-5-quintillion-
of the ACM international conference on management bytes-of-data-created-every-day-how-does-cpg-retail-
of data, Chicago, pp 1753–1758 manage-it/
Arulraj J, Pavlo A, Dulloor SR (2015) Let’s talk about Lunt BM, Linford MR, Davis RC, Jamieson S, Pearson
storage & recovery methods for non-volatile memory A, Wang H (2013) Toward permanence in digital data
database System. In: Proceedings of the ACM interna- storage. Proc Arch Conf 1:132–136
tional conference on management of data, Melbourne, Menon AK, Gupta BK (1999) Nanotechnology: a data
pp 707–722 storage perspective. Nanostruct Mater 11(8):965–986
Bhati I, Chang M-T, Chishti Z, Lu S-L, Jacob B (2016) Parhami B (2018) Dependable computing: a multi-level
DRAM refresh mechanisms, penalties, and trade-offs. approach, draft of book manuscript, available on-line at
IEEE Trans Comput 65(1):108–121 http://www.ece.ucsb.edu/~parhami/text_dep_comp.htm
Bornholt J, Lopez R, Carmean DM, Ceze L, Seelig G, Petascale Data Storage Institute (2012) Analyzing fail-
Strauss K (2016) A DNA-based archival storage sys- ure data. Project Web site: http://www.pdl.cmu.edu/
tem. ACM SIGOPS Oper Syst Rev 50(2):637–649 PDSI/FailureData/index.html
Data Profiling 563
Overview
Data Management
Data profiling encompasses a vast array of meth-
ods to examine datasets and produce metadata.
Big Data and Recommendation Like many data management tasks, data profiling
faces three challenges: (i) managing the input,
(ii) efficiently performing the computation, and
(iii) managing the output. Apart from typical data
Data Preparation
formatting issues, the first challenge addresses
the problem of specifying the expected outcome,
Data Wrangling i.e., determining which profiling tasks to execute
on which parts of the data. In fact, many tools
require a precise specification of what to inspect.
Other approaches are more open and perform a
Data Profiling wider range of tasks, discovering all metadata
automatically.
Ziawasch Abedjan Research and industry have tackled the second
Teschnische Universität Berlin, Berlin, Germany challenge in different ways, relying on the capa-
bilities of the underlying DBMS and formulating
profiling tasks as SQL queries, using indexing
Synonyms schemes, parallel processing, and reusing inter-
mediate results. Additionally, several methods
Metadata extraction; Metadata generation have been proposed that deliver only approximate
564 Data Profiling
Data profiling
Multiple columns Clusters & outliers
Summaries &
sketches
Key discovery
Uniqueness Conditional
(Approximate)
Foreign key
discovery
Dependencies Inclusion
Conditional
dependencies
(Approximate)
Conditional
Functional
dependencies
(Approximate)
results for various profiling tasks in order to filing. Research on single column profiling is
handle Big Data. Figure 1 shows the classification mostly concerned with the reduction of com-
of profiling tasks according to Abedjan et al. plexity in the number of rows, trying to avoid
(2015), which includes single-column tasks, mul- full scans or superlinear operations on the en-
ticolumn tasks, and dependency detection. While tire column. Research on multicolumn profiling
dependency detection falls under multicolumn and dependency discovery is mostly concerned
profiling, it is assigned a separate profiling class with the reduction of the exponential complexity
to emphasize this large and complex set of tasks. with regard to the number of attribute and value
The third challenge has not yet been the focus combinations, trying to discover new practical
of broader research. Typically an aggregation pruning techniques.
of the huge amounts of meta-data in form of Single-column profiling addresses the gener-
histograms or graphs is desirable (Kandel et al. ation of counts, such as the number of values,
2012; Abedjan et al. 2014a; Papenbrock et al. the number of unique values, and the number of
2015a). non-null values. These metadata are often part
of the basic statistics gathered by the DBMS.
Key Research Findings In addition, the maximum and minimum values
are discovered, and the data type is derived (usu-
There are different directions of research that ally restricted to string vs. numeric vs. date).
address the three general categories of data pro- More advanced techniques create histograms of
Data Profiling 565
value distributions and identify typical patterns that reduce the runtime of validating a depen-
in the data values in the form of regular ex- dency by using inverse indexing structures (Heise
pressions (Raman and Hellerstein 2001). Among et al. 2013; Papenbrock et al. 2015b; Abedjan
specializations of these techniques are methods to et al. 2014b). Additionally, new algorithms apply
determine the frequency distribution of soundex advanced search strategies to efficiently prune the
code, n-grams, and others; the inverse frequency exponential solution space, using random walk
distribution, i.e., the distribution of the frequency techniques or user-defined heuristics (Heise et al. D
distribution; or the entropy of the frequency dis- 2013; Papenbrock and Naumann 2016). There is
tribution of the values in a column (Kang and also a huge body of work on the so-called relaxed
Naughton 2003). A particularly interesting distri- dependencies (Caruccio et al. 2016), which focus
bution is the first digit distribution for numeric on dependencies that only hold (1) partially, i.e.,
values. Benford’s law states that in naturally oc- for a subset of the data; (2) conditionally, i.e., for
curring numbers, the distribution of the first digit a subset of the data co-occurring with a condi-
d of a number approximately follows P .d / D tion; or (3) for similarity-based relationships, i.e.,
1
log10 1 C d (Benford 1938). dependencies are not invalid if depending values
Multicolumn profiling generalizes profiling are unequal as long as they are at least similar.
tasks on single columns to multiple columns
and also identifies inter-value dependencies
and column similarities. There has been work Examples of Application
on computing multidimensional histograms for
query optimization (Poosala and Ioannidis 1997; Data profiling has many traditional use cases,
Deshpande et al. 2001). Multicolumn profiling including the data exploration, data cleansing,
also plays an important role in data cleansing, and data integration scenarios. Statistics about
e.g., in assessing and explaining data glitches, data are also useful in query optimization.
which can be exposed considering column
combinations instead of single columns (Dasu Data exploration. Users are often confronted
et al. 2014). Another task is to identify with new unknown datasets. Examples include
correlations between values through frequent data files downloaded from the Web, old database
patterns or association rules (Hipp et al. 2000). dumps, or newly gained access to some DBMS.
Furthermore, clustering approaches that consume In many cases, such data have no known schema,
values of multiple columns as features allow for no or old documentation, etc. Even if a formal
the discovery of coherent subsets of data records schema is specified, it might be incomplete, for
and outliers (Berti-Equille et al. 2011; Dasu and instance, specifying only the primary keys but no
Loh 2012). Similarly, generating summaries and foreign keys. A natural first step is to identify the
sketches of large datasets relates to profiling basic structure and high-level content of a dataset.
values across columns (Garofalakis et al. 2013; Data profiling techniques can support such man-
Ntarmos et al. 2009; Dasu et al. 2006). ual exploration. Simple, ad hoc SQL queries can
Dependencies are metadata that describe re- reveal some insight, such as the number of dis-
lationships among columns. The difficulties of tinct values, but more sophisticated methods are
automatically detecting such dependencies in a needed to efficiently and systematically discover
given dataset are twofold: First, pairs of columns metadata.
or larger column sets must be examined, and
second, the existence of a dependency in the Database management. Typically, the gener-
data at hand does not imply that this dependency ated metadata include various counts, such as
is meaningful. While much research has been the number of values, the number of unique
invested in addressing the first challenge, there values, and the number of non-null values. These
is less work on semantically interpreting the pro- metadata are often part of the basic statistics
filing results. There has been a series of new gathered by a DBMS. An optimizer uses them to
algorithms that apply hybrid pruning techniques estimate the selectivity of operators and perform
566 Data Profiling
other optimization steps. Mannino et al. give a cases, scientific data is produced by non-database
survey of statistics collection and its relation- experts and without the intention to enable
ship to database optimization (Mannino et al. integration. Thus, they often come with no
1988). More advanced techniques use histograms adequate schematic information, such as data
of value distributions, functional dependencies, types, keys, or foreign keys. Additionally, many
and unique column combinations to optimize life-science databases have been created and
range queries (Poosala et al. 1996) or for dynamic updated through a long period of time creating
reoptimization (Kache et al. 2006). disconnected and redundant tables that require
exhaustive search and exploration before they
Database reverse engineering. Given a “bare” can be put into purposeful use again.
database instance, the task of schema and
database reverse engineering is to identify its Data quality/data cleansing. A typical use case
relations and attributes, as well as domain seman- of data profiling data is the data cleansing pro-
tics, such as foreign keys and cardinalities (Petit cess. Commercial data profiling tools are usu-
et al. 1994; Markowitz and Makowsky 1990). ally bundled with corresponding data quality/data
Hainaut et al. call these metadata “implicit cleansing software. Profiling as a data quality
constructs,” i.e., those that are not explicitly assessment tool reveals data errors, such as in-
specified by DDL statements (Hainaut et al. consistent formatting within a column, missing
2009). However, possible sources for reverse values, or outliers. Profiling results can also be
engineering are DDL statements, data instances, used to measure and monitor the general qual-
data dictionaries, etc. The result of reverse ity of a dataset, for instance, by determining
engineering might be an entity-relationship the number of records that do not conform to
model or a logical schema to assist experts previously established constraints (Pipino et al.
in maintaining, integrating, and querying the 2002; Kandel et al. 2012). Generated constraints
database. and dependencies also allow for rule-based data
imputation.
Data integration. Before different data sources
can be integrated, one has to consider the dimen- Big Data analytics. Many Big Data and re-
sions of a dataset, its data types, and formats. lated data science scenarios call for data mining
A concrete use case for data profiling is that and machine learning techniques to explore and
of schema matching, i.e., finding semantically mine data. Again, data profiling is an important
correct correspondences between elements of two preparatory task to determine which data to mine,
schemata (Euzenat and Shvaiko 2013). Many how to import it into the various tools, and how
schema matching systems perform data profiling to interpret the results (Pyle 1999). In this con-
to create attribute features, such as data type, text, leading researchers have noted “If we just
average value length, patterns, etc., to compare have a bunch of datasets in a repository, it is
feature vectors and align those attributes with unlikely anyone will ever be able to find, let alone
the best matching ones (Madhavan et al. 2001; reuse, any of this data. With adequate metadata,
Naumann et al. 2002). there is some hope, but even so, challenges will
Scientific data management and integration remain[. . . ].”(Agrawal et al. 2012)
have created additional motivation for efficient
and effective data profiling: When importing
raw data, e.g., from scientific experiments or Future Directions for Research
extracted from the Web, into a DBMS, it is
often necessary and useful to profile the data Much of the data that currently is being fo-
and then devise an adequate schema. In many cused on for application purposes is of nontra-
Data Profiling 567
ditional type, i.e., non-relational, non-structured Dasu T, Loh JM (2012) Statistical distortion: con-
(textual), and heterogeneous. Many existing pro- sequences of data cleaning. Proc VLDB Endow
(PVLDB) 5(11):1674–1683
filing methods cannot adequately handle that kind Dasu T, Johnson T, Marathe A (2006) Database explo-
of data: Either they do not exist yet or they ration using database dynamics. IEEE Data Eng Bull
do not scale to the existing dimensions of data. 29(2):43–59
Furthermore, fast data poses a new challenge to Dasu T, Loh JM, Srivastava D (2014) Empirical glitch
explanations. In: Proceedings of the international con-
data profiling. With fast data we are considering ference on knowledge discovery and data mining D
data that changes over time through transactions (SIGKDD), pp 572–581. ISBN: 978-1-4503-2956-9
or simply streaming data. For a more elaborate Deshpande A, Garofalakis M, Rastogi R (2001) Inde-
overview of upcoming challenges of data profil- pendence is good: dependency-based histogram syn-
opses for high-dimensional data. In: Proceedings of
ing, we refer to (Naumann 2013). the international conference on management of data
As mentioned in the introduction, the inter- (SIGMOD), pp 199–210. ISBN:1-58113-332-4
pretation of profiling results remains an open Euzenat J, Shvaiko P (2013) Ontology matching, 2nd edn.
challenge. To facilitate interpretation of meta- Springer, Berlin/Heidelberg/New York
Garofalakis M, Keren D, Samoladas V (2013) Sketch-
data, effectively visualizing any profiling results based geometric monitoring of distributed stream
is of utmost importance. So far, there are only queries. Proc VLDB Endow (PVLDB) 6(10):937–948
prototypes that rudimentary tackle visualization Hainaut J-L, Henrard J, Englebert V, Roland D, Hick J-M
of metadata (Kandel et al. 2012; Abedjan et al. (2009) Database reverse engineering. In: Encyclopedia
of database systems. Springer, Heidelberg, pp 723–728
2014a; Papenbrock et al. 2015a). Heise A, Quiané-Ruiz J-A, Abedjan Z, Jentzsch A, Nau-
mann F (2013) Scalable discovery of unique column
combinations. Proc VLDB Endow (PVLDB) 7(4):
301–312
References Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algo-
rithms for association rule mining – a general survey
and comparison. SIGKDD Explor 2(1):58–64. ISSN:
Abedjan Z, Grütze T, Jentzsch A, Naumann F (2014a) 1931-0145
Profiling and mining RDF data with ProLOD++. In: Johnson T (2009) Data profiling. In: Encyclopedia of
Proceedings of the international conference on data database systems. Springer, Heidelberg, pp 604–608
engineering (ICDE), pp 1198–1201, Demo Kache H, Han W-S, Markl V, Raman V, Ewen S (2006)
Abedjan Z, Schulze P, Naumann F (2014b) DFD: efficient POP/FED: progressive query optimization for feder-
functional dependency discovery. In: Proceedings of ated queries in DB2. In: Proceedings of the interna-
the international conference on information and knowl- tional conference on very large databases (VLDB),
edge management (CIKM), pp 949–958 pp 1175–1178
Abedjan Z, Golab L, Naumann F (2015) Profiling rela- Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J
tional data: a survey. VLDB J 24(4):557–581 (2012) Profiler: integrated statistical analysis and visu-
Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal alization for data quality assessment. In: Proceedings
U, Franklin M, Gehrke J, Haas L, Halevy A, Han J, of advanced visual interfaces (AVI), pp 547–554
Jagadish HV, Labrinidis A, Madden S, Papakonstanti- Kang J, Naughton JF (2003) On schema matching with
nou Y, Patel JM, Ramakrishnan R, Ross K, Shahabi C, opaque column names and data values. In: Proceedings
Suciu D, Vaithyanathan S, Widom J (2012) Challenges of the international conference on management of data
and opportunities with Big Data. Technical report, (SIGMOD), pp 205–216
Computing Community Consortium. http://cra.org/ccc/ Madhavan J, Bernstein PA, Rahm E (2001) Generic
docs/init/bigdatawhitepaper.pdf schema matching with Cupid. In: Proceedings of
Benford F (1938) The law of anomalous numbers. Proc the international conference on very large databases
Am Philos Soc 78(4):551–572 (VLDB), pp 49–58
Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of Mannino MV, Chu P, Sager T (1988) Statistical profile
complex glitch patterns: a novel approach to quantita- estimation in database systems. ACM Comput Surv
tive data cleaning. In: Proceedings of the international 20(3):191–221
conference on data engineering (ICDE), pp 733–744 Markowitz VM, Makowsky JA (1990) Identifying ex-
Caruccio L, Deufemia V, Polese G (2016) Relaxed func- tended entity-relationship object structures in relational
tional dependencies – a survey of approaches. IEEE schemas. IEEE Trans Softw Eng 16(8):777–790
Trans Knowl Data Eng (TKDE) 28(1):147–165. https:// Naumann F (2013) Data profiling revisited. SIGMOD Rec
doi.org/10.1109/TKDE.2015.2472010 42(4):40–49
568 Data Provenance for Big Data Security and Accountability
tion history of data (Ko et al. 2015; Ko and at the operating systems running on the layer
Will 2014), can improve the quality of the re- 1 hardware. Activities such as data creation,
sults by providing supporting metadata to en- reading, updating, and deleting are now captured
hance the Big Data workflow and to also serve in atomic actions such as operating system calls.
as a tool for security, attribution, and foren- Layer 3 of the provenance stack makes sense of
sics. While data provenance does not provide the data flow tracking of ingress and egress of
the security requirements of confidentiality, in- Big Data information flow between and within D
tegrity, and availability, it serves as a security nodes. Nodes may be virtual machines, routers,
tool to provide detection, prevention, and mitiga- mobile devices, or physical servers. This layer
tion needs – increasing the accountability of Big ensures there is an accurate encapsulation and
Data. representation of data provenance in between
and across nodes. A compromised node can be
Provenance Abstraction Layers identified by verifying the layer 3 provenance.
Using the Full Provenance Stack (Ko and Layer 4 provenance is what most Big Data
Phua 2017) shown in Fig. 1 as a guideline, a systems record, the application provenance.
complete and meaningful Big Data provenance This is provenance specific to a particular
can be achieved. The Full Provenance Stack application or a specific phase in the Big Data
is an abstract model that classifies provenance workflow. Examples can be Web browsing
into five separate layers. The higher layers information or database transactions. This layer
encapsulate the layer below to refine the of provenance provides phase-specific, semantic-
understanding of provenance from layers below rich provenance data. Layer 5, the final layer
them. Layer 1 is the physical layer. In this of the provenance stack, is the attribution
layer, provenance information found at hardware layer; this is the encapsulation of all layers
and BIOS/firmware level can be collected and below, to provide a complete understanding
analyzed. This is especially relevant in verifying of the who, what, when, where, how, and
the hardware integrity of the data-generating why explanation of the analysis results. At
source. Data originating from a source that fails layer 5, we can attribute data creators and
to establish its integrity can be discarded. In layer provide accountability and transparency for Big
2, data provenance in layer 1 is encapsulated to Data.
provide atomic human-readable provenance data,
Integration,
Data Extraction, Aggre-
Analysis,
Acquisition, Cleaning, gation, Interpretation
Modeling
Recording Annotation Repre-
sentation
Data Provenance for Big Data Security and Accountability, Fig. 2 Big data workflow (Agrawal et al. 2012)
stage. Data provenance can record and provide irrelevant data may be included resulting in a less
the metadata required to strengthen such justifica- accurate analytic result(s). Data provenance may
tions. For example, the provenance metadata can reveal some non-apparent relationships between
provide information relating to how the data set two data sets to make the integration process
was recorded, in what unit was it measured in, more coherent. Apart from relevance, the quality
and further details of the data-generating source. of the integrated data needs to be verified to pre-
In addition to that, data provenance can also vent contamination of data. The system can gauge
help to eliminate inputs from untrusted data- the trustworthiness and quality of integrated data
generating sources. For example, after identify- set through their provenance.
ing a source that has been providing inconsistent, After the analysis phase, the results are being
missing, or erroneous data, a filter may be put interpreted in the interpretation phase. The inter-
in place to analyze the provenance metadata and pretation phase is not a straightforward process.
discard any data originating from this untrusted One must not only possess the skill but must also
source. take into account system errors and the assump-
Next, we discuss the extraction phase. Be- tions made. Results produced need to be verified
cause data sets come in myriad forms – from and justified. Data provenance aids in interpreting
sensor data to images or plain text – they may the results by providing information on (1) how
sometimes be annotated or prepared prior to anal- the result was derived, (2) which data contributes
ysis. The extraction phase attempts to extracts to the result, and (3) how a particular data set
information into structured form to ease analysis. contributes to the result. Data provenance infor-
An example of data extraction method is natural- mation should reveal any possibilities or at least
language processing (NLP). This process con- make identifying errors and false assumptions
cerns the data quality and data integrity, that, is clearer. Therefore, one can repeat the analysis
how complete and accurate the extracted data is. process to obtain a more accurate result.
Data provenance has two responsibilities here: it In addition, the distributed environment and
provides the necessary metadata to provide con- actors are also of concern in Big Data security.
text and semantics to allow for better and more Big Data analysis is usually carried out in a cloud
accurate extraction process; it also records a new environment due to their capabilities and flexi-
provenance, to support the other phases in the bilities. However, using a third-party computing
pipeline. Data provenance also empowers better and storage resource like cloud computing poses
data integrity – aligned to typical information a risk to the privacy of data, data integrity, and
security requirements. data quality. The security of the cloud provider
The integration phase focuses on incorporat- needs to be taken into account, internally and
ing data from other internal or external sources externally. An external risk is that an attacker
to expand the possibilities of Big Data analy- can either steal or perform malicious attack to
sis, for example, relating a customer’s shopping degrade the quality of the results. An internal risk
habits to their online social network activities. can be a malicious insider at the cloud provider.
The relationship between these data sets is not In this case, provenance information could enable
explicit and can be challenging; relevant data may the transparency required to prevent security risks
be excluded resulting in missed opportunities; or in the future.
Data Provenance for Big Data Security and Accountability 571
Key Provenance Research for Big signed to address four critical data security and
Data Security integrity problems: tamper-evidence of prove-
nance logs, precise timestamp synchronization
One of the earliest research applications for data across machines, growth of provenance logs, and
provenance in distributed, large-scale Big Data the efficient logging of root usage of the system.
was in the area of scientific workflows. Data This work later inspired HProgger (Fu et al.
provenance information is used by the scientific 2017), which is a provenance monitoring tool D
community to facilitate the reuse and sharing of built upon Progger, tailored specific for Hadoop.
experimental data. The use of data provenance HProgger reduces the number of system calls
has then been extended to several areas such as recorded, minimized the log files by reducing the
operating systems, database systems, and cloud details captured, and limits the capturing scope
computing. within the Hadoop working directory. While this
Provenance-aware storage system (PASS) improves the performance of the provenance log-
(Muniswamy-Reddy et al. 2006) is a storage ging system, the granularity and level of details
system that automatically collects and maintains are also significantly reduced.
provenance. Muniswamy-Reddy et al. proposed
that provenance should be treated as metadata
collected and maintained by storage system. This Examples Applications
allows storage system to generate system-level
provenance automatically while easing the need Identification and exclusion of untrusted
for additional provenance management task. This source. In Big Data, it is common for data from
system captures provenance data by intercepting different sources and origins to be integrated
system calls. A stackable file system was used (data fusion). However, if one or multiple
along with a database engine to store provenance combined source(s) of data is of low quality,
metadata. it is hard to filter them out after it has been
File-Centric Logger (Flogger) (Ko et al. combined and aggregated. Data provenance is
2011) by Hewlett-Packard Labs was the first able to clearly identify each data unit and allows
data-centric provenance logger for cloud data for removal of data from untrusted source.
provenance tracking. This system uses the Linux
Loadable Kernel Module and the Windows
Combating data poisoning attacks. Data poi-
Device Driver to capture cloud data provenance
soning attack is the deliberate introduction of
by intercepting file and network operations.
false data to affect the Big Data analytic results.
Provenance was captured at all virtual machines
Such data can be filtered out and discarded with
and physical machines. The provenance from
the help of data provenance. The origin of the
virtual machines was securely transmitted to
data can be verified from the data provenance,
their physical machines via a communication
and even if the origin has been spoofed, by com-
channel and subsequently written to log files
paring other provenance, metadata should reveal
and then stored in a data store. Provenance
when and how was the false data introduced. For
from virtual machines can be correlated with
example, the false data may have missing data
the related provenance captured in the physical
provenance information on extraction phase of
machines hosting the virtual machines. Ko et al.
the Big Data workflow; this can be an indicator
have identified that this system, like most system,
that this is a false data, and this data is introduced
runs under the assumption of kernel integrity and
after the extraction phase.
that provenance logs need to be tamper-proof and
immutable.
Provenance logger (Progger) (Ko and Will Audit trails identifying source of problems.
2014) is a tamper-evident, kernel-space cloud In distributed environment, a processing node
data provenance system. The system was de- may be compromised or contains bugs, therefore
572 Data Provenance for Big Data Security and Accountability
contaminating the analysis of the results. An error Provenance storage overheads. One of the
introduced in any phase of the Big Data workflow main challenges to collecting provenance is
will render the entire process a failure. Data its overheads, particularly storage overheads.
provenance keeps a complete audit trail, allowing Provenance can grow up to 22 times the size of
for clear explanation of the source and cause of the original data (Xie et al. 2013). There have
the problem, that is in which phase the error been proposals for provenance compression,
occurred, which nodes and data is affected and but compression and decompression require
how are they affected. computing effort and may shift the problem of
storage overhead to performance overheads. This
is especially crucial in Big Data as both storage
and performance are critical.
Future Directions for Research
extremely important to assess the quality of the Licensing is defined as the granting of per-
datasets used in Linked Data applications, before mission for a consumer to reuse a
using them, allowing users or applications to dataset under defined conditions.
understand whether data is appropriate for their
task at hand. Metrics. Licensing can be checked by the indi-
cation of machine- and human-readable informa-
tion associated with the dataset clearly indicating
the permissions of data reuse.
Data Quality Dimensions and Metrics
Interlinking refers to the degree to which en-
Data quality assessment involves the measure- tities that represent the same con-
ment of quality dimensions or criteria that are cept are linked to each other, be
relevant to the consumer. The dimensions can it within or between two or more
be considered as the characteristics of a dataset. data sources.
A data quality assessment metric, measure, or
indicator is a procedure for measuring a data Metrics. Interlinking can be measured by using
quality dimension (Bizer and Cyganiak 2009). network measures that calculate the interlinking
These metrics are heuristics that are designed to degree, cluster coefficient, sameAs chains, cen-
fit a specific assessment situation (Leo Pipino and trality, and description richness through sameAs
Rybold 2005). An assessment score is computed links.
from these indicators using a scoring function. Security is the extent to which data is protected
In the next section, definitions of 18 data quality against alteration and misuse.
dimensions are provided along with their respec-
tive 69 metrics that are studied in a survey of the Metrics. Security can be measured based on
quality of Linked Data (performed in Zaveri et al. whether the data has a proprietor or requires web
2016). These dimensions are classified into the security techniques (e.g., SSL or SSH) for users
(i) accessibility, (ii) intrinsic, (iii) contextual, and to access, acquire, or reuse the data. The impor-
(iv) representational groups. tance of security depends on whether the data
needs to be protected and whether there is a cost
Accessibility Dimensions of data becoming unintentionally available. For
The dimensions belonging to this category in- open data, the protection aspect of security can
volve aspects related to the access, authenticity, be often neglected, but the non-repudiation of the
and retrieval of data to obtain either the entire or data is still an important issue. Digital signatures
some portion of the data (or from another linked based on private-public key infrastructures can
dataset) for a particular use case. There are five be employed to guarantee the authenticity of the
dimensions that are part of this group, which are data.
availability, licensing, interlinking, security, and Performance refers to the efficiency of a
performance. system that binds to a large
dataset, that is, the more
Availability of a dataset is the extent to which performant a data source is, the
data (or some portion of it) is more efficiently a system can
present, obtainable, and ready for process data.
use.
Metrics. Performance is measured based on the
Metrics. Availability of a dataset can be mea- scalability of the data source, that is, a query
sured in terms of accessibility of the server, should be answered in a reasonable amount of
SPARQL endpoints, or RDF dumps and also by time. Also, detection of the usage of prolix RDF
the dereferencability of the URIs. features or usage of slash URIs can help deter-
Data Quality and Data Cleansing of Semantic Data 575
mine the performance of a dataset. Additional Metrics. On the Linked Data Web, semantic
metrics are low latency and high throughput of knowledge representation techniques are
the services provided for the dataset. employed, which come with certain infer-
ence and reasoning strategies for revealing
implicit knowledge, which then might render
Intrinsic Dimensions a contradiction. Consistency is relative to a
Intrinsic dimensions are those that are inde- particular logic (set of inference rules) for D
pendent of the user’s context. There are five identifying contradictions. A consequence of
dimensions that are part of this group, which are the definition of consistency is that a dataset
syntactic validity, semantic accuracy, consis- can be consistent wrt. the RDF inference rules
tency, conciseness, and completeness. These but inconsistent when taking the OWL2-QL
dimensions focus on whether information reasoning profile into account. For assessing
correctly (syntactically and semantically), consistency, one can employ an inference engine
compactly, and completely represents the real or a reasoner, which supports the respective
world and whether information is logically expressivity of the underlying knowledge
consistent in itself. representation formalism. Additionally, one can
detect functional dependency violations such
Syntactic validity is defined as the degree to as domain/range violations. In practice, RDF
which an RDF document Schema inference and reasoning with regard
conforms to the specification to the different OWL profiles can be used to
of the serialization format. measure consistency in a dataset.
Consistency means that a knowledge base is Completeness refers to the degree to which all
free of (logical/formal) contradic- required information is present
tions with respect to particular in a particular dataset. In terms
knowledge representation and of LD, completeness comprises
inference mechanisms. of the following aspects:
576 Data Quality and Data Cleansing of Semantic Data
Schema completeness the degree to which the Metrics. Relevancy is highly context dependent
classes and properties of an and can be measured by using meta-information
ontology are represented and attributes for assessing whether the content is
thus can be called ontology relevant for a particular task. Additionally, re-
completeness, trieval of relevant documents can be performed
Property completeness measure of the missing using a combination of hyperlink analysis and
values for a specific property, information retrieval methods.
Population completeness is the percentage of Trustworthiness is defined as the degree to
all real-world objects of a par- which the information is
ticular type that are represented accepted to be correct, true,
in the datasets and real, and credible.
Interlinking completeness, which has to be
considered especially in LD, Metrics. The provenance information can in
refers to the degree to which turn be used to assess the trustworthiness,
instances in the dataset are reliability, and credibility of a data source,
interlinked. an entity, publishers, or even individual RDF
statements. There exists an interdependency
Metrics. Completeness can be measured by de- between the data provider and the data itself.
tecting the number of classes, properties, values, On the one hand, data is likely to be accepted as
and interlinks that are present in the dataset by true if it is provided by a trustworthy provider. On
comparing it to the original dataset (or gold stan- the other hand, the data provider is trustworthy if
dard dataset). It should be noted that in this case, it provides true data. Thus, both can be checked
users should assume a closed-world assumption to measure the trustworthiness.
where a gold standard dataset is available and can Understandability refers to the ease with which
be used to compare against. Completeness can data can be comprehended
be measured by detecting the number of classes, without ambiguity and
properties, values, and interlinks that are present be used by a human
in the dataset by comparing it to the original information consumer.
dataset (or gold standard dataset). It should be
noted that in this case, users should assume a Metrics. Understandability can be measured
closed-world assumption where a gold standard by detecting whether human-readable labels for
dataset is available and can be used to compare classes, properties, and entities are provided.
against. Provision of the metadata of a dataset can also
contribute toward assessing its understandability.
The dataset should also clearly provide
Contextual Dimensions exemplary URIs and SPARQL queries along
Contextual dimensions are those that highly de- with the vocabularies used so that users can
pend on the context of the task at hand. There understand how it can be used.
are four dimensions that are part of this group,
Timeliness measures how up-to-date data is
namely, relevancy, trustworthiness, understand-
relative to a specific task.
ability, and timeliness.
Metrics. Timeliness is measured by combining
Relevancy refers to the provision of informa- the two dimensions: currency and volatility.
tion which is in accordance with the Additionally timeliness states the recency and
task at hand and important to the frequency of data validation and does not include
users’ query. outdated data.
Data Quality and Data Cleansing of Semantic Data 577
(DQV) (Albertoni et al. 2015). The idea of this a need to analyze the currently available digital
vocabulary is to describe various quality aspects resources with these FAIR principles.
of the attached data in a structured, interoperable,
and human - readable format. The Data Quality Quality Improvement of Published Data
Vocabulary extends the Data Catalog Vocabulary In addition to the assessment of quality, there
(DCAT) (Maali et al. 2014) with a number of con- need to be methods and tools for the consequent
cepts (classes and properties) suitable to express improvement of the datasets. One important step
quality metadata. to improve the quality of data is identifying the
root cause of the problem and then designing cor-
responding data improvement solutions. These
Open Research Challenges solutions would select the most effective and
efficient strategies and related set of techniques
Large-Scale Quality Assessment and tools to improve quality. Currently, only the
Current tools that perform quality assessment RDFUnit tool Kontokostas et al. (2014) reports
of Web data suffer from severe deficiencies in the exact triples which causes the quality issue.
terms of performance, once the dataset size grows However, there need to be systems in place which
beyond the capabilities of a single machine. Re- close the loop, per say, from data quality assess-
cently, an approach for quality assessment eval- ment to improvement.
uation of large RDF datasets has been imple-
mented using the SANSA framework (Lehmann
et al. 2017), which scales over clusters of ma- Cross-References
chines. The approach uses distributed in-memory
for computing different quality metrics for RDF Data Cleaning
datasets using Apache Spark. However, such sys- Data Profiling
tems still lack inference-aware knowledge graph Linked Data Management
embeddings, distributed reasoning, and intuitive
data partitioning strategies.
References
Data Profiling for the Web of (Big) Data
Data profiling is the set of activities and pro- Acosta M, Zaveri A, Simperl E, Kontokostas D, Auer S,
Lehmann J (2013) Crowdsourcing linked data quality
cesses to produce metadata about the content of assessment. In: Proceedings of the 12th international
a dataset. Data profiling provides a first under- semantic web conference (ISWC). Lecture notes in
standing of the data through statistics and other computer science, vol 8219. Springer, Berlin/Heidel-
berg, pp 260–276
useful metadata. The metadata, in turn, enables
Albertoni R, Isaac A, Guéret C, Debattista J, Lee D,
reuse of existing data in useful applications. The Mihindukulasooriya N, Zaveri A (2015) Data quality
ABSTAT tool (Spahiu et al. 2016) uses a schema- vocabulary (DQV). W3c interest group note, World
based summarization approach to profile LOD, Wide Web consortium (W3C)
Bizer C, Cyganiak R (2009) Quality-driven information
which provides summaries that reveal quality
filtering using the WIQA policy framework. J Web
issues that can be further analyzed. However, cur- Semant 7(1):1–10
rent datasets lack high-quality metadata, which Guéret C, Groth P, Stadler C, Lehmann J (2011) Linked
is extremely important to enable data reuse. To data quality assessment through network analysis. In:
The semantic web – ISWC 2011. Lecture notes in com-
tackle this problem, the recently published FAIR
puter science, vol 7032. Springer, Berlin/Heidelberg
principles (Wilkinson et al. 2016) have been es- Heath T, Bizer C (2011) Linked data: evolving the web
tablished to guide data producers to ensure that into a global data space. Synthesis lectures on the
their data is maximally findable, accessible, in- semantic web: theory and technology, 1st edn. Morgan
& Claypool, Milton Keynes
teroperable, and reusable. These principles apply
Kontokostas D, Westphal P, Auer S, Hellmann S,
to all digital research objects (e.g., repositories, Lehmann J, Cornelissen R (2014) Databugger: a test-
datasets, APIs, databases, publications). There is driven framework for debugging the web of data. In:
Data Replication and Encoding 579
Data Replication
and Encoding
Data Unavailability, Corruption,
and Loss
Behrooz Parhami
Department of Electrical and Computer
As valuable resources, data assets are subject to
Engineering, University of California, Santa
theft (can be worse than loss, if the entity stealing
Barbara, CA, USA
it abuses the resource), unavailability (not as bad
as loss or corruption but inconvenient nonethe-
less), corruption, or permanent loss. Theft for
Synonyms illicit use can be prevented by encryption, a topic
that is outside the scope of this article (Hankerson
Big Data Availability; Incorruptible big data et al. 2000).
580 Data Replication and Encoding
Avizienis 1973; Garner 1966) allow direct ad- The theory of error-detecting and error-
dition and subtraction of coded numbers, yield- correcting codes is quite advanced, and progress
ing coded results. Multiplication with arithmetic- is still being made (Guruswami and Rudra 2009).
coded operands is a bit more involved but still What remains missing is a methodology for
feasible, whereas division is quite difficult, thus choosing an optimal or near-optimal code for
limiting the scope of applicability. a given set of application requirements.
Popular examples of error-detecting codes in- D
clude checksum codes (McAuley 1994); weight-
Replication Codes
based and Berger codes (Berger 1961; Knuth
1986); cyclic codes (Peterson and Brown 1961);
Replication codes are the conceptually simplest
linear codes (Sklar and Harris 2004), which in-
codes but imply high redundancy (100% in the
clude Hamming codes (Hamming 1950) as spe-
case of duplication, 200% for triplication). Their
cial cases, balanced codes (Knuth 1986); arith-
advantages include the fact that any unrestricted
metic codes (Avizienis 1971); and many other
error in one copy is detectable (duplication) or
types developed for specific applications.
correctable (triplication). Such detection and cor-
rection capabilities require that data copies be
stored on distributed nodes whose failures are
independent of each other (Jalote 1994).
Error-Correcting Codes
In addition to allowing error detection or cor-
rection, replication facilitates access to data by
Error-correcting codes (Arazi 1987, Peterson and
improving availability and access latency. For
Weldon 1972) are used extensively in storing data
example, with two copies of a particular piece
on disk memory, whose raw reliability would be
of data stored on two different computers, ex-
dismal without the detecting and correcting ca-
treme slowness or even crashing of one com-
pabilities of such codes (Petascale Data Storage
puter will not hinder access to the data. The
Institute 2012; Schroeder and Gibson 2007), as
prevailing nomenclature here is that we have
well as in many other application contexts. The
a primary site for each piece of data and one
use of such codes within individual data blocks
or more backup sites (Budhiraja et al. 1993;
is often augmented by inter-block redundancy
Guerraoui and Schiper 1997; Jalote 1994). In
schemes that provide an extra layer of protection
addition to high redundancy in such primary-
in what is known as RAID architecture, acronym
backup schemes, a key challenge is to keep the
for redundant array of independent disks (Chen
multiple copies up-to-date and consistent with
et al. 1994; Feng et al. 2005), or redundant disk
each other (Dullmann et al. 2001).
array, for short.
Popular examples of error-correcting codes
in practical use include cyclic codes (Lin and Data Dispersion
Costello 2004), which include the widely used
cyclic redundancy check or CRC codes as spe- An alternative to replication is a scheme known as
cial cases; linear codes (Sklar and Harris 2004), data dispersion (Iyengar et al. 1998; Rabin 1989),
which include Hamming codes (Hamming 1950) originally devised by Michael O. Rabin, which
as special cases; Reed-Solomon codes (Feng et entails much less redundancy but requires some
al. 2005; Reed and Solomon 1960); BCH codes computation to reconstruct the data.
(Pless 1998); and many other types developed Let us present this method through an exam-
for specific applications. Turbo codes (Benedetto ple. Consider the three numbers a, b, and c as
and Montorsi 1996) and low-density parity-check the data of interest to be protected. Instead of
or LDPC codes (Gallager 1962) are examples storing the three numbers, we might store the
of widely studied and used modern codes that four values f(1), f(2), f(3), and f(4), where f is the
achieve very low redundancy ratios. polynomial function f(x) D ax2 C bx C c. We
582 Data Replication and Encoding
can lose any one of the four f values and still be twenty-first century is under review by many
able to derive the original data values a, b, and researchers (Chen and Zhang 2014; Hu et al.
c from the three remaining f values by solving a 2014; Stanford University 2012). Since resources
system of three linear equations. This example, and services for big data will be provided
which entails 33% redundancy to protect against through the Cloud, directions of cloud computing
loss of any single piece of data (compared with (Armburst et al. 2010) and associated hardware
100% for duplication), is readily understood, but acceleration mechanisms become relevant to our
it is not a particularly efficient way of dispersing discussion here (Caulfield et al. 2016).
data. A promising direction is provided by network
Data dispersion is actually a kind of encod- coding (Dimakis et al. 2011). This new field
ing (Feng et al. 2005), where k pieces of data essentially expands on the idea of data dispersion
(say, k numbers) are encoded into n pieces (with by considering more general data distribution
n > k), in such a way that any k pieces suffice schemes while also taking the impact of “repair
to derive the original data. Besides being useful traffic” on system performance into account.
for data protection, the method has security and
load-balancing applications. In security, imagine
devising a scheme so that at least three officers Cross-References
of a bank must cooperate to gain access to a safe,
whereas no two officers can gain access. Such an Data Longevity and Compatibility
arrangement is sometimes referred to as “secret Hardware Reliability Requirements
sharing” (Beimel 2011). In load balancing, con- Parallel Processing with Big Data
sider that the k pieces retrieved for reconstructing
the original data can come from any of the stake-
holders, thus making it possible to avoid con- References
gested sites. In such a case, data dispersion can
be viewed as a performance-enhancing strategy, Arazi B (1987) A commonsense approach to the theory of
error-correcting codes. MIT Press, Cambridge, MA
as well as a reliability improvement method. Armburst M et al (2010) A view of cloud computing.
Commun ACM 53(4):50–58
Avizienis A (1971) Arithmetic error codes: cost and effec-
Future Directions tiveness studies for application in digital system design.
IEEE Trans Comput 20(11):1322–1331
Avizienis A (1973) Arithmetic algorithms for error-coded
The field of dependable computing has advanced operands. IEEE Trans Comput 22(6):567–572
for several decades to provide more reliable pro- Beimel A (2011) Secret-sharing schemes: a survey. In:
Proceedings of international conference coding and
cessing, storage, and communication resources. It
cryptology, Springer LNCS no. 6639, Berlin, pp 11–46
is well known that reliability is a weakest-link Benedetto S, Montorsi G (1996) Unveiling turbo codes:
phenomenon, in the sense that any unprotected some results on parallel concatenated coding schemes.
or weakly protected part of the system can doom IEEE Trans Inf Theory 42(2):409–428
Berger JM (1961) A note on error detection codes for
the entire system. The exponential reliability law
asymmetric channels. Inf Control 4:68–73
(Parhami 2018) suggests that reliability declines Budhiraja N, Marzullo K, Schneider FB, Toueg S (1993)
exponentially with an increase in the number of The primary-backup approach. Distrib Syst 2:199–216
system parts. In the context of big data, all aspects Caulfield AM, et al (2016) A cloud-scale acceleration
architecture. In: Proceedings of 49th IEEE/ACM in-
of the system (processing, storage, and commu-
ternational symposium on microarchitecture, Taipei,
nication) will expand in complexity, challenging Taiwan, pp 1–13
the system designer to seek more advanced meth- Chen CLP, Zhang C-Y (2014) Data-intensive applications,
ods of reliability assurance. challenges, techniques and technologies: a survey on
big data. Inf Sci 275:314–347
Design of computer systems and their
Chen PM, Lee EK, Gibson GA, Katz RH, Patterson DA
attendant attributes in light of new application (1994) RAID: high-performance reliable secondary
domains and technological developments in the storage. ACM Comput Surv 26(2):145–185
Data Validation 583
Dimakis AG, Ramachandran K, Wu Y, Suh C (2011) A Peterson WW, Weldon EJ Jr (1972) Error-correcting
survey on network codes for distributed storage. Proc codes, 2nd edn. MIT Press, Cambridge, MA
IEEE 99(3):476–489 Pless V (1998) Bose-Chaudhuri-Hocquenghem (BCH)
Dullmann D et al (2001) Models for replica synchroniza- codes. In: Introduction to the theory of error-correcting
tion and consistency in a data grid. In: Proceedings codes, 3rd edn. Wiley, New York, pp 109–222
of 10th IEEE international symposium on high per- Rabin M (1989) Efficient dispersal of information for
formance distributed computing, San Francisco, CA, security, load balancing, and fault tolerance. J ACM
pp 67–75 36(2):335–348
Feng G-L, Deng RH, Bao F, Shen J-C (2005) New Rao TRN, Fujiwara E (1989) Error-control coding for D
efficient MDS array codes for RAID – part I: Reed- computer systems. Prentice Hall, Upper Saddle River,
Solomon-like codes for tolerating three disk failures. NJ
IEEE Trans Comput 54(9):1071–1080; Part II: Rabin- Reed I, Solomon G (1960) Polynomial codes over certain
like codes for tolerating multiple ( 4) disk failures. finite fields. SIAM J Appl Math 8:300–304
IEEE Trans Comput 54(12):1473–1483 Schroeder B, Gibson GA (2007) Understanding disk fail-
Gallager R (1962) Low-density parity-check codes. IRE ure rates: what does an MTTF of 1,000,000 hours mean
Trans Inf Theory 8(1):21–28 to you? ACM Trans Storage 3(3):Article 8, 31 pp
Garner HL (1966) Error codes for arithmetic operations. Sklar B, Harris FJ (2004) The ABCs of linear block codes.
IEEE Trans Electron Comput 5:763–770 IEEE Signal Process 21(4):14–35
Garrett P (2004) The mathematics of coding theory. Pren- Stanford University (2012) 21st century computer
tice Hall, Upper Saddle River architecture: a community white paper. On-line:
Guerraoui R, Schiper A (1997) Software-based replication http://csl.stanford.edu/christos/publications/2012.21st
for fault tolerance. IEEE Comput 30(4):68–74 centuryarchitecture.whitepaper.pdf
Guruswami V, Rudra A (2009) Error correction up Wakerly JF (1978) Error detecting codes, self-checking
to the information-theoretic limit. Commun ACM circuits and applications. North Holland, New York
52(3):87–95
Hamming RW (1950) Error detecting and error correcting
codes. Bell Labs Tech J 29(2):147–160
Hankerson R et al (2000) Coding theory and cryptogra-
phy: the essentials. Marcel Dekker, New York
Hilbert M, Gomez P (2011) The World’s technological
capacity to store, communicate, and compute informa- Data Retention
tion. Science 332:60–65
Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable
systems for big data analytics; a technology tutorial. Computing the Cost of Compressed Data
IEEE Access 2:652–687
Iyengar A, Cahn R, Garay JA, Jutla C (1998) Design and
implementation of a secure distributed data repository.
IBM Thomas J. Watson Research Division, Yorktown
Heights
Jalote P (1994) Fault tolerance in distributed systems.
Data Storage Replication
Prentice Hall, Englewood Cliffs
Knuth DE (1986) Efficient balanced codes. IEEE Trans Geo-replication Models
Inf Theory 32(1):51–53
Lin S, Costello DJ (2004) Error control coding, vol 2.
Prentice Hall, Upper Saddle River
McAuley AJ (1994) Weighted sum codes for error de-
tection and their comparison with existing codes.
IEEE/ACM Trans Networking 2(1):16–22 Data Structures for Graphs
Parhami B (2018) Dependable computing: a multi-
level approach. Draft of book manuscript, (Web/Social) Graph Compression
available on-line at: http://www.ece.ucsb.edu/par
hami/text_dep_comp.htm
Parhami B, Avizienis A (1973) Detection of storage errors
in mass memories using arithmetic error codes. IEEE
Trans Comput 27(4):302–308
Petascale Data Storage Institute (2012) Analyzing fail- Data Validation
ure data. Project Web site: http://www.pdl.cmu.edu/
PDSI/FailureData/index.html
Peterson WW, Brown DT (1961) Cyclic codes for error Data Quality and Data Cleansing of Semantic
detection. Proc IRE 49(1):228–235 Data
584 Data Warehouse Benchmark
as spreadsheets. This hurdle arguably discourages ization, outlier detection), standardization (e.g.,
many people from working with data in the first entity resolution, error correction, imputation),
place. The end result is that domain experts reg- enrichment (e.g., lookups, joins), and distillation
ularly spend more time manipulating data than (e.g., sampling, filtering, aggregation, window-
they do exercising their speciality, while less ing). Ideally, the outcome of wrangling is not sim-
technical audiences are needlessly excluded. ply data; it is an editable and auditable transcript
At heart, data wrangling is a domain-specific of transformations coupled with a nuanced un- D
programming problem, with a strong focus on the derstanding of data organization and data quality
structure, semantics, and statistical properties of issues.
data. This observation suggests that by making
the specification of data transformation programs
dramatically easier, we might remove drudgery Data Parsing and Structuring
for scarce technical workers and engage a far A common first task for data wrangling is to
wider labor pool in time-consuming, data-centric parse and structure a new data source. As il-
tasks. lustrated in Fig. 1, small datasets such as those
found in spreadsheets may be primarily intended
for manual inspection; restructuring such data
allows it to be more easily visualized, modeled, or
Key Research Findings integrated with other datasets. Related challenges
arise with larger datasets spanning gigabytes,
Data wrangling covers a range of tasks, includ- terabytes, or more. For example, log files (e.g.,
ing structuring (e.g., extracting fields from text, of user interactions or system operations) often
table reshaping), profiling (e.g., summary visual- use idiosyncratic text formats or semi-structured
Data Wrangling, Fig. 1 1. Input data table 2. Delete first two rows
An example of wrangling a Bureau of I.A. Niles C. Tel: (800)645-8397
US government personnel Regional Director Numbers Fax: (907)586-7252
dataset. Given Niles C. Tel: (800)645-8397
spreadsheet-formatted data Fax: (907)586-7252 Jean H. Tel: (918)781-4600
as input, as set of Fax: (918)781-4604
transformations are applied Jean H. Tel: (918)781-4600
to make the data more Fax: (918)781-4604 Frank K. Tel: (615)564-6500
amenable to computational Fax: (615)564-6701
use Frank K. Tel: (615)564-6500
Fax: (615)564-6701
formats that require transformation prior to inte- contains cross-tabulations by category or date, in
grating with relational databases. In many cases, which semantically similar measures are spread
data resides in a text-based format that must be over multiple columns. These cross-tabulations
mapped to a corresponding tabular representa- can be mapped to a tidy format via an unpivot
tion. operation (also referred to as a fold or melt
Data parsing involves first identifying the operation).
data format. For text-based formats one must
segment the text into records and attributes. Other Data Profiling
data may use a semi-structured format (XML, Once data is suitably formatted, one can begin
JSON) that may include nested sub-records. At assessing the shape and structure of the data
this stage, one often performs type induction to to guide further wrangling. The goal of data
identify the data types of individual attributes. profiling is to gain a basic understanding of data
This parsing stage can be automated for many content and assess potential data quality issues.
common formats but may require interactive Profiling typically involves consulting both sum-
assistance for idiosyncratic or inconsistent mary statistics and data visualizations.
encodings. Visualization techniques are central to data
Once parsed, data must sometimes be restruc- profiling (Kandel et al. 2011a). Histograms are
tured. A common goal is to ensure the data well-suited for showing the distributions of nu-
uses a tidy format (Wickham 2014): a tabular merical data, though care must be taken regarding
format in which each row corresponds to a single the choice of binning scheme. For categorical
observation and each column represents a seman- data, bar charts of occurrence counts, sorted by
tically consistent variable with a single data type. frequency, can depict top-K values. Other data
In terms of database theory, a “tidy” format is types can benefit from specialized displays, for
closely related to a relation in first normal form example, maps overlaid with symbols for geo-
(1NF) (Codd 1971b). (Normal forms beyond first graphic data and multiple plots showing various
normal form (second normal form, etc.) are often transformations for time-series data (e.g., his-
less desirable for analysis purposes: one might tograms for month, day of month, day of week,
wish to denormalize data (e.g., by joining rela- etc.).
tions with primary-foreign key relationships) in To scalably visualize relationships between
order to more conveniently perform analysis over variables, binned 2D histograms (a discretized
a single table.) version of scatterplots) can be used (Carr et al.
Common operations for string-typed data in- 1987). In addition, interaction techniques such as
clude splitting columns (e.g., based on delim- brushing and linking (in which selections in one
iters), pattern-based extraction of substrings into view are used to filter the data shown or high-
new columns, and merging columns together. lighted in other views) can be applied to assess
In the case of semi-structured data, unnesting higher-dimensional relationships. Visualizations
of sub-record attributes to flat columns is often can also be augmented to aid identification of
necessary. In addition, reshaping operations may missing values, using either auxiliary visualiza-
be needed to ensure a tidy organization. Consider tions (e.g., quality bars showing the proportions
an example table containing two columns – a of valid, missing, and type-violating values, as
string key and a numeric measure – where each in Kandel et al. 2011b) or by including missing
unique key indicates a different type of measure- values directly within a plot (Eaton et al. 2003)
ment. This organization violates a tidy format, as (e.g., including a specific category for null val-
semantically different measures are mixed within ues).
a column. This situation can be remedied via a An important goal during profiling is to ac-
pivot operation that reshapes the data into new count for data type violations, missing values,
columns, one for each unique key value (as in step and outliers. Automatic identification of null val-
6 of Fig. 1). Conversely, spreadsheet data often ues or values that do not match a column’s data
Data Warehouse Benchmark 587
type is straightforward; once flagged they can be A common standardization task is entity reso-
visually presented to a user for consideration and lution (or de-duplication), in which similar values
correction. Identification of missing records (i.e., representing the same entity are mapped to a
entire table rows) is much more difficult but may single standardized value. For example, if we
be aided by looking for corresponding patterns encounter similar names in a dataset, do they rep-
in summary visualizations (e.g., gaps within a resent the same person? Entity resolution is a rich
histogram of record timestamps). topic, with years of research into both automatic D
Outliers can be identified within visualizations and interactive approaches (Elmagarmid et al.
and using statistical routines (Hodge and Austin 2007). Common techniques include clustering
2004; Hellerstein 2008). Univariate outliers can to identify similar values (e.g., using string edit
be flagged using basic summary statistics (e.g., distance to group similar names) and blocking
distance from the interquartile range of numeric to aid disambiguation (consulting values in other
values), sometimes in conjunction with trans- columns to decide if similar values represent the
formations (e.g., identifying outliers in string- same entity, e.g., requiring similar names to also
typed data by analyzing the distribution of string come from the same geographic region), often
lengths). Multivariate outlier detection is a more followed by manual inspection (including crowd-
difficult problem, for which statistical measures sourcing techniques to perform manual correc-
such as Mahalanobis distance can be applied. Be- tion at scale (Stonebraker et al. 2013)).
yond numerical and categorical data, other types, Another standardization issue is the treatment
including geographic and time-series data, can of missing or outlier values. In the example of
further benefit from specialized outlier detection Fig. 1, a fill operation is used to copy string values
techniques. into missing cells. In other cases, users may
Once identified, missing or outlying values wish to remove missing data from consideration.
must often be interpreted in the context of other For numerical data, interpolation (e.g., to fill in
data columns (e.g., date and time of observa- missing values over a time domain) or imputation
tion) and/or external knowledge (e.g., was there a (e.g., replacing a value with the mean value for a
power spike at that time that may have caused er- corresponding group) of missing values may be
roneous extreme observations?) to gauge whether performed, depending on the downstream anal-
the values are valid or potentially misleading. A ysis needs. Similarly, outliers may be removed
user can then decide if corrective transformation or replaced with an imputed value depending on
is needed. the analysis goals; for example, some statistical
modeling methods are highly sensitive to outliers.
operation is to perform geo-coding to retrieve the formance metric prior to making a dependent de-
longitude-latitude coordinates for an address or cision. Similarly, a compliance officer may wish
zip code. to review how specific privacy-sensitive fields are
More complex operations include multi- subsequently used. Each of these tasks involves
table join and union operations. While common additionally tracking and visualizing the lineage
to database management systems, within data of transformed data.
wrangling contexts joins and unions may involve These needs imply that a critical output of
datasets that were not originally designed to be the wrangling process is not simply transformed
used together. In the absence of explicit primary data, but rather a reusable, editable, and auditable
key and foreign key designations, key inference transcript of wrangling operations. Not unlike
and schema mapping (Rahm and Bernstein the collaboration and management enabled by
2001) may be needed to properly align tables. revision control tools for code, both wrangling
Moreover, additional transformation operations programs and transformed data are organizational
may be necessary to ensure compatible join keys resources to be socially shared and monitored.
or unionable columns.
Distillation is the task of summarizing or re-
ducing a dataset prior to downstream analysis. Examples of Application
For example, an analyst may wish to perform
filtering or sampling to reduce the data or ex- To give a sense of the data wrangling process,
tract a subset appropriate for comparison (e.g., Fig. 1 illustrates a scenario adapted from Guo et
performing stratified sampling by group). Aggre- al. (2011), in which a user transforms spreadsheet
gation and windowing operations are means of data (step 1 within Fig. 1) into a dense relational
summarizing data, often by subgroup. In some format (step 6). This example begins with person-
cases, aggregation is a necessary precursor to data nel data originally formatted for human reading
integration, as datasets may use units of analysis and adapts it to a format more amenable to com-
with differing levels of granularity (e.g., US states putational processing. The transformation pro-
versus US counties). cess involves removing unneeded records (steps
2 and 4), splitting text fields into well-typed
Organizational Deployment, Sharing, and columns (step 3), interpolation of missing values
Review (step 5), and finally reshaping the table via a pivot
While data wrangling is an important consider- (or cross-tabulation) operation (step 6).
ation of an individual analyst’s work, wrangling While involving only a small dataset, this ex-
also plays a social role within an organization. ample exhibits a number of common operations
A set of wrangling transformations may need necessary for wrangling a dataset. In general,
to be applied repeatedly, scheduled to rerun at datasets of all sizes may require significant wran-
regular intervals as new data batches arrive. Once gling effort, whether to restructure the data for
wrangled, data can be valuable to a number of use by common tools, to identify and correct
various analysts or decision makers within an anomalies or discrepancies, or to adapt a dataset
organization. Methods for sharing and discovery such that it can be integrated with other sources
(e.g., data cataloging) can improve an organi- of data.
zation’s access to data and amortize wrangling
effort.
The provenance of wrangled data is another Future Directions for Research
central concern. Before using a dataset in their
analyses, a user may wish to review what trans- Technologies for data wrangling aim to support
formations were applied and ensure they are ap- one or more of the tasks described above. Wran-
propriate. An executive may wish to know what gling tools often involve the careful integration
calculations were performed to derive a key per- of automatic algorithms (e.g., type inference,
Data Warehouse Benchmark 589
transform inference, program synthesis) within a number of graphical interfaces have been
interactive systems (e.g., an interactive console devised for supporting data wrangling. Some
or graphical user interface) that enable users to projects focus on new interaction techniques for
inspect data and orchestrate transformations. Fu- a specific wrangling task, such as tools for de-
ture work areas include (1) improved languages duplication (Sarawagi and Bhamidipaty 2002;
and runtimes for wrangling operations, (2) new Kang et al. 2008), schema matching (Robertson
user interface designs for data profiling and in- et al. 2005; Chiticariu et al. 2008), or data D
teractive transformation, and (3) novel inferential profiling (Dasu et al. 2002; Kandel et al. 2012b).
procedures for simplifying and scaling human Future research projects could develop methods
guidance of the wrangling process. for core wrangling tasks, including improved
entity resolution, standardization, and analysis of
Domain-Specific Languages for Data data lineage.
Wrangling Other projects focus on more complete
Data wrangling is, at its core, a programming environments for data wrangling. The Potter’s
problem: what transformations should be applied Wheel (Raman and Hellerstein 2001) system is
and in what order? A popular approach is thus an early example, providing both a spreadsheet-
to craft domain-specific languages (DSLs) that like table representation and menu-driven com-
support wrangling tasks. Standard database query mands to perform transformations. Subsequent
languages such as SQL support a number of tools include Refine (Huynh and Mazzocchi
relevant wrangling tasks (distillation, enrichment, 2010) and Wrangler (Kandel et al. 2011b; Guo
etc.) but are less well-suited for messy data with et al. 2011) (later commercialized by Trifacta),
type violations or requiring operations that trans- which add richer support for direct manipulation
form between metadata and data (e.g., unpivot and visualization.
operations that map column names into data). Both the Potter’s Wheel (Raman and Heller-
The SchemaSQL (Lakshmanan et al. 2001) lan- stein 2001) and Wrangler (Kandel et al. 2011b;
guage generalizes SQL to include such second- Guo et al. 2011) systems map interactive op-
order operations. The PADS Fisher and Walker erations to statements in an underlying DSL,
(2011) and AJAX Galhardas et al. (2000) projects enabling those tools to output transformation pro-
also introduce custom DSLs for data transfor- grams that can be cross-compiled to scalable
mation. Within the R programming language for runtime environments. For example, by interac-
computational statistics, dplyr and tidyr (Wick- tively wrangling a sample of data, these tools can
ham 2014) are embedded DSLs supporting trans- then generate programs that can be applied to
formation and restructuring tasks. DSLs also un- transform terabytes (or more) of data. However,
derlie a number of interactive tools, such as users of graphical interfaces are accustomed to
Potter’s Wheel (Raman and Hellerstein 2001) directly manipulating the state of a document or
and Wrangler (Kandel et al. 2011b; Guo et al. dataset, not a more abstract set of (reusable) op-
2011). In addition to providing a succinct high- erations. Future work might explore and evaluate
level syntax, a DSL can support compilation to improved methods for representing and manip-
alternative runtimes (e.g., computing clusters or ulating the wrangling programs created by such
database management systems) to enable scalable tools.
execution of transformation scripts.
Inferential Methods to Accelerate
Graphical Interfaces for Data Wrangling Wrangling
While valuable, direct use of domain-specific A central feature to many successful wrangling
languages primarily benefits programmers, tools is a judicious balance of user interaction
as opposed to a larger swath of domain and automated algorithms. Examples of useful
experts knowledgeable about their data but not inference methods include outlier detection, type
necessarily skilled in programming. In response, inference (to determine column data types), join
590 Data Warehouse Benchmark
key inference (to identify primary-foreign key Data Quality and Data Cleansing of Semantic
relationships among multiple tables), and union Data
inference (to determine appropriate column as- ETL
signments – or lack thereof – between two or Large-Scale Entity Resolution
more tables). Each of these aids identification of
important data characteristics and the formula-
tion of appropriate wrangling operations (e.g., as
statements in a wrangling DSL). An interactive References
wrangling tool must run the appropriate analysis
(perhaps in response to user-initiated commands, Carr DB, Littlefield RJ, Nicholson W, Littlefield J (1987)
Scatterplot matrix techniques for large N. J Am Stat
such as selecting tables to join) and then surface
Assoc 82(398):424–436
the results in an actionable manner. Chiticariu L, Kolaitis PG, Popa L (2008) Interactive
A number of projects have explored interactive generation of integrated schemas. In: ACM SIGMOD,
methods, such as programming by example, to pp 833–846
Codd EF (1971b) Further normalization of the data base
couple automatic inference with human guid- relational model. In: Courant computer science sym-
ance. The work of Gulwani and collaborators posia 6, Data base systems, (New York, May 24–25)
applies program synthesis methods to learn text pp 33–64, Prentice-Hall
extraction (Gulwani 2011) and table restructur- Dasu T, Johnson T (2003) Exploratory data mining and
data cleaning. Wiley, New York
ing (Harris and Gulwani 2011) procedures from Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V
input-output examples; some of these techniques (2002) Mining database structure; or, how to build a
have been incorporated into Microsoft Excel. data quality browser. In: ACM SIGMOD, pp 240–251
Kandel et al.’s Wrangler (2011b) takes a dif- Doan A, Halevy A, Ives Z (2012) Principles of data
integration. Elsevier, Amsterdam
ferent approach, performing search over the space Eaton C, Plaisant C, Drizd T (2003) The challenge of
of possible transformation operations, informed missing and uncertain data. In: Proceedings of the
by user selections of data table rows, columns, IEEE visualization, p 100
or cell contents. Wrangler returns and visual- Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Dupli-
cate record detection: a survey. IEEE TKDE 19(1):1–
izes potential transformations from which the 16
user can choose (or modify). As users trans- Fisher K, Walker D (2011) The PADS project: an
form their data, the commercial Trifacta Wran- overview. In: International conference on database the-
gler product additionally uses automatic visual- ory, Mar 2011
Galhardas H, Florescu D, Shasha D, Simon E (2000)
ization routines to generate charts to aid data AJAX: an extensible data cleaning tool. In: ACM
profiling. The goal of this predictive interac- SIGMOD, p 590
tion (Heer et al. 2015) loop is recast the task of Gulwani S (2011) Automating string processing in spread-
writing a DSL program into one of visually guid- sheets using input-output examples. In: ACM POPL,
pp 317–330
ing and refining a transformation recommender Guo PJ, Kandel S, Hellerstein J, Heer J (2011) Proactive
system. Many challenges remain in perfecting wrangling: mixed-initiative end-user programming of
mixed-initiative (Horvitz 1999) interaction tech- data transformation scripts. In: ACM user interface
niques that use inferential methods to speed and software & technology (UIST)
Harris W, Gulwani S (2011) Spreadsheet table transforma-
scale wrangling, while preserving effective hu- tions from examples. In: ACM PLDI
man assessment and guidance of the wrangling Heer J, Hellerstein JM, Kandel S (2015) Predictive inter-
process. action for data transformation. In: CIDR
Hellerstein JM (2008) Quantitative data cleaning for large
databases. White Paper, United Nations Economic
Commission for Europe
Cross-References Hodge V, Austin J (2004) A survey of outlier detection
methodologies. Artif Intell Rev 22(2):85–126
Horvitz E (1999) Principles of mixed-initiative user inter-
Data Cleaning
faces. In: ACM CHI, pp 159–166
Data Integration Huynh D, Mazzocchi S (2010) Google refine. http://code.
Data Profiling google.com/p/google-refine/
Database Consistency Models 591
Kang H, Getoor L, Shneiderman B, Bilgic M, Licamele ent meanings. Within this entry, and following the
L (2008) Interactive entity resolution in relational data: usage of the distributed algorithms community,
a visual analytic tool and its evaluation. IEEE TVCG
14(5):999–1014
“consistency” refers to the observable behavior of
Kandel S, Heer J, Plaisant C, Kennedy J, van Ham F, Riche a data store.
NH, Weaver C, Lee B, Brodbeck D, Buono P (2011a) In the database community, roughly the same
Research directions in data wrangling: visualizations concept is called “isolation,” whereas the term
and transformations for usable and credible data. Inf
Vis J 10(4):271–288
“consistency” refers to the property that applica- D
Kandel S, Paepcke A, Hellerstein J, Heer J (2011b) Wran- tion code is sequentially safe (the C in ACID).
gler: interactive visual specification of data transfor-
mation scripts. In: ACM human factors in computing
systems (CHI)
Kandel S, Paepcke A, Hellerstein J, Heer J (2012a) En- Definitions
terprise data analysis and visualization: an interview
study. In: IEEE visual analytics science & technology A data store allows application processes to put
(VAST) and get data from a shared memory. In general, a
Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J
(2012b) Profiler: integrated statistical analysis and vi- data store cannot be modeled as a strictly sequen-
sualization for data quality assessment. In: Advanced tial process. Applications observe non-sequential
visual interfaces behaviors, called anomalies. The set of possible
Lakshmanan LVS, Sadri F, Subramanian SN (2001) behaviors, and conversely of possible anomalies,
SchemaSQL: an extension to SQL for multidatabase
interoperability. ACM Trans Database Syst 26(4): constitutes the consistency model of the data
476–519 store.
Rahm E, Bernstein PA (2001) A survey of approaches to
automatic schema matching. VLDB J 10:334–350
Raman V, Hellerstein JM (2001) Potter’s wheel: an inter-
active data cleaning system. In: VLDB, pp 381–390 Overview
Robertson GG, Czerwinski MP, Churchill JE (2005) Visu-
alization of mappings between schemas. In: ACM CHI, Background
pp 431–439
A data store, or database system, is a persis-
Sarawagi S, Bhamidipaty A (2002) Interactive deduplica-
tion using active learning. In: ACM SIGKDD tent shared memory space, where different client
Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherni- application processes can store data items. To
ack M, Zdonik SB, Pagan A, Xu S (2013) Data curation ensure scalability and dependability, a modern
at scale: the data tamer system. In: CIDR
data store distributes and replicates its data across
Wickham H (2014) Tidy data. J Stat Softw 59(10):1–23
clusters of servers running in parallel. This ap-
proach supports high throughput by spreading the
load, low latency by parallelizing requests, and
Database Consistency Models fault tolerance by replicating demand processing;
system capacity increases by simply adding more
Marc Shapiro1 and Pierre Sutra2 servers (scale out).
1 Ideally, from the application perspective, data
Sorbonne Université, LIP6 and INRIA, Paris,
France replication and distribution should be transparent.
2 Read and update operations on data items would
Télécom Sud Paris, Évry, France
appear to execute as in a single sequential thread;
reading a data item would return exactly the last
Synonyms value written to it in real time; and each applica-
tion transaction (a grouping of operations) would
Consistency criterion; Consistency model; Data take effect atomically (all at once). This ideal be-
consistency; Isolation level havior is called strict serializability, noted SSER
(Papadimitriou 1979).
The distributed systems and database communi- In practice, exposing some of the internal
ties use the same word, consistency, with differ- parallelism to clients enables better perfor-
592 Database Consistency Models
mance. More fundamentally, SSER requires can observe. A model is weaker than another if it
assumptions that are unrealistic at large allows more histories.
scale, such as absence of network partitions An absolute definition of “strong” and “weak
(see section “Fundamental Results” hereafter). consistency” is open to debate. For the purpose
Therefore, data store design faces a fundamental of this entry, we say that a consistency model
tension between providing a strictly serializable is strong when it has consensus power, i.e., any
behavior on the one hand, versus availability number of failure-prone processes can reach
and performance on the other. This explains agreement on some value by communicating
why large-scale data stores hardly ever provide through data items in the store. If this is not
the SSER model, with the notable exception of possible, then the consistency model is said weak.
Spanner (Corbett et al. 2012). In a strong model, updates are totally ordered.
Some well-known strong models include SSER,
Consistency Models serializability (SER), or snapshot isolation (SI).
Informally, a consistency model defines what an Weak models admit concurrent updates to the
application can observe about the updates and same data item and include causal consistency
reads of its data store. When the observed values (CC), strong eventual consistency (SEC), and
of data differ from strict serializability, this is eventual consistency (EC) (see Table 1).
called an anomaly. Examples of anomalies in-
clude divergence, where concurrent reads of the
same item persistently return different values; Key Research Findings
causality violation, where updates are observed
out of order; dirty reads, where a read observes Basic Concepts
the effect of a transaction that has not termi- A data store is a logically shared memory space
nated; or lost updates, where the effect of an where different application processes store, up-
update is lost. The more anomalies allowed by date, and retrieve data items. An item can be
a store, the weaker its consistency model. The very basic, such as a register with read/write
strongest model is (by definition) SSER, the base- operations, or a more complex structure, e.g., a
line against which other models are compared. table or a file system directory. An application
More formally, a consistency model is defined process executes operations on the data store
by the history of updates and reads that clients through the help of an API. When a process
Database Consistency
Acronym Full name Reference
Models, Table 1 Models
and source references EC Eventual Consistency Ladin et al. (1990)
SEC Strong Eventual Consistency Shapiro et al. (2011)
CM Client monotonicity Terry et al. (1994)
CS Causal Snapshot Chan and Gray (1985)
CC Causal Consistency Ahamad et al. (1995)
Causal HAT Causal Highly-Av. Txn. Bailis et al. (2013)
LIN Linearisability Herlihy and Wing (1990)
NMSI Non-Monotonic SI Saeida Ardekani et al. (2013a)
PSI Parallel SI Sovran et al. (2011)
RC Read Committed Berenson et al. (1995)
SC Sequential Consistency Lamport (1979)
SER Serialisability Gray and Reuter (1993)
SI Snapshot Isolation Berenson et al. (1995)
SSER Strict Serialisability Papadimitriou (1979)
SSI Strong Snapshot Isolation Daudjee and Salem (2006)
Database Consistency Models 593
invokes an operation, it executes a remote call oper can reason only about computations that
to an appropriate endpoint of the data store. In start with the final results of other transactions.
return, it receives a response value. A common Serializability allows concurrent operations to
example of this mechanism is a POST request to access the data store and still produce predictable,
an HTTP endpoint. reproducible results.
An application consists of transactions. A
transaction consists of any number of reads and Definitions D
updates to the data store. It is terminated either A history is a sequence of invocations and re-
by an abort, whereby its writes have no effect, or sponses of operations on the data items by the ap-
by a commit, whereby writes modify the store. plication processes. It is commonly represented
In what follows, we consider only committed with timelines. For instance, in history h1 below,
transactions. The transaction groups together processes p and q access a data store that con-
low-level storage operations into a higher-level tains two counters (x and y).
abstraction, with properties that help developers
reason about application behavior. x++ r(x) 0
The properties of transactions are often sum- p
marized as ACID: All-or-Nothing, (individual) x-- r(y) 0
Correctness, Isolation, and Durability. q
All-or-Nothing ensures that, at any point in (h1 )
time, either all of a transaction’s writes are in
the store or none of them is. This guarantee
is essential in order to support common data Operations .z++/ and .z--/ respectively in-
invariants such as equality or complementarity crement and decrement counter ´ by 1. To fetch
between two data items. Individual Correctness the content of ´, a process calls r.´/. A counter
is the requirement that each of the application’s is initialized to 0. The start and the end of a
transactions individually transitions the database transaction are marked using brackets, e.g., trans-
from a safe state (i.e., where some application- action T1 D .x++/:r.x/ in history h1 . When
specific integrity invariants hold over the data) the transaction contains a single operation, the
to another safe state. Durability means that all brackets are omitted for clarity.
later transactions will observe the effect of this As pointed above, we assume in this entry that
transaction after it commits. A, C, and D are all the transactions are committed in a history.
essential features of any transactional system and Thus every invocation has a matching response.
will be taken for granted in the rest of this entry. More involved models exist, e.g., when con-
The I property, Isolation characterizes the sidering a transactional memory (Guerraoui and
absence of interference between transactions. Kapalka 2008).
Transactions are isolated if one cannot interfere A history induces a real-time order between
with the other, regardless of whether they execute transactions (denoted h ). This order holds be-
in parallel or not. tween two transactions T and T 0 when the re-
Taken together, the ACID properties provide sponse of the last operation in T precedes in
the serializability model, a program semantics h the invocation of the first operation in T 0 . A
in which a transaction executes as if it was the history also induces a per-process order that cor-
only one accessing the database.This model re- responds to the order in which processes invoke
stricts the allowable interactions among concur- their transactions. For instance in h1 , transaction
rent transactions, such that each one produces re- T2 D .x--/ precedes transaction T3 D r.y/ at
sults indistinguishable from the same transactions process q. This relation together with (T1 <h1
running one after the other. As a consequence, the T3 ) fully defines the real-time order in history h1 .
code for a transaction lives in the simple, familiar Histories have various properties according
sequential programming world, and the devel- to the way invocations and responses interleave.
594 Database Consistency Models
Two transactions are concurrent in a history h model in this category is Eventual Consistency
when they are not ordered by the relation h . A (EC) (Vogels 2008), used, for instance, in the
history h is sequential when no two transactions Simple Storage Service (Murty 2008). EC re-
in h are concurrent. A sequential history is legal quires that, if clients cease submitting transac-
when it respects the sequential specification of tions, they eventually observe the same state of
each object. Two histories h and h0 are equivalent the data store. This eventually-stable state may
when they contain the same set of events (invoca- include part (or all) the transactions executed by
tions and responses). the clients. Under EC, processes may repeatedly
A consistency model defines the histories that observe updates in different orders. For example,
are allowed by the data store. In particular, seri- if the above list L is EC, each process may see
alizability (SER) requires that every history h is its update applied first on L until it decides,
equivalent to some sequential and legal history l. preventing agreement. In fact, EC is too weak
For instance, history h1 is serializable, since it is to allow asynchronous failure-prone processes to
equivalent to the history l1 below. reach an agreement (Attiya et al. 2017).
x++ r(x) 0
p Fundamental Results
x-- r(y) 0 In the most general model of computation, repli-
q cas are asynchronous. In this model, and under
the hypothesis that a majority of them are correct,
(l1 )
it is possible to emulate a linearizable shared
memory (Attiya et al. 1990). This number of cor-
In addition, if the equivalent sequential history rect replicas is tight. In particular, if any majority
preserves the real-time order between transac- of the replicas may fail, the emulation does not
tion, history h is said strictly serializable (SSER) work (Delporte-Gallet et al. 2004).
(Papadimitriou 1979). This is the case of h1 since The above result implies that, even for a very
in l1 the relations (T2 <h1 T3 ) and (T1 <h1 T3 ) basic distributed service, such as a register, it is
also hold. not possible to be at the same time consistent,
When each transaction contains a single available, and tolerant to partition. This result is
operation, SSER boils down to linearizability known as the CAP Theorem (Gilbert and Lynch
(LIN) (Herlihy and Wing 1990). The data store 2002), which proves that it is not possible to
ensures sequential consistency (SC) (Lamport provide all the following desirable features at
1979) when each transaction contains a single the same time: (C) strong Consistency, even for
operation and only the per-process order is kept a register; (A) Availability, responding to every
in the equivalent sequential history. client request; and (P) tolerate network Partition
The above consistency models (SER, SSER, or arbitrary messages loss.
LIN, and SC) are strong, as they allow the client A second fundamental result, known as FLP, is
application processes to reach consensus. To see the impossibility to reach consensus deterministi-
this, observe that processes may agree as follows: cally in presence of crash failures (Fischer et al.
The processes share a FIFO queue L in the data 1985). FLP is true even if all the processes but
store. To reach consensus, each process enqueues one are correct.
some value in L which corresponds to a proposal As pointed above, a majority of correct pro-
to agree upon. Then, each process chooses the cesses may emulate a shared memory. Thus, the
first proposal that appears in L. The equivalence FLP impossibility result indicates that a shared
with a sequential history implies that all the memory is not sufficient to reach consensus. In
application processes pick the same value. fact, solving consensus requires the additional
Conversely, processes cannot reach consensus ability to elect a leader among the correct pro-
if the consistency model is weak. A widespread cesses (Chandra et al. 1996).
Database Consistency Models 595
Data stores that support transactions on more EC). This section offers a perspective on other
than one data item are subject to additional prominent models. Table 1 recapitulates.
impossibility results. For instance, an appealing
property is genuine partial replication (GPR)
Read Committed (RC)
(Schiper et al. 2010), a form of disjoint-access
Almost all existing transactional data stores en-
parallelism (Israeli and Rappoport 1994). Under
sure that clients observe only committed data
GPR, transactions that access disjoint items D
(Zemke 2012; Berenson et al. 1995). More pre-
do not contend in the data store. GPR avoids
cisely, the RC consistency model enforces that if
convoy effects between transactions (Blasgen
some read r observes the state xO of an item x
et al. 1979) and ensure scalability under parallel
in history h, then the transaction Ti that wrote xO
workload. However, GPR data stores must
commits in h. One can distinguish a loose and a
sacrifice some form of consistency or provide
strict interpretation of RC. The strict interpreta-
little progress guarantees (Bushkov et al. 2014;
tion requires that r.x/ takes place after transac-
Saeida Ardekani et al. 2013b; Attiya et al. 2009).
tion Ti commits. Under the loose interpretation,
A data store API defines the shared data struc-
the write operation might occur concurrently.
tures the client application processes manipulate
When RC, or a stricter consistency model
as well as their consistency and progress guar-
holds, it is convenient to introduce the notion of
antees. The above impossibility results inform
version. A version is the state of a data item as
the application developer that some APIs require
produced by an update transaction. For instance,
synchronization among the data replicas. Process
when Ti writes to some register x, an operation
synchronization is costly; thus there is a trade-off
denoted hereafter w.xi /, it creates a new version
between performance and data consistency.
xi of x. Versions allow to uniquely identify the
state of the item as observed by a read operation,
Trade-Offs e.g., r.xi /.
In the common case, executing an operation un-
der strong consistency requires to solve consen-
sus among the data replicas, which costs at least Strong Eventual Consistency (SEC)
one round trip among replicas (Lamport 2006). Eventual consistency (EC) states that, for every
Sequential consistency allows to execute either data item x in the store, if there is no new update
read or write operations at a local replica (Attiya on x, eventually clients observe x in the same
and Welch 1994; Wang et al. 2014). Weaker state. Strong eventual consistency (SEC) further
consistency models, e.g., eventual (Fekete et al. constrains the behavior of the data replicas. In
1999) and strong eventual consistency (Shapiro detail, a data store is SEC when it is EC, and
et al. 2011), enable both read and write operations moreover, for every item x, any two replicas of
to be local. x that applied the same set of updates on item x
A second category of trade-offs relates con- are in the same state.
sistency models to metadata (Peluso et al. 2015;
Burckhardt et al. 2014). They establish lower Client Monotonic (CM)
bounds on the space complexity to meet a cer- Client monotonic (CM) ensures that a client al-
tain consistency models. For instance, tracking ways observes the results of its own past op-
causality accurately requires O.m/ bits of stor- erations (Terry et al. 1994). CM enforces the
age, where m is the number of replicas (Charron- following four so-called session guarantees:
Bost 1991).
(i) Monotonic reads (MR): if a client executes
Common Models r.xi /, then r.xj ¤i / in history h, necessarily
The previous sections introduce several consis- xj follows xi for some version order h;x
tency models (namely, SER, SC, LIN, SSER, and over the updates applied to x in h.
596 Database Consistency Models
(ii) Monotonic writes (MW): if a client exe- Tk , then version xk is followed by version xj in
cutes w.xi /, then w.xj /, the version order the version order h;x . A history h belongs to
xi h;x xj holds; CS when all its transactions observe a consistent
(iii) Read-my-writes (RMW): when a client snapshot. For instance, this is not the case of
executes w.xi / followed by r.xj ¤i /, then history h3 below. In this history, transaction T3 D
xi h;x xj holds; r.y/:r.x/ depends on T2 D r.x/:.y++/, and T2
(iv) Writes-follow-reads (WFR): if a client exe- depends on T1 D .x++/, yet T3 does not observe
cutes r.xi / followed by w.xj / it is true that the effect of T1 .
xi h;x xj .
Most consistency models require CM, but this x++ r(y) 1 r(x) 0
guarantee is so obvious that it might be some- p
r(x) 1 y++
times omitted – this is, for instance, the case in q
Gray and Reuter (1992).
(h3 )
Read Atomic (RA)
Under RA, a transaction sees either all of the
updates made by another transaction or none
Causal Consistency (CC)
of them (the All-or-Nothing guarantee). For in-
Causal consistency (CC) holds when transactions
stance, if a transaction T sees the version xi
observe consistent snapshots of the system, and
written by Ti and transaction Ti also updates
the client application processes are monotonic.
y, then T should observe at least version yi . If
CC is a weak consistency model, and it does
history h fails to satisfies RA, a transaction in
not allow solving consensus. It is in fact the
h exhibits a fractured read (Bailis et al. 2014).
strongest model that is available under partition
For instance, this is the case of the transaction
(Attiya et al. 2017). Historically, CC refers to
executed by process q in history h2 below.
the consistency of single operations on a shared
memory (Ahamad et al. 1995). Causally consis-
x++ y++ tent transactions (Causal HAT) is a consistency
p
model that extends CC to transactional data stores
r(x) 1 r(y) 0
q (Bailis et al. 2013).
(h2 )
Snapshot Isolation (SI)
SI is a widely used consistency model (Berenson
Consistent Snapshot (CS) et al. 1995). This model is strong but allows more
A transaction Ti depends on a transaction Tj interleavings of concurrent read transactions than
when it reads a version written by Tj , or such a SER. Furthermore, SI is causal (i.e., SI CC),
relation holds transitively. In other words, denot- whereas SER is not.
wr
ing Ti ! Tj when Ti reads from Tj , Tj is in Under SI, a transaction observes a snapshot of
wr
the transitive closure of the relation .!/ when the state of the data store at some point prior in
starting from Ti . time. Strong snapshot isolation (SSI) requires this
When a transaction never misses the effects of snapshot to contain all the preceding transactions
some transaction it depends on, the transaction in real time (Daudjee and Salem 2006). Two
observes a consistent snapshot (Chan and Gray transactions may commit under SI as long as they
1985). In more formal terms, a transaction Ti in do not write the same item concurrent. SI avoids
a history h observes a consistent snapshot when the anomalies listed in section “Consistency
for every object x, if (i) Ti reads version xj , Models” but exhibits the write-skew anomaly,
(ii) Tk writes version xk , and (iii) Ti depends on illustrated in history h4 below.
Database Consistency Models 597
Database Consistency
Models, Table 2 Acronym Ordering Visibility Composition
Three-dimensional features EC Capricious Rollbacks Single Operation
of consistency models and CM Concurrent Monotonic Single Operation
systems
CS Concurrent Transitive All-or-Nothing + Snapshot
CC Concurrent Causal Single Operation
Causal HAT Concurrent Causal All-or-Nothing + Snapshot
LIN Total External Single Operation
NMSI Gapless Transitive All-or-Nothing + Snapshot
PSI Gapless Causal All-or-Nothing + Snapshot
RC Concurrent Monotonic All-or-Nothing
SC Total Causal Single Operation
SER Total Transitive All-or-Nothing + Snapshot
SI Gapless Transitive All-or-Nothing + Snapshot
SSER Total External All-or-Nothing + Snapshot
SSI Gapless External All-or-Nothing + Snapshot
Future Directions for Research and Schiper 2005; Saeida Ardekani et al. 2014),
so as to understand their pros and cons.
Consistency models are formulated in various
frameworks and using different underlying
assumptions. For instance, some works (ANSI Cross-References
1992; Berenson et al. 1995) define a model in
terms of the anomalies it forbids. Others rely on Achieving Low Latency Transactions for D
specific graphs to characterize a model (Adya Geo-Replicated Storage with Blotter
1999) or predicates over histories (Viotti and Conflict-Free Replicated Data Types CRDTs
Vukolić 2016). The existence of a global time Coordination Avoidance
(Papadimitriou 1979) is sometimes taken for Data Replication and Encoding
granted. This contrasts with approaches (Lamport Databases as a Service
1986) that avoid to make such an assumption. Geo-Replication Models
A similar observation holds for concurrent Geo-Scale Transaction Processing
operations which may (Guerraoui and Kapalka In-Memory Transactions
2008) or not (Ozsu and Valduriez 1991) overlap Storage Hierarchies for Big Data
in time. TARDiS: A Branch-and-Merge Approach to
This rich literature makes difficult an apples- Weak Consistency
to-apples comparison between consistency mod- Weaker Consistency Models/Eventual Consis-
els. Works exist (Chrysanthis and Ramamritham tency
1994) that attempt to bridge this gap by express-
ing them in a common framework. However, not
all the literature is covered, and it is questionable
whether their definitions is equivalent to the ones References
given in the original publications.
The general problem of the implementability a Adya A (1999) Weak consistency: a generalized theory
and optimistic implementations for distributed transac-
given model is also an interesting avenue for re- tions. Ph.d., MIT
search. One may address this question in terms of Ahamad M, Neiger G, Burns JE, Kohli P, Hutto PW
the minimum synchrony assumptions to support (1995) Causal memory: definitions, implementation,
and programming. Distrib Comput 9(1):37–49
a particular model. In distributed systems, this
ANSI (1992) American national standard for information
approach has led to the rich literature on failure systems – database language – SQL. American Na-
detectors (Freiling et al. 2011). A related question tional Standards Institute, New York
is to establish lower and upper bounds on the Attiya H, Welch JL (1994) Sequential consistency versus
linearizability. ACM Trans Comput Syst 12(2):91–122
time and space complexity of an implementation
Attiya H, Bar-Noy A, Dolev D (1990) Sharing memory
(when it is achievable). As pointed out in sec- robustly in message-passing systems. Technical report
tion “Trade-Offs”, some results already exist, yet MIT/LCS/TM-423, Massachusetts institute of technol-
the picture is incomplete. ogy, Laboratory for computer science, Cambridge
Attiya H, Hillel E, Milani A (2009) Inherent limitations
From an application point of view, three ques-
on disjoint-access parallel implementations of trans-
tions are of particular interest. First, the robust- actional memory. In: Proceedings of the twenty-first
ness of an application against a particular con- annual symposium on parallelism in algorithms and
sistency model (Fekete et al. 2005; Cerone and architectures, SPAA’09. ACM, New York, pp 69–78
Attiya H, Ellen F, Morrison A (2017) Limitations
Gotsman 2016). Second, the relation between a of highly-available eventually-consistent data stores.
model and a consistency control protocol. These IEEE Trans Parallel Distrib Syst (TPDS) 28(1):
two questions are related to the grand challenge 141–155
of synthesizing concurrency control from the ap- Bailis P, Davidson A, Fekete A, Ghodsi A, Hellerstein JM,
Stoica I (2013) Highly available transactions: virtues
plication specification (Gotsman et al. 2016). A and limitations. Proc VLDB Endow 7(3):181–192
third challenge is to compare consistency models Bailis P, Fekete A, Hellerstein JM, Ghodsi A, Stoica I
in practice (Kemme and Alonso 1998; Wiesmann (2014) Scalable atomic visibility with ramp transac-
600 Database Consistency Models
tions. In: Proceedings of the 2014 ACM SIGMOD Fekete A, Liarokapis D, O’Neil E, O’Neil P, Shasha D
international conference on management of data, SIG- (2005) Making snapshot isolation serializable. Trans
MOD’14. ACM, New York, pp 27–38 Database Syst 30(2):492–528
Berenson H, Bernstein P, Gray J, Melton J, O’Neil E, Fischer MJ, Lynch NA, Patterson MS (1985) Impossibil-
O’Neil P (1995) A critique of ANSI SQL isolation ity of distributed consensus with one faulty process.
levels. SIGMOD Rec 24(2):1–10 J ACM 32(2):374–382
Blasgen M, Gray J, Mitoma M, Price T (1979) The convoy Freiling FC, Guerraoui R, Kuznetsov P (2011) The failure
phenomenon. ACM SIGOPS Oper Syst Rev 13(2): detector abstraction. ACM Comput Surv 43(2):9:1–
20–25 9:40
Burckhardt S, Gotsman A, Yang H, Zawirski M (2014) Gilbert S, Lynch N (2002) Brewer’s conjecture and
Replicated data types: specification, verification, op- the feasibility of consistent, available, partition-
timality. In: The 41st annual ACM SIGPLAN- tolerant web services. SIGACT News 33(2):
SIGACT symposium on principles of programming 51–59
languages, POPL’14, San Diego, 20–21 Jan 2014, Gotsman A, Yang H, Ferreira C, Najafzadeh M,
pp 271–284 Shapiro M (2016) ’Cause I’m strong enough: rea-
Bushkov V, Dziuma D, Fatourou P, Guerraoui R (2014) soning about consistency choices in distributed sys-
The PCL theorem: transactions cannot be parallel, tems. In: Proceedings of the 43rd annual ACM
consistent and live. In: Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on principles of pro-
symposium on parallelism in algorithms and architec- gramming languages, POPL’16. ACM, New York,
tures, SPAA’14. ACM, New York, pp 178–187 pp 371–384
Cerone A, Gotsman A (2016) Analysing snapshot isola- Gray J, Reuter A (1992) Transaction processing: concepts
tion. In: Proceedings of the 2016 ACM symposium on and techniques, 1st edn. Morgan Kaufmann Publishers
principles of distributed computing, PODC, Chicago, Inc., San Francisco
25–28 July 2016, pp 55–64 Gray J, Reuter A (1993) Transaction processing: concepts
Chan A, Gray R (1985) Implementing distributed and techniques. Morgan Kaufmann, San Francisco,
read-only transactions. IEEE Trans Softw Eng SE- ISBN:1-55860-190-2
11(2):205–212 Guerraoui R, Kapalka M (2008) On the correctness of
Chandra TD, Hadzilacos V, Toueg S (1996) The weak- transactional memory. In: Proceedings of the 13th
est failure detector for solving consensus. J ACM ACM SIGPLAN symposium on principles and practice
43(4):685–722 of parallel programming, PPoPP’08. ACM, New York,
Charron-Bost B (1991) Concerning the size of logi- pp 175–184
cal clocks in distributed systems. Inf Process Lett Herlihy M, Wing J (1990) Linearizability: a correcteness
39(1):11–16 condition for concurrent objects. ACM Trans Program
Chrysanthis PK, Ramamritham K (1994) Synthesis of Lang Syst 12(3):463–492
extended transaction models using ACTA. ACM Trans Israeli A, Rappoport L (1994) Disjoint-access-parallel
Database Syst 19(3):450–491 implementations of strong shared memory primitives.
Corbett JC, Dean J, Epstein M, Fikes A, Frost C, Furman In: Proceedings of the thirteenth annual ACM sympo-
J, Ghemawat S, Gubarev A, Heiser C, Hochschild P, sium on principles of distributed computing, PODC’94.
Hsieh W, Kanthak S, Kogan E, Li H, Lloyd A, Mel- ACM, New York, pp 151–160
nik S, Mwaura D, Nagle D, Quinlan S, Rao R, Rolig Kemme B, Alonso G (1998) A suite of database
L, Saito Y, Szymaniak M, Taylor C, Wang R, Wood- replication protocols based on group communication
ford D (2012) Spanner: Google’s globally-distributed primitives. In: International conference on distributed
database. In: Symposium on operating system de- computing systems, IEEE. IEEE computer society,
sign and implementation (OSDI). Usenix, Hollywood, pp 156–163
pp 251–264 Ladin R, Liskov B, Shrira L (1990) Lazy replication:
Daudjee K, Salem K (2006) Lazy database replication exploiting the semantics of distributed services. In:
with snapshot isolation. In: Proceedings of the 32nd in- IEEE computer society technical committee on operat-
ternational conference on very large data bases, VLDB ing systems and application environments, IEEE, vol 4.
Endowment, VLDB’06, pp 715–726 IEEE computer society, pp 4–7
Delporte-Gallet C, Fauconnier H, Guerraoui R, Hadzi- Lamport L (1979) How to make a multiprocessor com-
lacos V, Kouznetsov P, Toueg S (2004) The weakest puter that correctly executes multiprocess programs.
failure detectors to solve certain fundamental prob- IEEE Trans Comput 28(9):690–691
lems in distributed computing. In: Proceedings of the Lamport L (1986) On interprocess communication.
twenty-third annual ACM symposium on principles of Part I: basic formalism. Distrib Comput 1(2):
distributed computing, PODC’04. ACM, New York, pp 77–85
338–346 Lamport L (2006) Lower bounds for asyn-
Fekete A, Gupta D, Luchangco V, Lynch N, Shvartsman chronous consensus. Distrib Comput 19(2):
A (1999) Eventually-serializable data services. Theor 104–125
Comput Sci 220:113–156. Special issue on Distributed Murty J (2008) Programming Amazon web services, 1st
Algorithms edn. O’Reilly, Sebastopol
Databases as a Service 601
Ozsu MT, Valduriez P (1991) Principles of distributed Wiesmann M, Schiper A (2005) Comparison of database
database systems. Prentice-Hall, Inc., Upper Saddle replication techniques based on total order broadcast.
River IEEE Trans Knowl Data Eng 17(4):551–566
Papadimitriou CH (1979) The serializability of concurrent Zemke F (2012) What’s new in SQL:2011. SIGMOD Rec
database updates. J ACM 26(4):631–653 41(1):67–73
Peluso S, Palmieri R, Romano P, Ravindran B, Quaglia F
(2015) Disjoint-access parallelism: impossibility, pos-
sibility, and cost of transactional memory implementa-
tions. In: Proceedings of the 2015 ACM symposium on D
principles of distributed computing, PODC’15. ACM, Databases as a Service
New York, pp 217–226
Saeida Ardekani M, Sutra P, Shapiro M (2013a) Non- Renato Luiz de Freitas Cunha
Monotonic Snapshot Isolation: scalable and strong con- IBM Research, São Paulo, Brazil
sistency for geo-replicated transactional systems. In:
Symposium on reliable distribution system (SRDS).
IEEE computer society, Braga, pp 163–172
Saeida Ardekani M, Sutra P, Shapiro M (2013b) On the Synonyms
scalability of snapshot isolation. In: Proceedings of the
19th international euro-par conference (EUROPAR)
Saeida Ardekani M, Sutra P, Shapiro M (2014) G-DUR: a Cloud Databases
middleware for assembling, analyzing, and improving
transactional protocols. In: Proceedings of the 15th
international middleware conference, Middleware’14. Definitions
ACM, New York, pp 13–24
Schiper N, Sutra P, Pedone F (2010) P-store: genuine par-
Cloud provider: A cloud infrastructure provider,
tial replication in wide area networks. In: Proceedings
of the 29th IEEE international symposium on reliable such as Amazon Web Services, Microsoft Azure,
distributed systems (SRDS) or IBM Cloud.
Shapiro M, Preguiça N, Baquero C, Zawirski M (2011) End user: A user of cloud services, a person
Conflict-free replicated data types. In: Défago X, Petit
or company that uses services from the cloud
F, Villain V (eds) International symposium on stabiliza-
tion, safety, and security of distributed system (SSS). provider.
Lecture notes in computer science, vol 6976. Springer, Database server: A database management sys-
Grenoble, pp 386–400 tem that uses the client-server model.
Shapiro M, Saeida Ardekani M, Petri G (2016) Con-
Database as a service: Cloud computing ser-
sistency in 3D. In: Desharnais J, Jagadeesan R (eds)
International conference on concurrency theory (CON- vice model that abstracts the setup of hardware,
CUR), Schloss Dagstuhl – Leibniz-Zentrum für In- software, or tuning for providing access to a
formatik. Dagstuhl Publishing, Leibniz international database.
proceeding in informatics (LIPICS), Québec, vol 59,
pp 3:1–3:14 Acronyms
Sovran Y, Power R, Aguilera MK, Li J (2011) Transac-
IAAS Infrastructure as a service
tional storage for geo-replicated systems. In: Sympo-
sium on operating system principles (SOSP). Associa- PAAS Platform as a service
tion for Computing Machinery, Cascais, pp 385–400 SAAS Software as a service
Terry DB, Demers AJ, Petersen K, Spreitzer MJ, Theimer DBAAS Database as a service
MM, Welch BB (1994) Session guarantees for weakly
IT Information technology
consistent replicated data. In: International conference
on parallel and distribution information system (PDIS), DBA Database administrator
Austin, pp 140–149 API Application programming interface
Viotti P, Vukolić M (2016) Consistency in non-
transactional distributed storage systems. ACM Com-
put Surv 49(1):19:1–19:34
Vogels W (2008) Eventually consistent. ACM Queue Overview
6(6):14–19
Wang J, Talmage E, Lee H, Welch JL (2014) Im- Configuring and maintaining database systems is
proved time bounds for linearizable implementations
of abstract data types. In: 2014 IEEE 28th interna-
a complicated endeavor, which includes planning
tional parallel and distributed processing symposium, the hardware that will be used, configuring
pp 691–701 disks and file systems, installing and tuning
602 Databases as a Service
database management systems, determining self-service (Mell and Grance 2011). Broad
whether and how replication will be done, and network access is a necessity in clouds, for cloud
managing backups. In traditional Information providers can be geographically distant from end
technology (IT) environments, the setup of users, and even in this case, end users rely on
database systems requires the work of different cloud services. In clouds, resources are pooled,
professionals, such as Database administrators meaning that resources are already available in
(DBAs), systems administrators, and storage the provider and, upon request, are assigned to
administrators. In fact, tuning databases requires users. These resource pools enable the rapid
broad knowledge and a principled approach for elasticity in clouds, enabling the addition and
measuring performance (Shasha and Bonnet removal of resources in a short amount of time.
2002), and a well-tuned database system can Services in clouds can be provided in three
perform faster than an improperly tuned system main models:
by a factor of two or more (Stonebraker and
Cattell 2011). Software as a service (SAAS) In this model, ap-
In a Database as a service (DBAAS) setting, a plications execute on a cloud infrastructure,
service provider hosts databases providing mech- and users normally access them from a web
anisms to create, store, and access databases browser. Usually, consumers are end users,
at the provider. The means of interacting with who will not do development on top of such
the service are usually via web interfaces and applications.
Application programming interfaces (APIs). Al- Platform as a service (PAAS) In this model, ap-
though popularized in recent years by cloud com- plication developers and service providers are
puting and big data, this is not a new concept able to deploy onto the cloud infrastructure
and, in fact, has been available for more than a applications created using programming lan-
decade (Hacigumus et al. 2002). guages, libraries, tools, and services supported
More recently, cloud providers have made by the cloud provider.
DBAAS offerings available. In a DBAAS model, Infrastructure as a service (IAAS) In this model,
application developers neither have to maintain customers have access to computing, storage,
machines nor have to install and maintain network, and similar services where the cus-
database management systems themselves. In- tomer can execute arbitrary operating systems
stead, the database service provider is responsible and software.
for installing and maintaining database systems,
and users are charged according to their usage of In this entry, there is no distinction between
the service. private and public clouds, for the discussion will
To better place the DBAAS model in the be made considering the perspective of the end
cloud computing service models, a quick review user and the cloud provider. In private clouds,
of the cloud computing definition will be the infrastructure is provisioned for exclusive use
presented. Interested readers are directed to other by a single organization. In this model, different
sources (Mell and Grance 2011; Armbrust et al. business units are usually the end users of cloud
2010) for a more detailed definition. Readers services. In public clouds, the infrastructure is
familiar with cloud computing concepts can provisioned for open use by the general public.
safely skip the next section.
Positioning DBAAS
Defining Characteristics of Cloud IAAS allows one to abstract the actual hardware
Computing configuration away, but the issues for properly
The essential characteristics of the cloud configuring databases remain. Hence, end users
computing model is that it has broad network still have to devote time and energy on properly
access, with resource pooling, rapid elasticity, configuring and maintaining database systems,
and measured services, and provides on-demand when such energy would be better applied in
Databases as a Service 603
developing the system they wanted to write in the virtual machine templates, but, given the control
first place. users can have over virtual machines, some hu-
From the cloud computing definition, DBAAS man verification may still be required.
has aspects of both SAAS and PAAS: although What makes DBAAS stand out is the heavy use
databases are software packages, they expose a of abstraction and automation, another character-
programming interface, the query language, with istic of the cloud. To provision a new database
which developers interact with. In public clouds, server instance, the user specifies the database D
DBAAS usually is offered in the PAAS layer. engine and version required and the amount of
RAM and the amount of storage required, and the
cloud provider takes care of actually provisioning
Advantages of the DBAAS Model the database instance.
In principle, DBAAS works well for both small
There are many advantages in adopting a DBAAS and large users. Teams that need small databases
model, all of them resulting from using a cloud- benefit from the fast provisioning times and from
based service model. In DBAAS, due to the self- the lack of requiring a DBA, while teams with
service nature of the cloud, to create new database large database requirements benefit from the abil-
instances, a user only needs to make a request in ity to scale up instances whenever needed.
the cloud administration portal. Also, there is no
need to plan for machines, storage, or replication
strategies, since cloud providers either provide Disadvantages of the DBAAS Model
options to enable replication automatically in
different zones or manage storage transparently, Although DBAAS is a good solution for many
depending on provider and service. companies and teams, it is not without its
The agility of cloud deployments should issues. Users might need database extensions
be contrasted with traditional IT operations. not available in the cloud provider. Consider,
Consider, for example, that a business unit is for example, PostGIS. PostGIS is an extension
developing a new application that requires a that adds support for geographic objects to
database server connection. To set up the server, the PostgreSQL database. To add this support,
a system administrator would have to be involved PostGIS is installed as a set of helper programs,
to allocate the machine and install the database shared library files, and SQL scripts. If an
server for the business unit. Between requesting end user needs PostGIS enabled in a database,
a database server and actually being able to but the cloud provider does not support
use it, the business unit could have to wait it, then the user has no means of adding
one or two business days. In the cloud, due to geographic object support for that database
its high levels of automation, a new database instance. The same argument can be made
instance can be provisioned in a matter of about legacy versions of database servers
minutes. or unpopular database engines: if the cloud
Assuming the business unit of the previous provider does not support them, they cannot be
example used IAAS as a model for installing used.
database servers, whoever provisioned the com- With regard to scaling, not all providers might
puting resource would still be required to imple- support scaling down database server instances.
ment corporate policies, such as creating differ- If, for example, a user provisioned a database
ent accounts for different users and configuring instance larger than what was actually needed,
passwords and firewalls according to corporate the only solution to scaling down might be pro-
policy. Essentially, the business unit would still visioning a new database instance and migrating
need access to a system administrator to properly the data from the old instance to the new one.
configure the computing resource. Some of the Scaling database services out might suffer from a
work involved in this can be reduced by using similar issue. If the cloud provider lacks support
604 Databases as a Service
for multiple server instances serving the same tuples (where row fields are predefined in a
database – this can occur, for example, when schema), documents (which allow attributes
a database grows to be bigger than the biggest to be nested documents or lists of values as
instance type supported by the cloud provider – well as scalers and without the definition of a
then the user has to implement support manu- schema), extensible records (hybrids between
ally, i.e., by partitioning the data among different tuples and documents), and objects (analogous
database servers. to objects in programming languages, but
Depending on the needs of the end user, the without methods) (Cattell 2011). Opposed to
premium for using DBAAS might be higher than relational databases, NoSQL databases provide
deploying databases traditionally in IAAS clouds BASE (basically available, soft state, eventually
or on in-house machines. Therefore, a cost- consistent) properties.
benefit analysis must be made prior to choosing
one option or another. Basically available meaning that the system
guarantees availability, in which every request
Types of Databases receives a responses, without guarantee of
containing the most recent write
Currently there are two dominant database Soft state means that state can change, even
paradigms: relational (SQL) and the so-called without input, due to eventual consistency;
NoSQL databases. Relational databases are Eventually consistent indicates the system will
based on the relational calculus theory and become consistent over time, given it does not
make use of the structured query language receive input during that time.
(SQL). Different relational databases implement
different dialects of the SQL, but the language NoSQL databases became popular due to big
itself is standardized, and SQL databases make Internet companies, such as Amazon, Facebook,
use of similar terminology. Relational databases and Google, using them for scaling their services.
provide ACID (atomicity, consistency, isolation, Traditionally, relational databases were consid-
and durability) guarantees: ered hard to scale out, particularly due to the
ACID guarantees. Hence, companies with mil-
Atomicity requires that each transaction happen
lions of users needed better ways to scale their
all at once or not at all – if one part of the trans-
systems. In a way, these database systems were
action fails, then the whole transaction fails –
introduced to address the “velocity” and “vol-
to the outside world, committed transactions
ume” of the big data Vs. Examples of early
appear indivisible;
successful systems are Dynamo (DeCandia et al.
Consistency ensures that transactions change the
2007) and BigTable (Chang et al. 2008).
database only between valid states, making
The gains in performance of NoSQL databases
sure that all defined rules, such as cascades,
over SQL databases come from the BASE prop-
constraints, and triggers, are always valid;
erties. By relaxing the ACID properties, NoSQL
Isolation ensures that one transaction cannot in-
databases can return from queries faster, allowing
terfere with another and that the concurrent
for faster responses. Also, the eventual consis-
execution of transactions results in the same
tency allows for easier load balancing between
state that would be obtained if they were
back-end servers, since the global view does not
executed sequentially;
have to be consistent the whole time. In fact,
Durability ensures that once a transaction has
“customers tolerate airline over-booking, and or-
been committed, it will remain committed,
ders that are rejected when items in an online
even in the event of crashes.
shopping cart are sold out before the order is
The term NoSQL is newer and relates to finalized” (Cattell 2011), which makes a case for
databases with different data models, such as using such databases.
Databases as a Service 605
Databases as a Service, Table 1 Non-exhaustive list of DBAAS offerings of three big cloud providers as of the fourth
quarter of 2017. Table sorted by provider name
Provider Type Databases supported
Amazon Relational Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, Microsoft SQL Server
Amazon NoSQL Amazon SimpleDB, Amazon Redshift, Amazon DynamoDB, Amazon Neptune,
JanusGraph
Google Relational PostgreSQL, MySQL, Cloud Spanner
D
Google NoSQL Cloud Bigtable, Cloud Datastore
IBM Relational IBM DB2, PostgreSQL, MySQL
IBM NoSQL IBM Cloudant, MongoDB, JanusGraph, Redis, RethinkDB, ScyllaDB
Tahara D, Diamond T, Abadi DJ (2014) Sinew: a SQL Process simulation is the imitation of the ac-
system for multi-structured data. In: Proceedings of tual process through the use of a (computer)
the 2014 ACM SIGMOD international conference on
management of data. ACM, pp 815–826
model. It allows an organization to verify the
Verbitski A, Gupta A, Saha D, Brahmadesam M, Gupta K, consequences of proposed process modifications
Mittal R, Krishnamurthy S, Maurice S, Kharatishvili T, prior to implementation (Melão and Pidd 2003).
Bao X (2017) Amazon aurora: design considerations The quality of the results of process simulation is
for high throughput cloud-native relational databases.
In: Proceedings of the 2017 ACM international confer-
directly dependent on the quality of the process D
ence on management of data. ACM, pp 1041–1052 simulation model.
Therefore, to construct a high-quality simu-
lation model, the researcher must gather exten-
sive process information as input. Traditionally,
Data-Driven Process such input is gathered by means of business
Simulation documents, interviews, and observations. While
valuable, it has been shown that these sources
Benoît Depaire and Niels Martin also have limitations:
Research group Business Informatics, Hasselt
University, Diepenbeek, Belgium
• Process documentation often describes how a
process should be executed which can deviate
from the actual process behavior (Măruşter
Definitions and van Beest 2009).
• Interviews with business experts can result
Data-driven process simulation is a technique in contradictory information (Vincent 1998),
which constructs a computer model that imitates and the perception of the interviewees might
the internal details of a business process and be heavily influenced by recent experiences
extensively uses data – recorded by information within the process (Dickey and Pearson 2005).
systems supporting the actual process – to do so. • Observational data could suffer from the
The model is used to execute what-if scenarios in Hawthorne effect (Robinson 2004) – i.e.,
order to better understand the actual process be- the performance of staff members increase
havior and predict the impact of potential changes simply because they know they are being
to the process. observed – and often require a significant time
investment.
Overview
Another type of input are the data and records
Data-Driven Process Simulation kept by different information systems about the
Every organization executes multiple business daily process execution. With the rise of process-
processes – e.g., the production, transportation, aware information systems – supporting the en-
and billing process – which have to be man- tire process rather than a specific activity (Dumas
aged properly to generate customer value (Dumas et al. 2005), e.g., enterprise resource planning
et al. 2013). An essential part of business process systems – the amount of these process data is
management is the identification and design of growing rapidly. The benefit of these data is that
process improvement opportunities – e.g., hire they provide an objective view on the real process
more staff to reduce waiting time at a specific step behavior. The challenge, however, is to extract
in the process. Since a business process typically insights from these data which can be used to
has a complex and dynamic nature, it is often construct more realistic simulation models.
impossible to deduce analytically the full impact The goal of data-driven process simulation is
of a specific process change. In such setting, to extend the traditional inputs for simulation
process simulation is a valuable instrument. model construction with such process data and
608 Data-Driven Process Simulation
to extract insights from the data with respect to modeling aspects of process simulation. It should
different aspects of the simulation model. As a be noted that, in general, the insights that can
consequence, it encompasses several techniques be retrieved from process data depend upon the
which support specific modeling decisions. quality and granularity of the available data. To
structure the discussion in this section, the four
Building Blocks of a Simulation Model building blocks of a simulation model and their
When constructing a simulation model, four associated modeling decisions are used.
building blocks can be distinguished: entities,
activities, control-flow, and resources (Martin Entities
et al. 2016b). For each of these building blocks, With respect to the entity building block, two
a series of modeling tasks need to be executed modeling decisions will be discussed: the arrival
during the construction of a simulation model. rate and the entity typology.
An entity is the dynamic object that flows
through the process model and which instigates Entity Arrival Rate
action in the process as activities are executed on Modeling the entity arrival rate – i.e., the pace at
it by resources (Kelton et al. 2015; Tumay 1996) which new entities arrive at the process – typi-
– e.g., a specific order (entity) in an order picking cally consists of a deterministic and a stochastic
process. During simulation, multiple entities can part. The deterministic part can be a constant
be simultaneously present within the process and or dependent on other context-related factors.
can influence each other’s process flow by, e.g., The stochastic element is typically modeled as a
occupying particular resources. probability distribution (van der Aalst 2015).
An activity is the execution of a specific task Process data is used to learn the parameters
by a specific resource on a given entity (Tumay of a predefined probability distribution. The ex-
1996) – e.g., the registration (task) by employee ponential distribution is often used under the
X (resource) of order Y (entity). assumption that the start of the first activity in
The control-flow of a process expresses the a process execution defines the actual arrival
temporal relationship between the activities. of the entity and the entities arrive independent
Control-flow consists of sequence flows – from each other (Rozinat et al. 2009). When
modeling “directly follows” relationships – and these assumptions appear too restrictive – e.g.,
gateways – modeling complex process behavior an entity arrived before the first activity was
such as choice and parallelism. The control- executed on it – the ARPRA algorithm (Martin
flow defines the different routes an entity can et al. 2016c) accounts for queue formation before
follow through the process – e.g., an international the first activity and discovers the proper gamma
order might follow a different route in the order distribution – which is a generalization of the
picking process than a national order, resulting in exponential distribution.
different or additional activities that need to be
executed on the order. Entity Types
The actual execution of activities is the re- No two entities are exactly the same, and it
sponsibility of resources. Basic resource types is not uncommon that entity properties have a
are technical resources, such as software systems, direct impact on the process execution – e.g., the
and human resources (Dumas et al. 2013; Tumay physical dimensions of an order have a direct
1996). impact on the duration of the packaging activity.
To simplify complex processes, it is common to
define entity types which represent a group of
Key Research Findings entities similar in terms of specific characteristics
(i.e., attributes) (Kelton et al. 2015).
This section provides the key research results on When working with entity types, the qual-
how process data can be used to improve different ity of the defined typology has a direct influ-
Data-Driven Process Simulation 609
ence on the quality of the simulation model. To can be used to discover these dependencies and
identify relevant entity types, cluster analysis is model them correctly.
a useful technique to segment the entities (Xu The stochastic element is modeled by means
and Wunsch 2005). Recently, various algorithms of a probability distribution. Activity durations
have been developed that discover clusters which are extracted from the process data – i.e., dif-
group entities that have a common execution path ference between the start and completion time
and where clusters – entity types – differ in of an activity – and used to fit the probability D
terms of process execution. Various approaches distribution.
exist: a vector-based approach where the trajec- When either start or completion time of activi-
tory of an entity is translated into a set of fea- ties is missing, this approach cannot be applied
tures (Bose and van der Aalst 2010; de Medeiros directly. Literature suggest several approaches
et al. 2008; Delias et al. 2015; Greco et al. 2006; to deal with such missing data. When only the
Song et al. 2009), an approach based on the completion time is recorded, various proxies are
generic edit distance (Bose and van der Aalst suggested for the corresponding start time, such
2009), Markov model-based clustering (Veiga as the completion time of the previous activity for
and Ferreira 2010), and an approach driven by that entity or the time at which the resource exe-
process model accuracy (De Weerdt et al. 2013). cuting the activity finished its previous job (Naka-
tumba 2013; Wombacher and Iacob 2013).When
Activities only the start of an activity is recorded, similar
With respect to activities, five important mod- proxies have been suggested in Martin et al.
eling decisions will be discussed: determining (2016b). Be aware that several issues such as
which activities need to be included, activity du- the interruptibility of activities or the presence
ration, resource requirements, queue discipline, of batch processing can render the extraction of
and activity interruptions. duration observations more complex.
Activity Definition
Resource Requirements
Process data is typically captured at the level of
The resource(s) required to execute a particular
events – i.e., specific moments in time which
activity need(s) to be specified. Process data con-
change the state of the process. Even when data
taining resource information can be used to sup-
is logged at the level of activities, this data might
port this modeling task, and different techniques
be at a too granular level for process simulation
exist to reveal the relationship between activities
purposes. In both cases, techniques are needed to
and resources (Huang et al. 2011; Liu et al. 2008;
identify appropriate activities and to map data to
Ly et al. 2006; Senderovich et al. 2014).
higher-level activities. For this purpose, activity
mining (Günther et al. 2010) can be used and only
requires the event log. Other approaches which Queue Discipline
incorporate domain knowledge are also available In a simulation model, one must define how
such as in Szimanski et al. (2013), Baier et al. queues are dealt with at activities. The queue dis-
(2014), and Mannhardt et al. (2016). cipline establishes the processing order of queu-
ing entities, such as first-in-first-out (FIFO), last-
Activity Duration in-first-out (LIFO), or according to a priority rule.
Modeling the activity duration typically consists Process data can be used to discover the queu-
of a deterministic and stochastic element. Deter- ing discipline that is applied. Suriadi et al. (2017)
ministic differences in activity duration can be propose a method to extract FIFO, LIFO, and
dependent on a wide range of entity properties priority-based processing patterns from process-
but also on resource workload (cf. Pospisil and related data and consider three perspectives: the
Hrus̆ka (2012) and Nakatumba and van der Aalst resource, activity, and case perspective, each rep-
(2010); Nakatumba et al. (2012)). Process data resenting potential levels of aggregation.
610 Data-Driven Process Simulation
Activity Interruptibility and Unexpected (each outgoing path has a specific probability),
Interruptions deterministic (business rules govern the choice),
With respect to interruptions, the activity inter- or a combination of both.
ruptibility and unexpected interruptions need to In literature, Rozinat and van der Aalst
be modeled. The activity’s interruptibility ex- (2006a,b) use decision trees to discover the
presses whether it can be interrupted while an decision logic from process data, which is
entity is being processed. If activities are inter- also applied in a simulation context (Rozinat
ruptible, one must still model the interruption et al. 2008, 2009). Schonenberg et al. (2010)
frequency, duration, and nature. The nature of an use regression analysis to mine decision logic
interruption refers to whether the activity execu- from process data. The work of de Leoni et al.
tion stops immediately (preemptive) or whether (2013) allows one to mine rules from process
the running entity can still be finished (non- data which are linear combinations of multiple
preemptive) (Hopp and Spearman 2011). entity attributes. All these approaches result in
Process data containing specific information deterministic rules but can be easily extended
such as interruption events or resource break with a stochastic element as suggested by Pospisil
information (Senderovich et al. 2014) can be used and Hrus̆ka (2012). A purely stochastic approach
directly to inform the aforementioned modeling is presented in Leyer and Moormann (2015)
tasks. However, often such specific information which simply learns the routing probabilities for
is missing, and the interruption time is sim- each choice.
ply part of the activity duration. Under these
circumstances, duration outlier analysis can be Resources
used (Pika et al. 2013; Rogge-Solti and Kasneci With respect to modeling the resources, three
2014) to inform the researcher on activity inter- modeling tasks are discussed: the definition of
ruptions. resource roles, resource schedules, and batch pro-
cessing.
Control-Flow
With respect to the control-flow, two aspects
need to be modeled: the control-flow definition Resource Roles
between the activities and the gateway routing Resource roles group resources performing simi-
logic. lar activities and allow one to simplify the simu-
lation model – e.g., activities are assigned to roles
Control-Flow Definition instead of specific resources.
The control-flow determines the path(s) an Various work on how to retrieve insights
entity can follow throughout the process and on resource roles from process data exist. One
defines sequentiality, choice, and concurrency approach is to perform a cluster analysis on
between activities. A multitude of algorithms resource-activity frequency matrix retrieved from
exist to retrieve control-flow models from process the process data (Song and van der Aalst 2008;
data (De Weerdt et al. 2012; van der Aalst Rozinat et al. 2009). Another approach is to focus
2016). Several of these algorithms have been on handover relationships in the data (Burattin
applied in a simulation context such as the et al. 2013) or to cluster resources on the number
alpha-algorithm (Aguirre et al. 2013; Rozinat of entities they collaborated on (Ferreira and
et al. 2008, 2009), fuzzy mining (van Beest and Alves 2012).
Măruşter 2007), and heuristic mining (Leyer and
Moormann 2015; Măruşter and van Beest 2009). Resource Schedules
A resource schedule expresses its availability for
Gateway Routing Logic the simulated process, which can differ from the
For gateways representing choice, the routing official working schedule as a resource is often
logic needs to be defined, which can be stochastic involved in multiple processes. Process data can
Data-Driven Process Simulation 611
help to gather insights in the resource’s availabil- control-flow, gateway routing logic, and resource
ity for the process. roles. In another example, Leyer and Moormann
A first approach to use process data is to (2015) simulate the loan application process at a
estimate the start and end time of a working day financial institution and use real-life process data
from the process data (Wombacher et al. 2011). to model the activity duration, control-flow, and
Other approaches estimate resource availability gateway routing logic.
as a percentage of the total time a resource is While the aforementioned references use pro- D
executing activities (Liu et al. 2012; van der cess data to support multiple modeling decisions,
Aalst et al. 2010), which however do not pro- dedicated papers focusing on a single decision
vide exact availability timeframes. If more fine- also apply their proposed techniques to real-life
grained insights are needed, the work of Martin data. For instance, Măruşter and van Beest (2009)
et al. (2016a) provides a method to mine daily mainly focus on the use of process data to model
availability records for resources. the simulation model’s control-flow and consider
three real-life processes: a process of a gas com-
Batch Processing pany, the fine collection process of a governmen-
Batch processing refers to a resource’s tendency tal institution, and the navigation process in an
to accumulate entities in order to process them agricultural decision support system. In another
simultaneously, concurrently, or sequentially. example, Martin et al. (2017) retrieve insights in
Again, process data can be used to study the resources’ batching behavior from a call center’s
actual batching behavior. and a production department’s process data.
Wen et al. (2013) developed a control-flow dis-
covery algorithm when simultaneous batch pro-
cessing occurs in a process. Similar to Liu and Hu Future Directions for Research
(2007), they assume that events are only recorded
for one of the cases in a batch. While Wen et al. Despite the valuable research efforts outlined
(2013) focus on the control-flow, Nakatumba above, four key directions of additional work to
(2013) states that batching behavior is present exploit process data for the development of a
when two sets of activity instances on a particular simulation model can be distinguished.
working day are separated by a time period of Firstly, there is room for new, improved, and
more than 1 hour. A more systematic approach dedicated algorithms which support specific
is proposed in Martin et al. (2017). Based on modeling decisions using process data. This
a distinction between simultaneous, concurrent, article only discusses a subset of modeling
and sequential batch processing, an algorithm is tasks required to build a simulation model.
proposed which detects batching behavior. More- For other modeling tasks, e.g., the additional
over, metrics such as the batch size are calculated ones included in Martin et al. (2016b), no
which can be used to specify batching parameters clear starting points for future research could
in a simulation model (Martin et al. 2017). be identified in literature. For some modeling
decisions included in this article, the state of the
art refers to approaches which were not directly
Examples of Application developed with simulation in mind. For instance,
the use of clustering to identify entity types
Literature provides examples in which real-life results in distinct process models for each entity
process data is extensively used to support the type, requiring an additional aggregation step to
construction of simulation models. For exam- produce a single model. Dedicated algorithms
ple, Rozinat et al. (2009) consider the complaint can exploit the specific characteristics of process
handling process and the invoice handling pro- data and specific needs of process simulation.
cess in municipalities and use process data to Additionally, existing (dedicated) algorithms can
specify the entity arrival rate, activity duration, also be extended to provide even more insights.
612 Data-Driven Process Simulation
For example, the ARPRA algorithm (Martin et al. Baier T, Mendling J, Weske M (2014) Bridging abstrac-
2016c) can be extended to discover nonstationary tion layers in process mining. Info Syst 46:123–139
Bose RPJC, van der Aalst WMP (2009) Context aware
arrival patterns. trace clustering: towards improving process mining
Secondly, a common assumption among pro- results. In: Proceedings of the ninth SIAM international
cess discovery techniques is that both start and conference on data mining, pp 401–412
completion of an activity are logged. Often the Bose RPJC, van der Aalst WMP (2010) Trace clustering
based on conserved patterns: towards achieving better
data does not meet these assumptions and, e.g., process models. Lect Notes Bus Info Process 43:170–
only the completion times are recorded. In such 181
cases, the researcher must make additional as- Burattin A, Sperduti A, Veluscek M (2013) Business
sumptions to apply existing algorithms. There- models enhancement through discovery of roles. In:
Proceedings of the 2013 IEEE symposium on compu-
fore, future research on start time estimation tational intelligence and data mining, pp 103–110
and imputation is worth pursuing. Alternatively, de Leoni M, Dumas M, García-Bañuelos L (2013)
existing algorithms can also be extended such Discovering branching conditions from business pro-
that they do not longer require, e.g., explicit start cess execution logs. Lect Notes Comput Sci 7793:
114–129
timestamps. de Medeiros AKA, Guzzo A, Greco G, van der Aalst
Thirdly, most existing algorithms focus on WMP, Saccà D (2008) Process mining based on clus-
a single modeling aspect, whereas they are in tering: a quest for precision. Lect Notes Comput Sci
fact related to other modeling decisions. Future 4928:17–29
De Weerdt J, De Backer M, Vanthienen J, Baesens
algorithms for supporting modeling tasks could B (2012) A multi-dimensional quality assessment
therefore benefit from an integrated approach. For of state-of-the-art process discovery algorithms using
example, the resource schedules can be taken into real-life event logs. Info Syst 37(7):654–676
account when mining the queue discipline as a De Weerdt J, Vanthienen J, Baesens B, vanden Broucke
SKLM (2013) Active trace clustering for improved
resource might start prioritizing urgent tasks at process discovery. IEEE Trans Knowl Data Eng
the end of the working day. 25(12):2708–2720
Finally, a single analysis package should be Delias P, Doumpos M, Grigoroudis E, Manolitzas P, Mat-
created which takes process data as an input and satsinis N (2015) Supporting healthcare management
decisions via robust clustering of event logs. Knowl
provides comprehensive support for a multitude Based Syst 84:203–213
of simulation modeling decisions. This involves Dickey D, Pearson C (2005) Recency effect in college
an integration of the current algorithms with yet student course evaluations. Pract Assess Res Eval
to be developed techniques for other modeling 10(6):1–10
Dumas M, van der Aalst WMP, Ter Hofstede AH (2005)
decisions. Even though the development of such Process-aware information systems: bridging people
an integrated package does not require the devel- and software through process technology. Wiley, Hobo-
opment of new scientific knowledge, it is required ken
to facilitate data-driven simulation in practice. Dumas M, La Rosa M, Mendling J, Reijers HA
(2013) Fundamentals of business process management.
Springer, Heidelberg
Ferreira DR, Alves C (2012) Discovering user communi-
Cross-References ties in large event logs. Lect Notes Bus Info Process
99:123–134
Automated Process Discovery Greco G, Guzzo A, Ponieri L, Sacca D (2006) Discovering
expressive process models by clustering log traces.
Decision Discovery in Business Processes
IEEE Trans Knowl Data Eng 18(8):1010–1027
Queue Mining Günther CW, Rozinat A, van der Aalst WMP (2010)
Activity mining by global trace segmentation. Lect
Notes Bus Info Process 43:128–139
Hopp WJ, Spearman ML (2011) Factory physics. Wave-
References land Press, Long Grove
Huang Z, Lu X, Duan H (2011) Mining association
Aguirre S, Parra C, Alvarado J (2013) Combination of rules to support resource allocation in business process
process mining and simulation techniques for business management. Expert Syst Appl 38(8):9483–9490
process redesign: a methodological approach. Lect Kelton W, Sadowski R, Zupick N (2015) Simulation with
Notes Bus Info Process 162:24–43 Arena. McGraw-Hill, New York
Data-Driven Process Simulation 613
Leyer M, Moormann J (2015) Comparing concepts for international conference on business intelligence and
shop floor control of information-processing services technology, pp 14–18
in a job shop setting: a case from the financial services Robinson S (2004) Simulation: the practice of model
sector. Int J Prod Res 53(4):1168–1179 development and use. Wiley, Chichester
Liu J, Hu J (2007) Dynamic batch processing in work- Rogge-Solti A, Kasneci G (2014) Temporal anomaly
flows: model and implementation. Futur Gener Comput detection in business processes. Lect Notes Comput Sci
Syst 23(3):338–347 8659:234–249
Liu Y, Wang J, Yang Y, Sun J (2008) A semi-automatic Rozinat A, van der Aalst WMP (2006a) Decision mining
approach for workflow staff assignment. Comput Ind in business processes. Tech. Rep. BPM Center Report D
59(5):463–476 BPM-06-10
Liu Y, Zhang H, Li C, Jiao RJ (2012) Workflow sim- Rozinat A, van der Aalst WMP (2006b) Decision mining
ulation for operational decision support using event in ProM. Lect Notes Comput Sci 4102:420–425
graph through process mining. Decis Support Syst Rozinat A, Mans RS, Song M, van der Aalst WMP (2008)
52(3):685–697 Discovering colored Petri nets from event logs. Int J
Ly LT, Rinderle S, Dadam P, Reichert M (2006) Mining Softw Tools Technol Transfer 10(1):57–74
staff assignment rules from event-based data. Lect Rozinat A, Mans RS, Song M, van der Aalst WMP
Notes Comput Sci 3812:177–190 (2009) Discovering simulation models. Info Syst 34(3):
Mannhardt F, de Leoni M, Reijers HA, van der Aalst 305–327
WMP, Toussaint J (2016) From low-level events to ac- Schonenberg H, Jian J, Sidorova N, van der Aalst WMP
tivities – a pattern-based approach. Lect Notes Comput (2010) Business trend analysis by simulation. Lect
Sci 9850:125–141 Notes Comput Sci 6051:515–529
Martin N, Bax F, Depaire B, Caris A (2016a) Retriev- Senderovich A, Weidlich M, Gal A, Mandelbaum A
ing resource availability insights from event logs. (2014) Mining resource scheduling protocols. Lect
In: Proceedings of the 2016 IEEE international con- Notes Comput Sci 8659:200–216
ference on enterprise distributed object computing, Song M, van der Aalst WMP (2008) Towards comprehen-
pp 69–78 sive support for organizational mining. Decis Support
Martin N, Depaire B, Caris A (2016b) The use of process Syst 46(1):300–317
mining in business process simulation model construc- Song M, Günther CW, van der Aalst WMP (2009) Trace
tion: structuring the field. Bus Info Syst Eng 58(1): clustering in process mining. Lect Notes Bus Info
73–87 Process 17:109–120
Martin N, Depaire B, Caris A (2016c) Using event logs to Suriadi S, Wynn MT, Xu J, van der Aalst WMP, ter
model interarrival times in business process simulation. Hofstede AH (2017) Discovering work prioritisation
Lect Notes Bus Info Process 256:255–267 patterns from event logs. Decis Support Syst 100:
Martin N, Swennen M, Depaire B, Jans M, Caris A, 77–92
Vanhoof K (2017) Retrieving batch organisation of Szimanski F, Ralha CG, Wagner G, Ferreira DR (2013)
work insights from event logs. Decis Support Syst Improving business process models with agent-based
100:119–128 simulation and process mining. Lect Notes Bus Info
Melão N, Pidd M (2003) Use of business process simula- Process 147:124–138
tion: a survey of practitioners. J Oper Res Soc 54(1): Tumay K (1996) Business process simulation. In: Pro-
2–10 ceedings of the 1996 winter simulation conference, pp
Măruşter L, van Beest NRTP (2009) Redesigning business 55–60
processes: a methodology based on simulation and van Beest NRTP, Măruşter L (2007) A process mining ap-
process mining techniques. Knowl Inf Syst 21(3):267– proach to redesign business processes – a case study in
297 gas industry. In: Proceedings of the 2007 international
Nakatumba J (2013) Resource-aware business process symposium on symbolic and numeric algorithms for
management: analysis and support. Ph.D. thesis, Eind- scientific computing, pp 541–548
hoven University of Technology van der Aalst WMP (2015) Business process simulation
Nakatumba J, van der Aalst WMP (2010) Analyzing survival guide. In: vom Brocke J, Rosemann M (eds)
resource behavior using process mining. Lect Notes Handbook on business process management, vol 1.
Bus Info Process 43:69–80 Springer, Heidelberg, pp 337–370
Nakatumba J, Westergaard M, van der Aalst WMP (2012) van der Aalst WMP (2016) Process mining: data science
Generating event logs with workload-dependent speeds in action. Springer, Heidelberg
from simulation models. Lect Notes Bus Info Process van der Aalst WMP, Nakatumba J, Rozinat A, Rus-
112:383–397 sell N (2010) Business process simulation. In:
Pika A, van der Aalst WMP, Fidge CJ, ter Hofstede AHM, vom Brocke J, Rosemann M (eds) Handbook on
Wynn MT (2013) Predicting deadline transgressions business process management. Springer, Heidelberg,
using event logs. Lect Notes Bus Info Process 132: pp 313–338
211–216 Veiga GM, Ferreira DR (2010) Understanding spaghetti
Pospisil M, Hrus̆ka T (2012) Business process simula- models with sequence clustering in ProM. Lect Notes
tion for predictions. In: Proceedings of the second Bus Info Process 43:92–103
614 Decision Discovery in Business Processes
Vincent S (1998) Input data analysis. In: Banks J (ed) instances (e.g. loan’s amount, applicant’s or
Handbook of simulation: principles, advances, applica- patient’s age). This chapter reports on different
tions, and practice. Wiley, Hoboken, pp 3–30
Wen Y, Chen Z, Liu J, Chen J (2013) Mining batch pro-
approaches to discover these decision rules when
cessing workflow models from event logs. Concurrency unknown to process stakeholders. Most of these
Comput Pract Experience 25(13):1928–1942. techniques in fact leverage on machine-learning.
Wombacher A, Iacob ME (2013) Start time and duration A number of successful real-life case-study
distribution estimation in semi-structured processes. In:
Proceedings of the 28th annual ACM symposium on
applications are discussed along with the open-
applied computing, pp 1403–1409 source tools that were used.
Wombacher A, Iacob M, Haitsma M (2011) Towards a
performance estimate in semi-structured processes. In:
Proceedings of the 2011 IEEE international conference
on service-oriented computing and applications, pp 1–5 Introduction
Xu R, Wunsch D (2005) Survey of clustering algorithms.
IEEE Trans Neural Netw16(3):645–678 The main focus of automatic process discovery
has traditionally been in the realm of process
mining on the control-flow perspective (cf. the
chapter of Automatic Process Discovery). As an
Decision Discovery in example, the recent advances have led to discov-
Business Processes ery algorithms that can discover models such as
that in Fig. 1. This model only shows the control
Massimiliano de Leoni1 and Felix Mannhardt2 flow, namely, the ordering with which activities
1
Eindhoven University of Technology, can be executed in the process.
Eindhoven, The Netherlands Event logs are typically of such a form as in
2
Department of Economics and Technology Fig. 2. Automatic process discovery techniques
Management, SINTEF Technology and Society, only focus on the first two columns of an event
Trondheim, Norway log, the case identifier (e.g., the customer id)
and the activity name, thus overlooking a lot
of insightful information pertaining other per-
Synonyms spectives, such as the organization perspective,
the decision perspective (a.k.a. the data or case
Decision mining; Rule mining perspective), and the time perspective (van der
Aalst 2016).
The control-flow perspective is certainly of
Definitions high importance as it can be considered as the
process backbone; however, many other perspec-
In the context of business process analytics, deci- tives should also be considered to ensure that the
sion discovery is the problem of discovering the model is sufficiently accurate. This chapter will
set of rules that are de-facto used to make deci- focus the attention on the decision perspective
sions in a business process, by analyzing event and will not discuss the organization and time
logs recording past executions of the process. perspectives. Readers are referred to van der
Aalst (2016) as initial pointer to explore the
research results regarding the latter two perspec-
Overview tives.
The decision perspective focuses on how the
The execution of business processes typically routing of process instance executions is affected
follow some decisions that determine, within the by the characteristics of the specific process in-
multiple alternative executions, how to carry on stance, such as the amount requested for a loan,
executions of the process. These decisions are and by the outcomes of previous execution steps,
defined over the characteristics of the single e.g., the verification result. The representation
Decision Discovery in Business Processes 615
Renegotiate
Request
Inform
Advanced Customer
Assessment NORMAL
Final
Decision
Credit Simple Open the
Verify
Request Assessment Credit Loan
D
Skip
Assessment Inform
Customer VIP
Decision Discovery in Business Processes, Fig. 1 A model that shows only the control-flow, i.e. the ordering of
execution of activities
Customer id Activity Name Timestamp Amount Resource Verification Interest Decision Applicant VIP
HT25 Credit Request 1-12-2012 9:00AM 3000 John - - - Max 1
HT25 Verify 3-12-2012 1:00PM 3000 Pete OK - - Max 1
HT25 Simple Assessment 7-12-2012 11:00AM 3000 Sue OK 599 NO Max 1
HT25 Inform Customer VIP 7-12-2012 3:00PM 3000 Mike OK 599 NO Max 1
HX65 Credit Request 3-12-2012 9:00AM 5000 John - - - Felix 0
HX65 Verify 5-12-2012 1:00PM 5000 Mike NO - - Felix 0
HX66 Skip Assessment 5-12-2012 2:00PM 5000 Mike NO - - Felix 0
HX65 Inform Customer NORMAL 8-12-2012 1:00PM 5000 Pete NO - - Felix 0
EA49 Credit Request 1-12-2012 9:00AM 6000 John - - - Sara 1
EA49 Verify 3-12-2012 1:00PM 6000 Pete OK - - Sara 1
EA49 Advanced Assessment 5-12-2012 1:00PM 6000 Sue OK 1020 OK Sara 1
EA49 Inform Customer VIP 8-12-2012 5:00PM 6000 Ellen OK 1020 OK Sara 1
EA49 Open the Credit Loan 8-12-2012 6:00PM 6000 Ellen OK 1020 OK Sara 1
HX12 Credit Request 10-12-2012 9:00AM 6000 John - - - Michael 1
HX12 Verify 13-12-2012 1:00PM 6000 Pete OK - - Michael 1
HX12 Advanced Assessment 15-12-2012 1:00PM 6000 Sue OK 1020 NOK Michael 1
HX12 Renegotiate Request 16-12-2012 3:00PM 6000 Sue OK 1020 NOK Michael 1
... ... ... ... ... ... ... ... ... ...
Decision Discovery in Business Processes, Fig. 2 An process attributes are written and updated by one or more
excerpt of an event log. Each row is an event referring to events. The alternation of two color tones is used to group
an activity that happened at a given timestamp. Several events referring to the same process instance
of this decision perspective on a process in an driven by rules: the process attributes are not
integrated model or as separate tables is nowa- able to determine which branch is going to be
days gaining momentum. This is also testified followed (cf. the XOR split involving Renegotiate
by the recent introduction and refinement of De- Request). Figure 3 illustrates a possible extension
cision Model and Notation (DMN), which is a of the model in Fig. 1 to incorporate the data
standard published by the Object Management attributes that are manipulated by the process and
Group to describe and model the decision per- how they affect the routing at the decision points.
spective (DMN 2016). The rules at decision points can be alternatively
The simplest representation of the decision easily represented as a separate DMN table.
perspective is to attach decision rules to XOR and Historically, the discovery of the decision per-
OR splits, the decision points of a process model. spective is called Decision Mining, a name that
The rules explaining the choices are driven by was introduced by the seminal work of Rozinat
additional data associated with the process, which and van der Aalst (2006). In this work, the mined
is generated by the previous process’ steps. In decisions at the decision points are mutually
the remainder, this additional data is abstracted exclusive: when a process instance reaches a
as process attributes, each of which is name- decision point, one and exactly one branch is
value pair. Note that not all decision points are enabled. Rozinat et al. leverage on Petri nets, but
616
Applicant
VIP Renegotiate
Request
VIP = FALSE Inform
Ver = OK
Customer
Amount > 5000
Advanced NORMAL
Assessment
Ver = OK
Final
Dec = OK
Decision
Credit Simple Open the
Verify
Request Ver = OK Assessment Credit Loan
Amount <= 5000
Inform
VIP = TRUE
Customer VIP
Ver = NOK
Skip
Assessment
Amount Verification
if (Amount>5000)
0.1*Amount<Interest<0.15*Amount
else
0.15*Amount<Interest<0.2*Amount Interest Decision
Decision Discovery in Business Processes, Fig. 3 A model that extends what is depicted in Fig. 1. The decision points are annotated with rules enabling the branches
Decision Discovery in Business Processes
Decision Discovery in Business Processes 617
the same idea can trivially be moved to BPMN tain assignments of values to attributes are not
and other process modeling notations. In this recorded or, also, process participants forget to
work, decision-mining problems are transformed update the value of given attributes. Association
into classification problems: techniques such as rule discovery techniques might be a possibility,
decision-tree learning can be leveraged off. Deci- but they tend to return a large number of rules that
sion trees are learnt from an observation instance do not necessarily have a discriminatory function
table. For each decision point in the process, the on the decision, either. D
subset of the events that refers to the activities Let us consider the decision point in Fig. 4,
that follow the decision point is extracted. Each part of the model in Fig. 3. Clearly, the rules are
observation instance consists of a dependent vari- initially not present and the aim is to discover
able, a.k.a. response variable, and a set of inde- them. From the event log, every event that refers
pendent variables, a.k.a. predictor variables, that to the activities at the decision point is retained
are used for prediction. The dependent variable is from the event log. Considering the event log
the activity of the event following, and (a subset in Fig. 2, this results in an observation instance
of) the event (i.e., process) attributes are used as table such as in Fig. 5. The column Activity Name
independent variables. This chapter denotes vari- refers to the dependent variable, and the other
ables as the features used in observation instances columns are the independent variables, categor-
to learn a (decision-tree) classifier. Attributes are ical or numerical, that are used to predict the de-
conversely associated with processes and events pendent variable. The application of decision-tree
and indicate the data points manipulated while learning techniques will produce a tree similar to
executing a process instance. that in Fig. 6.
Decision trees classify instances by sorting
them down in a tree from the root to some
Decision Mining as a Classification leaf node. Each non-leaf node specifies a test of
Problem some variable x1 ; : : : ; xn (in decision mining, the
process attributes), and each branch descending
The introduction has illustrated that the prob- from that node corresponds to a range of possible
lem of decision mining can be translated into values for this variable. In general, a decision
a classification problem. Several classification tree represents a disjunction of conjunctions of
techniques exist; however, one needs to focus expressions: each path from the tree root to a leaf
on those techniques that return human-readable corresponds to an expression that is, in fact, a
decision rules in the form of, e.g., formulas. conjunction of variable tests. Each leaf node is
This motivates why most of the decision-mining
techniques, such as the seminar work (Rozinat
and van der Aalst 2006) and others (Batoulis Advanced
et al. 2015; de Leoni and van der Aalst 2013; Ver = OK Assessment
Bazhenova et al. 2016), aim to learn decision Amount > 5000
trees and use them to build a set of decision rules.
Ver = OK
In addition, to provide human-readable rules, de- Amount <= 5000 Simple
cision trees have the advantage of creating rules Assessment
that are clearly highlighting the attributes whose
values affect the decisions. These works typically
leverages on the C4.5 algorithm for decision-tree
Ver = NOK Skip
learning, which is capable to deal with noise and Assessment
missing values. In the context of decision mining,
noise indicates that in a fraction of situations,
Decision Discovery in Business Processes, Fig. 4 One
the wrong (or unusual) decision is made; missing of the decision points. The aim is to start from the control
values indicate that, due to logging errors, cer- flow and discover the rules
618 Decision Discovery in Business Processes
Decision Discovery in Business Processes, Fig. 5 An excerpt of the observation instances that are extracted from
the event log in Fig. 2 to discover the rules at the decision point in Fig. 4
= OK = NOK
Verification
Decision Discovery in Business Processes, Fig. 6 A possible decision tree discovered based on the observation
instances of which an excerpt is in Fig. 5
associated with one of the possible values of the lated to the situations when the executions are
dependent variables, namely, with one of the pro- not always compliant with the control-flow model
cess activities that follows an XOR split: if an ex- and/or the model contains invisible steps. The
pression e is associated with a path to a leaf node second extension is applicable when the rules
a, every input tuple for which e evaluates to true at decision points are not mutually exclusive,
is expected to return a as output. In decision min- such as when some information that drives the
ing, this means that every time a decision point is decisions is missing or the process behavior is
reached and the process attributes x1 ; : : : ; xn take simply not fully deterministic. These extensions
on values that make e evaluate true, the process are discussed in detail below.
execution is expected to continue with activity There are other valuable extensions, which
a. In the example tree in Fig. 6, each leaf refers cannot be elaborated on in detail for the sake of
to a different activity. However, multiple leaves space. In Bazhenova et al. (2016) and De Smedt
can be associated with the same activity a. In et al. (2017b), authors extend the basic algorithm
that case, each path leading to a would corre- to be able to discover the dependencies between
spond to a different expression. Together all these process attribute updates and leverage on invari-
expressions combined with disjunction (i.e., a ant miners. For example, the values assignable to
logical OR operator) constitute an expression that an attribute Risk are a function of the values taken
predicts when the process continues with activity on by attributes Amount and Premium. While
a. As discussed in van der Aalst (2016, pp. 294– the basic BPMN modeling notation is only able
296), this procedure is extensible to OR splits, to represent path routing decisions, BPMN can
where multiple branches may be activated. be complemented with DMN where data-related
decisions can be represented. As discussed in
Batoulis et al. (2015), DMN is especially relevant
Extension of the Basic Technique when models are of significant size. By com-
plementing BPMN with DMN, the decisions are
This basic technique has been extended to deal modeled outside the control-flow model. The pa-
with additional cases. The first extension is re- per argues that this separation of concerns would
Decision Discovery in Business Processes 619
Decision Discovery in Business Processes, Fig. 7 Two non-compliant traces for the model in Fig. 1: they break the
assumption of the basic decision-tree algorithm
increase the readability and the maintainability of of activity Verification, which is executed once
these models. Here, maintainability of a model is for each loan’s renegotiation. If one renegotiation
intended as the easiness of extending, adjusting, mistakenly skipped the verification, decision
and updating a model when the internal protocols mining might use the previous value, updated
or the external rules and regulations change. during the previous renegotiation’s loop. The
basic algorithm also suffers of the problem that
one should ignore updates of attributes that result
Non-compliance and Invisible Steps from the non-compliant execution of activities.
The first assumption of the basic algorithm is that, As an example, consider trace for the customer
when discovering the rules of a decision point, with id TH26 in Fig. 7: activity Verify occurs
every activity is always executed in the right after Simple Assessment, which violates the
moment according to the dictation of the control- constraints dictated by the model in Fig. 1. As
flow model, e.g., the model in Fig. 1. However, a result of the execution of this activity, attribute
due to logging errors and non-compliances, Verification takes on value NOK that is too late
sometimes executions do not comply with the because a simple assessment has already been
control-flow model. As an example, consider conducted. It is clear that, in this case, activity
trace for the customer with id TH25 in Fig. 7. Simple Assessment should not have occurred.
This trace is clearly not compliant with the model To detect missing values and illegal attribute
in Fig. 1: activity Credit Request is immediately updates, the work report in de Leoni and van der
followed by Simple Assessment, i.e., activity Aalst (2013) integrates the basic technique with
Verify did not occur (or its execution was not techniques for log-model alignment. An align-
logged). This means that the value of attribute ment between a recorded process execution (log
Verification is missing. Several decision-tree trace) and a process model is a pairwise matching
learning algorithms, such as C4.5, are very between activities recorded in the log and activi-
good at dealing with missing values. However, ties allowed by the model.
it is necessary to detect that a value is missing Table 1 illustrates the alignments of the model
due to non-compliancies. As an example, the in Fig. 1 with the traces in Fig. 7. Activity names
execution for the customer with id TH25 and the are abbreviated with the first letter of each name
decision of executing Simple Assessment were word. Abstracting from the attribute updates and
made without considering the value of attribute ignoring ’s, for each log trace, the log row
Verification; in fact, the value of that attribute corresponds to the log trace, and the process row
was undefined because the execution of activity represents the execution allowed by the process
Verification never took place. If such cases as model that is the closest to the log traces in terms
this are not detected, one mistakenly could use of number of necessary changes. The yellow and
an old value. For instance, attribute Verification green columns refer to log and model moves,
is updated multiple times via multiple executions respectively. For further information, readers are
620 Decision Discovery in Business Processes
referred to Adriansyah (2014) and to the chap- discovering decision rules for models such as in
ter on Conformance Checking. The events for Fig. 8. In this case, the explicit activity Skip
log moves will be ignored for decision mining. Assessment, which indicates that no activity
Conversely, activities related to model moves is performed in case of negative verification
are introduced and will be considered. Attributes but, conversely, the execution directly moves
that are supposedly updated as a result of the to the negative final decision, is not present
activity executions take on null, a special value and recorded in the respective traces. In such
to indicate that the value is missing and should a case, no activity can be explicitly identified to
be treated as such by the decision-tree learning indicate that the assessment does not take place.
algorithm. An alternative to using alignments To overcome this, the approach in de Leoni and
could be to just remove the traces referred to van der Aalst (2013) would assume that there is a
executions not compliant with the control-flow silent activity , as illustrated by the cloud in the
model. However, situations in which most of the figure. Using alignments, silent activity would
traces are almost compliant and few are fully be included in the analysis. When an execution
compliant would cause the rules to be mined on of is necessary (e.g., when the assessment
the basis of insufficient observation instances. If is skipped), the alignment would automatically
the missing events refer to activities that do not add a model move for , and as mentioned above,
update attributes relevant for a certain decision model moves are always considered when mining
point, there would be no reason to discard the the decision rules. Of course, the step is always
corresponding traces when discovering the rules involved in model moves since they are never
for such a decision point. recorded in the event log, being indeed silent.
In addition to dealing with non-compliance,
de Leoni and van der Aalst (2013) allows for
Overlapping Rules
Existing decision-mining techniques for exclu-
Decision Discovery in Business Processes, Table 1
sive choices rely on the strong assumption that
Alignments of the traces in Fig. 7 wrt. the model in Fig. 1 the rules attached to the alternative activities of
Customer with id TH25
an exclusive choice need to be mutually exclu-
Log: CR SA IC V
sive. However, business rules are often nonde-
Process: C rR V SA IC V
terministic, and this “cannot be solved until the
business rule is instantiated in a particular sit-
uation” (Rosca and Wild 2002). This ambiguity
Customer with id TH26
can occur due to conflicting rules or missing con-
Log: CR SA V ICN
textual information. For example, decisions taken
Process: CR V SA ICN
by process workers may depend on contextual
Renegotiate
Request
Inform
Advanced Customer
Assessment NORMAL
Final
Decision Open the
Credit Simple
Verify Credit
Request Assessment Loan
Inform
τ Customer
VIP
Decision Discovery in Business Processes, Fig. 8 A , whose execution is not recorded in any trace of the
variation of the model in Fig. 1: activity Skip Assessment event log
is removed. Ideally, it can be replaced by a silent activity
Decision Discovery in Business Processes 621
factors, which are not encoded in the system ture on top of which the decision perspective is
and, thus, not available in event logs (Bose et al. placed upon” (De Smedt et al. 2017a). This might
2013). In Mannhardt et al. (2016), a technique cause important decisions to remain unrevealed:
is proposed that discovers overlapping rules in the control-flow structure is discovered without
those cases that the underlying observations are considering the overall logic of the decisions, and
characterized better by such rules. The technique important decision points are not made explicit.
is able to deliberately trade the precision of mu- Whereas Maggi et al. (2013) and Schönig D
tually exclusive rules, i.e., only one alternative et al. (2016) report on works to simultaneously
is possible, against fitness, i.e., the overlapping discover data- and control-flow for declara-
rules that are less often violated. In short, as in tive models (cf. the chapter of Declarative
the basic algorithm, the technique builds an initial Modelling), Mannhardt et al. (2017) report
decision tree based on observations from the the only research work about decision-aware
event log, and for each activity a, a decision rule control-flow discovery of procedural models
rule1.a/ is found (possibly rule1.a/ D true). (i.e., BMPN-like) where the decision perspective
Then, for each decision-tree leaf, the wrongly is used to reveal paths in the control flow
classified instances are used to learn a new deci- that can be characterized by deterministic
sion tree leading to new rules, such as rule2.a/. rules over the recorded attributes. If such
These new rules are used in disjunction with the behavior is infrequently observed, it is often
initial rules yielding, e.g., overlapping rules of the disregarded by control-flow discovery techniques
form rule1.a/ _ rule2.a/. as noise (cf. the chapter on automated process
It is worth highlighting here that overlapping discovery). The idea is that some paths may be
rules are appropriate in contexts where informa- executed infrequently because the corresponding
tion is missing and/or rules are intrinsically not conditions are rarely fulfilled. These paths and
fully deterministic. However, in other contexts, the conditions are likely to be of great interest to
overlapping rules are just the results of wrong process analysts (e.g., in the context of risks and
decisions made by process modelers and experts. fraud). In Mannhardt et al. (2017), classification
In those cases, overlapping rules should be pre- techniques are employed to distinguish between
vented and, when present, be repaired to ensure such behavior and random noise.
them to be mutual exclusive. Calvanese et al.
(2016) propose a technique to detect overlapping
rules and fix them to ensure their mutual exclu- Example Cases and Tool Support
siveness.
Decision mining has been applied in several
domains as testified by many applications
Decision-Aware Control-Flow reported in literature (Rozinat 2010; de Leoni and
Discovery van der Aalst 2013; Mannhardt et al. 2016, 2017;
De Smedt et al. 2017b; Bazhenova and Weske
De Smedt et al. (2017a) distinguish two cate- 2016; Mannhardt 2018) and several implemented
gories of decision-mining techniques. The ap- tools (van Dongen et al. 2005; Mannhardt et al.
proaches that fall into the first category are named 2015; Kalenkova et al. 2014; Bazhenova et al.
decision-annotated process mining: first a suit- 2017). Here, two example cases are briefly
able control-flow model is discovered, which is presented. Both cases are taken from a project
later annotated with decision rules. Every ap- that was conducted in a regional hospital in the
proach discussed so far is within this category. A Netherlands (Mannhardt 2018).
second category aims at decision-aware control- In the first case, the trajectories of patients
flow discovery: the control-flow and decisions are that are admitted to the emergency ward of the
discovered together in a holistic manner. hospital with a suspicion for sepsis (Rhodes et al.
Any approach in the first category has the 2017) were analyzed. A normative model of the
disadvantage that it “imposes the initial struc- expected patient trajectory control flow was de-
622
DiagnosticLacticAcid = true
D1 Measure Release A
Lactic Acid
Admission Admission
NC IC Release B
DiagnosticLacticAcid = false
D2 D3 LacticAcid
SIRSCriteria = false >0
ER Sepsis Admission Release C Return ER
ER Triage
Triage NC
IV Liquid
SIRSCriteria = true LacticAcid > 0 ∧
Hypotensie = true Release D
Admission
IC
(LacticAcid > 0∧
IV Antibiotics Hypotensie = false)
∨ LacticAcid ≤ 0 Release E
Decision Discovery in Business Processes, Fig. 9 Overlapping and nonoverlapping decision rules discovered for the trajectory of patients in a Dutch hospital. This model is
an excerpt from the model in Mannhardt et al. (2016) and Mannhardt (2018)
Decision Discovery in Business Processes
Decision Discovery in Business Processes 623
Change D
Diagn Storno
Fin
Code OK
Code
Release Billed
NOK
Decision Discovery in Business Processes, Fig. 10 Process model discovered by the decision-aware process
discovery method (Mannhardt et al. 2017) for a hospital billing process (Mannhardt et al. 2017)
signed and, then, the overlapping decision rules the caseType attribute as well as to the medical
method (Mannhardt et al. 2016), yielding the specialty attribute. This path is interesting since
model in Fig. 9. it could not be explained by the domain expert.
In the second case, the hospital’s billing
process for medical services was analyzed.
Several medical services (e.g., diagnostics, Conclusion
treatments, etc.) are collected in a billing package
and validated and invoiced. However, in some This chapter has discussed the importance of the
cases, the billing package is reopened, rejected, decision perspective and illustrates several tech-
or canceled, and further changes are needed. niques that aim to explain how the choices are
Figure 10 shows a process model discovered driven by additional data associated with the pro-
by the decision-aware control-flow discovery cess. These techniques leverage on the additional
method. Six infrequent control-flow paths that information recorded in data attributes (also de-
are associated with specific data attribute values noted as event payload) of the event log to dis-
have been included in the process model. Some cover process models and enhance existing ones
of these paths should not be disregarded as noise so that the models explain the decisions that drive
since they are related to cases with rework. the process executions. It was illustrated that
For example, the path between RELEASE and these choices can be represented as annotations
Code NOK (➁) occurs mostly for two specific of process models or, alternatively, the decision
caseType values. According to a domain expert, can be modeled through separate DMN models
both case types correspond to exceptional cases: that highlight the decision-related aspects.
one is used for intensive care and the other for As discussed, the application to real-life
cases in which the code cannot be obtained (Code cases has illustrated that the maturity of the
NOK). Another example is the path from Code decision-mining approach has reached a level
NOK to Billed (➃), which is also related to that certainly allows its application to production
624 Decision Discovery in Business Processes
environments. While it is certainly necessary to de Leoni M, van der Aalst WMP (2013) Data-aware
extend the class of decisions that can be mined, process mining: discovering decisions in processes
using alignments. In: SAC 2013. ACM, pp 1454–1461.
the major drawback is that the decisions are https://doi.org/10.1145/2480362.2480633
mining locally without considering the other De Smedt J, vanden Broucke SKLM, Obregon J, Kim
decisions discovered at different points. This A, Jung JY, Vanthienen J (2017a) Decision mining in
means that decision rules at different decision a broader context: an overview of the current land-
scape and future directions. In: BPM 2016 workshops.
points can be conflicting, leading to models that, LNBIP, vol 281. Springer, pp 197–207. https://doi.org/
if actually enacted, would cause a certain number 10.1007/978-3-319-58457-7_15
of executions to be unable to properly terminate. De Smedt J, Hasic F, vanden Broucke S, Vanthienen J
As future work, it is crucial to check whether the (2017b) Towards a holistic discovery of decisions in
process-aware information systems. In: BPM 2017.
models enriched with decision rules are correct LNCS, vol 10445. Springer. https://doi.org/10.1007/
and to ensure to some extend that decisions are 978-3-319-65000-5_11
not conflicting, thus leading to deadlocks. DMN (2016) Decision Model and Notation (DMN) v1.1.
http://www.omg.org/spec/DMN/1.1/
Kalenkova AA, de Leoni M, van der Aalst WMP (2014)
Discovering, analyzing and enhancing BPMN models
using prom. In: BPM 2014 Demos, CEUR-WS.org,
Cross-References CEUR workshop proceedings, vol 1295, p 36
Maggi FM, Dumas M, García-Bañuelos L, Montali M
(2013) Discovering data-aware declarative process
Automated Process Discovery
models from event logs. In: BPM 2013. LNCS, vol
Conformance Checking 8094. Springer, pp 81–96. https://doi.org/10.1007/978-
3-642-40176-3_8
Mannhardt F (2018) Multi-perspective process mining.
PhD thesis, Eindhoven University of Technology
Mannhardt F, de Leoni M, Reijers HA (2015) The
References multi-perspective process explorer. In: BPM 2015 De-
mos, CEUR-WS.org, CEUR workshop proceedings,
Adriansyah A (2014) Aligning observed and modeled vol 1418, pp 130–134
behavior. PhD thesis, Eindhoven University of Tech- Mannhardt F, de Leoni M, Reijers HA, van der Aalst
nology. https://doi.org/10.6100/IR770080 WMP (2016) Decision mining revisited – discovering
Batoulis K, Meyer A, Bazhenova E, Decker G, Weske M overlapping rules. In: CAiSE 2016. LNCS, vol 9694.
(2015) Extracting decision logic from process models. Springer, pp 377–392. https://doi.org/10.1007/978-3-
In: CAiSE 2015. LNCS, vol 9097. Springer, pp 349– 319-39696-5_23
366. https://doi.org/10.1007/978-3-319-19069-3_22 Mannhardt F, de Leoni M, Reijers HA, van der Aalst
Bazhenova E, Weske M (2016) Deriving decision models WMP (2017) Data-driven process discovery – reveal-
from process models by enhanced decision mining. In: ing conditional infrequent behavior from event logs. In:
BPM 2015 workshops. Springer, pp 444–457. https:// CAiSE 2017. LNCS, vol 10253, pp 545–560. https://
doi.org/10.1007/978-3-319-42887-1_36 doi.org/10.1007/978-3-319-59536-8_34
Bazhenova E, Bülow S, Weske M (2016) Discovering Rhodes A et al (2017) Surviving sepsis campaign: in-
decision models from event logs. In: BIS 2016. LNBIP, ternational guidelines for management of sepsis and
vol 255. Springer, pp 237–251. https://doi.org/10.1007/ septic shock: 2016. Intensive Care Med 43(3):304–377.
978-3-319-39426-8_19 https://doi.org/10.1007/s00134-017-4683-6
Bazhenova E, Haarmann S, Ihde S, Solti A, Weske M Rosca D, Wild C (2002) Towards a flexible deployment
(2017) Discovery of fuzzy DMN decision models of business rules. Expert Syst Appl 23(4):385–394.
from event logs. In: CAiSE 2017. LNCS, vol 10253. https://doi.org/10.1016/s0957-4174(02)00074-x
Springer, pp 629–647. https://doi.org/10.1007/978-3- Rozinat A (2010) Process mining: conformance and ex-
319-59536-8_39 tension. PhD thesis, Eindhoven University of Technol-
Bose JC, Mans RS, van der Aalst WMP (2013) Wanna ogy, Eindhoven
improve process mining results? In: CIDM 2013. Rozinat A, van der Aalst WMP (2006) Decision mining
IEEE, pp 127–134. https://doi.org/10.1109/cidm.2013. in ProM. In: BPM 2006. LNCS, vol 4102. Springer, pp
6597227 420–425. https://doi.org/10.1007/11841760_33
Calvanese D, Dumas M, Ülari Laurson, Maggi FM, Mon- Schönig S, Di Ciccio C, Maggi FM, Mendling J (2016)
tali M, Teinemaa I (2016) Semantics and analysis of Discovery of multi-perspective declarative process
DMN decision tables. In: BPM 2016. LNCS, vol 9850. models. In: ICSOC 2016. LNCS, vol 9936. Springer,
Springer, pp 217–233. https://doi.org/10.1007/978-3- pp 87–103. https://doi.org/10.1007/978-3-319-46295-
319-45348-4_13 0_6
Declarative Process Mining 625
van der Aalst WMP (2016) Process mining – data science discovery, conformance checking, and compli-
in action, 2nd edn. Springer. https://doi.org/10.1007/ ance monitoring.
978-3-662-49851-4
van Dongen BF, de Medeiros AKA, Verbeek HMW,
Weijters AJMM, van der Aalst WMP (2005) The
prom framework: a new era in process mining tool Introduction
support. In: ICATPN 2005. LNCS, vol 3536. Springer,
pp 444–454
Process discovery, conformance checking, and D
process model enhancement are the three basic
forms of process mining (van der Aalst 2016). In
Decision Mining particular, process discovery aims at producing
a process model based on example executions
in an event log and without using any a priori
Decision Discovery in Business Processes
information. Conformance checking pertains to
the analysis of the process behavior as recorded
in an event log with respect to an existing refer-
ence model. Process model enhancement aims at
Decision Support enriching and adapting existing process models
Benchmarks with information retrieved from logs.
In the last years, several works have been
Analytics Benchmarks focused on process mining techniques based on
declarative process models (Maggi et al. 2011a,
2012a; de Leoni et al. 2012a; Di Ciccio and
Mecella 2013; vanden Broucke et al. 2014;
Schönig et al. 2016b; Burattin et al. 2016).
Declarative Process Mining The dichotomy procedural versus declarative
can be seen as a guideline when choosing the
Fabrizio Maria Maggi most suitable language to represent models
University of Tartu, Tartu, Estonia in process mining algorithms: process mining
techniques based on procedural languages can
be used for predictable processes working
Synonyms in stable environments, whereas techniques
based on declarative languages can be used
Constraint-based process mining; Rule-oriented for unpredictable, variable processes working
process mining in turbulent environments (Zugal et al. 2011;
Pichler et al. 2011; Haisjackl et al. 2013; Reijers
et al. 2013; Silva et al. 2014).
Definitions Indeed, a declarative approach allows the busi-
ness analyst to focus on the formalization of few
In this chapter, we introduce the techniques exist- relevant and relatively stable-through-time con-
ing in the literature in the context of declarative straints that the process should satisfy, avoiding
process mining, i.e., process mining techniques the burden to model many process control flow
based on declarative process modeling languages. details that are fated to change through time.
A declarative process mining technique is any Using declarative formalisms the processes are
technique that, in addition to taking as input an kept under-specified so that few rules allow for
event log, takes as input or as output a (busi- multiple execution paths. In this way, the process
ness) process model represented in a declarative representation is more robust to changeable be-
process modeling notation. Declarative process haviors and, at the same time, remains compact.
mining techniques include techniques for process For these reasons, declarative process mining
626 Declarative Process Mining
techniques have been largely applied in the con- lodi et al. (2010a, b) extend this technique by
text of healthcare processes (Grando et al. 2012, weighting in a second phase the constraints with a
2013; Rovani et al. 2015), since these processes probabilistic estimation. In particular, the learned
are characterized by a high degree of variability. constraints are translated into Markov Logic for-
Since declarative process modeling languages mulae that allow for a probabilistic classification
are focused on ruling out forbidden behavior, of the traces. Both the techniques in Lamma
these languages are very suitable for defining et al. (2007a) and Bellodi et al. (2010a) rely on
compliance models that are used for checking the availability of positive and negative traces of
that the behavior of a system complies certain execution.
regulations. The compliance model defines the Maggi et al. (2011b) first proposed an unsu-
rules related to a single instance of a process, and pervised algorithm for the discovery of declara-
the expectation is that all the instances follow the tive process models expressed using DECLARE.
model. Processes can be diagnosed for compli- This approach was improved in Maggi et al.
ance violations in an a posteriori or offline man- (2012a) using a two-phase approach. The first
ner, i.e., after process instance execution has been phase is based on an a priori algorithm used
finished (conformance checking). However, the to identify frequent sets of correlated activities.
progress of a potentially large number of process A list of candidate constraints is built on the
instances can be monitored at runtime to detect basis of the correlated activity sets. In the second
or even predict compliance violations. For this, phase, the constraints are checked by replaying
typically, terms such as compliance monitoring or the log on specific automata, each accepting only
online auditing are used (Ly et al. 2013, 2015). those traces that are compliant to one constraint.
In van der Aalst et al. (2009), Westergaard Those constraints satisfied by a percentage of
and Maggi (2011a), and Pesic et al. (2007), the traces higher than a user-defined threshold are
authors introduce a declarative process modeling discovered.
language called DECLARE with formal semantics Other variants of the same approach are pre-
grounded in LTL (linear temporal logic) (Pnueli sented in Maggi et al. (2013a), Bernardi et al.
1977). A DECLARE model is a set of DECLARE (2014b), Maggi (2014), Kala et al. (2016), and
constraints defined as instantiations of DECLARE Maggi et al. (2018). The technique presented in
templates that are parameterized classes of rules. Maggi et al. (2013a) leverages a priori knowledge
The bulk of the declarative process mining to guide the discovery task. In Bernardi et al.
approaches proposed in the recent years are based (2014b), the approach is adapted to be used in
on DECLARE. In the following sections, we give cross-organizational environments in which dif-
an overview of these approaches grouping them ferent organizations execute the same process in
into three categories, i.e., approaches for process different variants. In Maggi (2014), the author
discovery, conformance checking, and compli- extends the approach to discover metric temporal
ance monitoring. Some of these approaches have constraints, i.e., constraints taking into account
been implemented in the process mining toolset the time distance between events. Finally, in Kala
ProM (Maggi 2013). et al. (2016) and Maggi et al. (2018), the au-
thors propose mechanisms to reduce the execu-
tion times of the original approach presented in
Maggi et al. (2012a).
Process Discovery In Di Ciccio and Mecella (2012, 2013, 2015),
the authors introduce MINERful, a two-step algo-
In the area of process discovery, Lamma et al. rithm for the discovery of DECLARE constraints.
(2007a, b) and Chesani et al. (2009a) describe The first step of the approach is the building of
the usage of inductive logic programming tech- a knowledge base, with information about tem-
niques to mine models expressed as a SCIFF poral statistics about the (co-)occurrence of tasks
(Alberti et al. 2008) first-order logic theory. Bel- within the log. Then, the validity and the support
Declarative Process Mining 627
of constraints are computed by querying that Finally, we remark that studies have been
knowledge base. The value assigned to support is conducted to remove inconsistencies and
calculated by counting the activations not leading redundancies from discovered declarative
to violations of constraints. In Di Ciccio et al. models (Di Ciccio et al. 2015, 2017). The
(2014, 2016), the authors propose an extension of proposed solutions resort on the language
MINERful to discover target-branched DECLARE inclusion and cross-product of the automata
constraints, i.e., constraints in which the target underlying the constraints. Approaches for D
parameter is replaced by a disjunction of actual ensuring the relevance of a set of DECLARE
tasks. constraints discovered from an event log in terms
Another approach for the discovery of DE- of non-vacuity to the log have been presented in
CLARE models is described in Schönig et al. Maggi et al. (2016) and Di Ciccio et al. (2018).
(2016b). The presented technique is based on
the translation of DECLARE templates into SQL
queries on a relational database instance, where Conformance Checking
the event log has previously been stored. The
query answer assigns the free variables with those In recent years, an increasing number of re-
tasks that lead to the satisfaction of the con- searchers are focusing on the conformance check-
straint in the event log. The methodology has ing with respect to declarative models. For ex-
later been extended toward multi-perspective DE- ample, in Chesani et al. (2009b), an approach
CLARE discovery (Schönig et al. 2016a), to in- for compliance checking with respect to reac-
clude data in the formulation of constraints. tive business rules is proposed. Rules, expressed
Beforehand, an approach for discovering data- using DECLARE, are mapped to abductive logic
aware constraints was proposed in Maggi et al. programming, and Prolog is used to perform the
(2013c). Recently, approaches for taking into validation. The approach has been extended in
account activity life cycles in the discovery task Montali et al. (2010), by mapping constraints to
have been proposed in Bernardi et al. (2014a, LTL and evaluating them using automata. The en-
2016). tire work has been contextualized into the service
Another approach for the discovery of DE- choreography scenario.
CLARE models was presented in vanden Broucke In Burattin et al. (2012), the authors report an
et al. (2014). Here, the authors present the Evo- approach that can be used to evaluate the confor-
lutionary Declare Miner that implements the dis- mance of a log with respect to a DECLARE model.
covery task using a genetic algorithm. In particular, their algorithms compute, for each
Approaches for the online discovery of DE- trace, whether a DECLARE constraint is violated
CLARE models have been presented in Maggi or fulfilled. Using these statistics the approach
et al. (2013b) and Burattin et al. (2014, 2015). allows the user to evaluate the healthiness of the
Here, the models are discovered from streams of log. The approach is based on the conversion of
events generated at runtime during the execution DECLARE constraints into automata, and using
of a business process. The models are adapted on so-called activation trees, it is able to identify
the fly as soon as the behavior of the underlying violations and fulfillments.
process changes. The work described in de Leoni et al. (2012b,
An algorithm for the discovery DCR graphs 2014) consists in converting a DECLARE model
has been proposed in Debois et al. (2017). In into an automaton and perform conformance
Ferilli (2014), the authors introduce the WoMan checking of a log with respect to the generated
framework that includes a method for discovering automaton. The conformance checking approach
first-order logic constraints from event logs. It is based on the concept of alignment, and as
guarantees incrementality in learning and adapt- a result of the analysis each trace is converted
ing the models and the ability to express triggers into the most similar trace that the model
and conditions on the process tasks. accepts. Other approaches for trace alignment
628 Declarative Process Mining
with respect to LTL specifications have been that can be expressed as an LTL formula can
presented in De Giacomo et al. (2016, 2017). be monitored using MobuconLTL. At runtime,
Here, automated planning is used to improve the a stream of events can be sent by an external
performance of the alignment task. workflow management system to MobuconLTL
In Grando et al. (2012, 2013), the authors through the operational support. The provider
use DECLARE to model medical guidelines returns rich diagnostics about the status of each
and to provide semantic (i.e., ontology-based) rule in the reference model with respect to the
conformance checking measures. In a recent traces included in the stream. This approach has
work, reported in Borrego and Barba (2014), the been extended in De Masellis et al. (2014) to
data perspective for conformance checking with monitor data-aware DECLARE constraints and in
DECLARE is expressed in terms of conditions on De Giacomo et al. (2014) to cover a larger set
global variables disconnected from the specific of rules expressed using LDL (De Giacomo and
DECLARE constraints expressing the control Vardi 2013).
flow. MobuconEC is a compliance monitoring
In Westergaard and Maggi (2012) and Maggi framework based on a reactive version (Bragaglia
and Westergaard (2014), the authors propose et al. 2012) of the event calculus (Kowalski
an approach for checking DECLARE constraints and Sergot 1986). Being based on the event
augmented with temporal conditions. In Burattin calculus, it can easily accommodate explicit
et al. (2016), a conformance checking technique (metric and qualitative) time constraints, as
for multi-perspective declarative process models well as data-enriched events and corresponding
is presented. In particular, in this work, the data-aware conditions. In particular, it has been
authors first describe and formalize MP- used to formalize and monitor DECLARE rules
DECLARE (i.e., a multi-perspective semantics augmented with quantitative time constraints, and
of DECLARE augmented with data and time nonatomic activities and data-aware conditions
constraints). Then, they provide an algorithmic (Montali et al. 2013a, b).
framework to efficiently check the conformance SeaFlows is a compliance checking frame-
of MP-DECLARE with respect to event logs. work that addresses design and runtime checking.
One of the most well-known tools for confor- It aims at encoding compliance states in an easily
mance checking with declarative specifications interpretable manner to provide advanced com-
is the LTL Checker presented for the first time pliance diagnosis. The core concepts described in
in van der Aalst et al. (2005). This tool allows Ly (2013) and Ly et al. (2011) were implemented
the user to check the validity of a declarative within the prototype named SeaFlows Toolset.
specification expressed in terms of LTL rules in With SeaFlows, compliance rules are modeled as
the traces of a log. compliance rule graphs (CRG).
The framework described in Santos et al.
(2012) is based on the supervisory control
Compliance Monitoring theory. Through this approach, it is possible
to supervise the execution of a process in a
Several approaches have been presented for com- workflow management system by “blocking”
pliance monitoring of business processes with those events that would lead to a violation of a
respect to a set of compliance rules. For example, given set of constraints. This can be considered as
MobuconLTL (Maggi et al. 2011a,c) is a compli- a very sophisticated form of proactive violation
ance monitoring tool implemented as a provider management, which is applicable only when
of the operational support in ProM (Westergaard the workflow system can be (at least partially)
and Maggi 2011b; Maggi and Westergaard 2017; controlled by the monitor.
Maggi et al. 2012b). It takes as input a refer- ECE rules (Bragaglia et al. 2011) are
ence model expressed in the form of DECLARE a domain-independent approach that was
rules. More generally, every business constraint not specifically tailored for business process
Declarative Process Mining 629
monitoring. Two key features characterize ECE are called control points. The authors define an
rules. First, they support an imperfect (i.e., optimization problem to find the right balance
fuzzy and probabilistic) matching between between the accuracy of the compliance checking
expected and occurred events and hence deal and the number of control points (each control
with several fine-grained degrees of compliance. point has a verification cost).
Second, expected events can be decorated Thullner et al. (2011) defines a framework for
with countermeasures to be taken in case of a compliance monitoring able to detect violation D
violation, hence providing first-class support for and to suggest possible recovery actions after
compensation mechanisms. the violation has occurred. The framework is
With BPath (Sebahi 2012), Sebahi proposes focused on the detection of different types of time
an approach for querying execution traces based constraints.
on XPath. BPath implements a fragment of first- The work by Baresi et al. on Dynamo Baresi
order hybrid logic and enables data-aware and and Guinea (2005) constitutes one of the few
resource-aware constraints. techniques and tools dealing with monitoring
An approach based on constraint program- (BPEL) web services against complex rules, go-
ming is provided in Gomez-Lopez et al. (2013). ing beyond the analysis of quantitative KPIs for
The main goal of Gomez-Lopez et al. (2013) is service-level agreement. (Timed) Dynamo allows
to proactively detect compliance violations and to one to model ECA-like rules that mix information
explain the root cause of the violation. about the location (i.e., the target web services
Giblin et al. (2006) presents an approach based and operations), the involved data, and the tem-
on timed propositional temporal logic for trans- poral constraints. Dynamo provides continuous,
forming high-level regulatory constraints into reactive monitoring facilities based on the mes-
REALM constraints that can be monitored during sages fetched so far, with sophisticated recov-
process runtime. The paper presents architectural ery and compensation mechanisms. Furthermore,
considerations and an implementation of the the monitoring results can be aggregated and
transformation on the basis of a case study. reported to the user, providing useful insights that
Hallé and Villemaire (2008) provides an ap- go beyond a yes/no answer.
proach based on LTL-FOC .LTL-FOC is a linear In Namiri and Stojanovic (2007), the authors
temporal logic augmented with full first-order describe an approach that is based on the patterns
quantification over the data inside a trace of XML proposed by Dwyer et al. (1999). In particular, the
messages that focuses on monitoring data-aware authors introduce a set of control patterns. Con-
constraints over business processes. The data is trol patterns are triggered by events (such as the
part of the messages that are exchanged between execution of a controlled activity). When being
the process activities. trigged, conditions associated with the control are
MONPOLY (Basin et al. 2008, 2011) is a run- evaluated. In order to provide data for evaluating
time verification framework for security policies, the associated conditions, the authors introduce a
specified in the logic MFOTL, a variant of LTL- semantic mirror that is filled with runtime data of
FO with metric time. The high expressiveness a process instance. For each control, actions to be
of MFOTL allows one to express sophisticated carried out if a control fails may be specified.
compliance rules involving advanced temporal
constraints, as well as data- and resource-related
conditions. Conclusion
In Narendra et al. (2008), the authors address
the problem of continuous compliance monitor- In this chapter, we have presented the existing
ing, i.e., they evaluate the compliance of a process literature in the context of declarative process
execution with respect to a set of policies at mining. From this analysis, we can derive that
runtime. In particular, the policies are checked the works dealing with process discovery are
when some specific tasks are executed, which mainly focused on DECLARE and, in general,
630 Declarative Process Mining
on a limited set of rule patterns. The works norms and action, vol 7360. Springer, Heidelberg, pp
on conformance checking are based on larger 123–146
Burattin A, Maggi FM, van der Aalst WMP, Sperduti A
languages covering the entire LTL or even richer (2012) Techniques for a posteriori analysis of declara-
languages like LDL. The approaches developed in tive processes. In: EDOC, pp 41–50
the context of compliance monitoring are more Burattin A, Cimitile M, Maggi FM (2014) Lights, camera,
variegated and are based on different declarative action! Business process movies for online process
discovery. In: BPM workshops, pp 408–419
languages with different characteristics. Burattin A, Cimitile M, Maggi FM, Sperduti A (2015)
Online discovery of declarative process models from
event streams. IEEE Trans Serv Comput 8(6):833–846
Cross-References Burattin A, Maggi FM, Sperduti A (2016) Conformance
checking based on multi-perspective declarative pro-
cess models. Expert Syst Appl 65:194–211
Automated Process Discovery Chesani F, Lamma E, Mello P, Montali M, Riguzzi F,
Conformance Checking Storari S (2009a) Exploiting inductive logic program-
ming techniques for declarative process mining. ToP-
NoC 2:278–295
Chesani F, Mello P, Montali M, Riguzzi F, Sebastianis
References M, Storari S (2009b) Checking compliance of exe-
cution traces to business rules. In: BPM workshops,
Alberti M, Chesani F, Gavanelli M, Lamma E, Mello pp 134–145
P, Torroni P (2008) Verifiable agent interaction in De Giacomo G, Vardi MY (2013) Linear temporal logic
abductive logic programming: the SCIFF framework. and linear dynamic logic on finite traces. In: IJCAI, pp
ACM Trans Comput Log 9(4):29:1–29:43 854–860
Baresi L, Guinea S (2005) Dynamo: dynamic monitor- De Giacomo G, De Masellis R, Grasso M, Maggi FM,
ing of WS-BPEL processes. In: CSOC, vol 3826, Montali M (2014) Monitoring business metaconstraints
pp 478–483 based on LTL and LDL for finite traces. In: BPM, pp
Basin DA, Klaedtke F, Müller S, Pfitzmann B (2008) Run- 1–17
time monitoring of metric first-order temporal proper- De Giacomo G, Maggi FM, Marrella A, Sardiña S (2016)
ties. In: FSTTCS, vol 2, pp 49–60 Computing trace alignment against declarative process
Basin DA, Harvan M, Klaedtke F, Zalinescu E (2011) models through planning. In: ICAPS, pp 367–375
MONPOLY: monitoring usage-control policies. In: RV, De Giacomo G, Maggi FM, Marrella A, Patrizi F
vol 7186, pp 360–364 (2017) On the disruptive effectiveness of automated
Bellodi E, Riguzzi F, Lamma E (2010a) Probabilistic planning for LTLf -based trace alignment. In: AAAI,
declarative process mining. In: KSEM, pp 292–303 pp 3555–3561
Bellodi E, Riguzzi F, Lamma E (2010b) Probabilistic De Masellis R, Maggi FM, Montali M (2014) Monitor-
logic-based process mining. In: CILC, pp 292–303 ing data-aware business constraints with finite state
Bernardi ML, Cimitile M, Di Francescomarino C, Maggi automata. In: ICSSP, pp 134–143
FM (2014a) Using discriminative rule mining to dis- Debois S, Hildebrandt TT, Laursen PH, Ulrik KR (2017)
cover declarative process models with non-atomic ac- Declarative process mining for DCR graphs. In: SAC,
tivities. In: RuleML, pp 281–295 pp 759–764
Bernardi ML, Cimitile M, Maggi FM (2014b) Discovering Di Ciccio C, Mecella M (2012) Mining constraints for
cross-organizational business rules from the cloud. In: artful processes. In: BIS, pp 11–23
CIDM, pp 389–396 Di Ciccio C, Mecella M (2013) A two-step fast algorithm
Bernardi ML, Cimitile M, Di Francescomarino C, Maggi for the automated discovery of declarative workflows.
FM (2016) Do activity lifecycles affect the validity of a In: CIDM, pp 135–142
business rule in a business process? Inf Syst 62:42–59 Di Ciccio C, Mecella M (2015) On the discovery of
Borrego D, Barba I (2014) Conformance checking and declarative control flows for artful processes. ACM
diagnosis for declarative business process models Trans Manag Inf Syst 5(4):24:1–24:37
in data-aware scenarios. Expert Syst Appl 41(11): Di Ciccio C, Maggi FM, Mendling J (2014) Discov-
5340–5352 ering target-branched declare constraints. In: BPM,
Bragaglia S, Chesani F, Mello P, Montali M, pp 34–50
Sottara D (2011) Fuzzy conformance checking of Di Ciccio C, Maggi FM, Montali M, Mendling J (2015)
observed behaviour with expectations. In: AI*IA 2011, Ensuring model consistency in declarative process dis-
vol 6934, pp 80–91 covery. In: BPM, Springer, pp 144–159. https://doi.org/
Bragaglia S, Chesani F, Mello P, Montali M, Torroni P 10.1007/978-3-319-23063-4_9
(2012) Reactive event calculus for monitoring global Di Ciccio C, Maggi FM, Mendling J (2016) Efficient
computing applications. In: Artikis A, Craven R, Kesim discovery of target-branched declare constraints. Inf
Çiçekli NK, Sadighi B, Stathis K (eds) Logic programs, Syst 56:258–283
Declarative Process Mining 631
Di Ciccio C, Maggi FM, Montali M, Mendling J (2017) Ly LT, Rinderle-Ma S, Knuplesch D, Dadam P (2011)
Resolving inconsistencies and redundancies in declar- Monitoring business process compliance using com-
ative process models. Inf Syst 64:425–446 pliance rule graphs. In: OTM conferences (1),
Di Ciccio C, Maggi FM, Montali M, Mendling J (2018) pp 82–99
On the relevance of a business constraint to an event Ly LT, Maggi FM, Montali M, Rinderle-Ma S, van der
log. Inf Syst Aalst WMP (2013) A framework for the systematic
Dwyer M, Avrunin G, Corbett J (1999) Patterns in prop- comparison and evaluation of compliance monitoring
erty specifications for finite-state verification. In: ICSE, approaches. In: EDOC, pp 7–16
pp 411–420 Ly LT, Maggi FM, Montali M, Rinderle-Ma S, van der D
de Leoni M, Maggi FM, van der Aalst WMP (2012a) Aalst WMP (2015) Compliance monitoring in busi-
Aligning event logs and declarative process models for ness processes: functionalities, application, and tool-
conformance checking. In: BPM, pp 82–97 support. Inf Syst 54:209–234
de Leoni M, Maggi FM, van der Aalst WMP (2012b) Maggi FM (2013) Declarative process mining with the
Aligning event logs and declarative process models for declare component of ProM. In: BPM (Demos), vol
conformance checking. In: BPM, pp 82–97 1021
de Leoni M, Maggi FM, van der Aalst WM (2014) An Maggi FM (2014) Discovering metric temporal business
alignment-based framework to check the conformance constraints from event logs. In: BIR, pp 261–275
of declarative process models and to preprocess event- Maggi FM, Westergaard M (2014) Using timed au-
log data. Inf Syst 47:258–277 tomata for a Priori warnings and planning for timed
Ferilli S (2014) Woman: logic-based workflow learning declarative process models. Int J Cooperative Inf Syst
and management. IEEE Trans Syst Man Cybern Syst 23(1)
44(6):744–756 Maggi FM, Westergaard M (2017) Designing software
Giblin C, Müller S, Pfitzmann B (2006) From regulatory for operational decision support through coloured Petri
policies to event monitoring rules: towards model- nets. Enterprise IS 11(5):576–596
driven compliance automation. Technical report. Maggi FM, Montali M, Westergaard M, van der
Report RZ 3662 Aalst WMP (2011a) Monitoring business constraints
Gomez-Lopez M, Gasca R, Rinderle-Ma S (2013) Ex- with linear temporal logic: an approach based
plaining the incorrect temporal events during business on colored automata. In: BPM 2011, vol 6896,
process monitoring by means of compliance rules pp 132–147
and model-based diagnosis. In: EDOC workshops, Maggi FM, Mooij AJ, van der Aalst WMP (2011b) User-
pp 163–172 guided discovery of declarative process models. In:
Grando MA, van der Aalst WMP, Mans RS (2012) CIDM, pp 192–199
Reusing a declarative specification to check the con- Maggi FM, Westergaard M, Montali M, van der Aalst
formance of different CIGs. In: BPM workshops, pp WMP (2011c) Runtime verification of LTL-based
188–199 declarative process models. In: RV 2011, vol 7186, pp
Grando MA, Schonenberg MH, van der Aalst WMP 131–146
(2013) Semantic-based conformance checking of com- Maggi FM, Bose RPJC, van der Aalst WMP (2012a) Ef-
puter interpretable medical guidelines. In: BIOSTEC, ficient discovery of understandable declarative process
vol 273, pp 285–300 models from event logs. In: CAiSE, pp 270–285
Haisjackl C, Zugal S, Soffer P, Hadar I, Reichert M, Ping- Maggi FM, Montali M, van der Aalst WMP (2012b) An
gera J, Weber B (2013) Making sense of declarative operational decision support framework for monitoring
process models: common strategies and typical pitfalls. business constraints. In: FASE, pp 146–162
In: BPMDS, pp 2–17 Maggi FM, Bose RPJC, van der Aalst WMP (2013a)
Hallé S, Villemaire R (2008) Runtime monitoring A knowledge-based integrated approach for dis-
of message-based workflows with data. In: ECOC, covering and repairing Declare maps. In: CAiSE,
pp 63–72 pp 433–448
Kala T, Maggi FM, Di Ciccio C, Di Francescomarino C Maggi FM, Burattin A, Cimitile M, Sperduti A (2013b)
(2016) Apriori and sequence analysis for discovering Online process discovery to detect concept drifts in
declarative process models. In: EDOC, pp 1–9 LTL-based declarative process models. In: OTM, pp
Kowalski RA, Sergot MJ (1986) A logic-based calculus of 94–111
events. N Gener Comput 4(1):67–95 Maggi FM, Dumas M, García-Bañuelos L, Montali M
Lamma E, Mello P, Montali M, Riguzzi F, Storari S (2013c) Discovering data-aware declarative process
(2007a) Inducing declarative logic-based models from models from event logs. In: BPM, pp 81–96
labeled traces. In: BPM, pp 344–359 Maggi FM, Montali M, Di Ciccio C, Mendling J (2016)
Lamma E, Mello P, Riguzzi F, Storari S (2007b) Applying Semantical vacuity detection in declarative process
inductive logic programming to process mining. In: mining. In: BPM, pp 158–175
ILP, pp 132–146 Maggi FM, Di Ciccio C, Di Francescomarino C, Kala T
Ly LT (2013) SeaFlows – a compliance checking frame- (2018) Parallel algorithms for the automated discovery
work for supporting the process lifecycle. PhD thesis, of declarative process models. Inf Syst. https://doi.org/
University of Ulm 10.1016/j.is.2017.12.002
632 Decomposed Process Discovery and Conformance Checking
Montali M, Pesic M, van der Aalst WMP, Chesani F, an approach based on temporal logic. In: CoopIS,
Mello P, Storari S (2010) Declarative specification pp 130–147
and verification of service choreographies. ACM Trans van der Aalst WMP, Pesic M, Schonenberg H (2009)
Web 4(1):1–62 Declarative workflows: balancing between flexibility
Montali M, Chesani F, Mello P, Maggi FM (2013a) To- and support. Comput Sci Res Dev 23(2):99–113
wards data-aware constraints in Declare. In: SAC, pp vanden Broucke SKLM, Vanthienen J, Baesens B (2014)
1391–1396 Declarative process discovery with evolutionary com-
Montali M, Maggi FM, Chesani F, Mello P, van der Aalst puting. In: CEC, pp 2412–2419
WMP (2013b) Monitoring business constraints with Westergaard M, Maggi FM (2011a) Declare: a tool suite
the event calculus. ACM Trans Intell Syst Technol for declarative workflow modeling and enactment. In:
5(1):17:1–17:30 BPM (Demos)
Namiri K, Stojanovic N (2007) Pattern-based design and Westergaard M, Maggi FM (2011b) Modeling and ver-
validation of business process compliance. In: CoopIS, ification of a protocol for operational support using
On the move to meaningful internet systems, Springer, coloured Petri nets. In: PETRI NETS, pp 169–188
pp 59–76 Westergaard M, Maggi FM (2012) Looking into the fu-
Narendra N, Varshney V, Nagar S, Vasa M, Bhamidipaty ture: using timed automata to provide a priori advice
A (2008) Optimal control point selection for continu- about timed declarative process models. In: CoopIS,
ous business process compliance monitoring. In: IEEE pp 250–267
SOLI, vol 2, pp 2536–2541 Zugal S, Pinggera J, Weber B (2011) The impact of
Pesic M, Schonenberg H, van der Aalst WMP (2007) DE- testcases on the maintainability of declarative process
CLARE: full support for loosely-structured processes. models. In: BMMDS/EMMSAD, pp 163–177
In: EDOC, pp 287–300
Pichler P, Weber B, Zugal S, Pinggera J, Mendling J,
Reijers HA (2011) Imperative versus declarative pro-
cess modeling languages: an empirical investigation.
In: BPM workshops, pp 383–394 Decomposed Process
Pnueli A (1977) The temporal logic of programs. In: Discovery and Conformance
Annual IEEE symposium on foundations of computer Checking
science, pp 46–57
Reijers HA, Slaats T, Stahl C (2013) Declarative
modeling-an academic dream or the future for BPM? Josep Carmona
In: BPM, pp 307–322 Universitat Politècnica de Catalunya, Barcelona,
Rovani M, Maggi FM, de Leoni M, van der Aalst Spain
WMP (2015) Declarative process mining in healthcare.
Expert Syst Appl 42(23):9236–9251
Santos EAP, Francisco R, Vieira AD, FR Loures E, Busetti
MA (2012) Modeling business rules for supervisory
control of process-aware information systems. In: BPM
Definitions
workshops, vol 100, pp 447–458
Schönig S, Di Ciccio C, Maggi FM, Mendling J (2016a) Decomposed process discovery and decomposed
Discovery of multi-perspective declarative process conformance checking are the corresponding
models. In: ICSOC, pp 87–103
variants of the two monolithic fundamental
Schönig S, Rogge-Solti A, Cabanillas C, Jablonski
S, Mendling J (2016b) Efficient and customisable problems in process mining (van der Aalst
declarative process mining with SQL. In: CAiSE, 2011): automated process discovery, which
pp 290–305 considers the problem of discovering a process
Sebahi S (2012) Business process compliance monitoring:
a view-based approach. PhD thesis, LIRIS
model from an event log (Leemans 2009), and
Silva NC, de Oliveira CAL, Albino FALA, Lima RMF conformance checking, which addresses the
(2014) Declarative versus imperative business process problem of analyzing the adequacy of a process
languages – a controlled experiment. In: ICEIS, pp model with respect to observed behavior (Munoz-
394–401
Thullner R, Rozsnyai S, Schiefer J, Obweger H, Suntinger
Gama 2009), respectively.
M (2011) Proactive business process compliance moni- The term decomposed in the two definitions is
toring with event-based systems. In: EDOC workshops, mainly describing the way the two problems are
pp 429–437 tackled operationally, to face their computational
van der Aalst WMP (2016) Process mining: data science
in action. Springer, Berlin/Heidelberg
complexity by splitting the initial problem into
van der Aalst WMP, de Beer HT, van Dongen BF smaller problems, that can be solved individually
(2005) Process mining and verification of properties: and often more efficiently.
Decomposed Process Discovery and Conformance Checking 633
L
event log process
decomposition discovery
mapping
R decompose
event log event log R1 R2 Rn
representation
compose
M model M1 M2 Mn
Decomposed Process Discovery and Conformance Checking, Fig. 1 General view of decomposed process
discovery. (Figure adapted from van der Aalst 2013)
634 Decomposed Process Discovery and Conformance Checking
stage or subprocess (or more generally a “frag- each pair of sublog and fragment generated. As
ment”) of the original process. conformance checking strongly relies on relating
The available techniques to do the partition modeled and observed behavior (Munoz-Gama
are enumerated below. Then, from each repre- 2009), the obtention of a conformance artifact
sentation Ri , a process model Mi is obtained like an alignment (Adriansyah 2014) between a
through the application of an automated process sublog Li and a fragment Mi can be materialized.
discovery technique. The output of decomposed Likewise, conformance diagnosis can already be
process discovery can be the set of process mod- obtained on these local problems.
els or fragments derived, or if some composition The decompositions mentioned above can
mechanism is disposable, a unique final model then be applied at different levels: log decompo-
M that encompasses the unified behavior of the sition and process model decomposition (van der
n derived process models. Aalst 2013). Below a general perspective on
the two types of decomposition techniques is
General View of Decomposed Conformance provided.
Checking. The general view of decomposed
conformance checking is shown in Fig. 2. Log Decomposition. When the event log is not
Although both process model and event log are transformed to a different representation, decom-
decomposed, the process model decomposition position can then be applied on top of the event
is applied first, and then the event log decom- log. Event logs can be either split horizontally,
position is performed so that process model or projected vertically, or both. Figure 3 provides
fragments and corresponding sublogs match on an example of such operations. The horizontal
their alphabet of activities. Then, conformance selection of traces can be done in many ways,
checking techniques can be applied locally on but trace clustering (Greco et al. 2006; Ferreira
model conformance
decomposition checking
technique
decompose
M model M1 M2 Mn
decompose
L event log L1 L2 Ln
event log event log event log event log
Decomposed Process Discovery and Conformance Checking, Fig. 2 General view of decomposed conformance
checking. (Figure adapted from van der Aalst 2013)
Decomposed Process Discovery and Conformance Checking 635
L1 (a,b,c,d,e) L1 (a,b,c,d,e,x,y,z)
1: a,c,e,b,d,e,a,c,e 1: a,c,e,b,d,x,e,a,c,y,e,z
2: a,c,e,a,c,e,b 3: a,x,c,y,e,b,e,d,z,b,c,e
3: a,c,e,b,e,d,b,c,e 5: b,x,c,y,e,a,e,d,z,b,c,e
4: b,c,e,b,e,d,b,c,e
Vertical L (a,b,c,d,e,x,y,z) Horizontal 8: b,x,c,y,e,b,e,d,z,b,c,e
5: b,c,e,a,e,d,b,c,e
6: a,c,e,b,e,d,b,c,e 1: a,c,e,b,d,x,e,a,c,y,e,z L2 (a,b,c,d,e,x,y,z)
7: a,c,e,b,e,d,a,c,e 2: a,c,e,a,x,c,y,z,e,b,x,y
8: b,c,e,b,e,d,b,c,e 3: a,x,c,y,e,b,e,d,z,b,c,e 2: a,c,e,a,x,c,y,z,e,b,x,y
L2 (x,y,z)
4: b,x,c,y,e,b,e,d,z,b,c,e 3: a,x,c,y,e,b,e,d,z,b,c,e D
5: b,x,c,y,e,a,e,d,z,b,c,e 6: a,c,x,y,e,b,e,d,z,b,c,e
6: a,c,x,y,e,b,e,d,z,b,c,e 7: a,x,c,y,e,b,e,d,z,a,c,e
1: x,y,z
2: x,y,z,x,y 7: a,x,c,y,e,b,e,d,z,a,c,e
....
3: x,y,z 8: b,x,c,y,e,b,e,d,z,b,c,e
4: x,y,z
Lp (a,b,c,d,e,x,y,z)
5: x,y,z
6: x,y,z 1: a,c,e,b,d,x,e,a,c,y,e,z
7: x,y,z 7: a,x,c,y,e,b,e,d,z,a,c,e
8: x,y,z 8: b,x,c,y,e,b,e,d,z,b,c,e
Vertical & Horizontal
Decomposed Process Discovery and Conformance Checking, Fig. 3 Log decomposition by either horizontal or
vertical selections or a combination thereof
et al. 2007; Bose and van der Aalst 2009 b, c; mance checking is dominated by the size of the
Weerdt et al. 2013; Hompes et al. 2015) is usually alphabet.
applied. Notice that in the figure, sublogs arising In case the log has been transformed into
from clusters can overlap, e.g., trace 3 belongs a different representation R, then other forms
both to L1 and L2 in the horizontal selection of of decomposition are applicable. For instance,
traces from L. Clearly, by decomposing a log into if the log is transformed into a transition sys-
smaller sublogs, a linear factor on the complexity tem (Arnold 1994) which encompasses the be-
reduction can be accomplished. havior underlying in L, then decomposition tech-
The vertical projection of log traces starts by niques like the ones presented in de San Pedro
selecting a proper subset A 0 of the log’s event and Cortadella (2016) can be applied.
alphabet A . It then projects each trace in the
log onto activities in A 0 , i.e., removing from Process Model Decomposition. Process
those events in A A 0 , resulting in the model decomposition only makes sense for
trace jA 0 . The vertical projection can derive conformance checking, where apart from the
subsets A1 ; ; An with Ai A , deriving event log, the process model is also an input.
the sublogs LjA1 ; : : : ; LjAn . How to select those The intuitive idea is to break the process model
sets is the crucial decision; there are techniques into fragments and project vertically the log
based on the directly-follows relation from the ˛- accordingly (see Fig. 2). This way, the problem
algorithm (Carmona 2012) or hierarchical meth- instances tackled in decomposed conformance
ods that learn patterns which become new labels checking are significantly smaller than in the
in the alphabet (Bose and van der Aalst 2009a), monolithic version.
among others. The vertical projection has in gen- Not every decomposition can be used in the
eral a bigger impact in the complexity alleviation scope of conformance checking: a valid decom-
than the horizontal split, since the complexity of position (van der Aalst 2013) of a Petri net
most of the techniques for discovery or confor- into fragments requires that the places and arcs
636 Decomposed Process Discovery and Conformance Checking
insurance check
e prepare
start high end high notification
check high medical history check check n
start
d f h notification notify
contract hospital m p
register decide
claim high-low g
o
a b re-notification
insurance check need
start low end low register q archive
check e check notification claim
i low medical history check l s
k
re-process claim
r
S'2 e n
n p
d f h S'3
h m m
S'1 d p
g o
a a b b B4 o
B1 r q
i B5
B2 e l
i l
B3 q B6 S'4
k r s s
Decomposed Process Discovery and Conformance Checking, Fig. 4 Process model decomposition based on
single-entry single-exit (SESE) components (Munoz-Gama et al. 2014)
of the net are partitioned among the fragments, the same outputs as the ones provided in the
meaning that the decomposition uniquely assigns monolithic versions of the problems.
each place to a single fragment. Transitions in the
frontiers of the cuts deriving the fragments are Key Research Findings in Decomposed
replicated, while transitions inside one fragment Process Discovery. Techniques for decomposed
cannot, i.e., transitions inside a fragment cannot process discovery have appeared in different
occur in other fragments. Figure 4 shows an forms in the last years (Solé and Carmona 2012;
example of a valid decomposition of a process Carmona 2012; van der Aalst 2013; van der
model. Examples of decompositions are the ones Aalst and Verbeek 2014; van der Aalst et al.
based on passages (van der Aalst 2012) or based 2015; de San Pedro and Cortadella 2016).
on single-entry single-exit (SESE) (Munoz-Gama These approaches not only focus on alleviating
et al. 2014). the computation of the process discovery task
but also consider other goals. For instance, a
discovery approach based on van der Aalst (2013)
was used in the framework that recently won
Key Research Findings
the Process Discovery Contest (Carmona et al.
2016). Decomposition-based process discovery
In general, the use of decomposition tends to
can also be applied to derive a different modeling
make automated process discovery and confor-
notation, like it is presented in Solé and Carmona
mance checking problems more tractable. How-
(2012) for C- nets (van der Aalst et al. 2011).
ever, current decomposition techniques cannot
It can also focus on deriving particular Petri net
guarantee optimality in general or even to provide
Decomposed Process Discovery and Conformance Checking 637
subclasses, like free choice, state machines, or of decomposed process discovery and confor-
marked graphs (Carmona 2012; de San Pedro mance checking. It is suitable for large problem
and Cortadella 2016). instances, e.g., in healthcare (Mans et al. 2015),
software repositories, and finances, among many
Key Research Findings in Decomposed others.
Conformance Checking. Techniques for
decomposed conformance checking have been D
proposed in the last years (van der Aalst 2013;
van der Aalst and Verbeek 2014; Munoz- Future Directions for Research
Gama et al. 2014; vanden Broucke et al.
2014; de Leoni et al. 2014; Verbeek 2017). There exist several challenges to face, so that
Apart from the aforementioned techniques for decomposed techniques can be applied in more
decomposing a process model through a valid situations. Below we provide some key chal-
decomposition (van der Aalst 2013; van der Aalst lenges that may be faced in the next years:
and Verbeek 2014; Munoz-Gama et al. 2014),
other approaches that have a different focus • Decomposition in the large: the techniques
have been presented. In vanden Broucke et al. available so far do not always guarantee a
(2014), the SESE-based decomposition approach bound in the complexity of the operations
is adapted to boost the computation so that it performed. Fundamental research needs to
can be applied online. On a different perspective be developed so that lightweight approaches
but again using SESE-based decomposition, the are proposed that guarantee this bound,
approach in de Leoni et al. (2014) shows how opening the door to real-time or online
data can also be taken into account. approaches.
Fitness, i.e., the capability of a model in re- • Recomposing the results: both in Figs. 1 and 2,
producing a trace, strongly influences the idea of one can see that the result of the decomposi-
valid decomposition: on a valid decomposition, tion may not always be a unique object, but
the fitness of the whole model with respect to several. To deploy a unique final result from
the trace can be regarded as the fitness of the these may be a challenge. For instance, in the
individual fragments (see Fig. 2). Therefore, if case of decomposed conformance checking, to
none of the fragments has a fitness problem, then compute an alignment, it is well known that
the initial model is fitting (van der Aalst 2013). there are situations where only a best effort
The opposite is not true in general: a valid decom- is possible (Verbeek and van der Aalst 2016).
position where some fragment contains a fitness Therefore, research on how to recompose the
problem does not necessarily imply that there results will be needed.
exists a real fitness problem. Also, by separating • Decomposition beyond Petri nets: most of the
the initial problem into pieces, important con- techniques available for decomposition rely
formance artifact like optimal alignments, i.e., strongly on the Petri net semantics. However,
alignments between model and log which contain other modeling notations are also used, and in
the minimal number of deviations, cannot be particular the industry adoption of Petri nets
guaranteed. There has been recent work inves- is low. Therefore, other notations like BPMN
tigating the two aforementioned issues (Verbeek will be considered as input for decomposition
and van der Aalst 2016). techniques.
• Decomposition beyond control flow: very few
approaches have been presented that consider
Examples of Application not only the control flow of the process,
but also other data attributes (de Leoni
All areas where process mining techniques have et al. 2014). However, the aforementioned
been applied are amenable for the application techniques decompose only on the basis
638 Decomposed Process Discovery and Conformance Checking
of the control flow. Techniques that apply OTM 2014 conferences – confederated international
decomposition on different perspectives conferences: CoopIS, and ODBASE 2014, Amantea,
27–31 Oct 2014, pp 3–20. https://doi.org/10.1007/978-
will enable a different perspective for 3-662-45563-0_1
decomposition. de San Pedro J, Cortadella J (2016) Mining structured
• Alternative theories for decomposed confor- Petri nets for the visualization of process behavior. In:
mance checking: currently, only the notion of Proceedings of the 31st annual ACM symposium on
applied computing, Pisa, 4–8 Apr 2016, pp 839–846.
valid decomposition has been explored as a http://doi.acm.org/10.1145/2851613.2851645
viable strategy to decompose a process model. Ferreira DR, Zacarias M, Malheiros M, Ferreira P (2007)
In the future, totally different alternatives to a Approaching process mining with sequence cluster-
valid decomposition may be studied. ing: experiments and findings. In: Proceedings of the
business process management, 5th international confer-
ence, BPM 2007, Brisbane, 24–28 Sept 2007, pp 360–
374
Cross-References Greco G, Guzzo A, Pontieri L, Saccà D (2006) Discover-
ing expressive process models by clustering log traces.
IEEE Trans Knowl Data Eng 18(8):1010–1027. https://
Automated Process Discovery doi.org/10.1109/TKDE.2006.123
Business Process Event Logs and Visualization Hompes B, Buijs J, van der Aalst W, Dixit P, Buurman
H (2015) Discovering deviating cases and process vari-
Conformance Checking
ants using trace clustering. In: Proceedings of the 27th
Trace Clustering benelux conference on artificial intelligence (BNAIC)
Leemans S (2009) Automated process discovery. In: Sakr
and Zomaya (2017), pp 288–289
Mans R, van der Aalst WMP, Vanwersch RJB (2015)
References Process mining in healthcare – evaluating and exploit-
ing operational healthcare processes. Springer briefs
Adriansyah A (2014) Aligning observed and modeled be- in business process management. Springer. https://doi.
havior. PhD thesis, Technische Universiteit Eindhoven org/10.1007/978-3-319-16071-9
Arnold A (1994) Finite transition systems. Prentice Hall, Mendling J, Dumas M (2009) Business process event logs
Paris and visualization. In: Sakr and Zomaya (2017), pp 288–
Bose RPJC, van der Aalst WMP (2009a) Abstractions 289
in process mining: a taxonomy of patterns. In: Pro- Munoz-Gama J (2009) Conformance checking. In: Sakr
ceedings of the 7th international conference business and Zomaya (2017), pp 288–289
process management, BPM 2009, Ulm, 8–10 Sept Munoz-Gama J, Carmona J, van der Aalst WMP
2009, pp 159–175. https://doi.org/10.1007/978-3-642- (2014) Single-entry single-exit decomposed confor-
03848-8_12 mance checking. Inf Syst 46:102–122. https://doi.org/
Bose RPJC, van der Aalst WMP (2009b) Context aware 10.1016/j.is.2014.04.003
trace clustering: towards improving process mining Murata T (1989) Petri nets: properties, analysis and appli-
results. In: Proceedings of the SIAM international cations. Proc IEEE 77(4):541–580
conference on data mining, SDM 2009, 30 Apr–May Sakr S, Zomaya A (eds) (2017) Encyclopedia of big data
2 2009, Sparks, pp 401–412 technologies. Springer, Berlin-Heidelberg
Bose RPJC, van der Aalst WMP (2009c) Trace clustering Solé M, Carmona J (2012) A high-level strategy for
based on conserved patterns: towards achieving better c-net discovery. In: 12th international conference on
process models. In: BPM 2009 international workshops application of concurrency to system design, ACSD
business process management workshops, Ulm, 7 Sept 2012, Hamburg, 27–29 June 2012, pp 102–111
2009. Revised papers, pp 170–181 van der Aalst WMP (2011) Process mining – discovery,
Carmona J (2012) Projection approaches to process conformance and enhancement of business processes.
mining using region-based techniques. Data Min Springer, Berlin-Heidelberg
Knowl Discov 24(1):218–246. https://doi.org/10.1007/ van der Aalst WMP (2012) Decomposing process mining
s10618-011-0226-x problems using passages. In: Application and theory
Carmona J, de Leoni M, Depaire B, Jouck T (2016) of petri nets – 33rd international conference, PETRI
Summary of the process discovery contest 2016. In: NETS 2012, Hamburg, 25–29 June 2012. Proceedings,
Proceedings of the BPM 2016 workshops. LNBIP, pp 72–91. https://doi.org/10.1007/978-3-642-31131-
vol 281, pp 7–10 4_5
de Leoni M, Munoz-Gama J, Carmona J, van der Aalst van der Aalst WMP (2013) Decomposing petri nets for
WMP (2014) Decomposing alignment-based confor- process mining: a generic approach. Distrib Paral-
mance checking of data-aware process models. In: Pro- lel Databases 31(4):471–507. https://doi.org/10.1007/
ceedings on the move to meaningful internet systems: s10619-013-7127-5
Deep Learning on Big Data 639
the previous one. Therefore, it is possible to niques. Finally, semi-supervised learning permits
extract informative and discriminative features to handle the case where only a small subset
even from raw data (e.g., pixels of images, bio- of the training data is labeled (e.g., they use
logical sequences, or user preference data). This unlabeled data to define the cluster boundaries
capability is one of the most important and dis- and use a small amount of labeled data to label
ruptive advances introduced by the deep learning the clusters). To this aim, Reinforcement learning
paradigm; indeed, usually no feature engineering can be used to train the weights so that, given the
phase and domain experts are required to select a state of the current environment, DNNs can find
good set of features. which action the agent should take next, in order
Different from traditional algorithms, based to maximize expected rewards.
on shallow hierarchies that fail to explore and In this chapter, first we introduce some ba-
learn highly complex patterns, Deep Learning- sic concepts required to understand how DNNs
based techniques allow to intrinsically exploit work. Then, we focus on the most diffused super-
the large amounts of data featuring big data vised and unsupervised architectures exploited
(Volume). Moreover, as Deep Learning-based ap- in literature for handling big data. Finally, we
proaches allow to extract data abstractions and discuss some interesting open issues and future
representations at different levels, they are also research directions.
suitable to analyze raw data provided in different
formats and by different types of source (Variety). Deep Neural Networks
Finally, deep architectures can be used within ANNs are computational models that try to repli-
infrastructures for big data storage and analysis cate the neural structure of the brain. In a nut-
(e.g., Hadoop and Spark) and can exploit GPUs shell, the brain learns and stores information as
to parallelize the computation and to reduce the patterns. These patterns can be highly complex
learning times. Hadoop, in particular, is widely and allow us to perform many tasks, such as
used in industry for the development of big data recognizing individual from the image of their
applications such as spam filter, click stream faces also considering different angles of view.
analysis, social recommendation, etc. That al- This process of storing information as patterns,
lows to handle the speed at which new data applying those patterns and making inferences,
are being produced in these big data scenarios represents the approach behind ANNs.
(Velocity). The artificial neuron is the main building block
Deep learning can be used to tackle differ- of a neural network and encompasses some in-
ent types of learning task. The most common teresting capabilities. Its behavior can be thought
machine learning task for DNNs is supervised as composed of two steps. First, the linear com-
learning. In this case, DNNs allow to learn dis- bination of the inputs is computed (a weight,
criminative patterns for classification purposes, usually limited in the range 0–1, is associated
when labeled data are available in direct or indi- with each value of the input vector). Note that
rect forms. The most known DNN architectures the summation function often takes an extra input
for supervised learning are Convolutional Neural value b having a weight of 1, which represents
Networks and Recurrent Neural Networks. Deep the threshold or the bias of a neuron.
neural networks for unsupervised or generative The sum-of-product value is provided as input
learning are used when no information about for the second step. Here, the activation function
class labels is available and allow to extract high- is executed to compute the final output the neu-
order correlation from data for pattern analy- ron. The activation function limits the output in
sis or summarization purposes. Several archi- the range Œ0; 1 or alternately Œ1; 1.
tectures have been developed to address these An artificial neural network is essentially a
learning tasks; in particular Autoencoders, (Deep- combination of neurons stacked in layers.
/Restricted) Boltzmann Machine, and Deep Belief The most simple structure for an ANN in-
Network are largely recognized as effective tech- cludes two levels: the neurons placed in the first
Deep Learning on Big Data 641
Layer 0 1 2 3 4 5 6
architecture can be designed as a sequence of RNNs receive as input two values, representing
layers, where each layer transforms one volume the current state and the recent past of the input,
of activations to another, by using a differentiable which are combined to determine how they re-
function. In literature, three main types of layer spond to new data.
are used to learn CNNs: Convolutional Layer, These networks assume that the sequence
Pooling Layer, and Fully-Connected Layer. In of the input states is an important information;
Fig. 1, a simple CNN architecture that combines therefore, the added memory allows to carry out
these layers is shown. tasks that feedforward networks cannot perform.
On the one hand, convolutional layers allow In particular, this sequential information is kept
to discover local conjunctions of features from in the hidden state of the RNNs. Basically, RNNs
the previous layer; on the other hand pooling can be defined as feedforward neural networks
layers combine semantically similar features in a featured by edges that join subsequent time steps,
single feature. It is common to insert a pooling including the concept of time within the model.
layer in-between successive convolutional layers. Specifically, like feedforward neural networks,
In fact, it permits to reduce the dimension of the these learning architectures do not exhibit
representation and the amount of parameters in cycles among the standard edges; however,
the network and consequently control also the those that connect adjacent time steps (named
overfitting. recurrent edges) can have cycles, including self-
connections.
Recurrent Neural Networks and Long At time t , the nodes having recurrent edges
Short Time Memory Networks receive the input value x .t/ from the current
Many typical DNN architectures have no mem- data point and also the value h.t1/ from the
ory, and the output for a given input is always hidden node, which keeps the previous state of
the same independently from the sequence of the network. The output yQ .t/ at each time t is
inputs previously given to the network. On the computed by exploiting as input the hidden node
contrary RNNs, of which LSTMs are a well- values h.t/ at time t . Input x .t1/ at time t 1
known variant, keep an internal memory to make can influence yQ .t/ at time t and later by means
long-term dependencies influence the output. of the recurrent connections. Figure 2 shows the
In a nutshell, in these networks, some inter- above-described scheme.
mediate operations generate values that are stored LSTM improves RNN by introducing the con-
internally and used as inputs for other operations. cept of memory cell, a computation unit that
Deep Learning on Big Data 643
a Encoder Decoder
x1 y1
x2 y2
x3 y3
x4 y4
. .
. .
. .
xN yN
b Decoder
x1 y1
x2 y2
x3 y4
x4 y4
x5 y5
x6 y6
x7 y7
x8 y8
. .
. .
. .
xN yN
Input Encoder Output
Layer Layer
Deep Learning on Big Data, Fig. 4 Autoencoder architecture. (a) Simple autoencoder. (b) Deep autoencoder
number of its hidden layers is greater than one former allows to learn a mapping between the
(Fig. 4b). input layer and the hidden layers (encoding)
Specifically, an autoencoder can be concep- and the second a mapping between the features
tually thought as composed of two subnets, extracted by the encoder and the output
respectively, named encoder and decoder. The layer (decoding). The encoder receives as
Deep Learning on Big Data 645
input a set of features and extracts a smaller learned can be used as initial weights for the
set of new (high-level) features (by using supervised training of the resulting ANN.
the sigmoid as activation function for each Deep Boltzmann Machines (DBMs) are
neuron). Hence, the decoder converts these widely used to analyze large volume of
features back to the original input attributes. data in different contexts as Handwritten
An interesting variant of this architecture is Digits Recognition, Collaborative Filtering
called (Stacked) Denoising Autoencoder, which (Salakhutdinov et al. 2007), Phone Recognition D
shares some properties with restricted Boltzmann (Dahl et al. 2010), Transportation Network
machines and can be used for generative purposes Congestion Monitoring and Prediction (Ma et al.
(Vincent et al. 2008). In a nutshell, it tries to 2015) and Facial Landmark Tracking (Wu et al.
minimize the reconstructing error due to applied 2013).
stochastic noise to the input. In this way, it can
be used as a pre-training strategy for learning
DNNs. Examples of Application
a
h b
h
.
.
.
.
.
.
.
.
.
.
.
.
x
x
Deep Learning on Big Data, Fig. 5 Boltzmann machine architectures. (a) Boltzmann machine. (b) Restricted
Boltzmann machine
to extract low-dimensional semantic vectors for social networks, frauds and anomalies should be
search engine queries of web documents. detected coping with a large amount of data.
Discriminative Tasks: Deep learning can be Deep learning can be considered as an appro-
used for discriminative tasks. It extracts nonlinear priate machine learning tool to detect anomalies
features from data; subsequently, linear models and frauds, especially in the case of high dimen-
can be used to find a discriminative function. This sionality. For example, the authors of Paula et al.
property is really useful for applications such (2016) proposed an unsupervised deep learning
as object recognition and image comprehension method to detect frauds in exports. Moreover,
in big data. For example, in Jia et al. (2016), Erfani et al. (2016) suggested using deep belief
the authors used different types of deep neural networks to extract robust features from high-
networks for image recognition problems such as dimensional data and then using support vector
gender recognition and face recognition. machine (SVM) to build a one-class model to
Transfer Learning and Domain Adapta- detect anomalies.
tion: In machine learning, transfer learning or Traffic Flow Prediction: Predicting the traffic
inductive transfer is defined as a process, which flow in accurate and timely fashion is very im-
utilizes the knowledge obtained during solving portant for intelligent transportation systems, and
(learning) one problem, and applied to another deep learning is appropriate to this task. As an
different but related problem (Torrey and Shavlik example in Lv et al. (2015), the authors used a
2009). The properties of high velocity and a stacked autoencoder to learn traffic flow features
variety of big data provide a good opportunity to and trained it with a greedy layer-wise method.
apply transfer learning and domain adaptation. The applications of deep learning in big data
Deep learning is a good tool to identify similar are not limited to these topics; just to cite a few
factors for different data-driven forms and examples, remote sensing (Kussul et al. 2017)
for different problem domains (Bengio 2012). and fault detection (Ren et al. 2017) are two
In Shin et al. (2016), the authors evaluated interesting fields in which these techniques were
transfer learning by using convolutional neural successfully applied for big data problems.
networks trained over ImageNet on two different
diagnosis applications: thoracoabdominal lymph
node detection and interstitial lung disease Future Directions for Research
classification.
Anomaly and Fraud Detection: Detecting Big data and deep learning are strictly connected
anomalies and frauds is a challenging problem, as interesting research themes. Indeed, big data
especially in the case of big data. In applications issues can benefits from the application of Deep
such as electronic payments, network flows, and Learning, and in addition, some hard problems
Deep Learning on Big Data 647
and applications can be solved by deep learning and to deal with the entire automatic process of
techniques by using big data developments and analyzing these data, from the analysis and the
tools. extraction of information from audio and video
Some of the most relevant issues to be ad- files to the recognition of objects and faces and
dressed by using deep learning techniques ex- to the transcription of audio tracks and up to
ploiting the potentialities of big data techniques recognize real-time fake news.
include: D
Jia S, Lansdall-Welfare T, Cristianini N (2016) Gender deeper with convolutions. In: Computer vision and
classification by deep learning on millions of weakly pattern recognition (CVPR)
labelled images. In: 2016 IEEE 16th international con- Torrey L, Shavlik J (2009) Transfer learning. In: Soria E,
ference on data mining workshops (ICDMW). IEEE, Martin J, Magdalena R, Martinez M, Serrano A (eds)
pp 462–467 Handbook of research on machine learning applica-
Kussul N, Lavreniuk M, Skakun S, Shelestov A (2017) tions and trends: algorithms, methods, and techniques,
Deep learning classification of land cover and crop vol 1. IGI Global, Hershey, p 242
types using remote sensing data. IEEE Geosci Remote Vincent P, Larochelle H, Bengio Y, Manzagol P (2008)
Sens Lett 14(5):778–782 Extracting and composing robust features with denois-
Le Cun Y, Bengio Y, Hinton G (2015) Deep learning. ing autoencoders. In: Proceedings of the 25th inter-
Nature 521(7553):436–444 national conference on machine learning, ICML’08,
Lebret R, Collobert R (2015) “The sum of its parts”: pp 1096–1103
joint learning of word and phrase representations with Wu Y, Wang Z, Ji Q (2013) Facial feature tracking under
autoencoders. CoRR abs/1506.05703 varying facial expressions and face poses based on
Lv Y, Duan Y, Kang W, Li Z, Wang FY (2015) Traffic restricted Boltzmann machines. In: 2013 IEEE con-
flow prediction with big data: a deep learning approach. ference on computer vision and pattern recognition,
IEEE Trans Intell Transp Syst 16(2):865–873 Portland, 23–28 June 2013, pp 3452–3459. https://doi.
Ma X, Yu H, Wang Y, Wang Y (2015) Large-scale org/10.1109/CVPR.2013.443
transportation network congestion evolution prediction Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with
using deep learning theory. https://doi.org/10.1371/ big data. IEEE Trans Knowl Data Eng 26(1):97–107
journal.pone.0119044 Zhang S, Yang M, Wang X, Lin Y, Tian Q (2013)
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya Semantic-aware co-indexing for image retrieval. In:
N, Wald R, Muharemagic E (2015) Deep learning Proceedings of the IEEE international conference on
applications and challenges in big data analytics. J Big computer vision, pp 1673–1680
Data 2(1):1
Paula EL, Ladeira M, Carvalho RN, Marzagão T (2016)
Deep learning anomaly detection as support fraud in-
vestigation in Brazilian exports and anti-money laun-
dering. In: 2016 15th IEEE international conference Definition of Data Streams
on machine learning and applications (ICMLA). IEEE,
pp 954–960
Ren H, Chai Y, Qu J, Ye X, Tang Q (2017) A novel
Alessandro Margara1 and Tilmann Rabl2
1
adaptive fault detection methodology for complex sys- Politecnico di Milano, Milano, Italy
2
tem using deep belief networks and multiple models: Database Systems and Information
a case study on cryogenic propellant loading system. Management Group, Technische Universität
Neurocomputing 275:2111–2125
Salakhutdinov R, Mnih A, Hinton G (2007) Restricted
Berlin, Berlin, Germany
Boltzmann machines for collaborative filtering. In:
Proceedings of the 24th international conference on
machine learning, ICML’07. ACM, New York, pp 791–
798. https://doi.org/10.1145/1273496.1273596, http:// Synonyms
doi.acm.org/10.1145/1273496.1273596
Shen Y, He X, Gao J, Deng L, Mesnil G (2014) Learning Event streams; Information flows
semantic representations using convolutional neural
networks for web search. In: Proceedings of the 23rd
international conference on world wide web. ACM,
pp 373–374
Shin HC, Roth HR, Gao M, Lu L, Xu Z, Nogues I, Yao J,
Definitions
Mollura D, Summers RM (2016) Deep convolutional
neural networks for computer-aided detection: CNN A data stream is a countably infinite sequence
architectures, dataset characteristics and transfer learn- of elements. Different models of data streams
ing. IEEE Trans Med Imaging 35(5):1285–1298
Sutskever I, Vinyals O, Le QV (2014) Sequence to se-
exist that take different approaches with respect
quence learning with neural networks. In: Proceedings to the mutability of the stream and to the structure
of the 27th international conference on neural infor- of stream elements. Stream processing refers to
mation processing systems – vol 2, NIPS’14. MIT analyzing data streams on-the-fly to produce new
Press, Cambridge, MA, pp 3104–3112. http://dl.acm.
org/citation.cfm?id=2969033.2969173
results as new input data becomes available. Time
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov is a central concept in stream processing: in
D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going almost all models of streams, each stream ele-
Definition of Data Streams 649
ment is associated with one or more timestamps model, and the time series model. The most
from a given time domain that might indicate, general model is the turnstile model. In this
for instance, when the element was generated, model, the stream is modeled as a vector of
the validity of its content, or when it became elements, and each element si in the stream
available for processing. is an update (increment or decrement) to an
element of the underlying vector. The size of
the vector in this model is the domain of the D
Overview stream elements. This model is also the model
typically used in traditional database systems,
A data stream is a countably infinite sequence of where there are inserts, deletes, and updates to
elements and is used to represent data elements the database. In the cash register model, elements
that are made available over time. Examples are in the stream are only additions to the underlying
readings from sensors in an environmental moni- vector, but elements can never leave the vector
toring application, stock quotes in financial appli- again. This is similar to databases recording
cations, or network data in computer monitoring the history of relations. Finally, the time series
applications. model treats every element in the stream si as
A stream-based application analyzes elements a new independent vector entry. As a result,
from streams as they become available to timely the underlying model is a constantly increasing
produce new results and enable fast reactions if vector and generally unbounded vector. Because
needed (Babcock et al. 2002; Cugola and Mar- each element can be processed individually in
gara 2012). this model, it is frequently used in current stream
In the last decade, several technologies processing engines.
have emerged to address the processing of
high-volume, real-time data without requiring Time
custom code (Stonebraker et al. 2005). They Time is a central concept in many stream pro-
are commonly referred to as stream processing cessing applications, either because they are con-
systems, and they are the topic of this section on cerned with updating their view of the world by
“Big Stream Processing.” taking into account recent data received from
In the remainder of this entry, we overview streams or because they aim to detect temporal
the main models of data streams and stream pro- trends in the input streams.
cessing systems, and we briefly introduce some For this reason, in most models of stream and
central concepts in stream processing, namely, stream processing, data elements are associated
time and windows. Furthermore, we summarize with some timestamp from a given time domain.
the main requirements of stream processing. Common semantics of time include event time,
which is the time when the element was pro-
Stream Models duced, and processing time, which is the time
Streams can be structured or unstructured. In when the stream processing system starts pro-
a structured stream, elements follow a certain cessing the element (Akidau et al. 2015).
format or schema that allows for additional mod- Different time semantics introduce different
eling of the stream. In contrast, unstructured problems in terms of order and synchronization.
streams can have arbitrary contents often result- Intuitively, while event time and processing time
ing from combining streams from many sources should ideally be equal, in practice unsynchro-
(these are also referred to as event showers or nized clocks at the producers as well as variable
event clouds Doblander et al. 2014). communication and processing delays produce
There are three major models for structured a skew between event time and processing time
streams, which differentiate in how the elements which is not only non-zero but also highly vari-
of the stream are related to and influence each able (Akidau 2015). On the one hand, processing
other: the turnstile model, the cash register elements in event time order is necessary in
650 Definition of Data Streams
many application scenarios. On the other hand, with the unbounded nature of the input data
this requires waiting for out-of-order elements (Arasu et al. 2006).
and reorder them before processing. This is a Complex event processing systems, instead,
nontrivial problem, which is analyzed in detail in aim to detect (temporal) patterns over the input
the entry related to “Time management.” stream of elements (Etzion and Niblett 2010;
Luckham 2001). These systems are discussed in
Windows the entry on “Event recognition.”
Windows are one of the core building blocks Modern Big Data stream processing systems,
of virtually all stream processing systems. They such as Flink (Carbone et al. 2015), provide
define bounded portions of elements over an general operators to transform input streams into
unbounded stream. They are used to perform output streams, thus offering the possibility to
computations that would be impossible (nonter- implement heterogeneous streaming applications
minating) in the case of unbounded data, such as by integrating these operators. These systems are
computing the average value of all the elements designed to scale to multiple processing cores and
in a stream of numbers. computing nodes and persist intermediate results
The most common types of windows are for fault tolerance. Flink and other systems are
count-based and time-based windows. The discussed in several entries within this section
former define their size in terms of the number of of the encyclopedia. The “Introduction to stream
elements that they include, while the latter define processing” entry overviews the main classes of
their size in terms of a time frame and include all algorithms for stream processing.
the elements with a timestamp included in that
time frame. General Requirements of Stream
In both cases, we distinguish between slid- Processing
ing windows, which continuously advance with Data streams could theoretically be stored and
the arrival of new elements, thus always captur- processed with traditional database systems or
ing new elements, and tumbling windows, which other analytic solutions. However, real-time pro-
can accumulate multiple elements before moving cessing of streams introduces specific require-
(Botan et al. 2010). ments that stream processing systems need to
More recently, new types of windows have satisfy (Stonebraker et al. 2005). Although the
been defined to better capture the needs of ap- following requirements are not necessarily ful-
plications. They include session and data-defined filled by all current stream processing systems,
windows that are variable in size and define they are frequently necessary in stream process-
their boundaries based on the data elements: for ing applications.
instance, in a software monitoring application, a In order to do advanced analytics on streams,
window can include all and only the elements that which involve more than a single operation,
refer to a session opened by a specific user of that a stream processing system needs to integrate
software (Akidau et al. 2015). (pipeline) basic operations with each other. This
can be done by (micro-)batching the stream,
Stream Processing that is, splitting it in small chunks such that
Several types of stream processing systems exist each operator processes a chunk at a time, or by
that are specialized on different types of compu- true streaming, where each element is processed
tations (Cugola and Margara 2012). individually.
Data stream management systems are Because streams are often generated in a dis-
designed to update the answers of continuous tributed fashion, for example, in sensor networks,
queries as new data becomes available. They a streaming system must handle small inconsis-
typically adopt declarative abstractions similar tencies in the stream, such as out-of-order data,
to traditional database query languages that they missing values, or delayed values. This is re-
enrich with constructs such as windows to deal lated to the topic of time management, discussed
Definition of Data Streams 651
Cugola G, Margara A (2012) Processing flows of infor- The eccentricity of a node u is ecc.u/ D
mation: from data stream to complex event processing. maxv2V d.u; v/, which measures in how many
ACM Comput Surv 44(3):15:1–15:62. https://doi.org/
10.1145/2187671.2187677
hops u can reach any other node in the graph.
Doblander C, Rabl T, Jacobsen HA (2014) Processing big Hence, the diameter D (resp. the radius R) of
events with showers and streams. In: Rabl T, Poess G is defined as the maximum (resp. minimum)
M, Baru C, Jacobsen HA (eds) Specifying big data eccentricity among all the nodes, i.e., D D
benchmarks. Springer, Berlin/Heidelberg, pp 60–71
Etzion O, Niblett P (2010) Event processing in action.
maxu2V ecc.u/ (resp. R D minu2V ecc.u/).
Manning Publications, Greenwich Given a node u 2 V and an h 2 ŒD (where,
Luckham DC (2001) The power of events: an introduction for any positive integer x, Œx denotes the set
to complex event processing in distributed enterprise f1; 2; : : : ; xg), we call Bh .u/ the set fv W v 2
systems. Addison-Wesley, Boston
Marz N, Warren J (2015) Big data: principles and best
V; d.u; v/
hg and, similarly, Nh .u/ the set
practices of scalable realtime data systems. Manning fv W v 2 V; d.u; v/ D hg, which are, respectively,
Publications, Greenwich the sets of nodes reachable from u with at most
Stonebraker M, Çetintemel U, Zdonik S (2005) The 8 (resp. exactly) h edges. We denote N1 .u/ also as
requirements of real-time stream processing. SIGMOD
Rec 34(4):42–47. https://doi.org/10.1145/1107499. N.u/.
1107504 Moreover, we call Bh (resp. Nh ) the fraction
of pairs of nodes x; y 2 V such that d.x; y/
h
(resp. d.x; y/ D h). Clearly, for h > 0, we have
Nh D Bh Bh1 , and the average distance of G
P P
is defined as u;v2V d.u;v/ n2
D h h Nh .
Degrees of Separation and
Diameter in Large Graphs
of huge real-world complex networks, which is in exactly the diameter. Even though, consistently
practice extremely efficient. with the time lower bound provided by Roditty
and Williams (2013), these algorithms may be
Lower bounds and techniques. As distance O.nm/, they turn out to be linear in practice in
distribution, radius, and diameter can be real-world graphs.
computed using any algorithm computing all- In the following we will discuss the approxi-
pairs shortest path (in short, APSP), their total mation methods for the distance distribution and D
time cost can be O.nm/, for instance, computing then we will focus on the exact computation of
n breadth-first searches (in short, BFSs), one diameter.
for each node. This cost is clearly prohibitive
in real-world huge networks, but it has been
proved that, unfortunately, this worst-case time
bound seems hard to improve unless the so-called Key Research Findings
strong exponential time hypothesis (SETH) is
false (Impagliazzo et al. 2001). Indeed, under On Approximating the Distance
this hypothesis, even deciding whether a graph Distribution
has diameter 2 or 3 requires ˝.n2 / (Roditty and
Williams 2013). Classical Sampling
For this reason, heuristics or approximation The idea of classical sampling is to select a
algorithms are often used. In the case of distance random set of vertices U and to approximate
distribution, approximation algorithms have been the distribution of distances of the whole graph
designed to approximate Nh with bounded ab- with the distribution of distances starting from
solute error (Crescenzi et al. 2011) or to ap- U (Crescenzi et al. 2011). In other words, given
proximate Bh with bounded relative error (Cohen a random set U V , for any h, with h 2 ŒD,
1994, 2015; Boldi et al. 2011; Flajolet and Martin classical sampling approximates Nh with Nh .U /,
1985). These approximations turn out to be ap- where Nh .U / D jf.u; v/ 2 U V W d.u; v/ D
proximations also for the average distance (Boldi hgj=.jU j n/. Classical sampling is often classified
et al. 2011; Backstrom et al. 2012). as a Monte Carlo method and, in the field of
Approximation algorithms have been also network analysis, has been applied, for exam-
proposed in the case of diameter computa- ple, for closeness centrality (Eppstein and Wang
tion (Roditty and Williams 2013; Abboud et al. 2001) and betweenness centrality (Brandes and
2016; Cairo et al. 2016). However, by the Pich 2007).
reduction provided by Roditty and Williams The algorithm works as follows: (i) perform a
(2013), we have that unless SETH fails, ˝.n2 / random sample of k vertices from V obtaining
time is required to get a .3=2 /-approximation the multiset U D fu1 ; u2 ; : : : ; uk g V ; (ii)
algorithm for computing the diameter even in the run iteration i D 1; 2; : : : ; k, computing the
case of sparse graphs. distances d.ui ; v/ for all v 2 V , by executing
In the case of the radius and diameter of a BFS traversal of G starting from vertex ui ;
real-world networks, some heuristics have been and (iii) return the approximation Nh .U /. The
shown to work surprisingly well (Magnien et al. running time of the algorithm is O.k m/, with ad-
2009; Borassi et al. 2015; Akiba et al. 2015). ditional space of O.n/. By applying the Azuma-
Some of them are designed to approximate the Hoeffding bound (Hoeffding 1963), it is possible
diameter with a constant number of BFSs provid- to prove that choosing k D . 2 log n/, this
ing lower (or upper) bounds. Even if their approx- algorithm computes an approximation of the dis-
imation ratio can be asymptotically 2 (Crescenzi tance distribution Nh of G whose absolute error
et al. 2013), they often turn out to be tight in prac- is bounded by , with high probability.
tice (Magnien et al. 2009; Crescenzi et al. 2010). As the number of connected pairs is not known
Other heuristics have been designed to compute a priori in directed weakly connected graphs, this
654 Degrees of Separation and Diameter in Large Graphs
Bottom-k. This paradigm relies on the (2007). We refer to the description given by Boldi
technique introduced by Cohen and Kaplan et al. (2011). In this case, let h be a hash function
(2008, 2007a,b). Given a mapping r W U ! from U to 21 . As in the case of LogLog counters,
f1=n; 2=n; : : : ; 1g and a subset A of U , we a sketch S.A/ is a sequence of exactly k entries
denote as Hk .A/ the first k elements of A a1 ; : : : ; ak where k D 2b and b is given in
according to r and with max.A/ the element input. Each entry ai is a counter. Let hb .x/ be
of A having maximum rank. A bottom-k sketch the sequence of the b leftmost bits of h.x/, while D
for A is Hk .A/. let hb .x/ be the remaining bits. Moreover, given
a binary sequence w, let C .w/ be the number of
INIT .S.A//: S.A/ D ;.
leading zeroes in w plus one.
UPDATE .S.A/; u/: If jS.A/j < k, add u to
Intuitively, as in the case of LogLog counters,
S.A/. Otherwise, if r.max.S.A/// > r.u/,
the elements of U are partitioned in k groups ac-
replace max.S.A// with u.
cording to their value of hb .x/, and in the sketch
UNION .S.A/; S.B//: return Hk .S.A/ [ S.B//
k1 S.A/ in ai , there is the maximum C .hb .y//
SIZE .S.A//: return r.max.S.A///
among the elements y of A mapped in the i -th
It corresponds to the sampling without re- group.
placement version of the k-min approach.
INIT .S.A//: ai D 1 for any i .
LogLog counters. This approach exploits the UPDATE .S.A/; u/: Let i D hb .u/. Set ai equal
LogLog counters introduced in Flajolet and to maxfai ; C .hb .u//g.
Martin (1985). The implementation of this UNION .S.A/; S.B//: return fc1 ; : : : ; ck g
paradigm for computing distance distribution where ci D maxfai ; bi g.
2
is called ANF (Palmer et al. 2002). Here a SIZE .S.A//: return Pk˛k k aj :
2
sketch S.A/ is a sequence of exactly k entries, j D1
Cohen E (2015) All-distances sketches, revisited: HIP Roditty L, Williams VV (2013) Fast approximation algo-
estimators for massive graphs analysis. IEEE Trans rithms for the diameter and radius of sparse graphs.
Knowl Data Eng 27(9):2320–2334 In: Symposium on theory of computing conference,
Cohen E, Kaplan H (2007a) Bottom-k sketches: better STOC’13, Palo Alto, 1–4 June 2013, pp 515–524
and more efficient estimation of aggregates. In: ACM
SIGMETRICS performance evaluation review, vol 35.
ACM, pp 353–354
Cohen E, Kaplan H (2007b) Summarizing
data using bottom-k sketches. In: PODC, Delta Compression
pp 225–234 Techniques
Cohen E, Kaplan H (2008) Tighter estimation using bot-
tom k sketches. PVLDB 1(1):213–224 Torsten Suel
Crescenzi P, Grossi R, Imbrenda C, Lanzi L, Marino
A (2010) Finding the diameter in real-world graphs: Department of Computer Science and
experimentally turning a lower bound into an up- Engineering, Tandon School of Engineering,
per bound. In: Proceeding ESA. LNCS, vol 6346, New York University, Brooklyn, NY, USA
pp 302–313
Crescenzi P, Grossi R, Lanzi L, Marino A (2011)
A comparison of three algorithms for approximat-
ing the distance distribution in real-world graphs. Synonyms
In: Theory and practice of algorithms in (computer)
systems – proceedings of first international ICST Data differencing; Delta encoding; Differential
conference, TAPAS 2011, Rome, 18–20 Apr 2011,
pp 92–103 compression
Crescenzi P, Grossi R, Lanzi L, Marino A (2012) On com-
puting the diameter of real-world directed (weighted)
graphs. In: Experimental algorithms – proceedings of
Definitions
11th international symposium, SEA 2012, Bordeaux,
7–9 June 2012, pp 99–110
Crescenzi P, Grossi R, Habib M, Lanzi L, Marino Delta compression techniques encode a target
A (2013) On computing the diameter of real- file with respect to one or more reference files,
world undirected graphs. Theor Comput Sci 514:
such that a decoder who has access to the same
84–95
Eppstein D, Wang J (2001) Fast approximation of central- reference files can recreate the target file from the
ity. In: Proceedings of the twelfth annual ACM-SIAM compressed data. Delta compression is usually
symposium on discrete algorithms, SODA’01. Society applied in cases where there is a high degree of
for Industrial and Applied Mathematics, Philadelphia,
redundancy between target and references files,
pp 228–229
Flajolet P, Martin G (1985) Probabilistic counting algo- leading to a much smaller compressed size than
rithms for data base applications. J Comput Syst Sci could be achieved by just compressing the tar-
31(2):182–209 get file by itself. Typical application scenarios
Flajolet P, Éric Fusy, Gandouet O et al (2007) Hy-
include revision control systems and versioned
perloglog: the analysis of a near-optimal cardinality
estimation algorithm. In: Proceedings of the 2007 file systems that store many versions of a file or
international conference on analysis of algorithms, software or content updates over networks where
INAOFA’07 the recipient already has an older version of the
Hoeffding W (1963) Probability inequalities for sums
of bounded random variables. J Am Stat Assoc
data. Most work on delta compression techniques
58(301):13–30 has focused on the case of textual and binary files,
Impagliazzo R, Paturi R, Zane F (2001) Which problems but the concept can also be applied to multimedia
have strongly exponential complexity? J Comput Syst and structured data.
Sci 63(4):512–530
Magnien C, Latapy M, Habib M (2009) Fast computation
Delta compression should not be confused
of empirically tight bounds for the diameter of massive with Elias delta codes, a technique for encod-
graphs. J Exp Algorithmics 13:1–10 ing integer values, or with the idea of coding
Palmer CR, Gibbons PB, Faloutsos C (2002) ANF: a fast sorted sequences of integers by first taking the
and scalable tool for data mining in massive graphs.
In: Proceedings of the 8th ACM SIGKDD international
difference (or delta) between consecutive values.
conference on knowledge discovery and data mining, Also, delta compression requires the encoder to
pp 81–90 have complete knowledge of the reference files
Delta Compression Techniques 659
and thus differs from more general techniques for Key Research Findings
redundancy elimination in networks and storage
systems where the encoder has limited or even String-to-String Correction and
no knowledge of the reference files, though the Differencing
boundaries with that line of work are not clearly Some early related work was done by Wagner and
defined. Fisher (1974), who studied the string-to-string
correction problem. This is the problem of find- D
ing the best (shortest) sequence of insert, delete,
and update operations that transform one string
Overview fref to another string ftar . The solution in Wagner
and Fisher (1974) is based on finding the longest
Many applications of big data technologies in- common subsequence of the two strings using
volve very large data sets that need to be stored on dynamic programming and adding all remaining
disk or transmitted over networks. Consequently, characters to ftar explicitly. However, the string-
data compression techniques are widely used to to-string correction problem does not capture the
reduce data sizes. However, there are many sce- full generality of the delta compression problem.
narios where there are significant redundancies In particular, in the string-to-string correction
between different data files that cannot be ex- problem, it is implicitly assumed that the data
ploited by compressing each file independently. common to ftar and fref appears in (roughly) the
For example, there may be many versions of same order in the two files, and the approach can-
a document, say a Wikipedia article, that dif- not exploit the case of substrings in fref appearing
fer only slightly from one to the next. Delta repeatedly in ftar .
compression techniques attempt to exploit such This early work motivated Unix tools such as
redundancy between pairs or groups of files to diff and bdiff, which create a patch, i.e., a set
achieve better compression. of edit commands, that can be used to update
We define delta compression for the most the reference file to the target file. Such tools
common case of a single reference file. Formally, are widely used to create patches for software
we have two strings (files): ftar 2 ˙ (the target updates, and one advantage is that these patches
file) and fref 2 ˙ (the reference file). Then the are easily readable by humans who need to under-
goal for an encoder E with access to both ftar and stand the changes. However, the tools typically do
fref is to construct a file fı 2 ˙ of minimum not generate delta files of competitive size, due to
size, such that a decoder D can reconstruct ftar (1) limitations in the type of edit operations that
from fref and fı . We also refer to fı as the delta are supported, as discussed in the previous para-
of ftar and fref . graph, and (2) the absence of native compression
Given this definition, research on delta com- methods for reducing the size of the patch. The
pression techniques has focused on three chal- latter issue can be partially addressed by applying
lenges: (1) How to build delta compression tools general purpose file compressors such as gzip or
that result in fı of small size while also achieving bzip to the generated patches, though this is still
fast compression and decompression speeds and far from optimal.
small memory footprint. (2) How to use delta The first problem was partially addressed by
compression techniques to compress collections Tichy (1984), who defined the string-to-string
of files, i.e., how to select suitable reference files correction problem with block moves. For a file
given a target file, or how to select a sequence of f , let f Œi denote the i th symbol of f , 0
i <
delta compression steps between files to achieve jf j, and f Œi; j denote the block of symbols from
good compression for a collection ideally while i until (and including) j . Then a block move is a
also allowing fast retrieval of individual files. triple .p; q; l/ such that fref Œp; : : : ; p C l 1 D
(3) How to apply delta compression in various ftar Œq; : : : ; q C l 1, representing a nonempty
application scenarios. common substring of fref and ftar of length l.
660 Delta Compression Techniques
The file fı can now be constructed as a minimal LZ77-based delta compression. Implementations
covering set of block moves, such that every of this approach include vdelta (Hunt et al. 1998),
ftar Œi that also appears in fref is included in several versions of xdelta (MacDonald 2000),
exactly one block move. zdelta (Trendafilov et al. 2002), the recent and
Tichy (1984) also showed that a greedy algo- very fast ddelta (Xia et al. 2014b) and edelta (Xia
rithm results in a minimal cover set, leading to et al. 2015), and a number of implementations
an fı that can be constructed in linear space and following the vcdiff differencing format.
time using suffix trees. However, the multiplica-
tive constant in the space complexity made the A Sample Implementation
approach impractical. A more practical approach We now describe a possible implementation in
uses hash tables with linear space but quadratic more detail, using the example of zdelta. In fact,
time worst-case complexity (Tichy 1984). Subse- zdelta is based on a modification of the zlib
quent work in Ajtai et al. (2002) showed how to library implementation of gzip by Gailly and
further reduce the space used during compression Adler (Gailly 2017), with some additional ideas
while avoiding a significant increase in com- inspired by vdelta (Hunt et al. 1998). Note that
pressed size, while Agarwal et al. (2004) showed zlib finds previous occurrences of substrings us-
how to reduce size by picking better copies than ing a hash table. In zdelta, this is extended to
a greedy approach. Finally, Xiao et al. (2005) several hash tables, one for each reference file
showed bounds for delta compression under a and one for the already coded part of the target
very simple model of file generation where a file file. The table for ftar is essentially handled as
is edited according to a two-state Markov process in zlib, where new entries are inserted as ftar is
that either keeps or replaces content. traversed and encoded. The table for fref can be
built beforehand by scanning fref , assuming fref
Delta Compression Using Lempel-Ziv is not too large. When looking for matches, all
Coding tables are searched to find the best match. Hash-
The block move-based approach proposed in ing of substrings is done based on the initial three
Tichy (1984) leads to a basic shift in the characters, with chaining inside each hash bucket.
development of delta compression algorithms, as As observed in Hunt et al. (1998), in many
it suggests thinking about delta compression as cases, the location of the next match in fref is
a sequence of copies from the reference file into a short distance after the end of the previous
the target file, as opposed to a sequence of edit match, especially when the files are very similar.
operations on the reference file. This led to a set Thus, the locations of matches in fref are usu-
of new algorithms based on the Lempel-Ziv fam- ally best represented as offsets from the end of
ily of compression algorithms (Ziv and Lempel the previous match in fref . However, sometimes
1977, 1978), which naturally lend themselves to there are isolated matches in other parts of fref .
such a copy-based approach while also allowing This motivates zdelta to maintain two pointers
for compression of the resulting patches. into each reference file and to code locations by
In particular, the LZ77 algorithm can be storing which pointer is used, the direction of the
viewed as a sequence of copy operations that offset, and the offset itself. Pointers are initially
replace a prefix of the string being encoded by set to the start of the file and afterward usually
a reference to an identical previously encoded point to the ends of the previous two matches in
substring. Thus, delta compression could be the same reference file. If a match is far away
viewed as simply performing LZ77 compression from both pointers, a pointer is only moved to the
with the file fref representing previously encoded end of it if it is the second such match in a row.
text. In fact, we can also include the part of ftar Other more complex pointer movement policies
that has already been encoded in the search for might give additional moderate improvements.
a longest matching prefix. A few extra changes To encode the generated offsets, pointers, di-
are needed to get a practical implementation of rections, match lengths, and literals, zdelta uses
Delta Compression Techniques 661
the Huffman coding facilities provided by zlib, are currently encoding. This approach was
while vdelta uses a byte-based encoding that is considered by Korn and Vo in the context of
faster but less compact. implementing vcdiff (Korn and Vo 2002).
• Hashing Longer Substrings: Another
approach indexes the entire reference file
Window Management in LZ-Based Delta
and avoids long hash chains by hashing
Compressors
substrings much longer than three characters. D
The above description assumes that reference
This allows finding copies from anywhere
files are small enough to completely fit into the
in the reference and target files, as long as
hash table. However, this is usually not realistic.
copies are longer than the hashed substrings.
By default, zlib only indexes a history of up to
This approach was taken in earlier versions of
32 KB of already processed data in the hash
xdelta, making it very robust against content
table, in two blocks of 16 KB each. Whenever
reordering between the files, but compression
the second block is filled, the first one is freed
may be slightly worse than other approaches
up, temporarily reducing indexed history to 16
on well-aligned files, since only large copies
KB. This is for efficiency reasons: Indexing more
are supported. However, the approach can be
than 32 KB would result in larger hash tables
combined with the others above, to allow both
that lead to increased L1 cache misses (as L1
longer and shorter copies, by first using larger
cache sizes have not increased much over the
fixed-size chunks as proposed in Bentley and
last two decades) and, more importantly, result
McIlroy (1999) and Tridgell (2000) or with
in very long hash chains for common three-
content-defined chunking and then finding
character substrings that would be traversed when
smaller copies more locally; see, e.g., ddelta
searching for matches.
(Xia et al. 2014b).
In the target file, we could just index the last up
to 32 KB in the hash table by their first 3 bytes,
as done in zlib. But which parts of the reference
Compressing Collections of Files
files should we index? There are several possible
Larger collections of files can be compressed by
choices, as follows:
repeatedly applying delta compression to pairs or
• Sliding Window Forward: In zdelta, a 32 groups of files. This raises the question of what
KB window is initially placed at the start of reference files to choose for which target files in
the reference file and then moved forward order to achieve a cycle-free compression of the
by 16 KB whenever the median of the last entire collection with minimum compressed size.
few copies comes within a certain distance In some scenarios, it is also crucial to be able
from the end of the current window. This to quickly decompress individual files without
works well if the content is reasonably aligned having to decompress too many other files.
between the reference and target case, which In the case of revision control systems, older
is the common case in many applications, but files are usually compressed using a newer ver-
can perform very badly when large blocks of sion as a reference file, since newer versions are
content occur in a very different order in the more frequently accessed. However, retrieving an
two files. old version could be very slow if it requires de-
• Block Fingerprints: Another possible compressing a long chain of more recent versions
approach chooses the parts in the reference before reaching the one that is needed. For this
files that should be indexed based on reason, such systems may often create shortcuts,
their similarity to the content currently where versions are occasionally encoded using a
being encoded. This is done by computing much later version as reference, at the cost of a
fingerprint for certain substrings and then slight increase in size.
indexing parts of the reference file that Delta compression can also be used to com-
share many fingerprints with the content we press collections of non-versioned files that share
662 Delta Compression Techniques
some degree of similarity. For the case of a single • Software Revision Control Systems: Delta
reference file, finding an optimal and cycle-free compression techniques were originally pro-
assignment of reference files to target files can posed and developed in the context of sys-
be modeled as an optimum branching problem tems used for maintaining the revision his-
on a complete graph (Tate 1997; Ouyang et al. tory of software projects and other documents
2002), which can be solved in jV j2 time on (Berliner 1990; Rochkind 1975; Tichy 1985).
dense graphs where jV j is the number of vertices. In such systems, there are multiple, often
For the case of more than one reference file, almost identical, versions of each document,
the problem is known to be NP-complete (Adler including different branches, that have to be
and Mitzenmacher 2001). However, the cost of stored, to enable users to retrieve and roll back
computing the edge weights of the input graph to past versions. See Hunt et al. (1998) for
is prohibitive, motivating (Ouyang et al. 2002) to more discussion on delta compression in the
propose using text clustering to identify for each context of such systems.
target file a small set of promising reference file • Software Patch Distribution: Delta com-
candidates. Work in Bagchi et al. (2006) showed pression techniques are used to generate
that an optimum branching on a reduced degree software patches that can be efficiently
graph of highest weight edges in fact provides a transmitted over a network in order to update
provable approximation ratio to the solution on installed software packages. A good delta
the full graph. compressor can significantly reduce the cost
Work by Molfetas et al. (2014b,a) studied of rolling out updates, say, for security-critical
how to optimize and trade off access speed and updates to popular software that need to be
compression ratio in delta-compressed file collec- quickly disseminated. The case of updating
tions, e.g., by avoiding references to substrings automobile software over wireless links was
in certain files when other copies could be used recently discussed in Nakanishi et al. (2013),
instead. It is shown that significant improvements while Samteladze and Christensen (2012)
are possible by judiciously picking copies during addresses efficient app updates on mobile
compression. devices. Tools such as bspatch and bsdiff are
However, for bulk archival and retrieval of especially optimized for the case of updating
large collections of non-versioned web pages, executables, and even better algorithms for
work in Trendafilov et al. (2004), Chang et al. this case are proposed in Percival (2006) and
(2006), and Ferragina and Manzini (2010) Motta et al. (2007).
suggests that better speed and compression can • Improving Data Downloading: There is a
be achieved by applying optimized compression long history of proposals to employ delta com-
tools for archiving large data sets on top of a pression to improve web access, by exploit-
suitable linear ordering of the files, say one ing the similarity between the current and an
obtained via text clustering or URL ordering. older cached version of a page or between
The main reason is that such tools can identify different pages on the same site. A scheme
repeated substrings over very long distances, called optimistic delta was proposed in Banga
while delta compressors are limited to one or a et al. (1997) and Mogul et al. (1997), where
few reference files. a caching proxy hides server latency by first
returning a potentially outdated cached ver-
sion, followed if needed by a small patch
Examples of Applications once the server replies. In another approach,
a client with a cached old version sends a tag
There are a number of scenarios where delta identifying the version as part of the HTTP
compression can be applied in order to reduce request and then receives a patch (Housel and
networking and storage cost. A few of them are Lindquist 1996; Delco and Ionescu, xProxy:
now discussed briefly. a transparent caching and delta transfer sys-
Delta Compression Techniques 663
tem for web objects. Unpublished manuscript, pression has been considered as a way to
2000). A specialized tool called jdelta for improve compression of general collections of
XML data in push-based services is proposed files that share some degree of similarity; see,
in Wang et al. (2008) e.g., Ouyang et al. (2002). However, subse-
It is well known that pages from the same quent work (Trendafilov et al. 2004; Chang
web site are often very similar, mostly due et al. 2006; Ferragina and Manzini 2010) indi-
to common layout and menu structure and cates that this approach is often outperformed D
that this could be exploited with delta com- by optimized compression tools for archiving
pression. Work in Chan and Woo (1999) and large data sets, after suitable reordering of the
Savant and Suel (2003) studies how to identify files.
cached pages that make good reference files • Exploring File Differences: Delta compres-
for delta compression, while Douglis et al. sion tools can be used to visualize differences
(1997) proposes a similar idea for shared dy- between documents. For example, the diff
namic pages, e.g., different stock quotes from utility displays the differences between two
a financial site. files as a set of edit commands, while the
While delta compression has the poten- HtmlDiff and topblend tools in Chen et al.
tial to significantly reduce latency and band- (2000) visualize the difference between
width usage, particularly over slower mobile HTML documents. Here, the focus is less
links, such schemes have only seen limited on compressed size and more on readability.
adoption. Examples of widely used implemen-
tations are the Shared Dictionary Compres-
Future Directions for Research
sion over HTTP mechanism implemented by
Delta compression is a fairly mature technology,
Google in the Chrome browser and in their
and future gains in compression may thus be
Brotli compressor (Alakuijala and Szabadka
modest for general scenarios. However, there are
2016) (which is based on the vcdiff differenc-
still a number of challenges that remain, includ-
ing format and employs a mechanism similar
ing the following:
to delta compression) and the use of delta
compression for reducing network traffic in
Dropbox (Drago et al. 2013). • Speed Improvements: Recent years have
• Delta Compression in File and Storage Sys- seen significant improvements in speed
tems: Delta compression is also useful for for many compression techniques, in some
versioning file systems that keep old versions cases due to use of available data parallel
of files. The Xdelta File System (MacDonald (SIMD) instruction sets in modern processors.
2000) aimed to provide efficient support for Examples of recent high-speed methods for
delta compression at the file system level us- delta compression are ddelta (Xia et al. 2014b)
ing xdelta. Delta compression is also used for and edelta (Xia et al. 2015), and future work
redundancy elimination in backup systems. should further improve on these results.
In particular, Kulkarni et al. (2014), Shilane • Alternative Approaches: Almost all current
et al. (2012), and Xia et al. (2014b,a) showed delta compressors use LZ-based approaches.
that adding delta compression between sim- While this is a natural approach, there might
ilar files or chunks of data on top of dupli- be alternative approaches (e.g., methods
cate elimination (removal of identical files or based on Burrows-Wheeler compression
chunks) can give significant additional size Burrows and Wheeler 1994) that could lead
reductions, though at some processing cost. to improvements. On a more speculative
Delta compression was implemented, e.g., in level, researchers have recently started
the EMC Data Domain line of products. using recurrent neural networks for better
• Efficient Storage of File Collections: As dis- compression, and such an approach might
cussed in the previous section, delta com- also lead to better delta compression.
664 Delta Compression Techniques
• Formal Analysis of Methods: Current delta Bentley J, McIlroy D (1999) Data compression using long
compression algorithms are very heuristic common strings. In: IEEE data compression confer-
ence
in nature, and there is limited work on Berliner B (1990) CVS II: Parallelizing software develop-
formally analyzing their performance or ment. In: Winter 1990 USENIX conference
optimality with respect to information Burrows M, Wheeler D (1994) A block-sorting lossless
theoretic measures. An exception is the data compression algorithm. Technical report. 124,
SRC. Digital Systems Research Center, Palo Alto
work in Xiao et al. (2005), which shows Chan M, Woo T (1999) Cache-based compaction: a new
some bounds under a very simple model of technique for optimizing web transfer. In: INFOCOM
document generation, but there is a need for conference
more formal analysis. Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Bur-
rows M, Chandra T, Fikes A, Gruber R (2006) Bigtable:
• Integration in Storage Systems: As a distributed storage system for structured data. In:
discussed, delta compression techniques Seventh symposium on operating system design and
have been deployed inside storage systems implementation
in conjunction with other redundancy Chen Y, Douglis F, Huang H, Vo K (2000) Topblend:
an efficient implementation of HtmlDiff in Java. In:
elimination techniques. In such systems, WebNet 2000 conference
it can be challenging to identity suitable Douglis F, Haro A, Rabinovich M (1997) HPP: HTML
reference files for a target file that needs to macro-preprocessing to support dynamic document
be stored and to decide when it is worth to caching. In: USENIX symposium on internet technolo-
gies and systems
apply delta compression techniques, which Drago I, Bocchi E, Mellia M, Slatman H, Pras A (2013)
require fetching the reference files for coding Benchmarking personal cloud storage. In: Internet
and decoding, in conjunction with other measurement conference
redundancy elimination techniques. Future Ferragina P, Manzini G (2010) On compressing the textual
web. In: ACM international conference on web search
research could look for new methods that and data mining
use fingerprints and associated indexes to Gailly J (2017) zlib compression library, version 1.2.11.
quickly find good reference files or explore Available at https://zlib.net
better tradeoffs between compression and the Housel B, Lindquist D (1996) WebExpress: a system for
optimizing web browsing in a wireless environment. In:
amount of knowledge that is needed about the ACM conference on mobile computing and network-
reference data (where delta compression is the ing, pp 108–116
case of full knowledge). Hunt J, Vo KP, Tichy W (1998) Delta algorithms: an
empirical analysis. ACM Trans Softw Eng Methodol
7:192–213
Korn D, Vo KP (2002) Engineering a differencing and
compression data format. In: USENIX annual technical
References conference, pp 219–228
Kulkarni P, Douglis F, LaVoie J, Tracey JM (2014) Redun-
Adler M, Mitzenmacher M (2001) Towards compressing dancy elimination within large collections of files. In:
web graphs. In: IEEE data compression conference USENIX annual technical conference
Agarwal R, Amalapuraru S, Jain S (2004) An approxi- MacDonald J (2000) File system support for delta
mation to the greedy algorithm for differential com- compression. MS thesis, University of California,
pression of very large files. In: IEEE data compression Berkeley
conference Mogul JC, Douglis F, Feldmann A, Krishnamurthy B
Ajtai M, Burns R, Fagin R, Long D, Stockmeyer L (1997) Potential benefits of delta-encoding and data
(2002) Compactly encoding unstructured inputs with compression for HTTP. In: ACM SIGCOMM confer-
differential compression. J ACM 49(3):318–367 ence, pp 181–196
Alakuijala J, Szabadka Z (2016) Rfc7932: Brotli com- Molfetas A, Wirth A, Zobel J (2014a) Scalability in
pressed data format. Available at https://tools.ietf.org/ recursively stored delta compressed collections of files.
html/rfc7932 In: Second Australasian web conference
Bagchi A, Bhargava A, Suel T (2006) Approximate max- Molfetas A, Wirth A, Zobel J (2014b) Using inter-file
imum weighted branchings. Inf Process Lett 99(2): similarity to improve intra-file compression. In: IEEE
54–58 international congress on big data
Banga G, Douglis F, Rabinovich M (1997) Optimistic Motta G, Gustafson J, Chen S (2007) Differential com-
deltas for WWW latency reduction. In: USENIX an- pression of executable code. In: IEEE data compression
nual technical conference conference
Dimension Reduction 665
Nakanishi T, Shih H, Hisazumi K, Fukuda A (2013) A Ziv J, Lempel A (1978) Compression of individual se-
software update scheme by airwaves for automotve quences via variable-rate coding. IEEE Trans Inf The-
equipment. In: International conference on informa- ory 24(5):530–536
tion, electronics, and vision
Ouyang Z, Memon N, Suel T, Trendafilov D (2002)
Cluster-based delta compression of a collection of files.
In: Third international conference on web information
systems engineering Delta Encoding
Percival C (2006) Matching with mismatches and assorted D
applications. PhD thesis, University of Oxford
Rochkind M (1975) The source code control system. IEEE
Trans Softw Eng 1:364–370
Samteladze N, Christensen K (2012) Delta: delta encoding Delta Compression Techniques
for less traffic for apps. In: IEEE conference on local
computer networks
Savant A, Suel T (2003) Server-friendly delta compres-
sion for efficient web access. In: 8th international
workshop on web content caching and distribution Dictionary-Based
Shilane P, Huang M, Wallace G, Hsu W (2012) WAN
optimized replication of backup datasets using stream- Compression
informed delta compression. In: USENIX symposium
on file and storage technologies
Tate S (1997) Band ordering in lossless compression
of multispectral images. IEEE Trans Comput 46(45):
211–320 Grammar-Based Compression
Tichy W (1984) The string-to-string correction problem
with block moves. ACM Trans Comput Syst 2(4):
309–321
Tichy W (1985) RCS: a system for version control. Softw
Pract Exp 15:637–654
Trendafilov D, Memon N, Suel T (2002) zdelta: a simple
Differential Compression
delta compression tool. Technical report. Polytechnic
University, CIS Department
Trendafilov D, NMemon, Suel T (2004) Compressing
file collections with a TSP-based approach. Technical
Delta Compression Techniques
report TR-CIS-2004-02. Polytechnic University
Tridgell A (2000) Efficient algorithms for sorting and
synchronization. PhD thesis, Australian National Uni-
versity
Wagner RA, Fisher MJ (1974) The string-to-string correc-
tion problem. J ACM 21(1):168–173 Dimension Reduction
Wang J, Guo Y, Huang B, Ma J, Mo Y (2008) Delta
compression for information push services. In: Interna-
tional conference on advanced information networking Kasper Green Larsen
and applications – workshops Aarhus University, Aarhus, Denmark
Xia W, Jiang H, Feng D, Tian L (2014a) Combining
deduplication and delta compression to achieve low-
overhead data reduction on backup datasets. In: IEEE
data compression conference
Xia W, Jiang H, Feng D, Tian L, Fu M, Zhou Y (2014b)
Ddelta: a deduplication-inspired fast delta compression Definitions
approach. Perform Eval 79:258–272
Xia W, Li C, Jiang H, Feng D, Hua Y, Qin L, Zhang
Y (2015) Edelta: a word-enlarging based fast delta Dimensionality Reduction is a technique for tak-
compression approach. In: USENIX workshop on hot ing a high-dimensional data set (data objects
topics in storage and file systems with many features/attributes) and replacing it
Xiao C, Bing B, Chang GK (2005) Delta compression for
fast wireless internet downloads. In: IEEE GlobeCom
with a much lower-dimensional data set while
Ziv J, Lempel A (1977) A universal algorithm for data still preserving similarities between data objects.
compression. IEEE Trans Inf Theory 23(3):337–343 Dimensionality reduction is useful for reducing
666 Dimension Reduction
the memory requirements for storing a data set, dimensionality, we also reduce the running time
as well as for speeding up algorithms. of such algorithms.
The difficulty in dimensionality reduction
comes from the conflicting goal of also
Overview preserving the geometry of the point set.
Preserving the geometry can have many
In modern algorithm design and data analysis, interpretations, but roughly speaking, it means
we often face very high-dimensional data. High- that we want close points to remain close and far
dimensional data comes in many forms. As ex- points to remain far apart. This chapter surveys
amples, we can think of a 10-megapixel im- the most popular notion of dimensionality
age as a point with ten million coordinates, one reduction, namely, dimensionality reduction
for each pixel. In this way, an image becomes under Euclidian distances. Here the goal is to
a ten million-dimensional point. Another often preserve Euclidian distances between points
encountered example arises when we wish to while reducing the dimension as much as
compare documents based on similarity. One way possible. For two points x D .x1 ; : : : ; xd / and
to do this is to simply count how many times y D .y1 ; : : : ; yd / in d -dimensional space, their
each dictionary word occurs in a document and Euclidian distance is simply the generalization of
compare documents based on these counts. This distances in the plane to higher dimensions, that
yields a representation of a document as a point is, the distance from x to y is
with one coordinate for each of the about 105
common English words. The distance between v
u d
uX
these high-dimensional points naturally captures t .y x /2 :
i i
the similarity of documents. The dimensionality iD1
grows even more dramatically when one com-
pares documents based on so-called w-shingles
If we think of points as vectors, then the distance
(sets of w consecutive words in the text). With
between vectors x and y is the so-called `2 -norm
a typical choice of w D 5, the number of
of the vector y x.q For a vector ´, we use the
possible 5-shingles grows to .105 /5 D 1025 , Pd 2
and thus a document becomes an extremely high- notation k´k2 D iD1 ´i to denote the `2 -
dimensional point (albeit with few non-zero co- norm of ´. Thus the distance from x to y equals
ordinates). ky xk2 . When we say dimensionality reduction
Dimensionality Reduction is an approach in this chapter, we thus think of the task of taking
to “compressing” such high-dimensional a set of d -dimensional points X D fp1 ; : : : ; pn g
points. Dimensionality reduction works by and mapping them to m-dimensional points
taking a high-dimensional point set and then f .X / D ff .p1 /; : : : ; f .pn /g, such that the
mapping/embedding it to another point set distance kf .pi / f .pj /k2 roughly equals the
in much lower-dimensional space. Said more original distance kpi pj k2 for all pairs of points
formally, we take a set X D fp1 ; : : : ; pn g of pi ; pj in X .
d -dimensional points and map them to a set
f .X / D ff .p1 /; : : : ; f .pn /g of m-dimensional
points, hopefully with m much smaller than d . Key Research Findings
Since the points have much fewer coordinates
after this embedding, they take up much less The cornerstone result in dimensionality
storage space. In addition to reducing the size reduction is the famous Johnson-Lindenstrauss
of data sets, dimensionality reduction can also lemma due to Johnson and Lindenstrauss (1984).
help speed up algorithms: The running time For any set X D fp1 ; : : : ; pn g of n points
of many algorithms depends naturally on the in d -dimensional space (denoted Rd ), and
dimensionality of the input data. If we reduce the for any approximation factor " 2 .0; 1=2/,
Dimension Reduction 667
the result says that there exists an embedding Intuitively, the above algorithm chooses m
f .X / D ff .p1 /; : : : ; f .pn /g, with each f .pi / random directions and projects the data onto
in Rm and m D O."2 lg n/, such that for all those directions. The algorithm is randomized,
points pi ; pj 2 X , it holds that and there is a small probability that it fails in the
sense that some pair of points has their distance
distorted by more than .1 ˙ "/. The failure
kf .pj / f .pi /k2 .1 "/kpj pi k2 probability is polynomially small in n when D
m D C "2 lg n for a big enough constant C > 0.
If one cannot live with this small probability of
and
failure, one can always run through all n2 pairs
of points and explicitly check if any pair have
their distance distorted by too much. If that is
kf .pj / f .pi /k2
.1 C "/kpj pi k2 :
the case, one can start over with a new random
matrix A. Of course this is very time-consuming
as we have to check n2 pairs. Thus one would
Let us take a minute to parse the result. When
typically just hope for the best and not try to
given a point set X in Rd , it says that the
verify whether all distances were preserved.
“user” can specify an approximation factor " 2
Finding an efficient deterministic construction
.0; 1=2/. For that chosen approximation factor,
remains an intriguing open problem (see Future
it is possible to take the points X and embed
Directions for Research).
them into only m D O."2 lg n/ dimensions
Computing f .x/ for a data point x 2 X takes
while guaranteeing that the distance between any
O.md / time by a straightforward implementa-
two points is only distorted by a multiplicative
tion of matrix-vector multiplication.
factor .1 ˙ "/. Thus if two points were close to
each other before embedding, they will remain
Remark 1 The paper by Dasgupta and Gupta
close afterward. Similarly, if two points were
(2003) actually has each entry of A a uniform ran-
far apart before embedding, they will remain
dom N .0; 1/ distributed random variable. This
far apart. It is quite remarkable that the target
makes their proof simpler but makes the construc-
dimension m doesn’t even depend on the original
tion less efficient when implemented in practice.
dimensionality of the point set. Furthermore, the
For a correctness proof when using 1= C 1
dependency on the number of points in X is only
entries, see, e.g., Nelson (2015).
logarithmic.
The original construction of Johnson and Lin-
Remark 2 Much research has gone into
denstrauss (1984) is not meant to be practical,
determining whether the number of dimensions
and as you may have observed, the above only
m D O."2 lg n/ is the best possible when given
talks about the existence of an embedding f .X /.
the desired approximation factor ". This question
In practice, we naturally care about computing
was only answered very recently, where Larsen
f .X / efficiently. The following simple algorithm
and Nelson (2017) showed that there exist point
can be found in Dasgupta and Gupta (2003):
sets where one cannot embed into fewer than
˝."2 lg n/ dimensions while preserving all
Embed(X ):
pairwise distances to within .1 ˙ "/. Thus the
Johnson-Lindenstrauss lemma is optimal when it
• Let m D ."2 lg n/. Form an m d matrix comes to the target dimension m.
A with each entry chosen independently and
uniformly at random among f1; C1g. Speeding Up the Embedding
• Embed each point x 2 X to the point f .x/ D In many applications, the embedding time of
p
.Ax/= m (compute Ax and divide all entries O.md / may not be sufficiently fast. For instance,
p
by m). if the original dimensionality is in the millions
668 Dimension Reduction
and one embeds into a couple of thousand di- C has few non-zeroes, then computing C x can
mensions, then the matrix-vector multiplication again be done faster than computing Ax. Before
takes in the order of billions instructions for we present the results, let us first make another
each single data point. Several approaches have comment on the time it takes to compute Ax in
been suggested for speeding up the embedding the first algorithm we presented. Here A was an
time. We discuss the two main directions in the m d matrix with all entries either 1 or C1.
following: If our input vector x has few non-zeroes, then
computing Ax can also be done faster. To see
P
Fast Johnson-Lindenstrauss Transforms. The this, note that Ax is simply j xj aj where aj
first approach, due to Ailon and Chazelle (2009), is the j ’th column of A. All terms in this sum
replaces the matrix A from the previous algo- where xj D 0 can be ignored. Thus if we let
rithm with another carefully chosen matrix B, kxk0 denote the number of non-zero coordinates
such that computing Bx can be done faster than of x, the embedding time actually improves to
computing Ax. To be precise, they actually re- O.mkxk0 /. In many real-life applications (such
p
place Ax= m by the product of three matrices as the example of documents and w-shingles
PHD and x, i.e., f .x/ D PHDx. The matrix earlier), the input vectors are very sparse, and
D is a d d diagonal matrix with each diago- thus exploiting the sparsity is crucial. Also ob-
nal set independently and uniformly at random serve that the fast Johnson-Lindenstrauss trans-
among f1; C1g (and all off-diagonals are 0). form above (computing PHD) cannot exploit the
The matrix H is the d d normalized Walsh- sparsity of x as applying H using the fast Fourier
Hadamard matrix, where the entry hi;j is equal transform takes time O.d lg d / regardless of the
to .1/hi1;j 1i and hi; j i is the inner product sparsity of Dx.
modulo 2 of the vectors corresponding to the Kane and Nelson (2014) proposed the
binary representations of i and j . Finally, P is following way to choose the matrix C : Let
an m d matrix where the entries are chosen m D ."2 lg n/ and let s D ."1 lg n/.
independently as follows: With probability 1 q, Think of the m output dimensions as being
set pi;j to 0 and otherwise let pi;j be N .0; 1=q/ partitioned into s chunks of m=s consecutive
distributed. Here coordinates each. In each column of C , we
choose one uniform random entry in each chunk
˚ p
q D min ..lg2 n/=d /; 1 : and set it uniformly at random to either 1= s
p
or C1= s. All other entries are set to 0. They
proved that letting f .x/ D C x for each point
While the embedding algorithm seems more x 2 X also preserves all pairwise distances
complicated, observe first that computing Dx to within .1 ˙ "/ with high probability. The
takes O.d / time since D is a diagonal matrix. embedding time improves to O.skxk0 / instead
Applying H can be done in O.d lg d / time using of .mkxk0 /, saving an "1 factor over the original
the fast Fourier transform. Finally, P has an approach.
expected q md D O.m lg2 n/ D O."2 lg3 n/
non-zeroes; thus applying P can be done in Feature Hashing. Taking the sparse matrix
O.m lg2 n/ time, resulting in an embedding time approach to the extreme, Weinberger et al. (2009)
of O.d lg d C m lg2 n/. This may or may not proposed feature hashing. Their suggestion is to
be faster than the original embedding time of simply have one non-zero in each column of C ,
O.md / depending on the relationship between placed at a uniform random location and being
m and d , but for most natural values of m and d , uniform random among f1; C1g. This of course
this would be preferred. gives the fastest possible embedding time of just
O.kxk0 /. The downside is that the algorithm
Sparse Matrices. The second approach replaces only works with high probability if the difference
the matrix A with a much sparser matrix C . If between any two input vectors is spread out on
Dimension Reduction 669
many coordinates. More concretely, the current distance as a similarity measure makes good
best analysis due to Dahlgaard et al. (2017) sense.
proves that feature hashing preserves all pairwise Dimensionality reduction may also serve as a
distances to within .1 ˙ "/ with high probability, preprocessing step in nearest neighbor search ap-
as long as all pairwise differences x y satisfy plications. In (Euclidian) nearest neighbor search,
that we must store a set of high-dimensional points
X , such that when given a query point q, we can D
s ! efficiently find the point x 2 X that is closest
" lg.1="/ (most similar) to q. An important property of all
kx yk1 D O kx yk2 :
lg n lg.mn/ the above algorithms is that they are all based
on sampling a random matrix independently of
the data points. Thus if we are to store a set of
Here kx yk1 is the largest absolute value of a
high-dimensional points X for nearest neighbor
coordinate in x y. Thus feature hashing works
search, what we can do instead is as follows:
as long as for any pairwise difference x y,
Sample a matrix A (using any of the distributions
there is no single coordinate that is responsible
described earlier) and compute f .x/ D Ax for
for a large fraction of the distance between x
each x 2 X . Store the embedded data set in
and y. Feature hashing seems to work very well
a data structure for nearest neighbor search (on
in practice and has been successfully applied in
m-dimensional points instead of d -dimensional
many applications.
points). When we wish to answer a query q (find
the point closest to q), we first compute f .q/ D
Remarks In both feature hashing and the sparse
Aq and then search for the nearest point f .x/
matrix approach, we described the algorithms as
to f .q/. If this finds a point f .x/ of distance
first computing the matrix C and then computing
a from f .q/, we know that the original point x
C x when embedding a point. In the introductory
has distance at most .1 C "/a from q and thus
example with w-shingles, the input dimension
is also very close to q. The crucial point here
d was 1025 , making it infeasible to store C
is that A was chosen independently of q (and
explicitly. What one does instead is to use a
X ), and thus one can simply apply the standard
hash function that allows you to compute the
analysis on the point set X [ fqg to conclude
columns of C corresponding only to the non-
that the distance from q to any point in X is
zero coordinates of an input vector x. Thus C
preserved to within .1 ˙ "/ with high probability.
is never stored explicitly. We refer the reader to
For more information on nearest neighbor search
Kane and Nelson (2014) for how to implement a
in high dimensions, see, e.g., the survey Andoni
hash function for their sparse approach and to the
and Indyk (2008).
paper by Dahlgaard et al. (2017) for suggestions
Another exciting application of dimensional-
and experiments on how to implement feature
ity reduction is in k-means clustering. In k-means
hashing in practice.
clustering, we are given a set of n points X D
fp1 ; : : : ; pn g in d -dimensional space. The goal is
to find k cluster centers
1 ; : : : ;
k 2 Rd such
Examples of Application that
X
min k
j pi k22
j
Dimensionality reduction in the form presented pi 2X
in this chapter is of course only applicable in
applications where Euclidian distance between is minimized. That is, we have to find the set of
data points is a natural similarity measure. For k centers such that the sum of squared distances
text documents, if we have a coordinate for each from each data point to its nearest center is min-
w-shingle (w consecutive words) counting how imized. Lloyd’s algorithm and the k-means++
many times it occurs in the text, using Euclidian algorithm in Arthur and Vassilvitskii (2007) give
670 Dimension Reduction
efficient heuristics for finding good clusterings are too large in any pairwise difference vector
(finding the optimal clustering is known to be x y. It remains an open problem whether the
NP-hard). These algorithms have a running time requirement
of O.t k nd / where t is the number of iterations
of local improvements. Boutsidis et al. (2009) s !
proposed that one first performs dimensionality " lg.1="/
kx yk1 D O kx yk2 :
reduction on the input data in order to speed lg n lg.mn/
up Lloyd’s algorithm and the k-means++ algo-
rithm. More concretely, they proved that if the is necessary or if perhaps feature hashing works
embedded point set preserves all distances to with even larger coordinates.
within .1 ˙ "/, the cost of all clusterings are
also preserved to within .1 ˙ "/. Thus one can Fast Johnson-Lindenstrauss Transforms. We
run clustering algorithms on m D O."2 lg n/ also studied the fast Johnson-Lindenstrauss trans-
dimensional points and find almost as good clus- form and saw that it gave an embedding time
terings as before. This reduces the running time of O.d lg d C m lg2 n/. Can this be improved
from O.t k nd / to O.t k nm/ plus the time for to something closer to O.d C m/? Or maybe
the embedding. The current state of the art in O..d C m/ lg d /?
dimensionality reduction for k-means clustering
can be found in Cohen et al. (2015) where bounds
better than m D O."2 lg n/ can be obtained for
small values of k. References
Larsen KG, Nelson J (2017, to appear) Optimality of the supervision as well as their main shortcomings.
Johnson-Lindenstrauss lemma. In: Proceedings of the We then discuss approaches that improve
58th annual symposium on foundations of computer
science (FOCS). See http://arxiv.org/abs/1609.02094
over the basic method, including approaches
Nelson J (2015) Johnson-lindenstrauss notes. http:// based on the at-least-one-principle along with
people.seas.edu/~minilek/madalgo2015/index.html their extensions for handling false negative
Nelson J, Nguyen HL (2013) Sparsity lower bounds for labels, and approaches leveraging topic models.
dimensionality reducing maps. In: Proceedings of the
45th annual ACM symposium on theory of computing
We also describe embeddings-based methods D
(STOC), pp 101–110 including methods leveraging convolutional
Weinberger K, Dasgupta A, Langford J, Smola A, Atten- neural networks. Finally, we discuss how to take
berg J (2009) Feature hashing for large scale multitask advantage of auxiliary information to improve
learning. In: Proceedings of the 26th annual interna-
tional conference on machine learning (ICML), pp
relation extraction.
1113–1120
Definitions
Overview
Basic Approach
In this chapter, we discuss approaches leveraging
distant supervision for relation extraction. We The idea of using a knowledge graph as a source
start by introducing the key ideas behind distant of labels for training data was first proposed
672 Distant Supervision from Knowledge Graphs
by Mintz et al. (2009) and relies on the following tools are used to identify potential entities from
assumption: the text.
The second step, entity matching, takes as
input the textual entities identified in the previous
Assumption 1 If two entities participate in a
step and tries to match them to one of the
relation, any sentence that contains those two
instances in the knowledge graph (e.g., it tries
entities might express that relation.
to match “William Tell” as found in the text
to its corresponding instance in Wikidata, e.g.,
For example, the following sentence:
instance number “Q30908” (see https://www.
South African entrepreneur Elon Musk is known
wikidata.org/wiki/Q30908)).
for founding Tesla Motors and SpaceX. In the third step, sentences where two entities
are correctly matched to the knowledge graph
mentions the tuple (Elon Musk, Tesla Motors). are processed to extract features. Two types of
Assuming that the triple (Elon Mask, created, features, lexical and syntactic, were originally
Tesla Motors) exists in the knowledge graph, proposed by Mintz et al. (2009). Similar fea-
the textual sentence is labeled with the relation tures were used in many subsequent approaches.
created and can be used as training data for Lexical features include the sequence of words
subsequent relation extractions. In this context, between the two entities, flags indicating which
the set of sentences sharing the same entity pair entity name comes first in the sentence as well
is typically called a bag of sentences. as the words immediately to the left of the first
The relation extraction task can be considered entity and to the right of the second entity. In
as a classification problem, where the goal is to addition to these features, syntactic features can
predict, for every entity pair, the relation it par- be taken from a dependency parser – which
ticipates in from multiple classes. The classifier extracts syntactic relations between words such
needs training data where every entity pair is as the dependency path between two entities.
represented as a feature vector (a binary represen- Finally, the last step is called labeling, i.e.,
tation of the input) and labeled with a correspond- obtaining relation labels corresponding to the
ing relation. After learning on the training data, entity pair from the knowledge graph.
the testing step is performed, where the classifier The test set is obtained as above when generat-
predicts the relation labels for previously unseen ing training data, except for the labeling part, i.e.,
entity pairs. the first three steps (natural language processing,
The transformation of the text corpus C into entity matching, and feature extraction) remain
a usable training set involves several steps illus- the same. After the training and test sets are
trated in Fig. 1. The first step is a preprocessing constructed, a standard classifier can be applied
step where classical natural language processing for relation extraction. In this context, Mintz
Distant Supervision from Knowledge Graphs, Fig. 1 The pipeline of preparing training data with distant supervi-
sion
Distant Supervision from Knowledge Graphs 673
et al. (2009) used a multi-class logistic classifier The original model cannot capture the case
optimized using L-BFGS with Gaussian regular- when two entities participate in more than one
ization. relation. In our above example, the relation
between Elon Musk and Tesla Motors is not only
Shortcomings of Distant Supervision co-founder_Of but also CEO_Of. Hoffmann
Automatically labeled training data are typically et al. (2011) and Surdeanu et al. (2012) propose
noisy, i.e., they contain false positives and false undirected graphical models, called MultiR and D
negatives, and this is also the case for distant MIML RE, respectively, to perform multi-
supervision. On one hand, false negatives are instance multi-label classification. Both models
mostly caused by the incompleteness of the infer a relation expressed by a particular mention;
knowledge graph. Two entities can be related in thus the models are able to predict more than
reality, but their relation might be missing in the one relation between the two entities. The
knowledge graph; hence their mentions will be MultiR model is a conditional probability
wrongly labeled as negative examples by distant model that learns a joint distribution of both
supervision. On the other hand, two entities may mention-level and bag-level assignments. The
appear in the same sentence because they are MIML RE model contains two layers of
related to the same topic, but not necessarily classifiers. The first-level classifier assigns a
because the sentence is expressing the relation. label to a particular mention. The second level
Such examples might yield false positives using has k binary classifiers, where k is the number
distant supervision. of known relation labels for the particular pair of
entities. Each bag-level classifier yi decides if the
relation ri holds for the given entity pair, using
the mention-level classifications as input.
Distant Supervision Improvements
Negative Labels
This section lists several extensions aimed at In distant supervision, the knowledge graph only
improving the basic distant supervision method. provides positive labels. Thus, negative training
data is produced synthetically, e.g., by sampling
At-Least-One Principle the sentences containing two entities which are
Riedel et al. (2010) replace Assumption 1 with not related in the knowledge graph. The standard
the following: method to generate negative training data is to la-
bel mentions of unrelated entity pairs as negative
Assumption 2 If two entities participate in a samples. However, since the knowledge graph
relation, at least one sentence that mentions those is incomplete, the absence of any relation label
two entities will express that relation. does not necessarily mean that the corresponding
entities are not related. Hence, such a method
This relaxed assumption, while more intuitive, potentially brings additional errors to the training
comes at the cost of a more challenging clas- data.
sification problem. To tackle this problem, the Several extensions of this model are proposed
authors propose an undirected graphical model to overcome this problem. Min et al. (2013) pro-
that solves both the task of predicting a relation pose an additional layer to MIML RE in order
between two entities and the task of predict- to model the bag-level label noise. Ritter et al.
ing which sentence expresses this relation. The (2013b) extend the MultiR model to handle
model is developed on top of a factor graph, a missing data. Xu et al. (2013) explore ideas to
popular probabilistic graphical network represen- enhance the knowledge graph, i.e., to infer entity
tation where an undirected bipartite graph con- pairs that are likely to participate in a relation
nects relation variables with factors representing even if they are unlabeled in the knowledge
their joint probabilities. graph, and add them as positive samples.
674 Distant Supervision from Knowledge Graphs
Fan et al. (2014) propose an alternative ap- have been suggested to capture three subsets of
proach to tackle the problems of both erroneous patterns:
labeling and incompleteness of the knowledge
graph. They formulate the relation extraction task • general patterns that appear for all relations;
as a matrix completion problem. • patterns that are specific for entity pairs and
not generalizable across relations;
Topic Models • patterns that are observed across most pairs
Another widely used approach to improve text with the same relation (i.e., relation-specific
analysis and text classification is topic models. patterns).
In this context, topics represent clusters of terms
(words or patterns) which often co-occur in the
documents together. The goal of topic models is
to assign a topic to every term. Embedding-Based Methods
Topic models can be applied to distance su-
pervision by mapping a document to a sentence The methods discussed above are based on a
mentioning an entity pair and a topic to a relation. large variety of lexical and syntactic features. We
Words are represented by lexical and syntactic discuss below approaches that map the textual
features, such as POS-tags or dependency paths representation of relations and entity pairs onto
between the entities. While the mention-level a vector space. A mapping from discrete objects,
classifiers described above assign a relation to such as words, to vectors of real number is
every sentence individually, the topic model clas- called an embedding in this context. Embedding-
sifiers are capable to capture more general depen- based approaches for relation extraction do not
dencies between textual patterns and relations, require extensive feature engineering and natural
which in practice can lead to improved perfor- language processing, but they still require NER
mance. tags to link entities in the knowledge graph with
Yao et al. (2011) propose a series of generative their textual mentions.
probabilistic models in that context. Though the Riedel et al. (2013a) apply matrix factorization
presented models are designed for unsupervised, for the relation extraction task. The authors com-
open relation extraction (i.e., the set of the target pute a low-rank factorization of the probability
relations is not prespecified), the authors also matrix M , where rows correspond to entity pairs
show how the detected clusters can improve dis- and columns correspond to relations. The ele-
tantly supervised relation extraction. All their ments of the matrix correspond to the probability
proposed models are based on latent Dirichlet that a relation holds between two entities. This
allocation (LDA). The models differ in the set of factorization provides an embedding of entity
features used and thus in their ability to cluster pairs and of relations, which is then used to
patterns. The advantage of generative topic mod- predict the probability of new triplets using a
els is that they are capable to “transfer” informa- logistic function and one of the following feature
tion from known patterns to unseen patterns, i.e., models:
to associate a relation expressed by an already
observed pattern to a new pattern. 1. The latent feature model (F) defines r;t as
Alfonseca et al. (2012) distinguish patterns a measure of compatibility between relation
that are expressing the relation from the ones that r and tuple t via the dot product of their
are not relation-specific and can be used across embeddings ar and vt , respectively.
relations. For instance, the pattern “was born 2. The neighborhood model (N) defines r;t as
in” is relation-specific, while the pattern “lived a set of weights wr;r 0 , corresponding to a
in” can be used across different relations, i.e., directed association strength between relation
mayorOf or placeOfDeath. Their proposed mod- r and r 0 , where both relations are observed for
els are also based on LDA. Different submodels tuple t .
Distant Supervision from Knowledge Graphs 675
3. The entity model (E) also measures a com- • Selective Attention, where sentence weight
patibility between tuple t and relation r, but depends on the query-based function which
it provides latent feature vectors for every scores how well the sentence expresses a given
entity and every argument of relation r and relation.
thus can be used for n-ary relations. Entity
types are implicitly embedded into the entity
representation. D
Leveraging Auxiliary Information for
4. The combined model (NFE) defines r;t as a
Supervision
sum of the parameters defined by the three
above models.
A number of distant supervision improvements
use additional knowledge to enhance the model.
Neural Networks Angeli et al. (2014) and Pershina et al. (2014)
Another direction in text classification is applying study the impact of extending training data with
convolutional neural networks (CNNs). CNNs manually annotated data, which is generally
are able to find specific n-grams in the text and speaking more reliable than the labels obtained
classify the relations expressed in the sentences from the knowledge graph. Pershina et al. (2014)
based on these n-grams. propose to perform feature selection to generalize
Zeng et al. (2015) perform relation extrac- human labeled data into training guidelines and
tion via piecewise convolutional neural networks include them into the MIML RE model. Angeli
(PCNNs). The learning procedure of PCNNs is et al. (2014) propose to initialize the MIML RE
similar to standard CNNs. It consists of four model with manually annotated data, since the
parts: vector representation, convolution, piece- original model is sensitive to initialization. As
wise max pooling and softmax output. In contrast manual annotation is very expensive, carefully
to the embedding-based approaches above, the selecting which sentences to annotate is of utmost
proposed model does not build word embeddings importance. The authors study several criteria to
itself but uses pre-trained word vectors instead. pick the most valuable sentences for manual
Additionally, the model encodes a position of annotation.
the word in the sentence with position features Another explored direction is improving entity
introduced by Zeng et al. (2014). Position fea- identification, i.e., extract sentences mentioning
tures represent the relative distance between the entities not only by their canonical names but
current word in the sentence and both entities also by abbreviated mentions, acronyms, para-
of interest e1 and e2 . The vector representation phrases, and even pronouns. Given the following
of a word is then the concatenation of word sentences:
embedding and a position feature. In their work,
the authors consider that the model predicts a Elon Musk is trying to redefine transportation on
relation for a bag (i.e., for an entity pair) if and earth and in space.
Through Tesla Motors – of which he is cofounder,
only if a positive label is assigned to at least one CEO and Chairman – he is aiming to bring fully-
entity mention. electric vehicles to the mass market.
Lin et al. (2016) extend the PCNN approach.
The authors explore different methods to over- The second sentence contains a relation mention
come the wrong labeling problem. More specif- (Elon Musk, cofounderOf, Tesla Motors) that
ically, the model learns weights of the sentences cannot be extracted without co-reference reso-
in a bag in order to select sentences that are lution, which refers to the task of clustering
true relation mentions. Two ways of defining the mentions of the same entity together, typically
weights are explored: within a single sentence or document.
Augenstein et al. (2016) explore a variety of
• Average, where every sentence has the same strategies for making entity identification tools
weight; more robust across domains. Moreover, they per-
676 Distant Supervision from Knowledge Graphs
form co-reference resolution for extracting rela- The approach proposed by Han and Sun
tions across sentence boundaries. The approach (2016) is also based on the idea of using indirect
is designed to perform relation extraction on supervision. The authors rely on a Markov logic
very heterogeneous text corpuses, such as those network as a representation language and use:
extracted from the Web.
Koch et al. (2014) use entity types and co- • the consistency of relation labels, i.e., the
reference resolution for a more accurate relation interdependencies between relations that were
extraction. Two sets of entity types are available: mentioned previously (implications and mu-
coarse types (PERSON, LOCATION, ORGA- tual exclusion);
NIZATION, and MISC) come from the named • the consistency between relation and argu-
entity recognizer (NER), while fine-grained types ments, i.e., entity-type information retrieved
come from the knowledge graph. To bring type- from the knowledge graph is used to filter
awareness into the system, the authors build sep- inconsistent candidates;
arate relation extractors on top of the MultiR • the consistency between neighboring in-
model for each pair of coarse types, e.g., (PER- stances: the authors define the similarity
SON, PERSON), (PERSON, LOCATION), and function between two relation candidates as
combine the extractions from the extractors of the cosine similarity of their feature vectors.
every type signature. Candidate mentions with in-
compatible entity types are then discarded, which
improves precision at the cost of a slight drop in Cross-References
recall.
Chang et al. (2014) propose a tensor decom- Deep Learning on Big Data
position approach that also uses entity-type in- Knowledge Graph Embeddings
formation and can be stacked with other models.
In particular, it helps to overcome the fact that
the matrix factorization model does not share References
information between rows, i.e., between entity
pairs (including pairs containing the same entity). Alfonseca E, Filippova K, Delort JY, Garrido G (2012)
Type-LDA is a topic model proposed by Yao Pattern learning for relation extraction with a hierar-
et al. (2011) that considers fined-grained entity chical topic model. In: Proceedings of the 50th annual
meeting of the association for computational linguis-
types based on latent Dirichlet allocation. In tics: short papers-volume 2. Association for Computa-
contrast to the previous approaches, it learns fine- tional Linguistics, pp 54–59
grained entity types directly from the text. Angeli G, Tibshirani J, Wu J, Manning CD (2014) Com-
Finally, a few relation extraction approaches bining distant and partial supervision for relation ex-
traction. In: EMNLP, pp 1556–1567
integrate logical connections or constraints be- Augenstein I, Maynard D, Ciravegna F (2016) Distantly
tween the relations. As an example, the relation supervised web relation extraction for knowledge base
capitalOf between two entities directly implies population. Semantic Web 7(4):335–349
another relation cityOf. Furthermore, many rela- Chang KW, Yih SWt, Yang B, Meek C (2014) Typed
tensor decomposition of knowledge bases for relation
tion pairs are mutually exclusive, e.g., spouseOf extraction. In: Proceedings of the 2014 conference
and parentOf. on empirical methods in natural language processing,
Rocktäschel et al. (2015) propose to inject pp 1568–1579
logical formulae into the relations and entity Fan M, Zhao D, Zhou Q, Liu Z, Zheng TF, Chang EY
(2014) Distant supervision for relation extraction with
pair embeddings. Their model is based on the matrix completion. In: ACL (1), Citeseer, pp 839–849
matrix factorization approach from Riedel et al. Han X, Sun L (2016) Global distant supervision for
(2013a). They explore two methods, namely, pre- relation extraction. In: Thirtieth AAAI conference on
factorization inference and joint optimization, artificial intelligence
Hoffmann R, Zhang C, Ling X, Zettlemoyer L, Weld
and choose to focus on direct relation implica- DS (2011) Knowledge-based weak supervision for
tion. information extraction of overlapping relations. In:
Distributed File Systems 677
Distributed File Systems, Table 1 Types of transparen- their storage site, so failure modes are substan-
cies tially more complex in DFSes compared to local
Transparency Description file systems. Replication and erasure coding are
Access Clients can access the files in a DFS in two typical techniques to achieve high reliability,
the same way as in a local file system and the following are key research findings based
Location Clients are unaware of the physical on the above two reliability schemes.
location of files
Concurrency Clients have the same view of the
state of the file system
Replication Policy
Replication Clients are unaware of replication of
DFSes use replication by making several copies
files across multiple servers of a data file on different servers. When a client
Migration Clients are unaware of the movement requests the file, he transparently accesses one of
of files across multiple servers the copies.
Replica placement plays a critical role in
both performance and reliability of DFSes.
bring convenience to sharing files in both local Replicas are stored on different servers according
area networks (LANs) and wide area networks to a placement scheme. Many DFSes (e.g.,
(WANs). HDFS (Shvachko et al. 2010), GFS (Ghemawat
The convenience is mainly brought by the et al. 2003)) use random replication as default,
technique of transparency. Transparency makes in which replicas are stored randomly on
files accessed, stored, and managed on the local different nodes, on different racks, at different
client machines, while the transparency process geographical locations, so that if a fault occurs
is actually held on the servers, such that the DFS anywhere in the system, data is still available.
implements its access control and storage man- However, it cannot effectively separate the
agement control to the client machines efficiently cluster into tiers while maintaining cluster
in a centralized way via the network file system, durability and thus is quite susceptible to
and one or more central servers store files that can frequent data loss due to correlated failures.
be accessed by any client in the network. In other Facebook HDFS implementation (Borthakur
words, a DFS is a distributed implementation of et al. 2011) limits the placement of block
the classical time-sharing model of a file system replicas to smaller groups of nodes, thereby
with multiple clients dispersed across the whole reducing the block loss probability with multiple
network of multiple and independent clients and node failures. Nonrandom replication schemes,
makes it possible to restrict access to the file like Copyset Replication (Cidon et al. 2013),
system depending on access lists or capabilities improve Facebook scheme, by restricting the
on both the servers and the clients. replicas to a minimal number of sets for a
The main goal of transparency in the DFS is given scatter width. Scarlett (Ananthanarayanan
to hide from the clients the fact that processes et al. 2011) carefully stores replicas based
and resources are physically distributed across on workload patterns to alleviate hotspots.
the network and provide a common view of a Sinbad (Chowdhury et al. 2013) improves write
centralized file system. That is, DFS aims to be performance by identifying the variance of link
“invisible” to clients, who consider the DFS to be capacities in a DFS and avoiding storing replicas
similar to a local file system. The different types on nodes with congested links. Also, some
of transparencies are listed in Table 1. studies address the efficient transitions between
replication and erasure coding schemes (called
asynchronous encoding) in DFSes. AutoRAID
Key Points of Reliability (Wilkes 1996) leverages access patterns to switch
between replication for hot data and RAID-5 for
Reliability is of special significance in DFSes be- cold data. DiskReduce (Fan et al. 2009) addresses
cause the usage site of files can be different from the transition replication to erasure coding in
Distributed File Systems 679
HDFS. EAR (Li et al. 2015) focuses on how ralidhar et al. 2014)) have also deployed erasure
replica placement affects the performance and coding in production DFSes to reduce storage
reliability of asynchronous encoding. overhead. Piggybacked-RS codes (Rashmi et al.
In addition, replication disperses multiple 2013, 2014) embed parities of one stripe into
copies of a file, and the changes have to be that of the following stripe, thereby reducing
propagated to all the replicas in a consistent repair bandwidth, which have also been evaluated
manner, especially in cache consistency, that in Facebook clusters. Facebook f4 (Muralidhar D
is, how to update all copies of a data in cache et al. 2014) protects failures at different levels in-
when one of them is modified. The first method cluding disks, nodes, and racks, by manipulating
to ensure cache consistency is write once read different redundancies of erasure coding.
many (WORM) (Quinlan 1991), in which cached Some studies propose new erasure code con-
files are in read-only mode (i.e., once a file is structions and evaluate their applicability in DF-
created, it cannot be modified) such that only Ses. Local repairable codes (LRC) are a new
the latest version of the data can be read. The family of erasure codes that reduce I/O during
second method is transactional locking (Dice recovery by limiting the number of surviving
et al. 2006), in which a read lock is obtained on nodes to be accessed. Azure’s LRC (Huang et al.
the requested data file such that any other clients 2012) and Facebook’s LRC (Sathiamoorthy et al.
cannot write this file or a write lock in order 2013) present and design a novel family of LRCs
is obtained to prevent any access on this file, that are efficiently repairable based on a trade-off
allowing that each read reflects the latest write between locality and minimum distance.
and each write is done in order. The third method Regenerating codes (Dimakis et al. 2007) are
is leasing (Gray and Cheriton 1989), in which the a state-of-the-art family of erasure codes that
client is guaranteed that any other client cannot achieve the optimal trade-off between storage
modify the data when the data file is requested redundancy and repair traffic in DFSes. Construc-
during the lease. When the lease expires or the tions of regenerating codes have been proposed
client releases its rights, the file is available again. and implemented in DFSes, such as FMSR codes
(Hu et al. 2012, 2017a), PM-RBT codes (Rashmi
Erasure Coding et al. 2015), and Butterfly codes (Pamies-Juarez
DFSes typically adopt replication as the fault et al. 2016). CORE (Li et al. 2015) extends
tolerance scheme, while more and more studies regenerating codes to support the optimal recov-
propose and implement efficient erasure codes in ery of multi-node failures atop HDFS. DoubleR
DFSes for lower storage overhead. Zhang et al. (Hu et al. 2017b) proposes and implements new
(2010) apply erasure coding on the write path regenerating code constructions for hierarchical
of HDFS to study the performance impact on data centers atop HDFS.
various MapReduce (Dean and Ghemawat 2013)
workloads. Silberstein et al. (2014) propose lazy
recovery for erasure-coded storage to reduce re- Examples of Application
pair bandwidth. Li et al. (2014) also propose
degraded-first scheduling to improve scheduling DFSes are the foundation of storage mechanisms
of MapReduce jobs running on erasure-coded of big data applications. Google GFS (Ghemawat
storage. HACFS (Xia et al. 2015), built on HDFS, et al. 2003) is an expandable DFS to support
dynamically switches encoded blocks between large-scale and data-intensive applications, us-
two erasure codes to balance storage overhead ing commodity servers to provide data reliability
and repair performance. Repair pipelining (Li and high-performance services. In addition, there
et al. 2017) speeds up the repair performance are other solutions to achieve different demands
as a middleware system on HDFS. Many en- for storage of big data. For example, HDFS
terprises (e.g., Google (Ghemawat et al. 2003), (Shvachko et al. 2010) is a derivative of open-
Azure (Huang et al. 2012), and Facebook (Mu- source codes of GFS. Microsoft uses Cosmos
680 Distributed File Systems
(Chaiken et al. 2008) to support its search and of racks; (2) when a block is under-replicated, the
advertisement business. Facebook uses Haystack namenode creates a new replica and places it in
(Beaver et al. 2010) to store a large number of a priority queue which is verified periodically;
photos. and (3) when all replicas of a block are on the
DFSes have been relatively mature after years same rack, the namenode will add a new replica
of development, and it is hard to make a complete on a different rack to trigger the over-replicated
study given the number of existing DFSes. This policy.
section chooses to study two popular DFSes,
HDFS (Shvachko et al. 2010) and Ceph (Weil Fault Management: Each datanode compares
et al. 2006a), from four aspects: architecture, the software version and the namespace ID with
client access, replication, and fault detection. those of the namenode. If they don’t match, the
datanode shuts down to preserve the integrity of
HDFS the system. Datanodes do a registration with the
namenode whenever they restart with a different
Architecture: An HDFS cluster consists of a IP address or port. After this registration and
single namenode, which is the master node that every hour, datanodes tell a current view of the
manages the file system namespace and regulates location of each block to the namenode. Every
the access to files by clients. The data stream is 3 s, datanodes send heartbeats to the namenode,
divided into blocks distributed across datanodes, and the namenode considers datanodes as failed if
which manage the storage attached to the nodes it does not receive any heartbeats during 10 min.
that they run on. HDFS provides a secondary
namenode as a persistent copy of the namenode Ceph
to handle namenode failures by restoring the
namespace from the secondary namenode. The Architecture: Ceph uses metadata servers
namenode executes different file system opera- (MDS) to provide a dynamic distributed metadata
tions like opening, closing, and renaming files management by performing queries of metadata
and directories. The datanodes serve clients’ read and managing the namespace. Ceph stores data
or write requests and also perform block creation, and metadata in object storage devices (OSDs)
deletion, and replication from the namenode. which mainly perform I/O operations. Ceph
stripes each data into blocks which are assigned
Client Access: HDFS uses a code library that to objects, which are identified by the same inode
allows clients to read, write, and delete file number plus an object number and placed on
and create and delete directories. Clients use a storage devices using a function called CRUSH
file’s path in the namespace to access it, contact (Weil et al. 2006b).
the namenode to get or put files’ blocks, and
do the transfer by connecting directly with the Client Access: Ceph uses MDSs to get the in-
datanodes. ode number and the file size for client requests.
Then, the client can calculate how many objects
Replication Policy: HDFS divides data into comprise the file and contacts the OSDs to get the
blocks which are replicated across datanodes data via CRUSH.
according to a placement policy, by which each
datanode holds at most one copy of a block Replication Policy: Ceph stores data as objects
and each rack holds at most two copies of the which are put on different placement group (PG)
block. The namenode verifies that each block and then stored on OSDs, which are chosen by the
follows the above policy: (1) when a block is CRUSH. Ceph has three synchronous replication
over-replicated, the namenode removes a replica strategies, primary-copy, chain, and splay repli-
on the datanode that has the least amount of cation: (1) primary-copy replication updates all
available space and tries not to reduce the number replicas in parallel; (2) chain replication updates
Distributed File Systems 681
replicas in series; and (3) splay replication com- ceedings of the sixth European conference on computer
bines the parallel updates of primary-copy repli- systems (EUROSYS 2011), Alzburg, pp 287–300
Beaver D, Kumar S, Li HC, Sobel J, Vajgel P (2010) Find-
cation with the serial updates of chain replication. ing a needle in haystack: Facebook’s photo storage.
In: Usenix conference on operating systems design and
Fault Management: Each OSD periodically ex- implementation, pp 47–60
change heartbeats to detect failures. Ceph has Borthakur D, Gray J, Sarma JS, Muthukkaruppan K,
Spiegelberg N, Kuang H, Ranganathan K, Molkov
monitors that implement management of the clus- D, Menon A, Rash S (2011) Apache Hadoop goes D
ter map. Each OSD can send a failure report to realtime at Facebook. In: ACM SIGMOD international
any monitor. When OSDs fail, monitors detect conference on management of data (SIGMOD 2011),
and maintain a valid cluster map. Athens, pp 1071–1080
Chaiken R, Jenkins B, Ramsey B, Shakib D, Weaver
S, Zhou J (2008) Scope: easy and efficient parallel
processing of massive data sets. Proc VLDB Endow
Future Directions for Research 1(2):1265–1276
Chowdhury M, Kandula S, Stoica I (2013) Leveraging
endpoint flexibility in data-intensive clusters. ACM
Although DFSes (e.g., HDFS) have become a SIGCOMM Comput Commun Rev 43(4):231–242
mainstay in big data analytics file system plat- Cidon A, Rumble SM, Stutsman R, Katti S, Ousterhout J,
Rosenblum M (2013) Copysets: reducing the frequency
form, there are still improvements to be made.
of data loss in cloud storage. In: Usenix conference on
First, many DFSes (e.g., GFS and HDFS) orig- technical conference, pp 37–48
inally designed for large files cannot meet the Dean J, Ghemawat S (2013) MapReduce: simplified data
storage requirements of lots of small files, which processing on large clusters. In: Proceedings of op-
erating systems design and implementation (OSDI)
will lead to the production of a large number
51(1):107–113
of metadata, affect the metadata server man- Dice D, Shalev O, Shavit N (2006) Transactional locking
agement and restorability, and decline the en- II. In: International conference on distributed comput-
tire system performance. Second, many DFSes ing, pp 194–208
Dimakis AG, Godfrey PB, Wainwright MJ, Ramchan-
have multiple copies (six copies in many cases)
dran K (2007) Network coding for distributed stor-
due to the need for data locality in maintaining age systems. In: IEEE international conference on
performance, which makes it harder for already computer communications (INFOCOM 2007). IEEE,
“big” data to store. Third, many DFSes often pp 2000–2008
Fan B, Tantisiriroj W, Xiao L, Gibson G (2009) Diskre-
offer very limited SQL support and lack basic
duce: raid for data-intensive scalable computing. In:
SQL functions which are actually useful for big The workshop on Petascale data storage, pp 6–10
data analytics. Finally, some DFSes for big data Ghemawat S, Gobioff H, Leung ST (2003) The Google
storage organize thousands or even hundreds of file system. ACM SIGOPS Oper Syst Rev 37(5):29–43
Gray C, Cheriton D (1989) Leases: an efficient fault-
thousands of servers, which incurs substantial tolerant mechanism for distributed file cache consis-
energy consumption, and thus the deployment of tency. ACM SIGOPS Oper Syst Rev 23(23):202–210
DFS should consider its energy efficiency. Hu Y, Chen HCH, Lee PPC, Tang Y (2012) Nccloud:
applying network coding for the storage repair in a
cloud-of-clouds. In: Usenix conference on file and
storage technologies, pp 12–19
Cross-References Hu Y, Lee PPC, Shum KW, Zhou P (2017a) Proxy-assisted
regenerating codes with uncoded repair for distributed
storage systems. IEEE Trans Inf Theory PP(99):1–17
Hadoop Hu Y, Li X, Zhang M, Lee PPC, Zhang X, Zhou P, Feng D
(2017b) Optimal repair layering for erasure-coded data
centers: from theory to practice. IEEE Trans Storage
PP(99):1–26
References Huang C, Simitci H, Xu Y, Ogus A, Calder B, Gopalan
P, Li J, Yekhanin S (2012) Erasure coding in windows
Ananthanarayanan G, Agarwal S, Kandula S, Greenberg azure storage. In: Usenix conference on technical con-
A, Stoica I, Harlan D, Harris E (2011) Scarlett: coping ference, pp 2–13
with skewed content popularity in MapReduce clusters. Li R, Lee PPC, Hu Y (2014) Degraded-first scheduling
In: European conference on computer systems, pro- for MapReduce in erasure-coded storage clusters. In:
682 Distributed Incremental Query Computation
maintenance trigger handling a set of insertions from R group by A, except that the aggregate
R into R is (the schemas are ommited for value is stored inside the payload of the group by
clarity): tuples; for clarity, SumA R is used when f D 1.
ON UPDATE R BY MR: This fragment of the language suffices to express
M Q += Sum[](MR ‰ S ‰ T) flat queries with sum aggregates; Koch et al.
R += MR (2014) further extend this language to support
This trigger first increments the count by com- queries with nested aggregates.
puting the delta query for updates to R and then Delta Queries. Delta queries capture changes
updates the contents of R. In practice, R is in query results for updates to base relations.
often much smaller than R, making incremental For any query expression Q expressed as an
computation cheaper (faster) than reevaluation. operator tree, its delta query Q is constructed
by applying delta derivation rules to the tree in a
bottom-up fashion. If Q represents a relation R,
Data Model. Storing materialized views as then Q D R if there are updates to R and
generalized multiset relations (Koch 2010; Green Q D 0 otherwise. The delta rules for the other
et al. 2007) greatly simplifies the view mainte- operators are:
nance task. Under this data model, each relation
is treated as a map (dictionary) containing a .Q1 C Q2 / WD Q1 C Q2
finite number of unique tuples (keys) mapped to .Q1 ‰ Q2 / WD .Q1 ‰ Q2 / C .Q1 ‰ Q2 /
a non-zero payload (value). Such relations can
C .Q1 ‰ Q2 /
uniformly represent insertions and deletions as
tuples mapping to positive and, respectively, neg- . Q/ WD .Q/
ative payloads (multiplicities). This unified rep- .Sum AIf Q/ WD Sum AIf .Q/
resentation admits simpler, more efficient view
maintenance where insertions and deletions are Previous work (Koch 2010; Koch et al. 2014)
commutative (i.e., applicable to the database in studies these rules in detail.
any order) and follow the same set of delta rules, Higher-Order IVM. Incremental view main-
which contrasts with the view maintenance under tenance is often cheaper than naïve reevalua-
classical set and bag (multiset) semantics (Qian tion but is not free. Computing deltas can be
and Wiederhold 1991; Griffin and Libkin 1995). expensive, like in Example 1, where Q is a
This unified model also allows using tuple nontrivial join of one (small) input update and
payloads for storing (potentially non-integer) ag- two (large) base tables. To speed up the delta
gregate values (e.g., SUM, AVG). Traditional SQL evaluation, higher-order IVM materializes and re-
represents these aggregates as separate columns cursively maintains update-independent parts of
in the query result, while this model keeps them delta queries. This maintenance strategy creates
in the payload to facilitate incremental processing a hierarchy of increasingly simpler materialized
– updating aggregate values means changing only views that support each other’s incremental main-
tuple payloads rather than deleting and inserting tenance.
tuples from the result.
Query Language. The query language for Example 2 Consider the count query from Ex-
generalized multiset relations consists of union ample 1 and the hierarchy of materialized views
C, natural join ‰ , selection , and grouping shown in Fig. 1. This IVM strategy materializes
sum aggregation SumAIf . These operations nat- the top-level view MQ along with four additional
urally generalize those on multiset relations: for views MR , MS , MT , and MST in order to speed
instance, Q1 ‰ Q2 matches tuples of Q1 with up the maintenance task: for instance, updates to
tuples of Q2 on their common columns, mul- R and T exploit the precomputed partial counts
tiplying their payloads (multiplicities); the SQL in MST and respectively MS and MR for faster
equivalent of Sum AIf R is select A, SUM(f) delta propagation. This approach is an instance of
Distributed Incremental View Maintenance 685
Distributed Incremental
View Maintenance, Fig. 1
View hierarchy for the
query in Example 1. Delta
propagations for updates to
R (left blue) and to T
(right red)
higher-order IVM since an update to one relation the database and code are stored and respectively
may cause the maintenance of several views. The run on a single machine.
triggers for updates to R and T look as:
ON UPDATE R BY MR:
MMR := Sum [A](MR)
Distributed IVM
MMQ := Sum Œ (MMR ‰ MST )
MR += MMR This section describes how to transform local
MQ += MMQ maintenance code into maintenance programs
optimized for running on large-scale processing
ON UPDATE T BY MT:
MMT := Sum [A,C](MT) platforms. The presented approach targets
MMST := Sum [A](MS ‰ MMT ) systems with synchronous execution, such
MMQ := Sum Œ (MR ‰ MMST ) as Spark (Zaharia et al. 2012) and Hadoop
MT += MMT
MST += MMST
MapReduce (Dean and Ghemawat 2008), where
MQ += MMQ one driver node orchestrates job execution among
workers. Processing one batch of updates may
The trigger for updates to S is similar. For this require several computation stages, where each
example and this maintenance strategy, process- stage runs view maintenance code in parallel.
ing one update tuple takes a constant amount of All workers are stateful, preserve data between
work per statement: for instance, the join opera- stages, and participate in every stage.
tion in the R trigger requires only one lookup in Materialized views are classified depending on
MST for each tuple of MR . Processing single- the location of their contents. Local views are
tuple updates in constant time is impossible to stored and maintained entirely on the driver node.
achieve in this example using classical incremen- They are suitable for materializing aggregates
tal view maintenance, which evaluates Q D with small output domains. Distributed views
R ‰ S ‰ T , or recomputation, which have their contents spread over all workers to bal-
evaluates Q D .R [ R/ ‰ S ‰ T . In both ance CPU and memory pressure. Each distributed
cases, the maintenance time is linear in the size view has an associated partitioning function that
of the database. maps each tuple to a non-empty set of nodes
storing its replicas.
View maintenance code, like that of Exam-
ples 1 and 2, consists of trigger statements, which Compilation Overview
are executed upon changes in the underlying Figure 2 shows the process of transforming a
database. The discussion so far has assumed that local incremental program into a functionally
686 Distributed Incremental View Maintenance
equivalent, distributed program. The input con- means the result is stored on the driver node, (2)
sists of maintenance statements expressed using Dist.P/ tag denotes the result is distributed
the aforementioned query language. To distribute among all workers according to partitioning func-
their execution, the compiler relies on the parti- tion P, and (3) Random tag means the result
tioning information about each materialized view, is randomly distributed among all workers. One
which is provided by the user. relation or materialized view can take any of these
The compilation process consists of three tags. A query Q with a location tag T is written
phases. The first phase annotates a given input as QT .
program with partitioning information and, if To support distributed execution, the query
necessary, introduces operators for exchanging language needs new operators for manipulating
data among distributed nodes in order to preserve location tags and exchanging data over the net-
the correctness of query evaluation. The second work. These operators are called location trans-
phase optimizes the annotated program using formers:
a set of simplification rules and heuristics that
aim to minimize the number of jobs necessary • Repart transformer reshuffles the dis-
for processing one update batch. The final stage tributed result of query Q using a partitioning
uses code generation to specialize incremental function P:
programs for running on a specific processing
platform. RepartP .QT / D QDist.P/
sions, while Scatter works with local expres- Example 3 Consider the following statement ob-
sions. tained from the trigger for updates to T in Exam-
Location-Aware Query Operators. The se- ple 2.
mantics of the query operators is now extended
to support location tags. MST(A)+= Sum[A](MS(A,B) ‰ MMT(A,B))
former, if present, always references one materi- The location tag of the expressions from both
alized view. This normalization procedure repeats sides of one statement determines where and how
the following three steps: (1) it materializes the that statement is executed. Statements targeting
contents being transformed (if not already mate- local materialized views are, as expected, exe-
rialized), (2) it extracts the location transformers cuted at the driver in local mode. Statements
targeting the materialized contents into separate involving distributed views run in distributed
statements, and (3) it updates the affected state- mode, where the driver initiates the computation. D
ments with new references. The optimizer nor- All location transformations run in communi-
malizes each statement in a bottom-up traversal cation mode, but materializing the contents to
of its expression tree. be reshuffled can happen in both local and dis-
tributed mode. For instance, preparing contents
Example 4 Consider the maintenance trigger for Scatter takes place on the driver while for
from Example 2 and assume that R is randomly Repart and Gather happens on every worker.
distributed among workers, MR and MST are Statement Blocks. Distributed statements are
partitioned on A, and MQ is stored at the driver more expensive to execute than local statements.
node. The distributed trigger for updates to R To run a distributed statement, the driver needs
consists of statements annotated with location to serialize a task closure, ship it to all work-
tags in single transformer form: ers, and wait for the completion of each one
of them. For short-running tasks, non-processing
overheads can easily dominate in the execution
ON UPDATE R BY MR:
time.
1 MMRandom
R := Sum ŒA.MRRandom /Random
To amortize the cost of executing distributed
2 MM[A]
R := Repart[A] .MMRandom
R /[A]
statements, the optimizer can pack them together
3 MMRandom
Q := Sum Œ .MM[A] [A] Random
R ‰ MST /
into processing units called statement blocks.
4 MMLocal := Gather.MMRandom /Local
Q Q A statement block consists of a sequence of
5 M[A] += MM[A]
R R
distributed statements that can be executed at
Local
6 MQ += MMLocal
Q once on every node without compromising the
program correctness. The optimizer analyzes the
Statement 1 pre-aggregates in parallel the update data-flow dependencies among statements to de-
batch to minimize the amount of data being cide on valid statement reorderings.
reshuffled by the subsequent statement. State-
ment 3 performs a distributed join followed by Example 5 The trigger from Example 4 consists
an aggregation that computes a partial count at of three distributed statements (1, 3, and 5),
each worker node. Statement 4 sums these partial two communication statements (2 and 4), and
counts at the driver. The remaining statements one local statement (6). Statements 3 and 5 can
maintain the distributed view MR and the local be grouped together into one block as the lat-
view MQ . ter commutes with statement 4; thus, computing
MMRandom
Q and updating M[A]
R happens in one dis-
The single transformer form sets clear bound- tributed task. The same is not true for statements
aries around the contents that need to be commu- 1 and 3 because of the data-flow dependencies
nicated, which eases the implementation of fur- among the first three trigger statements.
ther optimizations. Common subexpression elim-
ination and dead code elimination can detect To minimize network overheads, the optimizer
shared parts among statements and eliminate re- can coalesce consecutive communication
dundant network transfers. In contrast to the clas- statements of the same type (i.e., with the same
sical compiler optimizations, these routines take type of location transformer) into one compound
into account the location where each expression request carrying a container with their individual
is executed. payloads. This technique amortizes the intrinsic
690 Distributed SPARQL Query Processing
start-up overhead associated with each network Koch C (2010) Incremental query evaluation in a ring of
operation over multiple communication state- databases. In: PODS, pp 87–98
Koch C, Ahmad Y, Kennedy O, Nikolic M, Nötzli A,
ments. Lupei D, Shaikhha A (2014) DBToaster: higher-order
delta processing for dynamic, frequently fresh views.
VLDB J 23(2):253–278
Murray DG, McSherry F, Isaacs R, Isard M, Barham P,
Conclusion Abadi M (2013) Naiad: a timely dataflow system. In:
SOSP, pp 439–455
This chapter studies incremental view mainte- Nikolic M, Dashti M, Koch C (2016) How to win a hot
nance in local and distributed environments. The dog eating contest: distributed incremental view main-
tenance with batch updates. In: SIGMOD, pp 511–526
presented approach compiles local maintenance Qian X, Wiederhold G (1991) Incremental recomputation
programs into distributed code optimized for the of active relational expressions. IEEE Trans Knowl and
execution on large-scale processing platforms. Data Eng 3(3):337–341
Nikolic et al. (2016) provide details of the com- Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mc-
Cauly M, Franklin MJ, Shenker S, Stoica I (2012)
pilation process and analyze the performance of Resilient distributed datasets: a fault-tolerant abstrac-
distributed IVM using Apache Spark as the data tion for in-memory cluster computing. In: NSDI,
processing framework. The technique discussed pp 15–28
here is general and applicable to other distributed Zaharia M, Das T, Li H, Hunter T, Shenker S,
Stoica I (2013) Discretized streams: fault-tolerant
data management systems and stream processing streaming computation at scale. In: SOSP,
engines. pp 423–438
Cross-References
Distributed SPARQL Query
Apache Spark Processing
Apache Flink
Continuous Queries Native Distributed RDF Systems
Hadoop
Incremental Approximate Computing
Incremental Sliding Window Analytics
Introduction to Stream Processing Algorithms
Stream Processing Languages and Abstractions Distributed Spatial Data
Robust Data Partitioning Streaming
Spark SQL
Stream Query Optimization Streaming Big Spatial Data
References
D
Duplicate Detection
Record Linkage
E
Elasticity, Fig. 1 Evolution from centralized SPEs to one or multiple nodes by the different SPEs. The figure
parallel-distributed SPEs, also showing how a sample is presented by Gulisano (2012) for the StreamCloud SPE
query composed by four operators can be deployed at
Martin et al. (2014); Apache Spark (https://spark. Under-provisioned SPEs are usually not an
apache.org/); Apache Flink (https://flink.apache. option when Service Level Agreements (SLAs)
org/); or Heron (https://twitter.github.io/heron/) are in place, since they might incur monetary
allow for individual operators to be run in parallel penalties. At the same time, over-provisioned
(i.e., providing intra-operator parallelism). SPEs result in wasted resources (and high main-
Before the introduction of elasticity proto- tenance costs).
cols, the resources allocated to an SPE were It is important to notice that elasticity should
statically assigned at the time the queries were (i) preserve the semantics of the analysis, i.e., not
deployed and their input streams were fed to affect the correctness of a query’s analysis, and
them. In this case, the SPE running a certain set (ii) require as little as possible involvement of
of queries could either be under-provisioned or the user in the decisions about resources to be
over-provisioned. provisioned or decommissioned. As we explain
When under-provisioned, the resources allo- in the following, different solutions approach
cated to an SPE can cope with the average com- these aspects in different ways (and to different
putational power needed to run the queries. In the extents).
occurrence of a temporary increase in the needed
resources, nevertheless, the user will experience
higher latency in the delivery of output tuples and
might resort to adaptive techniques such as input Foundations
admission control, as discussed by Balkesen et al.
(2013), or load shedding, as discussed by Tatbul We cover in this section elasticity’s core aspects.
et al. (2003). On the other hand, the resources of As discussed by De Matteis and Mencagli (2017)
an over-provisioned SPE are allocated to be able with respect to SASO properties (i.e., stability,
to cope with the highest peak of computational accuracy, settling time, and overshoot), elasticity
power that would be needed by the user-defined mechanics build on top of contrasting objectives.
queries. On one hand, frequent adjustments of resources
Elasticity 695
to match variations in computational needs are a new node to be initialized). Because of this,
desirable to overcome the limitations of over- or resources should be provisioned only when
under-provisioned SPEs. At the same time, spo- allocated ones (as a whole) cannot cope with
radic reconfigurations result in less overhead and the current workload. That is, provisioning
temporary degradation of queries’ performance. (and decommissioning) of resources should
For simplicity, we use the term user to refer to not be triggered because of an uneven load
both who maintains the resources to run the SPE distribution. This can be prevented by combining
and who defines the queries running on the latter. load balancing with elasticity.
Load balancing, nevertheless, usually boils E
Provisioning and Decommissioning down to NP-hard problems such as the bin-
Whenever the allocated resources should be packing one, as discussed by Gulisano (2012),
adapted to match computational needs, elasticity Balkesen et al. (2013), Heinze et al. (2013,
can result in the provisioning of new threads or 2014b), among others. When deciding the
nodes (when more resources than the allocated best workload repartitioning among allocated
ones are needed) or the decommissioning of some resources, it is also of key importance to leverage
existing thread or node (when less resources than efficient state transfer techniques, as mentioned
the allocated ones are needed). A first distinction by Ding et al. (2015).
in the existing elasticity protocols can be made
between techniques that can provision (or Granularity
decommission) one thread or node at a time and Since centralized SPEs bound each query to ex-
techniques that can provision (or decommission) actly one node, they can only leverage techniques
multiple ones. in which nodes are provisioned when a query, as
The solutions presented by Schneider et al. a whole, cannot be run by the allocated ones, as
(2009) and Hochreiner et al. (2016), for proposed by Loesing et al. (2012).
instance, provide one-at-a-time provisioning and Differently, distributed SPEs can also benefit
decommissioning of resources. The shortcoming of techniques in which operators’ responsibility
of these techniques is that, in some cases, one can be shifted between nodes, as proposed by
extra thread or node might not be enough to Xing et al. (2005) and Heinze et al. (2013).
match the increase of needed computational Finally, parallel-distributed SPEs can provi-
power. This is overcome by load-aware elastic sion (or decommission) threads and nodes to
techniques such as the ones discussed by adjust the parallelism degree of individual op-
Gulisano et al. (2012) and Heinze et al. erators, as discussed by Gulisano et al. (2012),
(2014b). The latter decide how many resources Heinze et al. (2014b), Martin et al. (2014), Cas-
should be provisioned or decommissioned tro Fernandez et al. (2013), among others.
based on the current computational load or the Transferring of the responsibility for a whole
computational load estimated at the end of the operator (or part of a parallel one) to a new thread
elastic reconfiguration. By doing this, they reduce (or node) requires the rerouting of its input tuples
the reconfiguration overhead by encapsulating to the new thread (or node). For stateful opera-
the provisioning or decommissioning of multiple tors, furthermore, it also requires the transfer of
threads or nodes rather than triggering multiple its whole state (or part of it for parallel stateful
consecutive reconfigurations. operators).
In the literature, the triggering conditions are state currently maintained by stateful operators
usually based on (i) the threads’ and nodes’ CPU for such subset is shifted from one thread (or
consumption, as discussed by Gulisano et al. node) to another one. Enforcing determinism in
(2012), Martin et al. (2014), Hochreiner et al. this case implies that the results produced by
(2016), (ii) the operators’ latency, as discussed by a system while undergoing elastic reconfigura-
Zacheilas et al. (2015), De Matteis and Mencagli tions are the exact same as the ones produced
(2017), or (iii) the operators’ queue lengths, as by a system whose resources are statically al-
discussed by Kumbhare et al. (2015). An im- located. In practice, this means each tuple is
portant distinction among the existing protocols processed exactly once, and the state is cor-
can be made between reactive mechanisms and rectly preserved during each elastic reconfigura-
proactive ones. Reactive mechanisms, such as tion, as we further explain when discussing state
the one proposed by Gulisano et al. (2012), trig- transfer.
ger a reconfiguration once a certain condition is Determinism is also referred to as safety by
met (e.g., when the CPU consumption exceeds a Hirzel et al. (2014) and semantic transparency by
certain threshold). On the other hand, proactive Gulisano et al. (2012).
mechanisms actively try to optimize the perfor-
mance of the running queries (e.g., by constantly
minimizing their latency), as for the solution State Transfer
discussed by De Matteis and Mencagli (2017), When an elastic reconfiguration is triggered, the
or predict (rather than observe) when a certain responsibility for the input tuples of an operator
condition will be met, as discussed by Zacheilas (a portion of them or all of them) is transferred
et al. (2015). from thread (or node) A to thread (or node)
B. The main idea is to find a point in time P
so that the processing of all tuples before P is
Determinism A’s responsibility, while the processing of tuples
As we mentioned, an elastic reconfiguration can coming after P is B’s responsibility.
temporarily degrade the performance of an SPE This is trivial for stateless operators and pro-
while undergoing a reconfiguration (e.g., a tem- vided by the protocols as the one proposed by
porary increase in the processing latency while Kumbhare et al. (2015). It is nevertheless chal-
waiting for a new node to be provisioned). Nev- lenging for stateful operator when enforcing de-
ertheless, elasticity should not affect the correct- terminism. The responsibility of the state main-
ness of the analysis. This property is known as tained by A cannot be transferred to B before A
determinism. is done processing all tuples up to P . A possible
Determinism has been introduced in the con- way of doing this guaranteeing determinism is
text of parallel-distributed SPEs and in the con- by a blocking mechanism in which B cannot
text of algorithms for parallel data streaming start processing tuples until it gets access to A’s
operators, for instance, by Gulisano et al. (2016, state once the latter is done processing all tuples
2017). The determinism guarantee states that the (Gulisano et al. 2012; Martin et al. 2014). When
results produced by an operator run in parallel transferring the state by serializing it, sending it
by multiple threads or at multiple nodes should between two nodes, and finally deserializing it,
produce the same exact results of its centralized nevertheless, the overhead and latency penalty
counterpart. When all the parallel operators of a of the reconfiguration can degrade the SPE per-
query run deterministically, then the query itself formance. This can be alleviated by techniques
enjoys the determinism guarantee. aiming at reducing latency spikes, such as the
When elastically reconfiguring the resources ones proposed by Heinze et al. (2014a), or at
allocated to a query, nevertheless, the system recreating small states at B by sending to the lat-
undergoes a transition during which the respon- ter previously sent tuples (rather than transferring
sibility to process a certain subset of data and the A’s state).
Elasticity 697
Cloud Computing for Big Data Analysis Gulisano V, Nikolakopoulos Y, Papatriantafilou M, Tsigas
Continuous Queries P (2016) Scalejoin: a deterministic, disjoint-parallel
and skew-resilient stream join. IEEE Trans Big Data
Sliding-Window Aggregation Algorithms
PP(99):1–1
Stream Window Aggregation Semantics and Gulisano V, Nikolakopoulos Y, Cederman D, Papatri-
Optimization antafilou M, Tsigas P (2017) Efficient data streaming
StreamMine3G: Elastic and Fault Tolerant multiway aggregation through concurrent algorithmic
designs and new abstract data types. ACM Trans Paral-
Large Scale Stream Processing lel Comput 4(2):11:1–11:28
Heinze T, Ji Y, Pan Y, Grueneberger FJ, Jerzak Z, Fetzer C
(2013) Elastic complex event processing under varying
References query load. In: BD3@ VLDB, pp 25–30
Heinze T, Jerzak Z, Hackenbroich G, Fetzer C (2014a)
Latency-aware elastic scaling for distributed data
Abadi DJ, Ahmad Y, Balazinska M, Cetintemel U, Cher- stream processing systems. In: Proceedings of the 8th
niack M, Hwang JH, Lindner W, Maskey A, Rasin ACM international conference on distributed event-
A, Ryvkina E et al (2005) The design of the bo- based systems. ACM, pp 13–22
realis stream processing engine. In: CIDR, vol 5, Heinze T, Pappalardo V, Jerzak Z, Fetzer C (2014b) Auto-
pp 277–289 scaling techniques for elastic data stream processing.
Arasu A, Cherniack M, Galvez E, Maier D, Maskey In: IEEE 30th international conference on data engi-
AS, Ryvkina E, Stonebraker M, Tibbetts R (2004) neering workshops (ICDEW). IEEE, pp 296–302
Linear road: a stream data management benchmark. In: Hirzel M, Soulé R, Schneider S, Gedik B, Grimm R
Proceedings of the thirtieth international conference on (2014) A catalog of stream processing optimizations.
very large data bases, VLDB Endowment, vol 30, pp ACM Comput Surv 46(4):46:1–46:34
480–491 Hochreiner C, Vögler M, Schulte S, Dustdar S (2016)
Balkesen C, Tatbul N, Özsu MT (2013) Adaptive input Elastic stream processing for the internet of things. In:
admission and management for parallel stream pro- IEEE 9th international conference on cloud computing
cessing. In: Proceedings of the 7th ACM international (CLOUD). IEEE, pp 100–107
conference on distributed event-based systems. ACM, Ilyushkin A, Ali-Eldin A, Herbst N, Papadopoulos AV,
pp 15–26 Ghit B, Epema D, Iosup A (2017) An experimental per-
Carney D, Çetintemel U, Cherniack M, Convey C, Lee S, formance evaluation of autoscaling algorithms for com-
Seidman G, Stonebraker M, Tatbul N, Zdonik S (2002) plex workflows. In: Proceedings of the 8th ACM/SPEC
Monitoring streams: a new class of data management on international conference on performance engineer-
applications. In: Proceedings of the 28th international ing (ICPE), ICPE’17. ACM, New York, pp 75–86.
conference on very large data bases, VLDB Endow- http://doi.org/10.1145/3030207.3030214
ment, pp 215–226 Koliousis A, Weidlich M, Castro Fernandez R, Wolf AL,
Castro Fernandez R, Migliavacca M, Kalyvianaki E, Piet- Costa P, Pietzuch P (2016) Saber: window-based hy-
zuch P (2013) Integrating scale out and fault tolerance brid stream processing for heterogeneous architectures.
in stream processing using operator state management. In: Proceedings of the 2016 international conference on
In: Proceedings of the 2013 ACM SIGMOD interna- management of data. ACM, pp 555–569
tional conference on management of data. ACM, pp Kumbhare AG, Simmhan Y, Frincu M, Prasanna VK
725–736 (2015) Reactive resource provisioning heuristics for
De Matteis T, Mencagli G (2017) Proactive elasticity and dynamic dataflows on cloud infrastructure. IEEE Trans
energy awareness in data stream processing. J Syst Cloud Comput 3(2):105–118
Softw 127:302–319 Loesing S, Hentschel M, Kraska T, Kossmann D (2012)
Ding J, Fu TZJ, Ma RTB, Winslett M, Yang Y, Zhang Stormy: an elastic and highly available streaming ser-
Z, Chao H (2015) Optimal operator state migration for vice in the cloud. In: Proceedings of the 2012 joint
elastic data stream processing. CoRR abs/1501.03619 EDBT/ICDT workshops. ACM, pp 55–60
Gedik B, Schneider S, Hirzel M, Wu KL (2014) Elastic Martin A, Brito A, Fetzer C (2014) Scalable and elastic
scaling for data stream processing. IEEE Trans Parallel realtime click stream analysis using streammine3g.
Distributed Syst 25(6):1447–1463 In: Proceedings of the 8th ACM international con-
Gibbons PB (2015) Big data: scale down, scale up, scale ference on distributed event-based systems. ACM,
out. In: IPDPS, p 3 pp 198–205
Gulisano V (2012) Streamcloud: an elastic parallel- Schneider S, Andrade H, Gedik B, Biem A, Wu KL (2009)
distributed stream processing engine. PhD thesis, Uni- Elastic scaling of data parallel operators in stream pro-
versidad Politécnica de Madrid cessing. In: IEEE international symposium on parallel
Gulisano V, Jimenez-Peris R, Patino-Martinez M, Soriente & distributed processing, IPDPS 2009. IEEE, pp 1–12
C, Valduriez P (2012) Streamcloud: an elastic and Tatbul N, Çetintemel U, Zdonik S, Cherniack M, Stone-
scalable data streaming system. IEEE Trans Parallel braker M (2003) Load shedding in a data stream
Distrib Syst 23(12):2351–2365 manager. In: Proceedings of the 29th international con-
Emerging Hardware Technologies 699
GPU
ELT Graphics processing units (GPUs) are specialized
processors initially designed to alter memory at
ETL a massive scale and fast speed to accelerate the
creation of images. They have been redesigned
to support programmable shading in 2001 and
gradually become general purpose accelerators.
A GPU dedicates most of its transistors to simpler
Emerging Hardware and smaller cores with more registers to support
Technologies massive thread-level parallelism on a large scale
of data. The bandwidth of GPU’s memory is
Xuntao Cheng1 , Cheng Liu2 , and Bingsheng
usually higher than that of the main memory.
He2
1
These features benefit a wide range of big data
Nanyang Technological University, Jurong
applications such as business intelligence and
West, Singapore
2
data mining (Christoph Weyerhaeuser et al. 2008;
National University of Singapore, Singapore,
Ren Wu et al. 2009).
Singapore
NVIDIA has proposed the NVLink technol-
ogy, a high bandwidth, energy-efficient inter-
connect that enables ultrafast communications
Definitions between the CPU and the GPU (NVIDIA 2017).
The bandwidth of NVLink is 5–12 times faster
This chapter introduces emerging hardware tech- than that of the PCIe Gen 3. NVLink can
nologies such as GPUs, x86-based many-core also connect multiple GPU cards in a mesh,
processors, and die-stacked DRAMs, which have improving GPUs’ capabilities for big data
significant impacts on big data applications. analytics.
700 Emerging Hardware Technologies
Memory
• Hybrid system architecture: closely coupled
CPU-FPGA architecture such as Intel Xeon- Die-Stacked DRAM
FPGA (Herman and Randy 2016) emerges Die-stacked DRAM is an emerging type of mem-
recently. CPU and FPGA have a shared mem- ory that is integrated closely to processing units in
Emerging Hardware Technologies 701
the same package. By stacking multiple layers of together with the memory cells based on different
DRAMs together in the 3D space, the bandwidth memory techniques such as metal-oxide resistive
per stack can be increased a lot. For example, the random access memory (ReRAM) (Wong et al.
third generation of the High Bandwidth Mem- 2012), spin-transfer torque magnetic RAM (STT-
ory (HBM), a JEDEC standard of die-stacked RAM) (Adrien et al. 2015), and phase-change
DRAM, is expected to reach a peak bandwidth of memory (PCM) (Geoffrey et al. 2015).
512 GB/s per module in the near future (JEDEC
Solid State Technology Association 2014, 2015).
Die-stacked DRAMs can be either interposed by Hard Disk E
the side of the processor or placed on top of the
processor. Although both approaches can achieve There are recent efforts on improving the hard
a very high bandwidth, the later approach also disk drives (HDD), whose applications have been
reduces the memory access latency. threatened by its slower bandwidth compared
with other emerging memory technologies. A
Storage Class Memory new approach to improve HDDs tries to replace
Storage class memory (SCM) is a type of byte- the conventional aluminum substrates with a new
addressable nonvolatile memory that has similar prototyping glass substrates. Glass substrates can
performance with DRAM in terms of latency and help in increasing the storage space of a single
similar capacity and economics with the storage HDD to about 20 TB (Jon Martindale 2017).
(Burr et al. 2008). It is an attractive solution to This enlarged capacity may improve the storage
meet the needs for fast and scalable memory in capability of disks serving big data systems.
big data analytics, databases, cloud services, and
many other scenarios. By far, candidates for SCM
include phase-change memory (PCM), spin- Network
transfer torque memory (STT-RAM), Intel 3D
Xpoint, resistive RAM, HP memristors, Samsung Smart NIC
V-NAND, and others. SCM has asynchronous Smart network interface cards (NIC) are NICs
read and write latency. For example, the read equipped with processors to process data while
and write latencies of STT-RAM are 10 ns and transferring them in the network. These pro-
50 ns, respectively. This calls for careful software cessors can be FPGAs, CPUs, or specifically
optimizations. designed ASICs. Although some computation
tasks can be offloaded to conventional NICs, they
PIM are usually not complex, such as calculating the
Processing in memory (PIM) is a technique that checksum. Smart NICs can exploit its processors
integrates the processing logic with the memory to support much more complex tasks such as the
so that the tasks that don’t require extensive pro- network functions virtualization (NFV) and more
cessing can be done directly within the memory. complicated data processing tasks.
It avoids the frequent data transfers between the
processing units and memory in conventional RDMA
computing architectures. Thus, it helps to speed Remote direct memory access (RDMA) over
up processing, memory transfer rate, and mem- Converged Ethernet (RoCE) is originally used
ory bandwidth and decrease processing latency. by the InfiniBand interconnect technology and
Centring the PIM concept, a number of PIM now applied on Ethernet networks. It allows
architectures have been explored over the years. offloading network transfers from the CPU
Some of the researchers proposed to integrate the to RDMA hardware engines, which frees the
processing logic with DRAM by leveraging the CPUs for other tasks which in turn increases the
3D technologies (Qiuling Zhu et al. 2013). Some performance of tasks. This feature has found very
of the researchers opted to build processing logic beneficial for big data systems (Lu et al. 2014).
702 Emerging Hardware Technologies
Xilinx (2017c) Integrated HBM and RAM. https://www. underlying infrastructure technology, and newer
xilinx.com/products/technology/memory.html. Online; use-cases, such as Internet of Things (IoT),
Accessed 11 Nov 2017
Zhu Q, Akin B, Sumbul HE, Sadi F, Hoe JC, Pileggi
deep learning, and conversational user interfaces
L, Franchetti F (2013) A 3d-stacked logic-in-memory (CUI).
accelerator for application-specific data intensive com- These new use-cases combined with ubiqui-
puting. In: 3D systems integration conference (3DIC), tous data characterized by its large volume, va-
2013 IEEE international. IEEE, pp 1–7
riety, and velocity have resulted in an emergence
of big data phenomenon. Most applications being
built in this connected world are based on ana- E
lyzing historical data of various genres to create
End-to-End Benchmark a context within which the user interaction with
applications is interpreted. Such applications are
Milind A. Bhandarkar popularly referred to as data-driven applications.
Ampool, Inc., Santa Clara, CA, USA While such data-driven applications are coming
into existence across almost all the verticals, such
as financial services, healthcare, manufacturing,
and transportation, web-based and mobile adver-
Synonyms tising companies recognized the potential of big
data technologies for building data-driven appli-
Deep analytics benchmark; Pipeline benchmark
cations and were arguably the earliest adopters of
big data technologies. Indeed, popular big data
platforms such as Apache Hadoop were created
Definitions and adopted at web technology companies, such
as Yahoo, Facebook, Twitter, and LinkedIn.
An end-to-end data pipeline benchmark is a stan- Most of the initial uses of these big data
dardized suite of data ingestion, data processing, technologies were in workloads analyzing large
and data queries, arranged in a series of stages, unstructured and semi-structured datasets in
where the output of a previous stage in a pipeline batch-oriented Java MapReduce programs or
feeds the next stage in pipeline, exercising all using queries in SQL-like declarative languages.
the needed system characteristics for commonly However, availability of the majority of enterprise
constructed data pipeline workloads. data on big data distributed storage infrastructure
soon attracted other non-batch workloads to
these platforms. Massively parallel SQL engines
Historical Background such as Apache Impala and Apache HAWQ that
provide interactive analytical query capabilities
As we witness the rapid transformation in were developed to natively run on top of Hadoop
data architecture, where relational database analyzing data stored in Hadoop Distributed File
management systems (RDBMS) are being System (HDFS). Since 2012, cluster resource
supplemented by large-scale non-relational management, scheduling, and monitoring func-
stores such as Hadoop Distributed File System tionality previously embedded in MapReduce
(HDFS), MongoDB, Apache Cassandra, and framework have been modularly separated out
Apache HBase, a more fundamental shift as a separate component, Apache YARN, in the
is on its way, which would require larger Hadoop ecosystem. This allowed data processing
changes to modern data architectures. While frameworks other than MapReduce to natively
the current shift was mandated by business run on Hadoop clusters, sharing computational
requirements for the connected world, the resources among various frameworks. The
next wave will be dictated by operational cost last 4 years have seen an explosion of
optimization, transformative changes in the Hadoop-native data processing frameworks
704 End-to-End Benchmark
tackling various processing paradigms, such data platforms were first developed several
as streaming analytics, OLAP cubing, iterative decades ago. Several industry consortia have
machine learning, and graph processing, among been established to standardize benchmarks
others. for data platforms over the years. Of these,
Flexibility of HDFS (and other Hadoop- Transaction Processing Performance Council
compatible storage systems), sometimes (TPC) and Standard Performance Evaluation
federated across a single computational cluster Corporation (SPEC), both established in 1988,
for storing large amounts of structured, semi- have gained prominence and have created several
structured, and unstructured data, and the standard benchmark suites. While TPC has
added flexibility and choice of data processing traditionally been focused on data platforms,
frameworks, has resulted in Hadoop-based big SPEC has produced several microbenchmarks
data platforms that are being increasingly utilized aimed at evaluating low-level computer systems
for end-to-end data pipelines. performance. Among the various TPC standard
Emergence of this ubiquitous new platform benchmarks, TPC-C (for OLTP workloads),
architecture has created the imperative for an TPC-H (for data warehousing workloads),
industry standard benchmark suite that evaluates and TPC-DS (for analytical decision support
efficacy of such platforms for end-to-end data workloads) have become extremely popular, and
processing workloads for consumers to make several results have been published using these
informed choices about various storage infras- benchmark suites. The benchmark specifications
tructure components and computational engines. consist of data models, tens of individual queries
In this article, we describe an end-to-end on these data, synthetic data generation utilities,
data pipeline benchmark, based on real-life and a test harness to run these benchmarks and to
workloads. While the benchmark itself is based measure performance.
on a typical pipeline in online advertising, it Starting late 2011, the Center for Large-Scale
shares many common patterns across a variety Data Systems Research (CLDS) at the San Diego
of industries that make it relevant to a majority Supercomputing Center, University of Califor-
of big data deployments. In the next section, nia at San Diego, along with several industry
we discuss various previous benchmarking and academic partners, established a Big Data
efforts that have contributed to our current Benchmarking Community (BDBC) to establish
understanding of industry standard big data standard benchmarks in evaluating hardware and
benchmarks. Then we describe the benchmark software systems that form the modern big data
scenario, at a fictitious Ad-Tech company. The platforms. A series of well-attended workshops
datasets stored and manipulated in the end- called Workshop on Big Data Benchmarking have
to-end data pipeline of this benchmark are been conducted across three continents as part of
discussed along with various computations that this effort, and proceedings of those workshops
are performed on the data in data pipeline. We have been published.
propose metrics to measure the effectiveness Recognizing the need to speed up develop-
of systems under test (SUT), report based on ment of industry standard benchmarks, especially
various classes of benchmarks that correspond for rapidly evolving big data platforms and work-
to data sizes, and conclude by outlining future loads, a new approach to developing and adopt-
work. ing benchmarks within the TPC, called TPC-
express, was adopted in 2014. Since then, two
standard benchmarks, TPC-xHS and TPC-xBB,
Related Work have been created under TPC-express umbrella.
TPC-xHS provides standard benchmarking rigor
Performance benchmarking of data processing to the existing Hadoop Sort (based on Terasort)
systems is a rich field of study and has benchmark, whereas TPC-xBB provides big data
seen tremendous activity since commercial extensions to the data models and workloads
End-to-End Benchmark 705
from the earlier TPC-DS benchmarks. TPC-xHS performance of an end-to-end data pipeline,
is a microbenchmark, which evaluates the shuffle rather than focus on hero numbers for individual
performance in MapReduce and other large data tasks.
processing frameworks. TPCx-BB is based on a
benchmark proposal named BigBench (Ghazal et
al. 2013) that consists of a set of 30 modern data
warehousing queries that combine both struc- Foundations
tured and unstructured data. None of the existing
benchmarks provide a way to evaluate combined Acme is a very popular content aggregation com- E
performance of an end-to-end data processing pany that has a web-based portal and also a mo-
pipelines. bile app, with tens of millions of users, who fre-
The need for having an industry standard big quently visit them from multiple devices several
data benchmark for evaluating performance of times a day to get hyper-personalized content,
data pipelines and conducting an annual com- such as news, photos, audio, and video. Acme
petition to rank the top 100 systems based on has several advertising customers (we distinguish
this benchmark has been described in (Baru et al. between users, who browse through Acme’s web-
2013). AdBench, the benchmark described in this site, from customers, who publish advertisements
article, builds upon this deep analytics pipeline on that website) who pay them to display their
benchmark proposal, by extending the proposed advertisements on all devices. Acme is one of the
user-behavior deep analytics pipeline, to include many Web 3.0 companies because they have a
streaming analytics and ad hoc analytical queries, deep understanding of their users’ precise inter-
in the presence of transactions on the underlying ests as well as exact demographic data, history of
datasets. consumption of content, and interactions with ad-
AdBench represents a generational advance vertisements displayed on Acme website and mo-
over the previous standardized benchmarks, bile application. Acme personalizes their users’
because it combines various common compu- experience based on this data. They have an ever-
tational needs into a single end-to-end data growing taxonomy of their users’ interests, and
pipeline. We note that while the performance their advertisers can target users not only by the
of individual tasks in data pipelines remains demographics but also by the precise interests of
important, operational complexity of assembling those users.
an entire pipeline composed of many different Here is Acme’s business in numbers:
systems adds overheads of data exchange
between various computational frameworks.
• 100 million registered users, with 50 million
Data layouts optimized for different tasks require
daily unique users
transformations. Scheduling different tasks into
• 100,000 advertisements across 10,000 adver-
a single data pipeline requires a workflow engine
tisements campaigns
with its own overheads. These additional data
• 10 million pieces of content (news, photos,
pipeline composition and orchestration are not
audio, video)
taken into account by the previous benchmarks
• 50,000 keywords in 50 topics and 500
that measure effectiveness of systems on
subtopics as user interests, content topics,
individual tasks. We believe that systems that
and for ad targeting
exhibit good-enough performance to implement
entire data pipelines are preferred for their
operational simplicity, rather than best-of-breed Acme has several hundred machines serving
disparate but incompatible systems that may advertisements, using a unique matching algo-
exhibit the best performance for individual tasks, rithm that fetches a user’s interests, finds the best
but have to be integrated with external glue code. match within a few milliseconds, and serves the
Thus, our data pipeline benchmarks measure ad within appropriate content.
706 End-to-End Benchmark
These datasets represent the top 3 most rel- Scale Factors and Metrics
evant ads for every user and for every content. When compared to microbenchmarks or bench-
These datasets are then used for the ad serving marks that consist of a fixed set of queries rep-
systems, such that when a user visits a particular resenting similar workloads, performed on a sin-
content, the best match among these ads is cho- gle system, an end-to-end data pipeline with
sen, based on one look up each in the user dataset mixed workloads poses significant challenges in
and content dataset. defining scale factors and metrics and reporting
Relevance of an ad for a user or a content benchmark results. In this section, we propose
is determined by cosine similarity in the list a few alternatives, with reasoning behind our
of keywords, topics, and subtopics. This model choice.
building pipeline, built using batch-oriented com-
putational frameworks, has the following steps: Scale Factors
Across different verticals, the number of entities
and thus the amount of data and data rates vary
1. Extract the relevant fields from user dataset
widely. For example, in financial institutions,
and ad dataset, and join based on topics,
such as Banks providing online banking, the
subtopics, and keywords.
number of customers (users) is between a few
2. Filter the top 3 matching keywords, and com-
tens of thousands and millions, the number of
pute the weights of ads for those keywords
advertisements that correspond to the number of
using cosine similarity.
different services (such as loans, fixed deposits,
3. Repeat the steps above for the content dataset
various types of accounts) is tens to hundreds,
and ad dataset.
the number of content entities is equivalent to
the number of dashboards (such as account trans-
Interactive and Ad Hoc SQL Queries action details, bill pay), and the data rates tend
The interactive and ad hoc queries are performed to be in low tens of thousands per minute even
on varying windows of the aggregates for cam- during peak times. These scales are very different
paigns and ads using an interactive query engine than a content aggregation website or mobile
(preferably SQL-based). Some examples of the application. To address these varying needs for
queries are: various industries, we suggest the following scale
factors:
1. What was the fper minute, hourly, dailyg con-
version rate for an ad? For a campaign? Events/
Class Users Ads Contents second
2. How many ads were clicked on as a percentage
Tiny 100,000 10 10 1,000
of viewed, per hour for a campaign?
Small 1,000,000 100 100 10,000
3. How much money does a campaign owe to
Medium 10,000,000 1,000 1,000 100,000
Acme for the whole day?
4. What are the most clicked ads and campaigns Large 100,000,000 10,000 10,100 1,000,000
per hour? Huge 1,000,000,000 100,000 100,000 10,000,000
5. How many male users does Acme have aged
0–21 or 21–40?
Metrics
The results of these queries can be displayed in Since the benchmark covers an end-to-end data
a terminal, and for the queries resulting in time- pipeline consisting of multiple stages, measures
windowed data, visualized using BI tools. of performance in each of the stages vary widely.
Various stages of computations in the Ad- Streaming data ingestion is measured in terms of
Bench data pipeline and dataflow among them are number of events consumed per second, but due
shown in Fig. 1. to varying event sizes, amount of data ingested
End-to-End Benchmark 709
per second is also an important measure. Stream- big data technology landscape, where a complete
ing data analytics operates on windows of event data platform is constructed from a collection of
streams, and performance of streaming analytics interoperable components, we encourage experi-
should be measured in terms of window size mentation of combining multiple technologies to
(amount of time per window, which is a product implement end-to-end data pipeline, and stage-
of number of events per unit time, and data wise performance metrics will aid this decision.
ingestion rate). Because of the latency involved In order to simplify comparisons of environ-
in ingestion of data and output of streaming ana- ments used to execute these end-to-end pipelines,
lytics, the most relevant measure of performance initially, a simple weighted aggregation of in-
of streaming analytics is the amount of time dividual stage’s performance is proposed as a
between event generation and event participation single performance metric for the entire system,
in producing results. with weights explicitly specified. There is anec-
Batch computations, such as machine-learned dotal evidence to suggest that the business value
model training (of which recommendation engine of data decreases as time between production
is a representative example), the amount of time and consumption of data increases. Thus, one
needed from the beginning of the model training, might be tempted to give higher weight to the
including any preprocessing of raw data, to the streaming computation performance than to the
upload of these models to the serving system, batch computation performance. However, it also
need to be considered as performance measure. might be argued that approximate computations
Similarly, for ad hoc and interactive queries, in real time are acceptable, while batch computa-
time from query submission to the production of tions need exact results, and therefore faster batch
results is the right performance measure. computations should be given higher weightage.
While most of the benchmark suits attempt We propose that determining appropriate weights
to combine the performance results into a sin- for individual stages should be an active area of
gle performance metric, we think that for an investigation and should be incorporated in the
end-to-end data pipeline, one should report in- metrics based on the outcome of this investiga-
dividual performance metrics of each stage, to tion.
enable choice of different components of the In addition to performance metrics, the
total technology stack. For a rapidly evolving complexity of assembling data pipelines by
710 Energy Benchmarking
Baru C, Bhandarkar M, Nambiar R, Poess M, Rabl T In 1996, one of the earliest works on an energy-
(2013) Benchmarking big data systems and the big data
efficiency metric was published by Gonzalez and
top100 list. Big Data 1(1):60–64
Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte A, Horowitz (1996) in which the authors proposed
Jacobsen HA (2013) Bigbench: towards an industry the energy-delay product as the metric of energy-
standard benchmark for big data analytics. In Proceed- efficient microprocessor design. Almost a decade
ings of the 2013 ACM SIGMOD international confer-
later, January 2006, the SPECpower Committee
ence on management of data, SIGMOD’13, ACM, New
York, pp 1197–1208 (https://www.spec.org/power) was founded and
started the development of the first industry-
standard energy-efficiency benchmark for server-
class compute equipment.
The SPECpower Committee developed the
Energy Benchmarking SPEC PTDaemon (http://www.spec.org/power/
docs/SPEC-PTDaemon_Design.pdf) interface
Klaus-Dieter Lange1 and which offloads the details of controlling a power
Jóakim von Kistowski2 analyzer or temperature sensor, presenting a
1
Hewlett Packard Enterprise, Houston, TX, USA common TCP/IP-based interface that can be
2
Institute of Computer Science, University of readily integrated into different benchmark
Würzburg, Würzburg, Germany harnesses.
The Power and Performance Benchmark
Methodology (https://www.spec.org/power/docs/
Synonyms SPEC-Power_and_Performance_Methodology.
pdf) was created by the SPECpower Committee
Energy-efficiency benchmarking; Power bench- for performance benchmark designers and
marking implementers who want to integrate a power
component into their benchmarks. This public
document also serves as an introduction to those
Definitions who need to understand the relationship between
power and performance metrics in computer
Software that utilizes standard applications, or systems benchmarks. Guidance is provided for
their core routines that focus on a particular including power metrics in existing benchmarks,
access pattern, in order to measure the power as well as altering existing benchmarks and
Energy Benchmarking 711
designing new ones to provide a more complete workloads (or “worklets,” in Chauffeur termi-
view of energy consumption. nology). Researchers can also use the WDK to
In 2007, the research benchmark JouleSort configure worklets to run in different ways, in
(Rivoire et al. 2007) was released and later that order to mimic the behavior of different types of
year, the industry standard SPECpower_ssj2008 applications. These features can be used in the
benchmark (http://spec.org/power_ssj2008), development and assessment of new technologies
which exercises the CPUs, caches, memory such as power management capabilities. The first
hierarchy, and the scalability of shared memory version of the Chauffeur WDK was released in
processors on multiple load levels (Lange 2009). 2014, and succeeded by releases in 2015 and E
The benchmark runs on a wide variety of 2017.
operating systems and hardware architectures. For the SERT suite 2.0, one of the main
2008 marks the first release of the SPC Energy focuses was the reduction of its run time and
specifications (http://spcresults.org/specifications) improvements on the automated hardware and
and the TPC-Energy specification (http://www. software discovery, in order to save the user
tpc.org/tpc_energy/default.asp), which utilizes testing time and costs. The support of additional
the SPEC PTDaemon. power analyzers and interfaces was adopted
The SPECpower Committee augmented the as well in order to provide wider choices for
SPECpower_ssj2008 benchmark with multi- users. Also, some major improvements of the
compute node support in 2009 and started the memory subsystem worklets for more accurate
work on the SERT (https://www.spec.org/sert) characterization of memory were implemented.
suite by designing the Chauffeur (https://www. Simultaneously to these efforts, the SPECpower
spec.org/chauffeur-wdk) framework, followed Committee, in collaboration with the more
with the implementation of the workload, academic-focused SPEC Power Research Group
automated HW and SW discovery, and GUI. (https://research.spec.org/working-groups/rg-pow
Several enhancements and code changes to er.html), conducted large-scale experiments
all SPECpower_ssj2008 benchmark components in order to determine the server efficiency
were included in subsequent releases in 2011 metric (https://www.spec.org/sert2/SERT-metric.
and 2012. SPEComp2012 (https://www.spec. pdf) for SERT Suite 2.0, which enabled its release
org/omp2012) and the vendor-based benchmarks on March 28, 2017.
VMmark (https://www.vmware.com/products/
vmmark.html) and SAP (https://www.sap.com/
about/benchmark.html) were also released in Foundations
2012, utilizing the SPEC PTDaemon and the
SPEC benchmark methodology. Energy benchmarks are useful tools to analyze
After a series of successful alpha and beta the power consumption and efficiency of IT
test phases, the SERT suite was released in equipment. They execute standard applications
2013 and 2 months later, the US Environmental or specialized micro-workloads to stress a
Protection Agency (EPA) adopted the SERT suite system, enabling measurement of the system’s
for their ENERGY STAR for Server Certification power consumption. Workloads used within
program. Over the years, four SERT upgrades an energy benchmark can be adopted from
were released in order to further enhance the existing benchmarks. Specific to the energy
usability and to increase its platform support. benchmarks are measurement methodologies
SPECvirt_sc2013 (https://www.spec.org/virt_ regarding power measurement and metrics that
sc2013) was also released that year replacing its combine the energy measurement result with the
predecessor, SPECvirt_sc2010. primary performance metric of the underlying
The Chauffeur framework from SERT is made benchmark.
available as the Chauffeur Worklet Development An energy benchmark measurement method-
Kit (WDK). This kit can be used to develop new ology defines, among other things, the point
712 Energy Benchmarking
research can utilize energy benchmark results for Future research on energy-efficiency benchmarks
the information that it provides on the systems in must find answers on how to deal with this spread
question. of results.
Other research focuses on energy management
of the servers themselves when running Big Data
applications (Chen et al. 2014). Energy bench- Cross-References
marks are used for evaluating the management
mechanisms that result from such research. Energy Efficiency in Big Data Analysis
Research on the benchmarks themselves has Energy Implications of Big Data
been conducted with focus on workloads and load
levels. Load levels have been introduced with the
SPECpower_ssj2008 benchmark and have since
been implemented in the SPEC SERT suite as References
well. As load levels are intended to increase
Bellosa F (2000) The benefits of event-driven en-
the relevance of the energy benchmarks using ergy accounting in power-sensitive systems. In: Pro-
them, research on these levels has focused on ceedings of the 9th workshop on ACM SIGOPS
the relevance aspect. In addition, research has European workshop, pp 37–42 See ACM link:
investigated the reproducibility of load level im- https://dl.acm.org/citation.cfm?id=566736
Chauffeur framework. https://www.spec.org/chauffeur-
plementations. An analysis in Kistowski et al. wdk
(2016) showed that the load level mechanism of Chen M, Mao S, Liu Y (2014) Big data: a
SPECpower_ssj2008 and SERT allows for a high survey. Mob Netw Appl 19:171. https://doi.
reproducibility regarding measurement re-runs. org/10.1007/s11036-013-0489-0
Gonzalez R, Horowitz M (1996) Energy dissipation in
Workloads, on the other hand, are usually appli- general purpose microprocessors. IEEE J Solid State
cation specific, but the investigation in Kistowski Circuits 31(9):1277–1284
et al. (2015) demonstrated explicitly that bench- Lange K-D (2009) Identifying shades of green: the
mark suites that employ multiple workloads can SPECpower benchmarks. Computer 42(3):95–97.
IEEE Computer Society Press, Los Alamitos
provide more information for server characteri- Mashayekhy L, Nejad MM, Grosu D, Zhang Q, Shi W
zation. This analysis also investigated the infor- (2015) Energy-aware scheduling of MapReduce jobs
mation gain that can be achieved by increasing for big data applications. IEEE Trans Parallel Distrib
the number of load levels. Similarly, results from Syst 26(10):2720–2733
Rivoire S, Shah MA, Ranganathan P, Kozyrakis C (2007)
multiple micro-benchmarks have been used by JouleSort: a balanced energy-efficiency benchmark. In:
researchers for the creation of system models Proceedings of the 2007 ACM SIGMOD international
(Bellosa 2000), which, in turn, can be used for conference on management of data. SIGMOD ‘07.
energy management research. ACM, New York, pp 365–376
SAP. https://www.sap.com/about/benchmark.html
Server Efficiency Rating Tool (SERT).
https://www.spec.org/sert
Future Directions for Research SPC specifications. http://spcresults.org/specifications
SPEC power and performance methodology. https://
www.spec.org/power/docs/SPEC-Power_and_Performa
Reproducibility is a significant challenge for en- nce_Methodology.pdf
ergy benchmarks, as these benchmarks require SPEC PTDaemon interface. http://www.spec.org/
both the performance results, as well as the power power/docs/SPEC-PTDaemon_Design.pdf
SPEC RG Power Working Group. https://research.
measurements to be reproducible. However, it
spec.org/working-groups/rg-power.html
was shown that power measurements of nomi- SPEComp2012. https://www.spec.org/omp2012
nally identically servers, using hardware with the SPECpower Committee. https://www.spec.org/power
same nameplate information and configuration, SPECpower_ssj2008. http://spec.org/power_ssj2008
SPECvirt_sc2013. https://www.spec.org/virt_sc2013
can differ for as much as 15% (Chen et al. 2014).
The SERT 2 metric. https://www.spec.org/sert2/SERT-met
This difference is due primarily to differences in ric.pdf
semiconductor quality within the hardware itself. TPC Energy. http://www.tpc.org/tpc_energy/default.asp
Energy Efficiency in Big Data Analysis 715
DBMSs to evolve and to consider energy queue are examined to determine if they can be
efficiency as a main goal (Lang et al. 2011). aggregated into a small number of groups, such
Processing the query as fast as possible is that the queries in each group can be evaluated
not the most energy-efficient way to operate. together. Evaluating this workload can trade
First, typical servers often run at low utilization. average query response time for reduced overall
This opens the opportunity to execute the query energy consumption. Single group queries can be
slower, if the additional delay is acceptable. For run in the DBMS at a lower-energy cost than the
example, this situation may occur if the service- individual queries.
level agreement (SLA) permits the additional In Roukh et al. (2016) is proposed an E
response time penalty. Since typical SLAs are energy-aware query optimizer framework, called
written to meet peak or near-peak demand, SLAs EnerQuery, which leverages the PostgreSQL
often have some slack in the frequent low uti- query optimizer by integrating the energy
lization periods, which could be exploited by the dimension. EnerQuery is built on top of a
DBMS. Second, components on modern server traditional DBMS to capitalize the efforts
motherboards offer various power/performance invested in building energy-aware query
states that can be manipulated from software. optimizers. Energy consumption is estimated
Of these components, the CPU is currently the on all query plan steps and integrated into a
most amenable for transitioning between such mathematical linear cost model used to select the
power/performance states. Based on these ob- best query plans.
servations, in Lang et al. (2011) is proposed a
novel framework where the query optimization
problem is to find and execute the most energy- Energy-Efficient Data Analysis on
efficient plan that meets the SLA. To enable this Mobile Computing
framework, existing query optimizers have been
extended with an energy consumption model, in The increasing power of wireless devices is open-
addition to the traditional response time model. ing the way to support analysis and mining of
With these two models, the enhanced query op- data in a mobile context (Wang et al. 2003).
timizer can generate a structure that details the Mobile data analysis and mining may include
energy and response time cost of executing each different scenarios in which a mobile device can
query plan at every possible power/performance play the role of data producer, data analyzer,
state under consideration. client of remote data miners, or a combination
Lang and Patel (2009) proposed the ecoDB of them. Accordingly, an increasing number of
project whose goal is to investigate energy- smartphone and PDA-based data intensive appli-
efficient methods for data processing in cations have been recently developed (Kargupta
distributed computing environments, proposing et al. 2002, 2003; Wang et al. 2003). Examples in-
two mechanisms that can trade energy for clude smartphone-based systems for body-health
performance. The first mechanism, called monitoring, vehicle monitoring, and wireless se-
Processor Voltage/Frequency Control (PVC), curity systems.
lets the DBMS explicitly change the processor A key aspect that must be addressed to enable
voltage and frequency parameters. Processors effective and reliable data mining over mobile
can take commands from software to adjust devices is ensuring energy efficiency, as most
their voltage/frequencies to operate at points commercially available mobile devices have bat-
other than their peak performance. The second tery power which would last only a few hours
mechanism is at the workload management level. (Comito et al. 2011, 2012).
It is referred to as the Explicit Query Delays In Bhargava et al. (2003), based on the ex-
(QED) approach where queries are delayed and perimental energy consumption data collected for
placed into a queue on arrival. When the queue several popular data analysis techniques, it has
reaches a certain threshold, all the queries in the introduced energy efficiency as a novel criterion
718 Energy Efficiency in Big Data Analysis
to evaluate algorithms intended to run on mo- the work in Yang et al. (2013) introduces CQP,
bile and embedded platforms. The authors char- a collaborative query processing framework that
acterize and compare the energy consumed by exploits the overlap across multiple smartphones.
data mining applications implemented onboard The framework automatically identifies the
a device with the energy costs for sending the shareable parts of multiple executing queries
data to a remote site. The obtained results show and then reduces the overheads of repetitive
that for low-bandwidth lossy networks, the high execution and data transmissions, by having
energy costs of communication often make lo- a set of “leader” mobile nodes executing and
cal onboard data mining a more energy-efficient disseminating these shareable partial results.
choice. To further reduce energy, CQP utilizes lower-
The work in Comito and Talia (2017) energy short-range wireless links to disseminate
enables energy-efficient execution of data mining such results directly among proximate smart-
algorithms on mobile devices, making them phones.
lightweight in terms of resource utilization. Efficient resource allocation and energy man-
To achieve this purpose, the approach is to agement can be achieved through clustering of
act on data dimensionality according to the mobile nodes into local groups. In Comito and
fact that mining algorithms are data intensive. Talia (2014) is proposed a clustering scheme
Therefore, in order to identify the key factors where mobile devices are organized into local
in the data complexity in Comito and Talia groups (clusters or mobile groups). Each cluster
(2017), an experimental study of the energy has a node referred to as cluster head that acts
consumption characteristics of representative as the local coordinator of the cluster. The fo-
data mining algorithms running on mobile cus of the approach is building and maintaining
devices is performed. The performance analysis a cluster structure in a network of cooperating
depends on many factors encompassing size of mobile devices. The problem of assigning devices
data sets, attribute selection, number of instances, to clusters has been formulated as a multicriteria
number of produced results, and accuracy. To decision-making problem based on the weighted-
carry out the performance evaluation, an Android sum approach. A multi-objective cost function is
application has been developed allowing users defined where the aim of the optimization is to
to execute data mining algorithms over Android maximally extend the network lifetime and the
smartphones by tuning a set of parameters. The mobility similarity of the devices within a group.
energy characterization outcomes have been then Depending on the specific application scenario,
exploited to predict the energy costs of data the weights associated to the two costs can be
mining algorithms running on mobile devices. properly tuned. This way optimal decisions may
Specifically, in Comito and Talia (2017) is be taken in the presence of trade-offs between the
designed and implemented a machine learning- two objectives. A weighted metric that combines
based framework to predict, from a set of mobile the effect of energy and mobility of nodes is
devices and set of data mining tasks, the resulting also introduced to select suitable cluster-head
energy consumption of such tasks over the mobile nodes.
devices. A supervised approach has been used In Comito et al. (2017) is proposed an energy-
where the main idea is to exploit examples of aware (EA) scheduling strategy for allocating
cases from past system behavior to construct an computational tasks over a wireless network of
energy model that is able to predict the class of mobile devices. In such a network, the devices
new examples. are clustered into local groups, named clusters.
Many emerging context-aware mobile Generally, geographically adjacent devices are
applications involve the execution of continuous assigned to the same cluster. To make the most
queries over sensor data streams generated by a of all available resources, the EA strategy tries
variety of onboard sensors on multiple personal to find a suitable distribution of tasks among
mobile devices. To reduce the energy overheads, clusters and individual devices. Specifically, the
Energy Efficiency in Big Data Analysis 719
main design principle of the strategy is finding a Experimental results reveal that, with reliable and
task allocation that prolongs the total lifetime of accurate statistical data, the framework proposed
the wireless network and maximizes the number in this study can achieve significant energy sav-
of alive devices by balancing the energy load ings and improve energy efficiency.
among them. To this end, the EA scheduler im- Query for sensor networks should be wisely
plements a two-phase heuristic-based algorithm. designed to extend the lifetime of sensors. In
The algorithm first tries to assign a task locally to Sun (2008) is presented a query optimization
the cluster where the execution request has been method based on user-specified accuracy item for
generated, so as to maximize the cluster residual wireless sensor networks. When issuing a query, E
life. If the task cannot be assigned locally, the user may specify a value/time accuracy constraint
second phase of the algorithm is performed by according to which an optimized query plan can
assigning the task to the most suitable node all be created to minimize the energy consumption.
over the network of clusters, maximizing this At each single sensor node, instead of direct de-
way the overall network lifetime. The energy livery of each reading, algorithms are proposed to
model takes into account the energy costs of both reduce both data sensing and data transmission.
computation and communication. In Rosemark et al. (2007), a query optimizer
for sensor networks whose key component is a
cost-based analysis to estimate the overall energy
Energy-Efficient Query Processing in consumption associated with a given query plan
Sensor Networks over the lifetime of a query is presented. By
utilizing this estimation technique, the query op-
Query processing has been studied extensively in timizer compares the desirability of several can-
traditional database systems. However, few exist- didate plans for a given query and chooses a plan
ing methods can be directly applied to wireless that is likely to utilize the least energy for execu-
sensor database systems (WSDSs) due to their tion. Simulation results show that the query plan
characteristics, such as decentralized nature, lim- chosen by such query optimizer consumes signif-
ited computational power, imperfect information icantly less energy than the state-of-the-art query
recorded, and energy scarcity in individual sensor processing engine for sensor networks. In Alwadi
nodes. Moreover, traditional database systems and Chetty (2015), an energy-efficient data min-
result in high energy consumption and low energy ing approach to characterize and classify large-
efficiency due to the lack of consideration of scale physical environments based on a com-
energy issues and environmental adaptation in bination of feature selection and single mode/
the design process. In this section, representa- ensemble classifier techniques is introduced.
tive recent efforts on this issue, with a focus By approaching the complexity of WSN/SN
on energy-aware query optimization and energy- with a data mining formulation, where each sen-
efficient query processing, are reported. sor node is equivalent to an attribute or a fea-
In Guo et al. (2017) is proposed a method ture of a data set, and all the sensor nodes to-
of modeling energy cost of query plans during gether forming the WSN/SN setup equivalent to
query processing based on their resource con- multiple features or attributes of the data set,
sumption patterns, which helps predict energy it is possible to use powerful feature selection,
cost of queries before execution. Using the cost dimensionality reduction, and learning classifier
model as a basis, the evaluation model can utilize algorithms from data mining field and come up
the trade-offs between power and performance of with an energy-efficient automatic classification
plans and helps the query optimizer select plans system. Here, minimizing the number of sensors
that meet performance requirements but result for energy-efficient control is very similar to
in lower-energy cost. Finally, a green database minimizing the number of features with an opti-
framework integrated with the two above models mal feature selection scheme for the data mining
is proposed to enhance a commercial DBMS. problem.
720 Energy Efficiency in Big Data Analysis
industry, progresses in the domain will continue and distributed data stream mining system for Re-
and new concepts will emerge. alTime vehicle monitoring. In: SIAM data mining
conference
Lang W, Patel JM (2009) Towards Eco-friendly database
management systems. CoRR, vol. abs/0909.1767
Lang W, Kandhan R, Patel JM (2011) Rethinking query
Cross-References processing for energy efficiency: slowing down to win
the race. Computer Sciences Department, University of
Big Data Analysis for Smart City Applications Wisconsin, Madison
Big Data Analysis Techniques Lefurgy C, Rajamani K, Rawson F, Felter W, Kistler M,
Keller T (2003) Energy management for commercial E
servers. Computer 36(12):39–48
Li K, Kumpf R, Horton P, Anderson T (1994) A
quantitative analysis of disk driver power manage-
References ment in portable computers. In: USENIX conference,
pp 279–292
Alsalih W, Akl SG, Hassanein HS (2005) Energy-aware Li Z, Wang C, Xu R (2001) Computation offloading to
task scheduling: towards enabling mobile computing save energy on handheld devices: a partition scheme.
over MANETs. In: IPDPS’05, pp 242a In: ACM international conference compilers, architec-
Alwadi M, Chetty G (2015) Energy efficient data mining ture, and synthesis for embedded systems, pp 238–246
scheme for high dimensional data. Procedia Comput Liu L, Wang H, Liu X, Jin X, He W, Wang Q, Chen Y
Sci 46:483–490 (2009) GreenCloud: a new architecture for green data
Aydin H, Melhem R, Moss D, Mejia-Alvarez P (2004) center. In: 6th international conference on autonomic
Power-aware scheduling for periodic real-time tasks. computing and communications, pp 29–38
IEEE Trans Comput 53(5):584–600 Luo J, Jha NK (2000) Power-conscious joint scheduling of
Bhargava R, Kargupta H, Powers M. (2003) Energy con- periodic task graphs and aperiodic tasks in distributed
sumption in data analysis for on-board and distributed real-time embedded systems. In: ICCAD
applications. In: ICML’03 Mohapatra S, Venkatasubramanian N (2003) PARM:
Bianchini R, Rajaniony R (2004) Power and energy man- power aware reconfigurable middleware. In: 23rd inter-
agement for server systems. Computer 37(11):68–76 national conference on distributed computing systems,
Catena M, Tonellotto N (2017) Energy-efficient query pp 312–319
processing in web search engines. Trans Knowl Data Petrucci V, Loques O, Niteroi B, Mossé D (2009) Dy-
Eng 29:1412–1425 namic configuration support for power-aware virtual-
Comito C, Talia D (2014) Energy-aware clustering of ized server clusters. In: 21th Euromicro conference on
ubiquitous devices for collaborative mobile applica- real-time systems
tions. In: Proceeding of MobiCASE, pp 133–142 Rosemark R, Lee WC, Urgaonkar B (2007) Optimizing
Comito C, Talia D (2017) Energy consumption of data energy-efficient query processing in wireless sensor
mining algorithms on mobile phones: evaluation and networks. In: International conference on mobile data
prediction. In: Pervasive and mobile computing, vol 42, management, pp 24–29
pp 248–264 Roukh A, Bellatreche L, Ordonez C (2016) EnerQuery:
Comito C, Talia D, Trunfio P (2011) An energy-aware energy-aware query processing. In: Proceeding of the
clustering scheme for mobile applications. In: IEEE 25th ACM CIKM conference, pp 2465–2468
Scalcom’11, pp 15–22 Rudenko A, Reiher P, Popek GJ, Kuenning GH (1998)
Comito C, Talia D, Trunfio P (2012) An energy aware Saving portable computer battery power through re-
framework for mobile data mining, chapt. 23. In: En- mote process execution. SIGMOBILE Mob Comput
ergy efficient distributed computing systems. Wiley- Commun Rev 2(1):19–26
IEEE Computer Society Press, New Jersey Seth K, Anantaraman A, Mueller F, Rotenberg E (2003)
Comito C, Falcone D, Talia D, Trunfio P (2017) Energy- FAST: frequency-aware static timing analysis. In: IEEE
aware task allocation for small devices in wire- RTSS, pp 40–51
less networks. Concurrency Comput Pract Exp 29(1): Sun JZ (2008) An energy-efficient query processing al-
1–24 gorithm for wireless sensor networks. In: Sandnes FE,
Guo B, Yu J, Liao B, Yang D, Lu L (2017) A green Zhang Y, Rong C, Yang LT, Ma J (eds) Ubiquitous
framework for DBMS based on energy-aware query intelligence and computing
optimization and energy-efficient query processing. J Verma A, Ahuja P, Neogi A (2008) Power-aware dynamic
Netw Comput Appl 84:118–130 placement of HPC applications. In: International con-
Kargupta H, Park B, Pitties S, Liu L, Kushraj D, Sarkar K ference on supercomputing, pp 175–184
(2002) Mobimine: monitoring the stock marked from a Wang F, Helian N, Guo Y, Jin H (2003) A distributed
PDA. ACM SIGKDD Explor 3(2):37–46 and mobile data mining system. In: Proceeding of
Kargupta H, Bhargava R, Liu K, Powers M, Blair the international conference on parallel and distributed
P, Bushra S, Dull J (2003) VEDAS: a mobile computing, applications and technologies
722 Energy Implications of Big Data
Weiser M, Welch B, Demers A, Shenker S (1996) factor that overshadowed the older concerns to
Scheduling for reduced CPU energy. In: Mobile com- some extent. Key driving forces for research
puting. Springer, Boston, pp 449–471
Yang J, Mo T, Lim L, Sattler KU, Misra A (2013) Energy-
into energy efficiency were battery life in com-
efficient collaborative query processing framework for pact mobile devices with smallish batteries, en-
mobile sensing services. In: IEEE 14th international ergy costs for operators of supercomputer cen-
conference on mobile data management, pp 147–156 ters, and the difficulties of heat dissipation in
Zhang Y, Hu X, Chen D (2002) Task scheduling and
voltage selection for energy minimization. In: DAC’02,
both mobile devices and mainframe/data-center
pp 183–188 installations. We are thus motivated to seek ex-
Zhuo J, Chakrabarti C (2005) An efficient dynamic task treme energy economy measures to make the pro-
scheduling algorithm for battery powered DVS sys- cessing of large data sets feasible within power
tems. In: ASP-DAC’05, pp 846–849
budgets that are practical for both cloud data
centers and mobile devices used at the cloud’s
edges.
Energy Implications of Big
Data Energy-Efficiency Trends
Until fairly recently, developments in the field As energy efficiency of mainstream processors
of computer architecture (Parhami 2005) were that control servers, desktops, laptops, and
focused around computation running time and smartphones has improved, research into
hardware complexity and how the two can be ultralow-power digital circuits has paved the
traded off against each other in various designs. way for even greater reduction in energy
With the advent of mobile devices and terascale, requirements per computation. Physical laws
petascale, and, soon, exascale computing, en- dictate that any nonlinear transformation of the
ergy consumption emerged as a major design kinds performed by standard AND/OR logic
Energy Implications of Big Data 723
1T Tera
E
486/25 and 486/33
Desktops
Compaq Deskpro 386/20e
Macintosh 128k
IBM PC IBM PC-AT
Univac II
Univac I
1K Kilo
EDVAC
Eniac
Regression results:
N = 80
Adjusted R-squared = 0.983
Comps/kWh = exp(0.4401939 x Year - 849.1617)
Average doubling time (1946 to 2009) = 1.57 years
1
Year
1940 1950 1960 1970 1980 1990 2000 2010
gates must necessarily expend energy, and a ways of reducing the energy consumption: Using
lower bound for this energy is known to be on lower-power technologies (e.g., Jaberipur et al.
the order of kT, where k Š 1.38 1023 is 2018) and using faster algorithms. Use of faster
the Boltzmann constant and T is the operating algorithms is, of course, highly desirable, as it
temperature in Kelvin (Landauer 1961). This has performance implications as well. So, there
lower bound has come to be known as Landauer’s is a great deal of ongoing research to find ever-
principle (Bennett 2003). Although, in principle, faster algorithms for computational problems of
this minimum should be reachable, current interest. Power consumption per device has been
computing devices are a great deal less efficient going down due to Dennard scaling, a law de-
in the amount of energy they dissipate. vised in 1974 that maintains the relative con-
The energy used by a computing device is stancy of MOSFET power density, as transistors
the product of average power drawn and the get smaller, a trend predicted by Moore’s law
computation time. This relationship suggests two (Brock and Moore 2006; McMenamin 2013).
724 Energy Implications of Big Data
The power dissipated by computing devices grounds. It is also helpful for reducing power con-
consists of static and dynamic components. Static sumption. Using m processors/cores, each with
power is the power used by devices that are in 1/m the speed, is more energy-efficient than using
the off state and ideally should not be draw- a single high-performance processor. Thus, if an
ing any power; however, leakage and other fac- application offers enough parallelism to use mul-
tors lead to some power waste, which is in- tiple processors/cores efficiently, it will lead to
creasing in significance as we use smaller and higher performance AND/OR lower energy con-
denser circuits. Dynamic power (Benini et al. sumption. Beginning with dual-core chips in the
2001) is proportional to the operating voltage mid-2000s, multi-core chips (Gepner and Kowa-
squared, the operating frequency, circuit capac- lik 2006) have allowed performance scaling to
itance, and activity (prevalence of signal-value continue unabated, despite Moore’s Law, in its
changes). Low-voltage circuits use much less original formulation, becoming invalid (Mack
power, but the closer we get to the threshold 2011).
voltage, the slower and less reliable circuit oper- While use of a large number of simple cores
ation becomes (Kaul et al. 2012). Thus, there is a carries energy-efficiency benefits, one must
limit to power savings by reducing the operating also pay attention to the effect of inherently
voltage. sequential parts of a computation which give
Reasons for energy waste include glitching rise to speedup limits predicted by Amdahl’s
(Devadas et al. 1992), unnecessary signal-value law (Parhami 1999). Perhaps a judicious mix
transitions that can be avoided by suitable encod- of powerful or “brawny” and simple or “wimpy”
ings (e.g., Musoll et al. 1998), and not clocking cores (Holzle 2010) in a multi-core chip, possibly
the circuitry that is not involved in performing also including application-specific performance
useful computations (Wu et al. 2000). Practical boosters (Hardavellas et al. 2011), can help
notions in this regard include energy-proportional mitigate this problem.
computing, that is, the desirable property that the The ultimate in energy efficiency is achieved
energy used is commensurate with the amount when the processors/cores operate very near
of useful work performed (Barroso and Holzle the threshold voltage (Dreslinski et al. 2010;
2007), and dynamic frequency AND/OR voltage Gautschi 2017; Hubner and Silano 2016). As
scaling (Nakai et al. 2005) in an effort to save operating voltage is scaled down, performance
energy when application requirements do not is reduced nearly linearly, whereas active energy
demand the highest performance level. dissipation goes down quadratically. Therefore,
A data center’s energy efficiency is influenced performance per unit of energy improves. When
by three factors attributed to the facility, server the static or leakage power is taken into account,
conversion efficiency, and efficiency at the level the savings become less pronounced but still
of circuits and devices (Barroso et al. 2013). Only significant. Near-threshold computing has pitfalls
about a third of the energy is used for productive in terms of computation stability and system
computation (Holzle 2017). Generally speaking, reliability, so there is a sensitive trade-off to be
large-scale installations provide more opportuni- performed. There is also the issue of giving up
ties for energy fine-tuning, thus underlining the too much in sequential performance in the hopes
importance of data centers and cloud-computing of regaining some of it via parallel processing
in handling big-data loads. More on this later. (Mudge and Holzle 2010).
Parallel processing is nearly a necessity for han- Dynamic random-access memory (DRAM) chips
dling of large data volumes, on performance account for a significant fraction of the power
Energy Implications of Big Data 725
2006) and to data-center climate control systems Eisenberg A (2010) Bye-Bye batteries: radio waves as a
(Holzle 2017), but a great deal more can be done. low-power source. New York Times, July 18, p BU3.
http://www.nytimes.com/2010/07/18/business/18novel.
html
Feller E, Ramakrishnan L, Morin C (2015) Performance
Cross-References and energy efficiency of big data applications in cloud
environments: a Hadoop case study. J Parallel Distrib
Big Data and Exascale Computing Comput 79:80–89
Computer Architecture for Big Data Gautschi M (2017) Design of energy-efficient processing
elements for near-threshold parallel computing, doc-
Emerging Hardware Technologies toral thesis, ETH Zurich E
Parallel Processing with Big Data Gepner P, Kowalik MF (2006) Multi-core processors:
new way to achieve high system performance. In: Pro-
ceedings of IEEE international symposium on parallel
computing in electrical engineering, pp 9–13
References Hammadi A, Mhamdi L (2014) A survey on architectures
and energy efficiency in data center networks. Comput
Abts D, Marty MR, Wells PM, Klausler P, Liu H Commun 40:1–21
(2010) Energy proportional datacenter networks. ACM Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2011)
SIGARCH Comput Archit News 38(3):338–347 Toward dark silicon in servers. IEEE Micro 31(4):6–15
Assuncao MD, Calheiros RN, Bianchi S, Netto MAS, Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A,
Buyya R (2015) Big data computing and clouds: trends Ullah Khan S (2015) The rise of ‘Big Data’ on cloud
and future directions. J Parallel Distrib Comput 79: computing: review and open research issues. Inf Syst
3–15 47:98–115
Barroso LA, Holzle U (2007) The case for energy- Heinzelman WR, Chandrakasan A, Balakrishnan H
proportional computing. IEEE Comput 40(12):33–37 (2000) Energy-efficient communication protocol for
Barroso LA, Clidara J, Holzle U (2013) The data center as wireless microsensor networks. In: Proceedings of 33rd
a computer: an introduction to the design of warehouse- Hawaii international conference on information sci-
scale machines. Synth Lect Comput Archit 8(3):1–154 ences, IEEE, pp 10
Benini L, De Micheli G (2002) Networks on chips: a new Holzle U (2010) Brawny cores still beat wimpy cores,
SoC paradigm. IEEE Comput 35(1):70–78 most of the time. IEEE Micro 30(4):23–24
Benini L, De Micheli G, Macii E (2001) Designing low- Holzle U (2017) Advances in energy efficiency through
power circuits: practical recipes. IEEE Circuits Syst cloud and ML, energy leadership lecture, University of
Mag 1(1):6–25 California, Santa Barbara
Bennett CH (2003) Notes on Landauer’s principle, re- Hubner M, Silano C (eds) (2016) Near threshold comput-
versible computation and Maxwell’s Demon. Stud Hist ing: technology, methods, and applications. Springer,
Philos Mod Phys 34(3):501–510 Cham
Bolla R, Bruschi R, Davoli F, Cucchietti F (2011) En- Jaberipur G, Parhami B, Abedi D (2018) Adapting
ergy efficiency in the future internet: a survey of computer arithmetic structures to sustainable
existing approaches and trends in energy-aware fixed supercomputing in low-power, majority-logic
network infrastructures. IEEE Commun Surv Tutorials nanotechnologies. IEEE Trans Sustain Comput.
13(2):223–244 https://doi.org/10.1109/TSUSC.2018.2811181
Brock DC, Moore GE (eds) (2006) Understanding Kaul H, Anders M, Hsu S, Agarwal A, Krishnamurthy
Moore’s law: four decades of innovation. Chemical R, Borkar S (2012) Near-threshold voltage (NTV) de-
Heritage Foundation, London sign – opportunities and challenges In: Proceedings of
Denning PJ, Lewis YG (2017) Exponential laws of com- 49th ACM/EDAC/IEEE design automation conference,
puting growth. Commun ACM 60(1):54–65 pp 1149–1154
Devadas S, Keutzer K, White J (1992) Estimation of Kaushik RT, Nahrstedt K (2012) T*: a data-centric cooling
power dissipation in CMOS combinational circuits us- energy costs reduction approach for big data analytics
ing boolean function manipulation. IEEE Trans Com- cloud. In: Proceedings of international conference on
put Aided Des Integr Circuits Syst 11(3):373–383 high performance computing, networking, storage and
Dhiman G, Rosing TS (2006) Dynamic power man- analysis, pp 1–11
agement using machine learning. In: Proceedings of Koomey JG, Berard S, Sanchez M, Wong H (2011) Impli-
IEEE/ACM international conference on computer- cations of historical trends in the electrical efficiency of
aided design, pp 747–754 computing. IEEE Ann Hist Comput 33(3):46–54
Dreslinski RG, Wieckowski M, Blaauw D, Sylvester D, Krishnamachari L, Estrin D, Wicker S (2002) The impact
Mudge T (2010) Near-threshold computing: reclaiming of data aggregation in wireless sensor networks. In:
Moore’s law through energy efficient integrated cir- Proceedings of 22nd international conference on dis-
cuits. Proc IEEE 98(2):253–266 tributed computing systems, pp 575–578
728 Energy Wall
Key Research Findings tional, nested, and unstructured data. It has built-
in functionality for joining, grouping, ordering,
Cloud computing gained momentum in the filtering, etc. and can also be extended by UDFs.
2000s, and this opened possibilities for collecting Pig thus also became popular for transforming
and analyzing big datasets. In 2004, Google and cleaning data in ETL flows. Pig can now ex-
published a paper about their MapReduce ecute on MapReduce as well as other platforms.
framework (Dean and Ghemawat 2004) which Sawzall (Pike et al. 2005) is a similar tool created
soon became very popular. It offered an easy by Google.
way to make a highly scalable and fault-tolerant ETLMR (Liu et al. 2013) was proposed as
program for a cluster of hundreds or thousands of a high-level dimensional ETL tool which loads
computers. The developer only had to define two data into a traditional relational database. It exe-
functions – map which processes different parts cutes on MapReduce but hides the complexity of
of the input in parallel and produces intermediate MapReduce. The developer writes Python code
key/value pairs and reduce which for a given to specify sources, targets, and transformations.
intermediate key and all its values produces ETLMR has built-in DW-specific support for
a result. Open source versions of MapReduce handling dimension tables and fact tables. The
and its distributed file system were implemented developer can thus, for example, find changes and
in Apache’s Hadoop and HDFS, respectively, possibly insert a new version of a member in a
as well as in other frameworks. MapReduce slowly changing dimension (Kimball 2008) with
was not only used for analysis of data but a single method call. With ETLMR, dimension
also preprocessing of data such as cleansing processing is always done first followed by fact
and transformations. IBM reported that many processing. The developer can choose between
customers used MapReduce to preprocess data different dimension processing schemes deciding
before loading it into DW managed by the DB2 how dimension data is partitioned between dif-
DBMS (Özcan et al. 2011). While MapReduce ferent reducers. CloudETL (Liu et al. 2014) has
conceptually is simple, there are still many a similar approach but is aimed at loading data
technical aspects to pay attention to when using into Hive. The developer defines the ETL flow by
it and the developer uses a general-purpose high-level constructs in Java, and the system will
programming language (such as Java for Apache then handle the parallelization. It remedies that it
Hadoop) to express what to do. Further high- is hard to do dimensional ETL processing in Hive
level functionality (such as joins) is not built-in. and Pig which are made for batch processing and
It can thus be tedious and error prone to write not individual look-ups or inserts and which do
MapReduce programs. As a remedy, Facebook not support UPDATEs such that slowly chang-
proposed Hive (Thusoo et al. 2009) where the ing dimensions are very difficult to implement,
user can use the high-level SQL-like HiveQL although they are widely used in traditional DWs.
language. HiveQL queries are then compiled into For big dimensions (i.e., dimensions with many
MapReduce jobs by Hive’s query compiler. Hive members), CloudETL also supports a map-only
also supports so-called external tables with data job where data locality in the distributed file
written to files by other applications and can system is exploited to avoid the need of shuffling
be extended to read other input formats. It can of data from mappers to reducers.
also use user-defined functions (UDFs) written in Many other cluster computing frameworks
Java. It is thus possible to use Hive to do much have been proposed after MapReduce. Recently,
ETL/ELT processing of data. Spark (Zaharia et al. 2012) has become very
Another abstraction on MapReduce is Pig (Ol- popular, and it is now a very active Apache
ston et al. 2008) where the developer specifies project. Spark can keep datasets in memory
sequences of transformations in a high-level pro- as the so-called resilient distributed datasets
cedural language called Pig Latin. Pig is made for (RDDs) between jobs and avoids expensive I/O
parallel data processing and can operate on rela- which MapReduce has to do. Spark does not
ETL 731
use MapReduce but is closely integrated with UDFs in Java code and does not need expertise on
Hadoop and works with HDFS. Spark has APIs the specific execution engine.
for different languages including Scala, Java, and It is generally known that it is very time-
Python and gives the developer much flexibility consuming to transform and clean data to prepare
in creating transformations. With the Spark SQL it for analysis. Traditionally such preprocessing
module (Armbrust et al. 2015), it is possible has been done by specialized IT staff when creat-
to manipulate data using both procedural and ing the ETL flow. In the Big Data context where
relational APIs. The relational API, however, new and complex data sources often must be
enables richer optimizations and works both explored and used, this is impractical. Wrangler E
with Spark’s RDDs and external data sources. (Kandel et al. 2011) is an interactive system
To enable better optimizations, the use of the where the user can be a business analyst rather
relational operations in Spark SQL builds a than an IT professional. The user specifies trans-
logical plan which first is executed when data formation scripts in an interactive way by se-
must be output. The relational operations can lecting data and choosing among suggested, con-
be expressed in a domain-specific language. To figurable transforms and immediately review the
perform the optimizations, Spark SQL provides results. When the necessary transformations have
a relational optimizer named Catalyst which is been defined, a Wrangler script can be exported
extensible such that support for new data sources to MapReduce and Python programs. Wrangler
and optimization rules as well as user-defined has been commercialized by the product Trifacta
functions and data types can be added. which also works with cloud platforms. In early
With the Dataflow model, Google introduced 2017, the service Google Dataprep was launched
a high-level simple programming model for scal- to allow users to prepare data visually. It is based
able parallel data processing with “the ability to on Trifacta’s software.
dial in precisely the amount of latency and cor- Another interesting system which is not
rectness for any specific problem domain given oriented toward specialized IT staff is Data Tamer
the realities of the data and resources at hand” (Stonebraker et al. 2013) (commercialized as
(Akidau et al. 2015). In the model, data is not Tamr). Data Tamer uses machine learning and
viewed as anything that eventually will be com- statistics to do schema integration and entity
plete, and the same operators can thus be used consolidation (i.e., deduplication). Only when
for bounded data and unbounded streaming data. needed, a human expert is asked. The human
The physical execution engine is abstracted away expert is a business user and not a programmer.
under the programming model, and different ex- By using automation, Data Tamer avoids the need
ecution engines can thus be used. Apache Beam for human specification of complex rules and can
(https://beam.apache.org/) is an open source ver- scale to a huge amount of data sources as present
sion of Dataflow and does also support a number in the Big Data era.
of different execution engines.
BigDansing (Khayyat et al. 2015) is a data
cleansing system for Big Data. The user ex- Examples of Application
presses a logical plan by means of five logical
operators which are UDFs allowing great ex- Organizations often have to deal with large
pressibility. The operators are defined at the finest amounts of complex and dirty data to be prepared
unit of the input data (e.g., rows for relational for analysis. With the large amounts of data,
data and triplets for RDF) to allow highly parallel dirty data is the norm, and the data is often
execution of the operators. BigDansing translates semi-structured or unstructured. The data can,
a logical plan into an optimized physical plan e.g., be open data from publicly available web
which in turn is translated into an execution plan sources, data harvested from social media,
for a specific execution engine such as MapRe- licensed commercial data, log data about page
duce or Spark. The user does thus only specify the views or ad clicks, and sensor data from IoT
732 ETL
devices. ETL pipelines are used to cleanse such Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes
data and integrate it with existing data to enable A, Bock J, Claybaugh J, Engovatov D, Hentschel M,
Huang J, Lee AW, Motivala A, Munir AQ, Pelley
analysis. The ETL pipelines can be created in S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner
an ETL tool but are often made of homegrown P (2016) The snowflake elastic data warehouse. In:
solutions exploiting a variety of other Big Data Proceedings of the 2016 international conference on
tools such as Pig, Spark, etc. management of data, SIGMOD’16, New York. ACM,
pp 215–226. ISBN:978-1-4503-3531-7. http://doi.acm.
org/10.1145/2882903.2903741
Dean J, Ghemawat S (2004) Mapreduce: simplified data
Future Directions for Research processing on large clusters. In: 6th symposium on
operating system design and implementation (OSDI
2004), San Francisco, 6–8 Dec 2004, pp 137–150.
At the time of this writing, a number of Big http://www.usenix.org/events/osdi04/tech/dean.html
Data ETL tools are available. Optimization and Gupta A, Agarwal D, Tan D, Kulesza J, Pathak R, Ste-
automatic tuning of ETL pipelines for Big Data fani S, Srinivasan V (2015) Amazon redshift and the
case for simpler data warehouses. In: Proceedings of
is an area for further research. There will most the 2015 ACM SIGMOD international conference on
likely be an increasing demand for ETL tools management of data, SIGMOD’15, New York. ACM,
for Big Data which are efficient and also easy to pp 1917–1923. ISBN:978-1-4503-2758-9. http://doi.
use for less technically inclined users to enable acm.org/10.1145/2723372.2742795
Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wran-
self-service. Supporting the user in discovering gler: interactive visual specification of data transfor-
relevant datasets and structures of complex, dirty mation scripts. In: Proceedings of the SIGCHI confer-
data is another area. As web data often changes ence on human factors in computing systems. ACM,
structure, the development of self-repairing ETL pp 3363–3372
Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani
flows is also an interesting direction. M, Papotti P, Quiané-Ruiz J-A, Tang N, Yin S
(2015) Bigdansing: a system for big data cleansing.
In: Proceedings of the 2015 ACM SIGMOD inter-
national conference on management of data. ACM,
Cross-References pp 1215–1230
Kimball R (2008) The data warehouse lifecycle toolkit.
Apache Spark Wiley, Hoboken
Data Cleaning Liu X, Thomsen C, Pedersen TB (2013) Etlmr: a
highly scalable dimensional ETL framework based on
Data Deduplication
mapreduce. In: Hameurlain A, Küng J, Wagner RR
Data Integration (eds) Transactions on large-scale data-and knowledge-
Data Wrangling centered systems VIII. Springer, Heidelberg/New York,
Hadoop pp 1–31
Liu X, Thomsen C, Pedersen TB (2014) Cloudetl: scal-
Hive
able dimensional ETL for hive. In: Proceedings of the
Spark SQL 18th international database engineering & applications
symposium. ACM, pp 195–206
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A
(2008) Pig latin: a not-so-foreign language for data
References processing. In: Proceedings of the 2008 ACM SIG-
MOD international conference on management of data.
Akidau T, Bradshaw R, Chambers C, Chernyak S, ACM, pp 1099–1110
Fernández-Moctezuma RJ, Lax R, McVeety S, Mills Özcan F, Hoa D, Beyer KS, Balmin A, Liu CJ, Li Y
D, Perry F, Schmidt E et al (2015) The dataflow model: (2011) Emerging trends in the enterprise data an-
a practical approach to balancing correctness, latency, alytics: connecting hadoop and db2 warehouse. In:
and cost in massive-scale, unbounded, out-of-order Proceedings of the 2011 ACM SIGMOD interna-
data processing. Proc VLDB Endow 8(12):1792–1803 tional conference on management of data. ACM,
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley pp 1161–1164
JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al Pike R, Dorward S, Griesemer R, Quinlan S (2005)
(2015) Spark SQL: relational data processing in spark. Interpreting the data: Parallel analysis with Sawzall.
In: Proceedings of the 2015 ACM SIGMOD inter- Sci Program 13(4):277–298
national conference on management of data. ACM, Simitsis A, Vassiliadis P (2017) Extraction, transfor-
pp 1383–1394 mation, and loading. Springer, New York, pp 1–9.
Event Log Cleaning for Business Process Analytics 733
Event log cleaning is a data preparation phase Before starting any cleaning procedure, the ana-
that turns event data into event logs to enable or lysts need to understand the nature of the given
improve the quality of business process analytics event logs. To that end, a model of data qual-
methods like process mining, model enrichment, ity is helpful. Different models of data quality
and conformance checking. Event data might exist in the literature that look at the dimensions
have to be collected from different sources and of accuracy, completeness, and consistency of
formats, filtered, transformed, and assigned to the data, cf. (Suriadi et al. 2017, Section 2.2). To
corresponding processes and cases. get a better understanding in the process ana-
lytics context, it is helpful to distinguish inher-
ent characteristics of the business process and
Overview
system-dependent characteristics that appear in
the recorded data (International Organization for
The goal of business process analytics projects
Standardization 2011).
is to gain insights into the execution of busi-
ness processes. It can help to know which ques-
tions should be answered by the analysis. Some Process Characteristics
typical questions are what is done (activities), When looking at the business process itself,
when is it done or how long does it take (time the following list of challenges and solution
734 Event Log Cleaning for Business Process Analytics
approaches help cleaning event logs and Examples of erroneous event log data appear
preparing them for analytics. due to human recording of events, due to noisy
sensors tracking the progress in a process. The
Volume. The simplest reason for large sizes of problem of stripping noisy and irrelevant data
event logs is the high number of cases that are from event logs is one of the most important
processed in large organizations. Certain log challenges in cleaning event logs for process
cleaning approaches perform costly operations analytics.
and can take a lot of time to run. Strategies: Remove outliers. Repair data (with
Strategies: Divide and conquer. Stream pro- or without external knowledge). Store raw
cessing. data for other analyses.
Approaches: Scalable approaches need to be Approaches: Under the assumption that noisy
chosen for handling large event logs. At times, data is less frequent than regular data, the
the available implementations of an approach events can be filtered by frequency. The con-
are not optimized for handling huge event cept of anomaly detection is applied to single
logs, as they require to load the event log into cases in an event log in de Lima Bezerra and
computer memory. In these cases, approaches Wainer (2013). Here, mainly frequencies are
with implementations that rely on divide and used, as also in Conforti et al. (2017). If a
conquer and stream processing technologies process model exists, the event log can be
are preferable, e.g., Leemans (2017). aligned to it to identify and correct unforeseen
Variability. Many process execution environ- behavior (de Leoni et al. 2015) effectively
ments require a certain degree of flexibility making the log fit into the model with mini-
to perform the tasks at hand. For example, mum corrections.
doctors need to react and decide which action Change. Business processes can be subject to
they need to do next based on a case by evolution or sudden changes to adapt to chang-
case basis. This flexibility can create a high ing environments. The phenomenon is also
variability in the event log, which needs to called concept drift in data mining (Gama
be taken into account for analytics and when et al. 2014). Undetected and untreated, differ-
cleaning the event logs (Reichert and Weber ent variants of a single process over time can
2012). Variability can occur on the case level, lead to overly complex process models and
that is, a lot of different paths are taken to faulty analysis results.
reach one business objective. Variability can Strategies: Divide and conquer. Partition event
also occur in the number of different activities log based on underlying process.
that can be performed. Approaches: To partition the event log cor-
Strategies: Divide and conquer. Clustering rectly, the concept drifts need to be localized
behavior. Focusing on frequent behavior– first. One of the first attempts to detect
infrequent behavior in the event log, however, concept drifts in business processes is Bose
is not always irrelevant and can be separately et al. (2014). By analyzing event logs and
analyzed. statistically comparing the relations between
Approaches: Trace clustering is proposed events in two adjacent sliding windows,
in Greco et al. (2006) and applied in complex significant differences can be detected and
and variable processes in Mans et al. (2008) concept drifts can be detected. More advanced
to analyze the clusters separately with process methods to detect process changes over time
mining. A more recent approach specifically are described in Ostovar et al. (2017).
addressing high variability is taking the divide
and conquer approach to the limit by focusing Event Log Characteristics
on individual cases (Diamantini et al. 2016). Besides the process characteristics, the circum-
Veracity. The process of recording events in a stances of recording process execution data, as
business process can itself be subject to errors. well as the extraction of data from systems, affect
Event Log Cleaning for Business Process Analytics 735
the event log. Thus, the nature of event logs and Approaches: Correct time stamps based
their generation must be well understood to select on rules. Inconsistencies can be detected
and apply the appropriate event log cleaning by anomaly detection methods (e.g., using
techniques. probabilistic models (Rogge-Solti and
Kasneci 2014) or temporal constraints (Song
et al. 2016))
Different Processes. When extracting process Correlation. Typically, events of a single busi-
data from information systems, the extracted ness case can be identified by a correlating
event logs can contain events from different attribute. For example, the “order id” of a E
processes or different process variants. customer order identifies the case, and all re-
Strategies and Approaches: See point Vari- lated events carry this information as attribute.
ability in Sect. 1. In some cases, however, this information is
Different Sources. Real business processes are missing, and it can be unclear how to correlate
often supported by multiple different IT sys- events in an event log.
tems. Hence, the number and heterogeneity of Strategies: Find correlation anchors (the cor-
sources from which data is integrated in the relation anchor can change within a case).
event log can complicate the analysis (Song Repair data.
et al. 2016). Dealing with multiple sources can Approaches: In the simpler case, correlation
create temporal inconsistencies, semantic am- is possible through attributes. An approach
biguities, or also different levels of granularity to identify correlation attributes is Nezhad
of recording. Because each of these issues can et al. (2011). It can be used for correlating
also occur with only a single event source, events belonging to the same cases. In the
they are listed as separate points. more complicated case, such attributes can-
Strategies: Awareness of potential issues. Ap- not be found. Even then, there are attempts
ply tailored data cleaning solution. to solve this problem (Bayomie et al. 2016;
Approaches: The unification of different Pourmirza et al. 2017). Latter techniques work
sources is largely a manual and time- with dependencies between activities and try
consuming task. This is an ongoing research to find a mapping between cases and events
challenge, and mostly partial solutions to that is consistent (Bayomie et al. 2016) or even
subproblems exist. The analyst needs to pick optimal with respect to the temporal variations
the appropriate techniques to transform the in the process (Pourmirza et al. 2017).
different sources into a common format. Completeness. Completeness issues can appear
Approaches exist in the domain of schema on different aggregation levels, meaning that
matching (Rahm and Bernstein 2001). entire cases, single events, or also event at-
Time stamps. When event logs are integrated tributes can be missing from an event log.
from different sources, inconsistencies can Strategies: Impute missing data. Remove in-
be caused by the clocks of different systems complete cases.
being out of sync. Untreated, these temporal Approaches for missing cases: It can happen
issues may lead to incorrect ordering of events that entire cases are missing from event logs.
in a unified event log. Currently, this problem The detection of missing cases is nontrivial,
requires the analyst to investigate potential however. A chance to detect missing cases
time differences within systems and apply is by carefully analyzing idle times of re-
a correction to restore synchronicity. If the sources for patterns of successive idleness.
systems are spread over different time zones, These “blank spots” in the schedules could
additional complexity in harmonizing the indicate a missing case. Further indications
time stamps stems from different treatment might be found in cross-checking the event log
of daylight savings time. records with other sources (e.g., the account-
Strategies: Detect anomalies. Repair data. ing books).
736 Event Log Cleaning for Business Process Analytics
Approaches for missing events: If single process knowledge into account (e.g.,
events are missing from an event log, causal temporal, resource assignments, precedes
relations between events can be used to deduce relationships) (Senderovich et al. 2016).
which events are missing. For example, a (b) Dealing with the attribute granularity
case with a “leave room” event, but without a mismatch. When time stamps are coarse
logically preceding “enter room” event, points grained such that the order of events
at a missing record. Logic-based missing event cannot be inferred by looking at the
completion techniques are used in Bertoli et al. time stamps, events need to be treated
(2013) and Francescomarino et al. (2015). as partially ordered. One idea is to guess
Also process models can be used to fill in the order of events by exploiting ordered
the blanks (de Leoni et al. 2015; Wang et al. information from other cases within the
2016). event log (Lu et al. 2014).
Approaches for missing attributes: The prob-
lem of missing attributes is especially com- Ambiguity. Event labels can be ambiguous, and
mon, when the event documentation is done the relation between business activity in a pro-
by humans. Two approaches are common: the cess model and event is unclear. For example,
minimum change principle (Song et al. 2016; two events with the same label (or name)
Mannhardt et al. 2016a) or the maximum like- can indicate different activities in a process
lihood principle (Rogge-Solti et al. 2013; Yak- depending on their context.
out et al. 2013). Strategies: Relabel ambiguous terms.
Granularity. There is not always a 1:1 relation- Approaches: Semantic ambiguities are a
ship between events in an event log and busi- highly complex problem, and only specific
ness activities that analysts are interested in. solutions exist, but techniques to relabel
Events can be recorded in a higher frequency certain activities with the same label based
than required. Also the contrary can be the on their context greatly facilitate the process
case. For example, temporal information can mining and analytics part (e.g., Song et al.
be cloaked or recorded on a daily granularity 2009; de San Pedro and Cortadella 2016).
only.
Strategies: Group low-level events to higher
level activities. Infer relations.
Approaches: Further Reading
(a) Dealing with the event granularity mis- In this handbook, a condensed view on event
match. log cleaning methods is provided. Deeper
A bridge between the low-level events insights can be retrieved from the text book on
to the high-level business events needs process mining (van der Aalst 2016, Chapter 4).
to be built by establishing a mapping. Additionally, the process mining manifesto
Rule-based approaches were the first to emphasizes the need for handling of event log
address the problem of n:m relations quality issues to improve process analytics
between events and activities (Baier results. It defines a five-level maturity model
et al. 2014). If low-level events follow a for the assessment of event logs. The scale
certain structure, a pattern-based mapping ranges from “?” lowest quality (e.g., paper-
is possible between low-level events based medical records) to “? ? ? ? ?” highest
and high-level events (Mannhardt et al. quality (e.g., semantically annotated logs of BPM
2016b). Also works based on raw sensor systems) (van der Aalst et al. 2011).
data exist that convert location data A collection of so-called event log imper-
into process instances by solving an fection patterns lists common problems that
optimization problem and taking potential occur in real projects and also provides recipes
Event Log Cleaning for Business Process Analytics 737
to overcome these problems (Suriadi et al. 2017). complete business process execution traces. In: AI*IA
Another related work specifically addresses the 2013: advances in artificial intelligence – XIIIth in-
ternational conference of the Italian association for
preprocessing of event logs required for process artificial intelligence, Turin, 4–6 Dec 2013. Proceed-
mining (Bose et al. 2013). Last, the approach ings, pp 469–480. https://doi.org/10.1007/978-3-319-
proposed in (de Leoni et al. 2016) can solve data 03524-6_40
cleaning problems on the basis of data mining Bose JCJC, Mans RS, van der Aalst WMP (2013) Wanna
improve process mining results? In: IEEE symposium
tools by projecting event logs into data tables. on computational intelligence and data mining, CIDM
This approach enables a large share of data 2013, Singapore, 16–19 Apr 2013, pp 127–134. https://
manipulation and analytics tool set of the data doi.org/10.1109/CIDM.2013.6597227 E
mining field (Han et al. 2011) in the process Bose RPJC, van der Aalst WMP, Zliobaite I, Pechenizkiy
M (2014) Dealing with concept drifts in process min-
mining context. ing. IEEE Trans Neural Netw Learn Syst 25(1):154–
171. https://doi.org/10.1109/TNNLS.2013.2278313
Conforti R, Rosa ML, ter Hofstede AHM (2017) Filtering
out infrequent behavior from business process event
Conclusion logs. IEEE Trans Knowl Data Eng 29(2):300–314.
https://doi.org/10.1109/TKDE.2016.2614680
The quality of an event log is of utmost impor- de Leoni M, Maggi FM, van der Aalst WMP (2015) An
tance for accurate process analytics. Data clean- alignment-based framework to check the conformance
of declarative process models and to preprocess event-
ing needs to be done with care, and each per- log data. Inf Syst 47:258–277. https://doi.org/10.1016/
formed step needs to be documented. Further- j.is.2013.12.005
more, access to the raw event data can help to de Leoni M, van der Aalst WMP, Dees M (2016) A general
evaluate the effect of the data cleansing steps process mining framework for correlating, predicting
and clustering dynamic behavior based on event logs.
and enables analysts to replace certain clean- Inf Syst 56:235–257. https://doi.org/10.1016/j.is.2015.
ing approaches. Automatically handling different 07.003
quality issues is an active research field with de Lima Bezerra F, Wainer J (2013) Algorithms for
more and more novel and improving approaches anomaly detection of traces in logs of process aware
information systems. Inf Syst 38(1):33–44. https://doi.
appearing recently. org/10.1016/j.is.2012.04.004
de San Pedro J, Cortadella J (2016) Discovering duplicate
tasks in transition systems for the simplification of pro-
cess models. In: Business process management – 14th
Cross-References international conference, BPM 2016, Rio de Janeiro,
18–22 Sept 2016. Proceedings, pp 108–124. https://doi.
Conformance Checking org/10.1007/978-3-319-45348-4_7
Data Quality and Data Cleansing of Semantic Diamantini C, Genga L, Potena D, van der Aalst WMP
(2016) Building instance graphs for highly variable
Data processes. Expert Syst Appl 59:101–118. https://doi.
Data Cleaning org/10.1016/j.eswa.2016.04.021
Dumas M, Rosa ML, Mendling J, Reijers HA (2013) Fun-
damentals of business process management. Springer,
https://doi.org/10.1007/978-3-642-33143-5
References Francescomarino CD, Ghidini C, Tessaris S, Sandoval
IV (2015) Completing workflow traces using action
Baier T, Mendling J, Weske M (2014) Bridging abstrac- languages. In: Advanced information systems engi-
tion layers in process mining. Inf Syst 46:123–139. neering – 27th international conference, CAiSE 2015,
https://doi.org/10.1016/j.is.2014.04.004 Stockholm, 8–12 June 2015, Proceedings, pp 314–330.
Bayomie D, Awad A, Ezat E (2016) Correlating https://doi.org/10.1007/978-3-319-19069-3_20
unlabeled events from cyclic business processes Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia
execution. In: Advanced information systems A (2014) A survey on concept drift adaptation. ACM
engineering – 28th international conference, CAiSE Comput Surv 46(4):44:1–44:37. http://doi.acm.org/10.
2016, Ljubljana, 13–17 June 2016. Proceedings, 1145/2523813
pp 274–289. https://doi.org/10.1007/978-3-319- Greco G, Guzzo A, Pontieri L, Saccà D (2006) Discover-
39696-5_17 ing expressive process models by clustering log traces.
Bertoli P, Francescomarino CD, Dragoni M, Ghidini C IEEE Trans Knowl Data Eng 18(8):1010–1027. https://
(2013) Reasoning-based techniques for dealing with in- doi.org/10.1109/TKDE.2006.123
738 Event Log Cleaning for Business Process Analytics
Han J, Kamber M, Pei J (2011) Data mining: concepts and management – 12th international conference, BPM
techniques, 3rd edn. Morgan Kaufmann, http://hanj.cs. 2014, Haifa, 7–11 Sept 2014. Proceedings, pp 234–
illinois.edu/bk3/ 249. https://doi.org/10.1007/978-3-319-10172-9_15
International Organization for Standardization (2011) Rogge-Solti A, Mans R, van der Aalst WMP, Weske M
Software engineering – Software Product Quality Re- (2013) Improving documentation by repairing event
quirements and Evaluation (SQuaRE) – Guide to logs. In: The practice of enterprise modeling – 6th IFIP
SQuaRE WG 8.1 working conference, PoEM 2013, Riga, 6–7
Leemans SJ (2017) Robust process mining with guaran- Nov 2013, Proceedings, pp 129–144. https://doi.org/
tees. Ph.D thesis, Eindhoven University of Technol- 10.1007/978-3-642-41641-5_10
ogy. https://pure.tue.nl/ws/files/63890938/20170509_ Senderovich A, Rogge-Solti A, Gal A, Mendling J, Man-
Leemans.pdf delbaum A (2016) The ROAD from sensor data to
Lu X, Fahland D, van der Aalst WMP (2014) Confor- process instances via interaction mining. In: Advanced
mance checking based on partially ordered event data. information systems engineering – 28th international
In: Business process management workshops – BPM conference, CAiSE 2016, Ljubljana, 13–17 Jun 2016.
2014 international workshops, Eindhoven, 7–8 Sept Proceedings, pp 257–273. https://doi.org/10.1007/978-
2014, Revised papers, pp 75–88. https://doi.org/10. 3-319-39696-5_16
1007/978-3-319-15895-2_7 Song JL, Luo TJ, Chen S, Liu W (2009) A clustering
Mannhardt F, de Leoni M, Reijers HA, van der Aalst based method to solve duplicate tasks problem. J Grad
WMP (2016a) Balanced multi-perspective checking School Chin Acad Sci 26(1):107–113
of process conformance. Computing 98(4):407–437. Song S, Cao Y, Wang J (2016) Cleaning timestamps with
https://doi.org/10.1007/s00607-015-0441-1 temporal constraints. PVLDB 9(10):708–719. http://
Mannhardt F, de Leoni M, Reijers HA, van der Aalst www.vldb.org/pvldb/vol9/p708-song.pdf
WMP, Toussaint PJ (2016b) From low-level events Suriadi S, Andrews R, ter Hofstede AHM, Wynn MT
to activities – a pattern-based approach. In: Business (2017) Event log imperfection patterns for process
process management – 14th international conference, mining: towards a systematic approach to cleaning
BPM 2016, Rio de Janeiro, 18–22 Sept 2016. Proceed- event logs. Inf Syst 64:132–150. https://doi.org/10.
ings, pp 125–141. https://doi.org/10.1007/978-3-319- 1016/j.is.2016.07.011
45348-4_8 van der Aalst WMP (2016) Process mining – data science
Mans RS, Schonenberg H, Song M, van der Aalst WMP, in action, 2nd edn. Springer, https://doi.org/10.1007/
Bakker PJM (2008) Application of process mining 978-3-662-49851-4
in healthcare – a case study in a Dutch hospital. van der Aalst WMP, Adriansyah A, de Medeiros AKA,
In: Biomedical engineering systems and technologies, Arcieri F, Baier T, Blickle T, Bose RPJC, van den
international joint conference, BIOSTEC 2008, Fun- Brand P, Brandtjen R, Buijs JCAM, Burattin A, Car-
chal, Madeira, 28–31 Jan 2008, Revised selected pa- mona J, Castellanos M, Claes J, Cook J, Costantini N,
pers, pp 425–438. https://doi.org/10.1007/978-3-540- Curbera F, Damiani E, de Leoni M, Delias P, van Don-
92219-3_32 gen BF, Dumas M, Dustdar S, Fahland D, Ferreira DR,
Nezhad HRM, Saint-Paul R, Casati F, Benatallah B (2011) Gaaloul W, van Geffen F, Goel S, Günther CW, Guzzo
Event correlation for process discovery from web ser- A, Harmon P, ter Hofstede AHM, Hoogland J, Ingvald-
vice interaction logs. VLDB J 20(3):417–444. https:// sen JE, Kato K, Kuhn R, Kumar A, Rosa ML, Maggi
doi.org/10.1007/s00778-010-0203-9 FM, Malerba D, Mans RS, Manuel A, McCreesh M,
Ostovar A, Maaradji A, Rosa ML, ter Hofstede AHM Mello P, Mendling J, Montali M, Nezhad HRM, zur
(2017) Characterizing drift from event streams of Muehlen M, Munoz-Gama J, Pontieri L, Ribeiro J,
business processes. In: Advanced information systems Rozinat A, Pérez HS, Pérez RS, Sepúlveda M, Sinur J,
engineering – 29th international conference, CAiSE Soffer P, Song M, Sperduti A, Stilo G, Stoel C, Swen-
2017, Essen, 12–16 Jun 2017, Proceedings, pp 210– son KD, Talamo M, Tan W, Turner C, Vanthienen J,
228. https://doi.org/10.1007/978-3-319-59536-8_14 Varvaressos G, Verbeek E, Verdonk M, Vigo R, Wang J,
Pourmirza S, Dijkman RM, Grefen P (2017) Corre- Weber B, Weidlich M, Weijters T, Wen L, Westergaard
lation miner: mining business process models and M, Wynn MT (2011) Process mining manifesto. In:
event correlations without case identifiers. Int J Business process management workshops – BPM 2011
Coop Inf Syst 26(2):1–32. https://doi.org/10.1142/ international workshops, Clermont-Ferrand, 29 Aug
S0218843017420023 2011, Revised selected papers, part I, pp 169–194,
Rahm E, Bernstein PA (2001) A survey of approaches to https://doi.org/10.1007/978-3-642-28108-2_19
automatic schema matching. VLDB J 10(4):334–350. Wang J, Song S, Zhu X, Lin X, Sun J (2016) Effi-
https://doi.org/10.1007/s007780100057 cient recovery of missing events. IEEE Trans Knowl
Reichert M, Weber B (2012) Enabling flexibility in Data Eng 28(11):2943–2957. https://doi.org/10.1109/
process-aware information systems – challenges, meth- TKDE.2016.2594785
ods, technologies. Springer, https://doi.org/10.1007/ Yakout M, Berti-Équille L, Elmagarmid AK (2013) Don’t
978-3-642-30409-5 be scared: use scalable automatic repairing with maxi-
Rogge-Solti A, Kasneci G (2014) Temporal anomaly mal likelihood and bounded changes. In: Proceedings
detection in business processes. In: Business process of the ACM SIGMOD international conference on
Exploring Scope of Computational Intelligence in IoT Security Paradigm 739
Visualization Techniques
Eventual Consistency
(a) Data perception and collection: In this as-
pect, typical attacks include data leakage,
TARDiS: A Branch-and-Merge Approach to sovereignty, breach, and authentication.
Weak Consistency (b) Data storage: The following attacks may oc-
cur – denial-of-service attacks (attacks on
availability), access control attacks, integrity
attacks, impersonation, modification of sen-
Exploratory Data Analysis sitive data, and so on.
(c) Data processing: In this aspect, there may ex-
ist computational attacks that aim to generate
Big Data Visualization Tools wrong data processing results.
740 Exploring Scope of Computational Intelligence in IoT Security Paradigm
(d) Data transmission: Possible attacks include synchronized in these sections of health centric
channel attacks, session hijacks, routing at- IoT application.
tacks, flooding, and so on. Apart from atten- This proposal explores the possibilities and
uation, theft, loss, breach, and disaster, data realization of computational intelligence tech-
can also be fabricated and modified by the niques in terms of empowering trust, authenti-
compromised sensors. cation, and residence mechanism. The desired
objective is to configure the trusted and secure
Health-related and medical data is one of the elements for smart IOT by using fuzzy logic,
sensitive application paradigms, where smart Bayesian attack graphs, intelligent learning, hy-
healthcare IoT deals with voluminous and brid intelligent methods (e.g., combining fuzzy
meaningful data. As health data are highly critical and neural network), and natural computing (like
for personal privacy and relevant regulations specifically through ant colony, bee colony opti-
such as HIPAA (Lever et al. 2015) must be mization, swarm intelligence). The highlight of
conformed to, the uploaded data need to be computational intelligence is to provide the se-
encrypted. The privacy of the communication curity measures and action plan against unknown
link between smartphones and cloud servers can and uncertain threats and attacks considering the
be protected by underlying media access control design constraints of secure elements (low energy
(MAC) protocols. A straightforward method is to consumption and high performance). The level
encrypt uploaded data with off-the-shelf methods of trust can be measured and data mined from
such as block ciphers, for example, AES or different profile of users, contexts, and places
KASUMI (Ould-Yahia et al. 2017; Li et al. 2015). of application deployment. However, depending
However, this may not be suitable and applicable on the design and adaptability of applications,
in smartphones, because smartphones usually even one or more intelligent techniques can be
have energy constraints. Moreover, smartphones embedded in the design of single secure ele-
may be misused, lost, stolen, or hacked by ments. The level of secure elements can also be
attackers; thus privacy protection itself should ranked depending on the deployment intelligent
be robust. Therefore, it is a critical challenge to techniques from lower to higher security levels.
design a lightweight and robust method to protect Till this novel article is formulated, there was
privacy. no documented chronological survey available of
As all the phases of data transmission of IoT such prototypes available.
involves substantial amount of threat and vulner-
ability, hence therefore trust and authentication in
smart system have become crucial. Specifically, State-of-the-Art & Contemporary
we envisage the smart health sector, where the Applications
medical data and health data is deployed through
various involvement and stages influenced by hu- As mentioned, there are many existing research
man factors (e.g., doctors, patients, several med- and industrial applications to counter different
ical agencies, diagnostic centers). The categories security attacks for connected objects. However,
of known and unknown attacks improvise to de- most of them are dependent either on con-
velop more adaptive and self-organized system, ventional cryptography or on secured protocol
which could foster into embedded smart appli- design (Zhao 2013; Bao and Chen 2012; Lize
cations. Conventional techniques like encryption et al. 2014; Yan et al. 2014). However, there
and security protocol may not provide adequate are few approaches, where fuzzy logic has been
security support. Typically, secure embedded sys- used for trust management and access control
tems are guided by the factors, e.g., small form (Mahalle et al. 2013). Still, commercialization of
factor, high performance, low energy consump- those intelligent security measures is not directly
tion (Urien 2016), and robustness to attacks. The implemented with IoT and smart applications
prediction, decision planning, and action can be (WhitePaper 2017). Interestingly, there is good
Exploring Scope of Computational Intelligence in IoT Security Paradigm 741
zero-day safety matrix proposed to rank the crypto processor that accelerates cryptographic
degree of vulnerabilities (Wang et al. 2014). processing. However fuzzy logic-driven secure
This mapping also can be made predictable. The elements can be positioned with Datagram Trans-
present proposal emphasizes on the aspect of port Layer Security (DTLS) or Transport Layer
prediction and configuring a decision support Security (TLS) stacks. The trust application can
system. In addition, it will be worthy to scan the be scripted in Python Data gram Transport Layer
detailed applications or products that exist with Security (https://pypi.python.org/pypi/Dtls) and
respect to relevant industrial perspective. The as a whole is well supported by SSL. However,
next section elaborates this issue. for secure elements, JAVACARD Syntax is E
Subsequently, there is substantial scope popularly used, which is a subset of JAVA
to configure and customize the existing suit language (Javaid et al. 2016).
of Smart IoT protocols. Considering the low
energy consumption as prime criteria, out of the
followings, some of them can be focused for
intelligent algorithm and customization.
Concept of Attack Graph for Interconnected
• CoAP: Constrained Application Protocol. Systems
• MQTT: Message Queue Telemetry Transport. Generation of attack graph is recently used to
• XMPP: Extensible Messaging and Presence assess the risk of interconnected networks (Lever
Protocol. et al. 2015), applying Bayesian attack graphs
• RESTFUL Services: Representational State at real time on secure elements to calculate the
Transfer. probability that an attacker can reach each state
• AMQP: Advanced Message Queuing Protocol ( condition) in the graph. Instead of attempting
• Websockets to rank unknown vulnerabilities, the proposed
BAG metric counts how many such vulnerabil-
ities would be required for compromising net-
Computational Intelligence: A Road work assets (Wang et al. 2014); a larger count
Map for Security in IoT and implies more security since the likelihood of
Connected Systems having more unknown vulnerabilities available,
applicable, and exploitable all at the same time
The road map of using computation intelligence will be significantly lower. Categorically, the IoT
in this proposal could be as follows: malware families, e.g., BASHLITE and Mirai,
can be analyzed to determine common properties
Fuzzy Logic-Based Trust and which can be used for detection through proba-
Authentication System bility values of BAG. Even an attribute data base
While developing a trust and authentication can be prepared offline for other IoT malware
matrix using Fuzzy Logic (where uncertainty families such as Hajime, NyaDrop, and LuaBot,
about the trust focused on the user can be this will assist to classify the types of malware
measured and discrete value of trust can attacks on line. The malware mentioned is also
be evaluated. Embedded software for secure known as botnet of IoT: BAG algorithm can be
elements can be developed by embedded developed and emulated through observe-orient-
programming environment), the embedded fuzzy decide-act (OODA) to match the requirements
program with fuzzy operators can work on user of smart healthcare applications. This will be a
behavior and profile data and can aggregate decision dashboard, and it assists IoT manage-
a discrete decision from fuzzy referral index, ment. Finally, a security risk measure matrix can
known as membership function. The secure be evaluated, and based on the recommendation,
elements realized as secure microcontrollers OODA can be triggered. High-level program-
with low power consumption(<200 mW) and a ming description will be sufficient.
742 Exploring Scope of Computational Intelligence in IoT Security Paradigm
Hybrid Model and Intelligent Learning warnings disseminated robustly with just a few
In this paradigm of computational intelligence, lines of Protelis code dynamically deployed and
more than one intelligent model can be combined executed on individual devices by a middleware
to ensure the detection of malware, attack- prone application. This short program is resilient and
component ,and intrusion. For example, artificial adaptive, enabling it to effectively estimate
fish swarm algorithm (Lin et al. 2014) can be for identifying the prone to attacks (could be
implemented with analyzing various features of partitioned in several fuzzy values like low,
possible intrusion. This could be distance, vision, moderately low, depending medium, very high,
neighbor, center, crowded degree, and maximum etc.) and distribute warnings while executing on
iterations. Here, considering (support vectors can numerous mobile devices. Mobile Python could
be used for classifying new data) Fi represents be ideally chosen for embedding the application
fish i and Ci represents the center of Fi , then for secure elements.
the embedded applications for configuring secure
elements are: Natural Computing for Resilience and
Self-Organized Mechanism
• Initialization The vertical applications of natural computing
• Fitness Evaluation are driven by social insects like ants, bees, or
• Movement dynamics of fish swarms swarms. Modeling of ant colony system (ACS)
under random undirected graph can also define
At first, a feature subset is randomly assigned IoT configuration and empirical measurement of
to N fishes. All parameters including vision, max- trust and security. Measured values are evaluated
imal crowded degree, and maximal trial number from topological mapping of pheromone com-
are defined. For example, if eight fishes were munication of ants across the graph of intercon-
initiated, each fish has its own feature, and the nected sensors (Ould-Yahia et al. 2017). In the
circle represents the vision of fish i . The next step dynamic contexts of sensors (in case of body
is to evaluate the classification value as a fitness wearable sensors), it will be more appropriate
value of the feature subset of each fish. to quantify contextual or referential intelligence
Starting with the first fish, follow step is im- according to the locations of the sensors. The
plemented. If follow is successful, optimal value degree of security or trust can vary with respect
of fitness of fish demonstrates highly secured to the sensors’ locations.
indicator; for example, in Fig. 1, the fitness value Computational intelligence can be plugged
of fish i is 55; by contrast, the best fitness neigh- into smart health security. Smart health of inter-
bor exhibits a value of 80. Thus, the best fitness connected devices often needs intelligent control
neighbor demonstrates a superior fitness value, policies (Ayala et al. 2012) to manage different
indicating that a superior fish is located in the contexts of IoT sensors to sense. Examples of
vision of fish i . Therefore, the follow step is such capabilities are control policies for man-
successful, and fish i moves to the location of best aging a self-controlled swarm of IoT devices
fitness neighbor and can be declared as secure (Viroli et al. 2016) (hierarchical vs. collabora-
zone in real time (Refer Fig. 1). tive control strategies) or context-aware control
Trusted and low power objects ideally be of devices and adaptation of their behavior ac-
based on Linux operating system, for example, cording to the environment. The mobile wireless
Raspberry Pi board with DEBIAN distribution. body sensor and participatory sensing. Mobile
Here, through middleware, the proposed WBSN comprises multiple sensor nodes that are
algorithm could be developed (Beal et al. 2015; implanted (or attached) into (or on) a human body
Viroli et al. 2016) by a recent emerging technique to monitor health or EEG physiological indica-
known as “Aggregate Programming” (Beal et al. tors, such as electrocardiogram (ECG electroen-
2015). Potential attack can be detected and cephalography), glucose, toxins, blood pressure,
Exploring Scope of Computational Intelligence in IoT Security Paradigm 743
Exploring Scope of
Computational Vision
Intelligence in IoT
Security Paradigm, Fig. 1 00101100 fit(80)
Dynamic movement of fish 00111101 fit(65)
for connected systems.
(Scheme Courtsey Lin 10101101 fit(70)
et al. 2014)
00101100 fit(80)
Fish i E
01101101 fit(55)
10101100 fit(40)
10111101 fit(75)
00100101 fit(50)
and so on. Data usually need to be uploaded new devices (i.e., Symbian, Android, Sun SPOT,
to a central database (e.g., a cloud computing Libelium) and network technologies (i.e., WiFi,
server) instantly so that doctors or nurses can Bluetooth, ZigBee, XBee, NFC) and extended
remotely access them for real-time diagnosis and with additional features, such as goal orientation
emergency response. As most persons possess a (Aiello et al. 2011) and self-management (Ayala
smartphone and customized applications can be et al. 2013).
installed on it, it is convenient and economical to
use a smartphone as a gateway between WBSN
and cloud servers. Participatory sensing also fol-
lows the suitable authentication scheme, and here Conclusion and Scope of Future
the self-organized swarms can be programmed Research
to deploy for access control of the application.
Conventionally, one-time mask (OTM) scheme The paradigm of interconnected devices, sensors,
and one-time permutation (OTP) scheme both and IoT improvises manifold applications with
are reasonably efficient from energy consumption the requirement of substantial security cover to
of view of IoT system, to protect the data of facilitate the big data-driven interactions. Static
WBSN. However, as an additional security cover, measures and conventional form of security, e.g.,
swarm agents are used to anticipate the sensing cryptography, may not be effective. Hence, dif-
scheme. Based on the design of swarm control, ferent dimensions of computationally intelligent
these encryption schemes can be functional. systems can be deployed in IoT environment, and
Currently, there are agent technologies that importantly, the approaches are well supported
can be embedded in devices of the IoT, such with embedded system components for IoT.
as Jade-Leap and Agent Factory Micro Edition Approximation measures (fuzzy logic-based sys-
(Muldoon et al. 2006). The embedded develop- tem and attack graphs), automated learning, and
ment may consider multiple agent technologies, social insect-driven approaches are already been
like Jade (Muldoon et al. 2006), Jadex, Jade- implemented, and thus new generation of IoT
Leap, and Mobile Agent Platform (Bellifemine can be more dynamically secured and protected
et al. 2001). In this case, the agents are the from different anticipated security breaches. At
managers of the sensors and make these data present, more machine learning-based method-
available to the IoT. Thus they can be adapted to ologies like deep learning (Javaid et al. 2016;
744 Extensible Record Store
Li et al. 2015) are planned to be incorporated with on wireless communications, vehicular technology, in-
malicious code detection in IoT applications. formation theory and aerospace & electronic systems
(VITAE). IEEE, pp 1–5
Muldoon C, O’Hare GM, Collier R, O’Grady MJ (2006)
Agent factory micro edition: a framework for ambient
References applications. In: International conference on computa-
tional science. Springer, pp 727–734
Aiello F, Fortino G, Gravina R, Guerrieri A (2011) A java- Ould-Yahia Y, Banerjee S, Bouzefrane S, Boucheneb H
based agent platform for programming wireless sensor (2017) Exploring formal strategy framework for the
networks. Comput J 54(3):439–454 security in iot towards e-health context using compu-
Ayala I, Pinilla MA, Fuentes L (2012) Exploiting dy- tational intelligence. In: Internet of things and big data
namic weaving for self-managed agents in the IOT. In: technologies for next generation healthcare. Springer,
German conference on multiagent system technologies. pp 63–90
Springer, pp 5–14 Urien P (2016) Three innovative directions based on
Ayala I, Amor M, Fuentes L (2013) Self-configuring secure elements for trusted and secured iot platforms.
agents for ambient assisted living applications. Pers In: 2016 8th IFIP international conference on new
ubiquit comput 17(6):1159–1169 technologies, mobility and security (NTMS). IEEE, pp
Bao F, Chen IR (2012) Dynamic trust management for 1–2
internet of things applications. In: Proceedings of the Viroli M, Audrito G, Damiani F, Pianini D, Beal J (2016)
2012 international workshop on Self-aware internet of A higher-order calculus of computational fields. arXiv
things. ACM, pp 1–6 preprint. arXiv:161008116
Beal J, Pianini D, Viroli M (2015) Aggregate program- Wang L, Jajodia S, Singhal A, Cheng P, Noel S (2014)
ming for the internet of things. Computer 48(9):22–30 k-zero day safety: a network security metric for mea-
Bellifemine F, Poggi A, Rimassa G (2001) Developing suring the risk of unknown vulnerabilities. IEEE Trans
multi-agent systems with a fipa-compliant agent frame- Dependable Secure Comput 11(1):30–44
work. Softw-Pract Exp 31(2):103–128 WhitePaper (2017) Iot 2020: smart and secure. Interna-
Javaid A, Niyaz Q, Sun W, Alam M (2016) A deep learn- tional Electrotechnical Commission, Geneva
ing approach for network intrusion detection system. Yan Z, Zhang P, Vasilakos AV (2014) A survey on trust
In: Proceedings of the 9th EAI international conference management for internet of things. J Netw Comput
on bio-inspired information and communications tech- Appl 42:120–134
nologies (formerly BIONETICS), ICST (Institute for Zhao YL (2013) Research on data security technology in
Computer Sciences, Social-Informatics and Telecom- internet of things. Appl Mech Mater Trans Tech Publ
munications Engineering), pp 21–26 433:1752–1755
Lever KE, Kifayat K, Merabti M (2015) Identifying inter-
dependencies using attack graph generation methods.
In: 2015 11th international conference on innovations
in information technology (IIT). IEEE, pp 80–85
Li Y, Ma R, Jiao R (2015) A hybrid malicious code Extensible Record Store
detection method based on deep learning. Methods
9(5):205–216
NoSQL Database Systems
Lin KC, Chen SY, Hung JC (2014) Botnet detection us-
ing support vector machines with artificial fish swarm
algorithm. J Appl Math 2014:1–9
Lize G, Jingpei W, Bin S (2014) Trust management
mechanism for internet of things. China Commun
11(2):148–156 Extract-Transform-Load
Mahalle PN, Thakre PA, Prasad NR, Prasad R (2013)
A fuzzy approach to trust based access control in
internet of things. In: 2013 3rd international conference ETL
F
Feature Learning from Social Graphs, Fig. 1 A topical categories of communities, and the names denote
Twitter-based community formed by the backers of Kick- the top users of the community
starter (Rakesh et al. 2015). The colors represent different
essential to develop models that can embed differ- million users, while the microblogging website
ent layers of information in a condensed feature Twitter has over 2.7 billion users (www.statista.
space. The second challenge is the scalability of com). Therefore, it is extremely important to
models. Although various methods of graph em- develop algorithms that scale to networks of this
beddings have been proposed in the field of ma- size.
chine learning (Belkin and Niyogi 2002; Tenen- To overcome the aforementioned challenges,
baum et al. 2000), these methods are suitable only researchers have proposed different algorithms
for small networks with few thousand nodes and that are highly effective in representing very large
edges. Modern social networks on the other hand networks in a condensed feature space (New-
typically contain millions of nodes and billions man 2006b; Tang and Liu 2009; Perozzi et al.
of edges. For example, as of 2017, the popular 2014; Grover and Leskovec 2016; Tang et al.
social networking website Facebook has over 330 2015). This entry focuses on explaining three
Feature Learning from Social Graphs 747
most popular models, namely, SocDim (Tang and Key Research Findings
Liu 2009), DeepWalk (Perozzi et al. 2014), and
node2vec (Grover and Leskovec 2016), that have The Social Dimension Model
become the state-of-the-art techniques in the field One of the earlier works on NRL was proposed
of network representation learning (NRL). These by Tang and Liu (2009) to extract the latent
algorithms have proven to be highly scalable in dimensions (or features) that are indicative of
capturing the latent representations from large the homophily in the network. To achieve this,
graphs while simultaneously encoding the social the authors propose a model called SocDim that
similarity and community membership. performs two separate functions: (a) extract latent
social dimensions based on network connectivity
and (b) construct a discriminative classifier that F
Definitions
uses the latent features for predicting the node la-
bels. Since the main focus of this entry is on net-
The graph notation: A graph is denoted by G D
work representation learning, the classifier part is
.V; E; F; Y /, where v 2 V are the nodes (or
excluded from the description. When developing
actors), E are the edges of the network, F 2
algorithms for learning graph representations, it
RjV jf are the attributes, and f is the size of the
is important to make sure that the latent features
feature space for each attribute and Y 2 RjV jy
capture the following properties:
are the labels and y is the size of each label
vector.
• Informative: The social dimensions should be
Random walks: Originally known as Brownian indicative of affiliations between actors.
motion in space, the random walk is named after • Plural: The same actor can get involved in
the botanist Robert Brown (www.wikipedia.org/ multiple affiliations, thus appearing in differ-
wiki/Brownian_motion) who in 1826 peered at a ent social dimensions.
drop of water using a microscope and observed • Continuous: The actors might have differ-
tiny particles (such as pollen grains and other im- ent degree of associations to one affiliation.
purities) in it performing strange random-looking Hence, a continuous value rather than a dis-
movements. A random walk on a undirected crete f0; 1g is more favorable.
graph G = (V, E) can be perceived as a stochastic
process that starts from a given vertex and then
selects one of its neighbors uniformly at random
Algorithm 1: SocDim(A,g,m)
to visit. The transition probability from vertex vi
to vj is defined as follows: 1 Input: interaction matrix A, node degree g, number
of edges m
8 2 Output: latent feature matrix 2 RjV jd
ˆ
< 1 3 B D GetModularityMat(A; g; m/
; vj neighbor of vi 4 E D GetTopEigenVec(B)
p.vi ; vj / D Odeg.vi /
:̂0; 5 for each v 2 V do
otherwise 6 .v/ = GetLatentSocDim(E ; v)
(1) 7 end
Feature Learning from Social Graphs, Fig. 2 Shows softmax approximation technique which reduces the com-
the three main steps of DeepWalk algorithm where (a) putational complexity of the posterior P r.vj j.v1 // by
denotes a random walk sample Wv4 rooted at node v4 , restricting the co-occurrence probability of v1 to just v3
(b) shows the neighborhood window of node v1 and the and v5 (Image Courtesy of Perozzi et al. 2014 and Perozzi
latent feature mapping .v1 / and (c) the hierarchical 2016)
F
v5 forms a community with v6 ; v7 , and v8 . On the source node v1 and neighborhood size w D 3, the
other hand, although v1 and v5 are from two dis- BFS samples nodes v2; v3; v4 and DFS samples
tinct communities, they share the same structural v3 ; v6 ; v5 . Furthermore, since BFS samples a
equivalency by acting as hub nodes. Therefore, to very small section of a neighborhood, the nodes
capture such community-specific properties, the tend to repeat themselves in multiple neighbor-
node2vec algorithm defines a biased form of ran- hood sets; consequently, this reduces the variance
dom walk that aims to maximize the likelihood of in characterizing the distribution of 1-hop nodes
preserving network neighborhood of nodes. with respect to the source nodes. Contrary to
Given a node v 2 V and its neighbors this technique, the DFS samples neighborhood
NS .v/ V generated using a sampling strategy on a macro-level by exploring large parts of the
S , the nod2vec model aims to optimize the network that results in capturing the homophily
following function: between the nodes.
Inspired by the above graph exploration strate-
X
max logP r.NS .v/j.v// (7) gies, the node2vec proposes a second-order ran-
dom walk with a transition probability that
facilitates smooth transitions between BFS and
The above equation is very similar to the objec- DFS. The transition probability between nodes v1
tive of the DeepWalk model (Eq. 6). Nonethe- and v2 is defined as follows:
less, contrary to DeepWalk, which models the 8
neighborhood NS .v/ using a sliding window over ˆ
ˆ 1
ˆ
ˆ if dv1;v2 D 0
consecutive nodes, the node2vec does not restrict ˆ
<p
the samples to the immediate neighbors of v. .v1 ; v2 / D 1 if dv1;v2 D 1 (8)
ˆ
ˆ
Instead, it allows to extract many different struc- ˆ
ˆ1
:̂ if dv1;v2 D 2
tures depending on the sampling strategy S . q
Flexible biased random walk: As stated ear-
lier, the aim of node2vec algorithm is to create where dv1;v2 denotes the shortest path between
sampling techniques that can capture both the v1 and v2 . Parameters p and q control how fast
structural equivalence and homophily in graphs. the walk explores and leaves the neighborhood of
This is achieved using a combination of breadth- the staring node; in other words, they allow the
first search (BFS) and depth-first search (DFS) search procedure to interpolate between BFS and
graph exploration strategies. BFS captures the DFS. The working of node2vec is illustrated in
structural equivalence by sampling the immediate Algorithm 3. Similar to DeepWalk, the algorithm
neighbors of the given source node, while DFS begins exploring its neighbors using a random
captures the homophily between the nodes by se- walk of length `, and this procedure is repeated K
quentially sampling nodes at increasing distances times (i.e., lines 7–11). The novelty of node2vec
from the source node. For example, in Fig. 3, for a is described by the function node2vecW alk that
752 Feature Learning from Social Graphs
the sparsity of links between questions and Chakrabarti D, Faloutsos C (2006) Graph mining: laws,
answers. However, modern Q&A websites such generators, and algorithms. ACM Compu Surv (CSUR)
38(1):2
as Quora and Stack Exchange provide additional Chang S, Han W, Tang J, Qi GJ, Aggarwal CC, Huang TS
information in the form of user communities (2015) Heterogeneous network embedding via deep ar-
where a reliable user may be followed by chitectures. In: Proceedings of the 21th ACM SIGKDD
other users. Therefore, in Fang et al. (2016), international conference on knowledge discovery and
data mining. ACM, pp 119–128
the authors use DeepWalk to extract the social Chen CM, Tsai MF, Lin YC, Yang YH (2016) Query-
information of users to solve the sparsity problem based music recommendations via preference embed-
and combine it with a deep recurrent neural ding. In: Proceedings of the 10th ACM conference on
network which models the textual contents of recommender systems. ACM, pp 79–82
questions and answers. In Chen et al. (2016),
Fang H, Wu F, Zhao Z, Duan X, Zhuang Y, Ester F
M (2016) Community-based question answering via
the authors propose a heterogeneous preference heterogeneous social network learning. In: Thirtieth
embedding approach (HPE) for music playlist AAAI conference on artificial intelligence
recommendation. The input to the HPE model Grover A, Leskovec J (2016) node2vec: Scalable fea-
ture learning for networks. In: Proceedings of the
is a graph that is composed of users, songs, and 22nd ACM SIGKDD international conference on
playlists, and the output is a low-dimensional knowledge discovery and data mining. ACM, pp
embedding of user preference vector. This 855–864
preference vector is then used for recommending Jones I, Tang L, Liu H (2015) Community discovery
in multi-mode networks. Springer International Pub-
songs and playlists to users. lishing, pp 55–74. https://doi.org/10.1007/978-3-319-
23835-7_3
Karypis G, Kumar V (1998) A fast and high quality
multilevel scheme for partitioning irregular graphs.
Future Directions for Research SIAM J Sci Comput 20(1):359–392
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient
estimation of word representations in vector space.
This entry presents some recent trends in network arXiv preprint arXiv:13013781
representation learning by introducing three pop- Newman ME (2006a) Finding community structure in
ular algorithms, namely, SocDim, DeepWalk, and networks using the eigenvectors of matrices. Phys Rev
node2vec. Future improvements and extensions E 74(3):036,104
Newman ME (2006b) Modularity and community struc-
to these algorithms can be categorized into the ture in networks. Proc Natl Acad Sci 103(23):
following research directions: (a) extending the 8577–8582
model for multiple types of vertices, (b) embed- Nowicki K, Snijders TAB (2001) Estimation and predic-
ding explicit domain-specific features of nodes tion for stochastic blockstructures. J Am Stat Assoc
96(455):1077–1087
and edges into a latent space, (c) modifying Perozzi B (2016) Local modeling of attributed graphs: al-
(or replacing) the feature learning process with gorithms and applications. PhD thesis, State University
recent techniques from deep neural networks, and of New York at Stony Brook
(d) extending the models for signed networks. Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: on-
line learning of social representations. In: Proceed-
ings of the 20th ACM SIGKDD international confer-
ence on knowledge discovery and data mining. ACM,
pp 701–710
References Rakesh V, Choo J, Reddy CK (2015) Project recommen-
dation using heterogeneous traits in crowdfunding. In:
Belkin M, Niyogi P (2002) Laplacian eigenmaps and ICWSM, pp 337–346
spectral techniques for embedding and clustering. In: Tang L, Liu H (2009) Relational learning via latent
Advances in neural information processing systems, social dimensions. In: Proceedings of the 15th ACM
pp 585–591 SIGKDD international conference on knowledge dis-
Brandes U, Delling D, Gaertler M, Görke R, Hoefer M, covery and data mining. ACM, pp 817–826
Nikoloski Z, Wagner D (2006) Maximizing modularity Tang L, Liu H, Zhang J, Nazeri Z (2008) Commu-
is hard. arXiv preprint physics/0608255 nity evolution in dynamic multi-mode networks. In:
Cao S, Lu W, Xu Q (2016) Deep neural net- Proceedings of the 14th ACM SIGKDD international
works for learning graph representations. In: AAAI, conference on knowledge discovery and data mining.
pp 1145–1152 ACM, pp 677–685
754 Federated RDF Query Processing
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) querying such a federation without requiring the
Line: large-scale information network embedding. In: usage of specific language features, the following
Proceedings of the 24th international conference on
World Wide Web, international World Wide Web con-
assumption has been made throughout the litera-
ferences steering committee, pp 1067–1077 ture: The result of executing any given SPARQL
Tenenbaum JB, De Silva V, Langford JC (2000) A global query over the federation should be the same as if
geometric framework for nonlinear dimensionality re- the query was executed over the union of all the
duction. Science 290(5500):2319–2323
Von Luxburg U (2007) A tutorial on spectral clustering.
RDF data available in all the federation members.
Stat Comput 17(4):395–416
Zafarani R, Abbasi MA, Liu H (2014) Social media Formalization of SPARQL Over Federated
mining: an introduction. Cambridge University Press, RDF Data
New York
To capture this assumption formally (Stolpe
2015), it is necessary to introduce a few basic
concepts first: The core component of each
Federated RDF Query SPARQL query is its query pattern. In the
Processing simplest case, such a query pattern is a so-called
triple pattern, that is, a tuple tp D .s; p; o/ where
Maribel Acosta1 , Olaf Hartig2 , and Juan s and o may be either a URI, an RDF literal, or
Sequeda3 a variable, respectively, and p may be either
1
Institute AIFB, Karlsruhe Institute of a URI or a variable. Multiple triple patterns
Technology, Karlsruhe, Germany can be combined into a set, which is called a
2
Linköping University, Linköping, Sweden basic graph pattern (BGP) and resembles the
3
Capsenta, Austin, TX, USA notion of a conjunctive query. In addition to
such BGPs, several other features are possible in
SPARQL query patterns (e.g., UNION, FILTER).
For the complete syntax refer to the SPARQL
Synonyms
specification (Harris et al. 2013).
The expected result of any given SPARQL
Federation of SPARQL endpoints; SPARQL fed-
query is defined based on an evaluation
eration
function ŒŒG that, given an RDF graph G,
takes a SPARQL query pattern and returns a
set (or multiset) of solution mappings, that
Definitions is, partial functions that associate variables
with RDF terms (i.e., URIs, literals, or blank
Federated RDF query processing is concerned nodes). For instance, for a triple pattern tp, the
with querying a federation of RDF data sources result ŒŒtpG contains every solution mapping
where the queries are expressed using a declar- whose domain is the set of all variables in tp
ative query language (typically, the RDF query and replacing the variables in tp according to the
language SPARQL), and the data sources are mapping produces an RDF triple that is in G. The
autonomous and heterogeneous. The current lit- evaluation function ŒŒG is defined recursively
erature in this context assumes that the data and over the structure of all query patterns (Pérez
the data sources are semantically homogeneous, et al. 2009).
while heterogeneity occurs at the level of data To capture the aforementioned assumption
formats and access protocols. and, thus, define SPARQL over a federation
of RDF data sources (rather than over an
Overview RDF graph), it is possible to introduce another
In its initial version, the SPARQL query language evaluation function ŒŒF where F is a given
did not have features to explicitly express queries federation. Formally, such a federation may be
over a federation of RDF data sources. To support captured as a tuple F D .E; d / where E is the
Federated RDF Query Processing 755
set of all data sources in the federation and d is with the result of the rest of the query pattern in
a function that associates every source e 2 E which the SERVICE clause is contained.
with the RDF graph that can be accessed via e.
Then, the new evaluation function ŒŒF can Federation of RDF Documents. This type of
be defined in terms of the standard evaluation federation composes a Web of (Linked) Data in
function ŒŒG as follows: For every SPARQL which the federation members are RDF docu-
query pattern P , ŒŒP F is a set of mappings such ments that can be retrieved by URI dereferencing.
that ŒŒP F D ŒŒP G where G is the RDF graph In URI dereferencing, clients use URIs (typically
that is the (virtual) union of the graphs of all found in the data) to perform HTTP requests
S
federation members; i.e., G D e2E d.e/. which result in retrieving documents that contain
a set of RDF triples. The process of evaluating F
queries over a federation of RDF documents is
Types of RDF Data Sources referred to as Linked Data query execution (Har-
Note that the aforementioned definition is tig 2013) and URI dereferencing is a core task
deliberately generic in what it considers to of this process. Hartig (2012) has shown that
be possible federation members. Most of the by using this process, producing the complete
current solutions for federated query processing query result as defined by the aforementioned
over RDF data have focused on federations of evaluation function ŒŒF cannot be guaranteed
SPARQL endpoints. Nonetheless, federations of and that there exist more restrictive approaches
RDF data sources may be composed of other to define an evaluation function that can be com-
types of sources including RDF documents or puted completely over a federation of interlinked
Triple Pattern Fragments. RDF documents.
(i) query decomposition and source selection, results from these data sources can be safely
(ii) query optimization, and (iii) query execution. ignored without affecting the completeness of
This section provides an overview of existing the overall query result. Source selection ap-
approaches to tackle these tasks. To this end, proaches that apply such a pruning are called
the tasks are considered separately, one after join-aware (Saleem and Ngomo 2014).
another, which is also the most common ap- Table 1 lists the approaches for query de-
proach to perform them in many RDF federation composition / source selection as proposed in
engines. However, some engines intertwine the the SPARQL federation literature. The main dif-
tasks (at least partially) to account for changing ference between these approaches (in addition
data sources and runtime conditions. to join awareness) is in the information they
require about the federation members and the
Query Decomposition and Source point at which they obtain this information. There
Selection exist approaches that rely solely on some type
To execute a query over a federation, the query of pre-populated data catalog with (approach-
first has to be decomposed into subqueries that specific) metadata about the federation members.
are associated with data sources (federation Other approaches do not assume such a catalog
members) selected for executing the subqueries. and, instead, identify relevant federation mem-
Essentially, the result of this step resembles a bers by querying all federation members dur-
rewriting of the query into a SPARQL query with ing the source selection process. A third type
SERVICE clauses, and the objective is to achieve of approaches are hybrids that employ a pre-
a complete query result with a minimum number populated data catalog in combination with run-
of SERVICE clauses. An important notion in this time requests to the federation members. The ap-
context is the relevance of data sources for triple proaches that employ a catalog (with or without
patterns; that is, a data source d is relevant for a
triple pattern tp if the evaluation of tp over the
data of d is nonempty; i.e., ŒŒtpep.d / ¤ ;. Federated RDF Query Processing, Table 1 Properties
of query decomposition/source selection approaches pro-
An initial, not necessarily optimal approach posed in the literature, where DC approaches rely solely
to decompose a given SPARQL query is to treat on a pre-populated data catalog, RT approaches rely only
each triple pattern in the query as a separate on runtime requests, and HY are hybrids
subquery and to associate each such subquery Approach Join-aware Class
with all the data sources that are relevant for the Federation engine
respective triple pattern. A common improvement DARQ (Quilitz and Leser 2008) X DC
of such an initial decomposition is to merge sub- ADERIS (Lynden et al. 2010, 2011) ✗ DC
queries that are associated with the same single ANAPSID (Acosta et al. 2011) X HY
data source; such merged subqueries are called SPLENDID (Görlitz and Staab ✗ HY
exclusive groups (Schwarte et al. 2011). 2011b)
Note that the results of the different subqueries FedX (Schwarte et al. 2011) ✗ RT
have to be combined to produce the result of the Prasser et al. (2012) X DC
given SPARQL query. It is possible, however, that WoDQA (Akar et al. 2012) X HY
for any of the (relevant) data sources associated LHD (Wang et al. 2013) ✗ HY
with a subquery, the subquery result obtained SemaGrow (Charalambidis et al. ✗ HY
from that source may give an empty result when 2015)
combined with the results of some of the other Lusail (Mansour et al. 2017) X RT
subqueries. Hence, that subquery result does not Source selection extension
contribute to the overall query result produced in DAW (Saleem et al. 2013) ✗ HY
the end. Based on this observation it is possible HiBISCuS (Saleem and Ngomo X HY
2014)
to prune some of the data sources associated with
Fed-DSATUR (Vidal et al. 2016) X RT
subqueries if there is evidence that the subquery
Federated RDF Query Processing 757
additional runtime requests) can be distinguished et al. 2010) and enable parallel execution of
further based on what type of metadata and statis- independent sub-plans. One common strategy
tics they capture in their catalog. Both Görlitz and used in traditional query processing is dynamic
Staab (2011a, Sec.5.3.2) and Oguz et al. (2015, programming, which is able to enumerate
Sec.3.1–3.2) provide a more detailed overview of all join plans while pruning inferior plans
existing approaches. based on the estimated cost. This strategy
is implemented by SPLENDID (Görlitz and
Query Optimization Staab 2011b) and LHD (Wang et al. 2014) to
The optimizer devises a plan that encodes how devise bushy tree plans. The main advantage
the execution of the query is carried out. The of dynamic programming is that it produces
objective of the optimizer is to devise an execu- the best possible plans under the assumption F
tion plan that minimizes the cost of processing that cost model is accurate. However, the time
the query decompositions. In federated scenarios, and space complexity of dynamic programming
both the number of intermediate results as well is exponential, and thus this strategy does
as the communication costs impact on the query not scale for queries with a large number of
performance. subqueries. To overcome this limitation, the
federated engines FedX (Schwarte et al. 2011)
Cost Estimation for Plans. To estimate the cost and ANAPSID (Acosta et al. 2011) implement
of plans in federated SPARQL query process- heuristics to devise reasonable effective plans
ing, Görlitz and Staab (2011b) propose a cost quickly. The heuristics implemented by FedX
model that accounts for communication costs enumerate left-deep plans, while ANAPSID
and estimates the cardinality of SPARQL expres- heuristics are able to devise bushy tree plans.
sions. Regarding the communication costs, this
model assumes that all the responses from the Further Optimization Techniques. In tradi-
sources require the same number of packages tional query processing, the optimizer considers
to be transmitted over the network and that the pushing down filters in the query plan which, in
solution mappings have the same average size. turn, allows for reducing the amount of interme-
Furthermore, this cost model relies on exact car- diate results produced during query execution.
dinalities for triple patterns with bound predi- Following this technique, ANAPSID (Acosta
cates and estimations for other variations of triple et al. 2011) implements heuristics to (i) push
patterns, graph patterns, and joins. Nonetheless, down filter expressions in the query plan and (ii)
in federated query processing where sources are decide whether the filter is evaluated at the query
autonomous, it is not always possible to collect engine or at the endpoint. In case that the filter
sufficient statistics from the data sources to es- expression contains variables that are resolved by
timate cardinalities accurately. Therefore, some the same source, ANAPSID ships the evaluation
SPARQL federation engines (Acosta et al. 2011; of the filter to the endpoint thus reducing the
Schwarte et al. 2011) implement heuristic-based amount of data transmitted over the network.
optimization techniques that estimate the cost of Other optimization techniques split composed
subqueries based on the number of triple patterns filter expressions into sub-expressions such that
and the number of variables in each triple pattern. each sub-expressions can be pushed down in the
query plan (Görlitz and Staab 2011b).
Plan Enumeration. To explore the space of
query plans, optimizers of current SPARQL Query Execution Techniques
federation engines implement different search During query execution, the engines evaluate the
strategies that devise different tree plan shapes: plan devised by the optimizer. In this task, the
(left-) deep plans and bushy plans. In contrast to engine contacts the members of the federation
deep plans, bushy trees are able to reduce the size identified as relevant and combines the retrieved
of intermediate results in SPARQL queries (Vidal data to produce the final query result. In federated
758 Federated RDF Query Processing
SPARQL query processing, physical operators with network delays and to handle a large number
and adaptive query processing techniques have of intermediate results produced by arbitrary
been proposed to efficiently combine the interme- data distributions. Another federated engine that
diate results. implements adaptivity during query execution is
ADERIS (Lynden et al. 2010, 2011). ADERIS is
Physical Operators. In terms of Nested Loops able to re-invoke the query optimizer at execution
Join, Görlitz and Staab (2011a) and Schwarte time after all the data is retrieved from the
et al. (2011) have proposed join operators that endpoints. The optimizer then changes the join
reduce the number of requests submitted to the ordering to devise a new efficient query plan
sources by grouping several bindings into a single given the current runtime conditions.
request. The distributed semi-join (Görlitz and
Staab 2011a) buffers the solution mappings in
FILTER expressions using the logical connectors Open Research Challenges
^ (when mappings contain several variables) and
_ (for grouping several mappings). In contrast, The open research challenges can be organized
the bound join (Schwarte et al. 2011) groups a in two parts. The first set of challenges are in
set of solution mappings in a single subquery the context of the Semantic Web. The second set
using SPARQL UNION constructs. Furthermore, of challenges extend known issues in federated
ANAPSID (Acosta et al. 2011) provides non- query processing on relational databases.
blocking join operators agjoin and adjoin that
efficiently combine solution mappings able to Challenges of Using the Semantic Web as a
produce results as soon as the data arrives from Federation
the endpoints.
Reasoning: RDF and SPARQL are part of a
Adaptive Query Processing. Most of the larger set of Semantic Web technologies stan-
current federated SPARQL query engines dardized by the W3C. An important standard is
follow the optimize-then-execute paradigm. the Web Ontology Language (OWL), which is
In this paradigm the plan devised by the a knowledge representation and reasoning lan-
optimizer is fixed. Nonetheless, the optimize- guage based on Description Logics. RDF sys-
then-execute paradigm is insufficient in federated tems may implement reasoners which enables
scenarios in which the optimizer is not able inferring new data based on the input data and
to devise optimal plans due to misestimated ontologies. A federated query processing system
statistics or unpredictable costs resulting from on RDF data coupled with OWL ontologies must
network delays or source availability. To be extended to support the following: (i) un-
overcome the limitations of the optimize-then- derstanding different entailment regimes (Glimm
execute paradigm, some SPARQL federation and Ogbuji 2013) that a federation member is
engines implement adaptive query processing able to support (e.g., RDF entailment, RDFS en-
techniques (Deshpande et al. 2007) to adjust their tailment, OWL2 direct semantics entailment) and
execution schedulers according to the monitored (ii) reasoning with alignments between different
runtime conditions. ANAPSID (Acosta et al. OWL ontologies used across different federation
2011) is an adaptive engine that flushes solution members. Early approaches that have taken first
mappings that have matched a resource to steps to integrate ontology alignments into feder-
secondary memory, in cases when the main ated query processing are ALOQUS (Joshi et al.
memory becomes full. Solution mappings 2012) and LHD-d (Wang et al. 2014).
in secondary memory are re-considered to
produce new solutions when the endpoints Unboundedness: Existing approaches to feder-
become blocked. The query execution techniques ated RDF query processing assume that the com-
implemented by ANAPSID are tailored to cope plete federation is known beforehand. However,
Federated RDF Query Processing 759
in the context of the Semantic Web, this may not Heterogeneity: Although all the federation
be the case. An initial set of data sources (feder- members are assumed to use a common data
ation members) may be known, but it is possible model (RDF), the members can be heterogeneous
that all the sources are not known initially. Hence, at different levels: (i) members may be using
the total number of sources that can contribute to different ontologies to model data, (ii) different
the answer of a given query may be unbound. versions of SPARQL and different query
This unboundedness has implications in query capabilities (e.g., SPARQL endpoints vs. TPF)
processing, namely, in the query decomposition may have an impact on the type of queries that
and source selection step. members are able to answer, (iii) depending
on which entailment regimes are supported, a
Completeness: Recall that the result of execut- member can infer different types of answers. F
ing any given SPARQL query over the feder-
ation should be the same as if the query was Source Selection: The challenge of decompos-
executed over the union of all the RDF data ing a query into subqueries that are associated
available in all the federation members. There- with data sources is increased when the following
fore, completeness of a SPARQL query over the additional constraints are considered. Given that
federation depends on knowing all the federation data sources are autonomous, they can: (i) have
members. Per the unboundedness challenge (see different timeouts, (ii) put a limit on the number
above), if the sources are not known, then this of results a query can return, (iii) support different
notion of completeness should be relaxed. The version of SPARQL or even support subsets of
implication of this challenge is to understand SPARQL only, (iv) be of different types, such as
a tradeoff between soundness and completeness SPARQL endpoints, RDF documents, or TPF in-
versus precision and recall of answering a query terfaces, (v) support different entailment regimes.
in a federation. Therefore, a federated query processing system
should take into account these constraints during
Dynamicity: The data of some federation mem- the source selection.
bers may change frequently (e.g., streams, social
media). It has to be studied how such dynamicity Query Results: An important challenge in
affects the results of queries over a federation and federated query processing is to associate
how the dynamicity can be taken into account provenance with the query results in order to
during federated query processing. understand where an answer is coming from.
When sources that support reasoning such that
they are able to infer new data, this adds to further
explanations in the provenance of the answers.
Challenges of Extending Federated Query
Processing
Cross-References
Access Control: Within federated database sys-
tems, access control is a mechanism to check Graph Data Models
whether a query issued by a user is allowed Graph Query Languages
or not. Given that each federation member is Linked Data Management
autonomous and has its own access control, con-
flicts may arise when data sources are combined
in a federation. This challenge is further increased
with the presence of ontologies and reasoning References
capabilities at the federation members. The in-
Acosta M, Vidal M, Lampo T, Castillo J, Ruckhaus E
ferred data needs to be checked with respect to (2011) ANAPSID: an adaptive query processing engine
the access control. for SPARQL endpoints. In: The semantic web – ISWC
760 Federated RDF Query Processing
2011 – Proceedings of the 10th international semantic data. In: Databases in networked information sys-
web conference, part I, Bonn, 23–27 Oct 2011, pp 18– tems, proceedings 6th international workshop, DNIS
34. https://doi.org/10.1007/978-3-642-25073-6_2 2010, Aizu-Wakamatsu, 29–31 Mar 2010. pp 174–193.
Akar Z, Halaç TG, Ekinci EE, Dikenelli O (2012) Query- https://doi.org/10.1007/978-3-642-12038-1_12
ing the web of interlinked datasets using VOID descrip- Lynden SJ, Kojima I, Matono A, Tanimura Y (2011)
tions. In: WWW2012 workshop on linked data on the ADERIS: an adaptive query processor for joining
web, Lyon, 16 Apr 2012. http://ceur-ws.org/Vol-937/ federated SPARQL endpoints. In: On the move to
ldow2012-paper-06.pdf meaningful internet systems: OTM 2011 – Proceedings
Aranda CB, Arenas M, Corcho Ó, Polleres A (2013) of the confederated international conferences: CoopIS,
Federating queries in SPARQL 1.1: syntax, semantics DOA-SVI, and ODBASE 2011, part II, Hersonissos,
and evaluation. J Web Sem 18(1):1–17. https://doi.org/ Crete, 17–21 Oct 2011, pp 808–817. https://doi.org/10.
10.1016/j.websem.2012.10.001 1007/978-3-642-25106-1_28
Charalambidis A, Troumpoukis A, Konstantopoulos S Mansour E, Abdelaziz I, Ouzzani M, Aboulnaga A, Kalnis
(2015) Semagrow: optimizing federated SPARQL P (2017) A demonstration of lusail: querying linked
queries. In: Proceedings of the 11th international con- data at scale. In: Proceedings of the 2017 ACM inter-
ference on semantic systems, SEMANTICS 2015, Vi- national conference on management of data, SIGMOD
enna, 15–17 Sept 2015, pp 121–128. https://doi.org/10. conference 2017, Chicago, 14–19 May 2017, pp 1603–
1145/2814864.2814886 1606. https://doi.org/10.1145/3035918.3058731
Deshpande A, Ives ZG, Raman V (2007) Adaptive Oguz D, Ergenc B, Yin S, Dikenelli O, Hameurlain
query processing. Found Trends Databases 1(1):1–140. A (2015) Federated query processing on linked
https://doi.org/10.1561/1900000001 data: a qualitative survey and open challenges.
Feigenbaum L, Williams GT, Clark KG, Torres E (2013) Knowl Eng Rev 30(5):545–563. https://doi.org/10.
SPARQL 1.1 protocol. W3C recommendation. Online 1017/S0269888915000107
at https://www.w3.org/TR/sparql11-protocol/ Pérez J, Arenas M, Gutierrez C (2009) Seman-
Glimm B, Ogbuji C (2013) SPARQL 1.1 entailment tics and complexity of SPARQL. ACM Trans
regimes. W3C recommendation. Online at https:// Database Syst 34(3):16:1–16:45. https://doi.org/10.
www.w3.org/TR/sparql11-entailment/ 1145/1567274.1567278
Görlitz O, Staab S (2011a) Federated data manage- Prasser F, Kemper A, Kuhn KA (2012) Efficient
ment and query optimization for linked open data. distributed query processing for autonomous RDF
In: New directions in web data management 1. databases. In: Proceedings of the 15th interna-
Springer, pp 109–137. https://doi.org/10.1007/978-3- tional conference on extending database technology,
642-17551-0_5 EDBT’12, Berlin, 27–30 Mar 2012, pp 372–383.
Görlitz O, Staab S (2011b) SPLENDID: SPARQL https://doi.org/10.1145/2247596.2247640
endpoint federation exploiting VOID descriptions. In: Prud’hommeaux E, Buil-Aranda C (2013) SPARQL 1.1
Proceedings of the second international workshop on federated query. W3C recommendation. Online at
consuming linked data (COLD2011), Bonn, 23 Oct https://www.w3.org/TR/sparql11-federated-query/
2011. http://ceur-ws.org/Vol-782/GoerlitzAndStaab_ Quilitz B, Leser U (2008) Querying distributed RDF data
COLD2011.pdf sources with SPARQL. In: The semantic web: research
Harris S, Seaborne A, Prud’hommeaux E (2013) SPARQL and applications, proceedings of the 5th European se-
1.1 query language. W3C recommendation. Online at mantic web conference, ESWC 2008, Tenerife, Canary
http://www.w3.org/TR/sparql11-query/ Islands, 1–5 June 2008, pp 524–538. https://doi.org/10.
Hartig O (2012) SPARQL for a web of linked data: 1007/978-3-540-68234-9_39
semantics and computability. In: The semantic web: Saleem M, Ngomo AN (2014) Hibiscus: hypergraph-
research and applications – Proceedings of the 9th based source selection for SPARQL endpoint fed-
extended semantic web conference, ESWC 2012, Her- eration. In: The semantic web: trends and chal-
aklion, Crete, 27–31 May 2012, pp 8–23. https://doi. lenges – proceedings of the 11th international con-
org/10.1007/978-3-642-30284-8_8 ference, ESWC 2014, Anissaras, Crete, 25–29 May
Hartig O (2013) An overview on execution strategies for 2014, pp 176–191. https://doi.org/10.1007/978-3-319-
linked data queries. Datenbank-Spektrum 13(2):89–99. 07443-6_13
https://doi.org/10.1007/s13222-013-0122-1 Saleem M, Ngomo AN, Parreira JX, Deus HF, Hauswirth
Joshi AK, Jain P, Hitzler P, Yeh PZ, Verma K, Sheth M (2013) DAW: duplicate-aware federated query pro-
AP, Damova M (2012) Alignment-based querying of cessing over the web of data. In: The semantic web
linked open data. In: On the Move to Meaningful – ISWC 2013 – proceedings of the 12th international
Internet Systems: OTM 2012, Confederated Interna- semantic web conference, part I, Sydney, 21–25 Oct
tional Conferences: CoopIS, DOA-SVI, and ODBASE 2013, pp 574–590. https://doi.org/10.1007/978-3-642-
2012, Rome, Italy, September 10–14, 2012. Proceed- 41335-3_36
ings, Part II, pp 807–824. https://doi.org/10.1007/978- Schwarte A, Haase P, Hose K, Schenkel R, Schmidt
3-642-33615-7_25 M (2011) Fedx: optimization techniques for federated
Lynden SJ, Kojima I, Matono A, Tanimura Y (2010) query processing on linked data. In: The semantic web
Adaptive integration of distributed semantic web – ISWC 2011 – proceedings of the 10th international
Flood Detection Using Social Media Big Data Streams 761
Flood Detection
Overview
Flood Detection Using Social Media Big Data This chapter provides comprehensive review of
Streams the state-of-the-art natural language processing
762 Flood Detection Using Social Media Big Data Streams
Introduction
create patterns by exploring huge amount of data. of 2013–2015. Dataset was collected from
This motivates usage of big data techniques in different parts of the world, during 19 different
disaster management systems, especially in flood incidents of crises including earthquakes, floods,
management (Eilander et al. 2016; Chen and landslides, typhoons, volcanos, and airline
Han 2016). The main aim of this chapter is to accidents. Artificial Intelligence for Disaster
discuss the existing challenges for flood detec- Response (AIDR), an open source platform,
tion using social media big data streams. This is used for the collection and classification
chapter will also provide comprehensive review of the messages. Additionally, the platform
of the state-of-the-art natural language processing provides easy method to collect tweets from
(NLP), machine learning (ML), and deep learn- specific regions with particular keywords.
ing (DL) technologies for flood detection. Messages are categorized according to their F
contents as injured and dead people, missing
people, damaged infrastructure, volunteer
Literature Review services, donations, sympathy, and emotional
support. Before classification, stop-words and
Most awful situations are faced by the countries uniform resource locators (URLs) are removed
with scarcity of resources. They are unable to from the data; both of them have no or less role
handle floods properly. Consequently, humanitar- in classification. Moreover, Lovins stemmer is
ian organizations play their role to handle the sit- used, which removes the derivational part of the
uation. Moreover, countries having a larger area word to make it more clear like “flooding” will
are also facing troubles in protecting larger region be replaced by “flood.” Unigram and bigram
from flooding events. Also, there is a lack of words are defined as features. However, the
proper intercommunication mechanism between information gained is used to select the top
countries to share the experiences and knowledge 1000 effective features. After the selection of
regarding flood management. Flooding events in features, multiclass categorization is performed
Pakistan and Philippines are analyzed to explore by using support vector machine (SVM), random
the timing of occurrence of flood, mapping of its forest (RF), and Naïve Bayes (NB) algorithms.
location, and understanding of its impact (Jong- Evaluation of trained models is performed
man et al. 2015). Data has been gathered by by tenfold cross validation, and largest word
using organizations of disaster response, Global embedding on the topic of crisis management has
Flood Detection System (GFDS) satellite sig- been developed (Imran et al. 2016).
nals, and information from Twitter accounts. It Another research (Nguyen et al. 2016)
was observed that GFDS satellite information is has explored stochastic gradient descent and
appropriate to produce better results in case of proposed online algorithm using deep neural
larger floods while less appropriate for flood with network (DDN). This algorithm is aimed to
smaller duration. It also has a drawback, as it perform identification and classification of
produces errors while calculating data in coastal significant tweets. For identification, informative
areas, so it is more suitable to be used in regions effectiveness of each tweet is investigated, and as
of delta or islands. Moreover, Twitter information a result, two categories were formed which were
analysis for flood exploration can produce fruitful tagged as “informative” and “non-informative.”
results in handling flood of any size. However, it Informative tweets contains indication regarding
has challenges of finding the appropriate location the category of destruction due to disaster. Fur-
of event. Additionally, Twitter population is lim- thermore, the tweets are categorized according to
ited in rural areas, which hinders the accurate and their contents, as tweets containing an indication
timely information (Jongman et al. 2015). of loss of infrastructure are placed in a different
In an experimental research study (Imran category, while tweets with information regard-
et al. 2016), a huge corpus of 52 million Twitter ing deceased and injured persons are placed in a
messages has been gathered during the period separate category. Convolutional neural network
764 Flood Detection Using Social Media Big Data Streams
Flood Detection Using Social Media Big Data Streams, Table 1 Some public datasets for disaster event
Media Samples Algorithm Duration Disaster event
Twitter, satellite (Jongman et al. 2015) 7.6 million – 2014–2015 Flood
Twitter (Imran et al. 2016) 52 million NB, RF, SVM 2013–2015 Flood, earthquake
Twitter (Nguyen et al. 2016) 0.11 million DNN 2015 Earthquake
(CNN)-based model is trained and tested for systems, speech generation, and recognition sys-
multiple six classes, including “donation and tems to name a few. Major fields that support
volunteering,” “sympathy and support,” and so NLP research are computer science, artificial in-
on (Nguyen et al. 2016). Some public datasets telligence, and computational linguistics. On the
for disaster events are shown in Table 1. Internet 80% of user-generated contents are in
The severity of flood is diverse for different textual form, and the potential of this huge knowl-
parts of the world. Few countries are significantly edge resource is still to be explored. Internet with
affected, as in Malaysia 40% of the total damage its recent advancement in real-time interactive
is caused by floods. The flood prone region of user-/social group-generated textual content and
Terengganu, Malaysia, has been mapped by readership offers new potential applications in
implementing support vector machine (SVM). the area of real-time event-based information pro-
The flood event took place in this area in cessing and management. The NLP-based appli-
November, 2009. Flood locations were detected cations generally have a preprocessing pipeline
by the method of texture analysis, and 181 for processing of textual features into more mean-
flood-affected locations were recorded. The ingful units. There can be parsing of lexeme
quantity of 70% locations was selected as from text, stemming/lemmatization, statistical, or
training set and remaining 30% as testing set. counting of features. During the US presiden-
Availability of flood or its absence is verified by tial election of 2016, Twitter proved to be the
binary classification with the help of various largest and powerful source of breaking news,
mathematical functions called kernels. This with 40 million election-related tweets posted by
approach has used sigmoid (SIG), radial basis 10 PM on election day. Social media platforms,
function (RBF), linear (LN), and polynomial including Twitter, have been used and analyzed
(PL) kernel functions. Results were compared for use in disasters and emergency situations.
with the popular method of frequency ratio; it Kaminska and Rutten (2014) identified the main
is found that SVM linear and SVM polynomial uses of social media across the four pillars of
achieved success rate of 84.31% and 81.89%, the disaster management cycle being prevention
respectively, while 72.48% by frequency with and mitigation, preparedness, response, and re-
ratio method (Tehrany et al. 2015). covery. Following are the three main areas where
these platforms can be very helpful (Dufty et al.
2016).
Efficient Natural Language
Processing (NLP) Techniques for • public information
Flood Detection Using Social Media • situational awareness
Big Data Text Streams • community empowerment and engagement
Natural language processing (NLP) plays a vi- Natural language processing (NLP) is a sig-
tal role in progressing artificial intelligence (AI) nificant research area, which aims to achieve
research in many practical areas of life. There computational language processing mechanism,
are many several challenges in the NLP realm similar to human beings. Various desired tasks
to be resolved such as natural language gen- may be accomplished through NLP, including
eration, natural language understanding, dialog translation of text from one language to another,
Flood Detection Using Social Media Big Data Streams 765
Flood Detection Using Social Media Big Data Streams, Fig. 2 Data flow diagram for metadata feature processing
using TF-IDF
State-of-the-Art NLP Systems for Flood captured. The main objective is to extract and
Detection Using Big Data Streams fuse the content of events which are present in
Recently, various novel systems have been pro- satellite imagery and social media (Bischke et al.
posed to detect flood through text analysis of 2017). A lot of different and diverse techniques
social media big data streams (Bischke et al. are reported including metadata features such
2017; Hanif et al. 2017; Ahmad et al. 2017b,a; as term frequency-inverse document frequency
Avgerinakis et al. 2017; Nogueira 2017). Most (TF-IDF) (Bischke et al. 2017; Hanif et al. 2017)
of these approaches are evaluated on standard and word/text embeddings from a large collection
public dataset from Multimedia Satellite Task at of titles, descriptions, and user tags (Tkachenko
MediaEval, 2017. This dataset consists of meta- et al. 2017). Among them, traditional features like
data of images uploaded by the social media TF-IDF are quite useful since it integrates the
users, which may or may not contain evidence concept of term frequency that counts the num-
of flood. This metadata contains various fields ber of occurrences of each semantically similar
including relevant description of individual im- concepts (“flood,” “river,” “damage”) in the given
age, user tags, title, date on which the image was user tags. Table 2 summarizes the performances
uploaded, and the device through which it was of several NLP-based techniques.
Flood Detection Using Social Media Big Data Streams 767
Flood Detection from Social Media regression (SR-KDA) are then used to distinguish
Visual Big Data Stream: Machine between flood and non-flood images.
Learning and Deep Learning Recently, deep learning is quite successful in
Techniques image classification tasks. Deep learning is an
advanced field of machine learning which por-
Visual images available on social media can be trays human brain and produces similar neural
effectively used for flood management system. network for execution of tasks. It executes input
Figure 3 shows some sample images from Medi- through hierarchy of multiple layers to produce
aEval 2017 satellite task. Over the years, various proper information. In contrast to machine learn-
computer vision techniques have been investi- ing, features are automatically extracted from
gated to extract local/global visual features from the provided input which saves a lot of human F
these images (Bischke et al. 2017). These fea- effort and computing resources (LeCun et al.
tures include AutoColorCorrelogram, EdgeHis- 2015). Due to real-time feature extraction and
togram, Color and Edge Directivity Descriptor processing according to the required task, more
(CEDD), Color Layout, Fuzzy Color and Tex- computation is required; that is why graphical
ture Histogram (FCTH), Joint Composite De- processing units (GPUs) are used for efficient
scriptor (JCD), Gabor, ScalableColor, Tamura, results. For improved accuracy, neural networks
SIFT, Colour SIFT, and local binary patterns require large amount of data, so that exact fea-
(LBP). Machine learning techniques such as sup- tures may be predicted clearly. Deep learning
port vector machine (SVM), decision tree (DT), has various applications in different fields in-
and kernel discriminant analysis using spectral cluding natural language processing (NLP), com-
Flood Detection Using Social Media Big Data Streams, Fig. 3 Some images from MediaEval 2017 satellite task
(Bischke et al. 2017)
768 Flood Detection Using Social Media Big Data Streams
puter vision, audio processing, and so on. Es- important to properly manage the situation. Con-
pecially in computer vision, deep learning has ventional algorithms perform their functionality
produced exceptional improvements to improve according to the defined steps, and they are
various dimensions including image segmenta- inappropriate to manage crises. On contrary, deep
tion, compression, localization, transformation, learning algorithms perform their task according
sharpening, completion, colorization, and predic- to input provided to them, and they are more
tion. Table 3 summarizes the performances of appropriate to respond contextual awareness,
several ML- and DL-based techniques. so that accelerated action may be performed.
ImageNet dataset: ImageNet is a dataset, which
Transfer Learning contains 15 million labeled images, categorized
The effectiveness of deep network is directly re- into about 22,000 categories (Jiang et al. 2013).
lated with the availability of training data; greater Yearly, a contest is organized to produce better
amount of training data reveals much accurate results by using ImageNet. The winner of
output. But, the strong training requires a huge that competition reveals new model, including
quantity of resources. Transfer learning is pro- AlexNet, QuocNet, Inception or GoogLeNet and
posed to resolve the issue of effective training, BN-Inception-v2, and Inception version 3. These
which does not require training of neural network pre-trained models are released to be executed
from scratch. Instead, the mechanism performs on machines having low computational and
little modification in extensively trained network, storage resources. One or more layers of these
which is trained on large datasets and strong networks are altered according to domain, and
machines. These pre-trained neural networks can then the network is retrained accordingly. This
be used in a similar or different domain. retrained network produce better efficiency due
to the strong training it acquires from ImageNet
Importance of Deep Learning (DL) dataset. The Inception version 3 was introduced
Algorithms in Event Prediction in 2016 (Chen and Han 2016), with boosted
During disaster and crises, accurate situational performance on ILSVRC 2012 classification
awareness and rapid communication are benchmark. In comparison with the state-of-
Metadata
“Flood”, Feature
“monsoon” Extraction
Predicted
+ Classifier
Confidence
Feature
Extraction
Image
Flood Detection Using Social Media Big Data Streams, Fig. 4 Data flow diagram showing fusion of meta- and
visual data
cently, there are several datasets shared across the Bramer M (2007) Principles of data mining, vol 180.
world from different countries/agencies for flood- Springer, London
Chen Y, Han D (2016) Big data and hydroinformatics.
related emergency situations. Moreover, there are J Hydroinf 18(4):599–614
integrated systems/frameworks for emergency Coltin B, McMichael S, Smith T, Fong T (2016) Auto-
reporting. These systems use several state-of-the- matic boosted flood mapping from satellite data. Int J
art machine learning approaches to determine the Remote Sens 37(5):993–1015
Dufty N et al (2016) Twitter turns ten: its use to date in
response in the critical situations. The integration disaster management. Aust J Emerg Manag 31(2):50
of image and social media textual data deals Eilander D, Trambauer P, Wagemaker J, van Loenen A
the right mixture, for effective predictions and (2016) Harvesting social media for generation of near
monitoring. The size of social media data, its real-time flood maps. Procedia Eng 154:176–183
Fohringer J, Dransch D, Kreibich H, Schröter K (2015)
integrations, filtration, and categorization to the Social media as an information source for rapid
relevant situation offer many challenges. On the flood inundation mapping. Nat Hazards Earth Syst Sci
other hand, its effective use in crisis detection, 15(12):2725–2738
resilience, learning, and capacity building for real Hanif M, Tahir MA, Khan M, Rafi M (2017) Flood de-
tection using social media data and spectral regression
and online analysis cannot be denied. based kernel discriminant analysis. In: Proceedings of
the MediaEval 2017 workshop, Dublin
Imran M, Mitra P, Castillo C (2016) Twitter as a lifeline:
human-annotated twitter corpora for NLP of crisis-
Cross-References related messages. arXiv preprint arXiv:160505894
Jiang L, Cai Z, Zhang H, Wang D (2013) Naive bayes text
classifiers: a locally weighted learning approach. J Exp
Big Data Analysis for Social Good Theor Artif Intell 25(2):273–286
Big Data in Social Networks Jongman B, Wagemaker J, Romero BR, de Perez EC
(2015) Early flood detection for rapid humanitarian
response: harnessing near real-time satellite and twitter
signals. ISPRS Int J Geo-Inf 4(4):2246–2266
Kussul N, Shelestov A, Skakun S (2008) Grid system for
References flood extent extraction from satellite images. Earth Sci
Inf 1(3):105
Abbott MB et al (1991) Hydroinformatics: informa- Lamovec P, Matjaz M, Ostir K (2013) Detection of
tion technology and the aquatic environment. Avebury flooded areas using machine learning techniques: case
Technical study of the Ljubljana moor floods in 2010. Dis Adv
Ahmad K, Konstantin P, Riegler M, Conci N, Holversen P 6:4–11
(2017a) CNN and GAN based satellite and social me- LeCun Y, Bengio Y, Hinton G (2015) Deep learning.
dia data fusion for disaster detection. In: Proceedings Nature 521(7553):436–444
of the MediaEval 2017 workshop, Dublin Nguyen DT, Joty S, Imran M, Sajjad H, Mitra P (2016)
Ahmad S, Ahmad K, Ahmad N, Conci N (2017b) Convo- Applications of online deep learning for crisis re-
lutional neural networks for disaster images retrieval. sponse using social media information. arXiv preprint
In: Proceedings of the MediaEval 2017 workshop, arXiv:161001030
Dublin Nogueira K (2017) Data-driven flood detection using neu-
Avgerinakis K, Moumtzidou A, Andreadis S, Michail E, ral networks. In: Proceedings of the MediaEval 2017
Gialampoukidis I, Vrochidis S, Kompatsiaris I (2017) workshop, Dublin
Visual and textual analysis of social media and satellite Pohl D, Bouchachia A, Hellwagner H (2012) Auto-
images for flood detection@ multimedia satellite task matic sub-event detection in emergency management
mediaeval 2017. In: Proceedings of the MediaEval using social media. In: Proceedings of the 21st in-
2017 workshop, Dublin ternational conference on World Wide Web. ACM,
Basnyat B, Anam A, Singh N, Gangopadhyay A, Roy pp 683–686
N (2017) Analyzing social media texts and images to Pohl D, Bouchachia A, Hellwagner H (2016) On-
assess the impact of flash floods in cities. In: 2017 line indexing and clustering of social media data
IEEE international conference on smart computing for emergency management. Neurocomputing 172:
(SMARTCOMP). IEEE, pp 1–6 168–179
Bischke B, Bhardwaj P, Gautam A, Helber P, Borth Ramaswamy B, Likith Ponnanna PB, Vishruth K (2017)
D, Dengel A (2017) Detection of flooding events in Urban flood forecast using machine learning on real
social multimedia and satellite imagery using deep time sensor data. Trans Mach Learn Artif Intell 5(5):69
neural networks. In: Proceedings of the MediaEval Santoro A, Raposo D, Barrett DG, Malinowski M, Pas-
2017 workshop, Dublin canu R, Battaglia P, Lillicrap T (2017) A simple neural
Framework-Based Scale-Out RDF Systems 771
network module for relational reasoning. arXiv preprint subject to object (it is directional). The same
arXiv:170601427 resource can be used in multiple triples playing
Shamsi J, Khojaye MA, Qasmi MA (2013) Data-
intensive cloud computing: requirements, expecta-
the same or different roles, e.g., it can be used
tions, challenges, and solutions. J Grid Comput 11(2): as a subject in one triple, as well as a predicate
281–310 or an object in another one. This ability enables
Tehrany MS, Pradhan B, Mansor S, Ahmad N (2015) definition of multiple connections between the
Flood susceptibility assessment using GIS-based sup-
port vector machine model with different kernel types.
triples, hence creation of a connected graph of
Catena 125:91–101 data. Such graph can be represented as nodes that
Tkachenko N, Zubiaga A, Procter RN (2017) Wisc at me- stands for the resources and edges capturing the
diaeval 2017: multimedia satellite task. In: Proceedings relationships between the nodes. RDF has been
of the MediaEval 2017 workshop
widely adopted in several application domains. F
For example, a growing number of ontologies and
knowledge bases storing millions to billions of
facts (e.g., DBpedia, Probase, and Wikidata) are
Forecasting Business Process now publicly available (Poggi et al. 2008). In ad-
Future dition, key search engines like Google and Bing
are providing increasing support for RDF. This
Predictive Business Process Monitoring entry gives an overview of various systems that
have been designed to exploit the Hadoop and
Spark frameworks for building scale-out RDF
processing engines.
Framework-Based Scale-Out
RDF Systems Overview
Marcin Wylot1 and Sherif Sakr2 In 2004, Google made a seminal contribution
1
ODS, TU Berlin, Berlin, Germany to the big data community by introducing the
2
Institute of Computer Science, University of MapReduce model (Dean and Ghemawa 2004).
Tartu, Tartu, Estonia It is a simple and powerful programming model
for developing scalable parallel applications
that can process large datasets across multiple
Synonyms
machines. In particular, MapReduce makes
programmers think in a data-centric style
Hadoop-based RDF query processors;
allowing them to concentrate on dataset trans-
Spark-based RDF query processors
formation, while MapReduce takes care of the
distributed execution and fault tolerance details,
transparently (Sakr et al. 2013). The Apache
Definitions Hadoop project has been introduced by Yahoo!
as an open-source framework that implemented
RDF, the Resource Description Framework, has the concepts of MapReduce. However, one of
been recognized as a de facto standard to describe the main limitations of the Hadoop framework
resources in a semi-structured manner. In particu- is that it requires that the entire output of each
lar, RDF is a graph-based format which allows to map and reduce task to be materialized into
define named links between resources in the form a local file on the Hadoop Distributed File
of triples subject, predicate, object, also called System (HDFS) before it can be consumed by
statements. A statement expresses a relationship the next stage. This materialization step allows
(defined by a predicate) between resources (sub- for the implementation of a simple and elegant
ject and object). The relationship is always from checkpoint/restart fault-tolerant mechanism;
772 Framework-Based Scale-Out RDF Systems
however, it dramatically harms the system variables are introduced and then joining these
performance. The Spark project was developed new bindings to the previous bound variables
as a big data processing framework that tackles such that the joined bound variables align with
this challenge (Zaharia et al. 2010). In particular, iteratively increasing subsets of the query clauses.
the fundamental programming abstraction of The intermediate steps execute MapReduce op-
Spark is called Resilient Distributed Dataset erations simultaneously over both the graph data
(RDD) (Zaharia et al. 2010) which represents and the previously bound variables which were
a logical collection of data partitioned across saved to disk to perform this operation. This
machines. In principle, having RDD as an iteration of MapReduce join continues until all
in-memory data structure enables Spark to clauses are processed and variables are assigned
exploit its functional programming paradigm which satisfy the query clauses. The initial reduce
by allowing user programs to load data into a step removes duplicate bindings without further
cluster’s memory and query it repeatedly. In modifying the output of the initial map step. A
addition, users can explicitly cache an RDD in final MapReduce step consists of filtering bound
memory across machines and reuse it in multiple variable assignments to obtain just the variable
MapReduce-like parallel operations. Therefore, bindings requested in the SELECT clause of the
Spark takes the concepts of Hadoop to the next original SPARQL query.
level by loading the data in distributed memory Husain et al. (2011) describe an approach to
and relying on cheaper shuffle during the data store RDF data in Hadoop Distributed File Sys-
processing (Sakr 2016). tem (HDFS) via partitioning and organizing the
data files and executing dictionary encoding. In
particular, this approach applies two partitioning
Examples of Systems components. The Predicate Split (PS) component
which splits the RDF data into predicate files.
Hadoop-Based RDF Systems These predicate files are then fed into the Pred-
SHARD (Rohloff and Schantz 2010) is one of the icate Object Split (POS) component which splits
first Hadoop-based approaches that used a clause- the predicate files into smaller files based on the
iteration technique to evaluate SPARQL queries type of objects. Query processing is implemented
against RDF datasets. It is designed to exploit the on top of Hadoop by using a heuristic cost-based
MapReduce-style jobs for high parallelization of algorithm that produces a query plan with the
SPARQL queries. In particular, SHARD iterates minimal number of Hadoop jobs for joining the
over clauses in queries to incrementally bind data files and evaluating the input query. The cost
query variables to the RDF graph nodes while of the generated query plan is bounded by the
conforming to all of the query constraints. In ad- log of the total number of variables in the given
dition, an iteration algorithm is used to coordinate SPARQL query.
the iterated MapReduce jobs with one iteration Figure 1 illustrates the architecture of the
for each clause in the query. In principle, the HadoopRDF system (Huang et al. 2011a)
Map step filters each of the bindings, and the that has been introduced as a scale-out
Reduce step removes duplicates where the key architecture which combines the distributed
values for both Map and Reduce are the bound Hadoop framework with a centralized RDF store,
variables in the SELECT clause. The initial map RDF-3X (Neumann and Weikum 2010), for
step identifies all feasible bindings of graph data querying RDF databases. The data partitioner
to variables in the first query clause. The output of HadoopRDF executes a disjoint partitioning
key of the initial map step is the list of variable of the input RDF graph by a vertex using a
bindings, and the output values are set to null. graph partitioning algorithm that allows triples
The intermediate MapReduce jobs continue to which are close to each other in the RDF
construct query responses by iteratively binding graph to be allocated on the same node. The
graph data to variables in later clauses as new main advantage of this partitioning strategy
Framework-Based Scale-Out RDF Systems 773
Framework-Based
Scale-Out RDF Systems, Data Partitioner Master
Fig. 1
MapReduce C RDF-3X:
system architecture (Huang Graph Triple
Query
et al. 2011b). (©2011 Partitioner Placer
Processer
VLDB Endowment.
Reprinted with permission)
Hadoop
F
is that it effectively reduces the network to generate an abstract syntax tree which is
communication at query time. To further reduce subsequently compiled into a SPARQL algebra
the communication overhead, HadoopRDF tree. Using this tree, PigSPARQL applies various
replicates some triples on multiple machines. In optimizations on the algebra level such as
particular, on each worker, the data replicator the early evaluation of filters and uses the
decides which triples are on the boundary selectivity information for reordering the triple
of its partition and replicates them based patterns. Finally, PigSPARQL traverses the
on specified n-hop guarantees. HadoopRDF optimized algebra tree bottom-up and generates
automatically decomposes the input query into an equivalent sequence of Pig Latin expressions
chunks which can be evaluated independently for every SPARQL algebra operator. For query
with zero communication across partitions execution, Pig automatically maps the resulting
and uses the Hadoop framework to combine Pig Latin script onto a sequence of Hadoop
the resulting distributed chunks. It leverages jobs. An advantage of taking PigSPARQL as
HadoopDB (Abouzeid et al. 2009) to manage an intermediate layer that uses Pig between
the partitioning of query evaluation across high- SPARQL and Hadoop is being independent of
performance single-node database systems and the actual Hadoop version or implementation
the Hadoop framework. details. RAPID+ (Ravindra et al. 2011) is another
The PigSPARQL system (Schätzle et al. Pig-based system that uses an algebraic approach
2013) compiles SPARQL queries into the Pig for optimizing and evaluating SPARQL queries
query language (Olston et al. 2008), a data on top of the Hadoop framework. It uses a data
analysis platform over the Hadoop framework. model and algebra (the Nested TripleGroup Data
Pig uses a fully nested data model and provides Model and Algebra (NTGA) Kim et al. 2013)
relational style operators (e.g., filters and joins). which includes support for expressing graph
In PigSPARQL, a SPARQL query is parsed pattern matching queries.
774 Framework-Based Scale-Out RDF Systems
The SHAPE system (Lee and Liu 2013) goal of minimizing the number of MapReduce
uses a semantic hash partitioning approach jobs and the data transfer between nodes during
that combines locality-optimized RDF graph query evaluation, CliqueSquare exploits the built-
partitioning with cost-aware query partitioning in data replication mechanism of the Hadoop Dis-
for processing queries over big RDF graphs. In tributed File System (HDFS). Each of its partition
practice, the approach utilizes access locality has three replicas by default, to partition the RDF
to partition big RDF graphs across multiple dataset in different ways. In particular, for the
compute nodes by maximizing the intra-partition first replica, CliqueSquare partitions triples based
processing capability and minimizing the inter- on their subject, property, and object values. For
partition communication cost. The SHAPE the second replica, CliqueSquare stores all sub-
system is implemented on top of the Hadoop ject, property, and object partitions of the same
framework with the master server as the value within the same node. Finally, for the third
coordinator and the set of slave servers as the replica, CliqueSquare groups all the subject par-
workers. In particular, RDF triples are fetched titions within a node by the value of the property
into the data partitioning module hosted on the in their triples. Similarly, it groups all object par-
master server, which partitions the data across titions based on their property values. In addition,
the set of slave servers. The SHAPE system CliqueSquare implements a special treatment for
uses RDF-3X (Neumann and Weikum 2010) triples whose property is rdf:type, by translating
on each slave server and used Hadoop to join them into an unwieldy large property partition.
the intermediate results generated by subqueries. CliqueSquare then splits the property partition of
For query processing, the master node serves rdf:type into several smaller partitions according
as the interface for SPARQL queries and to their object value. For SPARQL query process-
performs distributed query execution planning ing, CliqueSquare relies on a clique-based algo-
for each query received. The SHAPE system rithm which produces query plans that minimize
classifies the query processing into two types: the number of MapReduce stages. The algorithm
intra-partition processing and inter-partition is based on the variable graph of a query and its
processing. The intra-partition processing is used decomposition into clique subgraphs. The algo-
for the queries that can be fully executed in rithm works in an iterative way to identify cliques
parallel on each server by locally searching the and to collapse them by evaluating the joins on
subgraphs matching the triple patterns of the the common variables of each clique. The process
query without any inter-partition coordination. ends when the variable graph consists of only
The inter-partition processing is used for the one node. Since triples related to a particular re-
queries that cannot be executed on any partition source are colocated on one node, CliqueSquare
server, of which it needs to be decomposed into can perform all first-level joins in RDF queries
a set of subqueries such that each subquery can (SS, SP, SO, PP, PS, PO, etc.) locally on each
be evaluated by intra-partition processing. In node and reduce the data transfer through the net-
this scenario, the processing of the query would work. In particular, it allows queries composed of
requires multiple rounds of coordination and data 1-hop graph patterns to be processed in a single
transfer across a set of partition servers. Thus, the MapReduce job, which enables a significant per-
main goal of the cost-aware query partitioning formance competitive advantage.
phase is to generate locality-optimized query
execution plans that can effectively minimize the Spark-Based RDF Systems
inter-partition communication cost for distributed S2RDF (SPARQL on Spark for RDF) (Schätzle
query processing. et al. 2015b) introduced a relational partitioning
CliqueSquare (Goasdoué et al. 2015; Dja- schema for encoding RDF data called ExtVP (the
handideh et al. 2015) is another Hadoop-based Extended Vertical Partitioning) that extends the
RDF data management platform for storing and vertical partitioning (VP) schema introduced by
processing big RDF datasets. With the central Abadi et al. (2007) and uses a semi-join-based
Framework-Based Scale-Out RDF Systems 775
preprocessing to efficiently minimize query input the corresponding algebra tree and applies some
size by taking into account the possible join basic algebraic optimizations (e.g., filter pushing)
correlations between the underlying encoding and traverses the algebraic tree bottom-up to
tables of the RDF data and join indices (Valduriez generate the equivalent Spark SQL expressions.
1987). The main idea of the vertical partitioning For the generated SQL expression, S2RDF can
scheme is to represent the RDF triple table into m use the precomputed semi-join tables, if they
tables where m is the number of unique properties exist, or alternatively uses the base encoding
in the RDF database. Each of the m table consists tables.
of two columns. The first column stores the SparkRDF (Chen et al. 2014, 2015) is an-
subjects which are described by that property, other Spark-based RDF engine which partitions
while the second column stores the object values. the RDF graph into MESGs (Multi-layer Elas- F
The subjects which are not described by a tic SubGraphs) according to relations (R) and
particular property are simply omitted from the classes (C) by building five kinds of indices
table for that property. ExtVP precomputes the (C, R, CR, RC, CRC) with different granularities
possible join relations between partitions (i.e., to support efficient evaluation for the different
tables). The main goal of ExtVP is to reduce query triple patterns. SparkRDF creates an index
the unnecessary I/O operations, comparisons, file for every index structure and stores such
and memory consumption during executing join files directly in the HDFS, which are the only
operations by avoiding the dangling tuples in the representation of triples used for the query ex-
input tables of the join operations, i.e., tuples that ecution. These indices are modeled as RDSGs
do not find a join partner. In particular, S2RDF (Resilient Discreted SubGraphs), a collection of
determines the subsets of a VP table VPp1 that in-memory subgraphs partitioned across nodes.
are guaranteed to find at least one match when SPARQL queries are evaluated over these indices
joined with another VP table VPp2 where p1 using a series of basic operators (e.g., filter,
and p2 are query predicates. S2RDF uses this join). All intermediate results are represented as
information to precompute a number of semi- RDSGs and maintained in the distributed mem-
join reductions (Bernstein and Chiu 1981) of ory to support faster join operations. SparkRDF
VPp1 . The relevant semi-joins between tables uses a selectivity-based greedy algorithm to build
in VP are determined by the possible joins that a query plan with an optimal execution order
can occur when combining the results of triple of query triple patterns that aims to effectively
patterns during query execution. Clearly, ExtVP reduce the size of the intermediate results. In
comes at the cost of some additional storage addition, it uses a location-free pre-partitioning
overhead in comparison to the basic vertical strategy that avoids the expensive shuffling cost
partitioning techniques. Therefore, ExtVP does for the distributed join operations. In particular, it
not use exhaustive precomputations for all the ignores the partitioning information of the indices
possible join operations. Instead, an optional while repartitioning the data with the same join
selectivity threshold for ExtVP can be specified key to the same node.
to materialize only the tables where reduction The S2X (SPARQL on Spark with GraphX)
of the original tables is large enough. This (Schätzle et al. 2015a) RDF engine has been
mechanism facilitates the ability to control and implemented on top of GraphX (Gonzalez
reduce the size overhead while preserving most et al. 2014), an abstraction for graph-parallel
of its performance benefit. S2RDF uses the computation that has been augmented to
Parquet columnar storage format for storing Spark (Zaharia et al. 2010). It combines graph-
the RDF data on the Hadoop Distributed File parallel abstractions of GraphX to implement the
System (HDFS). S2RDF is built on top of Spark. graph pattern matching constructs of SPARQL. A
The query evaluation of S2RDF is based on similar approach has been followed by Goodman
SparkSQL (Armbrust et al. 2015), the relational and Grunwald (Goodman and Grunwald 2014)
interface of Spark. It parses a SPARQL query into for implementing an RDF engine on top the
776 Framework-Based Scale-Out RDF Systems
Poggi A, Lembo D, Calvanese D, De Giacomo G, Lenz- baseline for big data. In: Proceedings of the ISWC
erini M, Rosati R (2008) Linking data to ontologies. 2013 posters & demonstrations track, Sydney, 23
In: Spaccapietra S (ed) Journal on data semantics X. Oct 2013, pp 241–244. http://ceur-ws.org/Vol-1035/
Springer, Berlin/Heidelberg, pp 133–173 iswc2013_poster_16.pdf
Ravindra P, Kim H, Anyanwu K (2011) An intermediate Schätzle A, Przyjaciel-Zablocki M, Berberich T, Lausen
algebra for optimizing RDF graph pattern matching G (2015a) S2X: graph-parallel querying of RDF with
on mapreduce. In: The semanic web: research and GraphX. In: 1st international workshop on big-graphs
applications – 8th extended semantic web conference, online querying (Big-O(Q))
ESWC 2011, Heraklion, Crete, 29 May – 2 June 2011, Schätzle A, Przyjaciel-Zablocki M, Skilevic S, Lausen
Proceedings, Part II, pp 46–61. https://doi.org/10.1007/ G (2015b) S2RDF: RDF querying with SPARQL
978-3-642-21064-8_4 on spark. CoRR abs/1512.07021. http://arxiv.org/abs/
Rohloff K, Schantz RE (2010) High-performance, mas- 1512.07021
sively scalable distributed systems using the mapreduce Valduriez P (1987) Join indices. ACM Trans Database F
software framework: the shard triple-store. In: Pro- Syst 12(2):218–246. https://doi.org/10.1145/22952.
gramming support innovations for emerging distributed 22955
applications. ACM, p 4 Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica
Sakr S (2016) Big data 2.0 processing systems – a survey. I (2010) Spark: cluster computing with working sets.
Springer briefs in computer science. Springer. https:// In: HotCloud
doi.org/10.1007/978-3-319-38776-5
Sakr S, Liu A, Fayoumi AG (2013) The family of mapre-
duce and large-scale data processing systems. ACM
Comput Surv 46(1). https://doi.org/10.1145/2522968.
2522979
Functional Benchmark
Schätzle A, Przyjaciel-Zablocki M, Hornung T, Lausen
G (2013) Pigsparql: a SPARQL query processing Microbenchmark
G
performing SAM compressors encode reads car- suboptimal compression ratios since they treat
rying the same sequence variants only once or FASTQ as a simple plaintext file. Another
employ highly optimized statistical models for popular format, although only internally used in
various SAM fields. the large sequencing databanks, is NCBI’s SRA
In a recent benchmarking study to evaluate (Leinonen et al. 2010), which uses the LZ-77
the effectiveness of available compression scheme to encode metadata.
methods, the MPEG HTS compression working General FASTQ compressors typically decou-
group compiled an HTS data set consisting ple distinct streams (read names, sequences, com-
of both raw (FASTQ) and aligned (SAM ments, and quality scores) and compress each
or BAM) read data with a wide spectrum of them separately. Examples include DSRC and
of characteristics for ensuring statistically DSRC2 (Deorowicz and Grabowski 2011; Ro-
meaningful results (Numanagić et al. 2016). guski and Deorowicz 2014), FQC (Dutta et al.
On this data set, compression performance, 2015), LFQC (Nicolae et al. 2015), Fqzcomp
running times, memory usage, and parallelization (Bonfield and Mahoney 2013; Holland and Lynch
capabilities were measured for available 2013), and Slimfastq (Josef 2014). Instead of
compression tools, including both industry- directly compressing each stream as text sym-
scale tools and research-oriented prototypes. bols, these tools may first perform reformatting
The study declared no overall winner that can on each field, e.g., read identifier tokenization to
perform well on each data type and under every encode the common prefix of read identifiers only
performance measure employed. Therefore, an once or 2-bit nucleotide encoding. Quality scores
integrated solution that chooses the specific are usually compressed by statistical modeling
approach offering the best performance for the and entropy coding.
specific application and the specific input data
type(s) (with respect to the performance measure Alignment-Based FASTQ Compressors
of importance) would yield the best outcome, The order of records within a FASTQ file is
both for raw and aligned sequence data. arbitrary, and thus an alteration of their order
for the purpose of improving compression per-
formance would not result in loss of information.
FASTQ Compression Tools Indeed, some of the newer approaches reorder
reads in an input FASTQ file, typically with
FASTQ is a standard text format for storing the goal of approximating their order within the
unmapped HTS data provided as a set of reads genome sequence. Such reordering is helpful,
(Cock et al. 2009). Each record (representing a e.g., when coupled with an encoding scheme that
read) in a FASTQ file consists of read identi- replaces each read with a pointer to a reference
fier, sequence information, comment, and qual- genome (together with a list of mismatches, if
ity score (Ewing et al. 1998) information, each the alignment is not perfect). The reference can
separated by newline characters. If the sequencer either be user-provided or constructed via de
is able to produce paired-end reads, FASTQ files novo assembly, which is usually performed by
usually come in the so-called library format con- representing the read collection as a de Bruijn
sisting of two FASTQ files, each consisting of the graph and traversing it in a way to “cover” all
reads from the associated strand, such that the k- reads.
th read from first file corresponds to the k-th read Provided with a reference, alignment-based
from the second file. compression tools (explicitly or implicitly) sort
the reads with respect to their mapping position in
General FASTQ Compressors the reference. The reads, represented as pointers
FASTQ files are many times compressed with to the reference, could then be encoded, e.g., via
general-purpose tools, such as Gzip (Deutsch run-length encoding. Naturally the compression
1996) and bzip2 (Seward 1998), resulting in performance of this general approach increases
Genomic Data Compression 781
with the read coverage (average number of reads 2018). Finally Assembltrie uses a fast, greedy
that “cover” each position in the genome) and the assembly of the reads into trees (Ginart et al.
repeat content of the genome. 2018) (rather than strings) to achieve combina-
Tools that employ this general strategy include torial optimality with respect to the number of
LW-FQZip (Zhang et al. 2015) which uses an symbols used in the reference constructed. Note
available reference genome for extracting the that some of the abovementioned compression
pointers, as well as Quip (Jones et al. 2012), Leon tools are research-oriented products and do not
(Benoit et al. 2015), k-Path (Kingsford and Patro support compression of read identifiers/quality
2015), and KIC (Zhang et al. 2016) which first scores.
perform a de novo assembly of reads into longer
contigs.
improvements offered by Scramble are now part Pacific Biosciences or Oxford Nanopore, has
of the CRAM format. proven to be a more challenging task.
References
Variation-Sensitive Compression and
Other SAM Compressors Benoit G et al (2015) Reference-free compression of high
In both SAM and CRAM files, reads harbor- throughput sequencing data with a probabilistic de
ing the same sequence variation are redundantly Bruijn graph. BMC Bioinformatics 16:288
Bonfield JK (2014) The scramble conversion tool. Bioin-
encoded due to independent encoding of each formatics 30(19):2818–2819
read. In order to eliminate this redundancy, a Bonfield JK, Mahoney MV (2013) Compression of
newer tool, DeeZ (Hach et al. 2014) implic- FASTQ and SAM format sequencing data. PloS one
itly assembles the underlying genome in order 8:e59190
Chandak S, Tatwawadi K, Weissman T (2018) Com-
to encode these variations only once. A simi- pression of genomic sequencing reads via hash-based
lar approach is followed by CBC (Ochoa et al. reordering: algorithm and analysis. Bioinformatics
2014) and TSC (Voges et al. 2016). All of these 34:558–567
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2009)
tools treat distinct SAM fields separately and
The sanger FASTQ file format for sequences with qual-
apply a variety of compression techniques on ity scores, and the solexa/illumina FASTQ variants.
each field. Finally, some of the best-performing Nucleic Acids Res 38:1767–1771
tools in terms of compression rate, such as Quip Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-
scale compression of genomic sequence databases
and sam_comp (Bonfield and Mahoney 2013),
with the burrows–wheeler transform. Bioinformatics
employ highly optimized statistical models for 28:1415–1419
various SAM fields at the expense of slower CRAM format specification (version 3.0) (2017) https://
decompression. samtools.github.io/hts-specs/CRAMv3.pdf
Deorowicz S, Grabowski S (2011) Compression of DNA
sequence reads in FASTQ format. Bioinformatics
27:860–862
Deutsch LP (1996) GZIP file format specification version
4.3. https://tools.ietf.org/html/rfc1952
Final Remarks Dutta A, Haque MM, Bose T, Reddy CVSK, Mande SS
(2015) FQC: a novel approach for efficient compres-
A recent benchmarking study on existing HTS sion, archival, and dissemination of fastq datasets. J
Bioinform Comput Biol 13:1541003
compression methods (Numanagić et al. 2016) Ewing B, Hillier L, Wendl MC, Green P (1998) Base-
provided a comprehensive comparison of FASTQ calling of automated sequencer traces using Phred. I.
and SAM compressors, revealing that some of the Accuracy assessment. Genome Res 8:175–185
Fritz MHY, Leinonen R, Cochrane G, Birney E (2011)
newly available tools achieve significant gains
Efficient storage of high throughput DNA sequencing
over universal lossless compression methods data using reference-based compression. Genome Res
such as Gzip and bzip2 (as well as the standard 21:734–740
Samtools for SAM compression) both in terms Ginart AA et al (2018) Optimal compressed representa-
tion of high throughput sequence data via light assem-
of compression rate and speed. Some of these
bly. Nat Commun 9:566
tools provide significantly better compression Grabowski S, Deorowicz S, Roguski Ł (2014) Disk-
rates at the expense of higher running time (both based compression of data from genome sequencing.
for compression and decompression) or larger Bioinformatics 31:1389–1395
Hach F, Numanagić I, Alkan C, Sahinalp SC (2012)
memory usage. It is also important to note that the
SCALCE: boosting sequence compression algorithms
majority of the available tools are optimized for using locally consistent encoding. Bioinformatics
Illumina-style short, fixed-length reads (typically 28:3051–3057
only a few hundred base pairs long) with a low Hach F, Numanagić I, Sahinalp SC (2014) DeeZ:
reference-based compression by local assembly. Nat
(1%) sequencing error rate. Compressing long,
Methods 11:1082–1084
variable-length and error-prone read collections, Holland RC, Lynch N (2013) Sequence squeeze: an open
such as data obtained with sequencers from contest for sequence compression. GigaScience 2:5
Geo-replication Models 783
in different locations around the world. A bank of COPS, supports both read-only and write-only
user can access a local branch and perform trans- transactions across multiple keys. However, Eiger
actions. does not support arbitrary transactions with reads
and writes across multiple keys.
Some geo-replicated systems, such as Cassan-
Overview dra (Lakshman and Malik 2010) and Yahoo’s
PNUTS (Cooper et al. 2008), extend read and
Geo-replication models have emerged as an write semantics by allowing clients to perform
important technique for distributed applications high-level update and query operations over data
over the Internet, such as cloud computing. Wide objects. A client’s query simply returns the value
area network (WAN) communication becomes of an object without changing the database state,
a fundamental barrier for developing the large- but update operations modify the state. Consider
scale Internet-based applications. User requests PNUTS (Cooper et al. 2008), a geo-replicated
incur at least one round-trip across a WAN. While data storage that supports a simple relational data
this might be acceptable for some applications, model, organizing data into tables of records (or
such as web browsing, an important class of objects). Each record has a version; updating
applications (e.g., online game applications and a record creates a new version for the record.
e-commerce applications) is actually latency PNUTS’s API provides several primitive calls for
sensitive. For instance, Amazon reported that read and update requests with different guaran-
every 100ms slowdown degrades revenue by one tees. For instance, PNUTS’s read-only semantics
percent. The WAN latency issue becomes more allows a local replica to return a stale version
challenging as the market for distributed systems of data, while PNUTS’s read-latest semantics re-
is being fueled by cloud computing, big data, turns the most current version of data. For update
exascale computing, and so on. requests, PNUTS allows to atomically create or
Geo-replicated systems decrease WAN-based update records and its update API includes a
latency by bringing data closer to users. The test-and-set semantics, which allows updating a
replicated system model consists of a set of data record only if current version of that record is
centers, each maintaining a copy of the database equals to a given required version.
state. A database contains a set of replicated data Recent geo-replicated systems support arbi-
items. Each data item has a value at any one time trary transactions among multiple data objects,
and a type that determines the set of operations which enable the encapsulation of a group of
that can be invoked on the data object. Users operations into a single unit, guaranteeing certain
interact with the database system through running pleasant properties, i.e., atomicity, consistency,
applications. A client accesses a single replica, isolation, and durability (ACID). For instance,
referred as its local replica, and invokes a set Google’s Spanner (Corbett et al. 2012) proposes
of operations on the data objects stored in the a transactional data storage with ACID semantics
database. across multiple data centers.
Different geo-replicated systems offer To maintain state of all replicas of a geo-
different services, ranging from simple read and replicated model consistently, there are two main
write operations into ACID transactions (Gray synchrony protocols that are used: synchronous
and Reuter 1993). COPS (Lloyd et al. 2011) and replication and asynchronous replication.
Amazon’s Dynamo (DeCandia et al. 2007) are
examples of a key-value geo-replicated storage Synchronous Replication
system with read and write operations. COPS Ideally, the geo-replicated system must ensure
(Lloyd et al. 2011) supports read-only operations strong consistency, providing the behavior of a
with a consistent view of multiple keys, but single centralized system. Implementing strong
it does not allow multi-key write operations. consistency semantics, such as linearizability
Eiger (Lloyd et al. 2013), the extended version (Herlihy and Wing 1990) and serializability
Geo-replication Models 785
(Papadimitriou 1979), requires a global total using Paxos (Gray and Lamport 2006), ordering
order of updates (synchronous replication). transactions with regard to a real-time clock,
Linearizability provides the intuition of a single i.e., transactions commit in an order equivalent
replica that executes all operations in a real-time to their real-time order. Consider in a bank ap-
order. Serializability (SER) ensures that every plication, Alice transfers 100 dollar into Bob’s
execution of operations is equivalent to some account, and sometime later Bob reads his ac-
serial execution. In a synchronous replication count. A database providing real-time ordering
approach, when a user issues a request at a local will guarantee that Bob sees 100 dollar in his
replica, before proceeding with the request, this account. To implement the real-time ordering in
replica must synchronize with all other replicas a distributed system, Spanner introduces a new
to find out if there are some concurrent conflicts, API, called TrueTime, which bounds the clock
such as writing to the same data object. For uncertainty by specifying an error margin for
instance, in a bank account application with each local physical time clock. So, a transaction’s
G
deposit and debit operations, if one replica timestamp is an interval between the earliest time
wants to perform a debit operation, it must first that transaction could possibly be and the latest
synchronize with all other replicas to stop other time. Given the interval, Spanner orders transac-
concurrent debits that would make the balance tion T1 before transaction T2 iff T 1:lat est <
negative. T 2:earli est . To ensure atomicity and isolation
Such a synchronous protocol ensures that within a data center, Spanner relies on two phase
all replicas execute application’s requests in the commit and two phase locking.
same order, and if some operations conflict, only
one of these operations can be accepted; declined Asynchronous Replication
operations may be ignored or the application Experience with real-world applications shows
may choose to retry them. This synchronous the synchronous approach often causes more syn-
and strongly consistent approach simplifies chronization than the application really needs
reasoning about the behavior of a program in (Bailis et al. 2013; Wang et al. 2002). To avoid
a geo-replicated system, but the synchronization this cost, one alternative approach is to allow
comes at a significant cost in availability (or a replica to execute a client’s request without
performance). synchronizing a priori with another replicas. In-
stead, it returns a response immediately to the
Examples client; this request is propagated to all replica in
Google’s Megastore (Baker et al. 2011) imple- background. Eventually, every replica performs
ments a synchronous geo-replication model for the same request, possibly with a delay. We call
a NoSQL data store where data is partitioned this asynchronous replication.
into small databases, called entity groups. Each For instance, COPS (Lloyd et al. 2011, 2013)
entity group is synchronously replicated across is a key-value geo-replicated system that supports
data centers via Paxos consensus protocol (Gray asynchronous writes; each client could access
and Lamport 2006). A replicated transaction log a copy of data stored in his local replica and
is assigned to each entity group. Megastore pro- independently perform operations.
vides transactional ACID semantics within an en-
tity group by enforcing transactions which have Concurrency and Conflict Resolutions
access the entity group mutually access the log’s Consistency problems arise when multiple repli-
group, i.e., only one transaction can update the cas independently (i.e., without synchronizing
log at any time. Such synchronous approach seri- with other replicas) update the same data object.
alizes transactions running in the same group. Concurrent updates to multiple replicas of the
Google’s Spanner (Corbett et al. 2012) is an- same data object may execute in different orders
other geo-replicated model that supports a syn- among replicas; this may cause conflicts. For
chronous replication model across data centers instance, consider Alice and Bob both access a
786 Geo-replication Models
shared file f oo at different replicas r1 and r2 , uses two approaches: preferred replicas and
respectively. Type of file f oo is registered with counting set (cset). The idea of preferred copy
operations to read the value or to write a new is similar to primary copy in state machine
value. A write to the register rewrites its last replication model. Each object is assigned a
successful written value. Alice writes a to the preferred replica where updates into the object
file f oo. Concurrently, Bob writes b to the same can be performed locally. A transaction commits
file. After exchanging the updates, content of locally within a single replica r if the preferred
file f oo will be different at replicas r1 and r2 . replica of all objects written by this transaction
This violates the expectation of convergence to is r. Otherwise, committing the transaction in a
the same state. A replicated system is said to be non-preferred replica requires synchronization
convergent if all replicas that observed the same with all other replicas. Although the preferred
set of updates have equivalent state. replica approach amortizes the cost of distributed
There are different approaches used by repli- transactions, it does not scale well if data
cated systems to resolve such conflicts raised by objects modified by multiple users in different
concurrent executions. A common policy is to ac- replicas. Walter provides another technique built
cept one update of the two conflicting updates and upon a commutative data type, called cset to
ignore the other, for example, COPS (Lloyd et al. avoid conflicts. Like commutative data types
2011) exploits last-writer-win rule(LWW) (also (Shapiro et al. 2011), cset supports commutative
called Thomas’s write rule (Johnson and Thomas operations, which address write conflicts on the
1976)) that assigns a unique timestamp to every same data objects.
update, and in case of concurrent conflicts, the
update with a higher timestamp is selected and Application Invariants
the older one is ignored. An application might have a set of invariants,
The second approach is that the replicated which are explicitly stated. An invariant is a prop-
system may support some application-specific erty over the database state that determines the
conflict resolutions (DeCandia et al. 2007; Satya- safe behavior of the application in terms of cor-
narayanan et al. 1990; Terry et al. 1995). For rect values that may be observed by users access-
example, the data store simply keeps the set of ing the data objects. Unfortunately, asynchronous
all concurrently written values of an object and replication trades sequential safety semantics for
presents them to the application. It is up to the availability. For instance, the bank application
application how to resolve the conflicts. This may have an invariant enforcing that the account
approach is prevalent in replicated file systems, balance be always non-negative. If the balance is
such as LOCUS (Popek et al. 1981), Coda (Satya- initially $100, and two users, connecting to differ-
narayanan et al. 1990), Ficus (Guy et al. 1990), ent bank branches, try to withdraw $60, without
and Roam (Ratner et al. 1999), and in version synchronization, both withdrawals will succeed
control systems, such as Git or SVN. but the balance will become negative($20).
Alternatively, a replicated system may take ad-
vantage of replicated data types, such as CRDTs
(Shapiro et al. 2011), which automatically resolve Key Research Findings
conflicting updates, ensuring that database itself
has knowledge of resolution semantics in order to A major challenge in the design of geo-replicated
integrate the conflict resolution in the replication systems is to ensure consistency among data
protocol. centers. The CAP theorem (Gilbert and Lynch
Therefore, an asynchronous geo-replicated 2002) asserts a replicated system cannot promise
system may use one or more conflict resolution both availability and consistency in the pres-
strategies to optimistically resolve conflicts. For ence of failures. Thus, designers of geo-replicated
instance, Walter (Sovran et al. 2011) is a geo- system face a vexing choice between exploit-
replicated transactional key-value data store that ing strong consistency guarantees of synchronous
Geo-replication Models 787
replication, which promises standard sequential scalability requirements from social networks
behavior to applications but is slow and fragile, (Wittie et al. 2010) (e.g., Facebook and Twitter
and asynchronous replication, which is highly rely on geo-replicated storage) and multi-player
available and responsive but exposes users to online games to e-commerce applications such as
concurrency anomalies. Amazon (DeCandia et al. 2007).
No single replication model can best suit all
purposes: it can provide acceptable consistency
requirements for all applications without sacri-
ficing performance and availability (Sovran et al. Future Directions for Research
2011).
Recent geo-replicated systems, such as Navigating the trade-off between consistency
Gemini (Li et al. 2012) and Walter Sovran and performance (or availability) is not clear-
et al. (2011), are evolving toward hybrid models, cut in a geo-replicated system. Although
G
which provide different levels of consistency hybrid consistency models significantly improve
depending on the given application requirements. performance, there are still lots of issues that
In the hybrid approach, the replicated data must be addressed to exploit such techniques
storage provides the asynchronous replication effectively; this requires carefully analyzing
model by default, and synchronizations are application semantics and identifying its
added upon request (Li et al. 2012; Balegas consistency requirements. Static analysis and
et al. 2015; Terry et al. 2013). For instance, verification approaches are promising. They
Gemini (Li et al. 2012) is a geo-replicated system help programmers remove synchronization from
implementing a hybrid consistency model, called critical executions path, adding synchronization
RedBlue. RedBlue consistency model allows only when strictly needed.
both synchronous and asynchronous replication A number of verification tools for reasoning
models to coexist by classifying application about application executions in a geo-replicated
operations as red and blue. Blue operations setting have been already proposed (Li et al.
commute with all others; they are fast and 2014; Gotsman et al. 2016; Kaki et al. 2018).
always available even when disconnected. Red Li et al. (2014) have presented a static analysis
operations must be totally ordered, requiring in order to classify operations into synchronous
system-wide synchronization. RedBlue relies on and asynchronous operations in the redblue con-
programmer-specified invariants to determine sistency model. The analysis checks if executing
where synchronization is necessary. Consider a operations preserves a given integrity invariant;
replicated bank application where clients can if not, the analysis concludes that the operations
access different bank branches and perform require synchronization, i.e., it is a red operation.
deposit and debit operations. The application Bailis et al. (2015) have proposed a necessary and
invariant is the account balance to be non- sufficient condition, called I-confluent analysis,
negative. Deposit operations are blue, i.e., clients to check whether operation executions on a repli-
can add to their account in all circumstances, but cated database need synchronization or not. The
debits are red, because the system needs to stop work analyzes various operations of an applica-
a debit that would make the balance negative. If tion and its desired invariants in order to detect
the client cannot connect with the bank server, all necessary synchronization. Recently developed
debits are blocked. static analysis tools (Gotsman et al. 2016; Kaki
et al. 2018) use a form of rely-guarantee (RG)
reasoning (Jones 1983), a well-known method for
Examples of Application verification of concurrent programs. A successful
analysis proves that any execution of operations
Geo-replicated systems bring tremendous ben- over replicated data, under a given synchroniza-
efits for applications with high availability and tion protocol, maintains the given invariants.
788 Geo-replication Models
Gray J, Reuter A (1993) Transaction processing: concepts cations (DEXA). IEEE Computer Society, Los Alami-
and techniques. Morgan Kaufmann, San Francisco. tos, pp 96–104. http://doi.ieeecomputersociety.org/10.
ISBN:1-55860-190-2 1109/DEXA.1999.795151
Guy R, Heidemann JS, Mak W, Popek GJ, Roth- Satyanarayanan M, Kistler JJ, Kumar P, Okasaki ME,
meier D (1990) Implementation of the Ficus repli- Siegel EH, Steere DC (1990) Coda: a highly available
cated file system. In: USENIX conference proceedings, file system for a distributed workstation environment.
pp 63–71 IEEE Trans Comput 39:447–459
Herlihy M, Wing J (1990) Linearizability: a correcte- Shapiro M, Preguiça N, Baquero C, Zawirski M (2011)
ness condition for concurrent objects. ACM Trans Conflict-free replicated data types. In: Défago X, Petit
Program Lang Syst 12(3):463–492. https://doi.org/10. F, Villain V (eds) International symposium on stabiliza-
1145/78969.78972 tion, safety, and security of distributed Systems (SSS).
Johnson PR, Thomas RH (1976) The maintenance of Lecture notes in computer science, vol 6976. Springer,
duplicate databases. Internet request for comments Grenoble, pp 386–400
RFC 677. Information sciences institute. http://www. Sovran Y, Power R, Aguilera MK, Li J (2011) Transac-
rfc-editor.org/rfc.html tional storage for geo-replicated systems. In: Sympo-
Jones CB (1983) Specification and design of (parallel) sium on operating systems principles (SOSP). Associ-
G
programs. In: IFIP congress, North-Holland ation for Computing Machinery, Cascais, pp 385–400
Kaki G, Nagar K, Najafzadeh M, Jagannathan S (2018) Terry DB, Theimer MM, Petersen K, Demers AJ, Spre-
Alone together: compositional reasoning and inference itzer MJ, Hauser CH (1995) Managing update conflicts
for weak isolation. In: Proceedings of the 45rd annual in Bayou, a weakly connected replicated storage sys-
ACM SIGPLAN-SIGACT symposium on principles of tem. In: Symposium on operating systems principles
programming languages, POPL’18 (SOSP), ACM SIGOPS. ACM Press, Copper Moun-
Lakshman A, Malik P (2010) Cassandra: a decentralized tain, pp 172–182. http://www.acm.org/pubs/articles/
structured storage system. SIGOPS Oper Syst proceedings/ops/224056/p172-terry/p172-terry.pdf
Rev 44(2):35–40. https://doi.org/10.1145/1773912. Terry DB, Prabhakaran V, Kotla R, Balakrishnan M,
1773922 Aguilera MK, Abu-Libdeh H (2013) Consistency-
Li C, Porto D, Clement A, Gehrke J, Preguiça N, Ro- based service level agreements for cloud storage. In:
drigues R (2012) Making geo-replicated systems fast as SOSP
possible, consistent when necessary. In: Symposium on Wang AI, Reiher P, Bagrodia R, Kuenning G (2002)
operating systems design and implementation (OSDI), Understanding the behavior of the conflict-rate metric
Hollywood, pp 265–278 in optimistic peer replication. In: Proceedings of the
Li C, Leitão J, Clement A, Preguiça N, Rodrigues R, 13th international workshop on database and expert
Vafeiadis V (2014) Automating the choice of con- systems applications, pp 757–761. https://doi.org/10.
sistency levels in replicated systems. In: Proceedings 1109/DEXA.2002.1045989
of the 2014 USENIX conference on USENIX an- Wittie MP, Pejovic V, Deek L, Almeroth KC, Zhao BY
nual technical conference, USENIX ATC’14. USENIX (2010) Exploiting locality of interest in online social
Association, Berkeley, pp 281–292. http://dl.acm.org/ networks. In: Proceedings of the 6th international con-
citation.cfm?id=2643634.2643664 ference, Co-NEXT’10. ACM, New York, pp 25:1–
Lloyd W, Freedman MJ, Kaminsky M, Andersen DG 25:12. https://doi.org/10.1145/1921168.1921201
(2011) Don’t settle for eventual: scalable causal consis-
tency for wide-area storage with COPS. In: Symposium
on operating system principles (SOSP). Association for
Computing Machinery, Cascais, pp 401–416. https://
doi.org/10.1145/2043556.2043593
Lloyd W, Freedman MJ, Kaminsky M, Andersen
Geo-Scale Transaction
DG (2013) Stronger semantics for low-latency geo- Processing
replicated storage. In: Networked systems design and
implementation (NSDI), Lombard, pp 1–14 Faisal Nawab
Papadimitriou CH (1979) The serializability of concurrent
University of California, Santa Cruz, CA, USA
database updates. J ACM 26(4):631–653. http://portal.
acm.org/citation.cfm?id=322154.322158
Popek G, Walker B, Chow J, Edwards D, Kline C, Rudisin
G, Thiel G (1981) Locus: a network transparent, high Synonyms
reliability distributed system. In: Symposium on oper-
ating systems principles (SOSP). ACM, pp 169–177
Ratner D, Reiher P, Popek G (1999) Roam: a scalable
Geo-distributed transaction processing; Geo-
replication system for mobile computing. In: Interna- replicated transaction processing; Global-scale
tional workshop on database & expert systems appli- transaction processing
790 Geo-Scale Transaction Processing
design of novel transaction processing protocols process a request (from a leader), whereas Fast
specifically for geo-scale deployments. Paxos requires a fast quorum that is typically
Paxos (Lamport 1998) is a fault-tolerant larger than a majority of nodes. Egalitarian Paxos
consensus protocol that has been used widely (EPaxos) (Moraru et al. 2013) proposes a new
to implement transaction processing systems variant of Paxos that combine many attractive
that provide strong guarantees. In these features for geo-scale deployments. First, EPaxos
implementations, Paxos provides a state machine is a leaderless protocol; thus, it enjoys the benefits
replication (SMR) component that guarantees of leaderless protocols in comparison to the orig-
the totally ordered execution of requests (e.g., inal Paxos protocol as we discuss previously
transactions). Due to its wide use in data for Fast Paxos. Additionally, EPaxos reduces the
management systems, there have been many quorum size to be smaller than what is typically
attempts to build Paxos-based protocols for Geo- needed by MDCC and achieves the optimal quo-
Scale Transaction Processing, such as megas- rum size (i.e., three nodes) of the widely used
G
tore (Baker et al. 2011), Paxos-CP (Patterson case of five nodes. EPaxos also utilizes many use-
et al. 2012), Egalitarian Paxos (Moraru et al. ful features to increase concurrency, such as al-
2013), and MDCC (Kraska et al. 2013; Pang lowing nonconflicting operations to execute con-
et al. 2014). currently. This is done via a multidimensional
Megastore (Baker et al. 2011) uses Paxos to log, similar to generalized Paxos (Lamport 2005)
execute transactions on multiple nodes in differ- rather than a simple log of requests that enforce
ent data centers. Paxos is used to assign trans- ordering events even if they do not conflict.
actions to log positions and to make all nodes Paxos was also shown to be suitable to act as
agree on the assigned order. Then, each node ex- a synchronous communication layer integrated
ecutes the transactions according to the assigned with a transaction commit protocol (Corbett
order. This guarantees that all nodes maintain et al. 2012; Glendenning et al. 2011; Mahmoud
a consistent copy of data, since they all exe- et al. 2013). For example, Spanner (Corbett
cute transactions in the same order. Assigning et al. 2012) and Replicated Commit (Mahmoud
the position for a request requires a round of et al. 2013) commit using variants of two-
communication, at least, between a Paxos leader phase commit (2PC) and strict two-phase
and other nodes. Paxos-CP (Patterson et al. 2012) locking (S2PL) and leverage Paxos to replicate
extends megastore by increasing the utilization of across data centers. Spanner partitions data and
Paxos communication. Specifically, in the pres- assigns a leader for each partition. Each partition
ence of concurrent transactions, Paxos-CP allows is synchronously replicated, for fault tolerance,
combining transactions in the same log position to a set of replicas using Paxos. Transactions that
rather than starting a new round of messaging. span more than one partition are executed by
MDCC (Kraska et al. 2013; Pang et al. 2014) coordinating between the corresponding partition
leverages a follow-up Paxos variant, called Fast leaders. This architecture separates consistency
Paxos (Lamport 2006) that exhibits better perfor- from durability, allowing optimizing each in
mance characteristics for geo-scale deployments isolation. Also, transactions that access a single
compared to the original Paxos protocol. Fast partition are not subject to the multi-partition
Paxos, unlike the original Paxos protocol, is a coordination. Replicated Commit executes
leaderless protocol, meaning that leader elec- transactions by getting votes from a majority
tion, a major component of Paxos’ design, is of data centers. A data center’s positive vote
not needed. Therefore, for users at different lo- is a promise to not allow the execution of
cations, it is possible for all of them to pro- conflicting transactions. Once a majority positive
cess transactions without having to go through a votes are collected, the transaction proceeds
leader. This comes at the cost of larger quorums to commit. Replicated Commit conflates
(i.e., groups of nodes) that are needed to process consistency and durability, which reduces the
requests. In Paxos, a simple majority suffices to amount of redundant messages. Additionally,
792 Geo-Scale Transaction Processing
since there are no leaders, Replicated Commit An extreme point in the performance-
exhibits better load balancing characteristics. consistency trade-off spectrum is the weakening
However, a majority vote is always needed for of consistency and isolation guarantees in
all transactions, even ones that access a single favor of performance (Cooper et al. 2008;
partition. DeCandia et al. 2007; Bailis and Ghodsi 2013).
To ameliorate the cost of wide-area commu- Adopting this approach entails weakening the
nication, proactive coordination techniques were guarantees of how conflicts between requests
developed that exchange data continuously to are handled. For example, eventual consistency
enable early detection of conflicts (Nawab et al. delays the arbitration of conflicts and only
2013, 2015b). Helios (Nawab et al. 2015b), a guarantees that conflicts are resolved eventually
work based on proactive coordination, finds the – thus creating the possibility of temporary
optimal transaction commit latency via a the- inconsistency. Also, this approach may entail
oretical model of coordination and proposes a weakening the access abstractions, such as
protocol design that achieves the optimal latency. providing single-key access abstractions rather
Helios shows that for any pair of nodes, the sum than multi-object transactions. This limits the
of their transaction latency cannot be lower than interface to application developers; however, it
the round-trip time latency between them (Nawab is easier to manage and enables higher levels of
et al. 2015b). performance.
Timestamp-based and time synchronization In addition to the two extremes in the
techniques have been explored for Geo- consistency-performance trade-off spectrum,
Scale Transaction Processing systems (Corbett there are consistency levels in between that
et al. 2012; Du et al. 2013a,b). Although the exhibit various performance and consistency
distance between data centers may be large, characteristics. Exploring consistency levels in
Spanner (Corbett et al. 2012) shows that the context of Geo-Scale Transaction Processing
accurate time synchronization is achievable using aimed to overcome the high cost of wide-area
specialized infrastructure such as atomic clocks coordination while providing some consistency
and GPS. Without specialized hardware, accurate guarantees for application developers. To this
time synchronization is challenging. However, end, there have been many efforts to study weaker
other geo-scale systems explored the use of consistency guarantees and then extending them
loosely synchronized clock methods (Du et al. to adapt to geo-scale environments. Causal
2013a,b; Nawab et al. 2015b). consistency, that is inspired by the causal
ordering principle (Lamport 1978), is one such
Weak and Relaxed Transactional consistency level. Causal consistency guarantees
Semantics that events (e.g., transactions) are ordered at
A major source of performance overhead in Geo- all nodes according to their causal relations.
Scale Transaction Processing protocols is due to Causal relations in this context denote ordering
the coordination between nodes to guarantee con- guarantees between events at the same node
sistent outcomes. To overcome the overhead of (total order), ordering of events at a node before
coordination, many lines of work investigates the any subsequent received events (happens-before
trade-off between consistency and performance. relation), and ordering due to transitive total order
In such investigations, consistency is relaxed or or happens-before causal relations. There have
weakened to enable higher levels of performance. been many Geo-Scale Transaction Processing
Central to this study is to understand the impli- systems based on causal consistency (Lloyd
cations of relaxing and weakening consistency et al. 2011, 2013; Bailis et al. 2013; Nawab
and what this means to the users and application et al. 2015a; Du et al. 2013a). An example is
developers. For example, application developers COPS (Lloyd et al. 2011, 2013) which extends
may have to cope with anomalies that arise due causal consistency with a convergence guarantee
to the weaker consistency levels. and call it Causal+ consistency. COPS provides
Geo-Scale Transaction Processing 793
a read-only transactional interface that guarantee ample, (Sharov et al. 2015) propose an optimiza-
Causal+ execution via a two-round protocol. tion formulation for placement with an objective
Another relaxed consistency guarantee that of minimizing latency for leader-based transac-
was explored for Geo-Scale Transaction Process- tion processing systems such as Spanner (Corbett
ing is snapshot isolation (SI) (Berenson et al. et al. 2012). Zakhary et al. (2016, 2018) pro-
1995) which guarantees that a transaction reads pose placement formulations for quorum-based
a consistent snapshot of data and that write- transaction processing systems such as Paxos-
write conflicts are detected. Geo-Scale Trans- based systems.
action Processing systems have been built by
extending the concept of SI (Sovran et al. 2011; Managing the Consistency-Performance
Daudjee and Salem 2006; Lin et al. 2005, 2007; Trade-Off
Du et al. 2013b). An example is Walter (Sovran
et al. 2011) which builds upon SI and proposes
Rather than a fixed Geo-Scale Transaction Pro- G
cessing protocol, there have been studies on dy-
Parallel SI (PSI). PSI relaxes the snapshot and namic protocols that adapt their consistency level
write-write conflict definitions to allow replicat- and protocol characteristics based on the cur-
ing data asynchronously between data centers, rent performance (Wu et al. 2015; Terry et al.
which is not possible in SI. 2013; Ardekani and Terry 2014). For example,
CosTLO (Wu et al. 2015) introduces redundancy
Coordination-Free Processing in messaging to lower the variance of latency.
Another approach to tackle the consistency- Another example is Pileus and Tuba (Terry et al.
performance trade-off is to explore classes 2013; Ardekani and Terry 2014) that enables ap-
of transactions that can be executed without plications to dynamically control the consistency
coordination while maintaining consistency. This level by declaring consistency and latency pri-
approach is called coordination-free processing orities. According to these priorities, the system
and has been explored in the context of Geo-Scale adapts its protocol.
Transaction Processing (Bailis et al. 2014; Roy
et al. 2015; Zhang et al. 2013). An example of this
work is transaction chains (Zhang et al. 2013) that
analyze an application’s transactions and build an Conclusion
execution plan for distributed transactions with
the objective of fast response time, equivalent to Geo-Scale Transaction Processing provides easy-
executing at a single data center with no wide- to-use abstractions of access to data that is glob-
area communication. ally replicated or distributed. One of the major
challenges faced by Geo-Scale Transaction Pro-
Data Placement cessing is the large wide-area latency between
The topology of the geo-scale system plays a sig- nodes. To overcome this challenge, redesigns of
nificant role in transaction performance. There- transaction processing systems are proposed, and
fore, placement of data and workers on carefully studies of the consistency-performance trade-off
chosen data centers has the potential of optimiz- are conducted. These approaches are shown to
ing performance. Placement has been tackled in optimize the performance of Geo-Scale Trans-
the context of geo-scale systems (Zakhary et al. action Processing by considering the wide-area
2016, 2018; Wu et al. 2013; Vulimiri et al. 2015; latency as the main bottleneck.
Agarwal et al. 2010; Endo et al. 2011; Pu et al.
2015; Shankaranarayanan et al. 2014; Sharov
et al. 2015). Some of these placement frame- Examples of Application
works specifically target optimizing placement
for Geo-Scale Transaction Processing (Zakhary Geo-Scale Transaction Processing is used by web
et al. 2016, 2018; Sharov et al. 2015) For ex- and cloud applications.
794 Geo-Scale Transaction Processing
Future Directions for Research Bernstein PA, Goodman N (1981) Concurrency control
in distributed database systems. ACM Comput Surv
(CSUR) 13(2):185–221
An avenue of future research is to expand Bernstein PA, Hadzilacos V, Goodman N (1987) Con-
beyond data centers and build Geo-Scale currency control and recovery in database systems.
Transaction Processing protocols that utilize edge Addison-Wesley, Reading
resources. Cooper BF, Ramakrishnan R, Srivastava U, Silberstein A,
Bohannon P, Jacobsen HA, Puz N, Weaver D, Yerneni
R (2008) Pnuts: Yahoo!’s hosted data serving platform.
Proc VLDB Endow 1(2):1277–1288. https://doi.org/
Cross-References 10.14778/1454159.1454167
Corbett JC, Dean J, Epstein M, Fikes A, Frost C, Furman
JJ, Ghemawat S, Gubarev A, Heiser C, Hochschild P,
Coordination Avoidance
Hsieh W, Kanthak S, Kogan E, Li H, Lloyd A, Melnik
Geo-Replication Models S, Mwaura D, Nagle D, Quinlan S, Rao R, Rolig L,
Weaker Consistency Models/Eventual Consis- Saito Y, Szymaniak M, Taylor C, Wang R, Wood-
tency ford D (2012) Spanner: Google’s globally-distributed
database, pp 251–264. http://dl.acm.org/citation.cfm?
id=2387880.2387905
Daudjee K, Salem K (2006) Lazy database replication
with snapshot isolation. In: Proceedings of the 32nd in-
References ternational conference on very large data bases, VLDB
Endowment, VLDB’06, pp 715–726. http://dl.acm.org/
Agarwal S, Dunagan J, Jain N, Saroiu S, Wolman A, citation.cfm?id=1182635.1164189
Bhogan H (2010) Volley: automated data placement for DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lak-
geo-distributed cloud services. In: Proceedings of the shman A, Pilchin A, Sivasubramanian S, Vosshall P,
7th USENIX conference on networked systems design Vogels W (2007) Dynamo: Amazon’s highly available
and implementation, USENIX association, NSDI’10, key-value store. In: Proceedings of twenty-first ACM
Berkeley, pp 2–2. http://dl.acm.org/citation.cfm?id= SIGOPS symposium on operating systems principles,
1855711.1855713 SOSP’07. ACM, New York, pp 205–220. http://doi.
Ardekani MS, Terry DB (2014) A self-configurable geo- acm.org/10.1145/1294261.1294281
replicated cloud storage system. In: Proceedings of the
Du J, Elnikety S, Roy A, Zwaenepoel W (2013a) Orbe:
11th USENIX conference on operating systems design
scalable causal consistency using dependency matrices
and implementation, USENIX association, OSDI’14,
and physical clocks. In: Proceedings of the 4th annual
Berkeley, pp 367–381. http://dl.acm.org/citation.cfm?
symposium on cloud computing, SOCC’13. ACM,
id=2685048.2685077
New York, pp 11:1–11:14. http://doi.acm.org/10.1145/
Bailis P, Ghodsi A (2013) Eventual consistency
2523616.2523628
today: limitations, extensions, and beyond. Queue
Du J, Elnikety S, Zwaenepoel W (2013b) Clock-si: Snap-
11(3):20:20–20:32. http://doi.acm.org/10.1145/
shot isolation for partitioned data stores using loosely
2460276.2462076
synchronized clocks. In: Proceedings of the 2013 IEEE
Bailis P, Ghodsi A, Hellerstein JM, Stoica I (2013) Bolt-on
32nd international symposium on reliable distributed
causal consistency. In: Proceedings of the 2013 ACM
systems, SRDS’13. IEEE Computer Society, Washing-
SIGMOD international conference on management of
ton, pp 173–184, https://doi.org/10.1109/SRDS.2013.
data, SIGMOD’13. ACM, New York, pp 761–772.
26
http://doi.acm.org/10.1145/2463676.2465279
Endo PT, de Almeida Palhares AV, Pereira NN, Goncalves
Bailis P, Fekete A, Franklin MJ, Ghodsi A, Hellerstein
GE, Sadok D, Kelner J, Melander B, Mangs JE (2011)
JM, Stoica I (2014) Coordination avoidance in database
Resource allocation for distributed cloud: concepts and
systems. Proc VLDB Endow 8(3):185–196. https://doi.
research challenges. IEEE Netw 25(4):42–46. https://
org/10.14778/2735508.2735509
doi.org/10.1109/MNET.2011.5958007
Baker J, Bond C, Corbett JC, Furman JJ, Khorlin A,
Larson J, Leon J, Li Y, Lloyd A, Yushprakh V Glendenning L, Beschastnikh I, Krishnamurthy A, An-
(2011) Megastore: providing scalable, highly avail- derson T (2011) Scalable consistency in scatter. In:
able storage for interactive services. In: CIDR 2011, Proceedings of the twenty-third ACM symposium on
Fifth biennial conference on innovative data sys- operating systems principles, SOSP’11. ACM, New
tems research, Asilomar, pp 223–234, 9–12 Jan 2011. York, pp 15–28. http://doi.acm.org/10.1145/2043556.
Online Proceedings. http://cidrdb.org/cidr2011/Papers/ 2043559
CIDR11_Paper32.pdf Kemme B, Jimenez-Peris R, Patino-Martinez M (2010)
Berenson H, Bernstein P, Gray J, Melton J, O’Neil E, Database replication. Synth Lect Data Manage
O’Neil P (1995) A critique of ansi SQL isolation levels, 2(1):1–153. http://www.morganclaypool.com/doi/abs/
pp 1–10. http://doi.acm.org/10.1145/223784.223785 10.2200/S00296ED1V01Y201008DTM007
Geo-Scale Transaction Processing 795
Kraska T, Pang G, Franklin MJ, Madden S, Fekete A in multi-datacenter cloud environments. In: Proceed-
(2013) MDCC: multi-data center consistency. In: Pro- ings of the 18th international conference on extending
ceedings of the 8th ACM European conference on com- database technology, EDBT 2015, Brussels, 23–27
puter systems, EuroSys’13. ACM, New York, pp 113– Mar 2015, pp 13–24. https://doi.org/10.5441/002/edbt.
126. http://doi.acm.org/10.1145/2465351.2465363 2015.03
Lamport L (1978) Time, clocks, and the ordering of events Nawab F, Arora V, Agrawal D, El Abbadi A (2015b)
in a distributed system. Commun ACM 21(7):558–565. Minimizing commit latency of transactions in geo-
http://doi.acm.org/10.1145/359545.359563 replicated data stores. In: Proceedings of the 2015
Lamport L (1998) The part-time parliament. ACM Trans ACM SIGMOD international conference on manage-
Comput Syst 16(2):133–169. http://doi.acm.org/10. ment of data, SIGMOD’15. ACM, New York, pp 1279–
1145/279227.279229 1294. http://doi.acm.org/10.1145/2723372.2723729
Lamport L (2005) Generalized consensus and paxos. Pang G, Kraska T, Franklin MJ, Fekete A (2014) Planet:
Technical report, MSR-TR-2005-33, Microsoft Re- making progress with commit processing in unpre-
search dictable environments. In: Proceedings of the 2014
Lamport L (2006) Fast paxos. Distrib Comput 19(2): ACM SIGMOD international conference on manage-
79–103 ment of data, SIGMOD’14. ACM, New York, pp 3–14.
G
Lin Y, Kemme B, Patiño Martínez M, Jiménez-Peris R http://doi.acm.org/10.1145/2588555.2588558
(2005) Middleware based data replication providing Patterson S, Elmore AJ, Nawab F, Agrawal D, El Ab-
snapshot isolation. In: Proceedings of the 2005 ACM badi A (2012) Serializability, not serial: concurrency
SIGMOD international conference on management of control and availability in multi-datacenter datastores.
data, SIGMOD’05. ACM, New York, pp 419–430. Proc VLDB Endow 5(11):1459–1470. https://doi.org/
http://doi.acm.org/10.1145/1066157.1066205 10.14778/2350229.2350261
Lin Y, Kemme B, Patino-Martinez M, Jimenez-Peris R Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella
(2007) Enhancing edge computing with database repli- A, Bahl P, Stoica I (2015) Low latency geo-distributed
cation. In: Proceedings of the 26th IEEE international data analytics. In: Proceedings of the 2015 ACM con-
symposium on reliable distributed systems, SRDS’07. ference on special interest group on data communica-
IEEE Computer Society, Washington, pp 45–54. http:// tion, SIGCOMM’15. ACM, New York, pp 421–434.
dl.acm.org/citation.cfm?id=1308172.1308219 http://doi.acm.org/10.1145/2785956.2787505
Lloyd W, Freedman MJ, Kaminsky M, Andersen DG Roy S, Kot L, Bender G, Ding B, Hojjat H, Koch C,
(2011) Don’t settle for eventual: scalable causal consis- Foster N, Gehrke J (2015) The homeostasis proto-
tency for wide-area storage with cops. In: Proceedings col: avoiding transaction coordination through program
of the twenty-third ACM symposium on operating sys- analysis. In: Proceedings of the 2015 ACM SIGMOD
tems principles, SOSP’11. ACM, New York, pp 401– international conference on management of data, SIG-
416. http://doi.acm.org/10.1145/2043556.2043593 MOD’15. ACM, New York, pp 1311–1326. http://doi.
Lloyd W, Freedman MJ, Kaminsky M, Andersen acm.org/10.1145/2723372.2723720
DG (2013) Stronger semantics for low-latency geo- Shankaranarayanan PN, Sivakumar A, Rao S, Tawar-
replicated storage. In: Proceedings of the 10th USENIX malani M (2014) Performance sensitive replication in
conference on networked systems design and imple- geo-distributed cloud datastores. In: Proceedings of the
mentation, NSDI’13. USENIX Association, Berke- 2014 44th annual IEEE/IFIP international conference
ley, pp 313–328. http://dl.acm.org/citation.cfm?id= on dependable systems and networks, DSN’14. IEEE
2482626.2482657 Computer Society, Washington, pp 240–251. https://
doi.org/10.1109/DSN.2014.34
Mahmoud H, Nawab F, Pucher A, Agrawal D, El Abbadi
Sharov A, Shraer A, Merchant A, Stokely M (2015)
A (2013) Low-latency multi-datacenter databases using
Take me to your leader!: online optimization of
replicated commit. Proc VLDB Endow 6(9):661–672.
distributed storage configurations. Proc VLDB Endow
https://doi.org/10.14778/2536360.2536366
8(12):1490–1501. https://doi.org/10.14778/2824032.
Moraru I, Andersen DG, Kaminsky M (2013) There 2824047
is more consensus in Egalitarian parliaments. In: Sovran Y, Power R, Aguilera MK, Li J (2011)
Proceedings of the twenty-fourth ACM symposium Transactional storage for geo-replicated systems. In:
on operating systems principles, SOSP’13. ACM, Proceedings of the twenty-third ACM symposium
New York, pp 358–372. http://doi.acm.org/10.1145/ on operating systems principles, SOSP’11. ACM,
2517349.2517350 New York, pp 385–400. http://doi.acm.org/10.1145/
Nawab F, Agrawal D, El Abbadi A (2013) Message 2043556.2043592
futures: fast commitment of transactions in multi- Terry DB, Prabhakaran V, Kotla R, Balakrishnan
datacenter environments. In: CIDR 2013, sixth biennial M, Aguilera MK, Abu-Libdeh H (2013)
conference on innovative data systems research, Asilo- Consistency-based service level agreements for
mar, 6–9 Jan 2013. Online Proceedings. http://cidrdb. cloud storage. In: Proceedings of the twenty-fourth
org/cidr2013/Papers/CIDR13_Paper103.pdf ACM symposium on operating systems principles,
Nawab F, Arora V, Agrawal D, El Abbadi A (2015a) SOSP’13. ACM, New York, pp 309–324. http://doi.
Chariots: a scalable shared log for data management acm.org/10.1145/2517349.2522731
796 Geosocial Data
in parallel. Similar to CPUs, GPUs also con- provided by GPUs, without encountering any of
tains processing cores, registers, multiple levels the penalty associated with divergent execution.
of cache, and a global memory unit. However, In spite of being able to achieve high perfor-
GPUs differ from CPUs in terms of the design mance for parallel tasks, GPUs are not suitable
of each one of these components. Instead of for running generic tasks like an operating sys-
a small number of complex cores available on tem. Hence, any system that takes advantage of
the CPUs, GPUs use large number of relatively GPUs for accelerating tasks needs a CPU com-
simpler cores that work in a single instruction ponent to run generic tasks like schedulers and
multiple data (SIMD) fashion to process large resource managers. Hence, GPU-based hardware
amounts of data. Modern GPUs contain more platforms are platforms that usually use GPUs
than 100x the number of cores in modern CPUs. in conjunction with CPUs to accelerate specific
This allows GPUs to execute the same operation tasks. These hardware platforms can broadly be
on a large number of data items in parallel. GPUs classified into single-package and multi-package
G
are able to fit such large number of cores in a platforms, based on the level of integration be-
single silicon die due to the relative simplicity of tween the CPU and GPU. Single-package plat-
each individual core. These cores lack complex forms contain a GPU and a CPU on a single
components such as branch prediction unit and semiconductor package (Wikipedia 2011). Multi-
even have much smaller cache than CPUs. package platforms are those that contain a CPU
The GPU cores also have access to a much and a GPU in two separate packages, and the
larger number of registers and a higher bandwidth communication between the two happens over
global memory unit, which is usually smaller interconnections like PCIe or NVLink.
(in size) than the main memory available on
CPUs. The high bandwidth memory (HBM) units
available in modern GPUs can provide close to Single Package
10x the bandwidth of main memory accessible
to CPUs. This availability of larger number of In a single-package platform, the CPU and the
registers and the much higher global memory GPU components are colocated within the same
bandwidth of GPUs ensures fast data access to semiconductor package. Such systems can con-
the large number of parallel cores, thus making it tain a single silicon die or multiple silicon dies in-
possible to achieve much better throughput than terconnected using a silicon bridge. These single-
CPUs. Figure 1 shows an overview of the internal package platforms can further be divided into (1)
architecture of GPUs. mobile platforms used in small handheld devices
The high level of parallelism provided by and (2) desktop platforms used in larger systems
GPUs makes them an attractive platform for for more high-performance tasks.
many high-performance computing tasks (Owens
et al. 2007, 2008; Nickolls and Dally 2010). Over Mobile Platforms
the past few years, GPUs have become popular Most SoCs used in mobile devices these
for applications like 3D rendering (Kruger days contain GPUs in some form. Some of
and Westermann 2003; Cates et al. 2018; the most common examples include NVIDIA
Liu et al. 2004), deep-learning (Chetlur et al. Tegra (NVIDIA 2015), Qualcomm Snapdragon
2014; Bergstra et al. 2011), online analytical SoCs (Qualcomm 2017), and Apple’s A series
processing (Paul et al. 2016; Heimel et al. 2013), SoCs (Wikepedia 2017). The mobile GPU
and weather modeling (Shimokawabe et al. 2010; platforms are usually designed to minimize
Mielikainen et al. 2012). Such applications power consumption and are usually cooled
often involve executing the same instruction passively due to space restrictions.
on large number of data entries in parallel and The latest Tegra X1 SoC from NVIDIA is their
have minimal amount of incoherent branching. successor to the Tegra K1 SoC and contains an
Hence, they benefit from the huge parallelism ARM CPU combined with an NVIDIA Maxwell
798
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
L2 Cache
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
GPU. The ARM CPU is a 64-bit A57 with four instead of having separate memory units for the
cores and 2MB of L2 cache, while the GPU is 256 CPU and GPU. This helps avoid the overhead of
core Maxwell generation GPU. The latest Snap- explicit movement of data from the main memory
dragon 835 SoC from Qualcomm contains the to the GPU global memory, as in the case of
Qualcomm Kryo 280 CPU which is a 64-bit CPU, multi-package systems.
based on the ARM Cortex Technology. And the The main limitation of single-package systems
A11 SoC from Apple, which is its successor to comes in the form of higher heat generation and
the A10 SoC, contains 64-bit ARMv8 CPU cou- the limitations in terms of package size. Inte-
pled with a triple-core GPU designed by Apple. grating CPUs and GPUs into the same silicon
die or semiconductor package will lead to higher
Desktop Platforms energy output, due to more high-performance
Desktop GPU platforms are generally much components cramped into a small space. This can
more capable than mobile platforms, as these limit the capabilities of the system as a whole.
G
devices can afford to consume much more power Also mass production of very large semiconduc-
and are usually cooled by dedicated hardware. tor packages is a difficult endeavor, and hence
Some of the examples of the commercially the size of CPU or GPU that can fit within the
available hardware platforms of this kind include unified system will be much more limited than
AMD Ryzen Accelerated Processing Units multi-package systems leading to less powerful
(APUs) (AMD 2017a) and Intel’s latest efforts to individual components.
integrate AMD Radeon Graphics with its eighth-
generation CPUs (AnandTech 2017) in the same
physical package using EMIB technology. Multi-package
The latest AMD Ryzen 7 2700U APU from
AMD, which is their successor to the Fusion Multi-package systems consist of GPUs and
series APUs, consists of a quad-core CPU with up CPUs in two separate semiconductor packages
to 3.8 GHz boost clock and a Vega 10 GPU with on the same physical board or even on two
ten CUs operating at up to 1.3 GHz. Finally, the separate physical boards. Examples of such
latest announcement comes from Intel regarding systems include standalone GPU hardware that
their plans to combine an eighth-generation Intel can be connected to CPUs over interconnects
CPU with AMD Radeon graphics. Unlike the like NVLink and PCIe. Both AMD and NVIDIA
other two approaches, Intel plans to interlink two release such standalone or discreet GPUs every
separate silicon die (one containing an Intel CPU year. In these systems, the GPUs contain separate
and the other containing an AMD GPU) into physical global memory which stores all the data
a single package using the embedded multi-die that can be processed by the GPU. Due to their
interconnect bridge (EMIB) technology, where much larger size and higher-power requirements,
the GPU and the CPU communicate over a high- multi-package systems are not suitable for mobile
speed silicon bridge. devices and are usually restricted to desktop
devices.
Advantages and Limitations The latest P100 discreet GPU from NVIDIA
The main advantage of such a single-package (2016) contains more than 3500 individual
design is the high bandwidth and low latency of cores that can operate at frequency of up to
communication between the CPU and the GPU. 1.4 GHz. The P100 also have two different
Another advantage is that in many cases, due to variants: NVLink based and PCIe based. The
the close integration between the hardware de- PCIe-based model only supports the PCIe bus
vices, the GPU is able to directly access data from for communication, while the NVLink version
the CPU like the contents of the L3 cache. Such also supports the use of NVLink interconnect
systems also allow the use of a single unified which can provide up to 5x speed of the PCIe
memory unit (accessible to both CPU and GPU) interconnect. The P100 also supports up to
800 GPU-Based Hardware Platforms
16 GB of high bandwidth memory. The latest AMD (2017b) AMD Vega 64. https://gaming.radeon.com/
RX Vega 64 GPU AMD (2017b) from AMD on en/product/vega/radeon-rx-vega-64/
AnandTech (2017) Intel to create 8th generation CPUs
the other hand contains over 4000 cores operating with AMD Radeon graphics. https://www.anandtech.
at up to 1.4 GHz. They also contain up to 8 GB com/show/12003/intel-to-create-new-8th-generation-cp
of high bandwidth memory. us-with-amd-radeon-graphics-with-hbm2-using-emib
Bergstra J, Bastien F, Breuleux O, Lamblin P, Pascanu R,
Delalleau O, Desjardins G, Warde-Farley D, Goodfel-
low IJ, Bergeron A, Bengio Y (2011) Theano: deep
Advantages and Limitations learning on GPUs with python. In: Big learn workshop,
The main advantage of the multi-package GPU NIPS’11
platforms is that these systems can support much Cates JE, Lefohn AE, Whitaker RT (2018) GIST: an
more powerful individual components when interactive, GPU-based level set segmentation tool for
3D medical images. Med Image Anal 8(3):217–231.
compared to single-package designs. This is https://doi.org/10.1016/j.media.2004.06.022
because separate packages can be cooled more Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J,
efficiently, and individual components (GPU Catanzaro B, Shelhamer E (2014) cuDNN: efficient
and CPU) can be much larger in size when primitives for deep learning. CoRR. http://arxiv.org/
abs/1410.0759. https://dblp.org/rec/bib/journals/corr/
compared to single-package systems where ChetlurWVCTCS14
both the components need to cramp into a Heimel M, Saecker M, Pirk H, Manegold S, Markl V
single semiconductor package. Further, the use (2013) Hardware-oblivious parallelism for in-memory
of interconnects like PCIe means that more column-stores. Proc VLDB Endow 6(9):709–720.
https://doi.org/10.14778/2536360.2536370
GPUs can be added to the platform at a later Kruger J, Westermann R (2003) Acceleration techniques
stage (based on the CPU and available PCIe for GPU-based volume rendering. In: Proceedings of
lanes), making such system more flexible and the 14th IEEE visualization 2003 (VIS’03), IEEE com-
upgradeable when compared to single-package puter society, Washington, DC, p 38. https://doi.org/10.
1109/VIS.2003.10001
systems. Liu Y, Liu X, Wu E (2004) Real-time 3d fluid simulation
However, the multi-package systems suffer on GPU with complex obstacles. In: Proceedings of
from major bottlenecks when it comes to actual the 12th Pacific conference on computer graphics and
performance. Such systems are usually bottle- applications, PG 2004, pp 247–256. https://doi.org/10.
1109/PCCGA.2004.1348355
necked by the speed of the interconnect between Mielikainen J, Huang B, Huang HLA, Goldberg MD
the CPU and the GPU. This is because any data (2012) Improved GPU/CUDA based parallel weather
that needs to be processed by the GPU needs to be and research forecast (WRF) single moment 5-class
(WSM5) cloud microphysics. IEEE J Sel Top Appl
transferred from the system main memory to the
Earth Obs Remote Sens 5(4):1256–1265. https://doi.
GPU global memory via the PCIe bus. Further, org/10.1109/JSTARS.2012.2188780
these discreet GPUs lack access to CPU cache Nickolls J, Dally WJ (2010) The GPU computing era.
and in many cases even lack direct access to the IEEE Micro 30(2):56–69. https://doi.org/10.1109/MM.
2010.41
CPU main memory leading high communication
NVIDIA (2015) NVIDIA Tegra X1. http://www.nvidia.
overhead between the CPU and the GPU. com/object/tegra-x1-processor.html
NVIDIA (2016) NVIDIA Tesla P100. https://images.
nvidia.com/content/pdf/tesla/whitepaper/pascal-archite
cture-whitepaper.pdf
Owens JD, Luebke D, Govindaraju N, Harris M, Krüger
Cross-References J, Lefohn AE, Purcell TJ (2007) A survey of general-
purpose computation on graphics hardware. Com-
Search and Query Accelerators put Graphics Forum 26(1):80–113. https://doi.org/10.
1111/j.1467-8659.2007.01012.x
Owens JD, Houston M, Luebke D, Green S, Stone
JE, Phillips JC (2008) GPU computing. Proc IEEE
96(5):879–899. https://doi.org/10.1109/JPROC.2008.
References 917757
Paul J, He J, He B (2016) GPL: a GPU-based pipelined
AMD (2017a) AMD Ryzen mobile. http://www.amd.com/ query processing engine. In: Proceedings of the 2016
en/products/ryzen-processors-laptop international conference on management of data (SIG-
Grammar-Based Compression 801
MOD’16). ACM, New York, pp 1935–1950. https:// and hyperedge-replacement grammars. The en-
doi.org/10.1145/2882903.2915224 try (Bannai 2016) focuses on grammar construc-
Qualcomm (2017) Snapdragon 835 Mobile Platform.
https://www.qualcomm.com/products/snapdragon/proc
tion algorithms for strings. Here, only a few
essors/835 grammar construction algorithms are discussed.
Shimokawabe T, Aoki T, Muroi C, Ishida J, Kawano K, As a prototypical example of algorithms over
Endo T, Nukada A, Maruyama N, Matsuoka S (2010) grammars, the equivalence problem is consid-
An 80-fold speedup, 15.0 tflops full GPU acceleration
of non-hydrostatic weather model asuca production
ered.
code. In: Proceedings of the 2010 ACM/IEEE interna-
tional conference for high performance computing, net-
working, storage and analysis (SC’10). IEEE computer
Key Research Findings
society, Washington, DC, pp 1–11. https://doi.org/10.
1109/SC.2010.9
Wikipedia (2011) Semiconductor package. https://en. String Grammars
wikipedia.org/wiki/Semiconductor_package
G
A string s consists of a sequence a1 an of
Wikepedia (2017) Apple mobile application processors.
letters ai of an alphabet ˙. The length n of string
https://en.wikipedia.org/wiki/Apple_mobile_applicati
on_processors?oldid=752966303 s D a1 an is denoted by jsj.
P Q
i2Ii m, it holds that i2I 3dm=3e . The detailed discussion on approximation algorithms
origins of grammar-based compression are often of a smallest grammar. It should be noted that the
attributed to Nevill-Manning and Witten (1997) length of the LZ77 factorization of a string s is a
and Kieffer and Yang (2000) and Kieffer et al. lower bound to the size of any SLP for s (Rytter
(2000); however as pointed out in Casel et al. 2003; Charikar et al. 2005).
(2016), the concept appears already in Storer and
Szymanski (1982, 1978) as a particular version Random Access
of their “external pointer macro.” Given an SLP P one can construct in linear
time (on a RAM) a data structure which sup-
ports random access to any given position p of
Grammar Compressors val.P / in time log.N /, where N D jval.P /j;
A (string) grammar compressor is an algorithm the substring starting at p of length m can be
that takes as input a string s and computes a small computed in additional O.m/ time (Bille et al.
SLP P with val.P / D s. Several known com- 2015). Moreover, for a string representing the
pression schemes (e.g., LZ78 and with a small traversal string of a tree (consisting of opening
blowup also LZ77 (Rytter 2003)) can be seen as and closing parenthesis, plus (possibly) node la-
grammar compressors. Determining for a given bels), they construct in linear time (on a RAM) a
string s and an integer k, whether there exists an data structure that supports a wide range of tree
SLP P with jP j k, is NP-complete (Charikar navigation operations (such as child, parent, level
et al. 2005). In fact, even for “1-level SLPs”, ancestor, lowest common ancestor, etc.).
i.e., SLPs where only the start rule contains An algorithm that runs in polynomial time
nonterminals, the problem remains NP-complete with respect to jP j is called speed-up algorithm.
(even for an alphabet of size 5) (Casel et al. Many speed-up algorithms are known, for in-
2016); the same paper also presents an algorithm stance, for pattern matching, regular expression
which uses dynamic programming to construct a matching, edit-distance computation (Hermelin
smallest grammar in time O.3jsj /. et al. 2009; Takabatake et al. 2016), and equality
RePair (Larsson and Moffat 1999) is an exam- checking; see (Lohrey 2012) for a survey.
ple of a grammar compressor that runs in linear
time and performs well in practice. The idea of
RePair is to iteratively replace repeating diagrams Equality Checking
(= two consecutive symbols) by new nontermi- Equality checking, i.e., deciding if val.P1 / D
nals (with their corresponding rules), until no val.P2 / for SLPs P1 ; P2 , can be achieved in time
repeating digram exists. The approximation ratio O..jP1 j C jP2 j/2 /. This can be shown by recom-
of RePair, i.e., the maximum for any string s of pression (Jez 2015). Recompression is the appli-
length n of the size of an SLP produced by RePair cation of two operations in alternation, (1) block
for s divided by the size of a smallest grammar for compression and (2) pair compression. Block
s, is bounded by O..n= log n/2=3 /. Another well- compression means to represent a string s of the
n .n /
known grammar compressor is Bisection (Kieffer form a1n1 a2n2 ak k by the string a1.n1 / ak k ,
.n /
et al. 2000), with an approximation ratio bounded where the ai i are new symbols. Given a parti-
by O..n= log n/1=2 /. The idea is to recursively tion ˙l ; ˙r of the alphabet ˙, pair compression
cut the string s into two pieces, starting at position means to replace each digram ab with a 2
2j , where j is the largest integer such that 2j < ˙l and b 2 ˙r by the new symbol habi.
jsj. Afterward, generate a new nonterminal for It can be shown that for a string s that does
every distinct string of length greater than one not contain any factor of the form aa for a 2
produced in the cutting process. Then jR.X /j D ˙, a partition ˙l ; ˙r can be computed so that
2 for every nonterminal. A space optimal online pair compression generates a string of length
grammar compressor is presented in Takabatake 14 C 34 jsj. Two strings are equal if and only
et al. (2017); see Bannai (2016) for a more if their block or pair compressed versions are
Grammar-Based Compression 803
equal. Thus, equality of two strings s1 ; s2 can For simplicity, only ranked trees are consid-
be tested by repeatedly applying the three oper- ered for the rest of this section. While minimal
ations, (i) block compression, (ii) finding a parti- SLRTs can give exponential compression (viz., a
tion, and (iii) pair compression, until one string full binary tree), they are not able to compress
of length one is obtained. At most O.log js1 j/ a monadic tree of the form a.a. a.e/ //. A
repetitions are needed. It can be shown that the straight-line context-free tree grammar (over ˙)
three operations can be efficiently applied to (SLCT) is a triple .N ; S; R/, where N is a
SLPs P1 ; P2 directly. For instance, to carry out ranked alphabet of nonterminals, S 2 N is the
block compression over an SLP, one first com- start nonterminal, and a rule R.X / is a ranked
putes for each nonterminal X the symbols a; b tree over N [ ˙ and parameters y1 ; : : : ; ym of
and maximal numbers i; j such that the string rank zero. It is required that the DAG condition
generated by X equals ai ˛b j and then uses (see Definition 1 and the sentence below it) holds.
this information to construct a block compressed A rule R.X / is applied to a subtree X.t1 ; : : : ; tm /
G
SLP. The finally obtained SLPs are isomorphic by replacing the subtree by the tree obtained from
(which can be checked in linear time) if and only R.X / by replacing each yj by tj for 1 j m.
if val.P1 / Dval.P2 /. This yields an algorithm Note that SLCTs can achieve double-
for equality checking of SLPs that runs in time exponential compression: let Tn be the SLCT
O..jP1 j C jP2 j/2 /. with
rank of nonterminals; hence it is desirable to with respect to their traversal string (according
keep this number small. TreeRePair works well to some fixed order on ˙). For instance,
in practice, but from a theoretical point of view canon.f .d.a/; c.a/; b// D f .b; c.a/; d.a//,
does not approximate well a smallest grammar. if c is smaller than d . The result is extended to
For a linear time approximation algorithm that unrooted unordered trees by constructing TSLPs
generalizes the recompression idea of SLPs to for rooted version of the trees of the given TSLPs,
TSLPs, it has been shown that for a tree t of rooted at the (at most two) center nodes of the
size n, a TSLP G of size O.rg C rg log.n// trees.
is produced, where r is the maximal rank of a
node label in t and g is the size of a smallest
grammar (Jez and Lohrey 2016). For another Graph Grammars
compressor which generalizes Bisection to trees, There are two well-known notions of context-
it was shown that a TSLP of size O.n log n/ is free graph grammars: node-replacement (NR)
produced, where n is the size of the input tree grammars and hyperedge-replacement (HR)
and is the number of node labels (Ganardi et al. grammars (Engelfriet 1997). Compression via
2017). straight-line HR grammars has been considered
recently (Maneth and Peternek 2017).
Let ˙ be an alphabet and a ranked al-
Equality Checking phabet with rk. / 1 for every 2 . A
Similarly as for SLPs, many speed-up algorithms (directed hyper) graph over .˙; / is a tuple
are known for TSLPs, for instance, testing mem- .V; E; ; att; ext/ where V and E are finite sets
bership in a regular tree language or testing equal- of nodes and (hyper) edges, maps each v 2 V
ity. Often the complexity of algorithms of TSLPs to ˙ and each e 2 E to . The attachment
depends on the maximal number of parameters mapping att maps each e 2 E to a string over
of a nonterminal. It has been shown that any V of length n D rk..e//. The string ext 2 V
given TSLP can be transformed into an equivalent defines the sequence of external nodes. Typically,
one where each nonterminal uses at most one hypergraphs we compress do not have external
parameter. For such a TSLP each rule has one of nodes (i.e., ext equals the empty string "); how-
four forms ever, in a graph g appearing in the right-hand side
of a grammar rule, external nodes indicate how g
(1) f .A1 ; : : : ; An /, is merged with the host graph (see below). The
(2) B.C /, graph is of rank k if k D jextj. A conventional
(3) f .A1 ; : : : ; Ai ; y1 ; AiC1 ; : : : ; An /, (node and edge labeled) graph is a hypergraph
(4) B.C.y1 //. where (i) each 2 has rank 2, (ii) if att.e/ D
v1 v2 then edge e is directed from v1 to v2 , and
Testing equality of TSLPs T1 ; T2 can easily (iii) ext D ".
be reduced to the string case by constructing A straight-line hyperedge-replacement gram-
SLPs P1 ; P2 for the strings dflr.val.Ti //. mar (over .˙; /) (SLHR) is a triple .N ; S; R/,
The size of Pi is O.mjTi j/, where m is the where N is a ranked alphabet of nonterminals;
maximal rank of a nonterminal of Pi . TSLPs S 2 N is the start nonterminal, for every X 2N ;
can also be used to compress unordered trees, and the rule R.X / is a hypergraph over .˙; N [
by considering the tree val.G/ as unordered. / of rank rk.X /, and the DAG condition (see
Isomorphism of unordered trees represented Definition 1 and the sentence below it) holds. Let
by TSLPs Ti can be solved in polynomial e be an X -labeled edge with att.e/ D v1 vk .
time (Lohrey et al. 2015), by computing TSLPs Then a rule R.X / is applied to e by removing
Ti0 with val.Ti0 / Dcanon.val.Ti //. The canon e, inserting a disjoint copy of R.X /, and then
of a tree t is obtained from t by ordering the identifying the i -th external node of the copy of
subtrees of each node length-lexicographically R.X / with the node vi . The external nodes of an
Grammar-Based Compression 805
HR grammar have a similar role as the parameters the nodes 2 to 5 from the start graph (and using
of a context-free tree grammar. A-hyperedges of rank 1). The external nodes
Let g D .V; E; ; att; ext/ be a graph. The are colored black with their order according to
node size of g is jV j. The edge size of g is the node numbers. Clearly, val.G/ D g. The
P
jfe 2 E j rk..e// 2gj C frk..e// node numbers can be obtained by following a
j rk..e// 3gj. The size of g is the sum of node specific order of the nonterminal edges in each
and edge sizes. The size of an SLHR grammar is rule, namely, lexicographically w.r.t the sequence
P
X2N jR.X /j. of incident nodes. The node numbers of nodes 1
Consider the graph g in right of Fig. 1. This to 5 in R.S / stay as they are. The A-edge from
graph is edge-labeled and has no node labels. node 1 to node 2 is derived first. This generates a
The nodes are ordered according to the numbers new red node labeled 6 (the first free label). Next
shown in the nodes. The size of the graph is 69. is the C -edge from 1 to 6 which creates the blue
The left of Fig. 1 shows the “derivation DAG” (cf. node 8. Next, the green nodes 9, 10, and 11 are
G
the DAG condition of Definition 1) of an SLHR generated by application of the B-rule.
grammar G representing g. A rule R.X / D h is
written X ! h and simply h in case of the start
nonterminal S . The size of G is 34 (17 nodes, Grammar Compressors
14 edges, and 1 hyperedge of size 3). Note that RePair compression can be generalized to hy-
there is a smaller grammar, obtained by removing pergraphs by defining a digram to be a graph
3 b 3
14
A
A a
2 1 A 4 b 16
c c
A
12 15 b b
19 18 20
5 c c
b 17 c c
a b
13 b 23 22
b c c
2 a 9 a 1 21 4
a a
c c
b
A→ 2 1 B 3 3 10 11 b 25
2 b a
C C c c 29 b
1 c c
8 6 7
b b 27 24
c c
a 28 b
26
5 b
B→ 4 a 1 a 6
C→ 2 b 1 c c
b
2 3
3 c c
5
Grammar-Based Compression, Fig. 1 Derivation DAG of an SLHR grammar G together with the graph val.G/
represented by G
806 Grammar-Based Compression
Given a regular expression ˛ and two node Ershov AP (1958) On programming of arithmetic opera-
numbers u; v, it can be determined in time tions. Commun ACM 1(8):3–9
O.j˛jjGj/ whether there is a path from u to v Ganardi M, Hucke D, Jez A, Lohrey M, Noeth E (2017)
Constructing small tree grammars and small circuits for
that matches ˛. The existence of such nodes can formulas. J Comput Syst Sci 86:136–158. https://doi.
be determined in the same time bound. Many org/10.1016/j.jcss.2016.12.007
problems concerning SLHR grammars are still Hermelin D, Landau GM, Landau S, Weimann O (2009)
A unified algorithm for accelerating edit-distance com-
open. For instance, for a given k, it is not known
putation via text-compression. In: Proceedings of the
whether isomorphism of two graphs given by 26th international symposium on theoretical aspects
SLHRs in which each right-hand side has size of computer science, STACS 2009, 26–28 Feb 2009,
k can be solved in polynomial time. The Freiburg, pp 529–540. https://doi.org/10.4230/LIPIcs.
STACS.2009.1804
corresponding graphs are of bounded tree width,
Jez A (2015) Faster fully compressed pattern matching
and for uncompressed such graph isomorphism is
decidable in polynomial time (Bodlaender 1990).
by recompression. ACM Trans Algorithms 11(3):20:1– G
20:43. http://doi.acm.org/10.1145/2631920
Jez A, Lohrey M (2016) Approximation of smallest linear
tree grammar. Inf Comput 251:215–251. https://doi.
Cross-References org/10.1016/j.ic.2016.09.007
Kieffer JC, Yang E (2000) Grammar-based codes: a new
Graph Data Models class of universal lossless source codes. IEEE Trans
Inf Theory 46(3):737–754. https://doi.org/10.1109/18.
Graph Pattern Matching
841160
RDF Compression Kieffer JC, Yang E, Nelson GJ, Cosman PC (2000)
Universal lossless compression via multilevel pattern
matching. IEEE Trans Inf Theory 46(4):1227–1245.
References https://doi.org/10.1109/18.850665
Larsson NJ, Moffat A (1999) Offline dictionary-based
compression. In: Data Compression Conference, DCC
Bannai H (2016) Grammar compression. In: Encyclopedia 1999, Snowbird, 29–31 Mar 1999, pp 296–305. https://
of Algorithms, Springer, pp 861–866. https://doi.org/ doi.org/10.1109/DCC.1999.755679
10.1007/978-1-4939-2864-4_635
Liu Q, Yang Y, Chen C, Bu J, Zhang Y, Ye X (2008)
Bille P, Landau GM, Raman R, Sadakane K, Satti
Rnacompress: grammar-based compression and infor-
SR, Weimann O (2015) Random access to grammar-
mational complexity measurement of rna secondary
compressed strings and trees. SIAM J Comput
structure. BMC Bioinform 9(1):176. https://doi.org/10.
44(3):513–539. https://doi.org/10.1137/130936889
1186/1471-2105-9-176
Bodlaender HL (1990) Polynomial algorithms for graph
Lohrey M (2012) Algorithmics on SLP-compressed
isomorphism and chromatic index on partial k-trees.
strings: a survey. Groups Complex Cryptol 4(2):241–
J Algorithms 11(4):631–643. https://doi.org/10.1016/
299. https://doi.org/10.1515/gcc-2012-0016
0196-6774(90)90013-5
Lohrey M, Maneth S, Mennicke R (2013) XML tree struc-
Casel K, Fernau H, Gaspers S, Gras B, Schmid
ture compression using RePair. Inf Syst 38(8):1150–
ML (2016) On the complexity of grammar-based
1167. https://doi.org/10.1016/j.is.2013.06.006
compression over fixed alphabets. In: Proceeding
Lohrey M, Maneth S, Peternek F (2015) Compressed
of 43rd international colloquium on automata, lan-
tree canonization. In: Proceedings of the 42nd Inter-
guages, and programming, ICALP 2016, 11–15 July
national Colloquium on Automata, Languages, and
2016, Rome, pp 122:1–122:14. https://doi.org/10.4230/
Programming, ICALP 2015, Part II, Kyoto, 6–10 July
LIPIcs.ICALP.2016.122
2015, pp 337–349. https://doi.org/10.1007/978-3-662-
Charikar M, Lehman E, Liu D, Panigrahy R, Prabhakaran
47666-6_27
M, Sahai A, Shelat A (2005) The smallest grammar
problem. IEEE Trans Information Theory 51(7):2554– Maneth S, Peternek F (2017) Grammar-based graph com-
2576. https://doi.org/10.1109/TIT.2005.850116 pression. CoRR abs/1704.05254. http://arxiv.org/abs/
Downey PJ, Sethi R, Tarjan RE (1980) Variations on the 1704.05254, 1704.05254
common subexpression problem. J ACM 27(4):758– Maneth S, Sebastian T (2010) Fast and tiny structural self-
771. http://doi.acm.org/10.1145/322217.322228 indexes for XML. CoRR abs/1012.5696. http://arxiv.
Engelfriet J (1997) Context-free graph grammars. In: org/abs/1012.5696, 1012.5696
Rozenberg G, Salomaa A (eds) Handbook of formal Nevill-Manning CG, Witten IH (1997) Identifying hierar-
languages: beyond words, vol 3. Springer, Berlin/Hei- chical structure in sequences: a linear-time algorithm.
delberg, pp 125–213. https://doi.org/10.1007/978-3- J Artif Intell Res 7:67–82. https://doi.org/10.1613/jair.
642-59126-6_3 374
808 Graph Aggregation
between partitions (departments). Therefore, it Some social networks, such as Facebook, adopt
is possible to keep each department in a ma- the “friendship” relationship; hence, the friends
chine to parallelize the workload. On the other graph in Facebook is undirected. However, Twit-
hand, graph processing is interested in the re- ter adopts the “following” relationship; hence,
lationships between entities and queries the re- it is represented as a directed graph. A recom-
lationship structure rather than individual enti- mendation system graph, between customers and
ties. Therefore, when considering the relationship products, is a bipartite graph, where V D V1 [ V2
graph of employees, partitions (departments) are such that V1 represents customers, V2 represents
expected to be queried together. products, and edges between the two sets repre-
The lack of locality in graph data and the need sent preferences.
of running an algorithm for multiple iterations There are several sources of real graph data
before convergence make data shuffling between in the literature, such as SNAP (Leskovec and
partitions the major cost. Furthermore, due to Krevl 2014) and LWA (Boldi et al. 2011). Some
G
the power-law property of many real graphs, projects are dedicated for preparing large real
each partition/machine needs to make different graphs, for example, Common Crawl has 64 bil-
amount of workload. Most parallel systems try lion edges. These real datasets are complemen-
to minimize the data shuffling cost while main- tary to synthetic datasets. Synthetic datasets are
taining balanced workload. They take different equally important because they allow a bench-
approaches to do that. Therefore benchmarking mark to examine different properties of the same
these systems should take into consideration sev- dataset. The main challenge in generating a syn-
eral aspects, such as, memory usage, CPU utiliza- thetic dataset is to ensure that it represents the real
tion, graph data, and so on. world. The remaining of this section will discuss
This entry briefly introduces the readers to the properties of real graphs and synthetic graph
graph data, real graph properties, and synthetic generators.
graph generation. Then, it discusses common
workloads for graph benchmarking. After that, it Real Graphs Properties
summarizes existing benchmarks. The last sec- The problem of studying the patterns and prop-
tion discusses experiment design, metrics to be erties of real web and social graphs have been
measured, and evaluating the parallel cost. addressed extensively in the literature (Leskovec
et al. 2005b; Faloutsos et al. 1999; Chakrabarti
et al. 2010). Table 1 shows a few real graph
Graph Data datasets (Boldi et al. 2004, 2011; Boldi and Vi-
gna 2004; Leskovec and Krevl 2014) and their
A graph, G, is defined as G D .V; E/, where properties. These properties include graph size,
V is a set of vertices and E is a set of edges, average and maximum degrees, diameter, and
such that jV j D N; jEj D M . For an edge the percentage of nodes included in the strong
e D .u; v; Pe /, e 2 E if v; u 2 V and Pe connected component.
is a set of properties associated with edge e. A Social networks and web graphs are very large
vertex v 2 V , also, has a set of properties, Pv . and have a high growth rate in terms of vertices
Pe and Pv can be empty. If Pv ¤ or Pe ¤ , and edges. They are sparse; the average vertex
it is known as a property graph. This definition degree is significantly smaller than the number
is flexible to accommodate several graph types, of vertices; however, the graph density increases
such as directed and undirected, weighted edges, as the graph grows. The diameter, which is the
and bipartite graphs among others. longest shortest path between any two vertices in
Type of graph used depends on the semantics the graph, is typically small and shrinking as the
of the application in hand. The web graph is graph grows. Most of these graphs have one very
a directed graph because each edge represents large connected component and a few very small
a hyper link from one web page to the other. components that are usually ignored. The vertex
810 Graph Benchmarking
degree follows the power-law distribution with a same probability. These random graphs could be
very long tail. large, sparse.
Existing graph generators try to represent web
and social graph properties. There are typically Watts-Strogatz
seven properties: The WS (Watts and Strogatz 1998) model tries
to add small world (short diameter) property
and high clustering (one large component) to
• Very large
the generated random graphs. Instead of creating
• Sparse
edges randomly as in the ER model, it has two
• Small diameter
parameters ˇ and K, such that:
• One large connected component
• Community structure
NK
• Power law degree distribution M D
2
• Dynamic
N > K > ln.N / > 1
Erdös-Renyi Barabasi-Albert
ER (Erdös and Rényi 1960) generation model can The BA (Albert and Barabási 2002) model is
generate random graphs such that any possible based on the preferential attachment mechanism
graph with N vertices and M edges have the which means that some value is distributed to
Graph Benchmarking 811
agents based on what they already have. This and Özsu 2014), on the other hand, is built on top
mechanism can generate graphs with power-law of RTG but is very flexible. It is capable of gener-
distribution; the distributed value in this case is ating directed/undirected, weighted/unweighted,
the number of edges. and bipartite/unipartite/multipartite graphs.
The algorithm starts with an initial random RTG enhances the properties of the generated
graph G0 with N0 vertices. Every new vertex v graphs by introducing some dynamic graph
added to the graph could be connected to m < N0 properties such as shrinking diameter and densifi-
vertices. The probability of edge (v; vi ) is pi D cation. Structure preference was also introduced
di
2M
, such that di is the degree of vertex vi and in RTG to match the real-life graph property
M is number of edges in the graph so far. of having community structures, meaning that
This approach can create a large sparse dy- nodes form groups and possibly groups of
namic graph with a small diameter and power- groups. These groups are usually created by
law distribution for vertex degree, but it does not
G
connecting similar nodes together to form a
create graphs with community structure. community.
Kronecker Graphs
Kronecker graphs model (Leskovec et al. 2005a) Workloads
was proposed as an easy to analyze mathematical
model for graph generation. The model is based There are three main workload categories in
on the Kronecker product of an initial matrix graph applications: online queries, updates, and
which represents the adjacency matrix of an ini- iterative queries. Online queries are read only and
tial graph. The Kth power of the initial matrix interactive, usually do not have high complexity,
represents the generated graph. and do not require multiple computation
The Kronecker product of two matrices A iterations; the answer is expected to be computed
and B is matrix such that each value in position quickly. Updates are also interactive, but they
(i; j ) is a.i;j / B. The naive approach of fitting change the graph’s structure by changing its
this model to real graphs is expensive and needs parameters or its topology. Updates also include
exponential time, but the authors proposed an adding or removing a vertex or an edge. Iterative
efficient and scalable algorithm called KronFit. queries are heavy analytical queries, possibly
requiring multiple passes over the data, and
RTG (Random Typing Graphs) usually take a long time to complete. They may
Miller (1957) showed that typing random char- include some updates, but the changes to the
acters on a keyboard with only one character graph do not need to occur in real time.
identified as a separator between words can gen-
erate a list of words that follow a power-law Online Queries
distribution. RTG (Akoglu and Faloutsos 2009) Many graph applications include only online
builds on this by connecting each pair of con- queries such as a simple check of the availability
secutive words by an edge. The final result is a of customers’ relationship to a retail shop, an
graph whose node degrees follow the power-law update for the news feed in social media, etc.
distribution. Examples are:
Although Kronecker graph generator is able to
create graphs similar to real-world graphs and en-
ables easier analysis of the mathematical model, • Find a vertex or edge by its properties
it has some disadvantages such as generation • Report K-hop neighborhood of a query
of multinomial/lognormal degree distributions in- • Reachability or shortest path query between
stead of power-law and being not flexible. Water- two vertices
loo Graph Benchmark (WGB) generator (Ammar • Pattern matching query
812 Graph Benchmarking
If the experiment includes multiple dimensions when necessary. Ideally, total response time
then evaluating the possibilities of each dimen- should equal load + execute + save. However,
sion is an independent experiment. It is important it should be reported separately, because it
to change one dimension at a time to conclude represents the end-to-end processing time, and
meaningful results. Below is a list of some exam- occasionally includes some overhead that might
ples: not be explicitly reported by some systems, such
as the time of repartitioning, networking, and
• Datasets synchronization.
• Workloads
• Systems Parallelization COST
• Number of cores in each machine (scale-up) Evaluating the system scalability is very im-
• Number of machines in the cluster (scale-out) portant. However, parallelization usually comes
G
at a cost. Parallel algorithms, either in multi-
In the rest of this section, we will discuss core or multi-machine environments, are usually
two important considerations during the design more complicated. COST (McSherry et al. 2015)
process of the benchmarking experiment. stands for Configuration that Outperforms a Sin-
gle Thread and is used in the literature to evaluate
Metrics Measures the overhead of parallel algorithms and systems.
A good benchmarking experiment should The main idea is that parallel algorithms are
measure resources utilization and system often not optimal due to the special design con-
performance. High resource usage is better if the sideration for parallel processing between ma-
system is expected to run in isolation, while low chines, such as, synchronization and communi-
resource usage is better if the system is expected cation overhead. COST factor represents the re-
to run in an environment such that systems share sponse time of one thread divided by the response
resources and run simultaneously. Nonetheless, time of a parallel system. The baseline should be
evaluating system performance should be done in an efficient single thread implementation, such as
isolation, with no resource sharing across systems the single thread execution of algorithms in the
or across experiments. GAP Suite (Beamer et al. 2015) on a machine
Benchmarks should report the following mea- with large memory.
sures:
Anderson MJ, Sundaram N, Satish N, Patwary MMA, and distributed platforms. Proc VLDB Endow 9(13):
Willke TL, Dubey P (2016) Graphpad: optimized graph 1317–1328
primitives for parallel and distributed platforms. In: Leskovec J, Krevl A (2014) SNAP Datasets: stanford
Proceedings of the 30th international parallel and dis- large network dataset collection. http://snap.stanford.
tributed processing symposium, pp 313–322 edu/data
Bader DA, Madduri K (2005) Design and implementation Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C
of the hpcs graph analysis benchmark on symmetric (2005a) Realistic, mathematically tractable graph gen-
multiprocessors. In: International conference on high- eration and evolution, using kronecker multiplication.
performance computing, pp 465–476 In: Proceedings of the 9th European conference on
Batarfi O, Shawi R, Fayoumi A, Nouri R, Beheshti SMR, principles of data mining and knowledge discovery,
Barnawi A, Sakr S (2015) Large scale graph processing vol 5, pp 133–145
systems: survey and an experimental evaluation. Clust Leskovec J, Kleinberg J, Faloutsos C (2005b) Graphs
Comput 18(3):1189–1213 over time: Densification laws, shrinking diameters and
Beamer S, Asanović K, Patterson D (2015) The gap possible explanations. In: Proceedings of the 11th
benchmark suite. arXiv preprint arXiv:150803619 ACM SIGKDD international conference on knowledge
Biggs N, Lloyd EK, Wilson RJ (1986) Graph theory. discovery and data mining, pp 177–187
Clarendon Press, New York, pp 1736–1936 Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C,
Boldi P, Vigna S (2004) The webgraph framework I: Hellerstein JM (2012) Distributed GraphLab: a frame-
compression techniques, pp 595–601 work for machine learning in the cloud. Proc VLDB
Boldi P, Codenotti B, Santini M, Vigna S (2004) Ubi- Endow 5(8):716–727
crawler: a scalable fully distributed web crawler. Softw Lu Y, Cheng J, Yan D, Wu H (2014) Large-scale dis-
Pract Exp 34(8):711–726 tributed graph computing systems: an experimental
Boldi P, Rosa M, Santini M, Vigna S (2011) Layered label evaluation. Proc VLDB Endow 8(3):281–292
propagation: a multiresolution coordinate-free ordering Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I,
for compressing social networks, In: Proceedings of the Leiser N, Czajkowski G (2010) Pregel: a system for
20th international conference on world wide web, pp large-scale graph processing. In: Proceedings of ACM
587–596 SIGMOD international conference on management of
Chakrabarti D, Faloutsos C, McGlohon M (2010) Graph data, pp 135–146
mining: laws and generators. In: Aggarwal CC (ed) McSherry F, Isard M, Murray DG (2015) Scalability! but
Managing and mining graph data. Springer, pp at what cost? In: Proceedings of the 15th USENIX
69–123 conference on hot topics in operating systems
Ciglan M, Averbuch A, Hluchy L (2012) Benchmarking Miller GA (1957) Some effects of intermittent silence. Am
traversal operations over graph databases. In: Proceed- J Psychol 70(2):311–314
ings of the workshops of 28th international conference Murphy RC, Wheeler KB, Barrett BW, Ang JA (2010)
on data engineering. IEEE, pp 186–189 Introducing the graph 500. Cray Users Group (CUG)
Erdös P, Rényi A (1960) On the evolution of random Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W,
graphs. In: Publication of the mathematical institute of Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li
the hungarian academy of sciences, pp 17–61 X, Qiu B (2014) Bigdatabench: a big data benchmark
Erling O, Averbuch A, Larriba-Pey J, Chafi H, Gubichev suite from internet services. In: International sympo-
A, Prat A, Pham MD, Boncz P (2015) The LDBC sium on high performance computer architecture, pp
social network benchmark: interactive workload, In: 488–499
Proceedings of the 2015 ACM SIGMOD international Watts DJ, Strogatz SH (1998) Collective dynamics
conference on management of data. ACM, pp 619–630 of small-worldnetworks. Nature 393(6684):
Faloutsos M, Faloutsos P, Faloutsos C (1999) On power- 440–442
law relationships of the internet topology. In: ACM
SIGCOMM computer communication review, vol 29.
ACM, pp 251–262
Han M, Daudjee K, Ammar K, Özsu MT, Wang X,
Jin T (2014) An experimental comparison of pregel- Graph Centralities
like graph processing systems. Proc VLDB Endow
7(12):1047–1058
Hong S, Depner S, Manhardt T, Lugt JVD, Verstraaten M, Graph Invariants
Chafi H (2015) Pgx.d: a fast distributed graph process-
ing engine. In: Proceedings of international conference
for high performance computing, networking, storage
and analysis, pp 1–12
Iosup A, Hegeman T, Ngai WL, Heldens S, Prat-Pérez
A, Manhardto T, Chafio H, Capotă M, Sundaram
Graph Compression
N, Anderson M et al (2016) LDBC graphalytics: a
benchmark for large-scale graph analysis on parallel (Web/Social) Graph Compression
Graph Data Integration and Exchange 815
whose string satisfies a given regular expres- A pair of source and target graph databases
sion. .GS ; GT / satisfies a source-to-target dependency
• Two-way RPQs (2RPQs). Such queries en- S .x/
N ! T .x/ N iff S .GS / v T .GT /. A pair
hance RPQs with the ability to traverse edges of source and target graph databases .GS ; GT /
backward. satisfies a source-to-target mapping Mst , denoted
• Nested Regular Expressions (NREs). These by .GS ; GT / ˆ Mst , iff .GS ; GT / satisfy all the
extend 2RPQs with an existential (nesting) source-to-target dependencies in Mst .
operator in the manner of XPath (XPathWork-
ingGroup 2016), allowing for branching/non- Graph data integration mappings. In a graph
linear navigation. data integration setting ˝ D .f˙Si g; ˙G ; M /
• Conjunctive RPQs (CRPQs), which are n-ary with source schemas ˙Si and global schema
queries consisting in a conjunction of RPQs. ˙G , rules in the mapping M may state query
• Conjunctive 2RPQs (C2RPQs), which are n- containment of a source query into a global query
ary queries consisting in a conjunction of (sound sources (Lenzerini 2002)), in which case
2RPQs. one obtains mappings similar to data exchange.
• Conjunctive NREs (CNREs), which are n-ary However, data integration scenarios may also fea-
queries consisting in a conjunction of NREs. ture rules stating containment of a global query
into a source query (i.e., .x/
N ! .x/ N with
over ˙G and over ˙Si ), or stating equivalence.
CNREs are an extension of both NREs and Rule and mapping satisfiability semantics are
C2RPQs, and as such they provide the most similar to data exchange.
general class of queries among the above.
LAV and GAV mappings. Two classes of map-
pings are of special interest in data integration
Graph Schema Mappings and data exchange:
Schema mappings are essential ingredients of any
data exchange and data integration scenario, since • Local As View (LAV) mappings are mappings
they embody the logical rules relating source and where in all rules the source query is atomic,
target (respectively, global) schemas. Such rules i.e., consists in a single symbol of ˙S (respec-
typically express query containment or equiva- tively, ˙Si /.
lence, for queries expressed in the appropriate • Global As View (GAV) mappings are map-
language; accordingly, for graph databases, rules pings where in all rules the target (respec-
may feature any class of graph queries, including tively, global) query is atomic, i.e., consists
the six categories listed above. in a single symbol of ˙T (respectively, ˙G ).
Due to the semantics of mappings, there may 2. Query answering. This problem is of in-
be infinitely many solutions for a given source terest for both data exchange and data integra-
graph database and data exchange setting. Work tion. In data exchange, we are interested in an-
on data exchange has therefore placed a focus on swering queries over the target schema, whereas
the computation of a so-called universal solution: in data integration, queries are stated over the
a database with incomplete information which global schema. The semantics adopted for both
intuitively “represents all possible solutions.” settings are those of certain answers, that is,
In graph data exchange, Barceló et al. (2013) the intersection of the result sets of the query
have introduced a similar concept, namely, the Q for all data exchange solutions, respectively,
universal representatives. These have been de- all possible global instances compliant with the
fined as graph patterns (Barceló et al. 2011) source instances and the mapping. The query
where the set of nodes N comprises both node answering problem then consists of deciding,
ids and labeled nulls, and edges are labeled with given a query Q over the target (respectively,
G
expressions without variables over a schema ˙. global) schema, whether a given tuple of node ids
Such graph patterns model incompleteness of belongs to the certain answers of Q. As above,
universal representatives in two ways: unknown complexity analysis can be carried out in terms
values for nodes (via the labeled nulls) and im- of both data complexity (the input consists in
precise relations between the nodes (as witnessed the source database(s)) and combined complexity
by the expressions on edges). (including the schemas, the mapping, and the
A homomorphism from a graph pattern as query). Oftentimes, the query complexity is also
above to the graph database G D .V; E/ is a considered.
mapping h W N ! V such that (i) h is the Query answering is closely related to the uni-
identity over N \ V and (ii) for each edge of versal representatives, since it is often the case
of the form .u; exp; v/, .h.u/; h.v// 2 Qexp .G/, that the certain answers for a query can be ef-
where Qexp .x; y/ D .x; exp; y/. We denote by ficiently obtained by evaluating the query over
Rep˙ the set of all graph databases G over ˙ such representatives. Another important related
such that there exists a homomorphism from topic is that of query rewriting, which for data
to G. Given graph data exchange setting ˝ D exchange and data integration typically involves
.˙S ; ˙T ; Mst / and graph database GS over ˙S , reformulating a target (or global) query into a
Barceló et al. (2013) then formally define univer- query expressed over the source(s).
sal representatives as follows: the graph pattern
over ˙T is a universal representative for GS
under ˝ iff Rep˙T = Sol˝ .GS /. Key Research Findings
RPQs, 2RPQs, and C2RPQs, regarding both be directly adapted to the graph setting, with
data and combined complexity. Query answering data complexity in NLOGSPACE and combined
under sound views in particular, corresponding to complexity in PSPACE.
a typical LAV setting, is shown to be coNP- As argued by the authors, however, focusing
complete in data complexity and PSPACE- on data complexity is insufficient for graph set-
complete in combined complexity for RPQ tings, in particular due to the very large instances
queries over RPQ views; these bounds extend typically involved (e.g., social networks, scien-
to 2RPQs, but the problem reaches EXPSPACE- tific databases). Since the hardness results for
completeness in combined complexity for the combined complexity are a consequence of the
broader class of C2RPQs. A subclass of C2RPQs, hardness of CNRE evaluation, the paper further
called T2RPQs, is however identified, where the investigates the restriction of source queries in
query body is tree-shaped; for such queries over mappings, from CNREs to the simpler class of
C2RPQs views, query answering is shown to be NREs. Such NRE-restricted mappings comprise
PSPACE-complete. only rules of the form .x; exp; y/ ! T .x; y/,
With the work of Calvanese et al. (2012), with T a binary CNRE over ˙T . As shown
the authors further go beyond view-based query in the paper, in the presence of NRE-restricted
processing and LAV/GAV mappings, to address mappings, the universal representative computa-
sound (i.e., source-to-target only) GLAV map- tion becomes tractable: its combined complexity
pings, though in a setting restricted to RPQ map- is O.jGS j2 j˝j/. Similar bounds are shown to
pings and RPQ queries. Similar to the LAV re- hold for acyclic source CNREs.
sults, the paper shows that in such setting query Concerning query answering, the authors
answering is coNP-complete in data complexity start by emphasizing the difficulty of the certain
and PSPACE-complete in combined complexity, answers problem. Indeed, while the problem has
whereas deciding whether a non-empty rewriting been shown to be PSPACE-complete in combined
exists is EXPSPACE-complete. In addition, the complexity for RPQ queries and mappings, it is
authors show that deciding whether a rewriting shown to be EXPSPACE-complete for CNRE
of a target query is perfect, i.e., it always com- queries and mappings. On the other hand, going
putes the certain answers when evaluated over the from RPQs to CNREs keeps data complexity in
source, is NEXPTIME-complete. coNP, but the authors refine the lower bound by
exhibiting a setting featuring RPQ mappings and
Graph Data Exchange a “word” RPQ queries, for which the problem
Barceló et al. (2013) lay the foundations of graph is shown to be coNP-complete. Since word RPQ
data exchange by mainly focusing on the com- query can be translated into relational conjunctive
plexity of query answering and universal rep- queries, such setting shows how query answering
resentative computation. They also investigate in graph data exchange importantly diverges
the means of obtaining tractable procedures for from the relational setting, where computing
solving the above problems of interest. certain answers to conjunctive queries is known
Barceló et al. (2013)’s study considers the gen- to be tractable in data complexity (Fagin
eral CNRE class for both queries asked over the et al. 2005). The hardness results shown
target (i.e., those for which certain answers are also point toward the insufficiency of NRE-
required) and queries involved in the mappings. restricted mappings for achieving feasible query
The paper is the first to define universal repre- answering.
sentatives for graph data exchange, correspond- The pursuit of tractability then leads the au-
ing to graph patterns with NRE labeled edges. thors to examining further restrictions on the
Furthermore, the authors show how the chase- mappings and the queries. A new restriction on
based procedure used to build universal solutions the target queries in the mappings leads to the
in the relational setting (Fagin et al. 2005) can definition of rigid mappings, where the right-
Graph Data Integration and Exchange 819
hand side does not employ disjunction or re- • target tuple-generating dependencies (or sim-
cursion. Such mappings in turn are useful for ply t tgds), i.e., T .x/ ! T .x/, with T and
achieving tractability in several cases. For an T being graph queries of matching arity over
NRE query exp and mappings that are both NRE- the target schema ˙T
restricted and rigid, the combined complexity of • target equality-generating dependencies (or
query answering is indeed shown to be O.jGS j2 simply t egds), i.e., T .x/ ! .x1 D x2 /, with
j˝j jexpj/, by relying on the universal repre- T being a query over the target schema ˙T
sentative computation. If one further restricts the and x1 , x2 among the variables of x.
N
setting to mappings that are NRE-restricted and
GAV, the combined complexity for NRE query Target tgds have the same query containment
answering becomes linear, i.e., O.jGS j j˝j semantics as s-t tgds, whereas target egds express
jexpj/, this time by employing query rewriting equalities among node ids, and as such intuitively
techniques. Moreover, if one only examines data
G
involve merging nodes in the data exchange
complexity, in the case of CNREs and rigid map- solution. As is the case in the relational setting,
pings, query answering can be solved in polyno- target constraints have an important influence
mial time. upon the existence of a data exchange solution.
Last but not least, the authors initiate the Boneva et al. (2015) have added limited
investigation of tractability in data complexity for classes of target constraints to the setting of
answering tame CRPQs, which are CRPQs that Barceló et al. (2013) by focusing on t egds
do not employ any Kleene star, over mappings and limited forms of t tgds. The paper studies
that do not have existential quantifiers on the the complexity of existence of solutions and
right-hand side (e.g., NRE mappings), thus lead- query answering with the assumption that the
ing to universal representatives without labeled source schema and instance are fixed (query
nulls. For such queries and mappings, the authors complexity). Another hypothesis made in their
derive two interesting conditions under which work is to be able to export relational tuples from
query answering becomes tractable and further the source rather than graph patterns. The results
prove the maximality of the shown classes w.r.t. are quite negative in the sense that both existence
to tractability. of solutions and query answering in the presence
of arbitrary t egds and t tgds are shown to be
NP-hard and coNP-hard, respectively. Because
of this intractability and the difficulty of equating
Target Constraints two constants in case of t egds, the authors then
A typical relational data exchange setting (Fa- focus on a simpler class of constraints. These are
gin et al. 2005) comprises, in addition to the called same-as tgds, and they correspond to full
source and target schemas and the source-to- GAV t tgds in which the right-hand side contains
target mapping, a set of rules applicable to the a unique binary atom sameAs. Adding these con-
target database, called target constraints. Con- straints makes the existence of solutions trivial as
trarily to the relational setting, in which target it suffices to compute the solutions with s-t tgds
constraints were essential components of a sce- as in Barceló et al. (2013) and then add the corre-
nario, these constraints have not been included sponding tuples to the sameAs relation. However,
in the foundational definitions of (Barceló et al. even with this limited fragment of target
2013), and have been so far poorly adopted in constraints, query answering stays coNP-hard.
graph data exchange.
Borrowing from the relational setting (Beeri Mapping Management for Data Graphs
and Vardi 1984, Fagin et al. 2005), one may con- The recent work of Francis and Libkin (2017) en-
sider graph data exchange scenarios comprising riches the landscape of graph mapping manage-
two classes of target constraints, namely: ment studies by considering data graphs. These
820 Graph Data Integration and Exchange
are edge-labeled graphs where furthermore ver- the form c¤ retrieving paths whose extremities
tices are adorned with data values. Formally, have different data values.
letting V be a countably infinite set of node ids, Pursuing the search for tractability while
and D a countably infinite set of values, a node in restricting their focus to relational mappings,
a data graph is a pair .v; d / 2 VD. A data graph the authors investigate two very interesting
is then of the form G D .V; E/ with V V D alternatives for efficient query answering for data
and E V ˙ V . RPQs by relying on universal- and canonical-
Data graphs are very suitable for capturing like solutions. For restrictions of REMs and
property graphs, which are the models employed REEs to equalities (no inequalities allowed),
by some of the existing graph database manage- the authors show that query answering can
ment systems, such as Neo4j (NeoTechnology be solved by relying on the construction of
2017). As shown by the authors, however, adding a so-called least informative solutions, which
data to the picture makes query answering sensi- employs labeled nulls for both node ids and
bly harder than in the standard (i.e., without data data values, in a manner similar to standard
adornment) graph database case. relational data exchange canonical universal
Francis and Libkin (2017) consider mappings solutions. Accordingly, data complexity for
based on RPQs and in turn investigate query query answering becomes NLOGSPACE for
answering for data RPQs, which were defined in all the considered queries, whereas combined
Libkin et al. (2015, 2016). These are extensions complexity becomes PSPACE for restricted
of RPQs with regular expressions that mix alpha- REMs and PTIME for restricted REEs. A very
bet letters (i.e., edge labels) and data values, in interesting take further concerns data RPQs
other words queries that mix navigational prop- in general, without restrictions. It involves
erties and data. The three main classes of data providing an under-approximation of a query’s
RPQs under scrutiny are (i) data path queries, result set by relying on universal solutions that
(ii) regular expressions with equality (REEs, or employ SQL-like nulls (null value for which all
equality RPQs), and (iii) regular expressions with equalities and inequalities evaluate to false) for
memory (REMs, or memory RPQs). the data on nodes. Query answering over such
A first strong result of the paper shows that a null-based universal solution is shown to be
query answering is in general undecidable in NLOGSPACE in data complexity and PSPACE,
data complexity for data RPQs (with a notable respectively PTIME in combined complexity for
exception for the data path queries and REEs REMs, respectively REEs.
without inequalities, for which it is in coNP). The
undecidability is moreover shown to hold even in
a simple setting comprising a relational/reacha- Future Directions for Research
bility LAV mapping and an equality RPQ.
As identified by the authors, the issue comes Despite the early work on graph data integration
precisely from the reachability target queries and and exchange surveyed in this entry, several open
thus suggests that in order to recover decidability problems can be identified and might deserve the
one needs to focus on mappings where the target attention of our community in the next future.
queries exclude any form of recursive naviga- In particular, for what concerns the coverage of
tion and only consist in word RPQs. For such the characteristics of the data exchange setting,
mappings, called “relational mappings”, query one missing component is certainly embodied
answering becomes coNP in data complexity and by target constraints, which have been poorly
PSPACE in combined complexity. Moreover, the exploited so far. The definition of universal
problem becomes NLOGSPACE in data com- representatives as in Barceló et al. (2013) cannot
plexity for relational mappings and data path be easily adopted in the case of target constraints
queries comprising at most one subexpression of (Boneva et al. 2015). Indeed, even for full GAV
Graph Data Integration and Exchange 821
for both, RDF and PGM, by exposing data model- addition, manipulation, and removal of attributes
specific query languages and transformations on vertices and edges. Especially use cases that
between them (Oracle 2017a; Amazon Neptune target social networks and communication net-
2017). works demand a flexible and scalable graph per-
GDMS s can be broadly classified into native sistence that can cope with frequent updates to
and nonnative systems. Often GDBMSs provide vertex/edge attributes as well as changes to the
different degrees of nativeness by choosing care- graph topology. By topological changes we refer
fully which core system components should be to modifications of the graph topology, i.e., the
graph-aware. A GDMS is considered to be native, addition and removal of vertices and edges. In
if it (1) provides fast data access—often referred addition to the ability to modify the graph, an
to as index-free adjacency access—(2) and a operational GDBMS also must provide transac-
graph-specific query runtime, including graph- tional guarantees comprising atomicity, consis-
specific query optimization and processing strate- tency, isolation, and durability. A query workload
G
gies, (3) and if it exposes a graph data model, in such an operational scenario often performs
such as the RDF or the property graph data model, short-running queries accessing only a small por-
and corresponding access methods or query lan- tion of the graph. This often includes graph pat-
guages directly to the user. tern matching queries, which aim at finding query
Fully native GDBMSs implement graph-aware embeddings in the data graph.
components for all three core components (Neo4j
2017; TigerGraph 2017; MemGraph 2017; Sev- Graph Analytics Graph analytics workloads
enich et al. 2016; Amazon Neptune 2017; Tanase often consist of tasks that are computationally
et al. 2014; Cayley 2017; Datastax 2017). In intensive and typically take seconds, minutes,
contrast, partially native GDBMSs provide graph- or sometimes hours to complete on large graphs.
aware implementations for selected components, This is in contrast to operational graph processing
but typically lack a deep integration of graph- where queries are typically short-running and
specific components within the core database en- often only consume little time to complete. To
gine. Most RDBMS are only to some extent graph- target these computationally intensive and mostly
aware, most commonly by exposing the graph read-only analytical workloads, specialized
data model by some user interface to the end systems have been created that provide efficient
user (Oracle 2017a; Rudolf et al. 2013; Simmen data structures for storing relatively static graphs
et al. 2014; Sun et al. 2015; Microsoft 2017; and, typically, process graph algorithms or graph
AgensGraph 2017). Moreover, partially native queries in a parallel and/or in a distributed
GDBMS s include multi-model databases, which fashion. Because such systems do not always
combine different data models within a single provide strong transactional guarantees, they are
system (OrientDB 2017; ArangoDB 2017). typically called “graph analytics toolkits” rather
Each of the GDBMSs is tailored to specific than graph databases.
graph workloads, which are the key drivers for
the system design and the level of graph aware-
ness. In the following, we discuss two major Storage Representations
graph workloads that are predominant in realistic
graph use cases. Depending on their level of graph awareness,
GDBMS s implement different storage representa-
Operational Graph Processing The ability tions. Broadly speaking, the design space can be
to handle graph topologies where the structural divided into native and nonnative graph represen-
shape of vertices and edges evolves over time tations.
is a fundamental requirement of an operational
GDMS . Structural changes describe modifications Native, Read-Only/Read-Mostly Immutable,
to the structure of a single vertex/edge, i.e., the native graph representations tailored for
824 Graph Data Management Systems
read-mostly workloads are often implemented on the use of a graph schema since the scan tables
using a CSR data structure (Saad 2003; Brameller themselves are expected to be statically typed by
et al. 1976). A CSR data structure has the APACHE SPARK.
advantage that it provides large temporal and
space locality when accessing the neighborhood Relational, Transactional A graph can be
of a vertex. This is often observed for in-memory seamlessly represented as a set of vertex/edge
GDBMS s, such as O RACLE PGX (Sevenich et al. tables, where each record in a vertex table cor-
2016; Hong et al. 2015) and SYSTEM-G (Tanase responds to a single vertex and each record in
et al. 2014). If the graph is too large to fit into an edge table represents a single edge (Rudolf
the main memory of a single machine, suitable et al. 2013; Simmen et al. 2014; Oracle 2017a).
alternatives are processing the graph on cheaper Each vertex and edge attribute is mapped to a
yet fast SSDs, applying compression techniques single column in the corresponding vertex and
on the graph data or by sharding the graph data edge tables, respectively. For graphs without or
across multiple machines in a cluster (Tanase with only few vertex or edge attributes, a database
et al. 2014). consisting of two universal tables—one for ver-
Updates to the CSR data structure can be tices and one for edges—is a commonly used
either applied in batch updates (Haubenschild physical representation.
et al. 2016) or by snapshotting the CSR data Recently, non-relational extensions of
structure (Macko et al. 2015). RDBMS s have been applied to encode graph
topology information (Microsoft 2017) and
Native, Transactional NEO4J is the origi- vertex/edge properties using JSON document
nal, prevalent system for native, transactional storage functionality (Sun et al. 2015).
processing of property graphs. Its graph repre- Similarly, RDF GDBMSs often rely on
sentation is based on the notion of index-free ad- relational storage layouts to represent RDF
jacency, which provides O.1/ cost for traversing triples through universal tables (Erling 2012)
a relationship and has been implemented using a and by using exhaustive, relational indexing
proprietary set of record files. It uses direct offset techniques (Neumann and Weikum 2008).
pointers for navigational data access to avoid the
additional lookup costs of tree data structures.
The record files utilize additional pointers to Query Languages
reduce the overhead of traversing dense nodes
as well as generally minimize the amount of Graphs are queried using both declarative and
records that need to be accessed during the graph imperative languages. Declarative languages are
traversal. The main storage is complemented by used by operational DBMSs due to their ease of
additional indexes for direct by-value lookup and use and the ability to rely on a vast body of
full-text search. research for query optimization, while imperative
languages are used for graph analytics workloads
Relational, Read-Only The OPENCYPHER due to their versatility and the provided high level
project recently released CYPHER FOR APACHE of manual control over query execution.
SPARK (openCypher implementers group 2017a),
a new implementation of OPENCYPHER that op- Declarative QLs
timizes and compiles queries for execution by Graph pattern matching, i.e., finding and filtering
the relational Catalyst engine of APACHE SPARK instances of connected nodes and relationships,
SQL. Graphs are represented as sets of disjoint is the central functionality of declarative graph
scan tables for nodes and relationships with a query languages. Pattern matching may be ex-
given label. Using the immutable and lazily ma- pressed using visual “Ascii-Art” graph patterns
terialized tables of APACHE SPARK naturally pro- as pioneered by OPENCYPHER or via explicit
vides snapshotting, partial caching, and partition- recursion. Typically the connections in a pattern
ing capabilities. This graph representation relies are described in terms of a regular expression
Graph Data Management Systems 825
over the labels along a matched path. This is SQL SERVER provides an extension to SQL
called a regular path query (RPQ). that allows for creating a graph by turning exist-
Languages use different pattern matching ing relational tables into vertex or edge tables and
semantics: Homomorphic matching returns all provides graph querying support through spec-
matches but may lead to infinite results. Iso- ification of fixed size graph patterns (Microsoft
morphic matching prevents this by disallowing 2017).
matching nodes or relationships per match but is SPARQL (W3C 2013) is the declarative query
computationally intractable for certain patterns. language used for the RDF data model. It pro-
Single shortest path matching is tractable and vides pattern matching, basic relational capabili-
never leads to infinite results, but is non- ties, as well as constructing and returning graphs.
deterministic. G-CORE (Angles et al. 2017) is a language
Languages additionally provide fundamental designed as a community effort between indus-
relational capabilities, such as aggregation, slic- try and academia, meant to guide the evolution
G
ing, or sorting of results. of existing and future graph query languages.
OPENCYPHER 9 (Francis et al. 2017; G-CORE is composable, meaning that graphs are
openCypher implementers group 2017b) is the the input and the output of queries, and treats
language of the OPENCYPHER community. paths as first-class citizens.
Originally developed by NEO4J as a more high- Traversals form the foundation of GREM-
level alternative to GREMLIN, it is implemented LIN (Rodriguez 2015), a language that organizes
by multiple vendors (SAP HANA, REDIS, graph computation as a series of steps for
AGENS GRAPH for Postgres, CYPHER FOR pattern matching, sorting, aggregating, and
APACHE SPARK), as well as research projects. data flow logic, as well as a step for basic
OPENCYPHER provides graph pattern matching pattern matching and even graph computation.
using isomorphism, relational capabilities GREMLIN was originally closely tied to the
(including tabular query composition, basic Groovy programming language. Recent versions
subqueries, query parameters, and calling compile to a bytecode that is shared between
external algorithms), single and all shortest implementations. However real-world GREMLIN
path matching and all path enumeration using often also contains inline logic written in the
first-class path values, as well as both an client programming language.
update and a schema definition language. The
current OPENCYPHER 9 supports no alternation Imperative QLs
between non-atomic labels. However full RPQs Imperative languages have been the dominant
and path patterns have been specified as a paradigm for analytical graph processing in high-
separate language extension. The next version performance computing but also in handwritten
of the language, OPENCYPHER 10, is currently applications (e.g., Boost Graph Library for C++).
developed with support for a richer type system, Additionally, many classical graph algorithms
more powerful subqueries, and graph query com- were originally developed in an imperative set-
position for passing graphs between queries and ting, and therefore, many systems have chosen an
views. imperative query language to enable applications
PGQL (Property Graph Query Language) (van to implement such algorithms.
Rest et al. 2016; Oracle 2017b) is a pattern match-
ing query language developed at ORACLE and is Low Level
closely aligned to SQL. Its latest version (Oracle Low-level, imperative query languages expose an
2017b) has powerful regular path expressions imperative API to directly access the graph in a
with arbitrary filter predicates on vertices and fine-granular manner.
edges along paths. Its pattern matching semantic Vertex-centric systems such as PREGEL
is based on homomorphism, and the language (Malewicz et al. 2010) and SIGNAL/COL-
provides basic relational support (ORDER BY, LECT (Stutz et al. 2010) rely on a vertex-
SELECT, etc.). centric message-passing abstraction to perform
826 Graph Data Management Systems
massively parallel asynchronous or synchronous from scratch and optimized specifically for graph
graph computations. Each vertex executes a processing, or “relational graph processing,” in
user-provided compute() function (the query), which existing RDBMSs or tools are repurposed
which receives messages from neighbors, updates and extended for graph processing.
internal state, and sends messages to neighbors
as well as result collectors to advance the Native Graph Processing
computation until the result converges. Besides the storage aspect, as discussed before,
Imperative APIs like BLUEPRINTS (originated there are three other important aspects to consider
at NEO4J) enable programs written in an imper- when designing a native GDMS.
ative, general-purpose programming language to
directly access the graph. A key development of Compilation Strategies For processing of
such APIs is the notion of a traversal: a semi- graph queries (e.g., SPARQL, OPENCYPHER,
declarative description of how to search the graph or PGQL queries), the typical approach of
starting from a fixed vertex using different search GDMS s is that of cost-based query optimization,
strategies (BFS, DFS) and user-provided filtering which is similar to the approach in traditional
predicates. RDBMS s. Dynamic programming is used to
explore the plan space in order to estimate a
High Level cost for the various possible plans using a set
GRAPHSCRIPT (Paradies et al. 2017) is a of heuristics, precomputed data statistics, or
domain-specific, read-only graph query language through dynamic data sampling. However, graph
tailored to serve advanced graph analysis queries impose additional challenges since they
tasks and to ease the specification of custom, are often recursive in nature and estimation errors
complex graph algorithms. GRAPHSCRIPT is increase even faster than for relational queries
statement-based and provides high-level, graph- that typically only have a fixed number of joins.
specific language constructs to formulate graph There are systems (e.g., APACHE SPARK) that are
algorithms in a user-friendly and intuitive adopting mid-query re-optimization to overcome
manner while offering competitive execution such problems.
performance compared to manually tuned graph For processing of graph algorithms (e.g.,
algorithms in a low-level programming language. GRAPHSCRIPT or GREEN-MARL procedures),
It facilitates an imperative programming model the typical approach is that of code generation
with control flow elements and graph querying rather than interpretation because the overhead
operations. of compiling generated code is typically
GREEN-MARL (Hong et al. 2012, 2013), neglectable as graph analytics algorithms are
currently being developed at ORACLE LABS, is often computationally intensive and have long-
among the first domain-specific graph query lan- living executions. In ORACLE PGX, GREEN-
guages targeting graph analytics. The language MARL procedures are translated into optimized
has graph, vertex, and edge types as first-class and parallel JAVA or C++ programs through
citizens and exposes high-level constructs for code generation. In SAP HANA, GRAPH-
graph traversal and iteration, such as breadth- SCRIPT procedures are directly compiled into
first search (BFS), depth-first search (DFS), and parallelized, low-level, LLVM-based executable
incoming and outgoing neighbor iteration. code and cached for later usage. GRAPHSCRIPT
generates code, which directly accesses internal
data structures and APIs, thereby eliminating
Graph Processing function calls and property-type interpretations.
NEO4J provides the OPENCYPHER language for
On a high level, current graph processing ap- querying data as well as a programmatic API
proaches either fall into the category of “native for imperative graph processing. OPENCYPHER
graph processing,” in which systems are built queries are translated into an operator tree that is
Graph Data Management Systems 827
a mix of relational and graph-specific operators given an equal number of partial solutions and
for graph traversal and that is executed using both threads synchronize at the end of each step to
interpretation and compilation to Java bytecode. produce a new set of partial solutions for the next
step (Raman et al. 2014; Hong et al. 2012).
In-Memory Processing To avoid disk I/O being Although easily parallelizable, the breadth-
the bottleneck in graph processing, graphs are first search approach may suffer from a large
often fully or partially loaded into memory. This memory footprint, which can become an issue
is especially meaningful for the structural data of in graph query processing, as the number of
the graph, as edge traversal is a crucial operation partial solutions can grow exponentially to the
in graph processing. Nonstructural data, such as size of the graph. Therefore, depth-first search
labels and properties on vertices or edges, may be processing, which can be parallelized through
partially loaded into memory, depending on avail- work-stealing techniques, may provide a suitable
able space. In ORACLE PGX, graphs are loaded alternative (Roth et al. 2017).
G
into memory in their entire, and scaling out is
necessary (i.e., partition the graph over multiple Indexing Techniques Graphs often have labels
machines) when graphs are too large to fit into the on vertices and edges for indicating types (e.g.,
memory of a single machine. SAP HANA allows “Person” or “knows”) and also have properties on
fully or partially loading vertex and edge tables vertices or edges for providing unique identifiers
on a column level into memory. Lightweight (e.g., an “email” or “ISBN” property). Because
compression techniques aim at keeping the mem- these labels and identifiers are frequently used in
ory footprint low. For GRAPHSCRIPT, static code filter predicates of graph queries or graph algo-
analysis is used to determine the accessed ver- rithms, they require specialized indexes to speed
tex/edge properties and only load the correspond- up processing. For example, in ORACLE PGX and
ing subgraph into memory. In NEO4J, graphs NEO4J, secondary indexes can be created, which
are primarily stored on disk and mapped into retrieve vertices with a particular label in constant
memory using a custom page cache, which is op- time. Furthermore, there is an index for mapping
timized for concurrent, low latency, fast random a vertex identifier to the respective vertex, again,
access graph traversal. In practice, the majority in constant time. Both indexes can help avoid
of the graph working set of a database cluster scanning over the entire set of vertices. NEO4J
machine usually resides in memory. always maintains indexes for retrieving vertices
with certain labels. Furthermore, NEO4J allows
Parallel Processing Parallel processing of the creation of indexes on properties of nodes
graphs is necessary to make full advantage of with a given label, as well as additional index
available multi-threading capabilities of modern structures for full-text indexing and globally in-
computing architectures. In transactional graph dexing all properties.
workloads, there are often large amounts of In addition to the primary data location as
short-living tasks, so multiple tasks are executed relational tables, transient secondary index struc-
concurrently. However, in analytical graph tures, such as adjacency lists or CSR, can be
workloads, there are typically fewer but heavier used to accelerate compute-intensive graph ana-
longer-living tasks, so the typical strategy is that lytics tasks by creating them prior to the query
of intra-task parallelization rather than inter- execution (Hauck et al. 2015). Since the ma-
task parallelization. In Neo4j, OLTP graph jority of complex graph algorithms and work-
data is loaded into separate, specialized in- flows perform heavyweight, long-running com-
memory data structures for efficient parallel putations on the entire graph, the overhead for the
graph analytical processing. Traversals in graph index construction is outweighed by the actual
queries or algorithms can naturally be processed execution time of the task.
in a parallel breadth-first search manner, such Besides indexes that are created ahead of
that at each step, the available threads are query execution, there are also indexes that
828 Graph Data Management Systems
are created during query execution. Some CYPHER FOR APACHE SPARK (open-
indexing techniques are based on “supernodes” Cypher implementers group 2017a) is a
(or “landmarks”), which are vertices with new implementation of Cypher atop the
a disproportionate number of incoming or relational Catalyst engine of APACHE SPARK.
outgoing edges, and are common in real- CYPHER FOR APACHE SPARK is noteworthy
world graphs. Since the supernodes are well in providing experimental support for graph
connected, they are traversed many times and query composition and working with multiple
computations are cached to avoid expensive graphs via proposed OPENCYPHER 10 language
recomputations (Valstar et al. 2017). extensions (openCypher implementers group
2017b). Graphs are directly transformed into new
graphs by transforming the underlying immutable
Relational Graph Processing scan tables.
Naturally, existing RDBMSs are being extended
to process graph data. SAP HANA is the first
full-fledged RDBMS that tightly integrates graph
processing into the database kernel, allowing to Discussion
seamlessly combine relational and graph opera-
tions within the same database engine (Rudolf GDMS s are very new, and various advances
et al. 2013; Paradies et al. 2015). OPENCYPHER in storage representation, query interfaces,
queries are translated into relational operators, and processing techniques are to be expected.
while for processing of GRAPHSCRIPT, there is In particular, we see four trends that will
an additional set of graph-specific operators, such significantly influence the design and imple-
as graph traversal, shortest path computation, mentation choices of upcoming single-node
and subgraph induction, that are not mapped to graph processing systems. (1) We observe an
relational operators but instead implemented as increasing interest in the uses of high-level
native graph operators in the RDBMS execution graph query languages for graph analysis.
engine (Paradies et al. 2017). (2) We see a convergence toward hybrid
SQLGRAPH (Sun et al. 2015) translates transactional/analytic graph processing, which
GREMLIN into a combination of pure SQL combines graph analysis as performed by
functions, user-defined functions, common table highly tuned graph processing systems with
expressions, and stored procedures. The latter transactional graph processing as provided
ones are only used for graph updates without side by GDMSs or traditional RDBMSs. (3) We
effects and for recursive traversal queries without observe that existing systems are moving toward
an explicit recursion depth boundary. becoming multi-model systems (relational,
TERADATA ASTER (Simmen et al. 2014), a RDF , PGM ), as evidenced by GDMS s adopting
large-scale analytics platform, provides an itera- relational features and RDBMSs being extended
tive vertex-oriented JAVA-based programming ab- with graph capabilities. This is supported
straction, which can be used to implement bulk- by convergence on a common definition of
synchronous graph algorithms. Graph analytics the property graph model between different
functions, both pre-built and user-implemented languages that is closely related to the ongoing
ones, are modeled as polymorphic table operators discussions on working with multiple graphs,
that can be invoked directly from SQL. graph query composition, and schema models
VIRTUOSO (Erling 2012) is a hybrid columnar for graph data. (4) We observe an interesting
relational/graph DBMS with a SPARQL interface. line of research on compressed and skew
Internally, VIRTUOSO translates SPARQL queries adaptive storage of graphs, which is important
into SQL with graph-specific extensions, such as since efficiency of memory bandwidth usage is
the computation of the transitive closure using a an ongoing challenge in large multiprocessor
dedicated operator. machines.
Graph Data Management Systems 829
It was not until the 1980s that the need arose 3. The nested model. This is a revival of the
to design graph data models for databases which classic OEM (Object Exchange Model)
gave rise to a “Golden Age” or “Classic Period” (Papakonstantinou et al. 1995), whose most
of graph databases that produced manifold popular representation today is JSON. Each
models. The motivations were also varied, among object is a set of key-value pairs, in which
them: representing the semantics of databases, values can again be JSON objects. This gives
providing natural database interfaces, gener- a graph structure that is enriched in multiple
alizing the hierarchical and network models, ways (Bourhis et al. 2017).
representing knowledge structures, dealing with
object-oriented features, providing transparent In this entry, we will study each of these
representation to the manipulation of graph data, models. Earlier surveys include that by Angles
and modeling hypertext (see Angles and Gutier- and Gutierrez (2008), while surveys on query lan-
rez 2008 for a detailed survey of this period). guages for graph data models, which also define
G
These models did not reach mainstream users. the graph models under consideration, include
It was not until the early twenty-first century that those by Wood (2009), Barceló Baeza (2013),
actual applications of graph databases began in and Angles et al. (2017).
practice, mainly due to huge datasets represent-
ing data in the form of graphs (e.g., social net-
works, complex networks, biological networks, Key Research Findings
etc.). The need to have data and its relationships
directly accessible in its graph representation as The simplest form of graph data model is that
well as a query language to systematically re- which was proposed centuries ago. A graph com-
trieve data motivated the design of several com- prises a finite set of nodes and a finite set of edges,
mercial systems. where each edge connects a pair of nodes. Graphs
We will concentrate on the three main models can be either directed or undirected, depending
that are becoming both the most popular and the on whether the relationship between the pairs of
most studied: nodes is symmetric or not.
More formally, an undirected graph G is a pair
.N; E/, where N is a finite set of nodes and E is a
1. The RDF model. Designed to be a universal finite set of edges, where each edge is of the form
representation language for resources, its ba- fu; vg for u; v 2 N . A directed graph G is defined
sic structure consists of triples of resources of similarly, except that each edge is an ordered pair
the form .s; p; o/ representing the subject (s), of the form .u; v/ for u; v 2 N . In other words,
the predicate (p), and the object (o) of a state- E N N.
ment about these resources. Each statement Given that there is only one “type” of edge
.a; p; b/ can be viewed as a two-node labeled in directed and undirected graphs, their ability
p
graph a ! b. Thus, a set of RDF triples is to model relationships is rather limited. As a
a “graph” (even though mathematically such result, it is common to add labels to the edges.
sets of triples are not strictly graphs, e.g., the These labels can then be used to denote different
p q
two statements a ! b and a ! p are relationships, such as friend-of, cousin-of, etc.
allowed.) Since directed graphs are more general than undi-
2. The property graph model (e.g., in Neo4j, rected ones (an undirected graph can always be
etc.). This is a more standard model that math- modeled by a directed one in which there is a
ematically is a graph with labels on the nodes pair of directed edges between any pair of nodes
and edges and usually multiple properties as- connected by an undirected edge), we will only
sociated with them. There are several varia- consider directed graphs from now on.
tions on the basic model (identifiers for nodes, An edge-labeled graph G is a triple
multiple labels, etc.). .N; E; Lab/, where N is a finite set of nodes, E
832 Graph Data Models
is a finite set of edges, and Lab is a set of labels. “John”/; .B1 ; last; “Doe”/; .I1 ; age; 32/g, where
Each edge is of the form .u; l; v/ for u; v 2 N and I1 D http://my.org/pers#12, which graphically
l 2 Lab; in other words, E N Lab N . looks as follows:
Edge-labeled graphs can be used to model the
data comprising an extensive number of domains, 32
age
and most current models use them. A good
example is the Resource Description Framework http://my.org/pers#12 ”Doe”
(RDF).
While edge-labeled graphs provide a useful name last
way to model many kinds of data, they still have B1
their limitations: each edge only has one piece of first
information associated with it, namely, its label,
”John”
and each node similarly has a single identifier.
Sometimes it is important to be able to add
more information about an edge, for example, In G, the node with content http://my.org/
who added it to the graph or for how long it pers#12 is a known resource identified by an IRI.
is valid. Nodes too may need to have a number The node B1 is a blank node as mentioned above.
of properties associated with them, especially The predicates (name, first, last, age) are part of
if they correspond to real-world objects. These the schema (ontology). Blank nodes and predi-
requirements have given rise to what have been cates have also IRIs (not shown to save space).
called property graphs. The rest are value types (strings and integers
in this case). Any resource (IRI) can be further
The Resource Description Framework referenced in a triple in another graph, in this way
The Resource Description Framework (RDF) is integrating different graphs.
a W3C recommendation (Cyganiak et al. 2014),
originally designed to represent metadata on the As mentioned, some distinguished resources
web but more recently used as a representation conform to schemas. A schema in RDF is simply
layer for the so-called semantic web (Berners- a reserved vocabulary, usually further specified,
Lee et al. 2001). As mentioned above, an RDF which can be used to describe other resources.
graph can be viewed as an example of an edge- Examples above are “name,” “age,” etc. The RDF
labeled graph, that is, although an RDF graph is a specification includes a basic vocabulary (RDF
set of triples of the form .u; l; v/, where u; l and v schema) consisting of type, class, property, sub-
are universal resources, one can view u and v as Class, subProperty, and the like, plus some ba-
nodes and l as an edge label of a graph. sic deductive rules, e.g., transitivity of subClass,
An atomic RDF expression is a triple compris- etc. (Brickley and Guha 2014).
ing a subject u (the resource being described), Although the original idea of RDF was that of
a predicate l (the property of the resource be- a metadata specification for the web, it became
ing described), and an object v (the value of a standard to build graph databases with RDF
the property). Known resources are identified data. In this direction, a standard query language,
by IRIs (Internationalized Resource Identifiers), called SPARQL (Harris and Seaborne 2013), was
while unknown resources are denoted by so- introduced to query RDF data.
called blank nodes. This set can be partitioned
into disjoint sets of IRIs, blank nodes, and literals. The Property Graph Model
However, in RDF further constraints are applied A property graph G is a directed multigraph
to the permitted sets of nodes, labels, and triples. with possibly multiple labels on nodes and edges.
Formally, G is a tuple .N; E; Lab; P rop; Val; ;
Example 1 (Simple person example) A basic ; /, where N and E are finite sets of nodes and
RDF graph describing John Doe and his age is edges; P rop is a finite set of properties; Lab and
the following: G D f.I1 ; name; B1 /; .B1 ; first; Val are sets of labels and values, respectively;
Graph Data Models 833
and , , and are functions defined as fol- The schema in current property graph
lows (Angles et al. 2017): databases plays essentially the role of integrity
constraints to ensure safe modeling or improve
• W E ! .N N / is a total function defining performance. So a schema includes the possible
the edges, where, for e 2 E, .e/ D .u; v/ properties for nodes and edges, the corresponding
means that e is a directed edge from u to v for types, possible checking for existence or
u; v 2 N . cardinality, etc. Some systems include indexes
• W .N [ E/ ! Lab is a total function as part of the schema.
providing the label for each node and edge. If Current property graph database systems in-
v 2 N and e 2 E, then .v/ is the label of clude storage, a query language (although there
node v and .e/ is the label of edge e. is no widely adopted standard), APIs, and appli-
• W .N [ E/ P rop ! Val is a partial cation functionalities such as analytics, visualiza-
tion, and tools to produce code or assemble with
G
function providing values for those properties
defined for particular nodes and edges. If v 2 other applications. The Neo4J Graph Platform is
N , e 2 E and p 2 P rop, then .v; p/ is the a native graph database supporting transactional
value for property p for node v, if defined (and applications and graph analytics. It has developed
similarly for edge e). its own query language called Cypher (https://
neo4j.com). JanusGraph is designed for storing
and querying graphs distributed across a multi-
Example 2 Consider again the simple person ex- machine cluster. It uses Gremlin as its query
ample used in Example 1. This can be represented language (http://janusgraph.org/). DataStax En-
as a property graph as follows: terprise Graph is part of the DataStax Enter-
prise (DSE) data platform. It is built on Apache
• Nodes N D fn1 ; n2 ; n3 ; n4 ; n5 g. Cassandra (http://cassandra.apache.org) and also
• Edges E D fe1 D .n1 ; n2 /; e2 D uses the Gremlin query language. It supports dis-
.n1 ; n5 /; e3 D .n2 ; n3 /; e4 D .n2 ; n4 /g. tribution and benefits from the other functionali-
• Node labels .n1 / D ./; .n2 / D ./; .n3 / D ties of DSE (https://www.datastax.com/products/
”John”; .n4 / D ”Doe”; .n5 / D 32. datastax-enterprise-graph).
• Edge labels .e1 / D ”name”; .e2 / D
”age”; .e3 / D ”f i rst ”; .e4 / D ”second ”. The Nested Data Model
The need for a data model that was more flexible
Here we have identifiers for all nodes and edges. than the relational data model originally arose
The rest – strings and numbers – are values, when trying to integrate heterogeneous data into a
including ./, which represents the absence of single representation. This gave rise to the Object
any value. Only nodes can be further referenced. Exchange Model (OEM) and subsequently alter-
Edges have identifiers only for the purpose of native representations for what was to be termed
attaching properties to them. semi-structured data. The most recent concrete
Graphically, the example can be represented as proposal in this area is JSON (JavaScript Object
follows: Notation).
JSON is a lightweight representation based
on data types used in JavaScript and is use-
n1 ful when exchanging data between applications.
”age” Apart from the primitive values of true, false,
”name”
and null, and the primitive types of strings and
n2 n5 32 numbers, JSON provides two structuring types,
namely, objects and arrays. Objects are repre-
”first” ”second” sented by key-value pairs (dictionaries), while
”John” 3 n n 4 ”Doe” arrays are simply sequences of values. The values
appearing in objects and arrays can be arbitrary
834 Graph Data Models
JSON constructs, thereby allowing arbitrary nest- Several DBMSs support JSON as a data
ing of structures. format. Those that use it as their main data
format are also referred to as document
Example 3 Consider once again the person ex- stores, document-oriented databases, or JSON
ample from Example 1. This can be represented databases. The most well-known of these
in JSON as follows: include MongoDB (https://www.mongodb.com),
CouchDB (http://couchdb.apache.org), and Azure
{ Cosmos DB (https://azure.microsoft.com/en-gb/
"name": { services/cosmos-db/). For example, MongoDB
"first": "John", offers a schema language (its own and JSON
"last": "Doe" Schema) and a proprietary query language
}, to select and query JSON documents. Along
"age": 32 with these document stores, there are NoSQL
} databases, such as MarkLogic (http://www.
marklogic.com) and OrientDB (http://orientdb.
The basic structure in JSON is the key-value pair, com) that support JSON as one of their data
with key-value pairs being grouped into objects, formats. Finally, many popular SQL DBMSs,
each delimited by braces. Each key within an such as MySQL (https://www.mysql.com),
object must be a string and must also be unique. PostgreSQL (https://www.postgresql.org), and
Each value can be be a basic value (number, Oracle (https://www.oracle.com/database/index.
string, etc.) or a nested expression. The underly- html), support JSON.
ing structure is a graph, in this case:
graph structure. Properties in the form of labels graph query languages. ACM Comput Surv 50(5):68:
are described inside these structures. 1–68:40
Barceló Baeza P (2013) Querying graph databases. In:
The RDF model keeps the nodes as primitives Proceedings of the 32nd ACM SIGMOD-SIGACT-
and enriches the edges with a label that can SIGAI symposium on principles of database systems.
be a node. This label allows the structure ACM, New York, pp 175–188
to be enriched. But note that edges do not Berners-Lee T, Hendler J, Lassila O (2001) The semantic
web. Sci Am 284(5):34–43
properly have identifiers (they are pairs of Bourhis P, Reutter JL, Suárez F, Vrgoč D (2017) JSON:
nodes). One can get RDF “graphs” such as data model, query languages and schema specification.
.a; b; c/; .b; p; d / and .d; q; c/ that can be In: Proceedings of the 36th ACM SIGMOD-SIGACT-
b p q SIGAI symposium on principles of database systems.
represented as fa ! c; b ! d; d ! cg. This ACM, New York, pp 123–135
allows great flexibility, by allowing to “describe” Brickley D, Guha R (2014) RDF schema 1.1. http://www.
the edge label b recursively. The price for this w3.org/TR/2014/REC-rdf-schema-20140225/, W3C
Recommendation 25 Feb 2014
G
flexibility is the complexity to implement the
Chen PPS (1976) The entity-relationship model – toward a
model efficiently (because there is a mix between unified view of data. ACM Trans Database Syst 1(1):9–
properties of objects and the objects themselves). 36
The property model keeps identifiers for nodes Cyganiak R, Wood D, Lanthaler M (2014) RDF 1.1 con-
cepts and abstract syntax. http://www.w3.org/TR/2014/
and edges and enriches the structure by adding
REC-rdf11-concepts-20140225/, W3C Recommenda-
additional properties over them. These properties tion 25 Feb 2014
do not have (as in the nested and RDF models) Gyssens M, Paradaens J, Den Bussche JV, Gucht D (1990)
any intersection with the primitive nodes and A graph-oriented object database. In: Proceedings of
the 9th symposium on principles of database systems.
edges. This model keeps the types (nodes, edges,
ACM Press, pp 417–424
properties) clearly disjoint, and thus classical Harris S, Seaborne A (2013) SPARQL 1.1 query
graph algorithms can be implemented directly. language. http://www.w3.org/TR/2013/REC-sparql11-
query-20130321/, W3C Recommendation 21 Mar
2013
Lehmann FW, Rodin EY (eds) (1992) Semantic networks
Conclusions in artificial intelligence. International series in modern
applied mathematics and computer science, vol 24.
In this entry we have concentrated on only three Pergamon Press, Oxford
Levene M, Poulovassilis P (1991) An object-oriented data
categories of graph data model: edge-labeled model formalised through hypergraphs. IEEE Trans
graphs as exemplified by RDF, the property Knowl Data Eng 6(3):205–224
graph model, and the nested graph model. We Papakonstantinou Y, Garcia-Molina H, Widom J (1995)
believe these are currently the graphs models Object exchange across heterogeneous information
sources. In: Proceedings of the eleventh international
that are most widely studied and implemented. conference on data engineering. IEEE, p 251–260
Many other graph data models are described in Wood PT (2009) Graph database. In: Liu L, Özsu MT
the survey by Angles and Gutierrez (2008). (eds) Encyclopedia of database systems. Springer,
Boston, pp 1263–1266
Cross-References
References
Distinct root-based KWS defines GQ as a min- for graph exploration. Most prior work adopts
imal rooted tree which (1) contains a distinct root information retrieval techniques that make use
node vr and at least a content node vi 2 V .ti / as of query logs and user feedback (Carpineto and
a leaf for each ti 2 Q and (2) len.vr ; vi / r Romano 2012). Keyword query suggestion for
for a predefined hop bound r, for each leaf vi . graphs has been studied in Tran et al. (2009) by
Here, GQ is minimal if no subtree of GQ is suggesting structured queries computed from the
an answer of Q in G. The function F .GQ / is keywords over RDF data. In order to summarize
P
defined as F .GQ / D ti 2Q dist.vr ; vi /, where the results of KWS over structured data, tag
vi ranges over the content nodes. The answers of clouds (Koutrika et al. 2009) discover the most
such queries can be found in O.jQj.jV j log jV jC significant words retrieved as a part of the
jEj// time (cf. Yu et al. 2010). initial results. Provable quality guarantees of
the answers are not addressed in these works.
Steiner tree-based KWS (Bhalotia et al. 2002). Query suggestion has been studied for XML
G
differs from its distinct root-based counterparts data (Zeng et al. 2014), with a focus on coping
in that it uses a different cost function F .GQ /, with missed matches. An approximate answer
P
which is defined as e2GQ w.e/, i.e., the total of original query Q is computed by expanding
weight of the edges in the Steiner tree GQ . its answer with content nodes of the same type,
It is NP-hard (Yu et al. 2010) to evaluate and new keywords are suggested to replace those
a Steiner tree-based query by computing a without match in Q.
minimum weighted Steiner tree, a known NP-
hard problem (Ding et al. 2007). Both exact Exploratory Graph Analysis
(Ding et al. 2007) and approximation algo- One of the earliest attempts employing examples
rithms (Byrka et al. 2013) have been developed as search conditions is query-by-example (Zloof
to evaluate such queries. 1975). The main idea was to help the user in
the query formulation, allowing her to specify
Steiner graph-based KWS (Li et al. 2008; the shape of the results in terms of templates
Kargar and An 2011) finds GQ as graphs rather for tuples, i.e., examples. Query by example has
than trees. For Steiner graph-based queries with been lately revisited, and the use of examples has
a number r, GQ is a Steiner graph that contains found application in graph data. The definition of
content and Steiner nodes (i.e., nodes on shortest example has transformed from a mere template
paths between two content nodes), with either to the representative of the intended results the
radius bounded by r (i.e., r-radius Steiner graph user would like to retrieve. Examples for a graph
(Li et al. 2008)) or distance between any two G D .V; E/ can be either set of nodes Q V ,
content nodes bounded by r (i.e., r-clique (Kar- tuple of nodes .v1 ; : : : ; vn / 2 V n , or subgraphs
gar and An 2011)). For an answer GQ with Q G, where with an abuse of notation
nodes fv1 ; : : : ; vn g, its cost F .GQ / is computed indicates both set and structural inclusion.
P P
as i2Œ1;n j 2ŒiC1;n dist.vi ; vj /, i.e., the to-
tal pairwise distances of the content nodes in Set of nodes as examples. Several analyses
GQ . It is in general NP-hard to evaluate Steiner can be performed when a set of example nodes
graph-based queries (Yu et al. 2010). Approxi- Q V is provided; this includes the discovery
mate algorithms are developed for such queries of dense graph regions (Gionis et al. 2015;
to find r-radius graphs (Li et al. 2008) and r- Ruchansky et al. 2015), central nodes (Tong
cliques (Kargar and An 2011). and Faloutsos 2006), and communities (Staudt
et al. 2014; Perozzi et al. 2014). Discrepancy
Keyword query suggestion. Keyword query maximization in graphs (Gionis et al. 2015)
suggestion and its variants (e.g., query refinement aims at finding a connected subgraph G 0 D
and reformulation) have been studied to suggest .V 0 ; E 0 /; G 0 G that maximizes the discrepancy
new queries that better describe search intent ı.G 0 /D˛jQjjV nQj; ˛ > 1; in other words,
838 Graph Exploration and Search
G 0 is a subgraph containing more nodes from be exploited to obtain more accurate results. In
Q than other nodes. Discrepancy maximization the case of graphs, the example can constitute an
is NP-hard. On a similar line, the centerpiece exemplar query (Mottin et al. 2016) Q G, and
subgraph problem (CEPS) (Tong and Faloutsos a result is a subset of G congruent to Q. Exemplar
2006) finds the maximum weight subgraph queries is a flexible paradigm that allows the def-
that contains at least k query nodes and inition of multiple congruence relations among
at most b other nodes. A generalization of the input example and the intended results, such
the CEPS problem, the minimum Wiener as isomorphism or strong simulation (Ma et al.
connector problem (MWC) (Ruchansky et al. 2014). Moreover, it supports efficient retrieval of
2015) computes dense regions of a graph top-k results.
potentially representing community structures, A generalization of exemplar queries,
given a set of nodes as input. MWC aims PANDA (Xie et al. 2017), studies partial
at finding a subgraph of G induced by Q, topology-based network search, that is, finding
denoted as GŒQ
, that minimizes the sum of the connections (paths) between structures node-
the pairwise distances among the nodes in Q, label isomorphic to different user inputs. PANDA
P
i.e., arg minGŒQ fu;vg2Q dGŒQ .u; v/. first materializes all isomorphic graphs, then
Examples have also been used as a mean to groups them into connected components, and
detect communities, where a community is a set finally finds undirected shortest paths among
of nodes, in which every node is connected to them. More techniques on graph pattern queries
many more nodes inside the community than are introduced in chapters “ Graph Pattern
outside. One approach is to consider each node Matching” and “ Graph Query Processing.”
v 2 Q as a representative of a different com-
munity (Staudt et al. 2014). In graphs with at- Reformulation of Graph Queries
tributes on nodes, FocusCO (Perozzi et al. 2014) Users with exploratory intent typically provide
introduced a clustering algorithm that learns a indefinite queries that are likely to return either
similarity function among nodes, based on the too few or too many results. For this reason, query
attributes in the examples. Formally, FocusCO reformulation techniques aim at modifying the
takes an attributed graph G D .V; E; F /, where query to lead the user to the intended result (Mot-
F is a feature matrix F 2 Rn f and Fi tin et al. 2015; Hurtado et al. 2008; Islam et al.
represents the attribute vector for node vi , and 2015). A reformulation for a query Q G on
returns clusters that better represent the seed a labeled graph G D .V; E; `/ with labeling
nodes Q. FocusCO returns outliers that do not function on nodes and edges ` is another query
belong to any community identified so far. (For Q0 that is either a supergraph Q0
Q or a
community detection, see chapter “Community subgraph Q0 Q of Q. Therefore, Q0 can be
Analytics in Graphs.”) more specific (supergraph) or more generic (sub-
graph) than Q. Queries considered in graph query
Node tuples as examples. In graph query by reformulation are usually subgraph isomorphism
example (GQBE) (Jayaram et al. 2015b) a tuple queries (Lee et al. 2012). Query reformulation
of nodes .v1 ; : : : ; vn /; vi 2 V n is a representative has been proposed for collections of graphs, or
of the result tuples. For instance, if the pair graph databases, as well as large graphs.
(Yahoo, Jerry Yang) is provided, the result is
expectedly other IT company, founder pairs. The Query reformulation in graph databases.
solution first computes the minimal answer graph Query reformulation in graphs can apply
connecting the query nodes and then finds other to collection of graphs or graph databases
subgraphs isomorphic to such graph. D D fG1 ; : : : ; Gn g, where each Gi is a graph
Gi D .Vi ; Ei ; `i /. In such graph databases,
Subgraphs as examples. Subgraphs are more the query Q returns a subset of the collection
expressive than nodes and tuples; hence they can R.Q/ D containing the structure Q.
Graph Exploration and Search 839
Graph query reformulation (GQRef) (Mottin and the query Q and then use a differential graph
et al. 2015) discovers query reformulations to profile the query by adding edges from the
having small overlap and high coverage; thus differential graph to the MCS. The maximum
ideally no result is missing from the reformulated common subgraph is the largest connected graph
queries, yet the reformulations have little G 0 that is subgraph of both Q and G.
redundancy. The objective is to find a set R of k VIIQ (Jayaram et al. 2015a) employs query
reformulations with high coverage and diversity logs to rank the possible reformulations of the
(i.e., small overlap) maximizing f .R/ D query Q. VIIQ’s ranking function generates cor-
S P
j Q2R R.Q/j C Q1 ;Q2 2R jR.Q1 / [ relation paths among the sequence of reformula-
R.Q2 /j, with trading-off coverage and tions in the current query Q and the sequences in
diversity. Maximizing f is NP-hard, but an the query log. Among the candidate single edges
approximate greedy solution can guarantee a that can be added to Q, the ranking function
.1=2/-approximation. proposes those that have a larger support in the
G
Alternatively, AutoG (Yi et al. 2017) proposes query log.
a top-k reformulation approach for visual
autocompletion of queries. AutoG returns the k Why-not queries in graphs. Why-not queries
best relaxations according to a ranking function additionally require the user to provide a set of
that favors a large number of results for each missing answers from the results of Q, and the
reformulation and large diversity. Such ranking system proposes a new query Q0 returning those
function (excluding normalization) is u.R/ D results and the least number of irrelevant results.
P P
˛ Q2R jR.Q/j=jDj C .1 ˛/ Q1 ;Q2 2R The only work in this direction is Islam et al.
dist.Q1 ; Q2 /, where ˛ 2 Œ0; 1
strikes a balance (2015). In order to compute why-not queries,
between coverage and a user-defined distance the maximum common subgraph (MCS) among
dist among queries. For visual query analysis, the query Q and the missing answers D is
see chapter “ Visual Graph Querying.” computed. The best query is found minimizing
the maximum distance among Q0 and the missing
answers D as arg minQ0 2D maxf.Q0 ; G/jG 2
Query reformulation in large graphs. Few D g/, where given G1 D .E1 ; V1 ; `1 /; G2 D
works explicitly explore query reformulation in .E2 ; V2 ; `2 /, .G1 ; G2 / D jE1 j C jE2 j 2jMCS
large graphs. Early attempts in this direction .G1 ; G2 /j.
propose to enrich the SPARQL semantics with
a RELAX clause that returns a set of reformu-
lations following some predefined rules (such as
expanding edges with a specific label) (Hurtado Key Applications
et al. 2008). In Arenas et al. (2014) the SPARQL
reformulations are extended to OWL logic, where The need of graph exploratory is evident in
the semantics of each reformulation is captured user-friendly knowledge exploration (Achiezra
by the RDF taxonomy. The user can interactively et al. 2010), “why-not” query processing (Tran
accept or reject the possible expansions of a query and Chan 2010), and database security, where
such that the navigation is consistent with the highly correlated queries are identified to reverse
rules. engineering the sensitive entity information (Tran
Query reformulation is also employed in query et al. 2014). Another application is interactive
debugging to understand why the query returned sensemaking of complex graph data (Pienta et al.
too-many or too-few answers (Vasilyeva et al. 2015), with broader applications in literature
2016). In order to explain the dependency be- search and collaboration recommendation in
tween the number of results and the query (Vasi- academic networks, attack and anomaly detection
lyeva et al. 2016), first compute the maximum in cyber networks, and event detection in social
common subgraph (MCS) among the graph G media.
840 Graph Exploration and Search
Tong H, Faloutsos C (2006) Center-piece subgraphs: guided the research and development of database
problem definition and fast solutions. In: Proceed- systems during years, speeding up their progres-
ings of the 12th ACM SIGKDD international confer-
ence on knowledge discovery and data mining. ACM,
sion and their impact in society. Among many
pp 404–413 benchmarking initiatives, the Transaction Pro-
Tran QT, Chan CY (2010) How to conquer why-not cessing Council (TPC 2017) family of bench-
questions. In: SIGMOD, pp 15–26 marks is the best example of influential database
Tran T, Wang H, Rudolph S, Cimiano P (2009) Top-
k exploration of query candidates for efficient key-
benchmarks.
word search on graph-shaped (RDF) data. In: ICDE, Industry and academia are aware of the ben-
pp 405–416 efits benchmarking can provide to the evolution
Tran QT, Chan CY, Parthasarathy S (2014) Query reverse and adoption of graph database technologies, and
engineering. In: VLDB, pp 721–746
Vasilyeva E, Thiele M, Bornhövd C, Lehner W (2016)
as such, many graph benchmarking initiatives
Answering “why empty?” and “why so many?” queries have emerged such as Erling et al. (2015), Iosup
in graph databases. J Comput Syst Sci 82(1):3–22 et al. (2016), Armstrong et al. (2013), or Bagan
G
Wang H, Aggarwal CC (2010) A survey of algorithms et al. (2017) just to cite a few of them. However,
for keyword search on graph data. In: Managing and
mining graph data, pp 249–273
designing a benchmark is not a trivial task. A
Xie M, Bhowmick SS, Cong G, Wang Q (2017) Panda: benchmark has several components that need
toward partial topology-based search on large networks to be carefully designed/selected. For instance,
in a single machine. VLDBJ 26(2):203–228 which is the target audience of the benchmark?
Yi P, Choi B, Bhowmick SS, Xu J (2017) Autog: a visual
query autocompletion framework for graph databases. Which are the datasets the benchmark works
VLDBJ 26(3):347–372 with? How do we quantitatively measure the
Yu JX, Qin L, Chang L (2010) Keyword search in rela- performance of a system running the benchmark?
tional databases: a survey. IEEE Data Eng Bull 331: These are just a few questions that must be
67–78
Zeng Y, Bao Z, Ling TW, Jagadish H, Li G (2014) answered when designing a benchmark.
Breaking out of the mismatch trap. In: ICDE, In this chapter, we review the elements a graph
pp 940–951 benchmark consists of and how the different ex-
Zloof MM (1975) Query by example. In: Proceedings of isting graph database benchmarks address them.
the May 19–22, 1975, national computer conference
and exposition. ACM, pp 431–438 Finally, we discuss the different open challenges
in graph benchmarking.
Queries and Query mixes. There are different By pushing this strategy to an extreme, one
strategies usually followed when designing the typically ends up with a micro-benchmark, which
queries of a benchmark workload. One com- are benchmarks whose queries are directly the
monly adopted strategy is the one based on an- low-level operations/operators used to build more
alyzing a reference use case (e.g., the operation complex queries (e.g., getting the k-hop neigh-
of a social network web site) in depth. From borhood of a node, computing the path between
this analysis, representatives of the most relevant two nodes, performing a join, etc.).
queries (e.g., a friend recommendation query,
a group recommendation query, etc.) are syn-
thetized, which exhibit either a read or a write The Data. The performance of most of
mode. Additionally, the query mix is carefully the queries in database applications is data
designed to reproduce the query frequencies typ- driven, that is, it is highly influenced by the
ically observed. This strategy pursues to accu- characteristics of the data used. This is especially
rately model the reference use case with the goal true for graph database applications, which
of providing the user with insight about how usually perform expensive operations. For
different SUTs will perform in a similar but real instance, one might think about the computation
setting. Benchmarks following this approach are of the k-hop neighborhood of a given node.
typically known as macro-benchmarks. Depending on the query node, its degree, and
However, the purpose of a benchmark is not the diameter of the graph, a hypothetically small
only to assess of the current status of an industry two- or three-hop neighborhood could lead to
but also to drive its evolution and path to maturity. visit almost all nodes of the graph (Backstrom
Thus, some benchmarks do not only focus on ac- et al. 2012). Thus, when designing the workload,
curately modeling the reference use case queries designers must carefully choose the data to be
but also on making the queries hard to challenge used, not only to meet functional requirements
the different industry and academia actors. This (e.g., the presence of attributes that are involved
is usually achieved by understanding how queries in the queries) but also other characteristics that
are typically executed on the target systems and may affect the performance of the SUTs (number
by identifying which low-level operations or op- of joins, number of recursive steps, number of
erators dominate the execution time of typical disjuncts, and so on).
queries issued against the SUTs and become In general, benchmarks use two types of
their bottlenecks. Then, queries are crafted (in datasets: real datasets gathered from the real
combination with the input data) to deliberately reference use case and synthetically generated
stress such low-level operations and to push the datasets. The former has the advantage of
SUTs to their limits. When following this strat- containing the desired characteristics, but they
egy, the benchmark designer must find a trade- are typically scarce since their owners are
off between the realism of the queries and their reluctant to share them, and they do not always
hardness. This also applies to the definition of obey the desired scale. For this reason, many
the query mix, which may consist of frequencies benchmark designers opt for using synthetic
that do not necessarily strictly reproduce those graph generators, paying attention to accurately
observed in a reference use case. For instance, model the characteristics observed in their
one might decide to increase the frequency of realistic counterparts. This strategy has the
particular operations (e.g., write operations) to advantage that graphs at different scales can
stress the systems in this aspect. This approach be modeled, allowing the definition of scale
has the advantage of testing the SUTs more com- factors to test the SUTs at different scales. The
prehensively (bringing them out of their comfort importance of synthetic graph generation has
zone), which in turn stimulates their evolution significantly grown during the last years, and
toward usages not considered by the reference use thus there is a considerable amount of research
case, and making the whole industry to advance. on the topic.
Graph Generation and Benchmarks 843
Real networks like social networks, biological that unifies all such scores into a single score
networks, scientific collaboration networks, or value in order to ease the comparison of systems.
the World Wide Web share particular charac-
teristics including a skewed degree distribution
The Auditing Rules
that can be approximated with a power law with
The third main component of a benchmark is the
a long tail, a community structure, a small di-
auditing rules. The auditing rules are a set of
ameter, or a large largest connected component
rules that define the conditions under which the
(Girvan and Newman 2002; Newman 2003; Boc-
benchmark execution must be performed in order
caletti et al. 2006). As such, many synthetic graph
to consider the score as valid and to ensure that
generation techniques focus on reproducing such
it has not been gamed. The rules define various
observed characteristics, including the method
aspects such as the process to follow in order
proposed by Newman et al. (2001), the RMAT
method proposed by Chakrabarti et al. (2004), the
to execute the benchmark and publish its results, G
whether this must be executed by an independent
Kronecker graphs proposed by Leskovec et al.
auditor or not, what information of the SUT must
(2005), the LFR graph generator introduced by
be disclosed to the audience, or how to compute
Lancichinetti et al. (2008), or the BTER graph
the cost of the SUT.
generator described by Kolda et al. (2014), just to
cite a few of them. Typically, graph benchmarks
adopt these techniques to build their own graph
generators, since the former are mostly focused Key Research Findings
on the structure of the graph but do not consider
the generation of properties, usually required by LUBM
benchmarks. The LUBM (Lehigh University Benchmark)
(Guo et al. 2005) is a macro-benchmark designed
The Scoring Metric for Semantic Web that has also been used to
The main goal of a benchmark is to provide a benchmark graph database systems due to the
non-biased, fair measure of how a system per- more expressive property graph data model.
forms for a given task. In order to provide such The benchmark contains 14 queries crafted to
a measure, one or more scoring functions are de- be both realistic and diverse. In order to assess
fined. These scoring functions typically measure this diversity, the authors define a set of features
things such as throughput or latency, to give a for each query (e.g., input size, the selectivity, the
sense of the performance of the SUTs. However, complexity, etc.) and guide the definition of the
some benchmarks define scoring functions that queries trying to maximize the space of features
measure a trade-off between performance and covered.
cost, where cost is defined as the cost of operating The query mix consists of ten executions
the SUT, including the cost of purchasing the sys- of each query. The benchmark uses several
tem and sometimes also its maintenance during a scoring functions, including the loading time,
period of time. As a consequence, the user can the repository size (i.e., the size used to store the
know which is the best system given his budget loaded data), and the query response time per
or how much he needs to spend given the desired query, which are computed as the average execu-
throughput and/or latency. Also, some use cases tion time of the ten executions of each query.
that allow systems to return partial or incomplete Additionally, the benchmark uses the notion
results. Thus, some benchmarks define scoring of completeness and soundness for each query,
functions to measure how complete the results of which captures how precise and complete are
a query are. the returned results (partial results are allowed).
Finally, in those cases where a benchmark Finally, the benchmark proposes a combined
defines several scoring functions, it is also a metric that unifies all the aforementioned scoring
common practice to define a combined metric functions into a single number.
844 Graph Generation and Benchmarks
LUBM uses a synthetic data generator, the number of users served per hour), queries per
namely, the Univ-Bench Artificial (UBA) data second (QpS), and load time (LT), but neither
generator, that generates datasets representing a combined metric nor a cost-based metric is
universities, their departments, the staff, the provided. Finally, the benchmark defines the
students, and the courses. The minimum unit set of steps to be properly executed, including
of data generation (used to scale up the the process of loading, system warm-up, and
dataset) is the university, and the amount of execution, but a formal auditing process is not
different departments, professors, students, and specified.
courses per university is based on the authors’
“reasonable assumptions” but not backed on real LinkBench
data. Finally, the LUBM benchmark does not The LinkBench (Armstrong et al. 2013) is a
define a set of auditing rules nor any cost-based macro-benchmark created by Facebook to test
scoring function. the performance of alternative database systems
when dealing with Facebook’s social graph. The
BSBM benchmark replicates Facebook’s data model and
The Berlin SPARQL Benchmark (Bizer and the transactional query mix used on their produc-
Schultz 2009) is a macro-benchmark settled tion MySQL store.
in an e-commerce use case, where there are The benchmark provides a data generator
products offered by different vendors and reviews that generates a social graph similar to that of
posted by consumers on various review sites. The Facebook, based on statistics obtained from the
benchmark initially targeted RDF stores, both original Facebook data, and that can be scaled
those that execute SPARQL directly and those based on the desired number of users of the
that rely on SPARQL-to-SQL rewritings, but it social network. Similarly, the authors profiled the
has also been used to test graph databases. operations received in the MySQL back end and
The benchmark uses a synthetic data generator identified ten different types of basic operations
to generate the different classes like product, (including read and write operations) that operate
vendor, persons, or review, just to cite a few of on different nodes and edges of the social graph,
them. The generator scales based on the number as well as their corresponding frequencies that
of generated products, and for each of them, a conform the query mix. The workload also tries
description and reviews are generated. Then, the to reproduce the observed proportions of hot data
number of persons depends on the number of (data accessed frequently) and cold data (data
reviews generated until each review is assigned barely accessed).
to one and only one person. The data distribu- The benchmark defines several performance
tions are tuned to follow the author’s “realistic metrics that need to be reported. These include
assumptions,” including normal distributions for the latency at 50th, 75th, 95th, and 99th
the number of reviews per product and the num- percentiles and the throughput. They define
ber of reviews per person. the throughput per dollar as a cost/performance
The workload consists of the 12 queries that metric. Additionally, they suggest the gathering
represent the interactions of users with the e- of resource utilization statistics (e.g., CPU usage,
commerce site. The authors define a query mix as I/O operations, etc.) at regular intervals for
the succession of queries that emulates the search understanding the performance and efficiency,
and navigation pattern of a consumer looking but they do not provide a combined nor a cost-
for a product, which consists of 25 query execu- based scoring function. Finally, they do not define
tions. The final workload is composed of different auditing rules.
query mixes representing different users using the
e-commerce site concurrently. LDBC SNB Interactive
The BSBM benchmark uses three scoring The Social Network Benchmark: Interactive
functions: query mixes per hour (QMpH) (i.e., workload (SNB-I) (Erling et al. 2015) is a
Graph Generation and Benchmarks 845
macro-benchmark created by the Linked Data Regarding the scoring function, the bench-
Benchmark Council, an industry and academia mark uses throughput and throughput per cost as
initiative to build benchmarks for graph-based the scoring function. Additionally, the benchmark
technologies. defines a “valid run” condition, which consists of
SNB-I models the operation of a social 95% of the queries to start no later than 2 s after
network site, similar to that of Facebook in the scheduled issue time.
terms of its data model. The SNB-I provides Finally, the benchmark defines a comprehen-
a synthetic data generator that simulates a sive set of auditing rules that define both the
social network consisting of persons, posts, process to follow to execute the benchmark (done
comments, images, likes, groups, etc. The data by an official benchmark auditor) and the infor-
generator models the characteristics typically mation that must be disclosed from the SUTs.
observed in real networks, including a Facebook-
like degree distribution (Ugander et al. 2011), LDBC Graphalytics
G
spiky post/comment distribution based on The LDBC Graphalytics benchmark (Iosup et al.
real events (Leskovec et al. 2009), correlated 2016) is an industry-backed macro-benchmark
attributes of entities and among connected that targets graph processing systems specialized
entities (McPherson et al. 2001), etc. The on analytics on large graphs, including distributed
benchmark defines several scale factors (SF1, systems and heterogeneous systems based on
SF3, . . . SF1000) named after the number of GB GPUs.
on disk that takes the generated dataset files. The workload consists of six traditional
The workload is a transactional workload structure-based (no properties are used) graph
targeting graph databases, RDBMS with graph analytic algorithms: breadth-first search,
extensions support, and RDF systems. The PageRank, weakly connected components,
workload consists of 14 complex reads, which community detection with label propagation,
are complex yet interactive queries that explore local clustering coefficient, and single-source
the neighborhood of a given node; 7 short reads, shortest paths.
which represent those short queries used to The Graphalytics benchmark uses both real
retrieve the user information like its friendships, graphs and synthetic graphs of different scales
messages, likes, etc.; and 8 update queries, which named after the traditional T-shirt size labeling
modify the state of the social network. (S, M, L, XL, XXL, etc.) that includes a renewal
Complex reads are designed using a novel process that is revisited every 2 years. For gener-
“choke-point”-based approach, which consists ating synthetic graphs, the benchmark uses both
of identifying scenarios that are challenging for the SNB-I data generator and the RMAT data
SUTs and designs the queries to stress such generator.
scenarios deliberately yet being use case based. The benchmark uses several performance met-
The query mix is generated at the beginning rics, including the loading time; the make-span,
of the execution by the provided driver, such that which is the time of executing the algorithm
for each query instance, a scheduled issue time including the system-specific overheads; and the
is given. The query mix is designed not to make processing time, which is the actual time of ex-
any complex read to dominate the execution, ecuting the algorithm. Additionally, they collect
given that queries have different complexities the edges processed per second (EPS) and the
whose runtime depends on the expected amount edge-vertex processed per second (EVPS). The
of data touched. Additionally, updates and their benchmark also specifies a EVPS/cost metric.
scheduled issue times are read from data provided Graphalytics comprehensively defines the
by the data generator, while short reads are in- steps to follow to execute the full benchmark,
terleaved between complex reads and updates in including the experiments to run and measure-
such a way that a final read-write proportion of ments to take, as well as all the information that
80/20 is obtained. needs to be disclosed. So far no auditing rules
846 Graph Generation and Benchmarks
have been proposed, but their definition is on the The parameters in the query workload indi-
benchmark’s development road map. rectly suggest the scoring metrics to adopt in
order to compare the systems under test. For in-
stance, one can decide to measure the coverage of
gMark recursive queries of the SUTs or their robustness
gMark (Bagan et al. 2017; gMark 2016) is a with respect to quadratic selectivities or to the
synthetic graph instance and query workload number of queries in the workload. Nevertheless,
generator, which is both domain- and query gMark does not impose a fixed set of scoring
language-independent. It allows the creation metrics to be adopted for all cross-comparisons
of domain-specific use cases by means of a among the SUTs.
highly configurable number of input parameters. Finally, the gMark benchmark does not define
Contrarily to other synthetic benchmarks, the a set of auditing rules to measure the cost of the
benchmark is neither a macro-benchmark nor systems under test. However, since it has been
a micro-benchmark since it allows to simulate proved effective to reproduce existing bench-
both cases and all the variants in between. The marks and provide expanded query workloads,
gMark benchmark is also workload centric as it such as WatDiv (Aluç et al. 2014) and LDBC
leverages the presence of multiple queries in the SNB (Erling et al. 2015), one might adopt their
workload as opposed to single-query executions. respective auditing rules on the expanded query
It is system- and language-independent in workloads.
that it allows to generate queries for a broad
variety of systems and query languages (among Usage-Driven Query Analysis and
which SPARQL, openCypher, LogicQL, and Benchmarking
SQL are included in the default package). It is The rise of SPARQL endpoints has allowed to
extensible to upcoming systems and graph query collect query logs containing manually written
languages. queries as formulated by end users and machine-
The graph configuration consists of the size of generated queries through specific APIs. DBpe-
the graph (number of nodes), the edge predicates, dia 2010 logs have been examined in Morsey
the node types, and the occurrence constraints et al. (2011) in order to come up with a set of
(expressed as percentages of occurrence in the 25 most frequently used queries that can be con-
generated graph instance), along with the in- sidered as prototypical queries in performance
degree and out-degree distributions associated studies of triple stores of the so-called DBpedia
with each extreme of an edge. Such distributions SPARQL benchmark. The identification of rep-
vary among uniform, Gaussian, and Zipfian and resentative user queries has recently been carried
allow to represent many realistic graph datasets. out for Wikidata datasets and has led to define a
The query workload configuration allows to spec- collection of 309 Wikidata user example queries
ify the workload size (in terms of number of derived from real Wikidata queries (Wikidata
queries), the minimum and maximum arity of 2018). These queries are intended to showcase
the generated queries, the shape classes that are systems behavior and/or highlight characteristics
desired in the generation (out of chain, star, of the Wikidata dataset and are thus meant to
cycle, and star-chain queries), the selectivity of produce a nice output. Recent large SPARQL
the queries (representing the number of results query logs have been lately analyzed in Bonifati
returned by the queries, which depends on the et al. (2017), encompassing DBpedia, BioPortal,
graph size and topology and can be chosen among SWDF, British Museum, LinkedGeoData, and
constant, linear, and quadratic), the probability of OpenBioMed queries. These logs are inspected
multiplicity (in terms of proportion of recursive in order to understand the structural characteris-
queries in the generated query workload), and the tics of the queries, their syntactic features, their
query size (the maximum and minimum number disparate shapes, and their graph and hypergraph
of conjuncts, disjuncts, etc.). classification in order to understand their com-
Graph Generation and Benchmarks 847
plexity. Moreover, such true query logs contain certainly help the design of valid and widely
the “development process” of queries: users start recognized benchmarks.
by formulating a query and gradually refining it Another aspect is the generation of data for
until they are satisfied with it. Query logs can benchmarking purposes. The performance of
thus reveal streaks of the same evolving query, graph queries is heavily affected by the character-
which correspond to gradual modifications of a istics of the underlying data, both structurally and
seed query. in terms of attributes, which influence the size of
The above works show that much work is left the search space and the memory access patterns,
to be done in order to harvest the characteristics among others. Properly understanding the
of real user queries and feed future benchmarking relationship between different data characteristics
initiatives devoted to target queries that users and the performance of graph queries is one of the
are coveting. They also highlight the importance goals of benchmarking, and data generators that
of benchmarks that are self-adaptive to evolving help tuning the characteristics of such data are of
G
users’ needs. high importance. Moreover, the creation of tools
to assist in the exploration and identification of
the performance bottlenecks of graph databases
and their relationship with the input data is a
Future Directions of Research research direction worth to be investigated in the
near future.
Despite the fact that there has been a considerable
effort from both industry and academia to create
comprehensive and equitable graph benchmarks, Cross-References
their adoption is still far from being achieved.
The graph processing industry is not yet ma- Graph Data Integration and Exchange
ture for this adoption, and the number of in- Graph Data Models
volved industrial actors in the definition of these Graph Path Navigation
benchmarks is still narrow. Even though this Graph Pattern Matching
number is increasing lately, a paradigm shift is Graph Query Languages
needed in order for the involved companies to
fiercely compete for the best benchmark scores.
The opposite occurs in the academia, where graph References
processing is a hot topic and many benchmarks
have been proposed as discussed in this chapter. Aluç G, Hartig O, Özsu MT, Daudjee K (2014) Diversified
stress testing of RDF data management systems. In:
Implementing an industry-graded benchmark is
The semantic web – ISWC 2014 – Proceedings of
usually a time-consuming task that requires a the 13th international semantic web conference, Part I,
considerable effort, not only in implementing Riva del Garda, 19–23 Oct 2014, pp 197–212
the necessary queries but also in setting up the Armstrong TG, Ponnekanti V, Borthakur D, Callaghan M
(2013) Linkbench: a database benchmark based on the
execution environment properly. Thus, the evo-
Facebook social graph. In: SIGMOD. ACM, pp 1185–
lution of graph benchmarking must put a special 1196
emphasis in lowering the entry bar as much as Backstrom L, Boldi P, Rosa M, Ugander J, Vigna S
possible, either by providing the required tools (2012) Four degrees of separation. In: WebSci. ACM,
pp 33–42
and software to ease their adoption (like work-
Bagan G, Bonifati A, Ciucanu R, Fletcher GH, Lemay A,
load drivers, reference query implementations, Advokaat N (2017) gMark: schema-driven generation
dataset repositories, etc.). A non-negligible issue of graphs and queries. IEEE TKDE 29(4):856–869
is, for instance, tied to the comparison of results Bizer C, Schultz A (2009) The berlin sparql benchmark.
Int J Semant Web Inf Syst 5(2):1–24
and their reproducibility in a benchmarking envi-
Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang
ronment. Another matter concerns the definition DU (2006) Complex networks: structure and dynamics.
of a standard graph query language that would Phys Rep 424(4):175–308
848 Graph Grammars
instance, properties like “the graph is connected,” and in some contexts it is actually almost as good
“the graph is planar,” and so on are all examples as other more sophisticated alternatives (Craswell
of binary graph invariants. Scalar invariants have et al. 2003). We take this opportunity to notice
values on a field (e.g., V D R), for instance, “the that most centrality measures proposed in the
number of connected components,” “the average literature were actually described originally only
length of a shortest path,” and “the maximum for undirected graphs; adapting those measures to
size of a connected component,” are all examples the directed case often implies to choose between
of scalar graph invariants. Distribution invariants looking at incoming or outgoing paths: in the
take as value a distribution, for instance, “the case of degree, for example, we should choose
degree distribution” or the “shortest-path–length whether to consider in-degrees or out-degrees. In
distribution” are all examples of distribution in- most cases, the so-called “negative” version (the
variants. one that considers incoming paths) is more mean-
It is convenient to extend the definition of ingful than the “positive” alternative (Boldi and
graph invariants to the case where the output is Vigna 2014). In-degree, for instance, is probably
in fact a function assigning a value to each node. the oldest measure of importance ever used, as
Formally, a V -valued node-level graph invariant it is equivalent to majority voting in elections
is a function that maps every G 2 G to an (where the presence of an arc hx; yi means that
element of V NG (i.e., to a function from NG to x voted for y).
V ), such that for every isomorphism f W G ! H
one has .G/.x/ D .H /.f .x//. Local Properties
For example, suppose that assigns to every
graph G the function that maps each node x 2
NG to the number of triangles (closed cycles of
The average degree of a graph m=n and its
density m=n2 (e= n2 in the case of undirected
length three) involving x; if G Š H and x 0 2 graphs) determine how close the graph is to being
NH is the node of H that corresponds to x 2 NG , a clique. A family of graphs is said to be sparse if
the number of triangles of H involving x 0 is the their average degree is o.n/, dense if it is .n/.
same as the number of triangles of G involving x, A more fine-grained invariant measuring
that is, .G/.x/ D .H /.x 0 /. the density of a graph is given by its full (in/
Introducing node-level graph invariants is cru- out-)degree distribution, a discrete distribution
cial in order to be able to talk about graph whose probability mass function evaluated at k is
centralities (Newman 2010). A node-level real- the fraction of nodes with (in/out-)degree k.
valued graph invariant that aims at estimating The degree distribution is quite distinctive and
node importance is called a centrality (index or contains by itself much information about the
measure); given a centrality index c W NG ! structure of the graph. The degrees of a random
R, we interpret c.x/ > c.y/ as a sign of the graph (in the sense of Erdős and Rényi 1959)
fact that x is “more important” (more central) follow a binomial distribution. It was noted, how-
than y. Since the notion of being “important” ever, that many networks corresponding to em-
can be declined to have different meanings, in pirical phenomena have a peculiarly different
different contexts, many indices were proposed behavior and were called scale-free: a graph is
in the literature (Boldi and Vigna 2014); some of scale-free (Barabási and Albert 1999) if its degree
the most important ones will be mentioned in this distribution follows a power law, that is, if the cu-
section. mulative function F ./ of its degree distribution
A notable, simple example of centrality in has the property that
an undirected graph is simply node degree; in a
friendship network, degree centrality amounts at
measuring the importance of a node (person) by 1 F .x/ / x
the number of friends (s)he has. As simple as it
may seem, degree contains a lot of information, for some constant > 1.
Graph Invariants 851
The degree of a node tells us how many direct are pairwise disjoint, so they define a partition
connections a node has but does not give any of the node set whose parts are called connected
information about how much those neighbors components.
are connected with one another. An informative In the case of undirected graphs, there can
index in this direction (a node-level graph invari- exist no edge between vertices belonging to dif-
ant itself) is the clustering coefficient (Watts and ferent components; this is not true for directed
Strogatz 1998) of node x, defined for undirected graphs, but a similar property is true of cy-
graphs by cles (there can be no cycles involving nodes in
P different components). An undirected connected
y2N.x/ jN.y/ \ N.x/j acyclic graph (i.e., a connected forest) is called a
:
d.x/.d.x/ 1/ tree.
The number of paths ending at a node x can
This is the probability that two neighbors of x be used to estimate its importance in the network;
G
chosen at random turn out to be connected by an for example, Katz’s index (Katz 1953) of node x
edge: its extreme values are 0 if GŒN.x/
is an is defined as
independent set and 1 if GŒN.x/
is a clique. It is
possible to extend this definition to the directed k
X ˇ ˇ
case, but we shall not discuss this extension here. ˇ t ˇG t .; x/ˇ
Of course the distribution of clustering coef- tD0
a smoother alternative is closeness centrality that eigenvalues and eigenvectors of the adjacency
corresponds, up to constants, to (the reciprocal of matrix of the graph or of some matrices derived
the) average distance. More precisely, the close- from it. There is by now an immense body of
ness centrality (Bavelas 1948) of x is defined as literature showing the deep and unexpected con-
nections between spectral properties and topolog-
1 ical structure, in some cases leading to interesting
P :
y d.y; x/
node-level invariants.
The first and most obvious spectral centrality
The summation ranges over the nodes y for (called the Wei-Berge index) is the left domi-
which d.y; x/ is finite (otherwise, in non- nant eigenvector of the adjacency matrix G it-
strongly connected graphs, almost all nodes self (Berge 1958). But other centrality measures
would have null centrality). An alternative are of spectral nature; for instance, it turns out
definition, which works better in the non-strongly that Katz’s index is the left dominant eigenvector
connected case, is harmonic centrality (Harris of a perturbed version of the adjacency matrix:
1954; Boldi and Vigna 2014), defined for a node
x as ˇG C .1 ˇ/eT 1; (1)
X 1
y
d.y; x/ where is the dominant eigenvalue of G and e
is a right dominant eigenvector of G such that
where this time the summation is over all y ¤ x, 1eT D (Vigna 2016).
with the proviso that 1= C 1 D 0.
As another example, define GN as the matrix
Eccentricity, closeness, and harmonic may all
obtained by dividing each nonnull row of G by
be seen as variants of the same idea of measuring
its L1 -norm and substituting null rows with 1=n.
centrality the centrality of a node x through the
The left dominant eigenvector of the matrix
set of distances d.; x/: such centralities are
called geometric (Boldi and Vigna 2014). A quite
˛ GN C .1 ˛/.1=n/T 1; (2)
popular centrality index, which is also related
to shortest paths, but not to distances in the
is the well-known PageRank centrality in-
strict sense, is betweenness (Anthonisse 1971;
dex (Brin and Page 1998). Since GN has dominant
Freeman 1977); the betweenness centrality of a
eigenvalue 1, the analogy with (1) should be
node x is defined as
clear. The popularity of PageRank is related to
X u;v .x/ the fact that it was supposedly used by Google to
rank websites in their search engine results.
u;v
u;v
positive real number, usually) to each node or Barabási AL, Albert R (1999) Emergence of scaling in
arc (edge, in the case of undirected graphs). random networks. Science 286(5439):509–512
Bavelas A (1948) A mathematical model for group struc-
Weighted graphs appear in a variety of situations, tures. Hum Organ 7:16–30
and typically one wants to take weights into Berge C (1958) Théorie des graphes et ses applications.
account when dealing with a weighted graph. All Dunod, Paris
the definitions and properties discussed above can Berge C (1984) Hypergraphs: combinatorics of finite sets,
vol 45. Elsevier, North Holland
be adapted to the weighted case. For example,
Boldi P, Vigna S (2014) Axioms for centrality. Internet
the length of a path in an arc-weighted graph is Math 10(3–4):222–262
usually defined as the sum of the weights of the Bollobás B (1998) Modern graph theory. Graduate texts in
arcs along the path; obviously, under this defini- mathematics. Springer, Heidelberg
tion, shortest paths have no longer the minimum Brin S, Page L (1998) The anatomy of a large-scale
hypertextual Web search engine. Comput Netw ISDN
number of arcs, and the distance between two
nodes is not an integer anymore, in general. In
Syst 30(1):107–117 G
Chakrabarti S, Ester M, Fayyad U, Gehrke J, Han J,
some cases, the modification needed to deal with Morishita S, Piatetsky-Shapiro G, Wang W (2006)
the weighted case is straightforward; in other Data mining curriculum: a proposal (version 1.0). In:
Intensive working group of ACM SIGKDD curriculum
cases, some domain knowledge (of the meaning committee, p 140
behind weights) is required to decide how the Craswell N, Hawking D, Upstill T (2003) Predicting fame
definitions should be modified. and fortune: PageRank or indegree? In: In Proceedings
An uncertain graph (Jin et al. 2011) is spe- of the Australasian document computing symposium,
ADCS2003, pp 31–40
cial type of weighted graph whose weights are
Cvetkovic D, Rowlinson P (2004) Spectral graph theory.
interpreted as probabilities (and, therefore, are In: Beineke LW, Wilson RJ, Cameron PJ (eds) Topics in
constrained to be values between 0 and 1). An un- algebraic graph theory, vol 102. Cambridge University
certain graph U with n nodes may be interpreted Press, Cambridge, p 88
Diestel R (2012) Graph theory, 4th edn. Graduate texts in
as a compact way to define a distribution over
mathematics, vol 173. Springer, New York
the graphs of n nodes, in which only arcs of U Erdős P, Rényi A (1959) On random graphs, I. Publica-
may appear and every arc is drawn independently tiones Mathematicae (Debrecen) 6:290–297
with a probability equal to its weight. They are Freeman LC (1977) A set of measures of centrality based
a natural generalization of Erdős-Rényi random on betweenness. Sociometry 40(1):35–41
Gross JL, Tucker TW (1987) Topological graph theory.
graphs (Erdős and Rényi 1959) (that are equiva- Series in discrete mathematics and optimization. Wiley,
lent to uncertain cliques with constant weights). New York
A multigraph (Gross and Tucker 1987) is a Harris CD (1954) The market as a factor in the localization
graph whose arcs are a multiset and not a set; of industry in the United States. Ann Assoc Am Geogr
44(4):315–348
in other words, there may be more than one Jin R, Liu L, Aggarwal CC (2011) Discovering highly
arc connecting two nodes. Again, many of the reliable subgraphs in uncertain graphs. In: Proceedings
definitions carry over to this case without almost of the 17th ACM SIGKDD international conference on
any effort. knowledge discovery and data mining. ACM, pp 992–
1000
A k-hypergraph (Berge 1984) is an undirected Katz L (1953) A new status index derived from sociomet-
graph whose edges (called “hyperedges”) are not ric analysis. Psychometrika 18(1):39–43
pairs, but rather sets of nodes of cardinality k. Lovász L (2012) Large networks and graph limits. Col-
loquium publications, vol 60. American Mathematical
Society, Providence
Newman M (2003) The structure and function of complex
networks. SIAM Rev 45:167–256
References Newman MEJ (2010) Networks: an introduction. Oxford
University Press, Oxford/New York
Anthonisse JM (1971) The rush in a directed graph. Vigna S (2016) Spectral ranking. Netw Sci 4(4):433–445
Technical Report BN 9/71, Mathematical Centre, Am- Watts D, Strogatz S (1998) Collective dynamics of ‘small-
sterdam world’ networks. Nature 393(6684):440–442
854 Graph Layout
temporal graphs. A particular and novel variation The work by Tian et al. (2008) introduces
of dynamic graphs are stream graphs, where the SNAP operation (standing for summarization
edge (and possibly node) streams arrive in real by grouping nodes on attributes and pairwise
time. relationships) to produce a summary graph by
Graph analytics has been steadily gaining mo- grouping nodes based on node attributes and
mentum in the data management community, relationships selected by the user. The k-SNAP
since many real-world applications are naturally operation is proposed, allowing drilling down and
modeled as graphs, in particular in the domain of rolling up at different aggregation levels. The
social network analysis. Given the extensive use work also presents an algorithm to evaluate the
of graphs to represent practical problems, mul- SNAP operation, shows that the k-SNAP compu-
tidimensional analysis of graph data and graph tation is NP complete, and proposes two heuris-
data warehouses are promising fields for research tic methods to approximate the k-SNAP results.
and application development. There is a need to Nodes are aggregated based on the strength of the
G
perform graph analysis from different perspec- relationships between them.
tives and at multiple granularities. This poses new
challenges to the traditional OLAP technology. OLAP on Graph Data
This is generically called Graph OLAP. The re- A first approach to performing OLAP on graphs
mainder of this entry will address research on this is called Graph OLAP (Chen et al. 2009) and
topic. presents a multidimensional and multilevel view
of graphs. To illustrate the idea, consider a set
of authors working in a given field, say data
warehouses. If two persons coauthored x pa-
Key Research Findings
pers in the DaWaK 2009 conference, a link is
added between them, which has a collaboration
This section starts with a review on graph sum-
frequency attribute x. For every conference in
marization. Graph OLAP is then discussed. Fi-
every year, there is a coauthor graph describing
nally, relevant contributions on analytics over
the collaboration patterns between researchers.
dynamic graphs are presented.
Thus, each graph can be viewed as a snapshot of
the overall collaboration network. These graphs
Graph Summarization can be aggregated in an OLAP style. For in-
Graph summarization is a critical operation for stance, graphs can be aggregated to obtain the
multilevel analysis of graph data and must be collaborations by conference type and year for
accounted for when addressing OLAP on graphs. all pairs of authors. For this, nodes and edges
However, early work on graph summarization in each snapshot graph must be aggregated ac-
was aimed at different goals, mainly trying to find cording to the conference type (e.g., database
a smaller representation of the input graph, which conferences) and by year. For example, if there
can reveal patterns in the original data preserving is a link between two authors in the SIGMOD
specific structural properties. This is why graph and VLDB conferences, the nodes and the edges
summarization is also called graph coarsening. will be in the aggregated graph corresponding to
Thus, graph summarization aims at (a) reducing the conference type “Databases." More complex
the data volume producing a graph summary of patterns can be obtained, for example, by merging
the original graph, (b) speeding up algorithms, (c) the authors belonging to the same institution,
noise elimination, and (d) effective visualization. enabling to obtain patterns of collaboration be-
Graph summarization will be discussed below, tween researchers of the same institutions. In
although its goal is in some sense broader that Graph OLAP, dimensions are classified as infor-
the one of Graph OLAP, which mainly refers mational and topological. The former are close
to exploratory and interactive data analysis on to traditional OLAP dimension hierarchies using
graphs. information of the snapshot levels, for example,
856 Graph OLAP
Conference ! Field ! All. They can be used Based on the idea of defining graphs at
to aggregate and organize snapshots as explained different levels of aggregation, denoted graph
above. Topological dimensions can be used for cuboids, Distributed Graph Cube (Denis et al.
operating on nodes and edges within individual 2013) presents a distributed framework for graph
networks. For example, a hierarchy for authors cube computation and aggregation, implemented
like AuthorId ! Institution will belong to a using Spark and Hadoop. This proposal
topological dimension since author institutions addresses homogeneous graphs. Along the same
do not define snapshots. These definitions yield lines, Pagrol (Wang et al. 2014) introduces a
two different kinds of Graph OLAP operations. A MapReduce framework for distributed OLAP
roll-up over an informational dimension overlays analysis of homogeneous attributed graphs.
and joins snapshots (but does not change the Pagrol extends the model of Graph Cube
objects), while a roll-up over a topological di- by considering the attributes of the edges as
mension merges nodes in a snapshot, modifying dimensions. The idea behind these two proposals
its structure. is similar: to efficiently compute all possible
Graph Cube (Zhao et al. 2011) is a model aggregations of a homogeneous graph.
for graph data warehouses that supports OLAP To handle heterogeneous graphs, and also
queries on large multidimensional networks, ac- starting from the topological and informational
counting for both, attribute aggregation and struc- dimensions introduced in Chen et al. (2009), Yin
ture summarization of the networks. A multi- et al. (2012) proposed a data warehousing model
dimensional network consists of a collection of based on the notion of entity dimensions. In this
vertices each one of which contains a set of work, two basic operations are defined, namely,
multidimensional attributes describing the node rotate and stretch, which allow defining different
properties. For example, in a social network, views of the same graph, defining entities as
nodes can represent persons, and multidimen- relationships and vice versa.
sional attributes may include UserID, Gender, Gómez et al. (2017) propose a formal
City, etc. Thus, multidimensional attributes in the multidimensional data model for graph analysis,
graph vertices define the dimensions of the graph where graphs are node- and edge-labeled directed
cube. Measures are aggregated graphs summa- multi-hypergraphs, called graphoids, defined at
rized according to some criteria. Opposite to several different levels of granularity, according
Graph OLAP, where the model consists in sev- to dimension hierarchies associated with them.
eral snapshots, in Graph Cube there is only one Over this model, the authors define Graph OLAP
large network, leading to a graph summarization operations. This model addresses heterogeneous
problem. For example, consider a small social graphs.
network with three nodes, two of them corre-
sponding to male individuals in the network, Dynamic Graph Analytics
while the other one corresponds to a female. A The methods discussed above apply, basically,
graph that summarizes the connections between to static graphs. The work on dynamic graph
genders will have two nodes, one labeled male analytics has mainly focused in graph mining
and the other labeled female. The edges be- rather than in graph summarization and temporal
tween them will be annotated with the number Graph OLAP. However, some works on graph
of connections of some kind. For instance, if in summarization and temporal OLAP are briefly
the original graph there were two connections commented below.
between two male persons (in both directions), An early approach to dynamic graph summa-
the summarized graph will contain a cycle over rization consisted in creating an aggregate graph
the male node, annotated with a “2." If there was that summarizes the input dynamic one, based on
just one connection between a woman and a man, how recent and frequent were the interactions.
there will be an edge between nodes male and This is called an approximation graph (Hill et al.
female annotated with a “1." 2006; Sharan and Neville 2008). The approxi-
Graph OLAP 857
mation graph is built aggregating the node in- efficiently over a summary rather than on the
teractions across time, giving a weight to them, whole network. Summarization is also relevant
and this weight is higher when the interactions for graph visualization since, normally, graphs
are more recent and frequent. These solutions are too large to visualize in an effective way.
are used for mining and visualization. Another Pattern discovery is also a field of application
approach is based on compression techniques. of graph summarization and Graph OLAP.
This is the case of TimeCrunch (Shah et al. Patterns or outliers that are not immediately
2015), which aims at extracting meaningful pat- noticeable over the whole graph can become
terns from temporal data, in order to describe evident on certain summarizations along some
a dynamic graph. The method first generates dimensions.
static subgraph instances in each defined time
stamp, labels each instance using a dictionary
of static templates (e.g., star, clique, chain), and Future Directions of Research
G
puts them together to form temporal instances.
Then, it labels each structure using a dictionary There is a consensus in the database commu-
of temporal signatures (e.g., one-shot, ranged, nity in that novel models and techniques are
periodic). Finally, it selects for the summary the required for enabling multidimensional analysis
temporal structures that produces the shortest of big data (Cuzzocrea et al. 2013). From the
lossless decomposition (MDL) for describing the discussion in this entry, it follows that several
time-evolving graph. challenges arise: on the one hand, traditional
A temporal graph model has been presented data warehousing and OLAP operations on cubes
in Cattuto et al. (2013), where temporal data are clearly not sufficient to address the current
are organized in so-called frames, namely, the data analysis requirements; on the other hand,
finest unit of temporal aggregation. A frame is OLAP operations and models can expand the
associated with a time interval and allows to possibilities of graph analysis beyond the tra-
retrieve the status of the social network during ditional graph-based computation, like shortest-
such interval. A limitation of the model is that it path, centrality analysis and so on. Nevertheless,
does not allow registering changes in attributes from the discussion in the present entry, it is
of the nodes and that frame nodes become too clear that there is still no consensus on what
cluttered with edges. Graph OLAP means. Each proposal commented
A proposal for temporal OLAP on graphs was above presents a different model, with different
introduced by Campos et al. (2016). Here, a operators. Therefore, the main challenge for the
temporal model for graphs is presented, based on OLAP and databases communities is to agree in
Property Graphs, along with a query language, the model and operators for Graph OLAP.
which extends Cypher with temporal capabili- Many challenges in Graph OLAP occur
ties. in the field of dynamic graphs. As expressed
above, summarization techniques for these
kinds of graphs have not been studied as in
Examples of Applications the case of static graphs. The addition of the
time dimension introduces many problems, not
Graph summarization can be applied in several only theoretical but also practical. Analogously
different settings. On the technical side, for to the case of temporal relational databases,
improving graph query efficiency, for example, recording the history of graph data involves
some queries can be answered using a summary enormous amounts of data, which is also highly
graph instead of the whole one. This happens dependent on the time granularity that is chosen.
in social network analysis when issuing a Temporal query languages for graphs and
typical query to compute whether an edge exists query processing techniques must be developed.
between two nodes, which can be evaluated more Temporal graph summarization also opens many
858 Graph Parallel Computation API
research opportunities in the graph visualization Shah N, Koutra D, Zou T, Gallagher B, Faloutsos C
field. Finally, stream processing for (dynamic) (2015) TimeCrunch: interpretable dynamic graph sum-
marization. In: Proceedings of the 21st ACM SIGKDD
graphs is still in its infancy, and there is a fertile international conference on knowledge discovery and
field of research there. data mining, pp 1055–1064
Sharan U, Neville J (2008) Temporal-relational classifiers
for prediction in evolving domains. In: Proceedings of
Cross-References the 8th IEEE international conference on data mining,
pp 540–549
Tian Y, Hankins RA, Patel JM (2008) Efficient aggre-
Graph Data Management Systems gation for graph summarization. In: Proceedings of
Graph Data Models the 2008 ACM SIGMOD international conference on
Graph Pattern Matching management of data. ACM, pp 567–580
Wang Z, Fan Q, Wang H, Tan KL, Agrawal D, Abbadi
Graph Processing Frameworks
AE (2014) Pagrol: parallel graph OLAP over large-
Graph Query Processing scale attributed graphs. In: Proceedings of the IEEE
Historical Graph Management 30th international conference on data engineering, pp
Visual Graph Querying 496–507
Yin M, Wu B, Zeng Z (2012) HMGraph OLAP: a novel
framework for multi-dimensional heterogeneous net-
work analysis. In: Proceedings of the 15th international
References workshop on data warehousing and OLAP. ACM,
pp 137–144
Campos A, Mozzino J, Vaisman AA (2016) Towards Zhao P, Li X, Xin D, Han J (2011) Graph cube: on ware-
temporal graph databases. In: Proceedings of the 10th housing and OLAP multidimensional networks. In:
Alberto Mendelzon international workshop on foun- Proceedings of the 2011 ACM SIGMOD international
dations of data management, CEUR-WS.org, CEUR conference on management of data. ACM, pp 853–864
workshop proceedings, vol 1644
Cattuto C, Quaggiotto M, Panisson A, Averbuch A (2013)
Time-varying social networks in a graph database. In:
Proceedings of the 1st international workshop on graph
data management experiences and systems, p 11 Graph Parallel Computation
Chen C, Yan X, Zhu F, Han J, Yu P (2009) Graph OLAP: API
a multi-dimensional framework for graph data analysis.
Knowl Inf Syst 21(1):41–63
Graph Processing Frameworks
Cuzzocrea A, Bellatreche L, Song IY (2013) Data ware-
housing and OLAP over big data: current challenges
and future research directions. In: Proceedings of the
16th international workshop on data warehousing and
OLAP. ACM, pp 67–70
Denis B, Ghrab A, Skhiri S (2013) A distributed approach Graph Partitioning:
for graph-oriented multidimensional analysis. In: Pro- Formulations and
ceedings of the IEEE international conference on big
data, pp 9–16 Applications to Big Data
Gómez LI, Kuijpers B, Vaisman AA (2017) Performing
OLAP over graph data: query language, implementa- Christian Schulz1 and Darren Strash2
tion, and a case study. In: Proceedings of the interna- 1
University of Vienna, Vienna, Austria
tional workshop on real-time business intelligence and 2
analytics, pp 6:1–6:8 Department of Computer Science, Colgate
Harris S, Seaborne A (2012) SPARQL 1.1 query language. University, Hamilton, NY, USA
https://www.w3.org/TR/2013/REC-sparql11-query-201
30321/
Hartig O (2014) Reconciliation of RDF* and prop-
erty graphs. CoRR abs/1409.3288. http://arxiv.org/abs/ Definitions
1409.3288
Hayes P, Patel-Schneider PF (2014) RDF semantics. Given an input graph G D .V; E/ and an in-
http://www.w3.org/TR/rdf-mt/
teger k 2, the graph partitioning problem is
Hill S, Agarwal D, Bell R, Volinsky C (2006) Build-
ing an effective representation for dynamic networks. to divide V into k disjoint blocks of vertices
J Comput Graph Stat 15(3):1–25 V1 ; V2 ; : : : ; Vk , such that [1ik Vi D V , while
Graph Partitioning: Formulations and Applications to Big Data 859
Although the vast majority of techniques for techniques complex networks with up to a trillion
graph partitioning are for graphs that fit into edges can be partitioned.
memory on a single machine, these techniques
are sufficient for many big data problems. For Objective Functions
example, to process large datasets in parallel, a Cut Size. As discussed above, the most common
graph representing the communication required objective function is to minimize the total cut
P
for these computations, or for the dependencies size i<j !.Eij /. A related objective is to
between computations, can be formed. While the minimize the maximum cut size maxi<j !.Eij /.
data may not fit into memory, the graph may. For Minimizing the cut size is useful in reducing
graphs that fit into memory on a single machine, the amount of computation required by
the fastest techniques in practice include variants combinatorial optimization problems, where the
of the KL/FM local search algorithm (Fiduccia strategy is to subdivide the graph into blocks that
and Mattheyses 1982), which efficiently swap are solved independently (possibly in parallel)
nodes between partitions and keep the assignment and combine these solutions optimally (using
with lowest cut size. For balancing quality and branch and bound along the boundary vertices) or
speed, the multilevel approach (Hendrickson approximately by stitching together solutions
and Leland 1995; Sanders and Schulz 2011) across the cut edges (Lamm et al. 2017).
contracts subgraphs of the graph in a coarsening Communication Volume. In parallel comput-
phase, solves the smaller instance with a high- ing, the communication volume is often more
quality (but slow) partitioner, and uncontracts representative than the cut (Hendrickson and
while applying local search. For high-quality Kolda 2000). Let V` be the block containing
partitions, methods using max-flow/min- vertex v. Then we let D.v/ denote the number
cut are used (Sanders and Schulz 2011, of blocks in which vertex v has a neighbor,
2013). Other techniques include evolutionary excluding ˚the block containing v. That is,
methods (Kim et al. 2011), diffusion (Mey- D.v/ D j j W 9u 2 Vj ¤ V` s.t. fu; vg 2 E j.
erhenke and Sauerwald 2012), and spectral For a block Vi , the total communication volume
P
methods (Alpert et al. 1999). For an overview is comm .Vi / WD v2Vi c.v/D.v/. Similar to
of in-memory techniques, see the survey by cut size, one can minimize the maximum com-
Buluç et al. (2016). munication volume max1ik comm.Vi / or the
P
Though these techniques have an important total communication volume 1ik comm.Vi /.
place in graph partitioning, they are sparingly Suppose that c maps a vertex to its data size, and
used to process massive datasets – those that whole blocks of vertices are stored on a single
do not fit into the working memory of a single machine, then the total communication volume
machine. represents the total amount of bandwidth needed
to communicate data between adjacent blocks
Techniques for Massive Graphs simultaneously, and the maximum communi-
Recent research focuses on solving much larger cation volume would represent the maximum
instances, either by using external memory (i.e., bandwidth required for communicating from a
using the disk) or in a distributed setting. These single block. Choosing the total or maximum
methods use variants of the label propagation variant of the objective function here is typically
algorithm initially proposed by Raghavan et al. dictated by the topology of the interconnection
(2007) for graph clustering. Label propagation is network (Hendrickson and Kolda 2000).
used as a parallel local search algorithm (Ugan- Other Objectives are less common; these in-
der and Backstrom 2013; Aydin et al. 2016; clude the maximum degree in the quotient graph
Slota et al. 2017), and in some approaches it (the graph formed by blocks and their connec-
is additionally used to find clusterings that are tions), expansion and conductance (Lang and
contracted in a multilevel scheme (Meyerhenke Rao 2004), and optimizing the “shape” of parti-
et al. 2017; Akhremtsev et al. 2015). Using these tions (Meyerhenke and Sauerwald 2012).
Graph Partitioning: Formulations and Applications to Big Data 861
and communication in order to run on a the error which is why the linear equation systems
parallel computer. In this model vertices denote become very large. However, typically the arising
computation and edges represent communication. equation systems are sparse and solved using
In large-scale simulations, models can contain an iterative method such as the Jacobi method
billions of elements and simulations involve which requires sparse matrix-vector multiplica-
more than 100 k processors (Zhou et al. 2010). tion as a core component. In essence, at each
Graph partitioning is used to partition the model time step, one has an approximate solution xk to
and then distribute the computation over a given the linear equation system and computes a new
number of processing elements while minimizing approximation x kC1 using the equation x kC1 D
the communication overhead. For large-scale D 1 .b Rx k / where D D diag.A/, R D
simulations, it can be helpful to choose groups of A D. The process is repeated until the residual
mesh nodes (Fietz et al. 2012). We very briefly error is small. One approach to parallelize the
outline the finite element method (leaving out computation is to distribute columns of A and x
technical details) and then relate it to graph over equally processors such that communication
partitioning. is minimized. To build our model of computation
In scientific simulations one often wants to and communication, one rewrites the iterative
P
compute an approximation to some partial dif- formula as xikC1 D a1i i bi i¤j aij xjk .
ferential equation that models a physical process. It is now easy to define a graph/model G D
We use the heat equation to illustrate that. Let .f1; : : : ; ng; E/ which has one node per column
˝ be a region. The system that one wants to of the matrix and edges such that e D fi; j g 2
solve here is u D f in ˝ with u D 0 on @˝ E , ai;j ¤ 0. Since the amount of computation
P
where D @i i – the sum of the second- for a node is proportional to its degree, one uses
order derivatives. The problem asks for a func- vertex degrees as vertex weight. Additionally, one
tion u in some function space V that solves the associates xik ; xikC1 ; Ai; with node i 2 V . After
system. However, the solution space does usually partitioning the model, PE l obtains columns
not have finite dimension. Hence, one seeks an associated with block Vl , stores xik ; xikC1 ; Ai; ,
approximation to the system. One possibility to and computes xikC1 for i 2 Vl .
do so is to use weak formulations of the system: Decades ago, minimizing edge cut was the
Find
R u 2 V such that a.u; v/ D .f;Rv/ WD way to go since in meshes it is highly corre-
˝ f v dx 8v 2 V where a.u; v/ WD ˝ ru lated with communication volume. For networks
rv dx is a continuous bilinear map. In a second with complex structure, it is more important to
step, the solution space is approximated by a optimize the total/maximum communication vol-
vector space of finite dimension, e.g., choosing ume (Buluç and Madduri 2012).
Vh WD spanf1 ; ; n g V for some i . The Scalable Distributed Systems. Large compa-
weak formulation now becomes: Find uh 2 Vh nies that offer web services to users like Amazon,
such that a.uh ; / D .f; / 8 2 Vh . First, since Facebook, Google, or Twitter face the problem of
Vh has finite dimension, there are ˛i 2 R such distributed system design. Assigning components
P
that uh D niD1 ˛i i . Hence, one wants to find in a distributed system has a significant impact
P
˛i such that a. niD1 ˛i i ; / D .f; / 8 2 Vh on performance, as well as resource usage. A
P
which is easily rewritten to niD1 ˛i a.i ; j / D problem that often arises is to assign users to
.f; j / 8j 2 f1; ; ng. This however means servers or data records to storage units (Tran
that one can solve a linear equation system Ax D et al. 2012). This is a prototypical big data
b with aij D a.i ; j / and bi D .f; i /. Also application as the underlying networks that need
note that due to the approximation that is done partitioning often contain billions of vertices
here, the output solution may differ from the real and trillions of edges (Shalita et al. 2016).
solution, e.g., there are errors involved. Roughly Shalita et al. (2016) disclose two concrete
speaking, in practice one increases n to reduce applications used at Facebook: HTTP request
Graph Partitioning: Formulations and Applications to Big Data 863
routing optimization and storage sharding large-scale read datasets in order to facilitate
optimization. distributed-memory parallel analyses. Roughly
For example, in HTTP request routing, HTTP speaking, a de Bruijn graph represents overlaps
requests have to be assigned to compute clusters. between the reads that have been computed by
If a HTTP request arrives, any required data to the sequencing machine. The algorithm works as
serve the request has to be fetched from external follows: build a de Bruijn graph from the given
storage servers and will be cached in a local main set of input reads, compact it to reduce its size,
memory-based cache. Intuitively, it is clear that partition the resulting graph using a standard
HTTP requests that request similar data should parallel graph partitioning algorithm, and then
be assigned to the same compute cluster to use generate a partition of the read dataset from the
the main memory-based caches efficiently. As partition of the de Bruijn graph. This enables
friends and socially similar users tend to consume further analyses that run on the distributed
the same data, the static assignment is done memory machine to run efficiently.
G
by partitioning a graph where vertices represent
users and edges represent friendships between
them. Here, the edge cut is used as the optimiza- Cross-References
tion objective.
Large-Scale Biology Applications. There are Graph Processing Frameworks
numerous big data applications in biology that Large-Scale Graph Processing System
use graph partitioning techniques, for example, Parallel Graph Processing
3D cell nuclei segmentation and sequencing
datasets for parallel genomics analyses.
Recent advances enabled researches to do in References
vivo investigation of biological structures. The
instances can contain images with total size of Akhremtsev Y, Sanders P, Schulz C (2015)
(Semi-)external algorithms for graph partitioning
about 10 TB per embryo (Tomer et al. 2012). and clustering. In: Proceeding of 17th workshop on
This yields demand for automated methods to algorithm engineering and experiments (ALENEX
process these images. Arz et al. (2017) use graph 2015). SIAM, pp 33–43
partitioning to split the connected components of Alpert CJ, Kahng AB, Yao SZ (1999) Spectral parti-
tioning with multiple eigenvectors. Discret Appl Math
a binarized image recursively. The graph model 90(1):3–26
of a connected component is built by setting Andreev K, Räcke H (2006) Balanced graph partitioning.
foreground voxels to be vertices and inserting Theory Comput Syst 39(6):929–939
edges between neighboring cells. Edge weights Arz J, Sanders P, Stegmaier J, Mikut R (2017) 3D cell
nuclei segmentation with balanced graph partitioning.
depend on the likelihood that vertices belong CoRR abs/1702.05413
together. Aydin K, Bateni M, Mirrokni V (2016) Distributed bal-
Nowadays, high-throughput sequencing anced partitioning via linear embedding. In: Proceed-
machines allow researchers to generate massive ing of the ninth ACM international conference on web
search and data mining. ACM, pp 387–396
numbers of short genomic fragments (reads) Bichot C, Siarry P (eds) (2011) Graph partitioning. Wiley,
per run. Sequencing has become an important London
technology for a wide range of applications. In Bourse F, Lelarge M, Vojnovic M (2014) Balanced graph
many of these applications, mapping sequenced edge partition. In: Proceeding 20th ACM SIGKDD
international conference on knowledge discovery and
reads to their potential genomic origin is the data mining, KDD’14. ACM, pp 1456–1465
first fundamental step for subsequent analyses. Buluç A, Madduri K (2012) Graph partitioning for scal-
The large amounts of data generated from able distributed graph computations. In: Proceeding of
such machines present significant computational 10th DIMACS implementation challenge, contempo-
rary mathematics. AMS, pp 83–102
challenges for analysis. Jammula et al. (2017) Buluç A, Meyerhenke H, Safro I, Sanders P, Schulz
present a parallel algorithm for partitioning C (2016) Recent advances in graph partitioning. In:
864 Graph Partitioning: Formulations and Applications to Big Data
Kliemann L, Sanders P (eds) Algorithm engineering. McCune RR, Weninger T, Madey G (2015) Thinking like
Springer, Cham, pp 117–158 a vertex: a survey of vertex-centric frameworks for
Fiduccia CM, Mattheyses RM (1982) A linear-time large-scale distributed graph processing. ACM Comput
heuristic for improving network partitions. In: Pro- Surv 48(2):25:1–25:39
ceedings of the 19th conference on design automation, Meyerhenke H, Sauerwald T (2012) Beyond good parti-
pp 175–181 tion shapes: an analysis of diffusive graph partitioning.
Fietz J, Krause M, Schulz C, Sanders P, Heuveline V Algorithmica 64(3):329–361
(2012) Optimized hybrid parallel lattice Boltzmann Meyerhenke H, Sanders P, Schulz C (2017) Parallel graph
fluid flow simulations on complex geometries. In: Pro- partitioning for complex networks. IEEE Trans Parallel
ceeding of Euro-Par 2012 parallel processing. LNCS, Distrib Syst 28:2625–2638
vol 7484. Springer, pp 818–829 Raghavan UN, Albert R, Kumara S (2007) Near linear
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C time algorithm to detect community structures in large-
(2012) PowerGraph: distributed graph-parallel com- scale networks. Phys Rev E 76(3):036106
putation on natural graphs. In: Presented as part Rahimian F, Payberah AH, Girdzijauskas S, Haridi S
of the 10th USENIX symposium on operating sys- (2014) Distributed vertex-cut partitioning. In: IFIP in-
tems design and implementation (OSDI 12), USENIX, ternational conference on distributed applications and
pp 17–30 interoperable systems. Springer, pp 186–200
Hendrickson B, Kolda TG (2000) Graph partition- Sanders P, Schulz C (2011) Engineering multilevel graph
ing models for parallel computing. Parallel Comput partitioning algorithms. In: Proceedings of the 19th
26(12):1519–1534 European symposium on algorithms. LNCS, vol 6942.
Hendrickson B, Leland R (1995) A multilevel algo- Springer, pp 469–480
rithm for partitioning graphs. In: Proceeding of the Sanders P, Schulz C (2013) High quality graph parti-
ACM/IEEE conference on supercomputing’95. ACM tioning. In: Proceedings of the 10th DIMACS im-
Hyafil L, Rivest R (1973) Graph partitioning and plementation challenge – graph clustering and graph
constructing optimal decision trees are polyno- partitioning. AMS, pp 1–17
mial complete problems. Technical report 33, IRIA Schloegel K, Karypis G, Kumar V (2003) Graph parti-
– Laboratoire de Recherche en Informatique et tioning for high-performance scientific simulations. In:
Automatique Dongarra J, Foster I, Fox G, Gropp W, Kennedy K,
Jammula N, Chockalingam SP, Aluru S (2017) Dis- Torczon L, White A (eds) Sourcebook of parallel com-
tributed memory partitioning of high-throughput se- puting. Morgan Kaufmann Publishers, San Francisco,
quencing datasets for enabling parallel genomics pp 491–541
analyses. In: Proceeding of 8th ACM international Shalita A, Karrer B, Kabiljo I, Sharma A, Presta A,
conference on bioinformatics, computational biol- Adcock A, Kllapi H, Stumm M (2016) Social hash: an
ogy, and health informatics, ACM-BCB’17. ACM, assignment framework for optimizing distributed sys-
pp 417–424 tems operations on social networks. In: Argyraki KJ,
Karypis G, Aggarwal R, Kumar V, Shekhar S (1999) Mul- Isaacs R (eds) 13th USENIX symposium on networked
tilevel hypergraph partitioning: applications in VLSI systems design and implementation, NSDI. USENIX
domain. IEEE Trans Very Large Scale Integr (VLSI) Association, pp 455–468
Syst 7(1):69–79 Slota GM, Rajamanickam S, Devine K, Madduri K
Kim J, Hwang I, Kim YH, Moon BR (2011) Genetic (2017) Partitioning trillion-edge graphs in minutes. In:
approaches for graph partitioning: a survey. In: 13th ge- Proceedings of the 31st IEEE international parallel
netic and evolutionary computation (GECCO). ACM, and distributed processing symposium (IPDPS 2017),
pp 473–480 pp 646–655
Lamm S, Sanders P, Schulz C, Strash D, Werneck RF Tomer R, Khairy K, Amat F, Keller PJ (2012) Quantitative
(2017) Finding near-optimal independent sets at scale. high-speed imaging of entire developing embryos with
J Heuristics 23(4):207–229 simultaneous multiview light-sheet microscopy. Nat
Lang K, Rao S (2004) A flow-based method for im- Methods 9(7):755–763
proving the expansion or conductance of graph cuts. Tran DA, Nguyen K, Pham C (2012) S-clone: socially-
In: Proceedings of the 10th international integer pro- aware data replication for social networks. Comput
gramming and combinatorial optimization conference. Netw 56(7):2001–2013
LNCS, vol 3064. Springer, pp 383–400 Ugander J, Backstrom L (2013) Balanced label prop-
Langguth J, Sourouri M, Lines GT, Baden SB, Cai X agation for partitioning massive graphs. In: Pro-
(2015) Scalable heterogeneous CPU-GPU computa- ceedings of the sixth ACM international conference
tions for unstructured tetrahedral meshes. IEEE Micro on web search and data mining, WSDM’13. ACM,
35(4):6–15 pp 507–516
Li L, Geda R, Hayes AB, Chen Y, Chaudhari P, Zhang Zhou M, Sahni O, Devine KD, Shephard MS, Jansen
EZ, Szegedy M (2017) A simple yet effective balanced KE (2010) Controlling unstructured mesh partitions for
edge partition model for parallel computing. Proc ACM massively parallel simulations. SIAM J Sci Comput
Meas Anal Comput Syst 1(1):14:1–14:21 32(6):3201–3227
Graph Path Navigation 865
of variables that is disjoint with L. Then given overview of path queries and some of their exten-
x; y 2 V and ` 2 L, the following is a basic sions; for detailed surveys, the reader is referred
query to Barceló (2013) and Angles et al. (2017).
`
x!
y Regular Path Queries
A path p in a graph database G is a sequence e1 ,
that asks for the pairs of nodes in a graph database e2 , : : :, en of edges of the form ei D .ai ; `i ; bi /
connected by an edge-labeled `. The semantics such that n 0 and bj D aj C1 for every
of a such a query Q on a graph database j 2 f1; : : : ; n 1g. That is, each edge in a path
G D .N; E/, denoted by QG , is the set starts in the node where the previous edge ends.
of matchings h W fx; yg ! N such that If n D 0 we refer to p as the empty path. We
.h.x/; `; h.y// 2 E. write labels.p/ for the string of labels formed by
the sequence of edges in p, that is, labels.p/ D "
Example 2 Evaluating the basic query Q1 D if n D 0 and labels.p/ D `1 `2 `n otherwise.
knows
x ! y over the graph database G in Fig. 1 Finally, we say that a1 is the starting node of p
produces as a result: and bn is the ending node of p.
x y Alice
knows
George
h1 Alice George
h2 George Paul
h3 Alice Paul friendOf knows
n creatorOf
^ ri ans.x; y/ :- x ! ´ ^
ans.u1 ; : : : ; uk / :- vi
! wi (1)
creatorOf
iD1 y ! ´ ^
type
ri ´ ! book
where each vi ! wi , for i 2 f1; : : : ; ng, is an
RPQ that may include constants and u1 , : : :, uk is
a sequence of pairwise distinct variables among Let g W fx; yg ! N and h W fx; y; ´g ! N
those that occur as vi s and wi s. t
u be matchings such that g.x/ D John, g.y/ D
Paul, h.x/ D John, h.y/ D Paul, and h.´/ D
The left-hand side of (1) is called the head Databases. We have that g 2 Q5 G since
creatorOf
of the query, and its right-hand side is called g D hjfx;yg , hjfx;´g 2 x ! ´G ,
the body. Variables in the body of (1) need not creatorOf
hjfy;´g 2 y ! ´G , and hjf´g 2
be pairwise distinct; in fact, sharing variables type
allows joining results of the RPQs in the body ´ ! bookG . t
u
Graph Path Navigation 869
linked by a simple path whose label satisfies r. on paths such as subsequence or subword, query
However, evaluation of RPQs under the simple evaluation becomes completely infeasible or even
path semantics becomes NP-complete even in undecidable. We refer to Barceló et al. (2012) and
data complexity (Mendelzon and Wood 1995). Barceló (2013) for further discussions.
While this casts doubt on its practical ap- The last extension is constituted by adding
plicability, a closely related semantics was im- mechanisms for checking regular properties over
plemented in a practical graph DBMS, namely, an extended data model in which each node
Neo4j (Robinson et al. 2013). There, paths in carries data values from an infinite domain. This
graph patterns can only be matched by paths models graph databases as they occur in real life,
that do not have repeated edges. While still NP- namely, property graphs, where nodes and edges
complete in data complexity in the worst case, can carry tuples of key-value pairs. Extending
in practice this semantics could behave well, as RPQs and CRPQs requires choosing a proper
worst-case scenarios do not frequently occur in analog of regular expressions. There are many
real life. proposals, but with few exceptions they lead
to very high complexity of query evaluation.
Extensions The one that maintains good complexity is
(C)RPQs have been extended with several fea- given by register automata. Such extensions still
tures. One extension is a nesting operator that have NLOGSPACE data complexity and PSPACE
allows one to perform existential tests over the combined complexity and come with different
nodes of a path, similarly to what is done in the syntactic ways of describing navigation that
XML navigational language XPath (Pérez et al. mixes data and topology of the graph. We refer
2010). Another extension is the ability to com- the reader to Libkin et al. (2016) for details and
pare paths (Barceló et al. 2012). The idea is that the surveys by Barceló (2013) and Angles et al.
Wr (2017) for a thorough review of the literature on
paths are named in an RPQ, like x ! y, and
then the name can be used in the query. Without extensions of CRPQs.
defining them formally, we give an example:
Cross-References
p1 Wa p2 Wb
ans.x; y/ :- x ! ´ ^ ´ ! y ^ Graph Data Management Systems
Graph Pattern Matching
equal_length.p1 ; p2 /
Graph Query Languages
Graph Query Processing
says that there is a path p1 from x to ´ labeled
with as and a path p2 from ´ to y labeled with bs
and these paths have equal lengths (expressed by References
the predicate equal_length.p1 ; p2 /). The equal
length predicate belongs to the well-known class Alkhateeb F, Baget J, Euzenat J (2009) Extending
SPARQL with regular expression patterns (for query-
of regular relations, but the above query says
ing RDF). J Web Sem 7(2):57–73
that there is a path from x to y whose label is Angles R, Arenas M, Barceló P, Hogan A, Reutter JL,
of the form an b n for some n. This of course Vrgoc D (2017) Foundations of Modern Graph Query
is not a regular property, even though we only Languages. ACM Computing Surveys 50(5)
Barceló P (2013) Querying graph databases. In: Proceed-
used regular languages and relations in the query.
ings of the 32nd ACM Symposium on Principles of
Such extended CRPQs still behave well: their Database Systems, PODS 2013, pp 175–188
data complexity remains in NLOGSPACE, and Barceló P, Libkin L, Lin AW, Wood PT (2012) Expressive
their combined complexity goes up to PSPACE, languages for path queries over graph-structured data.
ACM Trans Database Syst 37(4):31
which is exactly the combined complexity of re-
Barrett CL, Jacob R, Marathe MV (2000) Formal-
lational algebra. However such additions must be language-constrained path problems. SIAM Journal on
handled with care: if we add common predicates Computing 30(3):809–837
Graph Pattern Matching 871
Calvanese D, De Giacomo G, Lenzerini M, Vardi MY as all graph transactions that are similar to Q for
(2000) Containment of conjunctive regular path queries a particular similarity measure.
with inverse. In: Proceedings of the 7th International
Conference on Principles of Knowledge Representa-
In the context of searching a single graph
tion and Reasoning, KR 2000, pp 176–185 G, graph pattern matching is to find all the
Holland DA, Braun UJ, Maclean D, Muniswamy-Reddy occurrences of a query graph Q in a given data
KK, Seltzer MI (2008) Choosing a data model and graph G, specified by a matching function. The
query language for provenance. In: 2nd International
Provenance and Annotation Workshop (IPAW)
remainder of this entry discusses various aspects
Libkin L, Martens W, Vrgoč D (2016) Querying graphs of graph pattern matching in a single graph.
with data. J ACM 63(2):14
Mendelzon AO, Wood PT (1995) Finding regular simple
paths in graph databases. SIAM J Comput 24(6):1235–
1258 Overview
Paths TFP (2009) Use cases in property paths task
force. http://www.w3.org/2009/sparql/wiki/TaskForce: This entry discusses foundations of graph pattern
G
PropertyPaths#Use_Cases matching problem: formal definitions, query lan-
Pérez J, Arenas M, Gutierrez C (2010) nSPARQL: A
navigational language for RDF. J Web Sem 8(4): guages, and complexity bounds.
255–270
Robinson I, Webber J, Eifrem E (2013) Graph Databases, Data graph. A data graph G = .V; E; L/ is a
1st edn. O’Reilly Media directed graph with a finite set of nodes V , a set
of edges E V V , and a function L that
assigns node and edge content. Given a query
Graph Pattern Matching Q = .VQ ; EQ ; LQ /, a node v 2 V in G with
content L.v/ (resp. edge e 2 E with content
Yinghui Wu1 and Arijit Khan2 L.e/) that satisfies the constraint L.u/ of a node
1
Washington State University, Pullman, u 2 VQ (resp. L.eu / of an edge eu 2 EQ ) is
WA, USA a candidate match of v (resp. e). Common data
2
Nanyang Technological University, Singapore, graph models include RDF and property graphs
Singapore (see chapter “ Graph Data Models”).
Subgraph pattern queries. Traditional graph with regular expressions (Fan et al. 2011). A
pattern queries are defined by subgraph isomor- graph simulation R VQ V relaxes subgraph
phism (Ullmann 1976): (1) each node in Q has isomorphism to an edge-to-edge matching
a unique label, and (2) the matching function F relation. For any node .u; v/ 2 R, u and v have
specifies a subgraph isomorphism, which is an the same label, and each child u0 of u must have
injective function that enforces an edge-to-edge at least a match v 0 as a child of v, such that
matching and label equality. .u0 ; v 0 / 2 R. Strong simulation (Ma et al. 2011)
The subgraph isomorphism problem is enforces simulation for both parents and children
NP complete. Ullmann proposed a first of a node in Q. Regular expressions are posed on
practical backtracking algorithm for subgraph edges in Q to find paths with concatenated edge
isomorphism search in graphs (Ullmann 1976). labels satisfying the regular expression (Fan et al.
The algorithm enumerates exact matches by 2011). All these query models can be evaluated
incrementing partial solutions and prunes partial with low polynomial time.
solutions when it determines that they cannot
be completed. A number of methods such as Approximate pattern matching. Approximate
VF2 (Cordella et al. 2004), QuickSI (Shang graph matching techniques make use of cost
et al. 2008), GraphQL (He and Singh 2008), functions to find approximate (top-k) matches
GADDI (Zhang et al. 2009), and SPath (Zhao with fast solutions. The answer quality is usually
and Han 2010) are proposed for graph pattern characterized by aggregated proximity among
queries defined by subgraph isomorphism. These matches determined by their neighborhood,
algorithms follow the backtracking principle as which is more effective to cope with ambiguous
in (Ullmann 1976) and improve the efficiency queries with possible mismatches in data graphs.
by exploiting different join orders and various Traditional cost measures include graph edit
pruning techniques. Lee et al. conducted an distance, maximum common subgraph, number
experimental study (Lee et al. 2012) to compare of mismatched edges, and various graph kernels.
these methods, and experimentally verified that Index techniques and synopses are typically used
QuickSI designed for handling small graphs often to improve the efficiency of pattern matching in
outperforms its peers for both small and large large data graphs (see chapter “Graph Indexing
graphs. and Synopses”).
The major challenge for querying big graphs Notable examples include PathBlast (Kelley
is twofold: (1) The query models have high com- et al. 2004), SAGA (Tian et al. 2006), NetAlign
plexity. For example, subgraph isomorphism is (Liang et al. 2006), and IsoRank (Singh et al.
NP hard. (2) It is often hard for end users to write 2008) that target bioinformatics applications. G-
accurate graph pattern queries. Major research Ray (Tong et al. 2007) aims to maintain the
effort has been focused on two aspects: (a) devel- shape of the query via effective path exploration.
oping computationally efficient query models and TALE (Tian and Patel 2008) uses edge misses
algorithms, and (b) user-friendly graph search to measure the quality of a match. NESS and
paradigms. NeMa (Khan et al. 2011, 2013) incorporate more
flexibility in graph pattern matching by allowing
an edge in the query graph to be matched with a
Key Research Findings path up to a certain number of hops in the target
graph. Both methods identify the top-k matches.
Computationally Efficient Query Models This is useful in searching knowledge graphs and
social networks.
Graph simulation. The class of simulation-
based query models includes graph simulation Making Big Graphs Small
and bounded simulation (Fan et al. 2010), strong When the query class is inherently expensive,
simulation (Ma et al. 2011), and pattern queries it is usually nontrivial to reduce the complex-
Graph Pattern Matching 873
ity. The second category of research aims to the size of G. Bounded incremental algorithms
identify a small and relevant fraction GQ of are identified for several graph pattern queries
large-scale graph G that suffice to provide ap- including reachability and graph simulation (Fan
proximate matches for query Q. That is, the et al. 2013).
match Q.GQ / is same or close to its exact coun-
terpart Q.G/. Several strategies have been de- Parallelizing sequential computation. When
veloped for graph pattern matching under this data graph G is too large for a single processor,
principle. bounded evaluability can be achieved by
distributing G to several processors. Each
Making graph query bounded. Bounded processor copes with a fragment GQ of
evaluability of graph pattern queries aims to de- G with a manageable size and bounded
velop algorithms with provable computation cost evaluability, guaranteed by partition, load
that is determined by the small fraction of graph. balancing, and other optimization techniques.
G
Notable examples include query preserving graph Existing models include bulk (synchronous and
compression (Fan et al. 2012a) and querying asynchronous) vertex-centric computation and
with bounded resources (Fan et al. 2014b; Cao graph-centric model , which are designed to
et al. 2015). (1) Given a query class Q, query- parallelize node and graph-level computations,
preserving compression applies a fast once-for- respectively (see chapter “ Parallel Graph
all preprocessing to compress data graph G to a Processing”).
small, query-able graph GQ , which guarantees Several characterizations of such models have
that for any query instance from Q, Q.G/ = been proposed. Notable examples include scale
Q.GQ /. Compression schemes have been devel- independence (Armbrust et al. 2009), which en-
oped for graph pattern queries defined by reach- sures that a parallel system enforces strict in-
ability and graph simulation. (2) Querying with variants on the cost of query execution as data
bounded resource adds additional size bound to size grows, and parallel scalability (Fan et al.
GQ . The method in Cao et al. (2015) makes use 2014c, 2017), which ensures that the system has
of a class of access constraints to estimate and polynomial speed-up over sequential computa-
fetch the amount of data needed to evaluate graph tion as more number of processors are used. A
queries. When no such constraints exist, another notable example is the techniques that aim to
method (Fan et al. 2014b) dynamically induces automatically parallelize sequential graph com-
GQ on the fly to give approximate matches. Both putation without recasting (Fan et al. 2017). The
guarantee to access GQ with bounded size. (3) declarative platform defines parallel computation
Graph views are also used to achieve bounded with three core functions for any processor – to
evaluability, where query processing only refers support local computation, incremental computa-
to a set of materialized views with bounded size tion (upon receiving new messages), and assem-
(Fan et al. 2014a). bling of partial answers from workers. Several
parallel scalable algorithms are developed under
Incremental evaluation. When data graph this principle for graph pattern queries includ-
G bear changes ΔG, bounded evaluability ing reachability and graph simulation (Fan et al.
is achieved by applying incremental pattern 2012b, 2014c).
matching (Fan et al. 2013). While ΔG is typically
small in practice, GQ is induced by combining User-Friendly Pattern Matching
ΔG and a small fraction of G (called “affected A third category of research aims to develop
area”) that are necessarily visited to update Q.G/ algorithms that can better infer the search intent
to its counterparts in the updated graph. An of graph queries, especially when end users have
algorithm guarantees bounded computation if no prior knowledge of underlying graph data.
it incurs cost determined by the neighbors of GQ We review two approaches specialized for graph
(up to a certain hop) and Q only, independent of pattern matching.
874 Graph Pattern Matching
Cao Y, Fan W, Huai J, Huang R (2015) Making Ma S, Cao Y, Fan W, Huai J, Wo T (2011) Capturing
pattern queries bounded in big graphs. In: ICDE, topology in graph pattern matching. PVLDB 5(4):310–
pp 161–172 321
Cordella LP, Foggia P, Sansone C, Vento M (2004) Mottin D, Lissandrini M, Velegrakis Y, Palpanas T (2014)
A (sub)graph isomorphism algorithm for matching Exemplar queries: give me an example of what you
large graphs. IEEE Trans Pattern Anal Mach Intell need. PVLDB 7(5):365–376
26(10):1367–1372 Shang H, Zhang Y, Lin X, Yu J (2008) Taming verification
Fan W, Li J, Ma S, Tang N, Wu Y, Wu Y (2010) Graph hardness: an efficient algorithm for testing subgraph
pattern matching: from intractable to polynomial time. isomorphism. PVLDB 1(1):364–375
PVLDB 3(1–2):264–275 Singh R, Xu J, Berger B (2008) Global alignment of mul-
Fan W, Li J, Ma S, Tang N, Wu Y (2011) Adding regular tiple protein interaction networks with application to
expressions to graph reachability and pattern queries. functional orthology detection. PNAS 105(35):12763–
In: ICDE, pp 39–50 12768
Fan W, Li J, Wang X, Wu Y (2012a) Query preserving Tian Y, Patel JM (2008) TALE: a tool for approximate
large graph matching. In: ICDE
graph compression. In: SIGMOD, pp 157–168
Tian Y, McEachin R, Santos C, States D, Patel J (2006)
G
Fan W, Wang X, Wu Y (2012b) Performance guar-
antees for distributed reachability queries. PVLDB SAGA: a subgraph matching tool for biological graphs.
5(11):1304–1316 Bioinformatics 23(2):232–239
Fan W, Wang X, Wu Y (2013) Incremental graph pattern Tong H, Faloutsos C, Gallagher B, Eliassi-Rad T (2007)
matching. TODS 38(3):18:1–18:47 Fast best-effort pattern matching in large attributed
Fan W, Wang X, Wu Y (2014a) Answering graph pattern graphs. In: KDD
queries using views. In: ICDE Ullmann JR (1976) An algorithm for subgraph isomor-
phism. J ACM 23:31–42
Fan W, Wang X, Wu Y (2014b) Querying big graphs
Zhang S, Li S, Yang J (2009) GADDI: distance index
within bounded resources. In: SIGMOD
based subgraph matching in biological networks. In:
Fan W, Wang X, Wu Y, Deng D (2014c) Distributed graph EDBT
simulation: impossibility and possibility. PVLDB Zhao P, Han J (2010) On graph query optimization in large
7(12):1083–1094 networks. In: VLDB
Fan W, Xu J, Wu Y, Yu W, Jiang J, Zheng Z, Zhang B, Zheng W, Cheng H, Zou L, Yu JX, Zhao K (2017) Natural
Cao Y, Tian C (2017) Parallelizing sequential graph language question/answering: let users talk with the
computations. In: SIGMOD knowledge graph. In: CIKM
He H, Singh A (2008) Graphs-at-a-time: query language
and access methods for graph databases. In: SIGMOD
He H, Wang H, Yang J, Yu PS (2007) BLINKS: ranked
keyword searches on graphs. In: SIGMOD
Jayaram N, Khan A, Li C, Yan X, Elmasri R (2015) Graph Processing
Querying knowledge graphs by example entity tuples. Frameworks
TKDE 27(10):2797–2811
Kargar M, An A (2011) Keyword search in graphs: finding
R-cliques. PVLDB 4(10):681–692 Arturo Diaz-Perez1 , Alberto Garcia-Robledo2 ,
Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, and Jose-Luis Gonzalez-Compean1
Ideker T (2004) PathBLAST: a tool for alignment of 1
Cinvestav Tamaulipas, Victoria, Mexico
protein interaction networks. Nucleic Acids Res 32: 2
83–88
Universidad Politecnica de Victoria, Victoria,
Khan A, Ranu S (2017) Big-graphs: querying, mining, Mexico
and beyond. In: Handbook of big data technologies.
Springer, Cham, pp 531–582 Synonyms
Khan A, Li N, Guan Z, Chakraborty S, Tao S (2011)
Neighborhood based fast graph search in large net-
works. In: SIGMOD Graph parallel computation API; Graph process-
Khan A, Wu Y, Aggarwal C, Yan X (2013) NeMa: ing library; Large-scale graph processing system
fast graph search with label similarity. PVLDB 6(3):
181–192
Lee J, Han WS, Kasperovics R, Lee JH (2012) An in- Definitions
depth comparison of subgraph isomorphism algorithms
in graph databases. PVLDB 6(2):133–144 A graph processing framework (GPF) is a set
Liang Z, Xu M, Teng M, Niu L (2006) NetAlign: a
web-based tool for comparison of protein interaction of tools oriented to process graphs. Graph
networks. Bioinformatics 22(17):2175–2177 vertices are used to model data and edges model
876 Graph Processing Frameworks
relationships between vertices. Typically, a GPF aspects of GPFs, 11 proposals, shown in Table 1,
includes an input data stream, an execution have been selected in this entry. Four main di-
model, and an application programming interface mensions can be used to classify GPF frame-
(API) having a set of functions implementing works: (1) the target platform, (2) the computa-
specific graph algorithms. In addition, some tion model, (3) the graph processing approach,
GPFs provide support for vertices and edges and (4) the communication model.
annotated with arbitrary properties of any kind The target platform is one of the major distinc-
and number. tions between frameworks. Most of the GPFs are
Graphs are being used for modeling large distributed memory frameworks (DMF), but there
phenomena and their relationships in different are some shared-memory frameworks (SMF) that
application domains such as social network anal- exploit the parallel capabilities of a single ma-
ysis (SNA), biological network analysis, and link chine.
analysis for fraud/cybercrime detection. Since Many graph processing algorithms need to
real graphs can be large, complex, and dynamic, traverse the graph in some way, and they are
GPFs have to deal with the three challenges of iterative until a global state is reached. Because
data growth: volume, velocity, and variety. of that, the computation model can be seen as a
The programming API of GPFs is simple sequence of MapReduce invocations requiring to
enough to facilitate efficient algorithm execution pass the entire state of the graph from one stage to
on graphs while abstracting away low-level de- the next. DMFs, like Surfer, Pegasus, and GBase,
tails. On the one hand, the basic computation are based on MapReduce.
entity in a GPF can be graph-centric, vertex- The bulk-synchronous process (BSP) model
centric, or edge-centric. On the other hand, as big was created to simplify the design of software
data drives graph sizes beyond memory capacity for parallel systems. BSP is a two-step process:
of a single machine, GPFs partition and distribute perform task computing on local data and com-
data among computing nodes implementing a municate the results. Both are performed itera-
communication model. tively and synchronously repeating the two step
as many iterations needed. Each compute/com-
munication iteration is called a superstep, with
Overview synchronization of the parallel tasks occurring at
the superstep barriers.
There is a growing number of GPFs. Doekemei- DMFs, like Pregel, GraphX, Giraph, and Gi-
jer and Varbanescu (2014) review 83 different raphX, and the hybrid framework, Totem, are
frameworks. To briefly explain the most relevant based on the BSP model of computation.
all associated neighbors are assigned to the same execution of iterative algorithms natively. Gelly is
node. Apache Flink’s graph processing API and library.
Pregel deals with the underlying organization Gelly’s programming model exploits Flink’s
of resource allocation of the graph data and splits delta iteration operators. Delta iterations enable
computation into BSP-style supersteps. Super- Gelly to minimize the number of vertices that
steps always end with a synchronization barrier, need to be recomputed during an iteration as
which guarantees that messages sent in a given the iterative algorithm. This can enhance perfor-
superstep are received at the beginning of the next mance when some vertices converge to their final
superstep. value faster than others.
Vertices may change states between active and Gelly also supports the GSA model functions
inactive, depending on the overall state of execu- defined by the user and wrapped in map, reduce,
tion. Pregel terminates when all vertices halt and and join operators. GSA also makes use of delta
no more messages are exchanged. iteration operators.
GraphX is a distributed graph engine built Apache Giraph is an open-source project
on top of Spark, which is a MapReduce-like which uses Java to implement the Pregel
general data-parallel engine currently released specification, rather than C/C++, on the top of the
as an Apache project (Xin et al. 2013). Spark’s Hadoop framework infrastructure (Avery 2011).
programming model consists of: Giraph uses a vertex-centric programming ab-
straction adapting the BSP model. Giraph runs
synchronous graph processing jobs as map-only
1. An in-memory data abstraction called resilient
jobs and uses Hadoop (HDFS) for storing graph
distributed datasets (RDD), and
data. Giraph employs Apache Zookeeper for pe-
2. A set of deterministic parallel operations on
riodic checkpointing, failure recovery, and coor-
RDDs that can be invoked using data primi-
dinating superstep execution. Giraph is executed
tives such as map, filter, join, and group by.
in-memory which can speed up job execution;
however, it can fail processing a workload beyond
A Spark’s program is a directed acyclic graph the physical available memory.
(DAG) of Spark’s primitives, which are applied
in a set of initial or transformed RDDs.
GraphX extends Spark’s RDD abstraction Single-Machine Graph Processing
to introduce resilient distributed graph (RDG) Frameworks
which associates records with vertices and An important issue of TLAV frameworks is how
edges and provides a collection of expressive data is shared between vertex programs. In shared
computational graph primitives. memory, vertex data are shared variables that
GraphX relies on a flexible vertex-cut parti- can be directly read or modified by other vertex
tioning to encode graphs as horizontally parti- programs.
tioned collections. A vertex-cut partitioning pro- Shared memory is often implemented by
motes to evenly assign edges to machines leaving TLAV frameworks developed for a single
vertices to span multiple machines. A vertex- machine. Instead of dealing with consistency
data table is used to represent each vertex par- accessing remote vertices, using a mutual
tition; an edge table is used to represent each exclusion protocol, access to vertices stored in
edge partition, i.e., those vertices having an edge shared memory can be serialized.
crossing from one partition to another. A vertex GiraphX is a synchronous shared-memory
map indexes all copies for each vertex. implementation following the Giraph model
Apache Flink is a stream processing frame- (Tasci and Demirbas 2013). Serialization of
work developed by the Apache Software Foun- shared vertex is provided through locking of
dation. Flink’s runtime provides support for the border vertices, without local cached copies.
Graph Processing Frameworks 879
steps with the Gelly framework API in the Scala k is paired with the value “1” through the map
language: method to obtain a dataset of pairs .k; 1/. Next,
the .k; 1/ pairs dataset is grouped by the degree
import org.apache. flink . api . scala ._
import org.apache. flink .graph. scala ._ field, summing up the value field to compute the
import org.apache. flink . types .NullValue degree frequencies with the groupBy and sum
object DegreeDistribution { methods. This produces a new dataset of degree
def main(args : Array[ String ]) { frequency pairs .k; nk /.
// Set up the execution environment As for step 5, the frequency field of the .k; nk /
val env = ExecutionEnvironment.getExecutionEnvironment pair dataset is divided by the total number of
// Step 1: load and distribute the graph data vertices n, again through the map method, to
val graph = Graph.fromCsvReader[Long, NullValue,
NullValue ](
obtain the final dataset of degree-probability pairs
pathEdges = "data/socLiveJournal1.csv", .k; P .k//, which represents the graph degree
env = env)
distribution. Step 6 collects and prints out the
// Step 2: calculate the number of vertices .k; P .k// pairs dataset by making use of the
val n = graph.numberOfVertices()
print dataset method.
// Step 3: calculate ( vertex ID, degree) for all vertices Figure 1 shows the log-log plot of the Live-
val degrees = graph.getDegrees ()
Journal social network degree distribution ob-
// Steps 4 and 5: calculate the degree distribution
val degreeDist = degrees
tained with the Gelly example. The close to
.map { degreeTuple => linear plot shows that the graph degree distribu-
degreeTuple ._2
} tion might follow a power law suggesting that
.map { (_, 1) } the LiveJournal graph topology encodes relevant
.groupBy(0)
.sum(1) properties of the underlying social phenomenon.
.map { degreeCount => Figure 2 shows the results of an experimen-
(degreeCount._1, degreeCount._2 / n.toDouble)
} tal evaluation of the degree distribution Gelly
// Step 6: collect and print out the result
application. All experiments were executed in a
degreeDist . print () server including 32 cores (Intel Xeon 2,5 GHz),
} 64 RAM in which Centos 7, Java8, Scala 2.1.8,
} SBT 1.0.4, and Flink 1.4.0 were installed. Ver-
tical axis shows the speedup and millions of
Step 1 is implemented with the GraphCsv- TEPS (traversed edges per second) achieved by
Reader method, which enables Gelly to load Gelly when using different numbers of threads
and build a Gelly graph dataset from an edge list (horizontal axis). The performance of Gelly is
CSV file of the LiveJournal graph, where the first nonlinear improved in proportion to the number
field is the source vertex u, the second field is the of threads. The speedup values for 2, 8, 16,
target vertex v, and the third optional field is the and 32 cores were 1.93, 5.86, 7.39, and 7.58,
edge value. respectively.
The numberOfVertices method is used
to retrieve the number of vertices, implementing
step 2. Likewise, the getDegrees method is
used to compute the nodes degrees, implementing Future Directions for Research
step 3. getDegrees builds a Flink dataset of
vertex-degree pairs .u; ku / for all vertices. Because of the increasing number and variety
Steps 4–6 are implemented in a MapReduce of available GPFs, users now face the challenge
style by using Flink’s dataset methods to map, of selecting an appropriate platform for their
group, aggregate, and collect data. To implement specific application and requirements. First, a
step 4, the degree field of the .u; ku / pairs dataset user might search for efficient implementations
is separated to keep only the sequence of all of graph algorithms. However, there exists a large
vertex degrees k1 ; k2 : : : kn . Then, each degree variety of graph processing algorithms which can
Graph Processing Frameworks 881
1.0000000
0.1000000
0.0100000
0.0010000
G
0.0001000
0.0000100
0.0000010
0.0000001
1 10 100 1000 10000 100000
Graph Processing Frameworks, Fig. 1 Plot of the Live- k. The vertical axis represents the degree distribution
Journal social network degree distribution obtained with P .k/. Both axes are in logarithmic scale
the Gelly GPF. The horizontal axis represents the degree
be categorized into several groups by functional- lifetime of each processing job, the CPU and
ity and consumption of resources. network load, the OS memory consumption, and
Efficient low-level implementation of such al- the disk I/O, to name some parameters. Those ob-
gorithms is subject to substantial implementation served parameters should be translated to relevant
effort. Parallelizing graph algorithms with effi- performance metrics such as vertices per second,
cient performance is a very challenging task. edges per second, horizontal and vertical scala-
Although TLAV is a simple and easy to use bility, computation and communication time, and
approach to build graph processing applications, time overhead.
their implementation often leads to a signifi- Datasets need to be considered also. In gen-
cant amount of overhead. Efficient computing of eral, designing a good benchmark is challenging
graph partitions remains as a relevant topic in task due to the many aspects that should be con-
GPFs. sidered which can influence the adoption and the
GPF performance is also a key aspect. Iden- usage scenarios of the benchmark. Unfortunately,
tifying the systems parameters to be monitored there is clear lack of standard benchmarks that
and the metrics that can be used to characterize can be employed in GPF evaluation.
GPF performance might be a difficult task. The Finally, building GPFs to hybrid architectures
general performance aspects can be raw process- is still under development. The main challenge
ing power, resource usage, scalability, and pro- in heterogeneous computing frameworks is to
cessing overheads. These aspects can be observed partition the graph in such a way that resulting
by monitoring traditional system parameters like partitions are mapped to processing elements
882 Graph Processing Frameworks
10 10
Million of TEPS
Speed−Up
1 1
Speed−Up
TEPS
0.1 0.1
0 5 10 15 20 25 30 35
Cores
Graph Processing Frameworks, Fig. 2 Speed-up and TEPS produced by Gelly by using different numbers of threads
of the LiveJournal Social Network degree distribution
according to their different architectural Chen R, Weng X, He B, Yang M (2010) Large graph pro-
characteristics. In heterogeneous computing the cessing in the cloud. In: Proceedings of the 2010 ACM
SIGMOD international conference on management of
power of different processors varies, so the load data. ACM, pp 1123–1126
assigned to each processor should not be the Doekemeijer N, Varbanescu AL (2014) A survey of par-
same. Graph partitioning for hybrid architectures allel graph processing frameworks. Delft University of
should focus on minimizing computation time Technology
Gharaibeh ENA, Costa LB, Ripeanu M (2013) Totem:
rather than minimizing communications. accelerating graph processing on hybrid CPU + GPU
systems. In: GPU technology conference
Kang U, Tong H, Sun J, Lin CY, Faloutsos C (2011a)
Gbase: a scalable and general graph management sys-
Cross-References tem. In: Proceedings of the 17th ACM SIGKDD inter-
national conference on knowledge discovery and data
BSP Programming Model mining. ACM, pp 1091–1099
Distributed File Systems Kang U, Tsourakakis CE, Faloutsos C (2011b) Pega-
sus: mining peta-scale graphs. Knowl Inf Syst 27(2):
Large-Scale Graph Processing System
303–325
Parallel Graph Processing Kyrola A, Blelloch GE, Guestrin C (2012) Graphchi:
large-scale graph computation on just a PC. In:
USENIX symposium on OSDI
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A,
References Hellerstein JM (2012) Distributed graphlab: a frame-
work for machine learning and data mining in the
Avery C (2011) Giraph: large-scale graph processing cloud. Proc VLDB Endow 5(8):716–727
infrastructure on hadoop. Proc Hadoop Summit Santa Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn
Clara 11(3):5–9 I, Leiser N, Czajkowski G (2010) Pregel: a system
Graph Query Languages 883
for large-scale graph processing. In: Proceedings of ture types of a data model, in any combination
the 2010 ACM SIGMOD international conference on desired.
management of data. ACM, pp 135–146
Tasci S, Demirbas M (2013) Giraphx: parallel yet
In the context of graph data management, a
serializable large-scale graph processing. In: Euro- graph query language (GQL) defines the way to
pean conference on parallel processing. Springer, pp retrieve or extract data which have been mod-
458–469 eled as a graph and whose structure is defined
Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013)
Graphx: a resilient distributed graph system on spark.
by a graph data model. Therefore, a GQL is
In: First international workshop on graph data manage- designed to support specific graph operations,
ment experiences and systems. ACM, p 2 such as graph pattern matching and shortest path
finding.
G
Graph Processing Library Overview
Graph Processing Frameworks
Research on graph query languages has at least
30 years of history. Several GQLs were proposed
during the 1980s, most of them oriented to study
and define the theoretical foundations of the area.
During the 1990s, the research on GQLs was
Graph Properties overshadowed by the appearance of XML, which
was arguably seen as the main alternative to
Graph Invariants relational databases for scenarios involving semi-
structured data. With the beginning of the new
millennium, however, the area of GQLs gained
new momentum, driven by the emergence of ap-
plications where cyclical relationships naturally
Graph Query Languages occur, including social networks, the Semantic
Web, and so forth. Currently GQLs are a funda-
Renzo Angles1 , Juan Reutter2 , and mental tool that supports the manipulation and
Hannes Voigt3 analysis of complex big data graphs. Detailed
1
Universidad de Talca, Talca, Chile survey of graph query language research can be
2
Pontificia Universidad Catolica de Chile, found in Angles et al. (2017b).
Santiago, Chile
3
Dresden Database Systems Group, Technische
Universität Dresden, Dresden, Germany
Key Research Findings
Since our focus is on the fundamentals of homomorphism and constitutes the most common
GQLs, we concentrate on the simplest data model semantics. Additionally, one could also consider
for graphs, i.e., directed labeled graphs. Later on, simple path semantics, which does not permit
we review how these fundamentals are translated repeating nodes and used in, e.g., the languages
into practical query languages in more complex G (Cruz et al. 1987b) and GC (Cruz et al. 1989),
graph data models. injective homomorphism, that disallow variables
Let ˙ be a finite alphabet. A graph database to be mapped to the same element or semantics
G over ˙ is a pair .V; E/, where V is a finite set based on graph simulations (see, e.g., Miller et al.
of nodes and E V ˙ V is a set of edges. 2015 for a survey).
That is, we view each edge as a triple .v; a; v 0 / 2 Although the set of matchings from P to G
V ˙ V , whose interpretation is an a-labeled can be already seen as the set of answers for
edge from v to v 0 in G. the evaluation of P over G, one is normally
interested in tuples of nodes: if P uses variables
Graph Pattern Matching x1 ; : : : ; xn , then we define the result of evaluating
The idea of patterns is to define small subgraphs P over G, which we denote by P .G/, as the
of interest to look for in an input graph. For set of tuples f..x1 /; : : : ; .xn // j is a match
example, in a social network one can match the from P to Gg.
following pattern to look for a clique of three
individuals that are all friends with each other: Extensions. Graph patterns can be extended by
adding numerous features. For starters, one could
x allow variables in the edge labels of patterns or
friend friend
allow a mixture between constant and variables
in nodes or edges. In this case, mappings must
y z
friend be the identity on constant values, and if an edge
.x; y; ´/ occurs in a pattern, then .y/ must
Let V D fx; y; ´; : : : g be an infinite set of be a label in ˙, so that ..x/; .y/; .´// is
variables. In its simplest form, a graph pattern an edge in the graph upon we are matching.
is simply a graph P D .VP ; EP / where all Patterns can also be augmented with projection,
the nodes in VP are replaced by variables. For and since the answer is a set of tuples, we can
example, the pattern above can be understood also define set operations (union, intersection,
as a graph over alphabet ˙ D ffriendg whose difference, etc.) over them. Likewise, one can
set of nodes is fx; y; ´g and whose edges define filters that force the matches to satisfy
are f.x; friend; y/; .y; friend; x/; .x; friend; ´/, certain Boolean conditions such as x ¤ y.
.´; friend; x/; .y; friend; ´/; .´; friend; y/g. Another operator is the left outer join, also known
The semantics of patterns is given using as the optional operator. This operator allows
the notion of a match. A match of a pattern to query incomplete information (an important
P D .VP ; EP / over a graph G D .VG ; EG / is a characteristic of graph databases).
mapping from variables to constants, such that
the graph resulting from replacing each variable Graph Patterns as Relational Database
for the corresponding constant in the pattern is Queries. Graph patterns can also be defined
actually a subgraph of G, i.e., for every pattern as conjunctive queries over the relational repre-
node x 2 VP , we have that .x/ 2 VG , and sentation of graph databases (see, e.g., Abiteboul
for each edge .x; a; y/ 2 EP , we have that et al. 1995 for a good introduction on relational
..x/; a; .y// belongs to EG . database theory). In order to do this, given an
The problem of pattern matching has been alphabet ˙, we define .˙ / as the relational
extensively studied in the literature, and several schema that consists of one binary predicate
other notions of matching have been considered. symbol Ea , for each symbol a 2 ˙. For
The basic notion defined above is known as a readability purposes, we identify Ea with a,
Graph Query Languages 885
for each symbol a 2 ˙. Each graph database The idea behind regular path queries is to
G D .V; E/ over ˙ can be represented as a select all pairs of nodes whose labels belong to
relational instance D.G/ over the schema .˙ /: the language of the defining regular expression.
The database D.G/ consists of all facts of the That is, given a graph database G D .V; E/ and
form Ea .v; v 0 / such that .v; a; v 0 / is an edge in an RPQ R, the evaluation of R over G, denoted
G (for this we assume that D includes all the RG , is the set of all pairs .v; v 0 / 2 V such that
nodes in V ). It is then not difficult to see that there is path between v and v 0 and the label of
graph patterns are actually conjunctive queries is a word in the language of R.
over the relational representation of graphs, that
is, formulas of the form Q.x/ N D 9y. N x;N y/,
N More Complex Path Queries. RPQs have been
where xN and yN are tuples of variables and extended in a myriad of ways. For example, we
.x;N y/
N is a conjunction of relational atoms have two-way regular path queries (2RPQs) (Cal-
from that use variables from xN to y. N For vanese et al. 2000) that add to RPQs the ability
G
example, the pattern shown above can be to traverse edge backward via inverse symbols
written as the relational query Q.x; y; ´/ D a , nested regular path queries (Barceló et al.
friend.x; y/ ^ friend.y; x/ ^ friend.x; ´/ ^ 2012b) that further extend 2RPQs with node
friend.´; x/ ^ friend.y; ´/ ^ friend.´; y/. tests similar to what is done with propositional
The correspondence between patterns and dynamic logic or XPath expressions, and several
conjunctive queries allows us to reuse the broad proposals adding variables to path queries (Wood
corpus on literature on conjunctive queries for our 1990; Santini 2012; Fionda et al. 2015). There
graph scenario. We also use this correspondence are also proposals of more expressive forms of
when defining more complex classes of graph path queries, using, for example, context-free lan-
patterns. guages (Hellings 2014), and several formalisms
aimed at comparing pieces of data that may
be associated with the nodes in a graph (see,
Path Queries e.g., Libkin and Vrgoč 2012). Another type of
The idea of a path query is to select pairs of nodes path queries, called property paths, extends RPQs
in a graph whose relationship is given by a path with a mild form of negation (Kostylev et al.
(of arbitrary length) satisfying certain properties. 2015).
For example, one could use a path query in a
social network to extract all nodes that can be Adding Path Queries to Patterns. Path queries
reached from an individual via friend-of-a-friend and patterns can be naturally combined to
paths of arbitrary length. form what is known as conjunctive regular
The most simple navigational querying path queries, or CRPQs (Florescu et al. 1998;
mechanism for graph databases is a regular Calvanese et al. 2000). Just as with graph patterns
path query (RPQ) (Abiteboul et al. 1999; Cruz and conjunctive queries, we can define CRPQs in
et al. 1987a; Calvanese et al. 2002), which two ways. From a pattern perspective, CRPQs
allows to specify navigation by means of regular amount to replacing some of the edges in a
expressions. Formally, an RPQ Q over ˙ is a graph pattern by a regular expression, such as
regular language L ˙ , and it is specified the following CRPQ, which looks for two paths
using a regular expression R. Their semantics with an indirect friend n in common:
is defined in terms of paths. A path from v0
friend + friend +
to vm in a graph G D .V; E/ is a sequence x n y
.v0 ; a0 ; v1 /, .v1 ; a1 ; v2 /; ; .vm1 ; am1 ; vm /,
for some m 0, where each .vi ; ai ; viC1 /, for However, a CRPQ over an alphabet ˙ can also
i < m, is an edge in E. In particular, all the vi ’s be seen as a conjunctive query over a relational
are nodes in V and all the aj ’s are letters in ˙. representation of graphs that includes a predicate
The label of is the word a0 am1 2 ˙ . for each regular expression constructed from ˙.
886 Graph Query Languages
In our case, the corresponding conjunctive query answers of Q2 . Since all queries we presented
for the pattern above is written as Q.x; y/ D are based on graph patterns, the lower bound for
friendC .x; n/ ^ friendC .y; n/. Note that here we this problem is NP, as containment is known to be
are using n as a constant; since CRPQs are again NP-complete already for simple graph patterns.
patterns, they can be augmented with any of the In contrast to evaluation, adding path queries to
features we listed for patterns, including con- patterns can significantly alter the complexity of
stants, projection, set operations, filters, and outer containment: for CRPQs the problem is already
joins. We can also obtain more expressive classes EXPSPACE-complete (Calvanese et al. 2000),
of queries by including more expressive forms and depending on the features considered, it may
of path queries: using 2RPQs instead of RPQs even become undecidable.
gives rise to C2RPQs, or conjunctive 2RPQs,
Beyond Patterns
and using nested regular expressions gives rise
For the navigational languages, we have seen,
to CNREs, which are also known as conjunctive
thus far, paths are the only form of recursion al-
nested path queries, or CNPQs (Bienvenu et al.
lowed, but to express certain types of queries, we
2014; Bourhis et al. 2014). Finally, Barceló et al.
may require more expressive forms of recursion.
(2012a) consider restricting several path queries
As an example, suppose now our social network
in a CRPQ at the same time by means of regular
admits labels friends and works_with, the latter
relations. The resulting language is capable of
for joining two nodes that are colleagues. Now
expressing, for example, that the two paths of
imagine that we need to find all nodes x and
friends in the pattern above must be of the same
y that are connected by a path where each sub-
length.
sequent pair of nodes in the path is connected
by both friend and works_with relations. We
Algorithmic Issues. The most important algo-
cannot express this query as a regular expres-
rithmic problem studied for GQLs is query an-
sion between paths, so instead what we do is to
swering. This problem asks, given a query Q with
adopt Datalog as a framework for expressing the
k variables, a graph G and a tuple aN of k values
repetitions of patterns themselves. Consider, for
from G, whether aN belongs to the evaluation of
example, a rule of the form:
Q over G. All of the classes of path queries we
have discussed can be evaluated quite efficiently. P .x; y/ friend.x; y/; works_with.x; y/:
For example, all of RPQs, 2RPQs, and NREs can
be evaluated in linear time in the size of Q and of This rule is just another way of expressing the
G. Barceló (2013) provides a good survey on the pattern that computes all pairs of nodes connected
subject. both by friend and works_with relations. In this
When adding path queries to patterns, one case, P represents a new relation, known as an
again hopes that the resulting language will not be intensional predicate, that is intended to compute
more difficult to evaluate than standard patterns all pairs of nodes that are satisfied by this pattern.
or conjunctive queries. And indeed this happens We can then use P to specify the rule:
to be the case: The evaluation problem for CR- Ans.x; y/ P C .x; y/:
PQs, C2RPQs, and CNREs is NP-complete just
as patterns. In fact, one can show this for any Now both rules together form what is known as
pattern augmented with a form of path query, as a regular query (Reutter et al. 2015) or nested
long as the evaluation for these path queries is in positive 2RPQs (Bourhis et al. 2014, 2015). The
PTIME. idea is that P C .x; y/ represents all pairs x; y
The second problem that has received con- connected by a path of nodes, all of them satisfy-
siderable attention is query containment. This ing the pattern P (i.e., connected both by friend
problem asks, given queries Q1 and Q2 , whether and works_with relations). Regular queries can be
the answers of Q1 are always contained in the understood as the language resulting of extending
Graph Query Languages 887
graph patterns with edges that can be labeled not are composable for the same reason. In general,
only with a path query but with any arbitrary Datalog-based graph query languages are com-
repetition of a binary pattern. posable over relations but not necessarily over
The idea of using Datalog as a graph query graphs.
language for graph databases comes from Con- In the context of modern data analytics, data is
sens and Mendelzon (1990), where it was in- collected to a large extent automatically by hard-
troduced under the name GraphLog. GraphLog and software sensors in fine granularity and low
opened up several possibilities for designing new abstraction. Where users interact with data, they
query languages for graphs, but one of the prob- typically think, reason, and talk about entities
lems of GraphLog was that it was too expressive, of larger granularity and higher abstraction. For
and therefore problems such as query contain- instance, social network data is collected in terms
ment were undecidable for arbitrary GraphLog of friend relationship and individual messages
queries. However, interest in Datalog for graphs send from one person to another, while the user
G
has returned in the past few years, as researchers is often interested in communities, discussion
started to restrict Datalog programs in order to threads, topics, etc. User-level concepts are of-
obtain query languages that can allow expressing ten multiple abstraction levels higher than the
the recursion or repetition of a given pattern, but concepts in which data is captured and stored.
at the same time maintaining good algorithmic The query language of database systems is the
properties for the query answering problem. For main means for users to derive information in
example, the complexity for evaluating regular terms of their user-level concepts for a database
queries is the same as for C2RPQs, and the con- oblivious of user’s concepts. To offer a capability
tainment problem is just one exponential higher. of conceptual abstraction, a query language has
Besides regular queries, other noteworthy graph to (1) be able to create entirely new entities
languages follow by restricting to monadic Data- within the data model to lift data into higher-
log programs (Rudolph and Krötzsch 2013), pro- level concepts and (2) be composable over its data
grams whose rules are either guarded or frontier- model, so that the users can express a stack of
guarded (Bourhis et al. 2015; Bienvenu et al. multiple such conceptual abstraction steps with
2015), and first-order logic with transitive clo- the same language.
sure (Bourhis et al. 2014). RPQs (and 2RPQs) and regular queries allow
conceptual abstraction for edges. For instance,
Composability ancestor edges can be derived from sequences
A particularly important feature of query lan- of parent edges. Conceptual abstraction for nodes
guages is composability. Composable query lan- has been considered only recently (Rudolf et al.
guages allow to formulate queries with respect to 2014; Voigt 2017) and allows to derive (or con-
results of other queries expressed in the same lan- struct) an entirely new output graph from the
guage. Composability is an important principle input graph. In the simplest form of graph con-
for the design of simple but powerful program- struction, a new node is derived from each tu-
ming languages in general and query languages ple resulting from a pattern match (C2RPQs,
in particular (Date 1984). To be composable, the etc.), e.g., deriving a marriage node from every
input and the output of a query must have the pair of persons that are mutually connected by
same (or a compatible) data model. married to edges. The more general and more
CQs (and C2RPQs) are not inherently expressive variant of graph construction includes
composable, since their output is a relation and aggregation and allows to derive a new node from
their input is a graph. In contrast, RPQs (and every group of tuples given a set of variables
2RPQs) allow query composition, since the set as grouping keys, e.g., deriving a parents node
of pairs of nodes resulting from an RPQ can be from every group of adults all connected by
interpreted as edges obviously. Regular queries parent to to the same child.
888 Graph Query Languages
Angles R, Arenas M, Barceló P, Hogan A, Reutter JL, Date CJ (1984) Some principles of good language design
Vrgoc D (2017b) Foundations of modern query lan- (with especial reference to the design of database
guages for graph databases. ACM Comput Surv languages). SIGMOD Rec 14(3):1–7
68(5):1–40 Dries A, Nijssen S, De Raedt L (2009) A query language
Barceló P (2013) Querying graph databases. In: Proceed- for analyzing networks. In: Proceeding of the ACM
ings of the 32nd ACM SIGMOD-SIGACT-SIGART international conference on information and knowledge
symposium on principles of database systems, PODS management (CIKM). ACM, pp 485–494
2013, pp 175–188 Fionda V, Pirrò G, Consens MP (2015) Extended property
Barceló P, Libkin L, Lin AW, Wood PT (2012a) paths: writing more SPARQL queries in a succinct
Expressive languages for path queries over graph- way. In: Proceeding of the conference on artificial
structured data. ACM Trans Database Syst (TODS) intelligence (AAAI)
37(4):31 Florescu D, Levy AY, Suciu D (1998) Query containment
Barceló P, Pérez J, Reutter JL (2012b) Relative expres- for conjunctive queries with regular expressions. In:
siveness of nested regular expressions. In: Proceedings Proceeding of the ACM symposium on principles of
of the Alberto Mendelzon workshop on foundations of database systems (PODS), pp 139–148
data management (AMW), pp 180–195 Haase P, Broekstra J, Eberhart A, Volz R (2004) A
G
Bienvenu M, Calvanese D, Ortiz M, Simkus M (2014) comparison of RDF query languages. In: Proceeding
Nested regular path queries in description logics. In: of the international Semantic Web conference (ISWC),
Proceeding of the international conference on princi- pp 502–517
ples of knowledge representation and reasoning (KR) Harris S, Seaborne A (2013) SPARQL 1.1 query lan-
Bienvenu M, Ortiz M, Simkus M (2015) Navigational guage. W3C recommendation. http://www.w3.org/TR/
queries based on frontier-guarded datalog: preliminary sparql11-query/
results. In: Proceeding of the Alberto Mendelzon work- Hellings J (2014) Conjunctive context-free path queries.
shop on foundations of data management (AMW), In: Proceeding of the international conference on
p 162 database theory (ICDT), pp 119–130
Bourhis P, Krötzsch M, Rudolph S (2014) How to best Kostylev EV, Reutter JL, Romero M, Vrgoč D (2015)
nest regular path queries. In: Informal Proceedings of SPARQL with property paths. In: Proceeding of
the 27th International Workshop on Description Logics the international Semantic Web conference (ISWC),
Bourhis P, Krötzsch M, Rudolph S (2015) Reasonable pp 3–18
highly expressive query languages. In: Proceeding of Libkin L, Vrgoč D (2012) Regular path queries on graphs
the international joint conference on artificial intelli- with data. In: Proceeding of the international confer-
gence (IJCAI), pp 2826–2832 ence on database theory (ICDT), pp 74–85
Brijder R, Gillis JJM, Van den Bussche J (2013) The Martín MS, Gutierrez C, Wood PT (2011) SNQL: a
DNA query language DNAQL. In: Proceeding of the social networks query and transformation language. In:
international conference on database theory (ICDT). Proceeding of the Alberto Mendelzon workshop on
ACM, pp 1–9 foundations of data management (AMW)
Calvanese D, De Giacomo G, Lenzerini M, Vardi MY Miller JA, Ramaswamy L, Kochut KJ, Fard A (2015)
(2000) Containment of conjunctive regular path queries Research directions for big data graph analytics. In:
with inverse. In: Proceeding of the international con- Proceeding of the IEEE international congress on big
ference on principles of knowledge representation and data, pp 785–794
reasoning (KR), pp 176–185 Prud’hommeaux E, Seaborne A (2008) SPARQL query
Calvanese D, De Giacomo G, Lenzerini M, Vardi MY language for RDF. W3C recommendation. http://www.
(2002) Rewriting of regular expressions and regular w3.org/TR/rdf-sparql-query/
path queries. J Comput Syst Sci (JCSS) 64(3):443–465 Reutter JL, Romero M, Vardi MY (2015) Regular
Consens M, Mendelzon A (1990) Graphlog: a visual queries on graph databases. In: Proceeding of the
formalism for real life recursion. In: Proceeding of the international conference on database theory (ICDT),
ACM symposium on principles of database systems pp 177–194
(PODS), pp 404–416 Rodriguez MA (2015) The Gremlin graph traversal ma-
Cruz I, Mendelzon A, Wood P (1987a) A graphical chine and language. In: Proceeding of the international
query language supporting recursion. In: ACM special workshop on database programming languages. ACM
interest group on management of data 1987 annual Rudolf M, Voigt H, Bornhövd C, Lehner W (2014) Synop-
conference (SIGMOD), pp 323–330 Sys: foundations for multidimensional graph analytics.
Cruz IF, Mendelzon AO, Wood PT (1987b) A graphical In: Castellanos M, Dayal U, Pedersen TB, Tatbul N
query language supporting recursion. In: Proceeding of (eds) BIRTE’14, business intelligence for the real-
the ACM international conference on management of time enterprise, 1 Sept 2014. Springer, Hangzhou,
data (SIGMOD), pp 323–330 pp 159–166
Cruz IF, Mendelzon AO, Wood PT (1989) G+: recursive Rudolph S, Krötzsch M (2013) Flag & check: data access
queries without recursion. In: Proceeding of the inter- with monadically defined queries. In: Proceeding of the
national conference on expert database systems (EDS). symposium on principles of database systems (PODS).
Addison-Wesley, pp 645–666 ACM, pp 151–162
890 Graph Query Processing
Santini S (2012) Regular languages with variables on triple stores (Neumann and Weikum (2010),
graphs. Inf Comput 211:1–28 Virtuoso), graph database management systems
van Rest O, Hong S, Kim J, Meng X, Chafi H
(2016) PGQL: a property graph query language.
(GDBMSes) (Kankanamge et al. (2017), neo4j),
In: Proceeding of the workshop on graph data- XML databases (Gou and Chirkova (2007)), or
management experiences and systems (GRADES) extensions of relational databases that support
Voigt H (2017) Declarative multidimensional graph graphs (Färber et al. 2012), also process other
queries. In: Proceeding of the 6th European business
intelligence summer schoole (BISS). LNBIP, vol 280.
types of queries over graph-structured data.
Springer, pp 1–37 For example, some GDBMSes and RDF triple
Wood PT (1990) Factoring augmented regular chain pro- stores have query languages, such as Cypher and
grams. In: Proceeding of the international conference SPARQL, which can express typical relational
on very large data bases (VLDB), pp 255–263
operations on data, such as group-by-and-
aggregates, filters, projections, ordering, as well
as graph-specific queries such as shortest path
queries. We focus on subgraph queries and RPQs
Graph Query Processing as they are two of the most commonly appearing
graph-specific queries in these systems. We note
S. Salihoglu1 and N. Yakovets2
1 that some systems, such as Pregel (Malewicz
University of Waterloo, Waterloo, ON, Canada
2 et al. 2010) and some linear algebra software,
Eindhoven University of Technology,
have APIs to express other graph algorithms,
Eindhoven, Netherlands
such as PageRank, algorithms to find connected
components, or BFS traversals, which we also do
not cover in this chapter.
Definitions The high-level pipeline in existing systems
to process subgraph queries and RPQs follows
While being eminently useful in a wide variety the classic pipeline in RDBMSes and consists
of application domains, the high expressiveness of two steps: (1) a logical query plan generation
of graph queries makes them hard to optimize and optimization and (2) a physical query plan
and, hence, challenging to process efficiently. We generation, which outputs a physical plan that
discuss a number of state-of-the-art approaches is evaluated on the actual graph data. We focus
which aim to overcome these challenges, focus- more on techniques to generate logical plans,
ing specifically on planning, optimization, and which we think depict more commonality across
execution of two commonly used types of declar- systems than physical plans which are highly
ative graph queries: subgraph queries and regular system-dependent.
path queries.
Subgraph Queries
Overview
We consider the most basic form of subgraph
In this chapter, we give a broad overview queries in which the goal is to find instances of
of the general techniques for processing two a directed subgraph Q, without any filters on
classes of graph queries: subgraph queries, its nodes or edges, in an input graph G. As a
which find instances of a query subgraph in running example, consider the triangle query in
a larger input graph, and regular path queries Fig. 1a. Any subgraph query Q can be equiva-
(RPQs), which find pairs of nodes in the graph lently written as a relational multiway self-join
connected by a path matching a given regular query as follows. We label each vertex in Q with
expression. Certainly, software that manage and a distinct attribute ai . The equivalent multiway
process graph-structured data, such as RDF join query contains an edge(ai , aj ) relation
Graph Query Processing 891
a b
a1 a2 a1 a2
c
a2 a1
a1
a3 a3 a1 a2
Graph Query Processing, Fig. 1 Triangle query and example edge-at-a-time and vertex-at-a-time plans. (a) Triangle
Query. (b) Edge-at-a-time Plan. (c) Vertex-at-a-time Plan
compute ˘a1 ;:::;am Q D Q. Partial subgraphs perform fast intersections, and the Empty-
to a new vertex ai by effectively intersecting all Headed (Aberger et al. 2016), LogicBlox (Aref
of the edges that are incident on ai in Q from et al. 2015) and Graphflow (Kankanamge et al.
already matched vertices a1 ; : : : ; ai1 . Figure 1c 2017) systems, which use worst-case optimal
shows the partial subgraphs that are matched multiway join algorithms.
when evaluating the triangle query vertex-at-a-
time. Query Decompositions
Readers might find the difference between Queries that contain many vertices and edges are
vertex-at-a-time and edge-at-a-time approaches a common scalability challenge for evaluating
superficial. Indeed depending on how the in- subgraph queries. A popular technique used by
tersections are computed, these approaches can both edge-at-a-time and vertex-at-a-time strate-
even be equivalent. As an example, suppose a gies is to decompose a query Q into smaller
“vertex-at-a-time” algorithm takes .a1 ; a2 / par- subqueries, SQ1 , ..., SQk , evaluate each sub-
tial matches, e.g., .1; 2/, and for each outgoing query separately, and then join them, instead of
edge of the vertex matching a2 , in our case matching the edges or vertices of the query one
vertex 2, checks if the destination of that edge by one. Bushy binary join plans are one type
has an edge back to a1 , in our case vertex 1. of decomposition. Decomposition has two main
This effectively computes the intersection of 2’s benefits:
out- and 1’s in-neighbors but is also equivalent to
the edge-at-a-time join plan in Fig. 1b. Surpris- • Caching: Decomposition avoids recomputing
ingly, if the intersections are done in a particular partial matched subgraphs SQi multiple times
way, there can be significant differences between by caching and reusing them.
these approaches. In particular, when extending • Filtering: If the results of an SQi in the
a partial match .a1 ; :::; ai1 / to ai , the cost of decomposed query is (partially) empty, we can
intersecting all of the adjacent neighbor lists of (partially) avoid evaluating some SQj . As a
ai must take time proportional to the smallest simple example, consider the bow tie query
adjacency list size, for each partial match (This in Fig. 2, which consists of an asymmetric
can be achieved, for example, by first checking triangle (a1 ; a2 ; a3 ) and a symmetric triangle
the sizes of the adjacency lists that need to be in- (a3 ; a4 ; a5 ) sharing vertex a3 . Suppose a ver-
tersected and starting intersecting the elements in tex u is not part of any asymmetric triangles
the smallest list in others.). Recent theoretical re- for a3 , then we can skip computing the sym-
sults have shown that using such fast intersections metric triangles of u for a3 .
can make vertex-at-a-time approaches asymptoti-
cally faster than edge-at-a-time approaches (Ngo Decomposition is used by many systems
et al. 2012). For example, one can construct input and algorithms (Aberger et al. 2016; Lai et al.
graphs with N edges such that any edge-at-a-time 2016; Neumann and Weikum 2010; Zhao
plan will take ˝.N 2 / time to compute the tri- and Han 2010). There are many possible
angle query, yet vertex-at-a-time approaches will decompositions of large queries, and different
take O.N 3=2 / time. These results were shown in systems use different decompositions. For
the context of the new worst-case optimal join
algorithms, such as NPRR (Ngo et al. 2012) and
Leapfrog Triejoin (Veldhuizen 2012) algorithms, a1 a4
which evaluate queries one attribute at a time by
such fast intersections. a3
Vertex-at-a-time approaches are used by
many systems and algorithms, such as the a2 a5
TurboISO (Han et al. 2013) and VF2 (Cordella
et al. 2004) algorithms, which do not necessarily Graph Query Processing, Fig. 2 Bow Tie Query
Graph Query Processing 893
example, SEED (Lai et al. 2016) and SPath (Zhao Cartesian products of the matches of these
and Han 2010), precompute and index several subqueries. For example, a directed two-
subgraphs, such as triangles or some paths, and path query edge.a1 ; a2 /; edge.a2 ; a3 / can
pick decompositions that use these subgraphs. be represented as the union of Cartesian
As another example, EmptyHeaded picks a products of incoming and outgoing edges
special type of decomposition called generalized of each vertex.
hypertree decomposition (GHD) and aims to have
small number of edges in each SQi , which are
evaluated using a vertex-at-a-time approach, but Some systems and algorithms also compress the
then joined using binary joins. intermediate partial matches instead of the out-
puts. For example, TurboISO (Han et al. 2013)
compresses two nodes ai and aj in Q that have
Compressed Output Representations the same neighborhoods into a single supern- G
Another scalability challenge for systems and ode (called neighbor equivalence class) and con-
algorithms is the possibly very large outputs. On structs a smaller query Q0 . That is because if in
a graph with N edges, the number of matches of a an output ai matches u and aj matches v, there
query Q can be as large as N .Q/ , where .Q/ must be another output in which ai matches v and
a quantity called the fractional edge cover of aj matches u. The algorithm evaluates Q0 in a
Q (Atserias et al. 2013). For example, there can way that supernodes match sets of vertices in the
be ˝.N 3=2 / triangles and ˝.N 3 / bow ties. As graph. These sets are then expanded in different
a simple demonstrative example, an input graph permutations to get outputs of Q. This technique
that contains
a clique of size 1000, will contain effectively compresses the intermediate results
at least 1000
5
, roughly a quadrillion, many 5 that the algorithm generates.
cliques, which no system can effectively output
in a reasonable time.
Several systems and algorithms handle this
problem by using a compressed representation
Regular Path Queries
of the outputs for some queries. The techniques
are varied, and we give high-level descriptions
Regular path queries (RPQs) extend subgraph
of several examples from literature. These com-
queries with nontrivial navigation on a graph.
pression techniques are closely related to the
This navigation is performed by returning pairs
technique of factorized databases from the re-
of vertices in a graph such that paths between
lational systems that have been used in several
the vertices in a pair match a given regular ex-
systems (Olteanu and Schleich 2016).
pression. Formally, given a graph G as a tuple
hV; ˙; Ei for which V is a set of vertices, ˙ is
• Clique compression: SEED (Lai et al. 2016) a set (alphabet) of labels, and E V ˙ V
detects maximal cliques in input graphs is a set of directed, labeled edges; a regular path
as a preprocessing step to avoid matching query Q is a tuple hx; r; yi, where x and y are
exponentially many instances of subgraphs free variables which range over vertices and r is a
in these cliques. Instead, these cliques are regular expression. An answer to Q over a graph
returned as output, and applications can G is a vertex pair hs; t i 2 V V such that there
enumerate the instances in them. exists a path p from vertex s to vertex t such that
• Cartesian Product-based Representation: it matches r. A result of a query Q on graph G is
Reference (Qiao et al. 2017) proposes a a collection of all answers of Q over G.
technique to decompose Q into multiple The semantics of regular path queries play an
subqueries (based on the vertex cover of important role in their evaluation. Specifically,
Q) such that matches to the entire query matching paths can be simple (no vertex is visited
can be represented as the union of a set of twice) or arbitrary (no such restriction). Further,
894 Graph Query Processing
the result of the query can be counting (8), i.e., In the following subsections, we give a
a query returns a bag of vertex pairs with each brief overview of different methods used in the
pair appearing in the result for each different evaluation of regular path queries. In literature,
matching path in a graph. On the other hand, the we identify three main groups of approaches:
result can be existential (9, non-counting), i.e., a relational algebra-based, Datalog-based, and
query returns a set of vertex pairs with each pair automata-based RPQ evaluation methods. Note
being an evidence that a matching path exists in that, due to lack of space, we do not discuss
a graph. Therefore, depending on the variations a wide body of work which uses specialized
above, four types of semantical configurations for data structures (e.g., reachability indexes) in the
RPQs are distinguished: si mple=9, si mple=8, evaluation of RPQs (e.g., see Gubichev et al.
arbi t rary=9, and arbi t rary=8. 2013; Fletcher et al. 2016). These methods are,
Intuitively, simple semantics have an implicit in general, complementary to the methods we
mechanism of dealing with cycles in a graph, discuss below.
while arbitrary configuration needs to deal with As a running example, consider a graph
cycles explicitly to avoid unbounded computa- GDBP and a query QDBP as shown in Fig. 3.
tion. Further, 8-semantics is very useful in SQL- Graph GDBP contains information from both
style aggregation queries which 9-semantics can- English and Dutch DBPedia datasets. Query
not handle. QDBP aims to find all places p in both Dutch and
It has been shown that arbi t rary=9 is English datasets in which the city of Eindhoven
the only tractable configuration for a full is located in. This query demonstrates the
fragment of regular expressions. Hence, a vast usefulness of RPQs in the navigation of unknown
majority of systems which evaluate RPQs do graphs, as it uses succinct but nontrivial Kleene
so under arbi t rary=9 semantics. SPARQL, transitive expressions to concurrently resolve
a widely used graph query language, adopted both equivalence and geographical closures
arbi t rary=9 RPQs in its property paths feature without prior knowledge of an underlying
introduced as a part of SPARQL 1.1 specification. graph.
However, other semantical configurations are
also used. For example, Cypher, a graph query
language used in a popular neo4j graph database, Relational Algebra and Datalog-Based
is based on CHAREs (regular expressions with Approaches
limited use of Kleene expressions) and semantics Authors of (Losemann and Martens 2012) pro-
similar to si mple=8, but with paths with no posed an RPQ evaluation algorithm based on a
repeated edges. dynamic programming paradigm. Their approach
Graph Query Processing,
Fig. 3 Example of an
interlinked English and
Dutch DBPedia graph
GDBP
Graph Query Processing 895
Graph Query Processing, Fig. 4 A parse tree for a regular expression in QDBP , an ˛-RA tree and a Datalog program
for QDBP . (a) A parse tree. (b) An ˛-RA tree (c) A Datalog program G
can be modeled by the relational algebra ex- sideways information passing, and many others)
tended by an additional operator ˛ which com- (Consens et al. 1995; Yakovets et al. 2016).
putes the transitive closure of a given relation.
This algebra, called ˛-RA (Agrawal 1988), can
be then used to construct an ˛-RA expression Finite Automata-Based Approaches
tree (e.g., see Fig. 4b) based on the bottom-up Research on regular expressions, of course, well
traversal of a parse tree of the regular expression precedes the introduction of regular path queries.
(Fig. 4a) in a given query. Specifically, regular expressions have been in-
Many GDBMSes implement RPQ evaluation troduced in formal languages theory as a nota-
by adapting the ˛-RA approach since it mod- tion for patterns which generate words over a
els relational techniques well and, thus, can be given alphabet. The dual problem to generation is
easily and efficiently implemented in most rela- recognition – and finite state automata (FA) are
tional databases which support SQL-style recur- the recognizers for regular expressions. Specifi-
sion (e.g., PostgreSQL, Oracle, IBM DB2, and cally, given a regular expression r, there exists
others) and relational triple stores (e.g., Virtuoso, a number of methods to construct a correspond-
IBM DB2 RDF store, and others). Evaluation ing automaton which recognizes the language
based on ˛-extended relational algebra is also of r. Note that there exist infinitely many FA
considered in (Dey et al. 2013; Yakovets et al. which can recognize r, and many approaches ex-
2013). ist which aim to construct an automaton deemed
Native support of recursion makes Datalog a efficient for a particular scenario, e.g., finite au-
suitable language to express regular path queries. tomata obtained by using minimization.
For example, authors of (Consens et al. 1995) Implementation of a G+ query language
propose a simple strategy to convert an ˛-RA (Mendelzon and Wood 1995) includes the first
tree into a Datalog program (e.g., as shown in method which uses FA for evaluation of regular
Fig. 4c). Then, this program can be submitted to path queries on graphs. The algorithm works
any Datalog-compliant engine (e.g., LogicBlox) as follows. First, an automaton AQDBP (e.g.,
in order to evaluate a given RPQ. see Fig. 5a) which recognizes given regular
Note, however, that both ˛-RA tree and a expression r is constructed by using standard
corresponding Datalog program obtained by the methods (e.g., Thompson’s construction (Thomp-
translation procedure are not necessarily the most son 1968) and then, optionally, minimized
efficient as all the issues related to query planning (Fig. 5b). Then, a graph database GDBP is
apply to the tree and its translation as well (e.g., converted into finite automaton AGDBP with
translation methods, join orders, query rewrites, graph vertices becoming automaton states and
896 Graph Query Processing
Graph Query Processing, Fig. 5 Example of automata for QDBP and a corresponding product automaton. (a) An
"-NFA for QDBP . (b) A minimized NFA for QDBP . (c) A product automaton PDBP D AQDBP AGDBP
graph edges becoming automaton transitions Authors of Koschmieder and Leser (2012)
in AGDBP . Given AQDBP and AGDBP , a present an RPQ evaluation method which exe-
product automaton PDBP D AQDBP AGDBP cutes the search simultaneously in the graph and
(Fig. 5c) is constructed. Finally, automaton PDBP the automaton. This is achieved by exploring the
is then tested for non-emptiness by checking graph using a BFS while marking the vertices
whether any accepting state(s) can be reached in the graph with the corresponding states in the
from the starting state(s). Then, pairs of starting automaton. Essentially, in this method, edges in
and corresponding reachable accepting states in the graph are traversed only if the corresponding
PDBP form the answer pairs for QDBP . The transition in the automaton allows it. This way,
idea of using FA in evaluation of RPQs is used in the construction of a full product automaton is
Kochut and Janik (2007), Koschmieder and Leser avoided, and the automaton essentially acts as an
(2012), Mendelzon and Wood (1995), Pérez et al. evaluation plan for a given RPQ.
(2010), Yakovets et al. (2016), and Zauner et al. WaveGuide system (Yakovets et al. 2016)
(2010). We briefly summarize some of these takes the idea of using an FA as an evaluation
approaches below. plan further. Authors show that effective plan
To avoid construction of a full product au- spaces that result from FA-based and ˛-RA-based
tomaton, authors of Kochut and Janik (2007) pro- approaches are, in fact, incomparable. Hence,
pose to evaluate RPQs by running a bidirectional to find the best plan, both implicit plan spaces
breadth-first search (biBFS) in the graph. Specif- have to be considered. With this in mind, authors
ically, given a reachability query (i.e., an RPQ design a cost-based optimizer for RPQs which
where variables are bound to constants on both subsume and extend FA and ˛-RA-based ap-
ends), two FA are constructed. First automaton proaches. Query plans in WaveGuide are based
recognizes the regular language defined by the on the notion of wavefronts – stratified automata
original expression, and the second automaton extended with seeds, append, prepend, and view
accepts the reversed language, which is also reg- transitions.
ular. The algorithm uses the steps of the biBFS to
expand the search frontiers and to connect paths.
Before each vertex in a graph is placed on the Research Directions
search frontier for the next expansion, the check
is performed whether the partial path leading up One important research direction in evaluating
to this vertex is not rejected by the appropriate subgraph queries is understanding when the
(direct or reverse) automaton. vertex-at-a-time algorithms perform better
Graph Query Processing 897
than edge-at-a-time algorithm. For example, Cordella L, Foggia P, Sansone C, Vento M (2004) A
some early work in the area Nguyen et al. (sub)graph isomorphism algorithm for matching large
graphs. TPAMI 26(10):1367–1372
(2015) suggest that the worst-case optimal join Dey S, Cuevas-Vicenttín V, Köhler S, Gribkoff E, Wang
algorithms outperform some systems that use M, Ludäscher B (2013) On implementing provenance-
binary join plans for some cyclic queries, but the aware regular path queries with relational query en-
topic requires a more thorough study. gines. In: Proceedings of the joint EDBT/ICDT 2013
workshops. ACM, pp 214–223
Often applications do not just find subgraphs Färber F, Cha SK, Primsch J, Bornhövd C, Sigg S, Lehner
but perform other computations on the found W (2012) SAP HANA database: data management for
subgraph outputs. Studying what kind of queries modern business applications. SIGMOD Rec 40(4):
can be further executed on these outputs, when 45–51
Fletcher GH, Peters J, Poulovassilis A (2016) Efficient
the outputs are compressed, is another research regular path query evaluation using path indexes. In:
direction. Proceedings of the 19th international conference on
Finally, while it has been shown that RPQs extending database technology
G
allow for a rich plan space, it is still a challenge to Gou G, Chirkova R (2007) Efficiently querying large
XML data repositories: a survey. TKDE 19(10):
generate good query evaluation plans with lowest 1381–1403
execution costs. More research is needed to come Gubichev A, Bedathur SJ, Seufert S (2013) Sparqling
up with accurate cost and cardinality estimation Kleene: fast property paths in RDF-3X. In: Workshop
techniques used in graph query processing. This on graph data management experiences and systems.
ACM, pp 14–20
task is further complicated by the limited and Han WS, Lee J, Lee JH (2013) Turboiso: towards ultrafast
often correlated structure of graph data. and robust subgraph isomorphism search in large graph
databases. In: SIGMOD
Kankanamge C, Sahu S, Mhedbhi A, Chen J, Salihoglu
S (2017) Graphflow: an active graph database. In:
Cross-References SIGMOD
Kochut KJ, Janik M (2007) SPARQLeR: extended
SPARQL for semantic association discovery. In: The
Graph Data Management Systems
semantic web: research and applications. Springer,
Graph Path Navigation Berlin, pp 145–159
Graph Pattern Matching Koschmieder A, Leser U (2012) Regular path queries
Graph Query Languages on large graphs. In: Scientific and statistical
database management. Springer, Berlin/Heidelberg,
Graph Representations and Storage
pp 177–194
Parallel Graph Processing Lai L, Qin L, Lin X, Zhang Y, Chang L (2016) Scalable
distributed subgraph enumeration. In: VLDB
Lai L, Qin L, Lin X, Chang L (2017) Scalable subgraph
enumeration in MapReduce: a cost-oriented approach.
References VLDB J 26(3):421–446
Losemann K, Martens W (2012) The complexity of evalu-
Aberger CR, Tu S, Olukotun K, Ré C (2016) Empty- ating path expressions in SPARQL. In: Proceedings of
Headed: a relational engine for graph processing. In: the 31st symposium on principles of database systems.
SIGMOD ACM, pp 101–112
Agrawal R (1988) Alpha: an extension of relational alge- Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I,
bra to express a class of recursive queries. IEEE TSE Leiser N, Czajkowski G (2010) Pregel: a system for
14(7):879–885 large-scale graph processing. In: SIGMOD
Aref M, ten Cate B, Green TJ, Kimelfeld B, Olteanu Mendelzon A, Wood P (1995) Finding regular simple
D, Pasalic E, Veldhuizen TL, Washburn G (2015) paths in graph databases. SIAM J Comput 24(6):
Design and implementation of the logicblox system. In: 1235–1258
SIGMOD Neumann T, Weikum G (2010) The RDF-3X engine for
Atserias A, Grohe M, Marx D (2013) Size bounds and scalable management of RDF data. VLDB 19:91–113
query plans for relational joins. SIAM J Comput Ngo HQ, Porat E, Ré C, Rudra A (2012) Worst-case
42(4):1737–1767 optimal join algorithms. In: PODS
Consens MP, Mendelzon AO, Vista D, Wood PT (1995) Nguyen D, Aref M, Bravenboer M, Kollias G, Ngo HQ,
Constant propagation versus join reordering in data- Ré C, Rudra A (2015) Join processing for graph
log. In: Rules in database systems. Springer, Berlin, patterns: an old dog with new tricks. In: GRADES
Heidelberg, pp 245–259 workshop
898 Graph Representations and Storage
Marcus Paradies1;3 and Hannes Voigt2 Adjacency Matrix. An adjacency matrix, the
1
SAP SE, Walldorf, Germany most basic storage structure for graphs, stores
2
Dresden Database Systems Group, Technische the topology as a binary matrix representing each
Universität Dresden, Dresden, Germany edge in the graph by an individual bit stored at
3
SAP SE, Berlin, Germany coordinates .u; v/ for vertices u; v 2 V (Cormen
et al. 2001). With a quadratic space complexity
of O.jV j2 /, an adjacency matrix is only space-
Definitions efficient for very dense graphs. However, often
graphs are sparse in practice; therefore a pure
Graph representation concerns the layout of adjacency matrix representation is rather uncom-
graph data in the linearly addressed storage of mon in graph data management and graph pro-
computers (main memory, SSD, hard disk, etc.). cessing systems.
General objectives for graph representations
are (1) to use little storage space (space- Adjacency List. Being more suitable for sparse
efficiency) while (2) allowing operations and graphs, an adjacency list stores for each vertex
queries on the graph to be executed in a a linked list of adjacent vertices and has a space
short time (time-efficiency). Hence, every complexity of O.jV j C jEj/. Technically, an
graph representation technique is subject to a adjacency list can be implemented either by a
specific space–time trade-off. It depends on the pointer array to represent vertices and a set of
particular scenario (data, workload, available linked lists to represent adjacent vertices or by an
resources, etc.) whether the space–time trade- array of arrays. While a list-based adjacency data
off of a certain representation technique is structure allows higher in-place update rates, an
worthwhile. array-based adjacency list exhibits a better CPU
Graph Representations and Storage 899
cache utilization for retrieving the adjacency of a only. Bits are assigned according to a level-wise
vertex through a strictly sequential and contigu- traversal of the tree, i.e., first all bits of level one,
ous memory access pattern. This is particularly then all bits of level two, and so on. Although the
important when the graph is stored in memory. data structure provides a concise representation
of the graph topology, multi-relational graphs,
CSR. The compressed sparse row (CSR) data i.e., multiple edges between a pair of vertices, are
structure is a variation of the adjacency list. It has not supported. Additionally, the k 2 -ary tree is not
been developed in the context of sparse matrix designed for dynamic graphs with frequent edge
multiplications and is widely used in graph pro- insertions/deletions as each of these operations
cessing systems (Gustavson 1978). The CSR data might trigger a rewriting of multiple internal
structure represents the graph topology as an edge nodes in the tree.
array consisting of the target vertices of all edges
in sequential order. To retrieve the adjacency of a Adaptive Bit Set Representations. Aberger
G
single vertex, the offset array is used to compute et al. (2016) propose to represent edges in an
the boundary values of the adjacency. In com- immutable and adaptive CSR data structure,
parison to the adjacency list and the adjacency which allows storing a single neighborhood either
matrix, a CSR is more compact since the entire in a dense or a sparse representation. The actual
topology is represented in a single, large, con- representation is selected by an optimizer based
tinuous chunk of memory. The downside is that on collected statistics about the density and the
adding and removing vertices/edges is expensive cardinality of the neighborhood set. The authors
and results in a partial reconstruction of the CSR define the density of a neighborhood set as the
to maintain the ordering. A detailed experimental cardinality of the set divided by the value range
analysis of several graph representations is given of the set. Real-world graphs tend to have a large
in Blandford et al. (2004). skew in their neighborhood density distribution.
to reduce the storage overhead implied by their schema similarity into small fix-sized parti-
reoccurring, long identifiers. The addition of tions, such as memory blocks or disk pages (Her-
indexes allows for faster lookups. The downside rmann et al. 2014a,b). The latter approach is
is that a triple table requires many join operations lightweight, does not require any upfront com-
to read an entity and its accompanying properties. putation, and can readapt if the inherent schema
changes over time.
Universal Table. Alternatively, triples can also
be stored in a universal table. The universal table Schema Hashing. Another technique for NULL
maps each distinct property key to a separate value reduction in universal tables is schema
column, and each object is represented as a single hashing (Bornea et al. 2013). Here, the universal
record in the table. Precisely, each row represents table consists of n column groups. One col-
an s, each column a p, and each cell o so that umn group consists of a property column and a
sŒp
D o if there is triple .s; p; o/ and sŒp
D value column, i.e., one column group represents
NULL otherwise. This approach is very common a predicate–object pair or a property key–value
for PGM data. In typical PGM data, nodes and pair from RDF or PGM, respectively. A hash
edges do not share many property keys so that function maps a predicate/property to one of the n
it is beneficial to use two universal tables, a column groups. With a well-designed hash func-
vertex table and an edge table (Rudolf et al. tion, properties are packed densely into the table
2013; Paradies et al. 2015). This approach allows with only few collisions. In case of a collision,
the join-free reconstruction of a vertex/an edge the entity has to spill into an additional row. The
through a selection and a subsequent attribute same principle can be applied to labeled edges;
projection. Both, vertices and edges, can have then a column group consists of three columns,
an arbitrary number of attributes, though. Hence, edge-id, label, and target-vertex (Sun et al.
universal tables require an efficient handling of 2015). In this way, the complete neighborhood of
NULL values. a vertex can be represented in a single row.
Emerging Schemas. A solution that reduces the Separate Property Storage. Instead of storing
number of joins without introducing too many schema-flexible properties in the same data struc-
NULL values tries to exploit the schema inherent ture as the graph topology, some systems store
in the graph data. Since graph data usually has properties in a separate data structure. For in-
no upfront defined and rigid database schema, stance, it is possible to utilize a document store
finding a suitable normalized database design for or the JSON (Crockford 2006) extension of a
it is a nontrivial and tedious task. The simple relational DBMS, to store all properties of an
property-class pattern partitions entities by class entity as a single JSON document in a single
(RDF) or label (PGM) (Abadi et al. 2007). More table cell of a separate vertex or edge attribute
general property tables cluster properties that fre- table (Sun et al. 2015). Another approach is to
quently occur together into a table. Individual en- separate properties into a key–value table. By
tities may be represented in many of these prop- doing so, vertices and edges can be stored as
erty tables, depending on which property cluster fixed-length records in a vertex and an edge table,
they fully or partially instantiate (Abadi et al. respectively. Fixed-length records can be directly
2007). For a given dataset, a good set of property addressed, which allows O.1/ access complexity
tables can be found automatically. Approaches to adjacent vertices.
are (1) utilizing standard clustering algorithms
such as k-means (MacQueen 1967), cf. (Chu et al. Indexing
2007), (2) starting from every distinct property
set and merge infrequent ones (Pham et al. 2015), Exhaustive Indexing. State of the art in
or (3) partitioning entities on the fly based on RDF representation is sixfold indexing, which
Graph Representations and Storage 901
exhaustively indexes every permutation of the computing a label for each node, which encodes
three triple elements, S, P, and O (Neumann and the reachability or the distance to other nodes in
Weikum 2008; Weiss et al. 2008). In total, this the graph.
approach results in six indexes: SPO, SOP, PSO, For instance, a classic approach computes 2-
POS, OSP, and OPS. Likewise, every graph can hop labels, where each node v is assigned a label
be indexed on the source (S) and target (T) of L.v/ D .Lin .v/; Lout .v// such that Lin .v/ and
the edges exhaustively by storing the adjacency Lout .v/ are a set of nodes from which v can be
once in forward order (ST) and once in reverse reached and a set of nodes that are reachable from
order (TS) (Hauck et al. 2015; Shun and Blelloch v, respectively. The labels are chosen in such a
2013). Exhaustive indexing of graphs is feasible way that for any given pair of nodes u and v, the
since dictionary and delta compression allow intersection of Lout .u/ and Lin .v/ is not empty if
for compact index representations. Furthermore, and only if v is reachable from u. The approach
exhaustive indexing provides considerable can be generalized to distance queries by storing a
G
performance gains since it facilitates direct distance d.x; v/ and d.v; x/ together with every
neighborhood access over incoming as well as x in Lin .v/ and Lout .v/, respectively, such that
outgoing edges and efficient sort merge join the minimum of d.u; x/ C d.x; v/ among all
operations on every combination of incident x 2 Lout .u/ \ Lin .v/ is equal to (or sufficiently
edges. approximates) the distance between u and v. The
challenge is to efficiently compute a minimal set
Aggregate Indexing. Additionally, aggregate of such labels.
indexes provide direct access to degree For reachability indexes, another common ap-
information (Neumann and Weikum 2008; Shun proach uses ranges to encode reachable nodes in
and Blelloch 2013). An aggregate index on S a label. Therefore, all nodes are assigned a label
provides out-degrees; one on T provides in- identifier by (1) collapsing all strongly connected
degrees. RDF systems can maintain up to nine components into single nodes, (2) traversing the
aggregate indexes: SP, SO, PS, PO, OS, OP, S, P, resulting graph depth-first from some (artificially
and O. Aggregate indexes can be stored even introduced) root node, and (3) numbering all
more compactly than regular graph indexes. nodes in post-order. Then, all nodes x reachable
They are useful for queries that project to from a given node v can be succinctly listed in
a subsets of query variables, where only the L.v/ as a set of ranges of label identifiers so that
existence of not fully qualified triples needs to a node u is reachable from v if u 2 L.v/ (Yu
be checked. Further aggregate indexes provide and Cheng 2010). The size of these labels can
useful statistics, which allow choosing more be further reduced by just storing approximations
efficient query plans (Neumann and Weikum that can answer most negative reachability query
2008) or processing strategies (Shun and Blelloch (resulting in false) and efficiently steer a depth-
2013). first search in the graph for the remaining queries.
For more details on reachability and distance
Reachability and Distance Indexing. A large indexing, we refer the reader to surveys on reach-
body of work concerns index structures support- ability queries (Yu and Cheng 2010; Su et al.
ing reachability and distance queries. Given a 2017) and distance queries (Sommer 2014; Bast
graph, reachability queries answer whether for et al. 2016).
a given pair of nodes u and v there is a path
between them and distance queries return the Structural Indexing. It is also possible to store
minimum length of the path connecting a given a summarization of a graph, which can help to
pair of nodes u and v. Distance is measured either speed up conjunctive and regular path queries.
in number of edges or as sum of edge weights For this topic we refer the reader to the entry on
along the path. Most approaches index a graph by graph indexing and synopses.
902 Graph Representations and Storage
Aadil Beverly
Garth
Jane Heather
Latif
Ed
Farid
Diane
Fernando
Mawsil
Andre
Izdihar
Carol
G
writing algorithms to mine the data for evidence • bends: the fewer edge bends the better; ideally
seen in the visualization. edges are straight-line
In this chapter, a graph G D .V; E/ consists • edge lengths: use as uniform edge lengths as
of a set of nodes (vertices) V and a set of edges possible
E, which are pairs of nodes. Denote by n D jV j • angular resolution: angles between edges at
and m D jEj the number of nodes and edges, the same node should be large
respectively. If there is an edge from node i to • crossing angles angles of pairs of crossing
node j , we denote that as i ! j . If the graph edges should be large
is undirected, then we denote the edge as i $ j • area and aspect ratio: the layout should be as
and call i and j neighboring (or adjacent) nodes. compact as possible
Node-Link Diagrams and Layout • edge slopes: few and regularly spaced edge
Aesthetics slopes should be used (e.g., orthogonal lay-
By far the most common type of graph layout is outs use only horizontal and vertical edges)
the so-called node-link diagram as seen in Fig. 1. • neighborhood: neighbors of each node in the
Here nodes are represented by points or sim- graph should be neighbors in the layout as
ple geometric shapes like ellipses or rectangles, well
whereas edges are drawn as simple curves linking
the corresponding pair of nodes. In this chapter The above list is not comprehensive, and there
we restrict our attention to node-link diagrams; may be additional application-specific constraints
for alternative types of graph representations, see and criteria or global aspects such as displaying
the survey of von Landesberger et al. (2011). symmetries. Moreover, some criteria may contra-
The algorithmic graph layout problem consists dict each other. Typically only a subset of criteria
in finding node positions in the plane (or in three- is optimized by a graph layout algorithm, and
dimensional space) and edge representations as trade-offs between different criteria need to be
straight lines or simple curves such that the re- considered.
sulting layout faithfully depicts the graph and
certain aesthetic quality criteria are satisfied and
optimized. We list the most common criteria. Key Research Findings
The influence of most of them on human graph
reading tasks has been empirically confirmed Undirected Graph Drawing
(Purchase 1997). The layout algorithms and techniques presented
in this section do not make use of edge directions,
• crossings: the fewer edge crossings the better but they can obviously be applied to both undi-
(a layout without crossings exists only for rected and directed graphs. Undirected graphs
planar graphs) are often represented by node-link diagrams with
906 Graph Visualization
1 input: graph G D .V; E /, initial positions x, tolerance tol, and nominal edge length K
2 set step D initial step length
3 repeat
4 x0 D x
5 for (i 2 V ) {
6 f D 0 // f is a 2/3D vector
7 for (j $ i; j 2 V ) f f C Fa .i; j / // attractive force, see equation (1)
8 for (j ¤ i; j 2 V ) f f C Fr .i; j / // repulsive force, see equation (2)
9 xi xi C step .f =jjf jj/ // update position of node i
}
10 until (jjx x 0 jj < tol K)
11 return x
straight-line edges. We denote by xi the location electrical charges (e.g., positive) that push them
of node i in the layout. Here xi is a point in two- apart. Specifically, there is an attractive spring
or three-dimensional Euclidean space. force exerted on node i from its neighbor j ,
Spring embedders (Eades 1984; Fruchterman which is proportional to the squared distance
and Reingold 1991; Kamada and Kawai 1989) between these two nodes:
are the most widely used layout algorithms for
undirected graphs. They attempt to find aesthetic xi xj 2 xi xj
node placement by representing the problem as Fa .i; j / D ; i $ j;
K xi xj
one of minimizing the energy of a physical sys- (1)
tem. The guiding principles are that nodes that where K is a parameter related to the nominal
are connected by an edge should be near each edge length of the final layout. The repulsive elec-
other, while no nodes should be too close to each trical force exerted on node i from any node j
other (cf. neighborhood aesthetic). Depending on is inversely proportional to the distance between
the exact physical model, we could further divide these two nodes:
the spring embedders into two types. For con-
venience, we call the first type spring-electrical
K2 x xj
model and the second type spring/stress model. Fr .i; j / D i ; i ¤ j: (2)
xi xj k xi xj
approximation techniques based on space There are several ways to minimize the spring
decomposition data structure, such as the Barnes- energy (3). A force-directed algorithm could be
Hut algorithm, can be used to approximate the applied, where the force exerted on node i from
repulsive forces efficiently (Tunkelang 1999; all other nodes j .j ¤ i / is
Quigley 2001; Hachul and Jünger 2004).
For large graphs, the force-directed algorithm, xi xj
F .i; j / D wij . xi xj dij / :
which uses the steepest descent process and repo- xi xj
sitions one node at a time to minimize the energy (4)
locally, is likely to be stuck at local minima, In recent years the stress majorization tech-
because the physical system of springs and elec- nique (Gansner et al. 2004) became a preferred
trical charges can have many local minimum way to minimize the stress model due to its
configurations. This can be overcome by the mul- robustness.
tilevel technique. In this technique, a sequence of
G
In the stress majorization process, systems of
smaller and smaller graphs are generated from linear equations are repeatedly solved:
the original graph; each captures the essential
connectivity information of its parent. The force-
Lw X D Lw;d Y (5)
directed algorithm can be applied to this sequence
of graphs, from small to large, each time using
the layout of the smaller graph as the start layout here X and Y are matrices of dimension n 2,
for the larger graph. Combining the multilevel and the weighted Laplacian matrix Lw is positive
algorithm and the force approximation technique, semi-definite and has elements
algorithms based on the spring-electrical model
can be used to layout graphs with millions of
P
wil ; i D j
l¤i
nodes and edges in seconds (Walshaw 2003; Hu .Lw /ij D (6)
wij ; i ¤j
2005).
Directed Graph Drawing Algorithms that all edges point upward. Multiple nodes
Directed graphs express relations that have a can be assigned to the same layer. Typical
source and a target, e.g., senders and recipients goals in this phase are to find compact layer
of messages or hierarchical relationships in an assignments that use few layers and distribute
organization. Using arrowheads to indicate direc- nodes evenly to the layers, while ensuring
tions, any undirected graph layout algorithm can that all edges point from a lower layer to
be used for visualizing directed graphs. Yet, often a higher layer. After the layer assignment,
the edge directions carry important information some edges may span several layers. Such
that should directly influence the layout geome- long edges are usually subdivided by dummy
try. In this section we discuss layout algorithms nodes to guarantee that for the next step, all
that make use of edge directions. edges connect nodes in neighboring layers.
In order to keep edges short, a natural goal
Layered Graph Layout is to minimize the number of dummy nodes
Layered graph layout is particularly useful for needed. Suitable layer assignments can be
hierarchical graphs, which have no or only few computed efficiently for most optimization
cyclic relationships. For such graphs it is natu- goals, e.g., minimizing the number of layers
ral to ask for a drawing, in which all or most based on topological sorting of the graph. For
edges point into the same direction, e.g., up- compact layouts with few dummy nodes and
ward. A classic algorithmic framework for up- bounds on the number of nodes per layer, the
ward graph layouts is layered graph drawing, ILP-based algorithm of Gansner et al. (1993)
often also known as Sugiyama layout named after provides good results.
the fundamental paper by Sugiyama et al. (1981). 3. Crossing minimization. The crossing
Each node is assigned to a layer, and for each minimization step needs to determine the
edge it is required that the layer of the source node order in each layer such that the number
node is below the layer of the target node. For a of edge crossings among the edges between
detailed survey of layered graph layout methods, two neighboring layers is minimized. Crossing
see Healy and Nikolov (2014). minimization between two layers is again an
The process of computing layered graph lay- NP-hard optimization problem (Eades and
outs consists of a series of four steps. Each step Wormald 1994). Several heuristics and (non-
influences the input to the subsequent steps and polynomial) exact algorithms for minimizing
thus also the achievable layout quality. crossings between neighboring layers are
known and evaluated in the experimental
1. Cycle removal. If the graph contains directed study by Jnger and Mutzel (1997). For
cycles, this step determines a small set of so- computing node orders in all layers, usually
called feedback edges whose reversal yields a layer-by-layer sweep method is applied
an acyclic graph. The subsequent steps will that cycles through all layers and optimizes
draw this acyclic graph and in the end of node orders in neighboring layers, but global
the process the original edge directions are multilevel optimization methods exist as
restored. Finding a minimum set of edges to well (Chimani et al. 2011).
reverse is equivalent to finding a so-called 4. Edge routing. Once layers and node orders
minimum feedback arc set, a classical NP-hard in each layer are fixed, it remains to assign x-
problem (Karp 1972). Several heuristics and coordinates to all nodes (including dummy
approximation algorithms can be used to find nodes) that respect the node orders and
small feedback edge sets in practice (Gansner optimize the resulting edge shapes. A
et al. 1993; Eades et al. 1993). typical optimization goal is to draw edges
2. Layer assignment. This step assigns each as vertical as possible with few bends for
node to a horizontal layer, which can be long edges (or with at most two bends with
thought of as an integer y-coordinate, such a strictly vertical middle segment) while
Graph Visualization 909
maintaining a minimum node spacing in C++ and hosts layout algorithms for both
each layer and avoiding node-edge overlaps. undirected (multilevel spring-electrical and
Algorithms for coordinate assignment either stress models) and directed graphs (layered
use mathematical programming (Sugiyama layout), as well as various graph theory
et al. 1981; Gansner et al. 1993) or a linear- algorithms. Graphviz is incorporated in
time sweep method (Brandes and Kpf 2002). Doxygen and available via R and Python.
• OGDF is a C++ class library for automatic
Layouts for Specific Graph Classes layout of diagrams. Developed and supported
Previous sections covered visualization algo- by German and Australian researchers, it
rithms for general undirected and directed graphs. contains spring-electrical algorithms with
There is a variety of tailored algorithms for fast multipole force approximation, as well
more specific graph classes, including prominent as layered, orthogonal, and planar layout
examples such as trees and planar graphs. These algorithms.
G
algorithms exploit structural properties when • Tulip is a C++ framework originating from
optimizing certain layout aesthetics. It is beyond the University of Bordeaux I for developing
the scope of this chapter to cover such specific interactive information visualization applica-
graph classes and their visualization algorithms. tions. One of the goals of Tulip is to facilitate
Rather we refer to Rusu (2013) for a survey the reuse of components; it integrates OGDF
on tree layout and to Duncan and Goodrich graph layout algorithms as plugins.
(2013) and Vismara (2013) for surveys on planar
graph layout. Further general books on graph The following are a few other free software
drawing algorithms (Di Battista et al. 1999; each with its own unique merit, but not designed
Tamassia 2013) contain chapters on specific to work on very large graphs. For example,
layout algorithms.
library also made it possible to display graphs and we are running out of pixels just to render
with millions of nodes and edges. Furthermore, every node.
there has also been progress in abstracting the One solution is to use a display with high
visual complexity of large graphs, for example, resolution, including display walls, and various
by grouping similar nodes together, and repre- novel ways to manipulate such a display (e.g.,
senting certain subgraphs such as cliques as a gesture- or touch-based control). But even the
motif. But as graphs become increasingly large, largest possible display is unlikely to be sufficient
complex, and time-dependent, there are a number for rendering some of the largest social networks.
of challenges to be addressed. One estimate puts human eyes as having a reso-
lution of just over 500 million pixels. Therefore
even if we can make a display with higher resolu-
tion, our eyes can only make partial use of such a
The Increasing Size of the Graphs display at any given moment.
The size of graphs is increasing exponentially Since the purpose of visualization is to help
over the years (Davis and Hu 2011). Social net- us to understand the structures and anomalies in
works are one area where we have graphs of the data, for very large graphs, it is likely that
unprecedented size. As of late 2017, Facebook, we need algorithms to discover structures and
for example, has over 2.07 billion monthly active anomalies first (Akoglu et al. 2010) and display
users, while Twitter has over 330 million. Other these in a less cluttered form, but allowing the
data sources may be smaller but just as com- human operator to drill down to details when
plex. For instance, Wikipedia currently has 5.5 needed.
million interlinked articles, while Amazon offers Node-link diagram representation, while most
around 400 million items, with recommendations common, may not be the most user-friendly to the
connecting each item to other like items. All general public, nor is it the most pixel-efficient.
these pale in comparison when we consider that Other representations, such as a combination of
there are 100 billion interconnected neurons in node-link diagrams and matrices (Henry et al.
a typical human brain and trillions of websites 2007), have been proposed.
on the Internet. Each of these graphs evolves Large complex networks call for fast and in-
over time. Furthermore, graphs like these tend to teractive visualization systems to navigate around
exhibit the small-world characteristic, where it is the massive amount of information. A number
possible to reach any node from any other in a of helpful techniques for exploring graphs inter-
small number of hops. All these features present actively, such as link sliding (Moscovich et al.
challenges to existing algorithms (Leskovec et al. 2009), have been proposed. Further research in
2009). this area, particularly at a scale that helps to make
The unique features of these networks call for sense out of networks of billions of nodes and
rethinking of the algorithms and visual encoding edges, are essential in helping us to understand
approaches. The large size of some networks large network data.
means that finding a layout for such a graph Finally, the stress model is currently the best
can take hours even with the fastest algorithms algorithm for drawing graphs with predefined
available. There have been works on using GPUs edge length. Improving its high computation and
and multiple CPUs, e.g., (Ingram et al. 2009), to memory complexity is likely to remain an active
speed up the computation by a factor of 10–100. area of research as well.
Even though state-of-the-art graph layout al-
gorithms can handle graphs with many millions Time-Varying and Complex Graphs
of nodes and billions of edges, with so many All the large and complex networks mentioned
nodes and edges, the traditional node-link di- earlier are time-evolving. How to visualize such
agram representation is at its limit. A typical dynamic networks is an active area of research
computer screen only has a few million pixels, (Frishman and Tal 2007), both in terms of graph
Graph Visualization 911
layout (van Ham and Wattenberg 2008), and Brandes U, Pich C (2007) Eigensolver methods for pro-
in displaying such graphs. For example, time- gressive multidimensional scaling of large data. In:
Proceeding of the 14th international symposium graph
varying graph can be displayed as an animation or drawing (GD’06). LNCS, vol 4372, pp 42–53
as small multiples. Researchers have been study- Chimani M, Hungerlnder P, Jnger M, Mutzel P (2011)
ing Archambault and Purchase (2016) whether An SDP approach to multi-level crossing minimiza-
to use one form or the other, when measured in tion. In: Algorithm engineering and experiments
(ALENEX’11), pp 116–126
terms of preservation of the users’ mental maps, Davis TA, Hu Y (2011) University of Florida sparse
and in improving comprehension and information matrix collection. ACM Trans Math Softw 38:1–18
recall. Dynamic network visualization will likely Di Battista G, Eades P, Tamassia R, Tollis IG (1999)
continue to be an area of strong interest. Algorithms for the visualization of graphs. Prentice-
Hall, Upper Saddle River
Visualizing multivariate graphs is another Duncan CA, Goodrich MT (2013) Planar orthogonal and
challenging area. In a multivariate graph, each polyline drawing algorithms, chap 7. In: Tamassia R
node and edge may be associated with several (ed) Handbook of graph drawing and visualization.
G
attributes. For example, in a social network, CRC Press, Hoboken, pp 223–246
Eades P (1984) A heuristic for graph drawing. Congressus
each node is a person with many possible Numerantium 42:149–160
attributes. Each edge represents a friendship Eades P, Wormald NC (1994) Edge crossings in drawings
which could entail multiple attributes as well, of bipartite graphs. Algorithmica 11:379–403
such as the type of friendship (e.g., school, Eades P, Lin X, Smyth WF (1993) A fast and effective
heuristic for the feedback arc set problem. Inf Process
work, church) and the length and strength of Lett 47(6):319–323
the friendship. Displaying these information in Frishman Y, Tal A (2007) Online dynamic graph drawing.
a way that helps our understanding of all these In: Proceeding of eurographics/IEEE VGTC sympo-
complex attributes is a challenge. It requires sium on visualization (EuroVis), pp 75–82
Fruchterman TMJ, Reingold EM (1991) Graph drawing
careful consideration of both visual design and by force directed placement. Softw Prac Exp 21:1129–
interaction techniques (Wattenberg 2006; Kerren 1164
et al. 2014). Gansner ER, Koutsofios E, North SC, Vo KP (1993) A
technique for drawing directed graphs. IEEE Trans
Softw Eng 19(3):214–230
Gansner ER, Koren Y, North SC (2004) Graph drawing
by stress majorization. In: Proceeding of the 12th
Cross-References international symposium on graph drawing (GD’04).
LNCS. vol 3383. Springer, pp 239–250
Gansner ER, Hu Y, North SC (2013) A maxent-stress
Big Data Visualization Tools model for graph layout. IEEE Trans Vis Comput Graph
Graph Exploration and Search 19(6):927–940
Visualization Hachul S, Jünger M (2004) Drawing large graphs
with a potential field based multilevel algorithm.
Visualization Techniques In: Proceeding of the 12th international symposium
graph drawing (GD’04). LNCS, vol 3383. Springer,
pp 285–295
Healy P, Nikolov NS (2014) Hierarchical drawing al-
gorithms. In: Tamassia R (ed) Handbook of graph
References drawing and visualization, chap 13. CRC Press, Boca
Raton, pp 409–454
Akoglu L, McGlohon M, Faloutsos C (2010) Oddball: Henry N, Fekete JD, McGuffin MJ (2007) Nodetrix: a
Spotting anomalies in weighted graphs. In: Proceed- hybrid visualization of social networks. IEEE Trans Vis
ings of the 14th Pacific-Asia conference on advances in Comput Graph 13:1302–1309
knowledge discovery and data mining (PAKDD 2010) Hu Y (2005) Efficient and high quality force-directed
Archambault D, Purchase HC (2016) Can animation graph drawing. Math J 10:37–71
support the visualisation of dynamic graphs? Inf Sci Ingram S, Munzner T, Olano M (2009) Glimmer: multi-
330(C):495–509 level MDS on the GPU. IEEE Trans Vis Comput Graph
Brandes U, Kpf B (2002) Fast and simple horizontal 15:249–261
coordinate assignment. In: Mutzel P, Jnger M, Leipert Jünger M, Mutzel P (1997) 2-layer straightline crossing
S (eds) Graph drawing (GD’01). LNCS, vol 2265. minimization: performance of exact and heuristic algo-
Springer, Berlin/Heidelberg, pp 31–44 rithms. J Graph Algorithms Appl 1(1):1–25
912 Green Big Data
Kamada T, Kawai S (1989) An algorithm for drawing Purchase HC (1997) Which aesthetic has the greatest
general undirected graphs. Inform Process Lett 31: effect on human understanding? In: Proceeding of the
7–15 5th international symposium graph drawing (GD’97).
Karp RM (1972) Reducibility among combinatorial prob- LNCS. Springer, pp 248–261
lems. In: Miller RE, Thatcher JW, Bohlinger JD (eds) Quigley A (2001) Large scale relational information vi-
Complexity of computer computations, pp 85–103. sualization, clustering, and abstraction. PhD thesis,
Kerren A, Purchase H, Ward MO (eds) (2014) Multivari- Department of Computer Science and Software Engi-
ate network visualization: Dagstuhl seminar # 13201, neering, University of Newcastle
Dagstuhl Castle, 12–17 May 2013. Revised discus- Rusu A (2013) Tree drawing algorithms. In: Tamassia
sions, Lecture notes in computer science, vol 8380, R (ed) Handbook of graph drawing and visualization
Springer, Cham chap 5. CRC Press, Boca Raton, pp 155–192
Kruskal JB (1964) Multidimensioal scaling by optimizing Sugiyama K, Tagawa S, Toda M (1981) Methods for
goodness of fit to a nonmetric hypothesis. Psychome- visual understanding of hierarchical systems. IEEE
trika 29:1–27 Trans Syst Man Cybern SMC-11(2):109–125
Kruskal JB, Seery JB (1980) Designing network diagrams. Tamassia R (2013) Handbook of graph drawing and visu-
In: Proceedings of the first general conference on social alization. Chapman & Hall/CRC, Boca Raton
graphics, U.S. Department of the Census, Washington, Tunkelang D (1999) A numerical optimization approach
DC, pp 22–50, Bell laboratories technical report no. 49 to general graph drawing. PhD thesis, Carnegie Mellon
van Ham F, Wattenberg M (2008) Centrality based visu- University
alization of small world graphs. Comput Graph Forum Vismara L (2013) Planar straight-line drawing algorithms.
27(3):975–982 In: Tamassia R (ed) Handbook of graph drawing
von Landesberger T, Kuijper A, Schreck T, Kohlhammer and visualization chap 6. CRC Press, Boca Raton,
J, van Wijk JJ, Fekete JD, Fellner DW (2011) Vi- pp 193–222
sual analysis of large graphs: state-of-the-art and fu- Walshaw C (2003) A multilevel algorithm for force-
ture research challenges. Comput Graph Forum 30(6): directed graph drawing. J Graph Algorithms Appl
1719–1749 7:253–285
Leskovec J, Lang K, Dasgupta A, Mahoney M (2009) Wattenberg M (2006) Visual exploration of multivariate
Community structure in large networks: natural cluster graphs. In: Proceedings of the SIGCHI conference on
sizes and the absence of large well-defined clusters. human factors in computing systems (CHI’06). ACM,
Internet Math 6:29–123 New York, pp 811–819
Moscovich T, Chevalier F, Henry N, Pietriga E, Fekete J
(2009) Topology-aware navigation in large networks.
In: CHI ’09: Proceedings of the 27th international
conference on human factors in computing systems.
ACM, New York, pp 2319–2328
Ortmann M, Klimenta M, Brandes U (2016) A sparse
stress model. In: Graph drawing and network visualiza- Green Big Data
tion – 24th international symposium, GD 2016, Athens,
revised selected papers, pp 18–32 Energy Implications of Big Data
H
itself. Each tool has its own application and was created around Hadoop, such as Apache
advantages. The Hadoop ecosystem comprises HBase, Hive, and Pig.
softwares for data processing such as Apache The first version of Hadoop, known as Hadoop
Hive, Apache Spark and Apache Drill, Apache 1, consisted of two main components: MapRe-
Oozie for jobs orchestration, Apache Mahout duce and HDFS. In this initial version, MapRe-
and Apache Spark’s MLlib for machine learning duce was responsible for managing job resources.
applications, and many others. Around 2012, YARN was introduced in Hadoop
2, making it much more polyvalent and open to
Hadoop History new processing engines such as Apache Spark.
The Apache Hadoop project started as a part of Currently, the use of MapReduce paradigm for
the Apache Nutch, a scalable open-source web data processing has declined, being replaced by
crawler project. Nutch was created in 2002 as a higher-level APIs such as Hive or by more dy-
component of the Apache Lucene initiative. Doug namic processing tools like Spark. HDFS remains
Cutting was the creator of Apache Lucene and a widely adopted distributed file system.
started Nutch trying to build a web crawler to
democratize search engine algorithms. He soon
realized that in order to do that he would need Hadoop Components
to scale Nutch to the complexity of the Internet.
Hadoop was the answer to this quest. HDFS
Doug Cutting conceived Hadoop inspired by The Hadoop Distributed File System is an ab-
Google’s 2003 and 2004 papers on Google’s straction layer between the regular operating sys-
distributed file system (GFS) (Ghemawat et al. tem file system and a data processing framework
2003) and distributed programming paradigm such as MapReduce, Spark, or HBase. It was
MapReduce (Dean and Ghemawat 2008). written in Java in a shared nothing master/-
The GFS paper described the foundations of worker architecture. In Hadoop, computers are
distributed storage for data intensive applications nodes in a cluster, and they exchange information
used by the company. It explained how to through the network (usually using a TCP/IP
deploy such a platform on commodity hardware socket communication). The master node orches-
while providing fault tolerance and scalable trates the cluster and the workers do the compu-
performance. The MapReduce paper introduced tation. HDFS was thought to handle big datasets
a straightforward paradigm to do distributed providing high availability and fault tolerance.
programming using the GFS architecture. The The master, NameNode, is responsible for
programmer would be responsible for providing storing all files and directories metadata such as
a Map and a Reduce function to process the data, file access permission and file locations. The file
while all resource allocation and application locations are not stored persistently on disk in
management were handled by the platform. the NameNode; they stay in memory for quicker
These two components put together made access. Note that this limits the amount of data
possible to scale Nutch, and Doug soon realized that is possible to store in a Hadoop system: the
that Hadoop scalability had great applicability to amount of files cannot be larger than what the
other areas. NameNode can store in memory. This is also
In 2006, Hadoop emerged as an independent one of the reasons why HDFS deals better with
Apache project, and around that same time, Doug larger files than a large number of small files;
Cutting joined Yahoo!. The company saw the tool small files have greater chances of overloading
potential and created a dedicated team to develop NameNode’s memory. Usually, each file block or
it. The name Hadoop was chosen by Doug in- directory takes about 150 bytes of memory.
spired in his kid’s toy, a yellow stuffed elephant. The Secondary NameNode exists to help the
By 2008, Hadoop was already a top-level Apache NameNode, by copying the NameNode file meta-
project, and by 2010 a whole ecosystem of tools data. Note that these copies are not guaranteed
Hadoop 915
to be updated, so in case of failure, the Sec- Under the hood, the client is sending a mes-
ondary NameNode is not a hot standby to the sage to the NameNode asking permission to write
NameNode. With Apache Zookeeper, however, it a file. If the client’s permission is granted, the
is possible to arrange the Secondary NameNode NameNode answers with DataNode’s locations
to be a hot standby. It is also possible to have mul- to write the blocks. The file my_file.txt is then di-
tiple NameNodes taking care of multiple clusters vided into blocks, and each block is copied to one
together with the so-called HDFS Federation, a different DataNode. Note that the file does not
concept introduced in Hadoop 2. pass through the NameNode itself which would
The worker DataNodes are responsible for cause a bottleneck in the system. Instead, the
storing and processing data. When storing a file client sends the block to the closest DataNode,
in HDFS, each file is divided into blocks, and and this DataNode by its turn sends the block to
each block is replicated and stored separately in the second closets DataNode and so on.
different DataNodes. The default replication fac- As mentioned, Hadoop replicates the block
tor is three and the default block size is 128 MB. three times. This, as it is with most Hadoop pa- H
The blocks are actually located in the regular OS rameters, can be configured by the cluster admin-
file system, but one accesses it through HDFS’s istrator. Two data blocks are copied to different
API. machines on a same rack, trying to reduce as
Replicating data is important for the failover much as possible the network traffic. This feature
mechanism, for high availability, and for load bal- is called rack awareness. A third copy is placed
ancing. Node failures are detected by the Hadoop on a different rack to ensure data availability in
system with a heartbeat type of signaling. Each case of a rack failure. To ensure data integrity, a
DataNode sends heartbeat signals to the NameN- checksum of the data is calculated every time it is
ode every 3 s. If a DataNode fails, the other read or written. All of these actions are handled
two replicas can be used to ensure high avail- by the system and are transparent to the user.
ability. In that case, when the NameNode stops
receiving the heartbeat signal, it will copy the lost YARN
DataNode data into another DataNode to main- YARN was introduced in Hadoop version 2 to
tain the default replication factor. To replicate the separate the resource management function from
lost data, the NameNode uses one of the other the processing layer of Hadoop, thus providing a
two copies of the data available in HDFS. Now, broader framework. Because of YARN, Hadoop
assume that no DataNode is failing but many was able to support other processing models such
of them are overloaded with processing jobs. as iterative programming or graph processing.
Having three data replicas helps ensuring that The platform contains a Resource Manager
processing power will be available to new jobs, to distribute the resources of the cluster among
and processing is distributed more or less equally all applications. YARN provides each application
among nodes. Furthermore, it helps ensure the with an ApplicationMaster, which talks with the
data locality principle of sending the job to the Resource Manager to negotiate resources. An-
data (and not otherwise) so as to not overload the other important entity is the NodeManager, which
network traffic. is responsible for launching and monitoring each
There are three main ways of interacting with application.
HDFS: a command-line interface, a Java API, and
a web interface with Apache Hue. The command- MapReduce
line interface is similar to POSIX, presenting MapReduce is a programming paradigm written
commands such as mkdir, ls, and cat. in Java to run on top of the HDFS, inspired
When writing a file into HDFS with the by Google’s MapReduce paper (Dean and
command-line interface, a file can be inserted Ghemawat 2008). It was mainly envisioned for
in the system with the following command: batch processing on embarrassingly parallel
$ hadoop hdfs -put my_file.txt problems where the two functions Map and
916 Hadoop
Reduce process data in a resilient and distributed sources for the task in the same DataNode where
manner. Many real-world computer science data is located. If it can’t, it will try a node on
problems have already been modeled into the the same rack, and if that is not available as well,
MapReduce paradigm (Rajaraman and Ullman it will go for any other node in the cluster. It is
2011). important to highlight that the Resource Manager
Suppose that a group of people want to count will execute the Map and Reduce functions in a
how many books are there in a library. They separate JVM, so in case of any failures, the rest
decide to divide the work, and each person is of the system should not be affected.
responsible for counting books in certain shelves. There will be one Map task per block of
The group finally gets together and sum up each data. The tracking of tasks is made by YARN
individual count to obtain an overall amount. through heartbeat signals sent from the tasks to
This same principle is used in the MapReduce the Resource Manager. The system executes the
framework. The Map phase applies operations in tasks in parallel, and if any of the tasks fail,
separated chunks of the dataset, and the Reduce it simply re-executes that single task, and not
phase can be used to combine results appropri- the entire application program. The intermediate
ately. The Map and Reduce functions come from results of Map tasks are written persistently to
functional programming and are easily paralleliz- disk, and not into HDFS. This is done because it
able. In this paradigm, the user is responsible for would be too expensive to write all intermediate
specifying a Map and a Reduce function and the data into HDFS, and in case the data is lost,
input and output location to read and write data. it is less expensive to just run the Map task
All distributed programming details such as input again. Intermediate result is written to disk and
partitioning, job failures, and resource scheduling it is sorted in a phase known as Shuffle. All data
are handled by the system so that the user can with the same key is grouped together and fed
focus on processing data. This framework allows to a single Reduce task using a hash function.
user with no distributed programming experience One Reduce task might receive input from many
to do distributed computing. or all Map tasks. The Reduce task might be in
MapReduce operates with key/value pairs. the same or in another DataNode in the system;
The input of a Map function should be in the this is determined by the Resource Manager. The
key/value format, such as (K1, V1), and the Reduce phase doesn’t wait for all Map tasks to
output as a list of key/values, list(K2, V2). finish to start shuffling data, but it has to wait
The Reduce function should receive as input a all Map tasks to finish to complete the shuffle.
tuple (k2, list(v2)) and return as output a list The shuffling phase is usually the most time-
of key/value pairs, for example, list(K3, V3). consuming part of a MapReduce job, since possi-
Note that the input of the Reduce function must bly all data has to be ordered. To help optimizing
match the output of the Map function, being an this phase, the MapReduce framework provides
aggregation on values by key. Other than that, the Combiners to aggregate and reduce data before
key/value pairs can be of any kind. sending it to the Reduce function. Combiners
Given a Map and a Reduce function, the sys- can be applied in some MapReduce jobs. Finally,
tem will partition the dataset in blocks, as men- once Reduce is done, its output is written to
tioned before, and create several Map and Reduce HDFS.
tasks to run on the data. The client application Several packages and libraries were created
will talk to an agent called Resource Manager to support the development of MapReduce ap-
from the YARN component, to ask for resources plications. The Hadoop Streaming component
to run the application. The Resource Manager provides an API for users to write Map and Re-
will try to allocate a resource container in a duce functions in languages other than Java, like
DataNode close to the data, based on the data Python, Ruby, or C. A library; MRUnit was cre-
locality principle. It will first try to allocate re- ated with JUnit to support testing in MapReduce.
Hadoop Software 917
The Apache Oozie project can be useful for the Google has publicly announced in 2004 that
development of applications that use several Map it used MapReduce on its search engine with
and Reduce tasks sequentially, since it works as a more than 20 TB of data, as well as in clustering
job orchestrator and scheduler. applications for Google News, construction of
graph application, and several others (Rajaraman
MapReduce Limitations and Ullman 2008).
The MapReduce paradigm facilitates distributed Hadoop has also been used extensively by
programming, with the idea of breaking down a Yahoo!. In 2008, Yahoo! announced that they
dataset and applying in parallel the same function used a 10,000 core cluster to index its search
to all portions of data. There are some drawbacks engine, and in 2009 they could sort 1 terabyte of
nonetheless. MapReduce is mainly applicable to data in 62 s (White 2015). By 2011, Yahoo! had
embarrassingly parallel programs, where com- a 42.000 node Hadoop cluster with petabytes of
munication between processes is minimal and the storage (Harris 2013).
algorithm is easily parallelized. When willing to H
do iterative analysis or communication intensive
applications, users have to concatenate several References
Map and Reduce functions sequentially, which
Cutting D (2016) https://www.youtube.com/watch?v=
may greatly increasing the running time. The Phjif53vAhM. Accessed 20 Oct 2017
reason for this is that each time a Map function Dean J, Ghemawat S (2008) MapReduce: simplified data
sends data to a Reduce function, it writes data processing on large clusters. Commun ACM 51(1):
107–113. https://doi.org/10.1145/1327452.1327492
on disk. Therefore, the overhead of writing and
Ghemawat S, Gobioff H, Leung S-T (2003) The Google
reading from disk makes iterative analysis on file system. SIGOPS Oper Syst Rev 37(5):29–43.
MapReduce less attractive. Apache Spark was https://doi.org/10.1145/1165389.945450
developed to overcome this issue. It processes Harris D (2013) https://gigaom.com/2013/03/04/the-history-
of-hadoop-from-4-nodes-to-the-future-of-data/.
data in memory, by applying functions sequen-
Accessed 20 Oct 2017
tially without writing to disk. It creates a directed Rajaraman A, Ullman JD (2011) Mining of massive
acyclic graph of tasks, where the state of the datasets. Cambridge University Press, New York
transformations applied to the data are saved for White T (2015) Hadoop: the definitive guide, 4th edn.
O’Reilly Media, Hadoop
fault tolerance. This mechanism together with
Spark’s simple API made it a great resource for
data processing. In 2016, Hadoop’s creator Doug
Cutting declared that Spark is the evolution of the
MapReduce framework (Cutting 2016). Hadoop Ecosystem
Hadoop
Examples of Application
temperature, static electricity, and humidity architecture, especially when the data writing
surrounding the IT facilities could impact the process is in progress. Compare to the relia-
reliability of the hardware, where harsh envi- bility of storage media, these hardware parts
ronment could significantly reduce the dura- cause less impact to the reliability of the data
bility and raise the failure rates of storage (Pinheiro et al. 2007; Schroeder et al. 2016),
components. and technics regarding this type of failure have
• Storage media (Mielke et al. 2008; Pinheiro been well developed for a fairly long period of
et al. 2007; Schroeder and Gibson 2007; Za- time (Dabrowski 2009).
man 2016; Furrer et al. 2017): Different stor- • Software reliability mechanism: Different re-
age medias have different reliability charac- liability assurance approaches are also applied
teristics. In modern data storage architectures, in data management software and operating
commonly used storage media includes hard systems to make sure that hardware malfunc-
disk drive (HDD), solid-state drive (SSD), and tion cause minimum impact on the application
magnetic tape. These storage medias could and the data. In this field, many technics have H
coexist in a single data storage architecture for been developed for decades such as data repli-
storing different types of data according to the cation (Li et al. 2015), erasure coding (Rashmi
reliability requirement, performance require- et al. 2014), checkpoint/snapshot(Dabrowski
ment, and cost-effective requirement. Details 2009), and recovery(Rashmi et al. 2013). De-
of the storage medias will be described in tails of this topic are out of scope of this entry
section “Reliability of Storage Medias.” and will be covered in chapter “ Data Repli-
• Media controller (Zaman 2016; Patterson et al. cation and Encoding” of the encyclopedia.
1988; Taylor 2014): Above the very bottom
storage media, there could be another layer
of hardware named media controller that is Reliability of Storage Medias
designed to improve the reliability and perfor-
mance of the storage facilities. For example, The reliability of storage medias is the funda-
multiple hard disk drives could be grouped by mental factor for a big data storage architec-
a RAID controller to build a RAID array, in ture. Simply, storage medias with lower reliabil-
which different configurations can be applied ity could cause higher probability and frequency
to improve the throughput performance and of data loss and cause lower data availability.
the reliability of the data, respectively. For Given large amounts of data stored, the impact
solid-state drives, a media controller (or SSD on big data applications is significantly magni-
controller) is a part of the SSD firmware, fied. Even with software reliability mechanism
in which error correction is conducted using applied to provide certain fault tolerance and data
error correction coding technology and wear recovery ability to the system, the probability
leveling is conducted to make sure all the that the data being totally lost can still be sig-
flash cells experience the same amount of nificantly increased. Analyses have shown that
erase/write cycles so that the chip ages evenly. most server failures in a data center are due to the
Different error correction code and wear lev- storage media (Vishwanath and Nagappan 2010).
eling algorithms can be applied to the SSD As mentioned earlier, each storage media has its
controller by reprogramming it to maximize own features. These features include response
the reliability of the drive. time, total throughput, failure rate and failure pat-
• Processing hardware (Dabrowski 2009; Pin- tern, etc. Therefore, big data storage architectures
heiro et al. 2007; Schroeder et al. 2016): Un- can apply a mixed storage solution by combin-
expected non-storage hardware failure, such ing HDD, SSD, and magnetic tape cartridges to
as motherboard failure or RAM failure, could provide comprehensive data storage solutions to
also cause loss of data in the data storage meet different requirements of different big data
920 Hardware Reliability Requirements
applications. For example, SSD and HDD can be sidered out of useful life of the disk and hence
applied as tier 1 storage for newly generated data not included in this model) (IDEMA 1998). In
or data that is frequently accessed and require (Pinheiro et al. 2007), it is reported that the av-
high access speed and low response time, and erage annual replacement rates of HDD is 2–9%.
magnetic tape can be applied as tier 2 storage More recently in the data storage industry, hard
for data of very large volume, the so-called cold disk reliability statistical reports are published
data that is not accessed very frequently as it regularly that review the disk failure rate of disks
takes hours to days to access. In current large- with different brands, capacities, performances,
scale data centers, this solution is widely applied and prices, which tracks the reliability of the
(Huang et al. 2012; Harris 2014). However, with disks as end users (Klein 2017). The report has
the drop of storage cost by using HDD, another shown that the latest models of disks with the
combination is getting more and more popular, sizes ranging from 3 to 8 TB normally have an
where SSD serves as tier 1, HDD serves as tier annual failure rate between 0.33% and 3.25%,
2, and magnetic tape is not used. In this section, where some extreme ones can be as low as 0.15%
we talk about the reliability features of HDD and or as high as 31.58%.
SSD.
HDD SSD
The reliability of HDDs has been investigated for The outstanding performance in I/O bandwidth
decades in both academia and industry, where the and random IOPS (input/output operations per
disk failure rate is considered the main indicator. second) has made SSD highly desirable in current
In a big data storage architecture, environmental big data storage architectures. Current SSDs use
factors are maintained to the optimum range, and NAND flash memory cells to retain data by trap-
unexpected dropping to the ground as normal ping electrons inside. A process called tunneling
home usage will not happen. However, the worn is used to move electrons in and out of the cells,
out of the internal physical parts could still cause but the back-and-forth traffic erodes the physical
a final failure of the drive. A typical hard disk’s structure of the cell, leading to breaches that
lifespan in a storage architecture is about 4 years. can render it useless (Zaman 2016). Therefore,
With the increase of the age of hard disks, exist- different from HDD, SSD’s capability of storing
ing studies have shown that the disk failure rate data fails gradually, and its reliability is evaluated
follows what is often called a “bathtub” curve, by raw bit error rate (RBER) or sometimes uncor-
as shown in Fig. 1, where disk failure rate is rectable bit error rate (UBER) as it is easier to be
higher in the disk’s early life (infant mortality), observed. There are quite some existing research
drops during the first year, remains relatively regarding to the reliability of SSD based on con-
constant for the remainder of the disk’s useful trolled lab tests under synthetic workloads, which
lifespan (middle life phase), and rises again at the predicts an exponential growth between RBER
end of the disk’s lifetime (wear-out) (Xin et al. and number of program erase cycles (Mielke et
2005). An investigation conducted on a large disk al. 2008; CAI et al. 2012). However, in actual data
drive population by Google has observed some center environments, it is observed that RBER
results very consistent to this model (Pinheiro grow linearly with program erase cycles, and
et al. 2007). For ease of representation, the In- same as HDD, it is also significantly affected
ternational Disk Drive Equipment and Materials by the age of the SSD (Schroeder et al. 2016).
Association (IDEMA) proposed a compromised Compare to HDD, SSD has a lower replacement
presentation for disk failure rates, as shown in rate, which is 4–10% being replaced in a 4-year
Fig. 2, where the lifespan of each disk is divided period compare to the HDD annual replacement
into different life stages with different failure rate of 2–9%. However, it also has a higher
rates, which are 0–3 months, 3–6 months, 6– probability of generating uncorrectable bit, which
12 months, and 1 year to the end of design means that while less likely to fail during usage,
lifespan (the wear-out phase of the disk is con- SSD is more likely of losing data.
Hardware Reliability Requirements 921
Infant
mortality Middle life phase Wear out
Disk failure rate
H
Time
Hardware Reliability Requirements, Fig. 1 Failure rate curve of hard disk drive
0-3 3-6
months months 6-12 months 12 months – end of lifespan
Disk failure rate
Time
Hardware Reliability Requirements, Fig. 2 IDEMA failure rate curve of hard disk drive
2. In the hardware prospective, specifically, lay- Patterson D, Gibson G, Katz R (1988) A case for re-
ered storage will become the trend. Firstly, as dundant arrays of inexpensive disks (RAID). In: ACM
SIGMOD international conference on the management
the unit price of SSD drops, it can be expected of data, Chicago
that SSD will be more widely used to maxi- Pinheiro E, Weber W, Barroso LA (2007) Failure trends in
mize the I/O throughput, IOPS of the system a large disk drive population. In: USENIX conference
to meet the high velocity requirement of big on file and storage technologies, San Jose
Rashmi KV et al (2013) A solution to the network chal-
data applications. Secondly, HDD will still lenges of data recovery in erasure-coded distributed
play a major role in the storage architecture storage systems: a study on the Facebook warehouse
due to its higher capacity, better durability, and cluster. In: USENIX HotStorage, San Jose
less data loss rate to meet the high-volume and Rashmi KV et al (2014) A hitchhiker’s guide to fast
and efficient data reconstruction in erasure-coded data
reliability requirement of the data. Thirdly, al- centers. In: SIGCOMM, Chicago
though magnetic tape still provides the lowest Schroeder B, Gibson G (2007) Disk failures in the real
unit storage cost and largest storage capacity, world: what does an MTTF of 1,000,000 hours mean
the time-consuming accessing of data made to you? In: USENIX conference on file and storage
technologies. San Jose
it less desirable for many, if not all, big data Schroeder B, Lagisetty R, Merchant A (2016) Flash reli-
applications and hence could be less applied ability in production the expected and the unexpected.
in future data storage structures. In: USENIX conference on file and storage technolo-
gies, Santa Clara
Taylor C (2014) SSD vs. HDD: performance and
reliability. [cited 2017]; Available from http://www.
enterprisestorageforum.com/storage-hardware/ssd-vs.-
References hdd-performance-and-reliability-1.html
Vishwanath KV, Nagappan N (2010) Characterizing cloud
CAI Y et al (2012) Error patterns in MLC NAND flash computing hardware reliability. In: ACM symposium
memory: measurement, characterization, and analysis. on cloud computing, Indianapolis
In: Design, automation & test in Europe conference & Wood T et al (2010) Disaster recovery as a cloud service:
exhibition, Dresden economic benefits & deployment challenges. In: Hot-
Dabrowski C (2009) Reliability in grid computing Cloud, Boston
systems. Concurr Comput Pract Experience 21(8): Xin Q, Schwarz TJE, Miller EL (2005) Disk infant
927–959 mortality in large storage systems. In: IEEE interna-
Furrer S, Lantz MA, Reininger P (2017) 201 Gb/in2 tional symposium on modeling, analysis, and simu-
recording areal density on sputtered magnetic tape. lation of computer and telecommunication systems,
IEEE Trans Magn 99:1–1 Atlanta
Gersback T (2012) Technical article: Energy-efficient cli- Zaman R (2016) SSD vs HDD: which one is more
mate control “keep cool” in the data centre. RITTAL reliable? [cited 2017]; Available from https://
Pty, New South Wales therevisionist.org/reviews/ssd-vs-hdd-one-reliable/
Harris R (2014) Amazon’s Glacier secret: BDXL.
[cited 2017]; Available from https://storagemojo.com/
2014/04/25/amazons-glacier-secret-bdxl/
Huang C et al (2012) Erasure coding in Windows Azure
storage. In: USENIX annual technical conference, Hardware-Assisted
Boston Compression
IDEMA (1998) R2–98: specification of hard disk drive
reliability. IDEMA Standards
Klein A (2017) Hard drive stats for Q2 2017. [cited 2017]; K. Sayood and S. Balkir
Available from https://www.backblaze.com/blog/ Department of Electrical and Computer
hard-drive-failure-stats-q2-2017/ Engineering, University of Nebraska-Lincoln,
Li W, Yang Y, Yuan D (2015) Ensuring cloud data reli-
ability with minimum replication by proactive replica Lincoln, NE, USA
checking. IEEE Trans Comput 65:1–13
Loken C et al (2010) SciNet: lessons learned from build-
ing a power-efficient top-20 system and data centre. J Definitions
Phys Conf Ser 256(1):1–35
Mielke N et al (2008) Bit error rate in NAND flash
memories. In: IEEE international reliability physics Video compression: Compact representation of
symposium, Phoenix digital video.
Hardware-Assisted Compression 923
Discrete cosine transform (DCT): An or- focus on the specific hardware approaches as
thonormal transform used in many compression each particular application has a specific set of
applications. design requirements and performance metrics.
Arithmetic coding: An entropy coding tech-
nique that is particularly useful for alphabets with
a skewed probability distribution. Video and Image Compression
LZ77: A dictionary based sequence compres-
sion algorithm which adaptively builds its dictio- The general approach used in video compression
nary through an optimal parsing of the “past” of standards is as follows (Sayood 2017): Each
the sequence. frame is divided up into blocks – often of differ-
ent sizes. For most of the image frames, a predic-
tion for the block is generated using the previous,
Introduction and sometimes future, reconstructed frames. The
difference is transformed using some variant of H
The information revolution has resulted in the the discrete cosine transform (Rao and Yip 1990)
ubiquity of the use of compression. As the and the coefficients encoded using some variant
explosion in the spread of information persists, of run-length encoding and an entropy coder. The
the need for energy efficient compression is reconstructed frames to be used for prediction are
driving the development of more and more filtered using a loop filter. The need for hard-
hardware-assisted compression, despite the ware assistance in video compression was clear
increasing power of processors. The need is from the time of the initial video compression
most where the resource constraints are most standards, such as H.261 (ITU-T Recomenda-
severe – where the constraints are relative to tion H.261 1993) and MPEG-1 (Mitchell et al.
the application. Video compression deals with a 1997); and hardware implementations of each
huge amount of temporally sensitive information part of the video coding structure have been
and thus is highly time-constrained. As video proposed. The hardware assistance was initially
has become the dominant user of the available focused on the discrete cosine transform which
internet bandwidth, compression algorithms have is computationally expensive and thus tends to
become more and more complex in order to be a bottleneck. A survey of early efforts in this
satisfy both the bandwidth constraints and the in- direction can be found in Pirsch et al. (1995). The
creasing quality requirements. The latter is due to discrete cosine transform (DCT) is a separable
significant improvements in display technology. transform (Sayood 2017), which means it can be
This makes software implementations of video implemented by taking a one-dimensional trans-
compression algorithms increasingly insufficient form of the rows followed by a one-dimensional
for the task at hand. At the other end of the transform of the columns or vice versa. As this is
spectrum, in terms of the volume of data, are significantly simpler than a direct implementation
wireless sensor networks which generally deal of the two-dimensional DCT, many of the hard-
with significantly less data per node but have to ware implementations are for one-dimensional
process them under severe power constraints. DCT (Martisius et al. 2011; Amer et al. 2005),
Because of the opportunistic nature of the de- most based on the efficient Loeffler algorithm
ployment of hardware-assisted compression, it is (Loeffler et al. 1989). There are also designs for
difficult to provide a systematic description of the the direct implementation of the two-dimensional
field; therefore, we will organize our discussion DCT (Chang et al. 2000; Gong et al. 2004).
around the most popular applications – video The latest standard, variously known as High
compression, wireless sensor networks, and com- Efficiency Video Coding, H.265, and MPEG H
puter memory – and then take a somewhat me- Part 2, uses a mixture of block sizes and an
andering path through the other applications that integer approximation to the DCT. A number of
do not fit into these categories. We will not implementations for the transform used in HEVC
924 Hardware-Assisted Compression
have been proposed (Pastuszak 2014; Meher et al. 2014; Wang and Leon-Salas 2012) use time-
2014; Park et al. 2012). frequency multiresolution approaches.
Other common components of most video
coding algorithms are the loop filter (Cho et al.
2015; Khurana et al. 2006), the motion adaptive
Wireless Sensor Applications
prediction, and the entropy coder. The latter has
been a particular target for hardware-assisted
The constraints on wireless sensor applications
design as the coder has become more and
have more to do with energy than time. Compres-
more complex with each succeeding standard.
sion is often essential in wireless sensor network
The most widely deployed entropy coder is
applications as communication costs are substan-
the context-adaptive binary arithmetic coder
tially higher than processing costs (Anastasi et al.
(CABAC). The binary arithmetic coder itself
2009). Despite this advantage, the amount of
is relatively simple to implement; however, the
processing required for compression still puts a
number of contexts and the approaches for
high cost in terms of energy use. This is par-
generating the contexts using rate distortion
ticularly true for visual sensor networks where
optimization have become increasingly complex
while it is imperative that the video sequence
and computationally expensive. Hence there
or images be compressed, the processing costs
is an increase in the number of hardware
can be prohibitive. Therefore, it is entirely un-
implementations for the CABAC encoder and
derstandable that this area is ripe for the use
decoder (Nunez-Yanez 2006; Penge et al. 2013;
of hardware-assisted compression. A survey of
Chen and Sze 2015).
many different image compression approaches
In recent years, a number of companies have
used in wireless sensor networks can be found
begun marketing complete hardware solutions
in Khalaf Hasan et al. (2014). Mobile sensor
targeting HEVC applications. To name a few,
networks for environmental monitoring tend to be
Pixelworks has developed and is marketing
small and miserly in their use of energy making
HEVC Ultra HD System-On-Chip solutions
the use of hardware-assisted compression useful
for the consumer electronics industry (www.
(Sarmiento et al. 2010). Physiological sensors
pixelworks.com), while VITEC is providing
are another obvious target for hardware-assisted
end-to-end video streaming solutions that are
compression. Wearable devices such as cardiac
based on compact and portable HEVC hardware
monitors (Singh Alvarado et al. 2012; Deepu and
encoders (www.vitec.com). System-On-Chip
Lian 2015; Deepu et al. 2016) which need to
Technologies offers high-performance HEVC
function over a long period of time have strict
IP Cores that are suitable for integration into
energy constraints which can be best satisfied
midrange Xilinx and Intel FPGAs (www.
with hardware-assisted compression. This is also
soctechnologies.com).
true for sensors which need to function inside the
The video coding standards are targeted to
body such as capsules used in wireless endoscopy
the very important broadcast market. However,
(Karargyris and Bourbakis 2010). In this, a cap-
there are more specific applications for video
sule containing a video sensor (Wahid et al. 2008;
compression. One application area that has been
Turcza and Duplaga 2013) records images of
particularly active has been CMOS imagers with
the gastrointestinal tract as it passes through the
focal plane compression. Because of the free-
patient.
dom from standards, a number of innovative ap-
proaches have been proposed such as Olyaei and
Genov (2013), Leon-Salas et al. (2006), Leon-
Salas et al. (2007), Chen et al. (2011), Oliveira Memory Systems
et al. (2013), Estevo Filho et al. (2013), which
use differential encoding and vector quantization, As programs and datasets have increased in
while others (Lin et al. 2008; Cardoso and Gomes size and complexity, computer memory has
Hardware-Assisted Compression 925
become more and more crowded leading to provides substantially more compression than
increased levels of paging. Paging in memory lossless compression.
systems can cause a considerable increase in
many applications. It has long been recognized
that compression in memory systems can Conclusion
help alleviate this problem. Issues that are
specific to compression in memory include While most hardware-assisted compression ap-
speed and the heterogeneity of the data to plications fall into the three groups mentioned
be compressed. To address these problems, above, they by no means exhaust the applica-
IBM developed the IBMLZ1 compression chip tions of hardware-assisted compression. Consider
(Cheng and Duyanovich 1995) which used the (Schmitz et al. 2017) which describes a recently
LZ77 (Ziv and Lempel 1977) algorithm and developed video processing chip that, not con-
a Huffman coder (Sayood 2017) to provide strained by the standards, uses differential decor-
compression. This algorithm uses the patterns relation to allow processing at rates of 1000
frames per second, or (Amarú et al. 2014) which H
contained in the history of the sequence as
a dictionary to compress the current values. uses logic operations to provide decorrelation,
This makes the compression scheme locally or a hardware-assisted set partitioning in hierar-
adaptive and addresses the data heterogeneity chical trees (SPIHT) (Said and Pearlman 1996)
issue. In their chip (as in other technologies implementation (Kim et al. 2015). As devices
using dictionary-based coding), the dictionary is become smaller and more functional, the role of
stored in content-addressable memory. The first hardware-assisted compression is likely to spread
commercial system to use compression to help much farther.
reduce the memory load was the IBM Memory
Expansion Technology (MXT) (Tremaine et al. Cross-References
2001; Franaszek 2001) which again used the
LZ77 algorithm for compression. Latency Data Replication and Encoding
issues were handled by using a 32 MB cache. Delta Compression Techniques
Since the appearance of the MXT technology, Dimension Reduction
memory sizes have grown considerably, and Grammar-Based Compression
the complexities of the LZ77 algorithm can Graph Compression
result in high latencies. A number of algorithms In-Memory Transactions
have been developed to reduce the latency by
using modified forms of the algorithm (Kjelso
et al. 1996; Benini et al. 2002; Sardashti References
and Wood 2014) with the most well-known
being the Frequent Pattern Compression (FPC) Alvarado AS, Lakshminarayan C, Principe JC (2012)
schemes (Ekman and Stenstrom 2005). More Time-based compression and classification of heart-
recent approaches, such as Zhao et al. (2015), beats. IEEE Trans Biomed Eng 59(6):1641–1648
Amarú L, Gaillardon P-E, Burg A, De Micheli G (2014)
use a multifaceted approach where the data Data compression via logic synthesis. In: 19th Asia
is compressed using one of several simple, and South Pacific Design Automation Conference
low-latency schemes. An excellent primer on (ASP-DAC). IEEE, pp 628–633
compression in memory can be found in Amer I, Badawy W, Jullien G (2005) A high-performance
hardware implementation of the H.264 simplified 8/spl
Sardashti et al. (2015). An interesting recent times/8 transformation and quantization [video cod-
proposal is to do away with the requirement ing]. In: IEEE International Conference on acous-
that all contents of memory be accurately tics, speech, and signal processing, proceedings
preserved (Ranjan et al. 2017). In this approach (ICASSP’05). , vol 2. IEEE, pp ii–1137
Anastasi G, Conti M, Francesco MD, Passarella A (2009)
contents of memory that can be approximated Energy conservation in wireless sensor networks: a
are compressed using lossy compression which survey. Ad Hoc Netw 7(3):537–568
926 Hardware-Assisted Compression
Benini L, Bruni D, Macii A, Macii E (2002) Hardware- methodologies presented. IEEE Eng Med Biol Mag
assisted data compression for energy minimization in 29(1):72–83
systems with embedded processors. In: Proceedings Khurana G, Kassim AA, Chua TP, Mi MB (2006) A
of the conference on design, automation and test in pipelined hardware implementation of in-loop deblock-
Europe. IEEE Computer Society, pp 449 ing filter in H.264/AVC. IEEE Trans Consum Electron
Cardoso BB, Gomes JGRC (2014) CMOS imager with 52(2):536–540
focal-plane image compression based on the EZW Kim S, Lee D, Kim H, Truong NX, Kim J-S (2015)
algorithm. In: 2014 IEEE 5th Latin American sympo- An enhanced one-dimensional SPIHT algorithm and
sium on circuits and systems. pp 1–4 its implementation for TV systems. Displays 40:
Chang T-S, Kung C-S, Jen C-W (2000) A simple proces- 68–77
sor core design for DCT/IDCT. IEEE Trans Circuits Kjelso M, Gooch M, Jones S (1996) Design and perfor-
Syst Video Technol 10(3):439–447 mance of a main memory hardware data compressor.
Cheng J-M, Duyanovich LM (1995) Fast and highly In: Beyond 2000: Proceedings of the 22nd EUROMI-
reliable IBMLZ1 compression chip and algorithm for CRO conference hardware and software design strate-
storage. In: IEEE [IEE95]. pp 143–154 gies, EUROMICRO 96. IEEE, pp 423–430
Chen YH, Sze V (2015) A deeply pipelined cabac decoder Leon-Salas WD, Balkir S, Sayood K, Hoffman MW,
for HEVC supporting level 6.2 high-tier applications. Schemm N (2006) A CMOS imager with focal plane
IEEE Trans Circuits Syst Video Technol 25(5):856– compression. In: Proceedings of IEEE international
868 symposium on circuits and systems, ISCAS 2006.
Chen S, Bermak A, Wang Y (2011) A CMOS image IEEE, pp 4–pp
sensor with on-chip image compression based on pre- Leon-Salas WD, Balkir S, Sayood K, Schemm N, Hoff-
dictive boundary adaptation and memoryless QTD al- man MW (2007) A CMOS imager with focal plane
gorithm. IEEE Trans Very Large Scale Integr (VLSI) compression using predictive coding. IEEE J Solid-
Syst 19(4):538–547 State Circuits 42(11):2555–2572
Cho S, Kim H, Kim HY, Kim M (2015) Efficient in-loop Lin Z, Hoffman MW, Leon WD, Schemm N, Balkir
filtering across tile boundaries for multi-core HEVC S (2008) A CMOS image sensor with focal plane
hardware decoders with 4K/8K-UHD video applica- spiht image compression. In: 2008 IEEE international
tions. IEEE Trans Multimedia 17(6):778–791 symposium on circuits and systems. pp 2134–2137
Deepu CJ, Lian Y (2015) A joint QRS detection and data Loeffler C, Ligtenberg A, Moschytz GS (1989) Practical
compression scheme for wearable sensors. IEEE Trans fast 1-D DCT algorithms with 11 multiplications. In:
Biomed Eng 62(1):165–175 1989 international conference on acoustics, speech, and
Deepu CJ, Zhang XY, Wong DLT, Lian Y (2016) An signal processing, ICASSP-89. IEEE, pp 988–991
ECG-on-chip with joint QRS detection & data com- Martisius I, Birvinskas D, Jusas V, Tamosevicius Z (2011)
pression for wearable sensors. In: 2016 IEEE inter- A 2-D DCT hardware codec based on loeffler algo-
national symposium on circuits and systems (ISCAS). rithm. Elektronika ir Elektrotechnika 113(7):47–50
IEEE, pp 2908–2908 Mitchell JL, Pennebaker WB, Fogg CE, and LeGall DJ
Ekman M, Stenstrom P (2005) A robust main-memory (1997) MPEG video compression standard. Chapman
compression scheme. In: ACM SIGARCH computer and Hall, London
architecture news, vol 33. IEEE Computer Society, pp Meher PK, Park SY, Mohanty BK, Lim KS, Yeo C (2014)
74–85 Efficient integer DCT architectures for HEVC. IEEE
Estevo Filho R de M, Gomes JGRC, Petraglia A (2013) Trans Circuits Syst Video Technol 24(1):168–178
Codebook improvements for a CMOS imager with Nunez-Yanez YL, Chouliaras VA, Alfonso D, Rovati FS
focal-plane vector quantization. In: 2013 IEEE 4th (2006) Hardware assisted rate distortion optimization
Latin American symposium on circuits and systems with embedded cabac accelerator for the H.264 ad-
(LASCAS), pp 1–4 vanced video codec. IEEE Trans Consum Electron
Franaszek PA, Heidelberger P, Poff DE, Robinson JT 52(2):590–597
(2001) Algorithms and data structures for compressed- Oliveira FDVR, Haas HL, Gomes JGRC, Petraglia A
memory machines. IBM J Res Dev 45(2):245–258 (2013) CMOS imager with focal-plane analog image
Gong D, He Y, Cao Z (2004) New cost-effective VLSI compression combining DPCM and VQ. IEEE Trans
implementation of a 2-D discrete cosine transform and Circuits Syst I: Regular Papers 60(5):1331–1344
its inverse. IEEE Trans Circuits Syst Video Technol Olyaei A, Genov R (2007) Focal-plane spatially over-
14(4):405–415 sampling cmos image compression sensor. IEEE Trans
Hasan KK, Ngah UK, Salleh MFM (2014) Efficient Circuits Syst I Regul Pap 54(1):26–34
hardware-based image compression schemes for wire- Park J-S, Nam W-J, Han S-M, Lee S-S (2012) 2-D large
less sensor networks: a survey. Wirel Pers Commun inverse transform (16 16, 32 32) for HEVC (high
77(2):1415–1436 efficiency video coding). JSTS: J Semicond Technol
ITU-T Recomendation H.261 (1993) Video codec for Sci 12(2):203–211
audiovisual services at p 64 kbit/s Pastuszak G (2014) Hardware architectures for the
Karargyris A, Bourbakis N (2010) Wireless capsule en- H.265/HEVC discrete cosine transform. IET Image
doscopy and endoscopic imaging: a survey on various Process 9(6):468–477
Hardware-Assisted Transaction Processing 927
or 1000s of cores in a processor). Today’s com- software developers did not have the luxury to
modity server hardware where majority of data be blind to the advancements in hardware. The
management applications, including OLTP, run processors, and in turn software systems, did not
on typically has multiple multicore processors. automatically get faster with each new generation
Such hardware is called multisocket multicore of hardware. Traditional software systems, in-
hardware. cluding transaction processing, had to go through
fundamental design changes to be able to exploit
the kind of parallelism offered by having multiple
Overview cores in a processor.
The traditional transaction processing systems
Transaction processing has been at the forefront were not designed with abundant horizontal par-
of many influential advancements in the data allelism in mind. Even though they can handle
management ecosystem (Diaconu et al. 2013; several concurrent requests very efficiently while
Exadata 2015; Kongetira et al. 2005; Stonebraker satisfying the ACID properties on a few single-
et al. 2007). In the recent years, many of core processors, their performance suffers on
these advancements have been triggered by multicore architectures due to the tightly coupled
the developments in the computer architecture and centralized internal components and unpre-
community: multicore parallelism, large main dictable accesses to the shared data (Johnson
memories, hardware transactional memory, et al. 2014). Such components and access patterns
nonvolatile memory, remote direct memory lead to various scalability bottlenecks on multi-
access, etc. These developments led to rethinking cores and hinder the parallel execution of even
of the overall system design for transaction non-conflicting requests.
processing systems. In turn, several specialized In the rest of this entry, we first categorize
transaction processing systems have evolved the communication patterns and, in turn, types
adopting a variety of system designs each with of critical sections, in transaction processing to
their unique pros/cons. The focus of this entry is better understand the underlying scalability prob-
to survey the techniques proposed in recent years lems. Then, we review the recent works that
to scale up transaction processing on modern tackle the scalability of transaction processing
multicore hardware and assess how suitable systems proposing novel system designs that are
these techniques are in the context of emerging based on data partitioning and that avoid data
many-core hardware and trends in the computer partitioning. Finally, we discuss the issues that
architecture community in general. await modern transaction processing systems on
Up until around 2005, software systems ben- emerging hardware and identify the significant
efited from the advancements in the hardware research challenges in this context and conclude
industry easily. Thanks to Moore’s law (Moore the entry.
1965), computer architects were able to boost the
performance of processors by adding more and
more complexity within a core (e.g., aggressive Communication Patterns in
pipelining, super-scalar execution, out-of-order Transaction Processing
execution, and branch prediction). Around 2005,
however, power draw and heat dissipation have To analyze the communication patterns of a trans-
prevented processor vendors from leaning on action processing system, we are going to look at
more complex micro-architectural techniques for the types of critical sections it executes. Trans-
higher performance. Instead, they have started action processing systems employ several types
adding more cores on a single processor, creating of communication and synchronization. Locking
multicores, to enable exponentially increasing operates at the logical (application) level to en-
parallelism (Olukotun et al. 1996). As a result, force isolation and atomicity among transactions.
Hardware-Assisted Transaction Processing 929
Hardware-Assisted Transaction Processing, Fig. 1 Communication patterns in transaction processing based on the
type of contention they create as the parallelism increases (Johnson et al. 2014)
H
Latching operates at the physical (database page) the ones where concurrent worker threads can ag-
level to enforce the consistency of the physical gregate their requests and execute them all at once
data as concurrent transactions read/write this later. They are not as lightweight as the fixed criti-
data. Finally, at the lowest levels, critical sections cal sections. However, they allow good resistance
ensure the consistency of the system’s internal against contention, since adding more threads to
state. Locks and latches form a crucial part of the system gives more opportunity for threads
that internal state and are themselves protected by to combine work rather than competing directly
critical sections underneath. Therefore, analyzing for the critical section. Batching log requests is
the behavior of critical sections captures nearly an example of cooperative communication in a
all forms of communication in a transaction pro- transaction processing system.
cessing system. The real key to scalability lies in converting
Critical sections, in turn, fall into three cate- all unbounded communication to either the fixed
gories depending on the nature of the contention or cooperative type, thus removing the potential
they tend to cause in the system as Fig. 1 illus- for bottlenecks to arise. The following sections
trates. Fixed critical sections are the ones where analyze the various proposals for building more
a fixed number of threads use to protect their scalable transaction processing systems in the
communication. Therefore, the contention is lim- context of this categorization of communication
ited, and it cannot increase as the underlying par- patterns and critical sections. Note that this cat-
allelism increases. Producer-consumer-like com- egorization is not unique to data management
munication, mainly found inside the transaction systems and can be used to reason about the
manager component of a transaction processing scalability of other types of software systems as
system, is an example for these types of critical well.
sections. At the other extreme, unbounded crit-
ical sections are the ones where all concurrent
worker threads in the system might content for. Key Research Findings
Therefore, as hardware parallelism increases, the
degree of contention also has the potential to Proposals that depart from traditional transaction
increase. Such critical sections inevitably grow processing in order to scale up on multicores
into a bottleneck. Making them shorter or less can be grouped into two broad categories. The
frequent provides a little slack but does not funda- first category applies some sort of data parti-
mentally improve scalability. Locking and latch- tioning, physical or logical, to bound the con-
ing are good examples for this category of critical current accesses on data. The second category
sections. Finally, cooperative critical sections are prefers not depending on data partitioning and in-
930 Hardware-Assisted Transaction Processing
stead relies on lighter-weight concurrency control distributed transactions or balance load is highly
mechanisms, lock-free algorithms, and multiver- costly due to the amount of data to be moved
sioning. The former has the potential to eliminate from one partition to another and reorganization
the most problematic critical sections in a sys- of the index structures (Serafini et al. 2014; Tözün
tem but requires the data to be almost perfectly et al. 2013). In addition, traditional physiological
partitionable. The latter doesn’t require the data logging (Mohan et al. 1992) does not favor this
to be perfectly partitionable but doesn’t elim- design since maintaining a partitioned log has
inate the critical sections either; it instead re- high overhead, especially for non-partitionable
duces/minimizes their impact. More importantly, workloads (Johnson et al. 2012). Therefore,
both approaches have challenges when it comes physically partitioned systems either use a shared
to scaling up on multisocket multicore (Porobic log (Lomet et al. 1992) or eliminate it completely
et al. 2012) or many-core hardware (Yu et al. (Stonebraker et al. 2007) and provide durability
2014). through replication or adopt more lightweight
forms of logging (Malviya et al. 2014).
Scaling Up with Partitioning
There are various ways to partition the data from Logical or Physiological Partitioning
extreme physical partitioning to lightweight log- Instead of physically partitioning the data and
ical partitioning to more hybrid partitioning that adopting a shared-nothing design, one can also
combines physical and logical partitioning. This just logically partition the data in a shared-
section briefly surveys these partitioning mecha- everything setting to regulate the data access
nisms going over some of the systems that adopt patterns and the contention over the shared
them. data. More specifically, logical partitioning
decentralizes the lock manager and enforces
Physical Partitioning each partition to maintain its own lock manager.
Shared-nothing systems physically partition the Each logical partition can be owned by either a
data and deploy multiple database instances. In single worker thread (Pandis et al. 2010) or a few
this setup, one can tune which part of the data be- worker threads (Lahiri et al. 2001). This design
longs to which instance and the number of threads eliminates the unbounded communication due
assigned to each instance. Therefore, physically to locking. In addition, dynamic repartitioning
partitioned systems can regulate the data access is cheap under logical partitioning since no data
patterns and contention on the data. Systems movement is necessary (Tözün et al. 2013).
like VoltDB (http://www.voltdb.com) (commer- On the other hand, logical partitioning is not
cial version of H-Store (Stonebraker et al. 2007)) able to regulate the accesses to shared database
apply the idea of physical partitioning to the pages. Therefore, it cannot eliminate the un-
extreme and deploy single-threaded database in- bounded communication due to latching. Physi-
stances. When a workload is perfectly partition- ological partitioning (Pandis et al. 2011; Schall
able, i.e., all the transactions are handled by a and Härder 2015) enhances logical partitioning
single database instance and data accesses are and partitions physical data page accesses as
balanced across the instances, this design elim- well. This type of partitioning can eliminate the
inates all unbounded communication within the unbounded communication due to both locking
database engine and has the potential to achieve and latching while still reaping the benefits of a
perfect scalability. shared-everything setting.
When a workload is not amenable to partition- However, both logical and physical parti-
ing, though, physical partitioning suffers from tioning introduce multi-partition transactions
multi-partition/distributed transactions (Curino in a shared-everything system. Such trans-
et al. 2010; Helland 2007; Pavlo et al. 2011; actions can be handled by some extra fixed
Porobic et al. 2012) or load imbalance (Pavlo communication to coordinate the actions of a
et al. 2012). Dynamic repartitioning to minimize transaction that belong to different partitions.
Hardware-Assisted Transaction Processing 931
While such fixed communication is usually update-heavy workloads with hotspots (Narula
lightweight and is not a burden for scalability as et al. 2014).
section “Communication Patterns in Transaction
Processing” describes, it has to be handled
within a processor socket as much as possible. Future Directions of Research:
Otherwise, such fixed communication can easily Toward Many Cores
become a bottleneck as well due to nonuniform
memory accesses as Porobic et al. (2014) shows Commodity servers that data-intensive appli-
and as section “Future Directions of Research: cations typically run on have been offering us
Toward Many Cores” elaborates. more and more opportunities for parallelism
over the years, thanks to Moore’s law, even
Scaling Up Without Partitioning though its form and level have been changing
While partitioning-based techniques have been over time. As mentioned in section “Overview”,
quite popular, the recent years have seen the this parallelism has been mainly implicit (more H
rise of systems work that relies on lock-free complexity within a core) shortly before the
techniques while maintaining the consistency of last decade. Multicores, on the other hand,
their internal data structures and optimistic and offer explicit parallelism where tasks can run
multiversion concurrency control to ensure iso- concurrently. A software system has to exploit
lation and atomicity of concurrent transactions. both types of parallelism if one wants to utilize
There are various commercial transaction pro- modern hardware fully.
cessing systems that follow this approach: HyPer Nowadays we tend to have multiple multicore
(Kemper et al. 2013), Hekaton (Diaconu et al. processors in server hardware, typically two to
2013), MemSQL (http://www.memsql.com/), four sockets on commodity servers. Such hard-
SAP HANA (Lee et al. 2013), etc. In addition ware does not offer uniform core-to-core com-
to typical OLTP workloads, these systems also munication anymore. Cores within a processor
benefit workloads that have short-running read- socket tend to communicate at least an order
/write transactions concurrently running with the of magnitude faster compared to the cores on
long-running read-only transactions or queries or different sockets (David et al. 2013) creating
workloads that need hybrid transactional/analyt- nonuniform communication within a server. In
ical processing (HTAP), since they do not block the future, we expect even more cores in a pro-
transactions with pessimistic locking techniques. cessor and even more processors in commodity
However, these systems do not eliminate the servers. Core-to-core access latency is going to
unbounded communication at all. They rather be nonuniform even within a processor socket.
minimize the time spent for such communication Therefore, system designs targeting only paral-
through optimistic concurrency control, decen- lelism and not nonuniformity of communication
tralized record locks, frequent use of atomic op- across cores are not ideal.
erations, read-copy-update mechanisms, or hard- Unfortunately, the traditional multicore hard-
ware transactional memory (Larson et al. 2011; ware design also faces challenges. Even though
Leis et al. 2014; Levandoski et al. 2013; Sewall Moore’s law still holds today, Dennard scaling
et al. 2011; Tu et al. 2013). Even though these (Dennard et al. 1974), which enables keeping the
mechanisms perform and scale well on architec- power density of the transistors constant, does
tures with a few (one to four) processor sockets not. The supply voltage required to power all the
without depending on data partitioning, scaling transistors up does not decrease at a proportional
them up on larger multisocket multicore hard- rate (Esmaeilzadeh et al. 2011). Putting more
ware or emerging many-core processors is not cores in a processor rather than putting more
straightforward (David et al. 2013; Porobic et al. complexity within a core helped mitigating this
2014; Yu et al. 2014). In addition, similarly to the problem up until now, but is not going to help us
statically partitioned designs, they suffer under solve this problem any longer. Even if we end up
932 Hardware-Assisted Transaction Processing
incorporating more cores on a single processor, transaction processing systems. None of the
we are not going to be able to use them all at major proposals for concurrency control are
the same time. This trend is called dark silicon good enough for many-core hardware (1000
and fundamentally alters the focus of hardware cores) (Yu et al. 2014) even when one assumes
designs (Hardavellas et al. 2011). In this new era, uniform cores. As long as there is unbounded
the focus has to shift toward optimizing energy communication in a system, even the most
per instruction. scalable systems on the current generation of
This focus on energy and the ever-growing hardware will fail to scale up in the future
scale of data-intensive applications has increased generations. In addition, scheduling tasks on
the interest in building hardware components today’s multisocket hardware or on hardware
that are specialized for data-intensive processing with mild specialization is not straightforward.
tasks. Most of the efforts in this direction so Focusing on parallelism as the only dimension
far has targeted traditional analytics workloads for scaling up transaction processing systems is
(Balkesen et al. 2018; Wu et al. 2014) or cloud insufficient. We have to take nonuniformity and
search, machine learning, and AI (Putnam et al. heterogeneity of the underlying hardware into
2014; Jouppi et al. 2017) more recently. There account.
have been efforts for building specialized hard- More and more emerging hardware is going
ware components for transaction processing as to be nonuniform and heterogeneous in terms
well (Johnson and Pandis 2013). Even though of the provided functionality. Therefore, it also
it might still not be feasible to design a whole becomes essential for the emerging transaction
chip specifically for a data-intensive application, processing systems to decide on the optimal de-
building hardware components that accelerate sign options based on the processor types they
frequent operations from an application might be, are running on. If the system is running on a
especially if the accelerator can benefit similar hardware that has a combination of different
operations from other applications as well. The processing units (aggressive, low-power, special
challenge is to find such operations that would purpose, etc.), it should be able to automatically
justify the cost to build specialized hardware. decide on which types of requests should run
Alternatively, many hardware vendors has on which types of processors. If the system is
started to ship co-processors. Low power pro- running on a hardware that has support for hard-
cessing units shipped together with power hungry ware transactional memory, it should be able to
ones or regular CPUs are shipped together with know the types of critical sections that would
FPGAs or GPUs on the side (https://www.arm. benefit from switching the synchronization prim-
com/products/processors/biglittleprocessing.php) itive to transactions and the ones that should keep
(https://www.altera.com/solutions/acceleration- its existing synchronization method. If there are
hub/overview.html) (https://www.ibm.com/ tasks that would benefit from running on GPUs
support/knowledgecenter/en/POWER9/p9hdx/ or could be accelerated via FPGAs, the system
POWER9welcome.htm). Finding ways to exploit should know when to off-load these tasks to
such heterogeneous platforms is going to be these units. Nevertheless, this requires a thorough
another big challenge for data-intensive systems. understanding of the specific requirements of
a system that can exploit the specific features
of various hardware types. In addition, it re-
Conclusions quires interdisciplinary collaborations, especially
involving people from data management, com-
Despite various advancements both from puter architecture, and compiler communities.
academia and industry, tackling ever-increasing Furthermore, to come up with more economically
parallelism and heterogeneity on modern and viable solutions in this area, we need to reach
emerging hardware is still a hot challenge for out to cloud providers so that they start building
Hardware-Assisted Transaction Processing 933
software and hardware infrastructures with het- Jouppi NP, Young C, Patil N, Patterson D, Agrawal G,
erogeneity in mind. Bajwa R, Bates S, Bhatia S, Boden N, Borchers A,
Boyle R, Cantin P, Chao C, Clark C, Coriell J, Daley M,
Dau M, Dean J, Gelb B, Ghaemmaghami TV, Gottipati
R, Gulland W, Hagmann R, Ho CR, Hogberg D, Hu
J, Hundt R, Hurt D, Ibarz J, Jaffey A, Jaworski A,
Cross-References Kaplan A, Khaitan H, Killebrew D, Koch A, Kumar
N, Lacy S, Laudon J, Law J, Le D, Leary C, Liu Z,
Emerging Hardware Technologies Lucke K, Lundin A, MacKean G, Maggiore A, Mahony
Hardware-Assisted Transaction Processing: M, Miller K, Nagarajan R, Narayanaswami R, Ni R,
Nix K, Norrie T, Omernick M, Penukonda N, Phelps
NVM
A, Ross J, Ross M, Salek A, Samadiani E, Severn C,
In-Memory Transactions Sizikov G, Snelham M, Souter J, Steinberg D, Swing
A, Tan M, Thorson G, Tian B, Toma H, Tuttle E,
Vasudevan V, Walter R, Wang W, Wilcox E, Yoon DH
(2017) In-datacenter performance analysis of a tensor
processing unit. In: ISCA, pp 1–12
References Kemper A, Neumann T, Finis J, Funke F, Leis V, H
Mühe H, Mühlbauer T, Rödiger W (2013) Transaction
Balkesen C, Kunal N, Giannikis G, Fender P, Sundara S, processing in the hybrid OLTP&OLAP main-memory
Schmidt F, Wen J, Agrawal S, Raghavan A, Varadara- database system HyPer. IEEE DEBull 36(2):41–47
jan V, Viswanathan A, Chandrasekaran B, Idicula S, Kongetira P, Aingaran K, Olukotun K (2005) Niagara:
Agarwal N, Sedlar E (2018) A many-core architecture a 32-way multithreaded sparc processor. IEEE Micro
for in-memory data processing. In: SIGMOD 25(2):21–29
Curino C, Jones E, Zhang Y, Madden S (2010) Schism: a Lahiri T, Srihari V, Chan W, MacNaughton N,
workload-driven approach to database replication and Chandrasekaran S (2001) Cache fusion: extending
partitioning. PVLDB 3:48–57 shared-disk clusters with shared caches. In: VLDB,
David T, Guerraoui R, Trigonakis V (2013) Everything pp 683–686
you always wanted to know about synchronization but Larson PA, Blanas S, Diaconu C, Freedman C,
were afraid to ask. In: SOSP, pp 33–48 Patel JM, Zwilling M (2011) High-performance
Dennard RH, Gaensslen FH, Yu HN, Rideout VL, Bas- concurrency control mechanisms for main-memory
sous E, Leblanc AR (1974) Design of ion-implanted databases. PVLDB 5(4):298–309
MOSFETs with very small physical dimensions. IEEE Lee J, Kwon YS, Farber F, Muehle M, Lee C, Bens-
J Solid-State Circuits 9:256–268 berg C, Lee JY, Lee A, Lehner W (2013) SAP
Diaconu C, Freedman C, Ismert E, Larson PA, Mittal P, HANA distributed in-memory database system: trans-
Stonecipher R, Verma N, Zwilling M (2013) Hekaton: action, session, and metadata management. In: ICDE,
SQL server’s Memory-optimized OLTP engine. In: pp 1165–1173
SIGMOD, pp 1243–1254 Leis V, Kemper A, Neumann T (2014) Exploiting hard-
Esmaeilzadeh H, Blem E, St Amant R, Sankaralingam K, ware transactional memory in main-memory databases.
Burger D (2011) Dark silicon and the end of multicore In: ICDE, pp 580–591
scaling. In: ISCA, pp 365–376 Levandoski J, Lomet D, Sengupta S (2013) The Bw-
Exadata (2015) Oracle corp.: exadata database machine. tree: a B-tree for new hardware platforms. In: ICDE,
http://www.oracle.com/technetwork/database/exadata/ pp 302–313
overview/index.html Lomet D, Anderson R, Rengarajan TK, Spiro P (1992)
Gray J, Reuter A (1992) Transaction processing: concepts How the Rdb/VMS data sharing system became fast.
and techniques. Morgan Kaufmann Publishers Inc., San Technical Report CRL-92-4, DEC
Francisco Malviya N, Weisberg A, Madden S, Stonebraker M (2014)
Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2011) Rethinking main memory OLTP recovery. In: ICDE,
Toward dark silicon in servers. IEEE Micro 31(4):6–15 pp 604–615
Helland P (2007) Life beyond distributed transactions: an Mohan C, Haderle D, Lindsay B, Pirahesh H, Schwarz
apostate’s opinion. In: CIDR, pp 132–141 P (1992) ARIES: a transaction recovery method
Johnson R, Pandis I (2013) The bionic DBMS is coming, supporting fine-granularity locking and partial roll-
but what will it look like? In: CIDR backs using write-ahead logging. ACM TODS 17(1):
Johnson R, Pandis I, Stoica R, Athanassoulis M, Ailamaki 94–162
A (2012) Scalability of write-ahead logging on multi- Moore G (1965) Cramming more components onto inte-
core and multisocket hardware. VLDB J 21:239–263 grated circuits. Electronics 38(6):82–85
Johnson R, Pandis I, Ailamaki A (2014) Eliminating Narula N, Cutler C, Kohler E, Morris R (2014) Phase
unscalable communication in transaction processing. reconciliation for contended in-memory transactions.
VLDB J 23(1):1–23 In: OSDI, pp 511–524
934 Hardware-Assisted Transaction Processing: NVM
Hardware-Assisted
Transaction Processing: DRAM PCM STT-RAM Memristor NAND Flash HDD
NVM, Table 1 Write energy 0.004 6 2.5 4 0.00002 10 109
Comparison of storage [pJ/bit]
technologies. (Source Endurance >1016 >108 >1015 >1012 >104 >104
ITRS 2011)
Page size 64 B 64 B 64 B 64 B 4–16 KB 512 B
Page read latency 10 ns 50 ns 35 ns <10 ns 25 us 5 ms
Page write latency 10 ns 500 ns 100 ns 20–30 ns 200 us 5 ms
Erase latency N/A N/A N/A N/A 2 ms N/A
Cell area [F2 ] 6 4–16 20 4 1–4 N/A
936 Hardware-Assisted Transaction Processing: NVM
Hardware-Assisted 2ns
Byte- Addressable
Transaction Processing: CPU Cache
NVM, Fig. 1 Complex (L1, L2, L3)
memory hierarchy 10ns
100ns
RAM
200ns
read
NVM
write 1μs
10μs
Access
Gap 25μs
read 80μs
Access Gap
Block- Addressable
Flash
write 500μs
800μs
HDD
5ms
small IoT devices, smartphones, consumer memory management, are topics of the current
devices, but also large data centers. However, research. Arulraj et al. (2015) propose a NVM al-
the definitive position is still unclear. NVM can locator extension providing a naming mechanism
function as a RAM replacement but can also that ensures validity of memory address pointers
be exposed as conventional block device. The and pointers even after restart, hence, providing
former calls for persistent memory controllers nonvolatile pointers. The allocators may also of-
with a consistent energy-backed write queue fer atomicity (Schwalb et al. 2015) by avoid-
(WQ) as well as extended CPU instruction ing dangling pointers since NVM leaks upon a
set (e.g., CLWB, CLFLUSH, etc.). The latter restart.
promotes faster market introduction at the cost Moreover, CPU instruction set architectures
of underutilizing the technological potential and (ISA) have to be extended by new primitives
employing suboptimal backward compatibility. like CLWB, CLFLUSH, or SFENCE (Oukid and
Furthermore, a variety of different and novel Lehner 2017; Arulraj and Pavlo 2017) to handle
interface definitions lately emerged. Depending NVM durability in combination with the WQ.
on the targeted utilization and application, these To avoid consistency issues, modified cache lines
approaches focus on various aspects of NVM need to be written in order. ISA extensions for
characteristics. A broad range of lightweight file this purpose are necessarily needed, since data
systems (PMFS (Dulloor et al. 2014), NOVA (Xu modifications caused by a transaction may still
and Swanson 2016), etc.) including kernel drivers rely in the CPU caches and may not directly
and user mode libraries were proposed in the last flushed to the NVM, causing durability and con-
half decade. But also a potpourri of novel access sistency issues in case of a power loss. Supporting
mechanisms for nonvolatile storage (PAllocator some type of soft or atomic writes natively on
Oukid et al. (2017), pmemlib Intel (2017), etc.), NVM is an intensive research area (Schwalb et al.
adapting and extending conventional concepts of 2015; Oukid et al. 2017; Intel 2017).
Hardware-Assisted Transaction Processing: NVM 937
accumulate updates to the respective page. In (2017), which avoids copies along the critical I/O
addition, partial differential logging of Kim et al. path due to logging by permanently maintaining
(2010) explores the idea of reducing write am- an entire copy of relevant database objects. If
plification. Many of these principles have been a transaction aborts, objects can be recovered
extended and applied on storage management for from this copy, while updates are handled
NVM (Lee et al. 2010a,b; Gao et al. 2015; Kim asynchronously. However, Huang et al. (2014)
et al. 2011). To further minimize the write vol- points out that simply replacing all disks with
ume and perform NVM-friendly writes, storage NVM led to no cost-effective optimum in
managers use techniques such as epochs, change terms of monetary cost per transaction. As an
ordering, and group commit (Pelley et al. 2013b). alternative, the proposed system NV-Logging
Volos et al. (2011) address the interface to shows the overheads and scalability limitations
modern storage technologies with Mnemosyne of centralized log buffers and therefore enables
and target the creation and management of NVM per-transaction logging.
as well as consistency handling in the presence Liu et al. (2014) first propose the novel unified
of failures. Likewise, Intel (2017) defines a set working memory and persistent store architec-
of libraries designed to provide some useful API ture, NVM Duet, which provides the required
server applications, e.g., a transactional object consistency and durability guarantees. Later, Liu
store. et al. (2017) present DudeTM, a crash-consistent
REWIND, Chatzistergiou et al. (2015a), pro- durable transaction system that avoids the draw-
poses a user-mode library to manage transac- backs of both undo and redo logging while pro-
tional updates directly in the application to min- viding NVM access directly through CPU load
imize the overhead of the conventional access and store instructions.
stack. PAllocator is proposed in Oukid et al. FOEDUS, Kimura (2015), is a holistic ap-
(2017) as a fail-safe persistent NVM allocator proach addressing OLTP DBMS on novel hard-
with a special focus on high concurrency, capac- ware. FOEDUS exploits OCC and is based on
ity, and scalability. Ogleari et al. (2016) presents Masstree (Mao et al. 2012) and SILO (Tu et al.
an approach that goes one step further in that 2013). It introduces the concepts of dual pages to
it handles a sort of undo/redo log directly in exploit NVRAM for large cold data and DRAM
hardware. for hot data, eliminating the need for mapping
tables that reduce the concurrency and scalability.
Transaction Management and Logging In addition, dual pages are coupled with a tech-
Due to their I/O characteristics, both Flash and nique called stratified snapshots for mirroring.
NVM are claimed to speed up logging. Nu-
merous methods have been developed for Flash Recovery and Instant Restart
(On et al. 2012; Chen 2009) with some con- Since the characteristics of modern storage tech-
cepts (e.g., group commit) being adapted by ap- nologies, especially NVM, can be exploited to
proaches for NVM. As a consequence, Kim et al. reduce the recovery time of data management
(2016), Gao et al. (2015), and Wang and Johnson systems, multiple approaches are proposed. The
(2014) demonstrate how the logging load shifts concept of instant restart is introduced in Oukid
from I/O-bound to software-bound when utilizing et al. (2014, 2015), and Schwalb et al. (2016),
NVM. while instant recovery is described in Graefe et al.
In fact, a whole field of persistent program- (2015, 2016), Graefe (2015), and Sauer et al.
ming models for NVRAM as RAM replacement (2015). A novel logging and recovery protocol,
has emerged (Boehm and Chakrabarti 2016; Kolli write-behind logging, is defined by Arulraj et al.
et al. 2016; Coburn et al. 2011). This includes (2016) and represents the key idea of logging
lightweight logging schemes, (Chatzistergiou what parts have changed rather than how it was
et al. 2015b), and approaches for main memory changed and gives the DBMS the opportunity
DBMS, DeBrabant et al. (2014b). An alternative to recover nearly instantaneously from system
approach is KaminoTx, Memaripour et al. failures.
Hardware-Assisted Transaction Processing: NVM 939
Chi P, Lee WC, Xie Y (2014) Making b+-tree efficient in Kim K, Lee S, Moon B, Park C, Hwang JY (2011)
pcm-based main memory. In: Proceedings of ISLPED, IPL-P: in-page logging with PCRAM. PVLDB 4(12):
vol 454, pp 69–74 1363–1366
Coburn J, Caulfield AM, Akel A, Grupp LM, Gupta Kim WH, Kim J, Baek W, Nam B, Won Y (2016)
RK, Jhala R, Swanson S (2011) Nv-heaps: making NVWAL: exploiting NVRAM in write-ahead logging.
persistent objects fast and safe with next-generation, SIGOPS Oper Syst Rev 50(2):385–398
non-volatile memories. SIGPLAN Not 46(3):105–118 Kimura H (2015) Foedus: OLTP engine for a thousand
Condit J, Nightingale EB, Frost C, Ipek E, Lee B, cores and NVRAM. In: Proceedings of SIGMOD
Burger D, Coetzee D (2009) Better I/O through byte- Kolli A, Pelley S, Saidi A, Chen PM, Wenisch TF (2016)
addressable, persistent memory. In: Proceedings of High-performance transactions for persistent memo-
SOSP ries. SIGPLAN Not 51(4):399–411
DeBrabant J, Arulraj J, Pavlo A, Stonebraker M, Zdonik Lee S, Moon B, Park C, Hwang JY, Kim K (2010a) Ac-
SB, Dulloor S (2014a) A prolegomenon on OLTP celerating in-page logging with non-volatile memory.
database systems for non-volatile memory. In: Interna- IEEE Data Eng Bull 33(4):41–47
tional workshop on accelerating data management sys- Lee S, Moon B, Park C, Hwang JY, Kim K (2010b) Ac-
tems using modern processor and storage architectures celerating in-page logging with non-volatile memory.
– ADMS’14, Hangzhou, 1 Sept 2014, pp 57–63 IEEE Data Eng Bull 33(4):41–47
DeBrabant J, Arulraj J, Pavlo A, Stonebraker M, Zdonik Liu RS, Shen DY, Yang CL, Yu SC, Wang CYM
SB, Dulloor S (2014b) A prolegomenon on OLTP (2014) NVM duet. In: ASPLOS. ACM Press,
database systems for non-volatile memory. In: Proceed- New York, pp 455–470. https://doi.org/10.1145/
ings of ADMS 2541940.2541957, http://dl.acm.org/citation.cfm?
Dulloor SR, Kumar S, Keshavamurthy A, Lantz P, Reddy id=2541940.2541957, http://dl.acm.org/citation.cfm?
D, Sankaran R, Jackson J (2014) System software doid=2541940.2541957
for persistent memory. In: Proceedings of EuroSys, Liu M, Zhang M, Chen K, Qian X, Wu Y, Zheng
pp 15:1–15:15 W, Ren J (2017) DudeTM. In: ASPLOS. ACM
Gao S, Xu J, Harder T, He B, Choi B, Hu H (2015) Pcm- Press, New York, pp 329–343. https://doi.org/10.1145/
logging: optimizing transaction logging and recovery 3037697.3037714, http://alchem.usc.edu/portal/static/
performance with PCM. TKDE 27(12):3332–3346 download/dudetm.pdf, http://dl.acm.org/citation.cfm?
Graefe G (2015) Instant recovery for data center savings. doid=3037697.3037714
SIGMOD Rec 44(2):29–34 Ma L, Arulraj J, Zhao S, Pavlo A, Dulloor SR, Giardino
Graefe G, Sauer C, Guy W, Härder T (2015) Instant re- MJ, Parkhurst J, Gardner JL, Doshi K, Zdonik S
covery with write-ahead logging. Datenbank-Spektrum (2016) Larger-than-memory data management on mod-
15(3):235–239 ern storage hardware for in-memory OLTP database
Graefe G, Guy W, Sauer C (2016) Instant recovery with systems. In: Proceedings of the 12th international
write-ahead logging: page repair, system restart, media workshop on data management on new hardware, Da-
restore, and system failover. Synthesis lectures on data MoN’16. ACM, New York, pp 9:1–9:7. https://doi.org/
management, 2nd edn. Morgan & Claypool Publishers, 10.1145/2933349.2933358
San Rafael Mao Y, Kohler E, Morris RT (2012) Cache craftiness for
Hardock S, Petrov I, Gottstein R, Buchmann A (2017) fast multicore key-value storage. In: Proceedings of
From in-place updates to in-place appends: revisiting EuroSys
out-of-place updates on flash. In: Proceedings of the Memaripour A, Badam A, Phanishayee A, Zhou Y, Ala-
2017 ACM international conference on management of gappan R, Strauss K, Swanson S (2017) Atomic in-
data, SIGMOD’17. ACM, New York, pp 1571–1586. place updates for non-volatile main memories with
https://doi.org/10.1145/3035918.3035958 kamino-Tx. In: EuroSys, pp 499–512. https://doi.org/
Huang J, Schwan K, Qureshi MK (2014) NVRAM- 10.1145/3064176.3064215, http://dl.acm.org/citation.
aware logging in transaction systems. Proc VLDB En- cfm?doid=3064176.3064215
dow 8(4):389–400. https://doi.org/10.14778/2735496. Mihnea A, Lemke C, Radestock G, Schulze R, Thiel C,
2735502 Blanco R, Meghlan A, Sharique M, Seifert S, Vishnoi
Intel (2017) libpmem: persistent memory programming. S, Booss D, Peh T, Schreter I, Thesing W, Wagle
http://pmem.io M, Willhalm T (2017) Sap hana adoption of non-
JEDEC (2017) JEDEC DDR5 and NVDIMM-P standards volatile memory. Proc VLDB Endow 10(12):1754–
under development. https://www.jedec.org/news/ 1765. https://doi.org/10.14778/3137765.3137780
pressreleases/jedec-ddr5-nvdimm-p-standards-under- Ogleari MA, Miller EL, Zhao J (2016) Relaxing
development persistent memory constraints with hardware-driven
Kim YR, Whang KY, Song IY (2010) Page-differential undo+redo logging. STABLE Tech Rep 002:1–18.
logging: an efficient and DBMS-independent approach https://pdfs.semanticscholar.org/0910/14f8f886b88d60
for storing data into flash memory. In: Proceedings of d981816c12665ec4667672.pdf?_ga=2.32218686.1829
SIGMOD, pp 363–374 609296.1513867920-1003521937.1513867920
Healthcare Ontologies 941
On ST, Xu J, Choi B, Hu H, He B (2012) Flag commit: Tu S, Zheng W, Kohler E, Liskov B, Madden S (2013)
supporting efficient transaction recovery in flash-based Speedy transactions in multicore in-memory databases.
dbmss. IEEE Trans Knowl Data Eng 24(9):1624–1639 In: Proceedings of SOSP, pp 18–32
Oukid I, Booss D, Lespinasse A, Lehner W, Willhalm Viglas SD (2012) Adapting the B-tree for asymmetric I/O.
T, Gomes G (2017) Memory management techniques In: Proceedings of the 16th East European conference
for large-scale persistent-main-memory systems. Proc on advances in databases and information systems,
VLDB Endow 10(11):1166–1177. https://doi.org/10. ADBIS’12. Springer, Berlin/Heidelberg, pp 399–412.
14778/3137628.3137629 https://doi.org/10.1007/978-3-642-33074-2_30
Oukid I, Lehner W (2017) Data structure engineering for Viglas SD (2015) Data management in non-volatile mem-
byte-addressable non-volatile memory. In: Proceedings ory. In: Proceedings of the 2015 ACM SIGMOD in-
of the 2017 ACM international conference on manage- ternational conference on management of data, SIG-
ment of data, SIGMOD’17. ACM, New York, pp 1759– MOD’15. ACM, New York, pp 1707–1711. https://doi.
1764. https://doi.org/10.1145/3035918.3054777 org/10.1145/2723372.2731082
Oukid I, Booss D, Lehner W, Bumbulis P, Willhalm Volos H, Tack AJ, Swift MM (2011) Mnemosyne:
T (2014) SOFORT. In: Proceedings of the tenth lightweight persistent memory. Asplos 22:1–13.
international workshop on data management on new https://doi.org/10.1145/1950365.1950379, http://
hardware – DaMoN’14. ACM Press, New York, research.cs.wisc.edu/sonar/papers/mnemosyne-
pp 1–7. https://doi.org/10.1145/2619228.2619236, osdi2010-poster_abstract.pdf H
http://15721.courses.cs.cmu.edu/spring2016/papers/ Volos H, Nalli S, Panneerselvam S, Varadarajan V, Sax-
a8-oukid.pdf, http://dl.acm.org/citation.cfm?doid= ena P, Swift MM (2014) Aerie: flexible file-system
2619228.2619236 interfaces to storage-class memory. In: Proceedings of
Oukid I, Lehner W, Kissinger T, Willhalm T, Bumbulis EuroSys, pp 14:1–14:14
P (2015) Instant recovery for main memory databases. Wang T, Johnson R (2014) Scalable logging through
In: Proceedings of CIDR emerging non-volatile memory. Proc VLDB Endow
Oukid I, Lasperas J, Nica A, Willhalm T, Lehner W (2016) 7(10):865–876
FPTree: a hybrid SCM-dram persistent and concurrent Xu J, Swanson S (2016) Nova: a log-structured file system
b-tree for storage class memory. In: Proceedings of the for hybrid volatile/non-volatile main memories. In:
2016 international conference on management of data, Proceedings of FAST
SIGMOD’16. ACM, New York, pp 371–386. https:// Yang J, Wei Q, Chen C, Wang C, Yong KL, He B
doi.org/10.1145/2882903.2915251 (2015) Nv-tree: reducing consistency cost for NVM-
Pavlo A, Angulo G, Arulraj J, Lin H, Lin J, Ma L, based single level systems. In: Proceedings of FAST,
Menon P, Mowry TC, Perron M, Quah I, Santurkar S, pp 167–181
Tomasic A, Toor S, Aken DV, Wang Z, Wu Y, Xian
R, Zhang T (2017) Self-driving database management
systems. In: CIDR. http://db.cs.cmu.edu/papers/2017/
p42-pavlo-cidr17.pdf
Pelley S, Wenisch TF, Gold BT, Bridge B (2013a) Storage HDFS
management in the NVRAM era. Proc VLDB En-
dow 7(2):121–132. https://doi.org/10.14778/2732228.
2732231 Hadoop
Pelley S, Wenisch TF, Gold BT, Bridge B (2013b) Storage
management in the NVRAM era. Proc VLDB Endow
7(2):121–132
Sadoghi M, Ross KA, Canim M, Bhattacharjee B (2016)
Exploiting SSDs in operational multiversion databases.
VLDB J 25(5):651–672 Healthcare
Sauer C, Graefe G, Härder T (2015) Single-pass re-
store after a media failure. In: Proceedings of BTW, Big Data for Health
pp 217–236
Schwalb D, Berning T, Faust M, Dreseler M, Plattner H
(2015) NVM malloc: memory allocation for NVRAM.
In: Proceedings of ADMS, pp 61–72
Schwalb D, Faust M, Dreseler M, Flemming P, Plattner
H (2016) Leveraging non-volatile memory for instant Healthcare Ontologies
restarts of in-memory database systems. In: 2016 IEEE
32nd international conference on data engineering
Big Semantic Data Processing in the Life
(ICDE), pp 1386–1389. https://doi.org/10.1109/ICDE.
2016.7498351 Sciences Domain
942 Hierarchical Process Discovery
a b
d e f
Hierarchical Process Discovery, Fig. 1 Flat process model (Conforti et al. 2014, 2016b)
Hierarchical Process
Discovery, Fig. 2 The Event Sub-Logs Sub-Processes Hierarchical
two-phase approach Log Extraction Discovery Process Model
The technique proposed by Li et al. (2011) is frequently follow each other within the event log. H
based on the idea that the execution of a sub- Following this concept, the authors propose to
process leaves a specific footprint in an event log. detect densely connected activities by employing
Such a footprint takes the shape of a recurrent graph clustering techniques on top of a casual
execution pattern (Bose and van der Aalst 2009), activity graph, i.e., a graph depicting the causal
which can then be used to detect the presence relation between activities. The resulting clusters
of a sub-process. To detect a sub-process, the are then used to generate several sub-logs (one
authors rely on two patterns: (i) the tandem ar- for each cluster), where each sub-log contains
rays pattern that corresponds to the execution of all events belonging to activities within the same
loops and (ii) the maximal repeats pattern that cluster.
corresponds to maximal common subsequence of
activities across several process instances. Using Data-Based Techniques
these two patterns, sequences of events which In contrast to control flow-based techniques,
could be group together within the same sub-log data-based techniques rely on data attributes
are identified and several sub-logs extracted. To belonging to events for the identification of sub-
limit the number of sub-logs generated, only the logs and their hierarchical structure. Techniques
most frequent patterns out of all patterns detected belonging to this category are the ones proposed
are considered. by Conforti et al. (2014, 2016b), Wang et al.
Maggi et al. (2014) consider sub-processes (2015), and Weber et al. (2015).
as pockets of flexibility within which events of The technique proposed by Conforti et al.
an event log may either be structured (i.e., dis- (2014, 2016a,b) identifies a hierarchy of sub-logs
tributed as a sequence for which the follow- within the event log by relying on approximate
ing event can be easily predicted) or unstruc- functional and inclusion dependency discovery
tured. This concept of structured and unstructured techniques (Huhtala et al. 1999; Bauckmann
events is used to split the event log in sub- et al. 2010; Zhang et al. 2010). Additionally,
logs. This is achieved by grouping within the using approximate dependency discovery
same sub-log all subsequent structured events, techniques (Zhang et al. 2010), the technique
while unstructured events are grouped together can identify and filter out noise from the event
when they are situated in between sequences of log, which may be caused by data entry errors,
structured events. missing event records, or infrequent behavior
Finally, the technique proposed by Sun (Conforti et al. 2016b).
and Bauer (2016) is based on the idea that Similarly to Conforti et al. (2014, 2016a,b),
activities belonging to the same sub-process the technique proposed by Weber et al. (2015)
will be densely connected and their events will is based on the idea that a hierarchy between
944 Hierarchical Process Discovery
a b e
c f
a b e
c f
H
Hierarchical Process Discovery, Fig. 4 Hierarchical process model – core BPMN with activity markers
a b e
c f
i.e., timer event or message event. Specifically, a Similarly, the quality of a discovered hierar-
timer boundary event is used when the duration chical process model is affected by the effec-
of a sub-process never exceeds a certain amount tiveness of the classical automated process dis-
of time, while a message boundary event is used covery technique used for the discovery of each
in all remaining cases. sub-process. In particular, techniques performing
Table 1 provides a brief summary of all tech- poorly will produce imprecise and complex sub-
niques presented, with the relative subcategory processes which will still result into imprecise
classification. and complex hierarchical process models.
These two aspects should be taken into consid-
eration when attempting to gain insights about a
Conclusion and Final Considerations business process. Business process management
practitioners should consider hierarchical process
In previous sections, we discussed the impor- discovery techniques only when the quality of the
tance of hierarchical process discovery, we high- event log is adequate. In scenarios where logs
lighted the advantages that it has over classical are of poor quality (e.g., logs containing incor-
automated process discovery, and we presented rectly recorded events and where data attributes
several techniques that can be used to discover are unreliable or missing), control flow-based
hierarchical process models. In this final section, hierarchical process discovery techniques should
we will discuss some aspects that should be be preferred over data-based hierarchical process
considered when undertaking the discovery of a discovery techniques. Finally, the discovery of
hierarchical process model. each sub-process should be handled through the
Hierarchical process discovery and the quality use of robust classical automated process discov-
of a discovered hierarchical process model are ery techniques for which we refer the reader to
influenced by two main factors: the quality of the the entry on automated process discovery, where
event log and the effectiveness of the classical a detailed analysis of all available techniques is
automated process discovery technique used for offered.
the discovery of each sub-process. Finally, the problem of hierarchical process
The quality of an event log is a deciding factor discovery shares several similarities with the
over the type of hierarchical process discovery problem of artifact-centric process discovery.
technique that should be adopted. In particular, Both families of methods aim at discovering a
if data attributes are missing or if the quality of modular process model from an event log. How-
information recorded is inadequate, data-based ever, they differ in terms of their modularization
techniques will fail to identify possible sub-logs approach. Specifically, artifact-centric process
within the event log, which in turn will result into discovery differs from hierarchical process
the discovery of a flat process model. discovery as it seeks to discover a process model
Historical Graph Management 947
that consists of a set of interacting artifacts, Li J, Bose RPJC, van der Aalst WMP (2011) Min-
each one with an independent life cycle. For a ing context-dependent and interactive business process
maps using execution patterns. In: Proceedings of busi-
detailed discussion on artifact-centric process ness process management workshops. Lecture notes
discovery, we refer to the corresponding entry in in business information processing, vol 66. Springer,
this encyclopedia. pp 109–121
Maggi FM, Slaats T, Reijers HA (2014) The automated
discovery of hybrid processes. In: Proceedings of the
Cross-References 12th international conference on business process man-
agement. Lecture notes in computer science, vol 8659.
Springer, pp 392–399
Artifact-Centric Process Mining Object Management Group (OMG) (2011) Business
Automated Process Discovery process model and notation (BPMN) ver. 2.0.
Decomposed Process Discovery and Object Management Group (OMG). http://www.omg.
org/spec/BPMN/2.0
Conformance Checking
Sun Y, Bauer B (2016) A graph and trace clustering-
based approach for abstracting mined business process
models. In: Proceedings of the 18th international con- H
References ference on enterprise information systems, SciTePress,
pp 63–74
van der Aalst WMP, Weijters T, Maruster L (2004) Work-
Bauckmann J, Leser U, Naumann F (2010) Efficient
flow mining: discovering process models from event
and exact computation of inclusion dependencies for
logs. IEEE Trans Know Data Eng 16(9):1128–1142
data integration. Technical report 34, Hasso-Plattner-
Wang Y, Wen L, Yan Z, Sun B, Wang J (2015) Discovering
Institute
BPMN models with sub-processes and multi-instance
Bose RPJC, van der Aalst WMP (2009) Abstractions in
markers. In: Proceedings of the on the move (OTM)
process mining: a taxonomy of patterns. In: Proceed-
confederated international conferences. Lecture notes
ings of the 7th international conference on business
in computer science, vol 9415. Springer, pp 185–201
process management. Lecture notes in computer sci-
Weber I, Farshchi M, Mendling J, Schneider J (2015)
ence, vol 5701. Springer, pp 159–175
Mining processes with multi-instantiation. In: Proceed-
Conforti R, Dumas M, García-Bañuelos L, La Rosa M
ings of the 30th annual ACM symposium on applied
(2014) Beyond tasks and gateways: discovering BPMN
computing. ACM, pp 1231–1237
models with subprocesses, boundary events and ac-
Weijters AJMM, Ribeiro JTS (2011) Flexible heuristics
tivity markers. In: Proceedings of the 12th interna-
miner (FHM). In: Proceedings of the IEEE symposium
tional conference on business process management.
on computational intelligence and data mining. IEEE,
Lecture notes in computer science, vol 8659. Springer,
pp 310–317
pp 101–117
Zhang M, Hadjieleftheriou M, Ooi BC, Procopiuc CM,
Conforti R, Augusto A, Rosa ML, Dumas M, García-
Srivastava D (2010) On multi-column foreign key dis-
Bañuelos L (2016a) BPMN miner 2.0: discovering
covery. Proc VLDB Endow 3(1):805–814
hierarchical and block-structured BPMN process mod-
els. In: Proceedings of the BPM demo track 2016
co-located with the 14th international conference on
business process management, CEUR-WS.org, CEUR
workshop proceedings, vol 1789, pp 39–43 Historical Graph Management
Conforti R, Dumas M, García-Bañuelos L, La Rosa M
(2016b) BPMN miner: automated discovery of BPMN
process models with hierarchical structure. Inf Syst Udayan Khurana1 and Amol Deshpande2
1
56:284–303 IBM Research AI, TJ Watson Research Center,
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) New York, NY, USA
TANE: an efficient algorithm for discovering func- 2
Computer Science Department, University of
tional and approximate dependencies. Comput J
42(2):100–111 Maryland, College Park, MD, USA
Leemans SJJ, Fahland D, van der Aalst WMP (2013) Dis-
covering block-structured process models from event
logs – a constructive approach. In: Proceedings of Definitions
the 34th international conference on application and
theory of petri nets and concurrency. Lecture notes in
business information processing, vol 7927. Springer, Real-world graphs evolve over time, with con-
pp 311–329 tinuous addition and removal of vertices and
948 Historical Graph Management
edges, as well as frequent change in their attribute to accommodate the dynamic aspects of graphs.
values. For decades, the work in graph analyt- To support general-purpose computations, most
ics was restricted to a static perspective of the of the emerging temporal graph systems adopt
graph. In recent years, however, we have wit- a strategy to separate graph updates from graph
nessed an increasing abundance of timestamped computation. More specifically, although updates
observational data, fueling an interest in per- are continually applied to a temporal graph, graph
forming richer analysis of graphs, along a tem- computation is only performed on a sequence of
poral dimension. However, the traditional graph successive static views of the temporal graph. For
data management systems that were designed simplicity, most systems adopt a discretized-time
for static graphs provide inadequate support for approach, so that time domain is set of natural
such temporal analyses. We present a summary numbers, i.e., t 2 N . As per the terminology
of recent advances in the field of historical graph of temporal relational databases, this discussion
data management. They involve, compact stor- considers valid time (against transaction time) as
age of large graph histories, efficient retrieval of the underlying temporal dimension for historical
temporal subgraphs, and effective interfaces for analyses. Valid time denotes the time period dur-
expressing historical graph queries, essential for ing which a fact is true with respect to the real
enabling temporal graph analytics. world. Transaction time is the time when a fact
is stored in the database. It is worth noting that
there is a related but orthogonal body of work
Overview which we do not touch in this chapter. It deals
with the need to do real-time analytics on the
There is an increasing availability of timestamped streaming data as it is being generated; here, the
graph data – from social and communication scope of the analysis typically only includes the
networks to financial networks, biological net- latest snapshot or the snapshots from a recent
works, and so on. Analysis of the history of a window. The key challenge there is to be able to
graph presents valuable insights into the under- deal with the high rate at which the data is often
lying phenomena that produced the graph. For generated.
instance, the evolution of a graph or community Broadly speaking, this chapter’s focus is to
can be seen through the evolution of metrics briefly introduce the reader to the recent advances
such as diameter or density; change in the im- in historical graph data management for temporal
portance of key entities in a network can be done graph analytics. There are many different types of
through measuring change in network centrality analyses that may be of interest. For example, an
and so on. analyst may wish to study the evolution of well-
Traditional graph data management systems studied static graph properties such as centrality
based on static graphs provide insufficient sup- measures, density, conductance, etc. over time
port for temporal analysis over historical graphs. or the search and discovery of temporal pat-
This is due to a fundamental lack in the model- terns, where the events that constitute the pattern
ing of the change of information. If performed are spread out over time. Comparative analysis,
using the conventional graph systems, such tasks such as juxtaposition of a statistic over time or,
become prohibitively expensive in storage costs, perhaps, computing aggregates such as max or
memory requirements, or execution time and are mean over time, gives another style of knowl-
often unfriendly or infeasible for an analyst to edge discovery into temporal graphs. Most of all,
even express in the first place. The frequent even a primitive notion of simply being able to
change of a temporal graph also poses a sig- access past states of the graphs and performing
nificant challenge to algorithm design, because simple static graph analytics empowers a data
the overwhelming majority of graph algorithms scientist with the capacity to perform analysis
assume static graph structures. One would have in arbitrary and unconventional patterns. Sup-
to design special algorithms for each application porting such a diverse set of temporal analytics
Historical Graph Management 949
size
Historical Graph Management, Fig. 1 A temporal operations (red), example queries (magenta) at different
graph can be represented across two different dimensions granularities of time and entity size (Khurana 2015)
– time and entity. It lists retrieval tasks (black), graph
950 Historical Graph Management
changes. Their hybrid is often referred to as the (TGI) (Khurana and Deshpande 2016), is an ex-
Copy+Log approach. Copy approach is clearly tension of DeltaGraph which partitions deltas
infeasible for frequently changing graphs because in so-called microdeltas, and uses chaining of
of intractable storage needs; the log approach, related microdeltas across horizontal nodes in
while storing minimal information, makes it a the DeltaGraph, allows retrieval of different tem-
hard work to reconstruct any snapshot or tempo- poral graph primitives including neighborhood
ral subset of the graph. versions, node histories, and graph snapshots.
Such a microdelta-based design is great for dis-
Delta-Based Encodings tributed storage and parallel retrieval on a cloud.
The use of “deltas”, or graph differences, is a This allows efficient retrieval of not only entire
powerful approach to register the changes in a snapshots but also of individual neighborhoods or
graph over a period of time. If carefully orga- temporal histories of individual neighborhoods.
nized, deltas can provide efficient retrieval at low
storage costs. The DeltaGraph index (Khurana Storing by Time Locality
and Deshpande 2013a), for instance, provides Another powerful approach to store and pro-
highly efficient retrieval of individual snapshots cess temporal graphs is based on time locality.
of the historical graph for specific time instances. Chronos (Han et al. 2014) targets time-range
It organizes the historical graph data in a hierar- graph analytics, requiring computation on the
chical data structure, whose lowest level corre- sequence of static snapshots of a temporal graph
sponds to the snapshots of the network over time within a time range. An example is the analy-
and whose interior nodes correspond to graphs sis of change in each vertex’s PageRank for a
constructed by “combining” the lower level snap- given time range. Obviously, the most straightfor-
shots in some fashion; the interior nodes are ward approach of applying computation on each
typically not valid snapshots as of any specific snapshot separately is too expensive. Chronos
time point. Neither the lowest-level graph snap- achieves efficiency by exploiting locality of tem-
shots nor the graphs corresponding to the interior poral graphs. It also leverages the time locality to
nodes are actually stored explicitly. Instead, for store temporal graphs on disk in a compact way.
each edge, a delta, i.e., the difference between The layout is organized in snapshot groups. A
the two graphs corresponding to its endpoints, is snapshot group Gt1 ;t2 contains the state of G in
computed, and these deltas are explicitly stored. the time range Œt1 ; t2 , by including a checkpoint
In addition, the graph corresponding to the root of the snapshot of G at t1 followed by all the
is explicitly stored. Given those, any specific updates made till t2 . The snapshot group is phys-
snapshot can be constructed by traversing any ically stored as edge files and vertex files in time-
path from the root to the node corresponding to locality fashion. For example, an edge file begins
the snapshot in the index and by appropriately with an index to each vertex in the snapshot
combining the information present in the deltas. group, followed by segments of vertex data. The
The use of different “combining” functions leads segment of a vertex, in turn, first contains a set
to a different point in the performance-storage of edges associated with the vertex at the start
trade-off, with intersection being the most natural time of the snapshot group, followed by all the
such function. This index structure is especially edge updates to the vertex. A link structure is
effective with multi-snapshot retrieval queries, further introduced to link edge updates related
which are expected to be common in temporal to the same vertex/edge, so that the state of a
analysis, as it can share the computation and vertex/edge at a given time t can be efficiently
retrieval of deltas across the multiple snapshots. constructed.
While it is efficient at snapshot retrieval, it is not
suitable for fetching graph primitives such as his- Indexing Using Multiversion Arrays
tories of nodes or neighborhoods over specified LLAMA (Macko et al. 2015) presents an approach
periods of time. However, Temporal Graph Index where an evolving graph is modeled as a time
Historical Graph Management 951
series of graph snapshots, where each batch of executed. For example, when running a graph
incremental updates produces a new graph snap- algorithm or evaluating a metric on various
shot. The graph storage is read-optimized, while snapshots of a graph, it might be prohibitively
the update buffer is write-optimized. It augments expensive to store all the snapshots in the memory
the compact read-only CSR representation to at the same time. However, most graph libraries
support mutability and persistence. Specifically, do require plain isolated version of each graph.
a graph is represented by a single vertex table The most common solution to this problem is
and multiple edge tables, one per snapshot. The to use an overlaying of multiple versions on
vertex table is organized as a large multiversioned one another and using bitmaps and an additional
array (LLAMA) that uses a software copy-on- lookup table to establish the validity of a graph
write technique for snapshotting, and the record component to one or more versions. This reduces
of each vertex v in the vertex table maintains any redundancy of nodes, edges, or attributes
the necessary information to track v’s adjacency that occur across multiple versions yet obtaining
list from the edge tables across snapshots. The access to each snapshot as a static graph through H
array of records is partitioned into equal-sized a simple interface. GraphPool (Khurana and
data pages, and an indirection array is constructed Deshpande 2013a) and Chronos (Han et al. 2014)
that contains pointers to the data pages. The among others use variations of this technique.
indirection array fits in L3 cache. To create a new In addition, Chronos also employs such a
snapshot, the indirection array is copied, with structure for locality-aware batch scheduling
those references to outdated pages replaced by (LABS) of graph computation. More specifically,
those to the newly modified pages. Thus, we do LABS batches the processing of a vertex across
not need to copy unmodified pages across snap- all the snapshots, as well as the information
shots. LAMA stores 16 consecutive snapshots of propagation to a neighboring vertex for all the
the vertex table in each file, so that disk space snapshots.
can be easily reclaimed from deleted snapshots.
The edge table for a snapshot i is organized Parallelism. Temporal graph analytics presents
as a fixed-length array that stores adjacency list opportunities for parallelization in two different
fragments consecutively, where each adjacency ways. First, evaluating a property across multi-
list fragment contains the edges of a vertex added ple snapshots can be easily parallelized, unless.
in snapshot i . An adjacency list fragment of Second, several static graph computation opera-
vertex v also stores a continuation record, which tions, such as PageRank, can be run in parallel
points to the next fragment for v, or null if there themselves. Determining the optimum level of
are no more edges. To support edge deletion, each parallelism for an analytical task can be complex
edge table maintains a deletion vector, which is and depends on the exact computation steps,
an array that encodes in which snapshot an edge nature of graph data, and the resources available.
was deleted. TGAF (Khurana and Deshpande 2016) provides
parallelism by expressing all temporal graphs
Analytical Frameworks as RDDs of time evolving nodes or subgraphs
Running graph analytics requires dealing with and letting the underlying Spark infrastructure
two main aspects – the runtime system compo- plan the parallelism. Leveraging Spark paral-
nents and the interfaces to express the analytical lelism provides abundant benefits, along with
task. Let us discuss the essentials of both. simplicity of design. However, there are occa-
sions on which certain tasks can be customized
Runtime Environment Aspects for better results. A major drawback with TGAF
In-Memory Graph Layouts. Like the storage is the lack of space saving obtained through
and indexing of temporal graphs, dealing with overlaid structures such as GraphPool. LABS and
temporal redundancy is important in the in- Llama show effective locality-based multi-core
memory data structures on which analytics are parallelism.
952 Historical Graph Management
(Bahmani et al. 2010), change in centrality the international conference on very large data bases
of vertices, path lengths of vertex pairs, (VLDB)
Beck F, Burch M, Diehl S, Weiskopf D (2014) The state
etc. (Pan and Saramäki 2011), shortest paths of the art in visualizing dynamic graphs. In: EuroVis
evolution (Ren et al. 2011), and such, provide STAR 2
useful analytical algorithms for changing graphs. Han W, Miao Y, Li K, Wu M, Yang F, Zhou L, Prab-
hakaran V, Chen W, Chen E (2014) Chronos: a graph
engine for temporal graph analysis. In: Proceedings of
the ninth European conference on computer systems.
Future Directions for Research ACM, p 1
Khurana U (2015) Historical graph data management.
The topic of historical graph data management PhD thesis, University of Maryland
Khurana U, Deshpande A (2013a) Efficient snapshot
is ripe with research opportunities. In spite of retrieval over historical graph data. In: Proceedings
the advances in temporal graph querying, there of IEEE international conference on data engineering,
is a considerable scope for designing a more pp 997–1008
expressive and rigorous algebra and language. Khurana U, Deshpande A (2013b) Hinge: enabling tem-
poral network analytics at scale. In: Proceedings of the H
The work in graph pattern matching and mining ACM SIGMOD international conference on manage-
appears to lack a temporal dimension. A related ment of data. ACM, pp 1089–1092
but important topic is that of finding interesting Khurana U, Deshpande A (2016) Storing and analyzing
temporal graphs from timestamped datasets such historical graph data at scale. In: Proceedings of inter-
national conference on extending database technology,
as system logs. In terms of systems research, inte- pp 65–76
grating a historical graph system and a streaming Khurana U, Nguyen VA, Cheng HC, Ahn Jw, Chen X,
graph system into one presents interesting chal- Shneiderman B (2011) Visual analysis of temporal
lenges and opportunities for researchers in the trends in social networks using edge color coding
and metric timelines. In: Proceedings of the IEEE
field. third international conference on social computing,
pp 549–554
Kumar R, Novak J, Tomkins A (2010) Structure
and evolution of online social networks. In:
Cross-References Faloutsos C et al (eds) Link mining: models,
algorithms, and applications. Springer, New York,
Graph Data Management Systems pp 337–357
Graph OLAP Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evo-
lution: densification and shrinking diameters. ACM
Graph Representations and Storage
Trans Knowl Discov Data (TKDD) 1(1):2
Link Analytics in Graphs Macko P, Marathe VJ, Margo DW, Seltzer MI (2015)
Parallel Graph Processing LLAMA: efficient graph analytics using large multi-
versioned arrays. In: Proceedings of IEEE international
conference on data engineering, pp 363–374
Navlakha S, Kingsford C (2011) Network archaeology:
References uncovering ancient networks from present-day interac-
tions. PLoS Comput Biol 7(4):e1001119
Ahn JW, Taieb-Maimon M, Sopan A, Plaisant C, Shnei- Pan RK, Saramäki J (2011) Path lengths, correlations,
derman B (2011) Temporal visualization of social net- and centrality in temporal networks. Phys Rev E
work dynamics: prototypes for nation of neighbors. In: 84(1):016105
Salerno J, Yang SJ, Nau D, Chai S-K (eds) Social Ren C, Lo E, Kao B, Zhu X, Cheng R (2011) On querying
computing, behavioral-cultural modeling and predic- historial evolving graph sequences. In: Proceedings of
tion. Springer, New York, pp 309–316 the international conference on very large data bases
Ahn JW, Plaisant C, Shneiderman B (2014) A task tax- (VLDB)
onomy for network evolution analysis. IEEE Trans Vis Salzberg B, Tsotras VJ (1999) Comparison of access
Comput Graph 20(3):365–376 methods for time-evolving data. ACM Comput Surv
Asur S, Parthasarathy S, Ucar D (2009) An event-based (CSUR) 31(2):158–221
framework for characterizing the evolutionary behavior Tang L, Liu H, Zhang J, Nazeri Z (2008) Commu-
of interaction graphs. ACM Trans Knowl Discov Data nity evolution in dynamic multi-mode networks. In:
(TKDD) 3(4):16 Proceedings of the 14th ACM SIGKDD international
Bahmani B, Chowdhury A, Goel A (2010) Fast incre- conference on knowledge discovery and data mining.
mental and personalized pagerank. In: Proceedings of ACM, pp 677–685
954 Hive
Overview
Running on Hadoop, Scalability,
Hive enables data warehousing in the Apache and High Availability
Hadoop ecosystem. It can run in traditional
Hadoop clusters or in cloud environments. It Running on Hadoop gives Hive the ability to run
can work with data sets as large as multiple over large data sets with a high degree of paral-
petabytes. Initially Hive was used mainly for lelism. Hadoop clusters vary in size from a few
ETL and batch processing. While still supporting nodes to tens of thousands of nodes. And since
these use cases, it has evolved to also support Hadoop provides connectors to other distributed
data warehousing use cases such as reporting, storage systems, Hive can be used with these
interactive queries, and business intelligence. distributed storage systems as well. This allows
This evolution has been accomplished by Hive, on a large cluster or in the cloud, to operate
adopting many common data warehousing on very large data sets.
techniques while adapting those techniques to Hadoop provides high availability and retry in
the Hadoop ecosystem. It is implemented in Java. the face of system failure for the tasks it executes.
Hive 955
Query LLAP
User HiveServer2
Results
Task Dispatch/
Metadata Results
YARN Tasks Data
Metastore App
Master
Data
Hive task running in Hadoop
H
Hive, Fig. 1 Hive components
Hive has built HiveServer2 and the Metastore can be further compressed using standard tech-
to be highly available, since they do not run niques such as zlib or Snappy. Information on
as tasks in Hadoop. Multiple instances of these column offsets is placed at the end of the file
components can be run simultaneously so that so that a reader can easily read only the desired
any single component failure does not render the columns.
system unavailable. Since the Metastore relies on ORC also records basic statistics (e.g., mini-
an RDBMS to store the metadata (see below), mum and maximum values) per stripe and row
users must choose a highly available RDBMS in group (a collection of 10,000 rows in a stripe)
order to avoid a single point of failure. which it uses to support predicate pushdown,
allowing it to further trim the data it reads. For ex-
ample, if a query requests users living in Canada
Columnar Data Storage and, when reading a particular row group, ORC
discovers that the minimum value in the country
For traditional data warehousing queries, it is column for that row group is Denmark, it will
well known that columnar formats provide supe- skip reading those rows.
rior performance. There are two major columnar
formats in the big data ecosystem, Apache ORC
(Huai et al. 2014) and Apache Parquet. Hive
supports both. ORC was originally a part of Hive “Unstructured” and Text Data
and is tightly integrated into a number of Hive’s
newer features, such as ACID and LLAP. The rest Hadoop is useful for storing very large quantities
of this section will focus on ORC, but much of of “unstructured” data. Unstructured is a mis-
what is said here applies to Parquet as well. nomer, in that this data does have structure. It is
Columnar formats often store each column in unstructured in that it has not been normalized
a separate file. However, HDFS does not support and cleansed for storing in an RDBMS. This
co-location of related files; hence this does not unstructured data may define its own structure
work in Hadoop. Instead, ORC stores all of the (e.g., JSON, Apache Avro) or be a format with
columns in a single file. An ORC file is split a predefined structure (e.g., MPEG3). It may also
into stripes. All of the values for a given column be a text file of comma separated values (CSV).
are stored contiguously within a stripe. These Hive can read and write data in JSON, CSV, and
columns are compressed in type-specific ways Avro. It can also read pre-existing data if a user
(e.g., strings are dictionary encoded), and run creates a table with the existing data as its data
length encoding is used for most types. The file file(s).
956 Hive
partitions, which puts pressure on HDFS (since To address this, Hive was extended to use
it has an upper bound on the number of files it other Hadoop-based engines, the first of which
can support in the file system) and the Metastore was Apache Tez. Tez jobs are expressed as a
since it has to store metadata and statistics for directed acyclic graph (DAG), where nodes are
each partition. processing tasks and edges are data flows be-
When a query’s WHERE clause includes a tween tasks. Multiple in- and outbound edges per
predicate on the partition column(s) that can be node are supported. Tez also supports modifying
statically computed, the Metastore can determine the DAG at runtime based on feedback during
which partitions are required for the query. How- processing (Saha et al. 2015). These features
ever, in the case where input data is trimmed provide Hive a much simpler and more efficient
via a join condition or subquery, Hive cannot model for executing its operators.
statically determine the appropriate partitions to Another engine chosen was Apache Spark.
read. This scenario is common with fact and Spark also supports arbitrary graphs of tasks,
dimension tables. In order to efficiently execute with intermediate data being stored in memory of H
such a query, Hive prunes partitions at runtime. the task machines. This makes it both more flex-
When assigning segments of data to individual ible and much faster than MapReduce (Zaharia
tasks, if it determines that the join will drop all 2010).
records in the partition (based on the partition key Hive currently supports Tez, Spark, and
or ORC statistics), it will remove that task from MapReduce, though MapReduce has been
the job. This is a crucial performance optimiza- deprecated in recent versions.
tion that allows Hive to efficiently execute data The use of Tez or Spark addresses many of
warehousing-style queries. the performance issues seen with MapReduce, in
some cases improving performance by a factor of
Query Execution 100 or more (Shanklin 2014). However, even with
these new engines, performance issues remained
Hive offers different options for query execu- due to the fact that Hadoop runs each task in a
tion. Originally all queries were executed using dedicated JVM. This slows down queries in three
Hadoop’s MapReduce framework. MapReduce ways:
scales very well for large data sets, but it suf-
fers from a number of issues as an execution • When a job is submitted, it takes a minimum
framework for SQL queries. Queries must be of 2–5 s to spawn the task JVMs.
constructed as a chain of MapReduce jobs, each • Java’s just-in-time compiler (JIT) is not in-
of which writes their results to HDFS in between. voked until each task is part way through
MapReduce is a very rigid model. Each job takes processing its data, even though the task is
one input, applies the map function to it, shuffles running the same code as many other tasks
the data between machines in order to group the before it.
records by key, applies the reduce function, and • There are limited options for caching data;
produces one output. The relational operator set each task must read its data from HDFS.
can be pushed into this, but it is neither simple
nor efficient. To address these issues, the Hive team de-
MapReduce worked well as an engine in veloped LLAP (Live Long and Process). LLAP
Hive’s early days when it was focused solely makes use of traditional MPP techniques. A set
on ETL and large batch jobs. As it began to tran- of long-running YARN tasks is allocated in the
sition to more data warehousing use cases, Hive Hadoop cluster. Each node has a cache of data.
needed to support more complex SQL and pro- This cache is column and ORC row group aware
vide much better latency for query performance. so that only hot portions of the data are cached.
958 Hive
Any node can cache any segment of the data. This sorted on the join key. In this case it will
prevents bottlenecks when a portion of the data materialize the join via a merge join on each
is very hot and many readers wish to access it. segment of data.
When a job has a task ready for execution, the • Distributed hash join. This redistributes
task is sent to an LLAP node. If the data to be ac- records from all inputs to a number of tasks
cessed by that task is already cached on an LLAP based on the join key and then materializes
node and that node has processing capacity, the the join in each task. This is by far the
task will be placed there. If the node does not most expensive join type since it requires
have capacity, then the task will be placed on a redistributing all the input data across the
node with capacity, which will then also cache network. However, it scales much better than
the required data. LLAP nodes can exchange data a broadcast join and does not require the data
in the same way that Tez tasks can, thus enabling to be already sorted to be efficient like SMB
the whole query to be executed in these standing join does.
servers. An area of active development for LLAP
is effective workload management, as provided Initially Hive’s operators used the textbook
by many mature data warehousing systems. database method of building a small set of op-
MPP-like systems such as LLAP perform very erators that handle all data types and pull one
well for traditional reporting and BI queries. record at a time through the operator tree. This
However, they are not optimal for handling model does not perform well for data warehous-
queries that need to read very large amounts of ing queries. It does not efficiently make use of
data. Consider a fact to fact join over 1 year pipelining in the CPU, as the need to support
of data where each input is a terabyte. The multiple data types creates many branches in
intermediate results will likely be quite large. the code. It also tends to produce CPU cache
It is unlikely that a full year of data will be in the misses since there is no guarantee that records are
cache, since most queries are over more recent co-located in memory, forcing expensive loads
data. And reading the full year into the cache will from main memory. This can be addressed by
force out more valuable recent data. LLAP will operating on records in batches that are sized to
not be able to dynamically add nodes to handle fit the CPU’s cache and reworking operators to
the large intermediate results. In contrast, this be data-type specific and to minimize branches
is exactly the situation where Hadoop performs (Boncz et al. 2005). Hive reworked its operators
well, since it can spawn sufficient tasks to handle to conform to this pattern. Hive refers to these
reading and shuffling the data in parallel. In new operators as “vectorized.”
order to support interactive queries as well as
large batch jobs, Hive allows users to run queries
either in LLAP or use Tez, Spark, or MapReduce. SQL
Whether running another engine such as Tez
or running in LLAP, Hive provides all of the The Hive project has stated a goal of supporting
relational operators to execute its queries. These all the segments of ANSI SQL 2011 that are rel-
include standard operators such as projection, evant for data warehousing operations. It has not
filter, group, sort, and join. Hive supports several yet achieved that goal but is tracking progress to-
types of joins: ward it (Apache Hive SQL Conformance 2017).
In addition to supporting standard SQL data
• Broadcast join. A table, usually small enough types, Hive also supports additional Java-like
to fit into memory, is broadcast to each task types. These include strings (unbounded char-
and joined against a fragment of the larger acter strings, similar to RDBMS text type, but
table. stored in row), lists (arrays of a given type),
• Sort-merge-bucket (SMB) join. This depends structs (a predefined set of names and types;
on the data already being partitioned and ANSI row types), and maps (a free form mapping
Hive 959
and other components. Via plug-ins to Hive, Apache Hive SQL Conformance (2017) https://cwiki.
Ranger or Sentry can examine the query plan apache.org / confluence / display / Hive / Apache + Hive+
SQL+Conformance. Accessed 11 Nov 2017
and authorize or reject a query. They can also Boncz P et al (2005) MonetDB/X100: hyper-pipelining
choose to modify the query plan. For example, query execution. In: Proceedings of the 2005 CIDR
Ranger uses this feature to anonymize sensitive conference, Asilomar, pp 225–237
data before returning it to the user. Huai Y et al (2014) Major technical advancements in
Apache Hive. In: Proceedings of the 2014 ACM SIG-
Since Hive runs in Hadoop, it also has to de- MOD international conference on management of data,
cide what user to execute its jobs as. Traditionally Snowbird Utah, 22–27 June 2014
Hadoop jobs run as the user submitting the job. Saha B et al (2015) Apache Tez: a unifying framework for
When running with Tez, Spark, or MapReduce, modeling and building data processing applications. In:
Proceedings of the 2015 ACM SIGMOD international
Hive can follow this paradigm, which it refers to conference on management of data, Melbourne, 31
as running in “doAs” mode, since it is doing the May–4 June 2015
operations as the user. This works well in envi- Shanklin C (2014) Benchmarking Apache Hive 13
ronments where a number of different tools are for enterprise Hadoop. https://hortonworks.com/blog/
benchmarking-apache-hive-13-enterprise-hadoop/. Ac-
being used to read and write data. It also allows cessed 9 Nov 2017
the use of UDFs. However, security options are Vavilapalli V et al (2013) Apache Hadoop YARN: yet
limited in this scenario, since Hive is dependent another resource negotiator. In: Proceedings of the 4th
on the ability of the user to access the files that annual symposium on cloud computing, Santa Clara,
1–3 Oct 2013
store the data. Zaharia M (2010) Spark: cluster computing with working
Hive can also run all jobs as a dedicated sets. In: Proceedings of the 2nd USENIX conference
user, referred to as the “hive user.” This gives on hot topics in cloud computing, Boston, 22–25 June
the ability to provide fine-grained security, since 2010
Hive owns all of the data and thus controls access
to it. However, it is not safe to allow UDFs when
running as a super user. Only system-provided
functions and administrator “blessed” UDFS are
allowed in this scenario. Holistic Schema Matching
times in an incremental way. For instance, one Such a reuse of mappings has already been pro-
can use one of the schemas as the initial in- posed for pairwise schema and ontology match-
tegrated schema and incrementally match and ing based on a repository of schemas and map-
merge the next source with the intermediate result pings (Do and Rahm 2002; Madhavan et al. 2005;
until all source schemas are integrated. This has Rahm 2011). A simple and effective approach is
the advantage that only N-1 match and integration based on the composition of existing mappings to
steps are needed for N schemas. quickly derive new mappings. In particular, one
Such binary integration strategies have already can derive a new mapping between schemas S1
been considered in early approaches to schema and S2 by composing existing mappings, e.g.,
integration (Batini et al. 1986). More recently mappings between S1 and Si and between Si
it has been applied within the Porsche approach and S2 for any intermediate schema Si . Such
(Saleem et al. 2008) to automatically merge many composition approaches have been investigated
tree-structured XML schemas. The approach in (Gross et al. 2011) and were shown to be very
holistically clusters all matching elements in the fast and also effective. A promising strategy is to
nodes of the integrated schema. The placement of utilize a hub schema (ontology) per domain, such
new source elements not found in the (interme- as UMLS in the biomedical domain, to which
diate) integrated schema is based on a simplistic all other schemas are mapped. Then one can
heuristic only. A general problem of incremental derive a mapping between any two schemas by
merge approaches is that the final merge result composing their mappings with the hub schema.
typically depends on the order in which the input
schemas are matched and merged. Holistic Matching of Web Forms and Web
A full pairwise matching has been applied Tables
in (Hu et al. 2011) to match the terms of more The holistic integration of many schemas has
than 4000 web-extracted ontologies. The match mainly been studied for simple schemas such as
process using a state-of- the-art match tool took web forms and web tables typically consisting
about 1 year on six computers showing the of only a few attributes. Previous work for web
insufficient scalability of unrestricted pairwise forms concentrated on their integration within
matching. a mediated schema, while for web tables, the
A holistic matching of concepts in linked focus has been on the semantic annotation and
sources of the so-called Data Web has been matching of attributes.
proposed in (Gruetze et al. 2012). The authors The integration of web forms has been stud-
first cluster the concepts (based on keywords ied to support a meta-search across deep web
from their labels and descriptions) within sources, e.g., for comparing products from differ-
different topical groups and then apply pairwise ent online shops. Schema integration mainly en-
matching of concepts within groups to finally tails grouping or clustering similar attributes from
determine clusters of matching concepts. The the web forms, which is simpler than matching
approach is a good first step to automatic holistic and merging complex schemas. As a result, scal-
schema matching but suffered from coverage and ability to dozens of sources is typically feasible.
scalability limitations. So topical grouping was Figure 1 illustrates the main idea for three simple
possible for less than 17% of the considered one web forms to book flights. The attributes in the
million concepts, and matching for the largest forms are first clustered, and then the mediated
group of 5 K concepts took more than 30 h. schema is determined from the bigger clusters
The matching between many schemas/ontolo- referring to attributes available in most sources.
gies can be facilitated by the reuse of previ- Queries on the mediated schema can then be
ously determined mappings (links) between such answered on most sources.
sources, especially if such mappings are avail- Proposed approaches for the integration
able in the Data Web or in repositories such as of web forms include WISE-Integrator and
Bio-Portal and LinkLion (Nentwig et al. 2014). MetaQuerier (He et al. 2004; He and Chang
Holistic Schema Matching 963
Holistic Schema Matching, Fig. 1 Integration of web forms (query interfaces) into a mediated schema for flight
bookings
2003). Matching and clustering of attributes determine topic-specific schema match graphs
is mainly based on the linguistic similarity of that only consider schemas similar to a specific
the attribute names (labels). The approaches query table. The match graphs help to deter-
also observe that similarly named attributes co- mine matching tables upfront before query an-
occurring in the same schema (e.g., FirstName swering and to holistically utilize information H
and LastName) do not match and should not from matching tables. Eberius et al. (2013) uses
be clustered together. Das Sarma et al. (2008) instance-based approaches to match the attributes
proposes the automatic generation of a so-called of web tables considering the degree of overlap
probabilistic mediated schema from N input in the attribute values, but they do utilize the
schemas, which is in effect a ranked list of several similarity of attribute names or their annotations.
mediated schemas. The holistic integration of both web forms
A much larger number (thousands and mil- and web tables is generally only relevant for
lions) of sources have to be dealt with in the case schemas of the same application domain. For a
of web-extracted datasets like web tables made very large number of such schemas, it is thus
available within open data repositories such as important to first categorize schemas by domain.
data.gov or datahub.io. The collected datasets are Several approaches have been proposed for the
typically from diverse domains and initially not automatic domain categorization problem of web
integrated at all. To enable their usability, e.g., forms (Barbosa 2007; Mahmoud and Aboulnaga
for query processing or web search, it is useful 2010) typically based on a clustering of attribute
to group the datasets into different domains and names and the use of further features such as
to semantically annotate attributes. explaining text in the web page where the form is
Several approaches have been proposed for placed. While some approaches (Barbosa 2007)
such a semantic annotation and enrichment of at- considered the domain categorization for only
tributes by linking them to concepts of knowledge few predefined domains, Mahmoud and Aboul-
graphs like YAGO, DBpedia, or Probase (Limaye naga (2010) cluster schemas into a previously
et al. 2010; Wang et al. 2012). In (Balakrishnan unknown number of domain-like groups that may
et al. 2015), attributes are mapped to concepts of overlap. This approach has also been used in
the Google Knowledge Graph (which is based on Eberius et al. (2013) for a semantic grouping
Freebase) by mapping the attribute values of web of related web tables. The categorization of web
tables to entities in the knowledge graph and thus tables should be based on semantically enriched
to the concepts of these entities. attribute information, the attribute instances, and
The enriched attribute information as well as possibly information from the table context in the
the attribute instances can be used to match and web pages (Balakrishnan et al. 2015). The latter
cluster related web tables to facilitate the com- study observed that a large portion of web tables
bined use of their information. This has only been includes a so-called “subject” attribute (often the
studied to a limited degree so far. In particular, first attribute) indicating the kinds of entities
the Infogather system (Yakout et al. 2012) uti- represented in a table (companies, cities, etc.)
lizes enriched attribute information to match web which allows already a rough categorization of
tables with each other. To limit the scope, they the tables.
964 Holistic Schema Matching
Compared to the huge amount of previous work Balakrishnan S, Halevy AY, Harb B, Lee H, Madhavan J,
Rostamizadeh A, Shen W, Wilder K, Wu F, Yu C (2015)
on pairwise schema matching, research on holis- Applying web tables in practice. In: Proceedings of the
tic schema matching is still at an early stage. As a CIDR
result, more research is needed in improving the Barbosa L, Freire J, Silva A (2007) Organizing hidden-
current approaches for a largely automatic, effec- web databases by clustering visible web documents.
In: Proceedings of the international conference on data
tive, and efficient integration of many schemas engineering, pp 326–335
and ontologies. In addition to approaches for de- Batini C, Lenzerini M, Navathe SB (1986) A comparative
termining an initial integrated schema/ontology, analysis of methodologies for database schema integra-
there is a need for incremental schema integra- tion. ACM Comput Surv 18(4):323–364
Bodenreider O (2004) The unified medical language
tion approaches that can add new schemas with- system (UMLS): integrating bio-medical terminology.
out having to completely reintegrate all source Nucleic Acids Res 32(suppl 1):D267–D270
schemas. The use of very large sets of web- Das Sarma A, Dong X, Halevy AY (2008) Bootstrapping
extracted datasets such as web tables also needs pay-as-you-go data integration systems. In: Proceed-
ings of the ACM SIGMOD conference
more research to categorize and semantically in- Do HH, Rahm E (2002) COMA – a system for flexible
tegrate these datasets. combination of schema matching approaches. In: Pro-
To deal with a very large number of input ceedings of the international conference on very large
schemas, it is important to avoid the quadratic data bases, pp 610–621
Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy
complexity of a pairwise mapping between K, Strohmann T, Sun S, Zhang W (2014) Knowledge
all input schemas. One approach, which is vault: A web-scale approach to probabilistic knowledge
especially suitable for integrating complex fusion. In: Proceedings of KDD, New York, USA,
schemas and ontologies, is to incrementally pp 601–610
Eberius J, Damme P, Braunschweig K, Thiele M, Lehner
match and integrate the input schemas. More W (2013) Publish-time data integration for open data
research is here needed to evaluate how platforms. In: Proceedings of the ACM workshop on
known approaches for pairwise matching and open data
integration could be used or extended for such a Euzenat J, Shvaiko P (2013) Ontology matching, 2nd edn.
Springer, Berlin/Heidelberg, Germany
setting. Gross A, Hartung M, Kirsten T Rahm E (2011) Mapping
For simpler schemas, or if only a clustering of composition for matching large life science ontologies.
concepts/attributes is aimed at, grouping of the In: Proceedings of the ICBO, pp 109–116
input schemas by domain or mutual similarity Gruetze T, Boehm C, Naumann F (2012) Holistic and
scalable ontology alignment for linked open data. In:
is able to reduce the overall complexity since Proceedings of linked data on the web
matching can then be restricted to the schemas He B, Chang KC (2003) Statistical schema matching
of the same group. To limit the effort for match- across web query interfaces. In: Proceedings of the
ing schema elements within such groups, perfor- ACM SIGMOD conference, pp 217–228
He H, Meng W, Yu CT, Wu Z (2004) Automatic inte-
mance techniques from entity resolution such as gration of web search interfaces with WISE-integrator.
blocking (Koepcke and Rahm 2010) can further VLDB J 13(3):256–273
be utilized. More research is thus needed to Hu W, Chen J, Zhang H, Qu Y (2011) How matchable
investigate these ideas for a very large number of are four thousand ontologies on the semantic web. In:
Proceedings of the ESWC, Springer LNCS 6643
schemas/datasets. Koepcke H, Rahm E (2010) Frameworks for entity match-
ing: a comparison. Data Knowl Eng 69(2):197–210
Limaye G, Sarawagi S, Chakrabarti S (2010) Annotating
and searching web tables using entities, types and
relationships. PVLDB 3(1–2):1338–1347
Cross-References Madhavan J, Bernstein PA, Doan A, Halevy AY (2005)
Corpus-based schema matching. In: Proceedings of the
IEEE ICDE conference
Large-Scale Entity Resolution Mahmoud HA, Aboulnaga A (2010) Schema cluster-
Large-Scale Schema Matching ing and retrieval for multi-domain pay-as-you-go data
HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 965
integration systems. In: Proceedings of the ACM SIG- typed metadata and support for fast, rich metadata
MOD conference search. However, previous attempts failed mainly
Nentwig M, Soru T, Ngonga Ngomo A, Rahm E (2014)
LinkLion: a link repository for the web of data. In:
due to the reduced performance introduced by
Proceedings of the ESWC (Satellite Events), Springer adding database operations to the file system’s
LNCS 8798 critical path. However, recent improvements in
Rahm E (2011) Towards large-scale schema and ontol- the performance of distributed in-memory on-
ogy matching. In: Schema matching and mapping.
Springer, Berlin/Heidelberg, Germany
line transaction processing databases (NewSQL
Rahm E (2016) The case for holistic data integration. In: databases) led us to reinvestigate the possibility
Proceedings of the ADBIS conference, Springer LNCS of using a database to manage file system meta-
9809 data, but this time for a distributed, hierarchical
Rahm E, Bernstein PA (2001) A survey of approaches to
automatic schema matching. VLDB J 10(4):334–350
file system, the Hadoop file system (HDFS).
Raunich S, Rahm E (2014) Target-driven merging of The single-host metadata service of HDFS is
taxonomies with ATOM. Inf Syst 42:1–14 a well-known bottleneck for both the size of
Saleem K, Bellahsene Z, Hunt E (2008) PORSCHE: HDFS clusters and their throughput. In this en-
performance oriented SCHEma mediation. Inf Syst H
33(7–8):637–657
try, we characterize the performance of differ-
Wang J, Wang H, Wang Z, Zhu KQ (2012) Understanding ent NewSQL database operations and design the
tables on the web. In: Proceedings of the ER confer- metadata architecture of a new drop-in replace-
ence, Springer LNCS 7532, pp 141–155 ment for HDFS using, as far as possible, only
Yakout M, Ganjam K, Chakrabarti K, Chaudhuri S (2012)
Infogather: entity augmentation and attribute discovery those high-performance NewSQL database ac-
by holistic matching with web tables. In: Proceedings cess patterns. Our approach enabled us to scale
of the ACM SIGMOD conference, pp 97–108 the throughput of all HDFS operations by an
order of magnitude.
HopsFS: Scaling Hierarchical File System Metadata namenode. In contrast, HopsFS supports multiple stateless
Using NewSQL Databases, Fig. 1 In HDFS, a single namenodes and a single leader namenode that all access
namenode manages the namespace metadata. For high metadata stored in transactional shared memory (NDB). In
availability, a log of metadata changes is stored on a set HopsFS, leader election is coordinated using NDB, while
of journal nodes using quorum-based replication. The log ZooKeeper coordinates leader election for HDFS
is subsequently replicated asynchronously to a standby
We use the database as a shared memory to namenodes to join an operational cluster. HDFS
implement a leader election and membership v2.X clients are fully compatible with HopsFS,
management service (Salman Niazi et al. 2015). although they do not distribute operations over
In both HDFS and HopsFS, datanodes are namenodes, as they assume there is a single active
connected to all the namenodes. Datanodes pe- namenode.
riodically send a heartbeat message to all the
namenodes to notify them that they are still alive.
The heartbeat also contains information such as MySQL’s NDB Distributed Relational
the datanode capacity, its available space, and Database
its number of active connections. Heartbeats up-
date the list of datanode descriptors stored at HopsFS uses MySQL’s Network Database
namenodes. The datanode descriptor list is used (NDB) to store the file system metadata.
by namenodes for future block allocations, and it NDB is an open-source, real-time, in-memory,
is not persisted in either HDFS or HopsFS, as it shared nothing, distributed relational database
is rebuilt on system restart using heartbeats from management system.
datanodes. NDB cluster consists of three types of nodes:
HopsFS clients support random, round-robin, NDB datanodes that store the tables, management
and sticky policies to distribute the file system nodes, and database clients; see Fig. 3.
operations among the namenodes. If a file system The management nodes are only used to dis-
operation fails, due to namenode failure or over- seminate the configuration information and to
loading, the HopsFS client transparently retries detect network partitions. The client nodes access
the operation on another namenode (after backing the database using the SQL interface via MySQL
off, if necessary). HopsFS clients refresh the Server or using the native APIs implemented in
namenode list every few seconds, enabling new C++, Java, JPA, and JavaScript/Node.js.
HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 967
Figure 3 shows an NDB cluster setup consist- stored in the NDB cluster database. In production
ing of four NDB datanodes. Each NDB datanode deployments each NDB datanode may store
consists of multiple transaction coordinator (TC) multiple data partitions, but, for simplicity,
threads, which are responsible for performing the datanodes shown in Fig. 3 store only one
two-phase commit transactions; local data man- data partition. The replication node group 1 is
agement threads (LDM), which are responsible responsible for storing and replicating partitions
for storing and replicating the data partitions 1 and 2 of the inodes’ table. The primary replicas
assigned to the NDB datanode; send and receive of partition 1 is stored on the NDB datanode 1,
threads, which are used to exchange the data and the replica of the partition 1 is stored on
between the NDB datanodes and the clients; and the other datanode of the same replication node
IO threads which are responsible for performing group, that is, NDB datanode 2. For example,
disk IO operations. the NDB datanode 1 is responsible for storing a
MySQL Cluster horizontally partitions the ta- data partition that stores the lib, conf, and libB
bles, that is, the rows of the tables are uniformly inodes, and the replica of this data partition is H
distributed among the database partitions stored stored on NDB datanode 2. NDB also supports
on the NDB datanodes. The NDB datanodes are user-defined partitioning of the stored tables,
organized into node replication groups of equal that is, it can partition the data based on a user-
sizes to manage and replicate the data partitions. specified table column. User-defined partitioning
The size of the node replication group is the provides greater control over how the data is
replication degree of the database. In the example distributed among different database partitions,
setup, the NDB replication degree is set to two which helps in implementing very efficient
(default value); therefore, each node replication database operations.
group contains exactly two NDB datanodes. The NDB only supports read commi t t ed trans-
first replication node group consists of NDB action isolation level which is not sufficient for
datanodes 1 and 2, and the second replication implementing practical applications. However,
node group consists of NDB datanodes 3 and 4. NDB supports row-level locking which can be
Each node group is responsible for storing and used to isolate transactions operating on the same
replicating all the data partitions assigned to the datasets. NDB supports exclusive (write) locks,
NDB datanodes in the replication node group. shared (read) locks, and read commi t t ed
By default, NDB hashes the primary key (read the last committed value) locks. Using these
column of the tables to distribute the table’s locking primitives, HopsFS isolates different file
rows among the different database partitions. system operations trying to mutate the same
Figure 3 shows how the inodes table for the subset of inodes.
namespace shown in Fig. 2 is partitioned and
HopsFS: Scaling
Hierarchical File System
Metadata Using NewSQL
Databases, Fig. 2 A
sample file system
namespace. The diamond
shapes represent directories
and the circles represent
files
968 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
HopsFS: Scaling Hierarchical File System Metadata The NDB cluster consists of three types of nodes, that
Using NewSQL Databases, Fig. 3 System architecture is, database clients, NDB management nodes, and NDB
diagram of MySQL’s Network Database (NDB) cluster. database datanodes
datanode that holds the data; however, it will experiments are repeated twice. The first set of
incur additional network traffic. experiments consist of read-only operations, and
the second set of experiments consist of read-
Primary Key (PK) Operation write operations where all the data that is read is
PK operation is the most efficient operation updated and stored back in the database.
in NDB that reads/writes/updates a single row Distributed full table scan operations and dis-
stored in a database shard. tributed index scan operations do not scale, as
these operations have the lowest throughput and
Batched Primary Key Operations highest end-to-end latency among all the database
Batch operations use batching of PK operations operations; therefore, there operations must be
to enable higher throughput while efficiently us- avoided in implementing file system operations.
ing the network bandwidth. Primary key, batched primary key operations,
and partition-pruned index scan operations are
Partition-Pruned Index Scan (PPIS) scalable database operations that can deliver very H
PP IS is a distribution-aware operation that high throughput and have low end-to-end opera-
exploits the distribution-aware transactions tional latency.
and application-defined partitioning features The design of the database schema, that is,
of NDB, to implement a scalable index scan how the data is laid out in different tables, the
operation such that the scan operation is design of the primary keys of the tables, type-
performed on a single database partition. s/number of indexes for different columns of
the tables, and data partitioning scheme for the
tables, plays a significant role in choosing an
Distributed Index Scan (DIS)
appropriate (efficient) database operation to read-
DIS is an index scan operation that executes on
/update the data, which is discussed in detail in
all database shards. It is not a distribution-aware
the following sections.
operation, causing it to become increasingly slow
for increasingly larger clusters.
HopsFS Distributed Metadata
Distributed Full Table Scan (DFTS)
DF T S operations are not distribution aware and In HopsFS the metadata is stored in the database
do not use any index; thus, they read all the rows in normalized form, that is, instead of storing the
for a table stored in all database shards. DFTS full file paths with each inode as in Thomson
operations incur high cpu and network costs; and Abadi (2015), HopsFS stores individual file
therefore, these operations should be avoided in path components. Storing the normalized data
implementing any database application. in the database has many advantages, such as
rename file system operation can be efficiently
Comparing Different Database Operations implemented by updating a single row in the
Figure 4 shows the comparison of the throughput database that represents the inode. If complete
and latency of different database operations. file paths are stored for each inode, then it would
These micro benchmarks were performed on not only consume significant amount of precious
a four-node NDB cluster setup running NDB database storage, but also, a simple directory
version 7.5.6. All the experiments were run on rename operation would require updating the file
premise using Dell PowerEdge R730xd servers paths for all the children of that directory subtree.
(Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40 GHz, Key tables in HopsFS metadata schema are
256 GB RAM, 4 TB 7200 RPM HDD) connected shown in the entity relational (ER) diagram in
using a single 10 GbE network adapter. We ran Fig. 5. The inodes for files and directories are
400 concurrent database clients, which were stored in the inodes table. Each inode is repre-
uniformly distributed across 20 machines. The sented by a single row in the table, which also
970 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Throughput
1.0M
throughput and latency of
800.0k
different database
operations. (a) Read-only 600.0k
operations’ throughput. (b) 400.0k
Read-only operations’ 200.0k
latency. (c) Read-write
0.0
operations’ throughput. (d) 1 Row 5 Rows 10 Rows
Read-write operations’ Row(s) read in each operation
latency
(b) Read only operations’ latency
10000
PK
Batch
Latency (ms) Log Scale
1000 PPIS
DIS
DFTS
100
10
0.1
1 Row 5 Rows 10 Rows
Row(s) read in each operation
200.0k
150.0k
100.0k
50.0k
0.0
1 Row 5 Rows 10 Rows
Row(s) read/updated in each operation
PPIS
1000 DIS
DFTS
100
10
1
1 Row 5 Rows 10 Rows
Row(s) read/updated in each operation
HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 971
HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Fig. 5 Entity relational diagram
for HopsFS schema
stores information such as inode ID, name, parent Non-empty files may contain multiple blocks
inode ID, permission attributes, user ID, and size. stored in the blocks table. The location of
As HopsFS does not store complete file paths each replica of the block is stored in the
with each inode, the inode’s parent ID is used to replicas table. During its life cycle a block
traverse the inodes table to discover the complete goes through various phases. Blocks may be
path of the inode. The extended metadata for the under-replicated if a datanode fails, and such
inodes is stored in the extended metadata entity. blocks are stored in the under-replicated blocks
The quota table stores information about disk table. HopsFS periodically checks the health
space consumed by the directories with quota of all the data blocks. HopsFS ensures that the
restrictions enabled. When a client wants to up- number of replicas of each block does not fall
date a file, then it obtains a lease on the file, and below the threshold required to provide high
the file lease information is stored in the lease availability of the data. The replication monitor
table. Small files that are few kilobytes in size sends commands to datanodes to create more
are stored in the database for low-latency access. replicas of the under-replicated blocks. The
Such files are stored in the small files’ data blocks blocks undergoing replication are stored in the
tables. pending replication blocks table. Similarly, a
972 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
replica of a block has various states during its of the directory listing operation. When listing
life cycle. When a replica gets corrupted due files in a directory, we use a hinting mechanism to
to disk errors, it is moved to the corrupted start the transaction on a transaction coordinator
replicas table. The replication monitor will located on the database partitions that holds the
create additional copies of the corrupt replicas child inodes for that directory. We can then use
by copying healthy replicas of the corresponding a pruned index scan to retrieve the contents of
blocks, and the corrupt replicas are moved to the directory locally. File inode-related metadata,
invalidated replicas table for deletion. Similarly, that is, blocks and replicas and their states, is
when a file is deleted, the replicas of the blocks partitioned using the file’s inode ID. This results
belonging to the file are moved to the invalidated in metadata for a given file all being stored in a
replicas table. Periodically the namenodes read single database partition, again enabling efficient
the invalided replicas tables and send commands file operations. Note that the file inode-related
to the datanodes to permanently remove the entities also contain the inode’s foreign key (not
replicas from the datanodes’ disk. Whenever a shown in the entity relational diagram Fig. 5) that
client writes to a new block’s replica, this replica is also the partition key, enabling HopsFS to read
is moved to the replica under construction table. the file inode-related metadata using partition-
If too many replicas of a block exist (e.g., due to pruned index scans.
recovery of a datanode that contains blocks that It is important that the partitioning scheme dis-
were re-replicated), the extra copies are stored in tribute the metadata evenly across all the database
the excess replicas table. partitions. Uneven distribution of the metadata
across all the database partitions leads to poor
Metadata Partitioning performance of the database. It is not uncommon
The choice of partitioning scheme for the hier- for big data applications to create tens of millions
archical namespace is a key design decision for of files in a single directory (Ren et al. 2013;
distributed metadata architectures. HopsFS uses Patil et al. 2007). HopsFS metadata partitioning
user-defined partitioning for all the tables stored scheme does not lead to hot spots as the database
in NDB. We base our partitioning scheme on partition that contains the immediate children
the expected relative frequency of HDFS oper- (files and directories) of a parent directory only
ations in production deployments and the cost stores the names and access attributes, while the
of different database operations that can be used inode-related metadata (such as blocks, replicas,
to implement the file system operations. Read- and associated states) that comprises the majority
only metadata operations, such as list, stat, and of the metadata is spread across all the database
file read operations, alone account for 95% of partitions because it is partitioned by the inode
the operations in Spotify’s HDFS cluster (Niazi IDs of the files. For hot files/directories, that is,
et al. 2017). Based on the micro benchmarks of file/directories that are frequently accessed by
different database operations, we have designed many of concurrent clients, the performance of
the metadata partitioning scheme such that fre- the file system will be limited by the performance
quent file system operations are implemented of the database partitions that hold the required
using the scalable database operations such as data for the hot files/directories.
primary key, batched primary key operations,
and partition-pruned index scan operations. Index
scans and full table scans are avoided, where HopsFS Transactional Operations
possible, as they touch all database partitions and
scale poorly. File system operations in HopsFS are divided
HopsFS partitions inodes by their parents’ into two main categories, that is, non-recursive
inode IDs, resulting in inodes with the same file system operations that operate on a single
parent inode being stored on the same database file or directory, also called inode operations,
partitions. That enables efficient implementation and recursive file system operations that operate
HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 973
on directory subtrees which may contain large iterative primary key operations and have lower
number of descendants, also called subtree op- end-to-end latency. Recent studies have shown
erations. The file system operations are divided that in production deployments, such as in Yahoo,
into these two categories because databases, for the file system access patterns follow a heavy-
practical reasons, put a limit on the number of tailed distribution, that is, only 3% of the files
read and write operations that can be performed account for 80% of file system operations (Abad
in a single database transaction, that is, a large file 2014); therefore, it is feasible for the namenodes
system operation may take a long time to com- to cache the primary keys of the recently accessed
plete and the database may timeout the transac- inodes. The namenodes store the primary keys
tion before the operation successfully completes. of the inodes table in a local least recently used
For example, creating a file, mkdir, or deleting (LRU) cache called the inodes hints cache. We
file are inode operations that read and update use the inode hint cache to resolve the frequently
limited amount of metadata in each transactional used file paths using single-batch query. The
file system operation. However, recursive file sys- inodes cache does not require cache coherency H
tem operations such as delete, move, chmod on mechanisms to keep the inode cache consistent
large directories may read and update millions of across the namenodes because if the inodes hints
rows. As subtree operations cannot be performed cache contains invalid primary keys, then the
in a single database operation, HopsFS breaks batch operation would fail and then the namenode
the large file system operation into small trans- will fall back to recursively resolving the file path
actions. Using application-controlled locking for using primary key operations. After recursively
the subtree operations, HopsFS ensures that the reading the path components from the database,
semantics and consistency of the file system op- the namenodes also update the inode cache. Ad-
eration remain the same as in HDFS. ditionally, only the move operation changes the
primary key of an inode. Move operations are
Inode Hint Cache very tiny fraction of the production file system
In HopsFS all file system operations are path workloads, such as at Spotify less than 2% of the
based, that is, all file system operations operate on file system operations are file move operations.
file/directory paths. When a namenode receives a
file system operation, it traverses the file path to Inode Operations
ensure that the file path is valid and the client is Inode operations go through the following
authorized to perform the given file system oper- phases.
ation. In order to recursively resolve the file path,
HopsFS uses primary key operations to read each
constituent inode one after the other. HopsFS Pre-transaction Phase
partitions the inodes’ table using the parent ID • The namenode receives a file system operation
column, that is, the constituent inodes of the file from the client.
path may reside on different database partitions. • The namenode checks the local inode cache,
In production deployments, such as in case of and if the file path components are found in
Spotify, the average file system path length is 7, the cache, then it creates a batch operation to
that is, on average it would take 7 round trips read all file components.
to the database to resolve the file path, which
would increase the end-to-end latency of the file Transactional File System Operation
system operation. However, if the primary keys of • The inodes cache is also used to set the parti-
all the constituent inodes for the path are known tion key hint for the transactional file system
to the namenode in advance, then all the inodes operation. The inode ID of the last valid path
can be read from the database in a single-batched component is used as partition key hint.
primary key operation. For reading multiple rows • The transaction is enlisted on the desired NDB
batched primary key operations scale better than datanode according to the partition hint. If the
974 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
locks on all inodes in the subtree in the same then the client resubmits the operation to another
total order used to lock inode. This is repeated alive namenode to complete the file system oper-
down the subtree to the leaves, and a tree data ation.
structure containing the inodes in the subtree is
built in memory at the namenode. The tree is later
used by some subtree operations, such as move Storing Small Files in the Database
and delete operations, to process the inodes. If
the subtree operations protocol fails to quiesce HopsFS supports two file storage layers, in con-
the subtree due to concurrent file system opera- trast to the single file storage service in HDFS,
tions on the subtree, it is retried with exponential that is, HopsFS can store data blocks on HopsFS
backoff. datanodes as well as in the distributed database.
Phase 3 – Executing the FS Operation: Fi- Large files blocks are stored on the HopsFS
nally, after locking the subtree and ensuring that datanodes, while small files (<D 64 KB) are
no other file system operation is operating on stored in the distributed database. The technique H
the same subtree, it is safe to perform the re- of storing the data blocks with the metadata is
quested file system operation. In the last phase, also called inode stuffing. In HopsFS, an average
the subtree operation is broken into smaller op- file requires 1.5 KB of metadata. As a rule of
erations, which are run in parallel to reduce the thumb, if the size of a file is less than the size
latency of the subtree operation. For example, of the metadata (in our case 1 KB or less), then
when deleting a large directory, starting from the the data block is stored in memory with the meta-
leaves of the directory subtree, the contents of data. Other small files are stored in on-disk data
the directory is deleted in batches of multiple tables. The latest solid-state drives (SSDs) are
files. recommended for storing small files data blocks
as typical workloads produce large number of
Handling Failed Subtree Operation random reads/writes on disk for small amounts
As the locks for subtree operations are controlled of data.
by the application (i.e., the namenodes), there- Inode stuffing has two main advantages. First,
fore, it is important that the subtree locks are it simplifies the file system operations protocol
cleared when a namenode holding the subtree for reading/writing small files, that is, many net-
lock fails. HopsFS uses lazy approach for clean- work round trips between the client and datan-
ing the subtree locks left by the failed namenodes. odes are avoided, significantly reducing the ex-
The subtree lock flag also contains information pected latency for operations on small files. Sec-
about the namenode that holds the subtree lock. ond, it reduces the number of blocks that are
Using the group membership service, the namen- stored on the datanode and reduces the block
ode maintains a list of all the active namenodes reporting traffic on the namenode. For example,
in the system. During the execution of the file when a client sends a request to the namenode
system operations, when a namenode encounters to read a file, the namenode retrieves the file’s
an inode with a subtree lock flag set that does metadata from the database. In case of a small
not belong to any live namenode, then the subtree file, the namenode also fetches the data block
operation flag is reset. from the database. The namenode then returns
It is imperative that the failed subtree opera- the file’s metadata along with the data block
tions do not leave the file system metadata in an to the client. Compared to HDFS, this removes
inconsistent state. For example, the deletion of the additional step of establishing a validated,
the subtree progresses from the leaves to the root secure communication channel with the datan-
of the subtree. If the recursive delete operation odes (Kerberos, TLS/SSL sockets, and a block
fails, then the inodes of the partially deleted sub- token are all required for secure client-datanode
tree remain connected to the namespace. When a communication), resulting in lower latencies for
subtree operation fails due to namenode failure, file read operations (Salman Niazi et al. 2017).
976 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Throughput (ops/sec)
500.0k HopsFS 200.0k
HopsFS
HDFS et HDFS et
400.0k
rov
em 150.0k vem
300.0k I mp m pro
I
1X 100.0k 2X
200.0k 59. 56.
100.0k 50.0k
0.0 0.0
1 5 10 15 20 25 30 35 40 45 50 55 60 1 5 10 15 20 25 30 35 40 45 50 55 60
Number of Namenodes Number of Namenodes
(c) Append File (d) Read File (get Block Locations)
250.0k 1.6M
Throughput (ops/sec)
Throughput (ops/sec)
HopsFS 1.4M HopsFS
200.0k
HDFS et 1.2M HDFS et
150.0k vem 1.0M vem
m pro 800.0k m pro
100.0k I I
0X 600.0k 8X
49. 400.0k 16.
50.0k
200.0k
0.0 0.0
1 5 10 15 20 25 30 35 40 45 50 55 60 1 5 10 15 20 25 30 35 40 45 50 55 60
Number of Namenodes Number of Namenodes
(e) Stat Dir (f) Stat File
1.8M
Throughput (ops/sec)
Throughput (ops/sec)
1.8M
1.6M HopsFS
1.4M HDFS et
1.6M HopsFS
1.4M HDFS et
H
1.2M vem 1.2M vem
1.0M
m pro 1.0M
m pro
800.0k I 800.0k I
600.0k 2X 600.0k 2X
400.0k 18. 400.0k 17.
200.0k 200.0k
0.0 0.0
1 5 10 15 20 25 30 35 40 45 50 55 60 1 5 10 15 20 25 30 35 40 45 50 55 60
Number of Namenodes Number of Namenodes
(g) List Dir (h) List File
700.0k 1.6M
Throughput (ops/sec)
Throughput (ops/sec)
Throughput (ops/sec)
Throughput (ops/sec)
Throughput (ops/sec)
HopsFS: Scaling Hierarchical File System Metadata horizontal (blue) lines. (a) Mkdir. (b) Create empty file.
Using NewSQL Databases, Fig. 6 Throughput of differ- (c) Append file. (d) Read file (getBlockLocations). (e) Stat
ent file system operations in HopsFS and HDFS. As HDFS dir. (f) Stat file. (g) List dir. (h) List file. (i) Rename file.
only supports only one active namenode, the throughput (j) Delete file. (k) Chmod dir. (l) Chmod file. (m) Chown
for HDFS file system operations are represented by the dir. (n) Chown file
978 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Xeon(R) CPU E5-2620 v3 @ 2.40 GHz, 256 GB same total order when taking locks, and locks
RAM, 4 TB 7200 RPM HDDs) connected using are never upgraded. Our architecture supports
a single 10 GbE network adapter. Unless stated a pluggable database storage engine, and other
otherwise, NDB, version 7.5.3, was deployed NewSQL databases could be used, providing they
on 12 nodes configured to run using 22 threads have good support for cross-partition transactions
each, and the data replication degree was 2. All and techniques such as partition-pruned index
other configuration parameters were set to default scans. Most importantly, our architecture now
values. Moreover, in our experiments Apache opens up metadata in HDFS for tinkering: custom
HDFS, version 2.7.2, was deployed on 5 servers. metadata only involves adding new tables and
We did not colocate the file system clients with cleaner logic at the namenodes.
the namenodes or with the database nodes. As
we are only evaluating metadata performance, all
the tests created files of zero length (similar to Cross-References
the NNThroughputBenchmark Shvachko 2010).
Testing with non-empty files requires an order of Distributed File Systems
magnitude more HDFS/HopsFS datanodes, and In-memory Transactions
provides no further insight. Hadoop
Figure 6 shows the throughput of different file
system operations as a function of number of
namenodes in the system. HDFS supports only
References
one active namenode; therefore, the throughput of
the file system is represented by (blue) horizontal Abad CL (2014) Big data storage workload characteriza-
line in the graphs. From the graphs it is quite tion, modeling and synthetic generation. PhD thesis,
clear that HopsFS outperforms HDFS for all file University of Illinois at Urbana-Champaign
Hammer-Bench (2016) Distributed metadata benchmark
system operations and has significantly better
to HDFS. https://github.com/smkniazi/hammer-bench.
performance than HDFS for the most common [Online; Accessed 1 Jan 2016]
file system operations. For example, HopsFS im- Ismail M, Gebremeskel E, Kakantousis T, Berthou G,
proves the throughput of create file and read file Dowling J (2017) Hopsworks: improving user ex-
perience and development on hadoop with scalable,
operations by 56:2X and 16:8X , respectively.
strongly consistent metadata. In: 2017 IEEE 37th inter-
Additional experiments for industrial work- national conference on distributed computing systems
loads, write intensive synthetic workloads, meta- (ICDCS), pp 2525–2528
data scalability, and performance under failures Ismail M, Niazi S, Ronström M, Haridi S, Dowling J
(2017) Scaling HDFS to more than 1 million operations
are available in HopsFS papers (Niazi et al. 2017; per second with HopsFS. In: Proceedings of the
Ismail et al. 2017). 17th IEEE/ACM international symposium on cluster,
cloud and grid computing, CCGrid ’17. IEEE Press,
Piscataway, pp 683–688
Niazi S, Haridi S, Dowling J (2017) Size matters:
Conclusions improving the performance of small files in HDF.
https://eurosys2017.github.io/assets/data/posters/
In this entry, we introduced HopsFS, an open- poster09-Niazi.pdfl. [Online; Accessed 30 June 2017]
Niazi S, Ismail M, Haridi S, Dowling J, Grohsschmiedt
source, highly available, drop-in alternative for
S, Ronström M (2017) Hopsfs: scaling hierarchical
HDFS that provides highly available metadata file system metadata using newsql databases. In:
that scales out in both capacity and throughput 15th USENIX conference on file and storage technolo-
by adding new namenodes and database nodes. gies (FAST’17). USENIX Association, Santa Clara,
pp 89–104
We designed HopsFS file system operations to be
Noll MG (2015) Benchmarking and stress testing an
deadlock-free by meeting the two preconditions hadoop cluster with TeraSort. TestDFSIO & Co. http://
for deadlock freedom: all operations follow the www.michael-noll.com/blog/2011/04/09/benchmarking-
Hybrid OLTP and OLAP 979
and-stress-testing-an-hadoop-cluster-with-terasort-
testdfsio-nnbench-mrbench/. [Online; Accessed 3 Sept Hybrid OLTP and OLAP
2015]
Ovsiannikov M, Rus S, Reeves D, Sutter P, Rao S, Kelly J Jana Giceva1 and Mohammad Sadoghi2
(2013) The quantcast file system. Proc VLDB Endow 1
Department of Computing, Imperial College
6(11):1092–1101
Patil SV Gibson GA Lang S, Polte M (2007) GIGA+: scal- London, London, UK
2
able directories for shared file systems. In: Proceedings University of California, Davis, CA, USA
of the 2nd international workshop on petascale data
storage: held in conjunction with supercomputing ’07,
PDSW ’07. ACM, New York, pp 26–29
Ren K, Kwon Y, Balazinska M, Howe B (2013) Hadoop’s
Synonyms
adolescence: an analysis of hadoop usage in scientific
workloads. Proc VLDB Endow 6(10):853–864 HTAP; Hybrid transactional and analytical
Salman Niazi GB, Ismail M, Dowling J (2015) Leader processing; Operational analytics; Transactional
election using NewSQL systems. In: Proceeding of
DAIS 2015. Springer, pp 158–172
analytics
Shvachko KV (2010) HDFS scalability: the limits to H
growth. Login Mag USENIX 35(2):6–16
Thomson A, Abadi DJ (2015) CalvinFS: consistent WAN Definitions
replication and scalable metadata management for dis-
tributed file systems. In: 13th USENIX conference
on file and storage technologies (FAST 15). USENIX Hybrid transactional and analytical processing
Association, Santa Clara, pp 1–14 (HTAP) refers to system architectures and tech-
niques that enable modern database management
systems (DBMS) to perform real-time analytics
on data that is ingested and modified in the
transactional database engine. It is a term that was
HTAP originally coined by Gartner where Pezzini et al.
(2014) highlight the need of enterprises to close
the gap between analytics and action for better
Hybrid OLTP and OLAP business agility and trend awareness.
Overview
Hybrid Data Warehouse The goal of running transactions and analytics on
the same data has been around for decades, but
has not fully been realized due to technology lim-
Hybrid Systems Based on Traditional Database itations. Today, businesses can no longer afford
Extensions to miss the real-time insights from data that is in
their transactional system as they may lose com-
petitive edge unless business decisions are made
on latest data (Analytics on latest data implies
Hybrid DBMS-Hadoop allowing the query to run on any desired level of
Solution isolations including dirty read, committed read,
snapshot read, repeatable read, or serializable.)
or fresh data (Analytics on fresh data implies
running queries on a recent snapshot of data
Hybrid Systems Based on Traditional Database that may not necessarily be the latest possible
Extensions snapshot when the query execution began or a
980 Hybrid OLTP and OLAP
consistent snapshot.). As a result, in recent years warehouse. Furthermore, systems try to reduce
in both academia and industry, there has been the amount of data replication and the cost that
an effort to address this problem by designing it brings. More importantly, enterprises want to
techniques that combine the transactional and improve data freshness and perform analytics
analytical capabilities and integrate them in a sin- on operational data. Ideally, systems should
gle hybrid transactional and analytical processing enable applications to immediately react on
(HTAP) system. facts and trends learned by posing an analytical
Online transaction processing (OLTP) systems query within the same transactional request. In
are optimized for write-intensive workloads. short, the choice of separating OLTP and OLAP
OLTP systems employ data structures that are is becoming obsolete in light of exponential
designed for high volume of point access queries increase in data volume and velocity and the
with a goal to maximize throughput and minimize necessity to enrich the enterprise to operate based
latency. Transactional DBMSs typically store on real-time insights.
data in a row format, relying on indexes and In the rest of this entry, we discuss design de-
efficient mechanisms for concurrency control. cisions, key research findings, and open questions
Online analytical processing (OLAP) systems in the database community revolving around ex-
are optimized for heavy read-only queries that citing and imminent HTAP challenges.
touch large amounts of data. The data structures
used are optimized for storing and accessing large
volumes of data to be transferred between the Key Research Findings
storage layer (disk or memory) and the process-
ing layer (e.g., CPUs, GPUs, FPGAs). Analytical Among existing HTAP solutions, we differen-
DBMSs store data in column stores with fast tiate between two main design paradigms: (1)
scanning capability that is gradually eliminating operating on a single data representation (e.g.,
the need for maintaining indexes. Furthermore to row or column store format) vs. multiple rep-
keep high performance and to avoid the overhead resentations of data (e.g., both row and column
of concurrency, these systems only batch updates store formats) and (2) operating on a single copy
at predetermined intervals. This, however, limits of data (i.e., single data replica) vs. maintaining
the data freshness visible to analytical queries. multiple copies of data (through a variety of data
Therefore, given the distinct properties of replication techniques).
transactional data and (stale) analytical data, Examples of systems that operate on a single
most enterprises have opted for a solution that data representation are SAP HANA by Sikka
separates the management of transactional and et al. (2012), HyPer by Neumann et al. (2015),
analytical data. In such a setup, analytics is and L-Store by Sadoghi et al. (2016a,b, 2013,
performed as part of a specialized decision 2014). They provide a unified data representation
support system (DSS) in isolated data ware- and enable users to analyze the latest (not just
houses. The DSS executes complex long running fresh data) or any version of data. Here we must
queries on data at rest and the updates from note that operating on a single data representation
the transactional database are propagated via does not necessarily mean avoiding the need for
an expensive and slow extract-transform-load data replication. For instance, the initial version
(ETL) process. The ETL process transforms data of HyPer relied on a copy-on-write mechanism
from transactional-friendly to analytics-friendly from the underlying operating system, while SAP
layout, indexes the data, and materializes selected HANA keeps the data in a main and a delta,
pre-aggregations. Today’s industry requirements which contains the most recent updates still not
are in conflict with such a design. Applications merged with the main. L-Store maintains the his-
want to interface with fewer systems and try to tory of changes to the data (the delta) alongside
avoid the burden of moving the transactional data the lazily maintained latest copy of data in a
with the expensive ETL process to the analytical unified representation form.
Hybrid OLTP and OLAP 981
Other systems follow the more classical ap- Sadoghi et al. (2016a,b, 2013, 2014) proposed
proach of data warehousing, where multiple data L-Store (lineage-based data store) that combines
representations are exploited, either within a sin- the real-time processing of transactional and an-
gle instance of the data or by dedicating a sep- alytical workloads within a single unified engine.
arate replica for different workloads. Example L-Store bridges the gap between managing the
systems are SQL Server’s column-store index data that is being updated at a high velocity and
proposed by Larson et al. (2015), Oracle’s in- analyzing a large volume of data by introducing
memory dual-format by Lahiri et al. (2015), SAP a novel update-friendly lineage-based storage ar-
HANA SOE’s architecture described by Goel chitecture (LSA). This is achieved by develop-
et al. (2015), and the work on fractured mirrors ing a contention-free and lazy staging of data
by Ramamurthy et al. (2002), to name a few. from write-optimized into read-optimized form
in a transactionally consistent manner without
Unified Data Representation the need to replicate data, to maintain multiple
The first main challenge for single data rep- representation of data, or to develop multiple H
resentation HTAP systems is choosing (or de- loosely integrated engines that limit real-time
signing) a suitable data storage format for hy- capabilities.
brid workloads. As noted earlier, there are many The basic design of LSA consists of two
systems whose storage format was optimized to core ideas, as captured in Fig. 1. First, the base
support efficient execution of a particular class data is kept in read-only blocks, where a block
of workloads. Two prominent example formats is an ordered set of objects of any type. The
are row store (NSM) and column store (DSM), modification to the data blocks (also referred
but researchers have also shown the benefit of to as base pages) is accumulated in the cor-
alternative stores like partition attribute across responding lineage blocks (also referred to as
(PAX) proposed by Ailamaki et al. (2001) that tail pages). Second, a lineage mapping links an
strike a balance between the two extremes. OLTP object in the data blocks to its recent updates in
engines typically use a row-store format with a lineage block. This essentially decouples the
highly tuned data structures (e.g., lock-free skip updates from the physical location of objects.
lists, latch-free BW trees), which provide low Therefore, via the lineage mapping, both the
latencies and high throughput for operational base and updated data are retrievable. A lazy,
workloads. OLAP engines have been repeatedly contention-free background process merges the
shown to perform better if data is stored in col- recent updates with their corresponding read-only
umn stores. Column stores provide better support base data in order to construct new consolidated
for data compression and are more optimized for data blocks while data is continuously being
in-memory processing of large volumes of data. updated in the foreground without any interrup-
Thus, the primary focus was on developing new tion. The merge is necessary to ensure optimal
storage layouts that can efficiently support both performance for the analytical queries. Further-
OLTP and OLAP workloads. more, each data block internally maintains the
The main challenge is the conflict in the prop- lineage of the updates consolidated thus far. By
erties of data structures suitable to handle large exploiting the decoupling and lineage tracking,
volume of update-heavy requests on the one hand the merge process, which only creates a new
and fast data scans on the other hand. Researchers set of read-only consolidated data blocks, is car-
have explored where to put the updates (whether ried out completely independently from update
to apply them in-place, append them in a log- queries, which only append changes to lineage
structured format, or store them in a delta and blocks and update the lineage mapping. Hence,
periodically merge them with the main), how there is no contention in the write paths of update
to best arrange the records (row-store, column- queries and the merge process. That is a funda-
store, or a PAX format), and how to handle mul- mental property necessary to build a highly scal-
tiple versions and when to do garbage collection. able distributed storage layer that is updatable.
982 Hybrid OLTP and OLAP
Primary
Secondary
Index Indexes
Lazily Consolidating
Updates
Lineage Tacking
Lineage Block
(Append-only) Latest
Version Base Data Block Lineage Tacking
Version (Read-only)
Consolidated Data
(Read-only)
Lineage Mapping
Appending Updates
is the data transformation only recommended for reorganization is done as part of an incremental
read-only data? If the former, when does the background process that does not impact the
data transformation take place – is it incremental latency-sensitive transactions but improves the
or immediate? Does it happen as part of query performance of the analytical queries.
execution or is it done as part of a background
process? Data Replication
Grund et al. (2010) with their HYRISE system Operating on a single data replica may come at
propose the use of the partially decomposed a cost due to an identified trade-off between de-
storage model and automatically partition gree of data freshness, system performance, and
tables into vertical partitions of varying depth, predictability of throughput and response time for
depending on how the columns are accessed. a hybrid workload mix. Thus, the challenges for
Unfortunately, the bandwidth savings achieved handling HTAP go beyond the layout in which
with this storage model come with an increased the data is stored. A recent study by Psaroudakis
CPU cost. A follow-up work by Pirk et al. (2013) et al. (2014) shows that systems like SAP HANA H
improves the performance by combining the and HyPer, which rely on partial replication,
partially decomposed storage model with just- experience interference problems between the
in-time (JiT) compilation of queries, which transactional and analytical workloads. The ob-
eliminates the CPU-inefficient function calls. served performance degradation is attributed to
Based on the work in HYRISE, Plattner (2009) in both resource sharing (physical cores and shared
SAP developed HANA, where tuples start out in a CPU caches) and synchronization overhead when
row-store and are then migrated to a compressed querying and/or updating the latest data.
column-store storage manager. Similarly, also This observation motivated researchers to re-
MemSQL, Microsoft’s SQL Server, and Oracle visit the benefits of data replication and address
support the two storage formats. One of the the open challenge of efficient update propaga-
main differentiators among these systems is who tion. One of the first approaches for address-
decides on the partitioned layout: i.e., whether the ing hybrid workloads was introduced by Rama-
administrator manually specifies which relations murthy et al. (2002), where the authors proposed
and attributes should be stored as a column store replicating the data and storing it in two data
or the system derives it automatically. The latter formats (row and columnar). The key advantage
can be achieved by monitoring and identifying is that the data layouts and the associated data
hot vs. cold data or which attributes are accessed structures as well as the data processing tech-
together in OLAP queries. There are a few niques can be tailored to the requirements of
research systems that have demonstrated the the particular workload. Additionally, the system
benefits of using adaptive data stores that can can make efficient use of the available hardware
transform the data from one format to another resources and their specialization. This is often
with an evolving HTAP workload. Dittrich and viewed as a loose-form of an HTAP system. The
Jindal (2011) show how to maintain multiple main challenge for this approach is maintaining
copies of the data in different layouts and use a the OLAP replica up to date and avoiding the ex-
logical log as a primary storage structure before pensive ETL process. BatchDB by Makreshanski
creating a secondary physical representation et al. (2017) relies on primary-secondary form
from the log entries. In H2O, Alagiannis et al. of replication with the primary replica handling
(2014) maintain the same data in different storage the OLTP workload and the updates being prop-
formats and improve performance of the read- agated to a secondary replica that handles the
only workload by leveraging multiple processing OLAP workload (Fig. 2). BatchDB successfully
engines. The work by Arulraj et al. (2016) addresses the problem of performance interfer-
performs data reorganization on cold data. The ence by spatially partitioning resources to the
984 Hybrid OLTP and OLAP
single snapshot
two data replicas and their execution engines data processing engines for the different replicas.
(e.g., either by allocating a dedicated machine In GoldenGate, Oracle uses a publish/subscribe
and connecting the replicas with RDMA over In- mechanism and propagates the updates from the
finiBand or by allocating separate NUMA nodes operational database to the other replicas if they
within a multi-socket server machine). The sys- have subscribed to receive the updates log.
tem supports high-performance analytical query
processing on fresh data with snapshot isolation Examples of Applications
guarantees. It achieves that by queuing queries
and updates on the OLAP side and schedul- There are many possible applications of HTAP
ing them in batches, executing one batch at a for modern businesses. With today’s dynamic
time. BatchDB ensures that the OLAP queries data-driven world, real-time advanced analytics
see the latest snapshot of the data by using a (e.g., planning, forecasting, and what-if analy-
lightweight propagation algorithm, which incurs sis) becomes an integral part of many business
a small overhead on the transactional engine to processes rather than a separate activity after the
extract the updates and efficiently applies them events have happened. One example is online
on the analytical side at a faster rate than what retail. The transactional engine keeps track of
has been recorded as a transactional throughput data including the inventory list and products
in literature so far. Such a system design solves and the registered customers and manages all the
the problem of high performance and isolation for purchases. An analyst can use this online data to
hybrid workloads, and most importantly enables understand customer behavior and come up with
data analytics on fresh data. The main problem is better strategies for product placement, optimized
that it limits the functionality of what an HTAP pricing, discounts, and personalized recommen-
application may do with the data. For instance, it dations as well as to identify products which are
does not support interleaving of analytical queries in high demand and do proactive inventory refill.
within a transaction and does not allow analysts Another business use for HTAP comes from the
to go back in time and explore prior versions of financial industry sector. The transactional engine
the data to explain certain trends. already supports millions of banking transactions
Oracle GoldenGate and SAP HANA SOE also per second and real-time fraud detection and
leverage data replication and rely on specialized action which can save billions of dollars. Yet
Hybrid OLTP and OLAP 985
another example of real-time detection and action an important game changer for databases, and
is inspired by content delivery networks (CDNs), as in-memory processing enabled much of the
where businesses need to monitor the network of technology for approaching true HTAP, there is
web servers that deliver and distribute the content discussion that the coming computing hetero-
at real time. An HTAP system can help identify geneity is going to significantly influence the
distributed denial-of-service (DDoS) attacks, find design of future systems as highlighted by Ap-
locations with spikes to redistribute the traffic puswamy et al. (2017).
bandwidth, get the top-k customers in terms of
traffic usage, and be able to act on it immediately.
Cross-References
database engines with native graph processing. CoRR Sikka V, Färber F, Lehner W, Cha SK, Peh T, Bornhövd C
abs/1709.06715 (2012) Efficient transaction processing in SAP HANA
Lahiri T, Chavan S, Colgan M, Das D, Ganesh A, Gleeson database: the end of a column store myth. In: SIG-
M, Hase S, Holloway A, Kamp J, Lee TH, Loaiza J, MOD’12, pp 731–742
Macnaughton N, Marwah V, Mukherjee N, Mullick A, Stonebraker M, Cetintemel U (2005) “One size fits all”:
Muthulingam S, Raja V, Roth M, Soylemez E, Zait M an idea whose time has come and gone. In: ICDE,
(2015) Oracle database in-memory: a dual format in- pp 2–11
memory database. In: ICDE, pp 1253–1258. https://doi. Teubner J, Woods L (2013) Data processing on FPGAs.
org/10.1109/ICDE.2015.7113373 Synthesis lectures on data management. Morgan &
Larson PA, Birka A, Hanson EN, Huang W, Nowakiewicz Claypool Publishers, San Rafael
M, Papadimos V (2015) Real-time analytical process-
ing with SQL server. PVLDB 8(12):1740–1751
Makreshanski D, Giceva J, Barthels C, Alonso G
(2017) BatchDB: efficient isolated execution of hybrid
OLTP+OLAP workloads for interactive applications.
In: SIGMOD’17, pp 37–50 Hybrid Systems Based on
Najafi M, Sadoghi M, Jacobsen H (2015) The FQP vision: Traditional Database
flexible query processing on a reconfigurable comput- Extensions
ing fabric. SIGMOD Rec 44(2):5–10
Najafi M, Zhang K, Sadoghi M, Jacobsen H (2017)
Hardware acceleration landscape for distributed real- Yuanyuan Tian
time analytics: virtues and limitations. In: ICDCS, IBM Research – Almaden, San Jose, CA, USA
pp 1938–1948
Neumann T, Mühlbauer T, Kemper A (2015) Fast
serializable multi-version concurrency control for
main-memory database systems. In: SIGMOD, Synonyms
pp 677–689
Pezzini M, Feinberg D, Rayner N, Edjali R (2014) Hadoop EDW Integration; Hybrid DBM-
Hybrid transaction/analytical porcessing will foster S-Hadoop Solution; Hybrid Data Warehouse;
opportunities for dramatic business innovation. https://
www.gartner.com/doc/2657815/hybrid-transactionanal Hybrid Warehouse
ytical-processing-foster-opportunities
Pilman M, Bocksrocker K, Braun L, Marroquín R, Koss-
mann D (2017) Fast scans on key-value stores. PVLDB Definitions
10(11):1526–1537
Pirk H, Funke F, Grund M, Neumann T, Leser U, Mane-
gold S, Kemper A, Kersten ML (2013) CPU and cache A hybrid system based on traditional database
efficient management of memory-resident databases. extensions refers to a federated system between
In: ICDE, pp 14–25
Hadoop and an enterprise data warehouse
Plattner H (2009) A common database approach for OLTP
and OLAP using an in-memory column database. In: (EDW). In such a system, a query may need
SIGMOD, pp 1–2 to combine data stored in both Hadoop and an
Psaroudakis I, Wolf F, May N, Neumann T, Böhm A, EDW.
Ailamaki A, Sattler KU (2014) Scaling up mixed
workloads: a battle of data freshness, flexibility, and
scheduling. In: TPCTC 2014, pp 97–112
Ramamurthy R, DeWitt DJ, Su Q (2002) A case for Overview
fractured mirrors. In: VLDB’02, pp 430–441
Sadoghi M, Ross KA, Canim M, Bhattacharjee B (2013)
The co-existence of Hadoop and enterprise data
Making updates disk-I/O friendly using SSDs. PVLDB
6(11):997–1008 warehouses (EDWs), together with the new ap-
Sadoghi M, Canim M, Bhattacharjee B, Nagel F, Ross plication requirement of correlating data stored in
KA (2014) Reducing database locking contention both systems, has created the need for a special
through multi-version concurrency. PVLDB 7(13):
federation between Hadoop-like big data plat-
1331–1342
Sadoghi M, Bhattacherjee S, Bhattacharjee B, Canim M forms and EDWs. This entry presents the mo-
(2016a) L-store: a real-time OLTP and OLAP system. tivation behind such hybrid systems, highlights
CoRR abs/1601.04084 the unique challenges of building them, surveys
Sadoghi M, Ross KA, Canim M, Bhattacharjee B (2016b)
existing hybrid solutions, and finally discusses
Exploiting SSDs in operational multiversion databases.
VLDB J 25(5):651–672 potential future directions.
Hybrid Systems Based on Traditional Database Extensions 987
processing. This is fundamentally different ticular, they use JDBC/ODBC interfaces for
from databases. Second, a database can have pushing down a maximal sub-query and re-
indexes, but a SQL-on-Hadoop processor trieving its results. Such a solution ingests
cannot exploit the same record-level indexes. the result data serially through the single JD-
Because they do not control the data; data BC/ODBC connection and hence is only fea-
can be inserted outside the query processor’s sible for small amounts of data. In the hybrid
control. In addition, HDFS does not provide warehouse case, a new solution that connects
the same level of performance for small at a lower system layer is needed to exploit
reads as a local file system, as it was mainly the massive parallelism on both the Hadoop
designed for large scans. Finally, HDFS does side and the EDW side and move data between
not support update in place; as a result, the two in parallel. This requires nontrivial
SQL-on-Hadoop systems are mostly used coordination and communication between the
for processing data that are not or infrequently two systems.
updated. Although the latest version of Hive
has added the transaction support, it is done
Table 1 summarizes the differences between
through deltas, thus more appropriate for
an EDW and Hadoop, hence highlights the chal-
batch updates. These differences between the
lenges of hybrid warehouse solutions.
two systems make this problem both hybrid
and asymmetric.
• There are also differences in terms of capacity Hybrid Warehouse Solutions Based
and cluster setup. While EDWs are typically on Database Extensions
deployed on high-end hardware, Hadoop (or
a SQL-on-Hadoop system) is deployed on As Hadoop ecosystem and related big-data tech-
commodity servers. A Hadoop cluster typi- nologies mature, the hybrid warehouse solutions
cally has a much larger scale (up to 10,000s have also evolved through several iterations.
machines), hence has more storage and com-
putational capacity than an EDW. In fact, to-
day more and more enterprise investment has Parallel Movement of Data in Bulk
shifted from EDWs to Hadoop. The first iteration of hybrid warehouse solutions
• Finally, the hybrid warehouse is also not a came right after EDW vendors realized that
typical federation between two databases. Ex- Hadoop is a major disruption to traditional
isting federation solutions (Josifovski et al. EDWs. Both EDW and Hadoop vendors started
2002; Adali et al. 1996; Shan et al. 1995; to allow data to be moved between EDWs and
Tomasic et al. 1998; Papakonstantinou et al. Hadoop efficiently, and to improve performance
1995) use a client-server model to access the most solutions leverage the massive parallelism
remote databases and move the data. In par- in Hadoop and EDWs to transfer data in
parallel. Sqoop is such a general tool from
Hybrid Systems Based on Traditional Database Extensions, Table 1 Differences between EDWs and Hadoop
EDW Hadoop
Data ownership Own its data Work with existing files on shared storage
Control data organization and partitioning Cannot dictate data layout
Index support Build and exploit index Scan-based only, no index support
Update Efficient record-level updates Good for batch updates
High-end servers Commodity machines
Capacity Smaller cluster size Larger cluster size
(up to 10,000 s nodes)
Hybrid Systems Based on Traditional Database Extensions 989
the Hadoop camp that imports database tables In this second iteration of hybrid solutions,
onto HDFS in different formats and exports the EDWs treat HDFS as an extended storage
data in HDFS directories as database tables. and pull data when needed from HDFS. However,
Many EDW vendors have also provided their the query processing is performed only inside the
own version of “Hadoop connectors.” Özcan database. This is probably due to the fact that in
et al. (2011) introduced a utility to bulk-load the early days, people mostly viewed Hadoop as a
HDFS data to IBM Db2. Teradata Connector for good option for cheap storage and the processing
Hadoop (Teradata 2013) supports bi-directional support on Hadoop was still limited (MapReduce
bulk load between Teradata systems and Hadoop was not easy to program against). But as the first
ecosystem components, such as HDFS and SQL-on-Hadoop system – Hive, as well as other
Hive. HP Vertica Connector for HDFS (Vertica systems exposing high-level language abstraction
2016) allows bulk-loading HDFS data into HP on top of MapReduce, such as Pig (Olston et al.
Vertica. Oracle Loader for Hadoop (Oracle 2008) and Jaql (Beyer et al. 2011), started to
2012) bulk-loads data from Hadoop to Oracle emerge, data processing on Hadoop became eas- H
Database. ier and more powerful.
Batch movement of data between EDWs and
Hadoop has several limitations. On one hand,
with this approach, one data set resides in two Query Pushdown to Hadoop
places, thus maintaining it requires extra care. Although the second wave of hybrid solutions
In particular, HDFS still does not have a good do not require data replication, they still load an
solution to do update in place, whereas ware- entire data set at runtime into the database for
houses always have up-to-date data. As a result, processing, even though only a small fraction of
an application needs to reload EDW data to the data may be needed (e.g., through filter and
Hadoop frequently to keep the data fresh. On projection). The processing burden on databases
the other hand, HDFS tables are usually pretty is still very high. Riding on the SQL-on-Hadoop
big, so it is not always feasible to load them boom, the third wave of hybrid warehouse so-
into the database. Such bulk reading of HDFS lutions start to offload some data processing to
data into the database introduces an unneces- the Hadoop side, by pushing mostly filters and
sary burden on the carefully managed database projections down to Hadoop. Oracle Big Data
resources. SQL (McClary 2014) allows Oracle Exadata to
query remote data on Hadoop via external tables.
But, different from the external table approach in
EDWs Directly Reading HDFS Data Oracle Direct Connector for HDFS, Oracle Big
To avoid data replication, the second wave of Data SQL is able to push down filter and pro-
hybrid warehouse solutions keep data where they jection to the Hadoop side by employing smart
are and allow EDWs to directly read HDFS data scans on Hadoop data. Teradata QueryGrid (Ter-
through external tables or table functions. For adata 2016) provides a unified data architecture
example, HP Vertica Connector for HDFS (Ver- to query data in Teradata and Aster databases
tica 2016), besides the bulk-loading approach de- as well as other data sources. In particular, it
scribed in the previous subsection, also supports enables a Teradata or Aster databases to call up
exposing the HDFS data as external tables to Hive to query data on Hadoop and bring the
query from. Özcan et al. (2011) proposed a table results back to Teradata or Aster to join with other
function called HDFSRead to load HDFS data at tables. Microsoft Polybase (DeWitt et al. 2013)
runtime to IBM Db2 and a new DB2InputFormat also exposes HDFS data as external tables in
to ingest data from Db2 to Hadoop. Oracle Di- SQL Server PDW and pushes filter and projection
rect Connector for HDFS (Oracle 2012) directly down to the Hadoop side. In addition, it also
accesses data on HDFS as external tables from considers pushing down joins if both tables are
Oracle Database. stored in HDFS.
990 Hybrid Systems Based on Traditional Database Extensions
In all these systems, however, a join between filter (Bloom 1970) on both sides to minimize
an HDFS table and an EDW table is always the data movement in join processing. Through
executed inside the EDW. This is probably due extensive experimental study, Tian et al. (2015)
to the assumption that SQL-on-Hadoop systems concluded that with a sophisticated SQL exe-
do not perform joins efficiently. Although this cution engine on the Hadoop side, which most
was true for the early SQL-on-Hadoop solutions, of the new-generation SQL-on-Hadoop systems
such as Hive (Thusoo et al. 2009) before version qualify, it is better to move the smaller table to the
0.13, it is not clear whether the same still side of the bigger table (after filter and projection
holds for the current-generation solutions such are applied on the tables), whether it is in HDFS
as IBM Big SQL (Gray et al. 2015), Impala or in the EDW. Moreover, Bloom filter is very
(Kornacker et al. 2015), and Presto (Traverso effective in reducing data movement in a hybrid
2013). There has been a significant shift during warehouse setting, and it can even be applied on
the last few years in the SQL-on-Hadoop both sides in one join algorithm. In a follow-up
solution space, in which many new systems work, Tian et al. (2016) proposed a cost model of
have moved away from MapReduce to have various join algorithms for a future optimizer in
shared-nothing parallel database architectures. the hybrid warehouse environment to choose the
They run SQL queries using their own long- right algorithms.
running daemons executing on every HDFS
DataNode. Instead of materializing intermediate
results, these systems pipeline them between Conclusion and Future Directions
computation stages. Many new SQL-on-
Hadoop systems also employ columnar store Hadoop has become an important data repository
technologies, vector-wise processing, and even in the enterprise as the center for all business
runtime code generation for better query analytics, from SQL queries to machine learning,
performance. to reporting. At the same time, EDWs continue
Moreover, HDFS tables are usually much big- to support critical business analytics. EDWs will
ger than database tables, so it is not always coexist with Hadoop. That is why there is a
feasible to ingest HDFS data and perform joins strong need to have hybrid warehouse solutions.
in the database. Another important observation is The hybrid warehouse is different from the tra-
that enterprises are investing more on big data ditional data federation. It requires exploiting
systems such as Hadoop and less on expensive massive parallelism between two sides for ef-
EDW systems. As a result, there is more capacity ficient data movement, and better collaborative
on the Hadoop side to leverage. query processing to minimize data that needs to
be moved, especially given the powerful SQL
Collaborative Query Processing processing capabilities on Hadoop in the recent
Finally, with the advancement in the SQL-on- years. With more and more enterprise investment
Hadoop land, it is possible to have EDWs and on the Hadoop side, it is wise to better utilize the
Hadoop to collaboratively process a query, even resource of the Hadoop side to do more of the
including executing a join together. processing.
Tian et al. (2015) studied a number of dis- In the future, EDWs and Hadoop will continue
tributed join algorithms for joining a HDFS table to coexist and evolve on their own, and the need
with an EDW table in the context of a hybrid for hybrid warehouse solutions will continue to
warehouse. Besides algorithms that execute joins grow. We need an even tighter integration be-
inside EDW, this work particularly explores algo- tween the two systems, in which an automated
rithms that conduct the join on the Hadoop side, optimizer can decide where to store data and
including a novel join algorithm called zigzag where the data processing is carried out, without
join, which heavily exploits the use of Bloom the users even knowing (Portnoy 2013).
Hybrid Systems Based on Traditional Database Extensions 991
Unlike other popular MPP analytic DBMSs, ments of the analytic workload and is therefore
several of which are forks of Postgres, Impala an important aspect of performance optimiza-
was written from scratch in CCC and Java. tion. Since Impala does not manage physical
It maintains Hadoop’s flexibility with regard to data itself, the functionality related to physical
the choice of processing frameworks by utiliz- schema design that is surfaced in Impala also
ing standard components (HDFS http://hadoop. depends on what is supported by the storage
apache.org/, Kudu https://kudu.apache.org/, Hive managers.
Metastore https://hive.apache.org/, Sentry https:// Impala was targeted at standard business in-
sentry.apache.org/, etc.) and is able to read the telligence environments and to that end supports
majority of the file formats that are popular with most relevant industry standards: clients connect
Hadoop users (e.g., text, Parquet https://parquet. via ODBC or JDBC, authentication is handled via
apache.org/, Avro https://avro.apache.org/, etc.). Kerberos or LDAP, and authorization function-
Impala implements principles pioneered for tra- ality (supplied via Sentry) follows the standard
ditional parallel DBMSs, such as data-partitioned SQL roles and privileges.
parallelism and pipelining, and runs on a col- In order to query data that resides in one of
lection of daemon processes that are responsible the supported Hadoop storage managers, the user
for all aspects of query execution and that run creates tables via the familiar CREATE TABLE
on the same machines as the rest of the Hadoop statement. This provides the logical schema of
infrastructure. the data as well as the storage manager-specific
parameters needed for Impala to access the un-
derlying physical data. Subsequently the user can
User View of Impala then reference that table in a SELECT statement,
with the familiar relational semantics and opera-
Impala is a query engine which is integrated into tions (joins, grouping, etc.). In particular, a single
the Hadoop environment and relies on a number SELECT statement can reference multiple tables,
of standard Hadoop infrastructure components regardless of their underlying storage managers,
to deliver an RDBMS-like experience. The stan- making it possible to join on-premises HDFS data
dardized file formats and component APIs that with S3-resident data, etc.
are part of Hadoop allow Impala to offer very
broad SQL analytic functionality against a wide
variety of data formats as well as storage options.
The latter include on-premises storage as well as Complex Types Support
cloud object storage such as AWS S3 and Azure Data ingested into Hadoop often arrives in a non-
Data Lake Store. relational form because the source system is not
However, due to the looser coupling between relational or because the data is hierarchical and
the query engine and the storage managers, there more naturally produced in a de-normalized fash-
are some notable differences in the areas of sup- ion. Preserving the original structure of the data
ported SQL update functionality and transac- eliminates the need to flatten and remodel the data
tional semantics. In both cases, Impala surfaces before querying it and alleviates the cognitive
the functionality made available by the storage burden for users who are used to the existing
managers themselves but does not try to aug- structure. To this end, Impala supports querying
ment it to bring it up to a uniform functional tables with the complex types array, map, and
standard. struct, which are familiar from Hive and sup-
Another aspect in which Impala differs from ported in the shared Hive metastore. At present,
a monolithic analytic DBMS is the approach this functionality is only available for the Parquet
to physical schema design. Physical schema de- file format but in principle can be supported for
sign allows the user to optimize the physical any nested relational file format, including JSON,
layout of the table data to match the require- Avro, etc. Parquet stores nested data using a
Impala 995
Hive HDFS
Impala Components Metastore Sentry
NameNode
Metadata Cache
collect/update
Client Interfaces
DDL/DML
fully MPP
distributed
I
Coordinator Coordinator Coordinator
Impala, Fig. 1 Overview of Impala’s architecture and Hadoop ecosystem components. Executors can also read from
cloud blob stores like S3 and ADLS (not shown)
ecution plan, and then coordinates the execu- plan in two phases: (1) single-node planning and
tion across the participating Impala executors (2) plan parallelization and fragmentation.
in the cluster In the first phase, the parse tree is translated
• A query executor, which accepts execution into a single-node plan tree consisting of the
plan fragments from a coordinator and exe- following operators: scan (HDFS, Kudu, HBase),
cutes them hash join, cross join, union, hash aggregation,
sort, top-n, analytic evaluation, and subplan (for
nested data evaluation). This phase is responsi-
There is no central “master” process that han- ble for expression rewrites (e.g., constant fold-
dles query coordination for all client requests, ing), converting subqueries into joins, assigning
and the actual roles assumed by each deployed predicates at the lowest possible operator in the
impalad process are configurable so that it is tree, inferring predicates based on equivalence
possible to independently scale up the number classes, pruning table partitions, setting limit-
of coordinators and executors according to the s/offsets, and applying column projections, as
throughput requirements of the workload. well as performing cost-based optimizations such
In a typical on-premise deployment of Impala, as join ordering and coalescing analytic win-
one impalad process runs on every node that also dow functions to minimize total execution cost.
runs a DataNode process (for HDFS) or tablet Cost estimation is based on table and partition
server process (for Kudu). This allows Impala to cardinalities plus distinct value counts for each
take advantage of data locality and do a maximal column. Impala uses a quadratic algorithm to
amount of local preprocessing before sending determine join order, instead of the more com-
data across the network. mon dynamic programming approach, due to the
One design goal for Impala is to avoid syn- absence of “interesting orders” (all operators are
chronous RPCs to shared cluster components hash based).
during query execution, which would severely The second planning phase takes the single-
limit throughput scalability for concurrent work- node plan as input and produces a distributed exe-
loads. For that reason, each impalad coordinator cution plan, consisting of a tree of plan fragments.
caches all of the metadata relevant for query A plan fragment is the basic unit of work sent
planning. to executors. It consumes data from any number
of child fragments and sends data to at most one
parent fragment. The general goal is to mini-
Query Planning mize data movement and maximize scan locality
Impalads with the coordinator role are respon- (in HDFS, local reads are considerably cheaper
sible for compiling SQL text into query plans than remote ones). This is done by adding ex-
executable by the Impala executors. Impala’s change operators to the plan where necessary
query compiler is written in Java and consists of and by adding extra operators to minimize data
a fully featured SQL parser and cost-based query movement across the network (e.g., local pre-
optimizer. In addition to the basic SQL features aggregation). This second optimization phase de-
(select, project, join, group by, order by, limit), cides the join strategy (broadcast or partitioning)
Impala supports in-line views, uncorrelated and for every join operator (the join order is fixed at
correlated subqueries (transformed into joins), that point). A broadcast join replicates the entire
and all variants of outer joins as well as explicit build side of a join to all machines executing the
left/right semi- and anti-joins, analytic window hash join; a partitioning join repartitions both join
functions, and extensions for nested data. inputs on the join expressions. Impala takes into
The query compilation process follows a account the existing data partitions of the join
traditional division into stages: query parsing, inputs and chooses the strategy that is estimated
semantic analysis, and query planning/optimiza- to minimize the amount of data transferred over
tion. The latter stage constructs an executable the network.
Impala 999
Recent advancements in self-adjusting trade-off between the output accuracy and perfor-
computation (Acar 2005; Acar et al. 2010; Ley- mance gains (also efficient resource utilization)
Wild et al. 2009) overcome the limitations of by employing sampling-based approaches for
dynamic algorithms. Self-adjusting computation computing over a subset of data items. However,
transparently benefits existing applications, these “big data” systems target batch processing
without requiring the design and implementation workflow and cannot provide required low-
of dynamic algorithms. At a high level, self- latency guarantees for stream analytics.
adjusting computation enables incremental
updates by creating a dynamic dependence graph Combining incremental and approximate
of the underlying computation, which records computing. This entry makes the observation
control and data dependencies between the sub- that the two computing paradigms, incremental
computations. Given a set of input changes, and approximate computing, are complementary.
self-adjusting computation performs change Both computing paradigms rely on computing
propagation, where it reuses the memoized over a subset of data items instead of the entire
intermediate results for all sub-computations dataset to achieve low latency and efficient
that are unaffected by the input changes, and cluster utilization. Therefore, this entry proposes I
recomputes only those parts of the computation to combine these paradigms together in order
that are transitively affected by the input change. to leverage the benefits of both. Furthermore,
As a result, self-adjusting computation computes incremental updates can be achieved without
only on a subset (delta ) of the computation requiring the design and implementation of
instead of recomputing everything from scratch. application-specific dynamic algorithms and
support approximate computing for stream
Approximate computing. Approximate com- analytics.
putation is based on the observation that The high-level idea is to design a sampling
many data analytics jobs are amenable to algorithm that biases the sample selection to the
an approximate, rather than the exact, output memoized data items from previous runs. This
(Krishnan et al. 2016; Quoc et al. 2017b,d). For idea is realized by designing an online sampling
such an approximate workflow, it is possible to algorithm that selects a representative subset of
trade accuracy by computing over a partial subset data items from the input data stream. Thereafter,
instead of the entire input data to achieve low the sample is biased to include data items for
latency and efficient utilization of resources. which already there are memoized results from
Over the last two decades, researchers in the previous runs while preserving the proportional
database community proposed many techniques allocation of data items of different (strata) distri-
for approximate computing including sam- butions. Next, the user-specified streaming query
pling (Al-Kateb and Lee 2010), sketches (Cor- is performed on this biased sample by making use
mode et al. 2012), and online aggregation (Heller- of self-adjusting computation, and the user is pro-
stein et al. 1997). These techniques make vided with an incrementally updated approximate
different trade-offs with respect to the output output with error bounds.
quality, supported query interface, and workload.
However, the early work in approximate
computing was mainly targeted toward the INC APPROX
centralized database architecture, and it was
unclear whether these techniques could be System Overview
extended in the context of big data analytics. INCAPPROX is designed for real-time data
Recently, sampling-based approaches have analytics on online data streams. At the high
been successfully adopted for distributed data level, the online data stream consists of data
analytics (Agarwal et al. 2013; Goiri et al. 2015). items from diverse sources of events or sub-
These systems show that it is possible to have a streams. INCAPPROX uses a stream aggregator
1002 Incremental Approximate Computing
(e.g., Apache Kafka (Kafka – A high-throughput Assumptions. INCAPPROX makes the following
distributed messaging system)) that integrates assumptions. Possible methods to enforce them
data from these sub-streams, and thereafter, the are discussed in section “Discussion” and in the
system reads this integrated data stream as the entry Krishnan et al. (2016).
input.
INCAPPROX facilitates user querying on this 1. This work assumes that the input stream is
data stream and supports a user interface that stratified based on the source of event, i.e.,
consists of a streaming query and a query budget. the data items within each stratum follow the
The user submits the streaming query to the same distribution and are mutually indepen-
system as well as specifies a query budget. The dent. Here a stratum refers to one sub-stream.
query budget can either be in the form of latency If multiple sub-streams have the same distri-
guarantees/SLAs for data processing, desired re- bution, they are combined to form a stratum.
sult accuracy, or computing resources available 2. This work assumes the existence of a virtual
for query processing. INCAPPROX makes sure function that takes the user-specified budget as
that the computation done over the data remains the input and outputs the sample size for each
within the specified budget. To achieve this, the window based on the budget.
system makes use of a mixture of incremental and
approximate computing for real-time processing Building Blocks
over the input data stream and emits the query Techniques and the motivation behind INCAP-
result along with the confidence interval or error PROX ’s design choices are briefly described.
bounds.
Stratified sampling. In a streaming envi-
System Model ronment, since the window size might be
In this section, the system model assumed in this very large, for a realistic rate of execution,
work is presented. INCAPPROX performs approximation using
samples taken within the window. But the data
Programming model. INCAPPROX supports a stream might consist of data from disparate
batched stream processing programming model. events. As such, INCAPPROX needs to make
In batched stream processing, the online data sure that every sub-stream is considered fairly
stream is divided into small batches or sets of to have a representative sample from each
records; and for each batch, a distributed data- sub-stream. For this the system uses stratified
parallel job is launched to produce the output. sampling (Al-Kateb and Lee 2010). Stratified
sampling ensures that data from every stratum
Computation model. The computation model is selected and none of the minorities are
of INCAPPROX for stream processing is sliding excluded. For statistical precision, INCAPPROX
window computations. In this model the compu- uses proportional allocation of each sub-stream to
tation window slides over the input data stream, the sample (Al-Kateb and Lee 2010). It ensures
where new arriving input data items are added to that the sample size of each sub-stream is in
the window, and the old data items are dropped proportion to the size of sub-stream in the whole
from the window as they become less relevant to window.
the analysis.
In sliding window computations, there is a Self-adjusting computation. For incremental
substantial overlap of data items between the sliding window computations, INCAPPROX uses
two successive computation windows, especially, self-adjusting computation (Acar 2005; Acar
when the size of the window is large relative et al. 2010) to reuse the intermediate results
to the slide interval. This overlap of unchanged of sub-computations across successive runs of
data items provides an opportunity to update the jobs. In this technique the system maintains a
output incrementally. dependence graph between sub-computations
Incremental Approximate Computing 1003
of a job and reuses memoized results for sub- results for reuse for the subsequent windows
computations that are unaffected by the changed (Krishnan et al. 2016).
input in the computation window. The job provides an estimated output which
is bound to a range of error due to approxima-
tion. INCAPPROX performs error estimation to
Algorithm estimate this error bound and define a confidence
This section presents an overview algorithm of interval for the result as output ˙ error bound
INCAPPROX. The algorithm computes a user- (Krishnan et al. 2016).
specified streaming query as a sliding window The entire process repeats for the next win-
computation over the input data stream. The user dow, with updated windowing parameters and
also specifies a query budget for executing the the sample size. (Note that the query budget can
query, which is used to derive the sample size be updated across windows during the course of
(sampleS i´e) for the window using a cost func- stream processing to adapt to the user’s require-
tion (see sections “Assumptions” and “Discus- ments.)
sion”). The cost function ensures that processing
remains within the query budget. Estimation of Error Bounds I
For each window, INCAPPROX first adjusts the In order to provide a confidence interval for the
computation window to the current start time t approximate output, the error bounds are com-
by removing all old data items from the wi nd ow puted due to approximation.
(timestamp < t ). Similarly, the system also drops
all old data items from the list of memoized items Approximation for aggregate functions. Ag-
(memo) and the respective memoized results of gregate functions require results based on all the
all sub-computations that are dependent on those data items or groups of data items in the popula-
old data items. tion. But since the system computes only over a
Next, the system reads the new incoming data small sample from the population, an estimated
items in the wi nd ow. Thereafter, it performs result based on the weightage of the sample can
proportional stratified sampling (detailed in be obtained.
Krishnan et al. 2016) on the wi nd ow to select Consider an input stream S, within a window,
a sample of size provided by the cost function. consisting of n disjoint strata S1 , S2 : : : ; Sn , i.e.,
Pn
The stratified sampling algorithm ensures that S D iD1 Si . Suppose the i th stratum Si has
samples from all strata are proportional and no Bi items and each item j has an associated value
stratum is neglected. vij . Consider an example of taking sum of these
Next, INCAPPROX biases the stratified sample values across the whole window, represented as
Pn PBi
to include items from the memoized sample,
iD1 . j D1 vij /. To find an approximate sum,
in order to enable the reuse of memoized re- INCAPPROX first selects a sample from the win-
sults from previous sub-computations. The bi- dow based on stratified and biased sampling, i.e.,
ased sampling algorithm biases samples specific from each i th stratum Si in the window, the sys-
to each stratum, to ensure reuse and, at the tem samples bi items. Then it estimates the sum
same time, retain proportional allocation (Krish- P Pb
from this sample as O D niD1 . Bbii j iD1 vij /˙
nan et al. 2016). where the error bound is defined as:
Thereafter, on this biased sample, the system
q
runs the user-specified query as a data-parallel
job incrementally, i.e., it reuses the memoized D tf;1 ˛2 b
V ar./
O (1)
results for all data items that are unchanged in
the window, and updates the output based on Here, tf;1 ˛2 is the value of the t-distribution (i.e.,
the changed (or new) data items. After the job t-score) with f degrees of freedom and ˛ D 1
finishes, the system memoizes all the items in confidence level. The degree of freedom f is
the sample and their respective sub-computation expressed as:
1004 Incremental Approximate Computing
n
X Discussion
f D bi n (2)
iD1 The design of INCAPPROX is based on the
b
The estimated variance for sum, V ar.O / is repre-
assumptions in section “Assumptions”. Solving
these assumptions is beyond the scope of this
sented as: entry; however, this section discusses some of
the approaches that could be used to meet the
Vb
X n 2
s i assumptions.
ar.O / D B .B b /
i i i (3)
b i
iD1
sample selection to the memoized data items Chiang YJ, Tamassia R (1992) Dynamic algorithms in
from previous runs. computational geometry. In: Proceedings of the IEEE
Coles S (2001) An introduction to statistical modeling of
extreme values. Springer, London/New York
Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012)
Synopses for massive data: samples, histograms,
References wavelets, sketches. Found Trends Databases 4(1–3):
1–294
Acar UA (2005) Self-adjusting computation. PhD thesis, Dean J, Ghemawat S (2004) MapReduce: simplified data
Carnegie Mellon University processing on large clusters. In: Proceedings of the
Acar UA, Cotter A, Hudson B, Türkoğlu D (2010) Dy- USENIX conference on operating systems design and
namic well-spaced point sets. In: Proceedings of the implementation (OSDI)
26th annual symposium on computational geometry Dziuda DM (2010) Data mining for genomics and pro-
(SoCG) teomics: analysis of gene and protein expression data.
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Wiley, Hoboken
Stoica I (2013) Blinkdb: queries with bounded errors Efron B, Tibshirani R (1986) Bootstrap methods for stan-
and bounded response times on very large data. In: dard errors, confidence intervals, and other measures of
Proceedings of the ACM European conference on com- statistical accuracy. Stat Sci 1(1):54–75
puter systems (EuroSys) Ganapathi AS (2009) Predicting and optimizing sys-
Al-Kateb M, Lee BS (2010) Stratified reservoir sampling tem utilization and performance via statistical ma-
over heterogeneous data streams. In: Proceedings of chine learning. In: Technical Report No. UCB/EECS-
the 22nd international conference on scientific and 2009-181
statistical database management (SSDBM) Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015)
Angel S, Ballani H, Karagiannis T, O’Shea G, Thereska E ApproxHadoop: bringing approximations to MapRe-
(2014) End-to-end performance isolation through vir- duce frameworks. In: Proceedings of the twenti-
tual datacenters. In: Proceedings of the USENIX con- eth international conference on architectural support
ference on operating systems design and implementa- for programming languages and operating systems
tion (OSDI) (ASPLOS)
Bhatotia P (2015) Incremental parallel and distributed Gunda PK, Ravindranath L, Thekkath CA, Yu Y,
systems. PhD thesis, Max Planck Institute for Software Zhuang L (2010) Nectar: automatic management of
Systems (MPI-SWS) data and computation in datacenters. In: Proceedings of
Bhatotia P, Wieder A, Akkus IE, Rodrigues R, Acar UA the USENIX conference on operating systems design
(2011a) Large-scale incremental data processing with and implementation (OSDI)
change propagation. In: Proceedings of the conference He B, Yang M, Guo Z, Chen R, Su B, Lin W, Zhou L
on hot topics in cloud computing (HotCloud) (2010) Comet: batched stream processing for data
Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquini R intensive distributed computing. In: Proceedings of the
(2011b) Incoop: MapReduce for incremental computa- ACM symposium on cloud computing (SoCC)
tions. In: Proceedings of the ACM symposium on cloud Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggre-
computing (SoCC) gation. In: Proceedings of the ACM SIGMOD interna-
Bhatotia P, Dischinger M, Rodrigues R, Acar UA (2012a) tional conference on management of data (SIGMOD)
Slider: incremental sliding-window computations for Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007)
large-scale data analysis. In: Technical Report: MPI- Dryad: distributed data-parallel programs from sequen-
SWS-2012-004 tial building blocks. In: Proceedings of the ACM Euro-
Bhatotia P, Rodrigues R, Verma A (2012b) Shredder: pean conference on computer systems (EuroSys)
GPU-accelerated incremental storage and computation. Kafka – A high-throughput distributed messaging system.
In: Proceedings of USENIX conference on file and http://kafka.apache.org. Accessed Nov 2017
storage technologies (FAST) Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues
Bhatotia P, Acar UA, Junqueira FP, Rodrigues R (2014) R (2016) IncApprox: a data analytics system for in-
Slider: incremental sliding window analytics. In: Pro- cremental approximate computing. In: Proceedings of
ceedings of the 15th international middleware confer- the 25th international conference on world wide web
ence (Middleware) (WWW)
Bhatotia P, Fonseca P, Acar UA, Brandenburg B, Ro- Ley-Wild R, Acar UA, Fluet M (2009) A cost semantics
drigues R (2015) iThreads: a threading library for for self-adjusting computation. In: Proceedings of the
parallel incremental computation. In: Proceedings of annual ACM SIGPLAN-SIGACT symposium on prin-
the 20th international conference on architectural sup- ciples of programming languages (POPL)
port for programming languages and operating systems Liu S, Meeker WQ (2014) Statistical methods for estimat-
(ASPLOS) ing the minimum thickness along a pipeline. Techno-
Brodal GS, Jacob R (2002) Dynamic planar convex hull. metrics 57(2):164–179
In: Proceedings of the 43rd annual IEEE symposium Logothetis D, Olston C, Reed B, Web K, Yocum K (2010)
on foundations of computer science (FOCS) Stateful bulk processing for incremental analytics. In:
Incremental Sliding Window Analytics 1007
Proceedings of the ACM symposium on cloud comput- Wieder A, Bhatotia P, Post A, Rodrigues R (2010a)
ing (SoCC) Brief announcement: modelling MapReduce for opti-
Lohr S (2009) Sampling: design and analysis, 2nd edn. mal execution in the cloud. In: Proceedings of the 29th
Cengage Learning, Boston ACM SIGACT-SIGOPS symposium on principles of
Masud MM, Woolam C, Gao J, Khan L, Han J, Hamlen distributed computing (PODC)
KW, Oza NC (2012) Facing the reality of data stream Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Con-
classification: coping with scarcity of labeled data. ductor: orchestrating the clouds. In: Proceedings of the
Knowl Inf Syst 33(1):213–244 4th international workshop on large scale distributed
Murray DG, McSherry F, Isaacs R, Isard M, Barham P, systems and middleware (LADIS)
Abadi M (2013) Naiad: a timely dataflow system. In: Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orches-
Proceedings of the twenty-fourth ACM symposium on trating the deployment of computations in the cloud
operating systems principles (SOSP) with conductor. In: Proceedings of the 9th USENIX
Olston C et al (2011) Nova: continuous pig/hadoop work- symposium on networked systems design and imple-
flows. In: Proceedings of the ACM SIGMOD interna- mentation (NSDI)
tional conference on management of data (SIGMOD) Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I
Peng D, Dabek F (2010) Large-scale incremental process- (2013) Discretized streams: fault-tolerant streaming
ing using distributed transactions and notifications. In: computation at scale. In: Proceedings of the twenty-
Proceedings of the USENIX conference on operating fourth ACM symposium on operating systems princi-
systems design and implementation (OSDI) ples (SOSP)
Popa L, Budiu M, Yu Y, Isard M (2009) DryadInc: reusing I
work in large-scale computations. In: Proceedings of
the conference on hot topics in cloud computing (Hot-
Cloud) Incremental Learning
Quoc DL, Martin A, Fetzer C (2013) Scalable and real-
time deep packet inspection. In: Proceedings of the
2013 IEEE/ACM 6th international conference on util- Online Machine Learning Algorithms over
ity and cloud computing (UCC) Data Streams
Quoc DL, Yazdanov L, Fetzer C (2014) Dolen: user-side Online Machine Learning in Big Data Streams:
multi-cloud application monitoring. In: International Overview
conference on future internet of things and cloud (FI-
CLOUD) Recommender Systems over Data Streams
Quoc DL, D’Alessandro V, Park B, Romano L, Fetzer C Reinforcement Learning, Unsupervised Meth-
(2015a) Scalable network traffic classification using ods, and Concept Drift in Stream Learning
distributed support vector machines. In: Proceedings of
the 2015 IEEE 8th international conference on cloud
computing (CLOUD)
Quoc DL, Fetzer C, Felber P, Rivière É, Schiavoni V, Incremental Sliding Window
Sutra P (2015b) Unicrawl: a practical geographically
distributed web crawler. In: Proceedings of the 2015
Analytics
IEEE 8th international conference on cloud computing
(CLOUD) Pramod Bhatotia1 , Umut A. Acar2 , Flavio P.
Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T Junqueira3 , and Rodrigo Rodrigues4
(2017a) Privacy preserving stream analytics: the mar- 1
The University of Edinburgh and Alan Turing
riage of randomized response and approximate com-
puting. https://arxiv.org/abs/1701.05403 Institute, Edinburgh, UK
2
Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T CMU, Pittsburgh, PA, USA
(2017b) PrivApprox: privacy-preserving stream analyt- 3
Dell EMC, Hopkinton, MA, USA
ics. In: Proceedings of the 2017 USENIX conference 4
INESC-ID, IST (University of Lisbon), Lisbon,
on USENIX annual technical conference (USENIX
ATC) Portugal
Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T
(2017c) Approximate stream analytics in Apache Flink
and Apache Spark streaming. CoRR, abs/1709.02946
Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T Introduction
(2017d) StreamApprox: approximate computing for
stream analytics. In: Proceedings of the international There is a growing need to analyze large feeds
middleware conference (Middleware)
The Apache Commons Mathematics Library. http://
of data that are continuously collected. Either
commons.apache.org/proper/commons-math. due to the nature of the analysis or in order to
Accessed Nov 2017 bound the computational complexity of analyzing
1008 Incremental Sliding Window Analytics
a monotonically growing data set, this processing resort to the memoization of sub-computations
often resorts to a sliding window analysis. In from previous runs, which still requires time
this type of processing, the scope of the analysis proportional to the size of the whole data rather
is limited to a recent interval over the entire than the change itself. Furthermore, these sys-
collected data, and, periodically, newly produced tems are meant to support arbitrary changes to
inputs are appended to the window and older the input and as such do not take advantage of
inputs are discarded from it as they become less the predictability of changes in sliding window
relevant to the analysis. computations to improve the timeliness of the
The basic approach to sliding window data results.
processing is to recompute the analysis over In this paper, we propose SLIDER, a system for
the entire window whenever the window slides. incremental sliding window computations where
Consequently, even old, unchanged data items the work performed by incremental updates is
that remain in the window are reprocessed, thus proportional to the size of the changes in the
consuming unnecessary computational resources window (the “delta”) rather than the whole data.
and limiting the timeliness of results. In SLIDER, the programmer expresses the com-
We can improve on this using an incremental putation corresponding to the analysis using ei-
approach, which normally relies on the program- ther MapReduce (Dean and Ghemawat 2004) or
mers of the data analysis to devise an incremental another dataflow language that can be translated
update mechanism (He et al. 2010; Logothetis to the MapReduce paradigm (e.g., Hive (Foun-
et al. 2010; Peng and Dabek 2010), i.e., an in- dation 2017) or Pig (Olston et al. 2008)). This
cremental algorithm (or dynamic algorithm) con- computation is expressed by assuming a static,
taining the logic for incrementally updating the unchanging input window. The system then guar-
output as the input changes. Research in the algo- antees the automatic and efficient update of the
rithms and programming languages communities output as the window slides. The system makes
shows that while such incremental algorithms no restrictions on how the window slides, allow-
can be very efficient, they can be difficult to de- ing it to shrink on one end and to grow on the
sign, analyze, and implement even for otherwise other end arbitrarily (though as we show more
simple problems (Chiang and Tamassia 1992; restricted changes lead to simpler algorithms and
Ramalingam and Reps 1993; Demetrescu et al. more efficient updates). SLIDER thus offers the
2004; Acar 2005). Moreover, such incremental benefits of incremental computation in a fully
algorithms often assume a uniprocessor comput- transparent way.
ing model and due to their natural complexity Our approach to automatic incrementalization
do not lend themselves well to parallelization, is based on the principles of self-adjusting
making them ill suited for parallel and distributed computation (Acar 2005; Acar et al. 2009),
data analysis frameworks (Dean and Ghemawat a technique from the programming-languages
2004; Isard et al. 2007). community that we apply to sliding window
Given the efficiency benefits of incremental computations. In self-adjusting computation, a
computation, an interesting question is whether dynamic dependency graph records the control
it would be possible to achieve these benefits and data dependencies of the computation,
without requiring the design and implementation so that a change-propagation algorithm can
of incremental algorithms on an ad hoc basis. update the graph as well as the data whenever
Previous work on systems like Incoop (Bhato- the data changes. One key contribution of
tia et al. 2011a,b), Nectar (Gunda et al. 2010), this paper is a set of novel data structures
HaLoop (Bu et al. 2010), DryadInc (Isard et al. to represent the dependency graph of sliding
2007), or Ciel (Murray et al. 2011) shows that window computations that allow for performing
such gains are possible to obtain in a transparent work proportional primarily to the size of the
way, i.e., without changing the original (single- slide, incurring only a logarithmic – rather than
pass) data analysis code. However, these systems linear – dependency to the size of the window.
Incremental Sliding Window Analytics 1009
To the best of our knowledge, this is the first The dynamic dependency graph enables a change
technique for updating distributed incremental propagation algorithm to update the computation
computations that achieves efficiency of this and the output by propagating changes through
kind. the dependency graph, re-executing all sub-
We further improve the proposed approach computations transitively affected by the change,
by designing a split processing model that takes reusing unaffected computations, and deleting
advantage of structural properties of sliding win- obsolete sub-computations which no longer take
dow computations to improve response times. place. The change propagation algorithm has
Under this model, we split the application pro- been shown to be effective for a broad class of
cessing into two parts: a foreground and a back- computations called stable and has even helped
ground processing. The foreground processing solve algorithmically sophisticated problems in a
takes place right after the update to the compu- more efficient way than more ad hoc approaches
tation window and minimizes the processing time (e.g., Acar 2010 and Sümer et al. 2011.)
by combining new data with a precomputed inter-
mediate result. The background processing takes Design Overview
place after the result is produced and returned, Our primary goal is to design techniques for I
paving the way for an efficient foreground pro- automatically incrementalizing sliding window
cessing by precomputing the intermediate result computation to efficiently update their outputs
that will be used in the next incremental update. when the window slides. We wish to achieve
We implemented SLIDER by extending this without requiring the programmer to change
Hadoop and evaluate its effectiveness by any of the code, which is written assuming that
applying it to a variety of micro-benchmarks the windows do not slide, i.e., the data remains
and applications and consider three real- static. To this end, we apply the principles of
world cases (Bhatotia et al. 2012b, 2014). Our self-adjusting computation to develop a way of
experiments show that SLIDER can deliver processing data that leads to stable computation,
significant performance gains while incurring i.e., one whose overall dependency structure does
only modest overheads for the initial run (non- not change dramatically when the window slides.
incremental preprocessing run). In this paper, we apply this methodology to the
MapReduce model (Dean and Ghemawat 2004),
which enables expressive, fault-tolerant computa-
Background and Overview tions on large-scale data, thereby showing that the
approach can be practical in modern data analysis
In this section, we present some background and systems. Nonetheless, the techniques we propose
an overview of the design of SLIDER. can be applied more generally, since we only
require that the computation can be divided in a
Self-Adjusting Computation parallel, recursive fashion. In fact, the techniques
Self-adjusting computation is a field that studies that we propose can be applied to other large-
ways to incrementalize programs automatically, scale data analysis models such as Dryad (Isard
without requiring significant changes to the code et al. 2007), Spark (Zaharia et al. 2012), and
base (Acar 2005; Bhatotia 2015). For automatic Pregel (Malewicz et al. 2010), which allow for
incrementalization, in self-adjusting computa- recursively breaking up a computation into sub-
tions, the system constructs and maintains a computations.
dynamic dependency graph that contains the Assuming that the reader is familiar with the
input data to a program, all sub-computations, MapReduce model, we first outline a basic design
and the data and control dependencies in and then identify the limitations of this basic
between, e.g., which outputs of sub-computations approach and describe how we overcome them.
are used as inputs to other sub-computations, For simplicity, we assume that each job includes
which sub-computations are created by another. a single Reduce task (i.e., all mappers output
1010 Incremental Sliding Window Analytics
tuples with the same key). By symmetry, this as- takes both as an input and output types a
sumption causes no loss of generality; in fact our sequence of <key,value> pairs. This new usage
techniques and implementation apply to multiple of Combiners is compatible with the original
keys and reducers. Combiner interface but requires associativity for
grouping different Map outputs. To construct a
The basic design. Our basic design corresponds contraction tree, we break up the work done by
roughly to the scheme proposed in prior work, the (potentially large) Reduce task into many
Incoop (Bhatotia et al. 2011a,b), where new data applications of the Combiner functions. More
items are appended at the end of the previous specifically, we split the Reduce input into
window and old data items are dropped from the small groups and apply the Combine to pairs
beginning. To update the output incrementally, of groups recursively in the form of a balanced
we launch a Map task for each new “split” (a tree until we have a single group left. We apply
partition of the input that is handled by a single the Reduce function to the output of the last
Map task) that holds new data and reuse the Combiner.
results of Map tasks operating on old but live When the window slides, we employ a
data. We then feed the newly computed results change-propagation algorithm to update the
together with the reused results to the Reduce task result. Change propagation runs the Map on the
to compute the final output. newly introduced data and propagates the results
of the Map tasks through the contraction tree
Contraction trees and change propagation. that replaced the Reduce task. One key design
The basic design suffers from an important challenge is how to organize the contraction tree
limitation: it cannot reuse any work of the so that this propagation requires minimal amount
Reduce task. This is because the Reduce task of time. We overcome this challenge by designing
takes to input all values for a given key (< Ki >, contraction trees and the change propagation
< V1 ; V2 ; V3 ; : : :Vn >), and therefore a single algorithm such that change propagation always
change triggers a recomputation of the entire guarantees that the contraction trees remain
Reduce task. We address this by organizing balanced, guaranteeing low depth. As a result,
the Reduce tasks as a contraction tree and change propagation can perform efficient updates
proposing a change propagation algorithm that by traversing only through a short path of the con-
can update the result by performing traversals traction tree. In this approach, performing an up-
on the contraction tree, while maintaining it date not only updates the data but also the struc-
balanced (low-depth) and thus guaranteeing ture of the contraction tree (so that it remains bal-
efficiency. anced), guaranteeing that updates remain efficient
Contraction trees (used in Incoop (Bhatotia regardless of the past history of window slides.
et al. 2011b), Camdoop (Costa et al. 2012), In addition to contraction trees and change
iMR (Logothetis et al. 2011)) leverage Combiner propagation, we also present a split processing
functions of MapReduce to break a Reduce technique that takes advantage of the structural
task into smaller sub-computations that can properties of sliding window computation to im-
be performed in parallel. Combiner functions prove response times. Split processing splits the
originally aim at saving bandwidth by offloading total work into background tasks, which can be
parts of the computation performed by the performed when no queries are executing, and
Reduce task to the Map task. To use Combiners, foreground tasks which must be performed in
the programmer specifies a separate Combiner order to respond to a query correctly. By prepro-
function, which is executed on the machine that cessing certain computation in the background,
runs the Map task, and performs part of the work split processing can improve performance signif-
done by the Reduce task in order to preprocess icantly (up to 40% in our experiments).
various <key,value> pairs, merging them into a The techniques that we present here are fully
smaller number of pairs. The Combiner function general: they apply to sliding window computa-
Incremental Sliding Window Analytics 1011
tions where the window may be resized arbitrar- In-memory distributed data caching. To pro-
ily by adding as much new data at the end and re- vide fast access to memoized results, we de-
moving as much new data from the beginning as signed an in-memory distributed data caching
desired. As we describe, however, more restricted layer. Our use of in-memory caching is motivated
updates where, for example, the window is only by two observations: First, the number of sub-
expanded by append operations (append-only) or computations that need to be memoized is limited
where the size of the window remains the same by the size of the sliding window. Second, main
(fixed-size windows) lead to a more efficient and memory is generally underutilized in data-centric
simpler design, algorithms, and implementation. computing (Ananthanarayanan 2012). The dis-
tributed in-memory cache is coordinated by a
master node, which maintains an index to locate
Slider Architecture the memoized results. The master implements a
simple cache replacement policy, which can be
In this section, we give an overview of our im- changed according to the workload characteris-
plementation. Its key components are described tics. The default policy is the Least Recently Used
in the following. (LRU). I
Implementation of self-adjusting trees. We Fault-tolerance. Storing memoized results in an
implemented our prototype of SLIDER based on in-memory cache is beneficial for performance
Hadoop. We implemented the three variants of but can lead to reduced cache effectiveness when
the self-adjusting contraction tree by inserting machines fail, as it requires unnecessary recom-
an additional stage between the shuffle stage putation (especially for long running jobs). We
(where the outputs of Map tasks are routed therefore conduct a background replication of
to the appropriate Reduce task) and the sort memoized results to provide fault tolerance by
stage (where the inputs to the Reduce task are creating two replicas of each memoized result
sorted). To prevent unnecessary data movement (similar to RAMCloud (Ongaro et al. 2011)).
in the cluster, the new contraction phase runs on This way fault tolerance is handled transparently:
the same machine as the Reduce task that will when a new task wants to fetch a memoized
subsequently process the data. result, it reads that result from one of the replicas.
SLIDER maintains a distributed dependency To ensure that the storage requirements remain
graph from one run to the next and pushes the bounded, we developed a garbage collection al-
changes through the graph by recomputing sub- gorithm that frees the storage used by results that
computations affected by the changed input. For fall out of the current window.
this purpose, we rely on a memoization layer to
remember the inputs and outputs of various tasks Scheduler modifications. The implementation
and the nodes of the self-adjusting contraction of SLIDER modifies Hadoop’s scheduler to be-
trees. A shim I/O layer provides access to the come aware of the location of memoized results.
memoization layer. This layer stores data in an in- Hadoop’s scheduler chooses any available node
memory distributed data cache, which provides to run a pending Reduce task, only taking lo-
both low-latency access and fault tolerance. cality into account when scheduling Map tasks
by biasing toward the node holding the input.
Split processing knob. SLIDER implements the SLIDER’s scheduler adapts previous work in data
split processing capability as an optional step that locality scheduling (Bhatotia et al. 2011b; Wieder
is run offline on a best effort basis. This stage is et al. 2010a,b, 2012), by attempting to run Reduce
scheduled during low cluster utilization periods tasks on the machine that contains the memoized
as a background task. It runs as a distributed data outputs of the combiner phase. When that target
processing job (similar to MapReduce) with only machine is overloaded, it migrates tasks from
one offline preprocessing stage. the overloaded node to another node, including
1012 Incremental Sliding Window Analytics
the relevant memoized results. Task migration is techniques that can achieve optimal update
important to prevent a significant performance times (Acar 2005). This work, however, primarily
degradation due to straggler tasks (Zaharia et al. targets sequential computations. The recent work
2008). on iThreads (Bhatotia et al. 2015) supports
parallel incremental computation. In contrast,
Dataflow query processing. As SLIDER we developed techniques that are specific to
is based on Hadoop, computations can be sliding window computations and operate on
implemented using the MapReduce program- big data by adapting the principles of self-
ming model. While MapReduce is gaining in adjusting computation for a large-scale parallel
popularity, many programmers are more familiar and distributed execution environment.
with query interfaces, as provided by SQL or
LINQ. To ease the adoption of SLIDER, we also Database systems. There is substantial work
provide a dataflow query interface that is based from the database community on incrementally
on Pig (Olston et al. 2008). Pig consists of a high- updating a database view (i.e., a predetermined
level language similar to SQL and a compiler, query on the database) as the database contents
which translates Pig programs to sequences of change (Ceri and Widom 1991). database
MapReduce jobs. As such, the jobs resulting view (Ceri and Widom 1991). In contrast, our
from this compilation can run in SLIDER without focus on the MapReduce model, with a different
adjustment. This way, we transparently run level of expressiveness, and on sliding window
Pig-based sliding window computations in an computations over big data brings a series of
incremental way. different technical challenges.
transitively affected by those changes. Second, the recomputing from scratch, and (3) SLIDER
our design includes several novel techniques significantly outperforms several related systems.
that are specific to sliding window computations These results show that some of the benefits
which are important to improve the performance of incremental computation can be realized
for this mode of operation. The follow-up work automatically without requiring programmer-
on IncApprox (Krishnan et al. 2016) extends controlled hand incrementalization. The asymp-
the approach of Slider to support incremental totic efficiency analysis of self-adjusting
approximate computing for sliding window contraction trees is available (Bhatotia 2016).
analytics. In particular, IncApprox combines
incremental computation and approximates
computation (Quoc et al. 2017a,b,c,d). References
Batched stream processing. Stream processing Acar UA (2005) Self-adjusting computation. PhD thesis,
engines (He et al. 2010; Condie et al. 2010; Ol- Carnegie Mellon University
Acar UA, Blelloch GE, Blume M, Harper R, Tang-
ston et al. 2011) model the input data as a stream, wongsan K (2009) An experimental analysis of self-
with data-analysis queries being triggered upon adjusting computation. ACM Trans Program Lang Syst I
bulk appends to the stream. These systems are (TOPLAS) 32(1):1–53
designed for append-only data processing, which Acar UA, Cotter A, Hudson B, Türkoğlu D (2010) Dy-
namic well-spaced point sets. In: Proceedings of the
is only one of the cases we support. Compared 26th annual symposium on computational geometry
to Comet (He et al. 2010) and Nova (Olston (SoCG)
et al. 2011), SLIDER avoids the need to design Ananthanarayanan G, Ghodsi A, Wang A, Borthakur D,
a dynamic algorithm, thus preserving the trans- Shenker S, Stoica I (2012) PACMan: coordinated mem-
ory caching for parallel jobs. In: Proceedings of the 9th
parency relative to single-pass non-incremental USENIX conference on networked systems design and
data analysis. Hadoop online (Condie et al. 2010) implementation (NSDI)
is transparent but does not attempt to break up the Apache Software Foundation. Apache Hive (2017)
work of the Reduce task, which is one of the key Babcock B, Datar M, Motwani R, O’Callaghan L (2002)
Sliding window computations over data streams. Tech-
contributions and sources of performance gains nical report
of our work. Bhatotia P (2015) Incremental parallel and distributed
systems. PhD thesis, Max Planck Institute for Software
Systems (MPI-SWS)
Bhatotia P (2016) Asymptotic analysis of self-adjusting
Conclusion contraction trees. CoRR, abs/1604.00794
Bhatotia P, Wieder A, Akkus IE, Rodrigues R, Acar UA
In this chapter, we present techniques for (2011a) Large-scale incremental data processing with
change propagation. In: Proceedings of the conference
incrementally updating sliding window com-
on hot topics in cloud computing (HotCloud)
putations as the window slides. The idea behind Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquini R
our approach is to structure distributed data- (2011b) Incoop: MapReduce for incremental computa-
parallel computations in the form of balanced tions. In: Proceedings of the ACM symposium on cloud
computing (SoCC)
trees that can perform updates in asymptotically
Bhatotia P, Rodrigues R, Verma A (2012a) Shredder:
sublinear time, thus much more efficiently than GPU-accelerated incremental storage and computation.
recomputing from scratch. We present several In: Proceedings of USENIX conference on file and
algorithms for common instances of sliding storage technologies (FAST)
Bhatotia P, Dischinger M, Rodrigues R, Acar UA
window computations; describe the design of
(2012b) Slider: incremental sliding-window compu-
a system, SLIDER (Bhatotia et al. 2012b, 2014), tations for large-scale data analysis. Technical report
that uses these algorithms; and present an im- MPI-SWS-2012-004, MPI-SWS. http://www.mpi-sws.
plementation along with an extensive evaluation. org/tr/2012-004.pdf
Bhatotia P, Acar UA, Junqueira FP, Rodrigues R (2014)
Our evaluation shows that (1) SLIDER is effective
Slider: incremental sliding window analytics. In: Pro-
on a broad range of applications, (2) SLIDER ceedings of the 15th international middleware confer-
drastically improves performance compared to ence (Middleware)
1014 Incremental Sliding Window Analytics
Bhatotia P, Fonseca P, Acar UA, Brandenburg B, ACM SIGMOD international conference on manage-
Rodrigues R (2015) iThreads: a threading library for ment of data (SIGMOD)
parallel incremental computation. In: Proceedings of Murray DG, Schwarzkopf M, Smowton C, Smith S,
the 20th international conference on architectural sup- Madhavapeddy A, Hand S (2011) CIEL: a universal
port for programming languages and operating systems execution engine for distributed data-flow computing.
(ASPLOS) In: Proceedings of the 8th USENIX conference on
Bu Y, Howe B, Balazinska M, Ernst MD (2010) HaLoop: networked systems design and implementation (NSDI)
efficient iterative data processing on large clusters. In: Olston C et al (2008) Pig Latin: a not-so-foreign language
Proceedings of the international conference on very for data processing. In: Proceedings of the ACM SIG-
large data bases (VLDB) MOD international conference on management of data
Ceri S, Widom J (1991) Deriving production rules for (SIGMOD)
incremental view maintenance. In: Proceedings of Olston C et al (2011) Nova: continuous pig/hadoop work-
the international conference on very large data bases flows. In: Proceedings of the ACM SIGMOD interna-
(VLDB) tional conference on management of data (SIGMOD)
Chiang Y-J, Tamassia R (1992) Dynamic algorithms in Ongaro D, Rumble SM, Stutsman R, Ousterhout J, Rosen-
computational geometry. In: Proceedings of the IEEE blum M (2011) Fast crash recovery in RAMCloud. In:
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmele- Proceedings of the twenty-third ACM symposium on
egy K, Sears R (2010) MapReduce online. In: Pro- operating systems principles (SOSP)
ceedings of the 7th USENIX conference on networked Peng D, Dabek F (2010) Large-scale incremental process-
systems design and implementation (NSDI) ing using distributed transactions and notifications. In:
Costa et al (2012) Camdoop: exploiting in-network aggre- Proceedings of the USENIX conference on operating
gation for big data applications. In: Proceedings of the systems design and implementation (OSDI)
9th USENIX conference on networked systems design Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T
and implementation (NSDI) (2017a) Privacy preserving stream analytics: the mar-
Dean J, Ghemawat S (2004) MapReduce: simplified data riage of randomized response and approximate com-
processing on large clusters. In: Proceedings of the puting. https://arxiv.org/abs/1701.05403
USENIX conference on operating systems design and Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T
implementation (OSDI) (2017b) PrivApprox: privacy-preserving stream analyt-
Demetrescu C, Finocchi I, Italiano G (2004) Hand- ics. In: Proceedings of the 2017 USENIX conference
book on data structures and applications. Chapman & on USENIX annual technical conference (USENIX
Hall/CRC, Boca Raton ATC)
Gunda PK, Ravindranath L, Thekkath CA, Yu Y, Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T
Zhuang L (2010) Nectar: automatic management of (2017c) Approximate stream analytics in Apache flink
data and computation in datacenters. In: Proceedings of and Apache spark streaming. CoRR, abs/1709.02946
the USENIX conference on operating systems design Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T
and implementation (OSDI) (2017d) StreamApprox: approximate computing for
He B, Yang M, Guo Z, Chen R, Su B, Lin W, Zhou L stream analytics. In: Proceedings of the international
(2010) Comet: batched stream processing for data middleware conference (Middleware)
intensive distributed computing. In: Proceedings of the Ramalingam G, Reps T (1993) A categorized bibliogra-
ACM symposium on cloud computing (SoCC) phy on incremental computation. In: Proceedings of the
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) ACM SIGPLAN-SIGACT symposium on principles of
Dryad: distributed data-parallel programs from sequen- programming languages (POPL)
tial building blocks. In: Proceedings of the ACM Euro- Sümer O, Acar UA, Ihler A, Mettu R (2011) Adaptive
pean conference on computer systems (EuroSys) exact inference in graphical models. J Mach Learn
Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R Wieder A, Bhatotia P, Post A, Rodrigues R (2010a)
(2016) IncApprox: a data analytics system for in- Brief announcement: modelling MapReduce for opti-
cremental approximate computing. In: Proceedings of mal execution in the cloud. In: Proceedings of the 29th
the 25th international conference on world wide web ACM SIGACT-SIGOPS symposium on principles of
(WWW) distributed computing (PODC)
Logothetis D, Olston C, Reed B, Web K, Yocum K (2010) Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Con-
Stateful bulk processing for incremental analytics. In: ductor: orchestrating the clouds. In: Proceedings of the
Proceedings of the ACM symposium on cloud comput- 4th international workshop on large scale distributed
ing (SoCC) systems and middleware (LADIS)
Logothetis D, Trezzo C, Webb KC, Yocum K (2011) In- Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orches-
situ MapReduce for log processing. In: Proceedings trating the deployment of computations in the cloud
of the 2011 USENIX conference on USENIX annual with conductor. In: Proceedings of the 9th USENIX
technical conference (USENIX ATC) symposium on networked systems design and imple-
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, mentation (NSDI)
Leiser N, Czajkowski G (2010) Pregel: a system for Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I
large-scale graph processing. In: Proceedings of the (2008) Improving MapReduce performance in hetero-
Indexing 1015
geneous environments. In: Proceedings of the USENIX Alice or finding other users within 300 meters
conference on operating systems design and implemen- around Alice. Spatial indices have been studied
tation (OSDI)
Zaharia M, Chowdhury M, Das T, Dave A, Ma J,
extensively to support such queries. In the big
McCauley M, Franklin MJ, Shenker S, Stoica I (2012) data era, there may be millions of spatial queries
Resilient distributed datasets: a fault tolerant abstrac- and places of interests to be processed at the
tion for in-memory cluster computing. In: Proceedings same time. This poses new challenges in both
of the 9th USENIX conference on networked systems
design and implementation (NSDI)
scalability and efficiency of spatial indexing
techniques. Parallel spatial index structures are
developed to address these challenges, which are
summarized in the following section.
Indexing
Indexing, Fig. 1
Two-level distributed N13 N14
indexing
N1 N2 C 1 N3 N4 C 2 N5 N6 C 3 N7 N8 C 4
o o o o o o o o o o o o o o o o
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
o o o o o o o o o o o o o o o o
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
local indices
More recent work on parallelizing spatial in- cells are grouped into a block for storage to suit
dices uses the MapReduce and the Spark frame- the block size in the HDFS (a distributed file
works which have become standard parallel com- system used by Hadoop). The minimum bounding
putation frameworks in the past decade. These rectangles (MBR) of the grid cells are stored in
frameworks hide the parallelism details such as the global index. A local index is built at query
synchronization and simplify parallel computa- time over the data objects assigned to the same
tion into a few generic operations (e.g., Map machine for query processing.
and Reduce). The two-level index structure as SpatialHadoop (Eldawy and Mokbel 2013)
described above is still commonly used, although extends Hadoop to provide built-in support for
some of the techniques proposed only use either spatial queries. It also uses a global index and a
the global index or the local indices. We describe local index. Grid-based and R-tree-based indices
a few representative index structures using these can both be used for the global index, where the
two frameworks next. grid cells and leaf nodes, respectively, correspond
to partitions that are stored in HDFS blocks. A
MapReduce-Based Indexing local index is built for each partition, which can
Hadoop-GIS (Aji et al. 2013) is a Hadoop (an also use grid or R-tree-based indices. AQWA (Aly
open-source implementation of the MapReduce et al. 2015) builds a global index only, which uses
framework)-based spatial data warehousing sys- the k-d tree. This index also stores the number
tem. It builds a two-level (global and local) index of data points of each partition to help query
structure over the data objects to support parallel processing.
processing of spatial queries. The global index A few other studies (e.g., Nishimura
is pre-built by first partitioning the data space et al. 2011; Hsu et al. 2012; Zhang et al.
with a regular grid and then further partitioning 2014) use space-filling curves such as the Z-
every grid cell containing more than a predefined curve (Orenstein and Merrett 1984) to map
number of data objects nmax into two equi-sized multidimensional coordinates to one-dimensional
cells at the middle recursively until each cell values. This enables storing the data objects
contains no more than nmax objects. Data objects into NoSQL databases such as HBase, where the
in the same cell are associated with the same mapped coordinate value and the id of a data
unique id and stored in the same block. Multiple object are stored as a key-value pair. Data update
Indexing 1017
and spatial queries can then be handled via the created by a uniform grid partitioning over the
APIs of NoSQL databases. data space. SpatialSpark (You et al. 2015) also
When a hierarchical index such as the R-tree builds a local index for data objects in each
or the k-d tree is built with MapReduce, dynamic RDD partition, which can be stored together
index building procedures that insert the data with the data objects in the RDD partition. Spa-
objects into the index individually lack efficiency. tialSpark supports binary space partitioning and
A common assumption is that the entire data set tile partitioning in addition to uniform grid par-
is available. In this case, the entire index structure titioning over the data space. STARK (Hagedorn
over the data set can be bulk-loaded at once. A et al. 2017) uses R-trees for the local indices
straightforward approach is to bulk-load the hier- while uniform grid partitioning and cost-based
archical index level by level, where each level in- binary space partitioning for creating the parti-
curs a round of MapReduce computation (Acha- tions. Simba (Xie et al. 2016) uses global and
keev et al. 2012). Sampling has been used to local indices (e.g., R-trees). The global index is
reduce the number of MapReduce rounds needed. held in-memory in the master node, while the
Papadopoulos and Manolopoulos (2003) propose local indices are held in each RDD partition.
a generic procedure for parallel spatial index I
bulk-loading. This procedure uses sampling to
estimate the data distribution, which guides the Examples of Application
partition of the data space. Data objects in the
same partition are assigned to the same machine Spatial indexing techniques are used in spatial
for index building. SpatialHadoop uses a similar databases to support applications in both business
idea for R-tree bulk-loading, where an R-tree is and daily scenarios such as targeted advertis-
built on a sample data set and the MBR of the ing, digital mapping, and location-based social
leaf nodes are used for data partitioning. Agarwal networking. In these applications, the locations
et al. (2016) also use a sampling-based technique (coordinates) of large sets of places of interests
but bulk-load k-d trees. This technique involves as well as users are stored in a spatial database.
multiple rounds of sampling. Each round uses Spatial indices are built on top of such data
a sample data set small enough to be processed to provide fast data access, facilitating spatial
by a single machine. A k-d tree is built on the queries such as finding customers within 100
sample data set, and data objects are assigned meters of a shopping mall (to push shopping
to the partitions created by the k-d tree. For the vouchers), finding the nearest restaurants for Al-
partitions that contain too many data objects to be ice, and finding users living in nearby suburbs
stored together, another round of sampling- and who share some common interests (for friendship
k-d tree-based partitioning is performed. Repeat- recommendations).
ing the procedure above, it takes O.poly logs n/
MapReduce rounds to bulk-load a k-d tree, where
s denotes the number of data points that can be
Future Directions for Research
processed by a single machine in the MapReduce
cluster.
The indexing techniques discussed above do not
consider handling location data with frequent
Spark-Based Indexing updates. Such data is becoming more and more
Spark models a data set as a resilient distributed common with the increasing popularity of po-
dataset (RDD) and stores it across machines in sitioning system-enabled devices such as smart
the cluster. Spatial indices on Spark focus on phones, which enables tracking and querying the
optimizing with the RDD storage architecture. continuously changing locations of users or data
GeoSpark (Yu et al. 2015) builds a local index objects of interest (Zhang et al. 2012; Ward et al.
such as a quad-tree on-the-fly for the data objects 2014; Li et al. 2015). How to update the spatial
in each RDD partition, while the partitions are indices and to record multiple versions of the
1018 Indexing
data (Lomet et al. 2008) with a high frequency Finkel RA, Bentley JL (1974) Quad trees a data struc-
is nontrivial. While the parallel frameworks such ture for retrieval on composite keys. Acta Informatica
4(1):1–9
as MapReduce scale well to massive data, the Gaede V, Günther O (1998) Multidimensional access
indices stored in the distributed data storage may methods. ACM Comput Surv 30(2):170–231
not be updated as easily as indices stored in a Guttman A (1984) R-trees: a dynamic index structure for
standalone machine. A periodic rebuilt of the spatial searching. In: Proceedings of the 1984 ACM
SIGMOD international conference on management of
indices is an option, but may not keep the indices data (SIGMOD), pp 47–57
updated between rebuilds, and hence may pro- Hagedorn S, Götze P, Sattler K (2017) Big spatial data
vide inaccurate query answers. More efficiency processing frameworks: feature and performance eval-
index update techniques are in need. Another uation. In: Proceedings of the 20th international con-
ference on extending database technology (EDBT),
important direction to pursuit is efficient indexing pp 490–493
techniques to support spatial data mining. Many Hsu YT, Pan YC, Wei LY, Peng WC, Lee WC (2012)
data mining algorithms are computational inten- Key formulation schemes for spatial index in cloud
sive, which require iterative accesses to (high- data managements. In: 2012 IEEE 13th international
conference on mobile data management (MDM),
dimensional) data. Spatial indices designed to pp 21–26
cope with high-dimensional data (Zhang et al. Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005)
2004; Jagadish et al. 2005) and the special ac- Idistance: an adaptive b+-tree based indexing method
cessing patterns of such algorithms would have for nearest neighbor search. ACM Trans Database Syst
30(2):364–397
a significant impact in improving the usability of Koudas N, Faloutsos C, Kamel I (1996) Declustering
such algorithms. spatial databases on a multi-computer architecture. In:
Proceedings of the 5th international conference on
extending database technology: advances in database
technology (EDBT), pp 592–614
Cross-References Li C, Gu Y, Qi J, Zhang R, Yu G (2015) A safe region
based approach to moving KNN queries in obstructed
space. Knowl Inf Syst 45(2):417–451
Query Processing: Computational Geometry Lomet D, Hong M, Nehme R, Zhang R (2008) Transaction
Query Processing: Joins time indexing with version compression. Proc VLDB
Query Processing – kNN Endow 1(1):870–881
Nishimura S, Das S, Agrawal D, Abbadi AE (2011) Md-
hbase: a scalable multi-dimensional data infrastructure
for location aware services. In: Proceedings of the 2011
IEEE 12th international conference on mobile data
References management (MDM), vol 01, pp 7–16
Orenstein JA, Merrett TH (1984) A class of data structures
Achakeev D, Seidemann M, Schmidt M, Seeger B (2012) for associative searching. In: Proceedings of the 3rd
Sort-based parallel loading of r-trees. In: Proceedings ACM SIGACT-SIGMOD symposium on principles of
of the 1st ACM SIGSPATIAL international workshop database systems (PODS), pp 181–190
on analytics for big geospatial data, pp 62–70 Papadopoulos A, Manolopoulos Y (2003) Parallel bulk-
Agarwal PK, Fox K, Munagala K, Nath A (2016) Parallel loading of spatial data. Parallel Comput 29(10):
algorithms for constructing range and nearest-neighbor 1419–1444
searching data structures. In: Proceedings of the 35th Schnitzer B, Leutenegger ST (1999) Master-client r-
ACM SIGMOD-SIGACT-SIGAI symposium on prin- trees: a new parallel r-tree architecture. In: Proceed-
ciples of database systems (PODS), pp 429–440 ings of the 11th international conference on scien-
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J tific and statistical database management (SSDBM),
(2013) Hadoop gis: a high performance spatial data pp 68–77
warehousing system over mapreduce. Proc VLDB En- Ward PGD, He Z, Zhang R, Qi J (2014) Real-time
dow 6(11):1009–1020 continuous intersection joins over large sets of mov-
Aly AM, Mahmood AR, Hassan MS, Aref WG, Ouzzani ing objects using graphic processing units. VLDB J
M, Elmeleegy H, Qadah T (2015) Aqwa: adaptive 23(6):965–985
query workload aware partitioning of big spatial data. Xie D, Li F, Yao B, Li G, Zhou L, Guo M
Proc VLDB Endow 8(13):2062–2073 (2016) Simba: efficient in-memory spatial analyt-
Eldawy A, Mokbel MF (2013) A demonstration of spatial- ics. In: Proceedings of the 2016 SIGMOD interna-
hadoop: an efficient mapreduce framework for spatial tional conference on management of data (SIGMOD),
data. Proc VLDB Endow 6(12):1230–1233 pp 1071–1085
Indexing for Graph Query Evaluation 1019
George Fletcher1 and Martin Theobald2 Data Model. The general data model we con-
1
Technische Universiteit Eindhoven, Eindhoven, sider for this article is that of a directed, labeled
Netherlands multigraph G D .V; E; L; V ; E /, consisting
2
Université du Luxembourg, Luxembourg, of a set of vertices V , a set of edges E over
Luxembourg V , a set of labels L, a vertex-labeling function
V W V 7! L, and an edge-labeling function E W
E 7! L. For most applications, we assume V to
Definitions be bijective, i.e., all vertices have a distinct label
(then also called “vertex identifier”), while E is
Given a graph, an index is a data structure sup- not necessarily injective nor surjective, i.e., edge
porting a map from a collection of keys to a labels may occur multiple times among edges.
collection of elements in the graph. For example,
we may have an index on node labels, which, Query Model. In analogy to our data model, we
given a node label as search key, facilitates accel- also consider a subgraph pattern matching query
erated access to all nodes of the graph having the GQ D .VQ ; EQ ; L; Vars; Q / to be a labeled,
given label. The evaluation of queries on graph directed multigraph, consisting of a set of query
databases is often facilitated by index data struc- vertices VQ , a set of query edges EQ over VQ ,
tures. An index can be the primary representation two disjoint sets of labels L and variables Vars
of the graph or can be a secondary access path to with L \ Vars D ;, and a labeling function
elements of the graph. Q W VQ [ EQ 7! L [ Vars. Also here, by
1020 Indexing for Graph Query Evaluation
assuming unique vertex labels, we impose that summarization, and related strategies have been
EQ VQ .L [ Vars/ VQ . developed, which we will cover in more detail in
Given a data graph G D .V; E; L/ and a query the following subsections.
graph GQ D .VQ ; EQ ; L; Vars/, the semantics of SPARQL also supports the usage of complex-
evaluating GQ on G consists of identifying all graph-pattern (CGP) queries with labeled prop-
isomorphic subgraphs G 0 D .V 0 ; E 0 ; L/ of G, erty paths allows a user to formulate transitive
with V 0 V , E 0 E \ .V 0 L V 0 /, i.e., reachability constraints among the query vari-
a bijection f W V 0 [ L 7! VQ [ L [ Vars such ables. Very few works so far focused on the
that .v1 ; l; v2 / 2 E 0 H) .f .v1 /; f .l/; f .v2 // 2 combined processing of relational joins with ad-
EQ . For query evaluation, we impose the addi- ditional graph-reachability predicates, e.g., Gu-
tional condition that for vertex and edge labels bichev et al. (2013) and Sarwat et al. (2013).
l 2 L, it must hold that f .l/ D l, i.e., the given Przyjaciel-Zablocki et al. (2011) investigates an
labels in the query graph must be matched also initial approach to evaluate property paths via
by the data subgraph, such that only variables MapReduce. Currently, only very few commer-
in the query graph are replaced by labels in cial RDF engines, such as Virtuoso (Erling and
the data subgraph. Some applications relax the Mikhailov 2009), are available for processing
semantics of query evaluation such that all ho- property paths in SPARQL 1.1.
momorphic embeddings are constructed, i.e., f
is not necessarily injective. Under homomorphic
semantics, subgraph pattern matching queries can Value-Based Indexing
also be seen as conjunctive queries over relations
consisting of edge-labeled adjacency lists with Value-based indexes are the simplest form of
schema V L V and be solved by a fixed graph data structure, typically used as the primary
number of joins over these relations. representation of the graph in the database. Such
indexes map values in the graph (node identifiers,
RDF and SPARQL Directed labeled multi- edge identifiers, node labels, edge labels) to the
graphs without repeating vertex labels are nodes or edges associated with the search key.
particularly well suited for indexing RDF Among the centralized approaches, native RDF
data (RDF 2014) and for processing basic- stores like 4store (Harris et al. 2009), SW-Store
graph-pattern (BGP) queries expressible in (Abadi et al. 2009), MonetDB-RDF (Sidirour-
SPARQL 1.1 (SPA 2013). Due to the unique- gos et al. 2008), and RDF-3X (Neumann and
name assumption and the corresponding usage Weikum 2010a) have been carefully designed
of unique resource identifiers (URIs), both to keep pace with the growing scale of RDF
prevalent in the RDF data model, edge-labeled collections. Efficient centralized architectures ei-
adjacency lists of the form V L V ther employ various forms of horizontal (Neu-
resolve to subject-predicate-object (SPO) triples mann and Weikum 2010a; Harris et al. 2009)
over the underlying RDF data graph. Most and vertical (Abadi et al. 2009; Sidirourgos et al.
database-oriented indexing and query processing 2008) partitioning schemes or apply sophisticated
frameworks consider a collection of SPO encoding schemes over the subjects’ properties
triples either as a single large relation, or they (Atre et al. 2010; Yuan et al. 2013). We refer the
horizontally partition the collection into multiple reader to Luo et al. (2012) for a comprehensive
SO relations, one for each distinct predicate P overview of recent approaches for value-based
(or “property” in RDF terminology). Moreover, indexing and query processing techniques for
various value- and grammar-based synopses and labeled graphs in the form of RDF.
corresponding indexing techniques for RDF
data have been developed recently. To further Relational Approaches. The majority of
scale-out the indexing strategies to a distributed existing RDF stores, both centralized and
setting, both horizontal and vertical partitioning, distributed, follow a relational approach toward
Indexing for Graph Query Evaluation 1021
storing and indexing RDF data. By treating RDF mance of a relational query processor. In join-
data as a collection of triples, early systems ahead pruning, triples that might not qualify for
like 3Store (Harris and Gibbins 2003) use a join are pruned even before the actual join
relational databases to store RDF triples in a operator is invoked. This pruning of triples ahead
single large three-column table. This approach of the join operators may thus save a substantial
lacks scalability, as the query processing (then amount of computation time for the actual joins.
using SQL) involves many self-joins over this Instead of the sideways information passing (SIP)
single table. Later approaches follow more strategy used in RDF-3X (Neumann and Weikum
sophisticated approaches for storing RDF triples. 2010a,b), which is a runtime form of join-ahead
More recent approaches, such as SW-Store by pruning, TriAD (Gurajada et al. 2014) employs
Abadi et al. (2009), vertically partition RDF a similar kind of pruning via graph compression,
triples into multiple property tables. State-of-the- discussed below.
art centralized approaches, such as HexaStore
(Weiss et al. 2008) and RDF-3X (Neumann Reachability and Regular-Path
and Weikum 2010a), employ index-based Indexing
solutions by storing triples directly in BC -trees I
over multiple, redundant SPO permutations. The reachability problem in directed graphs is
Including all permutations and projections of one of the most fundamental graph problems
the SPO attributes, this may result in up to 15 and thus has been tackled by a plethora of
such BC -trees (Neumann and Weikum 2010a). centralized approaches; see Su et al. (2017)
Coupled with sophisticated statistics and query- for a recent overview. All methods aim to find
optimization techniques, these centralized, index- tradeoffs among query time and indexing space
based approaches still are very competitive as which, for a directed graph G.V; E/, are in
recently shown in Tsialiamanis et al. (2012). between O.jV j C jEj/ for both query time
and space consumption when no indexes are
Join-Order Optimization. Determining the used and O.1/ query time and O.jV j2 / space
optimal join-order for a query plan is arguably consumption when the transitive closure is fully
the main factor that impacts query processing materialized. Distributed reachability queries
performance of a relational engine. RDF-3X for single-source, single-target queries have so
(Neumann and Weikum 2010a) thus performs an far been considered in Fan et al. (2012b). For
exhaustive plan enumeration in combination multisource, multi-target reachability queries,
with a bottom-up DP algorithm and aggressive only comparably few centralized approaches
pruning in order to identify the best join-order. have been developed (Gao and Anyanwu 2013;
TriAD (Gurajada et al. 2014) adopts the DP Then et al. 2014) that provide suitable indexing
algorithm, as described in Neumann and Weikum and processing strategies. Both are based on a
(2010a) and further adapts it to finding both the notion of equivalence sets among graph vertices
best exploration-order for the summary graph and which effectively resolves to a preprocessing and
the best join-order for the subsequent processing indexing step of the data graph to predetermine
against the SPO indexes. Moreover, by including these sets. Efficient centralized approaches,
detailed distribution information and the ability to however, are naturally limited to the main
run multiple joins in parallel into the underlying memory of a single machine and usually do
cost model of the optimizer, TriAD is able to not consider a parallel, i.e., multi-threaded,
generate query plans that are more specifically execution of a reachability query. Then et al.
tuned toward parallel execution than with a pure (2014), on the other hand, focused on the query-
selectivity-based cost model. time optimization of multisource BFS searches.
Gubichev et al. (2013) have studied the use
Join-Ahead Pruning. Join-ahead pruning is a of reachability indexes for evaluation of richer
second main factor that influences the perfor- property-path queries.
1022 Indexing for Graph Query Evaluation
Berkeley’s GraphX (Xin et al. 2013) (based on Fan W, Li J, Wang X, Wu Y (2012a) Query preserv-
Spark Zaharia et al. 2010), Apache Giraph (Ching ing graph compression. In: Proceedings of the ACM
SIGMOD international conference on management of
et al. 2015) and IBM’s very recent Giraph++ data, SIGMOD 2012, Scottsdale, 20–24 May 2012,
(Tian et al. 2013), on the other hand, allow for pp 157–168
the scalable processing of graph algorithms over Fan W, Wang X, Wu Y (2012b) Performance guar-
massive, distributed data graphs. All of these antees for distributed reachability queries. PVLDB
5(11):1304–1315
provide generic API’s for implementing various Fletcher G, Van Gucht D, Wu Y, Gyssens M, Brenes
kinds of algorithms, including multisource, S, Paredaens J (2009) A methodology for coupling
multi-target reachability queries. Graph indexing fragments of XPath with structural indexes for XML
and query optimization, however, break the documents. Inf Syst 34(7):657–670
Fletcher G, Gyssens M, Leinders D, Van den Bussche
node-centric computing paradigm of Pregel and J, Van Gucht D, Vansummeren S (2015) Similarity
Giraph, where major amounts of the graph are and bisimilarity notions appropriate for characterizing
successively shuffled through the network in indistinguishability in fragments of the calculus of
each of the underlying MapReduce iterations relations. J Logic Comput 25(3):549–580
Fletcher G, Gyssens M, Paredaens J, Van Gucht D, Wu Y
(the so-called supersteps). Giraph++ is a very (2016) Structural characterizations of the navigational
recent approach to overcome this overly myopic expressiveness of relation algebras on a tree. J Comput I
form of node-centric computing, which led to Syst Sci 82(2):229–259
a new type of distributed graph processing that Message Passing Interface Forum (2015) MPI: a message
passing interface standard, version 3.1. https://www.
is coined graph-centric computing in Tian et al. mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
(2013). By exposing intra-node state information Gao S, Anyanwu K (2013) PrefixSolve: efficiently solving
as well as the internode partition structure to the multi-source multi-destination path queries on RDF
local compute nodes, Giraph++ is a great step graphs by sharing suffix computations. In: WWW,
pp 423–434
toward making these graph computations more Gubichev A, Bedathur SJ, Seufert S (2013) Sparqling
context-aware. Further distributed approaches Kleene: fast property paths in RDF-3X. In: GRADES,
include Gurajada and Theobald (2016), Harbi pp 14:1–14:7
et al. (2015), Peng et al. (2016), and Potter et al. Gurajada S, Theobald M (2016) Distributed set reachabil-
ity. In: SIGMOD, pp 1247–1261
(2016). Gurajada S, Seufert S, Miliaraki I, Theobald M (2014)
TriAD: a distributed shared-nothing RDF engine based
on asynchronous message passing. In: SIGMOD, pp
Cross-References 289–300
Harbi R, Abdelaziz I, Kalnis P, Mamoulis N (2015)
Evaluating sparql queries on massive RDF datasets.
Graph Data Models
Proc VLDB Endow 8(12):1848–1851. https://doi.org/
Graph Partitioning: Formulations and Applica- 10.14778/2824032.2824083
tions to Big Data Harris S, Gibbins N (2003) 3store: efficient bulk RDF
Graph Query Languages storage. In: PSSS, pp 1–15
Harris S, Lamb N, Shadbolt N (2009) 4store: the de-
Graph Representations and Storage
sign and implementation of a clustered RDF store. In:
SSWS
Hellings J, Fletcher G, Haverkort H (2012) Efficient
References external-memory bisimulation on DAGs. In: Proceed-
ings of the ACM SIGMOD international conference on
Abadi DJ, Marcus A, Madden SR, Hollenbach K (2009) management of data, SIGMOD 2012, Scottsdale, 20–
SW-Store: a vertically partitioned DBMS for semantic 24 May 2012, pp 553–564
web data management. VLDB J 18(2):385–406 Huang J, Abadi DJ, Ren K (2011) Scalable SPARQL
Atre M, Chaoji V, Zaki MJ, Hendler JA (2010) Matrix querying of large RDF graphs. PVLDB 4(11):
“Bit” loaded: a scalable lightweight join query proces- 1123–1134
sor for RDF data. In: WWW. ACM, pp 41–50 Lin J, Dyer C (2010) Data-intensive text processing with
Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukr- MapReduce. Synthesis lectures on human language
ishnan S (2015) One trillion edges: graph processing at technologies. Morgan & Claypool Publishers, San
facebook-scale. PVLDB 8(12):1804–1815 Rafael
Erling O, Mikhailov I (2009) Virtuoso: RDF support in a Luo Y, Picalausa F, Fletcher GHL, Hidders J, Vansum-
native RDBMS. In: SWIM, pp 501–519 meren S (2012) Storing and indexing massive RDF
1026 Indexing for Graph Query Evaluation
datasets. In: Virgilio RD, Guerra F, Velegrakis Y (eds) RDF (2014) Resource description framework. http://www.
Semantic search over the web. Springer, Berlin/Hei- w3.org/RDF/
delberg, pp 31–60. https://doi.org/10.1007/978-3-642- Rohloff K, Schantz RE (2011) Clause-iteration with
25008-8_2 MapReduce to scalably query datagraphs in the
Luo Y, de Lange Y, Fletcher G, De Bra P, Hidders J, SHARD graph-store. In: DIDC, pp 35–44
Wu Y (2013a) Bisimulation reduction of big graphs Sarwat M, Elnikety S, He Y, Mokbel MF (2013) Hor-
on MapReduce. In: Big data – proceedings of 29th ton+: a distributed system for processing declarative
British National conference on databases, BNCOD reachability queries over partitioned graphs. PVLDB
2013, Oxford, 8–10 July 2013, pp 189–203 6(14):1918–1929
Luo Y, Fletcher G, Hidders J, Wu Y, De Bra P (2013b) Seufert S, Anand A, Bedathur SJ, Weikum G (2013)
External memory k-bisimulation reduction of big FERRARI: flexible and efficient reachability
graphs. In: 22nd ACM international conference on range assignment for graph indexing. In: ICDE,
information and knowledge management, CIKM’13, pp 1009–1020
San Francisco, 27 Oct–1 Nov 2013, pp 919–928 Sidirourgos L, Goncalves R, Kersten M, Nes N, Manegold
Maccioni A, Abadi DJ (2016) Scalable pattern match- S (2008) Column-store support for RDF data manage-
ing over compressed graphs via dedensification. In: ment: not all swans are white. PVLDB 1(2):1553–1563
Proceedings of the 22nd ACM SIGKDD international SPA (2013) SPARQL 1.1 overview. https://www.w3.org/
conference on knowledge discovery and data mining, TR/sparql11-overview/
San Francisco, 13–17 Aug 2016, pp 1755–1764 Su J, Zhu Q, Wei H, Yu JX (2017) Reachability querying:
Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn can it be even faster? IEEE Trans Knowl Data Eng
I, Leiser N, Czajkowski G (2010) Pregel: a system 29(3):683–697
for large-scale graph processing. In: SIGMOD, pp Then M, Kaufmann M, Chirigati F, Hoang-Vu T, Pham K,
135–146 Kemper A, Neumann T, Vo HT (2014) The more the
Maneth S, Peternek F (2016) Compressing graphs by merrier: efficient multi-source graph traversal. PVLDB
grammars. In: 32nd IEEE international conference on 8(4):449–460
data engineering, ICDE 2016, Helsinki, 16–20 May Tian Y, Balmin A, Corsten SA, Tatikonda S, McPherson J
2016, pp 109–120 (2013) From “think like a vertex” to “think like a
Milo T, Suciu D (1999) Index structures for path ex- graph”. PVLDB 7(3):193–204. http://www.vldb.org/
pressions. In: Database theory – ICDT’99, proceedings pvldb/vol7/p193-tian.pdf
of 7th international conference, Jerusalem, 10–12 Jan Tsialiamanis P, Sidirourgos L, Fundulaki I, Christophides
1999, pp 277–295 V, Boncz PA (2012) Heuristics-based query optimisa-
Neumann T, Weikum G (2010a) The RDF-3X engine for tion for SPARQL. In: EDBT, pp 324–335
scalable management of RDF data. VLDB J 19(1): Webber J (2012) A programmatic introduction to Neo4j.
91–113 In: SPLASH, pp 217–218
Neumann T, Weikum G (2010b) x-RDF-3X: fast query- Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple
ing, high update rates, and consistency for RDF indexing for semantic web data management. PVLDB
databases. PVLDB 3:256–263 1(1):1008–1019
Peng P, Zou L, Özsu MT, Chen L, Zhao D (2016) Process- Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013)
ing sparql queries over distributed RDF graphs. VLDB GraphX: a resilient distributed graph system on Spark.
J 25(2):243–268. https://doi.org/10.1007/s00778-015- In: GRADES
0415-0 Yuan P, Liu P, Wu B, Jin H, Zhang W, Liu L (2013)
Picalausa F, Luo Y, Fletcher G, Hidders J, Vansummeren TripleBit: a fast and compact system for large scale
S (2012) A structural approach to indexing triples. RDF data. PVLDB 6(7):517–528
In: The semantic web: research and applications – Zaharia M, Chowdhury NMM, Franklin M, Shenker S,
proceedings of 9th extended semantic web conference, Stoica I (2010) Spark: cluster computing with working
ESWC 2012, Heraklion, 27–31 May 2012, pp 406–421 sets. Technical report UCB/EECS-2010-53, EECS De-
Picalausa F, Fletcher G, Hidders J, Vansummeren S (2014) partment, UC Berkeley, http://www.eecs.berkeley.edu/
Principles of guarded structural indexing. In: Proceed- Pubs/TechRpts/2010/EECS-2010-53.html
ings of 17th international conference on database the- Zeng K, Yang J, Wang H, Shao B, Wang Z (2013)
ory (ICDT), Athens, 24–28 Mar 2014, pp 245–256 A distributed graph engine for web scale RDF data.
Potter A, Motik B, Nenov Y, Horrocks I (2016) Dis- PVLDB 6(4):265–276
tributed RDF query answering with dynamic data Zhang X, Chen L, Tong Y, Wang M (2013) EAGRE: to-
exchange. In: ISWC, pp 480–497. https://doi.org/10. wards scalable I/O efficient SPARQL query evaluation
1007/978-3-319-46523-4_29 on the cloud. In: ICDE, pp 565–576
Przyjaciel-Zablocki M, Schätzle A, Hornung T, Lausen G Zou L, Mo J, Chen L, Özsu MT, Zhao D (2011) gStore:
(2011) RDFPath: path query processing on large RDF answering SPARQL queries via subgraph matching.
graphs with MapReduce. In: ESWC, pp 50–64 PVLDB 4(8):482–493
Influence Analytics in Graphs 1027
“Influence analytics in graphs” is the study of A social network acts as a medium for the spread
the dynamics of influence propagation over a of information, ideas, and innovations among its
network. Influence includes anything that can be members. The idea of the influence maximization
transferred or shaped through network interac- problem, which arose in the context of viral mar-
tions, such as pieces of information, actions, be- keting, is controlling the spread so that an idea
havior, or opinions. Influence analytics includes or product is widely adopted through word-of-
modeling of the process of influence propagation mouth effects. More specifically, by targeting a
and addressing algorithmic challenges that arise small number of influential individuals, say, giv-
in solving optimization problems related to the ing them free samples of the product, the goal is
dynamic propagation process. to trigger a cascade of influence by which friends
will recommend the product to other friends, and
many individuals will ultimately try it. One can
Overview rephrase this problem using graph terminology,
that is, the goal is to find a set of small number of
Diffusion process on networks is one of the most nodes, called a seed set, in a given network that
important dynamics studied in network analysis, maximizes the number of influenced nodes in the
which can represent social influence such as the end of an influence propagation process.
1028 Influence Analytics in Graphs
Step 0 Step 1
0.7 0.2 0.7 0.2
Step 2 Step 3
0.7 0.2 0.7 0.2
Influence Analytics in Graphs, Fig. 1 Example of a is initially activated. Step 1: The top-middle vertex is
propagation process in the independent cascade model. activated, whereas the bottom-left vertex remains inactive.
Blacks nodes, black nodes with white boundary, and The arcs once used in the activation process will be no
white nodes are active, activated in the current step, and longer used (dashed arcs). Step 2: The bottom-middle
inactive, respectively. Real numbers on arcs denote the vertex is activated. Step 3: The bottom-left and bottom-
success probability of activating the node in the head right vertices are activated
when the tail node is active. Step 0: The top-left vertex
Step 0 Step 1
0.7
0.5 0.2
0.3 0.7
0.5 0.2
0.3
Step 2 Step 3
0.7
0.5 0.2
0.3 0.7
0.5 0.2
0.3
Influence Analytics in Graphs, Fig. 2 Example of a ones on nodes denote their thresholds. The top-left vertex
propagation process in the linear threshold model. Blacks is initially activated in Step 0 and then the top-middle,
nodes, black nodes with white boundary, and white nodes bottom-middle, and bottom-left vertices are activated in
are active, activated in the current step, and inactive, re- Steps 1, 2, and 3, respectively
spectively. Real numbers on arcs denote their weights, and
of the said likelihood. A subtle point is that there Similar to the influence propagation models,
may be many different ways to partition an actual the underlying principle behind opinion forma-
action log into bins, and different partitions may tion models is that the state (opinion) of a node is
lead to different values for the likelihood. The determined by the states of the nodes in the neigh-
question of finding the “optimal” partitioning is borhood. Following the intuition that the opinions
not addressed by them. One inherent challenge of individuals in real life are influenced from the
with this approach is that the algorithm can take opinions in their social circle, the opinion forma-
considerable time for learning model parameters tion models assume a propagation of opinions in
on large networks and action logs. the network. However instead of activating the
Goyal et al. (2010) followed a simple frequen- nodes, the opinion formation process adjusts the
tist approach and used a maximum likelihood nodes’ opinions according to a specific model.
estimator (MLE) for learning the edge weights A variety of opinion formation processes have
under the generalized threshold model, which been proposed that aim to model different aspects
is a generalization of the LT model. In addi- of opinion dynamics (Degroot 1974; Hegselmann
tion to a framework for learning model parame- and Krause 2002; Castellano et al. 2009), such
ters, they show that their framework can handle as, consensus (convergence to a single opinion), I
the case where influence probabilities change polarization (clustering of opinions around two
(specifically, decay) over time (Indeed, in prac- extreme positions), or plurality (existence of mul-
tice, one can expect influence to decay over tiple clusters of opinions). This article focuses
time.). Using this, they show that in addition to on algorithmic challenges that arise in these pro-
predicting user activation, they can also predict cesses. More specifically, it focuses on the prob-
the most likely time of activation of users. They lem of opinion maximization, where the goal is
also propose notions of user and action influ- to maximize (say) the positive opinion in the
enceability and ways of measuring them from an network.
action log and show that for users with higher
influenceability, both activation prediction and The Friedkin and Johnsen Model
prediction of the most likely time of activation A widely adopted opinion formation model is the
can be done more accurately. The learning al- model of Friedkin and Johnsen (1990). In this
gorithm proposed in Goyal et al. (2010) scales model, the network is undirected, and every node
to large networks and action logs: the algorithm v 2 V has an internal opinion value sv 2 Œ0; 1
makes just one pass over the action log. and an expressed opinion value ´v 2 Œ0; 1. The
There has been significant subsequent work internal opinion corresponds to the prior beliefs
on learning influence models from given action of the individual, and it is fixed and unchanged
propagation data, including improvements and throughout the opinion formation process. The
extensions, which are omitted for lack of space. expressed opinion is malleable, and its value is
the result of the influences exerted on the node by
its neighbors in the graph.
Opinion Formation Mathematically, the expressed opinion is de-
fined as follows:
Different from influence propagation models P
such as the IC and LT models, where nodes are sv C u2Nv ´u
´v D ; (1)
in a binary state of being active or inactive, in 1 C jNv j
opinion formation models, the state of a node
v 2 V is a real value ´v which captures the where Nv D fw 2 V W vw 2 Eg is the set
opinion of the node. Typically, the opinion values of neighbors of v. Equation (1) intuitively means
range over the interval Œ0; 1 (sometimes, Œ1; 1), that the expressed opinion of a node is determined
where ´v D 1 signifies a strong positive opinion, by taking average over its own internal opinion
while ´v D 0 signifies a strong negative opinion. and the expressed opinions of its neighbors.
1032 Influence Analytics in Graphs
The expressed opinions can be computed by is viewed as an electrical network, where edges
solving a system of linear equations. Let s be the are resistances. Each node is connected to its
vector of internal opinions and z the vector of own voltage source, with voltage equal to the
expressed opinions. Then internal opinion value of the node (in the ab-
sorbing random walk, these are absorbing nodes).
z D .I C L/1 s (2) The voltage value at each node is equal to the
expressed opinion of the node.
where L is the Laplacian matrix of the adjacency
matrix of the social graph. Opinion Maximization
The expressed opinions can also be viewed as In the opinion maximization problem (Gionis
the output of a dynamic system which iteratively et al. 2013), the goal is to maximize the sum
updates the opinion values. The opinion forma- of the opinion values in the network. This sum
tion model proceeds as follows: captures the average opinion in the network,
which should be as favorable as possible. There
are two ways to affect the average opinion in the
network:
• In step 0, initialize for each node v 2 V ,
´0v D sv . I. Convince nodes to change their internal opin-
• In step t 1, all nodes update their ions to be as positive as possible (set sv D
expressed opinions as follows: 1). In the electrical network interpretation,
P this corresponds to setting the voltage of the
sv C u2Nv ´ut1 power source of v to C1
´tv D
1 C jNv j II. Convince nodes to adopt a positive expressed
opinion (set ´v D 1). In this case the ex-
• Repeat the process until all expressed pressed opinion value remains fixed, and it
opinion values converge. is not iteratively updated. In the electrical
network interpretation, this corresponds to
turning node v into a voltage source with
voltage C1.
There are several interpretations of the Fried-
kin and Johnsen model. Bindel et al. (2015) The goal is to find a small set of nodes S V
showed that the opinion values are the result of such that changing their opinions maximizes the
the selfish behavior of nodes, where a node v sum of expressed opinion values. More formally,
tries to minimize their personal cost of adopting let G denote the social network. Let g.zI G; s/ D
P
opinion value ´v , defined as v2V ´v be the sum of opinion values, and let
g.zI S; G; s/ be the sum of opinion values after
X
C.´v / D .´v sv /2 C .´v ´u /2 changing the opinions of the nodes in S , by
u2Nv setting either their internal or expressed opinion
to 1. The goal is to find a set S of k nodes so as
The first term in the cost function is the cost to maximize g.zI S; G; s/.
incurred to node v for compromising their own In Case I, the problem can be solved optimally
internal beliefs. The second term is the cost in polynomial time by simply selecting S to
incurred to node v for nonconforming with the be the nodes with the lowest internal opinion
opinions of their social neighborhood. value. This follows from Eq. (2) and the fact
P
A different interpretation is offered by Gionis that g.zI S; G; s/ D v2V sv (Gionis et al.
et al. (2013), where they model the opinion for- 2013).
mation process as a random walk on a graph with In Case II, the problem is NP-hard. Similar
absorbing nodes. Intuitively, the social network to influence maximization, the value of g as a
Influence Analytics in Graphs 1033
function of the set S is monotone and submod- some positive and negative results are obtained
ular, and thus, the greedy algorithm provides a by Chen et al. (2016) and He and Kempe (2016).
.1 1=e/-approximation to the optimal solution. Tightening the gap between them is an interesting
problem.
Influence and opinion maximization can be
Future Directions considered together, in a model where nodes
carry an opinion value, and the goal is to max-
Although influence maximization and opinion imize the overall opinion of the activated nodes
maximization have been intensively studied, (e.g., see the work of Zhang et al. 2013). This
there are still many challenges that should be formulation opens the possibility of combining
addressed. techniques from both problems to design better
The greedy method achieves .1 1=e/- models and algorithms.
approximation to the optimal solution both The idea of selecting a few nodes to
for the influence maximization and opinion affect node opinions has also been applied
maximization problems. However, empirical to tasks other than opinion maximization. For
performance seems to be much better than this example, Matakos et al. (2017) applied this idea I
theoretical guarantee and this gap should be for reducing polarization, while Golshan and
filled. Terzi (2017) applied it for minimizing tension
The main focus of the influence maximization in a related model. It would be interesting
problem is to maximize the expected number to consider other applications of this idea in
of activated nodes. However, maximizing the different settings.
number of activated nodes in less propagated
scenarios is also important. To address this prob-
lem, Ohsaka and Yoshida (2017) exploited the
Cross-References
notion of conditional value at risk (CVaR) from
the field of financial risk measurement, which is Big Data Analysis for Social Good
the expected number of activated nodes in the Big Data in Social Networks
worst ˛-fraction of the scenarios, where ˛ 2 Link Analytics in Graphs
.0; 1/ is a parameter specified by a user. They pro-
posed a polynomial-time algorithm that produces
a portfolio over initial seed sets with a provable
guarantee on its CVaR. Optimizing other mea- References
sures on probability distributions such as variance
Bindel D, Kleinberg JM, Oren S (2015) How bad is form-
will be interesting.
ing your own opinion? Games Econ Behav 92:248–265
Information propagation models require some Borgs C, Brautbar M, Chayes J, Lucier B (2014) Maxi-
parameters on arcs (edge probabilities or edge mizing social influence in nearly optimal time. In: Pro-
weights), which are typically unknown for real ceedings of the 25th annual ACM-SIAM symposium
on discrete algorithms (SODA), pp 946–957
networks. As previously discussed, approaches
Castellano C, Fortunato S, Loreto V (2009) Statistical
for learning these parameters have been devel- physics of social dynamics. Rev Mod Phys 81(2):591–
oped. However, there may be noise in the learned 646
parameters, on account of the action log being Chen W, Lakshmanan LVS, Castillo C (2013) Information
and influence propagation in social networks. Synthesis
noisy or because of the inherent uncertainty of the
lectures on data management. Morgan & Claypool
learning problem. Hence, it is natural to introduce Publishers, San Rafael
some intervals on those parameters and seek an Chen W, Lin T, Tan Z, Zhao M, Zhou X (2016) Robust
initial seed set that has a good approximation influence maximization. In: Proceedings of the 22nd
ACM SIGKDD international conference on knowledge
guarantee no matter how parameters are chosen
discovery and data mining (KDD), pp 795–804
from the associated intervals. This problem is Degroot MH (1974) Reaching a consensus. J Am Stat
called the robust influence maximization, and Assoc 69(345):118–121
1034 Information Filtering
perform when subsequent accesses are sequential pointer pairs, sorted by key, to direct searches
because they benefit from spatial and temporal to lower levels of the Bw-Tree. In addition, all
locality. nodes contain a side link pointer that points to
the immediate right sibling on the same level
Hash-Based Indexing of the tree. The Bw-Tree thus offers logarithmic
Hash-based indexing has been proposed for data access time for key lookups and linear access
storage in main memory transaction processing time for queries on a one-dimensional key range.
systems such as Hekaton (Diaconu et al. 2013). Range queries are particularly efficient as they
Multiple attributes can be indexed; each index- access contiguous data items. Unlike disk-based
able attribute will have a separate in-memory structures that impose a pre-defined page size,
hash table structure. An in-memory hash table is pages in a Bw-tree are elastic: they can grow
created by hashing the attribute of every tuple in to arbitrarily large sizes until split by a system
a table. Every hash bucket is the root of a singly management operation.
linked list that stores all the tuples that hash to The main distinction from disk-oriented tree
the same value. By tracking individual tuples, this structures is that all page references in the Bw-
design avoids spurious conflicts during concur- Tree are logical references to a page identifier.
rent operations. In addition, using a singly linked This level of indirection allows readers to access
list allows for lock-free operations on the hash the data structure in a lock-free manner. An in-
table. Readers start their search at the hash table memory hash table, called the mapping table,
bucket and follow pointers to the next data item. maintains the current mapping of page identi-
Modifying the linked list uses atomic compare- fiers to physical memory locations. The Bw-Tree
and-swap (CAS) instructions. never updates in place (i.e., it never modifies the
Avoiding lock contention and blocking, contents of a page). Instead, updates in a Bw-Tree
however, requires lock-free memory allocation create delta records that are prepended to existing
and reclamation. A lock-free memory allocator pages and describe the update. The memory ad-
guarantees progress regardless of whether dress of the delta record is atomically installed
some threads are delayed or even killed and in the mapping table using the atomic compare-
regardless of scheduling policies. Prior work and-swap (CAS) instruction. Hence, a concurrent
has investigated how memory allocators can read request that looks up the physical location
offer freedom from deadlocks, gracefully handle of a logical page identifier in the mapping table
asynchronous signals, safely handle priority will either see the new page address (that contains
inversion, and tolerate thread preemption and the delta record) or the old page address (that
thread termination (Michael 2004b). Another skips the delta record), but it will never see an
challenge is safe memory reclamation. In inconsistent state.
particular, the problem is how to allow memory to
be freed in a lock-free manner that guarantees that Masstree
no thread currently accesses this memory. One Another tree-based in-memory data structure is
solution from prior work keeps the pointers that Masstree (Mao et al. 2012). Masstree combines
a thread accesses in a globally visible structure properties of a trie and a BC -tree in one data
and carefully monitors when a reclamation is safe structure. The Silo transaction processing engine
(Michael 2004a). uses a structure similar to Masstree for data stor-
age (Tu et al. 2013). Masstree tracks individual
Bw-Tree tuples and carefully coordinates writes for high
A tree-based in-memory data structure is the concurrency.
Bw-Tree (Levandoski et al. 2013). The Bw-Tree Masstree consists of multiple trees arranged
organizes data in pages and closely resembles in layers. Each layer is indexed by a different
the Blink-tree data structure (Lehman and Yao 8-byte slice of the key. Therefore, layer 0 has a
1981). Nodes in the Bw-Tree contain key and single tree that is indexed by bytes 0–7 of the
In-Memory Transactions 1037
key; trees in layer 1, the next deeper layer, are transaction that only reads should wait for update
indexed by bytes 8–15; trees in layer 2 by bytes transactions, especially if the query does not need
16–23; and so forth. Each tree contains interior to access the latest data. In addition, blocking
nodes and border (leaf) nodes. Interior nodes introduces the possibility of deadlocks where two
form the trie, whereas border nodes store values transactions wait on each other. This requires
(akin to a BC -tree) or pointers to trees in deeper the system to either perform expensive deadlock
layers. Masstree creates additional layers as detection to abort a transaction that participates
needed when an insertion cannot use an existing in a wait-for cycle or burden the user to spec-
tree. ify a timeout parameter after which the transac-
Masstree is designed for highly concurrent tion will be restarted. Finally, single-version two-
accesses. Instead of a lock, each node in the phase locking triggers additional memory writes
Masstree structure has a version counter. A to the locks: a transaction needs to perform twice
writer increments the version counter of a node as many writes to locks (one write to acquire, one
to denote an in-progress update and increments write to release) as the number of the records that
the version number again after the update has it will access.
been installed. Readers coordinate with writers For these reasons, in-memory transaction I
by checking the version of a node before a read processing research has looked for alternatives
and comparing this number to the version after to single-version two-phase locking for in-
the operation. If the two version numbers differ memory transactions. Concurrency control
or a version number is an odd number, the reader algorithms for in-memory transactions can be
may have observed an inconsistent state and will classified into partitionable and non-partitionable
retry. Therefore, conflicting requests in Masstree categories.
may restart, but readers will never write globally
accessible shared memory. Concurrency Control for Partitionable
Workloads
Concurrency Control Algorithms Many transaction processing workloads exhibit
In-memory concurrency control algorithms guar- a fairly predictable access pattern that allows
antee the atomicity and isolation of in-memory a transaction processing system to partition the
transactions. The goal is serializability, that is, to database such that most transactions only access a
allow transactions to execute concurrently while single partition. This access predictability is often
appearing to the user that they run in isolation. because of the hierarchical structure of database
If the concurrency control algorithm detects an entities, such as bank branches and accounts,
operation that violates serializability, such as two customers and orders, or suppliers and parts.
transactions updating the same item at the same The H-Store prototype (Kallman et al. 2008)
time, it can delay or abort a transaction. The abandons complex concurrency control protocols
challenge is detecting and resolving any conflicts and just executes transactions serially on different
efficiently. database partitions. Multi-partition transactions,
A widely used concurrency control protocol is however, become more expensive as they require
single-version two-phase locking (Bernstein et al. coordination across partitions, even if these parti-
1987). In this protocol, transactions acquire read tions are assigned to different threads in the same
or write locks on a tuple, perform the operation, node.
and release the lock at the end of the transaction. The DORA prototype (Pandis et al. 2010)
Lower isolation levels than serializable permit a proposes a shared-everything design that assigns
transaction to release read locks earlier. data to threads. Transactions are decomposed
Single-version two-phase locking poses three into subtransactions that form a transaction flow
problems for in-memory transactions. First, writ- graph. DORA then carefully routes subtransac-
ers block readers in single-version two-phase tions to the appropriate partition for execution to
locking. It is counterintuitive to many users why a maintain serializability.
1038 In-Memory Transactions
quency of checkpointing. Early lock release, con- found there. In: SIGMOD’08: proceedings of the 2008
solidated buffer allocation, and asynchronous I/O ACM SIGMOD international conference on manage-
ment of data. ACM, New York, pp 981–992. https://
move the latency of hardening log frames out doi.org/10.1145/1376616.1376713
of the critical path (Johnson et al. 2010). The Johnson R, Pandis I, Stoica R, Athanassoulis M, Aila-
checkpointing frequency is a trade-off between maki A (2010) Aether: a scalable approach to logging.
steady-state performance and recovery time: fre- Proc VLDB Endow 3(1–2):681–692. https://doi.org/
10.14778/1920841.1920928
quent checkpointing takes CPU cycles away from Kallman R, Kimura H, Natkins J, Pavlo A, Rasin A,
transaction processing, while checkpointing in- Zdonik S, Jones EPC, Madden S, Stonebraker M,
frequently means substantial recovery work on Zhang Y, Hugg J, Abadi DJ (2008) H-Store: a high-
failure. performance, distributed main memory transaction pro-
cessing system. Proc VLDB Endow 1(2):1496–1499.
https://doi.org/10.1145/1454159.1454211
Kim K, Wang T, Johnson R, Pandis I (2016) Ermia: fast
Other Research Directions memory-optimized database system for heterogeneous
workloads. In: Proceedings of the 2016 international
In-memory transaction processing is a very active conference on management of data (SIGMOD’16).
ACM, New York, pp 1675–1687. https://doi.org/10.
and evolving research field. Other research efforts 1145/2882903.2882905 I
improve the performance of in-memory transac- Kimura H (2015) Foedus: OLTP engine for a thousand
tion processing in the presence of heterogeneous cores and NVRAM. In: Proceedings of the 2015 ACM
workloads (Kim et al. 2016) and hotspots (Ren SIGMOD international conference on management of
data (SIGMOD’15). ACM, New York, pp 691–706.
et al. 2016; Wang and Kimura 2016) or take https://doi.org/10.1145/2723372.2746480
advantage of determinism in transaction execu- Kung HT, Robinson JT (1981) On optimistic methods
tion (Thomson et al. 2012). Another research for concurrency control. ACM Trans Database Syst
direction considers properties of the serialization 6(2):213–226. https://doi.org/10.1145/319566.319567
Larson P, Blanas S, Diaconu C, Freedman C, Patel
graph to enforce serializability (Wang et al. 2017) JM, Zwilling M (2011) High-performance concur-
and reduce false aborts (Yuan et al. 2016). Multi- rency control mechanisms for main-memory databases.
versioning has been shown to ameliorate lock PVLDB 5(4):298–309
contention through an indirection table (Sadoghi Lehman PL, Yao SB (1981) Efficient locking for concur-
rent operations on b-trees. ACM Trans Database Syst
et al. 2014). Finally, new logging and recovery 6(4):650–670. https://doi.org/10.1145/319628.319663
protocols promise to leverage byte-addressable Levandoski JJ, Lomet DB, Sengupta S (2013) The bw-
non-volatile memory for in-memory transaction tree: a b-tree for new hardware platforms. In: 2013
durability (Arulraj et al. 2016; Kimura 2015). IEEE 29th international conference on data engineering
(ICDE), pp 302–313. https://doi.org/10.1109/ICDE.
2013.6544834
Mao Y, Kohler E, Morris RT (2012) Cache craftiness for
References fast multicore key-value storage. In: Proceedings of the
7th ACM European conference on computer systems
Arulraj J, Perron M, Pavlo A (2016) Write-behind log- (EuroSys’12). ACM, New York, pp 183–196. https://
ging. PVLDB 10(4):337–348 doi.org/10.1145/2168836.2168855
Bernstein PA, Hadzilacos V, Goodman N (1987) Con- Michael MM (2004a) Hazard pointers: safe memory
currency control and recovery in database systems. reclamation for lock-free objects. IEEE Trans Paral-
Addison-Wesley, Reading lel Distrib Syst 15(6):491–504. https://doi.org/10.1109/
Curino C, Zhang Y, Jones EPC, Madden S (2010) Schism: TPDS.2004.8
a workload-driven approach to database replication and Michael MM (2004b) Scalable lock-free dynamic mem-
partitioning. PVLDB 3(1):48–57 ory allocation. SIGPLAN Not 39(6):35–46. https://doi.
Diaconu C, Freedman C, Ismert E, Larson PA, Mittal P, org/10.1145/996893.996848
Stonecipher R, Verma N, Zwilling M (2013) Hekaton: Mohan C, Haderle D, Lindsay B, Pirahesh H, Schwarz
SQL server’s memory-optimized OLTP engine. In: P (1992) Aries: a transaction recovery method sup-
Proceedings of the 2013 ACM SIGMOD international porting fine-granularity locking and partial rollbacks
conference on management of Data (SIGMOD’13). using write-ahead logging. ACM Trans Database Syst
ACM, New York, pp 1243–1254. https://doi.org/10. 17(1):94–162. https://doi.org/10.1145/128765.128770
1145/2463676.2463710 Pandis I, Johnson R, Hardavellas N, Ailamaki A
Harizopoulos S, Abadi DJ, Madden S, Stonebraker M (2010) Data-oriented transaction execution. PVLDB
(2008) OLTP through the looking glass, and what we 3(1):928–939
1040 Integration-Oriented Ontology
I
Integration-Oriented Ontology, Fig. 1 Example of query execution in the Semantic Web
mon consensus when sharing data, such as the tionships between G and S and how to translate
RDF Data Cube Vocabulary or the Data Catalog and compose queries posed over G into queries
Vocabulary. over S. The former led to the two fundamental
Data providers make available their datasets approaches for schema mappings, global-as-view
via SPARQL endpoints that implement a protocol (GAV) and local-as-view (LAV), while the latter
where given a SPARQL query, a set of triples is has generically been formulated as the problem of
obtained, properly annotated with respect to an rewriting queries using views; see Halevy (2001)
ontology. Hence, the data analysts’ queries must for a survey.
separately fetch the sets of triples of interest from Still, providing an integrated view over a het-
each endpoint and integrate the results. Further- erogeneous set of data sources is a challenging
more, such settings assume that both TBox and problem, commonly referred to as the data variety
ABox are materialized in the ontology. Figure 1 challenge, that traditional data integration tech-
depicts the Semantic Web architecture, where niques fail to address in Big Data contexts given
the data analyst is directly responsible of firstly the heterogeneity of formats in data sources; see
decomposing the query and issuing the different Horrocks et al. (2016). Current attempts, namely,
pieces into several homogeneous SPARQL end- ontology-based data access (OBDA), adopt on-
points and afterward collecting all the results and tologies to represent G , as depicted in Fig. 2. In
composing the single required answer. such case, unlike in the Semantic Web, the TBox
The alternative architecture for data integra- resides in the ontology as sets of triples, but the
tion is that of a federated schema that serves as ABox in the data sources, in its source format.
a unique point of entry, which is responsible of The most prominent OBDA approaches are based
mediating the queries to the respective sources. on generic reasoning in description logics (DLs)
Such federated information systems are com- for query rewriting (see Poggi et al. 2008). They
monly formalized as a triple hG ; S; Mi, such that present data analysts with a virtual knowledge
G is the global schema, S is the source schema, graph (i.e., G ), corresponding to the TBox of
and M are the schema mappings between G and the domain of interest, in an OWL2 QL ontol-
S; see Lenzerini (2002). With such definition ogy; see Grau et al. (2012). Such TBox is built
two major problems arise, namely, how to design upon the DL-Lite family of DLs, which allows
mappings capable of expressing complex rela- to represent conceptual models with polynomial
1042 Integration-Oriented Ontology
Grind (see Hansen and Lutz 2017), or Clipper disparate files that require advanced techniques
(see Eiter et al. 2012). Ontop provides an end-to- to annotate their content and efficiently facilitate
end solution for OBDA, which is integrated with querying. Related to this efficiency, more atten-
existing Semantic Web tools such as Protégé or tion must be paid to query optimization and ex-
Sesame. It assumes a DL-Lite ontology for G , and ecution, specially when combining independent
while processing an ontology-mediated query, it sources and data streams. Finally, from another
applies several optimization steps. Oppositely, point of view, given that the goal of integration-
Grind assumes that G is formulated using the oriented ontologies is to enable access to non-
EL DL, which has more expressivity than DL- technical users, it is necessary to empower them
Lite, but does not guarantee that a first-order logic by providing clear and traceable query plans that,
rewriting exists for the queries. Clipper relies on for example, trace provenance and data quality
Horn-SH IQ ontologies, a DL that extends DL- metrics throughout the process.
Lite and EL. All works assume relational sources
and translate first-order logic expressions into
SQL queries. Nonetheless, recent works present Cross-References
approaches to adopt NOSQL stores as underlying I
sources, such as MongoDB; see Botoeva et al. Data Integration
(2016). Federated RDF Query Processing
Regarding vocabulary-based approaches, we Ontologies for Big Data
can find an RDF metamodel to be instantiated Schema Mapping
into G and S for the purpose of accommodating
schema evolution in the sources and use RDF
named graphs to implement LAV mappings; see References
Nadal et al. (2018). Such approach simplifies
the definition of LAV mappings, for nontechnical Artale A, Kontchakov R, Wolter F, Zakharyaschev M
users, which has been commonly characterized (2013) Temporal description logic for ontology-based
as a difficult task. Query answering, however, data access. In: IJCAI 2013, Proceedings of the 23rd
international joint conference on artificial intelligence,
is restricted to only traversals over G . Another Beijing, 3–9 Aug 2013, pp 711–717
proposal is GEMMS, a metadata management Bizer C, Heath T, Berners-Lee T (2009) Linked data – the
system for Data Lakes, that automatically ex- story so far. Int J Semant Web Inf Syst 5(3):1–22
tracts metadata from the sources to annotate them Botoeva E, Calvanese D, Cogrel B, Rezk M, Xiao G
(2016) OBDA beyond relational Dbs: a study for Mon-
with respect to the proposed metamodel; see Quix godb. In: 29th international workshop on description
et al. (2016). logics
Calvanese D, De Giacomo G, Lembo D, Lenzerini M,
Rosati R (2007) Tractable reasoning and efficient query
answering in description logics: the DL-Lite family. J
Future Directions for Research Autom Reason 39(3):385–429
Calvanese D, Cogrel B, Komla-Ebri S, Kontchakov R,
Despite that much work has been done in the Lanti D, Rezk M, Rodriguez-Muro M, Xiao G (2017)
Ontop: answering SPARQL queries over relational
foundations of OBDA and description logics,
databases. Semantic Web 8(3):471–487
there are two factors that complicate the problem Cyganiak R, Das S, Sundara S (2012) R2RML: RDB to
in the context of Big Data, namely, size and lack RDF mapping language. W3C recommendation, W3C.
of control over the data sources. Given this lack http://www.w3.org/TR/2012/REC-r2rml-20120927/
Eiter T, Ortiz M, Simkus M, Tran T, Xiao G (2012) Query
of control and the heterogeneous nature of data
rewriting for Horn-SHIQ plus rules. In: Proceedings of
sources, it is needed to further study the kind the twenty-sixth AAAI conference on artificial intelli-
of mappings and query languages that can be gence, Toronto, 22–26 July 2012
devised for different data models. Nevertheless, Feier C, Kuusisto A, Lutz C (2017) Rewritability in
monadic disjunctive datalog, MMSNP, and expressive
this is not enough; new governance mechanisms
description logics (invited talk). In: 20th international
must be put in place giving rise to Semantic conference on database theory, ICDT 2017, Venice,
Data Lakes, understood as huge repositories of 21–24 Mar 2017, pp 1:1–1:17
1044 Integrity
Gottlob G, Orsi G, Pieris A (2011) Ontological queries: In: 28th international conference on advanced informa-
rewriting and optimization. In: Proceedings of the 27th tion systems engineering (CAiSE), pp 129–136, CAiSE
international conference on data engineering, ICDE Forum
2011, 11–16 Apr 2011, Hannover, pp 2–13 Varga J, Romero O, Pedersen TB, Thomsen C (2014) To-
Grau BC, Fokoue A, Motik B, Wu Z, Horrocks I (2012) wards next generation BI systems: the analytical meta-
OWL 2 web ontology language profiles, 2nd edn. W3C data challenge. In: 16th international conference on
recommendation, W3C. http://www.w3.org/TR/2012/ data warehousing and knowledge discovery (DaWaK),
REC-owl2-profiles-20121211 pp 89–101
Gruber T (2009) Ontology. In: Tamer Özsu M, Liu L Wood D, Cyganiak R, Lanthaler M (2014) RDF 1.1
(eds) Encyclopedia of database systems. Springer, New concepts and abstract syntax. W3C recommenda-
York/London, pp 1963–1965 tion, W3C. http://www.w3.org/TR/2014/REC-rdf11-
Halevy AY (2001) Answering queries using views: a concepts-20140225
survey. VLDB J 10(4):270–294
Hansen P, Lutz C (2017) Computing FO-rewritings in
EL in practice: from atomic to conjunctive queries. In:
16th international semantic web conference (ISWC),
pp 347–363 Integrity
Horrocks I, Giese M, Kharlamov E, Waaler A (2016)
Using semantic technology to tame the data variety Security and Privacy in Big Data Environment
challenge. IEEE Internet Comput 20(6):62–66
Jiménez-Ruiz E, Kharlamov E, Zheleznyakov D,
Horrocks I, Pinkel C, Skjæveland MG, Thorstensen E,
Mora J (2015) Bootox: practical mapping of RDBs to
OWL 2. In: The semantic web – ISWC 2015 – 14th
international semantic web conference, Bethlehem, Intelligent Disk
11–15 Oct 2015, Proceedings, Part II, pp 113–132
Keet CM, Ongoma EAN (2015) Temporal attributes: sta- Active Storage
tus and subsumption. In: 11th Asia-Pacific conference
on conceptual modelling, APCCM 2015, Sydney,
pp 61–70
Lenzerini M (2002) Data integration: a theoretical per-
spective. In: 21st ACM SIGACT-SIGMOD-SIGART Interactive Visualization
symposium on principles of database systems (PODS),
pp 233–246 Big Data Visualization Tools
Levy AY, Rajaraman A, Ordille JJ (1996) Querying het-
erogeneous information sources using source descrip-
tions. In: 22th international conference on very large
data bases (VLDB), pp 251–262
Lutz C, Wolter F, Zakharyaschev M (2008) Temporal
description logics: a survey. In: 15th international Introduction to Stream
symposium on temporal representation and reasoning, Processing Algorithms
TIME 2008, Université du Québec à Montréal, 16–18
June 2008, pp 3–14 Nicoló Rivetti
Nadal S, Romero O, Abelló A, Vassiliadis P, Vansum-
meren S (2018) An integration-oriented ontology to Faculty of Industrial Engineering and
govern evolution in big data ecosystems. Information Management, Technion – Israel Institute of
Systems, Elsevier. https://doi.org/10.1016/j.is.2018.01. Technology, Haifa, Israel
006
Pérez-Urbina H, Motik B, Horrocks I (2009) A com-
parison of query rewriting techniques for dl-lite. In:
Proceedings of the 22nd international workshop on Definitions
description logics (DL 2009), Oxford, 27–30 July 2009
Poggi A, Lembo D, Calvanese D, De Giacomo G, Lenz-
erini M, Rosati R (2008) Linking data to ontologies. Data streaming focuses on estimating functions
J Data Semant 10:133–173 over streams, which is an important task
Pottinger R, Halevy AY (2001) Minicon: a scalable al- in data-intensive applications. It aims at
gorithm for answering queries using views. VLDB J
10(2–3):182–198 approximating functions or statistical measures
Quix C, Hai R, Vatov I (2016) GEMMS: a generic and over (distributed) massive stream(s) in poly-
extensible metadata management system for data lakes. logarithmic space over the size and/or the domain
Integrity 1045
size (i.e., number of distinct items) of the t in the stream . Our aim is to compute some
stream(s). function on . In this model we are allowed
only to access the sequence in its given order
(no random access), and the function must be
Overview computed in a single pass (online). The challenge
is to achieve this computation in sublinear
Many different domains are concerned by the space and time with respect to m and n. In
analysis of streams, including machine learning, other words, the space and time complexities
data mining, databases, information retrieval, should be at most smal lO.minfm; ng/, but
and network monitoring. In all these fields, it is the goal is O.log.m
n//. However, since
necessary to quickly and precisely process a huge reaching the latter bound is quite hard (if
amount of data. This can also be applied to any not impossible), one deals in general with
other data issued from distributed applications complexities such as polylog.m; n/. In general,
such as social networks or sensor networks. .x/ is said to be polylog.x/ if 9C >0;
Given these settings, the real-time analysis of .x/ D O.logC x/. By abusing the notation,
large streams, relying on full-space algorithms, we then define polylog.m; n/ as 9C; C 0 >0
0
I
is often not feasible. Two main approaches exist such that .m; n/ D O.logC m C logC n/.
to monitor massive data streams in real time
with small amount of resources: sampling and Sliding window model — In many applica-
summaries. tions, there is a greater interest in considering
Computing information theoretic measures in only the “recently observed” items than the whole
the data stream model is challenging essentially semi-infinite stream. This gives rise to the sliding
because one needs to process a huge amount window model (Datar et al. 2002): items arrive
of data sequentially, on the fly, and by using continuously and expire after exactly M steps.
very little storage with respect to the size of the The relevant data is then the set of items that ar-
stream. In addition, the analysis must be robust rived in the last M steps, i.e., the active window.
over time to detect any sudden change in the A step can either be defined as a sample arrival,
observed streams (which may be an indicator of then the active window would contain exactly
some malicious behavior). M items, and the window is said to be count-
based. Otherwise, the step is defined as a time
Data Streaming Models tick, either logical or physical, and the window
Let (Muthukrishnan 2005) consider a stream is said to be time-based. More formally, let us
D ht1 ; : : : ; tj ; : : : ; tm i, where each item t consider the count-based sliding window and
is drawn from a universe of size n. More in recall that by definition, tj is the item arriving
details, t denotes the value of the item in the set at step j . In other words, the stream at step m
Œn D f1; : : : ; ng, while the subscript j denotes is the sequence ht1 ; : : : ; tmM ; tmM C1 ; : : : ; tm i.
the position in the stream in the index sequence Then, the desired function has to be evaluated
Œm D f1; : : : ; mg. Then m is the size (or number over the stream .M / D htmM C1 ; : : : ; tm i.
of items) of the stream , and n is the number of Notice that if M D m, the sliding window model
distinct items. Notice that in general, both n and boils down to the classical data streaming model.
m are large (e.g., network traffic) and unknown. Two related models may be named as well: the
The stream implicitly defines a frequency landmark window and jumping window models.
vector f D hf1 ; : : : ; ft ; : : : ; fn i where ft is the In the former, a step (landmark) is chosen, and
frequency of item t in the stream , i.e., the the function is evaluated on the stream following
number of occurrences of item t in . The the landmark. In the jumping window model, the
empirical probability distribution is defined as sliding window is split into sub-windows and
pDhp1 , : : : ; pt , : : : ; pn i, where pt D ft =m is moves forward only when the most recent sub-
the empirical probability of occurrence of item window is full (i.e., it jumps forward). In general,
1046 Integrity
solving a problem in the sliding window model is fi D hf1;i ; : : : ; ft;i ; : : : ; fn;i i where ft;i is the
harder than in the classical data streaming model: frequency of item t in the sub-stream i , i.e.,
using little memory (low space complexity) im- the number of occurrences of item t in the sub-
P
plies some kind of data aggregation. The problem stream i . Then, we also have ft D siD1 ft;i .
is then how to slide the window forward, i.e., The empirical probability distribution is defined
, how to remove the updates performed by the as pi D hp1;i , : : : ; ptt;i , : : : ; pn;i i, where
items that exited the window, and how to intro- pt;i D ft;i =mi is the empirical probability of
duce new items. For instance, basic counting, occurrence of item t in the stream i .
i.e., count the number of 1 in the stream of With respect to the classical data streaming
bits, is a simple problem whose solution is a model, we want to compute the function . / D
prerequisite for efficient maintenance of a variety .1 [ : : : [ s /. This model introduces the
of more complex statistical aggregates. While this additional challenge to minimize the communi-
problem is trivial for the data streaming model, cation complexity (in bits), i.e., the amount of
it becomes a challenge in the sliding windowed data exchanged between the s nodes and the
model (Datar et al. 2002). coordinator.
The distributed functional monitoring prob-
Distributed streaming model and functional lem (Cormode et al. 2011; Cormode 2011) has
monitoring — In large distributed systems, it is risen only in the 2000s. With respect to the
most likely critical to gather various aggregates distributed streaming model, functional monitor-
over data spread across a large number of nodes. ing requires a continuous evaluation of . / D
This can be modeled by a set of nodes, each .1 [ : : : [ s / at each time instant j 2 Œm.
observing a stream of items. These nodes collab- Bounding to be monotonic, it is possible to
orate to evaluate a given function over the global distinguish two cases: threshold monitoring (i.e.,
distributed stream. For instance, current network raising an alert if . / exceed a threshold het a)
management tools analyze the input streams of and value monitoring (i.e., approximating . /).
a set of routers to detect malicious sources or It then can be shown (Cormode et al. 2011)
to extract user behaviors (Anceaume and Busnel that value monitoring can be reduced with a
2014; Ganguly et al. 2007). factor O.1=" log R/ (where R is the largest value
The distributed streaming model (Gibbons reached by the monitored function) to threshold
and Tirthapura 2001) considers a set of nodes monitoring. However, this reduction does not
making observations and cooperating, through hold any more in a more general setting with
a coordinator, to compute a function over the non-monotonic function (e.g., entropy). Notice
union of their observations. More formally, let that to minimize the communication complex-
I be a set of s nodes S1 ; : : : ; Si ; : : : ; Ss that ity and maximize the accuracy, the algorithms
do not communicate directly between them but designed in this model tend to minimize the
only with a coordinator (i.e., a central node communication when nothing relevant happens
or base station). The communication channel while reacting as fast as possible in response to
can be either one way or bidirectional. The outstanding occurrences or patterns.
distributed stream is the union the sub-
streams 1 ; : : : ; i ; : : : ; s locally read by the Building Blocks
Ss
s nodes, i.e., D iD1 i , where { [ |
is a stream containing all the items of { and Approximation and randomization — To
| with any interleaving. The generic stream compute on in these models, most of the
i is the input to the generic node Si . Then research relies on " or ."; ı/ approximations.
Ps
we have m D iD1 mi , where mi is the An algorithm A that evaluates the stream in a
size (or number of items) of the i -th sub- single pass (online) is said to be an ."; ı/ additive
stream i of the distributed stream . Each sub- approximation of the function on a stream if,
stream i implicitly defines a frequency vector for any prefix of size m of the input stream , the
Integrity 1047
data streams became of utmost importance in the Sliding window model — There are two exten-
last decade (e.g., in the context of network mon- sions of the Count-Min algorithm to the sliding
itoring, Big Data, etc). IP network management, window model. The ECM-Sketch replaces the
among others, is a paramount field in which the entries of the Count-Min matrix with instances
ability to estimate the stream frequency distribu- of the wave data structure (Gibbons and Tirtha-
tion in a single pass (i.e., online) and quickly can pura 2014) which provides an ."wave / approxima-
bring huge benefits. For instance, it could rapidly tion for the basic counting problem in the sliding
detect the presence of anomalies or intrusions window model. A more recent solution (Rivetti
identifying sudden changes in the communication et al. 2015a) superseded the ECM-Sketch per-
patterns. formances. It has a similar approach, replacing
the entries with a novel data structure to keep
Data streaming model — Considering the track of the changes in the frequency distribution
data streaming model, the Count-Sketch of the stream over time.
algorithm (Charikar et al. 2002) was an initial
result satisfying the following accuracy bound:
PŒj fbt ft j " k ft k2 ı, where ft Applications
represents the frequency vector f without the
scalar ft (i.e., jft j D n 1). Distributed Denial of Service Detection
Few years later, a similar algorithm, Network monitoring, as mentioned previously, is
Count-Min sketch (Cormode and Muthukr- one of the classical fields of application given
ishnan 2005), has been presented. The bound the natural fitting of the data streaming models
on the estimation accuracy of the Count-Min to network traffics. The detection of distributed
is weaker than the one provided by the denial of service (DDoS) can be reduced to the
Count-Sketch, in particular, for skewed distributed heavy hitters problem, having first
distribution. On the other hand, the Count-Min been studied in Manjhi et al. (2005), by mak-
shaves a 1=" factor from the space complexity ing the assumption that the attack is locally de-
and halves the hash functions. The Count-Min tectable by routers. Further solutions relaxed the
consists of a two-dimensional matrix F of size model but relied on other assumptions such as
r c, where r D dlog 1=ıe and c D de="e, where the existence of a known a gap between legit
e is Euler’s constant. Each row is associated and attack traffic (Zhao et al. 2010), a priori
with a different two-universal hash function knowledge of the IP addresses distribution (Zhao
h{2Œr W Œn ! Œc. For each item tj , it updates et al. 2006) (i.e., distributed bag model). Finally,
each row: 8{ 2 Œr; FŒ{; h{ .t / FŒ{; h{ .t / C 1. Yi and Zhang (2013) propose a solution in the
That is, the entry value is the sum of the spirit of the distributed functional monitoring
frequencies of all the items mapped to that model to minimize the communication cost be-
entry. Since each row has a different collision tween the nodes and the coordinator. However
pattern, upon request of f bt , we want to return the each router maintains at any time the exact fre-
entry associated with t minimizing the collisions quency of each IP address in its local network
impact. In other words, the algorithm returns, as stream, i.e., their solution is extremely memory
ft estimation, the entry associated with t with expensive. A recent solution (Anceaume et al.
the lowest value: f bt D min{2Œr fFŒ{; h{ .t /g. 2015), based on the Count-Min, detects dis-
Listing 1 presents the global behavior of the tributed denial of service without relying on any
Count-Min algorithm. The space complexity of assumption on the stream characteristics.
this algorithm is O. 1" log 1ı .log m C log n// bits,
while the update time complexity is O.log 1=ı/. Stream Processing Optimization
Therefore, the following quality bounds hold: Data streaming has also been extensively ap-
PŒj fbt ft j " k ft k1 ı, while ft f bt is plied to databases, for instance, providing effi-
always true. cient estimation of the join size (Alon et al. 1996;
Integrity 1049
Vengerov et al. 2015). There are many query plan the space-saving (Metwally et al. 2005) algorithm
optimizations performed in database systems that instead. The basic intuition is that heavy hitters
can leverage such estimations. induce most of the unbalance in the load distri-
A more recent application field is stream pro- bution. Then, in both works the authors leverage
cessing (Hirzel et al. 2014) and complex event solutions to the heavy hitter problem to isolate/-
processing. Load balancing in distributed com- carefully handle the keys that have a frequency
puting is a well-known problem that has been ex- larger than/close to 1=k. On the other hand, Nasir
I
tensively studied since the 1980s (Cardellini et al. et al. (2015) applied the power of two choices
2002). Two ways to perform load balancing can approach to this problem, i.e., relax the key
be identified in stream processing systems (Hirzel grouping constraints by spreading each key on
et al. 2014): either when placing the operators two instances, thus restricting the applicability of
on the available machines (Carney et al. 2003; their solution but also achieving better balancing.
Cardellini et al. 2016) or when assigning load Caneill et al. (2016) present another work
to the parallelized instances of operator O. If O focusing on load balancing through a different
is stateless, then tuples are randomly assigned to angle. In topologies with sequences of stateful
instances (random or shuffle grouping). Extend- operators, and thus key-grouped stages, the like-
ing the Count-Min (Cormode and Muthukr- lihood that a tuple is sent to different operator
ishnan 2005), Rivetti et al. (2016a) presents an instances located on different machines is high.
algorithm aimed at reducing the overall tuple Since network is a bottleneck for many stream
completion time in shuffle-grouped stages. Key applications, this behavior significantly degrades
grouping is another grouping technique used by their performance. The authors use the space-
stream processing frameworks to simplify the de- saving (Metwally et al. 2005) algorithm to iden-
velopment of parallel stateful operators. Through tify the heavy hitters and analyze the correlation
key grouping a stream of tuples is partitioned between keys of subsequent key-grouped stages.
in several disjoint sub-streams depending on the This information is than leveraged to build a
values contained in the tuples themselves. Each key-to-instance mapping that improves stream
operator instance target of one sub-stream is locality for stateful stream processing applica-
guaranteed to receive all the tuples containing a tions.
specific key value. A common solution to imple- The load-shedding (or admission control)
ment key grouping is through hash functions that, technique (Hirzel et al. 2014) goal is to drop
however, are known to cause load imbalances some of the input in order to keep the input
on the target operator instances when the input load under the maximum capacity of the system.
data stream is characterized by a skewed value Without load shedding, the load would pile up
distribution. Gedik (2014) proposes a solution to inducing buffer overflows and trashing. In Rivetti
this problem leveraging consistent hashing and et al. (2016b) the authors provide a novel load-
the lossy counting (Manku and Motwani 2002) shedding algorithm based on an extension of
algorithm. In Rivetti et al. (2015b) the authors the Count-Min (Cormode and Muthukrishnan
propose an improvement over (Gedik 2014) using 2005), targeted at stream processing systems.
1050 Integrity
Vengerov D, Menck AC, Zait M, Chakkappen SP (2015) the inverted lists corresponding to the terms of
Join size estimation subject to filter conditions. Proc the query set. Efficient list intersection relies on
VLDB Endow 8(12):1530–1541
Yi K, Zhang Q (2013) Optimal tracking of distributed
the operation NextGEQt .x/, which returns the
heavy hitters and quantiles. Algorithmica 65:206–223 integer ´ 2 `t such that ´ x. This primitive is
Zhao Q, Lall A, Ogihara M, Xu J (2010) Global iceberg used because it permits to skip over the lists to be
detection over distributed streams. In: Proceedings of intersected.
the 26th IEEE international conference on data engi-
neering, ICDE
Because of the many documents indexed
Zhao Q, Ogihara M, Wang H, Xu J (2006) Finding global by search engines and stringent performance
icebergs over distributed data sets. In: Proceedings of requirements dictated by the heavy load of user
the 25th ACM SIGACT- SIGMOD-SIGART sympo- queries, the inverted lists often store several
sium on principles of database systems, PODS
millions (even billion) of integers and must be
searched efficiently. In this scenario, compressing
the inverted lists of the index appears as a
Inverted Index Compression mandatory design phase since it can introduce
a twofold advantage over a non-compressed
Giulio Ermanno Pibiri and Rossano Venturini
Department of Computer Science, University of
representation: feed faster memory levels with I
more data in order to speed up the query
Pisa, Pisa, Italy
processing algorithms and reduce the number of
storage machines needed to host the whole index.
This has the potential of letting a query algorithm
Definitions work in internal memory, which is faster than
the external memory system by several orders of
The data structure at the core of nowadays large- magnitude.
scale search engines, social networks, and storage
architectures is the inverted index. Given a collec- Chapter Notation
tion of documents, consider for each distinct term All logarithms are binary, i.e., log x D log2 x,
t appearing in the collection the integer sequence x > 0. Let B.x/ represent the binary representa-
`t , listing in sorted order all the identifiers of the tion of the integer x and U.x/ its unary represen-
documents (docIDs in the following) in which tation, that is, a run of x zeroes plus a final one:
the term appears. The sequence `t is called the 0x 1. Given a binary string B, let jBj represent
inverted list or posting list of the term t . The its length in bits. Given two binary strings B1 and
inverted index is the collection of all such lists. B2 , let B1 B2 be their concatenation. We indicate
The scope of the entry is the one of surveying with S.n; u/ a sequence of n integers drawn
the most important encoding algorithms devel- from a universe of size u and with S Œi; j the
oped for efficient inverted index compression and subsequence starting at position i and including
fast retrieval. endpoint S Œj .
assign the smallest code word as possible in order and the remainder r D x q 2k 1. The
to minimize the number of bits used to represent quotient is encoded in unary, i.e., with U.q/,
the sequence. whereas the remainder is written in binary with k
In particular, since we are dealing with in- bits. Therefore, jRk .x/j D q Ck C1 bits. Clearly,
verted lists that are monotonically increasing by the closer 2k is to the value of x the smaller the
construction, we can subtract from each element value of q: this implies a better compression and
the previous one (the first integer is left as it faster decoding speed.
is), making the sequence be formed by integers Given the parameter k and the constant 2k
greater than zero known as delta-gaps (or just d - that is computed ahead, decoding Rice codes is
gaps). This popular delta-encoding strategy helps simple too: count the number of zeroes up to the
in reducing the number of bits for the codes. Most one, say there are q of these, then multiply 2k
of the literature assumes this sequence form, and, by q, and finally add the remainder, by reading
as a result, compressing such sequences of d - the next k bits. Finally, add one to the computed
gaps is a fundamental problem that has been result.
studied for decades.
Variable-Byte
Elias’ Gamma and Delta The codes described so far are also called bit-
The two codes we now describe have been in- aligned as they do not represent an integer using
troduced by Elias (1975) in the 1960s. Given an a multiple of a fixed number of bits, e.g., a byte.
integer x > 0, .x/ D 0jB.x/j1 B.x/, where The decoding speed can be slowed down by the
jB.x/j D dlog.x C 1/e. Therefore, j .x/j D many operations needed to decode a single inte-
2dlog.x C 1/e 1 bits. Decoding .x/ is simple: ger. This is a reason for preferring byte-aligned or
first count the number of zeroes up to the one, say even word-aligned codes when decoding speed is
there are n of these, then read the following n C 1 the main concern.
bits, and interpret them as x. Variable-Byte (VByte) (Salomon 2007) is the
The key inefficiency of lies in the use of the most popular and simplest byte-aligned code: the
unary code for the representation of jB.x/j 1, binary representation of a nonnegative integer is
which may become very large for big integers. split into groups of 7 bits which are represented
The ı code replaces the unary part 0jB.x/j1 with as a sequence of bytes. In particular, the 7 least
.jB.x/j/, i.e., ı.x/ D .jB.x/j/B.x/. Notice significant bits of each byte are reserved for
that, since we are representing with the quantity the data, whereas the most significant (the 8-th),
jB.x/j D dlog.x C 1/e which is guaranteed called the continuation bit, is equal to 1 to signal
to be greater than zero, ı can represent value continuation of the byte sequence. The last byte
zero too. The number of bits required by ı.x/ is of the sequence has its 8-th bit set to 0 to signal,
j .jB.x/j/j C jB.x/j, which is 2dlog.djB.x/j C instead, the termination of the byte sequence.
1e/e C dlog.x C 1/e 1. The main advantage of VByte codes is decoding
The decoding of ı codes follows automatically speed: we just need to read one byte at a time until
from the one of codes. we find a value smaller than 128. Conversely, the
number of bits to encode an integer cannot be
less than 8; thus VByte is only suitable for large
Golomb-Rice numbers, and its compression ratio may not be
Rice and Plaunt (1971) introduced a parameter competitive with the one of bit-aligned codes for
code that can better compress the integers in a small integers.
sequence if these are highly concentrated around Various enhancements have been proposed in
some value. This code is actually a special case of the literature to improve the (sequential) decod-
the Golomb code (Golomb 1966), hence its name. ing speed of VByte. This line of research fo-
The Rice code of x with parameter k, Rk .x/, cuses on finding a suitable format of control
consists in two pieces: the quotient q D b x12k
c and data streams in order to reduce the prob-
Inverted Index Compression 1053
ability of a branch misprediction that leads to work finds its origin in the so-called frame of
higher instruction throughput and helps keep- reference (FOR) (Goldstein et al. 1998).
ing the CPU pipeline fed with useful instruc- Once the sequence has been partitioned into
tions (Dean 2009) and on exploiting the paral- blocks (of fixed or variable length), then each
lelism of SIMD instructions (single-instruction- block is encoded separately. An example of this
multiple-data) of modern processors (Stepanov approach is binary packing (Anh and Moffat
et al. 2011; Plaisance et al. 2015; Lemire et al. 2010; Lemire and Boytsov 2013), where blocks
2018). of fixed length are used, e.g., 128 integers. Given
a block S Œi; j , we can simply represent its inte-
List Compressors gers using a universe of size dlog.S Œj S Œi C
Differently from the compressors introduced in 1/e bits by subtracting the lower bound S Œi from
the previous subsection, the compressors we now their values. Plenty of variants of this simple ap-
consider encode a whole integer list, instead of proach has been proposed (Silvestri and Venturini
representing each single integer separately. Such 2010; Delbru et al. 2012; Lemire and Boytsov
compressors often outperform the ones described 2013). Recently, it has also been shown (Otta-
before for sufficiently long inverted lists because viano et al. 2015) that using more than one com- I
they take advantage of the fact that inverted pressor to represent the blocks, rather than simply
lists often contain clusters of close docIDs, e.g., representing all blocks with the same compressor,
runs of consecutive integers, that are far more can introduce significant improvements in query
compressible with respect to highly scattered time within the same space constraints.
sequences. The reason for the presence of such Among the simplest binary packing strategies,
clusters is that the indexed documents themselves Simple-9 and Simple-16 (Anh and Moffat 2005,
tend to be clustered, i.e., there are subsets of doc- 2010) combine very good compression ratio and
uments sharing the very same set of terms. There- high decompression speed. The key idea is to
fore, not surprisingly, list compressors greatly try to pack as many integers as possible in a
benefit from reordering strategies that focus on memory word (32 or 64 bits). As an example,
reassigning the docIDs in order to form larger Simple-9 uses 4 header bits and 28 data bits. The
clusters of docIDs. header bits provide information on how many
An amazingly simple strategy, but very ef- elements are packed in the data segment using
fective for Web pages, is to assign identifiers equally sized code words. The 4 header bits
to documents according to the lexicographical distinguish from 9 possible configurations. Simi-
order of their URLs (Silvestri 2007). A recent larly, Simple-16 has 16 possible configurations.
approach has instead adopted a recursive graph
bisection algorithm to find a suitable reordering
of docIDs (Dhulipala et al. 2016). In this model, PForDelta
the input graph is a bipartite graph in which one The biggest limitation of block-based strategies
set of vertices represents the terms of the index is their space inefficiency whenever a block
and the other set represents the docIDs. Since contains at least one large value, because this
a graph bisection identifies a permutation of the forces the compressor to encode all values with
docIDs, the goal is the one of finding, at each the number of bits needed to represent this
step of recursion, the bisection of the graph which large value (universe of representation). This
minimizes the size of the graph compressed using has been the main motivation for the introduction
delta-encoding. of PForDelta (PFD) (Zukowski et al. 2006). The
idea is to choose a proper value k for the universe
Block-Based of representation of the block, such that a large
A relatively simple approach to improve both fraction, e.g., 90%, of its integers can be written
compression ratio and decoding speed is to en- in k bits each. This strategy is called patching.
code a block of contiguous integers. This line of All integers that do not fit in k bits are treated as
1054 Inverted Index Compression
exceptions and encoded separately using another ing (Elias 1974). Figure 1 shows an example
compressor. of encoding for the sequence S.12; 62/ D
More precisely, two configurable parameters Œ3; 4; 7; 13; 14; 15; 21; 25; 36; 38; 54; 62.
are chosen: a base value b and a universe of Despite its simplicity, it is possible to ran-
representation k so that most of the values fall in domly access an integer from a sequence en-
the range Œb; b C 2k 1 and can be encoded with coded with Elias-Fano without decompressing
k bits each by shifting (delta-encoding) them in the whole sequence. We refer to this operation as
the range Œ0; 2k 1. To mark the presence of an Access.i /, which returns S Œi . The operation is
exception, we also need a special escape symbol; supported using an auxiliary data structure that is
thus we have Œ0; 2k 2 possible configurations built on bit vector H , able to efficiently answer
for the integers in the block. Select1 .i / queries, that return the position in H
The variant OptPFD (Yan et al. 2009), which of the i -th 1 bit. This auxiliary data structure is
selects for each block the values of b and k succinct in the sense that it is negligibly small
that minimize its space occupancy, has been in asymptotic terms, compared to EF.S.n; u//,
demonstrated to be more space-efficient and only requiring only o.n/ additional bits (Navarro and
slightly slower than the original PFD (Yan et al. Mäkinen 2007; Vigna 2013).
2009; Lemire and Boytsov 2013). Using the Select1 primitive, it is possible to
implement Access.i / in O.1/ time. We basically
Elias-Fano have to relink together the high and low bits of an
The encoding we are about to describe was in- integer, previously split up during the encoding
dependently proposed by Elias (1974) and Fano phase. The low bits `i are trivial to retrieve as
(1971), hence its name. Given a monotonically we need to read the range of bits LŒi `; .i C 1/`/.
increasing sequence S.n; u/, i.e., S Œi 1 S Œi , The retrieval of the high bits deserves, instead, a
for any 1 i < n, with S Œn 1 < u, we write bit more care. Since we write in negated unary
each S Œi in binary using dlog ue bits. The binary how many integers share the same high part, we
representation of each integer is then split into have a bit set for every integer of S and a zero
two parts: a low part consisting in the rightmost for every distinct high part. Therefore, to retrieve
` D dlog nu e bits that we call low bits and a high the high bits of the i -th integer, we need to know
part consisting in the remaining dlog ue ` bits how many zeros are present in H Œ0; Select1 .i //.
that we similarly call high bits. Let us call `i This quantity is evaluated on H in O.1/ as
and hi the values of low and high bits of S Œi , Rank0 .Select1 .i // D Select1 .i / i . Finally,
respectively. The Elias-Fano representation of S linking the high and low bits is as simple as:
is given by the encoding of the high and low parts. Access.i / D ..Select1 .i / i / `/ j `i , where
The array L D Œ`0 ; : : : ; `n1 is written ex- is the left shift operator and j is the bitwise
plicitly in ndlog nu e bits and represents the encod- OR.
ing of the low parts. Concerning the high bits, The query NextGEQ.x/ is supported in
we represent them in negated unary using a bit O.1 C log nu / time as follows. Let hx be
vector of n C 2blog nc 2n bits as follows. We the high bits of x. Then for hx > 0, i D
start from a 0-valued bit vector H , and set the Select0 .hx / hx C 1 indicates that there are
bit in position hi C i , for all i 2 Œ0; n/. Finally i integers in S whose high bits are less than hx .
the Elias-Fano representation of S is given by On the other hand, j D Select0 .hx C 1/ hx
the concatenation of H and L and overall takes gives us the position at which the elements having
EF.S.n; u// ndlog nu e C 2n bits. high bits greater than hx start. The corner case
While we can opt for an arbitrary split into hx D 0 is handled by setting i D 0. These
high and low parts, ranging from 0 to dlog ue, two preliminary operations take O.1/. Now
it can be shown that ` D dlog nu e minimizes we have to conclude our search in the range
the overall space occupancy of the encod- S Œi; j , having skipped a potentially large range
Inverted Index Compression 1055
Inverted Index
Compression, Fig. 1 An
example of Elias-Fano
encoding. We distinguish
in light gray the missing
high bits, i.e., the ones not
belonging to the integers of
the sequence among the
possible 2blog nc
of elements that, otherwise, would have required Elias-Fano on most of the queries (Ottaviano
to be compared with x. We finally determine the and Venturini 2014). On the other hand, parti-
successor of x by binary searching in this range tioning the sequence greatly improves the space
which may contain up to u=n integers. The time efficiency of its Elias-Fano representation.
bound follows.
I
Partitioned Elias-Fano Binary Interpolative
One of the most relevant characteristics of Elias- Binary Interpolative coding (BIC) (Moffat and
Fano is that it only depends on two parameters, Stuiver 2000) is another approach that, like Elias-
i.e., the length and universe of the sequence. As Fano, directly compresses a monotonically in-
inverted lists often present groups of close do- creasing integer sequence without a first delta-
cIDs, Elias-Fano fails to exploit such natural clus- encoding step. In short, BIC is a recursive al-
ters. Partitioning the sequence into chunks can gorithm that first encodes the middle element of
better exploit such regions of close docIDs (Ot- the current range and then applies this encoding
taviano and Venturini 2014). step to both halves. At each step of recursion, the
The core idea is as follows. The sequence algorithm knows the reduced ranges that will be
S.n; u/ is partitioned into n=b chunks, each of b used to write the middle elements in fewer bits
integers. First level L is made up of the last ele- during the next recursive calls.
ments of each chunk, i.e., L D ŒS Œb 1; S Œ2b More precisely, consider the range S Œi; j . The
1; : : : ; S Œn1. This level is encoded with Elias- encoding step writes the quantity S Œm low
Fano. The second level is represented by the m C i using dlog.hi low j C i /e bits,
chunks themselves, which can be again encoded where S Œm is the range middle element, i.e., the
with Elias-Fano. The main reason for introduc- integer at position m D .i C j /=2, and low and
ing this two-level representation is that now the hi are, respectively, the lower bound and upper
elements of the i -th chunk are encoded with a bound of the range S Œi; j , i.e., two quantities
smaller universe, i.e., LŒi LŒi 1 1, i > 0. such that low S Œi and hi S Œj . The
As the problem of choosing the best partition is algorithm proceeds recursively, by applying the
posed, an algorithm based on dynamic program- same encoding step to both halves: Œi; m and
ming can be used to find in O.n log1Ceps 1=eps/ Œm C 1; j by setting hi D S Œm 1 for the
time partition whose cost (i.e., the space taken left half and low D S Œm C 1 for the right half.
by the partitioned encoded sequence) is at most At the beginning, the encoding algorithm starts
.1 C "/ away from the optimal one, for any 0 < with i D 0, j D n 1, low D S Œ0, and
" < 1 (Ottaviano and Venturini 2014). hi D S Œn 1. These quantities must be also
This inverted list organization introduces a known at the beginning of the decoding phase.
level of indirection when resolving NextGEQ Apart from the initial lower and upper bound, all
queries. However, this indirection only causes the other values of low and hi are computed on
a very small time overhead compared to plain the fly by the algorithm.
1056 Inverted Index Compression
Inverted Index
Compression, Fig. 2 An
example of Binary
Interpolative coding. We
highlight in bold font the
middle element currently
encoded: above each
element we report a pair
where the first value
indicates the element
actually encoded and the
second value the universe
of representation
Figure 2 shows an encoding example for the Future Directions for Research
same sequence of Fig. 1. We can interpret the
encoding as a preorder visit of the binary tree This concluding section is devoted to recent ap-
formed by the recursive calls the algorithm per- proaches that encode many inverted lists together
forms. The encoded elements in the example are to obtain higher compression ratios. This is possi-
in order: Œ7; 2; 0; 0; 18; 5; 3; 16; 1; 7. Moreover, ble because the inverted index naturally presents
notice that whenever the algorithm works on a some amount of redundancy, caused by the fact
range S Œi; j of consecutive integers, such as the that many docIDs are shared between its lists.
range S Œ3; 4 D f13; 14g in the example, i.e., the In fact, as already motivated at the beginning
ones for which the condition S Œj S Œi D j i of section “List Compressors”, the identifiers of
holds, it stops the recursion by emitting no bits similar documents will be stored in the inverted
at all. The condition is again detected at decod- lists of the terms they share.
ing time, and the algorithm implicitly decodes