You are on page 1of 39

Optimization of Complex Dataflows with User-Defined Functions

ASTRID RHEINLÄNDER and ULF LESER, Humboldt-Universität zu Berlin


GOETZ GRAEFE, Google

In many fields, recent years have brought a sharp rise in the size of the data to be analyzed and the
complexity of the analysis to be performed. Such analyses are often described as dataflows specified in
declarative dataflow languages. A key technique to achieve scalability for such analyses is the optimization
of the declarative programs; however, many real-life dataflows are dominated by user-defined functions
(UDFs) to perform, for instance, text analysis, graph traversal, classification, or clustering. This calls for
specific optimization techniques as the semantics of such UDFs are unknown to the optimizer.
In this article, we survey techniques for optimizing dataflows with UDFs. We consider methods developed
over decades of research in relational database systems as well as more recent approaches spurred by the
popularity of Map/Reduce-style data processing frameworks. We present techniques for syntactical dataflow
modification, approaches for inferring semantics and rewrite options for UDFs, and methods for dataflow
transformations both on the logical and the physical levels. Furthermore, we give a comprehensive overview
on declarative dataflow languages for Big Data processing systems from the perspective of their build-in
optimization techniques. Finally, we highlight open research challenges with the intention to foster more
research into optimizing dataflows that contain UDFs.
r r
CCS Concepts: General and reference → Surveys and overviews; Information systems → Query
optimization; MapReduce-based systems; MapReduce languages;
Additional Key Words and Phrases: Dataflow optimization, user-defined functions, big data processing sys-
tems
ACM Reference Format:
Astrid Rheinländer, Ulf Leser, and Goetz Graefe. 2017. Optimization of complex dataflows with user-defined
functions. ACM Comput. Surv. 50, 3, Article 38 (May 2017), 39 pages.
DOI: http://dx.doi.org/10.1145/3078752

1. INTRODUCTION
For many applications, recent years have brought a sharp rise in the size of datasets
to be analyzed and the complexity of the analysis to be performed. This development
gave rise to a class of systems recently termed Big Data processing frameworks [Wu
et al. 2014], which typically consist of a language to specify complex analyses that
should be performed on the input data, a compiler that translates programs in these
languages into executable pipelines of individual programs (or operators), and a run- 38
time engine that can execute these pipelines on a given infrastructure, typically a
cluster of shared-nothing computers. Two cornerstones in the development of such
Big Data systems were the presentation of the Map/Reduce programming model in
2004 [Dean and Ghemawat 2004] and the subsequent availability of an open source

This work is supported by the German Research Foundation under grant FOR 1306.
Authors’ addresses: A. Rheinländer and U. Leser, Department of Computer Science, Humboldt-Universität
zu Berlin, 10099 Berlin, Germany; emails: {rheinlae, leser}@informatik.hu-berlin.de; G. Graefe, Google, 10
N Livingston St, Madison WI 53703, USA; email: goetzg@google.com.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned
by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from Permissions@acm.org.
2017 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 0360-0300/2017/05-ART38 $15.00
DOI: http://dx.doi.org/10.1145/3078752

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:2 A. Rheinländer et al.

implementation of this framework with Apache Hadoop,1 which substantially eased


the parallel execution of data analysis tasks on very large datasets. A particular aspect
of the Map/Reduce model appealing to many developers and researchers is its focus on
custom data processing operations [Marx 2013; Howe et al. 2008; Berriman and Groom
2011], reflecting the observation that many types of analysis require domain-specific
functionality.
The Map/Reduce framework emerged from research in distributed systems. Another
pillar of Big Data processing frameworks are relational databases; clearly, the coarse-
grained architecture of Big Data frameworks as outlined above closely resembles that
of typical relational databases and their query execution sub-system. Many Big Data
frameworks build on experiences from database research in that they offer a declarative
language for specifying the dataflows to be executed. Such a language should (a) allow
the expression of complex dataflows in form of queries or scripts (expressiveness),
(b) enable the adaption of the final execution plan to the properties of the concrete
input data and execution infrastructure at hand (optimizability), and (c) provide a rich
and easily extensible set of data analytics operators (domain adaptability). The latter
is typically achieved by allowing the usage of arbitrary functions in those languages.
These functions may be predefined and available in the form of libraries or user-
provided within the query. In this work, we use the term “user-defined function” (UDF)
to denote such functions, that is, operators used in a dataflow program other than the
traditional relational ones.
Accordingly, several declarative dataflow languages have been proposed (e.g., Thusoo
et al. [2009], Olston et al. [2008], Beyer et al. [2011], Heise et al. [2012], and Alexandrov
et al. [2015]), most of which allow mixing of relational operators and UDFs to support
a wide range of applications. This mixture poses serious challenges to an optimizer:
While relational operations (filter, joins, aggregates) are semantically well understood,
which is reflected in a rich set of semantic-preserving rewrite rules, properties of a UDF
are not known per se to the optimizer. Thus, finding and applying query rewriting to
dataflows with UDFs must use special techniques. Without such techniques, dataflows
cannot be reordered, which may lead to grossly suboptimal execution plans [Hueske
et al. 2012].
Consider the exemplary abstract dataflow shown in Figure 1, which analyzes a
knowledge base to find information on bankruptcies of organizations within a certain
time frame and the persons involved in the bankruptcies. Figure 1(a) shows the unop-
timized dataflow as it could be specified in a dataflow language. Documents from the
knowledge base are captured at two different points in time and stored in the datasets
“new” and “old.” First, a complex operator for information extraction is applied, which
internally consists of three methods to extract organizations, persons, and dates from
the texts (see left-most box). Afterwards, the two document collections are joined by
document ID to retrieve all documents that describe a bankruptcy in the dataset “new,”
but do not mention such an event in the older version. The remaining documents are
grouped by year and month but also by year only to account for incomplete date in-
formation. Finally, grouped documents are unioned to a single dataset and returned
to the user. In total, this unoptimized dataflow consists of 10 operators (including the
expanded UDF ie_pipeline), 6 of which are non-relational. An optimized version of
the dataflow is displayed in Figure 1(b); it consists of only 6 operators, 4 of which are
non-relational (in grey). Enumerating such a plan, which produces the same result set
as the original dataflow, is a core task of any dataflow optimizer and requires knowl-
edge on the properties of the UDFs in the plan, for instance, to infer that the complex

1 http://hadoop.apache.org/, retrieved: 2016-10-31.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:3

Fig. 1. Example of a dataflow with UDFs before (left) and after optimization (right). UDFs are in grey. The
complex UDF ie_pipeline consists of three operators (shown in the dashed box on the left), which perform
different information extraction (IE) tasks.

UDF ie_pipeline can be broken up into three atomic UDFs, that the order of these
can be changed, and that two of them can be safely pushed through the join. We use an
extended version of this exemplary dataflow throughout this survey to highlight the
technique and methods that allow such optimizations.
In the following, we survey techniques for optimizing dataflows with UDFs in Big
Data processing systems at different stages of a typical optimization process. We hereby
focus on techniques that grow the search space for UDF dataflow optimization; tech-
niques for plan enumeration and cost estimation are out of the scope of this survey.
Specifically, we first review techniques for syntactical dataflow transformation followed
by a discussion of methods for the analysis of operator semantics to derive precedence
relationships and rewrite options between pairs of dataflow operations. Third, we study
logical and physical dataflow transformations, where the former addresses optimizing
the execution order of operator and the latter decides on concrete operator execution
strategies and choice of operator implementation. Logical and physical optimization
are often intertwined in the optimization process, for example, consider join optimiza-
tion, where the join order determined logically and the concrete selection of join algo-
rithms is part of physical optimization. Logical optimization relies on rewrite rules,
information on the semantics of operations contained in a dataflow, and cardinality es-
timates, whereas physical optimization employs properties of the data to be analyzed
(e.g., uniqueness or sortedness of attributes, cardinality estimates, attribute count,
etc.), the dataflow itself (e.g., types of operators), and of the platform used for execu-
tion (e.g., number and configuration of compute nodes) together with rewrite rules for
dataflow optimization. We illustrate each technique using concrete dataflow examples
and describe the availability of each method in existing systems. Finally, we provide
an overview on declarative dataflow languages for Big Data processing systems and
characterize these languages regarding technical properties and available optimiza-
tion techniques. We conclude this article with an outlook to important future research
directions.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:4 A. Rheinländer et al.

1.1. Map/Reduce-Style Data Processing with UDFs


Query optimization with UDFs to some extend also has been studied in the database
community, and several of the techniques we summarize in this work are rooted in
this field. However, a proper handling of UDFs has tremendously gained in importance
since the advent of Big Data processing frameworks, which are typically used for highly
domain-specific types of data analysis. Many of these systems can be traced back to
the presentation of the Map/Reduce programming model [Dean and Ghemawat 2004]
to ease certain types of parallel data processing in large, distributed shared-nothing
clusters. Essentially, the Map/Reduce programming model parallelizes data process-
ing based on two parallelization functions (so-called second-order functions, Map and
Reduce) together with first-order functions, which provide the concrete implementation
of the tasks to be performed. These first-order functions have to be implemented by
the developer and encapsulate most of the analytical functionality in a Map/Reduce
pipeline. Since then, many extensions and improvements for parallel data process-
ing have been proposed, especially additional parallelization functions [Battré et al.
2010; Zaharia et al. 2010] and the usage of high-level dataflow languages. While early
languages (e.g., PigLatin [Olston et al. 2008] or JaQL [Beyer et al. 2011]) compiled
these programs into pipelines of pure Map/Reduce steps, more recent proposals exploit
the power of additional parallelization primitives, which often leads to more efficient
execution plans (e.g., AQL/Asterix [Alsubaiee et al. 2014] or Meteor/Stratosphere [Heise
et al. 2012]). All these languages try to keep a declarative flavor to make program-
ming simpler, and all have strong support for UDFs to be usable in as many areas
as possible. However, many of the existing Map/Reduce-style data analytics systems
treat UDFs as black boxes, which greatly diminishes the opportunities for dataflow
optimization.

1.2. Previous Surveys


Multiple surveys in the area of Big Data processing systems have been published in
the past years, each focussing on different aspects of these systems and topics that are
complementary to our work. One of the first surveys on Big Data processing systems
that build upon the Map/Reduce programming model was presented by Lee et al.
[2012]. After introducing the Map/Reduce programming model and its variants, the
authors discuss advantages and drawbacks of the model and provide an overview of
approaches and strategies enhancing this model. Important application areas (e.g.,
biomedical data analytics, statistics, data warehousing) and open research challenges
are also summarized.
Later, Sakr et al. [2013] surveyed Big Data processing systems that build on the
Map/Reduce programming model. Enhancements of the original approach by Dean
and Ghemawat are discussed (e.g., iterations, data placement and storage formats,
optimization of join processing) and the authors provide an overview on Big Data
processing systems and SQL-like query languages built on top of these systems.
Rumi et al. [2014] survey approaches for task and application scheduling regarding
different optimization goals (e.g., data locality, skewness reduction, system utilization)
in the Hadoop ecosystem.
Doulkeridis and Nørvåg [2014] provide an overview of Map/Reduce-based data pro-
cessing techniques followed by an analysis of the most significant limitations of the
underlying execution model and suggest a wide range of performance improvements
to overcome these limitations. This work also provides a broad and insightful overview
of existing approaches for certain aspects of dataflow optimization (e.g, data access,
fair work allocation, non-redundant processing). Yet optimization of UDFs is not in the
focus of this work.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:5

The probably most comprehensive survey to date was authored by Li et al. [2014] and
presents an overview of the Map/Reduce programming model, systems implementing
this model, extensions and enhancements compared to the original proposal, as well as
languages for specifying analytical dataflows for database-style applications. Therein,
it focuses on approaches for implementation, scheduling, and optimization of relational
operators using the Map/Reduce programming model but disregards dataflows with
UDFs.
Also Bajaber et al. [2016] review the state of the art in Big Data processing systems
for huge datasets. Discussed systems are categorized in a taxonomy and distinguished
based on their application domain, that is, systems for general purpose, database style,
graph analytics, and stream processing. Again, dataflow optimization with UDFs is not
included.
By focussing on the processing and optimization of dataflows with UDFs, our arti-
cle complements the perspectives on Big Data processing presented in the previous
surveys.
1.3. Target Audience
This survey is intended for readers interested in techniques for performance optimiza-
tions of dataflows whose functionality is predominately encoded in UDFs. Readers with
a background in distributed systems and Big Data analytics may profit by learning how
such computations can be optimized using techniques inspired by decades of research
in relational database systems. Experts in relational data processing may be inspired
by our comprehensive description of the role UDFs play in large-scale data analysis
flows and our account of related optimization techniques. For each presented approach,
we discuss typical usage scenarios and potential benefits using real-world dataflows;
furthermore, we summarize their availabilities in existing Big Data processing sys-
tems.
1.4. Article Organization
Section 2 introduces terminology, basic concepts, and mechanisms of dataflow pro-
cessing in Big Data processing systems. Section 3 discusses techniques for syntactical
dataflow transformation. A major challenge in dataflow optimization is to determine
precedence constraints between UDFs, since they are not part of an algebra with
clearly defined rewrite rules. Approaches that address this challenge are presented in
Section 4. Section 5 discusses techniques for dataflow transformations on the logical
and on the physical level. Section 6 gives an overview on existing dataflow languages
available for Big Data processing systems and summarizes key characteristics re-
garding application areas, targeted processing platforms, and optimization techniques.
Finally, this survey concludes in Section 7 with an outlook to future research directions.
2. TERMINOLOGY AND CONCEPTS
This section introduces basic terminology and fundamentals of dataflow processing
in Big Data processing systems. Furthermore, it contains a complex example of a
dataflow with many UDFs, which is used in the following sections to explain premises
and effectiveness of the surveyed optimization techniques.
In this survey, dataflow optimization refers to two orthogonal strategies for reducing
the total execution costs of a dataflow by (1) minimizing time consumption by maxi-
mizing the output per time unit for a fixed set of resources or by (2) minimizing the
resource consumption necessary to compute the output for a fixed time budget. A Big
Data processing system processes analytical workloads on massive datasets in a par-
allel way by using several parallel threads on a multi-core machine, by executing the
workload on different machines in a distributed environment, or both. In contrast to

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:6 A. Rheinländer et al.

workloads processed by relational database systems, dataflows processed by Big Data


processing systems are usually long running, ad hoc specified, and executed only once
over a certain set of input data.
In the remainder of this article, we use the following simple and general definition of a
dataflow. We call an unordered set of records, which might adhere to a schema definition
or not, dataset. An operator o transforms a list of input datasets I = [I1 , . . . , In] into a
list of output datasets O = [O1 , . . . , Om] by applying a function f to I.
A dataflow is a connected directed acyclic graph D(V, E) with the following proper-
ties:
—any vertex without incoming edges is called data source,
—any vertex without outgoing edges is called data sink, and
—any vertex with both incoming and outgoing edges is an operator.
Note that we focus on deterministic, stateless dataflows. Dataflows containing iter-
ations, operations involving processing of random-numbers, or window-based stream
processors are out of scope of this survey.
Two dataflows D, D are semantically equivalent2 if D and D are deterministic and
produce the same output datasets O given the same input datasets I. Degrees of freedom
regarding the way how a result set is computed enable the optimizer to change the
execution order of dataflow operators to derive the result more efficiently. By the term
query we mean a high-level representation of a dataflow, which is either formulated in
a structured, textual language, for example, in Beyer et al. [2011], Heise et al. [2012],
Olston et al. [2008], and Thusoo et al. [2009], or, alternatively, is created by drag and
drop of operators in a graphical user interface.3
Dataflow operators can be relational (e.g., selection, projection, join), non-relational,
or relational operators that are configured with UDFs (e.g., a join operator, where the
join condition evaluates a similarity function). We use the terms user-defined operator
and UDF synonymously to refer to all non-relational operators integrated into Big
Data processing systems by developers or supplied by users as part of a dataflow. We
also consider relational operators, which are configured with a UDF as user-defined
operators. We distinguish between logical and physical dataflows. The former is an
abstract, algebraic representation of all operations to be performed in a dataflow. The
latter defines a concrete execution plan for a dataflow that consists of parallelization
functions (e.g., map, reduce) and concrete implementations for each operator as well
as data shipment strategies among sources, operators, and sinks.
Optimization attempts to determine the best possible plan for a given query; how-
ever, there are different reasons why an optimized execution plan selected by a dataflow
optimizer may not be the best possible plan. First, the number of possible plan alterna-
tives for a given dataflow may be too large to be considered completely, which is often
the case for large dataflows with many degrees of freedom. Second, the semantics of
UDFs is often not available to the optimizer, which hampers an optimal placement of
these operations. Third, statistics for UDFs are often imprecise and thus hinder accu-
rate modelling of operator costs and data transfer. In contrast to relational database
settings, where datasets are assumed to be queried repeatedly and a priori computed
statistics are available, dataflows are usually executed only once and statistics collec-
tion on very large datasets may be prohibitively expensive. Therefore, optimization in
Big Data processing systems is often not carried out cost based but employs heuristic
rules for plan space pruning. To find efficient operator execution orders for dataflows

2 Inprogramming languages, semantic equivalence is also known under the concept of behavioral equiva-
lence [Reichel 1984].
3 Hadoop User Experience, http://gethue.com/, retrieved October 31, 2016.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:7

Fig. 2. Overview of query and dataflow processing in Big Data processing systems.

in general, different constructive approaches have been proposed that evaluate prece-
dence constraints determined earlier to construct alternative dataflows using bottom-
up or top-down plan enumeration algorithms [Burge et al. 2005; Hueske et al. 2012;
Srivastava et al. 2006]. Note that plan enumeration algorithms are not in the scope of
this survey; we refer the reader to Chaudhuri [1998], Kossmann [2000], and Ioannidis
[1996].
2.1. Query and Dataflow Processing in Big Data Processing Systems
Parallel database management systems (DBMS) and data analytics systems that build
on the Map/Reduce paradigm are complementary technologies, each focussing on dif-
ferent aspects of data processing at a very large scale. While Map/Reduce-style systems
excel at performing complex analytical tasks on huge sets of semi- and unstructured
data, parallel DBMS shine when huge sets of mostly structured data are queried re-
peatedly [Stonebraker et al. 2010]. Yet query and dataflow processing is conceptually
similar in both technologies (see Figure 2).
Queries formulated in dataflow languages are translated into parallel, executable
dataflow programs using a compilation process similar to query optimization in rela-
tional database systems. A query is translated into an abstract parse tree and syntacti-
cally transformed in a first optimization step by analyzing the utilization of variables,
operators, and predicates. If queries are nested, then syntactic transformation attempts
to unnest the query to facilitate more comprehensive rewriting in downstream opti-
mization. A dataflow consisting of logical operators is created, which is afterwards
logically optimized, for example, by reordering operators, operator decomposition, or
redundancy elimination. Such optimizations can only be applied if information on
operator semantics and the operator’s potential for reordering with other operators
is available. Alternatively, operator semantics can be determined in a separate opti-
mization step that analyzes precedence relationships within a given dataflow and its
contained operators to infer concrete rewrite options. The optimized logical dataflow is
translated into a physical dataflow and optimized physically to reduce both communi-
cation and computation costs on the available hardware, for example, by introducing
early aggregation and caching into the dataflow or by choosing specific operator im-
plementations and parallelization schemes based on the properties of the data to be
processed. Finally, the executable code of the parallel dataflow program is created and
executed in parallel on the available hardware infrastructure.
2.2. Dataflows with UDFs by Example
In the following sections, we introduce a multitude of different optimization techniques
and show their potential for optimizing dataflows with UDFs. We explain their premises
by means of an example query script and its corresponding dataflow. Whenever neces-
sary, we slightly modify and extend this example throughout this survey to clarify the
benefits of certain optimizations. Our example query is formulated in Meteor [Heise

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:8 A. Rheinländer et al.

et al. 2012], a declarative dataflow language designed for processing dataflows that
contain UDFs using the Big Data processing system Stratosphere [Alexandrov et al.
2014]. In Meteor, variables start with a ’$’ sign and each statement is delimited with
a ’;’ sign.4
Our example query analyzes two dumps of Wikipedia articles gathered at different
points in time to determine NASDAQ-listed companies, which went bankrupt within a
certain time frame, together with an investigation of persons related to these compa-
nies. The corresponding logical dataflow is shown in Figure 3, and the concrete Meteor
query is contained in Listing 1. We added comments (i.e., green text starting with ’//’
in Listing 1) to precisely identify operators occurring multiple times in the depicted
dataflow. The dataflow is a directed acyclic graph (DAG) and consists of 10 operators, 2
of which are complex. When expanding complex operators, the dataflow consists of 22
operators, 13 of which are non-relational.
Specifically, the first line of the query imports a package of non-relational operators
that perform tasks in the areas of information extraction (IE) and natural language
processing (see below for details). Lines 3–12 contain the ad hoc definition of a com-
plex UDF consisting of several IE operators for sentence and token annotation, for
annotating multiple entities in texts (i.e., organizations, persons, and dates), and for
extracting relationships between persons and organizations. The datasets to be ana-
lyzed are retrieved from the file system in Lines 14 and 15. Analysis of both datasets is
carried out through calling the UDF ie_pipeline and by a subsequent filter operation
(Lines 17–20 for dataset “new” and Lines 22–25 for dataset “old,” respectively). The
newer dataset is filtered for articles that contain the term “bankrupt” (Lines 19 and
20) and the older dataset is filtered for articles that contain the term “NASDAQ” but do
not contain the term “bankrupt” (Lines 23–25). Additionally, locations are annotated
in the dataset “new” (see Line 18).
Both datasets are joined on the document ID to retain only those documents where
companies are described as bankrupt in the newer dataset but were not described as
such in the older dataset (Line 27). Afterwards, two grouping and a union operator (see
Lines 29–43) are applied to group the joined data by year and month (Lines 35–41)
and also by year only (Lines 29–33) to account for incomplete date annotations. The
output format of both grouping operators is specified in the query by the JSON schema
following the keyword into (see Lines 30–33 and 37–41, respectively). Information
regarding the respective months and years in the output are retrieved from the first
item belonging to a specific group (indicated by ’[0]’ in the query), whereas infor-
mation regarding relationships are retrieved from all items belonging to the group
(indicated by ’[*]’ in the query). Unnecessary attributes of the grouped data are pro-
jected out by using the into keyword. The result set is filtered for companies that went
bankrupt between 2014 and 2016 (Lines 45–47) and is finally written to the file system.
3. SYNTACTIC DATAFLOW TRANSFORMATION
In this section, we discuss dataflow transformations on the syntactical level in the
presence of UDFs. Specifically, we present techniques for inlining of functions and
variables, operator unnesting, and the simplification of operators and predicates. Some
of the presented techniques are applicable in a broader scope and are not limited to
dataflows with UDFs. Therefore, a special focus of this section is to discuss the impact of
these techniques as an enabler for downstream optimizations of dataflows with UDFs.
The transformation of dataflows on the syntactic level is applied as a first step
within the optimization process. Given a query script, which describes a dataflow with

4 Throughout this article, we underline all operators contained in Listings and print keywords that commence
operator predicates in italics to ease readability of listed queries.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:9

Fig. 3. Dataflow for Meteor query shown in Listing 1. UDFs are in grey.

UDFs, the following techniques may be applied: (1) rule-based inlining of variables
and functions, (2) query unnesting and group-by simplifications, and (3) operator and
predicate simplifications.
The goal of syntactic transformation is not only to reduce complexity within the
query script but also to discover potential for inter-operator parallelism of contained
operators. Although the techniques discussed below are not specific to dataflows
that contain UDFs, the presented transformations greatly impact the effectiveness of

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:10 A. Rheinländer et al.

Listing 1. Meteor query for advanced information extraction combining UDFs, nonrelational, and relational
operators.

downstream logical optimization in such flows. By first applying operator simplification


and query unnesting, complex UDFs are translated into concrete dataflow operators,
which enables advanced dataflow transformations in succeeding steps of the optimiza-
tion process (see Sections 4 and 5).

3.1. Rule-Based Variable and Function Inlining


Variable and function inlining replaces a variable or function call with the variable’s
value or the function body, similar to macro expansion known from procedural pro-
gramming and scripting languages. Variables can be inlined if they are referenced

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:11

Listing 2. Excerpt of Meteor query from Listing 1, UDF expansion.

only once in expressions or function calls of a query. Similarly, calls of UDFs can be
expanded with the function’s body if the function’s call-tree is not recursive. In these
cases, parameters are replaced by local variables, which are possibly inlined to further
simplify the query.
Function inlining is important as an enabler for downstream plan transformations
that are not possible without. Consider Listing 2, which contains an excerpt of the query
from Listing 1. Lines 3–12 of the script declare a UDF called ie_pipeline to analyze
texts. Internally, this UDF calls six other UDFs (Lines 4–9) and one filter operator
(Lines 10 and 11). When translating the complete query without UDF expansion, the
resulting dataflow consists only of two consecutive operators, ie_pieline and filter,
whereas it consists of eight operators after expanding of ie_pipeline, as shown in
Figure 4.
In principle, the expanded dataflow has six possible operator execution orders of the
annotate entities operators that are semantically equivalent to the initial query (see
Section 4). Moreover, the expanded dataflow also contains two adjacent filter operators,
which can be reordered with any database-style optimizer.
Variable and function inlining is available in some optimizers for parallel analytics
systems [Beyer et al. 2011; Murray et al. 2011]. The direct benefits of inlining have
been evaluated selectively in Big Data settings; for example, Ackermann et al. [2012]
evaluated the benefits of operation fusion in the context of the domain-specific language
Jet.

3.2. Group-by Simplification


Computing aggregates and groupings is one of the most frequent operation in data an-
alytics applications that involve UDFs, since very often, results computed by the UDFs
need to be aggregated for interpretation by human data analysts. Often, the grouping
functions contain expensive user-defined predicates [Jacobs 2009] and optimization of
such operations is crucial for the overall performance of data analytics programs. Next
to logical and physical dataflow transformations regarding aggregations and groupings,
which are discussed in Section 5, simplification of such operations is also a powerful
technique.
Group-by operations can be simplified by eliminating redundant grouping columns
in cases where “equals” selection predicates are specified on them. Consider Listing 3,

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:12 A. Rheinländer et al.

Fig. 4. Expansion of UDFs during query simplification. UDFs are in grey.

Listing 3. Excerpt of Meteor query from Listing 1, optimization of group-by.

Listing 4. Excerpt of Meteor query from Listing 1, elimination of redundant grouping attributes.

which modifies the query from Listing 1 in Line 28 by adding a filter operator, which re-
tains only records where the month June is annotated. Lines 31–33 group the dataset by
year and month. Since Line 28 only produces results for a certain month, the group-by
operation in Lines 31–33 can therefore be altered as shown in Listing 4. This simplifi-
cation reduces the number of key columns to be considered and simplifies aggregation
operations performed later in a dataflow. Reduction of grouping columns is applied

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:13

Listing 5. Excerpt of Meteor query from Listing 1, grouping sets optimization.

in relational database systems, for example, in IBM DB25 and SQL Server [Simmen
et al. 1996], yet we are not aware of any dataflow optimizer currently employing this
technique.
Another simplification technique for group-by operations in dataflows, called group-
ing sets, applies to situations where multiple group-by operators process the same
grouping keys. Grouping sets allow to selectively specify grouping columns and thus
specifies several grouping operations within a single operation. In this example, two
grouping sets are specified, namely (1) the grouping set (y,m) for grouping by year
and month and (2) the grouping set (y) for grouping by year only. Grouping sets are
logically equivalent to expressing several individual group by operations and connect-
ing them with union operations. For example, Lines 29–43 of Listing 1 are logically
equivalent with the statement shown in Listing 5. Such a rewriting is beneficial in
many cases, since it collapses three operators (i.e., two group-bys and one union) into
a single one that evaluates all grouping predicates without the need of a subsequent
union operation. Grouping set operations are well known from data warehousing sys-
tems [Gray et al. 1997] and they are also available in the Big Data processing system
Hive [Thusoo et al. 2009].
3.3. Query Unnesting
The unnesting of nested queries is another important technique for optimizing
dataflows with UDFs, since nested queries involving UDFs are prone to cause high
memory and CPU loads during query processing.6 Sub-queries can be distinguished
into uncorrelated and correlated sub-queries. In correlated sub-queries, the nested
query references one or more datasets processed by the outer part of the query. Opti-
mization of such queries has been addressed, since the early ages of relational query
optimization research. A classification of nested queries was first introduced by Kim
[1982], based on which different techniques for unnesting were developed by Ganski
and Wong [1987].
Uncorrelated sub-queries are independent of the outer query and can therefore be
optimized and executed separately, whereas optimization of correlated sub-queries is
significantly more difficult and often leads to inferior plans [Neumann 2009]. Opti-
mizers typically unnest the nested part of the correlated query as much as possible to
evaluate the nested part separately and combine it with the outer part by an explicit
join operation [Selinger et al. 1979]. This allows the dataflow optimizer to identify
more efficient rewrite options, for example, by pushing projections or selections from
the outer to the inner part of the query.

5 IBM DB2 for z/OS Technical Overview, https://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/


sg248180.html, retrieved October 31, 2016.
6 For example, consider bug reports of Hive https://issues.apache.org/jira/browse/HIVE-372, https://issues.
apache.org/jira/browse/HIVE-5494, both retrieved October 31, 2016.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:14 A. Rheinländer et al.

Listing 6. Correlated sub-query inspired by Listing 1.

Listing 6 shows an excerpt of Listing 1, which is modified with a correlated sub-query


filtering the join results for all years where at least 10 bankruptcies of NASDAQ-listed
companies were found. The nested part of the query is connected to the filter operation
in Line 28. It starts with first isolating date annotations from the remaining anno-
tations (Lines 29–32), which are subsequently transformed into a normal form by a
user-defined operator normalize (Line 33). Normalized dates are grouped by year, and
occurrences of years are counted (Lines 34–38). Finally, the grouped data are filtered
for those years occurring at least 10 times (Line 39). After computing the correlated
sub-query, the dataflow continues unaltered.
Since the nested part of the query is used as a filter predicate, it needs to be computed
prior to the actual execution of the outer part and no rewrite options can be applied
immediately. By unnesting Lines 29–40 of Listing 6, the partial dataflow depicted in
Figure 5 is created and optimizations can be applied, for example, by reordering the
filter operator (see Line 41) as indicated by the gray dashed arrows. To combine the
results of the nested part of the query with the remaining outer part, an explicit outer
join operator is introduced into the dataflow.
Nested sub-queries are available, for example, in Impala [Kornacker et al. 2015],
which allows both correlated and uncorrelated sub-queries in the WHERE clause and
rewrites these into join queries. The Scope optimizer [Chaiken et al. 2008] optimizes
correlated sub-queries by applying combinations of outer and semi-joins together with
user-defined combiners.

3.4. Algebraic Dataflow and Predicate Simplification


UDFs are often configured with complex predicates, which greatly impact the per-
formance of the UDF itself and the overall dataflow. To combine multiple predicates,
Boolean expressions are applied, which might contain redundancies, and a naı̈ve eval-
uation of such predicates leads to unnecessary computation.
Simplification of redundant predicates involves rewritings based on the well-known
Boolean algebra, which were summarized for relational query optimization by Özsu
and Valduriez [1999, Chapter 9]. Some of the transformation rules can be applied in
any situation, such as the removal of double negation. For example, the filter predicate
applied in Lines 45–47 of the Meteor query shown in Listing 1 can be simplified using
laws of the Boolean algebra into the statement shown in Listing 7.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:15

Fig. 5. Unnested correlated sub-query from Listing 6. Grey dashed arrows indicate reorder options in the
unnested query.

Listing 7. Excerpt of Meteor query from running example, algebraic predicate simplification.

Other rules directly affect the execution order of predicates; for example, consider
two predicates p1 and p2, which are combined with the Boolean or operator (∨). Since ∨
is commutative (i.e., executing p1 ∨ p2 is equivalent to executing p2 ∨ p1 ), any execution
order of p1 and p2 is correct but yields different costs [Minker and Minker 1980].
Therefore, each such rewrite should be based on cost estimates or a combination of
cost estimates and probability values for predicates p1 and p2 [Hanani 1977; Brown
1998]. Rewriting should also ensure that semantics of logical predicates is preserved.
For example, the Boolean operator ∧ is sometimes implemented with lazy evaluation
(denoted by &&). In contrast to p1 ∧ p2 , the semantics of p1 && p2 is defined as “first
evaluate p1 and evaluate p2 only if p1 is true.” Since some expressions of p1 and p2
might have side effects, they are not commutative and should thus not be rewritten
to p2 && p1 . Simplification of Boolean expressions is available in many compilers for
dataflow languages, for example, in PigLatin [Olston et al. 2008], Hive [Thusoo et al.
2009], or Spark SQL [Zaharia et al. 2010].
Removal of idempotent UDFs is particularly important for dataflow simplification.
We call a unary UDF o idempotent if it produces the same result regardless of how
often it is applied to some input I: o(o(I)) ≡ o(I). N-ary UDFs are idempotent if all
its inputs are idempotent with respect to the function applied to them, that is, when
applied to n equal values, an idempotent function returns the value as result. For
example, a function that determines the minimum value of two equal inputs is idem-
potent: min(I1 , I1 ) = I1 and a UDF, which merges text annotations, is also idempotent.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:16 A. Rheinländer et al.

Eliminating multiple calls to idempotent UDFs from a dataflow before its execution
is always beneficial, for example, selections and projections on the same predicates
are idempotent [Özsu and Valduriez 1999]. For UDFs, idempotency needs to be anno-
tated or derived to eliminate redundant executions of o. This technique is, for instance,
applied in the SOFA optimizer [Rheinländer et al. 2015].
4. SEMANTIC ANALYSIS OF UDFS
To enable dataflow rewritings that involve reordering, introduction, or removal of op-
erators, UDF metadata (e.g., to describe selectivity estimates or semantic properties
like associativities or commutativities) need to be available to the optimizer. Accu-
rately describing the semantics of UDFs in terms of optimization is one of the most
persistent challenges for dataflow optimization, since UDFs exhibit all sorts of behav-
iors, which is difficult to describe in a general, optimizer-friendly way. In contrast to
relational operators, whose semantics is well understood [Özsu and Valduriez 1999],
UDFs have different properties than the classical relational ones and finding the right
set of properties for describing UDFs in terms of optimization is not trivial. Moreover,
optimizing dataflows with UDFs requires novel approaches for dataflow transforma-
tions, which are usually captured in concrete rewrite rules or described by precedence
constraints contained in a dataflow. This section discusses advantages and drawbacks
of different techniques for accurately describing UDF semantics, that is, by providing
manual annotations, by automatically analyzing program code, and by using hybrid
approaches.
We say two operators oi , o j contained in a dataflow D are in a precedence relation
if D contains a path from oi to o j and o j accesses information modified or created by
oi or vice versa (i.e., oi does not access information modified or created by o j ) [Cytron
et al. 1991]. In the context of dataflows, precedence relationships are data dependen-
cies. We call the set of all precedence relationships between pairs of operators within a
dataflow precedence constraints. Two operators oi , o j can be reordered if either a spe-
cific rewrite rule exists or if they are not in a precedence relation. Rewrite rules either
allow reordering pairs of operators oi , o j or enable replacements of operator oi with a
semantically equivalent sub-flow of operators (ok, . . . , ol ). In the following, we distin-
guish three classes of approaches that address the problem of describing the semantics
of UDFs, namely (1) approaches that annotate UDF manually, (2) approaches that
perform automated code analysis to infer semantics, and (3) approaches that combine
(1) and (2).
4.1. Annotation of UDF Semantics
Manually annotating UDFs with properties that describe their semantics, their be-
havior, and defining concrete rewrite rules for UDFs requires a deep knowledge of
the application domain. Domain- and application-specific rewrite rules are available
for different areas, such as dataflows for information extraction, ETL flows, or scien-
tific workflows. In the following, we briefly summarize some of these domain-specific
approaches.
Wachsmuth et al. [2011] define rules for rewriting information extraction dataflows
based on domain-specific knowledge. These rules include decomposition of extraction
tasks, early filtering, and lazy evaluation. User-defined extraction tasks are distin-
guished into required and optional tasks. Optional tasks are only executed if needed
to satisfy a concrete information need. For example, an extraction dataflow, as shown
on the left of Figure 4, is decomposed into its components as shown on the right side
of this figure and each task is examined whether it is needed to produce the output
result. In this example, all tasks are needed and thus included in the execution plan
for the shown flow.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:17

Ogasawara et al. [2011] propose an algebraic approach for task definition for the
scientific workflow system Chiron [Ogasawara et al. 2013] to optimize scientific work-
flows with many UDFs. UDFs are manually assigned to a fixed set of six operations
(e.g., Map, Reduce, Filter) based on the ratio of consumption and production between
input and output records. For example, the annotate entities UDF from Example 1
would be assigned to Map if it produces and consumes exactly one record. Similarly,
a group by UDF would be mapped to Reduce, since it groups a set of n records into
a single record based on a given grouping attribute. By applying these assignments,
UDFs obtain clear semantics and can be optimized through algebraic transformations
and by analyzing producer-consumer dependencies between operators.
Simitsis et al. [2005] reason about manually annotated properties of ETL flows
such as schema information, operator adjacency in ETL processes, or the number of
consumers and producers to determine possible optimizations.
Carman Jr. et al. [2015] provide rewrite rules for processing XML documents with
XQuery on top of Algebricks [Borkar et al. 2015] and the Big Data processing system
Hyracks [Borkar et al. 2011]. Rules for rewriting path expressions attempt to remove
redundant parts of the dataflows introduced by unnesting operations (see Section 5.2).
Further rewrite rules perform plan transformations to enable a parallel computation
of certain XQuery constructs (e.g., data access, aggregation, join).
Kougka and Gounaris [2013] annotate and analyze binding patterns to determine
precedence constraints between operators in dataflows. This technique is derived from
query optimization in the context of web services [Srivastava et al. 2006]. First,
a global schema is defined for the dataflow and annotations of produced and con-
sumed attributes are created. Consider Figure 6, which shows a global schema for
the ie_pipeline of Listing 1 (see Lines 3–14) together with information on produced
and consumed attributes. Recall that precedence constraints exist between any two
operators a, b if a produces attributes consumed by b or vice versa. In our example
from Figure 6, precedence constraints exist, for example, between sentence and token
annotation and between entity and relationship annotation. The only operator, which
can be placed at any position of ie_pipeline, is filter f2, since is not in a precedence
relationship with any other operator.
To improve optimization of dataflows with UDFs, Spark, AsterixDB, and Strato-
sphere allow us to manually annotate and provide rewrite rules for UDFs. In the
following section, we discuss methods to automatically derive UDF semantics, such as
producer-consumer relationships, through code analysis.

4.2. Inference of UDF Semantics Through Code Analysis


Inferring the semantics of UDFs and precedence relationships between pairs of UDFs
automatically without manual annotations provided by the user requires techniques
that deeply analyze the UDF’s program code without executing it.
Hueske et al. [2012] show that analyzing a few properties of UDFs suffice to enable
semantically equivalent reordering of user-defined selections, joins, and certain aggre-
gations without having access to UDF annotations. Particularly, they define read and
write sets, which are similar to the notion of producers and consumers. Provided that
complete information on read/write sets is available, two consecutive record-at-a-time
UDFs A, B (e.g., filter, projection, map, . . .) can be safely reordered if no read/write
conflicts between A and B exist, that is, if A does not access attributes modified by B
and A does not modify attributes modified or accessed by B and vice versa. Similarly,
key-at-a-time UDFs (e.g., reduce, aggregations, joins) can be reordered if they have no
read/write conflicts and if either all or no records with a specific grouping key k are re-
moved during reordering (so-called key group preservation). Read/write sets can either

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:18 A. Rheinländer et al.

Fig. 6. Global schema and read/write set information for ie_pipeline sub-flow. Orange boxes indicate for
each UDF produced (written) attributes and green boxes indicate consumed (read) attributes. UDFs are in
grey.

Listing 8. Three address code for filter f2 used for static code analysis.

be annotated manually or be obtained from static code analysis as shown in Hueske


et al. [2013].
To enable code analysis, a given dataflow is first analyzed to infer a global schema and
each dataflow operator is translated into three-address code (TAC). Figure 6 displays
a global schema for the UDF ie_pipeline from Listing 1 together with annotations of
read and written attributes. Green marked attributes in Figure 6 correspond to the
read set of each UDF and orange marked attributes constitute the write set of each
UDF. Listing 8 shows the TAC of the filter f 2 from our running example. The read set is
determined by scanning the UDF’s code for statements of the form $t:=getField($i,n),
which assign the content of attribute n of the input record i to a temporary variable
$t, and all subsequent uses of $t in the code. If $t is used, then it is added to the read
set of the UDF. In Listing 8, Line 2 contains a getField statement, which assigns the
content of the attribute at index position 1 of the input record to a temporary variable
$a. Subsequently, $a is used to perform the actual filter operation (see Line 3). Since no
other attribute of the input record is accessed, the read set only consists of this attribute.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:19

In the same way, static code analysis detects explicit operations that copy, project,
modify, or add single attributes to the output to infer the write set of a UDF. Here,
the static code analysis algorithm starts to collect statements of the form emit($o),
which emit the output record of a UDF, and tracks the origin of $o to identify column
projections. Remaining attributes of the write set are identified by collecting statements
of the form setField($o,n,$a) from the TAC of a UDF. In our example from Listing 8,
the write set is empty, since no attribute is modified.
Guo et al. [2012] apply code analysis not only to determine precedence relationships
between UDFs to enable code reordering and early filtering but also to identify op-
tions for reducing attributes automatically. To perform attribute reduction, the set of
output attributes, which is used by subsequent UDFs in the dataflow topology, is deter-
mined. Unused output attributes and UDFs, which solely contribute to computing the
removed attributes, are removed. For example, in Figure 6 the attribute at index 0 can
be removed, because it is not accessed by any of the UDFs. Fan et al. [2015] present
an approach for optimizing the placement of filters inside the body of a UDF using
static code analysis. Static code analysis as carried out in Manimal [Cafarella and Ré
2010; Jahani et al. 2011] also does not consider UDF semantics but statically analyzes
the code of Map/Reduce programs to identify selections, projections, and opportuni-
ties for data compression. Static code analysis as presented by Hueske et al. [2012]
is implemented in Stratosphere, Manimal [Cafarella and Ré 2010; Jahani et al. 2011]
is available for Hadoop, and the techniques of Fan et al. [2015] and Guo et al. [2012]
are implemented in the Scope system.
4.3. Hybrid Approaches
Although static code analysis already enables quite a few optimizations for dataflows
with UDFs, including semantic information on the UDFs behaviour and concrete
rewrite rules for UDFs into the optimization process is crucial to enable advanced
dataflow transformations. As explained earlier, annotating semantic information and
rewrite rules for many combinations of UDFs is time consuming and requires a deep
knowledge of the application domain. Therefore, hybrid approaches, which combine
manual annotations with automated UDF analysis and static code analysis, aim at
comprehensive optimization of dataflows with UDFs. To minimize annotation effort,
Rheinländer et al. [2015] propose to arrange UDFs and properties relevant to dataflow
optimization (e.g., parallelization function, number of input and output channels, alge-
braic properties) in taxonomies. Relationships within and between those taxonomies
concisely describe the semantics of related UDFs. Figure 7 shows an excerpt of a tax-
onomy for information extraction UDFs from Listing 1 together with an excerpt of a
property taxonomy, which describes UDF properties relevant for dataflow optimiza-
tion. Black arrows between UDFs (properties, respectively) indicate generalization-
specialization relationships, for example, organization is a specialization of the entity
annotation UDF. Bold blue arrows indicate dependencies between UDFs, for exam-
ple, the UDF for entity annotation requires token annotation to be executed upfront.
Dashed orange arrows characterize properties of UDFs relevant for optimization; for
example, a UDF annotate has a single-input and processes one record at a time. Since
all UDFs arranged below are specializations of annotate, they inherit the optimization
properties related to annotate.
Rewrite templates as shown in the Datalog code in Listing 9 are built on UDF
properties and abstract UDFs. For example, the first rewrite template (Lines 1–3)
states that a distributive, dual-input bag-at-a-time UDF (e.g., union, intersection) can
be rewritten with a single-input record-at-a-time UDF (e.g., annotate, filter, record
transform). The second template (Lines 5- and 6) is specific to the annotate UDFs
shown in Figure 7 and states that any two annotate UDFs X, Y can be reordered if X is

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:20 A. Rheinländer et al.

Fig. 7. Excerpt of an operator (bottom, blue cornered boxes) and property (top, green rounded boxes) taxon-
omy for UDFs together with relationships between operator pairs and operator-property pairs.

Listing 9. Exemplary rewrite templates in Datalog notation. Templates are instantiated with concrete
operators and evaluated according to the taxonomic relationships shown in Figure 7.

not in a precedence relationship with Y . To decide whether X and Y can be reordered,


the templates are instantiated with concrete UDF pairs and all conditions in the body
of the template are checked for X and Y by reasoning along taxonomy relationships. X
and Y can be reordered, if either a rule reorder(X,Y) evaluates to true or if read/write
set analysis yields no conflicts.
Consider the read/write set annotations for the three consecutive annotate enti-
ties UDFs shown in Figure 6. Using read/write set analysis only, reordering of these
UDFs is not possible, since each annotate entities UDF modifies attribute 4 of the
global schema. At the same time, all annotate entities UDFs have an append-only
semantics, that is, they only add entity annotations to records with possibly existing
annotations but never delete any existing entity annotations. For this reason, reorder-
ing annotate entities operators is a safe operation and yields semantically equivalent
plans and the rewrite template shown in Listing 9 enables such a reordering. The se-
mantics of annotate entities is captured in the operator taxonomy of Figure 7 by the
fact that no annotate entities UDF is in a precedence relationship with any other
annotate entities UDF.
Vassiliadis et al. [2009] propose taxonomies for characterizing semantics of non-
relational functions for ETL workflows. Rheinländer et al. [2015] use the taxonomy
and the rewrite template scheme introduced above together with automated code and
read/write set analysis for dataflow optimization in Stratosphere.

5. OPTIMIZATION BY DATAFLOW TRANSFORMATION


This section surveys techniques for dataflow transformation on the logical and physical
level in the presence of UDFs, which can be applied after a dataflow has been trans-

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:21

formed syntactically, contained UDFs have been analyzed regarding their semantics,
and existing precedence relationships and rewrite options have been identified. Logi-
cal dataflow transformation optimizes the logical execution order of dataflow operators
while respecting precedence constraints. Physical dataflow transformation optimizes
concrete execution strategies of a dataflow, for example, by choosing appropriate op-
erator implementations depending on the data to be analyzed and the execution en-
vironment, by optimizing data transfer across compute nodes, or by optimizing the
execution of operators through introducing intermediate caching and indexes. In the
following, we review techniques for logical and physical transformation, which we con-
sider most promising for dataflows with UDFs, namely (1) operator composition and
decomposition, (2) redundancy elimination, (3) predicate and operator migration, (4)
partial aggregations, (5) semi-join reducers, and (6) choice of operator implementation.
5.1. Operator Composition and Decomposition
Complex operators are abstractions of partial dataflows that contain multiple simpler
operators, and we refer to these as elementary operators in the following. Elementary
operators are, for example, non-relational operators for annotating texts or for parsing
XML documents or relational operators, such as filter, join, or group by. In contrast
to UDFs defined ad hoc as employed in Listing 1, complex operators are directly em-
bedded into dataflow languages and provide shortcuts to adding multiple elementary
operators to a dataflow [Rheinländer et al. 2015; Simitsis et al. 2012]. The semantics
of complex operators is often more clearly defined compared to ad hoc UDFs and can
be inferred by an optimizer through evaluating provided operator annotations or code
analysis. Decomposing complex operators into contained elements is important for op-
timizing dataflows with UDFs, since complex operators exhibit different semantics
than their elementary components in many cases as described below. Furthermore,
these operators are mostly designed to address non-relational needs, although some
contained parts may be of relational origin (e.g., filter, projection). For example, con-
sider a complex operator tokenize for splitting text into separate words. This UDF
internally consists of two elementary operators, annotate tokens and split tokens.
The annotate tokens operator recognizes start and end positions of individual words in
text by analyzing white spaces and other delimiters. The split tokens operator splits
the text into separate words based on the boundaries detected by annotate tokens.
Tokenize and its elementary components differ in terms of prerequisites regarding the
properties of the input data (i.e., existing sentence boundary annotations), accessed and
modified attributes of the input data, and the ratio between input (I) and output (O)
data as shown in the left part of Figure 8. The operator annotate tokens only produces
information on existing token boundaries in the input, while split tokens actually
splits the input data into multiple smaller units. This yields many output records for
each analyzed text depending on the number of identified token boundaries. Thus,
executing the split tokens operator as late as possible is highly beneficial in many
complex analytical flows. Such a rewrite option is not detectable without decomposing
tokenize into its elementary components first, since the global behavior of tokenize
conforms to the split tokens operation, where text is analyzed and in turn is split up
into individual words.
In other cases, the reverse operation only enables logical operator reorderings, that
is, composing multiple elementary operators into a single, complex UDF. Consider the
right part of Figure 8, which displays two text processing UDFs, expand synonyms and
set default, in the upper part. In the figure, we list for each UDF properties relevant
during the optimization process, that is, the read/write set of consumed and produced
attributes (indicated by dark-green and orange colored fields), the I/O ratio of the
respective operators, and the set of prerequisites necessary for a successful execution of

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:22 A. Rheinländer et al.

Fig. 8. Composition (right) and decomposition (left) of complex, user-defined operators.

the respective UDF. For example, for executing the UDF tokenize successfully, sentence
annotation needs to be performed in advance. Similarly, split tokens requires that
token boundaries are annotated.
The UDF expand synonyms augments existing entity annotations with synonyms
from a dictionary and set default selects one of these synonyms as representative
and removes all others. Reordering these two UDFs with other operators is limited,
since both consume and produce existing entity annotations. However, a complex UDF
normalize entity as shown in the bottom of the figure is potentially reorderable. By
composing expand synonyms and set default, modifications of entity annotations only
affect internals of existing entity annotation and the number of records and annotations
remains unchanged. An analog example for benefits of operator composition for in the
relational relational world is the fact that a cross-product followed by a selection in can
be replaced with a join if the selection has predicate involving the two inputs to the
cross-product.
During optimization on the logical level, dataflows with UDFs should therefore be
examined for contained complex operators, which are decomposed into elementary
components, and the rewritten dataflow should be analyzed again to detect further
reorder options. Similarly, consecutive elementary operators in a dataflow should be
analyzed whether a composition of these yields beneficial reorderings. Simitsis et al.
[2005] introduce a rather general approach called factorize and distribute for optimizing
operators with two inputs by composition and decomposition in the context of ETL
dataflows. The left part of Figure 9 displays a partial dataflow consisting of a dual-
input operator o3 together with two single-input operators o1 , o2 . The operators o1 , o2
have the same semantics and functionality but are applied to different inputs I1 , I2
before executing o3 . Assuming that o1 and o2 are not in a precedence relation with o3 ,
the dataflow can be rewritten in the following way: o1 and o2 are removed from the
graph, and the dual-input operator o3 is applied to I1 and I2 to produce an intermediate
dataset I  . A new operator o4 , which combines the functionality of o1 and o2 , processes
I  and is inserted after o3 . Note that this transformation can also be applied in the
reverse direction by replacing o4 with o1 and o2 . Operator decomposition is available
as part of the SOFA optimizer [Rheinländer et al. 2015] for Stratosphere.
5.2. Redundancy Elimination
Redundancy is often introduced by view resolution and similar data access macros,
that is, the view defines more than is required for the dataflow at hand. Redundancy
ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:23

Fig. 9. Factorize and distribute of user-defined operators.

elimination is particularly important for dataflows with UDFs, since these flows of-
ten contain many compute-intensive UDFs. We consider operators to be redundant if
results produced by such an operator are never consumed by any other operator or
data sink contained in a dataflow, that is, data produced by such an operator does not
contribute to the final result set in any form. Such operators can be safely removed to
(a) shrink the size of the dataflow to remove complexity from downstream optimization
and scheduling phases and, more importantly, (b) to reduce the overall execution time.
Dead code elimination originally emerged from compiler optimization, and a clas-
sical approach was introduced by Cytron et al. [1991]. In this approach, a dataflow
is transformed into an intermediate representation called a static single assignment
(SSA). SSA requires that (1) all intermediate datasets are identified by variables,
which are properly defined before usage, and (2) that every variable is assigned only
once in a dataflow to create explicit representations of producer-consumer chains. This
requirement can be achieved by versioning existing dataflow variables.
Necessary and redundant operators can now be identified as follows: First, all data
sinks are added to a queue working list. An algorithm for analyzing the critical path of a
dataflow recursively extracts one item k from the working list and marks it as required
for the dataflow. All operators, which have not yet been marked as required and which
produce an output that is consumed by k, are added to the working list. The algorithm
recursively continues until the working list is empty. All remaining operators, which
have not been marked as required yet do not contribute to any result set of the dataflow
and can therefore be safely removed from the optimized dataflow.
Consider Figure 10, where each operator from Listing 1 is assigned a variable in
SSA, shown in yellow cornered boxes beneath each operator. The critical operator path
is also shown and is marked with red arrows. Individual steps conducted during critical
path analysis are shown on the right side of the figure bottom-up. Analysis starts by
adding the variable sink to the working list (see first step), which is subsequently
marked as required and removed from the working list. At the same time, all variables
identifying operators that produce a result set consumed by sink are added to the
working list. In our example, variable r2 is added to the working list (see second step).
The algorithm continues with this procedure until all sources are added to the set of
required operators in step 35 and the working list is empty. One operator, which is
identified by the SSA variable loc, remains unmarked after all sources were processed.
This operator is unnecessary to compute any result set and can therefore be removed
from the dataflow.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:24 A. Rheinländer et al.

Fig. 10. Dead code elimination using static single assignment (SSA) and critical path analysis. SSA assign-
ments are contained in yellow boxes beneath each operator and the critical path is marked with red arrows.
Dataflow edges are not displayed to ease readability.

In complex analytical dataflows, common sub-expressions occur frequently, where


intermediate results are processed in many different ways (e.g., joined, aggregated,
transformed) and a repeated execution of such common expressions is dispensable.
Silva et al. [2012] propose an extension to the Scope optimizer, leveraging query finger-
prints and physical properties of shared groups to identify and optimize the execution
of common sub-expressions in distributed settings. Neumann [2005] discusses shar-
ing and re-using common operators and sub-expressions in the context of relational
databases to generate efficient, DAG-shaped query plans.
In Big Data processing systems, redundant operator removal is carried out by using
code analysis similar to SSA as available in SCOPE and Apache Flink [Fan et al. 2015;
Alexandrov et al. 2015]. Alternatively, redundant operator removal can be performed
by applying rewrite rules based on operator semantics [Rheinländer et al. 2015] or
by a structural analysis of the dataflow graph [Heise et al. 2012] as available in
Stratosphere.

5.3. Predicate and Operator Migration


A fundamental difference between optimization of relational queries and dataflows
with UDFs is the presence of expensive predicates, which are executed in the latter
case often in combination with non-relational or filter operations and are applied to

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:25

unstructured or semi-structured data. Consider a filter operation, which filters a large


collection of XML documents based on XPath predicates. To evaluate the filter condi-
tion, both the XPath expression and all documents need to be parsed and compared
to each other, which is often expensive [Gottlob et al. 2005]. In contrast, filter predi-
cates in relational settings, where key columns, numbers, or strings are predominantly
compared, are much cheaper to evaluate.
In general, early filter execution potentially reduces the set of candidate records
dramatically. Therefore, many dataflow optimizers (e.g., PigLatin [Olston et al. 2008],
Jaql [Beyer et al. 2011]) contain a heuristic that pushes filter predicates as close
as possible to the data sources similar to query optimizers in the relational world.
In the presence of expensive predicates, however, it might be more efficient to first
execute other operators that also reduce the number of records before evaluating the
expensive predicate. In the context of query optimization in object-relational databases,
Hellerstein proposed a predicate migration algorithm, which optimizes the placement
of expensive filter operations by assigning a rank to each filter predicate based on
selectivity and cost estimates [Hellerstein 1998]. When ordering filter predicates in
ascending rank order in linear dataflows without changing the execution order of
non-filter operators, Hellerstein showed that this yields optimal execution plans. For
non-linear dataflows, predicate migration cannot guarantee optimal plans, since in
such settings, not only filter ranks need to be considered, but also implicit filter orders
need to be respected. For example, a filter that considers attributes from two different
data sources cannot be applied before the two data sources themselves are joined. To
overcome this limitation, Chaudhuri and Shim [1999] present a different algorithm
with complete rank ordering and search space pruning, which is polynomial in the
number of user-defined filter predicates.
A different approach for reducing the execution costs of expensive predicates aims at
avoiding redundant invocations of expensive predicates or UDFs on duplicate values
through caching computed results. A well-known approach for caching the results of
UDFs is memoization, which builds a memory-based hash table storing the UDF’s
computed values for different input values [Michie 1968]. Hellerstein and Naughton
[1996] combine memoization and sorting in a hybrid caching scheme and show that this
approach yields significant improvements over applying either memoization or sorting
in most cases.
Respecting expensive predicates is not contained in any data analytics system and,
to the best of our knowledge, even cost-based dataflow optimizers do not specifically
address this problem.
5.4. Partial Aggregation
In contrast to UDFs that process single records one by one and that can be executed
on arbitrary data partitionings (e.g., annotator, filter, record transformer), UDFs based
on processing groups of records require a certain data partitioning scheme such that
all records, which exhibit the same key or grouping attribute, are processed together.
To reduce costs for data shipment in situations where a UDF o1 is executed before a
key-based UDF o2 , partial aggregations (so-called combiners) can be inserted into a
dataflow. Combiners pre-compute partial aggregation results locally at the processing
site of o1 directly before sending data to o2 .
Consider Figure 11, which shows an excerpt of the join and grouping operators con-
tained in the dataflow from Listing 1. We assume that the dataflow is executed on four
different nodes. In the upper part of the figure, the partial dataflow is shown with-
out introducing a combiner and the lower part displays the introduction of combiners.
Next to the dataflow, the respective record layout of the processed data is shown. With-
out introducing a combiner, the attributes 0–4 remain unchanged until executing the

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:26 A. Rheinländer et al.

Fig. 11. Partial dataflow executed on four nodes without combiner (top) and with combiner (bottom) to
reduce network traffic.

grouping operator, that is, many records with many (potentially complex) attributes
are sent over the network to remote compute nodes. The combiner exhibits the same
functionality as the subsequent grouping operation by aggregating joined records by
year and removing all annotations stored in attributes 0–4. Annotated relations are
grouped and are stored in attribute 5. Implicit schema reduction is applied in the com-
biner before network transfer, that is, first, the partial grouping operation is executed
and, second, contents of attributes not necessary for the final grouping operator are
removed. This significantly reduces the amount of data to be transferred. Finally, the
partial results are sent to the appropriate nodes to compute the final grouping.
Combiners can be applied to optimize the computation of averages, correlation, re-
gression, or any other grouping function that is associative and commutative (e.g., sum,
min, or max). Combiners are available in all state-of-the-art Big Data processing sys-
tems. Hash-based combine strategies, which create a hash map on the key attributes
to determine records to be combined, are usually more efficient and provide a better
combine rate than sort-based implementation and are therefore supported by many
systems (e.g., Tenzing, Flink, Spark).
Next to partial aggregation before data shipment, placement of aggregation-based
operators in dataflows is crucial. Push-down or pull-up of aggregations as introduced
by Yan and Larson [1994] optimizes the placement of record groupings depending on
the selectivity of join operations. Pushing group-bys towards the data sources reduces
the number of input records for operations that combine datasets (e.g., join). Pulling
group-bys towards the data sinks is deemed beneficial in settings where selective filters
are contained in a dataflow, which significantly reduce the number of input records
for the aggregation operation. Follow-up research [Yan and Larson 1995] describes
eager and lazy aggregation for optimizing settings where the achieved amount of data

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:27

reduction does not justify to push or pull the entire group-by but is beneficial only for
parts of the input. To perform eager or lazy aggregation, the aggregation function f
itself needs to be decomposable, such that when applied to two inputs I1 , I2 f (I1 ∪a
I2 ) ≡ f  ( f  (I1 ), f  (I2 )) holds. For example, aggregate functions such as min or sum are
decomposable, since min(I1 ∪a I2 ) is equivalent to min(min(I1 ), min(I2 )). The decision to
perform eager or lazy aggregation should be taken cost based and, to our knowledge,
this technique has not yet been evaluated or implemented in the context of Big Data
processing systems.
5.5. Optimization of Communication Costs by Semi-join Reduction and Other Methods
Network I/O is a main cost factor in dataflows, since executing such flows on Big
Data processing systems often requires shipping massive amounts of data to remote
compute nodes. In fact, Das Sarma et al. [2013] found that communication and data
shipment costs often dominate the total execution costs of dataflows. Transferring
data over the network requires broadcasting, re-partitioning, and shuffling of the data,
which greatly impacts the overall performance of data analytics systems. Thus, a main
objective during dataflow optimization is to reduce communication and data shipment
costs as much as possible.
In distributed database systems, semi-join reducers were introduced in the context
of join optimization with the goal to identify matching records before an actual join
operator is executed. By inserting semi-join reducers into a dataflow, an optimizer can
prevent shipping records over the network, which are not part of the final result not
only for join operators but also for UDFs building on joins. Consider Figure 12, which
shows a distributed setting with three compute nodes, where a UDF  3 (A, B, C) for
performing a three-way join on the datasets A, B, C shall be executed. A is a mid-sized
dataset with many attributes stored on node 1, B is a large dataset with few attributes
stored on node 2, and C is a small dataset with a medium number of attributes for
each record stored on node 3. The computation of the three-way join can be performed
on any of these nodes at different costs. A rule of thumb in such scenarios is to ship the
smaller datasets to the nodes where the larger datasets reside to reduce communication
costs. Since dataset B is the largest dataset in our example, a cost-based optimizer
decides that all computation should be carried out at node 2, where B is stored and
A and C need to be shipped to node 2. Introducing semi-joins at nodes 1 and 3 now
attempts to reduce communication costs by first identifying records that qualify for
the join condition. Therefore, the dataflow optimizer decides to introduce a projection
and distinct operation on the join attributes a and u of dataset B at node 2 and send
the intermediate dataset B to node 1 and the intermediate dataset B to node 3,
respectively. At nodes 1 and 3, semi-joins of the form A B and B C, respectively, are
executed to identify records that contribute to the final result set of the three-way join.
These records are now sent to node 2, where the final join result is determined.
Instead of sending entire datasets or join attributes over the network, bit vec-
tors [Chan and Ioannidis 1998; Valduriez 1987] or Bloom filters [Bloom 1970] can
be applied to exclude irrelevant records without actually evaluating the join predicate.
To accomplish this, a bit vector v of size n is created. Each value of the join attribute a
of dataset B is transformed into a new value in the interval [1, . . . , n] by using an ap-
propriate hash function and the corresponding bits in a bit vector v are set accordingly.
Similarly, a bit vector w is created for the join attribute u. The bit vector v is sent to
the remote node 1 and bit vector w is sent to node 3 together with the corresponding
hash functions to identify potential join partners, which are ultimately sent to node 2
for join processing. The sizes of both v, w shipped to nodes 1 and 3 are significantly
smaller compared to performing a projection on the join key column on node 2 and
sending the entire column to the remote nodes as employed in semi-join reducers. On

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:28 A. Rheinländer et al.

Fig. 12. Semi-join reduction of user-defined three-way join operator to decrease network traffic.

the other hand, the size of the set of join candidates shipped by bit vector filtering may
be significantly larger due to hash key collisions, since the benefits of filtering highly
depend on the choice of a proper hash function and the size of the bit vector. Bit vector
filtering is not only valuable for n-way joins but also for UDFs that involve two phases,
such as custom intersections, groupings, and so on. In the first phase, such operators
can compute a bit vector of the key attributes, which is populated to remote nodes to
skip records in the second phase [Miner and Shook 2012].
Semi-join reducers were first introduced by Bernstein et al. [1981] and later improved
by Apers et al. [1983] to reduce data shipment costs in distributed database queries.
Although a study showed that this technique was beneficial in distributed database
systems in the 1980s only for some types of queries [Bernstein and Goodman 1981],
it was re-discovered in the 1990s and 2000s to optimize different types of queries, for
example, star joins or top-n queries [Stocker et al. 2001]. Another potential optimization
similar to semi-joins is to, first, project out all attributes that are not part of the join
and primary keys; second, perform the join; and, third, fetch all tuples belonging to
the join result. This technique is effective in situations where a lot of data needs to be
processed by a join and has been applied, for example, in similarity join processing.
In Big Data processing systems, where huge datasets are shipped between nodes, this
technique is also promising to reduce costs of data transfer, particularly, when n-way
joins are involved. To the best of our knowledge, semi-join reducers are currently not
contained in any data analytics system’s optimizer but can be added manually by the

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:29

developer. There are plans to integrate semi-join reducers into the Calcite optimizer for
Hive with promising initial results, yet integration has not been finished as of October
2016.7
Apart from introducing semi-join reducers, the data shipment strategy itself is an
important part for adjustment during dataflow optimization [Lu and Carey 1985;
Mackert and Lohman 1986]. Intuitively, when two or more datasets or partitions shall
be combined in a distributed setting, two data shipment strategies are deemed bene-
ficial. These strategies are (1) ship as a whole, which ships entire datasets to remote
nodes, and (2) fetch as needed, where compute nodes fetch records as needed. Clearly,
both approaches have severe disadvantages, that is, the first results in high volumes
of shipped data and the second yields a vast amount of messages that are exchanged
between the compute nodes. Different strategies have been proposed to overcome this
problem, namely the introduction of on-the-fly indexing, data compression, and parti-
tion pulling. Another option for optimizing network transfer and data shuffling costs is
to pull and replicate entire partitions on certain compute nodes if they are sufficiently
small [Graefe 2009]. In the same way, Alexandrov et al. [2015] propose to re-use exist-
ing partitions by first computing sets of interesting partitionings for a given dataflow
and enforcing such partitioning in early stages of the dataflow execution. In addition,
data analytics systems, such as Flink, Spark, or Impala, commonly apply techniques
for data compression before shipment across the network.
5.6. Choice of Operator Implementation
Once a logical plan for a dataflow is fixed, concrete algorithms and execution strategies
for each operator need to be chosen. Choosing appropriate operator implementations
is as important as the previous optimization techniques, since data properties such
as distribution, sortedness, or cardinality impact operator and dataflow performance
dramatically. Choosing join algorithms adaptively is discussed in Blanas et al. [2010].
For UDFs, however, choosing appropriate alternatives based on data properties be-
comes even more difficult because of the lack of appropriate statistics, and, often, only
few different UDF implementations are available to a data analytics system (although
UDF variants exist). In Big Data processing systems, annotations of different physical
properties, such as input sizes, the ratio between input and output sizes and key cardi-
nalities, and the number of produced records per UDF function call are used by Battré
et al. [2010] and Alexandrov et al. [2014] to decide on a parallelization strategy and
to choose specific join operator implementations for dataflow execution. The optimizer
for PigLatin contains specific mechanisms to choose appropriate join and aggregation
algorithms for skewed data distributions [Gates et al. 2013].
6. DATAFLOW LANGUAGES AND OPTIMIZATION IN MAP/REDUCE-STYLE SYSTEMS
Since the mid-2000s, many dataflow languages with implementations for different Big
Data processing systems have been proposed. These languages aim at simplifying the
writing of scalable data analytics programs through dataflows and to enable automatic
optimization of such flows by appropriate compilers and optimizers. In the following, we
summarize these languages and their key characteristics in terms of type, data model,
intermediate representation, targeted infrastructure, and available optimization tech-
niques regarding UDFs. For all presented languages, we compare general language
properties in Table I and compare technical properties in Table II.
Dremel [Melnik et al. 2010], also known as BigQuery, is a language designed
at Google to support interactive ad hoc analyses of very large read-only datasets.
BigQuery provides the core set of Dremel language features, which is available to
7 http://issues.apache.org/jira/browse/CALCITE-468, retrieved October 31, 2016.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:30 A. Rheinländer et al.

Table I. Overview of Dataflow Languages for Big Data Processing Systems. Part 1: General Language Properties

Table II. Overview of Dataflow Languages for Big Data Processing Systems. Part 2: Technical Properties
and Optimization

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:31

external developers, whereas Dremel is only available within the company. Next to
relational queries, the language supports inter- and intra-record aggregations, top-k
queries, and UDFs. Recently, an open-source implementation of Dremel is available in
the Apache Drill [Hausenblas and Nadeau 2013] project, which supports querying a va-
riety of non-relational data stores, such as HDFS, MongoDB, or Google Cloud Storage.8
Queries formulated in Drill are executed in a custom, massively parallel execution en-
gine. Extensions of Dremel to support tree-structured data [Afrati et al. 2014] and the
possibility of extending Drill with UDFs to enable domain-specific applications have
been proposed. Drill applies a compiler pipeline that contains byte-code analysis and
rewriting to enable automatic code optimization. Optimizations of aggregation and join
queries are carried out manually by the developer; UDFs are not optimized.
Tenzing [Chattopadhyay et al. 2011] is another language developed by Google in-
tended to support SQL and relational online analytical processing (OLAP) applications
using batch processing on Map/Reduce infrastructures. A cost-based optimizer is avail-
able to optimize aggregations and joins, for which different join algorithms are also
available. Tenzing is extensible with UDFs, which are not optimized.
PigLatin [Olston et al. 2008] was originally developed at Yahoo research and is a
declarative scripting language similar to the relational algebra to support analytical
queries on top of Hadoop. It is extensible with UDFs and supports relational and
arithmetic operators. Though originally developed for Hadoop, plans for compilation to
other platforms, such as Tez, Spark, or Storm, are under development. Optimization is
carried out rule based in a limited, database-style manner. Join optimization is per-
formed manually and techniques to optimize processing of skewed data are available.
UDFs are not optimized.
JaQL [Beyer et al. 2011] is a functional scripting language for processing semi-
structured data on top of Hadoop, where operators are expressed as functions and
dataflow programs are compositions of functions. Optimization is carried out rule based
on the logical level and includes variable and function inlining, filter push-down, and
optimization of field access.
Hive [Thusoo et al. 2009] is a data warehousing system on top of Hadoop, Spark,
and Tez, which provides HiveQL, a SQL-like query language. It contains a rule-based
optimizer capable of optimizing joins and aggregations, predicate push-downs, data
pruning, and dynamic partition pruning. Optionally, the included optimizer can be
replaced by a cost-based optimizer built on top of Apache Calcite9 that uses cardinality
estimates to generate efficient execution strategies.
Scope [Chaiken et al. 2008; Zhou et al. 2012] is one of the earliest declarative script-
ing languages proposed for large-scale data processing in 2008 with the intention to
process structured records similar to relational databases. It shares resemblances with
PigLatin, and it is extensible with UDFs and contains a limited set of SQL-style oper-
ators, which are optimized cost based by rewriting sub-expressions using the Cascades
framework [Graefe 1995]. Later, an enhanced optimizer has been included that per-
forms code analysis to optimize imperative UDFs using techniques such as column
reduction, early filtering, and optimization of data shuffling [Guo et al. 2012]. Scope
comprehensively addresses dataflow optimization including techniques regarding the
question how to adequately model properties of ordering, grouping, and partitioning
and exploit this information during the optimization process. Recently, Microsoft re-
leased a novel query language named U-SQL,10 which provides SQL operations and

8 For an overview of supported infrastructures, see http://drill.apache.org/, retrieved March 24, 2017.
9 https://calcite.apache.org/, retrieved October 31, 2016.
10 http://usql.io/, retrieved October 31, 2016.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:32 A. Rheinländer et al.

UDF functionality for the storage and analysis platform Azure. Optimization and exe-
cution in this system is carried out based on the techniques developed for Scope.
DryadLINQ [Yu et al. 2008; Isard and Yu 2009] enables data-parallel programming
in the .NET environment by extending the declarative-imperative language and pro-
gramming model LINQ [Meijer et al. 2006]. LINQ has an SQL-like syntax, which is
enriched with lambda expressions and anonymous types. Optimization is carried out
rule based and based on annotations of algebraic operator properties provided by the
developer. The DryadLINQ project was stopped in 2010 in favor of Hadoop and Spark;
however, an academic version providing source code is still available.
Meteor [Heise et al. 2012] is a declarative and extensible dataflow language for
Stratosphere, the research branch of Apache Flink. Next to general-purpose oper-
ators, it contains many domain-specific operators for information extraction, data
cleansing, and web data extraction. UDFs are implemented as first-class algebraic
operators, which are provided in self-contained libraries consisting of the operators
implementations, their syntax, and optional semantic annotations, which allows an
optimizer to access and exploit this information for logical dataflow rewriting at
compile time [Rheinländer et al. 2015]. Optimization is carried out in a hybrid way,
where manual annotation is combined with operator and static code analysis to infer
rewrite options.
AsterixQL (AQL) [Alsubaiee et al. 2014] is a declarative language for Aster-
ixDB, which provides native support for storing, indexing, and analyzing nested
data by using FLWOR statements originally introduced in XQuery. AQL is shipped
with many general-purpose and domain-specific operators, for example, to execute
similarity-based or range-based queries. Optimization is carried out using algebraic
rewrite rules in the Algebricks framework and can be enhanced by manually providing
operator annotations. Recently, a connector interface between AsterixDB and Spark
has been demonstrated, allowing users to submit AQL queries through Spark to As-
terixDB to produce intermediate result partitions, which are processed by Spark for
advanced analytics [Alkowaileet et al. 2016].
Spark SQL [Armbrust et al. 2015] is the successor of Shark [Xin et al. 2013] and
provides an API to Spark, which allows us to query structured data with SQL inside
Spark programs. Next to relational operators, libraries for domain-specific applica-
tions such as streaming, graph processing, and machine learning are available. Spark
SQL queries are translated and optimized using Catalyst, an extensible optimizer
that performs two-phase optimization on abstract syntax trees similar to optimization
in database systems. Optimization includes many different techniques; however, the
optimization of UDFs is not natively supported but can be added manually through
additional rewrite rules.
Sawzall [Pike et al. 2005] is a procedural, domain-specific language developed for
Google’s Map/Reduce infrastructure for analyzing log records and was made available
as open source in 2010. However, since the underlying runtime system is not publicly
available, it can only be applied for analyzing small to mid-sized datasets. An open
source re-implementation of the Sawzall compiler and runtime for Hadoop is available
under the name Sizzle11 ; an optimizer is not included.
Impala [Kornacker et al. 2015] is an Apache Incubator project initiated by Cloudera
that enables real-time SQL queries on top of Hadoop. To enable real-time processing,
a custom, parallel, and distributed query processor has been developed similarly to
a parallel DBMS. Optimization is carried out in a two-staged approach: A query is
translated and locally optimized and subsequently translated into a parallel plan,
which is physically optimized and executed.
11 http://sizzlelanguage.blogspot.de/, retrieved October 31, 2016.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:33

Deep language embedding of domain-specific languages into host languages not only
allows for a high-level, declarative task description together with data parallelism
transparency but also allows in principle for holistic optimization by means of com-
piler optimizations and advanced relational and domain-specific optimizations. Jet
[Ackermann et al. 2012] is a framework for deeply embedding domain-specific lan-
guages for large-scale analytics into Scala on top of Hadoop and Spark. It combines
optimization techniques from compilers (e.g., loop fusion, dead code elimination) with
mechanisms for projection insertion and operator fusion. Emma [Alexandrov et al.
2015] is also deeply embedded in Scala and uses monad comprehensions in a layered
intermediate representation together with a complex, multi-staged compiler-optimizer
pipeline to generate efficient code for Flink.

7. CONCLUSION AND FUTURE DIRECTIONS


We surveyed techniques for optimizing dataflows with UDFs, which may be applied at
different stages of the optimization process in Big Data processing systems. While some
of the discussed techniques are available in a few current systems, the comprehensive
optimization of UDFs and non-relational operators taking the semantics of UDFs into
account still is an open challenge. In the following, we summarize the most promising
and urgent research questions, which are, from our point-of-view, as-yet unaddressed
in the context of dataflows with UDFs.
Holistic dataflow optimization. We believe that dataflow optimization would
greatly benefit from advancements in the structure of the optimization process itself.
All modern dataflow systems approach optimization multi staged, where dataflows are
first simplified, possibly reordered logically, and, finally, optimized physically. This may
lead to sub-optimal execution strategies, since none of these stages can access and ex-
ploit all relevant information for optimization. For example, operator decomposition is
useful to enlarge options for logical operator reordering but possibly prevents operator
chaining on the physical level. We envision a holistic optimization approach, where all
relevant information on UDF semantics, rewrite, and simplification rules as well as
information on physical data and infrastructure properties is used in a single optimiza-
tion phase to determine an optimal parallel execution strategy. Since such an approach
implies many more degrees of freedom for an optimizer to investigate, efficient plan
enumeration strategies and cost models need to be researched as well [Graefe 1994,
1995; Lohman 1988].
Efficient plan enumeration for complex dataflows. Unlike relational queries,
which mostly translate to linear or tree-shaped execution plans, dataflows often ex-
hibit DAG-shaped logical and physical plans. How to efficiently traverse the space of
alternative execution strategies for such plans, probably applying branch-and-bound
pruning strategies, is an open and challenging question, especially because exhaustive
enumeration is NP-hard for arbitrary dataflows [Burge et al. 2005]. From our own
experiences in the Stratosphere project [Rheinländer et al. 2015], plan enumeration
may become a real issue already for mid-sized dataflows with 15 to 20 operators.
Incremental and adaptive dataflow optimization. Resources available to Big
Data processing systems may change during the execution of a dataflow on cluster
or cloud infrastructures. For example, the number of compute nodes or the size of
the memory available to the system may shrink or increase. This requires not only
adaptive scheduling mechanisms but also calls for incremental dataflow optimiza-
tion techniques, which enable a re-planning of dataflows during execution. Rather
than solely relying on a two-phase approach, where dataflows are first optimized and
then executed, we believe that the development of optimization strategies for altering
an optimized, but possibly not optimal, dataflow during execution yields significant

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:34 A. Rheinländer et al.

benefits in terms of resource consumption or execution time [Cole and Graefe 1994;
Avnur and Hellerstein 2000].
Benchmarking UDF optimization. In this article, we surveyed a multitude of
promising methods for optimizing dataflows with UDFs, each addressing different
aspects of the optimization process. However, the concrete benefits of the presented
methods alone and in combination have not been evaluated systematically yet. We be-
lieve that this is for two reasons. First, we are not aware of any optimizer for a Big Data
processing system implementing all or at least a majority of the discussed methods;
instead, existing systems often rely on heuristics or manual dataflow transformations
performed by the developer. Second, we are not aware of any benchmark in the area
of large-scale data processing that considers truly non-relational workloads. Quite a
few benchmarks for Big Data analytics have been designed in the past few years with
many of them focussing on SQL-style processing of mainly structured data [Amplab
2014; Ghazal et al. 2013] or graph processing [Barnawi et al. 2014; Batarfi et al. 2015;
Han et al. 2014]. The most comprehensive benchmark to date is BigBench [Ghazal
et al. 2013], which also includes a few dataflows with UDFs executed on semi- or
unstructured data but focuses mainly on analyzing structured data. We believe that
establishing a benchmark for dataflows with UDFs could be very helpful to gain deep
insights into the benefits the given a heterogeneous workload containing many UDFs.
Reliable statistics for UDFs. As of today, another severe drawback to the optimiza-
tion of dataflows with UDFs is the lack of reliable statistics regarding the UDF’s run-
time performance and intermediate data sizes. In contrast to RDBMS, where queries
are often repeatedly executed over rather static datasets, dataflows with UDFs in
Big Data processing systems are usually long running and process very large ad hoc
inputs only once or a few times. In our opinion, accurately estimating properties of
such functions relevant to dataflow optimization for very large inputs and deriving
reliable predictions for the UDF’s performance is one of the greatest challenges for
future research in dataflows optimization.
With progress in these areas, we hope that optimization of complex dataflows that
contain UDFs will have a significant impact on improving the performance non-
relational analytics at large scale.

REFERENCES
Stefan Ackermann, Vojin Jovanovic, Tiark Rompf, and Martin Odersky. 2012. Jet: An embedded DSL for
high performance big data processing. In Proceedings of the International Workshop on End-to-end
Management of Big Data (BigData’12), held in conjunction with VLDB 2012. 1–10.
Foto N. Afrati, Dan Delorey, Mosha Pasumansky, and Jeffrey D. Ullman. 2014. Storing and querying tree-
structured records in dremel. Proc. VLDB Endow. 7, 12 (2014), 1131–1142.
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid
Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid
Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke.
2014. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939–964.
Alexander Alexandrov, Andreas Kunft, Asterios Katsifodimos, Felix Schüler, Lauritz Thamsen, Odej Kao,
Tobias Herb, and Volker Markl. 2015. Implicit parallelism through deep language embedding. In Pro-
ceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15).
47–61.
Wail Y. Alkowaileet, Sattam Alsubaiee, Michael J. Carey, Till Westmann, and Yingyi Bu. 2016. Large-scale
complex analytics on semi-structured datasets using asterixdb and spark. Proc. VLDB Endow. 9, 13
(2016), 1585–1588.
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael
Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover,
Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria
Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, and Till Westmann. 2014. AsterixDB: A scalable,
open source BDMS. Proc. VLDB Endow. 7, 14 (2014), 1905–1916.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:35

Amplab. 2014. Big Data Benchmark. (2014). Retrieved October 31, 2016 from https://amplab.cs.berkeley.edu/
benchmark/.
P. M. G. Apers, A. R. Hevner, and S. B. Yao. 1983. Optimization algorithms for distributed queries. IEEE
Trans. Softw. Eng. 9, 1 (1983), 57–68.
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng,
Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data
processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data (SIGMOD’15). 1383–1394.
Ron Avnur and Joseph M. Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings
of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00).
Fuad Bajaber, Radwa Elshawi, Omar Batarfi, Abdulrahman Altalhi, Ahmed Barnawi, and Sherif Sakr. 2016.
Big data 2.0 processing systems: Taxonomy and open challenges. J. Grid Comput. 14, 3 (2016), 379–405.
Ahmed Barnawi, Omar Batarfi, Seyed-Mehdi-Reza Beheshti, Radwa El Shawi, Ayman G. Fayoumi, Reza
Nouri, and Sherif Sakr. 2014. On characterizing the performance of distributed graph computation plat-
forms. In Proceedings of the 6th TPC Technology Conference (TPCTC’14)—Performance Characterization
and Benchmarking. 29–43.
Omar Batarfi, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, Seyed-Mehdi-Reza Beheshti, Ahmed Bar-
nawi, and Sherif Sakr. 2015. Large scale graph processing systems: Survey and an experimental evalu-
ation. Cluster Comput. 18, 3 (2015), 1189–1213.
Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. 2010.
Nephele/PACTs: A programming model and execution framework for web-scale analytical processing.
In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). 119–130.
Philip A. Bernstein and Nathan Goodman. 1981. Power of natural semijoins. SIAM J. Comput. 10, 4 (1981),
751–771.
Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie Jr. 1981.
Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst. 6, 4 (1981),
602–625.
G. Bruce Berriman and Steven L. Groom. 2011. How will astronomy archives survive the data tsunami?
Commun. ACM 54, 12 (2011), 52–56.
Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Y. Eltabakh, Carl-Christian
Kanne, Fatma Özcan, and Eugene J. Shekita. 2011. Jaql: A scripting language for large scale semistruc-
tured data analysis. Proc. VLDB Endow. 4, 12 (2011), 1272–1283.
Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian. 2010. A
comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD
International Conference on Management of Data (SIGMOD’10). 975–986.
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7
(1970), 422–426.
Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible
and extensible foundation for data-intensive computing. In Proceedings of the 27th IEEE International
Conference on Data Engineering (ICDE’11). 1151–1162.
Vinayak R. Borkar, Yingyi Bu, E. Preston Carman Jr., Nicola Onose, Till Westmann, Pouria Pirzadeh,
Michael J. Carey, and Vassilis J. Tsotras. 2015. Algebricks: A data model-agnostic compiler backend for
big data languages. In Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC’15). 422–433.
Anthony Peter Graham Brown. 1998. Optimization of the order in which the comparisons of the components
of a Boolean query expression are applied to a database record stored as a byte stream.
Jen Burge, Kamesh Munagala, and Utkarsh Srivastava. 2005. Ordering Pipelined Query Operators with
Precedence Constraints. Technical Report. Stanford University.
Michael J. Cafarella and Christopher Ré. 2010. Manimal: Relational optimization for data-intensive pro-
grams. In Proceedings of the 13th International Workshop on the Web and Databases (WebDB’10). 10:1–
10:6.
E. Preston Carman Jr., Till Westmann, Vinayak R. Borkar, Michael J. Carey, and Vassilis J. Tsotras. 2015.
A scalable parallel xquery processor. In Proceedings of the IEEE International Conference on Big Data
(Big Data’15). 164–173.
Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren
Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive datasets. Proc. VLDB Endow. 1,
2 (2008), 1265–1276.
Chee Yong Chan and Yannis E. Ioannidis. 1998. Bitmap index design and evaluation. In Proceedings of the
1998 ACM SIGMOD International Conference on Management of Data (SIGMOD’98). 355–366.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:36 A. Rheinländer et al.

Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina,
Younghee Kwon, and Michael Wong. 2011. Tenzing a SQL implementation on the mapreduce framework.
Proc. VLDB Endow. 4, 12 (2011), 1318–1327.
Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the
17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’98). ACM,
34–43.
Surajit Chaudhuri and Kyuseok Shim. 1999. Optimization of queries with user-defined predicates. ACM
Trans. Database Syst. 24, 2 (1999), 177–228.
Richard L. Cole and Goetz Graefe. 1994. Optimization of dynamic query evaluation plans. In Proceedings of
the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD’94). 150–160.
Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently
computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang.
Syst. 13, 4 (1991), 451–490.
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In
Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI’04).
10–10.
Christos Doulkeridis and Kjetil Nørvåg. 2014. A survey of large-scale analytical query processing in mapre-
duce. VLDB J. 23, 3 (2014), 355–380.
Xuepeng Fan, Zhenyu Guo, Hai Jin, Xiaofei Liao, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Wei Lin,
Jingren Zhou, and Lidong Zhou. 2015. Spotting code optimizations in data-parallel pipelines through
periscope. IEEE Trans. Parallel Distrib. Syst. 26, 6 (2015), 1718–1731.
Richard A. Ganski and Harry K. T. Wong. 1987. Optimization of nested SQL queries revisited. SIGMOD Rec.
16, 3 (1987), 23–33.
Alan Gates, Jianyong Dai, and Thejas Nair. 2013. Apache pig’s optimizer. IEEE Data Eng. Bull. 36, 1 (2013),
34–45.
Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Ja-
cobsen. 2013. BigBench: Towards an industry standard benchmark for big data analytics. In Proceedings
of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). 1197–1208.
Georg Gottlob, Christoph Koch, and Reinhard Pichler. 2005. Efficient algorithms for processing XPath
queries. ACM Trans. Database Syst. 30, 2 (2005), 444–491.
Goetz Graefe. 1994. Volcano—An extensible and parallel query evaluation system. IEEE Trans. Knowl. Data
Eng. 6, 1 (1994), 120–135.
Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull. 18, 3 (1995),
19–29.
Goetz Graefe. 2009. Parallel query execution algorithms. In Encyclopedia of Database Systems, Ling Liu and
M. Tamer ÖZsu (Eds.). Springer, 2030–2035.
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank
Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by,
cross-tab, and sub-totals. Data Min. Knowl. Discov. 1, 1 (1997), 29–53.
Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei
Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting code optimizations in data-parallel pipelines through
periSCOPE. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Imple-
mentation (OSDI’12). 121–133.
Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014.
An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7, 12 (2014),
1047–1058.
Michael Z. Hanani. 1977. An optimal evaluation of Boolean expressions in an online query system. Commun.
ACM 20, 5 (1977), 344–347.
Michael Hausenblas and Jacques Nadeau. 2013. Apache drill: Interactive ad-hoc analysis at scale. Big Data
1, 2 (2013), 100–104.
Arvid Heise, Astrid Rheinländer, Marcus Leich, Ulf Leser, and Felix Naumann. 2012. Meteor/sopremo: An
extensible query language and operator model. In Proceedings of the International Workshop on End-to-
end Management of Big Data (BigData’12), held in conjunction with VLDB 2012. 1–10.
Joseph M. Hellerstein. 1998. Optimization techniques for queries with expensive methods. ACM Trans.
Database Syst. 23, 2 (1998), 113–157.
Joseph M. Hellerstein and Jeffrey F. Naughton. 1996. Query execution techniques for caching expensive
methods. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data
(SIGMOD’96). 423–434.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:37

Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P Hill,
Renate Kania, Mary Schaeffer, Susan St Pierre, and others. 2008. Big data: The future of biocuration.
Nature 455, 7209 (2008), 47–50.
Fabian Hueske, Aljoscha Krettek, and Kostas Tzoumas. 2013. Enabling operator reordering in data flow
programs through static code analysis. CoRR abs/1301.4200 (2013), 1–4.
Fabian Hueske, Mathias Peters, Matthias Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and
Kostas Tzoumas. 2012. Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5, 11
(2012), 1256–1267.
Yannis E. Ioannidis. 1996. Query optimization. ACM Comput. Surv. 28, 1 (1996), 121–123.
Michael Isard and Yuan Yu. 2009. Distributed data-parallel computing using a high-level programming
language. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data
(SIGMOD’09). 987–994.
Adam Jacobs. 2009. The pathologies of big data. Commun. ACM 52, 8 (2009), 36–44.
Eaman Jahani, Michael J. Cafarella, and Christopher Ré. 2011. Automatic optimization for mapreduce
programs. Proc. VLDB Endow. 4, 6 (2011), 385–396.
Won Kim. 1982. On optimizing an SQL-like nested query. ACM Trans. Database Syst. 7, 3 (1982), 443–469.
Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin
Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, and others. 2015. Impala: A modern,
open-source SQL engine for hadoop. In Proceedings of the 7th Biennial Conference on Innovative Data
Systems (CIDR’15). 1–10.
Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4
(2000), 422–469.
Georgia Kougka and Anastasios Gounaris. 2013. Declarative expression and optimization of data-intensive
flows. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery
(DaWaK’13). 13–25.
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data
processing with mapreduce: A survey. SIGMOD Rec. 40, 4 (2012), 11–20.
Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using mapreduce.
ACM Comput. Surv. 46, 3 (2014), 31:1–31:42.
Guy M. Lohman. 1988. Grammar-like functional rules for representing query optimization alternatives. In
Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (SIGMOD’88).
18–27.
Hongjun Lu and Michael J. Carey. 1985. Some experimental results on distributed join algorithms in a
local network. In Proceedings of the 11th International Conference on Very Large Data Bases (VLDB’85).
292–304.
Lothar F. Mackert and Guy M. Lohman. 1986. R* optimizer validation and performance evaluation for
distributed queries. In Proceedings of the 12th International Conference on Very Large Data Bases
(VLDB’86).
Vivien Marx. 2013. Biology: The big challenges of big data. Nature 498, 7453 (2013), 255–260.
Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling object, relations and XML in the
.net framework. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of
Data (SIGMOD’06). 706–706.
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and
Theo Vassilakis. 2010. Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th
International Conference on Very Large Data Bases (VLDB’12). 330–339.
Donald Michie. 1968. “Memo” functions and machine learning. Nature 218, 5136 (1968), 19–22.
Donald Miner and Adam Shook. 2012. MapReduce Design Patterns: Building Effective Algorithms and
Analytics for Hadoop and Other Systems. O’Reilly Media, Inc.
Jack Minker and Rita G. Minker. 1980. Optimization of Boolean expressions-historical developments. IEEE
Ann. Hist. Comput. 2, 3 (1980), 227–238.
Derek Gordon Murray, Michael Isard, and Yuan Yu. 2011. Steno: Automatic optimization of declarative
queries. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and
Implementation (PLDI’11). 121–131.
Thomas Neumann. 2005. Efficient Generation and Execution of DAG-Structured Query Graphs. Ph.D. Dis-
sertation. University of Mannheim, Germany.
Thomas Neumann. 2009. Query optimization (in relational databases). In Encyclopedia of Database Systems,
Ling Liu and M. Tamer Zsu (Eds.). Springer, 2273–2278.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
38:38 A. Rheinländer et al.

Eduardo Ogasawara, Jonas Dias, Vitor Silva, Fernando Chirigati, Daniel Oliveira, Fabio Porto, Patrick
Valduriez, and Marta Mattoso. 2013. Chiron: A parallel engine for algebraic scientific workflows. Concurr.
Comput.: Pract. Exper. 25, 16 (2013), 2327–2341.
Eduardo S. Ogasawara, Daniel de Oliveira, Patrick Valduriez, Jonas Dias, Fabio Porto, and Marta Mattoso.
2011. An algebraic approach for data-centric scientific workflows. Proc. VLDB Endow. 4, 12 (2011),
1328–1339.
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin:
A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International
Conference on Management of Data (SIGMOD’08). 1099–1110.
M. Tamer Özsu and Patrick Valduriez. 1999. Principles of Distributed Database Systems (2nd ed.). Prentice-
Hall, Inc.
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis
with sawzall. Sci. Program. 13, 4 (2005), 277–298.
Horst Reichel. 1984. Behavioural equivalence—a unifying concept for initial and final specifications. In
Proceedings of the 3rd Hungarian Computer Science Conference, Akademiai Kiado.
Astrid Rheinländer, Arvid Heise, Fabian Hueske, Ulf Leser, and Felix Naumann. 2015. SOFA: An extensible
logical optimizer for UDF-heavy data flows. Inf. Syst. 52 (2015), 96–125.
Giulia Rumi, Claudia Colella, and Danilo Ardagna. 2014. Optimization techniques within the hadoop eco-
system: A survey. In Proceedings of the 16th International Symposium on Symbolic and Numeric Algo-
rithms for Scientific Computing (SYNASC’14). 437–444.
Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. 2013. The family of mapreduce and large-scale data process-
ing systems. ACM Comput. Surv. 46, 1 (2013), Article 11.
Anish Das Sarma, Foto N. Afrati, Semih Salihoglu, and Jeffrey D. Ullman. 2013. Upper and lower bounds
on the cost of a map-reduce computation. Proc. VLDB Endow. 6, 4 (2013), 277–288.
P. Griffiths Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price.
1979. Access path selection in a relational database management system. In Proceedings of the 1979
ACM SIGMOD International Conference on Management of Data (SIGMOD’79). 23–34.
Yasin N. Silva, Paul-Ake Larson, and Jingren Zhou. 2012. Exploiting common subexpressions for cloud query
processing. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE’12).
1337–1348.
Alkis Simitsis, Panos Vassiliadis, and Timos Sellis. 2005. Optimizing ETL processes in data warehouses. In
Proceedings of the 21st International Conference on Data Engineering (ICDE’05). 564–575.
Alkis Simitsis, Kevin Wilkinson, Malu Castellanos, and Umeshwar Dayal. 2012. Optimizing analytic data
flows for multiple execution engines. In Proceedings of the 2012 ACM SIGMOD International Conference
on Management of Data (SIGMOD’12). 829–840.
David Simmen, Eugene Shekita, and Timothy Malkemus. 1996. Fundamental techniques for order opti-
mization. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data
(SIGMOD’96). 57–67.
Utkarsh Srivastava, Kamesh Munagala, Jennifer Widom, and Rajeev Motwani. 2006. Query optimization
over web services. In Proceedings of the 32nd International Conference on Very Large Data Bases
(VLDB’06). 355–366.
Konrad Stocker, Donald Kossmann, Reinhard Braumandl, and Alfons Kemper. 2001. Integrating semi-join-
reducers into state of the art query processors. In Proceedings of the 17th International Conference on
Data Engineering (ICDE’01). 575–584.
Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexan-
der Rasin. 2010. MapReduce and parallel DBMSs: Friends or foes? Commun. ACM 53, 1 (2010), 64–71.
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete
Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework.
Proc. VLDB Endow. 2, 2 (2009), 1626–1629.
Patrick Valduriez. 1987. Join indices. ACM Trans. Database Syst. 12, 2 (1987), 218–246.
Panos Vassiliadis, Alkis Simitsis, and Eftychia Baikousi. 2009. A taxonomy of ETL activities. In Proceedings
of the 12th Int. Workshop on Data Warehousing and OLAP (DOLAP’09). 25–32.
Henning Wachsmuth, Benno Stein, and Gregor Engels. 2011. Constructing efficient information extrac-
tion pipelines. In Proceedings of the 20th ACM Int. Conf. on Information and Knowledge Management
(CIKM’11). 2237–2240.
Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. 2014. Data mining with big data. IEEE Trans.
Knowl. Data Eng. 26, 1 (2014), 97–107.

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.
Optimization of Complex Dataflows with User-Defined Functions 38:39

Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2013. Shark:
SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data (SIGMOD’13). 13–24.
Weipeng P. Yan and Per-Ake Larson. 1994. Performing group-by before join [query processing]. In Proceedings
of the 10th International Conference on Data Engineering (ICDE’94). 89–100.
Weipeng P. Yan and Per-Ake Larson. 1995. Eager aggregation and lazy aggregation. In Proceedings of the
21th International Conference on Very Large Data Bases (VLDB’95). 345–357.
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon
Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a
high-level language. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and
Implementation (OSDI’08). 1–14.
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark:
Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in
Cloud Computing (HotCloud’10). 10–10.
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012.
SCOPE: Parallel databases meet mapreduce. VLDB J. 21, 5 (2012), 611–636.

Received November 2016; revised March 2017; accepted April 2017

ACM Computing Surveys, Vol. 50, No. 3, Article 38, Publication date: May 2017.

You might also like