Professional Documents
Culture Documents
43864-The Dataflow Model PDF
43864-The Dataflow Model PDF
ABSTRACT
1.
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile
usage statistics, and sensor networks). At the same time,
consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by
features of the data themselves, in addition to an insatiable
hunger for faster answers. Meanwhile, practicality dictates
that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary
of how to reconcile the tensions between these seemingly
competing propositions, often resulting in disparate implementations and systems.
We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern
data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under
the assumption that we will never know if or when we have
seen all of our data, only that new data will arrive, old data
may be retracted, and the only way to make this problem
tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of
interest: correctness, latency, and cost.
In this paper, we present one such approach, the Dataflow
Model1 , along with a detailed examination of the semantics
it enables, an overview of the core principles that guided its
design, and a validation of the model itself via the real-world
experiences that led to its development.
1
We use the term Dataflow Model to describe the processing model of Google Cloud Dataflow [20], which is based
upon technology from FlumeJava [12] and MillWheel [2].
This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contact
copyright holder by emailing info@vldb.org. Articles from this volume
were invited to present their results at the 41st International Conference on
Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast,
Hawaii.
Proceedings of the VLDB Endowment, Vol. 8, No. 12
Copyright 2015 VLDB Endowment 2150-8097/15/08.
1792
INTRODUCTION
1793
1
2
3
2
4
5
3
4
Fixed
Sliding
Sessions
1.1
Unbounded/Bounded vs Streaming/Batch
1.2
period may be less than the size, which means the windows
may overlap. Sliding windows are also typically aligned;
even though the diagram is drawn to give a sense of sliding
motion, all five windows would be applied to all three keys in
the diagram, not just Window 3. Fixed windows are really
a special case of sliding windows where size equals period.
Sessions are windows that capture some period of activity over a subset of the data, in this case per key. Typically
they are defined by a timeout gap. Any events that occur
within a span of time less than the timeout are grouped
together as a session. Sessions are unaligned windows. For
example, Window 2 applies to Key 1 only, Window 3 to Key
2 only, and Windows 1 and 4 to Key 3 only.
1.3
Windowing
Time Domains
1794
12:04
12:03
12:02
12:01
Processing Time
ParDo for generic parallel processing. Each input element to be processed (which itself may be a finite collection) is provided to a user-defined function (called
a DoFn in Dataflow), which can yield zero or more output elements per input. For example, consider an operation which expands all prefixes of the input key,
duplicating the value across them:
P arDo(
y
ExpandP ref ixes)
12:04
Actual watermark:
Ideal watermark:
Event Time Skew:
y GroupByKey
pipeline. As weve made very clear above, notions of completeness are generally incompatible with correctness, so we
wont rely on watermarks as such. They do, however, provide a useful notion of when the system thinks it likely that
all data up to a given point in event time have been observed,
and thus find application in not only visualizing skew, but
in monitoring overall system health and progress, as well as
making decisions around progress that do not require complete accuracy, such as basic garbage collection policies.
In an ideal world, time domain skew would always be
zero; we would always be processing all events immediately
as they happen. Reality is not so favorable, however, and
often what we end up with looks more like Figure 2. Starting
around 12:00, the watermark starts to skew more away from
real time as the pipeline lags, diving back close to real time
around 12:02, then lagging behind again noticeably by the
time 12:03 rolls around. This dynamic variance in skew
is very common in distributed data processing systems, and
will play a big role in defining what functionality is necessary
for providing correct, repeatable results.
2.
2.2
Core Primitives
Windowing
DATAFLOW MODEL
2.1
1795
AssignW indows(
y
Sessions(30m))
AssignW indows(
y
Sliding(2m, 1m))
(k, v1 , 12:00, [11:59, 12:01)),
(k, v1 , 12:00, [12:00, 12:02)),
(k, v2 , 12:01, [12:00, 12:02)),
(k, v2 , 12:01, [12:01, 12:03))
y DropT imestamps
2.2.1
Window Assignment
y GroupByKey
2.2.2
M ergeW indows(
y
Sessions(30m))
(k1 , [(v1 , [13:02, 13:50)),
(v3 , [13:57, 14:27)),
(v4 , [13:02, 13:50))]),
(k2 , [(v2 , [13:14, 13:44))])
y GroupAlsoByW indow
Window Merging
Window merging occurs as part of the GroupByKeyAndWindow operation, and is best explained in the context of an
example. We will use session windowing since it is our motivating use case. Figure 4 shows four example data, three
for k1 and one for k2 , as they are windowed by session, with
a 30-minute session timeout. All are initially placed in a
default global window by the system. The sessions implementation of AssignWindows puts each element into a single window that extends 30 minutes beyond its own timestamp; this window denotes the range of time into which
later events can fall if they are to be considered part of the
same session. We then begin the GroupByKeyAndWindow
operation, which is really a five-part composite operation:
y ExpandT oElements
(k1 , [v1 , v4 ], 13:50, [13:02, 13:50)),
(k1 , [v3 ], 14:27, [13:57, 14:27)),
(k2 , [v2 ], 13:44, [13:14, 13:44))
2.2.3
GroupAlsoByWindow - For each key, groups values
by window. After merging in the prior step, v1 and
v4 are now in identical windows, and thus are grouped
together at this step.
9
API
1796
2.3
1797
12
12:09
12:08
12:07
Processing Time
12:06
7
5
12:01
12:02
12:06
12:07
12:08
Actual watermark:
Ideal watermark:
Processing Time
12:09
Examples
51
1
8
12:08
2.4
1
8
12:07
watermark. The straight dotted line with slope of one represents the ideal watermark, i.e. if there were no event-time
skew and all events were processed by the system as they
occurred. Given the vagaries of distributed systems, skew is
a common occurrence; this is exemplified by the meandering
path the actual watermark takes in Figure 5, represented by
the darker, dashed line. Note also that the heuristic nature
of this watermark is exemplified by the single late datum
with value 9 that appears behind the watermark.
12:06
7
5
12:01
12:02
12:06
12:07
12:08
Actual watermark:
Ideal watermark:
1798
8
3
22 3
4
12:08
7
5
12
7
5
12:01
12:01
12:02
12:06
12:07
12:08
12
12:07
Processing Time
12:08
12:07
33
12:06
12:09
51
12:06
Processing Time
12:09
12:02
3
11
3
4
3 7
12
12:06
12:07
12:08
18
12:08
12:07
11
3
12:06
Processing Time
12:09
Let us now return to the other option for supporting unbounded sources: switching away from global windowing.
To start with, let us window the data into fixed, two-minute
Accumulating windows:
3
7
10 3
4
12
5
12:01
12:02
12:06
12:07
12:08
1799
12
12:08
14
3
5
3
4
12:01
12:07
12:08
228
1
8
12:07
Processing Time
12:09
12:06
12
22
14
12:06
Processing Time
12:09
12:02
12:06
12:07
12:08
Actual watermark:
Ideal watermark:
7
5
12:03 12:04 12:05
Event Time
12:06
12:07
12:08
Actual watermark:
Ideal watermark:
12
14
1
8
3
3
8
14
3
5
3
12:09
12:07
22
12:01
12:02
12:06
12:07
Processing Time
5
12:08
Actual watermark:
Ideal watermark:
12 8
12:08
12:08
12:06
Processing Time
12:09
3
3
228
14
3
5
12:01
Next, consider this pipeline executed on a streaming engine, as in Figure 12. Most windows are emitted when the
watermark passes them. Note however that the datum with
value 9 is actually late relative to the watermark. For whatever reason (mobile input source being offline, network partition, etc.), the system did not realize that datum had not
yet been injected, and thus, having observed the 5, allowed
the watermark to proceed past the point in event time that
would eventually be occupied by the 9. Hence, once the
9 finally arrives, it causes the first window (for event-time
14
12:07
12:02
12:06
12:01
12:02
12:06
12:07
12:08
Actual watermark:
Ideal watermark:
1800
3.2
Though much of our design was motivated by the realworld experiences detailed in Section 3.3 below, it was also
guided by a core set of principles that we believed our model
should embody:
Never rely on any notion of completeness.
Be flexible, to accommodate the diversity of known use
cases, and those to come in the future.
12:08
-5
39
-25
-7
8
25 -10
12:07
12:06
Processing Time
12:09
1
-3 8 12
3
5
5
12:01
12:02
Not only make sense, but also add value, in the context
of each of the envisioned execution engines.
Encourage clarity of implementation.
Support robust analysis of data in the context in which
they occurred.
While the experiences below informed specific features of
the model, these principles informed the overall shape and
character of it, and we believe ultimately led to a more comprehensive and general result.
3.3
310
12:06
12:07
12:08
Actual watermark:
Ideal watermark:
3.
3.1
Motivating Experiences
As we designed the Dataflow Model, we took into consideration our real-world experiences with FlumeJava and MillWheel over the years. Things which worked well, we made
sure to capture in the model; things which worked less well
motivated changes in approach. Here are brief summaries
of some of these experiences that influenced our design.
3.3.1
Design Principles
3.3.2
1801
time does so by calculating sessions. Thus, support for sessions became paramount in our design. As shown in Figure
14, generating sessions in the Dataflow Model is trivial.
3.3.6
3.3.3
Two teams with billing pipelines built on MillWheel experienced issues that motivated parts of the model. Recommended practice at the time was to use the watermark as a
completion metric, with ad hoc logic to deal with late data
or changes in source data. Lacking a principled system for
updates and retractions, a team that processed resource utilization statistics ended up leaving our platform to build a
custom solution (the model for which ended being quite similar to the one we developed concurrently). Another billing
team had significant issues with watermark lags caused by
stragglers in their input. These shortcomings became major
motivators in our design, and influenced the shift of focus
from one of targeting completeness to one of adaptability
over time. The results were twofold: triggers, which allow
the concise and flexible specification of when results are materialized, as evidenced by the variety of output patterns
possible over the same data set in Figures 714; and incremental processing support via accumulation (Figures 7 and
8) and retractions (Figure 14).
3.3.4
3.3.5
Another pipeline that we considered built trees of user activity (essentially session trees) across a large Google property. These trees were then used to build recommendations
tailored to users interests. The pipeline was noteworthy in
that it used processing-time timers to drive its output. This
was due to the fact that, for their system, having regularly
updated, partial views on the data was much more valuable
than waiting until mostly complete views were ready once
the watermark passed the end of the session. It also meant
that lags in watermark progress due to a small amount
of slow data would not affect timeliness of output for the
Anomaly Detection:
Data-Driven & Composite Triggers
In the MillWheel paper, we described an anomaly detection pipeline used to track trends in Google web search
queries. When developing triggers, their diff detection system motivated data-driven triggers. These differs observe
the stream of queries and calculate statistical estimates of
whether a spike exists or not. When they believe a spike is
happening, they emit a start record, and when they believe
it has ceased, they emit a stop. Though you could drive
the differ output with something periodic like Trills punctuations, for anomaly detection you ideally want output as
soon as you are confident you have discovered an anomaly;
the use of punctuations essentially transforms the streaming system into micro-batch, introducing additional latency.
While practical for a number of use cases, it ultimately is
not an ideal fit for this one, thus motivating support for
custom data-driven triggers. It was also a motivating case
for trigger composition, because in reality, the system runs
multiple differs at once, multiplexing the output of them according to a well-defined set of logic. The AtCount trigger
used in Figure 9 exemplified data-driven triggers; figures
1014 utilized composite triggers.
4.
CONCLUSIONS
1802
5.
ACKNOWLEDGMENTS
We thank all of our faithful reviewers for their dedication, time, and thoughtful comments: Atul Adya, Ben Birt,
Ben Chambers, Cosmin Arad, Matt Austern, Lukasz Cwik,
Grzegorz Czajkowski, Walt Drummond, Jeff Gardner, Anthony Mancuso, Colin Meek, Daniel Myers, Sunil Pedapudi,
Amy Unruh, and William Vambenepe. We also wish to recognize the impressive and tireless efforts of everyone on the
Google Cloud Dataflow, FlumeJava, MillWheel, and related
teams that have helped bring this work to life.
6.
REFERENCES
1803