You are on page 1of 6

S

Sliding-Window Aggregation ... 2 7 6 3 5 3 1 4 7 2 6 ...


Algorithms stream
window max 6
Kanat Tangwongsan1 , Martin Hirzel2 , and Scott
Schneider2 aggregation
aggregate value
1
Mahidol University International College,
Salaya, Thailand ... 2 7 6 3 5 3 1 4 7 2 6 ...
2
IBM Research AI, Yorktown Heights, NY,
USA max 5

... 2 7 6 3 5 3 1 4 7 2 6 ...
Synonyms

Sliding-window fold; Stream reduce; SWAG max 7

Sliding-Window Aggregation Algorithms, Fig. 1


Sliding-window aggregation definitions
Definition

An aggregation is a function from a collection Overview


of data items to an aggregate value. In sliding-
window aggregation, the input collection consists Sliding-window aggregation summarizes a col-
of a window over the most recent data items in lection of recent streaming data, capturing the
a stream. Here, a stream is a potentially infinite most recent happenings as well as some history.
sequence of data items, and the decision on which Including some history provides context for deci-
data items are most recent at any point in time is sions, which would be missing if only the current
given by a window policy. A sliding-window ag- data item were used. Using the most recent data
gregation algorithm updates the aggregate value, helps identify and react to present trends, which
often using incremental-computation techniques, would be diluted if all data from the beginning of
as the window contents change over time, as time were included.
illustrated in Fig. 1. Aggregation is one of the most fundamental
data processing operations. This is true
in general, not just in stream processing.
Aggregation is versatile: it can compute counts,
© Springer International Publishing AG 2018
S. Sakr, A. Zomaya (eds.), Encyclopedia of Big Data Technologies,
https://doi.org/10.1007/978-3-319-63962-8_157-1
2 Sliding-Window Aggregation Algorithms

averages, or maxima, index data structures, formulated as an abstract data type with the
sketches such as Bloom filters, and many more. following operations:
In databases, it shows up as a basic relational
algebra operator called group-by-aggregate • insert.v/ appends the value v to the window.
and denoted  (Garcia-Molina et al. 2008). In • evict./ removes the “oldest” value in the
spreadsheets, it shows up as a function from a window.
range of cells to a summary statistic (Sajaniemi • query./ returns the aggregation of the values
and Pekkanen 1988). In programming languages, in the window.
it shows up as a popular higher-order function
called fold (Hutton 1999). In MapReduce, Metrics of interest in SWAG implementations
it shows up as reduce (Dean and Ghemawat are throughput, latency, and memory footprint.
2004), which has been leveraged in many tasks, SWAG implementations also differ in general-
including computations that do not diminish the ity: to enhance efficiency, aggregation operations,
volume of data. when feasible, are applied incrementally – that
In stream processing, aggregation plays a sim- is, modifying a running sum of sort in response
ilarly central role. But unlike the abovementioned to data items arriving or leaving the window. To
cases, which focus on data at rest, streaming what extent this can be exploited depends on the
aggregation must handle data in motion. In par- nature of the aggregation operation.
ticular, sliding-window aggregation must handle Past work (Gray et al. 1996; Tangwongsan
inserting new data items into the window as et al. 2015) cast most aggregation operations
they arrive and evicting old data items from as binary operators, written ˚, and has catego-
the window as they expire. Supporting this effi- rized them based on algebraic properties. Table 1
ciently poses algorithmic challenges, especially lists common aggregation operations with their
for non-invertible aggregation functions such as properties and groups them into categories. An
max, for which there is no way to “subtract off” aggregation operator is invertible if there exists
expiring items. From an algorithmic perspective, some function  such that .x ˚ y/  y D x
handling sliding windows with both insertion and for all x and y. Using , SWAGs can implement
eviction is more challenging than handling just eviction as an undo. A function is associative if
insertion. Yet, there are two cases where eviction x ˚ .y ˚ ´/ D .x ˚ y/ ˚ ´ for all x, y, and ´.
does not matter: unbounded and tumbling win- SWAGs can take advantage of associativity by
dows. Unbounded windows appear, for instance, applying ˚ at arbitrary places inside the window.
in CQL (Arasu et al. 2006). Because they grow Without associativity, SWAGs are restricted to
indefinitely, it is sufficient to update aggregations applying ˚ only at the end, upon insertion. A
upon insert and not keep the data item itself function is commutative if x ˚ y D y ˚ x for all
around; they never need to call evict. Tumbling x and y. SWAGs are able to ignore the insertion
windows are more common; because they clear order of data items for commutative aggregation
the entire contents of the window at the same operators. An aggregation operator is rank-based
time, there is no need to call evict on individual if it relies upon an ordering by some attribute
elements of the window. of each data item, for instance, to find the i th-
Sliding windows are most commonly first- smallest.
in, first-out (FIFO), resembling the behavior of a Table 2 presents an overview of the SWAG
queue. What to keep in a sliding window and how algorithms presented in this article, with their
often the aggregation is computed are controlled asymptotic complexity, space usage, and restric-
by policies, cataloged elsewhere (Gedik 2013); tions. The most straightforward SWAG algorithm
they may be count-based (e.g., the past 128 is called Recalc, since it always recalculates
elements) or time-based (e.g., the past 12 min), all values. Upon any insert or evict, Recalc
among others. Regardless of policies, FIFO walks the entire window and recomputes the
sliding-window aggregation (SWAG) can be
Sliding-Window Aggregation Algorithms 3

Sliding-Window
Aggregation Invertible Associative Commutative Rank-based
Algorithms, Table 1  Sum-like: sum, count, average, X X X 
Aggregation operations. standard deviation, . . .
Check marks (X),  Collect-like: collect list, concate- X X  ?
crosses (), and question nate strings, ith-youngest, . . .
marks (?) indicate that a
 Median-like: median, percentile, X X X X
property is true for all,
ith-smallest, . . .
false for all, or false for
some of the given group,  Max-like: max, min, argMax,  X ? 
respectively argMin, maxCount, . . .
 Sketch-like: Bloom filter (Bloom  X X 
1970), CountMin (Cormode
and Muthukrishnan 2005),
HyperLogLog (Flajolet et al.
2007)

Sliding-Window
Algorithmic complexity Restrictions
Aggregation
Algorithms, Table 2 Time Space
Summary of aggregation Recalc Worst-case O.n/ None
algorithms and their O.n/
properties, where n is the Subtract-on-Evict Worst-case O.n/ Sum-like or
window size and nmax is (SOE) O.1/ collect-like
the size of the smallest
Order Statistics Tree Worst-case O.n/ Median-like
contiguous range that
(OST) (Hirzel et al. O.log n/
contains all the shared
2016)
windows
Reactive Aggregator Average O.n/ Associative
(RA) (Tangwongsan O.log n/
et al. 2015)
DABA Worst-case O.n/ Associative,
(Tangwongsan O.1/ FIFO
et al. 2017)
B-Int (Arasu and Shared O.nmax / Associative,
Widom 2004) O.log nmax / FIFO
FlatFIT (Shein et al. Average O.n/ Associative,
2017) O.1/ FIFO S

aggregation value by using all available elements. evict O.log n/. But query calls for aggregations
Its performance is obviously O.n/, where n is such as the median or pth percentile become
the current number of elements in the window. O.log n/ because such information can be de-
Recalc serves as the baseline comparison for rived by traversing a path that is no longer than
all other SWAG algorithms. Subtract-on-evict the height of the tree.
(SOE) is a O.1/ algorithm, but it is not gen- This section gave a brief overview with back-
eral: it can only be used when the aggregation ground and some simple aggregation algorithms.
is invertible. Upon every insert, SOE updates In general, research into SWAG algorithms tries
the current aggregation value using ˚, and upon to avoid O.n/ costs (unlike Recalc) but main-
every evict, SOE updates that value using . tain generality (unlike SOE and OST). The next
The order statistics tree (OST) adds subtree section will discuss the more sophisticated al-
statistics to the inner nodes of a balanced search gorithms from Table 2 that offer improvements
tree (Hirzel et al. 2016). Values are put in both toward this goal.
a queue and the tree, making both insert and
4 Sliding-Window Aggregation Algorithms

Key Research Findings This leads to higher performance than other tree-
based SWAG implementations in practice, since
The most successful techniques in speeding it saves the time of rebalancing as well as the
up sliding-window aggregation have been data overheads of pointers and fine-grained memory
structuring and algorithmic techniques that yield allocation.
asymptotic improvements. They are the most For latency-sensitive applications, the
effective when the aggregation function meets aggregation algorithm cannot afford a long
certain algebraic requirements. For instance, pause. DABA (Tangwongsan et al. 2017)
there are important aggregation operations that ensures that every SWAG operation takes O.1/
are associative, but not necessarily invertible nor time in the worst-case, not just on average.
commutative. This is accomplished by extending Okasaki’s
Pre-aggregation of data items that will be functional queue (Okasaki 1995) and removing
evicted at the same time is a technique that can dependencies on lazy evaluation and automatic
be applied together with all SWAG algorithms garbage collection. FlatFIT (Shein et al.
discussed in this article. When data items are 2017) is another algorithm that achieves O.1/
co-evicted, the window need not store them in- time although in the amortized sense. This is
dividually but can instead store partial aggre- accomplished by storing pre-aggregated values
gations, reducing the effective window size n in a tree-like index structure that promotes reuse,
in Table 2. Pre-aggregation algorithms include reminiscent of path compression in the union-find
paned windows (Li et al. 2005), paired win- data structure.
dows (Krishnamurthy et al. 2006), and Cutty There are a number of other generic
windows (Carbone et al. 2016). Windows are techniques that tend to apply broadly to sliding-
sometimes coarsened to enable pre-aggregation, window aggregation. Window partitioning is
improving performance at the expense of some sometimes used as a means to maintain group-by
approximation. aggregation and obtain data parallelism through
B-Int (Arasu and Widom 2004), designed fission (Schneider et al. 2015). When stream
to facilitate sharing across windows, stores a data items arrive out of order, a holding buffer
“shared” window S that contains inside it all the can be used to reorder them before they enter
windows being shared. To facilitate fast queries, the window (Srivastava and Widom 2004).
B-Int maintains pre-aggregated values for all base Alternatively, in the case that the stream is formed
intervals that lie within S . Base intervals (more by merging multiple sub-streams, out-of-order
commonly known now as dyadic intervals) are streams may be solved by pre-aggregating each
intervals of the form Œ2` k; 2` .k C 1/  1 with data source separately and consolidating partial
`; k  0. The parameter ` defines the level of aggregation results as late as possible when doing
a base interval. This allows a query between the an actual query (Krishnamurthy et al. 2010).
i -th data item and j -th data item within S to be
answered by combining at most O.log ji  j j/
pre-aggregated values, resulting in logarithmic Examples of Application
running time.
The Reactive Aggregator (RA) (Tang- Many applications of stream processing depend
wongsan et al. 2015) is implemented via heavily upon sliding-window aggregation. This
a balanced tree ordered by time, where section describes concrete examples of apply-
internal nodes hold the partial aggregations of ing sliding-window aggregation to real-world use
their subtrees, and offers O.log n/ amortized cases. Understanding these examples helps ap-
time. Instead of the conventional approach preciate the problems and guide the design of
to implementing balanced trees by frequent solutions.
rebalancing, RA projects the tree over a complete
perfect binary tree, which it stores in a flat array.
Sliding-Window Aggregation Algorithms 5

Medical service providers want to save lives purpose algorithms for fast sliding-window ag-
by getting early warnings when there is a high gregation? How can SWAG algorithms take bet-
likelihood that a patient’s health is about to de- ter advantage of multicore parallelism?
teriorate. For instance, the Artemis system an-
alyzes data from real-time sensors on patients
in a neonatal intensive care unit (Blount et al. Cross-References
2010). Among other things, it counts how often
the blood oxygen saturation and the mean arterial  Adaptive Windowing
blood pressure fall below a threshold in a 20-  Incremental Sliding Window Analytics
s sliding window. If the counts exceed another  Stream Query Optimization
threshold, Artemis raises an alert.  Stream Window Aggregation Semantics and
Financial agents engaged in algorithmic trad- Optimisation
ing want to make money by buying and selling
stocks or other financial instruments. Treleaven
et al. review the current practice for how that References
works technologically (Treleaven et al. 2013).
Streaming systems for algorithmic trading make Arasu A, Widom J (2004) Resource sharing in continuous
sliding window aggregates. In: Conference on very
their decisions based on predicted future prices. large data bases (VLDB), pp 336–347
One of the inputs for these predictions is a mov- Arasu A, Cherniack M, Galvez E, Maier D, Maskey AS,
ing average of the recent history of a price, for Ryvkina E, Stonebraker M, Tibbetts R (2004) Linear
example, over a 1-hour sliding window. road: a stream data management benchmark. In:
Conference on very large data bases (VLDB), pp 480–
Road traffic can be regulated using variable 491
tolling to implement congestion-pricing policies. Arasu A, Babu S, Widom J (2006) The CQL continuous
One of the most popular benchmarks for stream- query language: semantic foundations and query exe-
ing systems, linear road, is based on variable cution. J Very Large Data Bases 15(2):121–142
Bloom BH (1970) Space/time trade-offs in hash coding
tolling (Arasu et al. 2004). The idea is to regulate with allowable errors. Commun ACM 13(7):422–426
demand by charging higher tolls for driving on Blount M, Ebling MR, Eklund JM, James AG, McGregor
congested roads. To do this, the streaming system C, Percival N, Smith K, Sow D (2010) Real-time anal-
must determine whether a road is congested. This ysis for intensive care: development and deployment of
the Artemis analytic system. IEEE Eng Med Biol Mag
works by using sliding-window aggregation to 29:110–118
compute the number and average speed of vehi- Carbone P, Traub J, Katsifodimos A, Haridi S, Markl V
cles in a given road segment and time window. (2016) Cutty: aggregate sharing for user-defined win- S
The above list of use cases is by no means dows. In: Conference on information and knowledge
management (CIKM), pp 1201–1210
exhaustive; there are many more applications Cormode G, Muthukrishnan S (2005) An improved data
of sliding-window aggregation, for instance, in stream summary: the count-min sketch and its applica-
phone providers, security, and social media. tions. J Algorithms 55(1):58–75
Dean J, Ghemawat S (2004) MapReduce: simplified data
processing on large clusters. In: Symposium on op-
erating systems design and implementation (OSDI),
Future Directions for Research pp 137–150
Flajolet P, Fusy E, Gandouet O, Meunier F (2007) Hy-
Research on sliding-window aggregation has fo- perLogLog: the analysis of a near-optimal cardinality
estimation algorithm. In: Conference on analysis of
cused mainly on aggregation functions that are algorithms (AofA), pp 127–146
associative and on FIFO windows. Much less Garcia-Molina H, Ullman JD, Widom J (2008) Database
is known for other nontrivial scenarios. Is it systems: the complete book, 2nd edn. Pearson/Prentice
possible to efficiently support associative aggre- Hall, New Dehli
Gedik B (2013) Generic windowing support for extensible
gation functions on windows that are non-FIFO? stream processing systems. Softw Pract Exp 44(9):
Besides associativity and invertibility, what other 1105–1128
properties can be exploited to develop general-
6 Sliding-Window Aggregation Algorithms

Gray J, Bosworth A, Layman A, Pirahesh H (1996) Data Sajaniemi J, Pekkanen J (1988) An empirical analysis of
cube: a relational aggregation operator generalizing spreadsheet calculation. Softw Pract Exp 18(6):583–
group-by, cross-tab, and sub-total. In: International 596
conference on data engineering (ICDE), pp 152–159 Schneider S, Hirzel M, Gedik B, Wu KL (2015) Safe data
Hirzel M, Rabbah R, Suter P, Tardieu O, Vaziri M (2016) parallelism for general streaming. IEEE Trans Comput
Spreadsheets for stream processing with unbounded 64(2):504–517
windows and partitions. In: Conference on distributed Shein AU, Chrysanthis PK, Labrinidis A (2017) FlatFIT:
event-based systems (DEBS), pp 49–60 accelerated incremental sliding-window aggregation
Hutton G (1999) A tutorial on the universality and expres- for real-time analytics. In: Conference on scientific and
siveness of fold. J Funct Program 9(1):355–372 statistical database management (SSDBM), pp 5:1–
Krishnamurthy S, Wu C, Franklin M (2006) On-the- 5:12
fly sharing for streamed aggregation. In: Interna- Srivastava U, Widom J (2004) Flexible time management
tional conference on management of data (SIGMOD), in data stream systems. In: Principles of database
pp 623–634 systems (PODS), pp 263–274
Krishnamurthy S, Franklin MJ, Davis J, Farina D, Tangwongsan K, Hirzel M, Schneider S, Wu KL
Golovko P, Li A, Thombre N (2010) Continuous (2015) General incremental sliding-window aggrega-
analytics over discontinuous streams. In: Interna- tion. In: Conference on very large data bases (VLDB),
tional conference on management of data (SIGMOD), pp 702–713
pp 1081–1092 Tangwongsan K, Hirzel M, Schneider S (2017) Low-
Li J, Maier D, Tufte K, Papadimos V, Tucker PA (2005) latency sliding-window aggregation in worst-case con-
No pane, no gain: efficient evaluation of sliding- stant time. In: Conference on distributed event-based
window aggregates over data streams. ACM SIGMOD systems (DEBS), pp 66–77
Rec 34(1):39–44 Treleaven P, Galas M, Lalchand V (2013) Algorithmic
Okasaki C (1995) Simple and efficient purely func- trading review. Commun ACM 56(11):76–85
tional queues and deques. J Funct Program 5(4):
583–592

You might also like