# Crash Course on Data Stream Algorithms

Part I: Basic Deﬁnitions and Numerical Streams

Andrew McGregor
University of Massachusetts Amherst

1/24

Goals of the Crash Course
Goal: Give a ﬂavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms.

2/24

Goals of the Crash Course
Goal: Give a ﬂavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-speciﬁc details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?”

2/24

what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites.Goals of the Crash Course Goal: Give a ﬂavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. 2/24 . “When we abstract away the application-speciﬁc details.

Goals of the Crash Course Goal: Give a ﬂavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-speciﬁc details. Request: 2/24 . what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites.

2/24 .Goals of the Crash Course Goal: Give a ﬂavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. Request: If you get bored. ask questions. . “When we abstract away the application-speciﬁc details. what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. .

ask questions. 2/24 . what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. “When we abstract away the application-speciﬁc details. . ask questions. Request: If you get bored. .Goals of the Crash Course Goal: Give a ﬂavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. If you get lost. . .

. 2/24 . “When we abstract away the application-speciﬁc details. . what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. ask questions. If you’d like to ask questions. .Goals of the Crash Course Goal: Give a ﬂavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. ask questions. . ask questions. . . If you get lost. Request: If you get bored.

Outline Basic Deﬁnitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 3/24 .

Outline Basic Deﬁnitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 4/24 .

Data Stream Model Stream: m elements from universe of size n.g. 3. e. . x1 . 5/24 . . x2 . . . 7. . 4.. . 5. 5. . xm = 3.

x2 .. x1 . 5/24 . e. . 3. median. Goal: Compute a function of stream.g. number of distinct elements.. e. 5. .Data Stream Model Stream: m elements from universe of size n. . . 7. . xm = 3.g. . 5. 4. longest increasing sequence. .

5.Data Stream Model Stream: m elements from universe of size n. x2 . Limited working memory. . .. Catch: 1.g. longest increasing sequence. median. . 3. e. xm = 3. 5. . number of distinct elements. . .g. e. Goal: Compute a function of stream. 4. sublinear in n and m 5/24 . . x1 . 7..

. Access data sequentially 5/24 . 5.. . Limited working memory.g. x2 . 5. Goal: Compute a function of stream. .. e. longest increasing sequence. 7. xm = 3.Data Stream Model Stream: m elements from universe of size n. . . number of distinct elements. 4. . . Catch: 1. sublinear in n and m 2.g. e. median. x1 . 3.

number of distinct elements. . median. x1 . 7. e. .g. . longest increasing sequence. Catch: 1. Goal: Compute a function of stream. Limited working memory. 5. . x2 . sublinear in n and m 2. 3..Data Stream Model Stream: m elements from universe of size n. .g.. Process each element quickly 5/24 . Access data sequentially 3. 4. . . xm = 3. e. 5.

. longest increasing sequence. . 4.g. 7. 5/24 .Data Stream Model Stream: m elements from universe of size n. number of distinct elements. . median.g. sublinear in n and m 2. 5. 3. Limited working memory.. 5. . xm = 3. e. x1 . . Access data sequentially 3. Catch: 1. Process each element quickly Origins in 70s but has become popular in last ten years because of growing theory and very applicable. Goal: Compute a function of stream. x2 . e. .. .

. . sensor networks aggregation. I/O eﬃciency for massive data. ubiquitous data-logging results in massive amount of data to be processed. 6/24 . Applications to network monitoring. query planning. cheaper data storage.Why’s it become popular? Practical Appeal: Faster networks.

Theoretical Appeal: Easy to state problems but hard to solve. embeddings. . sensor networks aggregation.Why’s it become popular? Practical Appeal: Faster networks. Applications to network monitoring. pseudo-random generators. 6/24 . query planning. . . ubiquitous data-logging results in massive amount of data to be processed. Links to communication complexity. cheaper data storage. approximation. . compressed sensing. I/O eﬃciency for massive data.

Outline Basic Deﬁnitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 7/24 .

Sampling and Statistics Sampling is a general technique for tackling massive amounts of data 8/24 .

Statistical arguments relate the size of the sample to the accuracy of the estimate. 8/24 . we could just sample some and use the median of the sample as an estimate for the true median.Sampling and Statistics Sampling is a general technique for tackling massive amounts of data Example: To compute the median packet size of some IP packets.

Statistical arguments relate the size of the sample to the accuracy of the estimate.Sampling and Statistics Sampling is a general technique for tackling massive amounts of data Example: To compute the median packet size of some IP packets. we could just sample some and use the median of the sample as an estimate for the true median. Challenge: But how do you take a sample from a stream of unknown length or from a “sliding window”? 8/24 .

Reservoir Sampling Problem: Find uniform sample s from a stream of unknown length 9/24 .

s ← xt with probability 1/t 9/24 .Reservoir Sampling Problem: Find uniform sample s from a stream of unknown length Algorithm: Initially s = x1 On seeing the t-th element.

Reservoir Sampling Problem: Find uniform sample s from a stream of unknown length Algorithm: Initially s = x1 On seeing the t-th element. s ← xt with probability 1/t Analysis: What’s the probability that s = xi at some time t ≥ i? 9/24 .

. × 1 − = i i +1 t t 9/24 .. s ← xt with probability 1/t Analysis: What’s the probability that s = xi at some time t ≥ i? „ « „ « 1 1 1 1 P [s = xi ] = × 1 − × .Reservoir Sampling Problem: Find uniform sample s from a stream of unknown length Algorithm: Initially s = x1 On seeing the t-th element.

.Reservoir Sampling Problem: Find uniform sample s from a stream of unknown length Algorithm: Initially s = x1 On seeing the t-th element. s ← xt with probability 1/t Analysis: What’s the probability that s = xi at some time t ≥ i? „ « „ « 1 1 1 1 P [s = xi ] = × 1 − × .. 9/24 . × 1 − = i i +1 t t To get k samples we use O(k log n) bits of space.

Priority Sampling for Sliding Windows Problem: Maintain a uniform sample from the last w items 10/24 .

1) 10/24 .Priority Sampling for Sliding Windows Problem: Maintain a uniform sample from the last w items Algorithm: 1. For each xi we pick a random value vi ∈ (0.

.Priority Sampling for Sliding Windows Problem: Maintain a uniform sample from the last w items Algorithm: 1. . In a window xj−w +1 . . xj return value xi with smallest vi 10/24 . . 1) 2. For each xi we pick a random value vi ∈ (0.

To do this. In a window xj−w +1 . For each xi we pick a random value vi ∈ (0. 1) 2. . . maintain set of all elements in sliding window whose v value is minimal among subsequent values 10/24 . . . xj return value xi with smallest vi 3.Priority Sampling for Sliding Windows Problem: Maintain a uniform sample from the last w items Algorithm: 1.

To do this.Priority Sampling for Sliding Windows Problem: Maintain a uniform sample from the last w items Algorithm: 1. maintain set of all elements in sliding window whose v value is minimal among subsequent values Analysis: 10/24 . 1) 2. . For each xi we pick a random value vi ∈ (0. . In a window xj−w +1 . . . xj return value xi with smallest vi 3.

. . . . 1) 2. + 1/1 = O(log w ) 10/24 . . To do this. . For each xi we pick a random value vi ∈ (0. maintain set of all elements in sliding window whose v value is minimal among subsequent values Analysis: The probability that j-th oldest element is in S is 1/j so the expected number of items in S is 1/w + 1/(w − 1) + . xj return value xi with smallest vi 3. In a window xj−w +1 .Priority Sampling for Sliding Windows Problem: Maintain a uniform sample from the last w items Algorithm: 1.

1) 2. + 1/1 = O(log w ) Hence. .Priority Sampling for Sliding Windows Problem: Maintain a uniform sample from the last w items Algorithm: 1. 10/24 . . To do this. . . algorithm only uses O(log w log n) bits of memory. For each xi we pick a random value vi ∈ (0. . xj return value xi with smallest vi 3. maintain set of all elements in sliding window whose v value is minimal among subsequent values Analysis: The probability that j-th oldest element is in S is 1/j so the expected number of items in S is 1/w + 1/(w − 1) + . . In a window xj−w +1 .

compute fi = |{j : xj = i}| 11/24 .Other Types of Sampling Universe sampling: For a random i ∈R [n].

Other Types of Sampling Universe sampling: For a random i ∈R [n]. compute fi = |{j : xj = i}| Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} 11/24 .

Other Types of Sampling
Universe sampling: For a random i ∈R [n], compute fi = |{j : xj = i}| Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} AMS sampling: Sample xj for j ∈R [m] and compute r = |{j ≥ j : xj = xj }|

11/24

Other Types of Sampling
Universe sampling: For a random i ∈R [n], compute fi = |{j : xj = i}| Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} AMS sampling: Sample xj for j ∈R [m] and compute r = |{j ≥ j : xj = xj }| Handy when estimating quantities like
i

g (fi ) because g (fi )
i

E [m(g (r ) − g (r − 1))] =

11/24

Outline

Basic Deﬁnitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

12/24

Sketching Sketching is another general technique for processing streams 13/24 .

Post-process lower dimensional image to estimate the quantities of interest.Sketching Sketching is another general technique for processing streams Basic idea: Apply a linear projection “on the ﬂy” that takes high-dimensional data to a smaller dimensional space. 13/24 .

x2 . . . .Estimating the diﬀerence between two streams Input: Stream from two sources x1 . xm ∈ ([n] ∪ [n])m 14/24 . .

. xm ∈ ([n] ∪ [n])m Goal: Estimate diﬀerence between distribution of red values and blue values. .. .Estimating the diﬀerence between two streams Input: Stream from two sources x1 . x2 . . |fi − gi | i∈[n] where fi = |{k : xk = i}| and gi = |{k : xk = i}| 14/24 . e.g.

Z ∼ µ and a.g. b ∈ R : aX + bY ∼ (|a|p + |b|p )1/p Z e. Y .p-Stable Distributions and Algorithm Defn: A p-stable distribution µ has the following property: for X . Gaussian is 2-stable and Cauchy distribution is 1-stable 15/24 ..

Z ∼ µ and a. 15/24 . b ∈ R : aX + bY ∼ (|a|p + |b|p )1/p Z e. Y . k = O( −2 ).g..p-Stable Distributions and Algorithm Defn: A p-stable distribution µ has the following property: for X . Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy.

. Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy. b ∈ R : aX + bY ∼ (|a|p + |b|p )1/p Z e. Z ∼ µ and a. Y .g. 15/24 .p-Stable Distributions and Algorithm Defn: A p-stable distribution µ has the following property: for X . k = O( Compute sketches Af and Ag incrementally −2 ).

. . b ∈ R : aX + bY ∼ (|a|p + |b|p )1/p Z e. |tk |) where t = Af − Ag −2 ). . Y .. 15/24 . . k = O( Compute sketches Af and Ag incrementally Return median(|t1 |.g.p-Stable Distributions and Algorithm Defn: A p-stable distribution µ has the following property: for X . Z ∼ µ and a. Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy.

Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy. .g. Analysis: By the 1-stability property for Zi ∼ Cauchy X X |ti | = | Ai. . |tk |) where t = Af − Ag −2 ). .p-Stable Distributions and Algorithm Defn: A p-stable distribution µ has the following property: for X .. b ∈ R : aX + bY ∼ (|a|p + |b|p )1/p Z e. . Z ∼ µ and a. k = O( Compute sketches Af and Ag incrementally Return median(|t1 |. Y .j (fj − gj )| ∼ |Zi | |fj − gj | j j 15/24 .

Y . Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy. Z ∼ µ and a. with high probability. . .j (fj − gj )| ∼ |Zi | |fj − gj | j j For k = O( −2 ). since median(|Zi |) = 1.g.p-Stable Distributions and Algorithm Defn: A p-stable distribution µ has the following property: for X . . k = O( Compute sketches Af and Ag incrementally Return median(|t1 |. . . b ∈ R : aX + bY ∼ (|a|p + |b|p )1/p Z e. X X (1 − ) |fj − gj | ≤ median(|t1 |. . . . |tk |) where t = Af − Ag −2 ).. |tk |) ≤ (1 + ) |fj − gj | j j 15/24 . Analysis: By the 1-stability property for Zi ∼ Cauchy X X |ti | = | Ai.

A Useful Multi-Purpose Sketch: Count-Min Sketch 16/24 .

A Useful Multi-Purpose Sketch: Count-Min Sketch Heavy Hitters: Find all i such that fi ≥ φm 16/24 .

j aren’t known in advance 16/24 .A Useful Multi-Purpose Sketch: Count-Min Sketch Heavy Hitters: Find all i such that fi ≥ φm Range Sums: Estimate i≤k≤j fk when i.

and i≤qj −1 fi < jm ≤ k fi i≤qj 16/24 . j aren’t known in advance Find k-Quantiles: Find values q0 . . qk = n. . .A Useful Multi-Purpose Sketch: Count-Min Sketch Heavy Hitters: Find all i such that fi ≥ φm Range Sums: Estimate i≤k≤j fk when i. qk such that q0 = 0. .

. and i≤qj −1 fi < jm ≤ k fi i≤qj Algorithm: Count-Min Sketch Maintain an array of counters ci. j aren’t known in advance Find k-Quantiles: Find values q0 .j for i ∈ [d] and j ∈ [w ] 16/24 . . qk = n.A Useful Multi-Purpose Sketch: Count-Min Sketch Heavy Hitters: Find all i such that fi ≥ φm Range Sums: Estimate i≤k≤j fk when i. qk such that q0 = 0. . .

.j for i ∈ [d] and j ∈ [w ] Construct d random hash functions h1 . qk such that q0 = 0. . and i≤qj −1 fi < jm ≤ k fi i≤qj Algorithm: Count-Min Sketch Maintain an array of counters ci. . h2 . . . qk = n. hd : [n] → [w ] 16/24 . .A Useful Multi-Purpose Sketch: Count-Min Sketch Heavy Hitters: Find all i such that fi ≥ φm Range Sums: Estimate i≤k≤j fk when i. . j aren’t known in advance Find k-Quantiles: Find values q0 .

. and i≤qj −1 fi < jm ≤ k fi i≤qj Algorithm: Count-Min Sketch Maintain an array of counters ci. qk = n. .A Useful Multi-Purpose Sketch: Count-Min Sketch Heavy Hitters: Find all i such that fi ≥ φm Range Sums: Estimate i≤k≤j fk when i. . . j aren’t known in advance Find k-Quantiles: Find values q0 . . . . hd : [n] → [w ] Update counters: On seeing value v . h2 .j for i ∈ [d] and j ∈ [w ] Construct d random hash functions h1 . qk such that q0 = 0.hi (v ) for i ∈ [d] 16/24 . increment ci.

j for i ∈ [d] and j ∈ [w ] Construct d random hash functions h1 . . qk such that q0 = 0. return ˜ fk = min ci. . and i≤qj −1 fi < jm ≤ k fi i≤qj Algorithm: Count-Min Sketch Maintain an array of counters ci. . hd : [n] → [w ] Update counters: On seeing value v . j aren’t known in advance Find k-Quantiles: Find values q0 . increment ci.hi (v ) for i ∈ [d] To get an estimate of fk . qk = n. h2 .hi (k) i 16/24 . . . . .A Useful Multi-Purpose Sketch: Count-Min Sketch Heavy Hitters: Find all i such that fi ≥ φm Range Sums: Estimate i≤k≤j fk when i.

qk such that q0 = 0. return ˜ fk = min ci. and i≤qj −1 fi < jm ≤ k fi i≤qj Algorithm: Count-Min Sketch Maintain an array of counters ci. .hi (v ) for i ∈ [d] To get an estimate of fk . j aren’t known in advance Find k-Quantiles: Find values q0 . .hi (k) i Analysis: For d = O(log 1/δ) and w = O(1/ 2 ) ˜ P fk − m ≤ fk ≤ fk ≥ 1 − δ 16/24 .j for i ∈ [d] and j ∈ [w ] Construct d random hash functions h1 . . h2 . hd : [n] → [w ] Update counters: On seeing value v . . . increment ci.A Useful Multi-Purpose Sketch: Count-Min Sketch Heavy Hitters: Find all i such that fi ≥ φm Range Sums: Estimate i≤k≤j fk when i. . . qk = n.

Outline Basic Deﬁnitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 17/24 .

.Counting Distinct Elements Input: Stream x1 . . x2 . xm ∈ [n]m Goal: Estimate the number of distinct values in the stream up to a multiplicative factor (1 + ) with high probability. . 18/24 . .

Apply random hash function h : [n] → [0. 1] to each element 19/24 .Algorithm Algorithm: 1.

Algorithm Algorithm: 1. Compute φ. Apply random hash function h : [n] → [0. 1] to each element 2. the t-th smallest value of the hash seen where t = 21/ 2 19/24 .

Algorithm Algorithm: 1. the t-th smallest value of the hash seen where t = 21/ 3. Return ˜ = t/φ as estimate for r . Apply random hash function h : [n] → [0. 1] to each element 2. r 2 19/24 . the number of distinct items. Compute φ.

the t-th smallest value of the hash seen where t = 21/ 3.Algorithm Algorithm: 1. the number of distinct items. 19/24 . Algorithm uses O( −2 log n) bits of space. Apply random hash function h : [n] → [0. Return ˜ = t/φ as estimate for r . r 2 Analysis: 1. 1] to each element 2. Compute φ.

Return ˜ = t/φ as estimate for r . Apply random hash function h : [n] → [0.Algorithm Algorithm: 1. Algorithm uses O( −2 log n) bits of space. Compute φ. the t-th smallest value of the hash seen where t = 21/ 3. the number of distinct items. 1] to each element 2. We’ll show estimate has good accuracy with reasonable probability P [|˜ − r | ≤ r ] ≤ 9/10 r 19/24 . r 2 Analysis: 1. 2.

. . Suppose the distinct items are a1 .Accuracy Analysis 1. ar 20/24 . . .

. . Over Estimation: P [˜ ≥ (1 + )r ] = P [t/φ ≥ (1 + )r ] = P φ ≤ r t r (1 + ) 20/24 . ar 2. . Suppose the distinct items are a1 . .Accuracy Analysis 1.

ar 2. Let Xi = 1[h(ai ) ≤ P φ≤ t r (1+ ) ] t r (1 + ) and X = Xi t = P [X > t] = P [X > (1 + )E [X ]] r (1 + ) 20/24 . Over Estimation: P [˜ ≥ (1 + )r ] = P [t/φ ≥ (1 + )r ] = P φ ≤ r 3. Suppose the distinct items are a1 . . . . .Accuracy Analysis 1.

.Accuracy Analysis 1. ar 2. . . Let Xi = 1[h(ai ) ≤ P φ≤ t r (1+ ) ] t r (1 + ) and X = Xi t = P [X > t] = P [X > (1 + )E [X ]] r (1 + ) 4. Suppose the distinct items are a1 . P [X > (1 + )E [X ]] ≤ 1 2 E [X ] ≤ 1/20 20/24 . By a Chebyshev analysis. . Over Estimation: P [˜ ≥ (1 + )r ] = P [t/φ ≥ (1 + )r ] = P φ ≤ r 3.

ar 2. . . Let Xi = 1[h(ai ) ≤ P φ≤ t r (1+ ) ] t r (1 + ) and X = Xi t = P [X > t] = P [X > (1 + )E [X ]] r (1 + ) 4. P [X > (1 + )E [X ]] ≤ 1 2 E [X ] ≤ 1/20 5. Under Estimation: A similar analysis shows P [˜ ≤ (1 − )r ] ≤ 1/20 r 20/24 . . By a Chebyshev analysis. Over Estimation: P [˜ ≥ (1 + )r ] = P [t/φ ≥ (1 + )r ] = P φ ≤ r 3.Accuracy Analysis 1. . Suppose the distinct items are a1 .

Outline Basic Deﬁnitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 21/24 .

(x2 . . y1 ). .Some Other Results Correlations: Input: (x1 . y2 ). . ym ) Goal: Estimate strength of correlation between x and y via the distance between joint distribution and product of the marginals. (xm . . 22/24 . ˜ Result: (1 + ) approx in O( −O(1) ) space.

Linear Regression: Input: Stream deﬁnes a matrix A ∈ Rn×d and b ∈ Rd×1 Goal: Find x such that Ax − b 2 is minimized. 22/24 . . ˜ Result: (1 + ) approx in O( −O(1) ) space.Some Other Results Correlations: Input: (x1 . y2 ). (x2 . ym ) Goal: Estimate strength of correlation between x and y via the distance between joint distribution and product of the marginals. y1 ). (xm . ˜ Result: (1 + ) estimation in O(d 2 −1 ) space. . . .

. . . xm ∈ [n]m Goal: Determine B bucket histogram H : [m] → R minimizing (xi − H(i))2 i∈[m] ˜ Result: (1 + ) estimation in O(B 2 −1 ) space 23/24 .Some More Other Results Histograms: Input: x1 . x2 . .

xm ∈ [n]m Goal: Determine B bucket histogram H : [m] → R minimizing (xi − H(i))2 i∈[m] ˜ Result: (1 + ) estimation in O(B 2 −1 ) space Transpositions and Increasing Subsequences: Input: x1 . . . .Some More Other Results Histograms: Input: x1 . . x2 . . . x2 . xm ∈ [n]m Goal: Estimate number of transpositions |{i < j : xi > xj }| Goal: Estimate length of longest increasing subsequence √ ˜ ˜ Results: (1 + ) approx in O( −1 ) and O( −1 n) space respectively 23/24 . . .

S. McGregor.wordpress.edu/S/course/6/fa07/6.com Lectures: Piotr Indyk. MIT http://stellar. Muthukrishnan (2005) “Algorithms and Complexity of Stream Processing” A.mit.Thanks! Blog: http://polylogblog. Muthukrishnan (forthcoming) 24/24 .895/ Books: “Data Streams: Algorithms and Applications” S.