You are on page 1of 61

CS85: Data Stream Algorithms

Lecture Notes, Fall 2009


Amit Chakrabarti
Dartmouth College
Latest Update: December 15, 2009
Contents
0 Preliminaries: The Data Stream Model 4
0.1 The Basic Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.2 The Quality of an Algorithms Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.3 Variations of the Basic Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Finding Frequent Items Deterministically 6
1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The Misra-Gries Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Estimating the Number of Distinct Elements 8
2.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 The Quality of the Algorithms Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 The Median Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Finding Frequent Items via Sketching 11
3.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 The Count-Min Sketch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 The Quality of the Algorithms Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Space Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 The Count Sketch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 Space Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 Comparison of Count Min and Count Sketch Algorithm . . . . . . . . . . . . . . . . . . . . 15
4 Estimating Distinct Items Using Hashing 16
4.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 The BJKST Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1
CONTENTS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
4.3 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.1 Space Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.2 Analysis: How Well Does The BJKST Algorithm Do? . . . . . . . . . . . . . . . . . . . . . 17
4.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Estimating Frequency Moments 19
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 AMS estimator for F
k
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Analysis of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 Why does the amazing AMS algorithm work? 24
7 Finding the Median 29
7.1 The Median / Selection Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.2 Munro-Paterson Algorithm (1980) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8 Approximate Selection 33
8.1 Two approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 Greenwald - Khanna Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.3 Guha-McGregor Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9 Graph Stream Algorithms 36
9.1 Graph Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2 Connectedness Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.3 Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.4 Shortest Paths (Distances) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.4.1 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10 Finding Matching, Counting Triangles in graph streams 39
10.1 Matchings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.2 Triangle Counting: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
11 Geometric Streams 43
11.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
11.1.1 Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
11.1.2 Working with Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
11.2 Doubling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
11.3 Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
11.4 k-center Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.4.1 Guhas Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.4.2 Space Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2
CONTENTS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
12 Core-sets 47
12.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
12.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
12.3 Streaming algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
13 Communication Complexity 50
13.1 Introduction to Communication Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
13.1.1 EQUALITY problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
13.2 Communication complexity in Streaming Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 52
13.2.1 INDEX problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
14 Reductions 54
15 Set Disjointness and Multi-Pass Lower Bounds 57
15.1 Communication Complexity of DISJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
15.2 ST-Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
15.3 Perfect Matching Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
15.4 Multiparty Set Disjointness (DISJ
n,t
) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3
Lecture 0
Preliminaries: The Data Stream Model
Scribe: Amit Chakrabarti
0.1 The Basic Setup
In this course, we shall concerned with algorithms that compute some function of a massively long input stream .
In the most basic model (which we shall call the vanilla streaming model), this is formalized as a sequence =
,a
1
, a
2
, . . . , a
m
), where the elements of the sequence (called tokens) are drawn from the universe [n] := {1, 2, . . . , n].
Note the two important size parameters: the stream length, m, and the universe size, n. If you read the literature in the
area, you will notice that some authors interchange these two symbols. In this course, we shall consistently use m and
n as we have just dened them.
Our central goal will be to process the input stream using a small amount of space s, i.e., to use s bits of random-
access working memory. Since m and n are to be thought of as huge, we want to make s much smaller than these;
specically, we want s to be sublinear in both m and n. In symbols, we want
s = o (min{m, n]) .
The holy grail is to achieve
s = O(log m log n) ,
because this amount of space is what we need to store a constant number of tokens from the stream and a constant
number of counters that can count up to the length of the stream. Sometimes we can only come close and achieve
a space bound of the form s = polylog(min{m, n]), where f (n) = polylog(g(n)) means that there exists a constant
c > 0 such that f (n) = O((log g(n))
c
).
The reason for calling the input a stream is that we are only allowed to access the input in streaming fashion,
i.e., we do not have random access to the tokens. We can only scan the sequence in the given order. We do consider
algorithms that make p passes over the stream, for some small integer p, keeping in mind that the holy grail is to
achieve p = 1. As we shall see, in our rst few algorithms, we will be able to do quite a bit in just one pass.
0.2 The Quality of an Algorithms Answer
The function we wish to compute (), say will usually be real-valued. We shall typically seek to compute only
an estimate or approximation of the true value of (), because many basic functions can provably not be computed
4
LECTURE 0. PRELIMINARIES: THE DATA STREAM MODEL
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
exactly using sublinear space. For the same reason, we shall often allow randomized algorithms than may err with
some small, but controllable, probability. This motivates the following basic denition.
Denition 0.2.1. Let A() denote the output of a randomized streaming algorithm A on input ; note that this is a
random variable. Let be the function that A is supposed to compute. We say that the algorithm (, )-approximates
if we have
Pr
_

A()
()
1

>
_
.
Notice that the above denition insists on a multiplicative approximation. This is sometimes too strong a condition
when the value of () can be close to, or equal to, zero. Therefore, for some problems, we might instead seek an
additive approximation, as dened below.
Denition 0.2.2. In the above setup, the algorithm A is said to (, )-additively-approximate if we have
Pr [[A() ()[ > ] .
We have mentioned that certain things are provably impossible in sublinear space. Later in the course, we shall
study how to prove such impossibility results. Such impossibility results, also called lower bounds, are a rich eld of
study in their own right.
0.3 Variations of the Basic Setup
Quite often, the function we are interested in computing is some statistical property of the multiset of items in the input
stream . This multiset can be represented by a frequency vector f = ( f
1
, f
2
, . . . , f
n
), where
f
j
= [{i : a
i
= j ][ = number of occurrences of j in .
In other words, implicitly denes this vector f, and we are then interested in computing some function of the form
+(f). While processing the stream, when we scan a token j [n], the effect is to increment the frequency f
j
. Thus,
can be thought of as a sequence of update instructions, updating the vector f.
With this in mind, it is interesting to consider more general updates to f: for instance, what if items could both
arrive and depart from our multiset, i.e., if the frequencies f
j
could be both incremented and decremented, and
by variable amounts? This leads us to the turnstile model, in which the tokens in belong to [n] {L, . . . , L],
interpreted as follows:
Upon receiving token a
i
= ( j, c) , update f
j
f
j
c .
Naturally, the vector f is assumed to start out at 0. In this generalized model, it is natural to change the role of the
parameter m: instead of the streams length, it will denote the maximum number of items in the multiset at any point
of time. More formally, we require that, at all times, we have
|f|
1
= [ f
1
[ [ f
n
[ m .
A special case of the turnstile model, that is sometimes important to consider, is the strict turnstile model, in which
we assume that f 0 at all times. A further special case is the cash register model, where we only allow positive
updates: i.e., we require that every update ( j, c) have c > 0.
5
Lecture 1
Finding Frequent Items Deterministically
Scribe: Amit Chakrabarti
1.1 The Problem
We are in the vanilla streaming model. We have a stream = ,a
1
, . . . , a
n
), with each a
i
[n], and this implicitly
denes a frequency vector f = ( f
1
, . . . , f
n
). Note that f
1
f
n
= m.
In the MAJORITY problem, our task is as follows: if j : f
j
> m,2, then output j , otherwise, output .
This can be generalized to the FREQUENT problem, with parameter k, as follows: output the set {j : f
j
> m,k].
In this lecture, we shall limit ourselves to deterministic algorithms for this problem. If we further limit ourselves
to one-pass algorithms, even the simpler problem, MAJORITY, provably requires O(min{m, n]) space. However, we
shall soon give a one-pass algorithm the Misra-Gries Algorithm [MG82] that solves the related problem of
estimating the frequencies f
j
. As we shall see,
1. the properties of Misra-Gries are interesting in and of themselves, and
2. it is easy to extend Misra-Gries, using a second pass, to then solve the FREQUENT problem.
Thus, we now turn to the FREQUENCY-ESTIMATION problem. The task is to process to produce a data structure
that can provide an estimate

f
a
for the frequency f
a
of a given token a [n]. Note that a is given to us only after we
have processed .
1.2 The Misra-Gries Algorithm
As with all one-pass data stream algorithms, we shall have an initialization section, executed before we see the stream,
a processing section, executed each time we see a token, and an output section, where we answer question(s) about
the stream, perhaps in response to a given query.
This algorithm uses a parameter k that controls the quality of the answers it gives. (Note: to solve the FREQUENT
problem with parameter k, we shall run the Misra-Gries algorithm with parameter k.) It maintains an associative array,
A, whose keys are tokens seen in the stream, and whose values are counters associated with these tokens. We keep at
most k 1 counters at any time.
6
LECTURE 1. FINDING FREQUENT ITEMS DETERMINISTICALLY
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Initialize : A (empty associative array) ;
Process j :
if j keys(A) then 1
A[ j ] A[ j ] 1 ; 2
else if [keys(A)[ k 1 then 3
A[ j ] 1 ; 4
else 5
foreach keys(A) do 6
A[] A[] 1 ; 7
if A[] = 0 then remove from A ; 8
Output : On query a, if a keys(A), then report

f
a
= A[a], else report

f
a
= 0 ;
1.3 Analysis of the Algorithm
To process each token quickly, we could maintain the associative array A using a balanced binary search tree. Each
key requires {log n bits to store and each value requires at most {log m bits. Since there are at most k 1 key/value
pairs in A at any time, the total space required is O(k(log m log n)).
Now consider the quality of the algorithms output. Let us pretend that A consists of n key/value pairs, with
A[ j ] = 0 whenever j is not actually stored in A by the algorithm. Notice that the counter A[ j ] is incremented only
when we process an occurrence of j in the stream. Thus,

f
j
f
j
. On the other hand, whenever A[ j ] is decremented
(in lines 7-8, we pretend that A[ j ] is incremented from 0 to 1, and then immediately decremented back to 0), we
also decrement k 1 other counters, corresponding to distinct tokens in the stream. Thus, each decrement of A[ j ] is
witnessed by a collection of k distinct tokens (one of which is a j itself) from the stream. Since the stream consists
of m tokens, there can be at most m,k such decrements. Therefore,

f
j
f
j
m,k. Putting these together we have
the following theorem.
Theorem 1.3.1. The Misra-Gries algorithm with parameter k uses one pass and O(k(log m log n)) bits of space,
and provides, for any token j , an estimate

f
j
satisfying
f
j

m
k


f
j
f
j
.
Using this algorithm, we can now easily solve the FREQUENT problem in one additional pass. By the above
theorem, if some token j has f
j
> m,k, then its corresponding counter A[ j ] will be positive at the end of the Misra-
Gries pass over the stream, i.e., j will be in keys(A). Thus, we can make a second pass over the input stream, counting
exactly the frequencies f
j
for all j keys(A), and then output the desired set of items.
7
Lecture 2
Estimating the Number of Distinct Elements
Scribe: Amit Chakrabarti
2.1 The Problem
As in the last lecture, we are in the vanilla streaming model. We have a stream = ,a
1
, . . . , a
n
), with each a
i
[n],
and this implicitly denes a frequency vector f = ( f
1
, . . . , f
n
). Let d = [{j : f
j
> 0][ be the number of distinct
elements that appear in .
In the DISTINCT-ELEMENTS problem, our task is to output an (, )-approximation (as in Denition 0.2.1) to d.
It is provably impossible to solve this problem in sublinear space if one is restricted to either deterministic algo-
rithms (i.e., = 0), or exact algorithms (i.e., = 0). Thus, we shall seek a randomized approximation algorithm. In
this lecture, we give a simple algorithm for this problem that has interesting, but not optimal, quality guarantees. The
idea behind the algorithm is originally due to Flajolet and Martin [FM85], and we give a slightly modied presentation,
due to Alon, Matias and Szegedy [AMS99].
2.2 The Algorithm
For an integer p > 0, let zeros( p) denote the number of zeros that the binary representation of p ends with. Formally,
zeros( p) = max{i : 2
i
divides p] .
We use the following very simple algorithm.
Initialize :
Choose a random hash function h : [n] [n] from a 2-universal family ; 1
z 0 ; 2
Process j :
if zeros(h( j )) > z then z zeros(h( j )) ; 3
Output : 2
z
1
2
8
LECTURE 2. ESTIMATING THE NUMBER OF DISTINCT ELEMENTS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
The basic intuition here is that we expect 1 out of the d distinct tokens to hit zeros(h( j )) log d, and we dont
expect any tokens to hit zeros(h( j )) log d. Thus, the maximum value of zeros(h( j )) over the stream which is
what we maintain in z should give us a good approximation to log d. We now analyze this.
2.3 The Quality of the Algorithms Estimate
Formally, for each j [n] and each integer r 0, let X
r, j
be an indicator randomvariable for the event zeros(h( j ))
r, and let Y
r
=

j : f
j
>0
X
r, j
. Let t denote the value of z when the algorithm nishes processing the stream. Clearly,
Y
r
> 0 t r . (2.1)
We can restate the above fact as follows (this will be useful later):
Y
r
= 0 t r 1 . (2.2)
Since h( j ) is uniformly distributed over the (log n)-bit strings, we have
E[X
r, j
] = Pr[zeros(h( j )) r] = Pr[2
r
divides h( j )] =
1
2
r
.
We now estimate the expectation and variance of Y
r
as follows. The rst step of Eq. (2.3) below uses the pairwise
independence of the random variables Y
r
, which follows from the 2-universality of the hash family from which h is
drawn.
E[Y
r
] =

j : f
j
>0
E[X
r, j
] =
d
2
r
.
Var[Y
r
] =

j : f
j
>0
Var[X
r, j
]

j : f
j
>0
E[X
2
r, j
] =

j : f
j
>0
E[X
r, j
] =
d
2
r
. (2.3)
Thus, using Markovs and Chebyshevs inequalities respectively, we have
Pr[Y
r
> 0] = Pr[Y
r
1]
E[Y
r
]
1
=
d
2
r
, and (2.4)
Pr[Y
r
= 0] = Pr
_
[Y
r
E[Y
r
][ d,2
r
_

Var[Y
r
]
(d,2
r
)
2

2
r
d
. (2.5)
Let

d be the estimate of d that the algorithm outputs. Then

d = 2
t
1
2
. Let a be the smallest integer such that
2
a
1
2
3d. Using Eqs. (2.1) and (2.4), we have
Pr
_

d 3d
_
= Pr[t a] = Pr[Y
a
> 0]
d
2
a

2
3
.
Similarly, let b be the largest integer such that 2
b
1
2
d,3. Using Eqs. (2.2) and (2.5), we have
Pr
_

d d,3
_
= Pr[t b] = Pr[Y
b1
= 0]
2
b1
d

2
3
.
These guarantees are weak in two ways. Firstly, the estimate

d is only of the same order of magnitude as d, and
is not an arbitrarily good approximation. Secondly, these failure probabilities above are only bounded by the rather
large

2,3 47%. Of course, we could make the probabilities smaller by replacing the constant 3 above with
a larger constant. But a better idea, that does not further degrade the quality of the estimate

d, is to use a standard
median trick which will come up again and again.
9
LECTURE 2. ESTIMATING THE NUMBER OF DISTINCT ELEMENTS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
2.4 The Median Trick
Imagine running k copies of this algorithm in parallel, using mutually independent random hash functions, and out-
putting the median of the k answers. If this median exceeds 3d, then at least k,2 of the individual answers must exceed
3d, whereas we only expect k

2,3 of them to exceed 3d. By a standard Chernoff bound, this is event has probability
2
O(k)
. Similarly, the probability that the median is below d,3 is also 2
O(k)
.
Choosing k = O(log(1,)), we can make the sum of these two probabilities work out to at most . This gives us
an (O(1), )-approximation to d. Later, we shall give a different algorithm that will provide an (, )-approximation
with 0.
The original algorithm requires O(log n) bits to store (and compute) a suitable hash function, and O(log log n)
more bits to store z. Therefore, the space used by this nal algorithm is O(log(1,) log n). When we reattack this
problem with a new algorithm, we will also improve this space bound.
10
Lecture 3
Finding Frequent Items via Sketching
Scribe: Radhika Bhasin
3.1 The Problem
We return to the FREQUENCY-ESTIMATION problem that we saw in Lecture 1. We give two randomized one-pass
algorithms for this problem that work in the more general turnstile model. Thus, we have an input stream =
,a
1
, a
2
, . . .), where a
i
= ( j
i
, c
i
) [n] {L, . . . , L]. As usual, this implicitly denes a frequency vector f =
( f
1
, . . . , f
n
) where
f
j
=

i : j
i
=j
c
i
.
We want to maintain a sketch of the stream, which is a data structure based on which we can, at any time, quickly
compute an estimate

f
j
of f
j
, for any given j [n].
We shall give two different solutions to this problem, with incomparable quality guarantees. Both solutions main-
tain certain counters who size is bounded not by the length of the stream, but by the maximum possible frequency.
Thus, we let m denote the maximum possible value attained by any [ f
j
[ as we process the stream. Whenever we refer
to counters in our algorithms, they are assumed to be O(log m)-bits long.
3.2 The Count-Min Sketch Algorithm
This sketch was introduced by Cormode and Muthukrishnan [CM05]. A Count-Min (CM) sketch with parameters
(, ) provides guarantees akin to an (, )-approximation, but not quite! The parameter is used in a different way,
though still represents the probability of error. Read on.
The CM sketch consists of a two-dimensional t k array of counters, plus t independent hash functions, each
chosen uniformly at random from a 2-universal family. Each time we read a token, resulting in an update of f
j
, we
update certain counters in the array based on the hash values of j (the count portion of the sketch). When we wish
to compute an estimate

f
j
, we report the minimum (the min portion) of these counters. The values of t and k are
set, based on and , as shown below.
11
LECTURE 3. FINDING FREQUENT ITEMS VIA SKETCHING
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Initialize :
t log(1,) ; 1
k 2, ; 2
C[1 . . . t ][1 . . . k]

0 ; 3
Pick t independent hash functions h
1
, h
2
, . . . , h
t
: [n] [k], each from a 2-universal family ; 4
Process ( j, c):
for i = 1 to t do 5
C[i ][h
i
( j )] C[i ][h
i
( j )] c ; 6
Output : On query a, report

f
a
= min
1i t
C[i ][h
i
(a)]
3.2.1 The Quality of the Algorithms Estimate
We focus on the case when each token ( j, c) in the stream satises c > 0, i.e., the cash register model. Clearly, in this
case, every counter C[i ][h
i
(a)], corresponding to a token a, is an overestimate of f
j
. Thus, we always have
f
j


f
j
.
For a xed a, we now analyze the excess in one such counter, say in C[i ][h
i
(a)]. Let the random variable X
i
denote
this excess. For j [n] {a], let Y
i, j
denote the contribution of j to this excess. Notice that j makes a contribution
iff it collides with a under the i th hash function, i.e., iff h
i
( j ) = h
i
(a). And when it does contribute, it causes f
j
to
be added to this counter. Thus,
Y
i, j
=
_
f
j
, if h
i
( j ) = h
i
(a), which happens with probability 1,k ,
0 , otherwise.
Notice that the probability above is 1,k because h
i
is 2-universal. Anyway, we now see that E[Y
i, j
] = f
j
,k. By
linearity of expectation,
E[X
i
] = E
_
_
_
n

j =1
j ,=a
Y
i, j
_

_ =
n

j =1
j ,=a
E[Y
i, j
] =
n

j =1
j ,=a
f
j
k
=
|f|
1
f
a
k
.
Using our choice of k, we have
E[X
i
] =

2
(|f|
1
f
a
)

2
|f|
1
.
So, by Markovs inequality,
Pr[X
i
|f|
1
]
1
2
.
The above probability is for one counter. We have t such counters, mutually independent. The excess in the output,

f
a
f
a
, is the minimum of the excesses X
i
, over all i [t ]. Thus,
Pr
_

f
a
f
a
|f|
1
_
= Pr [min{X
1
, . . . , X
t
] |f|
1
]
= Pr
_
t

i =1
(X
i
|f|
1
)
_
=
t

i =1
Pr[X
i
|f|
1
]

1
2
t
.
12
LECTURE 3. FINDING FREQUENT ITEMS VIA SKETCHING
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
And using our choice of t , this probability is at most . Thus, we have shown that, with high probability,
f
a


f
a
f
a
|f|
1
,
where the left inequality always holds, and the right inequality fails with probability at most .
3.2.2 Space Usage
The Count-Min algorithm needs to store (1) the k t array of counters, and (2) representations of the t hash functions.
The array needs space O(kt log m) = O(
1
eps
log
1

logm). Assuming, w.l.o.g., that n is a power of 2, we can use


a nite-eld-based hash function (as in Homework 1), which will require O(log n) space for each of the t functions.
Thus, the total space usage is at most
O
_
log
1

_
log m

log n
__
.
3.3 The Count Sketch Algorithm
Count Sketch Algorithm allows us to the estimate the frequencies of all items in the stream.The algorithm uses two
hash functions,along with a two-dimensional t k array of counters.The rst hash function h
i
maps input input items
onto k,and secondary hash function g
i
maps the input items onto {-1,+1].Each input item j causes g
i
( j ) to be added
on to the entry C[i ][h
i
( j )] in row i,for 1 i t .For any row i, g
i
( j ).C[i ][h
i
( j )] is an unbaised estimator for f
j
.The
estimate

f
j
, is the median of these estimates over the t rows.The values of t and k are set, based on and , as shown
below.
Initialize :
t log(1,) ; 1
k 3,
2
; 2
C[1 . . . t ][1 . . . k]

0 ; 3
Pick t independent hash functions h
1
, h
2
, . . . , h
t
: [n] [k], each from a 2-universal family ; 4
Pick t independent secondary hash functions g
1
, g
2
, . . . , g
t
: [n] {1, 1], each from a 2-universal family ; 5
Process ( j, c):
for i = 1 to t do 6
C[i ][h
i
( j )] C[i ][h
i
( j )] c.g
i
( j ) ; 7
Output : On query a, report

f
a
= median
1i t
C[i ][h
i
(a)]
g
i
( j ) is either 1 or 1 and the decision is made at random.For any token a,g
1
(a) will either add one or subtract
one and is constant among multiple occurences of the token.If another token b collides with token a, descision of g
i
(b)
will be made independent of g
i
(a).
Case1: No collison of token a with other tokens
C[i ][h
i
(a)] = f
a
.g
i
(a)
Case2: In case of Collisons,other tokens will increment or decrement the counter with equal probability.
C[i ][h
i
(a)] = f
a
.g
i
(a) Error due to collision
Without loss of generality,assuming a = 1 and i = 1 ,then a good estimate for C[1][h
1
(1)] should be equal to
f
1
.g
1
(1).
13
LECTURE 3. FINDING FREQUENT ITEMS VIA SKETCHING
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
3.4 Analysis of the Algorithm
For a xed token a, we now analyze the excess in one such counter, say in C[i ][h
i
(a)]. Let the random variable
X denote this excess. For j [n] {a], let Y
j
denote the contribution of j to this excess. Notice that j makes a
contribution iff it collides with a under the i th hash function, i.e., iff h
i
( j ) = h
i
(a). And when it does contribute, it
causes f
j
to be added to this counter. Thus,
Y
j
=
_

_
f
j
, which happens with probability 1,2k ,
f
j
, which happens with probability 1,2k ,
0 , which happens with probability 1 1,k ,
Notice that the probability above is 1,2k because h
i
and g
i
both are 2-universal. And Expectation E[Y
j
] = 0.
Writing X as the sum of pairwise independent random variables
X = (Y
2
Y
3
Y
4
. . . , Y
n
)
Random variables (Y
2
Y
3
Y
4
. . . , Y
n
) are pairwise independent as the hash functions chosen are 2-universal.
Therefore, by pairwise independence we have
Var[X] = Var[Y
2
] Var[Y
3
] . . . Var[Y
n
]
Variance of a random variable Y
j
is given by
Var[Y
j
] = E[Y
2
j
] E[Y
j
]
2
Var[Y
j
] = E[Y
2
j
] 0
Var[Y
j
] = f
2
j
1
k
Therefore,
Var[X] =
( f
2
2
. . . f
2
n
)
k
.
Writing in the norm form,we have
Var[X] =
| f |
2
2
f
2
1
k
Dening K =
_
| f |
2
2
f
2
1
, then
Var[X] =
K
2
k
So,Standard Deviation, =
_
K
2
k
Using Chebyshevs Inequality,
Pr[[X[ K] = Pr[[X[

k
_
K

k
_
]
_
1

2
k
_

1
3
14
LECTURE 3. FINDING FREQUENT ITEMS VIA SKETCHING
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
where t =

k and =
K

k
Thus,the nal guarantee for Count Sketch is
[

f
j
f
j
[ | f |
2
where the inequality fails with probability at most after amplifying the probability of success by taking median of
t = log
1

independent instances.(Using chernoff bound)


3.4.1 Space Usage
The Count Sketch algorithm needs to store (1) the t k array of counters, and (2) representations of the two hash
functions. The array needs space O(kt log m) = O(
1
eps
2
log
1

logm).
3.4.2 Comparison of Count Min and Count Sketch Algorithm
Guarantee
Count Min Sketch: Guarantee is in norm-1.
[

f
a
f
a
[ K
cms
| f |
1
where K
cms
= | f |
1
| f
a
|
Count Sketch:Guarantee is in norm-2.
[

f
a
f
a
[ (K
cs
) | f |
2
where K
cs
=
_
| f |
2
2
f
2
a
In general, | f |
2
| f |
1
, so CMS gives a better guarantee.Its a lot better guarantee if the frequencys are evenly
distributed.
Space Required
Space required is measured by the number of counters as the size of the counters is the same.
CMS:Space = O(
1

log
1

) counters
CM :Space = O(
1

2
log
1

) counters
Count sketch algorithm uses more space but gives a better guarantee than CMS.
15
Lecture 4
Estimating Distinct Items Using Hashing
Scribe: Andrew Cherne
4.1 The Problem
Again we are in the vanilla streaming model. We have a stream = ,a
1
, a
2
, a
3
, . . . , a
n
), with each a
i
[m], and this
implicitly denes a frequency vector f = ( f
1
, . . . , f
n
). Let d = [{j : f
j
> 0][ be the number of distinct elements
that appear in . In the DISTINCT-ELEMENTS problem, our task is to output an (, )-approximation (as in Denition
0.2.1) to d. When d is calculated exactly, its value is equal to the zeroth-frequency moment, denoted F
0
. This is an
important problem with many real world examples. Estimating the number of distinct elements has applications in
databases, network statistics and Internet analysis.
4.2 The BJKST Algorithm
In this section we present the algorithm dubbed BJKST, after the names of the authors: Bar-Yossef, Jayram, Kumar,
Sivakumar and Trevisan [BJK

04]. The original paper in which this algorithmis presented actually gives 3 algorithms,
the third of which we are presenting.
4.3 Analysis of the Algorithm
4.3.1 Space Complexity
Since there are 3 distinct components to the structure of this algorithm, the total space required is the sumof these. First
is the hash function h, which requires O(log m) space. The secondary hash function, g, requires O(log m log(1,))
space. Finally we need the space of the buffer, which, using this clever implementation with the secondary hash
function, g, stores at most 1,
2
elements, each requiring space of O(log(1,)(log log m)). Therefore the overall space
complexity of the algorithm is O(log m 1,
2
(log(1,) log log m)).
16
LECTURE 4. ESTIMATING DISTINCT ITEMS USING HASHING
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Initialize : choose a 2-universal hash function h : [m] [m] ;
let h
t
= last t bits of h (0 t log m) ;
t 0 ;
B ;
choose a secondary 2-universal hash function g : [m] [O(
log m

2
)] ;
Process j :
if h
t
( j ) = 0
t
then 1
B B {g( j )] ; 2
if [B[ >
c

2
then 3
t ; 4
rehash B ; 5
6
Output : [B[2
t
; 7
4.3.2 Analysis: How Well Does The BJKST Algorithm Do?
Let t
-
be the nal value of t after processing the stream. Then we say that
FAILURE [[B[2
t
-
d[ d
If the deviation is d (where d is the number of distinct elements), then it is a failure event (this is just relative
error).
Let X
t, j
be indicator random variable such that
X
t, j
=
_
1 if h
t
( j ) = 0
t
(with probability
1
2
t
);
0 otherwise.
and X
t
=

j : f
j
>0
X
t, j
is a sum of d pairwise independent identically distributed indicator random variables.
That is, X
t
is the number of elements in the buffer using h
t
.
E[X
t
] =
d
2t
Var[X
t
] =

j
Var[X
t, j
]

j
E[X
t, j
]
= E[X
t
]
=
d
2
t
= Var[X
t
] = E[X
t
] =
d
2
t
= FAILURE [X
t
- 2
t
-
d[ d
Continuing the equality from above, and dividing both sides by 2
t
,
17
LECTURE 4. ESTIMATING DISTINCT ITEMS USING HASHING
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
FAILURE
log m
_
t =0
([X
t

d
2
t

[
d
2
t
t

= t )
If t
-
= t , then the above is just equal to [X
t
- 2
t
-
d[ d (union on all possible values of t ).
For small values of t , [X
t

d
2
t
[
d
2
t
is unlikely.
Similarly, for large values of t , t

= t is unlikely.
Let s be an integer such that
12

2

d
s
2

24

2
Where anything greater than s is considered large and anything less than s 1 is considered small. We are starting the
analysis from t = 1 since at t = 0 there would be no failure event:

s1
_
t =1
([X
t

d
2
t
[
d
2
t
)
log m
_
t =s
(t

= t )

s1
_
t =1
([X
t
E[X
t
][
_
d
2
t
X
t
) (t

s)
(in Chebyshev form)
Pr[t
-
s] = Pr[x
s1
>
c

2
] (Chebyshev)
E[x
s1
]

2
c
(Markov)
Pr[FAILURE]
s1

t =1
2
t

2
d

d
2
s1

2
c

2
s

2
d

d
2
2
s1
c

1
12

48
c
=
1
6
with c = 576
4.4 Future Work
In their paper, the authors note that this problem has been shown to have a lower bound of O(log m) and O(1,) (via
reduction of the INDEX problem). Interesting future work would be to devise an algorithm that uses polylog space or
a lower bound of O(1,
2
).
18
Lecture 5
Estimating Frequency Moments
Scribe: Robin Chhetri
5.1 Background
We are in the vanilla streaming model. We have a stream = ,a
1
, . . . , a
n
), with each a
i
[n], and this implicitly
denes a frequency vector f = ( f
1
, . . . , f
n
). Note that f
1
f
n
= m.
We have already seen that the count-min-sketch algorithm with parameters (, ).Please note that, [Pr[answer not
with relative error]] . The Count Min-Sketch gives


f f
i

|f|
1
with high probability with

O(1,) space
where

O(.) means that we ignore factors polynomial in logm, logn, log1, etc
In Count-Sketch Algorithm,we have


f f
i

|f|
1
with high probability and the space requirement is

O(1,
2
).
The question here is how to estimate |f|
1
Please note the

O notation. Typically, the space is O
_
f ().log
1

_
but it is more convenient to express it in the
form of

O. Therefore, we will be using the

O notation quite a bit.
The k
t h
frequency moment F
k
is dened by
F
k
=

m
[ f
m
[
k
Alon, Matias, and Szegedy or AMS[96] gave an , approximation for computing the moment. This is what we
will cover in this lecture.
First some background:The rst frequency moment |f|
1
is
|f|
1
= [ f
1
[ [ f
2
[ ... [ f
n
[
If only positive updates are taken, that is there are no deletions but only insertions, then
|f|
1
= f
1
f
2
... f
n
= m
Clearly, m is trivial to compute.
19
LECTURE 5. ESTIMATING FREQUENCY MOMENTS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
In fact,maintaining running sum also works, as frequencies stay positive, which implies strict turnstile model.
How to we nd |f|
2
? Strictly speaking, we know that sum of squares of frequencies is the 2nd frequency norm,
that is
|f|
2
=
_
f
2
1
f
2
2
... f
2
n
=

F
But unfortunately, we cant use this here. This is because we need to nd the individual frequencies, f
1
, f
2
, .. and
so on but we dont have space for this computation.
Therefore, we give a general algorithm. The K
t h
frequency moment of the stream is
F
k
= F
k
() = f
k
1
f
k
2
... f
k
n
We assume that we are only taking positive updates, that is
f
j
=

f
j

The zero moment of the set of frequencies of F, F


0
is
F
0
= f
0
1
f
0
2
... f
0
n
Please note that ,we are taking a convention 0
0
= 0 here as some of the frequencies could be zero.
This equals to the number of distinct elements (DISTINCT-ELEMENTS).
Similarly, the rst moment is simply the sum of frequencies or m, which takes O(logm) space.
F
1
= f
1
f
2
... f
n
= m
And, the second moment is
F
2
= f
2
1
f
2
2
... f
2
n
This takes

O
_
1

2
loglogn logn
_
space
5.2 AMS estimator for F
k
We pick an uniform random token from the stream. We count the number of times, our token appears in the remaining
stream. And we output m(r
k
(r 1)
k
). Please note that we dont know the length of the stream before we choose the
token. Eventually, we will run many copies of this basic estimator to get our nal estimation with better guarantees.
The space for the algorithm is O(logm logn).
20
LECTURE 5. ESTIMATING FREQUENCY MOMENTS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Initialize : M 0;
R 0;
Token 1
Process j :
m m 1 ; 1
With Probability
1
m
do: 2
t oken j ; 3
r 0 ; 4
if j == t oken then 5
r r 1 ; 6
Output : m(r
k
(r 1)
k
); 7
5.3 Analysis of the Algorithm
How to compute the expectation?
Let N be the position number of token picked by A. Then, N
R
[m] where
R
means that they are uniformly
random.
We can think of the choice of m as a two step process
1:Pick a random token value from [n]
Pr[ pi cki ng j ] =
f
j
m
2:Pick one of the P
j
of j uniformly at random
E[m(R
k
(R 1)
k
] = mE[R
k
(R 1)
k
]
= m
n

j =1
f
j
m
f
j

i =1
(( f
i
i 1)
k
( f
j
i )
k
)
1
f
j
= m
n

j =1
f
j
m
f
j

i =1
(i
k
(i 1)
k
)
1
f
j
=
n

j =1
f
j

j =1
(i
k
(i 1)
k
)
=
n

j =1
f
k
j
= F
k
which is the k
t h
norm. Therefore, the estimator has the right expectation value. Even if errors are large in the
algorithm, we can bring them down in subsequent runs.
Now, to compute the variance
Var[X] = E[X
2
] E[X]
2
E[X
2
]
21
LECTURE 5. ESTIMATING FREQUENCY MOMENTS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
We pick the token at random and use the occurrence,
Var[X] = mE
2
[(R
k
(R 1)
k
)
2
]
= m
2
n

j =1
f
j
m
f
j

i =1
1
f
j
(i
k
( j 1)
k
)
2
= m
n

j =1
f
j

i =1
(i
k
(i 1)
k
)
2
We are going to crudely estimate this using upper bounds, by using: (x
k
(x 1)
k
)
2
kx
k1
(x
k
(x 1)
k
)
from mean value theorem
Get:
Var[X] m
n

j =1
f
j

i =1
ki
k1
(i
k
(i 1)
k
) m
n

j =1
k f
k1
j
f
j

i =1
(i
k
(i 1)
k
)
= m
n

j =1
k f
k1
j
. f
k
j
= mk
n

j =1
f
2k1
j
i.e,
Var[X] kF
1
F
2k1
kn
1
1
k
F
2
k
Using
F
1
F
2k1
n
11,k
F
2
k
Here, we have the expected value is right, and we have a bound on the variance. We want to compute the average
to bring down the error.
Now, we have
E[X
i
] =
Let X
1
, X
2
, ..., X
n
be independent and identically distributed random variables
And
Var[X
i
]
V
T
Chebyshev inequality gives,
Pr[Y E[Y]] E[Y] = Pr[[Y E[Y][
E(Y)
(Y)
.(Y)


2
(Y)

2
E[Y]
2

V,T

2
=
V
2

2
1
T
2
We want (,1/3). So, we pick T so that this becomes 1/3. Thus, choice of T =
3V

2
22
LECTURE 5. ESTIMATING FREQUENCY MOMENTS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
remembered as
3Var[X
i
]

2
[X
i
]
2
Basic estimator X has E[X] = F
k
Var[X] kn
11,k
F
2
k
Therefore, for failure probability 1,, number of repetitions required for failure
=
3

2
kn
11,k
F
2
k
F
2
k
= O
_
1

2
kn
11,k
_
For F
2
, space is
= O
_
1

2
n
1,2
(logm logn)
_
which is not great, but is pretty good. In the next lecture, we will see something specic for F
2
that will bring it down
to log space.
23
Lecture 6
Why does the amazing AMS algorithm work?
Scribe: Joe Cooley
Normal distribution Probability distribution on the reals whose shape looks like the plot drawn in class.
Gaussian distribution A high dimensional analogue of a 1-dimensional normal distribution. More precisely, its
a vector (z
1
, z
2
, . . . , z
n
) +
n
(reals) where each z
i
is drawn from a normal distribution. When drawing a random
number n from a normal distribution, it will likely equal 0. Probability exponentially decreases as numbers become
large.
The following equation desribes a gaussian distribution:
y = exp
x
2
or more generally,
y = exp
(x )
2
One can do linear combinations of gaussian distributions and still get gaussians. For example, z, w Az B w.
with probability
_
p pick z
1
Gaussian
1
1 p pick z
2
Gaussian
2
Dene F
2
=
n

j =1
f
j
2
.
Initialize :
Choose a random hash function h : [n] [n] from a 4-universal family ; 1
x 0 ; 2
Process j :
x x h( j ) ; 3
Output : x
2
This works because mixtures of Gaussians give gaussians. Gaussian distributions behave nicely under a certain
set of conditions.
24
LECTURE 6. WHY DOES THE AMAZING AMS ALGORITHM WORK?
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Consider the N(0, 1) distribution. 0 refers to centering distribution on = 0, and 1 refers to thickness of
curve, i.e., (stdev) of resulting distribution = 1.
Recall that in the discrete scenario, each outcome in the set of outcomes has a weight, and the weights sum to 1.
What is the probability distribution on the reals, i.e., whats the probability of getting a number in the area of X?
The probability distribution on + can be specied in two ways.
1. probability density function (pdf)
F : R [0, 1] s.t.
_

f (x) dx = 1,
i.e., assign a weight to every real number s.t. their sum is 1.
2. cumulative distribution function (cdf)
F [0, 1], F(t ) = Pr[X t ] =
_

f (x)dx, f (x) = F
/
(x)
Note that X is drawn from a distribution and is at most t.
Back to the N(0,1) distribution. The pdf =
1

2
e
x
2
and has the following property.
First, let X
1
, X
2
, . . . , X
n
N(0, 1) i.i.d., and let a
1
, a
2
, . . . , a
n
+
Theorem 6.0.1. a
1
X
1
a
2
X
2
.. a
n
X
n

_
a
2
1
a
2
2
.. a
2
n
X, where X N(0, 1)
(proof...fair amount of calculus)
For the intuition, imagine a
1
X
1
a
2
X
2
.. a
n
X
n
are orthogonal unit vectors in n-dimensions, length (the |L|
2
-
norm) in euclidean sense of vector would be
_
a
2
1
a
2
2
.. a
2
n
X. Adding normal random variables like these is
like taking length of orthogonal vectors in high dimensions
Recall that | z|
p
= (
n

i =1
[z
i
[
p
)
1
p
, 0 p 2
Denition 6.0.2. A distribution on + is p-stable if X
1
, X
2
, .., X
n
i.i.d. (distributed according to ) and
a = (a
1
, . . . , a
n
) +
n
, then a
1
X
1
.. a
n
X
n
| a|
p
X.
Theorem 6.0.3. N(0, 1) is 2-stable
Theorem 6.0.4. p (0, 2], a p-stable distribution (constructive)
How do we generate reals? Given some precision b, we can generate real numbers in time linear in b (# of bits
needed); the underlying distribution should be close to b. (We just need to generate a fair coin.)
Theorem 6.0.5. Cauchy distribution is 1-stable where the Cauchy pdf is
2

1
1x
2
(a fat tail distribution).
Intuitively (in a plot), at 0, y = 1 and the drop is slow, i.e., it has a fat tail. No well-dened variance ( variance),
integrating fromto gives . To compare distributions with innite variances, we nd the highest moment that
is still nite.
Back to algorithms. . .
Stream problem
Estimate F
p
= |

f |
p
p
, or equivalently, estimate |

f |
p
L
p
.
Algorithm A
1
(well handle reducing memory later)
25
LECTURE 6. WHY DOES THE AMAZING AMS ALGORITHM WORK?
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Initialize :
Choose n random reals X
1
, . . . , X
n
, a p-stable distribution ; 1
x 0 ; 2
Process j :
x x X
j
; 3
Output : [x[
Initialize :
Choose n random reals X
1
, . . . , X
n
, a p-stable distribution ; 1
x 0 ; 2
Process j :
x x cX
j
// with turnstile model; 3
Output : [x[
Assume frequency is bounded by some constant, which would bound the counter size. Thus, storing x requires
O(log(m)) bits provided we guarantee each [X
i
[ O(1)
Problems with real numbers
X precision not bounded
X range not bounded
Just truncate large values because the probability of getting a large is small (see plot); well just add to the error.
Algorithm operation
Token arrives (j,c) real storage access X
j
portion of random string.
Why does this work?
Consider p = 1:
Let X be the nal x of the algorithm, then by 1-stability X | f |
1
Z, where Z
X = f
1
X
1
.. f
n
X
n
Where in the 1-stable case, so were using the Cauchy distribution scaled by the L
1
-norm.
Why do we output [x[ instead of x?
Because the variance isnt dened, we cant compute expectation. Instead of computing expectation (like all other
solutions weve seen except for Misra-Gries).
Were going to take the median of values and argue that this is reasonable. To do this, we need to take the median
of a random variable. What does this mean? How do we take the median of a random variable?
Consider the cdf of a Cauchy distribution. What does median mean in this context? Intuitively, when taking a
median, arrange values on a line and take middle value. When we have a continuous distribution, wed like to choose
the median where
1
2
the mass is on one side, and half on the other.
Denition 6.0.6. For a random variable R , median of R or median of = t , s.t. Pr [R t ] =
1
2
, i.e.,
t, s.t.
_
t

(x)dx =
1
2
, i.e., t, s.t. 5(t ) =
1
2
, i.e., 5
1
(
1
2
).
How do we know that a unique t exists to satisfy the median property? Derivative assumptions must hold (we
wont discuss those here).
Algorithm A
2
(Piotr Indyk, MIT)
Run k =
c

2
log(
1

) copies of A
1
in, and output the median of the answers.
26
LECTURE 6. WHY DOES THE AMAZING AMS ALGORITHM WORK?
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Theorem 6.0.7. This gives (, )-approximation to L
1
= | f |
1
= F
1
.
Consider that we have 2 streams, each dening a histogram. Wed like to the know the, for example, L
1
distance
between the two vectors. If wed like to know the L
2
distance, use the previous algorithm where X
2
is output.
Proof. Essentially, use a good old Chernoff bound, but rst suppose X Cauchy (X is distributed according to
Cauchy)
Claim median[[X[] = 1 (corollary, median[a[X[] = [a[ by linearity)
Proof. pdf of [X[ = g where
_
g(x) = 0, x 0,
2

1
1x
2
, x > 0
The median = t s.t.
1
2
=
_
t

g(x)dx
=
2

_
t

dx
1 x
2
=
2

arctan(t )
arctan(t ) =

4
t = tan(

4
)
= 1
Whats the intuition behind what were saying? If we take a random variable from the Cauchy distribution, the
median of the value is close to 1. If we take a bunch of values, the median is even more likely to be closer to 1. We
can use the Chernoff bound to show this.
Claim
Let X
1
, .., X
k
(from denition of A
2
) [Cauchy[, then
Pr[medi an(X
1
, . . . , X
n
) , [1 O(), 1 O()]]
How do we prove this? Imagine plotting cdf of [Cauchy[ 1 is the median, so curve touches
1
2
here.
If the median is too small, lets examine one value.
Pr[X
1
1 O()]
1
2
,
by continuity and bounded derivative of cdf of [Cauchy[.
So, how much do we need to twiddly 1 O() to hold the inequality
1
2
?
Pr[X t ]
Pr[X 1] =
1
2
Pr[X 0.99] is close to
1
2
(we assume were bounded derivative property we havent discussed.)
E[i s.t. X
i
1 O()] (
1
2
)k
27
LECTURE 6. WHY DOES THE AMAZING AMS ALGORITHM WORK?
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Plugging into the Chernoff bound, we have
Pr[median 1 O()] exp

2
kO(1)


2
What a messy algorithm!
How many bits of precision do we need?
How do we generate from a distribution?
etc.
A taste of later...
We can generate a large random string with small random string and a PRNG.
Based on the more general algorithm, A
1
, if we chose gaussian variables, we could compute the L
2
norm. Using
yesterdays algorithm (1, 1), behaves like gaussian distribution.
If Y
1
, Y
2
, .., Y
n

R
1, 1 then

Y = (Y
1
, .., Y
n
) is close to

Z = (Z
1
, .., Z
n
) where Z
i
N(0, 1) for n .
An open question posed by Piotr Indyk
Is there some way to choose values from a simple distribution with small of independence that happens to work, i.e.,
replace large array of reals w/ (1, 1), i.e., we dont have to rely on a heavy hammer (i.e., a PRNG).
28
Lecture 7
Finding the Median
Scribe: Jon Denning
7.1 The Median / Selection Problem
We are working in the vanilla stream model. Given a sequence of numbers = a
1
, a
2
, . . . , a
m
>, a
i
[n], assume
a
i
s are distinct, output the median.
Suppose that the stream is sorted. Then the median of the stream is the value in the middle position (when m is
odd) or the average of the two on either side of the middle (when m is even). More concisely, the median is the average
of the values at the ]n,2 and {n,2 indices. (Running two copies of our algorithm nds the mathematical median.)
This leads us to a slight generalization: nding the rth element in a sorted set.
Denition 7.1.1. Given a set S of elements from a domain D with an imposed order and an element x D, the rank
of x is the count of elements in the set S that are not greater than x.
rank(x, S) = [{y S : y x][
This denition of rank makes sense even if x does not belong to S. Note: we are working with sets (no repeats),
so assume they are all distinct. Otherwise, wed have to consider the rank of an element in a multiset.
There are multiple ways of solving for the median. For example, sort array then pick the rth position. However,
the running time is not the best, because sort is in m log m time (not linear).
There is a way to solve this in linear time. This is done by partitioning the array into 5 chunks, nd median, then
recurse to nd median. The median of median will have rank in the whole array. So, we need to nd the rank of this
element. If its > r or r, recurse on the right or left half. Want rank of r t if r is rank you want and t is rank of
median.
Unfortunately, theres no way to solve this median problem in one pass as it is stated. There is a theorem that says,
in one pass, selection requires O(min{m, n]) space (basically linear space).
Can we do something interesting in two passes? Yes! Like in the Misra-Gries algorithm, in two passes we have
space usage down close to

O(

n) space (tilde hides some constant). It is similar but not in log. So, lets use more
passes! In p passes, we can achieve

O(n
1,p
) space. Making p large enough can get rid of n. Lets use p = O(log n).
In O(log n) passes, were down to

O(1) space.
29
LECTURE 7. FINDING THE MEDIAN
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
We will give an algorithm that uses p passes, and uses space O(n
1,p
). It will be difcult to state in pseudo code,
but it is very elegant.
The concept is called i -sample (i Z): from a set of numbers, construct a sample from it by selecting one in every
n integers. For example, a 1-sample is about half the set, and a 2-sample is about a quarter of the set.
Denition 7.1.2. An i -sample of a population (set of numbers) of size 2
i
s (s is some parameter) is a sorted sequence
of length s dened recursively as follows, where A B is stream A followed by stream B:
i = 0 : 0-sample(A) = sort(A)
i > 0 : i -sample(A B) =
= sort(evens((i 1)-sample(A))
_
evens((i 1)-sample(B)))
In other words, when i > 0, construct an (i 1)-sample of the rst part, and (i 1)-sample of the second part
(each sorted sequence of length s). Then combine them by picking every even positioned element in the two sets and
insert in sorted order.
This algorithm says I want to run in a certain space bound, which denes what sample size can be done (can
afford). Process stream and then compute sample. Making big thing into small thing takes many layers of samples.
Claim: this sample gives us a good idea of the median.
Idea: each pass shrinks the size of candidate medians
7.2 Munro-Paterson Algorithm (1980)
Given r [m], nd the element of stream with rank = r. The feature of Munro-Paterson Algorithm is it uses p passes
in space O(m
1,p
log
22,p
m log n). We have a notion of lters for each pass. At the start of the hth pass, we have
lters a
h
, b
h
such that
a
h
rank-r element b
h
.
Initially, we set a
1
= , b
1
= . The main action of algorithm: as we read stream, compute i -sample. Let m
h
= number of elements in stream between the lters, a
h
and b
h
. With a
1
and b
1
as above, m
1
= m. But as a
h
b
h
,
m
i
will shrink. During the pass, we will ignore any elements outside the interval a
h
, b
h
except to compute rank(a
h
).
Build a t -sample of size s = S, log m, where S = target size, where t is such that 2
t
s = m
h
. It does not matter if this
equation is exactly satised; can always pad up.
Lemma 7.2.1. let x
1
x
2
. . . x
s
be the i -sample (of size s) of a population P. [ P[ = 2
i
s.
To construct this sample in a streaming fashion, well have a working sample at each level and a current working
sample s (an (i 1)-samples) that the new tokens go into. When the working samples become full (i 1)-samples,
combine the two (i 1)-samples into an i -sample. This uses 2s storage. (i.e., new data goes into 0-sample. When its
full, the combined goes into 1-sample, and so on.)
At the end of the pass, what should we see in the sample? t -sample contains 2
t
elements. Should be able to
construct rank of elements up to 2
t
. In the nal sample, we want rank r, so look at rank r,2
t
.
Lets consider the j th element of this sample. We have some upper and lower bounds of the rank of this element.
Note: with lower bounds 2
i
j, 2
i 1
j and upper bounds 2
i
(i j ), 2
i 1
(i j ), ranks can overlap.
2
i
j rank(x
j
, P) 2
i
(i j )
L
i j
= lower bound = 2
i
j, U
i j
= upper bound = 2
i
(i j )
30
LECTURE 7. FINDING THE MEDIAN
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
An example, suppose these are elements of sample with given ranks
A B C D
() rank(A)
() rank(B)
() rank(C)
() rank(D)
rank = r
As upper bound is too small and Ds lower bound is too large, so they will serve as the new lters. Update lters.
If a
h
= b
h
, output a
h
as answer.
Proof. We proceed by induction on i .
When i = 0, everything is trivially correct, because the sample is the whole population. L
i j
= U
i j
= j .
Consider the (i 1)-sample, where i 0: (i 1)-sample is generated by joining two i -samples, a blue population
and a red population. We made an i -sample from blue population and then picked every other element. Similarly
for red population. Combined, they produce, for example,
i -sample
blue
= b
1
b
2
b
3
b
4
b
5
b
6
b
7
b
8
b
9
b
10
, i -sample
red
= r
1
r
2
r
3
r
4
r
5
r
6
r
7
r
8
r
9
r
10
(i 1)-sample = sort(b
2
b
4
b
6
b
8
b
10
r
2
r
4
r
6
r
8
r
10
) = b
2
b
4
r
2
b
6
r
4
b
8
r
6
b
10
r
8
r
10
.
Suppose, without loss of generality, the j th element of the (i 1)-sample is a b ( j = 6, for example), so it must be
the kth blue selected element from the blue i -sample (k = 4). This kth picked element means there are 2k elements up
to k. Now, some must have come from the red sample ( j k picked elements, or the 2( j k)th element in sample),
because the is sorted from blue and red samples.
rank(x
j
, P) L
i,2k
L
i,2( j k)
by inductive hypothesis on rank of elements in sample over their respective
populations.
Combining all these lower bounds
L
i 1, j
= min
k
(L
i,2k
L
i,2( j k)
) (7.1)
Note: the kth element in the (i 1)-sample is the 2kth element in the blue sample, so the lower bound of the kth
element.
Upper bound: Let k be as before.
U
i 1, j
= max
k
(U
i,2k
U
i,2( j k1)1
) (7.2)
Have to go up two red elements to nd the upper bound, because one red element up is uncertainly bigger or
smaller than our blue element. (Check that L
i j
and U
i j
as dened by (7.1) and (7.2) satisfy lemma.)
Back to the problem, look at the upper bounds for a sample. At some point we are above the rank, U
t, j
r
but U
t, j 1
r. We know that all these elements (up to j ) are less than the rank were concerned with. In the nal
t -sample constructed, look for uth element where u is the min such that U
t u
r, i.e., 2
t
(t u) r t u = {r,2
t
.
u is an integer, so u = {r,2
t
t . Set a uth element in sample for the next pass.
Similarly, nd the nth element where n is max such that L
t n
r, i.e., 2
t
n r n = ]r,2
t
. Set b nth
element for the next pass.
Based on the sample at the end of the pass, update lters to get a
h1
and b
h1
. Each pass reduces the number of
elements under consideration by S, whittling it down by the factor (m
h
log
2
m),S.
31
LECTURE 7. FINDING THE MEDIAN
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Lemma 7.2.2. If there are n elements between lters at the start of a pass, then there are O(n log
2
n,S) elements
between lters at the end of the pass.
m
h1
= O
_
m
h
log
2
m
S
_
= O
_
m
h
S, log
2
m
_
m
h
= O
_
m
(S, log
2
m)
h1
_
We are done making passes when the number of candidates is down to S. At this point, we have enough space to
store all elements between the lters and can nd the rank element wanted.
The number of passes required p is such that, ignoring constants,

_
m
(S, log
2
m)
p1
_
= (S)
m log
2p2
m = S
p
m
1,p
log
22,p
m = S
where S is the number of elements from the stream that we can store.
Were trading off the number of passes for the amount of space.
Note: this computes the exact median (or any rank element). Also, the magnitude of the elements makes no
difference. We do a compare for , but dont care how much less it is. This algorithm is immune to the magnitudes of
the elements.
32
Lecture 8
Approximate Selection
Scribe: Alina Djamankulova
8.1 Two approaches
The selection problem is to nd an element of a given rank r, where rank is a number of element that are less than
or equal to the given element. The algorithm we looked at previously gives us the exact answer, the actual rank of
an element. Now we are going to look at two different algorithms that take only one pass over data. And therefore
give us an approximate answer. Well give only rough detailes of the algorithms analysing them at very high level.
Munro-Paterson algorithm we previously looked at considered this problem.
1. Given target rank r, return element of rank r. The important note is that we are not looking for the approxi-
mate value of r element, which does not make a lot of sense.
Let us dene a notion of relative rank. Relative rank is computed the following way
relrank(x, S) =
rank(x, S)
[S[
The relative rank is going to be some number between 0 and 1. For example, median is an element of relative
rank 1/2. We can generalize this in terms of quantiles. Top quantile is the element of relative rank 1/4.
Given target relative rank , return element of relative rank [ , ]
( - approximation parameter, - quantile problem) [Greenwald - Khanna 01]
2. Random-order streams The idea is to create some really bad ordering of the elements in the stream. This does
not mean that we literally mess the data. But we just suppose that we can get data in the worse order it can be.
And if we can successfully solve this problem within the time and space limits, we are guaranteed to get good
results, since data is very unlikely to come in that messy order.
Given a multiset of tokens, serialized into a sequence uniformly at random, and we need to compute the
approximate rank of the element. We can think of the problem in the following way.
We look at the multiset leaving the details of how we do it.
Given r [m], return element of rank r with high probability with respect to the randomness n the ordering
(say 1 - )
(random order selection)[Guho - McGregor 07]
33
LECTURE 8. APPROXIMATE SELECTION
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
8.2 Greenwald - Khanna Algorithm
The algorithms has some slight similarities with Misra-Gries algoritms for solving frequent element problem. Maintain
a collection of s tuples (n
1
, g
1
, A
1
), ...(n
s
, g
s
, A
s
) where
n
i
is a value from stream so far ()
g
i
= min-rank (n
i
) - min-rank (n
i
1) // gain: how much more is this rank relative to the previously seen element.
Storing g
i
we can update easier and faster.
Minimum value seen so far and the maximum value seen so far will always be stored.
A
i
= max-rank (n
i
) - min-rank (n
i
) // range
min () = n
1
... n
s
= max () // seen so far, increasing order
Initialize :
Starting reading elements from the stream; 1
m 0 ; 2
Process a token n:
Find i such that n
i
n n
i 1
; 3
n becomes the new n
i 1
associated tuple = (n, 1, ]2m); 4
m m 1; 5
if n = new min/new max then 6
A 0 // every
1

elements; 7
Check if compression is possible (withis some bound) and it it is, do the compression: 8
if i such that g
i
g
i 1
A
i 1
]2m then 9
Remove i
t
h tuple; 10
Set g
i 1
g
i
g
i 1
; 11
Output : a random element that is ltered 12
Space: O(
1

log mlogn)
At any point, G-K algorithm can tell the rank of any element up to e where e = max
i
(
g
i
A
i
2
)
min-rank (n
i
) = sum
i
j =1
g
j
max-rank (n
i
) = sum
i
j =1
g
j
A
i
Identify i such that n
i
n n
i 1
Then rank(v) [min-rank(n
i
), max-rank(n
i 1
)]
In this statement min-rank(n
i
) = a and max-rank(n
i 1
) = b. So we do approach to the real rank of the element with a
and b lters.
34
LECTURE 8. APPROXIMATE SELECTION
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
rank (v) min-rank (n
i
) = sum
i
j =1
g
j
rank (v) max-rank (n
i 1
) = sum
i
j =1
g
j
A
i 1
rank range for v g
i 1
A
i 1
2e If we announce rank(v)
ab
2
, then error (
ba
2
) e
Algorithm maintains the invariant g
i
Ai 1 ]2m. So ranks are known up to error
1]2m
2
= m + 1/2
8.3 Guha-McGregor Algorithm
Split stream into pieces: = S
1
, E
1
, S
2
, E
2
, ..., S
p
, E
p
>
S: sample
E: Estimate
One pass polylog space # passes for polylog space = O(log log m) and m and p both O(log m)
Maintain lters a
n
, b
n
(1 h p) initially
a
i
= , b
i
=
such that with high probability rel-rank - element (a
n
, b
n
)
Sample phase: nd rst token u (a
n
, b
n
), f romS
n
Estimate phase: compute r = rel-rank (u, E
n
)
Update lters, setting as in binary search
True rank of u [rm

l, rm

l] with high probability


Filter update
35
Lecture 9
Graph Stream Algorithms
Scribe: Ryan Kingston
9.1 Graph Streams
In a graph stream, tokens represent lines in a graph, where the vertices are known in advance. Duplicate pairs can be
represented in a form that is convenient to the problem, such as weighted lines, parallel lines, etc. Given n vertices
{1, 2, ..., n], stream = ,(u
1
, n
1
), (u
2
, n
2
), ..., (u
m
, n
m
)), with each (u
i
, n
i
) [n].
Finding the most interesting properties of the graph requires O(n) space, e.g. Is there a path of length 2 between
vertices 1 and 2?
The edges arrive as follows: (1, n) for n S where S {3, 4, ..., n{, then (n, 2) for n T where T {3, 4, ..., n{.
The question then boils down to is S T ,= ? This is known as the Disjointess Problem, which is known to require
O(n). Graph streams are of the semi-streaming model with slightly different goals as the streams weve seen
previously. Namely, to make the (
space
n
) as small as we can, and O(n pol y(logn)) is our new holy grail.
9.2 Connectedness Problem
Initialize : F , where F is a set of edges ;
Process (u, n):
if F {(u, n)] does not contain a cycle then 1
F {(u, n)] F ; 2
3
Output : if [F[ (The number of edges) = n 1 then
Yes
else
No
The UNION-FIND data structure can be used to determine connected components in the graph. For example, if
component u and component n are equal, then (u, n) forms a cycle and the edge can be discarded, otherwise the
36
LECTURE 9. GRAPH STREAM ALGORITHMS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
components can be merged. This ensures that the graph will be maintained as a forest. Note that this works only on
insertion-only graph streams (edges in the stream do not depart from the graph).
9.3 Bipartite Graphs
A bipartite graph is a graph with no odd cycles, also known as a 2-colored graph. If a graph is bipartite, then a depth-
rst search should traverse back and forth between the bipartitions. To discover whether a graph is bipartite or not can
be found by checking that the graph has no odd cycles.
Denition: Monotone properties - Properties that are not destroyed by adding more edges to the graph. For exam-
ple, once a graph is discovered to be connected, adding more edges cannot cause the graph to become disconnected.
The same holds true for bipartite graphs; A graph that is discovered to be non-bipartite, it is always going to be
non-bipartite after adding more edges.
Initialize : F ;
Ans Yes
Process (u, n):
if F {(u, n)] does not contain a cycle then 1
F {(u, n)] F ; 2
else 3
if cycle created by (u, n) is odd then 4
Ans No 5
6
Output : Ans
Outline of proof, given an input graph G:
1. Information retained allows us to compute a 2-coloring of G.
2. Information discarded doesnt affect this coloring or its validity.
3. The rst edge to cause odd cycle violates the coloring.
9.4 Shortest Paths (Distances)
At the end of the stream, given vertices x, y, estimate d
G
(x, y) = mi n
_
# edges in an x-y path in G
, if no x-y path in G
_
Initialize : H a subgraph of G
Process (u, n):
if H {(u, n)] does not contain a cycle of length t then 1
H H {(u, n)] 2
3
Output : Given query (x, y), output d
H
(x, y)
Quality analysis: d
G
(x, y) the actual distance d
H
(x, y) t d
G
(x, y)
37
LECTURE 9. GRAPH STREAM ALGORITHMS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
We say that H is a t-spanner of G if it contains no cycles of length t . A smaller t will give us a more accurate
answer but requires more space. The girth of a graph is the length of its shortest cycle. Here, we are ensuring that
girth(H) t .
Theorem 9.4.1. A graph on n vertices with girth t has O(n
1
2
t
) edges. This is known as Bollob as Theorem.
9.4.1 Algorithm Analysis
Say H has m edges (and without loss of generality, m > n). If we delete from H all vertices of degree
m
n
. The
number of deleted edges (
m
n
1) n = m n.
The number of edges remaining m (m n) = n > 0.
Consider a breadth-rst search from any root in the remaining subgraph,

H. Up to
t 1
2
levels, we get
m
n
children
per level (because the branching factor is
m
n
) (
m
n
)
t 1
2
vertices. So, n number of vertices of

H (
m
n
)
t 1
2

n
2
t 1

m
n
m n
1
2
t 1
38
Lecture 10
Finding Matching, Counting Triangles in
graph streams
Scribe: Ranganath Kondapally
10.1 Matchings
Denition 10.1.1. Given a graph G(V, E), Matching is a set of edges M E, such that no two edges in M share a
vertex.
Problem: We are given a graph stream (stream of edges). We want to nd a maximum matching.
There are two types of maximum matchings that we will consider:
Maximum cardinality matching(MCM) : We want to maximize the number of edges in the matching.
Maximum weight matching(MWM) : In this case, edges are assigned weights and we want to maximize the sum
of weights of the edges in the matching.
Semi-Streaming model : aim for space

O(n) or thereabouts [we just want to do better than O(n
2
) space usage]
Both the algorithms for MCM and MWM have the following common characteristics:
Approximation improves with more number of passes
Maintain a matching (in memory) of the subgraph seen so far
Input: Stream of edges of a n-vertex graph
Without space restrictions: We can compute a MCM, starting with an empty set, and try to increase the size of
matching : By nding an augmenting path.
Denition 10.1.2. Augmenting Path: Path whose edges are alternately in M and not in M, beginning and ending in
unmatched vertices.
Theorem 10.1.3. Structure theorem: If there is no augmenting path, then the matching is maximum.
39
LECTURE 10. FINDING MATCHING, COUNTING TRIANGLES IN GRAPH STREAMS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
So, when we can no longer nd an augmenting path, the matching we have is an MCM by the above theorem.
Streaming Algorithm for MCM: Maintain a maximal matching (cannot add any more edges without violating
the matching property). In more detail:
Initialize: M
Process (u, n): If M {(u, n)] is a matching, M M {(u, n)].
Output: [M[
Theorem 10.1.4. Let t denote the output of the above algorithm. If t is the size of MCM of G, then t ,2 t t .
Proof. Let M

be a MCM and [M

[ = t . Suppose [M[ t ,2. Each edge in M kills (prevents from being added to
M) at most two edges in M

. Therefore, there exists an unkilled edge in M

that could have been added to M. So, M


is not maximal which is a contradiction.
State of the art algorithm for MCM: For any , can nd a matching M such that (1)t [M[ t , using constant
(depends on ) number of passes.
Outline: Find a matching, as above, in the rst pass. Passes 2, 3, . . . nd a short augmenting path (depending on
) and increase the size of the matching. Generalized version of the above structure theorem for matchings gives us
the quality guarantee.
MWM algorithm:
Initialize: M
Process (u, n): If M {(u, n)] is a matching, then M M {(u, n)]. Else, let C = {edges of M conicting
with (u, n)] (note [C[ = 1 or 2). If wt (u, n) > (1 ) wt (C), then M (M C) {(u, n)].
Output: w = wt (M)
Space usage of the above algorithm is O(n log n).
Theorem 10.1.5. Let M

be a MWM. Then, k wt (M

) w wt (M

) where k is a constant.
With respect to the above algorithm, we say that edges are born when we add them to M. They die when
they are removed from M. They survive if they are present in the nal M. Any edge that dies has a well dened
killer (the edge whose inclusion resulted in its removal from M).
We can associate a (Killing) tree with each survivor where survivor is the root, and the edge(s) that may have
been killed by the survivor (at most 2), are its child nodes. The subtrees under the child nodes are dened recursively.
Note: The above trees, so dened, are node disjoint (no edge belongs to more than one tree) as every edge has a
unique killer, if any.
Let S = {survivors], T(S) =
_
eS
[ edges in the Killing tree(e) not including e] (descendants of e)
Claim 1: wt (T(S)) wt (S),
Proof. Consider one tree, rooted at e S.
wt (level
i
descendants)
wt (level
i 1
descendants)
1
40
LECTURE 10. FINDING MATCHING, COUNTING TRIANGLES IN GRAPH STREAMS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Because, wt (edge) (1 )wt ({edges killed by it])
wt (level
i
descendants)
wt (e)
(1 )
i
wt (descendants) wt (e)
_
1
1

1
(1 )
2
. . .
_
= wt (e)
1
1
_
1
1
1
1
_
= wt (e)
1
1 1
=
wt (e)

wt (T(S)) =

eS
wt (descendants of e)

eS
wt (e)

=
wt (S)

Claim 2: wt (M

) (1 )(wt (T(S)) 2 wt (S))


Proof. Let e

1
, e

2
, . . . be the edges in M

in the stream order. We will prove the claim by using the following charging
scheme:
If e

i
is born, then charge wt (e

i
) to e

i
which is in T(S) S
If e

i
is not born, this is because of 1 or 2 conicting edges
One conicting edge, e: Note: e S T(S). Charge wt (e

i
) to e. Since e

i
could not kill e, wt (e

i
)
(1 )wt (e)
Two conicting edges e
1
, e
2
: Note: e
1
, e
2
T(S) S. Charge wt (e

i
)
wt (e
j
)
wt (e
1
)wt (e
2
)
to e
j
for j = 1, 2.
Since e

i
could not kill e
1
, e
2
, wt (e

i
) (1 )(wt (e
1
) wt (e
2
)). As before, we maintain the property
that weight charged to an edge e (1 )wt (e).
If an edge is killed by e
/
, transfer charge from e to e
/
. Note: wt (e) wt (e
/
), so e
/
can indeed absorb this charge.
Edges in S may have to absorb 2(1 ) times their weight (conict two edges in M

, etc). This proves the claim.


By claim 1&2,
wt (M

) (1 )
_
wt (S)

2 wt (S)
_
= (1 )
_
1 2

_
wt (S)
=
_
1

3 2
_
wt (S)
Best choice for (which minimizes the above expression) is 1,

2. This gives
wt (M

)
3 2

2
w wt (M

)
Open question: Can we improve the above result (with a better constant)?
41
LECTURE 10. FINDING MATCHING, COUNTING TRIANGLES IN GRAPH STREAMS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
10.2 Triangle Counting:
Given a graph stream, we want to estimate the number of triangles in the graph. Some known results about this
problem:
We cant multiplicatively approximate the number of triangles in o(n
2
) space.
We can approximate the number of triangles up to some additive error
If we are given that the number of triangles t , then we can multiplicatively approximate it.
Given a input stream of m edges of a graph on n vertices with at least t triangles, we can compute (, )-
approximation using space O(
1

2
log
1


mn
t
) as follows:
1. Pick an edge (u, n) uniformly at random from the stream
2. Pick a vertex w uniformly at random from V {u, n]
3. If (u, w) and (n, w) appear after (u, n) in the stream, then output m(n 2) else output 0.
It can easily be shown that the expectation of the output of the above algorithm is equal to the number of triangles.
As before we run copies of the above algorithm in parallel and take the average of their output to be the answer. Using
Chebyshevs inequality, from the variance bound, we get the space usage to be O(
1

2
log
1


mn
t
).
Bar-Yossef, Kumar, Sivakumar [BKS02] algorithm: Uses space

O(
1

2
log
1

(
mn
t
)
2
). Even though the space
usage of this algorithm is more than the previous one, the advantage of this algorithm is it is a sketching algorithm
(computes not exactly a linear transformation of the stream but we can compose sketches computed by the algorithm
of any two streams and it can handle edge deletions).
Algorithm: Given actual stream of edges, pretend that we are seeing virtual stream of triples {u, n, w] where
u, n, w V. Specically,
actual token Virtual token
{u, n] {u, n, w
1
], {u, n, w
2
], . . . , {u, n, w
n2
]
where {w
1
, w
2
, . . . , w
n2
] = V {u, n].
Let F
k
be the k
th
frequency moment of virtual stream and
T
i
=

_
{u, n, w] : u, n, w are distinct vertices and exactly i edges amongst u, n, w
_

Note: T
0
T
1
T
2
T
3
=
_
n
3
_
.
F
2
=

u,n,w
(number of occurrences of {u, n, w] in the virtual stream)
2
= 1
2
T
1
2
2
T
2
3
2
T
3
= T
1
4T
2
9T
3
Similarly, F
1
= T
1
2T
2
3T
3
(Note: F
1
= length of virtual stream = m(n 2) ) and F
0
= T
1
T
2
T
3
.
If we had estimates for F
0
, F
1
, F
2
, we could compute T
3
by solving the above equations. So, we need to compute
two sketches of the virtual stream, one to estimate F
0
and another to estimate F
2
.
42
Lecture 11
Geometric Streams
Scribe: Adrian Kostrubiak
11.1 Clustering
11.1.1 Metric Spaces
When talking about clustering, we are working in a metric space. A metric space is an abstract notion in which
points have certain distances between each other. This metric space consists of a set of M with a distance function
d : M M R

and the following properties:


1. d(x, y) = 0 x = y (Identity)
2. d(x, y) = d(y, x) (Symmetry)
3. x, y, z : d(x, y) d(x, z) d(z, y) (Triangle Inequality)
Note that if properties (2) and (3) hold and property (1) does not hold then we are working in a semi-metric space.
One example of a metric space is R
n
, under the distance function d( x, y) = | x y|
p
such that ( p > 0).
11.1.2 Working with Graphs
We will be working with weighted graphs that are in metric space. In these weighted graphs, M = the vertex set,
d = the shortest weighted path distance, and all weights 0. Whenever the set of points is nite, a weighted graph is
the general situation.
A general idea for a clustering algorithm is as follows:
Input: A set of points from a metric space and some integer k > 0.
Goal: Partition the points into k clusters in the best way, using a cost function.
Possible objective (cost) functions:
Max diameter = max
c
(max distance between two points in c)
43
LECTURE 11. GEOMETRIC STREAMS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Three other cost functions involving representatives with the following properties:
clusters c
1
, . . . , c
k
representatives r
1
c
1
, . . . , r
k
c
k
and let x
i
1, x
i
2, . . . , be points in c
i
1. k-center:
max
i
_
max
j
_
d(x
i j
, r
i
)
__
2. k-median:
max
i
_

j
d(x
i j
, r
i
)
_
3. k-means:
max
i
_
_
_

j
d(x
i j
, r
i
)
2
_
_
11.2 Doubling Algorithm
The goal of the doubling algorithm is to try to minimize the k-center objective function.
The following is an outline of the algorithm:
Input stream: x
1
, x
2
, . . . , x
m
M, where (M, d) is a metric space, and some integer k > 0, which is the max
number of clusters that we wish to compute.
Desired output: a set y
1
, . . . , y
k
of representatives such that the k-center cost of using these representatives is (approx-
imately) minimized.
When we see (k 1)
st
point in the input stream, suppose that the minimum pairwise distance is 1, by rescaling
the distances.
Then at this point, we know that OPT 1, where OPT is dened as
OPT = cost of optimum clustering
The algorithm now proceeds in phases.
At the start of the i
th
phase, we know that OPT d
i
,2(d
1
= 1).
Phase i does the following:
Choose each point as the new representative if its distance from all other representatives > 2d
i
.
If were about to choose the (k 1)
st
representative, stop this phase, and let d
i 1
= 2d
i
.
If the algorithm ends with w k clusters, we are O.K. and we can just add in some arbitrary cluster. This
additional cluster could be empty or not.
11.3 Summaries
A summary of a stream = x
1
, x
2
, . . . > with x
i
U is any set S U.
Associated with the summary is a summarization cost, A(, S).
44
LECTURE 11. GEOMETRIC STREAMS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
With the k-center algorithm, well use the following function in order to determine the summarization cost:
A(, S) = max
x
_
min
yS
(d(x, y))
_
The following is an example of a stream, with representing normal elements in the stream and representing
elements that have been chosen as representatives. In this example, has been processed, and the rest of the stream,
has not yet been processed.

. , .

[
. , .

We can say A is a metric cost if for any streams , and their respective summaries, S, T, we have:
A(S , T) A(, S) A( , T) A(S , T) A(, S) (1)
We can think of this scenario as S summarizing , and T summarizing the whole stream, .
11.4 k-center Problem
Problem: Summarize a stream, while minimizing a summarization cost given by the function A. In order to do this,
the following conditions are required:
1. A is a metric cost
2. A has an -approximate threshold algorithm, A
Note that a threshold algorithm is an algorithm that takes as input an estimate t and a stream , and
Either produces a summary S (where [S[ = k) such that A(, S) t
Or fails, proving that T (where [T[ = k), A(, T) > t
Looking at the Doubling Algorithm, any one phase of the doubling algorithm is a 2-approximate threshold algorithm.
If S

is an optimum summary for then running A with threshold t

= A(, S

) will surely not fail. Therefore, it


will produce a summary S with A(, S) t

= OPT.
11.4.1 Guhas Algorithm
Guhas algorithm (2009) is just one algorithm designed to solve the k-center problem in the streaming model. The
idea behind this algorithm is to run multiple copies in parallel with different thresholds. So we will run p instances of
the A algorithm all in parallel. If any run i with a threshold of t
i
fails, then we can know for sure that j i copies
of the algorithm will fail too.
The following is an outline of Guhas Algorithm:
Keep p (such that (1 )
p
=

) instances of the Aalgorithm with contiguous geometrically increasing thresh-


olds of the form (1 )
i
running in parallel.
Whenever q p of instances fail, start up next q instances of A using the summaries from the failed instances
the replay the stream. When an instance fails, kill all the other instances with a lower threshold.
(Note that the threshold goes up from (1 )
i
(1 )
i p
).
At the end of input, output the summary from the lowest threshold alive instances of A.
We can think of this algorithm as raising the threshold of an instance by (1)
p
rather than starting a new instance.
45
LECTURE 11. GEOMETRIC STREAMS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
11.4.2 Space Bounds
Let:
i
= the portion of the stream in phase i ,
S
i
= summary of the stream at the end of phase i ,
= the nal portion of the stream,
T = the nal summary,
and t = the estimate that A was using when input ended.
Then: A(S
j
, T) t (by the -approximate property).
A(
1

2
. . .
j
, T) A(S
j
, T) A(
1

2
. . .
j
, S
j
) by (1)
t A(
1

2
. . .
j
, S
j
)
Since A was running with threshold
t
(1)
p
=
t

at the end of
j
then by -approximate property,
A(S
j 1

j
, S
j
)
t

= t
and by the metric property,
A(
1

2
. . .
j 1

j
, S
j
) A(S
j 1

j
, S
j
) A(
1
. . .
j 1
, S
j 1
)
t A(
1
. . .
j 1
, S
j 1
)
Repeating this argument ( j 1) times,
A(
1

2
. . .
j
, T) t t
2
t . . .
= t t
_

1
_
Let S

be an optimum summary. Then the instance of A with a threshold of


t
1
failed on the input stream, so
OPT = A(
1

2
. . .
j
, S

) >
t
1
So our output T has
cost t t
_

1
_
= t
_


1
_

_


1
_
(1 ) OPT
= ( O()) OPT
So we get an ( O())-approximate algorithm.
So in using this with k-center, this will leave us with a (2
/
)-approximation. And the space required is

O(k p) =

O
_
k
log(,)
log(1 )
_
=

O
_
k

log
_

_
_
46
Lecture 12
Core-sets
Scribe: Konstantin Kutzkow
12.1 The Problem
Core-sets
Arise in computational geometry
Naturally gives streaming algorithms for several (chiey geometric) problems
Suppose we have cost function C
P
: R
d
R

parameterized by a set P of points in R


d
. C
P
is monotonic, i.e. if
Q P, then x R
d
: C
Q
(x) C
P
(x).
Denition 12.1.1. Problem: Min-Enclosing-Ball (MEB). In this case C
P
(x) = max
yP
|x y|
2
, i.e. radius of
smallest ball containing all points in P R
2
.
Denition 12.1.2. We say Q P is an -core-set for P if T R
d
, x R
d
: C
QT
(x) C
PT
(x) C
QT
(x)
for > 1.
The rst inequality always holds by monotonicity of C
P
. We will think of T as disjoint of P. We want > 1 as
small as possible.
12.2 Construction
Theorem 12.2.1. MEB admits a (1 )-core-set of size O(
1

2(d1)
) for R
d
. In particular for d = 2 we have O(
1

2
).
Proof. Let n
1
, n
2
, .., n
t
R
d
be t unit vectors, such that given any unit vector u R
d
there exists i [t ] with
ang(u, n
i
) , where ang(u, n) = [ arccos,u, n)[. (In R
d
we need t = O(
1

d1
), for R
2
we need t =

2
.) A core-set
for P is equal to
Q =
t
_
i =1
{the two extreme points of P along direction n
i
] =
t
_
i =1
{arg max
xP
,x, n
i
), arg min
xP
,x, n
i
)]
47
LECTURE 12. CORE-SETS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
1
. Note that the size of the core-set is 2t = O(
1

d1
).
We need to show that T R
d
, x R
d
: C
PT
(x) C
QT
(x) for some small (1).
Suppose z is the furthest point from x in P T, i.e. z = arg max
x
/
PT
|x x
/
|. C
PT
(x) = |x z|. If
z Q T, then clearly C
QT
(x) = C
PT
(x) and we are done. If not, z PQ. Let u = x z and n
i
be a direction
such that ang(
u
|u|
, n
i
) , exists by construction. W.l.o.g. we assume ,n
i
, u) > 0. Let y = arg max
x
/
P
,x
/
, n
i
).
Then, by construction, y Q. By the constraints on y:
|y x| |z x|
,y, n
i
) ,z, n
i
)
and we see that
|y x| |z x| cos = C
PT
(x) cos
i.e.
C
QT
(x) C
PT
(x) cos
We have proved (1) with = 1, cos . Set such that
1
cos
= 1 = arccos
1
1
= O(
2
), which completes
the proof.
12.3 Streaming algorithm
Theorem 12.3.1. Suppose a problem of the form minimize C
P
where C
P
is a cost function that admits a (1O())-
core-set of size A(). Then the problem has a streaming algorithm using space O(A(

log m
) log m), for approximation
(1 ).
Before proving the theorem we present two simple lemmas:
Lemma 12.3.2. (Merge property) If Q is an -core-set for P and Q
/
is an -core-set for P
/
, then Q Q
/
is an
()-core-set for P P
/

Proof. The rst inequality follows from monotonicity and for the second we have T R
d
, x R
d
:
C
PP
/
T
(x) C
QP
/
T
(x) C
QQ
/
T
(x)
for > 1 since Q and Q
/
are core-sets for P and P
/
, respectively.
Lemma 12.3.3. (Reduce property) If Q is an -core-set for P and R is a -core-set for Q then R is an ()-core-set
for P.
Proof. From Q -core-set for P we have T R
d
, x R
d
:
C
PT
(x) C
QT
(x)
and analogously T R
d
, x R
d
:
C
QT
(x) C
RT
(x)
thus T R
d
, x R
d
: C
PT
(x) C
RT
(x).
Now we prove the theorem:
1
arg min
xP
f (x) returns x
0
such that x P : f (x
0
) f (x) for some real-valued function f : P R
48
LECTURE 12. CORE-SETS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Proof. Place incoming points in a buffer of size B. When the buffer is full, summarize using next level core-set in
post-order. Maintain at most one core-set per level by merging as needed.
If each core-set is a (1 )-core-set, then nal (root) core-set is a ((1 )
2 log
m
B
)-core-set for the input stream. So
nal approximation is 1 2 log
m
B
. Set so that 2 log
m
B
= . The required space is log
m
B
(number of core-sets)
times A(

log m
) (number of each core-set) B (size of buffer) = O(A(

log m
) log m).
Corollary: MEB admits a (1 )-approximation and requires space O(
log
3
m

2
).
49
Lecture 13
Communication Complexity
Scribe: Aarathi Prasad
Recall the Misra-Gries algorithm for MAJORITY problem where we found the majority element in a data stream
in 2 passes. We can show that it is not possible to nd the element using 1 pass in sublinear space. Lets see how to
determine such lower bounds.
13.1 Introduction to Communication Complexity
First lets start with an introduction to an abstract model of communication complexity.
There are two parties to this communication game, namely Alice and Bob. Alices input is x X and Bobs input
is y Y. X and Y are known before the problem is specied. Now we want to compute f (x, y), f: XY Z ,
when X = Y = [n] , Z = {0,1]. For example,
f (x, y) = (x y) mod 2 = x mod 2 y mod 2
In this case, Alice doesnt have to send the whole input x, using {log n bits; instead she can send x mod 2 to Bob using
just 1 bit! Bob calculates y mod2 and uses Alices message to nd f(x,y). However, only Bob knows the answer. Bob
can choose to send the result to Alice, but in this model, it is not required that all the players should know the answer.
Note that we are not concerned about the memory usage, but we try to minimise the number of communication steps
between Alice and Bob.
13.1.1 EQUALITY problem
This problem nds its application in image comparisons.
Given X = Y = {0, 1]
n
, Z = {0, 1]
n
EQ(x, y) =
_
1 if x = y
0 otherwise
So in this case, communication between Alice and Bob requires n bits. For now, we consider a one-way transmission,
ie messages are sent only in one direction. For symmetric functions, it doesnt matter who sends the message to whom.
50
LECTURE 13. COMMUNICATION COMPLEXITY
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Theorem 13.1.1. Alice must send Bob n bits, in order to solve EQ in a one-way model.
D

(EQ) n
Proof : Suppose Alice sends n bits to Bob. Then the number of different messages she might send 2
1
2
2

....2
n1
= 2
n
2. But Alice can have upto 2
n
inputs.
Lets recall the Pigeonhole principle. There are n pigeons, m holes and m n. This implies that 2 pigeons have
to go into one hole.
Using the pigeonhole principle, there exists two different inputs x ,= x
/
, such that Alice sends the same message
on inputs x and x
/
.
Let P(x, y) be Bobs output, when input is (x, y). We should have
P(x, y) = EQ(x, y) (13.1)
Since P(x, y) is determined fully by Alices message on x and Bobs input y, using equation 13.1, we have
P(x, y) = EQ(x, x) = 1 (13.2)
P(x
/
, y) = EQ(x
/
, x) = 0 (13.3)
However since Bob sees the message from Alice for both inputs x and x
/
, P(x, y) = P(x
/
, y), which is a contradic-
tion.
Hence the solution to the EQUALITY problem is for Alice to send all the 2
n
messages using n bits.
Theorem 13.1.2. Using randomness, we can compute EQ function with an error probability 1,3, in the one-way
model, with message size O(log n)
R

(EQ) = O(log n)
Proof: Consider the following protocol.
Alice picks a random prime p [n
2
, 2n
2
]
Alice sends Bob ( p, x mod p), using O(log n) bits
Bob checks if y mod p = x mod p, outputs 1 if true and 0 otherwise
If EQ(x, y) = 1, output is correct. If EQ(x, y) = 0, then its an error if and only if (x-y) mod p = 0.
Remember that x and y are xed and p is random. Also we should choose a large prime for p, since if we choose
p = 2, there is high error probability.
Let x y = p
1
p
2
... p
t
be the prime factorisation of x y, where p
1
, p
2
, ... p
t
need not be distinct. For the output
EQ(x, y) to be incorrect, p {p
1
, p
2
.. p
t
].
Pr[error]
t
no of primes in [n
2
, 2n
2
]
(13.4)
x y 2
n
, p
1
p
2
... p
t
2
t
t n
Using prime number theorem,
Number of primes in [1..N]
N
ln N
51
LECTURE 13. COMMUNICATION COMPLEXITY
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Number of primes in [n
2
, 2n
2
]
2n
2
ln 2n
2

n
2
ln n
2
=
2n
2
ln 2 2 ln n

n
2
2 ln n

1.9 n
2
2 ln n

n
2
2 ln n
=
0.9 n
2
2 ln n
Using estimates in equation 3, Pr[error]
n
0.9 n
2
, 2 ln n
=
2 ln n
0.9 n

1
3
13.2 Communication complexity in Streaming Algorithms
Now lets consider communication complexity models in streaming algorithms. Here the stream is cut into two parts,
one part is with Alice and the other with Bob. We claim that if we can solve the underlying problem, we can solve the
communication problem as well.
Suppose a deterministic or randomised streaming algorithm to compute f (x, y) using s bits of memory, then
D

( f ) s OR R

( f ) s
Alice runs the algorithm on her part of the stream, sends the values in memory (s bits) to Bob, and he uses these
values, along with his part of the stream, to compute the output.
Note one-pass streaming algorithm implies a one-way communication. So a
p pass streaming algo
_
p messages from A B
p -1 messages from B A
Hence we can generalize that a communication lower bound proven for a fairly abstract problem can be reduced to a
streaming lower bound. ie from a streaming lower bound, we can reach the lower bound for a communication problem.
13.2.1 INDEX problem
Given X = {0, 1]
n
, Y = [n], Z = {0, 1],
INDEX(x, j ) = x
j
= j
t h
bit of x
e.g., INDEX(1100101, 3) = 0
Theorem 13.2.1.
D

(INDEX) n
52
LECTURE 13. COMMUNICATION COMPLEXITY
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Proof Can be proved using Pigeonhole Principle.
Say we have a MAJORITY streaming algorithm using s space. We are given an INDEX instance, Alices input :
1100101 and Bobs input : 3. The scheme followed is
Given an instance (x, j ) of INDEX, we construct streams , of length m each as follows. Let A be the streaming
algorithm, then
ALICEs input x is mapped to a
1
a
2
...a
m
, where a
i
= 2(i 1) x
i
BOBs input j is mapped to bb...b, b occurs m times, where b = 2(j-1)
Alice and Bob communicate by running A on .
If A says no majority, then output 1, else output 0.
1-pass MAJORITY algorithm requires O(m) space. By theorem 13.2.1, communication uses m, s m = 1/2
stream length. We also have R

(INDEX) = O(m).
53
Lecture 14
Reductions
Scribe: Priya Natarajan
Recap and an important point: Last time we saw that 1-pass MAJORITY requires space O(n) by reducing from
INDEX. We say O(n), but in our streaming notation, m is the stream length and n is the size of the universe. We
produced a stream , where m = 2N and n = 2N, and we concluded that the space s must be O(N). Looks
like s = O(m) or s = O(n) but we have proven the lower bound only for the worst of the two. So, in fact, we
have s = O(mi n{m, n]). For MAJORITY, we could say O(n) because m and n were about the same. We wont be
explicit about this point later, but most of our lower bounds have O(mi n{m, n]) form.
Today, we will see some communication lower bounds. But rst, here is a small table of results:
f D

( f ) R

1,3
( f ) D( f ) R
1,3
( f )
INDEX n O(n) { log n { log n
EQUALITY n O(log n) n O(log n)
DISJOINTNESS O(n) O(n) O(n) O(n)
Table 14.1: In this table, we will either prove or have already seen almost all results except R

1,3
(EQ). Also, we will
only prove a special case of DISJ.
We will start by proving D(EQ) n. In order to prove this, however, we rst have to prove an important property
about deterministic communication protocols.
Denition 14.0.2. In two-way communication, suppose Alice rst sends message
1
to Bob to which Bob responds
with message
1
. Then, say Alice sends message
2
to Bob who responds with message
2
, and so on. Then, the
sequence of messages
1
,
1
,
2
,
2
, . . . > is known as the transcript of the deterministic communication protocol.
Theorem 14.0.3. Rectangle property of deterministic communication protocol: Let P be a deterministic communi-
cation protocol, and let X and Y be Alices and Bobs input spaces. Let trans
p
(x, y) be the transcript of P on input
(x, y) X Y. Then, if trans
p
(x
1
, y
1
) = trans
p
(x
2
, y
2
) = (say); then trans
p
(x
1
, y
2
) = trans
p
(x
2
, y
1
) = .
Proof : We will prove the result by induction.
Let =
1
,
1
,
2
,
2
, . . . >. Let
/
=
/
1
,
/
1
,
/
2
,
/
2
, . . . > be trans
p
(x
2
, y
1
).
Since Alices input does not change between (x
2
, y
1
) and (x
2
, y
2
), we have
/
1
=
1
.
Induction hypothesis: and
/
agree on the rst j messages.
54
LECTURE 14. REDUCTIONS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Case 1: j is odd = 2k 1 (k 1) (say).
So, we have:
/
1
=
1
,
/
2
=
2
, . . . ,
/
k1
=
k1
,
/
k
=
k
and
/
1
=
1
,
/
2
=
2
, . . . ,
/
k1
=
k1
,
/
k
=?.
Now,
/
k
depends on Bobs input which does not change between (x
1
, y
1
) and (x
2
, y
1
), and it depends on the
transcript so far which also does not change. So,
/
k
=
k
.
Case 2: j is even = 2k (k 1) (say).
The proof for this case is similar to that of case 1, except we consider inputs (x
2
, y
2
) and (x
2
, y
1
).
By induction, we have the rectangle property.
Now that we have the rectangle property at hand, we can proceed to prove:
Theorem 14.0.4. D(EQ) n.
Proof : Let P be a deterministic communication protocol solving EQ. We will prove that P has at least 2
n
different
transcripts which will imply that P sends n bits.
Suppose trans
p
(x
1
, x
1
) = trans
p
(x
2
, x
2
) = , where x
1
,= x
2
.
Then, by rectangle property:
trans
p
(x
1
, x
2
) =
output on (x
1
, x
1
) = output on (x
1
, x
2
)
1 = EQ(x
1
, x
1
) = EQ(x
1
, x
2
) = 0. [Contradiction]
This means that the 2
n
possible inputs lead to 2
n
distinct transcripts.
Note: General communication corresponds to multi-pass streaming.
Theorem 14.0.5. Any deterministic streaming algorithm for DISTINCT-ELEMENTS, i.e., F
0
must use space O(n,p),
where p is the number of passes.
Before seeing the proof, let us make note of a couple of points.
Why O(n,p)? Over p passes, Alice and Bob exchange 2p 1 messages; if the size of each message (i.e., space
usage) is s bits, then:
total communication cost = s(2p 1)
total communication cost 2ps.
If we can prove that the total communication must be at least n bits, then we have:
n 2ps
s n,2p
i.e., s = O(n,p)
Proof Idea: We will reduce from EQ. Suppose Alices input x {0, 1]
n
is {100110], and Bobs input y {0, 1]
n
is {110110]. Using her input, Alice produces the stream = 1, 2, 4, 7, 9, 10 > (i.e.,
i
= 2(i 1) x
i
). Similarly,
Bob produces the stream = 1, 3, 4, 7, 9, 10 >.
Suppose Alices string disagrees in d places with Bobs string (i.e., Hamming distance(x, y) = d). Then, F
0
( ) = n d.
If and are equal, then d = 0, i.e., EQ(x, y) = 1 d = 0.
So, EQ(x, y) = 1 d = 0 F
0
( ) = n
EQ(x, y) = 0 d 1 F
0
( ) n 1
However, note that unless the F
0
algorithmis exact, it is very difcult to differentiate between n and n (a small d).
As we proved earlier, we can only get an approximate value for F
0
, so we would like that if x ,= y, then Hamming
distance between x and y be noticeably large. In order to get this large Hamming distance, we will rst run Alices and
Bobs input through an encoder that guarantees a certain distance. For example, running Alices n-bit input x through
an encoder might lead to a 3n-bit string c(x) (similarly, y and c(y) for Bob), and the encoder might guarantee that
x ,= y Hamming distance between c(x) and c(y) 3n,100 (say).
Proof : Reduction from EQ.
55
LECTURE 14. REDUCTIONS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Suppose Alice and Bob have inputs x, y {0, 1]
n
. Alice computes x= C(x), where C is an error-correcting
code with encoded length 3n, and distance 3n,100. Then, she produces a stream =
1
,
2
, . . . ,
3n
>, where

i
= 2(i 1) x
i
. Similarly, Bob has y, he computes y= C(y) and produces a stream where
i
= 2(i 1) y
i
.
Now, they simulate F
0
streaming algorithm to estimate F
0
( ).
If x = y, EQ(x, y) = 1, then x = y, so F
0
= 3n.
If x ,= y, EQ(x, y) = 0, then Hamming( x, y) 3n,100, so F
0
(3n 3n,100). That is F
0
3n 1.01.
If F
0
algorithm gives a (1 )-approximation with 0.005, then we can distinguish between the above two
situations and hence solve EQ. But D(EQ) n, therefore, D(F
0
) = O(n,p) with p passes.
56
Lecture 15
Set Disjointness and Multi-Pass Lower Bounds
Scribe: Zhenghui Wang
15.1 Communication Complexity of DISJ
Alice has a n-bit string x and Bob has a n-bit string y, where x, y [0, 1]
n
. x, y represent subset X, Y of [n]
respectively. x, y are named characteristic vector or bitmask, i.e.
x
j
=
_
1 if j X
0 otherwise
Compute function
DISJ(x, y) =
_
1 if X Y ,=
0 otherwise
Theorem 15.1.1. R(DISJ) = O(n)
Proof. We prove the special case: R

(DI SJ) = O(n) by reduction from INDEX. Given an instance (x, j ) of


INDEX, we create an instance (x
/
, y
/
) of DISJ, where x
/
= x, y
/
is a vector s.t.
y
/
i
=
_
0 if i ,= j
1 otherwise
By construction, x
/
y
/
,= x
j
= 1 i.e. DISJ(x
/
, y
/
) =INDEX(x
/
, y
/
). Thus, a one way protocol of DISJ
implies a one way protocol of INDEX with the same communication cost. But R

(INDEX) = O(n), so we have


R

(DISJ) = O(n).
The general case is proved by [Kalya], [Razborov 90] [Bar Yossof-Jayarom Kauor 02].
57
LECTURE 15. SET DISJOINTNESS AND MULTI-PASS LOWER BOUNDS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
15.2 ST-Connectivity
ST-connectivity: Given an input graph stream, are two specic vertices s and t connected?
Theorem 15.2.1. Solving ST-CONN requires O(n) space. (semi-streaming is necessary.)
Proof. By reduction from DISJ.
Let (x, y) be an instance of DISJ. Construct an instance of CONN on a graph with vertex set {s, t, n
1
, n
2
, , n
n2
].
So Alice gets set of edges A = {(s, n) : n x] and Bob gets set of edges B = {(n, t ) : n y].
Vertex s and t are connected in the resulting graph
a length-two path from s to t
n s.t. (s, n) A and (n, t ) B
n x y
DISJ(x, y) = 1
Reducing this communication problem to a streaming problem, we have the required space in a p pass algorithm is
O(n,p).
15.3 Perfect Matching Problem
Given a graph on n vertices, does it have a perfect matching i.e. a matching of size n,2?
Theorem 15.3.1. One pass streaming algorithm for this problem requires O(n
2
) space.
Proof. Reduction from INDEX.
Suppose we have an instance (x, k) of INDEX where x {0, 1]
N
2
, k [N
2
]. Construct graph G, where V
G
=

N
i =1
{a
i
, b
i
, u
i
, n
i
] and E
G
= E
A
E
B
. Edges in E
A
and E
B
are introduced by Alice and Bob respectively.
E
A
= {(u
i
, n
j
: x
f (i, j )=1
)], where f is a 1-1 correspondence between [N] [N] [N
2
]
e.g f (i, j ) = (N 1)i j
E
B
= {(a
l
, u
l
: l ,= i )] {(b
l
, n
l
: l ,= j )] {(a
i
, b
j
)], where i, j [N] are s.t. f (i, j ) = k
By construction,
G has a perfect matching
(u
i
, n
j
) E
G
x
k
= 1
INDEX(x, k) = 1
Thus, the space usage of a streaming algorithm for perfect matching must be O(N
2
). The number of vertices in
instance constructed is n = 4N. So the lower bound in term of n is O((
n
4
)
2
) = O(n
2
).
15.4 Multiparty Set Disjointness (DISJ
n,t
)
Suppose there are t player and each player A
i
has a n-bit string x
i
{0, 1]
n
. Dene
DISJ
n,t
=
_
1 if
_
n
i =1
x
i
,=
0 otherwise
58
LECTURE 15. SET DISJOINTNESS AND MULTI-PASS LOWER BOUNDS
CS 85, Fall 2009, Dartmouth College
Data Stream Algorithms
Theorem 15.4.1 (C.-Khot-Sun 03/Granmeicr 08). R(DISJ
n,t
) = O(n,t )
Theorem 15.4.2. Approximating 1-pass F
k
with randomization allowed requires O(m
12,k
) space.
Proof. Given a DISJ instance (x
1
, x
2
, , x
t
) s.t. every column of matrix
_
_
_
_
x
1
x
2

x
t
_
_
_
_
must contain 0,1 or t ones and at
most one column can have t ones. Player A
i
converts x
i
into a stream
i
=
1
,
2
, > such that
j

i
iff.
x
i j
= 1. Simulate the F
k
algorithm on stream =
1

2

t
. The frequency vector of this stream is either 0, 1
only (if DISJ
n,t
= 0) or 0, 1 and a single t ( If DISJ
n,t
= 1). In the rst case, F
k
n while F
k
t
k
in the second case.
Choose t s.t. t
k
> 2m, then 2-approximation to F
k
can distinguish these two situations. Thus, The memory usage is
O(m,t
2
) = O(
m
(m
1,k
)
2
) = O(m
12,k
), for stream length m nt and m = O(1,t ).
59
Bibliography
[AMS99] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency
moments. J. Comput. Syst. Sci., 58(1):137147, 1999. Preliminary version in Proc. 28th Annu. ACM
Symp. Theory Comput., pages 2029, 1996.
[BJK

04] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements
in a data stream. In Proc. 6th International Workshop on Randomization and Approximation Techniques in
Computer Science, pages 128137, 2004.
[BKS02] Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application
to counting triangles in graphs. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms,
pages 623632, 2002.
[CM05] Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and
its applications. J. Alg., 55(1):5875, 2005. Preliminary version in Proc. 6th Latin American Theoretical
Informatics Symposium, pages 2938, 2004.
[FM85] Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. J.
Comput. Syst. Sci., 31(2):182209, 1985.
[MG82] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143152, 1982.
60

You might also like