You are on page 1of 7

CS 473 ] Spring 2020

Y Homework 5 Z
Due Wednesday, March 11, 2020 at 9pm

1. Suppose we are hashing from the universe U = {0, 1, . . . , 2w − 1} of w-bit strings to a hash
table of size m = 2` ; that is, we are hashing w-bit words into `-bit labels. To define our
universal family of hash functions, we think of words and labels as boolean vectors, and we
specify our hash function by choosing a random ` × w boolean matrix.
For any ` × w matrix M of 0s and 1s, define the hash function h M : {0, 1}w → {0, 1}` by
the boolean matrix-vector product
w
M M
h M (x) = M x mod 2 = Mi x i = Mi .
i=1 i : x i =1

where ⊕ denotes bitwise exclusive-or (that is, addition mod 2), Mi denotes the ith column
of M , and x i denotes the ith bit of x.

(a) Prove that the family M = {hm | M ∈ {0, 1}w×` } of all random-matrix hash functions
is universal.
Solution: Pick two arbitrary distinct words x 6= y; we need to prove that
Pr M [h M (x) = h M ( y)] ≤ 1/m. Definition-chasing implies that for any matrix M ,
we have h M (x) = h M ( y) if and only if M (x ⊕ y) mod 2 = 0. Thus, it suffices to
prove for any non-zero word z ∈ {0, 1}w that Pr M [M z mod 2 = 0] ≤ 1/m.
Fix an arbitrary non-zero word z, and suppose z j = 1. Then we have
M M
M z mod 2 = 0 ⇐⇒ Mi zi = 0 ⇐⇒ Mi zi = M j
i i6= j

If we fix all the entries in M except the jth column M j , the left side of the last
equation is a fixed vector, but the right side is still uniformly random. The
L
probability that the random vector M j equals the fixed vector i6= j Mi zi is
exactly 1/m. „

Rubric: 4 points

(b) Prove that M is not uniform.

Solution: h(0) = 0 for all h ∈ M. „

Rubric: 1 point

1
CS 473 Homework 5 (due March 11) Spring 2020

(c) Now consider a modification of the previous scheme, where we specify a hash
function by a random matrix M ∈ {0, 1}`×w and an independent random offset vector
b ∈ {0, 1}` :  w 
M
h M ,b (x) = (M x + b) mod 2 = Mi x i ⊕ b
i=1
+
Prove that the family M of all such functions is strongly universal.

Solution: Fix two arbitrary distinct words x 6= y and two labels u and v; we
need to compute the probability that h(x) = u and h( y) = v.
Without loss of generality, suppose x i = 0 and yi = 1 for some index i. Fix
all the entries in M except in column Mi ; then

h(x) = u ⊕ b and h( y) = v ⊕ Mi ⊕ b

for some fixed vectors u and v; specifically,


M M
u= Mi x i and v= Mi yi
j6=i j6=i

Definition-chasing now implies that h(x) = u and h( y) = v if and only if

b =u⊕u and M1 = u ⊕ u ⊕ v ⊕ v.

The vectors b and M1 are uniform and independent, so the probability of this
event is exactly 1/m2 . „

Rubric: 4 points

(d) Prove that M+ is not 4-uniform.

Solution: h(3) = h(0) ⊕ h(1) ⊕ h(2) for all h ∈ M+ . „

Rubric: 1 point

(e) [Extra credit] Prove that M+ is actually 3-uniform.

Solution: Fix three distinct words x, y, z and three (not necessarily distinct)
labels s, t, u. Because x, y, z are distinct, there must be two indices i 6= j such
that the bit-pairs (x i , x j ), ( yi , y j ), (zi , z j ) are also distinct. Fix all entries in M
except those in columns Mi and M j . Up to permutations of the variable names,
there are only three cases to consider.

• Case 1: No (1, 1). Suppose x i = x j = y j = zi = 0 and yi = z j = 1. Then

h(x) = s ⊕ b h( y) = t ⊕ Mi ⊕ b h(z) = u ⊕ M j ⊕ b

for some fixed vectors s, t, and u. It follows that h(x) = s and h( y) = t and

2
CS 473 Homework 5 (due March 11) Spring 2020

h(z) = u if and only if

b =s⊕s Mi ⊕ b = t ⊕ t Mj ⊕ b = u ⊕ u

and therefore (by solving the system of three mod-2-linear equations) if and
only if

Mi = s ⊕ s ⊕ t ⊕ t Mj = s ⊕ s ⊕ u ⊕ u b =s⊕s

• Case 2: No (0, 1). Suppose x i = x j = y j = 0 and yi = zi = z j = 1. Then

h(x) = s ⊕ b h( y) = t ⊕ Mi ⊕ b h(z) = u ⊕ Mi ⊕ M j ⊕ b

for some fixed vectors s, t, and u. It follows that h(x) = s and h( y) = t and
h(z) = u if and only if

b =s⊕s Mi ⊕ b = t ⊕ t Mi ⊕ M j ⊕ b = u ⊕ u

and therefore if and only if

Mi = s ⊕ s ⊕ t ⊕ t Mj = s ⊕ s ⊕ t ⊕ t ⊕ u ⊕ u b =s⊕s

• Case 3: No (0, 0). Suppose x j = yi = 0 and x j = yi = zi = z j = 1. Then

h(x) = s ⊕ Mi ⊕ b h( y) = t ⊕ M j ⊕ b h(z) = u ⊕ Mi ⊕ M j ⊕ b

for some fixed vectors s, t, and u. It follows that h(x) = s and h( y) = t and
h(z) = u if and only if

Mi ⊕ b = s ⊕ s Mj ⊕ b = t ⊕ t Mi ⊕ M j ⊕ b = u ⊕ u

and therefore if and only if

Mi = t ⊕ t ⊕ u ⊕ u Mj = s ⊕ s ⊕ u ⊕ u b =s⊕s⊕ t ⊕ t ⊕u⊕u

In all three cases, we have found unique values of Mi and M j and b such
that h(x) = s and h( y) = t and h(z) = u. We conclude that Pr[(h(x) = s) ∧
(h( y) = t) ∧ (h(z) = u)] = 1/m3 , as required. „

Rubric: Max 5 points extra credit. This is not the only correct proof, or even necessarily the
simplest proof.

3
CS 473 Homework 5 (due March 11) Spring 2020

2. (a) Prove that the item returned by GetOneSample(S) is chosen uniformly at random
from S.
Solution: Let P(i, n) denote the probability that GetOneSample(S) returns
the ith item in a given stream S of length n. We consider the behavior of the last
iteration of the algorithm. With probability 1/n, GetOneSample returns the
last item in the stream, and with probability (n − 1)/n, GetOneSample returns
the recursively computed sample of the first n − 1 elements of S. Thus, we have
the following recurrence for all positive integers i and n:


 0 if i > n

1

P(i, n) = n if i = n

 n − 1 P(i, n − 1) if i < n



n
The recurrence includes the implicit base case P(i, 0) = 0 for all i. The induction
hypothesis implies that P(i, n−1) = 1/(n−1) for all i < n. It follows immediately
that P(i, n) = 1/n for all i ≤ n, as required. „

Rubric: 3 points.

Solution: See our solution to problem 3 (with k = 1). „

Rubric: Full credit if and only if (1) the proposed algorithm for problem 3 is actually equiv-
alent to GETONESAMPLE when k = 1, (2) the proposed solution to problem 3 includes a
correct proof of uniformity, and (3) the proposed solution to problem 3 doesn’t refer to
problem 2(a).

(b) What is the exact expected number of times that GetOneSample(S) executes line (?)?

that equals 1 if line (?) is executed in


Solution: Let X i be the indicator variable P
the ith iteration of the main loop; then X = i X i is the total number of executions
of line (?). The algorithm immediatelyPgives us E[X P i ] = Pr[X i = 1] = 1/i, so
n n
linearity of expectation implies E[X ] = i=1 E[X i ] = i=1 1/i = Hn . „

Rubric: 2 points. 1 point for “O(log n)”.

(c) What is the exact expected value of ` when GetOneSample(S) executes line (?) for
the last time?
Solution: Let `∗ denote the value of ` when GetOneSample(S) executes line (?)
for the last time. Part (a) immediately implies that `∗ is uniformly distributed
between 1 and n. It follows that E[`∗ ] = (n + 1)/2. „

Rubric: 2 points. 1 point for “O(n)”.

4
CS 473 Homework 5 (due March 11) Spring 2020

(d) What is the exact expected value of ` when either GetOneSample(S) executes line (?)
for the second time (or the algorithm ends, whichever happens first)?

Solution: Let `2 denote the value of ` when either GetOneSample(S) executes


line (?) for the second time (or the algorithm ends, whichever happens first). To
determine
Xn
E[`2 ] = Pr[`2 ≥ i],
i=1

we compute Pr[`2 ≥ i] for all i.


We immediately have Pr[`2 ≥ 1] = 1 and Pr[`2 ≥ i] = 0 for all i > n. For
any i such that 1 ≤ i ≤ n, we have `2 ≥ i if and only if line (?) is not executed in
iterations 2 through i − 1 of the main loop, so
i−1
Y j−1 1
Pr[`2 ≥ i] = = .
j=2
j i−1

We conclude that
n n
X X 1
E[`2 ] = Pr[`2 ≥ i] = 1 + = Hn−1 + 1.
i=1 i=2
i−1

Rubric: 3 points. This is not the only correct proof. 2 points for writing “O(log n)”; 1 point
for an exact answer that is incorrect but still O(log n).

5
CS 473 Homework 5 (due March 11) Spring 2020

3. (This is a continuation of the previous problem.) Describe and analyze an algorithm that
returns a subset of k distinct items chosen uniformly at random from a data stream of
length at least k. Prove your algorithm is correct.

Solution (O(k) time per element, 8/10): Intuitively, we can chain together k inde-
pendent instances of GetOneSample, where the items rejected by each instance are
treated as the input stream for the next instance. Folding these k instances together
gives us the following algorithm, which runs in O(kn) time. The algorithm maintains
a single array sample[1 .. k].

GetSamples(S, k):
while S is not done
x ← next item in S
for j ← 1 to min{k, `}
if Random(` − j + 1) = 1
swap x ↔ sample[ j]
return sample[1 .. k]

Correctness follows inductively from the correctness of GetOneSample as follows.


The analysis in part (a) implies that sample[1] is equally likely to be any element
of S. Let S 0 [1 .. n − 1] be the sequence of items rejected by sample[1]; at each
iteration `, item S 0 [` − 1] is either S[`] (with probability 1 − 1/`) or the previous
value of sample[1] (with probability 1/`). The output array sample[2 .. k] has the
same distribution as the output array from GetSamples(S 0 , k −1). Thus, the inductive
hypothesis implies that sample[2 .. k] contains k − 1 distinct items chosen uniformly
at random from S \ {sample[1]}. We conclude that sample[1 .. k] contains k distinct
items chosen uniformly at random from S, as required. „

Solution (O(1) time per element, 10/10): To develop the algorithm, let’s start with
a standard algorithm for a different problem. Recall from the lecture notes that the
Fisher-Yates algorithm randomly permutes an arbitrary array. (This algorithm is
called InsertionShuffle in the notes.)

FisherYates(S[1 .. n]):
for ` ← 1 to n
j ← Random(`)
swap S[ j] ↔ S[`]
return S

We can prove that each of the n! possible permutations of the input array has the
same probability of being output by FisherYates, by combining two observations:

• First, there are exactly n! possible outcomes of all the calls to Random. Specifi-
cally, the `th call to Q
Random has ` possible outcomes, so the total number of
n
outcomes is exactly `=1 ` = n!.
• Second, every permutation of S can be computed by FisherYates. Specifically,
the location of S[n] in the output array gives us a unique outcome for the last

6
CS 473 Homework 5 (due March 11) Spring 2020

call to Random, and (after undoing the last swap) the induction hypothesis
implies that there is a sequence of Random outcomes that produce any desired
permutation of S[1 .. n − 1]. (The base case n = 0 is trivial.)

In particular, each subset of k distinct elements of S has the same probability of


appearing in the prefix S[1 .. k] of the output array.
Now suppose we modify the Fisher-Yates algorithm in two different ways. First,
let the input be given in a stream rather than an explicit array; in particular, we do
not know the value of n in advance. Second, we maintain only the first k elements of
the output array.

FYSample(S, k):
`←0
while S is not done
x ← next item in S
`←`+1
j ← Random(`)
if j ≤ k
x ← sample[ j]
return sample[1 .. k]

The k elements produced by this modified algorithm have exactly the same
distribution as the first k elements of the output of FisherYates. Thus, FYSample
chooses a k-element subset of the stream uniformly at random, as required. The
algorithm runs in O(n) time, using only O(k) space. „

Rubric: 10 points = 5 for the algorithm + 5 for the proof. Max 8 points for O(kn)-time algorithm;
scale partial credit. These are neither the only correct algorithms nor the only correct proofs for
these algorithms.

You might also like