You are on page 1of 37

Lecture 2: Estimating

Frequencies & Frequency


Moments
Estimating Frequencies and Related Statistics
• Stream: 𝜎 = 𝑎1 , … , 𝑎𝑚 ∈ 𝑛
• Frequency vector: 𝑓 = 𝑓1 , … , 𝑓𝑛 , where 𝑓𝑖 = #times 𝑖 appears in 𝜎

• Frequency query: given 𝑖 ∈ 𝑛 , return 𝑓𝑖


• Frequency moments: compute 𝐹𝑘 = σ𝑛𝑖=1 𝑓𝑖𝑘
• Heavy hitters: which elements appear most frequently in 𝜎?
Today
• Distinct elements (𝐹0 ): KVM algorithm
• Frequency queries: CountMin and CountSketch
• Estimating 𝐹2 : tug-of-war sketch
KVM Algorithm for 𝐹0
• Idealized version: choose ℎ: 𝑛 → 0,1 uniformly at random
• Consider ℎ 𝑖 : 𝑓𝑖 > 0 :

• What do we expect the smallest value to be?


• What do we expect the 𝑘-th smallest value to be?
Real Version of KVM
• Choose pairwise-independent ℎ: 1, … , 𝑛 → 1, … , 𝑀
• Store 𝑘 smallest values of ℎ 𝑖 seen
• Output: let 𝑣 be the 𝑘-th smallest value; return 𝑘𝑀/𝑣
Intuition: “Expected Behavior”
• Output: 𝑘𝑀/(𝑘-th smallest value)
• We hope: 𝑘-th smallest value = 𝑘𝑀/𝐹0

• 𝑋𝑖 = indicator for “ℎ 𝑖 ≤ 𝑘𝑀/𝐹0 ”


• Pr 𝑋𝑖 ≈
• E σ𝑖:𝑓𝑖 >0 𝑋𝑖 ≈
Analysis: Upper Tail
Analysis: Upper Tail
Analysis: Lower Tail
Analysis: Lower Tail
Space Complexity
CountMin & CountSketch
• Basic idea: ℎ: 𝑛 → 𝑘 • “Insert into cell 𝑗 ∈ 𝑘 ” = ?
• How to query 𝑓𝑖 ?
𝜎 = 𝑎1 , … , 𝑎𝑖 , …

𝑘 ≪ 𝑛 cells
CountMin & CountSketch
Estimate quality comparison: Frequency vector of 𝑛 ∖ 𝑖
• CountMin: w.h.p.,
𝑓𝑖 ≤ 𝑓መ𝑖 ≤ 𝑓1 + 𝜖 ∥ 𝑓−𝑖 ∥1

• CountSketch: w.h.p.,
𝑓መ𝑖 − 𝑓𝑖 ≤ 𝜖 ∥ 𝑓−𝑖 ∥2

Since ∥⋅∥1 ≥∥⋅∥2 , CountSketch is better


… but it takes more space
CountMin: 𝑓𝑖 ≤ 𝑓መ𝑖 ≤ 𝑓𝑖 + 𝜖 ∥ 𝑓−𝑖 ∥1
ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …

+1 𝑘 ≪ 𝑛 cells

• At the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖 Question:


• Query 𝑖: return 𝑓መ𝑖 = 𝐶 ℎ 𝑖 How big should 𝑘 be for E 𝑓መ𝑖
to be “good”?
CountMin: 𝑓𝑖 ≤ 𝑓መ𝑖 ≤ 𝑓𝑖 + 𝜖 ∥ 𝑓−𝑖 ∥1

𝜎 = 𝑎1 , … , 𝑎𝑖 , … ℎ: 𝑛 → 𝑘
+1
𝑘 ≪ 𝑛 cells

At the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖


Query 𝑖: return 𝑓መ𝑖 = 𝐶 ℎ 𝑖

Question:
How big should 𝑘 be for E 𝑓መ𝑖
to be “good”?
CountMin: 𝑓𝑖 ≤ 𝑓መ𝑖 ≤ 𝑓𝑖 + 𝜖 ∥ 𝑓−𝑖 ∥1
ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …

+1 𝑘 ≪ 𝑛 cells

•ByAtMarkov:
the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖
𝑘 = 2/𝜖
• Query
Pr 𝑓መ𝑖 − 𝑓𝑖 > 𝜖 𝑓መ𝑓
𝑖: return = 𝐶1 ℎ<𝑖1/2
𝑖 −𝑖 𝐸 𝑓መ𝑖 − 𝑓𝑖 = 2𝜖 𝑓−𝑖 1
Estimate Quality
• Single estimate 𝑓መ𝑖 :
• Always: 𝑓መ𝑖 ≥ 𝑓𝑖
• W.p. ½: 𝑓መ𝑖 ≤ 𝑓𝑖 + 𝜖 𝑓−𝑖 1
• To amplify the success probability:
• Compute independent 𝑓መ𝑖1 , … , 𝑓መ𝑖ℓ
• Output: 𝑓መ𝑖 = min{ 𝑓መ𝑖1 , … , 𝑓መ𝑖ℓ }
1 ℓ
Pr 𝑓መ𝑖 > 𝜖 𝑓−𝑖 1 <
2

⇒ ℓ = log 1/𝛿
CountMin: 𝑓𝑖 ≤ 𝑓መ𝑖 ≤ 𝑓1 + 𝜖 ∥ 𝑓−𝑖 ∥1

ℎ1 , ℎ2 , … , ℎℓ : 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …

+1
+1 1
ℓ = log
+1 𝛿
+1

𝑘 = 2/𝜖
CountSketch
ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , … 𝑟: 𝑛 → −1, +1
±1 𝑘 ≪ 𝑛 cells
• At the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑟
𝑖 𝑓𝑖
E = 0 Question:
What is E 𝐶 ℎ 𝑖 ?
E𝐶 ℎ 𝑖 = 𝑟 𝑖 ⋅ 𝑓𝑖
How does it relate to 𝑓𝑖 ?

• Return 𝑓መ𝑖 = 𝑟 𝑖 ⋅ 𝐶 ℎ 𝑖 = 𝑓𝑖 + 𝑟 𝑖 σ𝑗≠𝑖:ℎ 𝑗 =ℎ 𝑖 𝑟 𝑗 𝑓𝑗


CountSketch
Estimate: 𝑓መ𝑖 = 𝑓𝑖 + 𝑟 𝑖 σ𝑗≠𝑖:ℎ 𝑗 =ℎ 𝑖 𝑟 𝑗 𝑓𝑗
• Expectation:
E 𝑓መ𝑖 = 𝑓𝑖
• Variance:
• Let 𝐵𝑗 = “is ℎ 𝑖 = ℎ 𝑗 ?”
1
• E 𝐵𝑗 = 𝑘
2
• 𝑉𝑎𝑟 𝑓መ𝑖 = E 𝑓መ𝑖 − 𝑓𝑖 =?
CountSketch
Estimate: 𝑓መ𝑖 = 𝑓𝑖 + 𝑟 𝑖 σ𝑗≠𝑖:ℎ 𝑗 =ℎ 𝑖 𝑟 𝑗 𝑓𝑗
• Expectation:
E 𝑓መ𝑖 = 𝑓𝑖
• Variance:
• Let 𝐵𝑗 = “is ℎ 𝑖 = ℎ 𝑗 ?”
1
• E 𝐵𝑗 = 𝑘
2
• 𝑉𝑎𝑟 𝑓መ𝑖 = E 𝑓መ𝑖 − 𝑓𝑖 =?
CountSketch
Estimate: 𝑓መ𝑖 = 𝑓𝑖 + 𝑟 𝑖 σ𝑗≠𝑖:ℎ 𝑗 =ℎ 𝑖 𝑟 𝑗 𝑓𝑗
• Expectation:
E 𝑓መ𝑖 = 𝑓𝑖
• Variance:
1
Var 𝑓መ𝑖 = 𝑘 𝑓−𝑖 2
2

By Chevyshev:
𝑓−𝑖 2 1
Pr 𝑓መ𝑖 − 𝑓𝑖 > 2 <
𝑘 4
Want: 𝑓መ𝑖 − 𝑓𝑖 ≤ 𝜖 𝑓−𝑖 2 Set 𝑘 = Θ 1/𝜖 2
CountSketch: 𝑓መ𝑖 − 𝑓𝑖 ≤ 𝜖 𝑓−𝑖 2

ℎ1 , ℎ2 , … , ℎℓ : 𝑛 → 𝑘
𝑟1 , 𝑟2 , … , 𝑟ℓ : 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …

±1
±1 1
ℓ = log
±1 𝛿
±1

𝑘 = 2/𝜖
Linear Sketching
Random linear mapping 𝑀 ∈ ℝ𝑠×𝑚 :

𝑀 𝜎 = 𝑀𝜎 answer
CountMin as a Linear Sketch
What is 𝑀 ? ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …

+1
CountMin as a Linear Sketch
What is 𝑀 ? ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …
𝜎
1 0 0 1 1 0 0 0 0 1
+1

𝑀𝑖,𝑗 = 1 iff ℎ 𝑖 = 𝑗 Cell 𝑗: σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖


CountMin as a Linear Sketch
What is 𝑀 ?
ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , … 𝑟: 𝑛 → {−1, +1}
𝜎
-1 0 0 +1 -1 0 0 0 0
±1

𝑀𝑖,𝑗 = 𝑟 𝑖 iff Cell 𝑗: σ𝑖:ℎ 𝑖 =𝑗 𝑟 𝑖 𝑓𝑖


ℎ 𝑖 =𝑗
Linear Sketches
• Easy to compute 𝑀 𝜎1 + 𝜎2 = 𝑀𝜎1 + 𝑀𝜎2
• Nice for distributed settings:

𝑀 𝜎1 𝑀𝜎1 𝑀 𝜎2 𝑀𝜎2

= =
Linear Sketches =

“Dimensionality reduction”:
embed 𝑣1 , … , 𝑣𝑘 ∈ ℝ𝑁 into ℝ𝑛 while preserving
essential properties
- distances (Johnson-Lindenstrauss Lemma)
- ...
Break
Estimating Higher Frequency
Moments
Higher Frequency Moments
• Reminder: 𝐹𝑘 = σ𝑖∈ 𝑛 𝑓𝑖𝑘
• Higher 𝑘 ⇒ weighted towards highest 𝑓𝑖
• 𝐹2 is the variance of 𝑓
• Higher moments have applications in databases (?)
Estimating 𝐹2 : the Tug-Of-War Sketch
Tug-of-War Sketch
• Choose ℎ: 𝑛 → −1, +1 from 4-wise independent family
•𝑥←0
• Process 𝑎𝑖 ∈ 𝑛 :
𝑥 ← 𝑥 + ℎ 𝑎𝑖
• Output 𝑥 2
Tug-of-War Sketch
• Choose ℎ: 𝑛 → −1, +1 from 4-wise independent family
•𝑥←0
• Process 𝑎𝑖 ∈ 𝑛 : 𝑖 ≠ 𝑗: E = 0
𝑥 ← 𝑥 + ℎ 𝑎𝑖 𝑖 = 𝑗: E = 1

• Output 𝑥 2
2
2
E 𝑥2 = E ෍ ℎ 𝑖 𝑓𝑖 =E ෍ ℎ 𝑖 ℎ 𝑗 𝑓𝑖 𝑓𝑗 = 𝑓 2 = 𝐹22
𝑖∈ 𝑛 𝑖,𝑗∈ 𝑛
Tug-of-War Sketch
• 𝑉𝑎𝑟 𝑥 2 ≤ 𝐹22
• By Chebyshev:
Pr 𝑥 2 − 𝐹22 > 𝑘 ⋅ 𝐹2 ≤ 1/𝑘 2

• We want:
Pr 𝑥 2 − 𝐹22 > 𝜖 ⋅ 𝐹2 ≤ 𝛿

?
Median-of-Means Trick Pr 𝑧Ƹ − z > 𝑐 ⋅ std ≤ 1/𝑐 2

Given unbiased estimator 𝑧Ƹ with E 𝑧Ƹ = z :


Step 1: take mean of Θ 𝑁 copies
Reminder: sum of 𝑁 copies ⇒ expect ∼ Θ 𝑁 , std ∼ Θ 𝑁
mean of 𝑁 copies ⇒ expect ∼ Θ 1 , std ∼ Θ 1Τ 𝑁
To get an 𝜖-approximation, 𝑁 = ?

1
Step 2: take median of Θ log 1Τ𝛿 means log 𝑛log
𝛿
Space: 𝑂
𝜖2

You might also like