Lecture2 FrequencyMoments

Lecture 2: Estimating
Frequencies & Frequency

Moments
Estimating Frequencies and Related Statistics
• Stream: 𝜎 = 𝑎1 , … , 𝑎𝑚 ∈ 𝑛
• Frequency vector: 𝑓 = 𝑓1 , … , 𝑓𝑛 , where 𝑓𝑖 = #times 𝑖 appears in 𝜎
• Frequency query: given 𝑖 ∈ 𝑛 , return 𝑓𝑖

• Frequency moments: compute 𝐹𝑘 = σ𝑛𝑖=1 𝑓𝑖𝑘
• Heavy hitters: which elements appear most frequently in 𝜎?
Today
• Distinct elements (𝐹0 ): KVM algorithm
• Frequency queries: CountMin and CountSketch
• Estimating 𝐹2 : tug-of-war sketch
KVM Algorithm for 𝐹0
• Idealized version: choose ℎ: 𝑛 → 0,1 uniformly at random
• Consider ℎ 𝑖 : 𝑓𝑖 > 0 :
• What do we expect the smallest value to be?

• What do we expect the 𝑘-th smallest value to be?
Real Version of KVM
• Choose pairwise-independent ℎ: 1, … , 𝑛 → 1, … , 𝑀
• Store 𝑘 smallest values of ℎ 𝑖 seen
• Output: let 𝑣 be the 𝑘-th smallest value; return 𝑘𝑀/𝑣
Intuition: “Expected Behavior”
• Output: 𝑘𝑀/(𝑘-th smallest value)
• We hope: 𝑘-th smallest value = 𝑘𝑀/𝐹0
• 𝑋𝑖 = indicator for “ℎ 𝑖 ≤ 𝑘𝑀/𝐹0 ”

• Pr 𝑋𝑖 ≈
• E σ𝑖:𝑓𝑖 >0 𝑋𝑖 ≈
Analysis: Upper Tail
Analysis: Upper Tail
Analysis: Lower Tail
Analysis: Lower Tail
Space Complexity
CountMin & CountSketch
• Basic idea: ℎ: 𝑛 → 𝑘 • “Insert into cell 𝑗 ∈ 𝑘 ” = ?
• How to query 𝑓𝑖 ?
𝜎 = 𝑎1 , … , 𝑎𝑖 , …
𝑘 ≪ 𝑛 cells
CountMin & CountSketch
Estimate quality comparison: Frequency vector of 𝑛 ∖ 𝑖
• CountMin: w.h.p.,
𝑓𝑖 ≤ 𝑓መ𝑖 ≤ 𝑓1 + 𝜖 ∥ 𝑓−𝑖 ∥1
• CountSketch: w.h.p.,
𝑓መ𝑖 − 𝑓𝑖 ≤ 𝜖 ∥ 𝑓−𝑖 ∥2
Since ∥⋅∥1 ≥∥⋅∥2 , CountSketch is better

… but it takes more space
CountMin: 𝑓𝑖 ≤ 𝑓መ𝑖 ≤ 𝑓𝑖 + 𝜖 ∥ 𝑓−𝑖 ∥1
ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …
+1 𝑘 ≪ 𝑛 cells
• At the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖 Question:

• Query 𝑖: return 𝑓መ𝑖 = 𝐶 ℎ 𝑖 How big should 𝑘 be for E 𝑓መ𝑖
to be “good”?
𝜎 = 𝑎1 , … , 𝑎𝑖 , … ℎ: 𝑛 → 𝑘
+1
𝑘 ≪ 𝑛 cells
At the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖

Query 𝑖: return 𝑓መ𝑖 = 𝐶 ℎ 𝑖
Question:
How big should 𝑘 be for E 𝑓መ𝑖
to be “good”?
ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …
+1 𝑘 ≪ 𝑛 cells
•ByAtMarkov:
the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖
𝑘 = 2/𝜖
• Query
Pr 𝑓መ𝑖 − 𝑓𝑖 > 𝜖 𝑓መ𝑓
𝑖: return = 𝐶1 ℎ<𝑖1/2
𝑖 −𝑖 𝐸 𝑓መ𝑖 − 𝑓𝑖 = 2𝜖 𝑓−𝑖 1
Estimate Quality
• Single estimate 𝑓መ𝑖 :
• Always: 𝑓መ𝑖 ≥ 𝑓𝑖
• W.p. ½: 𝑓መ𝑖 ≤ 𝑓𝑖 + 𝜖 𝑓−𝑖 1
• To amplify the success probability:
• Compute independent 𝑓መ𝑖1 , … , 𝑓መ𝑖ℓ
• Output: 𝑓መ𝑖 = min{ 𝑓መ𝑖1 , … , 𝑓መ𝑖ℓ }
1 ℓ
Pr 𝑓መ𝑖 > 𝜖 𝑓−𝑖 1 <
2
⇒ ℓ = log 1/𝛿
CountMin: 𝑓𝑖 ≤ 𝑓መ𝑖 ≤ 𝑓1 + 𝜖 ∥ 𝑓−𝑖 ∥1
ℎ1 , ℎ2 , … , ℎℓ : 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …
+1
+1 1
ℓ = log
+1 𝛿
+1
𝑘 = 2/𝜖
CountSketch
ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , … 𝑟: 𝑛 → −1, +1
±1 𝑘 ≪ 𝑛 cells
• At the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑟
𝑖 𝑓𝑖
E = 0 Question:
What is E 𝐶 ℎ 𝑖 ?
E𝐶 ℎ 𝑖 = 𝑟 𝑖 ⋅ 𝑓𝑖
How does it relate to 𝑓𝑖 ?
• Return 𝑓መ𝑖 = 𝑟 𝑖 ⋅ 𝐶 ℎ 𝑖 = 𝑓𝑖 + 𝑟 𝑖 σ𝑗≠𝑖:ℎ 𝑗 =ℎ 𝑖 𝑟 𝑗 𝑓𝑗

CountSketch
Estimate: 𝑓መ𝑖 = 𝑓𝑖 + 𝑟 𝑖 σ𝑗≠𝑖:ℎ 𝑗 =ℎ 𝑖 𝑟 𝑗 𝑓𝑗
• Expectation:
E 𝑓መ𝑖 = 𝑓𝑖
• Variance:
• Let 𝐵𝑗 = “is ℎ 𝑖 = ℎ 𝑗 ?”
1
• E 𝐵𝑗 = 𝑘
2
• 𝑉𝑎𝑟 𝑓መ𝑖 = E 𝑓መ𝑖 − 𝑓𝑖 =?
CountSketch
• Expectation:
• Variance:
• Let 𝐵𝑗 = “is ℎ 𝑖 = ℎ 𝑗 ?”
1
• E 𝐵𝑗 = 𝑘
2
• 𝑉𝑎𝑟 𝑓መ𝑖 = E 𝑓መ𝑖 − 𝑓𝑖 =?
CountSketch
• Expectation:
• Variance:
1
Var 𝑓መ𝑖 = 𝑘 𝑓−𝑖 2
2
By Chevyshev:
𝑓−𝑖 2 1
Pr 𝑓መ𝑖 − 𝑓𝑖 > 2 <
𝑘 4
Want: 𝑓መ𝑖 − 𝑓𝑖 ≤ 𝜖 𝑓−𝑖 2 Set 𝑘 = Θ 1/𝜖 2
CountSketch: 𝑓መ𝑖 − 𝑓𝑖 ≤ 𝜖 𝑓−𝑖 2
ℎ1 , ℎ2 , … , ℎℓ : 𝑛 → 𝑘
𝑟1 , 𝑟2 , … , 𝑟ℓ : 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …
±1
±1 1
ℓ = log
±1 𝛿
±1
𝑘 = 2/𝜖
Linear Sketching
Random linear mapping 𝑀 ∈ ℝ𝑠×𝑚 :
𝑀 𝜎 = 𝑀𝜎 answer
CountMin as a Linear Sketch
What is 𝑀 ? ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …
+1
What is 𝑀 ? ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , …
𝜎
1 0 0 1 1 0 0 0 0 1
+1
𝑀𝑖,𝑗 = 1 iff ℎ 𝑖 = 𝑗 Cell 𝑗: σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖

What is 𝑀 ?
ℎ: 𝑛 → 𝑘
𝜎 = 𝑎1 , … , 𝑎𝑖 , … 𝑟: 𝑛 → {−1, +1}
𝜎
-1 0 0 +1 -1 0 0 0 0
±1
𝑀𝑖,𝑗 = 𝑟 𝑖 iff Cell 𝑗: σ𝑖:ℎ 𝑖 =𝑗 𝑟 𝑖 𝑓𝑖

ℎ 𝑖 =𝑗
Linear Sketches
• Easy to compute 𝑀 𝜎1 + 𝜎2 = 𝑀𝜎1 + 𝑀𝜎2
• Nice for distributed settings:
𝑀 𝜎1 𝑀𝜎1 𝑀 𝜎2 𝑀𝜎2
= =
Linear Sketches =
“Dimensionality reduction”:
embed 𝑣1 , … , 𝑣𝑘 ∈ ℝ𝑁 into ℝ𝑛 while preserving
essential properties
- distances (Johnson-Lindenstrauss Lemma)
- ...
Break
Estimating Higher Frequency
Moments
Higher Frequency Moments
• Reminder: 𝐹𝑘 = σ𝑖∈ 𝑛 𝑓𝑖𝑘
• Higher 𝑘 ⇒ weighted towards highest 𝑓𝑖
• 𝐹2 is the variance of 𝑓
• Higher moments have applications in databases (?)
Estimating 𝐹2 : the Tug-Of-War Sketch
Tug-of-War Sketch
• Choose ℎ: 𝑛 → −1, +1 from 4-wise independent family
•𝑥←0
• Process 𝑎𝑖 ∈ 𝑛 :
𝑥 ← 𝑥 + ℎ 𝑎𝑖
• Output 𝑥 2
Tug-of-War Sketch
• Choose ℎ: 𝑛 → −1, +1 from 4-wise independent family
•𝑥←0
• Process 𝑎𝑖 ∈ 𝑛 : 𝑖 ≠ 𝑗: E = 0
𝑥 ← 𝑥 + ℎ 𝑎𝑖 𝑖 = 𝑗: E = 1
• Output 𝑥 2
2
2
E 𝑥2 = E ෍ ℎ 𝑖 𝑓𝑖 =E ෍ ℎ 𝑖 ℎ 𝑗 𝑓𝑖 𝑓𝑗 = 𝑓 2 = 𝐹22
𝑖∈ 𝑛 𝑖,𝑗∈ 𝑛
Tug-of-War Sketch
• 𝑉𝑎𝑟 𝑥 2 ≤ 𝐹22
• By Chebyshev:
Pr 𝑥 2 − 𝐹22 > 𝑘 ⋅ 𝐹2 ≤ 1/𝑘 2
• We want:
Pr 𝑥 2 − 𝐹22 > 𝜖 ⋅ 𝐹2 ≤ 𝛿
?
Median-of-Means Trick Pr 𝑧Ƹ − z > 𝑐 ⋅ std ≤ 1/𝑐 2
Given unbiased estimator 𝑧Ƹ with E 𝑧Ƹ = z :

Step 1: take mean of Θ 𝑁 copies
Reminder: sum of 𝑁 copies ⇒ expect ∼ Θ 𝑁 , std ∼ Θ 𝑁
mean of 𝑁 copies ⇒ expect ∼ Θ 1 , std ∼ Θ 1Τ 𝑁
To get an 𝜖-approximation, 𝑁 = ?
1
Step 2: take median of Θ log 1Τ𝛿 means log 𝑛log
𝛿
Space: 𝑂
𝜖2

Lecture2 FrequencyMoments

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture2 FrequencyMoments

Uploaded by

Copyright:

Available Formats

Lecture 2: Estimating

Frequencies & Frequency

• Frequency query: given 𝑖 ∈ 𝑛 , return 𝑓𝑖

• What do we expect the smallest value to be?

• 𝑋𝑖 = indicator for “ℎ 𝑖 ≤ 𝑘𝑀/𝐹0 ”

Since ∥⋅∥1 ≥∥⋅∥2 , CountSketch is better

• At the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖 Question:

At the end: 𝐶 𝑗 = σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖

• Return 𝑓መ𝑖 = 𝑟 𝑖 ⋅ 𝐶 ℎ 𝑖 = 𝑓𝑖 + 𝑟 𝑖 σ𝑗≠𝑖:ℎ 𝑗 =ℎ 𝑖 𝑟 𝑗 𝑓𝑗

𝑀𝑖,𝑗 = 1 iff ℎ 𝑖 = 𝑗 Cell 𝑗: σ𝑖:ℎ 𝑖 =𝑗 𝑓𝑖

𝑀𝑖,𝑗 = 𝑟 𝑖 iff Cell 𝑗: σ𝑖:ℎ 𝑖 =𝑗 𝑟 𝑖 𝑓𝑖

Given unbiased estimator 𝑧Ƹ with E 𝑧Ƹ = z :

You might also like