Approximate Frequency Counts Over Data Streams: Gurmeetsingh Manku (Stanford University) Rajeevmotwani (Stanforduniversity)

Approximate frequency
counts over data

streams
G U R M E E T S I N G H M A N K U ( S TA N F O R D U N I V E R S I T Y )
R A J E E V M O T WA N I ( S TA N F O R D U N I V E R S I T Y )
CONFERENCE : VLDB 2002
1
Authors
GURMEET SINGH MANKU RAJEEV MOTWANI
 Completed his PhD from Stanford, currently  Was a professor in Stanford University
working Google
 Primarily research area : Theoretical
 Bachelors from IIT Delhi computer science
 Research Areas : Algorithm theory,  Early Advisor of Google and PayPal
Information Retrieval, Distributed Systems and
parallel computing  Done his Ph.D. under Richard M Karp(Turing
award winner)
 Rajeev Motwani Building, IIT Kanpur (Dept. of
Computer Science)
2
Motivation
 Data Streams : Network traffics, web server logs
 Storage Issue : Storing data streams lifetime
 If Storage not an issue:
 If number of unique elements in streams are too high
 Then difficult to maintain frequency count of each element, due to increase in size of data structure
 Query response time would increase
 Solution : Scan data streams and have a summary of it.

 Summary : Keeping frequency count of only those elements, which have some importance.
 In most cases, higher frequency elements are significantly important.
3
Iceberg Queries
 Idea : To identify aggregates in a GROUP BY of a SQL query that exceed user specified threshold Ƭ
 Algorithm :
 First pass : set of counters is maintained
 Counters are then compressed into bitmap, with 1 denoting larger counter value
 Second pass : Exact frequencies for only those elements are maintained, having bitmap value 1.
Attribute Count
a 10
b 35
c 40
d 5
e 20
4
Frequent Itemsets and Association Rules
 Transaction : Subsets of items from set of all Items(I).
 Itemset X is subset I, have support s, if X occurs more than sN times in N transactions
 Association Rules :
 X → Y, where X ∩ Y = ϕ and X U Y has support exceeding user threshold s.
 Confidence of X → Y : support(X U Y )/ support (X)
 Algorithm :
 First pass : Compute set of candidates of frequent item set
 Second pass : Use frequent itemsets, to generate association rules
Item Count Item Count Item Count
A 10 BC 15 BCD 12
B 20 CD 13
Min Support = 11
C 15 BD 16
D 18 5
Problem Definition
• N : current length of the stream
• Input by user :
• Support threshold, s ϵ (0,1)
• Error Parameter, ε ϵ (0,1)
• ε << s
• Guarantees
• Items with frequency > sN, (No False Negatives)
• No item with frequency < (s-ε)N
• Estimated frequencies are less than true frequencies by at most εN.
• ε-deficient synopsis
6
Sticky Sampling
 Sampling based algorithm.
 Parameters
 N, length of elements seen so far
 User parameters (support s, error ε, probability of failure δ)
 δ ϵ (0,1)
 Probabilistic Algorithm, so can fail, if ε -deficient synopsis didn’t satisfy.

 Data structure, S
 Entry : (element e, frequency f)
7
Incoming Stream of data
8
Divide Stream into windows
 Given s = 10%, ε = 1%
and δ = 0.1%
 t ≈ 921
 After 2nd window, both
window size(w) and
sampling rate(r) doubles
for each incoming
window.
 Sampling rate r :
selecting an element with
probability (1/r)
9
First window of 2t size, with
sampling rate,r = 1
 Sampling rate=1, means
picking every element and
adding it data structure.
10
After each window is complete
 Repeatedly toss an unbiased coin for each
element in data structure S.
 For each unsuccessful toss of an element
decrement the frequency of that element by 1.
 For successful toss, stop and move to next
element
 If frequency of element becomes 0, then
remove the entry of element from S.
 Coin tosses follows geometric distribution
 Ensures uniform sampling
11
Summary
 Now next 2t elements, with sample rate=2.
 Elements already present in data structure S, are sampled with r=1.
 Once done, toss coin for each element in S
 Fetch next 4t element, with sample rate = 4
 Repeat same procedures, with doubling sample rate and window size.
 Output : Entries in S where f ≥ (s-ε)N
12
Correctness of Sticky Sampling
13
Failure at Failure Probability Sampling Failure Probability of selecting
coin toss Rate an element
First Toss r=2
Second Toss r=3
Third Toss r=4
14
15
Lossy Counting
 Deterministic algorithm
 Parameters
 N, length of elements seen so far
 User Parameters (support s, error ε)
 Data Structure, D
 Entry : (element e, frequency f, max possible error in f ∆)
 Window size(w) = ceil(1/ε) = number of elements in window

 Number of windows = ceil(N/w)
16
Dividing Stream into windows(equal
size)
 All windows are of equal size
17
First window comes in
 Go through each element in window and
maintain counter for each element.
 If element exist increase counter by 1
 Else, add an entry with counter initialized
with 1.
 Also, ∆ = b_current – 1 , is also added in each
entry, where b_current = 1.
 b_current, is current window(or bucket)
number
18
Adjust count at window boundaries
 Prune D, by deleting some entries at bucket
boundaries
Rule of deletion
 For an entry (e, f, ∆)
 If f + ∆ ≤ b_current, then delete that entry.
19
Next window
 If new element comes, add an entry with
counter = 1, and , ∆ = b_current – 1.Otherwise
increment counter by 1 for the respective
entry
 ∆, helps the new element to survive in data
structure D, by assuming that it had already
decremented ∆ frequency of an element.
 Since for deletion, f + ∆ ≤ b_current
20
Summary of Algorithm
• Split stream into equal size windows
• For each window
• For each element in window
• If element exists in S, then increment counter
• Else create a new entry in S, (e, 1 , b_current -1 )
• At window boundaries,
• Traverse over each entry in D,
• If f + ∆ ≤ b_current, then delete the entry from S
• Output : entries with frequency, f ≥ (s-ε)N
21
Lemmas (to prove correctness)
22
23
Sticky Sampling vs Lossy Counting
 Theoretically, space complexity of
Sticky Sampling ≤ Lossy Counting
 Practically Lossy Counting taking
much lesser space.
 Decreasing min support means more
items can qualify criteria of min
support.
Memory requirements in terms of entries

 Probability of failure δ : 10-4
SS : Sticky Sampling  Length of stream, N = 107
LC : Lossy Counting
Worst : Worst case bounds
Zipf : skewed distribution
Uniq : distribution having all unique values
24
Memory requirements at various stages
of streaming data
Support, s=1%, Error Parameter : 0.1s, probability of failure in sticky sampling δ : 10-4
25
Theory to practice- Frequent Itemsets
 Input
 Stream of transactions
 Transaction contains set of items
 Number of transactions processed till now, N
 User inputs : support s, error e
 Aim : Find frequent itemsets

 Data Structure, D
 Entry : (set, f, ∆)
 set : subsets of items
 f : approximate frequency of set
 ∆ : maximum possible error in f
26
Data Structure- BUFFER
 Incoming transaction stream is divided into buckets, w = ceil(1/ε) transactions
 BUFFER : repeatedly reads batch of transactions into available main memory.
 β denotes the number of buckets in main memory
 They are laid into a big array and bitmap is used for transaction boundaries.
 BUFFER first prunes the item-id’s, whose frequency is less than εN.
 Then, BUFFER sorts each transactions by their item-id’s.
27
Data Structure-TRIE
 Conceptually : A forest, consisting of label nodes.
 Label Nodes : <item_id, f , ∆, level>
 Level : distance of current node with root node
 A node in tree represents an itemset, consisting of item-ids in that node and all its ancestor
 1 to 1 mapping between entries of TRIE and data structure D
 Implementation : Array of entries of form <item_id, f , ∆, level> corresponding to pre-order
traversal of underlying trees
 Equivalent to lexicographic ordering of subsets.
28
SETGEN
 Generates subsets of item-id’s along with their frequencies in current batch of transactions in
lexicographic order
 Subset must be enumerated iff either it occurs in TRIE or its frequency in the current batch
exceeds β.
 SETGEN pruning rule:
 If a subset does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no
supersets of should be considered.
 Like Apriori pruning rule.
 UPDATE_SET : For each entry (set, f, ∆) in D,update f with new counts, and if f + ∆ ≤ b_current delete the
entry from D.
 NEW_SET : If a set has frequency ≥ β in the current batch and does occur in D, Add an entry (set, f,
b_current – β ).
29
Implementation Details
Data Structure D
A C set f ∆
BUFFER
(A, C), (A ,B ,C ,D), A 5 0
(C,D), … D
B C B 6 0
(B, C), (A ,C ,D), AB 4 0
(C,D), … C 8 0
C
(A, D), (A ,B ,D), Conceptual AC 4 1
(B,C,D), …. Representation of TRIE D 7 0

D AD 4 2
BD 4 2
CD 3 2
A B AB C AC D AD BD CD ACD 1 to 1 ACD 3 2
OLD TRIE
5 6 4 8 4 7 4 4 3 3
A B AB C AC D AD BD CD ACD ABD
NEW TRIE Min Support Count = 2
5 6 4 8 4 7 4 4 3 3 2 Support(BC) = 1
30
Experimental Results
 Dataset is generated with IBM test data generator.
Dataset name Average Average large Size of dataset
Transaction size Itemset size (10K Unique Items)
T10.I4.1000K 10 4 49 MB (1000K
transactions)
T15.I6.1000K 15 6 69 MB (1000K
transactions)
31
Comparison with Apriori
•Apriori Algorithm : pubically available
package written Christian Borglet.
•dlmalloc library by Doug Lea,
for figuring out
memory requirements.
•mallinfo is used, instead
of dlmalloc for proposed algo
32
Times taken for IBM test
dataset T10.I4.1000K with 10K items.
33
Times taken for IBM test dataset
T15.I6.1000K with 10K items.
34
Times taken for frequent word-pairs in 100K
webpages(Iceberg queries)
[with Ƭ > 500, size : 54 MB]
35
Times taken for frequent word-pairs in 800K Reuters
documents(Iceberg query)
[with support s=0.5%, size : 210 MB]
36
Conclusion
 Proposed algorithms outperforms other previous algorithms in terms space requirements(main
memory footprints).
 Can be used to solve other frequency count problems like association rules, network traffic, etc.
 Anytime behavior of algorithms is a plus point.
 Inefficient implementation of algorithm may slow down the speed of the stream,(which can create
issues in real time environments).
37
Probabilistic Lossy
Counting
 Gives correct answer with probability (1-δ).
 δ : user parameter
 ∆ is computed after each window, for next
window.
 Computation of delta is based on distribution of
data stream.(zipf is assumed in paper)
False negative can come.
 Lower memory requirements
PAPER : PROBABILISTIC LOSSY COUNTING: AN EFFICIENT ALGORITHM FOR FINDING HEAVY HITTERS, 2005, XENOFONTAS DIMITROPOULOS, PAUL HURLEY, ANDREAS KIND,
IBM ZURICH RESEARCH LAB 38

Approximate Frequency Counts Over Data Streams: Gurmeetsingh Manku (Stanford University) Rajeevmotwani (Stanforduniversity)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Approximate Frequency Counts Over Data Streams: Gurmeetsingh Manku (Stanford University) Rajeevmotwani (Stanforduniversity)

Uploaded by

Copyright:

Available Formats

Approximate frequency

counts over data

 Solution : Scan data streams and have a summary of it.

 Probabilistic Algorithm, so can fail, if ε -deficient synopsis didn’t satisfy.

Second Toss r=3

Third Toss r=4

 Window size(w) = ceil(1/ε) = number of elements in window

 All windows are of equal size

• Output : entries with frequency, f ≥ (s-ε)N

Memory requirements in terms of entries

 Aim : Find frequent itemsets

(B,C,D), …. Representation of TRIE D 7 0

You might also like

Approximate Frequency Counts Over Data Streams: Gurmeetsingh Manku (Stanford University) Rajeevmotwani (Stanforduniversity)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Approximate Frequency Counts Over Data Streams: Gurmeetsingh Manku (Stanford University) Rajeevmotwani (Stanforduniversity)

Uploaded by

Copyright:

Available Formats

Approximate frequency

counts over data

 Solution : Scan data streams and have a summary of it.

 Probabilistic Algorithm, so can fail, if ε -deficient synopsis didn’t satisfy.​

Second Toss r=3

Third Toss r=4

 Window size(w) = ceil(1/ε) = number of elements in window​

 All windows are of equal size

• Output : entries with frequency, f ≥ (s-ε)N​

Memory requirements in terms of entries

 Aim : Find frequent itemsets

(B,C,D), …. Representation of TRIE D 7 0

You might also like

 Probabilistic Algorithm, so can fail, if ε -deficient synopsis didn’t satisfy.

 Window size(w) = ceil(1/ε) = number of elements in window

• Output : entries with frequency, f ≥ (s-ε)N