You are on page 1of 38

Approximate frequency

counts over data


streams
G U R M E E T S I N G H M A N K U ( S TA N F O R D U N I V E R S I T Y )
R A J E E V M O T WA N I ( S TA N F O R D U N I V E R S I T Y )
CONFERENCE : VLDB 2002

1
Authors
GURMEET SINGH MANKU RAJEEV MOTWANI

 Completed his PhD from Stanford, currently  Was a professor in Stanford University
working Google
 Primarily research area : Theoretical
 Bachelors from IIT Delhi computer science
 Research Areas : Algorithm theory,  Early Advisor of Google and PayPal
Information Retrieval, Distributed Systems and
parallel computing  Done his Ph.D. under Richard M Karp(Turing
award winner)
 Rajeev Motwani Building, IIT Kanpur (Dept. of
Computer Science)

2
Motivation
 Data Streams : Network traffics, web server logs
 Storage Issue : Storing data streams lifetime
 If Storage not an issue:
 If number of unique elements in streams are too high
 Then difficult to maintain frequency count of each element, due to increase in size of data structure
 Query response time would increase

 Solution : Scan data streams and have a summary of it.


 Summary : Keeping frequency count of only those elements, which have some importance.
 In most cases, higher frequency elements are significantly important.

3
Iceberg Queries
 Idea : To identify aggregates in a GROUP BY of a SQL query that exceed user specified threshold Ƭ
 Algorithm :
 First pass : set of counters is maintained
 Counters are then compressed into bitmap, with 1 denoting larger counter value
 Second pass : Exact frequencies for only those elements are maintained, having bitmap value 1.

Attribute Count
a 10
b 35
c 40
d 5
e 20

4
Frequent Itemsets and Association Rules
 Transaction : Subsets of items from set of all Items(I).
 Itemset X is subset I, have support s, if X occurs more than sN times in N transactions
 Association Rules :
 X → Y, where X ∩ Y = ϕ and X U Y has support exceeding user threshold s.
 Confidence of X → Y : support(X U Y )/ support (X)

 Algorithm :
 First pass : Compute set of candidates of frequent item set
 Second pass : Use frequent itemsets, to generate association rules
Item Count Item Count Item Count
A 10 BC 15 BCD 12
B 20 CD 13
Min Support = 11
C 15 BD 16
D 18 5
Problem Definition
• N : current length of the stream​
• Input by user : ​
• Support threshold, s ϵ (0,1)​
• Error Parameter, ε  ϵ (0,1)​
• ε << s​

• Guarantees
• Items with frequency > sN​, (No False Negatives)​
• No item with frequency < (s-ε)N​
• Estimated frequencies are less than true frequencies by at most εN.​
• ε-deficient synopsis

6
Sticky Sampling
 Sampling based algorithm​.
 Parameters
 N, length of elements seen so far​
 User parameters (support s, error ε, probability of failure δ)​
 δ ​ϵ (0,1)​

 Probabilistic Algorithm, so can fail, if  ε -deficient synopsis didn’t satisfy.​


 Data structure, S​
 Entry : (element e, frequency f)

7
Incoming Stream of data

8
Divide Stream into windows
 Given s = 10%,  ε = 1%
and δ = 0.1%
 t ≈ 921
 After 2nd window, both
window size(w) and
sampling rate(r) doubles
for each incoming
window.
 Sampling rate r :
selecting an element with
probability (1/r)

9
First window of 2t size, with
sampling rate,r = 1​
 Sampling rate=1, means
picking every element and
adding it data structure.

10
After each window is complete
 Repeatedly toss an unbiased coin for each
element in data structure S.
 For each unsuccessful toss of an element
decrement the frequency of that element by 1.
 For successful toss, stop and move to next
element
 If frequency of element becomes 0, then
remove the entry of element from S.
 Coin tosses follows geometric distribution
 Ensures uniform sampling

11
Summary
 Now next 2t elements, with sample rate=2.
 Elements already present in data structure S, are sampled with r=1.​
 Once done, toss coin for each element in S​
 Fetch next 4t element, with sample rate = 4​
 Repeat same procedures, with doubling sample rate and window size.
 Output : Entries in S where f ≥ (s-ε)N​

12
Correctness of Sticky Sampling

13
Failure at Failure Probability Sampling Failure Probability of selecting
coin toss Rate an element
First Toss r=2

Second Toss r=3

Third Toss r=4

14
15
Lossy Counting
 Deterministic algorithm​
 Parameters
 N, length of elements seen so far​
 User Parameters (support s, error ε)​

 Data Structure, D
 Entry : (element e, frequency f, max possible error in f  ∆)

 Window size(w) = ceil(1/ε) = number of elements in window​


 Number of windows = ceil(N/w)​

16
Dividing Stream into windows(equal
size)

 All windows are of equal size

17
First window comes in
 Go through each element in window and
maintain counter for each element.
 If element exist increase counter by 1
 Else, add an entry with counter initialized
with 1.
 Also, ∆ = b_current – 1 , is also added in each
entry, where b_current = 1.
 b_current, is current window(or bucket)
number

18
Adjust count at window boundaries
 Prune D, by deleting some entries at bucket
boundaries
Rule of deletion
 For an entry (e,  f,  ∆)
 If  f + ∆ ≤ b_current, then delete that entry.

19
Next window
 If new element comes, add an entry with
counter = 1, and , ∆ = b_current – 1.Otherwise
increment counter by 1 for the respective
entry
 ∆, helps the new element to survive in data
structure D, by assuming that it had already
decremented ∆ frequency of an element.
 Since for deletion,  f + ∆ ≤ b_current

20
Summary of Algorithm
• Split stream into equal size windows​
• For each window​
• For each element in window​
• If element exists in S, then increment counter​
• Else create a new entry in S, (e,  1 ,  b_current -1 )​

• At window boundaries,​
• Traverse over each entry in D,​
• If  f + ∆ ≤ b_current, then delete the entry from S

• Output : entries with frequency, f ≥ (s-ε)N​

21
Lemmas (to prove correctness)

22
23
Sticky Sampling vs Lossy Counting
 Theoretically, space complexity of
Sticky Sampling ≤ Lossy Counting
 Practically Lossy Counting taking
much lesser space.
 Decreasing min support means more
items can qualify criteria of min
support.

Memory requirements in terms of entries


 Probability of failure δ : 10-4
SS : Sticky Sampling  Length of stream, N = 107
LC : Lossy Counting
Worst : Worst case bounds
Zipf : skewed distribution
Uniq : distribution having all unique values

24
Memory requirements at various stages
of streaming data

Support, s=1%, Error Parameter : 0.1s, probability of failure in sticky sampling δ : 10-4

25
Theory to practice- Frequent Itemsets
 Input
 Stream of transactions
 Transaction contains set of items
 Number of transactions processed till now, N
 User inputs : support s, error e

 Aim : Find frequent itemsets


 Data Structure, D
 Entry : (set, f, ∆)
 set : subsets of items
 f : approximate frequency of set
 ∆ : maximum possible error in f

26
Data Structure- BUFFER
 Incoming transaction stream is divided into buckets, w = ceil(1/ε) transactions
 BUFFER : repeatedly reads batch of transactions into available main memory.
 β denotes the number of buckets in main memory
 They are laid into a big array and bitmap is used for transaction boundaries.
 BUFFER first prunes the item-id’s, whose frequency is less than εN.
 Then, BUFFER sorts each transactions by their item-id’s.

27
Data Structure-TRIE
 Conceptually : A forest, consisting of label nodes.
 Label Nodes : <item_id, f , ∆, level>
 Level : distance of current node with root node

 A node in tree represents an itemset, consisting of item-ids in that node and all its ancestor
 1 to 1 mapping between entries of TRIE and data structure D
 Implementation : Array of entries of form <item_id, f , ∆, level> corresponding to pre-order
traversal of underlying trees
 Equivalent to lexicographic ordering of subsets.

28
SETGEN
 Generates subsets of item-id’s along with their frequencies in current batch of transactions in
lexicographic order
 Subset must be enumerated iff either it occurs in TRIE or its frequency in the current batch
exceeds β.
 SETGEN pruning rule:
 If a subset does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no
supersets of should be considered.
 Like Apriori pruning rule.

 UPDATE_SET : For each entry (set, f, ∆) in D,update f with new counts, and if f + ∆ ≤ b_current delete the
entry from D.
 NEW_SET : If a set has frequency ≥ β in the current batch and does occur in D, Add an entry (set, f,
b_current – β ).

29
Implementation Details
Data Structure D
A C set f ∆
BUFFER
(A, C), (A ,B ,C ,D), A 5 0
(C,D), … D
B C B 6 0
(B, C), (A ,C ,D), AB 4 0
(C,D), … C 8 0
C
(A, D), (A ,B ,D), Conceptual AC 4 1

(B,C,D), …. Representation of TRIE D 7 0


D AD 4 2
BD 4 2
CD 3 2
A B AB C AC D AD BD CD ACD 1 to 1 ACD 3 2
OLD TRIE
5 6 4 8 4 7 4 4 3 3

A B AB C AC D AD BD CD ACD ABD
NEW TRIE Min Support Count = 2
5 6 4 8 4 7 4 4 3 3 2 Support(BC) = 1

30
Experimental Results
 Dataset is generated with IBM test data generator.
Dataset name Average Average large Size of dataset
Transaction size Itemset size (10K Unique Items)
T10.I4.1000K 10 4 49 MB (1000K
transactions)
T15.I6.1000K 15 6 69 MB (1000K
transactions)

31
Comparison with Apriori
•Apriori Algorithm : pubically available
 package written Christian Borglet.​
•dlmalloc library by Doug Lea,
for figuring out
memory requirements.​
•mallinfo is used, instead
of dlmalloc for proposed algo​

32
Times taken for IBM test
dataset T10.I4.1000K with 10K items.​

33
Times taken for IBM test dataset
T15.I6.1000K with 10K items.​

34
Times taken for frequent word-pairs in 100K 
webpages(Iceberg queries)
[with Ƭ > 500, size : 54 MB]

35
Times taken for frequent word-pairs in 800K Reuters
documents(Iceberg query)
[with support s=0.5%, size : 210 MB]

36
Conclusion
 Proposed algorithms outperforms other previous algorithms in terms space requirements(main
memory footprints).
 Can be used to solve other frequency count problems like association rules, network traffic, etc.
 Anytime behavior of algorithms is a plus point.
 Inefficient implementation of algorithm may slow down the speed of the stream,(which can create
issues in real time environments).

37
Probabilistic Lossy
Counting
 Gives correct answer with probability (1-δ).
 δ : user parameter
 ∆ is computed after each window, for next
window.
 Computation of delta is based on distribution of
data stream.(zipf is assumed in paper)
False negative can come.
 Lower memory requirements

PAPER : PROBABILISTIC LOSSY COUNTING: AN EFFICIENT ALGORITHM FOR FINDING HEAVY HITTERS​, 2005, XENOFONTAS DIMITROPOULOS, PAUL HURLEY, ANDREAS KIND,
IBM ZURICH RESEARCH LAB 38

You might also like