Professional Documents
Culture Documents
1
Authors
GURMEET SINGH MANKU RAJEEV MOTWANI
Completed his PhD from Stanford, currently Was a professor in Stanford University
working Google
Primarily research area : Theoretical
Bachelors from IIT Delhi computer science
Research Areas : Algorithm theory, Early Advisor of Google and PayPal
Information Retrieval, Distributed Systems and
parallel computing Done his Ph.D. under Richard M Karp(Turing
award winner)
Rajeev Motwani Building, IIT Kanpur (Dept. of
Computer Science)
2
Motivation
Data Streams : Network traffics, web server logs
Storage Issue : Storing data streams lifetime
If Storage not an issue:
If number of unique elements in streams are too high
Then difficult to maintain frequency count of each element, due to increase in size of data structure
Query response time would increase
3
Iceberg Queries
Idea : To identify aggregates in a GROUP BY of a SQL query that exceed user specified threshold Ƭ
Algorithm :
First pass : set of counters is maintained
Counters are then compressed into bitmap, with 1 denoting larger counter value
Second pass : Exact frequencies for only those elements are maintained, having bitmap value 1.
Attribute Count
a 10
b 35
c 40
d 5
e 20
4
Frequent Itemsets and Association Rules
Transaction : Subsets of items from set of all Items(I).
Itemset X is subset I, have support s, if X occurs more than sN times in N transactions
Association Rules :
X → Y, where X ∩ Y = ϕ and X U Y has support exceeding user threshold s.
Confidence of X → Y : support(X U Y )/ support (X)
Algorithm :
First pass : Compute set of candidates of frequent item set
Second pass : Use frequent itemsets, to generate association rules
Item Count Item Count Item Count
A 10 BC 15 BCD 12
B 20 CD 13
Min Support = 11
C 15 BD 16
D 18 5
Problem Definition
• N : current length of the stream
• Input by user :
• Support threshold, s ϵ (0,1)
• Error Parameter, ε ϵ (0,1)
• ε << s
• Guarantees
• Items with frequency > sN, (No False Negatives)
• No item with frequency < (s-ε)N
• Estimated frequencies are less than true frequencies by at most εN.
• ε-deficient synopsis
6
Sticky Sampling
Sampling based algorithm.
Parameters
N, length of elements seen so far
User parameters (support s, error ε, probability of failure δ)
δ ϵ (0,1)
7
Incoming Stream of data
8
Divide Stream into windows
Given s = 10%, ε = 1%
and δ = 0.1%
t ≈ 921
After 2nd window, both
window size(w) and
sampling rate(r) doubles
for each incoming
window.
Sampling rate r :
selecting an element with
probability (1/r)
9
First window of 2t size, with
sampling rate,r = 1
Sampling rate=1, means
picking every element and
adding it data structure.
10
After each window is complete
Repeatedly toss an unbiased coin for each
element in data structure S.
For each unsuccessful toss of an element
decrement the frequency of that element by 1.
For successful toss, stop and move to next
element
If frequency of element becomes 0, then
remove the entry of element from S.
Coin tosses follows geometric distribution
Ensures uniform sampling
11
Summary
Now next 2t elements, with sample rate=2.
Elements already present in data structure S, are sampled with r=1.
Once done, toss coin for each element in S
Fetch next 4t element, with sample rate = 4
Repeat same procedures, with doubling sample rate and window size.
Output : Entries in S where f ≥ (s-ε)N
12
Correctness of Sticky Sampling
13
Failure at Failure Probability Sampling Failure Probability of selecting
coin toss Rate an element
First Toss r=2
14
15
Lossy Counting
Deterministic algorithm
Parameters
N, length of elements seen so far
User Parameters (support s, error ε)
Data Structure, D
Entry : (element e, frequency f, max possible error in f ∆)
16
Dividing Stream into windows(equal
size)
17
First window comes in
Go through each element in window and
maintain counter for each element.
If element exist increase counter by 1
Else, add an entry with counter initialized
with 1.
Also, ∆ = b_current – 1 , is also added in each
entry, where b_current = 1.
b_current, is current window(or bucket)
number
18
Adjust count at window boundaries
Prune D, by deleting some entries at bucket
boundaries
Rule of deletion
For an entry (e, f, ∆)
If f + ∆ ≤ b_current, then delete that entry.
19
Next window
If new element comes, add an entry with
counter = 1, and , ∆ = b_current – 1.Otherwise
increment counter by 1 for the respective
entry
∆, helps the new element to survive in data
structure D, by assuming that it had already
decremented ∆ frequency of an element.
Since for deletion, f + ∆ ≤ b_current
20
Summary of Algorithm
• Split stream into equal size windows
• For each window
• For each element in window
• If element exists in S, then increment counter
• Else create a new entry in S, (e, 1 , b_current -1 )
• At window boundaries,
• Traverse over each entry in D,
• If f + ∆ ≤ b_current, then delete the entry from S
21
Lemmas (to prove correctness)
22
23
Sticky Sampling vs Lossy Counting
Theoretically, space complexity of
Sticky Sampling ≤ Lossy Counting
Practically Lossy Counting taking
much lesser space.
Decreasing min support means more
items can qualify criteria of min
support.
24
Memory requirements at various stages
of streaming data
Support, s=1%, Error Parameter : 0.1s, probability of failure in sticky sampling δ : 10-4
25
Theory to practice- Frequent Itemsets
Input
Stream of transactions
Transaction contains set of items
Number of transactions processed till now, N
User inputs : support s, error e
26
Data Structure- BUFFER
Incoming transaction stream is divided into buckets, w = ceil(1/ε) transactions
BUFFER : repeatedly reads batch of transactions into available main memory.
β denotes the number of buckets in main memory
They are laid into a big array and bitmap is used for transaction boundaries.
BUFFER first prunes the item-id’s, whose frequency is less than εN.
Then, BUFFER sorts each transactions by their item-id’s.
27
Data Structure-TRIE
Conceptually : A forest, consisting of label nodes.
Label Nodes : <item_id, f , ∆, level>
Level : distance of current node with root node
A node in tree represents an itemset, consisting of item-ids in that node and all its ancestor
1 to 1 mapping between entries of TRIE and data structure D
Implementation : Array of entries of form <item_id, f , ∆, level> corresponding to pre-order
traversal of underlying trees
Equivalent to lexicographic ordering of subsets.
28
SETGEN
Generates subsets of item-id’s along with their frequencies in current batch of transactions in
lexicographic order
Subset must be enumerated iff either it occurs in TRIE or its frequency in the current batch
exceeds β.
SETGEN pruning rule:
If a subset does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no
supersets of should be considered.
Like Apriori pruning rule.
UPDATE_SET : For each entry (set, f, ∆) in D,update f with new counts, and if f + ∆ ≤ b_current delete the
entry from D.
NEW_SET : If a set has frequency ≥ β in the current batch and does occur in D, Add an entry (set, f,
b_current – β ).
29
Implementation Details
Data Structure D
A C set f ∆
BUFFER
(A, C), (A ,B ,C ,D), A 5 0
(C,D), … D
B C B 6 0
(B, C), (A ,C ,D), AB 4 0
(C,D), … C 8 0
C
(A, D), (A ,B ,D), Conceptual AC 4 1
A B AB C AC D AD BD CD ACD ABD
NEW TRIE Min Support Count = 2
5 6 4 8 4 7 4 4 3 3 2 Support(BC) = 1
30
Experimental Results
Dataset is generated with IBM test data generator.
Dataset name Average Average large Size of dataset
Transaction size Itemset size (10K Unique Items)
T10.I4.1000K 10 4 49 MB (1000K
transactions)
T15.I6.1000K 15 6 69 MB (1000K
transactions)
31
Comparison with Apriori
•Apriori Algorithm : pubically available
package written Christian Borglet.
•dlmalloc library by Doug Lea,
for figuring out
memory requirements.
•mallinfo is used, instead
of dlmalloc for proposed algo
32
Times taken for IBM test
dataset T10.I4.1000K with 10K items.
33
Times taken for IBM test dataset
T15.I6.1000K with 10K items.
34
Times taken for frequent word-pairs in 100K
webpages(Iceberg queries)
[with Ƭ > 500, size : 54 MB]
35
Times taken for frequent word-pairs in 800K Reuters
documents(Iceberg query)
[with support s=0.5%, size : 210 MB]
36
Conclusion
Proposed algorithms outperforms other previous algorithms in terms space requirements(main
memory footprints).
Can be used to solve other frequency count problems like association rules, network traffic, etc.
Anytime behavior of algorithms is a plus point.
Inefficient implementation of algorithm may slow down the speed of the stream,(which can create
issues in real time environments).
37
Probabilistic Lossy
Counting
Gives correct answer with probability (1-δ).
δ : user parameter
∆ is computed after each window, for next
window.
Computation of delta is based on distribution of
data stream.(zipf is assumed in paper)
False negative can come.
Lower memory requirements
PAPER : PROBABILISTIC LOSSY COUNTING: AN EFFICIENT ALGORITHM FOR FINDING HEAVY HITTERS, 2005, XENOFONTAS DIMITROPOULOS, PAUL HURLEY, ANDREAS KIND,
IBM ZURICH RESEARCH LAB 38