You are on page 1of 16

Count-min sketch 

Count-min sketch 
• Count-min sketch approach was proposed by Graham
Cormode and S. Muthukrishnan. in the paper approximating data
with the count-min sketch published in 2011. 
• The Count-min sketch is a probabilistic data structure. The
Count-Min sketch is a simple technique to summarize large amounts
of frequency data.
• Count-min sketch algorithm talks about keeping track of the count of
things. i.e, How many times an element is present in the set.
• Finding the count of an item could be easily achieved in Java
using HashTable or Map.
• It uses multiple hash functions to map these frequencies on to the
matrix
So the final entry on each index was 3, 2, 2, 2. Now take the minimum count among these entries
and that is the result. So min(3, 2, 2, 2) is 2, that means the above test input is processed two
times in the above list.
Issue with Count-min sketch and its solution:
• What if one or more elements got the same hash values and then they all incremented. So, in
that case, the value would have been increased because of the hash collision.
• Count-min sketch overcounts the frequencies because of the hash functions. So the more hash
function we take there will be less collision. 
• The fewer hash functions we take there will be a high probability of collision. Hence it
always recommended taking more number of hash functions.
• Applications of Count-min sketch:
• Compressed Sensing
• Networking
• NLP
• Stream Processing
• Frequency tracking
• Extension: Heavy-hitters
• Extension: Range-query
Count-Min Sketches — Approximately How
Many Times did People Watch Adele’s “Hello”?
• Let’s take an example. Suppose we have a 3-column, 4-row
Count-Min Sketch. The very first user on your platform decides
to watch “History of Japan”.
• When that video’s id is hashed by the three hashes, we get 1, 2,
and 1, indicating that we should increment the counters at (1,1),
(2,2), and (1,3).
• Now suppose that another user watches “Hello” by Adele, which
hashes to 1, 1, and 4.
• Following that, yet another user watches “History of Japan”.
• how many times users watched “Hello”.
• “Hello” hashes to 1,1, and 4, so we should examine the counts at
the cells (1,1), (1,2), and (4,3). These counts are 3, 1, and 1
respectively.

• “Hello” was watched only once, and yet, in the Count-Min Sketch
above, one of the counters says “Hello” was watched 3 times!
• This is because “Hello” and “Japan” collide on the first hash
function, so any views of “Japan”, which was watched twice, will
also count towards “Hello”.
• Essentially, the Count-Min Sketch is doubling-up, allowing
multiple events to share the same counter in order to preserve
space.
• The more cells in the table, the more counters we store in
memory, so the less doubling-up will occur, and the more
accurate our counters will be.
• Even though “Hello” was never watched in this sequence, each
of the videos collided with “Hello” in at least one hash function.
• As a result, when we query the view count for Adele’s “Hello”,
• we are incorrectly told that it has been watched once
• we can reduce both the probability and magnitude of errors by
using a larger table.
• If goal is to understand general trends rather than to make
precise measurements.

You might also like