Hash Map&Bloom Filter

HashMap
Hashmap is similar to the HashTable, but it is unsynchronized. It allows to store the null keys
as well, but there should be only one null key object and there can be any number of null
values.
public class HashMap<K,V>

extends AbstractMap<K,V>
implements Map<K,V>, Cloneable, Serializable
This implementation provides constant-time performance for the basic operations (get and
put), assuming the hash function disperses the elements properly among the buckets. Iteration
over collection views requires time proportional to the "capacity" of the HashMap instance
(the number of buckets) plus its size (the number of key-value mappings). Thus, it's very
important not to set the initial capacity too high (or the load factor too low) if iteration
performance is important.
An instance of HashMap has two parameters that affect its performance: initial capacity and
load factor. The capacity is the number of buckets in the hash table, and the initial capacity is
simply the capacity at the time the hash table is created. The load factor is a measure of how
full the hash table is allowed to get before its capacity is automatically increased. When the
number of entries in the hash table exceeds the product of the load factor and the current
capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the
hash table has approximately twice the number of buckets.
As a general rule, the default load factor (.75) offers a good tradeoff between time and space
costs. Higher values decrease the space overhead but increase the lookup cost (reflected in
most of the operations of the HashMap class, including get and put). The expected number of
entries in the map and its load factor should be taken into account when setting its initial
capacity, so as to minimize the number of rehash operations. If the initial capacity is greater
than the maximum number of entries divided by the load factor, no rehash operations will ever
occur.
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large
capacity will allow the mappings to be stored more efficiently than letting it perform automatic
rehashing as needed to grow the table. Note that using many keys with the same hashCode()
is a sure way to slow down performance of any hash table. To ameliorate impact, when keys
are Comparable, this class may use comparison order among keys to help break ties.
Note that this implementation is not synchronized. If multiple threads access a hash map
concurrently, and at least one of the threads modifies the map structurally, it must be
synchronized externally.
Bloom Filters
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether
an element is present in a set.
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us
that the element either definitely is not in the set or may be in the set.
The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to
demonstrate:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Each empty cell in that table represents a bit, and the number below it its index. To add an
element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at
the index of those hashes to 1.
It's easier to see what that means than explain it, so enter some strings and see how the bit
vector changes. Fnv and Murmur are two simple hash functions:
add to bloom filter

Enter a string:
fnv:
murmur:
Your set: []
When you add a string, you can see that the bits at the index given by the hashes are set to 1.
Insertion and Search Operation in Bloom Filter

To insert an element into the Bloom Filter, every element goes through k hash functions. We
will first start with an empty bit array with all the indexes initialized to zero and 'k' hash
functions. The hash functions need to be independent and an optimal amount is calculated
depending on the number of items that are to be hashed and the length of the table available
with us.
The values to be inserted are hashed by all k hash functions and the bit in the hashed position
is set to 1 in each case. Let us take some examples:
To check if an element is already present in the Bloom Filter, we must again hash the search
query and check if the bits are present or not. Let us take an example:
As we discussed above, Bloom Filter can generate some false positive results also which means
that the element is deemed to be present but it is actually not present. To find out the
probability of getting a false positive result in an array of k bits is: 1-(1/k).
The point of a bloom filter is to store whether a key was visited before. An example use-case
for a Bloom Filter might be a web crawler, which needs to store whether we've visited a
website before.
Bloom Filters are NOT about key-value pairs like Hash Maps. There's no "value" associated with
a key. The only thing we care about for a key is a boolean: true that we've seen this key before,
or false that this key is new.
Bloom filters are good because they use less memory than a hash map and still provide fast
lookup and insertion times.
The reason bloom filters aren't incredibly common is because it's possible they give wrong
answers: a Bloom Filter might say that you've visited a key before when you really haven't.
How Bloom Filters Work

A bloom filter can be represented as a list of k hash functions and an array of booleans that
starts off with all values initialized to false.
A hash function is simply a function that takes some input, and transforms that into some
number (which we'll use as the index in our array) as output. The same input will always
produce the same output.
We care about two operations in a bloom filter:
• Inserting a key.
• Querying if a key was seen before.
Insertion works by inputting a key into all k hash functions, and for each outputted array index,
marking that array value as true.
Querying works by inputting a key into all k hash functions, and for each outputted array index,
checking whether that array value is true. If all array values are true, then the key was seen
before. If any array values are false, then the key wasn't seen before.
The time complexity of inserting and querying are both O(k).

Bloom Filter Example
Let's say we have a bloom filter with an array initialized to false (and pretend our array can
hold negative numbers), and let's say k=3 so that we have 3 hash functions (which takes a string
s as input, for simplicity).
A bloom filter initialized.

The following operations show the various interactions with a bloom filter (I'll show these in
the video):
Insert "data"
Insert "dog"
Query "data" -> True we've seen it
Query "i" -> False we haven't seen it
Query "cat" -> True we've seen it (even though we never inserted it)
Conclusion
They're good because they're efficient on space, but the drawback is that they produce false
positives, possibly saying it's seen a key before when it really hasn't.
We've only looked at a basic bloom filter example to get the concept behind them, because
optimizing a real bloom filter takes a lot of tuning to decide a good k and depends on the
amount of memory you have. And of course, we saw no code today because the basic concept
of a bloom filter is easy enough to code, but coding a real one is a bit involved (because of the
complex nature of making good hash functions).

Hash Map&Bloom Filter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hash Map&Bloom Filter

Uploaded by

Copyright:

Available Formats

HashMap

public class HashMap<K,V>

add to bloom filter

Insertion and Search Operation in Bloom Filter

How Bloom Filters Work

We care about two operations in a bloom filter:

The time complexity of inserting and querying are both O(k).

A bloom filter initialized.

You might also like