Professional Documents
Culture Documents
Hash Map&Bloom Filter
Hash Map&Bloom Filter
Hashmap is similar to the HashTable, but it is unsynchronized. It allows to store the null keys
as well, but there should be only one null key object and there can be any number of null
values.
This implementation provides constant-time performance for the basic operations (get and
put), assuming the hash function disperses the elements properly among the buckets. Iteration
over collection views requires time proportional to the "capacity" of the HashMap instance
(the number of buckets) plus its size (the number of key-value mappings). Thus, it's very
important not to set the initial capacity too high (or the load factor too low) if iteration
performance is important.
An instance of HashMap has two parameters that affect its performance: initial capacity and
load factor. The capacity is the number of buckets in the hash table, and the initial capacity is
simply the capacity at the time the hash table is created. The load factor is a measure of how
full the hash table is allowed to get before its capacity is automatically increased. When the
number of entries in the hash table exceeds the product of the load factor and the current
capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the
hash table has approximately twice the number of buckets.
As a general rule, the default load factor (.75) offers a good tradeoff between time and space
costs. Higher values decrease the space overhead but increase the lookup cost (reflected in
most of the operations of the HashMap class, including get and put). The expected number of
entries in the map and its load factor should be taken into account when setting its initial
capacity, so as to minimize the number of rehash operations. If the initial capacity is greater
than the maximum number of entries divided by the load factor, no rehash operations will ever
occur.
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large
capacity will allow the mappings to be stored more efficiently than letting it perform automatic
rehashing as needed to grow the table. Note that using many keys with the same hashCode()
is a sure way to slow down performance of any hash table. To ameliorate impact, when keys
are Comparable, this class may use comparison order among keys to help break ties.
Note that this implementation is not synchronized. If multiple threads access a hash map
concurrently, and at least one of the threads modifies the map structurally, it must be
synchronized externally.
Bloom Filters
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether
an element is present in a set.
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us
that the element either definitely is not in the set or may be in the set.
The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to
demonstrate:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Each empty cell in that table represents a bit, and the number below it its index. To add an
element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at
the index of those hashes to 1.
It's easier to see what that means than explain it, so enter some strings and see how the bit
vector changes. Fnv and Murmur are two simple hash functions:
When you add a string, you can see that the bits at the index given by the hashes are set to 1.
As we discussed above, Bloom Filter can generate some false positive results also which means
that the element is deemed to be present but it is actually not present. To find out the
probability of getting a false positive result in an array of k bits is: 1-(1/k).
The point of a bloom filter is to store whether a key was visited before. An example use-case
for a Bloom Filter might be a web crawler, which needs to store whether we've visited a
website before.
Bloom Filters are NOT about key-value pairs like Hash Maps. There's no "value" associated with
a key. The only thing we care about for a key is a boolean: true that we've seen this key before,
or false that this key is new.
Bloom filters are good because they use less memory than a hash map and still provide fast
lookup and insertion times.
The reason bloom filters aren't incredibly common is because it's possible they give wrong
answers: a Bloom Filter might say that you've visited a key before when you really haven't.
A hash function is simply a function that takes some input, and transforms that into some
number (which we'll use as the index in our array) as output. The same input will always
produce the same output.
• Inserting a key.
• Querying if a key was seen before.
Insertion works by inputting a key into all k hash functions, and for each outputted array index,
marking that array value as true.
Querying works by inputting a key into all k hash functions, and for each outputted array index,
checking whether that array value is true. If all array values are true, then the key was seen
before. If any array values are false, then the key wasn't seen before.
Conclusion
They're good because they're efficient on space, but the drawback is that they produce false
positives, possibly saying it's seen a key before when it really hasn't.
We've only looked at a basic bloom filter example to get the concept behind them, because
optimizing a real bloom filter takes a lot of tuning to decide a good k and depends on the
amount of memory you have. And of course, we saw no code today because the basic concept
of a bloom filter is easy enough to code, but coding a real one is a bit involved (because of the
complex nature of making good hash functions).