Optimize database queries with Bloom filters and materialized views

Optimization techniques
Bloom filters
Bloom filters
• Probabilistic data structure, check membership for a value in a set.
• How it works: S, set of n values → const * n bits

calculate hash(v) ∈ [1, const * n]
set bit hash(v) to 1
Test w ∈ S → h(w) = 1 ?
• Small probability of false positive. w1 ∈ S, w2 ∉ S h(w1) = h(w2)

Bloom filters
• To reduce the probability of false positives use k > 1 independent
hash functions.
• How it works: S, set of n values → const * n bits

calculate h1(v), h2(v) … hk(v) ∈ [1, const * n]
set bits h1(v), h2(v) … hk(v) to 1
Test w ∈ S → h1(v) = 1 and h2(v) = 1 … and hk(v) = 1 ?

v1 v2 v3
h2(v1) = 12
h1(v1) = 4 h3(v1) = 22
0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Small probability of false positive.
Probability of false negative = 0.

Bloom filters
• Used only to add elements or the test membership.
• Once an element is added to the filter it cannot be removed.
• If all bits are set to 1, the probability of false positives increases.

More space → more accuracy.
• More hash functions
Latency → more accuracy.
Bloom filters – independent hashing
• A family of hash functions 𝐻 = ℎ: 𝑈 → 1. . 𝑚 is k-independent if
∀ 𝑥1 , 𝑥2 … 𝑥𝑘 ∈ 𝑈 𝑘 and ∀ 𝑦1 , 𝑦2 … 𝑦𝑘 ∈ 1. . 𝑚 𝑘 :
1
• 𝑃𝑟ℎ∈𝐻 ℎ 𝑥1 = 𝑦1 ‫ ٿ‬ℎ 𝑥2 = 𝑦2 … ‫ ٿ‬ℎ 𝑥𝑘 = 𝑦𝑘 =
𝑚𝑘
• ℎ 𝑥1 uniformly distributed.
• ℎ 𝑥1 , ℎ 𝑥2 , … ℎ 𝑥𝑘 independent random variables.
v1 v2 v3
h2(v1) = 12
h1(v1) = 4 h3(v1) = 22
0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Small probability of false positive. false positive. Value w: B[h1(w)] = 1 B[h2(w)] =1 … B[hk[w]] = 1
Probability of false negative = 0. Each hash of w equals a hash of an element in the set
Bloom filters – accuracy
• m size of array, n number of elements in S, k number of hash
functions.
• Probability of false positive:
𝑘𝑛 𝑘
1
𝑃 = 1− 1− or
𝑚
𝑘𝑛 𝑘
−𝑚
𝑃 = 1− 𝑒
• m = 10 * n and k = 7 ≃ 0,01
functions.
h(w) != h1(v1)
• Probability of false positive:
𝑘𝑛 𝑘
1
𝑃 = 1− 1− or
𝑚
• m = 10 * n and k = 7 ≃ 0,01
functions. h1(w) != h1(v1)
• Probability of false positive: h1(w) != h1(v1)
𝑘 …..
1 𝑘𝑛
𝑃 = 1− 1− or h1(w) != hn(v1)
𝑚 h1(w) != h1(v2)
…
h1(w) != hn(v2)
…
• m = 10 * n and k = 7 ≃ 0,01
functions. h1(w) = h1(v1)
• Probability of false positive: or
𝑘 h1(w) = h1(v1)
1 𝑘𝑛
𝑃 = 1− 1− or …..
𝑚 h1(w) = hn(v1)
or
h1(w) = h1(v2)
…
h1(w) = hn(v2)
…
• m = 10 * n and k = 7 ≃ 0,01
Log Structured Merge-tree
Log Structured Merge Trees
• Optimize I/O operations.
• Used by: Bigtable, LevelDB, Apache Cassandra etc.
• Data organized in B+ trees.
• Advantages: leaves sequentially located,

leaves are full.
B+ tree
https://commons.wikimedia.org/wiki/File:Btree.png
LSMT L0 in memory
L1
on disk
L2
…….
LSMT
lookup L0 in memory
Separate
L1
lookups
on disk
L2
…….
LSMT
lookup L0 in memory
Separate
L1
lookups
on disk
L2
merge results
…….
LSMT
insert L0 in memory
insert if
memory …
available
on disk
…….
LSMT
insert L0 in memory
memory full L1
on disk
if L1 empty,
copy L0 → L1,
delete L0 …
…….
LSMT
insert L0 in memory
memory full L0,L1
on disk
rolling merge
…
…….
LSMT
update, delete L0 in memory
L0,L1
“deleted” or on disk
“updated”
recoreds,
udapted merge:
check if record …
is deleted or
updated
…….
LSMT
stepped-merge L0 L0_1 L0_2 … in memory
L1_1 L1_2 ….
on disk
…….
LSMT
stepped-merge L0 L0_1 L0_2 … in memory
L1_1 L1_2 ….
on disk
Bloom filters
optimize lookup
…
…….
Materialized views
Materialized views
• redundant data, contents can be inferred from the definition
• immediate view refresh
• deferred view refresh
• incremental update: modify only the affected parts of the

materialized view
Materialized views
• Join operation
• Selection
• Projection
• Aggregation

Optimize database queries with Bloom filters and materialized views

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimize database queries with Bloom filters and materialized views

Uploaded by

Copyright:

Available Formats

Optimization techniques

• How it works: S, set of n values → const * n bits

• Small probability of false positive. w1 ∈ S, w2 ∉ S h(w1) = h(w2)

• How it works: S, set of n values → const * n bits

Test w ∈ S → h1(v) = 1 and h2(v) = 1 … and hk(v) = 1 ?

Small probability of false positive.

Probability of false negative = 0.

• Once an element is added to the filter it cannot be removed.

• If all bits are set to 1, the probability of false positives increases.

• Used by: Bigtable, LevelDB, Apache Cassandra etc.

• Data organized in B+ trees.

• Advantages: leaves sequentially located,

memory full L0,L1

• immediate view refresh

• deferred view refresh

• incremental update: modify only the affected parts of the

You might also like