You are on page 1of 75

Alternative Data

Structures in Ruby

Tyler McMullen

Friday, February 19, 2010


Why?

Friday, February 19, 2010


Why?

• Speed
• Memory
• Clarity

Friday, February 19, 2010


What’s wrong with my favorite
data structure, X?

Friday, February 19, 2010


Nothing. (Maybe.)

Friday, February 19, 2010


• Bloom Filter
• BK-tree
• Splay Tree
• Trie
Friday, February 19, 2010
Bloom Filters

• Tests for existence in a set


• Probabilistic
• Minimal memory use

Friday, February 19, 2010


100 million strings in a Set

Traditional Set: Minimum 10gb

Friday, February 19, 2010


100 million strings in a Set

Traditional Set: Minimum 10gb


Bloom Filter (0.00001): 280mb

Friday, February 19, 2010


100 million strings in a Set

Traditional Set: Minimum 10gb


Bloom Filter (0.00001): 280mb
Bloom Filter (0.001): 170mb

Friday, February 19, 2010


Friday, February 19, 2010
0 1 2 3 4 5 6 7

Friday, February 19, 2010


“to be or not to be”

0 1 2 3 4 5 6 7

Friday, February 19, 2010


add: “to be or not to be”

0 1 2 3 4 5 6 7

Friday, February 19, 2010


add: “that is the question”

0 1 2 3 4 5 6 7

Friday, February 19, 2010


query: “whether ‘tis nobler”

0 1 2 3 4 5 6 7

NO MATCH

Friday, February 19, 2010


query: “to be or not to be”

0 1 2 3 4 5 6 7

MATCH

Friday, February 19, 2010


query: “in the mind to suffer”

0 1 2 3 4 5 6 7

FALSE MATCH

Friday, February 19, 2010


File Server

Friday, February 19, 2010


Request

File Server

Y exists? N

200 404
Friday, February 19, 2010
Request

Bloom Filter

File Server

Y exists? N

200 404
Friday, February 19, 2010
Bloom Filter

• Test for existence in set


• Tiny Memory Footprint
• Excellent Speed

Friday, February 19, 2010


BK-tree

Friday, February 19, 2010


BK-tree

• find items within a distance of a target


• reduces search space
• works inside a metric space

Friday, February 19, 2010


Triangle Inequality
| d(x, y) - d(x, z) | ≤ d(y, z)

Friday, February 19, 2010


Triangle Inequality
| d(x, y) - d(x, z) | ≤ d(y, z)

Friday, February 19, 2010


Triangle Inequality
| d(x, y) - d(x, z) | ≤ d(y, z)

y
4
z 1

Friday, February 19, 2010


Triangle Inequality
| d(x, y) - d(x, z) | ≤ d(y, z)

? y
4
z 1

Friday, February 19, 2010


Triangle Inequality
| 4 - 1 | ≤ d(y, z)

? y
4
z 1

Friday, February 19, 2010


Triangle Inequality
3 ≤ d(y, z)

≥3 y
4
z 1

Friday, February 19, 2010


BK-tree
taser
paste
shave
light
pastor
pasta

Friday, February 19, 2010


BK-tree
taser
paste
shave
light
pastor
pasta

Friday, February 19, 2010


BK-tree
root
paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree
root
pastu paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree
root
1
pastu paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree
root
1
pastu paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree
root
1
pastu paste
2
1

pasta pastor

Friday, February 19, 2010


BK-tree
root
1
pastu paste
2
1

pasta pastor

Friday, February 19, 2010


BK-tree
root
paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree
root
pastu paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree
root
pastu paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree
root
pastu paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree
root
pastu paste

1 2 3 4 5
pasta pastor taser shave light

Friday, February 19, 2010


BK-tree

• Most often used for spelling correctors


• Work in any metric space
• Reduce the search space

Friday, February 19, 2010


Splay Tree

Friday, February 19, 2010


Tangent:
Access Patterns

Friday, February 19, 2010


Access Patterns

Usually assumed to be random or even.

Friday, February 19, 2010


Access Patterns

Rarely the case.

Friday, February 19, 2010


Splay Tree

• Self-balancing binary tree


• Brings most accessed items toward root
• The more uneven the access pattern, the
better

Friday, February 19, 2010


Splay Tree
7

4 11

2 6 9 13

1 3 5 4 8 10 12 14

Friday, February 19, 2010


Splay Tree
7

4 11

2 6 9 13

1 3 5 4 8 10 12 14

Friday, February 19, 2010


Splay Tree
7

4 9

2 6 8 11

10 13
1 3 5 4

12 14

Friday, February 19, 2010


Splay Tree
9
8 11

7 10 13

4
12 14

2 6

1 3 5 4

Friday, February 19, 2010


Splay Tree

• Made for very uneven access patterns


• Caches, Garbage collectors, etc...

Friday, February 19, 2010


Trie

Friday, February 19, 2010


Trie

• O(1) on lookup, add, removal


• Ordered traversals
• Prefix matching
• Excellent memory usage (depending on
implementation)

Friday, February 19, 2010


Trie

Friday, February 19, 2010


Trie
add: “thin”
T

Friday, February 19, 2010


Trie
add: “trap”
T

H R

I A

N P

Friday, February 19, 2010


Trie
add: “bar”
B T

A H R

R I A

N P

Friday, February 19, 2010


Trie
add: “burp”
B T

U A H R

R R I A

P N P

Friday, February 19, 2010


Trie
query: “trap”
B T

U A H R

R R I A

P N P

Friday, February 19, 2010


Trie
query: “trap”
B T

U A H R

R R I A

P N P

Friday, February 19, 2010


Trie
query: “trap”
B T

U A H R

R R I A

P N P

Friday, February 19, 2010


Trie
query: “trap”
B T

U A H R

R R I A

P N P

Friday, February 19, 2010


Trie
query: “trap”
B T

U A H R

R R I A

P N P

Success!
Friday, February 19, 2010
Trie
query: “bumpkin”
B T

U A H R

R R I A

P N P

Friday, February 19, 2010


Trie
query: “bupkis”
B T

U A H R

R R I A

P N P

Friday, February 19, 2010


Trie
query: “bupkis”
B T

U A H R

R R I A

P N P

Friday, February 19, 2010


Trie
query: “bupkis”
B T

U A H R

R R I A

P N P

Fail!
Friday, February 19, 2010
Trie

Example: Autocompleter

Friday, February 19, 2010


Trie
class  Autocompleter
   def  initialize(words)
       @trie  =  Trie.new
       words.each  {  |word|  @trie.add(word)  }
   end

   def  query(word)
       return  @trie.children(word)
   end
end

Friday, February 19, 2010


Trie
class  Autocompleter
   def  initialize(words)
       @trie  =  Trie.new
       words.each  {  |word|  @trie.add(word)  }
   end

   def  call(env)
       request  =  Rack::Request.new(env)
       return  [200,
                       {  ‘content-­‐type’  =>  ‘application/json’  },
                       @trie.children(word).to_json]
   end
end

Friday, February 19, 2010


Conclusion: Data structures are cool.

Friday, February 19, 2010


Questions?

Friday, February 19, 2010

You might also like