Alternative Data Structures in Ruby

Tyler McMullen
Friday, February 19, 2010

Why?

Friday, February 19, 2010

Why?
• Speed • Memory • Clarity

Friday, February 19, 2010

What’s wrong with my favorite data structure, X?

Friday, February 19, 2010

Nothing. (Maybe.)

Friday, February 19, 2010

• • BK-tree Splay Tree • • Trie
Friday, February 19, 2010

Bloom Filter

Bloom Filters
• Tests for existence in a set • Probabilistic • Minimal memory use

Friday, February 19, 2010

100 million strings in a Set

Traditional Set: Minimum 10gb

Friday, February 19, 2010

100 million strings in a Set

Traditional Set: Minimum 10gb Bloom Filter (0.00001): 280mb

Friday, February 19, 2010

100 million strings in a Set

Traditional Set: Minimum 10gb Bloom Filter (0.00001): 280mb Bloom Filter (0.001): 170mb

Friday, February 19, 2010

Friday, February 19, 2010

0

1

2

3

4

5

6

7

Friday, February 19, 2010

“to be or not to be”

0

1

2

3

4

5

6

7

Friday, February 19, 2010

add: “to be or not to be”

0

1

2

3

4

5

6

7

Friday, February 19, 2010

add: “that is the question”

0

1

2

3

4

5

6

7

Friday, February 19, 2010

query: “whether ‘tis nobler”

0

1

2

3

4

5

6

7

NO MATCH

Friday, February 19, 2010

query: “to be or not to be”

0

1

2

3

4

5

6

7

MATCH

Friday, February 19, 2010

query: “in the mind to suffer”

0

1

2

3

4

5

6

7

FALSE MATCH

Friday, February 19, 2010

File Server

Friday, February 19, 2010

Request

File Server

Y 200
Friday, February 19, 2010

exists?

N 404

Request

Bloom Filter

File Server

Y 200
Friday, February 19, 2010

exists?

N 404

Bloom Filter
• Test for existence in set • Tiny Memory Footprint • Excellent Speed

Friday, February 19, 2010

BK-tree

Friday, February 19, 2010

BK-tree
• find items within a distance of a target • reduces search space • works inside a metric space

Friday, February 19, 2010

Triangle Inequality
| d(x, y) - d(x, z) | ≤ d(y, z)

Friday, February 19, 2010

Triangle Inequality
| d(x, y) - d(x, z) | ≤ d(y, z)

y z x

Friday, February 19, 2010

Triangle Inequality
| d(x, y) - d(x, z) | ≤ d(y, z)

4 z 1 x

y

Friday, February 19, 2010

Triangle Inequality
| d(x, y) - d(x, z) | ≤ d(y, z)

? 4 z 1 x

y

Friday, February 19, 2010

Triangle Inequality
| 4 - 1 | ≤ d(y, z)

? 4 z 1 x

y

Friday, February 19, 2010

Triangle Inequality
3 ≤ d(y, z)

≥3 4 z 1 x

y

Friday, February 19, 2010

BK-tree
taser paste shave light pastor pasta

Friday, February 19, 2010

BK-tree
taser paste shave light pastor pasta

Friday, February 19, 2010

BK-tree
root

paste
1 2 3 4 5

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
root

pastu
1 2

paste
3 4 5

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
pastu
1

1

root

paste
3 4 5

2

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
pastu
1

1

root

paste
3 4 5

2

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
pastu 1 pasta 2 pastor 1
root

paste

Friday, February 19, 2010

BK-tree
pastu 1 pasta 2 pastor 1
root

paste

Friday, February 19, 2010

BK-tree
root

paste
1 2 3 4 5

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
root

pastu
1 2

paste
3 4 5

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
root

pastu
1 2

paste
3 4 5

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
root

pastu
1 2

paste
3 4 5

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
root

pastu
1 2

paste
3 4 5

pasta

pastor

taser

shave

light

Friday, February 19, 2010

BK-tree
• Most often used for spelling correctors • Work in any metric space • Reduce the search space

Friday, February 19, 2010

Splay Tree

Friday, February 19, 2010

Tangent: Access Patterns

Friday, February 19, 2010

Access Patterns
Usually assumed to be random or even.

Friday, February 19, 2010

Access Patterns
Rarely the case.

Friday, February 19, 2010

Splay Tree
• Self-balancing binary tree • Brings most accessed items toward root • The more uneven the access pattern, the
better

Friday, February 19, 2010

Splay Tree
7 4 2 1 3 5 6 4 8 9
10 12 11 13

14

Friday, February 19, 2010

Splay Tree
7 4 2 1 3 5 6 4 8 9
10 12 11 13

14

Friday, February 19, 2010

Splay Tree
7 4 2 1 3 5 6 4 8
10

9
11 13

12

14

Friday, February 19, 2010

Splay Tree
9 8 7 4
12 14 10 11 13

2 1 3 5

6 4

Friday, February 19, 2010

Splay Tree
• Made for very uneven access patterns • Caches, Garbage collectors, etc...

Friday, February 19, 2010

Trie

Friday, February 19, 2010

Trie
• O(1) on lookup, add, removal • Ordered traversals • Prefix matching • Excellent memory usage (depending on
implementation)

Friday, February 19, 2010

Trie

Friday, February 19, 2010

Trie
add: “thin”
T H

I N

Friday, February 19, 2010

Trie
add: “trap”
T H R A P

I N

Friday, February 19, 2010

Trie
add: “bar”
B A R T H R A P

I N

Friday, February 19, 2010

Trie
add: “burp”
B U R P A R T H R A P

I N

Friday, February 19, 2010

Trie
query: “trap”
B U R P A R T H R A P

I N

Friday, February 19, 2010

Trie
query: “trap”
B U R P A R T H R A P

I N

Friday, February 19, 2010

Trie
query: “trap”
B U R P A R T H R A P

I N

Friday, February 19, 2010

Trie
query: “trap”
B U R P A R T H R A P

I N

Friday, February 19, 2010

Trie
query: “trap”
B U R P A R T H R A P

I N

Success!
Friday, February 19, 2010

Trie
query: “bumpkin”
B U R P A R T H R A P

I N

Friday, February 19, 2010

Trie
query: “bupkis”
B U R P A R T H R A P

I N

Friday, February 19, 2010

Trie
query: “bupkis”
B U R P A R T H R A P

I N

Friday, February 19, 2010

Trie
query: “bupkis”
B U R P A R T H R A P

I N

Fail!
Friday, February 19, 2010

Trie
Example: Autocompleter

Friday, February 19, 2010

Trie
class  Autocompleter    def  initialize(words)        @trie  =  Trie.new        words.each  {  |word|  @trie.add(word)  }    end    def  query(word)        return  @trie.children(word)    end end

Friday, February 19, 2010

Trie
class  Autocompleter    def  initialize(words)        @trie  =  Trie.new        words.each  {  |word|  @trie.add(word)  }    end    def  call(env)        request  =  Rack::Request.new(env)        return  [200,                        {  ‘content-­‐type’  =>  ‘application/json’  },                        @trie.children(word).to_json]    end end
Friday, February 19, 2010

Conclusion: Data structures are cool.

Friday, February 19, 2010

Questions?

Friday, February 19, 2010