You are on page 1of 13

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE

Summary of the performance of symbol-table implementations

Order of growth of the frequency of operations.

typical case
ordered operations
implementation
operations on keys

5.2 T RIES search insert delete

red-black BST log N log N log N ✔ compareTo()


‣ R-way tries
equals()
‣ ternary search tries hash table 1† 1† 1†
hashCode()

‣ character-based operations
Algorithms F O U R T H E D I T I O N
† under uniform hashing assumption

R OBERT S EDGEWICK | K EVIN W AYNE use array accesses to make R-way decisions
http://algs4.cs.princeton.edu (instead of binary decisions)

Q. Can we do better?
A. Yes, if we can avoid examining the entire key, as with string sorting.

String symbol table basic API String symbol table implementations cost summary

String symbol table. Symbol table specialized to string keys.

character accesses (typical case) dedup

search search space


implementation insert moby.txt actors.txt
public class StringST<Value> hit miss (references)

StringST() create an empty symbol table red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4N 1.40 97.4

void put(String key, Value val) put key-value pair into the symbol table hashing
L L L 4N to 16N 0.76 40.6
(linear probing)
Value get(String key) return value paired with given key

Parameters file size words distinct


void delete(String key) delete key and corresponding value
• N = number of strings
moby.txt 1.2 MB 210 K 32 K
• L = length of string
⋮ actors.txt 82 MB 11.4 M 900 K
• R = radix

Goal. Faster than hashing, more flexible than BSTs. Challenge. Efficient performance for string keys.

3 4
Tries

5.2 T RIES
‣ R-way tries
‣ ternary search tries
‣ character-based operations
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu

Tries Search in a trie

Tries. [from retrieval, but pronounced "try"] Follow links corresponding to each character in the key.
・Store characters in nodes (not keys). ・Search hit: node where search ends has a non-null value.
・Each node has R children, one for each possible character. ・Search miss: reach null link or node where search ends has null value.
(for now, we do not draw null links)
link to trie for all keys get("shells")
root that start with s
link to trie for all keys
that start with she

b s t b s t

y 4 e h h y 4 e h h
key value

by 4 a 6 l e 0 o e 5 a 6 l e 0 o e 5
sea 6
value for she in node
sells 1 l l r corresponding to last l l r
she 0 key character

shells 3 s 1 l e 7 s 1 l e 7
shore 7
return value associated
the 5 s 3 s 3 with last key character
7 (return 3) 8
Search in a trie Search in a trie

Follow links corresponding to each character in the key. Follow links corresponding to each character in the key.
・Search hit: node where search ends has a non-null value. ・Search hit: node where search ends has a non-null value.
・Search miss: reach null link or node where search ends has null value. ・Search miss: reach null link or node where search ends has null value.
get("she") get("shell")

b s t b s t

y 4 e h h y 4 e h h

a 6 l e 0 o e 5 a 6 l e 0 o e 5

search may terminated


l l r l l r
at an intermediate node
(return 0)

s 1 l e 7 s 1 l e 7

no value associated
with last key character
s 3 s 3
(return null)
9 10

Search in a trie Insertion into a trie

Follow links corresponding to each character in the key. Follow links corresponding to each character in the key.
・Search hit: node where search ends has a non-null value. ・Encounter a null link: create new node.
・Search miss: reach null link or node where search ends has null value. ・Encounter the last character of the key: set value in that node.
get("shelter") put("shore", 7)

b s t b s t

y 4 e h h y 4 e h h

a 6 l e 0 o e 5 a 6 l e 0 o e 5

l l r l l r

s 1 l e 7 s 1 l e 7

no link to t
s 3 s 3
(return null)
11 12
Trie construction demo Trie construction demo

trie trie

b s t

y 4 e h h

a 6 l e 0 o e 5

l l r

s 1 l e 7

s 3

Trie representation: Java implementation R-way trie: Java implementation

Node. A value, plus references to R nodes.


public class TrieST<Value>
{
private static class Node private static final int R = 256; extended ASCII
{ private Node root = new Node();
use Object instead of Value since
private Object value; no generic array creation in Java
private Node[] next = new Node[R]; private static class Node
} { /* see previous slide */ }

public void put(String key, Value val)


{ root = put(root, key, val, 0); }

characters are implicitly neither keys nor


defined by link index s private Node put(Node x, String key, Value val, int d)
s characters are
explicitly stored
{
e h if (x == null) x = new Node();
e h
if (d == key.length()) { x.val = val; return x; }
a l e
a 2 l e 0 2 0 char c = key.charAt(d);
x.next[c] = put(x.next[c], key, val, d+1);
l
l each node has return x;
an array of links
s and a value }
s 1 1


Trie representation
15 16
R-way trie: Java implementation (continued) Trie performance

Search hit. Need to examine all L characters for equality.


public boolean contains(String key) Search miss.
{ return get(key) != null; }
・Could have mismatch on first character.
public Value get(String key) ・Typical case: examine only a few characters (sublinear).
{
Node x = get(root, key, 0);
Space. R null links at each leaf.
if (x == null) return null;
return (Value) x.val; cast needed (but sublinear space possible if many short strings share common prefixes)
}
characters are implicitly
defined by link index s
private Node get(Node x, String key, int d) s
e h
{ e h
if (x == null) return null; a l e
a l e 2 0
if (d == key.length()) return x; 2 0

l
char c = key.charAt(d); each node has
l
an array of links
return get(x.next[c], key, d+1); s and a value
s 1 1
}
Trie representation
}
Bottom line. Fast search hit and even faster search miss, but wastes space.
17 18

Deletion in an R-way trie Deletion in an R-way trie

To delete a key-value pair: To delete a key-value pair:


・Find the node corresponding to key and set value to null. ・Find the node corresponding to key and set value to null.
・If node has null value and all null links, remove that node (and recur). ・If node has null value and all null links, remove that node (and recur).
delete("shells") delete("shells")

b s t b s t

y 4 e h h y 4 e h h

a 6 l e 0 o e 5 a 6 l e 0 o e 5

l l r l l r

s 1 l e 7 s 1 l e 7

s set value to null null value and links s


3
19
(delete node) 20
String symbol table implementations cost summary

character accesses (typical case) dedup

search search space


implementation insert moby.txt actors.txt
hit miss (references)

red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4N 1.40 97.4


5.2 T RIES
hashing
L L L 4N to 16N 0.76 40.6 ‣ R-way tries
(linear probing)
‣ ternary search tries
R-way trie L log R N L (R+1) N 1.12 out of
memory
‣ character-based operations
Algorithms
R-way trie.

R OBERT S EDGEWICK | K EVIN W AYNE
Method of choice for small R. http://algs4.cs.princeton.edu

・Too much memory for large R.


Challenge. Use less memory, e.g., 65,536-way trie for Unicode!
21

Ternary search tries Ternary search tries

・Store characters and values in nodes (not keys). ・Store characters and values in nodes (not keys).
・Each node has 3 children: smaller (left), equal (middle), larger (right). ・Each node has 3 children: smaller (left), equal (middle), larger (right).

link to TST for all keys link to TST for all keys
that start with that start with s
a letter before s
s
b t
a b s t a
Fast Algorithms for Sorting and Searching Strings

Jon L. Bentley* Robert Sedgewick# r y 4 e h u h r y 4 h h


e u
Abstract that is competitive with the most efficient string sorting e 12 a 14 l e 10 o r 0 e 8 e 12 14 l e 10 r 0 e 8
We present theoretical algorithms for sorting and programs known. The second program is a symbol table a o
searching multikey data, and derive from them practical C implementation that is faster than hashing, which is com-
monly regarded as the fastest symbol table implementa-
implementations for applications in which keys are charac-
tion. The symbol table implementation is much more l l r e each node has l l r e
ter strings. The sorting algorithm blends Quicksort and
radix sort; it is competitive with the best known C sort space-efficient than multiway trees, and supports more three links
codes. The searching algorithm blends tries and binary advanced searches.
search trees; it is faster than hashing and other commonly In many application programs, sorts use a Quicksort s 11 l e 7 l s 11 l e 7 l
used search methods. The basic ideas behind the algo- implementation based on an abstract compare operation,
rithms date back at least to the 1960s but their practical and searches use hashing or binary search trees. These do
utility has been overlooked. We also present extensions to not take advantage of the properties of string keys, which s 15 y 13 s 15 y 13
more complex string problems, such as partial-match are widely used in practice. Our algorithms provide a nat-
searching. ural and elegant way to adapt classical algorithms to this
important class of applications.
1. Introduction Section 6 turns to more difficult string-searching prob- TST representation of a trie
Section 2 briefly reviews Hoare’s [9] Quicksort and lems. Partial-match queries allow “don’t care” characters
binary search trees. We emphasize a well-known isomor- (the pattern “so.a”, for instance, matches soda and sofa).
phism relating the two, and summarize other basic facts. The primary result in this section is a ternary search tree
The multikey algorithms and data structures are pre- implementation of Rivest’s partial-match searching algo-
sented in Section 3. Multikey Quicksort orders a set of II rithm, and experiments on its performance. “Near neigh-
vectors with k components each. Like regular Quicksort, it bor” queries locate all words within a given Hamming dis-
tance of a query word (for instance, code is distance 2 23 24
partitions its input into sets less than and greater than a
given value; like radix sort, it moves on to the next field from soda). We give a new algorithm for near neighbor
once the current input is known to be equal in the given searching in strings, present a simple C implementation,
field. A node in a ternary search tree represents a subset of and describe experiments on its efficiency.
Search hit in a TST Search miss in a TST

get("sea") get("shelter")

s s

b t b t

y 4 h h y 4 h h
e e

l e 0 e 5 l e 0 e 5

6 a o 6 a o

l l r l l r
return value associated
with last key character

s 1 l e 7 s 1 l e 7

s 3 s 3 no link to t
(return null)
25 26

Ternary search trie construction demo Ternary search trie construction demo

ternary search trie ternary search trie

b t

y 4 h h
e

l e 0 e 5

6 a o

l l r

s 1 l e 7

s 3

27 28
Search in a TST 26-way trie vs. TST
now
Follow links corresponding to each character in the key. 26-way trie. 26 null links in each leaf. for

・If less, take left link; if greater, take right link.


tip
ilk

・If equal, take the middle link and move to the next key character.
dim
tag
jot
sob
nob
Search hit. Node where search ends has a non-null value. sky
hut
Search miss. Reach a null link or node where search ends has null value. ace
bet
26-way trie (1035 null links, not shown)
men
get("sea") match: take middle link, egg
move to next char few
jay
mismatch: take left or right link,
do not move to next char
s
TST. 3 null links in each leaf. owl
joy
b t rap
a
gig
wee
r y 4 h h
e u was
cab
e 12 l e 10 r e 8 wad
14 a o
caw
l l r e cue
fee
return value tap
s 11 l e 7 l
associated with ago
last key character tar
s 15 y 13
jam
dug
TST search example TST (155 null links) and
29 30

TST representation in Java TST: Java implementation

A TST node is five fields:


・A value. private class Node public class TST<Value>

・A character c.
{ {
private Value val; private Node root;

・A reference to a left TST. private char c;


private class Node

・A reference to a middle TST.


private Node left, mid, right;
} { /* see previous slide */ }

・A reference to a right TST. public void put(String key, Value val)


{ root = put(root, key, val, 0); }

private Node put(Node x, String key, Value val, int d)


{
char c = key.charAt(d);
standard array of links (R = 26) ternary search tree (TST) if (x == null) { x = new Node(); x.c = c; }
s
if (c < x.c) x.left = put(x.left, key, val, d);
link for keys
s that start with s else if (c > x.c) x.right = put(x.right, key, val, d);
h else if (d < key.length() - 1) x.mid = put(x.mid, key, val, d+1);
e u
else x.val = val;
e h u
return x;
link for keys
that start with su }
Trie node representations

31 32
TST: Java implementation (continued) String symbol table implementation cost summary

character accesses (typical case) dedup



public boolean contains(String key) search search space
implementation insert moby.txt actors.txt
{ return get(key) != null; } hit miss (references)

public Value get(String key) red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4N 1.40 97.4


{
Node x = get(root, key, 0); hashing
if (x == null) return null; L L L 4N to 16N 0.76 40.6
(linear probing)
return x.val;
} R-way trie L log R N L (R+1) N 1.12 out of
memory
private Node get(Node x, String key, int d)
{ TST L + ln N ln N L + ln N 4N 0.72 38.7
if (x == null) return null;
char c = key.charAt(d);
if (c < x.c) return get(x.left, key, d);
else if (c > x.c) return get(x.right, key, d); Remark. Can build balanced TSTs via rotations to achieve L + log N
else if (d < key.length() - 1) return get(x.mid, key, d+1);
else return x; worst-case guarantees.
}
}
Bottom line. TST is as fast as hashing (for string keys), space efficient.
33 34

TST with R2 branching at root String symbol table implementation cost summary

Hybrid of R-way trie and TST.


・Do R -way branching at root.
2 character accesses (typical case) dedup

・Each of R root nodes points to a TST.


2
implementation
search
hit
search
miss
insert
space
(references)
moby.txt actors.txt

red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4N 1.40 97.4


array of 262 roots

hashing
L L L 4N to 16N 0.76 40.6
(linear probing)
aa ab ac zy zz
R-way trie L log R N L (R+1) N 1.12 out of
memory

TST TST TST


… TST TST
TST L + ln N ln N L + ln N 4N 0.72 38.7

TST with R2 L + ln N ln N L + ln N 4 N + R2 0.51 32.7

Q. What about one- and two-letter words? Bottom line. Faster than hashing for our benchmark client.
35 36
TST vs. hashing

Hashing.
・Need to examine entire key.
・Search hits and misses cost about the same.
・Performance relies on hash function.
・Does not support ordered symbol table operations. 5.2 T RIES
TSTs. ‣ R-way tries
・Works only for string (or digital) keys. ‣ ternary search tries
・Only examines just enough key characters. ‣ character-based operations
・Search miss may involve only a few characters. Algorithms
・Supports ordered symbol table operations (plus extras!).
R OBERT S EDGEWICK | K EVIN W AYNE
Bottom line. TSTs are: http://algs4.cs.princeton.edu

・Faster than hashing (especially for search misses).


・More flexible than red-black BSTs. [stay tuned]
37

String symbol table API String symbol table API

Character-based operations. The string symbol table API supports several public class StringST<Value>
useful character-based operations.
StringST() create a symbol table with string keys
key value

by 4 void put(String key, Value val) put key-value pair into the symbol table

sea 6
Value get(String key) value paired with key
sells 1

she 0 void delete(String key) delete key and corresponding value

shells 3 ⋮
shore 7
Iterable<String> keys() all keys
the 5
Iterable<String> keysWithPrefix(String s) keys having s as a prefix

Prefix match. Keys with prefix sh: she, shells, and shore. Iterable<String> keysThatMatch(String s) keys that match s (where . is a wildcard)

Wildcard match. Keys that match .he: she and the.


String longestPrefixOf(String s) longest key that is a prefix of s
Longest prefix. Key that is the longest prefix of shellsort: shells.

Remark. Can also add other ordered ST methods, e.g., floor() and rank().
39 40
Warmup: ordered iteration Ordered iteration: Java implementation

To iterate through all keys in sorted order: To iterate through all keys in sorted order:
・Do inorder traversal of trie; add keys encountered to a queue. ・Do inorder traversal of trie; add keys encountered to a queue.
・Maintain sequence of characters on path from root to node. ・Maintain sequence of characters on path from root to node.
keysWithPrefix("");
keys()
public Iterable<String> keys()
key q {
b b s t Queue<String> queue = new Queue<String>();
by by
s y e h collect(root, "", queue);
4 h
se return queue; sequence of characters
sea by sea a 6 l e 0 o e 5
sel
} on path from root to x

sell l l r
sells by sea sells private void collect(Node x, String prefix, Queue<String> q)
sh s 1 l e 7
she by sea sells she {
shell s 3
if (x == null) return;
shells by sea sells she shells
sho if (x.val != null) q.enqueue(prefix);
shor for (char c = 0; c < R; c++)
shore by sea sells she shells shore
t
collect(x.next[c], prefix + c, q);
th } or use StringBuilder
the by sea sells she shells shore the

Collecting the keys in a trie (trace)


41 42

Prefix matches Prefix matches in an R-way trie

Find all keys in a symbol table starting with a given prefix. Find all keys in a symbol table starting with a given prefix.

Ex. Autocomplete in a cell phone, search bar, text editor, or shell. keysWithPrefix("sh");
keysWithPrefix("sh");


key qkey q
User types characters one at a time. sh sh

・System reports all matching strings.


b b s s t t b b s s t t she she she
shel shel
y 4 ye 4 eh h h h y 4 ye 4 eh h h h shell shell
shellsshells
she shells
she shells
a 6 l a 6 el0 o e 0e o5 e 5 a 6 la 6 el0 o e 0e o5 e 5 sho sho
find subtrie
find for
subtrie
all for all shor shor
keys beginning with "sh"
keys beginning l
with "sh" ll r l r l ll rl r shore shore
she shells
she shells
shore s
s 1 ls 1 e l7 e 7 s 1 ls 1 e l7 e 7collect keys
collect keys
s 3 s 3 s 3 s 3
in that subtrie
in that subtrie

Prefix match
Prefix in
match
a triein a trie

keysWithPrefix("sh");
public Iterable<String> keysWithPrefix(String prefix) key
key queue
q
{ sh
b s t b s t she she
Queue<String> queue = new Queue<String>();
shel
Node x =y get(root,
4 e h 0); y 4
h prefix, e h h shell
shells she shells
collect(x, aprefix, 6 l e queue);
0 o e 5 a 6 l e 0 o e 5 sho
find subtrie for
return all
queue; root of subtrie for all strings shor
keys beginning with "sh" l l r l with
beginning l given
r prefix shore she shells shore
}
s 1 l e 7 s 1 l e 7 collect keys
43 s 3 s 3
in that subtrie 44

Prefix match in a trie


Longest prefix Longest prefix in an R-way trie

Find longest key in symbol table that is a prefix of query string. Find longest key in symbol table that is a prefix of query string.
・Search for query string.
Ex. To send packet toward destination IP address, router chooses IP ・Keep track of longest key encountered.
address in routing table that is longest prefix match.

represented as 32-bit
"128"
binary number for IPv4
"128.112" (instead of string)
"she"
"she"
"she" "shell"
"shell"
"shell" "shellsort"
"shellsort"
"shellsort" "shelt
"she
"s
"128.112.055"
"128.112.055.15" s s s s s s
s s s
"128.112.136" longestPrefixOf("128.112.136.11") = "128.112.136" e e e h h h e e e h h h
longestPrefixOf("128.112.100.16") = "128.112"
e e e h h h e
"128.112.155.11" a a a l2 le e
2 2l 0 0e 0 a a a l2 le e
2 2l 0 0e 0search ends
search
search atends
ends at at
endend
ofend
string
of string
of string a a a l2 le e a a
longestPrefixOf("128.166.123.45") = "128"
value is
value null
is null
value is null 2 2l 0 0e 0 2
"128.112.155.13" l l ll l l l l ll l l return she
return sheshe
return
search
searchends
search atends
ends at at (last key
(last onkey
key
(last path)
on path)
on path) l l ll l l
"128.222" s s
1 1s
l 1l l endend ofend
string
of string
of string s s1 1s
l 1l l
value is
value not null
is not
value nullnull
is not s s
1 1s
l 1l l
"128.222.136" s s3 3s 3return
return shesheshe
return s s
3 3s 3
search ends
search
search atends
ends at at
s s
3 3s 3 null link
null linklink
null
return
return shells
shells
return shells
(last key
(last onkey
key
(last path)
on path)
on path)
Possibilities
Possibilities forfor
Possibilities
Possibilities for
for longestPrefixOf()
longestPrefixOf()
longestPrefixOf()
longestPrefixOf()
Note. Not the same as floor: floor("128.112.100.16") = "128.112.055.15"

45 46

Longest prefix in an R-way trie: Java implementation T9 texting

Find longest key in symbol table that is a prefix of query string. Goal. Type text messages on a phone keypad.
・Search for query string.
・Keep track of longest key encountered. Multi-tap input. Enter a letter by repeatedly pressing a key.
Ex. hello: 4 4 3 3 5 5 5 5 5 5 6 6 6

public String longestPrefixOf(String query)


"a much faster and more fun way to enter text"
{
int length = search(root, query, 0, 0); T9 text input.
・Find all words that correspond to given sequence of numbers.
return query.substring(0, length);

・Press 0 to see all completion options.


}

private int search(Node x, String query, int d, int length)


{
Ex. hello: 4 3 5 5 6
if (x == null) return length;
if (x.val != null) length = d;
if (d == query.length()) return length;
char c = query.charAt(d);
return search(x.next[c], query, d+1, length);
}
Q. How to implement? www.t9.com
47 48
Patricia trie Suffix tree

Patricia trie. [Practical Algorithm to Retrieve Information Coded in Alphanumeric] Suffix tree.
・Remove one-way branching. ・Patricia trie of suffixes of a string.
・Each node represents a sequence of characters. ・Linear-time construction: well beyond scope of this course.
・Implementation: one step beyond this course.
put("shells", 1);
put("shellfish", 2);

standard no one-way suffix tree for BANANAS


trie branching

shell BANANAS A NA S

Applications. s s 1 fish 2

・Database search. h
NA S S NAS

・P2P network search. e internal

・IP routing tables: find longest prefix match.


one-way
branching
l

・Compressed quad-tree for N-body simulation.


NAS S
l

・Efficiently storing and querying XML documents. s 1 f

i external Applications.
・Linear-time: longest repeated substring, longest common substring,
one-way
branching
s
longest palindromic substring, substring search, tandem repeats, ….
・Computational biology databases (BLAST, FASTA).
h 2

Also known as: crit-bit tree, radix tree.


Removing one-way branching in a trie
51 52

String symbol tables summary

A success story in algorithm design and analysis.

Red-black BST.
・Performance guarantee: log N key compares.
・Supports ordered symbol table API.
Hash tables.
・Performance guarantee: constant number of probes.
・Requires good hash function for key type.
Tries. R-way, TST.
・Performance guarantee: log N characters accessed.
・Supports character-based operations.

Bottom line. You can get at anything by examining 50-100 bits (!!!)
53

You might also like