Professional Documents
Culture Documents
Distributed application
put(key, data) get (key) data
Distributed hash table (DHash)
lookup(key) node IP address
Lookup service (Chord)
Maintenance, optimization
Load balancing : maybe even change the key-hash node
mapping on the fly
Replicate entries on more nodes to increase robustness
Distributed Hash Table
The lookup problem
N2
N1 N3
Key=“title” Internet
Value=MP3 data… ? Client
Publisher
Lookup(“title”)
N4 N6
N5
Centralized lookup (Napster)
SetLoc(“title”, N4) N1 N2
N3
Client
Publisher@ N
4
DB Lookup(“title”)
Key=“title”
Value=MP3 data…
N9 N8
N7
N6
N6 N7 N8
N9
N1 N2
N3
Client
Publisher N4 Lookup(“title”)
Key=“title”
Value=MP3 data…
N6 N7 N8
N9
Routing challenges
Define a useful key nearness metric
Keep the hop count small
Keep the tables small
Stay robust despite rapid change
Function balances load: all nodes receive roughly the same number
of keys
When an Nth node joins (or leaves) the network, only an O(1/N)
fraction of the keys are moved to a different location
Chord IDs
Key identifier = SHA-1(key)
Node identifier = SHA-1(IP, Port)
Both are uniformly distributed
N105 K20
Circular 7-bit
ID space N32
N90
K80
A key is stored at its successor: node with next higher ID
Successor Nodes
identifier
node
6
X key
1
0 successor(1) = 1
7 1
identifier
successor(6) = 0 6 6 circle 2 2 successor(2) = 3
5 3
4
2
Node Joins and Departures
6 1
successor(6) = 7 0
7 1 successor(1) = 3
6 2
5 3
4
2 1
Consistent Hashing – Join and Departure
keys
0 1
7 1
keys
6 2
keys
5 3 2
4
Consistent Hashing – Node Dep.
keys
7
keys
0 1
7 1
keys
6 6 2
keys
5 3 2
4
Scalable Key Location
Queries for a given identifier can be passed around the circle via
these successor pointers
A Simple Key Lookup
Pseudo code for finding successor:
// ask node n to find the successor of id
n.find_successor(id)
if (id (n, successor])
return successor;
else
// forward the query around the circle
return successor.find_successor(id);
Scalable Key Location
ith entry in the table at node n contains the identity of the first
node, s, that succeeds n by at least 2i-1 on the identifier circle
7 1 1+2 0
2 3
1+21 3 3
1+22 5 0
6 2
7 1 2 [2,3) 3
3 [3,5) 3
5 [5,1) 0
6 2
7 1 2 [2,3) 3
3 [3,5) 3
5 [5,1) 06
finger table keys
start int. succ. 6 2
7 [7,0) 0
0 [0,2) 0 finger table keys
2 [2,6) 3
5 3 start int. succ. 2
4 [4,5) 06
4 5 [5,7) 06
7 [7,3) 0
Node Departures – with Finger Tables
finger table keys
start int. succ.
1 [1,2) 13
2 [2,4) 3
4 [4,0) 06
7 1 2 [2,3) 3
3 [3,5) 3
5 [5,1) 06
finger table keys
start int. succ. 6 6 2
7 [7,0) 0
0 [0,2) 0 finger table keys
2 [2,6) 3
5 3 start int. succ. 2
4 [4,5) 6
4 5 [5,7) 6
7 [7,3) 00
Simple Key Location Scheme
N1
N8
K45
N48
N14
N42
N38
N32 N21
Scalable Lookup Scheme
N1
Finger Table for N8
N56
N8 N8+1 N14
N51
N8+2 N14
N48 finger 6 finger 1,2,3
N8+4 N14
N14
N8+8 N21
finger 5
N8+16 N32
N42
finger 4 N8+32 N42
N38
N32 N21
finger [k] = first node that succeeds (n+2k-1)mod2m
Chord key location
Lookup in finger
table the
furthest node
that precedes
key
-> O(log n) hops
Scalable Lookup Scheme
N48
N14
N42
N38
N32 N21
Scalable Lookup Scheme
Each node forwards query at least halfway
along distance remaining to the target
¼ ½
1/8
1/16
1/32
1/64
1/128
N80
Dynamic Operations and Failures
Contains 6 functions:
create()
join()
stabilize()
notify()
fix_fingers()
check_predecessor()
Create()
n.stabilize()
x = successor.predecessor;
if (x (n, successor))
successor = x;
successor.notify(n);
Notify()
m thinks it might be n’s predecessor
n.notify(m)
if (predecessor is nil or m (predecessor, n))
predecessor = m;
Fix_fingers()
Periodically called to make sure that finger table entries
are correct
New nodes initialize their finger tables
Existing nodes incorporate new nodes into their finger tables
n.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(n + 2next-1);
Check_predecessor()
Periodically called to check whether
predecessor has failed
If yes, it clears the predecessor pointer, which can
then be modified by notify()
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;
Theorem
np runs stabilize
n succ(np) = ns np asks ns for its predecessor (now n)
pred(ns) = np
nil
all predecessor and successor
pointers are now correct
68 68
Impact of node joins on lookups
70 70
Failure of nodes
Correctness relies on
correct successor
pointers
What happens, if N14,
N21, N32 fail
simultaneously?
How can N8 aquire N38
as successor?
71 71
Voluntary Node Departures
Can be treated as node failures
Two possible enhancements
Leaving node may transfers all its keys to its
successor
Leaving node may notify its predecessor and
successor about each other so that they can
update their links
Chord – facts
Every node is responsible for about K/N keys (N nodes, K keys)
The leaf set contains nodes which are close in the id-
space
For ex, the first row is populated with nodes that have no prefix in
common with n
Single digit entry in each row shows the corresponding digit of the
present node’s id – i.e. prefix matches the current id up to the
given value of p – the next row down or leaf set should be
examined to find a route.
Routing Table
Routing tables (RT) thus built achieve an effect similar to
Chord finger table
The detail of the routing information increases with the proximity of
other nodes in the id-space
Without a large no. of nearby nodes, the last rows of the RT are
only sparsely populated – intuitively, the id-space would need to be
fully exhausted with node-ids for complete RTs on all nodes
In populating the RT, there is a choice from the set of nodes with
the appropriate id-prefix
The RT and the leaf set are the two sources of information
relevant for routing
L
Nodes that are numerically closer
to the present Node (2b or
2x2b entry)
R
Common prefix with 10233102-
next digit-rest of NodeId
(log2b (N) rows, 2b-1
columns)
M
Nodes that are closest according
to the proximity metric (2b or
2x2b entry)
Routing
Key D arrives at nodeId A
Shares a longer prefix with the key than the local node
Case 2: The query is routed via leaf set (one step). This
increases the no. of hop by one
Routing performance
There are 3 cases:
Case 3: The key is neither covered by the leaf set nor
does the RT contains an entry with a longer matching
prefix than the current node
Error correction
Failed leaf node: contact a leaf node on the side of the failed
node and add appropriate new neighbor
Discussion:
Locality in the routing table
Route locality
Locating the nearest among k nodes
Locality in the routing table
Node A is near X
A’s R0 entries are close to A, A is close to X, and
triangulation inequality holds entries in X are
relatively near A.
Likewise, obtaining X’s neighborhood set from A is
appropriate.
B’s R1 entries are reasonable choice for R1of X
Entries in each successive row are chosen from an exponentially
decreasing set size.
The expected distance from B to any of its R1 entry is much
larger than the expected distance traveled from node A to B.
Second stage: X requests the state from each of the
nodes in its routing table and neighborhood set to
update its entries to closer nodes.
Routing locality
Each routing step moves the message closer to the
destination in the nodeId space, while traveling the
least possible distance in the proximity space.
Given that:
A message routed from A to B at distance d cannot
subsequently be routed to a node with a distance of less than d
from A
The expected distance traveled by a message during each
successive routing step is exponentially increasing
Since a message tends to make larger and larger
strides with no possibility of returning to a node within
di of any node i encountered on the route, the message
has nowhere to go but towards its destination
Locating the nearest among k nodes
Goal:
among the k numerically closest nodes to a key, a
message tends to first reach a node near the client.
Problem:
Since Pastry routes primarily based on nodeId
prefixes, it may miss nearby nodes with a different
prefix than the key.
Solution (using a heuristic):
Based on estimating the density of nodeIds, it
detects when a message approaches the set of k
and then switches to numerically nearest address
based routing to locate the nearest replica.
Arbitrary node failures and network
partitions
Node continues to be responsive, but
behaves incorrectly or even maliciously.
• Example: D=2
• 1’s neighbors: 2,3,4,6
• 6’s neighbors: 1,2,4,5
• Squares “wrap around”, e.g.,
7 and 8 are neighbors
• Expected # neighbors: O(D)
CAN : Routing
To get to <n1, n2, …, nD> from <m1, m2, …, mD>
choose a neighbor with smallest Cartesian distance from <m1,
m2, …, mD> (e.g., measured from neighbor’s center)
Bootstrap
node
new node
CAN: construction
Bootstrap
node
(x,y)
(x,y)
J
J new
1
3
1 3
2 4
2 4
Partition tree
Zoning
Zone reassignment
1
3
1 3 4
4
Partition tree
Zoning
Zone reassignment
1
3
1 3
2 4
2 4
Partition tree
Zoning
Zone reassignment
1
2
1 2 4
4
Partition tree
Zoning