You are on page 1of 122

Peer-to-Peer (P2P) systems

DHT, Chord, Pastry, …


Unstructured vs Structured
 Unstructured P2P networks allow resources to be
placed at any node. The network topology is arbitrary,
and the growth is spontaneous.
 Structured P2P networks simplify resource
location and load balancing by defining a topology
and defining rules for resource placement.
 Guarantee efficient search for rare objects

What are the rules???

Distributed Hash Table (DHT)


Hash Tables
 Store arbitrary keys and
satellite data (value)
 put(key,value)
 value = get(key)
 Lookup must be fast
 Calculate hash function h()
on key that returns a
storage cell
 Chained hash table: Store
key (and optional value)
there
What Is a DHT?
 Single-node hash table:
key = Hash(name)
put(key, value)
get(key) -> value

 How do I do this across millions of


hosts on the Internet?
 Distributed Hash Table
DHT

Distributed application
put(key, data) get (key) data
Distributed hash table (DHash)
lookup(key) node IP address
Lookup service (Chord)

node node …. node

• Application may be distributed over many nodes


• DHT distributes data storage over many nodes
Why the put()/get() interface?

 API supports a wide range of applications

 DHT imposes no structure/meaning on keys


Distributed Hash Table
 Hash table functionality in a P2P network : lookup of
data indexed by keys

 Key-hash  node mapping


 Assign a unique live node to a key
 Find this node in the overlay network quickly and cheaply

 Maintenance, optimization
 Load balancing : maybe even change the key-hash  node
mapping on the fly
 Replicate entries on more nodes to increase robustness
Distributed Hash Table
The lookup problem
N2
N1 N3

Key=“title” Internet
Value=MP3 data… ? Client
Publisher
Lookup(“title”)

N4 N6
N5
Centralized lookup (Napster)
SetLoc(“title”, N4) N1 N2
N3
Client
Publisher@ N
4
DB Lookup(“title”)
Key=“title”
Value=MP3 data…
N9 N8
N7
N6

Simple, but O(N) state and a single point of failure


Flooded queries (Gnutella)
N1 N2 Lookup(“title”)
N3
Client
Publisher@ N
4
Key=“title”
Value=MP3 data…

N6 N7 N8

N9

Robust, but large number of messages per lookup


Routed queries (Chord, Pastry, etc.)

N1 N2
N3
Client
Publisher N4 Lookup(“title”)
Key=“title”
Value=MP3 data…
N6 N7 N8

N9
Routing challenges
 Define a useful key nearness metric
 Keep the hop count small
 Keep the tables small
 Stay robust despite rapid change

 Freenet: emphasizes anonymity


 Chord: emphasizes efficiency and simplicity
What is Chord? What does it do?
 In short: a peer-to-peer lookup service

 Solves problem of locating a data item in a collection


of distributed nodes, considering frequent node
arrivals and departures

 Core operation in most p2p systems is efficient


location of data items

 Supports just one operation: given a key, it maps the


key onto a node
Chord Characteristics
 Simplicity, provable correctness, and provable
performance

 Each Chord node needs routing information about


only a few other nodes

 Resolves lookups via messages to other nodes


(iteratively or recursively)

 Maintains routing information as nodes join and leave


the system
Mapping onto Nodes vs. Values
 Traditional name and location services provide a
direct mapping between keys and values

 What are examples of values? A value can be an


address, a document, or an arbitrary data item

 Chord can easily implement a mapping onto values


by storing each key/value pair at node to which that
key maps
Napster, Gnutella etc. vs. Chord
 Compared to Napster and its centralized servers,
Chord avoids single points of control or failure by a
decentralized technology

 Compared to Gnutella and its widespread use of


broadcasts, Chord avoids the lack of scalability
through a small number of important information for
routing
Addressed Difficult Problems (1)
 Load balance: distributed hash function, spreading
keys evenly over nodes

 Decentralization: chord is fully distributed, no node


more important than other, improves robustness

 Scalability: logarithmic growth of lookup costs with


number of nodes in network, even very large systems
are feasible
Addressed Difficult Problems (2)

 Availability: chord automatically adjusts its internal


tables to ensure that the node responsible for a key
can always be found

 Flexible naming: no constraints on the structure of


the keys – key-space is flat, flexibility in how to map
names to Chord keys
Chord properties
 Efficient: O(log(N)) messages per lookup
 N is the total number of servers

 Scalable: O(log(N)) state per node

 Robust: survives massive failures


The Base Chord Protocol (1)

 Specifies how to find the locations of keys

 How new nodes join the system

 How to recover from the failure or planned departure


of existing nodes
Consistent Hashing
 Hash function assigns each node and key an m-bit identifier
using a base hash function such as SHA-1
 ID(node) = hash(IP, Port)
 ID(key) = hash(key)

 Properties of consistent hashing:

 Function balances load: all nodes receive roughly the same number
of keys

 When an Nth node joins (or leaves) the network, only an O(1/N)
fraction of the keys are moved to a different location
Chord IDs
 Key identifier = SHA-1(key)
 Node identifier = SHA-1(IP, Port)
 Both are uniformly distributed

 Both exist in the same ID space

 How to map key IDs to node IDs?


Chord IDs
 consistent hashing (SHA-1) assigns each
node and object an m-bit ID
 IDs are ordered in an ID circle ranging from
0 – (2m-1).
 New nodes assume slots in ID circle
according to their ID
 Key k is assigned to first node whose ID ≥ k
  successor(k)
Consistent hashing
Key 5
K5
Node 105

N105 K20

Circular 7-bit
ID space N32

N90

K80
A key is stored at its successor: node with next higher ID
Successor Nodes
identifier
node
6
X key
1
0 successor(1) = 1
7 1

identifier
successor(6) = 0 6 6 circle 2 2 successor(2) = 3

5 3
4
2
Node Joins and Departures

6 1
successor(6) = 7 0
7 1 successor(1) = 3

6 2

5 3
4
2 1
Consistent Hashing – Join and Departure

 When a node n joins the network, certain


keys previously assigned to n’s successor
now become assigned to n.
 When node n leaves the network, all of its
assigned keys are reassigned to n’s
successor.
Consistent Hashing – Node Join
keys
5
7

keys
0 1

7 1

keys

6 2

keys
5 3 2

4
Consistent Hashing – Node Dep.
keys
7

keys
0 1

7 1

keys
6 6 2

keys
5 3 2

4
Scalable Key Location

 A very small amount of routing information suffices to


implement consistent hashing in a distributed environment

 Each node need only be aware of its successor node on the


circle

 Queries for a given identifier can be passed around the circle via
these successor pointers
A Simple Key Lookup
 Pseudo code for finding successor:
// ask node n to find the successor of id
n.find_successor(id)
if (id  (n, successor])
return successor;
else
// forward the query around the circle
return successor.find_successor(id);
Scalable Key Location

 Resolution scheme correct, BUT inefficient:

 it may require traversing all N nodes!


Acceleration of Lookups
 Lookups are accelerated by maintaining additional routing
information

 Each node maintains a routing table with (at most) m entries


(where N=2m) called the finger table

 ith entry in the table at node n contains the identity of the first
node, s, that succeeds n by at least 2i-1 on the identifier circle

 s = successor(n + 2i-1) (all arithmetic mod 2m)

 s is called the ith finger of node n, denoted by n.finger(i).node


Scalable Key Location – Finger Tables
finger table keys
For. start succ. 6
0+20 1 1
0+21 2 3
0+22 4 0

finger table keys


0 For. start succ. 1

7 1 1+2 0
2 3
1+21 3 3
1+22 5 0

6 2

finger table keys


5 3 For. start succ. 2
0
3+2 4 0
4 3+21 5 0
3+22 7 0
Finger Tables
finger table keys
start int. succ. 6
1 [1,2) 1
2 [2,4) 3
4 [4,0) 0

finger table keys


0 start int. succ. 1

7 1 2 [2,3) 3
3 [3,5) 3
5 [5,1) 0

6 2

finger table keys


5 3 start int. succ. 2
4 [4,5) 0
4 5 [5,7) 0
7 [7,3) 0
Finger Tables - characteristics

 Each node stores information about only a small number of


other nodes, and knows more about nodes closely following it
than about nodes farther away

 A node’s finger table generally does not contain enough


information to determine the successor of an arbitrary key k

 Repetitive queries to nodes that immediately precede the given


key will lead to the key’s successor eventually
Node Joins – with Finger Tables
finger table keys
start int. succ. 6
1 [1,2) 1
2 [2,4) 3
4 [4,0) 06

finger table keys


0 start int. succ. 1

7 1 2 [2,3) 3
3 [3,5) 3
5 [5,1) 06
finger table keys
start int. succ. 6 2
7 [7,0) 0
0 [0,2) 0 finger table keys
2 [2,6) 3
5 3 start int. succ. 2
4 [4,5) 06
4 5 [5,7) 06
7 [7,3) 0
Node Departures – with Finger Tables
finger table keys
start int. succ.
1 [1,2) 13
2 [2,4) 3
4 [4,0) 06

finger table keys


0 start int. succ. 1

7 1 2 [2,3) 3
3 [3,5) 3
5 [5,1) 06
finger table keys
start int. succ. 6 6 2
7 [7,0) 0
0 [0,2) 0 finger table keys
2 [2,6) 3
5 3 start int. succ. 2
4 [4,5) 6
4 5 [5,7) 6
7 [7,3) 00
Simple Key Location Scheme
N1

N8
K45

N48

N14

N42

N38

N32 N21
Scalable Lookup Scheme
N1
Finger Table for N8
N56
N8 N8+1 N14
N51
N8+2 N14
N48 finger 6 finger 1,2,3
N8+4 N14
N14
N8+8 N21
finger 5
N8+16 N32
N42
finger 4 N8+32 N42
N38

N32 N21
finger [k] = first node that succeeds (n+2k-1)mod2m
Chord key location
 Lookup in finger
table the
furthest node
that precedes
key
 -> O(log n) hops
Scalable Lookup Scheme

// ask node n to find the successor of id


n.find_successor (id)
n0 = find_predecessor(id);
return n0.successor;

// ask node n to find the predecessor of id


n.find_predecessor (id)
n0 = n;
while (id not in (n0, n0.successor] )
n0 = n0.closest_preceding_finger(id);
return n0;

// return closest finger preceding id


n.closest_preceding_finger (id)
for i = m downto 1
if (finger[i].node belongs to (n, id))
return finger[i].node;
return n;
Lookup Using Finger Table
N1
N56 lookup(54)
N8
N51

N48
N14

N42

N38

N32 N21
Scalable Lookup Scheme
 Each node forwards query at least halfway
along distance remaining to the target

 Theorem: With high probability, the number


of nodes that must be contacted to find a
successor in a N-node network is O(log N)
“Finger table” allows log(N)-time lookups

¼ ½

1/8

1/16
1/32
1/64
1/128
N80
Dynamic Operations and Failures

Need to deal with:


 Node Joins and Stabilization
 Impact of Node Joins on Lookups
 Failure and Replication
 Voluntary Node Departures
Node Joins and Stabilization
 Node’s successor pointer should be up to
date
 For correctly executing lookups

 Each node periodically runs a “Stabilization”


Protocol
 Updates finger tables and successor pointers
Node Joins and Stabilization

 Contains 6 functions:
 create()
 join()
 stabilize()
 notify()
 fix_fingers()
 check_predecessor()
Create()

 Creates a new Chord ring


n.create()
predecessor = nil;
successor = n;
Join()
 Asks m to find the immediate successor of n.
 Doesn’t make rest of the network aware of n.
n.join(m)
predecessor = nil;
successor = m.find_successor(n);
Stabilize()
 Called periodically to learn about new nodes
 Asks n’s immediate successor about successor’s predecessor p
 Checks whether p should be n’s successor instead
 Also notifies n’s successor about n’s existence, so that
successor may change its predecessor to n, if necessary

n.stabilize()
x = successor.predecessor;
if (x  (n, successor))
successor = x;
successor.notify(n);
Notify()
 m thinks it might be n’s predecessor
n.notify(m)
if (predecessor is nil or m  (predecessor, n))
predecessor = m;
Fix_fingers()
 Periodically called to make sure that finger table entries
are correct
 New nodes initialize their finger tables
 Existing nodes incorporate new nodes into their finger tables

n.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(n + 2next-1);
Check_predecessor()
 Periodically called to check whether
predecessor has failed
 If yes, it clears the predecessor pointer, which can
then be modified by notify()

n.check_predecessor()
if (predecessor has failed)
predecessor = nil;
Theorem

 If any sequence of join operations is executed


interleaved with stabilizations, then at some
time after the last join the successor pointers
will form a cycle on all nodes in the network
Stabilization Protocol

 Guarantees to add nodes in a fashion to


preserve reachability
Impact of Node Joins on Lookups
 If finger table entries are reasonably
current
 Lookup finds the correct successor in O(log N)
steps
 If successor pointers are correct but finger
tables are incorrect
 Correct lookup but slower
 If incorrect successor pointers
 Lookup may fail
Impact of Node Joins on Lookups
 Performance
 If stabilization is complete
 Lookup can be done in O(log N) time
 If stabilization is not complete
 Existing nodes finger tables may not reflect the new
nodes
 Doesn’t significantly affect lookup speed
 Newly joined nodes can affect the lookup speed, if the
new nodes ID’s are in between target and target’s
predecessor
 Lookup will have to be forwarded through the intervening
nodes, one at a time
Theorem
 If we take a stable network with N nodes
with correct finger pointers, and another set
of up to N nodes joins the network, and all
successor pointers (but perhaps not all finger
pointers) are correct, then lookups will still
take O(log N) time with high probability
Source of Inconsistencies:
Concurrent Operations and Failures

 Basic “stabilization” protocol is used to keep nodes’


successor pointers up to date, which is sufficient to
guarantee correctness of lookups

 Those successor pointers can then be used to verify


the finger table entries

 Every node runs stabilize periodically to find newly


joined nodes
Stabilization after Join
ns  n joins
 predecessor = nil
 n acquires ns as successor via some n’
pred(ns) = n

 n notifies ns being the new predecessor


 ns acquires n as its predecessor

 np runs stabilize
n succ(np) = ns  np asks ns for its predecessor (now n)
pred(ns) = np

 np acquires n as its successor


 np notifies n
 n will acquire np as its predecessor
succ(np) = n

nil
 all predecessor and successor
pointers are now correct

 fingers still need to be fixed, but old


fingers will still work
np
Node joins and stabilization
Node joins and stabilization

• N26 joins the system

• N26 aquires N32 as its successor

• N26 notifies N32

• N32 aquires N26 as its predecessor


Node joins and stabilization

• N26 copies keys

• N21 runs stabilize() and asks its


successor N32 for its predecessor
which is N26.
Node joins and stabilization

• N21 aquires N26 as its successor

• N21 notifies N26 of its existence

• N26 aquires N21 as predecessor


Failure Recovery
 Key step in failure recovery is maintaining correct successor
pointers

 To help achieve this, each node maintains a successor-list of its


r nearest successors on the ring

 If node n notices that its successor has failed, it replaces it with


the first live entry in the list

 stabilize will correct finger table entries and successor-list


entries pointing to failed node

 Performance is sensitive to the frequency of node joins and


leaves versus the frequency at which the stabilization protocol is
invoked
Impact of node joins on lookups
 All finger table entries
are correct =>
O(log N) lookups
 Successor pointers
correct, but fingers
inaccurate =>
correct but slower
lookups

68 68
Impact of node joins on lookups

 Stabilization completed => no influence


on performence
 Only for the negligible case that a large
number of nodes joins between the
target‘s predecessor and the target, the
lookup is slightly slower
 No influence on performance as long as
fingers are adjusted faster than the
69 network doubles in size 69
Failure of nodes
 Correctness relies on
correct successor
pointers
 What happens, if N14,
N21, N32 fail
simultaneously?
 How can N8 aquire N38
as successor?

70 70
Failure of nodes
 Correctness relies on
correct successor
pointers
 What happens, if N14,
N21, N32 fail
simultaneously?
 How can N8 aquire N38
as successor?

71 71
Voluntary Node Departures
 Can be treated as node failures
 Two possible enhancements
 Leaving node may transfers all its keys to its
successor
 Leaving node may notify its predecessor and
successor about each other so that they can
update their links
Chord – facts
 Every node is responsible for about K/N keys (N nodes, K keys)

 When a node joins or leaves an N-node network, only O(K/N)


keys change hands (and only to and from joining or leaving
node)

 Lookups need O(log N) messages

 To reestablish routing invariants and finger tables after node


joining or leaving, only O(log2N) messages are required
Pastry
 Self-organizing overlay network of nodes

 With high probability, nodes with adjacent


nodeId are diverse in geography, ownership,
jurisdiction, network attachment, etc…

 Pastry takes into account network locality (e.g.


IP routing hops).
Pastry

 Instead of organizing the id-space as a


Chord-like ring, the routing is based on
numeric closeness of identifiers

 The focus is not only on the no. of routing hops,


but also on network locality – as factors in
routing efficiency
Pastry
 Identifier space:

 Nodes and data items are uniquely associated with


m-bit ids – integers in the range (0 – 2m -1) – m is
typically 128

 Pastry views ids as strings of digits to the base 2b


where b is typically chosen to be 4

 A key is located on the node to whose node id it is


numerically closest
Routing Goal

 Pastry routes messages to the node whose


nodeId is numerically closest to the given key
in less than log2b (N) steps:

 “A heuristic ensures that among a set of nodes with


the k closest nodeIds to the key, the message is
likely to first reach a node near the node from which
the message originates, in term of the proximity
metric”
Routing Information
 Pastry’s node state is divided into 3 main elements

 The routing table – similar to Chord’s finger table –


stores links to id-space

 The leaf set contains nodes which are close in the id-
space

 Nodes that are closed together in terms of network


locality are listed in the neighbourhood set
Routing Table
 A Pastry node’s routing table is made up of m/b (log2b N)
rows with 2b -1 entries per row
 On node n, entries in row i hold the identities of Pastry nodes
whose node-id share an i-digit prefix with n but differ in digit n
itself

 For ex, the first row is populated with nodes that have no prefix in
common with n

 When there is no node with an appropriate prefix, the


corresponding entry is left empty

 Single digit entry in each row shows the corresponding digit of the
present node’s id – i.e. prefix matches the current id up to the
given value of p – the next row down or leaf set should be
examined to find a route.
Routing Table
 Routing tables (RT) thus built achieve an effect similar to
Chord finger table
 The detail of the routing information increases with the proximity of
other nodes in the id-space

 Without a large no. of nearby nodes, the last rows of the RT are
only sparsely populated – intuitively, the id-space would need to be
fully exhausted with node-ids for complete RTs on all nodes

 In populating the RT, there is a choice from the set of nodes with
the appropriate id-prefix

 During the routing process, network locality can be exploited by


selecting nodes which are close in terms of proximity ntk. metric
Leaf Set
 The Routing tables sort node ids by prefix. To increase
lookup efficiency, the leaf set L of nodes holds the |L|
nodes numerically closest to n (|L|/2 smaller and |L|/2
larger, L = 2 or 2 × 2b, normally)

 The RT and the leaf set are the two sources of information
relevant for routing

 The leaf set also plays a role similar to Chord’s successor


list in recovering from failures of adjacent nodes
Neighbourhood Set

 Instead of numeric closeness, the neighbourhood set M is


concerned with nodes that are close to the current node
with regard to the network proximity metric

 Thus, it is not involved in routing itself but in maintaining network


locality in the routing information
Pastry Node State (Base 4)

L
Nodes that are numerically closer
to the present Node (2b or
2x2b entry)

R
Common prefix with 10233102-
next digit-rest of NodeId
(log2b (N) rows, 2b-1
columns)

M
Nodes that are closest according
to the proximity metric (2b or
2x2b entry)
Routing
 Key D arrives at nodeId A

 Ril enetry in routing table


at column i and row l

 Li i-th closest nodeId in


leaf set

 Dl value of the l’s digit in


the key D

 shl(A,B) length of the


prefix shared in digits
Routing
 Routing is divided into two main steps:

 First, a node checks whether the key K is within the


range of its leaf set

 If it is the case, it implies that K is located in one of the


nearby nodes of the leaf set. Thus, the node forwards the
query to the leaf set node numerically closest to K. In case
this is the node itself, the routing process is finished.
Routing
 If K does not fall within the range of the leaf set, the
query needs to be forwarded over a large distance
using the routing table

 In this case, a node n tries to pass the query on to a


node which shares a longer common prefix with K than
n itself

 If there is no such entry in the RT, the query is forwarded to a


node which shares a prefix with K of the same length as n but
which is numerically close to K than n
Routing

 This scheme ensures that routing loop do not occur


because the query is routed strictly to a node with a
longer common identifier prefix than the current node,
or to a numerically closer node with the same prefix
Routing performance

 Routing procedure converges, each step takes the


message to node that either:

 Shares a longer prefix with the key than the local node

 Shares as long a prefix with, but is numerically closer to


the key than the local node.
Routing performance
 Assumption: Routing tables are accurate and no recent
node failures
 There are 3 cases in the Pastry routing scheme:

 Case 1: Forward the query (according to the RT) to a


node with a longer prefix match than the current node.

 Thus, the no. of nodes with longer prefix matches is


reduced by at least a factor of 2b in each step, so the
destination is reached in log2b N steps.
Routing performance

 There are 3 cases:

 Case 2: The query is routed via leaf set (one step). This
increases the no. of hop by one
Routing performance
 There are 3 cases:
 Case 3: The key is neither covered by the leaf set nor
does the RT contains an entry with a longer matching
prefix than the current node

 Consequently, the query is forwarded to a node with the same


prefix length, adding an additional routing hop.

 For a moderate leaf set size ( |L| = 2 × 2b), the probability of


this case is less than 0.6%. So, it is very unlikely that more than
one additional hop is incurred.
Routing performance
 As a result, the complexity of routing remains at
O(log2b N) on average

 Higher values of b leads to fast routing but also


increases the amount of state that needs to managed at
each node

 Thus, b is typically 4 but Pastry implementation can


choose an appropriate trade-off for specific application
Join and Failure
 Join
 Use routing to find numerically closest node already in
network
 Ask state from all nodes on the route and initialize own state

 Error correction
 Failed leaf node: contact a leaf node on the side of the failed
node and add appropriate new neighbor

 Failed table entry: contact a live entry with same prefix as


failed entry until new live entry found, if none found, keep
trying with longer prefix table entries
Self Organization: Node Arrival

 The new node n is assumed to know a nearby


Pastry node k based on the network proximity
metric
 Now n needs to initialize its RT, leaf set and
neighbourhood set.
 Since K is assumed to be close to n, the nodes in K’s
neghbourhood set are reasonably good choices for n,
too.
 Thus, n copies the neighbourhood set from K.
Self Organization: Node Arrival
 To build its RT and leaf set, n routes a special
join message via k to a key equal to n
 According to the standard routing rules, the query is
forwarded to the node c with the numerically closest
id and hence the leaf set of c is suitable for n, so it
retrieves c’s leaf set for itself.

 The join request triggers all nodes, which forwarded


the query towards c, to provide n with their routing
information.
Self Organization: Node Arrival
 Node n’s RT is constructed from the routing
information of these nodes starting at row 0.
 As this row is independent of the local node id, n can
use these entries at row zero of k’s routing table
 In particular, it is assumed that n and k are close in terms of
network proximity metric
 Since k stores nearby nodes in its RT, these entries are also
close to n.
 In the general case of n and k not sharing a common prefix,
n cannot reuse entries from any other row in K’s RT.
Self Organization: Node Arrival
 The route of the join message from n to c leads via
nodes v1, v2, … vn with increasingly longer common
prefixes of n and vi

 Thus, row 1 from the RT of v1 is also a good choice


for the same row of the RT of n

 The same is true for row 2 on node v2 and so on

 Based on this information, the RT of n can be


constructed.
Self Organization: Node Arrival
 Finally, the new node sends its node state to all
nodes in its routing data so that these nodes
can update their own routing information
accordingly

 In contrast to lazy updates in Chord, this mechanism


actively updates the state in all affected nodes when
a new node joins the system
 At this stage, the new node is fully present and
reachable in the Pastry network
Node Failure
 Node failure is detected when a communication attempt
with another node fails. Routing requires contacting
nodes from RT and leaf set, resulting in lazy detection of
failures

 During routing, the failure of a single node in the RT


does not significantly delay the routing process. The local
node can chose to forward the query to a different node
from the same row in the RT. (Alternatively, a node
could store backup nodes with each entry in the RT)
Node Failure
 To replace the failed node at entry i in row j of its RT
(Rji), a node contacts another node referenced in row j
 Entries in the same row j of the remote node are valid
for the local node and hence it can copy entry Rji from
the remote node to its own RT

 In case it failed as well, it can probe another node in row


j for entry Rji
 If no live node with appropriate nodeID prefix can be
obtained in this way, the local node queries nodes from
the preceding row Rj-1
Node Failure
 Repairing a failed entry in the leaf set of a node is
straightforward – utilizing the leaf set of other nodes
referenced in the local leaf set.

 Contacts the leaf set of the largest index on the side of


the failed node
 If this node is unavailable, the local node can revert to
leaf set with smaller indices
Node Departure

 Neighborhood node: asks other members


for their M, checks the distance of each of the
newly discovered nodes, and updates its own
neighborhood set accordingly.
Locality
 “Route chosen for a message is likely to be
good with respect to the proximity metric”

 Discussion:
 Locality in the routing table
 Route locality
 Locating the nearest among k nodes
Locality in the routing table
 Node A is near X
 A’s R0 entries are close to A, A is close to X, and
triangulation inequality holds  entries in X are
relatively near A.
 Likewise, obtaining X’s neighborhood set from A is
appropriate.
 B’s R1 entries are reasonable choice for R1of X
 Entries in each successive row are chosen from an exponentially
decreasing set size.
 The expected distance from B to any of its R1 entry is much
larger than the expected distance traveled from node A to B.
 Second stage: X requests the state from each of the
nodes in its routing table and neighborhood set to
update its entries to closer nodes.
Routing locality
 Each routing step moves the message closer to the
destination in the nodeId space, while traveling the
least possible distance in the proximity space.
 Given that:
 A message routed from A to B at distance d cannot
subsequently be routed to a node with a distance of less than d
from A
 The expected distance traveled by a message during each
successive routing step is exponentially increasing
  Since a message tends to make larger and larger
strides with no possibility of returning to a node within
di of any node i encountered on the route, the message
has nowhere to go but towards its destination
Locating the nearest among k nodes
 Goal:
 among the k numerically closest nodes to a key, a
message tends to first reach a node near the client.
 Problem:
 Since Pastry routes primarily based on nodeId
prefixes, it may miss nearby nodes with a different
prefix than the key.
 Solution (using a heuristic):
 Based on estimating the density of nodeIds, it
detects when a message approaches the set of k
and then switches to numerically nearest address
based routing to locate the nearest replica.
Arbitrary node failures and network
partitions
 Node continues to be responsive, but
behaves incorrectly or even maliciously.

 Repeated queries fail each time since they


normally take the same route.

 Solution: Routing can be randomized


 The choice among multiple nodes that satisfy the
routing criteria should be made randomly
CAN : Content Addressable
Network
 Hash value is viewed as a point in a D-dimensional Cartesian
space
 Hash value points <n1, n2, …, nD>.
 Each node responsible for a D-dimensional “cube” in the space
 Nodes are neighbors if their cubes “touch” at more than just a
point

• Example: D=2
• 1’s neighbors: 2,3,4,6
• 6’s neighbors: 1,2,4,5
• Squares “wrap around”, e.g.,
7 and 8 are neighbors
• Expected # neighbors: O(D)
CAN : Routing
 To get to <n1, n2, …, nD> from <m1, m2, …, mD>
 choose a neighbor with smallest Cartesian distance from <m1,
m2, …, mD> (e.g., measured from neighbor’s center)

• e.g., region 1 needs to send to


node covering X
• Checks all neighbors, node 2 is
closest
• Forwards message to node 2
• Cartesian distance monotonically
decreases with each transmission
• Expected # overlay hops:
(DN1/D)/4
CAN : Join
 To join the CAN:
 find some node in the CAN (via
bootstrap process)
 choose a point in the space
uniformly at random
 using CAN, inform the node
that currently covers the space
that node splits its space in half
 1st split along 1st dimension
 if last split along dimension i
< D, next split along i+1st
dimension The likelihood of a
 e.g., for 2-d case, split on x- rectangle being selected
axis, then y-axis is proportional to it’s
 keeps half the space and gives size, i.e., big rectangles
other half to joining node chosen more frequently
CAN: Join

Bootstrap
node

new node
CAN: construction

Bootstrap
node

new node 1) Discover some node “I” already in CAN


CAN: Join

(x,y)

new node 2) Pick random point in space


CAN: Join

(x,y)
J

new node 3) I routes to (x,y), discovers node J


CAN: Join

J new

4) split J’s zone in half… new owns one half


CAN Failure recovery

 View partitioning as a binary tree


 Leaves represent regions covered by overlay nodes
 Intermediate nodes represents “split” regions that could
be “reformed”
 Siblings are regions that can be merged together
(forming the region that is covered by their parent)
CAN Failure Recovery
 Failure recovery when leaf S is
removed
 Find a leaf node T that is either
 S’s sibling
 Descendant of S’s sibling where
T’s sibling is also a leaf node
 T takes over S’s region (move to
S’s position on the tree)
 T’s sibling takes over T’s previous
region
Maintenance
 Use zone takeover in case of failure or
leaving of a node
 Send your neighbor table to neighbors
to inform that you are alive at discrete
time interval t
 If your neighbor does not send alive in
time t, takeover its zone
 Zone reassignment is needed
Zone reassignment

1
3

1 3

2 4
2 4
Partition tree
Zoning
Zone reassignment

1
3

1 3 4

4
Partition tree
Zoning
Zone reassignment

1
3

1 3

2 4
2 4
Partition tree
Zoning
Zone reassignment

1
2

1 2 4

4
Partition tree
Zoning

You might also like