You are on page 1of 23

4 Fault-Tolerant Networks (ebook)

“With processors becoming cheaper, distributed systems are becoming


more commonplace; we look at some key network topologies (Mesh and
Mobile Ad Hoc Networks) and consider how to quantify and enhance
their fault-tolerance, in terms of :
* spare nodes (for mesh networks),
* multiple paths (for mobile ad hoc networks), and
* fault-tolerant routing (for mesh networks).”

1
Source Node and Destination
Node
“Network links and switchboxes establish one
or more paths between the sender of the
message (the source) and its receiver (the
destination). These links and switchboxes can
be either unidirectional or bidirectional. The
specific organization, or topology, of the
network may provide only a single path
between a given source and a given destination,
in which case any fault of a link or switchbox
along the path will disconnect the source–
destination pair. Fault tolerance in networks is
thus achieved by having multiple paths
connecting source to destination, and/or spare
units that can be switched in to replace the
failed units.”
2
4.1.2 Computer Networks
Measures
“The following measures express the
degradation of the dependability and
performance of a computer network in the
presence of faults:”

• Reliability (network or path)


• Bandwidth
• Connectability

Note: CSE4RFS focuses on Reliability


(network or path) only – see next slide.

3
Concept of Network Reliability
“We define R(t), the network reliability at time
t, as the probability that all the nodes are
operational and can communicate with each
other over the entire time interval [0, t]. If no
redundancy exists in the network, R(t) is the
probability of no faults occurring up to time t.
If the network has spare resources in the form
of redundant nodes and/or multiple paths
between source–destination pairs, the fact that
the network is operational at time t means that
any failed processing node has been
successfully replaced by a spare, and even if
some links failed, every source–destination pair
can still communicate over at least one fault-
free path.”

4
4.2.3 Rectangular Mesh
Networks (Spare Nodes &
Network Reliability)
“A two-dimensional N ×M rectangular mesh
network is a simple example of a network
topology in which all the nodes are computing
nodes and there are no separate switchboxes
(see Figure 4.6). Most of the NM computing
nodes (except the boundary nodes) have four
incident links. To send a message to a node that
is not an immediate neighbour, a path from the
source of the message to its destination must be
identified and the message has to be forwarded
by all the intermediate nodes along that path.”

5
The following figures:

Fig 4.6: Non-redundant Mesh Network

Fig 4.7: Redundant Mesh Network, 1 spare


node for each primary node, i.e. (1,4)
interstitial redundancy

Fig 4.8: Redundant Mesh Network, 4 spare


nodes for each primary node, i.e. (4,4)
interstitial redundancy.

6
7
8
4 Spares for each Primary Node

9
E.g. Calculate reliability for mesh
network with (1,4) interstitial
redundancy.
“The reliability of the (1, 4) interstitial scheme
can be evaluated as follows. Let R(t) be the
reliability of every primary or spare node, and
let the mesh be of size N ×M with both N and
M even numbers. In such a case, the mesh
contains N ×M/4 clusters of four primary
nodes with a single spare node. The reliability
of a cluster (see next slide)”

10
11
4.2.7 (Mobile) Ad Hoc Point-to-
Point Networks (Multi Paths &
Path Reliability)
“The computing nodes in a mobile computer
system are quite often interconnected
through a network that has no regular structure
but ad hoc. Such interconnection networks,
also called mobile ad hoc point-to-point
networks, have typically more than a single
path between any two nodes, and are therefore
inherently fault tolerant. For this type
of network, we would like to be able to
calculate the path reliability, defined as the
probability that there exists an operational path
between two specific nodes, given the various
link failure probabilities.”

12
E.g. Calculate path reliability of a
point-to-point network.
“Figure 4.13 (next slide) shows a network of
five directed links connecting four nodes.
Calculating the path reliability for the source–
destination pair N1 − N4.”

13
14
E.g. (cont)
“The network includes three paths from N1 to N4, namely,
P1 = {x1,2, x2,4}, P2 = {x1,3, x3,4} and P3 = {x1,2, x2,3,
x3,4}. Let pi,j denote the probability that link xi,j is
operational and define qi,j = 1 − pi,j. (Here too we
omit the dependence on time to simplify the notation.) We
assume that the nodes are fault-free; if the nodes can fail, we
incorporate their probability of failure into the failure
probability of the outgoing links. Clearly, for a path
from N1 to N4 to exist, at least one of P1, P2, or P3 must be
operational. We may not, however, add the three
probabilities Prob{Pi is operational}, because some events
will be counted more than once. The key to calculating
the path reliability is to construct a set of disjoint (or
mutually exclusive) events and then add up their
probabilities. For this example, the disjoint events that
allow N1 to send a message to N4 are (a) P1 is up, (b) P2 is
up but P1 is down, and (c) P3 is up but both P1 and P2 are
down. The path reliability is thus Reliability (N1->N4)
= p1,2p2,4 +p1,3p3,4[1− p1,2p2,4]+p1,2p2,3p3,4[q1,3q2,4]

15
4.3 Fault-Tolerant Routing
“The objective of a fault-tolerant routing
strategy is to get a message from source to
destination despite a subset of the network
being faulty. The basic idea is simple:
if no shortest or most convenient path is
available because of link or node failures,
reroute the message through other paths to its
destination.”

16
4.3.2 Fault-Tolerant Routing in
the Mesh
“The topology we consider is a two-dimensional rectangular
N × N mesh with at most N − 1 failures. The procedure can
be extended to meshes of dimension three or higher, and to
meshes with more than N − 1 failures. It is assumed that
all faulty regions are square. If they are not, additional nodes
are declared to have pseudo faults and are treated for routing
purposes as if they were faulty, so that the regions do become
square. Figure 4.17 (next slide) provides an example. Each
node knows the distance along each direction (east, west,
north, and south) to the nearest faulty region in that
direction.”

17
18
“Fault-tolerant routing in the mesh uses the idea of origin-
based routing by defining one node as the origin. By
restricting ourselves to the case in which there are no more
than N−1 failures in the mesh, we can ensure that the origin
is chosen so that its row and column do not have any faulty
nodes. Suppose we want to send a message from node S to
node D. The path from S to D is divided into an IN-path,
consisting of edges that take the message closer to the origin,
and an OUT-path, which takes the message farther away
from the origin, ultimately reaching the destination. Here,
distance is measured in terms of the number of hops along
the shortest path. In degenerate cases, either
the IN or the OUT path sets can be empty.
Key to the functioning of the algorithm is the notion of an
outbox associated with the destination node, D. The outbox is
the smallest rectangular region that contains within it both
the origin and the destination. See Figure 4.18 (next slide) for
an example. Next, we need to define safe nodes. A node V is
safe with respect to destination D and some set of faulty
nodes, F, if both the following conditions are met:
Node V is in the outbox for D.
Given the faulty set F, if neither V nor D is faulty, there exists
a fault-free OUT-path from V to D.”

19
20
“Finally, we introduce the notion of a diagonal
band. Denote by (xA, yA) the Cartesian
coordinates of node A, then the diagonal band
for a destination node D is
the set of all nodes V in the outbox for D
satisfying the condition that xV − yV =
xD −yD +e, where e ∈ {−1, 0, 1}.
For example, (xD, yD) = (3, 2) in Figure 4.18
and xD − yD = 3 − 2 = 1. Thus, any
node V within the outbox of D such that xV −
yV ∈ {0, 1, 2} is in its diagonal band.
It is relatively easy to show by induction that
the nodes of a diagonal band
for destination D are safe nodes with respect to
D. That is, once we get to a safe
node, there exists an OUT-path from that node
to D. Each step along an OUT-path
increases the distance of the message to the
origin: the message cannot therefore
be travelling forever in circles.”

21
“The routing algorithm consists of three phases.
Phase 1. The message is routed on an IN
path until it reaches the outbox. At the
end of phase 1, suppose the message is in node
U.
Phase 2. Compute the distance from U to the
nearest safe node in each direction,
and compare this to the distance to the nearest
faulty region in that direction. If
the safe node is closer than the fault, route to
the safe node. Otherwise, continue
to route on the IN links.
Phase 3. Once the message is at a safe node
U, if that node has a safe, non-faulty
neighbor V that is closer to the destination,
send it to V. Otherwise, U must be
on the edge of a faulty region. In such a case,
move the message along the edge of the faulty
region toward the destination D, and turn
toward the diagonal
band when it arrives at the corner of the faulty
square.” 22
“As an example, return to Figure 4.18 and
consider routing a message from node
S at the northwest end of the network to D. The
message first moves along the IN
links, getting ever closer to the origin. It enters
the outbox at node A. Since there is
a failure directly east of A, it continues on the
IN links until it reaches the origin.
Then it continues, skirting the edge of the
faulty region until it reaches node B. At
this point, it recognizes the existence of a safe
node immediately to the north and
sends the message through this node to the
destination.”

23

You might also like