You are on page 1of 9

Modification of Extended Byzantine Agreement Algorithm For

An Improved Fault Tolerant Railway Signal Communication

By
Debobrata Podder
05CS3003
Indian Institute Of Technology –Kharagpur
Introduction
Indian Railways is the world's fourth largest railway network after those of
the United States, Russia and China and transports 20 millions passengers
across the country. So the need of a safe and reliable railway signal
communication is of top priority. Though the distributed safety critical
railway signaling systems are based on fault tolerant and fail safe techniques
to provide high safety and reliability yet the railway communications
sometimes suffer from unique communications issues. In our article we will
discuss one such an issue and try to provide a possible solution to avoid that.
The various safety critical system functions distributed geographically in a
railway yard can be grouped and interconnected to form a local area network
.For a dual node topology of the local area network each node provides the
following functions -

1. It supports the fault tolerance provided in the given network topology.

2. It takes care of single processor fault within the same node.

3. It performs digital input/output to drive safety system functions within its


coverage area.

4. It ensures safe reaction in the safety system functions within its coverage
area in the event of multiple processor failures.

The reliability and safety of the node depends on the failure rates of
components in use in the node. Each node consists of four transputers. In this
article we will try to improve the existing system by adding an extra
transputers and using an extended byzantine algorithm.
Work Already Done
Basic Structure of a fail safe node
Each node consists of four transputers configured as a square mesh using the serial
links of the transputers. Out of the four serial links of each transputer, one is used
for maintaining a communication link with neighboring node ,one is used for
performing input/output for driving system functions and the remaining two are
used, one each for connecting with neighboring transputers of the same node.

For providing fault tolerance the following options are available –

1. Majority voting of the outputs of four transputers.

2. Election of a leader from among the four transputers for a fixed tenure.

3. Leadership based on rotation.

The majority voting requires an external reliable majority voter, while the election
of a leader from among the four transputers for fixed amount of time requires the
process of election which involves time overhead. In addition both these options
require fault detection mechanisms. The selection of leadership on rotation basis
for a fixed amount of time the above disadvantages.

Analysis of Byzantine faults


In the fail safe node, conflicting messages about the health of the system can be
received from the two neighboring transputers. For example, one transputer calls
for safe system down on account of a failure and another transputer does not call
for safe system shutdown on account of malicious fault foe the same failure. This
type of disagreement has been reported in the Byzantine Generals problem.
Byzantine Agreement Algorithm
The Byzantine Generals problem is hard because faulty processors could lie
about the message they received. Let us now remove this possibility by
providing some mechanism to authenticate the messages. That is, suppose
each processor can append to its messages an unforgeable signature. Before
forwarding a message, a processor appends its own signature to the message
it received. The recipient can check the authenticity of each signature. Thus, if
a processor receives a message that has been forwarded through processors A
and B, it can check to see whether the signatures of A and B have been
appended to the message and if they are valid. Once again, we assume that all
processors have timers so that they can time out any (faulty) processor that
remains silent. In such a case, maintaining interactive consistency becomes
very easy. Here is an algorithm that does so:

Algorithm AByz(N,m)

Step A1. The original source signs its message ψ and sends it out to each of the
processors.

Step A2. Each processor i that receives a signed message ψ : A, where A is the
set of signatures appended to the message ψ, checks the number of signatures
in A. If this number is less than m + 1, it sends out ψ : A ∪ {i} (i.e., what it
received plus its own signature) to each of the processors not in set A. It also
adds this message, ψ, to its list of received messages.

Step A3: When a processor has seen the signatures of every other processor
(or has timed out), it applies some decision function to select from among the
messages it has received.
Extension of byzantine generals on a network of
interconnected fail safe nodes with authenticated
messages
Let each fail safe node have n transputers (Generals), where n is even.each fail
safe node is partially connected network with the smallest number of disjoint
paths between each pair of transputers (Generals) being two. There are N fail
safe nodes interconnected to form n/2 multiple rings with one of the fail safe
node acting as a command unit (central controller).
Let each transputer of the fail safe node be designated as Tij where i is the
number of the transputer within the node (1<=i<=n) and j is the node number
(1<=J<=N). The command unit fail safe node guides each transputer of the fail
safe node when it is indecisive. In the n/2 multiple ring networks , consider a
fail safe node j with its adjacent nodes j-1 and j+1.For all j, the activities of the
transputers of fail safe node j is monitored by the neighboring transputers of
nodes j-1 and j+1 by executing some process S in parallel to observe the
activities of the neighboring transputer Puv , (where 1<=u<=n and
1<=v<=N)and conveys it to the command unit .This process S is called spy
process and differs from a General of fail safe node in the following ways –

1. The spy process does not have any control on the outputs of fail safe node
or on the transputer (General) which it is spying.
2. The spy process does not require any additional resources and it runs in
parallel with the General's functions of the fail safe node on the each
transputer.
Thus each transputer has dual role of performing the functions of the fail safe
node to which it belongs and at the same time spy the activities of the General
of the adjacent node connected to it which is conveyed to the command unit.
The spy does not take any decisions but conveys the following in a message
frame to the command unit –

1. Aberration in the incoming data rate from the General of the adjacent node
connected to it.
2. Deviation in the leadership time of the General of the adjacent node
connected to it.
The command unit would decide whether a given General of a fail safe node is
traitor or not based on the messages form the spy processes monitoring the
Generals.

The following assumptions are made _


1. There is an alternate external path available through the n/2 multiple ring
network for loyal Generals of each fail safe node for exchange of authenticated
messages.

2. If the number of traitor Generals of a fail safe node is more than the number
of loyal Generals, it is not valid and then fail safe node is shut down and is
isolated from the network.

Safe Shutdown Model


In this model for a four transputer fail safe node a transputer on account of
first single byzantine fault can exhibit a arbitrary behavior, the rest of the
three of non faulty transputers arrive at the same result. The fault detection
mechanism of the fail safe node detects and isolates the faulty transputer
within 200ms.The fail safe node continues to operate in degraded mode with
the three non faulty transputers. In the event of the occurrence of a second
byzantine fault in any one of the remaining three transputers, the two non-
faulty transputers with the help of command unit independently initiate safe
shutdown action.
Scope of improvement
Shutting down the fail safe node is not the best solution in case of faults. There
may be faults due to momentary fluctuation and glitch which can result in
aberrated behavior of the transputers and of the fail safe node as a
consequence. But there is a very good possibility that the transputers will
recover in no time. Our aim is to ensure that the nodes do not get shut down
prematurely .And to achieve that we have make some modification to the
above model.

Adding an extra transputer

Previously described model is based on partial ring networks connecting


neighboring nodes with a particular node. In our model we will construct an
all node complete network by adding an extra transputer to each fail safe
node and a timer attached to it and we will adopt token ring exchange
mechanism to exchange information.

Fail safe Node

Extra transputer

Fig 1
As shown in Fig 1, each of the extra transputer has four links , one outgoing
link to some other fail safe node’s extra transputer and one out-going node to
the command unit and one in-going link from the existing four transputers
structure of the same node and another in-going link from the extra
transputer of some other fail safe node.

Modification of existing extended Byzantine agreement


algorithm
We need to make some modification in the existing extended byzantine
algorithm. Along with the status of the fail safe node every extra transputer
also sends a timestamp indicating the time of generation of the message. The
message format is given below.

Data Time Stamp

Functioning of this model


From time to time every fail safe node sends its message to another fail safe
node through the extra transputer and the message travel through the entire
network and received by every other node. Now if a General encounters a
temporary glitch it will send faulty message and will be recognized by the
other nodes and isolates the faulty general. Now instead of shutting down the
system entirely the extra transputer will send the message to command unit
and set the timer. As a two way communication link exists between nodes, one
is via partial ring network and another is via all node network so even if any
general is get isolated there still remains a link for it to send messages. Now if
the faulty General recover from the fault it will send another message and it
will be received by other nodes. And if the message is received before the
timer expires the message will be conveyed to the commanding unit by the
extra transputer and upon checking the time stamp the commanding unit will
now whether the faulty general has recovered from the fault and as a
consequence the general which got isolated now gets integrated to the four
transputer unit and the possibility of premature shut down is eliminated.
References
1. Fault Tolerant Computing - Israel Koren , CM Krishna
2. Use Of Voter Comparators To Improve Railroad Radio
Communication - Raytheon Company Application Note –AN-
4001-1
3. Reliability and safety analysis of fault tolerant and fail safe
node for use in a railway signaling system –Vinod Chandra, K.
Vijaya Kumar.
4. Rail Road Operation And Signaling -Edmund John Phillips

You might also like