You are on page 1of 5

A REVIEW PAPER ON FAULT TOLERANCE IN DISTRIBUTED SYSTEM

Ahmed Ali Abbasi

Computer Science Dept, Baluchistan University of Information Technology, Engineering and


Management Sciences, Quetta.

aaabbasi99@gmail.com

ABSTRACT
One of the ongoing research areas in computing is distributed system and its challenges. One of
the challenges that we focused in our paper is fault tolerant. A fault can occur due to link failure,
resource failure or by any other reason is to be tolerated for working the system smoothly and
accurately. These faults can be detected and recovered by many techniques used accordingly. An
appropriate fault detector can avoid loss due to system crash and reliable fault tolerance
technique can save from system failure. This paper gives a detailed review of fault tolerant and
related matters and different solutions to avoid fault. This paper gives a detail view of different
techniques proposed to overcome fault tolerance in distributed systems.

1. INTRODUCTION
The increasing use of computers and our increasing reliance on them have led to a need for
highly reliable computer systems. There are many areas where computer perform life critical
tasks. Some examples of these are flight control systems, patient-monitoring systems etc. Other
application areas include banking and stock markets. In these systems, failure of computers may
lead to catastrophe, great financial loss, or even loss of human life. In such applications, highly
dependable systems are needed. Dependability means that our system can be trusted to perform
the service for which it has been designed. Dependability can be decomposed into reliability,
availability, safety, and security. Where, reliability deals with continuity of service, availability
with readiness of usage, safety with avoidance of catastrophic consequences on the environment,
and security with prevention of unauthorized access and/or handling of information.
Coulouris defines a distributed system as “A system in which hardware or software components
located at networked computers communicate and coordinate their actions only by message
passing”; and Tanenbaum defines it as “A collection of independent computers that appear to the
users of the system as a single computer.” Fault-tolerance is the property that enables a system to
continue operating properly in the event of the failure of some of its components.

There are different types of fault that can occur in Distributed System. These faults can be
classified on several factors such as:

Network fault: A Fault occur in a network due to network partition, Packet Loss, Packet
corruption, destination failure, link failure, etc.
Physical faults: This Fault can occur in hardware like fault in CPUs, Fault in memory, Fault in
storage, etc.
Media faults: Fault occurs due to media head crashes.
Processor faults: fault occurs in processor due to operating system crashes, etc.
Process faults: A fault that occurs due to shortage of resource, software bugs, etc.
Permanent: These failures occur by accidentally cutting a wire, power breakdowns and so on. It
is easy to reproduce these failures. These failures can cause major disruptions and some part of
the system may not be functioning as desired.

METHADOLOGY
We are aiming to review different fault tolerance techniques in distributed systems. For this
purpose, we have studies different papers related to fault tolerance in distributed system and
provided a brief review of these systems.

RELATED WORK
Stoller, S., Schneider, F. (2003) present a method for automated analysis of fault-tolerance of
distributed systems. The method is based on a data flow model of distributed computing. The
presented method does not capture temporal relationships between messages received by a
component on different channels. This makes their analysis more efficient. To make their
analysis more efficient their framework includes abstractions for the contents, number and
ordering of message sent on each channel.

Levitin, G., Xie, M. & Zhang, T. (2007) considers performance and reliability of fault-tolerant
software running on a hardware system that consist of multiple processing unit. The software
consists of functionally equivalent but independently developed versions that start execution
simultaneously. The computational complexity and reliability of different versions are different.
The system completes the task execution when the outputs of a pre-specified number of versions
coincide. The processing units are characterized by different availability and processing speed. It
is assumed that they are able to share the computational burden perfectly and that execution of
each version can be fully parallelized.
The algorithm based on the universal generating function technique is used for determining the
distribution of system task execution time. This algorithm allows analysts to evaluate complex
hardware-software reliability and performance indices such as expected task execution time and
probability that the task is completed within a given time. Their suggested algorithm allows
analyst to evaluate the expected performance (task execution time) and reliability (probability
that the task is completed within a given time) of complex fault-tolerant hardware-software
system consisting of non-identical hardware components. It takes into account both reliability of
software versions and availability of hardware units.

Gershteyn, Y. [2003] present a paper that reviews fault-tolerance aspects in distributed systems.
Due to possibility of partial failure in the system, the system should routinely recover itself from
it without disturbing the overall performance of the system. There exist a many researches how
to make distributed system fault-tolerant. In this paper he analyze some ideas on different aspects
of fault tolerance, such as new research on corporative leasing and scalable consistency
maintenance in context distributed networks; improving fault tolerance by replicating agents; and
problem specific branch-and bound algorithm for asynchronous distributed systems.
Golding, R., Borowsky, E. present a paper that describes one component of a self-healing storage
system: the component that allows for automatic recovery of access to data when the power
comes back on after a large-scale outage. This protocol guarantees that service will be repaired
quickly and automatically when enough failures are repaired. They have designed a replication
management system that automatically and correctly recovers from the large-scale failures they
expect in distributed data storage systems.

Dave, S., Raghuvanshi, A., [2012] provides a study of fault tolerance techniques in distributed
systems, especially replication and check pointing. They also suggested fault tolerance by
combining replication, check pointing, and implemented it in Java RMI. According to them, this
work will provide a good reference for researchers. They have proposed an improved method to
ensure the consistency by simulating the distributed environment using java RMI. There
algorithm is very simple and ensures the consistency in a very simple manner. This work will
definitely work as a reference for researcher and practitioner to design and develop high
performance multiple fault tolerance.

Kumar, A., Yadav, R., Ranvijay, Jain, A., [2003] present a paper in which they investigate the
different techniques of fault tolerance which are used in many real time distributed systems. The
focus is on types of fault occurring in the system, fault detection techniques and the recovery
techniques used. A fault can occur due to link failure, resource failure or by any other reason is
to be tolerated for working the system smoothly and accurately. These faults can be detected and
recovered by many techniques used accordingly. An appropriate fault detector can avoid loss due
to system crash and reliable fault tolerance technique can save from system failure. This paper
provides how these methods are applied to detect and tolerate faults from various Real Time
Distributed Systems.
REFERENCES
1] Stoller, S.D., Schneider, F.B (2003) 'Automated Analysis of Fault-Tolerance in Distributed
Systems'. 

2] Levitin, G., Xie, M. & Zhang, T. (2007) 'Reliability of Fault-Tolerant Systems with Parallel
Task Processing. 'European Journal of Operational Research', 177 (1), 420-430.

3] Gershteyn, Y (2003) 'Fault Tolerance in Distributed Systems'.

4] Golding, R., Borowsky. E ( ) 'Fault-Tolerant Replication Management in Large-Scale


Distributed Storage Systems'.

5] Dave, S., Raghuvanshi, A. (2012) 'Fault Tolerance Techniques in Distributed System',


International of Engineering Innovation & Research, Volume 1(Issue 2).

6] Kumar, A., Yadav, R.S., Ranvijay, Jain, A. (2011) 'Fault Tolerance in Real Time Distributed
System', International Journal on Computer Science and Engineering, Vol. 3(No. 2).

You might also like