You are on page 1of 12

6

FAULT-TOLERANT
TECHNIQUES
To achieve true fault tolerance, a system must be able to combat both physical
and algorithmic faults.
However, most existing fault-tolerant techniques essentially apply to physical
faults only. Algorithmic fault tolerance is still in its infancy.
Often fault-tolerant techniques involve several issues, including:
Use of Redundancy
Fault ConCnement
Fault Masking
Error Detection
Fault Diagnosis
ReconCguration
Error Recovery.

21
FAULT-TOLERANT TECHNIQUES 22

USE OF REDUNDANCY
As previously mentioned, fault tolerance can be achieved through the use of
redundancy in a system, so that the Logical Machine continues to operate
correctly 9according to speci:cations; even though there are 9physical; faults
in the Physical Machine or algorithmic faults in the Logical Machine.
9Details on types of redundancy will be dealt with later;.

FAULT CONFINEMENT
Fault con:nement limits the eAects of a speci:c fault from spreading to other
unaAected processes or components, and applies to hardware and software
faults.
For example, the crash of a single processor in a distributed processing network
should not bring the entire system down.

FAULT MASKING
Fault masking techniques concealthe eAects of a fault from the rest of the
system, i.e. prevent the Logical Machine from being by errors, These
corrupted

techniques are implemented on the Physical Machine.


FAULT-TOLERANT TECHNIQUES 23

ERROR DETECTION
Error detection is the identi-cation of errors that exist in a system. As stated
before, faults in the Physical Machine can cause errors to appear in states
of the Logical Machine. The detection of these errors ensures that the Log-
ical Machine continues to operate correctly. Therefore a failure in the error
detection algorithms may lead to system failure.
Often the most crucial factor involved in error detection algorithms is the
time taken to detect an error, which contributes to the latency problem and
attendant error propagation.
Error detection algorithms are often implemented at many levels, including
gate-to-gate interaction in hardware, memory parity checks, processor register,
etc. Generally error detection mechanisms can be categorised into the following
two types:
Self-checking GSCI, whereby a module can detect internal errors concur-
rently with normal operationJ and
Nonself-checking GNSCI, whereby a module does not have internal error
detection capability.
Both hardware and software methods are used for error detection. Some error
detection methods are:
Coding of data, such that invalid data Gcaused by errorsI can be detected
by, for example, hardware using SC circuitsJ
Majority voting on the same data produced by diNerent sources, where
discrepancy indicates a fault. This is an NSC type.
Watchdog timers, where timeout indicates a fault. It is imperative that
the timeout not be made too short, resulting in false alarms, nor too
long, resulting in long error latency which we shall see may degrade a
system's dependabilityJ
Regular testing of individual units. A drawback of this approach is that
system complexity may aNect the testability of the units.
FAULT-TOLERANT TECHNIQUES 24

FAULT DIAGNOSIS
Fault diagnosis is carried out to identify faulty units so that appropriate action
can be taken to correct the fault once the error detection mechanisms signal
an error.
Often fault diagnosis techniques are component-dependent: a fault diagnosis
method for a speci<c component may not be applicable to another component.
For example, the fault diagnosis mechanisms for disk drives are di@erent from
those for memory boards: in memory, failed bits can be diagnosed by re-
peatedly writing to memory words and reading from selected bit patterns of
the memory wordsB whereas, in disk drives, random track searches and head
movements are performed to diagnose and detect a faulty drive mechanism.

RECONFIGURATION
Once an error has been detected andCor fault diagnosis has been carried out,
system recon<guration must be initiated, in that the faulty unit must be iso-
lated and replaced Dif spares are availableE.
If spares are not available and graceful degradation is permitted, the system
can still continue to run with a graceful degradation of performance.

ERROR RECOVERY
Error recovery is initiated by signals from the error detection mechanisms.
Error recovery involves returning the system to some normal level of operation
after an error has been detected and the system is recon<gured. Error recovery
may also involve shutting the system down safely Dsafe shutdownE if necessary.
Error recovery algorithms may include such functions as error correction, fault
location, and exclusion or repair of faulty units. The exact functions of error
recovery algorithms are determined by the system design requirements.
7
TYPES OF REDUNDANCY
As stated previously, fault tolerance can be achieved through the use of re-
dundancy.
Redundancy is the extra resources, information and:or time beyond what is
required for a normal system operation. This obviously involves some inherent
costs, including those of hardware, software, e>ects on system performance,
and the penalties of space and power.
Usually redundancy techniques can be classi@ed into the following categories:
Hardware Redundancy
Software Redundancy
Information Redundancy
Time Redundancy.

25
TYPES OF REDUNDANCY 26

HARDWARE REDUNDANCY
Here additional hardware is employed, usually for the purpose of either de-
tecting or tolerating faults.
There are three types of hardware redundancy: static ;passive= hardware re-
dundancy, dynamic ;active= hardware redundancy, and hybrid hardware re-
dundancy.

STATIC HARDWARE REDUNDANCY


Here, a system has replicated powered-up units, each doing exactly the same
tasks, provided with the same inputs, and synchronised in some way.
TYPES OF REDUNDANCY 27

TMR
The most common form of static hardware redundancy is TMR 5Triple Mod-
ular Redundancy9. A TMR system uses three identical units and each unit
computes its own result. Majority voting is then used to determine the output.
This selected output will be correct so long as two out of the three units and
the voter of the TMR system are working correctly.
If one of the units fails, the voter will discard the faulty output, thus ensuring
correct result. The other non-faulty units will have no way of knowing of the
fault and will continue to operate. In this way the fault is said to have been
masked.
In essence, if the voter is physically dead the whole system will fail. To reduce
the chances of this single point of failure, additional voters, usually triplicated,
can be used.

NMR
The general form of the TMR approach is N-modular redundancy 5NMR9,
where N is often selected as an odd integer greater than or equal to three,
which ensures that a majority voting works. The higher N is, the greater the
number of faulty units that can be tolerated, but the higher the cost that will
be incurred. Therefore the major tradeoE in the NMR approach is the fault
tolerance achieved versus the required amount and cost of hardware.

Advantages
The advantages of the static redundancy approach are: fault masking 5which
is less complex than other fault diagnosis or discrimination mechanisms9, and
relatively higher reliability.

Disadvantages
The disadvantages are: synchronisation overheads, a possibility of a single
point of failure 5in the form of a single voter9, and substantial cost in hardware.
TYPES OF REDUNDANCY 28

DYNAMIC HARDWARE REDUNDANCY


Dynamic hardware redundancy is a standby-sparing type of redundancy whereby
redundancy is employed in the form of spare units.
These spare units may be either powered up and may run in parallel with the
identical active units ;or may be idling<, or unpowered. The unpowered spares
have the advantage of a lower failure rate but the disadvantage of requiring
time for switching into the system to replace the faulty active units. The
choice of powered-up or unpowered spare units depends on applications.
In contrast to static hardware redundancy techniques, dynamic hardware re-
dundancy approaches do not aim to prevent faults in the Physical Machine
from producing errors in the Logical Machine. Errors are detected in the
Logical Machine.
In dynamic hardware redundancy, whenever a fault occurs in an active unit
fault diagnosis is carried out to identify and locate the faulty unit, so that this
can be replaced by an identical working spare unit which automatically takes
over the system's operation.
In summary the following error recovery algorithm is involved:
Isolate the faulty module
Switch in a standby unit
Remove any error from the Logical Machine.
;This may be achieved through a mechanism called rollback recovery
using checkpoints. The state of the system is stored at each checkpoint
to allow the system to revert to the previous checkpoint ;i.e. to restore
the system state at the last checkpoint< before the fault occurs.<
TYPES OF REDUNDANCY 29

DUPLEX
One of the most common forms of dynamic hardware redundancy is a duplex
system where both identical units 7one of which is a powered-up spare unit9
run and perform the same computations in parallel. The results of these
computations are compared, and if there are discrepancies, this means there
is an error.
In the event of discrepancies, the comparator is unable to determine which
of the two units is faulty? so techniques such as diagnostic programs and self-
checking circuits are used to identify the faulty unit. Once identiCed, the faulty
unit is switched out of the system. If it is the active unit which is faulty, the
powered-up spare unit takes over the system's operation.
Example duplex systems include Voyager in the space Gight area and ESSs in
the telecommunications area.

Advantages
A level of dependability can possibly be sustained for a relatively long period
of time. Systems can be restored quickly to an operational state using recovery
techniques.

A Disadvantage
It is not applicable to systems, which do not permit momentary erroneous
results to be generated in the Logical Machine.
TYPES OF REDUNDANCY 30

HYBRID HARDWARE REDUNDANCY


The hybrid hardware redundancy approach takes advantage of the attractive
features of both the static and dynamic approaches.
The hybrid approach can usually achieve the highest dependability, but are
typically the most expensive.
TYPES OF REDUNDANCY 31

SOFTWARE REDUNDANCY
Here extra software ,or lines of code2 is added to a system to provide fault
tolerance.
There are two types of redundant software: redundant operational software
and redundant temporal or spatial software.

OPERATIONAL SOFTWARE REDUNDANCY


This technique is used as an alternative to hardware redundancy to detect or
tolerate faults that can occur in hardware.
Here programs or additional lines of code are employed to achieve tolerance to
hardware faults. These include any software used in relation to fault detection,
fault diagnosis, fault masking, recon@guration, and recovery. Also included are
backup copies of critical programs. But the technique is mainly associated with
error detection ,programs are used to carry out consistency checks or capability
checks2 and recovery ,programs are used to invoke additional modules, to carry
out state recoveryDrestoration and to establish an operational state2.
Issues involved in hardware or software implementation of achieving tolerance
to hardware faults are: ,i2 software is more Fexible, ,ii2 hardware is usually
faster, ,iii2 little is known about writing correct recovery procedures, and ,iv2
problems of evaluating software reliability.

TEMPORAL AND SPATIAL SOFTWARE REDUN-


DANCY
Here additional software is used to detect or tolerate faults due to software,
i.e. software fault tolerance.
Often, software faults arise from software design or coding Faws. As such, a
software fault detection technique must be able to detect software design or
coding mistakes. ,This is the subject of Software Fault Tolerance, which
will be dealt with in detail later.2
TYPES OF REDUNDANCY 32

INFORMATION REDUNDANCY

Here, extra information is added to data words in a system and is only used for
error detection or error recovery. This additional information is not necessary
if the system is fault free.
Examples include:
Hamming codes for error correction in data communication systems
Inclusion of a parity bit for checking a faulty bit
Duplication codes for error detection in memory systems
Checksum for error detection of data transferred in packet-switched net-
works.

TIME REDUNDANCY

Time redundancy involves a repetition of any instruction, computation, code


segment or program to achieve error detection Cresulting from transient or
intermittent faultsD or fault tolerance. E.g. retransmission of a message in the
event of a timed-out acknowledge signal not arriving.
Time redundancy is important in the last stage of a rollback recovery, i.e.
restoring the system state to the last checkpoint.
Care must be taken when using the time redundancy approach: the repeated
action should not be taken as a new process and a process that should be
carried out once should not be repeated.

You might also like