Professional Documents
Culture Documents
FAULT-TOLERANT
TECHNIQUES
To achieve true fault tolerance, a system must be able to combat both physical
and algorithmic faults.
However, most existing fault-tolerant techniques essentially apply to physical
faults only. Algorithmic fault tolerance is still in its infancy.
Often fault-tolerant techniques involve several issues, including:
Use of Redundancy
Fault ConCnement
Fault Masking
Error Detection
Fault Diagnosis
ReconCguration
Error Recovery.
21
FAULT-TOLERANT TECHNIQUES 22
USE OF REDUNDANCY
As previously mentioned, fault tolerance can be achieved through the use of
redundancy in a system, so that the Logical Machine continues to operate
correctly 9according to speci:cations; even though there are 9physical; faults
in the Physical Machine or algorithmic faults in the Logical Machine.
9Details on types of redundancy will be dealt with later;.
FAULT CONFINEMENT
Fault con:nement limits the eAects of a speci:c fault from spreading to other
unaAected processes or components, and applies to hardware and software
faults.
For example, the crash of a single processor in a distributed processing network
should not bring the entire system down.
FAULT MASKING
Fault masking techniques concealthe eAects of a fault from the rest of the
system, i.e. prevent the Logical Machine from being by errors, These
corrupted
ERROR DETECTION
Error detection is the identi-cation of errors that exist in a system. As stated
before, faults in the Physical Machine can cause errors to appear in states
of the Logical Machine. The detection of these errors ensures that the Log-
ical Machine continues to operate correctly. Therefore a failure in the error
detection algorithms may lead to system failure.
Often the most crucial factor involved in error detection algorithms is the
time taken to detect an error, which contributes to the latency problem and
attendant error propagation.
Error detection algorithms are often implemented at many levels, including
gate-to-gate interaction in hardware, memory parity checks, processor register,
etc. Generally error detection mechanisms can be categorised into the following
two types:
Self-checking GSCI, whereby a module can detect internal errors concur-
rently with normal operationJ and
Nonself-checking GNSCI, whereby a module does not have internal error
detection capability.
Both hardware and software methods are used for error detection. Some error
detection methods are:
Coding of data, such that invalid data Gcaused by errorsI can be detected
by, for example, hardware using SC circuitsJ
Majority voting on the same data produced by diNerent sources, where
discrepancy indicates a fault. This is an NSC type.
Watchdog timers, where timeout indicates a fault. It is imperative that
the timeout not be made too short, resulting in false alarms, nor too
long, resulting in long error latency which we shall see may degrade a
system's dependabilityJ
Regular testing of individual units. A drawback of this approach is that
system complexity may aNect the testability of the units.
FAULT-TOLERANT TECHNIQUES 24
FAULT DIAGNOSIS
Fault diagnosis is carried out to identify faulty units so that appropriate action
can be taken to correct the fault once the error detection mechanisms signal
an error.
Often fault diagnosis techniques are component-dependent: a fault diagnosis
method for a speci<c component may not be applicable to another component.
For example, the fault diagnosis mechanisms for disk drives are di@erent from
those for memory boards: in memory, failed bits can be diagnosed by re-
peatedly writing to memory words and reading from selected bit patterns of
the memory wordsB whereas, in disk drives, random track searches and head
movements are performed to diagnose and detect a faulty drive mechanism.
RECONFIGURATION
Once an error has been detected andCor fault diagnosis has been carried out,
system recon<guration must be initiated, in that the faulty unit must be iso-
lated and replaced Dif spares are availableE.
If spares are not available and graceful degradation is permitted, the system
can still continue to run with a graceful degradation of performance.
ERROR RECOVERY
Error recovery is initiated by signals from the error detection mechanisms.
Error recovery involves returning the system to some normal level of operation
after an error has been detected and the system is recon<gured. Error recovery
may also involve shutting the system down safely Dsafe shutdownE if necessary.
Error recovery algorithms may include such functions as error correction, fault
location, and exclusion or repair of faulty units. The exact functions of error
recovery algorithms are determined by the system design requirements.
7
TYPES OF REDUNDANCY
As stated previously, fault tolerance can be achieved through the use of re-
dundancy.
Redundancy is the extra resources, information and:or time beyond what is
required for a normal system operation. This obviously involves some inherent
costs, including those of hardware, software, e>ects on system performance,
and the penalties of space and power.
Usually redundancy techniques can be classi@ed into the following categories:
Hardware Redundancy
Software Redundancy
Information Redundancy
Time Redundancy.
25
TYPES OF REDUNDANCY 26
HARDWARE REDUNDANCY
Here additional hardware is employed, usually for the purpose of either de-
tecting or tolerating faults.
There are three types of hardware redundancy: static ;passive= hardware re-
dundancy, dynamic ;active= hardware redundancy, and hybrid hardware re-
dundancy.
TMR
The most common form of static hardware redundancy is TMR 5Triple Mod-
ular Redundancy9. A TMR system uses three identical units and each unit
computes its own result. Majority voting is then used to determine the output.
This selected output will be correct so long as two out of the three units and
the voter of the TMR system are working correctly.
If one of the units fails, the voter will discard the faulty output, thus ensuring
correct result. The other non-faulty units will have no way of knowing of the
fault and will continue to operate. In this way the fault is said to have been
masked.
In essence, if the voter is physically dead the whole system will fail. To reduce
the chances of this single point of failure, additional voters, usually triplicated,
can be used.
NMR
The general form of the TMR approach is N-modular redundancy 5NMR9,
where N is often selected as an odd integer greater than or equal to three,
which ensures that a majority voting works. The higher N is, the greater the
number of faulty units that can be tolerated, but the higher the cost that will
be incurred. Therefore the major tradeoE in the NMR approach is the fault
tolerance achieved versus the required amount and cost of hardware.
Advantages
The advantages of the static redundancy approach are: fault masking 5which
is less complex than other fault diagnosis or discrimination mechanisms9, and
relatively higher reliability.
Disadvantages
The disadvantages are: synchronisation overheads, a possibility of a single
point of failure 5in the form of a single voter9, and substantial cost in hardware.
TYPES OF REDUNDANCY 28
DUPLEX
One of the most common forms of dynamic hardware redundancy is a duplex
system where both identical units 7one of which is a powered-up spare unit9
run and perform the same computations in parallel. The results of these
computations are compared, and if there are discrepancies, this means there
is an error.
In the event of discrepancies, the comparator is unable to determine which
of the two units is faulty? so techniques such as diagnostic programs and self-
checking circuits are used to identify the faulty unit. Once identiCed, the faulty
unit is switched out of the system. If it is the active unit which is faulty, the
powered-up spare unit takes over the system's operation.
Example duplex systems include Voyager in the space Gight area and ESSs in
the telecommunications area.
Advantages
A level of dependability can possibly be sustained for a relatively long period
of time. Systems can be restored quickly to an operational state using recovery
techniques.
A Disadvantage
It is not applicable to systems, which do not permit momentary erroneous
results to be generated in the Logical Machine.
TYPES OF REDUNDANCY 30
SOFTWARE REDUNDANCY
Here extra software ,or lines of code2 is added to a system to provide fault
tolerance.
There are two types of redundant software: redundant operational software
and redundant temporal or spatial software.
INFORMATION REDUNDANCY
Here, extra information is added to data words in a system and is only used for
error detection or error recovery. This additional information is not necessary
if the system is fault free.
Examples include:
Hamming codes for error correction in data communication systems
Inclusion of a parity bit for checking a faulty bit
Duplication codes for error detection in memory systems
Checksum for error detection of data transferred in packet-switched net-
works.
TIME REDUNDANCY