You are on page 1of 40

CAHPTER THREE

Building a Fault Tolerant System


What is Fault?
• A system fails when it cannot meet its promises (specifications).
• An error is part of a system state that may lead to a failure.
• A fault is the cause of the error.
• Faults can be:
Transient (appear once and disappear)
Intermittent (appear-disappear-reappear behavior)
Permanent (appear and persist until repair)
Software and hardware design errors
Operator errors
Physical damage
Transient Fault
• All applications that communicate with remote services and resources
must be sensitive to transient faults.
• Transient faults include the momentary loss of network connectivity
to components and services, the temporary unavailability of a service,
or timeouts that arise when a service is busy.
• These faults are often self-correcting, and if the action is repeated
after a suitable delay it is likely to succeed.
Transient Fault: General guidelines
• The following guidelines will help you to design a suitable transient
fault handling mechanism for your applications:
Determine if there is a built-in retry mechanism
Determine if the operation is suitable for retrying
Determine an appropriate retry count and interval
Intermittent
• An intermittent network problem is a problem that occurs in your
network for a short time, and then seemingly goes away.
• The intermittent problem may not occur again until some time in the
future, if at all. But why is it important to catch them regardless:
Intermittent network problems can still affect users and business operations
They can have a lasting impact on your network
They can cause more damage once they happen again
They may be a sign of a bigger problem within your network
Intermittent…
• Some common intermittent network problems include:
High bandwidth usage
Packet loss
Jitter
Latency
Permanent Faults
• They persist indefinitely (or at least until repair) after their
occurrence.
• They are caused by irreversible device failures within a component
due to damage, fatigue or improper manufacturing.
• Once a permanent fault has occurred, the faulty component can be
restored by replacement or repair.
Dependability concept
Availability & Reliability
Safety and Maintainability
Why dependability?
Why dependability?
Approaches to Achieving Dependability
Fault avoidance
Fault Removal
Fault Forecasting
Fault Tolerance
• Fault tolerant computing is the art and science of building computing
systems that continue to operate satisfactorily in the presence of
faults.
Fault Tolerance concept Taxonomy
Error Detection
• Error detection is a detection of errors caused by noise or other
impairments during transmission from the transmitter to the receiver.
• A condition when the receiver’s information does not match with the
sender’s information.
• During transmission, digital signals suffer from noise that can introduce
errors in the binary bits travelling from sender to receiver.
• That means a 0 bit may change to 1 or a 1 bit may change to 0.
• There are many schemes of error detection:
Parity bits
Checksum
Cyclic redundancy checks
Cryptography hash function
Parity bits
• Blocks of data from the source are subjected to a check bit or parity
bit generator form, where a parity of :
1 is added to the block if it contains odd number of 1’s, and
0 is added if it contains even number of 1’s
• This scheme makes the total number of 1’s even, that is why it is
called even parity checking.
Checksum
• In checksum error detection scheme, the data is divided into k
segments each of m bits.
• In the sender’s end the segments are added using 1’s complement
arithmetic to get the sum. The sum is complemented to get the
checksum.
• The checksum segment is sent along with the data segments.
• At the receiver’s end, all received segments are added using 1’s
complement arithmetic to get the sum. The sum is complemented.
• If the result is zero, the received data is accepted; otherwise
discarded.
Checksum…
System Recovery
• We have talk a lot about fault tolerance but not talk about what
happen after fault has occurred.
• A process that exhibits a failure has to be able to recover to a correct
state.
• Fundamental to fault tolerance is recovery from an error.
• Error recovery means to replace an erroneous state with an error-free state.
• There are two types of recovery:
Backward Recovery
Forward Recovery
Backward Recovery
• The goal of backward recovery is to bring the system from an erroneous
state back to a prior correct state.
• The state of the system must be recorded (checkpointed) from time to
time, and then restored when things go wrong.
• Bring the system from its present erroneous state back into a previously correct state
• For this, the system’s state must be recorded from time to time; each time a state is
recorded, a checkpoint is said to be made
• e.g., retransmitting lost or damaged packets in the implementation of reliable
communication
• Most widely used method, since it is a generally applicable method and can be
integrated into the middleware layer of a distributed system
• Example:
o Reliable communication through packet retransmission.
Disadvantages
• Check pointing and restoring a process to its previous state are costly
and performance bottlenecks
• A network bottleneck refers to a discrete condition in which data flow is
limited by computer or network resources.
• No guarantee can be given that the error will not recur, which may take an application
into a loop of recovery
• Some actions may be irreversible; e.g., deleting a file, handing over cash to a customer
Forward Recovery
• The goal of forward recovery is to bring a system from an erroneous
state to a correct new state (not a previous state)
• Bring the system from its present erroneous state to a correct new state from which it
can continue to execute
• It has to be known in advance which errors may occur so as to correct those errors
• e.g., erasure correction (or simply error correction) where a lost or damaged packet is
constructed from other successfully delivered packets
• Reliable communication via erasure (a correction made by erasing)
correction, such as (n,k) block ensure code
Fault Masking
• Fault Masking is a structural redundancy technique that completely masks
faults within a set of redundant modules.
• Redundancy is key technique for hiding failures.
• Redundancy, however, can have an adverse impact on the performance of
a system.
• For Example, it can increase the resource consumption.
• Four different types of redundancy are used in fault-tolerant systems:
 Hardware redundancy
Information redundancy
Software redundancy
Time redundancy
Hardware Redundancy
• There are basically two approaches in hardware redundancy:
Addition of replicated modules, and
Use of extra circuits for fault detection.
Hardware Redundancy: Module Replication
• To avoid wrong results and actions being made, it is desirable that that failing modules
stops execution when a fault is detected, with a small fault latency (that means, as soon
as possible after the failure). Modules having this property are called fail-fast.
• We also talk about fail-silent modules, which are modules that only deliver correct
results (in the value- and time domain).
• If it is not possible to deliver correct results (because of failure), it delivers no results at
all.
• Making a module fail-fast can be done by duplication.
• Two identical copies of a module are employed, with a comparator checking the output
of the two copies.
• When the output differs, a fault is detected.
• The fault is detected immediately (the fault latency is small), and it is therefore fail-fast.
• This is a widely used technique, because it is easy to realize, and relatively cheap.
Hardware Redundancy: Module Replication…
• The cost of duplication is twice that of an equivalent simplex system,
and the duplication-checking is supported by several
microprocessors.
• Two examples are the Motorola 88000 and the AMD AM29000 which
have a master/slave ability determined by a “test” pin.
• The outputs of the slave copy are disabled, although it sees the same
input stream as the master.
• It performs the same operations as the master, and compares with
the master’s outputs.
Hardware Redundancy: Module Replication…
• Another approach is to use three or more modules (N-plex), and apply voting
rather than comparison.
• If we have three, we have enough redundant information to mask the failure of
one of the modules.
• The masking is accomplished by means of a majority vote on the three outputs.
• This is called triple modular redundancy (TMR).
• We say that this module is fail-vote, since it requires majority.
• We could also let the voters first sense which modules are available and then use
the majority of the available modules.
• This is called fail-fast voting, and can operate with less than a majority of the
modules, which gives better reliability than just fail-vote.
• To increase the reliability further, we can make these designs recursive; that
means fail-fast/fail-vote modules connected by a comparator or voter
Information Redundancy
• Information redundancy is the addition of extra information to data,
to allow error detection and correction.
• This is typically error-detecting codes, error-correcting codes (ECC),
and self-checking circuits.
Information Redundancy: Error-Detection (and Correction) Codes

• Parity codes are used in most modern computers for memory error
detection.
• This is a simple code that does not require much additional hardware.
• Another, more advanced code is m-of-n code.
• This is a code that requires code words to be n bit of length, and
contains exactly m ones.
• Cyclic and checksum codes are also common.
• To check arithmetic operations, arithmetic codes can be used.
• Arithmetic codes are preserved by arithmetic operations
Information Redundancy: Consistency Checking

• This is a verification of the results being reasonable.


• Examples are range checks; e.g. address checking, and arithmetic
operation checking
Information Redundancy: Self-Checking Logic
• Failure in a comparator element at the top of the hierarchy can be
disastrous (checking-the-checker problem).
• This single point of failure can be eliminated through self-checking and fail-
safe logic design.
• A circuit is said to be self-checking if it has the ability to automatically
detect the existence of a fault, without the need for any externally applied
stimulus.
• When the circuit is fault free and presented a valid input code word, it
should produce a correct output code word.
• If a fault exists, however, the circuit should produce an invalid output code
so that the existence of the fault can be detected.
• Self-checking is needed to make fail-silent modules
Software Redundancy
• In this report we focus on hardware, but to be able to fully understand the use of the
hardware techniques, it is important to know about the software techniques used on top
of the hardware.
• In fact, software is the most challenging problem in the area of fault-tolerance.
• As mentioned earlier, today’s hardware is relative reliable compared to the software.
• There are some important differences between software and hardware errors.
• Physical errors (in hardware) will not recur after they have been discovered and
corrected.
• Unfortunately, this is not the case with software errors.
• In the process of correcting a programming error, new errors are likely to be created.
• Software development is also a more complex and immature art than hardware design.
N-Version Programming and Software Fault-Tolerance

• We have two major software fault-tolerance techniques:


N-version programming: Write the program N times, then operate all N programs in
parallel, and take a majority vote for each answer. This is an analogy to the N-plexing
of hardware modules.
Transactions: Write the program as a transaction. Use a consistency check at the
end, and if the conditions are not met, restart. It should work the second time...
• The big disadvantage with N-version programming is its cost. It is
expensive, and repair is not trivial.
• It is also difficult to maintain.
• To get a majority, we need to have at least 3 versions.
• Programmers tend to do the same problems, so there is a certain risk of
getting the same mistake in the majority of the programs.
Time Redundancy
• Hardware- and information- redundancy requires extra hardware.
• This could be avoided by doing operations several times in the same
module and check the results, in stead of doing it in parallel on
several modules and compare the outputs.
• This reduces the amount of hardware at the expense of using
additional time, and is especially suitable if faults are mostly
transient.
• It could also be used to distinguish between permanent and transient
faults.
Reconfiguration
• Reconfiguration is the “process of eliminating a faulty entity from a
system and restoring the system to some operational condition or
state”.
• When we use reconfiguration process designer must be concerned
with fault detection, fault location, fault containment, and fault
recovery.
Fault Tolerance Implementation
• Fault-tolerant systems use backup components that automatically
take the place of failed components, ensuring no loss of service.
These include:
Hardware systems that are backed up by identical or equivalent systems. For
example, a server can be made fault tolerant by using an identical server
running in parallel, with all operations mirrored to the backup server.
Software systems that are backed up by other software instances. For
example, a database with customer information can be continuously
replicated to another machine. If the primary database goes down,
operations can be automatically redirected to the second database.
Power sources that are made fault tolerant using alternative sources. For
example, many organizations have power generators that can take over in
case main line electricity fails.

You might also like