Professional Documents
Culture Documents
NETWORKS
Sy-Yen Kuo
郭斯彥
NTUEE 1 KUO
Software Fault-Tolerance
Fault-tolerance
in software domain is not as
well understood as fault-tolerance in
hardware domain
‒ Controversial opinions exist on whether reliability
can be used to evaluate software.
‒ Software failures are mostly due to the activation
of design faults by specific input sequences.
‒ This makes the reliability of a software module
dependent on the environment that generates
input to the module over the time.
» Ariane 5 rocket accident
NTUEE 2 KUO
Software fault-tolerance
Many current techniques for software fault
tolerance attempt to leverage the
experience of hardware redundancy
schemes
‒ software N-version programming closely
resembles hardware N-modular redundancy
‒ recovery blocks use the concept of retrying
the same operation in expectation that the
problem is resolved after the second try.
NTUEE 3 KUO
Problems
Traditional
hardware fault tolerance
techniques were developed to fight
‒ permanent components faults primarily
‒ transient faults caused by environmental factors
secondarily.
Theydo not offer sufficient protection
against design and specification faults,
which are dominant in software.
NTUEE 4 KUO
Design diversity
NTUEE 5 KUO
Problems
A software system usually has a very
large number of states
‒ a collision avoidance system required on most
commercial aircrafts in the U.S. has 1040
states
Software states do not exhibite adequate
regularity to allow grouping them into
equivalence classes.
‒ Such regularity is common for digital hardware
NTUEE 6 KUO
Problems
The large number of states implies that
only a very small part of software system
can be verified for correctness.
‒ Traditional testing and debugging methods are
not feasible for large systems.
‒ Formal methods promise higher coverage,
however, they are very complex
» a specification using formal logic may be of the
same size or even larger than the code.
NTUEE 9 KUO
Difficulties
Improvements in software development methodologies
reduce the incidence of faults, yielding fault avoidance
Need for test and verification
Formal verification techniques, such as proof of
correctness, can be applied to rather small programs
Potential of faulty translation of user requirements
Conventional testing is hit-or-miss.
‒ “Program testing can show the presence of bugs but never
show their absence,” - Dikstra, 1972.
There is a lack of good fault models
NTUEE 10 KUO
Approaches to Software Fault Tolerance
ROBUSTNESS: The extent to which software continues to
operate despite introduction of invalid inputs.
Example: 1. Check input data
=>ask for new input
=>use default value and raise flag
2. Self checking software
FAULT CONTAINMENT: Faults in one module should not
affect other modules.
Reasonable checks
Example:
Watchdog timers
Overflow/divide-by-zero detection
Assertion checking
FAULT TOLERANCE: Provides uninterrupted operation in
presence of program faults through multiple
implementations of a given function
NTUEE 11 KUO
Approaches to Software FT
N Version Programming
Recovery Blocks
Process Pairs
Robust Data Structures
…
NTUEE 12 KUO
Recovery Blocks/N-Version Programming
Charles Babbage
“When the formula to be computed is very complicated, it
may be algebraically arranged for computation in two or
more totally distinct ways, and two or more sets of cards
may be made. If the same constants are now employed
with each set, and if under these circumstances the
results agree, we may be quite secure of the accuracy
of them all.”
NTUEE 13 KUO
Multi-Version Techniques
Multi-versiontechniques use two or more
versions of the same software module, which
satisfy design diversity requirements.
– different teams, different coding languages or
different algorithms can be used to maximize the
probability that all the versions do not have common
faults
NTUEE 14 KUO
N-version Programming
Resembles N-modular hardware
redundancy
Ndifferent software implementations of a
module are executed concurrently.
Theselection algorithm (voter) decides
which of the answers is correct
‒ a voter is application independent
‒ this is an advantage over recovery block fault
detection mechanism, requiring application
dependent acceptance tests
NTUEE 15 KUO
N-version Programming
NTUEE 16 KUO
Voters
There are many different types of voters:
‒ formalized majority voter
» selects majority
‒ generalized median voter
» selects the median of the values
‒ formalized plurality voter
» partitions the set of outputs based on metric
equality and selects the output from the largest
group
‒ weighted averaging
» combines the outputs in a weighted average
NTUEE 17 KUO
Voting
Theselection algorithms are normally
developed taking into account the
consequences of error
‒ For applications where reliability is important, the
selection algorithm should be designed so that the
selected result is correct with a very high
probability
‒ If availability is an issue, the selection algorithm is
expected to produce an output even it is incorrect
‒ For applications where safety the main concern,
the selection algorithm is required to correctly
distinguish the erroneous version and mask its
results
NTUEE 18 KUO
N Self-Checking Programming
N self-checking programming combines
recovery block concept with N version
programming
Thechecking is performed either by using
acceptance tests, or by using comparison.
Examples of applications of N self-
checking programming:
‒ Lucent ESS-5 phone switch
‒ Airbus A-340 airplane
NTUEE 19 KUO
N self-checking programming
using acceptance tests
NTUEE 20 KUO
N self-checking programming
using comparison
NTUEE 21 KUO
Comparison
N self-checking programming using
acceptance tests
‒ The use of separate acceptance test for each
version is the main difference of this technique
from recovery blocks
N self-checking programming using
comparison
‒ resembles triplex-duplex hardware redundancy
‒ An advantage over N self-checking
programming using acceptance tests is that the
application independent decision algorithm is
used for fault detection
NTUEE 22 KUO
Design Diversity
The most critical issue in multi-version
software fault tolerance techniques is
assuring independence between the
different versions of software through
design diversity
Software systems are vulnerable to
common design faults if they are
developed by the same design team, by
applying the same design rules and using
the same software tools
NTUEE 23 KUO
Design Diversity
Decision to be made when developing a
multi-version software system include
‒ which modules are to be made redundant
» usually less reliable modules are chosen
‒ the level of redundancy
» procedure, process, whole system
‒ the required number of redundant versions
‒ the required diversity
» diverse specification, algorithm, code, programming
language, testing technique
‒ rules of isolation between the development teams
NTUEE 24 KUO
N-Version Programming
N independent teams of programmers develop
software to same specifications - N versions
are run in parallel - output voted on
If programs are developed independently -
very unlikely that they will fail on same inputs
Assumption - failures are statistically
independent; probability of failure of an
individual version = q
Probability of no more than m failures out of N
versions
NTUEE 25 KUO
Independent vs. Correlated
Versions
Correlated failures between versions can increase
overall failure probability by orders of magnitude
‒ Example: N=3, can tolerate up to one failed version for any input;
q = 0.0001 - an incorrect output once every ten thousand runs
‒ If versions stochastically independent - failure probability of 3-
version system
NTUEE 26 KUO
Causes of Version Correlation
Common specifications - errors in specifications will
propagate to software
Intrinsic difficulty of problem - algorithms may be more
difficult to implement for some inputs, causing faults
triggered by same inputs
Common algorithms - algorithm itself may contain
instabilities in certain regions of input space - different
versions have instabilities in same region
Cultural factors - Programmers make similar mistakes in
interpreting ambiguous specifications
Common software and hardware platforms - if same
hardware, operating system, and compiler are used - their
faults can trigger a correlated failure
NTUEE 27 KUO
Achieving Version Independence -
Incidental Diversity
Forcing developers of different modules to work
independently of one another
Teams working on different modules are forbidden to
directly communicate
Questions regarding ambiguities in specifications or any
other issue have to be addressed to some central
authority who makes any necessary corrections and
updates all teams
Inspection of software carefully coordinated so that
inspectors of one version do not leak information about
another version
NTUEE 28 KUO
Achieving Version Independence -
Methods for Forced Diversity
Diverse specifications
Diverse hardware and operating systems
Diverse development tools and compilers
Diverse programming languages
Versions with differing capabilities
Diverse Specifications
♦ Most software failures due to requirements specification
♦ Diversity can begin at specification stage - specifications may be
expressed in different formalisms
♦ Specification errors will not coincide across versions - each
specification will trigger a different implementation fault profile
NTUEE 29 KUO
Diverse Hardware and Operating Systems
Output depends on interaction between application
software and its platform – OS and processor
Both processors and operating systems are notorious for
the bugs they contain
A good idea to complement software design diversity
with hardware and OS diversity - running each version
on a different processor type and OS
Diverse Development Tools and Compilers
♦ May make possible "notational diversity" reducing extent
of positive correlation between failures
♦ Diverse tools and compilers (may be faulty) for different
versions may allow for greater reliability
NTUEE 30 KUO
Diverse Programming Languages
Programming language affects software quality
Examples:
‒ Assembler - more error-prone than a higher-level language
‒ Nature of errors different - in C programs - easy to overflow
allocated memory - impossible in a language that strictly manages
memory
‒ No faulty use of pointers in Fortran - has no pointers
‒ Lisp is a more natural language for some artificial intelligence (AI)
algorithms than are C or Fortran
Diverse programming languages may have diverse
libraries and compilers - will have uncorrelated (or even
better, negatively-correlated) failures
NTUEE 31 KUO
Choice of Programming Language
Should all versions use best language for problem or some versions
be in other less suited languages?
‒ If same language - lower individual fault rate but positively correlated
failures
‒ If different languages - individual fault rates may be greater, but the
overall failure rate of N-version system may be smaller if less correlated
failures
‒ Tradeoff difficult to resolve - no analytical model exists - extensive
experimental work is necessary
Versions With Differing Capabilities
♦ Example: One rudimentary version providing less accurate but still
acceptable output
♦ 2nd simpler, less fault-prone and more robust
♦ If the two do not agree - a 3rd version can help determine which is
correct
♦ If 3rd very simple, formal methods may be used to prove correctness
NTUEE 32 KUO
Single Version vs. N Versions
Assumption: developing N versions - N times as
expensive as developing a single version
Some parts of development process may be common,
e.g. - if all versions use same specifications, only one set
needs to be developed
Management of an N-version project imposes additional
overheads
Costs can be reduced - identify most critical portions of
code and only develop versions for these
Given a total time and money budget - two choices:
‒ (a) develop a single version using the entire budget
‒ (b) develop N versions
No good model exists to choose between the two
NTUEE 33 KUO
An Assumption of Independence in N-Version
Programming ?
Do the N versions of a program fail independently (similar to
hardware)?
Are faults unrelated?
Does Prob (failure of N-version system) = Prob (failure of
one version)N ??
‒ If so, then the system reliability can be very high
Why such an assumption may be false?
‒ People make same mistakes, e.g. incorrect treatment of
boundary conditions
‒ Some parts of a problem more difficult than others
• statistics show similarity in programmer’s view of
“difficult” regions
NTUEE 34 KUO
Application of
N-Version Programming
NTUEE 35 KUO
Example of N-version programming:
Boeing 777
Boeing flight computer architecture consists of three Primary Flight
Computers (PFC) – left, center, and right – each of same design
and manufacturer.
Each PFC consists of three drivers computing lanes (command,
standby, and monitor) to form a triple redundant computational
entity.
Each lane is implemented using different microprocessors – AMD
29050, Motorola 68040, and Intel 80486
– achieve tolerance against hardware design faults.
Each PFC executes software written in Ada and compiled using
three different Ada compilers.
Each PFC lane uses dedicated ARINC 629 terminal to connect to
data bus.
NTUEE 36 KUO
Example of N-version programming:
Boeing 777 (cont.)
All three PFCs receive the same inputs and all are active.
Only one lane in a channel operates in the command role
– generates the surface command to its data bus.
A command lane in each PFC receives the proposed surface
commands from the other PFC channels and uses majority voting
to determine the correct surface command (signal).
Selected surface commands from all three PFCs are available to
the actuator control electronics and only one is used (based on
predetermined priority schedule) to drive the surface.
The remaining lanes (monitor and standby) are used for cross-
lane monitoring to detect errors, identify a faulty lane, and
reconfigure the controller.
NTUEE 37 KUO
Boeing Computer Architecture
C
R
PFC – Primary Flight Control
Version 1
Version 2 Voter
Version 3
NTUEE 38 KUO
Example of N-self-checking programming:
Airbus A330, A340
The Airbus architecture employs multiple self-checking flight
computers.
‒ Two computers – primary and backup.
Each computer has two channels: control (e.g., to govern a control
surface) and monitoring (to ensure correct operation of the control
channel)
The two computers are:
‒ designed and fabricated by different manufacturers to eliminate
common manufacturing faults.
‒ one computer is based on 68010-microprocessor
‒ the other one on 80186-microprocessor.
This architecture results in four independently developed software
packages designed and implemented (using different programming
languages) to common specifications.
NTUEE 39 KUO
N-self-checking Programming
N-self-checking programming (NSCP), is a variation of
N-version programming.
A self-checking component results from:
‒ the association with each variant of an acceptance
test on output results of the variant (parallel recovery
block)
‒ the association of two variants together with a
comparison algorithm.
NTUEE 40 KUO
Airbus Computer Architecture
28VDC
Control Processor
68010
or Memory
80186 RAM & ROM
Power Input/
Supply Output
Watchdog
Relay
Relay
Watchdog
Power Input/
Supply Output
Processor Memory
68010 RAM &
or ROM
Monitoring 80186
primary To Actuators
Version 1
compare
Version 2
Switch
Version 1’ backup
compare
Version 2’
NTUEE 41 KUO
Recovery blocks
Combinescheckpoint and restart
approach with standby sparing
redundancy scheme
n different implementations of the same
program
‒ Only one of the versions is active
‒ If an error is detected by the acceptance test,
a retry signal is sent to the switch
‒ The system is rolled back to the state stored in
the checkpoint memory and the execution is
switched to another module
NTUEE 42 KUO
Recovery Block
Approach
N versions, one running - if it
fails, execution is switched to
a backup
‒ Example - primary + 3 secondary
versions
‒ Primary executed – output passed
to acceptance test
‒ If output is not accepted - system
state is rolled back and secondary
1 starts, and so on
‒ If all fail - computation fails
Success of recovery block approach depends on failure
independence of different versions and quality of
acceptance test
NTUEE 43 KUO
Recovery blocks
NTUEE 44 KUO
Recovery blocks
Similarly to cold and hot standby sparing,
different versions can be executed either
serially, or concurrently
‒ Serial execution may require the use of checkpoints
to reload the state before the next version is
executed
‒ The cost in time of trying multiple versions serially
may be too expensive, especially for a real-time
system.
‒ A concurrent system requires n redundant hardware
modules, a communications network to connect
them and the use of input and state consistency
algorithms.
NTUEE 45 KUO
Restoration of System State
Restoring system state is automatic
Taking a copy of entire system state on entry
to each recovery block is too costly
Use Recovery Caches or “Recursive”
Caches
When a process is to be backed up, it is to a
state just before entry to primary alternate
Only NONLOCAL variables that have been
MODIFIED have to be reset
NTUEE 46 KUO
Recovery blocks
Ifall n versions are tried and failed, the
module invokes the exception handler to
communicate to the rest of the system a
failure to complete its function
Recovery blocks technique heavily
depends on design diversity
NTUEE 47 KUO
Summary
RB is equivalent to the stand-by sparing (of
passive dynamic redundancy) in HW fault-tolerant
architectures
NVP is equivalent to N-modular redundancy (static
redundancy) in HW fault-tolerant architectures
NSCP is equivalent to active dynamic redundancy
‒ A self-checking component results either from:
• The association of an acceptance test to a version
• The association of two variants with a comparison
algorithm
‒ Fault-tolerance is provided by the parallel execution of N ≥ 2
self-checking components
NTUEE 48 KUO