You are on page 1of 48

DEPENDABLE COMPUTER SYSTEMS AND

NETWORKS

Module 5 – Software Fault Tolerance

Sy-Yen Kuo
郭斯彥

NTUEE 1 KUO
Software Fault-Tolerance
 Fault-tolerance
in software domain is not as
well understood as fault-tolerance in
hardware domain
‒ Controversial opinions exist on whether reliability
can be used to evaluate software.
‒ Software failures are mostly due to the activation
of design faults by specific input sequences.
‒ This makes the reliability of a software module
dependent on the environment that generates
input to the module over the time.
» Ariane 5 rocket accident

NTUEE 2 KUO
Software fault-tolerance
 Many current techniques for software fault
tolerance attempt to leverage the
experience of hardware redundancy
schemes
‒ software N-version programming closely
resembles hardware N-modular redundancy
‒ recovery blocks use the concept of retrying
the same operation in expectation that the
problem is resolved after the second try.

NTUEE 3 KUO
Problems
 Traditional
hardware fault tolerance
techniques were developed to fight
‒ permanent components faults primarily
‒ transient faults caused by environmental factors
secondarily.
 Theydo not offer sufficient protection
against design and specification faults,
which are dominant in software.

NTUEE 4 KUO
Design diversity

 By simply triplicating a software module


and voting on its outputs we cannot
tolerate a fault in the module because all
copies have identical faults
 Design diversity technique has to be
applied.
‒ requires creation of diverse and equivalent
specifications so that programmers can design
software which do not share common faults
‒ this is widely accepted to be a difficult task

NTUEE 5 KUO
Problems
A software system usually has a very
large number of states
‒ a collision avoidance system required on most
commercial aircrafts in the U.S. has 1040
states
 Software states do not exhibite adequate
regularity to allow grouping them into
equivalence classes.
‒ Such regularity is common for digital hardware

NTUEE 6 KUO
Problems
 The large number of states implies that
only a very small part of software system
can be verified for correctness.
‒ Traditional testing and debugging methods are
not feasible for large systems.
‒ Formal methods promise higher coverage,
however, they are very complex
» a specification using formal logic may be of the
same size or even larger than the code.

 Due to incomplete verification, many


design faults are not diagnosed and are
not removed from the software
NTUEE 7 KUO
Causes of Software Errors
 Designing and writing software is very difficult -
essential and accidental causes of software errors
 Essential difficulties
‒ Understanding a complex application and operating environment
‒ Constructing a structure comprising an extremely large number of
states, with very complex state-transition rules
‒ Software is subject to frequent modifications - new features are added
to adapt to changing application needs
‒ Hardware and operating system platforms can change with time - the
software has to adjust appropriately
‒ Software is often used to paper over incompatibilities between
interacting system components
 Accidental difficulties - Human mistakes
 Cost considerations - use of Commercial Off-the-Shelf
(COTS) software - not designed for high-reliability
applications
NTUEE 8 KUO
Why Software Fault Tolerance ?
 Can increase software reliability via fault avoidance using
software engineering and testing methodologies
 Large and complex systems → fault avoidance not
successful
 Redundancy in software may be needed to detect,
isolate, and recover software failures
 Software is difficult to prove correct

HARDWARE FAULTS SOFTWARE FAULTS

1. Faults time-dependent Faults time-invariant


2. Duplicate hardware detects Duplicate software not effective
3. Mainly due to random cause Complexity is the main cause

NTUEE 9 KUO
Difficulties
 Improvements in software development methodologies
reduce the incidence of faults, yielding fault avoidance
 Need for test and verification
 Formal verification techniques, such as proof of
correctness, can be applied to rather small programs
 Potential of faulty translation of user requirements
 Conventional testing is hit-or-miss.
‒ “Program testing can show the presence of bugs but never
show their absence,” - Dikstra, 1972.
 There is a lack of good fault models

NTUEE 10 KUO
Approaches to Software Fault Tolerance
 ROBUSTNESS: The extent to which software continues to
operate despite introduction of invalid inputs.
Example: 1. Check input data
=>ask for new input
=>use default value and raise flag
2. Self checking software
 FAULT CONTAINMENT: Faults in one module should not
affect other modules.
Reasonable checks
Example:
Watchdog timers
Overflow/divide-by-zero detection
Assertion checking
 FAULT TOLERANCE: Provides uninterrupted operation in
presence of program faults through multiple
implementations of a given function

NTUEE 11 KUO
Approaches to Software FT
N Version Programming
Recovery Blocks
Process Pairs
Robust Data Structures
…

NTUEE 12 KUO
Recovery Blocks/N-Version Programming

 An observation from 1830:

Charles Babbage
“When the formula to be computed is very complicated, it
may be algebraically arranged for computation in two or
more totally distinct ways, and two or more sets of cards
may be made. If the same constants are now employed
with each set, and if under these circumstances the
results agree, we may be quite secure of the accuracy
of them all.”

NTUEE 13 KUO
Multi-Version Techniques
 Multi-versiontechniques use two or more
versions of the same software module, which
satisfy design diversity requirements.
– different teams, different coding languages or
different algorithms can be used to maximize the
probability that all the versions do not have common
faults

NTUEE 14 KUO
N-version Programming
 Resembles N-modular hardware
redundancy
Ndifferent software implementations of a
module are executed concurrently.
 Theselection algorithm (voter) decides
which of the answers is correct
‒ a voter is application independent
‒ this is an advantage over recovery block fault
detection mechanism, requiring application
dependent acceptance tests

NTUEE 15 KUO
N-version Programming

NTUEE 16 KUO
Voters
 There are many different types of voters:
‒ formalized majority voter
» selects majority
‒ generalized median voter
» selects the median of the values
‒ formalized plurality voter
» partitions the set of outputs based on metric
equality and selects the output from the largest
group
‒ weighted averaging
» combines the outputs in a weighted average

NTUEE 17 KUO
Voting
 Theselection algorithms are normally
developed taking into account the
consequences of error
‒ For applications where reliability is important, the
selection algorithm should be designed so that the
selected result is correct with a very high
probability
‒ If availability is an issue, the selection algorithm is
expected to produce an output even it is incorrect
‒ For applications where safety the main concern,
the selection algorithm is required to correctly
distinguish the erroneous version and mask its
results
NTUEE 18 KUO
N Self-Checking Programming
N self-checking programming combines
recovery block concept with N version
programming
 Thechecking is performed either by using
acceptance tests, or by using comparison.
 Examples of applications of N self-
checking programming:
‒ Lucent ESS-5 phone switch
‒ Airbus A-340 airplane

NTUEE 19 KUO
N self-checking programming
using acceptance tests

NTUEE 20 KUO
N self-checking programming
using comparison

NTUEE 21 KUO
Comparison
N self-checking programming using
acceptance tests
‒ The use of separate acceptance test for each
version is the main difference of this technique
from recovery blocks
N self-checking programming using
comparison
‒ resembles triplex-duplex hardware redundancy
‒ An advantage over N self-checking
programming using acceptance tests is that the
application independent decision algorithm is
used for fault detection
NTUEE 22 KUO
Design Diversity
 The most critical issue in multi-version
software fault tolerance techniques is
assuring independence between the
different versions of software through
design diversity
 Software systems are vulnerable to
common design faults if they are
developed by the same design team, by
applying the same design rules and using
the same software tools

NTUEE 23 KUO
Design Diversity
 Decision to be made when developing a
multi-version software system include
‒ which modules are to be made redundant
» usually less reliable modules are chosen
‒ the level of redundancy
» procedure, process, whole system
‒ the required number of redundant versions
‒ the required diversity
» diverse specification, algorithm, code, programming
language, testing technique
‒ rules of isolation between the development teams

NTUEE 24 KUO
N-Version Programming
N independent teams of programmers develop
software to same specifications - N versions
are run in parallel - output voted on
 If programs are developed independently -
very unlikely that they will fail on same inputs
 Assumption - failures are statistically
independent; probability of failure of an
individual version = q
 Probability of no more than m failures out of N
versions

NTUEE 25 KUO
Independent vs. Correlated
Versions
 Correlated failures between versions can increase
overall failure probability by orders of magnitude
‒ Example: N=3, can tolerate up to one failed version for any input;
q = 0.0001 - an incorrect output once every ten thousand runs
‒ If versions stochastically independent - failure probability of 3-
version system

‒ Suppose versions are statistically dependent and there is one


fault, causing system failure, common to two versions, exercised
once every million runs
‒ Failure probability of 3-version system increases to over 10 −6 ,
more than 30 times the failure probability of uncorrelated system

NTUEE 26 KUO
Causes of Version Correlation
 Common specifications - errors in specifications will
propagate to software
 Intrinsic difficulty of problem - algorithms may be more
difficult to implement for some inputs, causing faults
triggered by same inputs
 Common algorithms - algorithm itself may contain
instabilities in certain regions of input space - different
versions have instabilities in same region
 Cultural factors - Programmers make similar mistakes in
interpreting ambiguous specifications
 Common software and hardware platforms - if same
hardware, operating system, and compiler are used - their
faults can trigger a correlated failure
NTUEE 27 KUO
Achieving Version Independence -
Incidental Diversity
 Forcing developers of different modules to work
independently of one another
 Teams working on different modules are forbidden to
directly communicate
 Questions regarding ambiguities in specifications or any
other issue have to be addressed to some central
authority who makes any necessary corrections and
updates all teams
 Inspection of software carefully coordinated so that
inspectors of one version do not leak information about
another version
NTUEE 28 KUO
Achieving Version Independence -
Methods for Forced Diversity
 Diverse specifications
 Diverse hardware and operating systems
 Diverse development tools and compilers
 Diverse programming languages
 Versions with differing capabilities

Diverse Specifications
♦ Most software failures due to requirements specification
♦ Diversity can begin at specification stage - specifications may be
expressed in different formalisms
♦ Specification errors will not coincide across versions - each
specification will trigger a different implementation fault profile

NTUEE 29 KUO
Diverse Hardware and Operating Systems
 Output depends on interaction between application
software and its platform – OS and processor
 Both processors and operating systems are notorious for
the bugs they contain
 A good idea to complement software design diversity
with hardware and OS diversity - running each version
on a different processor type and OS
Diverse Development Tools and Compilers
♦ May make possible "notational diversity" reducing extent
of positive correlation between failures
♦ Diverse tools and compilers (may be faulty) for different
versions may allow for greater reliability
NTUEE 30 KUO
Diverse Programming Languages
 Programming language affects software quality
 Examples:
‒ Assembler - more error-prone than a higher-level language
‒ Nature of errors different - in C programs - easy to overflow
allocated memory - impossible in a language that strictly manages
memory
‒ No faulty use of pointers in Fortran - has no pointers
‒ Lisp is a more natural language for some artificial intelligence (AI)
algorithms than are C or Fortran
 Diverse programming languages may have diverse
libraries and compilers - will have uncorrelated (or even
better, negatively-correlated) failures

NTUEE 31 KUO
Choice of Programming Language
 Should all versions use best language for problem or some versions
be in other less suited languages?
‒ If same language - lower individual fault rate but positively correlated
failures
‒ If different languages - individual fault rates may be greater, but the
overall failure rate of N-version system may be smaller if less correlated
failures
‒ Tradeoff difficult to resolve - no analytical model exists - extensive
experimental work is necessary
Versions With Differing Capabilities
♦ Example: One rudimentary version providing less accurate but still
acceptable output
♦ 2nd simpler, less fault-prone and more robust
♦ If the two do not agree - a 3rd version can help determine which is
correct
♦ If 3rd very simple, formal methods may be used to prove correctness
NTUEE 32 KUO
Single Version vs. N Versions
 Assumption: developing N versions - N times as
expensive as developing a single version
 Some parts of development process may be common,
e.g. - if all versions use same specifications, only one set
needs to be developed
 Management of an N-version project imposes additional
overheads
 Costs can be reduced - identify most critical portions of
code and only develop versions for these
 Given a total time and money budget - two choices:
‒ (a) develop a single version using the entire budget
‒ (b) develop N versions
 No good model exists to choose between the two
NTUEE 33 KUO
An Assumption of Independence in N-Version
Programming ?
 Do the N versions of a program fail independently (similar to
hardware)?
Are faults unrelated?
Does Prob (failure of N-version system) = Prob (failure of
one version)N ??
‒ If so, then the system reliability can be very high
 Why such an assumption may be false?
‒ People make same mistakes, e.g. incorrect treatment of
boundary conditions
‒ Some parts of a problem more difficult than others
• statistics show similarity in programmer’s view of
“difficult” regions

NTUEE 34 KUO
Application of
N-Version Programming

NTUEE 35 KUO
Example of N-version programming:
Boeing 777
 Boeing flight computer architecture consists of three Primary Flight
Computers (PFC) – left, center, and right – each of same design
and manufacturer.
 Each PFC consists of three drivers computing lanes (command,
standby, and monitor) to form a triple redundant computational
entity.
 Each lane is implemented using different microprocessors – AMD
29050, Motorola 68040, and Intel 80486
– achieve tolerance against hardware design faults.
 Each PFC executes software written in Ada and compiled using
three different Ada compilers.
 Each PFC lane uses dedicated ARINC 629 terminal to connect to
data bus.

NTUEE 36 KUO
Example of N-version programming:
Boeing 777 (cont.)
 All three PFCs receive the same inputs and all are active.
 Only one lane in a channel operates in the command role
– generates the surface command to its data bus.
 A command lane in each PFC receives the proposed surface
commands from the other PFC channels and uses majority voting
to determine the correct surface command (signal).
 Selected surface commands from all three PFCs are available to
the actuator control electronics and only one is used (based on
predetermined priority schedule) to drive the surface.
 The remaining lanes (monitor and standby) are used for cross-
lane monitoring to detect errors, identify a faulty lane, and
reconfigure the controller.

NTUEE 37 KUO
Boeing Computer Architecture

C
R
PFC – Primary Flight Control
Version 1

Version 2 Voter

Version 3

NTUEE 38 KUO
Example of N-self-checking programming:
Airbus A330, A340
 The Airbus architecture employs multiple self-checking flight
computers.
‒ Two computers – primary and backup.
 Each computer has two channels: control (e.g., to govern a control
surface) and monitoring (to ensure correct operation of the control
channel)
 The two computers are:
‒ designed and fabricated by different manufacturers to eliminate
common manufacturing faults.
‒ one computer is based on 68010-microprocessor
‒ the other one on 80186-microprocessor.
 This architecture results in four independently developed software
packages designed and implemented (using different programming
languages) to common specifications.

NTUEE 39 KUO
N-self-checking Programming
 N-self-checking programming (NSCP), is a variation of
N-version programming.
 A self-checking component results from:
‒ the association with each variant of an acceptance
test on output results of the variant (parallel recovery
block)
‒ the association of two variants together with a
comparison algorithm.

NTUEE 40 KUO
Airbus Computer Architecture
28VDC

Control Processor
68010
or Memory
80186 RAM & ROM
Power Input/
Supply Output
Watchdog
Relay

Relay
Watchdog
Power Input/
Supply Output
Processor Memory
68010 RAM &
or ROM
Monitoring 80186

primary To Actuators
Version 1
compare
Version 2

Switch

Version 1’ backup
compare
Version 2’

NTUEE 41 KUO
Recovery blocks
 Combinescheckpoint and restart
approach with standby sparing
redundancy scheme
n different implementations of the same
program
‒ Only one of the versions is active
‒ If an error is detected by the acceptance test,
a retry signal is sent to the switch
‒ The system is rolled back to the state stored in
the checkpoint memory and the execution is
switched to another module

NTUEE 42 KUO
Recovery Block
Approach
 N versions, one running - if it
fails, execution is switched to
a backup
‒ Example - primary + 3 secondary
versions
‒ Primary executed – output passed
to acceptance test
‒ If output is not accepted - system
state is rolled back and secondary
1 starts, and so on
‒ If all fail - computation fails
 Success of recovery block approach depends on failure
independence of different versions and quality of
acceptance test
NTUEE 43 KUO
Recovery blocks

NTUEE 44 KUO
Recovery blocks
 Similarly to cold and hot standby sparing,
different versions can be executed either
serially, or concurrently
‒ Serial execution may require the use of checkpoints
to reload the state before the next version is
executed
‒ The cost in time of trying multiple versions serially
may be too expensive, especially for a real-time
system.
‒ A concurrent system requires n redundant hardware
modules, a communications network to connect
them and the use of input and state consistency
algorithms.
NTUEE 45 KUO
Restoration of System State
 Restoring system state is automatic
 Taking a copy of entire system state on entry
to each recovery block is too costly
 Use Recovery Caches or “Recursive”
Caches
 When a process is to be backed up, it is to a
state just before entry to primary alternate
 Only NONLOCAL variables that have been
MODIFIED have to be reset

NTUEE 46 KUO
Recovery blocks
 Ifall n versions are tried and failed, the
module invokes the exception handler to
communicate to the rest of the system a
failure to complete its function
 Recovery blocks technique heavily
depends on design diversity

NTUEE 47 KUO
Summary
 RB is equivalent to the stand-by sparing (of
passive dynamic redundancy) in HW fault-tolerant
architectures
 NVP is equivalent to N-modular redundancy (static
redundancy) in HW fault-tolerant architectures
 NSCP is equivalent to active dynamic redundancy
‒ A self-checking component results either from:
• The association of an acceptance test to a version
• The association of two variants with a comparison
algorithm
‒ Fault-tolerance is provided by the parallel execution of N ≥ 2
self-checking components

NTUEE 48 KUO

You might also like