Dependable Computing: Concepts, Limits, Challenges

Invited paper to FTCS-25, the 25th IEEE International Symposium on Fault-Tolerant Computing,
Pasadena, California, USA, June 27-30, 1995, Special Issue, pp. 42-54.
Dependable Computing: Concepts, Limits, Challenges
Jean-Claude Laprie
LAAS-CNRS
7, Avenue du Colonel Roche
31077 Toulouse, France
dependability?". The question which comes next is “What

Abstract
are the challenges which we are faced with, as a result of
Our society is faced with an ever increasing dependence these limits, and in order to overcome them?”. Responses
on computing systems, which lead to question ourselves to these questions need to be formulated within a conceptual
about the limits of their dependability, and about the and terminological framework, which in turns is influenced
challenges raised by those limits. In order to respond these by the analysis of the limits in dependability, and by the
questions, a global conceptual and terminological challenges raised by dependability. Such a framework can
framework is needed, which is first given. The limits and hardly be found in the many standardization efforts: as a
challenges in dependability are then addressed, from consequence of their specialization (telecommunications,
technical and financial viewpoints. The recognition that avionics, rail transportation, nuclear plant control, etc.),
design faults are the major limiting factor leads to they usually do not consider all possible sources of failures
recommending the extension of fault tolerance from which can affect computing systems, nor do they consider
products to their production process. all attributes of dependability.
Introduction The considerations expressed in the above two paragraphs

have guided the contents of the paper, which is composed of
Our society has become increasingly dependent on three sections. The first section is devoted to the main
computing systems and this dependency is especially felt definitions relating to the dependability concept. The second
upon the occurrence of failures. Recent examples of nation- section addresses the limits of dependability, and the third
wide computer-caused or -related failures are the 15 January section discusses some challenges which the computer
1990 telephone outage in the USA, or the 26-27 June 1993 science and industry are likely to be faced in a near future.
credit card denial of authorization in France. The
consequences of such events relate primarily to economics; 1 The Dependability Concept
however, some outages can lead to endangering human lives
In this section, condensed definitions for dependability are
as second order effects, or even directly as in the London
first given, which are then put into perspective with respect
Ambulance Service failure of 26-27 November 1992. As a
to their evolution over time, and other existing definitions.
consequence of such events, which can only be termed as
The definitions given in the first paragraph are commented,
disasters, the consciousness of our vulnerability to
and supplemented, in an Annex.
computer failures is developing, as witnessed by the
following quotation from the report Computing the Future: 1.1 Basic definitions
A Broader Agenda for Computer Science and Engineering
[COM 92]: “Finally, computing has resulted in costs to Dependability is defined as that property of a computer
society as well as benefits. Amidst growing concerns in system such that reliance can justifiably be placed on the
some sectors of society with respect to issues such as service it delivers. The service delivered by a system is its
unemployment, invasions of privacy, and reliance on behavior as it is perceptible by its user(s); a user is another
fallible computer systems, the computer is no longer seen system (human or physical) which interacts with the
as an unalloyed positive force in the society”. former.
Faced with this situation, a natural question is then "To Depending on the application(s) intended for the system,
which extent can we rely on computers?", or, more different emphasis may be put on different facets of
precisely, "What are the limits of computing systems dependability, i.e. dependability may be viewed according to
different, but complementary, properties, which enable the
The work reported in this paper has been partially supported by the attributes of dependability to be defined:
ESPRIT Basic Research Action PDCS-2 (Predictably Dependable • the readiness for usage leads to availability,
Computing Systems, Action no. 6362).
• the continuity of service leads to reliability, the system quality resulting from the impairments and
• the non-occurrence of catastrophic consequences on the means opposing to them to be assessed.
the environment leads to safety, AVAILABILITY
• the non-occurrence of unauthorized disclosure of RELIABILITY
SAFETY
information leads to confidentiality, ATTRIBUTES
CONFIDENTIALITY
• the non-occurrence of improper alterations of INTEGRITY
information leads to integrity, MAINTAINABILITY
• the ability to undergo repairs and evolutions leads to FAULT PREVENTION

maintainability. DEPENDABILITY MEANS
FAULT TOLERANCE
FAULT REMOVAL
Associating integrity and availability with respect to FAULT FORECASTING
authorized actions, together with confidentiality, leads to
FAULTS
security. IMPAIRMENTS ERRORS
A system failure occurs when the delivered service FAILURES
deviates from fulfilling the system function, the latter Figure 1 - The dependability tree
being what the system is aimed at. An error is that part of
the system state which is liable to lead to subsequent 1.2 Evolution and situation with respect to
failure: an error affecting the service is an indication that a definitions given in standards
failure occurs or has occurred. The adjudged or The definitions given in section 1.1 result from gradual
hypothesized cause of an error is a fault. evolutions within the Fault-Tolerant Computing
The development of a dependable computing system calls community, and especially the IFIP Working Group 10.4..
for the combined utilization of a set of methods and These evolutions took place from the mid-seventies to the
techniques which can be classed into: early nineties [Carter 1982, Laprie & Costes 1982, Laprie
• fault prevention: how to prevent fault occurrence or 1985, Avizienis & Laprie 1986, Laprie 1992a, Laprie
introduction, 1995].
• fault tolerance: how to ensure a service up to A major strength of the dependability concept, as it is
fulfilling the system's function in the presence of faults, formulated in this paper, is its integrative nature, which
• fault removal: how to reduce the presence (number, enables to put into perspective the more classical notions of
seriousness) of faults, reliability, availability, safety, security, maintainability,
• fault forecasting: how to estimate the present which are then seen as attributes of dependability. The fault-
number, the future incidence, and the consequences of error-failure model is central to the understanding and
faults. mastering of the various impairments which may affect a
system, and it enables a unified presentation of these
The notions introduced up to now can be grouped into three impairments, though preserving their specificities via the
classes and are summarized by figure 1: various fault classes which can be defined (see Annex). The
• the impairments to dependability: faults, errors, model provided for the means for dependability is extremely
failures; they are undesired — but not in principle useful, as those means are much more orthogonal to each
unexpected — circumstances causing or resulting from other than the usual classification according to the attributes
un-dependability (whose definition is very simply of dependability, with respect to which the design of any
derived from the definition of dependability: reliance real system has to perform trade-offs, due to the fact that
cannot, or will not any longer, be placed on the service); these attributes tend to be in conflict with each other.
• the means for dependability: fault prevention, fault The definitions of dependability which exist in current
tolerance, fault removal, fault forecasting; these are the standards differ from the definition we have given. Two
methods and techniques enabling one a) to provide the such differing definitions are:
ability to deliver a service on which reliance can be
• “The collective term used to describe the availability
placed, and b) to reach confidence in this ability;
performance and its influencing factors: reliability
• the attributes of dependability: availability, performance, maintainability performance and
reliability, safety, confidentiality, integrity, maintenance support performance” [ISO 91].
maintainability; these a) enable the properties which are
• “The extent to which the system can be relied upon to
expected from the system to be expressed, and b) allow
perform exclusively and correctly the system task(s)
under defined operational and environmental conditions
2
over a defined period of time, or at a given instant of usually exploits a large number of various software, located
time” [CEI 92]. on a similarly large number of servers.
The ISO definition is clearly centered upon availability.
Non-fault-tolerant systems Fault-tolerant systems
This is no surprise as this definition can be traced back to (Japan, 1383 organizations (Tandem Computers [Gray 90];
the definition given by the international organization for [Watanabe 86]; Bell Northern Research [Cramp et
USA, 450 companies al. 92])
telephony, the CCITT [CCITT 84], although it could be [FIND/SVP 93])
argued that it seems strange to take a definition formulated Mean Time To Failure: 6 to 12
in a given context, and to use it as it is in a broader weeks Mean Time To Failure: 21 years
[Tandem]
context. However, the willingness to grant dependability Average outage duration after
failure: 1 to 4 hours
with a generic character is noteworthy, so going beyond Failure sources Failure sources
availability as it was usually defined, and thus restituting Hardware 50% Software 65%
the role of reliability and maintainability. In this respect, Software 25% Operations -
Communications- Procedures 10%
the ISO/CCITT definition is consistent with the definition Environment 15% Hardware 8%
given in [Hosford 60] for dependability: “the probability Operations - Environment 7%
Procedures 10%
that a system will operate when needed”.
Figure 2 - Failure sources and frequencies for computing systems
The second definition, from [CEI 92], introduces the notion
of reliance, and as such is much closer to our definition. User outage (minutes for 1000 users)
This CEI definition stays however fundamentally in the 5 106
Client
spirit of the extension of availability.
4 106 Network
Server
2 Limits in Dependability 3 106 All
This section is aimed at identifying and discussing the
limits in the dependability of computer systems in terms of 2 106
fault classes. Three major classes of faults are considered: 106

physical faults, design faults, interaction faults (see the
Annex, § A.1, for precise definitions of these fault classes). 0
Physical Design Opérations Environ- Configu-
ment ration
2.1- Limits in terms of fault classes change
Outage category
Software, and thus design faults, are generally recognized as
being the current bottleneck for dependability in critical Figure 3 - User outage in non-fault-tolerant client-server networks
applications, be they money- or life-critical. A simple [Wood 94]
reason is that the computer systems involved in such A further examination of the data of figure 2 and 3 shows
applications are tolerant to physical faults. that fault-tolerant traditional systems and client-server
An examination of statistics for largely deployed systems networks exhibit the same ordering of fault classes in terms
confirm this state of affairs, as examplified by figure 2 of their contribution to system failure: 1) design faults, 2)
which compares failure data for non-fault-tolerant and fault- operations (i.e. interaction) faults, 3) physical faults, 4)
tolerant traditional systems, i.e. whose architecture is environment. In addition, figure 3 shows that configuration
typically a central computing system, which is accessed changes resulting from adaptive or perfective maintenance
through a network of terminals. A close examination of (upgrades, new version install) become a major source of
those statistics show that fault-tolerance provides an concern.
improvement of two orders of magnitudes in terms of time Interaction faults becoming the second source of system
to failure, and that, on the average, fault-tolerant computing failure is not due to the fact that human operators become
systems become obsolete before system failure. more error-prone, but is simply due to our progressive
The importance of design faults is confirmed when mastering of physical faults, and to the permutation among
considering figure 3, which gives the results of a failure sources which ensue. This permutation effect has
comprehensive survey about large scale client-server been noticeable for a long time in fault-tolerant systems
networks (several thousands of workstations), that is a type [Toy 78, Davis & Giloth 81], and is supported by the
of architecture which tends to replace the traditional statistics of figure 4 which show that, in commercial
architectures. Although the data displayed on figure 3 relate flights, human errors have become the first source of
to non-fault-tolerant systems, the dominance of design accidents, in spite of having significantly decreased in
faults come from the fact that, in such architectures, a user absolute terms over the years.
3
The vast majority of the published literature on the
Accidents per million takeoffs
evaluation of software reliability is based on failure data
1970-1978 1979-1986
exploited via reliability growth models [Xie 91]. Such
Technical defects 1.49 (45%) 0.43 (33%) studies, when conducted during development, are relevant to
Weather 0.82 (25%) 0.33 (26%)
Human error 1.03 (30%) 0.53 (41%) operational life only if the software execution profile is
Total 3.34 1.29 representative of operational life, and thus that such studies
are conducted during the final phases of development, i.e.
Figure 4 - Primary cause of accident of domestic commercial flights
in the US [Ruegger 90] validation and qualification tests. A problem is then that the
times to failure may simply be large enough in order to
In addition, a significant proportion of interaction faults can make the application of reliability growth models
actually be traced to design faults [Norman 83], be they due impractical, due to the (hoped for) scarcity of failure data.
to either a) a poor design of the man-machine interface or Stated in equivalent terms, predictions based on failure data
interaction procedures, or b) to a lack of assistance from the are of interest only if enough data are available in order to
system to its human operators in tasks where human be statistically representative, that is if there are "enough"
reasoning and judgement is ultimately necessary, such as faults, or if the data come from a large enough base of
facing situations of multiple system faults. software copies whose input data can reasonably be
considered as stochastically independent, or both. Such
Coming back to fault-tolerant systems, design faults can
situations are encountered during the initial phases of
affect a) software in the classical sense, i.e. both application
testing, or for large bases of deployed software in the field.
and executive software, and b) fault tolerance mechanisms,
In the former situation (initial phases of testing), reliability
which are usually at least partially software-implemented.
estimations can hardly be predictive of the operational
Although design faults can clearly affect hardware as well
reliability because the execution profiles, aimed at finding
(as examplified by the recent infamous Pentium problem),
faults, are different from operational profiles; however,
such a discussion is out of the scope of the paper. As other
processing failure data in order to determine whether the
papers in this volume address the design aspects of both
software reliability is growing or decreasing, i.e.
dependable software and fault-tolerant systems, we focus in
performing trend analysis [Ascher & Feingold 84, Kanoun
the following two paragraphs on the evaluation side.
& Laprie 94], can be extremely helpful in judging of the
2.2 Software reliability efficacy of the tests conducted. In the latter situation (large
bases of deployed software in the field), the reliability
The domain of software reliability evaluation is in a
estimations cannot obviously be used for guiding the
paradoxical situation: although we have seen that software
development process. However, they provide information
is the current bottleneck of computing systems
which can be exploited for posterior products, and they can
dependability, current practice does not generally (with a
be extremely useful for maintenance planning purposes.
few exceptions such as [Donnelly et al. 92]) involve
evaluation of software reliability, in spite of a huge amount Another class of models estimate reliability from the
of literature on the topic. In fact, reliability evaluation is explicit consideration of non-failed executions, possibly
simply ignored in the vast majority of software together with failures [Nelson 73, Parnas et al. 90]. Such
developments, and most of published work on evaluations models assume than no correction is performed should a
based on experimental data relating to real systems are failure occur, and are thus can be termed as stable
indeed post mortem studies, performed a posteriori, without reliability models. The problem which then arises, besides
direct influence on the development process. Such a state- the representativity of the execution profile according to
of-practice is supported by the current standards for the which the tests are conducted, is the number of executions
development of software, which usually simply ignore necessary for estimating reliability levels which are
reliability evaluation. commensurate with reasonable expectations.
An initial cause to this paradoxical situation has been, It is generally agreed that a practical current limit in the
during many years, the belief that methods and techniques assessment of a failure rate before operational use lies in the
for developing software would ultimately enable the range 10-2/h - 10-4/h, be the corresponding evaluations
production of fault-free software. This belief, which has conducted using reliability growth models or stable
now been widely realized as being undue, has indeed been reliability models [Parnas et al. 90, Thevenod &
detrimental to software reliability evaluation, which is Waeselynck 91]. On the other hand, measured failure rates
nevertheless facing practical difficulties. in operation range from 10-3/h [Chillarege et al. 95] to a
4
few 10-6/h [Kanoun & Sabourin 87, Gray 90], or even are much smaller, falling in the range of a few thousand to
down to 5. 10 -8/h [Laryd 94]. a few tens of thousands of lines of code. The (hoped for)
vanishingly small number of residual faults in operation
It appears from the above discussion that the current
clearly demonstrates the irrelevance of attempting to
approaches cannot generally satisfy the need of performing,
perform reliability predictions for such software because of
prior to deployment, reliability evaluations of software
their irrelevance from a statistical viewpoint. However, the
systems in order to forecast their (future) reliability in
rapid growth in the size of life-critical software may make
operation. A common cause to the limitations of the
reliability evaluations relevant on the principle; in such an
various approaches is that they consider products in
event, the stringent reliability objectives requested for life-
isolation from the process which produced them.
critical computing systems will necessitate approaches
Considering that software written from scratch are
similar to what we have described as “product-in-a-process”
exceptions, and that the current usual approach is rather to
to be employed, in order to overcome the practical
make evolutions from existing software, a logical
difficulties suffered by the traditional approaches,
consequence is to enhance the predictions performed for a
considering products in isolation [Butler & Finelli 93].
given product with field data relative to previous, similar,
software products. We have conducted preliminary studies 2.3 Fault tolerance effectiveness
on this idea, using a Bayesian approach, as what is looked
for is the incorporation of prior knowledge — deduced from The imperfections of fault tolerance, i.e. the lack of fault
past experience — in order to deduce posterior confidence — tolerance coverage, constitute a severe limitation to the
the reliability of the software under consideration viewed as increase in dependability which can be obtained. Such
the new generation of a software family [Laprie 1992b]. imperfections of fault tolerance are due either a) to design
The corner stone of the approach is clearly the notion of faults affecting the fault tolerance mechanisms with respect
similarity between the various generations of a software to the fault assumptions stated during the design, the
family. From the viewpoint of reliability prediction, and in consequence of which is a lack of error and fault handling
a Bayesian context, dissimilarities due to more failure coverage, or b) to fault assumptions which differ from the
sources will impact the prior distribution in the reduction of faults really occurring in operation, resulting in a lack of
its ability to accommodate strong beliefs in its fault assumption coverage, which can be in turn due to
representativeness. On the other hand, instead, or in either i) failed component(s) not behaving as assumed, that
conjunction with, such "negative" dissimilarities, we can is a lack of failure mode coverage, or ii) the occurrence of
expect "positive" dissimilarities to exist, resulting for correlated failures, that is a lack of failure independence
instance from advances in the development and validation coverage. The influence of a lack of error and fault
methods and techniques. In order to explore this notion of handling coverage [Bouricius et al. 69, Arnold 73] has been
similarity, experimental data relative to families of products shown to be such that it not only drastically limits the
have been processed [Kaâniche et al. 1994]. However, those dependability improvement, but that in some cases adding
studies, which we term as "product-in-a-process" are still in further redundancies can result in lowering dependability
a preliminary stage, and need to be refined in order to be [Dugan & Trivedi 89]. Similar effects can result from the
directly applicable. lack of failure mode coverage: conservative fault
assumptions (e.g., Byzantine faults) will result in a higher
Let us conclude this section by considering two broad failure mode coverage, at the expense of necessitating an
classes of critical software systems: money-critical software increase in the redundancy and more complex fault tolerance
(e.g. telecommunications, transaction-processing), and life- mechanisms, which can lead to an overall decrease in the
critical software (e.g. flight control, nuclear plant system dependability [Powell 92].
monitoring, railway signalling). Current figures of residual
fault densities for operational software fall in the range 0.01 The evaluation of the error and fault handling coverage has
- 10 faults per thousand lines of code (comments excluded), been recently devoted a large attention, leading to a number
or kSLOC. Published data show that money-critical of studies on fault-injection (see e.g. [Arlat et al. 90, Iyer &
software systems lie in the upper end, say from 1 to 10 Tang 94, Karlsson et al. 94]): the intricacy of the fault
faults/kSLOC [Levendel 95], and life-critical software in the tolerance mechanisms severely limits approaches relying
lower end, say from 0.01 to 1 fault/kSLOC [Bush 90]. purely on modeling [Dugan & Trivedi 89].
Money-critical software systems are very large, and their Let us finally mention that, if on one hand deficiencies of
size usually fall in the range of millions, or tens of fault tolerance coverage are unavoidable, on the other a
millions lines of code. The (large) number of residual faults fault-tolerant system usually tolerates in some situations
in operation make clearly reliability evaluation relevant more faults than expected, and thus exhibits an extra-
from a statistical viewpoint. Current life-critical software coverage. For instance, a system designed for tolerating
5
single faults will usually not tolerate all single faults
because of the coverage deficiencies, and will however 250
tolerate some combinations of multiple faults. Such a 200

result, which is classical in error correcting codes (see e.g. 150
[Siewiorek & Swarz 92, ]) applies to systems as well: it is 100
reported in [Gray 90] that 20% of the fault chains had a
Billion FF (1993 FF)

length (i.e. the number of successive faults leading to
12
failure) larger than 2. This notion of extra-coverage can be
generalized in the light of the (unavoidable) robustness any 10
system exhibits due to unintentional redundancies; as an 8
example, figure 5 gives the results of fault-injection
6
experiments conducted on the Delta-4 system (more than
20000 faults injected), which show that 12% of the errors 4
were tolerated although not having been detected by the 2

error detection mechanisms.
0
12% -2
94% 85% 99% -4

Detected Tolerated
Faults Errors
errors errors -6
3% 1% 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
First 500 computer companies (construction,

distribution, services) [Source: O1 Références]:
Failures
Cumulative revenue, France
Net cumulative result
Figure 5 - Results of fault-injection experiments in Delta-4 Computer failures [Source: APSAIRD/CLUSIF]:
[Arlat et al. 91]
Total cost (private organizations)
Cost allocated to accidental faults
3 Challenges raised by dependability Cost allocated to malicious faults
We have so far focused on technical issues, which we now

Figure 6 - Cost of computer failures in France as compared to the
complement by financial considerations in order to better revenue and results of the computer industry
appraise the challenges we are faced with.
If the correlation between the computer industry revenue
Figure 6 gathers two types of statistical data relating to and the cost of computer failures exhibited by figure 6 is
informatics in France: the cost of computer failures as not interrupted, then the dramatic development of computer
evaluated by the insurers’ association, and the computer applications currently promised by multi-media applications
industry revenue and results. This figure shows the high and the transformation of computer networks into the so-
cost of computer failures, both in absolute value and in called information highways, will be accompanied by
relative value, as it amounts to about 5% of the total equally dramatic increase of the cost of their failures, which
revenue of the computer industry. Such figures are likely to may ultimately impede their very development.
apply comparatively in other industrialized countries; for
instance, a recent survey conducted in the USA [FIND/SVP Let us now turn to life-critical software. What has been
93] has estimated the activity of large companies in exposed in the previous section, i.e. the statistical evidence
industry and services to suffer an annual loss of 4 Billions of software being the current bottleneck of dependability,
of Dollars, as a consequence of accidental faults only. together with the recognition that probabilistic assessment
of software reliability to levels commensurate with safety
requirements (e.g. 10-9 /h or 10-5 per demand) is currently
out of reach, has led to highly labor intensive approaches
for the development and validation of operational life-
critical software. Be they undertaken via traditional software
engineering approaches or via mathematically formal
approaches, orders of magnitudes of effort dedicated to the
development and validation of such software are in the range
of 10 man.years per 1000 lines of code, for software
ranging from a few thousands to a few tens of thousands
6
lines of code [Craigen et al. 1993, OFTA 1994]. Clearly, uncovering a specification fault. In the latter, recognizing that
such development efforts are hardly sustainable for software the event is undesired (and is in fact a failure) can only be
performed after its occurrence, for instance via its
which would be one or several orders of magnitude larger. consequences.
In addition, typically, verification and validation activities A system may not, and generally does not, always fail in the
amount to 75% of the total development costs for such same way. The ways a system can fail are its failure modes,
critical software, and the real benefit in terms of reliability which may be characterized according to three viewpoints:
improvement to be gained from any increase in those costs domain, perception by the system users, and consequences on
is doubtful. the environment, as indicated by figure A1.
A clear consequence of what precedes is that there is a VALUE FAILURES

DOMAIN
urgent need for the computer industry to adopt fault TIMING FAILURES
tolerance on a much broader scale than at present. Such a CONSISTENT FAILURES

PERCEPTION BY
reommendation applies obviously to all classes of faults we FAILURES SEVERAL USERS
UNCONSISTENT (BYZANTINE) FAILURES
have considered, i.e. physical faults, design faults, BENIGN FAILURES
•
interaction faults. The recommendation applies equally to CONSEQUENCES
ON ENVIRONMENT
•
•
intentionally malicious faults: coming back to figure 6, the CATASTROPHIC FAILURES
high cost of those faults clearly shows that the current Figure A1 - The failure modes
approaches, dominated by fault avoidance, are A class of failures relating to both value and timing are the
unsatisfactory, and the promising initial results for halting failures: system activity, if any, is no longer
tolerating such faults [Joseph & Avizienis 1988, Rabin perceptible to the users. According to how the system interacts
1989, Deswarte et al. 1991] deserve being pursued and with its user(s), such an absence of activity may take the form
of a) frozen outputs (a constant value service is delivered; the
broadened. constant value delivered may vary according to the application,
As, ultimately, any fault affecting human artefacts, as are e.g. last correct value, some predetermined value, etc.), or of b)
computer systems, can be traced to a design fault, then a a silence (no message sent in a distributed system). A system
whose failures can be — or more generally are to an acceptable
natural recursion leads to the fault tolerance of not only the extent — only halting failures, is a f a i l - h a l t s y s t e m ; the
products, but also of the production process. Such is the situations of frozen outputs and of silence lead respectively to
challenge for the era which opens for the Silver Jubilee of f a i l - p a s s i v e s y s t e m s and to f a i l - s i l e n t s y s t e m s
the Fault-Tolerant Computing Symposia: extending fault [Powell et al. 88].
tolerance from products to their production processes. A system whose failures can only be — or more generally are
to an acceptable extent — benign failures is a f a i l - s a f e
s y s t e m . The notion of failure severity, resulting from grading
Acknowledgements. I dedicate this paper to all my the consequences of failures upon the system environment,
colleagues and friends with whom I have had the pleasure to enables the notion of criticality to be defined: the c r i t i c a l i t y
share many ideas and (hard) work over the years. Special of a system is the highest severity of its (possible) failure
mentions go to the members of the LAAS Research Group on modes. The relation between failure modes and failure severities
Dependable Computing and Fault Tolerance, Alain Costes, is highly application-dependent. However, there exist a broad
Jean Arlat, Yves Crouzet, Yves Deswarte, Jean-Charles Fabre, class of applications where inoperation is considered as being
Mohamed Kaaniche, Karama Kanoun, David Powell, Pascale a naturally safe position (e.g. ground transportation, energy
Thévenod, with thoughts to the memory of Christian Béounes, production), whence the direct correspondence which is often
to Jean-Paul Blanquart from LIS (the Laboratory for made between fail-halt and fail-safe [Mine & Koga 67,
Dependability Engineering), and to all the members of IFIP WG Nicolaidis et al. 89]. Fail-halt systems (either fail-passive or
10.4, especially Al Avizienis, Bill Carter, John Meyer, Brian fail-silent) and fail-safe systems are however examples of f a i l -
Randell and Yoshi Tohma. c o n t r o l l e d s y s t e m s , i.e. systems which are designed and
realized in order that they may only fail — or may fail to an
acceptable extent — according to restrictive modes of failure,
e.g. frozen output as opposed to delivering erratic values,
silence as opposed to babbling, consistent failures as opposed
Annex to inconsistent ones; fail-controlled systems may in addition
be defined via imposing some internal state condition or
A . 1 The impairments to dependability accessibility, as in the so-called fail-stop systems [Schlichting
Of primary importance are the impairments to dependability, & Schneider 83].
as we have to know what we are faced with. Our classification of faults is performed in two steps: a)
Failure occurrence has been defined with respect to the function elementary fault classes are first defined, according to various
of as system, not with respect to its specification. Indeed, if an viewponits, as indicated in figure A2; b) the combination of
unacceptable behavior is generally identified as a failure due to these elementary faults leads to the combined fault classes
a deviation from the compliance with the specification, it may given by figure A3. The labels associated to those combined
happen that such a behavior complies with the specification, fault classes are consistent with the labels commonly used in
and be however unacceptable for the system user(s), thus
7
order to point in a condensed manner at one or several fault The creation and manifestation mechanisms of faults, errors,
classes. and failures may be summarized as follows:
1) A fault is a c t i v e when it produces an error. An active fault
PHYSICAL FAULTS
PHENOMENOLOGICAL is either a) an internal fault which was previously dormant
CAUSE HUMAN-MADE FAULTS and which has been activated by the computation process, or
ACCIDENTAL FAULTS b) an external fault. Most internal faults cycle between their
NATURE INTENTIONAL, NON-MALICIOUS FAULTS
dormant and active states. Physical faults can directly affect
the hardware components only, whereas human-made faults
INTENTIONALLY MALICIOUS FAULTS
may affect any component.
DEVELOPMENT FAULTS
FAULTS
PHASE OF CREATION 2) An error may be latent or detected. An error is l a t e n t when
OR OCCURENCE
OPERATIONAL FAULTS it has not been recognized as such; an error is detected by a
INTERNAL FAULTS
detection algorithm or mechanism. An error may disappear
SYSTEM BOUNDARIES
EXTERNAL FAULTS
before being detected. An error may, and in general does,
propagate; by propagating, an error creates other — new —
PERSISTENCE
PERMANENT FAULTS error(s). During operation, the presence of active faults is
TEMPORARY FAULTS determined only by the detection of errors.
Figure A2 - The elementary fault classes 3) A failure occurs when an error "passes through" the system-
user interface and affects the service delivered by the system.
FAULTS A component failure results in a fault a) for the system which
contains the component, and b) as viewed by the other
component(s) with which it interacts; the failure modes of
PHYSICAL DESIGN INTERACTION MALICIOUS INTRUSIONS
FAULTS FAULTS FAULTS LOGIC the failed component then become fault types for the
components interacting with it.
PHYSICAL
FAULTS
● ● ● ● ● ● These mechanisms enable the "fundamental chain" to be
completed:
HUMAN-MADE ● ● ● ● ● ● ● ● ● ● ●
FAULTS
• • • ➙ failure ➙ fault ➙ error ➙ failure ➙ fault ➙ • • •
ACCIDENTAL
FAULTS ● ● ● ● ● ● ● ● ● The arrows in this chain express the causality relationship
INTENTIONAL, between faults, errors and failures. They should not be
NON-MALICIOUS, ● ● ●
FAULTS
interpreted in a restrictive manner: a) by propagation, several
INTENTIONALLY errors can be generated before a failure occurs, and b) an error
MALICIOUS
FAULTS
● ● ● ● ●
can lead to a fault without a failure to be observed, if the
observation is not performed, as a failure is an event occurring
DEVELOPMENT
FAULTS ● ● ● ● ● ● ● ● at the interface between two components.
OPERATIONAL
FAULTS
● ● ● ● ● ● ● ● ● The complexity of the creation, activation and manifestation
of faults leads to a multiplicity of possible causes of failures.
INTERNAL
FAULTS
● ● ● ● ● ● ● ● ● ● ● ● This complexity leads to the definition of fault classes which
EXTERNAL are more abstract than the classes which have bee considered so
● ● ● ● ●
FAULTS far, be it a) for classifying faults uncovered during operation, or
PERMANENT ● ● ● ● ● ● ●
b) for stating fault assumptions during when designing a
FAULTS
system. Two such fault classes are:
TEMPORARY
FAULTS
● ● ● ● ● ● ● ● ● ●
• configuration change faults [Wood 94]: the service
delivered by the system is affected, subsequently to a
Figure A3 - The combined fault classes
maintenance action, either evolutive or perfective (e.g.
Two comments regarding the human-made fault classes: introduction of a new software version on a network server,
1) Intentional, non-malicious, design faults result generally • t i m i n g f a u l t s , leading to timing failures, and o m i s s i o n
from tradeoffs, either a) aimed at preserving acceptable f a u l t s , such that the system does not respond to a request,
performances, at facilitating the system utilization, or b) thus leading to halting failures [Cristian et al. 85].
induced by economic considerations. Intentional, non-
malicious interaction faults may result from the action of an A . 2 The means for dependability
operator either aimed at overcoming an unforeseen situation, In this section, we briefly examine in turns those means for
or deliberately violating an operating procedure without dependability which explicitely consider the notion of fault,
having developed the consciousness of the possibly i.e. fault tolerance, fault removal and fault forecasting.
damaging consequences of his or her action. These classes of
Fault tolerance. Fault tolerance [Avizienis 67] is carried
intentional non-malicious faults share the characteristic
out by error processing and by fault treatment [Anderson & Lee
that, often, it is realized they were faults only after an
81]. Error processing is aimed at removing errors from the
unacceptable system behavior, thus a failure, has occurred.
computational state, if possible before failure occurrence;
2) Malicious logics encompass development faults such as fault treatment is aimed at preventing faults from being
Trojan horses, logic or timing bombs, trapdoors, as well as activated — again.
operational faults (for the considered system) such as viruses
or worms [Landwehr et al. 93]. Error processing can be carried out via three primitives:
8
• error detection, which enables an erroneous state to be capable of delivering the same service as before, then a
identified as such; r e c o n f i g u r a t i o n may take place, which consists in
• error diagnosis, which enables to assess the damages modifying the system structure in order that the non-failed
caused by the detected error, or by errors propagated before components enable the delivery of an acceptable service,
detection; although degraded; a reconfiguration may involve some tasks
• error recovery, where an error-free state is substituted for to be given up, or re-assigning tasks among non-failed
the erroneous state; this substitution may take on three components.
forms: The preceding definitions apply to physical faults as well as to
- backward recovery, where the erroneous state design faults: the class(es) of faults which can actually be
transformation consists of bringing the system back to a tolerated depend(s) on the fault hypothesis which is being
state already occupied prior to error occurrence; this considered in the design process, and thus relies on the
involves the establishment of recovery points, which independence of redundancies with respect to the process of
are points in time during the execution of a process for fault creation and activation. An example is provided by
which the then current state may subsequently need to be considering tolerance of physical faults and tolerance of design
restored; faults. A (widely-used) method to attain fault tolerance is to
- forward recovery, where the erroneous state perform multiple computations through multiple channels.
transformation consists of finding a new state, from which When tolerance of physical faults is foreseen, the channels
the system can operate (frequently in a degraded mode); may be identical, based on the assumption that hardware
- c o m p e n s a t i o n , where the erroneous state contains components fail independently; such an approach is not
enough redundancy to enable its transformation into an suitable for the tolerance to design faults where the channels
error-free state. have to provide identical services through separate designs
and implementations [Elmendorf 72, Randell 75,
When backward or forward recovery are utilized, it is necessary Avizienis 78], i.e. through design diversity [Avizienis &
that error detection precedes error recovery. Backward and Kelly 84].
forward recovery are not exclusive: backward recovery may be
attempted first; if the error persists, forward recovery may then An important aspect in the coordination of the activity of
be attempted. In forward recovery, it is necessary to assess the multiple components is that of preventing error propagation
damage caused by the detected error, or by errors propagated from affecting the operation of non-failed components. This
before detection; damage assessment can — in principle — be aspect becomes particularly important when a given
ignored in the case of backward recovery, provided that the component needs to communicate some information to other
mechanisms enabling the transformation of the erroneous state components that is private to that component. Typical
into an error-free state have not been affected [Anderson & Lee examples of such single-source information are local sensor
81]. data, the value of a local clock, the local view of the status of
other components, etc. The consequence of this need to
The association into a component of its functional processing communicate single-source information from one component
capability together with error detection mechanisms leads to to other components is that non-failed components must reach
the notion of self-checking component, either in an agreement as to how the information they obtain should be
hardware [Carter & Schneider 68, Wakerly 78, Nicolaidis et al. employed in a mutually consistent way. Specific attention has
89] or in software [Yau & Cheung 75, Laprie et al. 90a]; one of been devoted to this problem in the field of distributed systems
the important benefits of the self-checking component (see e.g. atomic broadcast [Cristian et al. 85a], clock
approach is the ability to give a clear definition of error synchronization [Lamport & Melliar-Smith 85, Kopetz &
confinement areas [Siewiorek & Johnson 82]. When error Ochsenreiter 87] or membership protocols [Cristian 88]). It is
compensation is performed in a system made up of self- important to realize, however, that the inevitable presence of
checking components partitioned into classes executing the structural redundancy in any fault-tolerant system implies
same tasks, then state transformation is nothing else than distribution at one level or another, and that the agreement
switching within a class from a failed component to a non- problem therefore remains in existence. Geographically
failed one. On the other hand, compensation may be applied localized fault-tolerant systems may employ solutions to the
systematically, even in the absence of errors, then providing agreement problem that would be deemed too costly in a
f a u l t m a s k i n g (e.g. in majority vote). However, this can at "classical" distributed system of components communicating
the same time correspond to a redundancy decrease which is not by messages (e.g. inter-stages [Lala 86], multiple stages for
known. So, practical implementations of masking generally interactive consistency [Frison & Wensley 82]).
involve error detection, which may then be performed after the
state transformation. As opposed to fault-masking, Fault removal. Fault removal is composed of three steps:
implementing error processing via error recovery after error verification, diagnosis, correction. Verification is the
detection has taken place, is generally referred to as error process of checking whether the system adheres to properties,
d e t e c t i o n and r e c o v e r y . termed the verification conditions [Cheheyl et al. 81]. The
verification techniques can be classed according to whether or
The first step in fault treatment is fault diagnosis, which not they involve exercising the system.
consists of determining the cause(s) of error(s), in terms of
both location and nature. Then come the actions aimed at Verifying a system without actual execution is s t a t i c
fulfilling the main purpose of fault treatment: preventing the v e r i f i c a t i o n . The verification can be conducted:
fault(s) from being activated again, thus aimed at making • on the system itself, in the form of a) static analysis (e.g.
it(them) passive, i.e. fault passivation. This is carried out inspections or walk-through [Myers 79], data flow analysis
by preventing the component(s) identified as being faulty from [Osterweil et al. 76], complexity analysis [McCabe 76],
being invoked in further executions. If the system is no longer
9
compiler checks, etc.) or b) proof-of-correctness (inductive Fault forecasting. Fault forecasting is conducted by
assertions [Hoare 69]); performing an evaluation of the system behavior with respect
• on a model of the system behavior (e.g. Petri nets, finite to fault occurrence or activation. Evaluation has two aspects:
state automata), leading to behavior analysis [Diaz 82]. • ordinal evaluation, aimed at identifying, classifying and
Verifying a system through exercising it constitutes dynamic ordering the failure modes, or the methods and techniques
v e r i f i c a t i o n ; the inputs supplied to the system can be either aimed at avoiding them;
symbolic in the case of symbolic execution, or valued in • probablistic evaluation, aimed at evaluating in terms of
the case of verification testing, usually simply termed probabilities some of the attributes of dependability.
testing. The methods and tools enabling ordinal and probabilistic
Testing exhaustively a system with respect to all its possible evaluations are either specific (e.g., failure mode and effect
inputs is generally impractical. The methods for the analysis for qualitative evaluation, or Markov chains for
determination of the test patterns can be classed according to quantitative evaluation), or can be used for performing both
two viewpoints: criteria for selecting the test inputs, and forms of evaluation (e.g. reliability block diagrams, fault-
generation of the test inputs. trees).
The techniques for selecting the test inputs may in turn be The definition of the measures of dependability necessitates
classed according to three viewpoints: first the notions of proper and improper services to be defined:
• the purpose of the testing: checking whether the system • proper service, where the delivered service fulfills the
satisfies its specification is conformance testing, system function;
whereas testing aimed at revealing faults is f a u l t - f i n d i n g • improper service, where the delivered service does not
testing; fulfill the system function.
• the system model: depending on whether the system model A failure is thus a transition from proper to improper service,
relates to the function or the structure of the system, leads and the transition from improper to proper service is a
respectively to functional testing and structural r e s t o r a t i o n . Quantifying the alternation of proper-improper
testing; service delivery enables then reliability and availability to be
• fault model: the existence of a fault model leads to fault- defined as measures of dependability:
based testing [Morell 90], aimed at revealing specific • r e l i a b i l i t y : a measure of the continuous delivery of proper
classes of faults (e.g. stuck-at-faults in hardware production service — or, equivalently, of the time to failure;
[Roth et al. 67], physical faults affecting the instruction set
• a v a i l a b i l i t y : a measure of the delivery of proper service
of a microprocessor [Thatte & Abraham 78], design faults in
with respect to the alternation of proper and improper
software [Goodenough & Gerhart 75, DeMillo et al. 78,
service.
Howden 87]); if there is no fault model, one is then led to
c r i t e r i a - b a s e d t e s t i n g , where the criteria may relate, for A third measure, m a i n t a i n a b i l i t y , is usually considered,
software, to path sensitization [Howden 76], to utilization of which may be defined as a measure of the time to restoration
the program variables [Rapps & Weyuker 85], to input from the last experienced failure, or equivalently, of the
boundary values [Myers 79], etc. continuous delivery of improper service.
Combining the various viewpoints leads to the testing As a measure, safety can be seen as an extension of reliability.
approaches, where the distinction between hardware and Let us group the state of proper service together with the state
software is important since hardware testing is mainly aimed at of improper service subsequent to benign failures into a safe
removing production faults, whereas software testing is state (in the sense of being free from catastrophic damage, not
concerned only with design faults: from danger); s a f e t y is then a measure of continuous safeness,
• structural testing when applied to hardware generally means or equivalently, of the time to catastrophic failure. Safety can
fault-finding, fault-based, whereas it generally means fault- thus be considered as reliability with respect to the
finding, non-fault-based testing when applied to software; catastrophic failures.
• functional testing when applied to hardware generally means In the case of multi-performing systems, several services can
fault-finding, fault-based, whereas it generally means either be distinguished, as well as several modes of service delivery,
conformance or fault-finding, criteria-based when applied to ranging from full capacity to complete disruption, which can
software; be seen as distinguishing less and less proper service
• mutation testing of software [DeMillo et al. 78] is fault- deliveries. Performance-related measures of dependability for
finding, structural, fault-based, testing. such systems are usually grouped under the notion of
p e r f o r m a b i l i t y [Meyer 78, Smith & Trivedi 88].
The generation of the test inputs may be deterministic or
probabilistic: When performing an evaluation, the approaches differ
• in d e t e r m i n i s t i c t e s t i n g , test patterns are predetermined significantly according to whether the system is considered as
by a selective choice according to the adopted criteria, being in stable reliability or in reliability growth, which may
• in random, or s t a t i s t i c a l , t e s t i n g , test patterns are be defined as follows [Laprie et al. 90b]:
selected according to a defined probability distribution on • s t a b l e r e l i a b i l i t y : the system's ability to deliver proper
the input domain; the distribution and the number of input service is preserved (stochastic identity of the successive
data are determined according to the adopted criteria [David & times to failure);
Thevenod-Fosse 81, Duran & Ntafos 84, Thevenod-Fosse & • r e l i a b i l i t y g r o w t h : the system's ability to deliver proper
Waeselynck 91]. service is improved (stochastic increase of the successive
times to failure).
10
When evaluating fault-tolerant systems, the coverage of error
processing and fault treatment mechanisms has a very
significant influence [Bouricius et al. 69, Arnold 73]; its
evaluation can be performed either through modeling [Dugan &
References
Arlat et al. 90 J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.-C. Fabre, J.-
Trivedi 89] or through testing, then called fault-injection C. Laprie, E. Martins and D. Powell, “Fault Injection for
[Arlat et al. 90, Gunneflo et al. 89]. Dependability Validation — A Methodology and Some
Applications”, IEEE Trans. on Software Engineering, 16 (2), pp.166-
A . 3 The Attributes of Dependability 182, February 1990.
The attributes of dependability have been defined in section Arlat et al 91 J. Arlat, Y. Crouzet, E. Martins, D. Powell,
1.1 according to different properties, which may be more or “Dependability testing report LA3 - Fault injection on the extended
less emphasized depending on the application intended for the self-checking NAC”, LAAS Report 91.396, Delta-4 ref.
R91.119/I1/P, Dec. 1991.
computer system under consideration. Especially, reliability,
Anderson & Lee 81 T. Anderson, P.A. Lee, Fault Tolerance —
safety, confidentiality may or may not be required according to
Principles and Practice, Prentice Hall, 1981.
the application.
Arnold 73 T.F. Arnold, "The concept of coverage and its effect on the
Integrity is a pre-requisite for availability, reliability and reliability model of repairable systems", IEEE Trans. on Computers,
safety, but may not be so for confidentiality (for instance when vol. C-22, June 1973, pp. 251-254.
considering attacks via covert channels or passive listening). Ascher & Feingold 84 H. Ascher, H. Feingold, Repairable Systems
The definition given for integrity — absence of improper Reliability, vol. 7, Lectures Notes in Statistics, New York, USA,
alterations of information — generalizes the usual definitions, Dekker, 1984.
which relate to the notion of authorized actions only (e.g., Avizienis 67 A. Avizienis, "Design of fault-tolerant computers", in Proc.
prevention of the unauthorized amendment or deletion of Fall Joint Computer Conf., 1967, pp. 733-743.
information [CEC 91], assurance of approved data alterations Avizienis 78 A. Avizienis, "Fault tolerance, the survival attribute of
digital systems", Proceedings of the IEEE, vol. 66, no. 10, Oct. 1978,
[Jacob 91]; naturally, when a system implements an
pp. 1109-1125.
authorization policy, "improper" encompasses "unauthorized".
Avizienis & Kelly 84 A. Avizienis, J.P.J. Kelly, "Fault tolerance by
Whether a system holds the properties which have enabled the design diversity: concepts and experiments", Computer, vol. 17, no.
attributes of dependability to be defined should be interpreted 8, Aug. 1984, pp. 67-80.
in a relative, probabilistic, sense, and not in an absolute, Avizienis & Laprie 86 A. Avizienis, J.C. Laprie, "Dependable
deterministic sense: due to the unavoidable presence or computing: from concepts to design diversity", Proceedings of the
occurrence of faults, systems are never one hundred percent IEEE, vol. 74, no. 5, May 1986, pp. 629-638.
available, reliable, safe, or secure. Bouricius et al. 69 W.G. Bouricius, W.C. Carter, P.R. Schneider,
"Reliability modeling techniques for self-repairing computer
The definition given for maintainability goes deliberately systems", in Proc. 24th ACM National Conf., 1969, pp. 295-309.
beyond corrective maintenance, aimed at preserving or Bush 90 M.W. Bush, “Getting started on metrics - Jet Propulsion
improving the system's ability to deliver a service fulfilling its Laboratory productivity and quality”, Proc. 12th IEEE Int. Conf. on
function (relating to reparability only), and encompasses via Software Engineering, Nice, France, March 1990, pp. 133-1242.
evolvability the other forms of maintenance: adaptive Butler & Finelli 93 R.W. Butler, G.B. Finelli, "The infeasibility of
m a i n t e n a n c e , which adjusts the system to environmental quantiying the reliability of life-critical real-time software", IEEE
changes (e.g. change of operating systems or system data- Trans. on Software Engineering, vol. 19, no. 1, Jan. 1993, pp. 3-12.
bases), and perfective maintenance, which improves the Carter 82 W.C. Carter, "A time for reflection", in Proc. 12th IEEE Int.
system's function by responding to customer — and designer Symp. on Fault Tolerant Computing (FTCS-12), Santa Monica,
— defined changes, which may involve removal of California, June 1982, p. 41.
specification faults [Ramamoorthy 84]. Carter & Schneider 68 W.C. Carter, P.R. Schneider, "Design of
dynamically checked computers", in Proc. IFIP'68 Cong.,
Security has not been introduced as a single attribute of Amsterdam, 1968, pp. 878-883.
dependability, in agreement with the usual definitions of CCITT 84 Termes et définitions concernant la qualité de service, la
security, which view it as a composite notion, namely «the disponibilité et la fiabilité, Recommandation G 106, CCITT, 1984; in
combination of confidentiality, the prevention of the French.
unauthorized disclosure of information, integrity, the CEC 91 Information Technology Security Evaluation Criteria,
prevention of the unauthorized amendment or deletion of Harmonized criteria of France, Germany, the Netherlands, the
information, and availability, the prevention of the United Kingdom, Commission of the European Communities, 1991.
unauthorized withholding of information» [CEC 91]. CEI 92 Industrial-process measurement and control — Evaluation of
system properties for the purpose of system assessment. Part 5:
The variations in the emphasis to be put on the attributes of Assessment of system dependability, Draft, Publication 1069-5, CEI
dependability have a direct influence on the appropriate Secretariat, Feb. 1992.
balance of the techniques addressed in the previous section to Cheheyl et al. 81 M.H. Cheheyl, M. Gasser, G.A. Huff, J.K. Miller,
be employed in order that the resulting system be dependable. "Verifying security", Computing Surveys, vol. 13, no. 3, Sep. 1981,
This is an all the more difficult problem as some of the pp. 279-339.
attributes are antagonistic (e.g. availability and safety, Chillarege et al. 95 R. Chillarege, S. Biyani, J. Rosenthal, “Measurement
availability and security), necessitating that trade-offs be of failure rate in widely distributed software”, in Proc. 25th IEEE
performed. Considering the three main design dimensions of a Int. Symp. on Fault Tolerant Computing (FTCS-12), Pasadena,
computer system, i.e. cost, performance and dependability, the California, June 1995.
problem is further exacerbated by the fact that the COM 92 "Computing the Future", Report of the Committee to Asses
the Scope and Direction of Computer Science and Technology of
dependability dimension is less understood than the cost-
the National Research Council, Communications of ACM, vol. 35, no.
performance design space [Siewiorek & Johnson 82]. 11, Nov. 1992, pp. 30-40.
11
Craigen et al. 93 D. Craigen, S. Gehrart, T. Ralston, "An international ISO 91 Quality Concepts and Terminology, Part one: Geberic Terms and
survey of industrial applications of formal methods", report NIST Definitions, Document ISO/TC 176/SC 1 N 93, Feb. 1992.
GCR 93/626, National Institute of Standards and Technology, March Jacob 1991 J. Jacob. “The Basic Integrity Theorem,” Proc. Int. Symp.
1993. on Security and Privacy, pp. 89-97, Oakland, CA, USA, 1991.
Cramp et al. 92 R. Cramp, M.A. Vouk, W. Jones, "On operational Joseph & Avizienis 88 M.K. Joseph, A. Avizienis, "A fault tolerance
availability of a large software-based telecommunications system", approach to computer viruses", in Proc. 1988 Symp. on Security and
Proc. 3rd Int. Symp. on Software Reliability Engineering, Research Privacy, Oakland, April 1988, pp. 52-58.
Triangle Park, North Carolina, Oct. 1992, pp. 358-366. Kaâniche et al. 94 K. Kaâniche, K. Kanoun, M. Cukier and M. Bastos
Cristian et al. 85 F. Cristian, H. Aghili, R. Strong, D. Dolev, "Atomic Martini, “Software Reliability Analysis of Three Successive
broadcast: from simple message diffusion to Byzantine agreement", Generations of a Switching System”, in First European Conference
in Proc. 15th IEEE Int. Symp. on Fault Tolerant Computing (FTCS- on Dependable Computing (EDCC-1), (Berlin, Germany), 1994.
15), Ann Arbor, Michigan, June 1985, pp. 200-206. Kanoun & Laprie 1994 K. Kanoun, J.-C. Laprie, “Software Reliability
Cristian 88 F. Cristian, "Agreeing on who is present and who is absent in Trend Analyses: From Theoretical to Practical Considerations”,
a synchronous distributed system", in Proc. 18th IEEE Int. Symp. on IEEE Trans. on Software Engineering, vol. 20, no. 9, pp. 740-747,
Fault Tolerant Computing (FTCS-18), Tokyo, June 1988, pp. 206- 1994.
211. Kanoun & Sabourin 87 K. Kanoun and T. Sabourin, “Software
David & Thevenod-Fosse 81 R. David, P. Thévenod-Fosse, "Random Dependability of a Telephone Switching System”, Proc. 17th IEEE
testing of integrated circuits", IEEE Trans. on Instrumentation and Int Symp. on Fault-Tolerant Computing (FTCS-17), Pittsburgh, PA,
Measurement, vol. IM-30, no. 1, March 1981, pp. 20-25. USA, pp.236-41, June 1987.
Davis & Giloth 81 E.A. Davis, P.K. Giloth, "No 4 ESS: performance Karlsson et al 94 J. Karlsson, P. Lidén, P. Dahlgren and R. Johansson,
objectives and service experience", The Bell System Technical “Using Heavy-Ion Radiation to Validate Fault-Handling
Journal, vol. 60, no. 6, July-Aug. 1981, pp. 1203-1224. Mechanisms”, IEEE Micro, 14 (1), pp.8-23, February 1994.
DeMillo et al. 78 R.A. DeMillo, R.J. Lipton, F.G. Sayward, "Hints on test Kopetz & Ochsenreiter 87 H. Kopetz, W. Ochsenreiter, "Clock
data selection: help for the practicing programmer", Computer, April synchronization in distributed real-time systems, IEEE Trans. on
1978, pp. 34-41. Computers, vol. C-36, no. 8, Aug. 1987, pp. 933-940.
Deswarte et al. 91 Y. Deswarte, L. Blain, J.C. Fabre, “Intrusion Lamport & Melliar-Smith 85 L. Lamport, P.M. Melliar-Smith,
tolerance in distributed computing systems”, Proc. 1991 IEEE "Synchronizing clocks in the presence of faults", Journal of the
Symposium on Research in Security and Privacy, Oakland (USA), ACM, vol. 32, no.1, Jan. 1985, pp. 52-78.
20-22 Mai 1991, pp.110-121 Landwehr et al. 93 C.E. Landwehr, A.R. Bull, J.P. McDermott, W.S.
Diaz 82 M. Diaz, "Modeling and analysis of communication and Choi, "A taxonomy of computer security flaws, with examples",
cooperation protocols using Petri net based models", Computer Naval Research Laboratory, Report no. NRL/FR/5542-93-9591,
Networks, vol. 6, no. 6, Dec. 1982, pp. 419-441. Nov. 1991.
Donnelly et al. 92 M.M. Donnelly, W.W. Everett, J.D. Musa, G. Wilson, Laprie & Costes 82 J.C. Laprie, A. Costes, "Dependability: a unifying
“Software reliability engineering - Best current Practice”, AT&T concept for reliable computing", Proc. 12th IEEE Int. Symp. on
Document 45370B-930326-01TM, Oct. 1992. Fault Tolerant Computing (FTCS-12), Santa Monica, California,
Dugan & Trivedi 89 J.B. Dugan, K.S. Trivedi, "Coverage modeling for June 1982, pp. 18-21.
dependability analysis of fault-tolerant systems", IEEE Trans. on Laprie 85 J.C. Laprie, "Dependable computing and fault tolerance:
Computers, vol. 38, no. 6, June 1989, pp. 775-787. concepts and terminology", in Proc. 15th IEEE Int. Symp. on Fault
Duran & Ntafos 84 J.W. Duran, S.C. Ntafos, "An evaluation of random Tolerant Computing (FTCS-15), Ann Arbor, Michigan, June 1985,
testing", IEEE Trans. on Software Engineering, vol. SE-10, no. 4, July pp. 2-11.
1984, pp. 438-444. Laprie et al. 90a J.C. Laprie, J. Arlat, C. Béounes, K. Kanoun,
Elmendorf 72 W.R. Elmendorf, "Fault-tolerant progtamming", in Proc. "Definition and analysis of hardware- and software-fault-tolerant
2nd IEEE Int. Symp. on Fault Tolerant Computing (FTCS-2), architectures", IEEE Computer, vol. 23, no. 7, July 1990, pp. 39- 51.
Newton, Massachusetts, June 1972, pp. 79-83. Laprie et al. 90b J.C. Laprie, C. Béounes, M. Kaâniche, K. Kanoun,
FIND/SVP 93 “The Impact of Online Computer Systems Downtime on "The KAT (knowledge-action-transformation) approach to the
American Businesses”, FIND/SVP Survey, 1993. modeling and evaluation of reliability and availability growth", IEEE
Goodenough & Gerhart 75 J.B. Goodenough, S.L. Gerhart, "Toward a Trans. on Software Engineering, vol. 17, no. 4, April 1991, pp. 370-
theory of test data selection", IEEE Trans. on Software Engineering, 382.
vol. SE-1, no. 2, June 1975, pp. 156-173. Laprie 92a J.C. Laprie (Ed.), Dependability: Basic Concepts and
Gray 90 J. Gray, "A census of Tandem system availability between 1985 Terminology, Springer-Verlag, Vienna, 1992.
and 1990", IEEE Trans. on Reliability, vol. 39, no. 4, Oct. 1990, pp. Laprie 92b J.C. Laprie, "For a product-in-a-process approach to
409-418. software reliability evaluation", Proc. 3rd Int. Symp. on Software
Gunneflo et al. 89 U. Gunneflo, J. Karlsson, J. Torin, "Evaluation of Reliability Engineering, Research Triangle Park, NC, Oct. 1992, pp.
error detection schemes using fault injection by heavy-ion 134-139.
radiation", in Proc. 19th IEEE Int. Symp. on Fault Tolerant Laprie 95 J.C. Laprie, "Dependability - Its attributes, impairments, and
Computing (FTCS-19), Chicago, June 1989, pp. 340-347. means”, in Predicably Dependable Computing Systems, B.Randell,
Hoare 69 C.A.R. Hoare, "An axiomatic basis for computer J.C. Laprie, H. Kopetz, B. Littlewood, eds, Springer-Verlag, 1995,
programming", Communications of the ACM, vol. 12, no. 10, Oct. pp. 3-18.
1969, pp. 576-583. Laryd 94 A. Laryd, “Operating experioence of software in
Hosford 60 J.E. Hosford, "Measures of dependability", Operations programmable equipment used in ABB atom nuclear I&C
Research, vol. 8, no. 1, 1960, pp. 204-206. applications”, IAEA TCM, Helsinki, Finland, June 1994.
Howden 76 W.E. Howden, "Reliability of the path analysis testing Levendel 95 Y. Levendel, “The cost effectiveness of telecommunication
strategy", IEEE Trans. on Software Engineering, vol. SE-2, no. 3, service dependability”, in Software Fault Tolerance, M. R. Lyu, ed.,
Sep. 1976, pp. 208-215. Wiley, 1995, pp. 279-314.
Howden 87 W.E. Howden, Functional Program Testing and Ananlysis, McCabe 76 T.J. McCabe, "A complexity measure", IEEE Trans. on
McGraw-Hill, 1987. Software Engineering, vol. SE-2, no. 4, Dec. 1976, pp. 308-320.
Iyer & Tang 94 R. K. Iyer and D. Tang, “Experimental analysis of
computer system dependability”, UIUC Report, 1994.
12
Meyer 78 J.F. Meyer, "On evaluating the performability of degradable Thatte & Abraham 78 S.M. Thatte, J.A. Abraham, "A methodology for
computing systems", in Proc. 8th IEEE Int. Symp. on Fault Tolerant functional level testing of microprocessors", in Proc. 8th IEEE Int.
Computing (FTCS-8), Toulouse, France, June 1978, pp. 44-49. Symp. on Fault Tolerant Computing (FTCS-8), Toulouse, France,
Mine & Koga 67 H. Mine, Y. Koga, "Basic properties and a June 1978, pp. 90-95.
construction method for fail-safe logical systems", IEEE Trans. on Thevenod & Waeselynck 91 P. Thévenod-Fosse, H. Waeselynck, "An
Electron. Computers, vol. EC-16, no. 6, June 1967, pp. 282-289. investigation of statistical software testing", Journal of Software
Morell 90 L.J. Morell, "A theory of fault-based testing", IEEE Trans. on Testing, Verification and Reliability, vol. 1, no. 2, 1991, pp. 5-25.
Software Engineering, vol. 16, no. 8, Aug. 1990, pp. 844-857. Toy 78 W.N. Toy, "Fault-tolerant design of local ESS processors",
Myers 79 G.J. Myers, The Art of Software Testing, John Wiley & Sons, Proceedings of the IEEE, vol. 66, no. 19, Oct. 1978, pp. 1126-1145.
1979 Wakerly 78 J.F. Wakerly, Error-Detecing Codes, Self-Checking
Nelson 1973 E.C. Nelson, A Statistical Basis for Software Reliability, Circuits, and Applications, New York: Elsevier North-Holland, 1978
no. TRW-SS-73-02, TRW Software Series, 1973. Watanabe 1986 E. Watanabe, “Survey on Computer on Security”,
Nicolaidis et al. 89 M. Nicolaidis, S. Noraz, B. Courtois, "A generalized Japan Info. Dev. Corp. 1986.
theory of fail-safe systems", in Proc. 19th IEEE Int. Symp. on Fault Wood 94 A. Wood, “NonStop availability in a client/server
Tolerant Computing (FTCS-19), Chicago, USA, June 1989, pp. 398- environment”, Tandem Technical Report 94.1, March 1994.
406. Xie 91 M. Xie, Software Reliability Modeling, Singapour, World-
Norman 83 D.A. Norman, "Design rules based on analyses of human Scientific, 1991.
error", Communications of the ACM, vol. 26, no. 4, April 1983, pp. Yau & Cheung 75 S.S. Yau, R.C. Cheung, "Design of self-checking
254-258. software", in Proc. 1975 Int. Conf. on Reliable Software, Los
OFTA 1994 French Observatory for Advanced Techniques, ARAGO Angeles, USA, April 1975, pp. 450-457.
15, Fault-Tolerant Computing, Masson, Paris, 1994; in French:
Observatoire Français des Techniques Avancées, Informatique
Tolérante aux Fautes.
Osterweil et al. 76 L.J. Osterweil, L.D. Fodsick, "DAVE — A validation
error detection and documentation system for Fortran programs",
Software Practice and Experience, Oct.-Dec. 1976, pp. 473-486.
Parnas et al. 90 D.L. Parnas, A.J. van Schouwen, S.P. Kwan,
"Evaluation of safety-critical software", Communications of the
ACM, vol. 33, no. 4, June 1990, pp. 636-648.
Powell et al. 88 D. Powell, G. Bonn, D. Seaton, P. Verissimo, F.
Waeselynck, "The Delta-4 approach to dependability in open
distributed computing systems", in Proc. 18th IEEE Int. Symp. on
Fault Tolerant Computing (FTCS-18), Tokyo, Japan, June 1988, pp.
246-251.
Powell 92 D. Powell, “Failure Mode Assumptions and Assumption
Coverage”, Proc. 22nd IEEE Int. Symp. on Fault-Tolerant
Computing (FTCS-22), Boston, July 1992, pp.386-395.
Rabin 89 M.O. Rabin, "Efficient dispersal of information for security,
load balancing and fault tolerance", Jounal of the ACM, vol. 36, no.
2, April 1989, pp. 335-348.
Ramamoorthy 84 C.V. Ramamoorthy, A. Prakash, W.T. Tsai, Y.
Usuda, "Software engineering: problems and perspectives", IEEE
Computer, Oct. 1984, pp. 191-209.
Randell 75 B. Randell, "System structure for software fault tolerance",
IEEE Trans. on Software Engineering, vol. SE-1, no. 2, June 1975,
pp. 220-232.
Rapps & Weyuker 85 S. Rapps, E.J. Weyuker, "Selecting software test
data using data flow information", IEEE Trans. on Software
Engineering, vol. SE-11, no. 4, April 1985, pp. 367-375.
Roth et al. 67 J.P. Roth, W.G. Bourricius, P.R. Schneider, "Programmed
algorithms to compute tests to detect and distinguish between failures
in logic circuits", IEEE Trans. on Electronic Computers, vol. EC-16,
Oct. 1967, pp. 567-579.
Ruegger 90 B. Ruegger, “Human error in the cockpit”, Swiss
Reinsurance Company, 1990.
Schlichting & Schneider 83 R.D. Schlichting, F.B. Schneider, "Fail-stop
processors: an approach to designing fault-tolerant computing
systems", ACM Trans. on Computing Systems, vol. 1, no. 3, Aug.
1983, pp. 222-238.
Siewiorek & Johnson 82 D.P. Siewiorek, D. Johnson, "A design
methodology for high reliability systems: the Intel 432", in D.P.
Siewiorek, R.S. Swarz, The Theory and Practice of Reliable System
Design, Digital Press, 1982, pp. 737-767.
Siewiorek & Swarz 92 D.P. Siewiorek, R.S. Swarz, Reliable Computer
Systems, Design and Evaluation , Digital Press, 1992, pp. 737-767.
Smith & Trivedi 88 R.M. Smith, K.S. Trivedi, A.V. Ramesh,
"Performability analysis: measures, an algorithm, and a case study",
IEEE Trans. on Computers, vol. 37, no. 4, April 1988, pp. 406-417.
13

Dependable Computing: Concepts, Limits, Challenges

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dependable Computing: Concepts, Limits, Challenges

Uploaded by

Copyright:

Available Formats

Invited paper to FTCS-25, the 25th IEEE International Symposium on Fault-Tolerant Computing,

Dependable Computing: Concepts, Limits, Challenges

dependability?". The question which comes next is “What

Introduction The considerations expressed in the above two paragraphs

• the ability to undergo repairs and evolutions leads to FAULT PREVENTION

fault classes. Three major classes of faults are considered: 106

tolerate some combinations of multiple faults. Such a 200

Billion FF (1993 FF)

were tolerated although not having been detected by the 2

94% 85% 99% -4

First 500 computer companies (construction,

We have so far focused on technical issues, which we now

A clear consequence of what precedes is that there is a VALUE FAILURES

tolerance on a much broader scale than at present. Such a CONSISTENT FAILURES

You might also like