You are on page 1of 52

Software Reliability

 Categorising and specifying the


reliability of software systems

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 1


Objectives
 To discuss the problems of reliability
specification and measurement
 To introduce reliability metrics and to discuss
their use in reliability specification
 To describe the statistical testing process
 To show how reliability predications may be
made from statistical test results

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 2


What is reliability?
 Probability of failure-free operation for a
specified time in a specified environment for a
given purpose
 This means quite different things depending on
the system and the users of that system
 Informally, reliability is a measure of how well
system users think it provides the services they
require

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 3


System Reliability Specification
 Hardware reliability
• probability a hardware component fails
 Software reliability
• probability a software component will produce an
incorrect output
• software does not wear out
• software can continue to operate after a bad result
 Operator reliability
• probability system user makes an error

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 4


Software reliability
 Cannot be defined objectively
• Reliability measurements which are quoted out of context are
not meaningful
 Requires operational profile for its definition
• The operational profile defines the expected pattern of software
usage
 Must consider fault consequences
• Not all faults are equally serious. System is perceived as more
unreliable if there are more serious faults

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 5


Failure Probabilities
 If there are two independent components in a
system and the operation of the system depends
on them both then
P(S) = P(A) + P(B)
 If the components are replicated then the
probability of failure is
P(S) = P(A)n
meaning that all components fail at once

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 6


Failures and faults
 A failure corresponds to unexpected run-time
behaviour observed by a user of the software
 A fault is a static software characteristic which
causes a failure to occur
 Faults need not necessarily cause failures. They
only do so if the faulty part of the software is used
 If a user does not notice a failure, is it a failure?
Remember most users don’t know the software
specification

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 7


Hardware Reliability Metrics
 Hardware metrics are not suitable for software
since its metrics are based on notion of
component failure
 Software failures are often design failures
 Often the system is available after the failure has
occurred
 Hardware components can wear out

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 8


Software Reliability Metrics
 Reliability metrics are units of measure for
system reliability
 System reliability is measured by counting the
number of operational failures and relating these
to demands made on the system at the time of
failure
 A long-term measurement program is required to
assess the reliability of critical systems

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 9


Input/output mapping
Inputs causing
erroneous
Input set I outputs
e

Program

Erroneous
outputs
Output set Oe

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 10


Reliability improvement
 Reliability is improved when software faults which
occur in the most frequently used parts of the
software are removed
 Removing x% of software faults will not
necessarily lead to an x% reliability improvement
 In a study, removing 60% of software defects
actually led to a 3% reliability improvement
 Removing faults with serious consequences is the
most important objective

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 11


Reliability perception
Possible
inputs

User 1 Erroneous
inputs

User 3
User 2

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 12


Reliability and formal methods
 The use of formal methods of development may
lead to more reliable systems as it can be proved
that the system conforms to its specification
 The development of a formal specification forces
a detailed analysis of the system which discovers
anomalies and omissions in the specification
 However, formal methods may not actually
improve reliability

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 13


Reliability and formal methods
 The specification may not reflect the real
requirements of system users
 A formal specification may hide problems
because users don’t understand it
 Program proofs usually contain errors
 The proof may make assumptions about the
system’s environment and use which are incorrect

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 14


Reliability and efficiency
 As reliability increases system efficiency tends to
decrease
 To make a system more reliable, redundant code
must be includes to carry out run-time checks,
etc. This tends to slow it down

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 15


Reliability and efficiency
 Reliability is usually more important than
efficiency
 No need to utilise hardware to fullest extent
( erdvė) as computers are cheap and fast
 Unreliable software isn't used
 Hard to improve unreliable systems
 Software failure costs often far exceed system
costs
 Costs of data loss are very high

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 16


Reliability metrics
 Hardware metrics not really suitable for
software as they are based on component
failures and the need to repair or replace a
component once it has failed. The design is
assumed to be correct
 Software failures are always design failures.
Often the system continues to be available in
spite of the fact that a failure has occurred.

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 17


Reliability metrics
 Probability of failure on demand
• This is a measure of the likelihood that the system will fail when
a service request is made
• POFOD = 0.001 means 1 out of 1000 service requests result in
failure
• Relevant for safety-critical or non-stop systems
 Rate of fault occurrence (ROCOF)
• Frequency of occurrence of unexpected behaviour
• ROCOF of 0.02 means 2 failures are likely in each 100
operational time units
• Relevant for operating systems, transaction processing systems

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 18


Reliability metrics
 Mean time to failure
• Measure of the time between observed failures
• MTTF of 500 means that the time between failures is 500 time units
• Relevant for systems with long transactions e.g. CAD systems
 Availability
• Measure of how likely the system is available for use. Takes
repair/restart time into account
• Availability of 0.998 means software is available for 998 out of
1000 time units
• Relevant for continuously running systems e.g. telephone switching
systems

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 19


Availability
 Measures the fraction of time system is really
available for use
 Takes repair and restart times into account
 Relevant for non-stop continuously running
systems (e.g. traffic signal)

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 20


Probability of Failure on Demand
 Probability system will fail when a service request
is made
 Useful when requests are made on an intermittent
or infrequent basis
 Appropriate for protection systems service
requests may be rare and consequences can be
serious if service is not delivered
 Relevant for many safety-critical systems with
exception handlers

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 21


Rate of Fault Occurrence
 Reflects rate of failure in the system
 Useful when system has to process a large
number of similar requests that are relatively
frequent
 Relevant for operating systems and transaction
processing systems

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 22


Mean Time to Failure
 Measures time between observable system
failures
 For stable systems MTTF = 1/ROCOF
 Relevant for systems when individual
transactions take lots of processing time (e.g.
CAD or WP systems)

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 23


Failure Consequences - part 1
 Reliability does not take consequences into
account
 Transient faults have no real consequences but
other faults might cause data loss or corruption
 May be worthwhile to identify different classes of
failure, and use different metrics for each

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 24


Failure Consequences - part 2
 When specifying reliability both the number of
failures and the consequences of each matter
 Failures with serious consequences are more
damaging than those where repair and recovery is
straightforward
 In some cases, different reliability specifications
may be defined for different failure types

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 25


Examples
Failure Class Example Metric

Permanent ATM fails to


Non-corrupting operate with any ROCOF = .0001
card, must restart to Time unit = days
correct

Transient Magnetic stripe POFOD = .0001


Non-corrupting can't be read on Time unit =
undamaged card transactions

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 26


Reliability measurement
 Measure the number of system failures for a
given number of system inputs
• Used to compute POFOD
 Measure the time (or number of transactions)
between system failures
• Used to compute ROCOF and MTTF
 Measure the time to restart after failure
• Used to compute AVAIL

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 27


Time units
 Time units in reliability measurement must be
carefully selected. Not the same for all systems
 Raw execution time (for non-stop systems)
 Calendar time (for systems which have a
regular usage pattern e.g. systems which are
always run once per day)
 Number of transactions (for systems which are
used on demand)

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 28


Reliability specification
 Reliability requirements are only rarely
expressed in a quantitative, verifiable way.
 To verify reliability metrics, an operational
profile must be specified as part of the test
plan.
 Reliability is dynamic - reliability specifications
related to the source code are meaningless.
• No more than N faults/1000 lines.
• This is only useful for a post-delivery process analysis.

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 29


Failure classification

Failure class Description


Transient Occurs only with certain inputs
Permanent Occurs with all inputs
Recoverable System can recover without operator intervention
Unrecoverable Operator intervention needed to recover from failure
Non-corrupting Failure does not corrupt system state or data
Corrupting Failure corrupts system state or data

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 30


Steps to a reliability specification
 For each sub-system, analyse the
consequences of possible system failures.
 From the system failure analysis, partition
failures into appropriate classes.
 For each failure class identified, set out the
reliability using an appropriate metric. Different
metrics may be used for different reliability
requirements.

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 31


Bank auto-teller system
 Each machine in a network is used 300 times a
day
 Bank has 1000 machines
 Lifetime of software release is 2 years
 Each machine handles about 200, 000
transactions
 About 300, 000 database transactions in total per
day

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 32


Examples of a reliability spec.

Failure class Example Reliability metric


Permanent, The system fails to operate with ROCOF
non-corrupting. any card which is input. Software 1 occurrence/1000 days
must be restarted to correct failure.
Transient, non- The magnetic stripe data cannot be POFOD
corrupting read on an undamaged card which 1 in 1000 transactions
is input.
Transient, A pattern of transactions across the Unquantifiable! Should
corrupting network causes database never happen in the
corruption. lifetime of the system

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 33


Specification validation
 It is impossible to empirically validate very high
reliability specifications
 No database corruptions means POFOD of less
than 1 in 200 million
 If a transaction takes 1 second, then simulating
one day’s transactions takes 3.5 days
 It would take longer than the system’s lifetime to
test it for reliability

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 34


Reliability economics
 Because of very high costs of reliability
achievement, it may be more cost effective to
accept unreliability and pay for failure costs
 However, this depends on social and political
factors. A reputation for unreliable products may
lose future business
 Depends on system type - for business systems in
particular, modest reliability may be adequate

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 35


Costs of increasing reliability
Cost

Low Medium High Very Ultra-


high high
Reliability
©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 36
Topics covered
 Definition of reliability
 Reliability and efficiency
 Reliability metrics
 Reliability specification
 Statistical testing and operational profiles
 Reliability growth modelling
 Reliability prediction

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 37


Statistical testing
 Testing software for reliability rather than fault
detection
 Test data selection should follow the predicted
usage profile for the software
 Measuring the number of errors allows the
reliability of the software to be predicted
 An acceptable level of reliability should be
specified and the software tested and amended
until that level of reliability is reached

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 38


Statistical testing procedure
 Determine operational profile of the software
 Generate a set of test data corresponding to
this profile
 Apply tests, measuring amount of execution
time between each failure
 After a statistically valid number of tests have
been executed, reliability can be measured

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 39


Statistical testing difficulties
 Uncertainty in the operational profile
• This is a particular problem for new systems with no
operational history. Less of a problem for replacement systems
 High costs of generating the operational profile
• Costs are very dependent on what usage information is
collected by the organisation which requires the profile
 Statistical uncertainty when high reliability is
specified
• Difficult to estimate level of confidence in operational profile
• Usage pattern of software may change with time

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 40


An operational profile
Number
of inputs

Input
classes

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 41


Operational profile generation
 Should be generated automatically whenever
possible
 Automatic profile generation is difficult for
interactive systems
 May be straightforward for ‘normal’ inputs but it
is difficult to predict ‘unlikely’ inputs and to
create test data for them

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 42


Topics covered
 Definition of reliability
 Reliability and efficiency
 Reliability metrics
 Reliability specification
 Statistical testing and operational profiles
 Reliability growth modelling
 Reliability prediction

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 43


Reliability growth modelling
 Growth model is a mathematical model of the
system reliability change as it is tested and faults
are removed
 Used as a means of reliability prediction by
extrapolating from current data
 Depends on the use of statistical testing to
measure the reliability of a system version

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 44


Equal-step reliability growth

Reliability
(RO COF)

t1 t2 t3 t4 t5
Time
©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 45
Observed reliability growth
 Simple equal-step model but does not reflect
reality
 Reliability does not necessarily increase with
change as the change can introduce new faults
 The rate of reliability growth tends to slow down
with time as frequently occurring faults are
discovered and removed from the software
 A random-growth model may be more accurate

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 46


Random-step reliability growth
Note different
reliability
Reliability improvements Fault repair adds new fault
(RO COF) and decreases reliability
(increases ROCOF)

t1 t2 t3 t4 t5
Time

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 47


Growth models choice
 Many different reliability growth models have
been proposed
 No universally applicable growth model
 Reliability should be measured and observed data
should be fitted to several models
 Best-fit model should be used for reliability
prediction

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 48


Topics covered
 Definition of reliability
 Reliability and efficiency
 Reliability metrics
 Reliability specification
 Statistical testing and operational profiles
 Reliability growth modelling
 Reliability prediction

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 49


Reliability prediction
Reliability

= Measured reliability

Fitted reliability
model curve

Required
reliability

Estimated Time
time of reliability
achievement
©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 50
Key points
 Reliability is usually the most important dynamic
software characteristic
 Professionals should aim to produce reliable
software
 Reliability depends on the pattern of usage of the
software. Faulty software can be reliable
 Reliability requirements should be defined
quantitatively whenever possible

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 51


Key points
 There are many different reliability metrics. The
metric chosen should reflect the type of system
and the application domain
 Statistical testing is used for reliability
assessment. Depends on using a test data set
which reflects the use of the software
 Reliability growth models may be used to predict
when a required level of reliability will be
achieved

©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 18 Slide 52

You might also like