Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn

Software Reliability
CIS 376
Bruce R. Maxim
UM-Dearborn
Functional and Non-functional

Requirements
System functional requirements may
specify error checking, recovery features,
and system failure protection
System reliability and availability are
specified as part of the non-functional
requirements for the system.
System Reliability Specification

Hardware reliability
probability a hardware component fails
Software reliability
probability a software component will produce an
incorrect output
software does not wear out
software can continue to operate after a bad result
Operator reliability
probability system user makes an error
Failure Probabilities
If there are two independent components in a
system and the operation of the system depends on
them both then
P(S) = P(A) + P(B)
If the components are replicated then the
probability of failure is
P(S) = P(A)n
meaning that all components fail at once
Functional Reliability Requirements

The system will check the all operator
inputs to see that they fall within their
required ranges.
The system will check all disks for bad
blocks each time it is booted.
The system must be implemented in using a
standard implementation of Ada.
Non-functional Reliability
Specification
The required level of reliability must be
expressed quantitatively.
Reliability is a dynamic system attribute.
Source code reliability specifications are
meaningless (e.g. N faults/1000 LOC)
An appropriate metric should be chosen to
specify the overall system reliability.
Hardware Reliability Metrics

Hardware metrics are not suitable for
software since its metrics are based on
notion of component failure
Software failures are often design failures
Often the system is available after the
failure has occurred
Hardware components can wear out
Software Reliability Metrics

Reliability metrics are units of measure for system
reliability
System reliability is measured by counting the
number of operational failures and relating these
to demands made on the system at the time of
failure
A long-term measurement program is required to
assess the reliability of critical systems
Reliability Metrics - part 1

Probability of Failure on Demand (POFOD)
POFOD = 0.001
For one in every 1000 requests the service fails
per time unit
Rate of Fault Occurrence (ROCOF)

ROCOF = 0.02
Two failures for each 100 operational time units
of operation
Reliability Metrics - part 2

Mean Time to Failure (MTTF)
average time between observed failures (aka
MTBF)
Availability = MTBF / (MTBF+MTTR)

MTBF = Mean Time Between Failure
MTTR = Mean Time to Repair
Reliability = MTBF / (1+MTBF)
Time Units
Raw Execution Time
non-stop system
Calendar Time
If the system has regular usage patterns
Number of Transactions
demand type transaction systems
Availability
Measures the fraction of time system is
really available for use
Takes repair and restart times into account
Relevant for non-stop continuously running
systems (e.g. traffic signal)
Probability of Failure on Demand

Probability system will fail when a service request is
made
Useful when requests are made on an intermittent or
infrequent basis
Appropriate for protection systems service requests
may be rare and consequences can be serious if
service is not delivered
Relevant for many safety-critical systems with
exception handlers
Rate of Fault Occurrence

Reflects rate of failure in the system
Useful when system has to process a large
number of similar requests that are
relatively frequent
Relevant for operating systems and
transaction processing systems
Mean Time to Failure

Measures time between observable system
failures
For stable systems MTTF = 1/ROCOF
Relevant for systems when individual
transactions take lots of processing time
(e.g. CAD or WP systems)
Failure Consequences - part 1

Reliability does not take consequences into
account
Transient faults have no real consequences
but other faults might cause data loss or
corruption
May be worthwhile to identify different
classes of failure, and use different metrics
for each
Failure Consequences - part 2

When specifying reliability both the number of
failures and the consequences of each matter
Failures with serious consequences are more
damaging than those where repair and recovery is
straightforward
In some cases, different reliability specifications
may be defined for different failure types
Failure Classification
Transient - only occurs with certain inputs
Permanent - occurs on all inputs
Recoverable - system can recover without
operator help
Unrecoverable - operator has to help
Non-corrupting - failure does not corrupt system
state or data
Corrupting - system state or data are altered
Building Reliability Specification

For each sub-system analyze consequences
of possible system failures
From system failure analysis partition
failure into appropriate classes
For each class send out the appropriate
reliability metric
Examples
Failure Class
Example
Metric
ATM fails to
Permanent
Non-corrupting operate with any
ROCOF = .0001
card, must restart to Time unit = days
correct
Magnetic stripe
Transient
Non-corrupting can't be read on
undamaged card
POFOD = .0001
Time unit =
transactions
Specification Validation
It is impossible to empirically validate high
reliability specifications
No database corruption really means
POFOD class < 1 in 200 million
If each transaction takes 1 second to verify,
simulation of one days transactions takes
3.5 days
Statistical Reliability Testing

Test data used, needs to follow typical
software usage patterns
Measuring numbers of errors needs to be
based on errors of omission (failing to do
the right thing) and errors of commission
(doing the wrong thing)
Difficulties with Statistical

Reliability Testing
Uncertainty when creating the operational
profile
High cost of generating the operational
profile
Statistical uncertainty problems when high
reliabilities are specified
Safety Specification
Each safety specification should be specified
separately
These requirements should be based on hazard and
risk analysis
Safety requirements usually apply to the system as
a whole rather than individual components
System safety is an an emergent system property
Safety Life Cycle - part 1

Concept and scope definition
Hazard and risk analysis
Safety requirements specification
safety requirements derivation
safety requirements allocation
Planning and development
safety related systems development
external risk reduction facilities
Safety Life Cycle - part 2

Deployment
safety validation
installation and commissioning
Operation and maintenance

System decommissioning
Safety Processes
Hazard and risk analysis
assess the hazards and risks associated with the system
Safety requirements specification

specify system safety requirements
Designation of safety-critical systems

identify sub-systems whose incorrect operation can
compromise entire system safety
Safety validation
check overall system safety
Hazard Analysis Stages

Hazard identification
identify potential hazards that may arise
Risk analysis and hazard classification

assess risk associated with each hazard
Hazard decomposition
seek to discover potential root causes for each hazard
Risk reduction assessment

describe how each hazard is to be taken into account
when system is designed
Fault-tree Analysis
Hazard analysis method that starts with an
identified fault and works backwards to the
cause of the fault
Can be used at all stages of hazard analysis
It is a top-down technique, that may be
combined with a bottom-up hazard analysis
techniques that start with system failures that
lead to hazards
Fault-tree Analysis Steps

Identify hazard
Identify potential causes of hazards
Link combinations of alternative causes
using or or and symbols as appropriate
Continue process until root causes are
identified (result will be an and/or tree or a
logic circuit) the causes are the leaves
How does it work?

What would a fault tree look like for a fault
tree describing the causes for a hazard like
data deleted?
Risk Assessment
Assess the hazard severity, hazard probability, and
accident probability
Outcome of risk assessment is a statement of
acceptability
Intolerable (can never occur)
ALARP (as low as possible given cost and schedule
constraints)
Acceptable (consequences are acceptable and no extra
cost should be incurred to reduce it further)
Risk Acceptability
Determined by human, social, and political
considerations
In most societies, the boundaries between
regions are pushed upwards with time
(meaning risk becomes less acceptable)
Risk assessment is always subjective (what is
acceptable to one person is ALARP to
another)
Risk Reduction
System should be specified so that hazards do not
arise or result in an accident
Hazard avoidance
system designed so hazard can never arise during normal
operation
Hazard detection and removal

system designed so that hazards are detected and
neutralized before an accident can occur
Damage limitation
system designed to minimized accident consequences
Security Specification
Similar to safety specification
not possible to specify quantitatively
usually stated in system shall not terms rather
than system shall terms
Differences
no well-defined security life cycle yet
security deals with generic threats rather than
system specific hazards
Security Specification Stages - part 1

Asset identification and evaluation
data and programs identified with their level of
protection
degree of protection depends on asset value
Threat analysis and risk assessment

security threats identified and risks associated with each
is estimated
Threat assignment
identified threats are related to assets so that asset has a
list of associated threats
Security Specification Stages - part 2

Technology analysis
available security technologies and their applicability
against the threats
Security requirements specification

where appropriate these will identify the security
technologies that may be used to protect against different
threats to the system

Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn

Uploaded by

Copyright:

Available Formats

Software Reliability

Functional and Non-functional

System Reliability Specification

Functional Reliability Requirements

Hardware Reliability Metrics

Software Reliability Metrics

Reliability Metrics - part 1

Rate of Fault Occurrence (ROCOF)

Reliability Metrics - part 2

Availability = MTBF / (MTBF+MTTR)

Reliability = MTBF / (1+MTBF)

Probability of Failure on Demand

Rate of Fault Occurrence

Mean Time to Failure

Failure Consequences - part 1

Failure Consequences - part 2

Building Reliability Specification

Statistical Reliability Testing

Difficulties with Statistical

Safety Life Cycle - part 1

Safety Life Cycle - part 2

Operation and maintenance

Safety requirements specification

Designation of safety-critical systems

Hazard Analysis Stages

Risk analysis and hazard classification

Risk reduction assessment

Fault-tree Analysis Steps

How does it work?

Hazard detection and removal

Security Specification Stages - part 1

Threat analysis and risk assessment

Security Specification Stages - part 2

Security requirements specification

You might also like