Availability and Reliability
Availability and reliability, 2013
Slide 1
Principal dependability
properties
Availability and reliability, 2013
Slide 2
Reliability
The probability of failure-free
system operation over a specified
time in a given environment for a
given purpose
Availability and reliability, 2013
Slide 3
Availability
The probability that a system, at a
point in time, will be operational and
able to deliver the requested services
Availability and reliability, 2013
Slide 4
Availability specification
Both reliability and availability
attributes can be expressed as
numbers:
Availability of 0.999 means that the
system is up and running for 99.9% of
the time;
Availability and reliability, 2013
Slide 5
Reliability specification
Probability of failure on demand
(POFOD) of 0.0001 means that on
average 1 in 10, 000 demands for
service from a system will fail in
some way
Availability and reliability, 2013
Slide 6
Availability and reliability
Availability and reliability are closely
related
Obviously if a system is unavailable it is
not delivering the specified system
services.
Availability and reliability, 2013
Slide 7
However, it is possible to have
systems with low reliability that
must be available.
So long as system failures can be
repaired quickly and does not damage
data, some system failures may not
be a problem.
Availability and reliability, 2013
Slide 8
Availability is therefore best
considered as a separate attribute
reflecting whether or not the
system can deliver its services.
Availability takes repair time into
account, if the system has to be
taken out of service to repair
faults.
Availability and reliability, 2013
Slide 9
Availability perception
Availability is usually expressed as
a percentage of the time that the
system is available to deliver
services e.g. 99.9%.
Availability and reliability, 2013
Slide 10
Availability and reliability, 2013
Slide 11
Subjective availability
The number of users affected by
the service outage.
Loss of service in the middle of the
night is less important for many
systems than loss of service during
peak usage periods.
Availability and reliability, 2013
Slide 12
The length of the outage.
The longer the outage, the more the
disruption. Several short outages are
less likely to be disruptive than 1 long
outage. Long repair times are a
particular problem.
Availability and reliability, 2013
Slide 13
Reliability metrics
Probability of failure on demand
(POFOD)
Probability that a system will not
deliver a service correctly when
requested
Used for systems where demands are
infrequent and intermittent
Availability and reliability, 2013
Slide 14
Rate of occurrence of failure
(ROCOF)
Number of system failures in a given
time period
Used for transaction processing
systems with frequent and regular
transactions
Availability and reliability, 2013
Slide 15
Fault
A characteristic of a software system that can lead to a
system error.
Error
An erroneous system state that can lead to system behavior
that is unexpected by system users.
Failure
An event that occurs at some point in time when the system
does not deliver a service as expected by its users.
Availability and reliability, 2013
Slide 16
Faults-errors-failures
Fault
Error
Failure
Availability and reliability, 2013
Slide 17
Faults and failures
Failures are a usually a result of
system errors.
The incorrect state causes
undesirable system behaviour
Incorrect state is a consequence of
executing faulty code
Availability and reliability, 2013
Slide 18
However, faults do not necessarily
result in system errors
The erroneous system state resulting
from the fault may be transient and
corrected before an error arises.
The faulty code may never be
executed.
Availability and reliability, 2013
Slide 19
Errors do not necessarily lead to
system failures
The error can be corrected by built-in
error detection and recovery
The failure can be protected against
by built-in protection facilities. These
may, for example, protect system
resources from system errors
Availability and reliability, 2013
Slide 20
Reliability achievement
Fault avoidance
Development technique are used
that either minimise the
possibility of mistakes or trap
mistakes before they result in the
introduction of system faults.
Availability and reliability, 2013
Slide 21
Fault detection and removal
Verification and validation
techniques that increase the
probability of detecting and
correcting errors before the
system goes into service are
used.
Availability and reliability, 2013
Slide 22
Fault tolerance
Run-time techniques are used to
ensure that system faults do not
result in system errors and/or
that system errors do not lead to
system failures.
Availability and reliability, 2013
Slide 23
Summary
Availability is the probability that a
system will be available when a
service request is made
Reliability is the probablity that a
system will deliver a service as
expected by users
Availability and reliability, 2013
Slide 24
Summary
Software faults lead to state errors
lead to operational failures
Fault avoidance, detection and
tolerance are strategies for
achieving reliability
Availability and reliability, 2013
Slide 25