You are on page 1of 7

CHAPTER 3: CRITICAL SYSTEMS

Software failures are relatively common. In most cases, these failures cause
inconvenience but no serious, long-term damage. However, in some systems failure can
result in significant economic losses, physical damage or threats to human life. These
systems are called critical systems. Critical systems are technical or socio-technical
systems that people or businesses depend on. If these systems fail to deliver their services
as expected then serious problems and significant losses may result.
There are three main types of critical systems:

1. Safety-critical systems: A system whose failure may result in injury, loss of life or
serious environmental damage. An example is a control system for a chemical
manufacturing plant.
2. Mission-critical systems: A system whose failure may result in failure of some
goal-directed activity. An example is navigational system for a spacecraft.
3. Business-critical systems: A system whose failure may result in very high costs
for the business using that system. An example is the customer accounting system
in a bank.

The most important emergent property of a critical system is its dependability. There are
several reasons for this:

1. Systems that are unreliable, unsafe or insecure are often rejected by their users: If
users don’t trust a system, they will refuse to use it.
2. System failure cost may be enormous: For some applications, such as a aircraft
navigation system, the cost of system failure is orders of magnitude greater than
the cost of the control system.
3. Untrustworthy systems may cause information loss: Data is very expensive to
collect and maintain; it may sometimes be worth more than the computer system
on which it is processed. A great deal of effort and money may have to be spent
duplicating valuable data to guard against data corruption.

The high cost of critical systems failure means that trust methods and techniques must be
used for development. Consequently, critical systems are usually developed using well-
tried techniques rather than newer techniques that have not been subject to extensive
practical experience. Critical systems developers prefer to use older techniques whose
strengths and weaknesses are understood rather than new techniques which may appear to
be better but whose long-term problems are unknown.
Although a small number of control systems may be completely automatic, most critical
systems are socio-technical systems where people monitor and control the operation of
computer-based systems. While system operator can help recover from problems, they
can also cause problems if they make mistakes. There are three ‘system components’
where critical systems failures may occur:
1. System hardware may fail because of mistakes in its design, because components
fail as a result of manufacturing errors, or because the components have reached
the end of their natural life.
2. System software may fail because of mistakes in its specification, design or
implementation.
3. Human operators of the system may fail to operate the system correctly. As
hardware and software have become more reliable, failures in operation are now
probably the largest single cause of system failures.

It is particularly important that designers of critical systems take a holistic, systems


perspective rather than focus on a single aspect of the system. If the hardware, software
and operational processes are designed separately without taking the potential
weaknesses of other parts of the system into account, then it is more likely that errors will
occur at interfaces between the various parts of the system.

3.1 A SIMPLE SAFETY-CRITICAL SYSTEM

The critical system being considered here is a medical system that simulates the operation
of the pancreas. The system chosen is intended to help people who suffer from diabetes.
Diabetes is a relatively common condition where the human pancreas is unable to
produce sufficient quantities of a hormone called insulin. Insulin metabolises glucose in
the blood. The conventional treatment of diabetes involves regular injections of
genetically engineered insulin. Diabetics measure their blood sugar levels using an
external meter and then calculate the dose of insulin that they should inject.
The problem with this treatment is that the level of insulin in the blood does not just
depend on the blood glucose level but is a function of the time when the insulin injection
was taken. This can lead to very low levels of blood glucose (if there is too much insulin)
or very high levels of blood sugar (if there is too little insulin).
Current advances in developing miniaturised sensors have meant that it is now possible to
develop automated insulin deliver systems. These systems monitor blood sugar levels and
deliver an appropriate dose of insulin when required. Insulin deliver systems like this
already exist for the treatment of hospital patients.
The following figure shows the components and organisation of the insulin pump:
The following figure is a data-flow model that illustrates how an input blood sugar level
is transformed to a sequence of pump control commands:

There are two high-level dependability requirements for this insulin pump system:

1. The system shall be available to deliver insulin when required.


2. The system shall perform reliably and deliver the correct amount of insulin to
counteract the current level of blood sugar.

Failure of the system could cause excessive doses of insulin to be delivered and this
could threaten the life of the user.

3.2 SYSTEM DEPENDABILITY

The dependability of a computer system is a property of the system that equates to it


trustworthiness. Trustworthiness essentially means the degree of user confidence that
system will operate as they expect and that the system will not fail in normal use.

There are four principle dimensions to dependability, as shown below:


1. Availability: The availability of a system is the probability that it will be up and
running and able to deliver useful services at any given time.
2. Reliability: The reliability of a system is the probability, over a given period of
time, that the system will correctly deliver services as expected by the user.
3. Safety: The safety of a system is a judgement of how likely it is that the system
will cause damage to people or its environment.
4. Security: The security of a system is a judgement of how likely it is that the
system can resist accidental or deliberate intrusions.

These are complex properties that can be decomposed into a number of other, simpler
properties. For example, security includes integrity (ensuring that the systems program
and data are not damaged) and confidentiality (ensuring that information can only be
accessed by people who are authorised). Reliability includes correctness (ensuring the
system services are as specified), precision (ensuring information is delivered at an
appropriate level of detail) and timeliness (ensuring that information is delivered when it
is required).

As well as these four main dimensions, other system properties can also be considered
under dependability:

1. Repairability: System failures are inevitable, but the disruption caused by failure
can be minimised if the system can be repaired quickly. For this, it must be
possible to diagnose the problem, access the component that has failed and make
changes to fix that component.
2. Maintainability: As systems are used, new requirements emerge. It is important to
maintain the usefulness of a system by changing it to accommodate these new
requirements.
3. Survivability: Survivability is the ability of a system to continue to deliver service
whilst it is under attack and, potentially, while part of the system is disabled.
Work on survivability focuses on identifying key system components and
ensuring that they can deliver a minimal service.
4. Error tolerance: This property reflects the extent to which the system has been
designed so that user input error are avoided and tolerated. When user errors
occur, the system should, as far as possible, detect these errors and either fix them
automatically or request the user to re-input their data.

Because of additional design, implementation and validation costs, increasing the


dependability of a system can significantly increase development costs. In particular,
validation costs are high for critical systems.

The following figure shows the relationship between costs and incremental improvements
in dependability. The higher the dependability that you need, the more you have to spend
on testing to check that you have reached that level. Because of the exponential nature of
this cost/dependability curve, it is not possible to demonstrate that a system is 100%
dependable, as the costs of dependability assurance would then be infinite.
3.3 AVAILABILITY AND RELIABILITY

The reliability of a system is the probability that the system’s services will be correctly
delivered as specified. The availability of a system is the probability that the system will
be up and running to deliver these services to users when they request them.

Although they are closely related, it cannot be assumed that a reliable system will always
be available and vice versa. For example, some systems can have a high availability
requirement but a much lower reliability requirement. If users expect continuous service
then the availability requirements are high. An example of a system where availability is
more critical than reliability is a telephone exchange switch. Users expect a dial tone
when they pick up a phone so the system has high availability requirements.

A further distinction between these characteristics is that availability does not simply
depend on the system itself buy also on the time needed to repair the faults that make the
system unavailable.

System reliability and availability may be defined more precisely as follows:

1. Reliability: The probability of failure-free operation over a specified time in a


given environment for a specific purpose.
2. Availability: The probability that a system, at a point in time, will be operational
and able to deliver the requested services.

The definition of reliability states that the environment in which the system is used and
the purpose that it is used for must be taken into account. If you measure system
reliability in one environment, you can’t assume that the reliability will be the same in
another environment where the system is used in a different way.
A further difficulty with these definitions is that they do not take into account the severity
of failure or the consequences of unavailability. People, naturally, are more concerned
about system failures that have serious consequences, and their perception of system
reliability is influenced by these consequences.

A strict definition of reliability relates the system implementation to its specification.


That is, the system is behaving reliably if its behaviour is consistent with that defined in
the specification. However, a common cause of perceived unreliability is that the system
specification does not match the expectations of the system users.

Reliability and availability are compromised by system failures. These may be a failure to
provide a service, a failure to deliver a service as specified, or the delivery of a service in
such a way that it is unsafe or insecure. When discussing reliability, it is helpful to
distinguish between the terms fault, error and failure as shown below.

Term Description
System failure An event that occurs at some point in time when
the system does not deliver a service as expected
by its users
System error Erroneous system state that can lead to system
behaviour that is unexpected by system users.
System fault A characteristic of a software system that can lead to a system
error. For example, failure to initialise a variable could lead to
that variable having the wrong value when it is used.
Human error or Human behaviour that results in the introduction
Mistake of faults into a system.

The distinction between the above terms helps to identify three complementary
approaches that are used to improve the reliability of a system:

1. Fault avoidance: Development technique are used that either minimise the
possibility of mistakes or trap mistakes before they result in the introduction of
system faults.
2. Fault detection and removal: The use of verification and validation techniques
that increase the probability of detecting and correcting errors before the system is
used.
3. Fault tolerance: Run-time techniques are used to ensure that system faults do not
result in system errors or that ensure that system errors do not lead to system
failures.

Software faults cause software failures when the faulty code is executed with a set of
inputs that expose the software fault. The code works properly for most inputs. The
following figure shows a software system mapping of an input to an output set. Given an
input or input sequence, the program responds by producing a corresponding output.
The reliability of the system is the probability that a particular input will lie in the set of
inputs that cause erroneous outputs. Different people will use the system in different
ways so this probability is not a static system attribute but depends on the system’s
environment. The following figure illustrates this. The set of inputs produced by User 2
intersect with the erroneous input set. So User 2 will experience some system failure.

The overall reliability of a program, therefore, mostly depends on the number of inputs
causing erroneous outputs during normal use of the system by most users. Software faults
that occur only in exceptional situations have little effect on the system’s reliability.

You might also like