You are on page 1of 12

Teach Computer Science

Reliability in
computer systems

teachcomputerscience.com
1.

Revision notes

teachcomputerscience.com
Introduction
Reliability of any computer-related component is an attribute that
denotes its consistent performance according to the specifications. As
the world is becoming more computerised, people want assurance
that the system is reliable and would be able to work properly even if a
few modules fail.
Many things can go wrong with a computer. Hardware might fail to
operate. Software might contain bugs or errors. Human errors can
also make the system inefficient. Security is very important to systems,
as there might even be a deliberate attack. Disasters like power cuts,
flooding or an earthquake affect the operation of systems too.

Figure 1: Sources of system failure

Critical systems
Critical systems must be highly reliable, as their failure may have a
great impact on human lives. Programmers developing critical systems
using a conservative technique to design the system rather than a new
technique. The advantages and weaknesses of a conservative system
are well-known. A new technique is only implemented after analysing
its long-term effects, even though it might seem to be more efficient.

teachcomputerscience.com
Critical systems can be classified into groups, as shown in the figure
below.

Figure 2: Classification of critical systems

a) Safety-critical systems:
Safety-critical systems are designed to avoid danger to human lives
and the environment. For example, temperature control of nuclear
reactors, air traffic control, unmanned train systems, etc.
b) Mission-critical systems:
Failure of mission-critical systems affects the overall performance of
the system, as they are responsible for the goals in the system. For
example, navigation systems of aircraft.
c) Business-critical systems:
Business-critical systems are designed to avoid loss of business,
economic loss and loss of reputation. For example, banking systems,
stock-trading systems, etc.
d) Security-critical systems:
Security-critical systems are designed to protect sensitive information
that can be misused when in the wrong hands. For example, defence
systems containing information about the country’s security can be
misused if attacked by cyber terrorists.

teachcomputerscience.com
Backup
To make the system reliable, a copy of all data and files can be stored.
These duplicate data and files are stored in a separate server or
storage drive. This protects the data from being lost due to failure.
Backup is also useful when the data is accidentally overwritten.
Different types of backup are explained in detail in operating systems
(article 5).
Backup procedure
The team responsible for the backup procedure performs the backup
according to a well-defined schedule. Backup disks are to be stored in
a secure location. Disks and tapes secured in a fireproof location are
called an off-site backup. The data can also be backed-up on the
Internet using cloud technology.
Disaster recovery
Disaster recovery is the process of getting back lost data from the
backup after a system failure. Let us consider the example of a
hardware failure. To recover from this failure, the hardware is repaired
or replaced with new hardware. The data is recovered from the backup
and copied to the hardware. Examples of precautionary measures
taken by an organisation to avoid disaster are the use of
uninterruptible power supply (UPS), surge protectors (to minimise the
power surges in electronic equipment), fire prevention and anti-virus
software.

Redundancy
Redundancy is the duplication of critical parts of a computer system to
improve reliability. If the primary system fails, the backup or reserve
system steps in. Redundancy is very important in critical systems like
aircraft systems. If any hardware or software fails during a flight, the
redundant system steps in to avoid failure.

teachcomputerscience.com
Types of redundancy:

Figure 3: Types of Redundancy


a) Hardware redundancy
Computer systems have an extra critical hardware device to avoid
failure. For example, instead of having one power supply, a system is
provided with two power supplies. The two power supplies are
arranged in a parallel setup so that they can be easily switched if one
of them fails.
Redundant array of independent disks (RAID) is a data storage
technology in which multiple physical disk drives are used to store
redundant data. If one or more disk drives fail, the data will remain
intact in at least one of the disk drives.
b) Software redundancy
Redundant software is used to replace the original program in case it
fails.
c) Data redundancy
Redundant data in the backup can replace the original data in case the
original data is lost or overwritten accidentally.

teachcomputerscience.com
Fault tolerance
Fault tolerance is a property that enables a system to operate properly
even if the system undergoes one or more failures. This property is
essential for life-critical systems. A fault-tolerant design enables a
system to continue its operation, might be at a reduced level, rather
than failing completely, even when some parts of the system fail. In
this way, the data is protected from damage, intrusion or disclosure.
When a system gracefully fails, that is, operates at a reduced level after
some component failures, is called a fail-soft system. For example, a
building may operate with reduced lighting and elevators in case the
power fails.

Defensive programming
Software can be made more reliable by adding extra checks. These
checkpoints will warn the user in case the program is not working in
the desired manner. This is called defensive programming and this
enables the user to take action. In the absence of these extra checks,
the program would crash without any warning.

Measuring reliability
The reliability of a system is measured using various statistical
parameters that are used to predict how reliable the system is.
Reliability factors are:
a) Percentage of time
The percentage of time denotes the percentage of time for which the
service was available and operational during a particular month.
b) Number of hours
Number of hours denotes the amount of time the system has
operated without reporting any problems.

teachcomputerscience.com
c) Downtime
The period during which a system breaks down or spends out of
action. Zero downtime refers to a system that is available all the time.
d) Mean time between failures (MTBF)
Mean time between failures is calculated by taking the average of the
time between failures of a system.
e) Mean time to failure (MTTF)
Mean time to failure is the time duration in which the system is
expected to continue its operation before system failure.

Time between failures

Time to repair Time to failure

System Resumes normal System


failure operation failure

Figure 4: Measuring reliability

teachcomputerscience.com
2.

Activity

teachcomputerscience.com
Activity 1
Duration: 15 minutes

1. You are a programmer developing a banking system. Explain in


detail the important parts of the system and the ways in which
these parts may fail in the box below.

2. What measures will you take to protect the banking system from
possible failures of your answer in question 1? Explain in detail in
the box below.

teachcomputerscience.com
3.
End of topic
questions

teachcomputerscience.com
End of topic questions
1. Where is backup stored?
2. What are the different types of redundancy? How are they useful
in improving the reliability of a system?
3. What is a fault-tolerant system?
4. What is a fail-soft system?
5. How can the reliability of a system be measured? Write down the
different parameters with a line of explanation.

teachcomputerscience.com

You might also like