You are on page 1of 12

W H I T E PA P E R

ELIMINATING SINGLE
POINTS OF FAILURE
How to design for redundancy in mission-critical
backup power systems
Eliminating single points of failure
How to design for redundancy in backup power systems

Eliminating single points of failure is of paramount components that can take over in case of component
importance in mission-critical backup power failure.
systems. Designing for redundancy throughout the In this whitepaper, we will look at some common
installation is the most efficient way of ensuring single points of failure, providing recommendations
uninterrupted uptime despite adverse conditions for eliminating them by designing for redundancy. It is,
such as controller failure and loss of communication. however, important to realise that redundancy is not a
Redundancy is not limited to any one particular question of adding any one particular component; it is a
feature or technology; instead, system designers design approach.
need to view it as a general design approach that
should be applied to all aspects of the backup power
system, and all systems that it interfaces with.
Single point of failure:
If power fails in bank IT systems, customers and Any component in a system that would cause
employees cannot use ATM machines or access account the entire system to stop operating if it failed.
information. If power fails in an airport control tower or
hospital operating theatre, human life could be at stake. Redundancy:
In many applications, power failure is simply not an Installing backup or duplicate components that
option, and such applications therefore require resilient can take over in case of component failure in
power systems capable of delivering the necessary order to eliminate single points of failure.
electricity even in adverse conditions. While it is
particularly important in critical power systems found in
hospitals, data centres, or broadcasting corporations, it
applies in any installation where power outages could
have serious consequences.

Single points of failure and redundancy:


definition
A key consideration when designing resilient power
systems is to avoid single points of failure. A single
point of failure is any component in a system that would
cause the entire system to stop operating if it failed. The
list of components that could become single points of
failure in a typical power system is long and includes
breakers, controllers, transformers, and communication
lines.
Eliminating single points of failure is the only effective
way of ensuring resilience in power systems, and the
only effective way of eliminating them is to provide
redundancy for these components in the power system
design; in other words to add backup or duplicate

2
REDUNDANCY IS NOT A FEATURE.
IT IS A DESIGN APPROACH
“Redundancy is not just about hooking up an extra design. Providing redundancy for critical components
CAN bus cable between two controllers”, explains allows you to eliminate both of these risks: even if
René Kristensen, Critical Power System Specialist and component failure does occur, the system design
Accredited Tier Designer at DEIF, a global market leader provides resilience and security of supply through
in control solutions for decentralised power production redundancy.
on land or at sea. “It’s a design approach where you
methodically review every critical part of the system,
“Redundancy is about methodically
constantly asking, ‘What would happen if this component
reviewing every critical part of the system,
failed?’ If the results of component failure are not
constantly asking, ‘What would happen
acceptable, providing redundancy for that component is
if this component failed?’”
the answer.”
According to results presented at the 2018 Data Center René Kristensen, DEIF
World conference, 48% of all critical data centre failures
were caused by equipment failure or inadequate system

3%
4%
4%
Equipment faliure

10% 28% System design

Human error

Equipment design

13% Installation error

Commissioning or test deficiency

20% Maintenance oversight


18%
Natural disaster

Equipment failure or inadequate system design account for almost half of all critical data centre failures.
Source: results presented at the 2018 Data Center World conference

3
THE UPTIME INSTITUTE TIER RATING SYSTEM
The data centre business is a good example of a business where uptime is key. The Tier Standard developed by the US-based
Uptime Institute is a widely recognised standard for data centre reliability and performance. It defines four levels of resilience
that can be used by data centre operators to certify that their installations deliver a certain level of performance, and by
customers to define requirements when selecting a data storage solution.

Tier level Description

Basic capacity: a single path for power and cooling and few, if any, redundant and backup
Tier I
components

Single path for power and cooling and some redundant and backup components
Tier II
(e.g. chiller, UPS, generator)

Multiple paths for power and cooling and concurrently maintainable: the system can be updated
and maintained without taking it offline

Tier III In a Tier III installation, the number of all critical system components must be N+1; in other words,
there must be the number of components required, plus one additional component for redundancy
purposes. If a Tier III installation requires 3 gensets, for example, the actual number of gensets
installed must be 4.

Completely fault tolerant (no single point of failure; automatic response to any failure to ensure
service continuity without any operator action needed)
Tier IV
In a Tier IV installation, the number of all critical components must be 2N; in other words, double
the number of components required for normal operation. This ensures full redundancy and fault
tolerance but also comes at a significantly higher price than Tier III.

While the Tier Standard applies to data centres only, it can be used as a source of inspiration for designing resilient power
systems in other applications as well. For mission-critical applications, Tier III or IV is recommended

4
COMMON SINGLE POINTS OF FAILURE
While the following list of common single points of failure in backup power systems is not exhaustive, it provides useful tips
on how to design for redundancy.

Failure to start
The most common single point of failure in a critical power application is failure to start; in other words that the backup power
source (typically a genset) is not available in case of a grid blackout because it cannot start.
The solution is to include more than one starting method. There are two aspects to this:

• The genset can be equipped with an extra starter system.


• The genset controller may offer a double starter feature.

In an installation with both of these precautions implemented, the double starter feature can toggle between the two starter
systems until the genset cranks, or carry out several attempts with the standard starter system before switching to the extra starter
system. The exact functionality depends on the controller and on your installation and requirements. DEIF controllers support
double starter relay outputs. Providing extra starter systems and a controller capable of toggling between them is an efficient way
of eliminating single points of failure. Even if the starter features work as they should, lack of fuel or insufficient battery power can
result in genset starting failure. Getting a genset controller with advanced battery testing features adds
Mains 1 another layer of security and helps you detect potential starting issues early on.

Genset 1 Genset 2

A simple critical power system with a mains Genset starter motors. Installing a redundant starter
connection and two backup gensets. In case of a system is a good way of eliminating a single point of
mains blackout, the gensets can only provide the failure in the backup power system.
necessary backup power if they actually start.

5
Master controller failure
Backup power systems are often designed with a (which must be designed with full redundancy, too; more
master PLC controlling a number of slave controllers on this below).
on the gensets, breakers, and other important system One controller assumes the master controller role, and
components. While this design philosophy is widespread, if it fails, the master role is automatically transferred to
it introduces another potential single point of failure: If the next controller in the network. There is therefore no
the master PLC fails, the whole system fails because master/slave architecture as in the master PLC system.
there is no master controller to ensure that gensets start If the master controller fails, only part of the system is
when they should, breakers open or close as required, temporarily lost while all remaining components remain
and so on. in service controlled by a different master controller.
You can design for redundancy by programming and When a controller fails, you need to take corrective action
installing a backup master PLC or by using a multi-master to get it up and running again, but with a multi-master
power management system. In such a system, there is architecture, the entire power system does not fail just
an intelligent controller on all system components, and because a single controller does.
all controllers are interlinked in a communication network

Master

CAN ID 02 CAN ID 01

Load Load

TB17 TB18

BTB33

CAN ID 03 CAN ID 04 CAN ID 05 CAN ID 06

A multi-master power management system can handle controller failure


because all controllers are capable of assuming the master controller role.
If the current master controller (in this example, the grid breaker controller
with CAN ID 01) fails, the master role is transferred to the next controller on
the communication network (CAN ID 02).

6
Controller failure
Even if a system is designed with a multi-master
architecture, the failure of a single controller can still
weaken the system, making it vulnerable to complete
failure if more controllers fail. For designers wanting to
safeguard against controller failure, it is possible to take
another approach: installing a redundant controller at
every important controller location in the system.
Designing such a system requires the use of intelligent,
connected controllers that constantly exchange
availability information. One controller is designated
to be in control while the other is in supervision mode,
monitoring the main controller and taking over control in
case the main controller goes offline or becomes unstable.
This shift is done on the fly, ensuring uninterrupted
system control; a feature also known as “hot standby”.

① Primary

② Secondary

① ①

① ①

By installing redundant controllers at every important controller


location in the backup power system, you can safeguard the
system against controller failure.

7
Loss of communication
In any backup power system, the PLCs or controllers damage one cable will likely damage both if they are
managing the breakers, grid connections, gensets, and located next to one another. In order to achieve true
other key components need to be connected. If they redundancy, it is also important to use controllers
cannot communicate, they cannot adequately control the that have physically separate communication ports
power system. Loss of communication is a typical single and do not route both channels through the same
point of failure; communication cables are vulnerable to port.
mechanical stress and excessive heat, and if they break,
communication is no longer possible. • It is important to select a network topology that
The solution is to eliminate this single point of failure provides redundancy in case of cable breaks.
by installing dual communication wiring with true A single-string link network is vulnerable, in that
redundancy. Most power systems are controlled using all controllers further down the line will lose all
a CAN bus or similar; the presence of two redundant communication if the first cable link is broken. Use
communication lines (A and B) makes this an attractive ring or star networks instead, as they provide better
option for ensuring uninterrupted communication – if one network redundancy. If the connection to a particular
channel fails, the other takes over. There are, however, a node fails in a star network, for example, the
couple of important points to bear in mind: connections to all other nodes will still be available.
The entire star network can be replicated in two
• It is important to physically route the two channels physical locations to increase security, for example
completely separately, in separate ducts and by setting up two parallel CAN bus networks using
preferably along separate routes in the building. a star topology. If a ring network is broken in one
If both channels run in the same duct, both can place, communication between nodes will still be
be taken out simultaneously: Anything that can possible by using the rest of the ring network.

Diagram of controllers connected using CAN bus. If channel A should fail (because of a cable break or similar), controller
communication is possible over the redundant B channel.

Single string Ring Star

Single-string link networks are vulnerable to communication failure. Use ring or star networks whenever possible, as they provide
better redundancy

8
What happens if the CAN bus fails? Breaker or coil failure
In CAN bus-based communication networks, it is possible Breakers are absolutely crucial to any backup power and
to add an extra layer of redundancy and resilience by distribution system. In simple terms, without dependable
setting up an analogue communication channel between and well-functioning breakers, power will not go where
genset controllers (if the controllers support this feature). it must go, and without redundancy for the breaker, this
The AGC-4 automatic genset controller from DEIF component becomes a potential single point of failure.
supports this feature. If both CAN bus channels fail, the Breakers, and the actuator coils that open and close
connected controllers will continue operation in a stable them, are electrical-mechanical components, and like
load sharing mode. In this way, the generators remain any such components, they can wear out. “When a
in stable operation even though power management breaker fails, it is usually caused by coil failure”, explains
is temporarily lost, ensuring stable frequency and René Kristensen. “Remember that an actuator close coil
voltage by sharing the load in the system until CAN is latched 99.9% of the time. This can cause it to stick,
bus communication can be re-established. Electronic in which case you cannot trip the breaker. Breakers
droop can be used to the same effect over an analogue are also subject to wear and tear which can wear them
connection. down. It’s important to cycle them periodically to check
that they work, and you should always follow the service
instructions regarding lubrication, cleaning, and so on”.
In order to eliminate single points of failure in your
breakers, you need to provide redundancy for all
important breakers, for example installing parallel grid
connections with separate breakers.

Mains 1 Mains 2

The DEIF AGC-4 automatic genset controller. An analogue load


sharing line can be connected, ensuring continued operation in a
stable load sharing mode in case of complete CAN bus failure.

Genset 1 Genset 2

A power system with redundant breakers retains its ability to


function even if one breaker fails.

9
Tip: Draw-out breakers UPS failure
Breaker replacement is much safer and more reliable Most critical power systems are designed with an
when using a so-called draw-out breaker; in other words, uninterruptible power supply (UPS) capable of feeding
a breaker built with a draw-out construction which allows critical loads in the period following a grid blackout and
you to remove and replace it without dismantling the until enough gensets have been started to cover the
switchgear and disturbing the power circuit connections. power requirement.
The drawback to this solution is the cost. When using
draw-out breakers, it is important to also use controllers While UPS failure is rare, it will result in power outage
capable of responding intelligently and safely when the until genset power (or another backup power source)
breaker is racked out to testing position or all the way out can be brought online. In some applications, this is
of the system. not acceptable: The US FDA, for example, requires
that full backup power is available in hospitals within
10 seconds of a grid blackout, which can be hard or
impossible to achieve with multiple gensets and no UPS.
For this reason, UPS systems should be designed for
redundancy when power interruptions are not acceptable.
Note that this also involves selecting the right type of
UPS, and selecting a control architecture that allows the
UPS modules to communicate with the genset controllers
(or other backup power sources used in the critical power
system).

Replacing a draw-out breaker, like this one in the picture, is much


Rectifier Inverter Static isolator
safer and more reliable than replacing a hard-wired breaker that
requires you to dismantle switchgear and disassemble electrical
connections. UPS1
Main supply

UPS2

UPS3

Critical
Load
Maintenance bypass

A de-centralised UPS system with three modules. If one module


fails or is in need of maintenance or replacement, the next module
takes over.

10
ADDITIONAL CONSIDERATIONS Fast power delivery
On the preceding pages, we have described some of Depending on your application, other factors may need
the most common, and important, single points of failure to be considered, too. Does full backup power need to
that you need to consider when designing a reliable be available within 10 seconds, for example, as required
power system. The list is not exhaustive, and it does not by US FPA regulations? In that case, solutions that can
include systems and components that are outside the speed up genset start-up and excitation procedures will
scope of the power system. As stated initially, however, very probably be required. A critical power system that
redundancy is a design approach, and you should apply delivers power too late is not much better than one that
the same principles to all components and systems does not deliver any power at all. When reliability and
at your site. In a data centre, for example, you need to resilience are important, this could be an important issue
ensure redundant cooling, as cooling system failure can to address.
quickly have catastrophic consequences for the server
racks. Depending on your business and application, WORK WITH EXPERIENCED PARTNERS
redundant lighting, access control, and communications There are many technical and legal details to consider
equipment could be necessary. And if your operation when designing any kind of power system. We
relies on dependable 24/7 access to surveillance data recommend partnering with experienced system
and error logs, you should take steps to ensure that this designers and solution vendors who can assist
data is stored with redundancy in case a hard disk or you in defining, designing, developing, testing, and
server rack fails. commissioning a system that meets your requirements.

Performance and system cost For more information on designing for redundancy in
Redundancy and resilience come at a price, and you power systems, contact DEIF. We have the experience
should always weigh your performance and safety and know-how to help you avoid single points of failure
requirements against the potential installation and and develop a solution that fulfils your requirements,
operating costs. If you do not need full redundancy, there no matter if you need full 2N redundancy or a more
is no need for the significant investment that an N+1 or economical and less fault tolerant solution.
2N system represents. On the other hand, for mission-
critical applications, there is often no choice: Customer
requirements, safety considerations, or legal obligations
may require you to design and implement a solution that
delivers at least partial redundancy.

Testing and connectivity


Be sure to define testing schedules, and stick to them!
Be aware of operating conditions; high humidity or heat,
for example, may require you to carry out particular
tasks more often than set forth in the manufacturer’s
recommendations. And remember that the power system
does not work in isolation. Should it interface with a
BMS or SCADA system? Should it reply on inputs from
other systems elsewhere? If so, such systems and all
interfaces and connections should be designed and
tested in order to eliminate single points of failure here,
too.

11
FIND MORE INFORMATION ON DEIF.COM
Case study
Case study See how Denmark’s biggest
See how a Norwegian hospital co-location provider built an
upgraded its backup power N+1 configuration with DEIF’s
system and got higher reliability. AGC-4 controller

Whitepapers
Find more free whitepapers, for example on critical power
system design and on the advantages of using multi-master
power management systems in critical power.

www.deif.com/whitepapers

DEIF A/S
Frisenborgvej 33, 7800 Skive, Denmark
Tel. +45 9614 9614
Learn more at deif.com

You might also like