You are on page 1of 60

PUBLICLY AVAILABLE SPECIFICATION

PAS 77:2006

IT Service Continuity Management


Code of Practice

ICS code: 35.020 NO COPYING WITHOUT BSI PERMISSION EXCEPT AS PERMITTED BY COPYRIGHT LAW

PAS 77:2006

This Publicly Available Specification comes into effect on 11 August 2006

Amd. No.

Date

Comments

BSI 11 August 2006 ISBN 0 580 49047 5

BSI 11 August 2006

PAS 77:2006

Contents
Page
ii .......................... Foreword iii .......................... Introduction 1 .......................... 1 Scope 2 .......................... 2 Terms and definitions 5 .......................... 3 Abbreviations 6 .......................... 4 IT Service Continuity management 7 .......................... 5 IT Service Continuity strategy 13 .......................... 6 Understanding risks and impacts within your organization 14 .......................... 7 Conducting business criticality and risk assessments 15 .......................... 8 IT Service Continuity plan 20 .......................... 9 Rehearsing an IT Service Continuity plan 25 .......................... 10 Solutions architecture and design considerations 27 .......................... 11 Buying Continuity Services 29 36 38 39 43 48 .......................... .......................... .......................... .......................... .......................... .......................... Annex Annex Annex Annex Annex Annex A (informative) Conducting business criticality and risk assessments B (informative) IT Architecture Considerations C (informative) Virtualization D (informative) Types of site models E (informative) High availability F (informative) Types of resilience

51 .......................... Bibliography

BSI 11 August 2006

PAS 77:2006

Foreword
This Publicly Available Specification (PAS) has been prepared by the British Standards Institution (BSI) in partnership with Adam Continuity, Dell Corporation, Unisys, and SunGard. Acknowledgement is given to the following organizations that have been involved in the development of this code of practice. Adam Continuity Dell Corporation SunGard Unisys

Contributors: Oscar OConnor, Lead Author John Pollard Richard Pursey Andrew Roles Brian Hayden Douglas Craig Stafford Hunt As a code of practice, this PAS takes the form of guidance and recommendations. It should not be quoted as if it is a specification and particular care should be taken to ensure that claims of compliance are not misleading. This Publicly Available Specification has been prepared and published by BSI, which retains its ownership and copyright. BSI reserves the right to withdraw or amend this Publicly Available Specification on receipt of authoritative advice that it is appropriate to do so. This Publicly Available Specification will be reviewed at intervals not exceeding two years, and any amendments arising from the review will be published as an amended Publicly Available Specification and publicized in Update Standards. This Publicly Available Specification is not to be regarded as a British Standard. This Publicly Available Specification does not purport to include all the necessary provisions of a contract. Users are responsible for its correct application. Compliance with this Publicly Available Specification does not of itself confer immunity from legal obligations. Attention is drawn to the following statutory instruments and regulations: Basel II: International Convergence of Capital Measurement and Capital Standards: a Revised Framework, Basel. Bank for International Settlements Press and Communications, 2005. The Civil Contingencies Act 2004. Cabinet Office: The Stationery Office. The Data Protection Act 1998. British Parliament: The Stationery Office. The Higgs Report on the Role of Non-Executive Directors: Department of Trade and Industry: The Stationery Office, 2001 The Sarbanes-Oxley Act, 107th Congress of the United States of America, 2002. The Turnbull Report on Corporate Governance: Department of Trade and Industry: The Stationery Office, 1998 The Orange Book Management of Risk Principles and Concepts: HM Treasury, 2004.

ii

BSI 11 August 2006

PAS 77:2006

Introduction
This code of practice provides guidance on IT Service Continuity Management (ITSCM). It is intended to compliment, rather than replace or supersede, other publications such as PAS 56, BS ISO/IEC 20000, BS ISO/IEC 17799:2005 and ISO 9001 (see Bibliography for further information).
PAS 56 provides guidance on best practice in Business Continuity Management, and while it mentions the need for IT Service Continuity it does not provide the detailed guidelines found in this code of practice; BS ISO/IEC 20000 provides guidance on best practice on Service Management and, as PAS 56, mentions IT Service Continuity, but not at the level of detail presented in this code of practice; BS ISO/IEC 17799:2005 provides detailed guidance on best practice in information security management, which is one aspect of IT Service Continuity Management. This code of practice does not directly address information security or physical and environmental security as these areas are covered by BS ISO/IEC 17799:2005; ISO 9001 provides guidance on best practice in Quality Management Systems. When implementing any recommendations found within this code of practice, the reader is encouraged to apply the quality assurance and control recommendations found in ISO 9001. Many organizations believe that a loss of systems infrastructure will not happen to them or that a loss of such infrastructure will have a relatively low impact. However, while many of those organizations might believe that they have invested in adequate systems resilience, it is often apparent that such confidence is misplaced. In an age in which information technology is becoming evermore pervasive and increasingly critical within the day to day operations of many organizations, it is clear that the ability to continue to operate with any degree of success is likely to be severely compromised following loss of IT services. In addition it is evident that the duration of a tolerable IT outage is becoming ever shorter. As Figure 1 suggests, there is a continuous cycle in the relationships between several important documents. The IT Strategy defines the organizations key policies and direction regarding information technology, systems and services. From this, the IT Service Continuity Strategy can be defined to ensure that the policies and standards for IT Service Continuity directly and explicitly support the objectives set out in the IT Strategy. This then enables the organization to define its IT Architecture based upon the requirements and objectives set out in both the IT Strategy and ITSC Strategy. Once the architecture is defined, the organization can then define IT Service Continuity Plans for each element of the architecture. Feedback from (amongst many other sources) rehearsing the ITSC Plans can subsequently be used as input to the next iteration of the IT Strategy.

BSI 11 August 2006

iii

PAS 77:2006

Figure 1 The relationship between the IT Strategy, ITSC Strategy, IT Architecture and ITSC Plan

IT Strategy

IT Service Continuity Plan

IT Service Continuity Strategy

IT Architecture

Whilst it is true that major events such as bombs, fires and floods make headline news, the majority of IT related incidents fall into the category of quiet calamities that only affect an individual or a small subset of the organization. Examples of such common incidents include the theft of a mobile workers notebook computer, the failure of an important business application and corruption of important or confidential data. These incidents have the potential to damage an organizations brand or public image and its reputation, not to mention its revenues and customer service. Such damage has the potential to destroy that organization unless appropriate action is taken to implement IT Service Continuity (ITSC). In order to retain an appropriate sense of perspective, this document refers to incidents and events rather than disasters. Since the Asian Tsunami of 2004 and Hurricane Katrina in 2005, the phrase disaster recovery has taken on dimensions previously unknown and the authors felt it was inappropriate to describe the failure of IT systems, however disruptive, using the same language. Throughout the document the reader may encounter terminology which is used in other standards. To avoid ambiguity the reader should refer to the definition section to understand how such terminology is used in this document which may differ from other standards. This document is intended to be read by a number of different audiences:

Executive and Senior Management to gain a high level understanding of the fundamental interdependencies between Corporate Governance, Business Continuity and IT Service Continuity in order to make better-informed investment decisions relating to ITSCM; Middle Management to understand how decisions should be made regarding IT Service Continuity such that critical business processes survive disruption (ideally) or at the very least have the ability to recover from disruption in timescales required by the organization; IT Management to understand the decision making processes required in order to ensure that IT Service Continuity strategies and plans fully support business priorities; IT Support and Operations to gain a practical insight into how IT Service Continuity strategies should be drawn up and implemented in such a way as to add value to the organization as well as protecting it from IT-related incidents; Regulators, auditors, insurance and benchmarking organizations to understand what best practice in IT Service Continuity Management implies for organizations so that these measures can be assessed as part of wider reviews of Corporate Governance and resilience. This code of practice is designed for organizations of all shapes and sizes, whether in the private or public sectors.

iv

BSI 11 August 2006

PAS 77:2006

It should not be regarded as a step-by-step guide to implementing IT Service Continuity Management but as guidance on the aspects of ITSCM which organizations should consider when investing in this area. Not all activities described herein will be applicable or appropriate for all organizations. In particular, small organizations should aim to use this code of practice as a reference guide in order to help them make informed decisions about what level of ITSCM would be appropriate for them given their individual characteristics. Throughout this code of practice certain terms have been used which may cause confusion. Such confusion is naturally not the intention of the authors, so the following guidance should be borne in mind when reading this document: The term business is used when referring to the non-IT elements of an organization. This should not be taken to imply that this code of practice is aimed purely at private sector or commercial bodies. In each such instance, the term is used merely as convenient shorthand to avoid over-complicating the language used herein. This code of practice refers to rehearsing ITSC Plans. Other publications in this field have referred to testing and also to exercising. The authors regard these terms as largely interchangeable and have opted to use the term rehearsing in this context as it implies not only testing that ITSC Plans are accurate and capable of being implemented, but also that the people required to implement them are guided, supported and provided with feedback on their own personal performance as well as that of the Plans. The authors did not feel that either of the other terms used in other publications quite conveyed the necessary emphasis on this aspect. The term data centre is used to imply any location or facility where core information technology services are housed, whether that be the ultra-modern data centres that major organizations use or under the desk where a one-person business keeps its file server. No inference should be drawn regarding the applicability of guidance or recommendations to any type of non-data centre environment.

1 Scope
This Publicly Available Specification (PAS) explains the principles and some recommended techniques for IT Service Continuity management. It is intended for use by persons responsible for implementing, delivering and managing IT Service Continuity within an organization. This PAS provides a generic framework and guidelines for a continuity programme including the following topics. What the required management structure, roles and responsibilities for implementing IT Service Continuity management are. How business criticality, risk assessments and business impact assessments should be performed to produce useable results. What business continuity plans contain and the steps required to respond to, and recover from, the identified risks within the context of specified business processes. How the development, rehearsal and deployment of the Business Continuity plan does not have to cost more in terms of money, risk or reputation than taking no action. Why a framework and capability should be developed for the organization to respond effectively to unexpected disruption. This document is not intended to be used as step-by-step instructions for conducting any of the activities described herein. It is intended to provide an overview of a complete process on the assumption that information will already exist within the organization that would be identified by activities described in this document. Where this is the case, users of this document are encouraged to review the information in their possession to ensure that it includes all of the details required, and that it is up-to-date and accurate.

BSI 11 August 2006

PAS 77:2006

2 Terms and definitions


For the purpose of this PAS, the following terms and definitions apply.

2.10

abnormal service level of service that deviates from the levels agreed for normal operations NOTE Usually as a result of an incident causing disruption to normal service levels.

2.1

data availability measure a systems ability to deliver a predetermined level of data access during a system failure

action plan schedule of activities, lead times and dependencies of activities in order to address a particular requirement

2.2

dependency modelling activity used to determine the inter-relationships and dependencies between functions and/or processes and how they affect the system or organization as a whole

2.11

2.3

asynchronous replication periodic physical replication of data from one storage system to another NOTE Typically over a wide area network.

disk imaging method of copying a complete hard disk of a computer into a single file from which the gathered image can be distributed to a single or multiple computers to minimize the time and effort for the creation of computers that will have identical software and configurations to the original

2.12

2.4

atomic requirement, transaction or objective which is self contained i.e. cannot be broken down further

domain logical association of a defined environment and the assets within the pre-defined environment

2.13

audit log shipping automated process for transferring records of transactions (audit logs) between primary and secondary systems

2.5

downtime vs. cost vs. benefit model model which analyses the costs of downtime and of the measures required to minimize downtime in the event of an incident and compares them against the benefits available to the organization from services being resumed

2.14

business continuity management plan document that sets out to ensure resumption of critical business functions in the event of either an incident or unforeseen event that threatens the business

2.6

duplexed ability to simultaneously send and receive data through a medium in both directions NOTE When used to describe disk devices or disk connectivity it implies duplication or mirroring.

2.15

clustered system two or more computer systems configured in such a manner that in the event of failure of a system or service run on it, operation is transferred to another system within the cluster

2.7

2.16

fail-back return of service/operation from fail-over site

2.17
fail-over ability for services offered by a component, server or system to automatically be undertaken by another component, server or system in the event of its failure so that the impact of losing that device, server or system has a minimal impact on the service or services offered

2.8

cold back-up site provides the space but not the infrastructure needed to resume operations quickly

continuity procedures set of predefined procedures to be followed in the event of an incident which disrupts normal service levels

2.9

BSI 11 August 2006

PAS 77:2006

failure modes and effects analysis (FMEA) structured quality method to identify and counter weak points in early conception phase of products and processes1)

2.18

NOTE Examples of IT services include messaging, business applications, file and print services, network services, and help desk services3).

2.19

incident event that disrupts normal IT services NOTE This usage differs from that in ITIL [1].

incident recovery activities required to respond effectively to an incident, with the primary objective being to ensure the resumption of normal service levels

2.20

IT Service Continuity Management supports the overall Business Continuity Management process by ensuring that the required information technology technical and services facilities (including computer systems, networks, applications, telecommunications, technical support and service desk) can be recovered within required, and agreed, business timescales

2.26

2.21

I/O Processors allow servers, workstations and storage subsystems to transfer data faster, reduce communication bottlenecks, and improve overall system performance by offloading I/O processing functions from the host CPU2)

last mile telecoms provider organization responsible for the provision of telecommunications services from the national or local telecommunications infrastructure to a specific location

2.27

2.28

2.22

latency delay due to the time it takes to transmit data from one location to another

IP Address logical address of a system within an IP network NOTE The IP address uniquely identifies computers on a network. An IP address can be private, for use on a Local Area Network (LAN), or public, for use on the Internet or other WAN. IP addresses can be determined statically (assigned to a computer by a system administrator) or dynamically (assigned by another device on the network on demand).

maintenance procedures procedures applied by an organization to ensure that their IT Infrastructure is maintained in optimum condition through both proactive and reactive measures

2.29

2.23

IT Architecture overall design of an organizations information technology and services including both physical and logical entities

monte carlo analysis means of statistical evaluation of mathematical functions using random samples, often used in risk analysis of highly complex systems

2.30

2.31

2.24

IT Infrastructure physical devices which comprise an organizations information technology and services architecture

Network Attached Storage (NAS) storage device that can be attached to the network for the purpose of file sharing NOTE In essence a NAS device is simply a file server.

IT Service set of related information technology and probably non-information technology functionality, which is provided to end-users as a service

2.25

2.32

network protocol technological rules, codes, encryption, data transmission and receiving techniques which allow networks to operate

1) http://www.fmeainfocentre.com/ 2) http://www.intel.com/design/iio/ 3) http://whatis.com

BSI 11 August 2006

PAS 77:2006

operations bridge central facility used for monitoring and managing systems, services and networks

2.33

2.42

paper test mechanism for proving the hypothetical effectiveness of a process by working through scenarios in a discursive forum

2.34

risk management plan document that sets out to define a list of activities, lead times and dependencies in order to mitigate one or more identified risks

2.43

risk mitigation set of actions that will affect either the probability of the risk occurring or its impact should the risk occur. These are summarized as risk transference, tolerate the risk, terminate or treat

2.35

point in time (PIT) consistent copy of the data taken at the same instance in time for one or more systems

2.44

recovery procedures procedures which result in the restoration of services following an incident

2.36

risk monitoring iterative process of the risk owner checking and reporting on any changes in status of the risk log in terms of risk proximity, impact and response

2.37
redundant routing resilient approach to data networking in which there are a minimum of two routes from each node in the network

rehearsing the critical testing of ITSC strategies and ITSCs, rehearsing the roles of team members and staff, and testing the recovery or continuity of an organizations systems (e.g. technology, telephony, administration) to demonstrate ITSC competence and capability NOTE A rehearsal may involve invoking business continuity procedures but is more likely to involve the simulation of a business continuity incident, announced or unannounced, in which participants role-play in order to assess what issues may arise, prior to a real invocation.

2.38

stateful/stateless describe whether a computer or computer program is designed to note and remember one or more preceding events in a given sequence of interactions with a user, another computer or program, a device, or other outside element NOTE Stateful means the computer or program keeps track of the state of interaction, usually by setting values in a storage field designated for that purpose. Stateless means there is no record of previous interactions and each interaction request has to be handled based entirely on information that comes with it. Stateful and stateless are derived from the usage of state as a set of conditions at a moment in time. (Computers are inherently stateful in operation, so these terms are used in the context of a particular set of interactions, not of how computers work in general).

2.45

2.39
replication appliance device which provides functionality to replicate data to other storage systems

2.46

storage array two or more hard disk drives working in unison to improve fault tolerance and performance

2.47

2.40

synchronous replication instantaneous physical replication of data from one storage area to another, typically over a high speed interconnect such as fibre channel

risk combination of the probability of an event and its consequence [ISO Guide 73:2002]

risk communication exchange or sharing of information about risk between the decision-maker and other stakeholders [ISO Guide 73:2002]

2.41

test scripts definition of the specific tests to be enacted when proving the functionality and operation of a system or service

2.48

BSI 11 August 2006

PAS 77:2006

vulnerability report report which identifies the specific vulnerabilities of a specific system or service

2.49

3 Abbreviations
For the purpose of this PAS, the following abbreviations apply. BCM Business Continuity Management BCMP Business Continuity Management Plan BCMT Business Continuity Management Team BCSG CMT DAS IMT I/O IT ITIL ITSC NAS OS RAID RPO RTO SAN UPS WAN Business Continuity Steering Group Crisis Management Team Direct Attached Storage Incident Management Team Input/Output Information Technology (also includes Information Systems (IS)) Information Technology Infrastructure Library Information Technology Service Continuity Network Attached Storage Operating System Redundant Array of Independent Disks Recovery Point Objective Recovery Time Objective Storage Area Network Uninterruptible Power Supply Wide Area Network

2.50

work schedule defined set of activities and deliverables which, once completed, will result in the desired outcome of a procedure or project

2.51

Zero Data Loss (ZDL) remote replication method that guarantees not to lose any live data

DBMS Database Management System

zoning allocation of resources for device load balancing and for selectively allowing access to data only to specific systems NOTE Zoning allows an administrator to control who can see what is in a SAN.

2.52

BSI 11 August 2006

PAS 77:2006

4 IT Service Continuity management


Information Technology Service Continuity (ITSC) is the collection of policies, standards, processes and tools through which organizations not only improve their ability to respond when major system failures occur but also improve their resilience to major incidents such that critical systems and services do not fail. It is related to a number of disciplines and should be undertaken with a complete and thorough understanding of the organizations policies, standards, processes and supporting services for: a) Business Continuity Management; b) Major Incident and Crisis Management; c) Corporate Governance and Risk Management; d) Information Technology (IT) Governance; e) Information Security and Data Protection. ITSC management should also have a significant influence on IT strategy to identify information systems and services which require high levels of resilience, availability and capacity. The purpose of risk management and ITSC management is not simply to be able to say that risk-based control mechanisms have been implemented. The management of risk can result in many tangible and intangible benefits to the organization if implemented with commitment and the right motivation. Risk management can be used to improve product quality, productivity, financial performance and working conditions. These benefits should be at the forefront of the participants thinking throughout the process. Risk can be seen as a positive approach to improving all aspects of the organization's performance. Every stakeholder can make a significant, positive contribution by considering, on a regular basis, the ways in which the organization's ability to achieve its objectives could be at risk. In order to do so, the organization's objectives should be communicated clearly to everyone involved in their achievement and communication on risks should be encouraged. ITSC management addresses risks that could cause a sudden and serious impact, such that they could immediately threaten the continuity of the business. These typically include: a) |oss, damage or denial of access to key infrastructure services; b) failure or non-performance of critical providers, distributors or other third parties; c) loss or corruption of key information; d) sabotage, extortion or commercial espionage; e) deliberate infiltration or attack on critical information systems. Business Continuity Management (BCM) is concerned with managing risks to ensure that at all times an organization can continue operating to, at least, a pre-determined minimum level. The BCM process involves reducing the risk to an acceptable level and planning for the recovery of business processes should a risk materialize and a disruption to the business occur. In essence ITSC management should be a part of the overall Business Continuity plan and not dealt with in isolation.

BSI 11 August 2006

PAS 77:2006

5 Service Continuity strategy


5.1 Defining an IT Service Continuity strategy
NOTE The following frameworks are well regarded and respected and can be used for additional information when creating a comprehensive IT Strategy that will assist in clarifying and defining your ITSC Strategy. For additional reading please refer to: CMM, Capability Maturity Model, http://www.itservicecmm.org; CobiT 4.0, Control Objectives for Information and related Technology, http://www.itgi.org; ITIL, IT Infrastructure Library, http://www.itil.co.uk. An ITSC strategy should define the direction and highlevel methods that should meet IT service level objectives. It should ensure a business is never compromised by a lack of IT availability beyond acceptable, predefined and regularly reviewed levels of uptime and performance. The ITSC strategy should be agreed at Board level and ideally be fully endorsed by the CEO. A Board member should be accountable for the strategy and be referred to when deciding on new business initiatives including mergers and acquisitions, directional change and any decision that could have an impact on ITSC. In devising an organizations ITSC strategy, it is advisable to consider four discrete but linked stages in the management of a major incident, as also shown in Figure 2: a) initial response covering the initial actions required to ensure the safety and welfare of people affected by the incident, to activate the relevant incident management teams and determine the level of response which is appropriate to the incident; b) service recovery this may take place in a number of stages depending upon the needs and scale of the organization but should involve the restoration of all required services in priority order to pre-agreed (possibly degraded) levels of service; c) service delivery in abnormal circumstances until the organization is ready and able to resume normal service operations there is still a need to continue to operate required services at the pre-agreed service levels until the circumstances permit these abnormal services to be failed back to business as usual and decommissioned; d) normal service resumption as with service recovery, the resumption of normal service may take place in stages according to the needs and priorities of the organization. Only when each service has been validated and verified as being back to normal should the secondary systems be decommissioned. This stage is only complete when all of the organizations IT services are restored to normal service levels.

Figure 2 Major Incident Management

Service recovery Initial response

Service delivery in abnormal circumstances

Resumption of normal service

BSI 11 August 2006

PAS 77:2006

The ITSC strategy should enable the organization to plan for and rehearse the whole life cycle of a major incident from the point of initial disruption, through the recovery, to abnormal service to the point where normal service levels are once again guaranteed. The strategy should be developed from a clear understanding of the organizations need for IT services and the agreed service levels that are required from time to time, taking into account: a) priority for key business units at given moments in time; b) peak loads on business; c) strategically important business periods e.g. reporting periods, manufacturing deadlines etc; d) compliance with business Continuity Management Plans and objectives; e) investment vs. risk; f) impact of failure or loss; g) recovery time objectives; h) acceptable levels of downtime and performance; i) system changes and upgrades; j) new projects; k) interdependencies;

l) compliance with legislation; m) deadline management; n) rehearsing and rehearsing recovery plans; o) data protection; p) data availability; q) plan maintenance; r) education and awareness programmes for all IT staff. The strategy should not define the detailed tactics but should set the direction of the individual components of an ITSC plan.

5.2 Creating an ITSC strategy


The ITSC strategy should be a by-product of a Business Continuity Management Plan (BCMP) but can be defined without. Where a BCMP exists, those responsible for IT service levels are likely to have contributed to the plan and already be aware of the implications of that plan on the IT strategy and direction. As shown in Figure 3, an ITSC strategy should have six main elements all of which are part of a continuous cyclical process.

Figure 3 ITSC strategy elements

Understand Requirements Review Strategy and Update Objectives

Monitor

Rehearse, Exercise and Audit Instill Continuity Culture

Understand Dependencies

BSI 11 August 2006

PAS 77:2006

The six elements in the ITSC Strategy Process are: a) Understanding the business requirements and agreeing service levels The ITSC strategy should be aligned with corporate strategy and in line with the pre-defined business goals of the organization. To provide a sound base to start from and allow for continued and controlled growth of the ITSC strategy a defined service level should be agreed. This will allow for clearly defined service levels which are specific and can be measured, analysed and improved on. NOTE Refer to ITIL [1] for guidance on agreeing service levels. b) Reviewing the IT strategic vision and updating objectives Change is inevitable and allowances should be made to ensure that the ITSC strategy is always aligned with the overall IT strategy and business goals. The overall IT strategy and its ITSC strategy should be constantly assessed and updated. c) Performing regular risk assessment and dependency modelling (internal and external) Changes in the nature of the risks and dependencies should be expected and in order to ensure that these variations are controlled and taken into cognisance in the ITSC strategy they should be reassessed regularly. The quantity and quality of the regular dependency modelling exercises will depend on the nature of the environment. d) Building and embedding an ITSC culture from new project inception When initiating an IT project consideration should be given to how the deliverables will support or enhance the ITSC strategy. e) Exercising, maintaining and auditing continuity and recovery plans The continued rehearsals, maintenance and auditing of the continuity and recovery plans an organization can ensure that: 1) the plans are constantly improved on and up to date; 2) staff are familiar with the plans contents; 3) the plans are fit for purpose and relevant; 4) resource requirements for management of major incident are understood and planned for; 5) visibility and governance are provided for the teams exercising the plans and the organization as a whole. f) Monitoring performance Constant improvements can only be made if the pre-defined service levels in the first stage are constantly monitored, measured, analysed and reported on. Monitoring performance should be central to this. Everyone within the organization who is likely to participate should be reminded of the importance of initiating an ITSC programme and the priority it should be afforded. NOTE Experience shows that a better quality of response is achieved if appropriately worded instructions are received from

an executive level within the organization. This is particularly true within larger organizations where a variety of individuals and levels of management are likely to be involved. A suitable mandate should be sent to the relevant department heads and others who may be involved in the programme as a matter of priority before the commencement of the programme. The business should define the levels of service it expects from the IT department. It should clearly define priorities, allowing the Head of IT to determine which services have greater protection, resilience and redundancy over others. IT resilience costs money. The depth and detail of an ITSC strategy will for many companies be driven by risk versus cost. A strategy should be developed that delivers improvement over time focusing on key issues and priorities where risks are high and the impact of loss or failure significant. This in itself is not a simple algorithm and will require an impact and risk analysis before budgets can be assigned and a strategy ratified. An initial strategy should be chosen by defining a strategy around the priorities and calculating the associated costs for delivery or by allocating a budget and then building a strategy around the available budget. NOTE It is rare that an organization will have implemented an ITSC plan from the outset of IT infrastructure implementation. Many IT environments have evolved over time and are often a combination of a number of inconsistent strategies.

5.3 The role of the Board and Executive


At Board or Executive level, business priorities should be defined focusing on key deliverables such as manufacturing deadlines, financial reporting, customer service levels etc. The impact of loss or failure should be defined, often a by-product of a BCMP. These priorities should help to define which areas of the IT infrastructure need strengthening to improve resilience. The Head of IT should then determine which IT processes and systems the key business priorities depend upon. This dependency modelling should show the impact of individual or collective component failure. It should allow the strategist to determine vulnerable weaknesses in an IT infrastructure and measure the impact of failure from an operational level. The Board of Directors (or equivalent) should have a vision for the organizations future, a growth plan and a series of objectives. This information acts as a long-term steer for the ITSC strategy which will in time help steer investment decisions and corporate direction.

BSI 11 August 2006

PAS 77:2006

5.4 Identifying requirements and weaknesses


The foundation of an ITSC strategy should be to ensure an embedded resilience throughout an IT infrastructure. The Head of IT should commission an internal review of all areas of potential weakness from single points of failure through to redundancy, supply chain dependence and general IT housekeeping processes such as secure back-up and restore technology. From this review, a strategy of improved resilience can be determined. An ITSC strategy should make use of history and trend reports highlighting downtime experience, proven areas of weakness and service level reports. A vulnerability report is highly likely to involve expenditure. Key to implementing an ITSC strategy is to measure the costs of downtime vs. resilience expenditure i.e. impact and risk vs. cost. This should be done by working with the Board to calculate the cost of downtime on key business functions on an hourly basis. This determines budgets and also focuses attention on service level agreements required as a result of measured downtime. Department Heads should provide their service level requirements and the level of uptime they require within a steady state as well as the recovery time objectives after an incident. These should be measured using a downtime vs. cost vs. benefit model. Advancements in technology carry an inherent demand for constant change and improvement. Enhancements to an IT infrastructure should be planned, rehearsed, and carefully managed with clear contingency plans in place should the implementation fail. The ITSC strategy should take into account a tolerance for downtime for major system upgrades and changes. Should scheduled downtime be unacceptable then plans should be put in place for duplicate environments running parallel systems. Agreement should be reached on levels of foreseen downtime prior to defining the strategy. Due consideration and research should be undertaken on current and forthcoming legislative requirements as well as good practice guidelines on aspects of business continuity and IT resilience. The Board (or equivalent) and especially Non-Executive Directors, if present, should steer an organization towards compliance and healthy business management. The ITSC strategy should also include an ongoing and continuous process for change management including involvement of third party suppliers as well as internal customers. An agreement should be reached at Board or Executive level on levels of investment and priorities for

expenditure on IT resilience. It should agree on its policy on outsourcing risk management to third party suppliers e.g. incident recovery companies and third party maintenance organizations, (see Clause 11).

5.5 Management structure and roles


5.5.1 General The management structure should be a standard, three tiered structure used widely within both the private and public sectors for major incident and crisis management. The three tiers are: a) Bronze operational level: Incident Management Team (IMT); b) Silver tactical level: Business Continuity Management Team (BCMT); c) Gold strategic level: Crisis Management Team (CMT). In terms of relative size and subsidiarity, the relationship between these teams is illustrated in Figure 4:

Figure 4 Management structure

Gold Crisis Management Team Silver Business Continuity Management Team Bronze Incident Management Team

10

BSI 11 August 2006

PAS 77:2006

Depending upon the seriousness of the incident, ITSC should be managed at both Silver (BCMT) and Bronze (IMT) levels, with Bronze teams being established for each IT service, coordinated by a dedicated IT service Silver team. The Gold team (CMT) should provide high level direction, prioritisation and coordination and should take sole and direct responsibility for communicating with external stakeholders, including the media, emergency services and public authorities where that is appropriate or required. Some organizations (mainly large) have internal teams specializing in Corporate Risk and Business Continuity Management from whom support and assistance should be sought when constructing, rehearsing and invoking ITSC Plans. As these teams are most likely to comprise members of the organizations management team, i.e. not business continuity or incident management specialists, it is possible that for both rehearsals and live activations additional specialist support may be appropriate. Consideration should also be given to the resource requirements implied by the activation of these teams and their associated ITSC Plans, since there is a possibility that more resources will be required during the initial stages than are directly available within the organization. 5.5.2 Incident Management Team In the event of disruption, whether from a predicted source or elsewhere, the IMT for the affected IT service should be activated. The IMT should determine whether the nature and extent of the disruption warrant the deployment of the relevant BCMP, and if so should: a) Determine the nature of the disruption and, if necessary, coordinate with the organizations Problem Management4) function to adapt existing procedures for the initial response to the disruption. b) Implement the selected procedures, securing the required resources through the BCMT where appropriate. c) Identify, and where appropriate adapt, the relevant continuity procedures to ensure that the business continues to operate as near to normal a manner as possible for the duration of the disruption. All such activities should be coordinated through the BCMT. d) Identify, and where appropriate adapt, the relevant recovery procedures to ensure that the business recovers from disruption in a timely and controlled manner once the root cause of the disruption has been

eliminated. All such activities should be coordinated through the BCMT. e) Activate the BCMT who investigate the requirement for further Bronze teams to be activated. Whilst the IMT is active, all activities should be coordinated through the BCMT to ensure that no action taken by one IMT conflicts with actions taken by others. f) Communicate with all parts of the organization affected by the disruption on a regular basis regarding progress and the actions initiated by the IMT. g) Organize, once recovery actions have been completed, a thorough review of its management of the disruption so all relevant lessons from the experience can be learned and incorporated into procedures and training programmes. 5.5.3 Business Continuity Management Team The BCMT should determine the nature and extent of the disruption and should: a) coordinate the activation and management of all relevant IMTs; b) coordinate the allocation of resources to IMTs; c) coordinate the management of undisrupted business; d) manage communication with regulators, investors, the media, associates and staff; e) ensure that the active IMTs have all the facilities, people and other resources that they require to mount effective response, continuity and recovery operations; f) where appropriate, activate the CMT. 5.5.4 Crisis Management Team The CMT should be activated when an incident, or a combination of incidents, have such wide ranging impact as to require organization-wide response coordination. When active, the CMT should take responsibility for the coordination of all active BCMTs and for all communication with all external stakeholders, especially customers, suppliers, regulators, the media and the public. 5.5.5 Learning lessons In order to ensure that each real or rehearsed invocation of the ITSC management contributes to the ongoing improvement of the ITSC strategy and related plans, each team should maintain a comprehensive journal, including details of: a) the reasons for the team being activated, including details of the disruption that occurred, and justification for the team being activated; b) any amendments made to the ITSC strategy and related plans and procedures as a result of the actual disruption being different in character to that predicted; c) all decisions made during the disruption, including supporting evidence;

4) http://whatis.com

BSI 11 August 2006

11

PAS 77:2006

d) all events transpiring during the disruption, their effects and likely causes; e) all actions taken and evidence of their results; f) all communication in relation to the disruption, including the other parties involved, the nature of the communication and what information was passed in each direction. This journal should cover the period from the time the team is activated to the time it stands down. All entries in the journal should include details of the date and time the entry was made, and by whom. The completed journals should be used to support future reviews of business and IT service continuity plans and their effectiveness. Therefore, stringent change control should be applied on these journals, and no changes of any nature should be permitted once the team has stood down.

e) Service levels (e.g. uptime statistics) should be reviewed as a Board agenda item each month. Trend analysis can show even a slight decline in service which can be an indicator of bigger problems. f) Testing and rehearsing contingency and recovery plans should be an essential ingredient to keeping an ITSC strategy current. Ensuring a department or application can be recovered fully, after failure can ensure simple errors and problems are minimized. This includes performing complete data back-ups as well as testing third party suppliers. g) Suppliers ability to maintain appropriate levels of service should be regularly assessed. Including suppliers such as incident recovery and maintenance providers in the change management loop is highly recommended. h) Remunerating staff against service levels can help ensure the relevant level of awareness reaches all levels of the organization. i) An internal/external audit of plans.

5.6 IT Service Continuity in a changing environment


Business is by its very nature dynamic. It changes regularly and with change comes risk; not only risk of failure but risk of destabilizing existing policies and strategies. Therefore, the ITSC strategy should be resilient to change and also adaptable. The key factors that should be considered to ensure that the ITSC strategy and plans remain appropriate for the organization as it and its environment change include the following. a) Board level responsibility and accountability for the ITSC strategy should be to help keep an ITSC strategy current as the organization changes, develops and grows. BCM and ITSCM should be a high-profile ingredient to Board level thinking and should be the most important aspect of any continuity plan. b) The change management process should include all parties responsible for the ITSC strategy, both its compilation and its delivery. No change to the IT infrastructure should be considered until the implications of the change have been assessed and understood and contingency plans are rehearsed. c) The procurement process for new IT systems should include sign-off that resilience has not been compromised by even the most simple of upgrades or improvements. Non-IT expenditure could still have an impact on IT resilience such as recruitment (system overheads), marketing campaigns (web site activity) etc. d) Due diligence on merger and acquisition (M&A) activity should include a resilience assessment. Often, M&A activity can bring perceived cost saving benefits such as branch or office closure. This can also reduce resilience through loss of fail-over sites, loss of secondary systems, and inherent redundancy.

12

BSI 11 August 2006

PAS 77:2006

6 Understanding risks and impacts within your organization


6.1 General
Risks are prevalent within any environment. Before commencing any ITSC programme there should be an understanding of potential risks and impacts. The loss of IT (staff, management or infrastructure) typically results in the loss of the ability to operate and manage an organizations systems infrastructure, with the resultant degradation or loss of critical applications and data. How this affects an organization depends on what it does, its key processes and their dependence on technology and the duration of that disruption. For example, businesses in the financial sector frequently depend on financial and/or market information feeds and applications in order to manage time bound investments or transactions. An inability to manage investments and other financial vehicles would have a potentially serious impact on the businesss balance sheet and loss of significant revenues impact on an organizations balance sheet and revenues. In order to fully understand how a disruption in IT service can affect an organization it is necessary to conduct a business criticality and risk assessment (see Annex A) which will identify critical activities with the degree these are dependent on IT. It should also identify the required recovery timescales (RTOs) for IT services which are vital in the implementation of those critical activities as well as the currency of the data which is used in the recovery of those IT services. m) data communications; n) archiving; o) IT environment and monitoring; p) telephony; q) any other relevant exposure. Every organizations risk level will be different, however the outcome of the risk assessment should provide it with sufficient information to evaluate its vulnerabilities in a rational manner and to decide how to deal with them by eliminating the risk altogether. This can be achieved by investing in resources to mitigate the exposure or by preparing beforehand for the consequences of the risk, such as having appropriate incident management in place. By adopting this twin track approach at the start of the ITSC programme one should obtain an understanding of the organizations dependencies on IT infrastructure in terms of the impacts of infrastructure failure (as a whole or in part) and an appreciation of the vulnerabilities present which could give rise to an incident which precipitates those impacts.

6.2 Vulnerability assessment


In parallel with an impact analysis, the potential vulnerabilities prevalent within IT service delivery which might give rise to disruption should be determined. This information can be obtained through a risk assessment which should review the IT infrastructures exposure in terms of: a) system resilience and availability; b) key suppliers and agreements; c) documentation; d) hardware and software assets; e) storage; f) back-up regimes; g) staff exposure; h) staff training; i) location of buildings and facilities; j) IT security; k) systems monitoring; l) power;

BSI 11 August 2006

13

PAS 77:2006

7 Conducting business criticality and risk assessments


A critical initial activity in the development of an ITSC strategy or plan is to identify all business processes and the departments or business functions responsible for their operation and to categorize each function and process according to its criticality to the business. Subsequently to identify all IT services which support each business process and assess their criticality to the operation of those business processes. NOTE 1 More detailed guidance is available in Annex A. NOTE 2 Specific guidance on conducting risk assessments relating to information security can also be found in BS ISO/IEC 17799:2005. ITSC management addresses the ways in which the following types of activity could be disrupted, stopped or have their performance degraded to unacceptable levels. a) operation of IT services and processes; b) IT service resumption following a disruption or failure; c) new IT service or information systems development projects; d) readiness and operation of ITSC required to comply with statutory or regulatory requirements. The organization should be regarded in two ways: a) Physically: an organization exists on one or more sites, each site comprising buildings, which can be broken down in a variety of ways (floor, wing, corridor, office etc.); b) Organizationally: most organizations are structured into a number of Directorates, each of which comprises a number of functions, which comprise departments, processes and activities. Naturally this naming convention is not intended to be an accurate description of all organizations but a theme which can be readily recognized. It is possible for each physical component to support a number of organizational components. In order to avoid duplication of effort the risk assessment process should examine the organization and its IT services from both of these perspectives. It is equally possible for a single organizational component to be situated in a number of different physical locations. The criticality of each business process should have a direct impact on the criticality of supporting IT services. Suggested designations are shown in Table 1. Critical

Table 1 Business criticality categories


Category Mandatory Impact Vital to enable the organization to meet statutory or other (internally or externally) imposed requirements. Vital to the day-to-day operation of the organization. Important for the implementation of the long term strategy. Important for the achievement of the short to medium term performance objectives of the organization.

Strategic

Tactical

NOTE If a system or service cannot readily be assigned to any of these categories, the organization may wish to consider whether that system or service has any ongoing purpose. If however a system or service can be assigned to more than one category the organization should decide on which single category designation will be used.

The process of assessing business criticality and risk should be managed to ensure that the assessment of physical risks is coordinated with, but not dominated by, the assessment of organizational risks. Neither assessment is more important than the other, but each has its part to play in ensuring that the business as a whole adopts a position in which all types of risk are managed as effectively as possible. The inherent complexity in all organizations implies that any risk assessment method should be adaptable to the different circumstances within each part of the organization. NOTE See Annex A for details on how to conduct business criticality and risk assessments.

14

BSI 11 August 2006

PAS 77:2006

8 Service Continuity plan


8.1 Definition of an ITSC plan
The ITSC plan is a simple, clear, unambiguous and all encompassing set of documents that define the actions required to restore IT services in the event of an incident. An ITSC plan is a series of working documents which are constantly rehearsed, updated, modified and improved. Depending on the organizations requirements, the ITSC plan can be one document, or a series of connected documents. It can be printed on paper or held as an electronic/on-line documents. However, the ITSC plan should be readily available in the right place at the right time and to the right people when an incident occurs, which might mean having hard copies accessible. The ITSC Plan for each service should provide detailed procedures and step-by-step guidelines for each stage in the incident management process, as described in Figure 2 in Clause 5.1. lose? If the answer is none then this will have a big impact on the third factor, cost. c) Cost: Typically the smaller the RTO and RPO values, the higher the cost of the solution. Essentially the cost of the technology increases as the time to recover and the amount data that can be lost decrease. Since the availability and cost of technology solutions change over time, these decisions should be reviewed on a regular basis. See Annex B for a more detailed discussion of IT Architecture considerations which influence service continuity. See also Annex C for a detailed discussion on virtualisation and how such technologies might be used to build resilience into the IT Architecture and also assist continuity planning.

8.4 Populating the IT Service Continuity plan


8.4.1 General If the IT infrastructure supports multiple services, for example a bank could provide separate independent cashier and mortgage application services, then the ITSC plan should be considered in multiple ways. One aspect is total failure of a site (or sites), another is the failure of individual IT services within a site. An ITSC plan should be part of a wider Business Continuity Management Plan and, as such, should adhere to any standards and terminology defined by that. If following an ITIL model for incident and problem management then the ITSC plan should also fall in line with ITIL processes. The model ITSC plan should contain the procedures to follow from initial response through to resumption of normal service following an incident (see 5.1). 8.4.2 Teams to populate the ITSC plan In order to populate each part of the plan the following should be prepared. a) Nominate members of management to form Incident Management Teams. The main role of these teams is to manage the recovery processes for each technology platform, each IT service and all required site facilities. The members of these teams should be trained to understand their responsibilities in the event of an incident. b) Develop escalation and process flow charts so that once the decision has been made to invoke the correct ITSC procedures are followed to allow recovery to commence as quickly as possible. c) Develop detailed procedures specifying how to recover each component of the IT systems. Although operations

8.2 Defining an architecture


Before building an ITSC plan the IT infrastructure should be reviewed to determine whether it has all the components and technology required to allow IT services to continue in the event of an incident. If not, then the systems should be updated to include ITSC components, such as resilient, high availability or redundant systems and data replication mechanisms. This should be done by defining an IT architecture that includes these components. Much like the architecture of an office block will include fire escapes and emergency exits, the IT architecture may include components whose sole purpose is to ensure service continuity. There are a number of common IT models which can be adopted to facilitate ITSC. Building IT architecture for a site doesnt have to be an onerous task, commonly accepted models for IT resiliency and ITSC can be used (see 10.3). Selection of the appropriate model(s) depends on many things including IT architecture and service continuity considerations.

8.3 Key Service Continuity Factors


There are three key factors which should be balanced prior to deciding on the IT architecture: a) Recovery Time Objective (RTO): How quickly after an incident the IT service needs to be restored. b) Recovery Point Objective (RPO): The point in the processing cycle where the IT service can be resumed. NOTE This could be at some consistent point prior to the incident e.g. the time of the last back-up. It comes down to answering the question: How much live data can I afford to

BSI 11 August 2006

15

PAS 77:2006

staff will understand how to operate the systems on a day-to-day basis, they may not know how to recover the applications and databases in the event of major incident. Make the procedures as comprehensive as possible and maintain them up to date when the systems change. 8.4.3 Initial response to an incident invocation of ITSC procedures If the organization has an IT service desk, the IMT should be contacted when the service desk is first made aware of an incident. If there are many members of the IMT, then a cascade process may be adopted. Either way the person making the initial contact should record who has been contacted and the response they received. If a person on the IMT is unavailable then their responsibilities should be fulfilled by their designated deputy within the team. In any instance, leaving phone messages is not a sufficient response to the incident. NOTE There may be a complex set of instructions or simply instructions to contact the members of the IMT. Depending on the seriousness of the incident the IMT may opt to activate either or both of the BCMT and CMT. If the initial assessment concludes that the IMT can manage the incident directly then the other groups can be stood down. The rules and decision making criteria for activation and escalation will be organization specific and should be developed as part of the ITSC plan. NOTE How the organization defines an incident will impact how you escalate the problem. For example, one site defines a disaster as an event which is likely to render the whole site unavailable for a considerable length of time. A major incident is defined as an event which is likely to render a single or multiple systems as unavailable for a considerable length of time. Any event that is outside of these definitions is handled as a business as usual event that is handled in the normal way by their service desk. 8.4.4 Problem Assessment The IMT has the responsibility of assessing the impact of the incident. Where possible, assessment should be made by those with the most domain or system knowledge. Critical time can be lost to prevarication or indecision over whether systems should be failed over. Likewise critical time can be lost by failing over the system to a remote site too quickly when a simple local recovery or waiting for a system to be repaired would have been sufficient. The IMT should develop a set of detailed criteria based on past experience and escalate based on whether the current situation meets those criteria. Where the IMT identifies a potential impact upon the organization beyond IT services, it is its responsibility to activate the BCMT using the organizations defined processes.

8.4.5 Roles and responsibilities The plan should include a full description of the predefined teams: the IMT, and the constituent specialist recovery teams for platforms, services and facilities. These should also contain each members role and responsibility and current contact information. NOTE 1 Any documents containing phone numbers should constantly be updated. Electronic documentation that is linked to directory systems is useful for keeping the plan up-to-date provided the directory systems are resilient enough to withstand incidents. NOTE 2 IMT members should be geographically dispersed in order to withstand environmental incidents. 8.4.6 Procedures to follow These procedures should be prepared in readiness for this document and should be frequently updated as a result of rehearsing and actual invocations. If a site has multiple IT systems, there should be multiple procedures which form part of the overall recovery procedures. NOTE Procedures could be developed as hyperlinked electronic documentation connected via a single high level index. This has the advantage over paper since new versions of sub-documents can be released without having to replace all the documents in the set and everyone has access to the most up-to-date information. All procedure documentation should be readily available to all points of the enterprise, even in an incident scenario. It could be a good idea for IMT members to keep the latest copy of the plan where they will always have access to it e.g. at home or on their company laptops. The contents of the plan(s) are likely to contain sensitive or confidential information and should always be held securely, with appropriate measures taken to ensure that the contents cannot be accessed by unauthorised personnel (see BS ISO/IEC 17799:2005 for further information on the kinds of measures which can be appropriate). Following a major incident there may not be the time or equipment necessary to print copies of plans, so any documentation you create should be either easy to read from a screen, or printed out on a regular basis. The procedures may take many forms. One useful form is a flowchart that shows the various possible high-level steps that should be followed and decisions that should be made. A common form of this process flowcharting is known as a swim-lane diagram. Each lane represents an individual recovery process for one system. Figure 5 shows the highest level process flow for a financial institution which runs only three financial systems: savings, mortgages and insurance. How each system is recovered depends on the chosen architecture of each system. In this example, the savings system uses

16

BSI 11 August 2006

PAS 77:2006

synchronous remote mirroring of the savings database. Recovery takes the form of enabling the remote mirrors on the remote system, recovering the database environment and then allowing branch traffic to access the system from the remote site. The mortgage system uses a combination of tape back-ups and audit log shipping. To recover this environment, first reload the last known copy of the database from tape and then bring it up to date by reapplying the audit records read from the

audit logs. The insurance system is a high availability clustered system which automatically fails-over to the back-up site to provide almost uninterrupted service. NOTE In this example there are no interdependencies between the individual systems. This may not be the case in reality. Quite often one system needs to be recovered before another can be brought on line.

Figure 5 Example of a high level process flow chart for service continuity management
Disaster / Major Component Event

Contact EMT members

Assess scale of disaster Switch all branch networks to remote site Re-route help desk and operations calls to remote site

Prepare backup site for full production running


No Notify branches of major disaster invocation Call in Disaster Operations Team Establish disaster operations bridge at remote site Switch remote access ports to remote site End site preparation

Main site still usable and safe? Yes

Failover savings systems to remote backup


Savings systems available? Yes No UP mirror of savings database packs Short recovery of DBMS environment Restart DBMS support runs Allow branch traffic for savings systems End of recovery of savings systems

Failover mortgage systems to remote backup


Mortgage systems available? Yes No Reload mortgage systems from last PIT backup tapes Re-apply DBMS audit logs to mortgage database Validate mortgage database for corrupted entries Restart DBMS support runs Allow branch traffic for mortgage systems End of recovery of mortgage systems

Failover insurance systems to remote backup


Insurance systems available? Yes No Cluster failover insurance systems to remote backup Allow branch traffic for insurance systems End of recovery of insurance systems

End failover checks

BSI 11 August 2006

17

PAS 77:2006

Each process in this flow chart should be documented separately, with its own flowchart if necessary highlighting each task that forms the process. The documented procedures should provide detailed step-by-step instructions. The level of detail required in the plan will

depend on the skill level of the intended audience. Each task shown in the top level process flow chart should be accompanied by a summary sheet containing the items shown in Figure 6.

Figure 6 Task summary sheet

Task A-4: Task description:

Call-in the fail-over operations team Contact remote site on-call operations staff and request extra coverage at the remote site. Current remote site Operations Contact List Contact-List.doc Emergency Call Out Procedure Emergency-Call-Out-.doc Wolverhampton Back-up Site Remote Site Operations Support Manager A-3 10 minutes Incident Management Team (BCM Manager)

Essential documentation:

Action takes place at: Task completed by: Preceding tasks: Time to complete task: Requestor:

Full description/reason for action: There is a need to provide full operations coverage at the remote site to augment normal skeleton staff. Thus need to invoke emergency on-call procedures for operations. Status check: Signature Status and Comments: Ensure that the section below is completed and signed-off Name Time

8.4.7 Fail-back Although it may not be possible to plan for all post failover scenarios, where for example there has been total devastation of the production site, basic planning should be undertaken and the high level steps understood. When returning service to the original system or site then detailed plans should be created for the fail-back process. In these circumstances it is unlikely that fail-back will be a

straightforward reversal of the fail-over steps and a separate set of procedures are likely to be required. Thus a full fail-back plan should be in place with the same quality and standard of documentation as for the fail-over. Figure 7 shows an example fail-back plan for the fictitious fail-over considered in Clause 8.4.6.

18

BSI 11 August 2006

PAS 77:2006

Figure 7 Example of a high level process flow chart for fail-back


EMT request fail-back

Orderly shutdown of backup site

Prepare production site for full production running


Switch all branch networks to production site Re-route help desk and operations calls on production site Establish operations bridge at production site Switch remote access ports to production site End site preparation

Restore savings systems to production site


UP production mirror of savings database packs Short recovery of DBMS environment Restart DBMS support runs Allow branch traffic for savings systems End of recovery of savings systems

Restore mortgage systems to production site


Reload mortgage systems backup tapes Re-apply DBMS audit logs to mortgage database Restart DBMS support runs Allow branch traffic for mortgage systems End of recovery of mortgage systems

Restore insurance systems to production site


Cluster failover insurance system to production site Allow branch traffic for insurance systems End of recovery of insurance systems

End fail-back

BSI 11 August 2006

19

PAS 77:2006

9 Rehearsing an IT Service Continuity plan


9.1 Introduction
The delivery of, and the feedback from, any rehearsal is one of the most interesting and fruitful parts of any business continuity programme. However, its success depends almost entirely on the way in which it is approached and developed. Good solid preparation ensures a sound delivery and everybody benefits from the exercise. Poor preparation leads to an ineffective rehearsal and the whole programme suffers. One unsatisfactory experience in an ill-conceived rehearsal will cause most participants to want to distance themselves from the whole concept of business continuity. On the other hand, a well-prepared exercise will provide all of the participants with a profitable experience. They will be fully engaged in the opportunity to learn from practical experiences. Thus they will become more competent whilst gaining confidence in themselves as well as the plans and procedures. It is important for the organizations staff to be aware of and to recognize the differences between a service continuity rehearsal and an actual invocation. The main difference is the high degree of planning and preparation that is required for each rehearsal. With any rehearsal there is a high degree of planning and preparation to ensure that there is little or no impact upon the live systems and to also ensure the rehearsal objectives are met wherever possible. All resources identified should be made available, booked and be available for the planning and preparation required for the rehearsals. Also during any rehearsal the live systems will still be running and therefore have to be maintained and supported. The service continuity recovery team: a) participates in the rehearsing and invocation of the service continuity recovery plan; b) includes technical staff for technical procedures; c) includes users for rehearsing and during actual invocation; d) includes departmental representatives for communication and coordination (in rehearsing and in invocation); e) is led by the service continuity manager.

9.3 Rehearsal guidelines


Staff resources, costs and implications should be considered by the organization when planning for a rehearsal. Staff resources are the most important element as without them the rehearsal would be difficult, if not impossible, to conduct. The staff resources should have the appropriate skills for any rehearsal, including appropriate platform knowledge, storage management knowledge and application knowledge. NOTE These resources are not only required for the actual rehearsal but also for pre-rehearsal meetings and should allow sufficient time for preparation and planning. It is essential for senior management buy-in to this. Costs are perhaps the most sensitive consideration of any rehearsal as they are not insignificant. Therefore, each rehearsal should be scoped in order to leverage maximum rewards/benefits and strive towards the organizations overall continuity objectives. There are implications to conducting rehearsals which the organizations senior management need to be made aware of. For example, whilst preparing, planning and attending the rehearsal, staff are not doing their day job and therefore impacting upon existing services, processes, systems and projects. Senior management should be aware of this and plan accordingly. Whilst everything is done to minimize the disruption rehearsals can have on the business, the following should also be considered: a) Is it possible to time this rehearsing to cause the least disruption to business functions? b) How much will the rehearsal cost? Is this appropriate for the additional confidence gained over other forms of rehearsing, including a tabletop or scenario exercise? c) Does the rehearsal scope continue to progress against the agreed rehearsing strategy and associated annual plans?

9.2 Roles and responsibilities


The service continuity manager: a) is responsible for service continuity; b) is the service continuity management process owner; c) leads the development of the service continuity recovery plan; d) is the person who invokes the service continuity recovery plan; e) is a senior member of the IT function; f) does not need to be technical; g) should understand the IT priorities of the users; h) should not delegate responsibility; i) should have cover during absence.

20

BSI 11 August 2006

PAS 77:2006

d) How can staff be trained to cope with the situation if they do not experience it in rehearsal-mode? e) Once the BCMP is in operation, how will you return to normal business operations? Are there specific issues here that warrant rehearsing in their own right? f) How different are the circumstances of an actual invocation likely to be relative to those of a rehearsal? NOTE For example it may be advisable to use copies of live systems and data in a rehearsal, the emotional environment of a rehearsal is likely to be more relaxed than in a real incident, etc.

considered. The frequency of exercises will depend on the individual circumstances of your organization but accepted best practice is to exercise plans at least once a year. a) Callout rehearsals should be conducted regularly, in addition a surprise callout rehearsal should be conducted involving all departments and the IMT. b) Walk through reviews of recovery plans, emergency management plans and departmental plans. c) Scenario-based walkthrough exercises for IMT, support teams and individual departments. d) Component rehearsing (e.g. individual departments, business processes, IT systems, voice and data network links, etc). For instance when new systems are implemented, when there are previous rehearsal failures, when changes occur or for previously unrehearsed components. Component testing should also be considered during periods when a more comprehensive test cannot be completed, e.g. test that network traffic can be redirected to the fail-over site, that users can connect to the fail-over site and that live data can be restored at the fail-over site. e) Integration rehearsals (e.g. multiple systems and/or business processes) where IT services rely upon combinations of information systems working together the organization should reassure itself that they are capable of not only recovering the individual systems but also that they can be recovered in such a way as to provide the required services by interacting as expected. f) Relocation rehearsals (technical and business recovery), whereby key parts of the business relocate to, and operate from, the recovery site, including the loss of the main facility, an IT switch or critical business processes. g) Fail-over rehearsals of the live IT environment to the recovery site (including verification by users) and business relocation rehearsals. h) Major incident simulations should include scenariobased role playing exercises, IT fail-over, business relocation and full fail-back rehearsals. In all cases, results should be documented and updates to appropriate continuity plans completed within four weeks of each rehearsal. All rehearsing should be carefully managed and coordinated to ensure low risk to the business but with maximum return on the effort put in.

9.4 Business user rehearsing


Whilst the organizations technical staff performs the service continuity rehearsal, the business users should validate the recovered applications and services. Therefore, they should understand their role to allow them to prepare appropriately. All business users who take part in service continuity rehearsals should be aware of the artificial environment, benefits of rehearsing, preparing and using rehearsal scripts and data input for validation. The environment used for an exercise might not be identical to the live environment in an actual invocation therefore participants should be aware of and understand the differences. For example, they might not have access to current data or logons might be different. Business users should rehearse to validate the recovery and feel confident that their applications and services can be recovered. Rehearsing should also provide valuable feedback to the organization, ensure the recovery is achieved as expected and offer opportunities for improvement. Business users should develop rehearsal scripts which can be followed during a rehearsal to ensure that the appropriate elements for a particular rehearsal are tested. Rehearsal scripts also provide valid input into the audit process. As the rehearsals become more complex, they should be as real as possible to be able to track data through the various recovered systems from front office to back office. The input data should be validated and the results, when running the rehearsal scripts, transactions and batch jobs, should be checked against pre-defined expectations.

9.5 Strategy
To achieve the organizations ITSC objectives, a combination of the following recommendations should be

9.6 Rehearsal programme management


To support the rehearsal programme an adequate management framework should be in place as illustrated in Figure 8.

BSI 11 August 2006

21

PAS 77:2006

Figure 8 Suggested Programme Management Organization

Compliance/ Audit Team

Business Continuity Steering Group

Business Continuity Coordinator

IT Rehearsal Working Group

Business Continuity Rehearsal Group

The suggested roles are as follows: a) Business Continuity Coordinator: the key facilitator of the Business Continuity function. b) Compliance/Audit: to oversee recovery rehearsals and exercises and to ensure they meet the regulatory requirements and satisfy external auditors. c) Business Continuity Steering Group (BCSG): oversight committee for the entirety of the business continuity function consisting of senior representation from all business areas, to reflect the business-wide impact of business continuity planning and management. NOTE As part of the rehearsal strategy, the organizations Business Continuity function should maintain a rolling rehearsal schedule. The Business Continuity Steering Group should sign off the rehearsal programme as part of this document being issued. d) IT Rehearsal Working Group: responsible for planning technical IT aspects of recovery rehearsals.

e) Business Continuity Rehearsal Group: is chaired by the Business Continuity Coordinator and including representatives from the IT Support Groups and Compliance/Audit. The Business Continuity Rehearsal Group reports to the BCSG. The Business Continuity Rehearsal Group is responsible for: 1) planning and executing all ad hoc infrastructure rehearsing, and regular full scale service continuity rehearsal simulation rehearsals; 2) agreeing the rehearsal scope and objectives with the business, via the BCSG; 3) pre-rehearsal planning and preparation; 4) production of the rehearsal plan document; 5) coordination of activities during the rehearsal; 6) post rehearsal reporting; 7) follow-up of actions arising.

22

BSI 11 August 2006

PAS 77:2006

The Business Continuity Rehearsal Group should be business-led, rather than an IT-led group. The Business Continuity Rehearsal Group should meet regularly, as required to meet the above responsibilities. Typically, this will be monthly, but increasing in frequency in the weeks before a rehearsal.

9.7.3 The importance of rehearsing Rehearsing is a vital part of the long term BCM lifecycle, which will prove the viability of recovery plans and highlight areas for further improvement. It also provides an ideal training opportunity for those involved in the key activities. Rehearsals are so called so that areas of weakness can be identified and new processes implemented to improve resilience. It is crucial that rehearsals are seen as positive tasks and any internal political influences are eliminated so that the focus of business resilience and continuity is maintained. The overall aims of the rehearsing strategy are to ensure effective crisis management and to enable live processing to be moved to the recovery site(s) on a regular basis and become part of business as usual. NOTE Even the most comprehensive rehearsal does not cover everything. For example in a service disruption where there has been injury or even death to colleagues, the reaction of staff to a crisis cannot be rehearsed and the plans should make allowance for this. Rehearsals should have clearly defined objectives and critical success factors which will be used to determine the success or otherwise of the exercise as well as of the BCP itself. A full rehearsal should replicate the invocation of all standby arrangements, including the recovery of business processes and the involvement of external parties. This should test completeness of the plans and confirm: a) time objectives, e.g. to recover the key business processes within a certain time period; b) staff preparedness and awareness; c) staff duplication and potential over commitment of key resources, during invocation of the BCP; d) the responsiveness, effectiveness and awareness of external parties. Rehearsals may be announced or unannounced. However, in the latter case the senior management should approve the announcement in advance otherwise it may be difficult to achieve commitment. 9.7.4 Rehearsal objectives The rehearsal strategy should meet the objectives to: a) validate emergency callout procedures and contact details contained in the recovery plans; b) ensure key staff are familiar with their Incident Management, Business Recovery and Technical Recovery plans;

9.7 Rehearsal planning process


9.7.1 Rehearsal plan contents An effective rehearsal should contain: a) a body responsible for control and coordination; b) objectives and success criteria; c) a rehearsal plan and schedule; d) a reversion plan allowing restoration back to live service at certain key points; e) briefing of participants; f) management and coordination; g) event logs and rehearsal feedback forms; h) independent observers; i) post-rehearsal reporting, follow-up and action plan. Post rehearsal reporting should include a variety of sources, e.g. helpdesk call for the duration of the test compared to the normal amount of calls for the day and time the test was carried out, to see if there were an increased number of incidents recorded. 9.7.2 Rehearsal planning principles The rehearsal process includes a number of principles, which should be applied throughout the planning process: a) Document an overall rehearsal strategy with a desired objective to be reached within a clearly defined timeframe, which should include the move to rehearsing invocation. b) Involve the customers in the service continuity rehearsing process. c) Document and agree a detailed annual plan and rehearsal programme which relates to the overall rehearsing strategy. d) Real and achievable objectives with realistic dates should be set. e) Ensure that all critical daily tasks and housekeeping routines are included. f) Include Business Continuity aspects and Business Recovery rehearsing in the plans. g) Include scenario planning/rehearsing with a generic priority list. h) Promote continuous improvements by following actions, suggestions and ideas from previous rehearsals. i) Include the Service Continuity Management team in rehearsals and test their abilities.

BSI 11 August 2006

23

PAS 77:2006

c) prove the ability to recover the technical IT and communications infrastructure; d) prove the ability of critical staff to relocate to and work from the nominated recovery site(s); e) validate the effectiveness and accuracy of the documented IT and Business Recovery plans. 9.7.5 Planning a rehearsal All parts of each rehearsal should be planned in advance as without the planning and preparation the following could occur: a) objectives will not be met and live systems could be adversely affected; b) the rehearsal could fail which will cause the staff involved to disassociate themselves from Business Continuity and Service continuity rehearsal; c) the identified resources (staff and other) may not be available when required or may not be appropriate, such as skill sets, adequate communications link, and server specification; d) there is nothing to measure progress against and therefore no opportunities to improve the rehearsing process; e) expectation of the organizations staff and customers may not be met or remain unknown. NOTE In many ways each rehearsal can be viewed as a project in that it has defined start and end points and should have agreed objectives and desired outcomes. For guidance on best practice in project management and planning the reader should refer to PRINCE2 [2] and/or the Project Management Institutes Project Management Body of Knowledge [3].

24

BSI 11 August 2006

PAS 77:2006

10 Solutions architecture and design considerations


10.1 General
Service continuity may be achieved in many ways ranging from replicating every single IT component to removing all known single points of failure from those components. There are many available models to choose from as illustrated in Figure 9. An organization may, however, favour one particular model but then also use components of several others to complete the IT architecture.

Figure 9 Infrastructure Architecture Models for Business/Service Continuity

Site

Site recovery Site/data centre failover Application failover/load balancing Redundant systems SAN, NAS & DAS Backup and restore

Application

Data

Platform

Rapid equipment replacement High availability system features

If the IT architecture is changed to support ITSC then this should be checked to ensure it does not compromise continuity or security. Thus a review of the complete environment should be undertaken to ensure security is maintained at the same level. This should include a thorough examination of alternative/back-up sites and network links between them. The following should be considered: a) Is the replication of data exposing client data? b) Are the Service continuity rehearsal plans secure or could these be used to identify weaknesses in the IT architecture? c) Are there unused service continuity rehearsal Internet Protocol (IP) addresses, which during normal operation a hacker could use to gain access to the network? The classic approach to ITSC is to use a two-site model which has a back-up site that can continue to provide a service when the main site is disabled or destroyed by an incident. There are a number of ways in which this remote

site model may be implemented (see Annex D), depending upon the organizations requirements.

10.2 System resilience


Typically any system running mission critical applications should be locally resilient. This means that the central system has no known single points of failure such as power supplies, CPUs, I/O Processors. In addition, paths to multiple peripherals are duplicated or duplexed and disk devices are mirrored or part of a Redundant Array of Independent Disks (RAID) configuration. Loss of any single component should not cause an interruption to service. Further information can be found in Annex E.

10.3 Application resilience


Application software may also play a part in system resilience by creating cluster systems viewed as a single system by the outside world but implemented physically as

BSI 11 August 2006

25

PAS 77:2006

multiple independent systems with automated fail-over between hosts. There could be issues relating to the sharing of databases (see D.2). Clustering and database sharing should be implemented if there are concerns around hardware or even software stability. Any application resiliency mechanisms should ensure recovery of data to consistent points. For example if a database has data on one volume and the indices on another, then the application should ensure that updates to the disks are either all applied or none applied i.e. the update is atomic. Databases that are resilient in this way are said to adopt Atomic, Consistent, Isolated and Durable (ACID) properties. A stateless server is one that provides a service but retains no transaction state information between interactions from the client. Each transaction is atomic e.g. self contained and has no relation to preceding or following interactions. An example of this type of server is a web server, web applications are typically stateless. Naturally stateless servers are good candidates for the creation of server farms: large groups of servers that all offer the same level of service. When optimum load is exceeded then another server running the same stateless server software should be added.

10.4 Network resilience


The network should be resilient and capable of handling the fail-over approach. There should be adequate communication bandwidth between sites to allow production to switch from one site to another and for performance to remain acceptable for business needs. Where appropriate, networks should use dual-paths between critical systems, both within a site and between sites, with all components replicated (switches, networks cards, etc.). Single points of failure should be identified and a risk analysis performed to identify if the risk is acceptable. Alternative network providers should be considered for inter-site links. This includes the last mile from any major trunks to the site, with cabling routed independently, following separate routes into the building and terminating to physically separate communication equipment.

10.5 Data resilience


Typically computer systems are reliant on the resilience of their disk based data storage. There are many different models that can be adopted to ensure data resilience some of which are described in Annex F, which discusses various approaches to resilience. Organizations should select the most appropriate model or models.

26

BSI 11 August 2006

PAS 77:2006

11 Buying Continuity Services


11.1 General
Buying continuity services is not a simple process. Any organization that chooses to minimize its risks by outsourcing to a third party should assess the viability and sustainability of the service it is buying. This is especially the case for continuity services, which may never be used and are hard to rehearse outside of a controlled and pre-planned environment. Paradoxically, it is quite possible that buying continuity services from an external supplier could compromise an ITSC plan if the due diligence on that supplier and its services has not been thorough. An organization should understand how a continuity services organization (supplier) makes money. For example, the supplier invests in resources (buildings, infrastructure, IT equipment etc.) that may be required by a client if an incident or failure occurs. To ensure that the service continuity rehearsal services are economically viable and thereby affordable to a client, and also to ensure the supplier is profitable, it syndicates those resources across as many clients as possible. The supplier then manages the chance (risks) of more than one client invoking the service and thereby demanding access to those same resources simultaneously. The implication is that, if the supplier does not manage the risk of multiple, simultaneous invocations both professionally and reasonably then the buyer could, in the event of a major incident, be denied access to the very resources it has subscribed to and thereby could struggle to regain IT and thereby business resumption. There are a range of questions to which satisfactory answers should be required when buying any service or product from an organization, irrespective of industry. This section is focused on the specialist due diligence required when buying continuity services and assumes the reader is already versed in standard purchasing practices such as financial due diligence and validating the accreditations of a supplier. Further information on best practice in this area can be found at The Chartered Institute of Purchasing and Supply5). many examples of this, notably the terrorist attacks on New York in 2001 and on London in 2005, accidents such as the Buncefield oil terminal explosion and natural disasters such as the Asian Tsunami and Hurricane Katrina. The supplier should be able to demonstrate its risk management system and the methods it uses to ensure the risks of multiple, simultaneous invocations (which major incidents and natural incidents imply) are as low as possible. It is also highly advisable to assess the method of syndication used by the supplier and match it against the levels of risk that your organization will find acceptable. For example, the supplier may offer lower prices if the buyer is prepared to accept a higher syndication rate (risk).

11.3 Syndication ratios


The supplier could quote a ratio of clients that it will allow to concurrently subscribe to a particular resource e.g. 25 clients share one computer etc. However, this ratio is just one aspect of the risk level that a buyer should be aware of and should not be accepted on its own as a satisfactory indication of the chances of gaining access to the resource you have subscribed to should an incident occur. The supplier should be able to produce automatically a risk listing of: a) its clients; b) their industry; c) their location; d) the resources under cover; e) the number of times it has sold those same resources; f) the speed with which the resources are to be delivered and/or made available; g) the length of time the resources may be required after an incident. This report should be made freely available to the buyer who can then determine if the risk of buying from the supplier is acceptable. Risk management is a dynamic process. The buyer of continuity services should periodically request and see the syndication report from the supplier and thereby continually be able to assess its own risk position.

11.2 Syndication management


There is a high chance that companies based in close proximity could be affected by the same incident or event that can disrupt IT and Business Continuity. There are

11.4 Location of clients


It is important, when buying Continuity Services, to understand not only the number of clients sharing a resource but also their location. As an example, it may be unlikely for an organization to find it acceptable to share

5) http://www.cips.org

BSI 11 August 2006

27

PAS 77:2006

the same resource with another client of the supplier in the same building, street or close area.

11.5 Risk presented by other clients


In addition to the location of other clients, it is also crucial to understand who those clients are and the industry they are in. By doing so the buyer becomes able to determine the likely threat those other clients could place on their own ITSC plans e.g. whether their very presence could constitute a threat or they could be a target for extremists which could have a knock-on effect on your own ITSC. This is a dynamic equation and will often provide a range of risk positions dependent upon the current political climate. As an example, it would be appropriate to know if you are subscribing to syndicated resources that are shared with a organization that could be classed as a welfare, social or political risk e.g. an organization that could be the target of an animal rights group, a forestry business that could be threatened by an environmental pressure group or an organization that could be known to sympathize with a particular side in an area of political unrest. It is often difficult to know exactly what risks other clients may actually place on you. Clearly there are limits to what you can do, however it is often worth imagining (if not actually doing) a helicopter scan over your premises and the surrounding areas of other clients. You may, for example, not be aware that your neighbour is storing gas cylinders in their work yard or is charging fuel tanks next to your building. You may be closer to a flood plain than you had originally thought or there could be building works going on that could, by accident, cut your telecoms lines etc. The Buncefield oil terminal explosion was proved to be the classic example of a single incident causing direct and consequential issues for many companies.

Another consideration is that of the physical and environmental security measures which are in force in the recovery site. These should be equivalent to those for the primary location and should be regularly audited against specific and detailed requirements.

11.7 Rehearsing
An ITSC plan should always be rehearsed to ensure it is current and appropriate to meet the required ITSC service levels. It is crucial when procuring continuity services from a supplier that the services are rehearsed. A supplier will have a finite amount of resource (both equipment and people). It is important when buying a service that the supplier's resources are known to the buyer to help it gauge the chances of service provision when an incident or interruption occurs. This information should be made readily available by the supplier; however, one way to gauge the amount of available resource is to request scheduled and unscheduled rehearsals. If a supplier is under-resourced to meet its contractual obligations, it is unlikely to be able to honour short timescale scheduled rehearsals. Should this happen, then alarm bells should be ringing as a lack of resource in a rehearsal when all is relatively calm and quiet, is likely to mean over-stretched resources and over-syndicated services. If the doubt is there, the buyer should ask deeper questions to ensure its own risk management levels have not been compromised.

11.6 Location and Physical Security


When there is an incident that has to be managed by the police and other emergency/security services, an area could be cordoned off for safety and on-going incident management purposes. This could mean that access to the premises is denied and buildings could be evacuated. When buying a continuity service, the buyer should expect that its chosen supplier does not sell the same resources to any other companies within the same geographic location. The buyer should, in advance of subscribing to the service, understand the typical size of an exclusion zone enforced by the security services and determine what is a reasonable and satisfactory area to demand exclusive access to the syndicated resources. The buyer can then ask the supplier to prove that it has allocated the requisite exclusion zone.

28

BSI 11 August 2006

PAS 77:2006

Annex A (informative) Conducting business criticality and risk assessments


A.1 General
The approach described here is a variant of Failure Modes and Effects Analysis (FMEA). A variant is suggested because the standard FMEA approach assumes application to a business process and concentrates on the causes and effects of disruption or failure of steps in the process. NOTE Such an approach would not be directly applicable to a departmental management process or a project, though sufficient common ground exists for the approach to be adapted for those circumstances. The variant is also required since the development of FMEA, the risk management industry has widely accepted that the concept of risk includes both threats and opportunities. Figure A.1 indicates the steps in the risk assessment process, which results in the development of an ITSC plan (see Clause 9).

Figure A.1 Risk assessment process

Process and Risk Identification

Rehearse and Learn lessons IT Service Continuity Plan

Response Selection

Assign Responsibility and Implementation

Response Planning

NOTE Where systems and/or IT services are involved in safety critical environments, such as on oil rigs, nuclear power plants etc., more sophisticated approaches to risk management such as Monte Carlo Analysis may be more appropriate.

BSI 11 August 2006

29

PAS 77:2006

A.2 Process and risk identification


The heart of any process for assessing risk should have a types of risks set that can be easily understood by those conducting the assessment. In the case of a physical risk assessment, this should involve identifying the hierarchy of IT services that will be the subject of the assessment, the owners of each service and the dependencies between them. In the case of an organizational risk assessment it involves identifying the organizations structure and the processes for which each node in the structure is responsible, the owners of each process and the dependencies between them. NOTE This does not mean that all risk assessments should start with a business modelling exercise, since in many cases this information will already be available. Where the information exists, common sense suggests that it would be prudent to review it to ensure continued accuracy, but under no circumstances should effort be expended in reproducing work that already exists in an acceptable form. The object of the risk assessment should therefore be to define the possible changes, understand how likely they are and how each change would impact IT service provision. The types of risk that can be identified include changes to: a) business process or activity, including risks ranging from catastrophic failure through minor disruption to positive improvement in productivity; b) dependencies, including risks ranging in effect from the collapse of a critical supplier of goods or services to the

temporary failure of an information flow from another business process; c) plant or equipment; d) buildings and environment; e) information technology or systems; f) information security including confidentiality, integrity and availability; g) projects including risks associated with not delivering the specified solution, risks associated with the solution and risks associated with its delivery. In assessing the types of risk to which a physical or organizational component of the business could be subject, the assessment should be well informed and based on verifiable evidence. Where possible and appropriate, the views of acknowledged experts should be called upon to ensure that the assessment of the nature and likelihood of a particular risk is as realistic as possible. All risks identified during this activity should be described in the ITSC plan. At this stage it is only necessary to record summary details for each risk including a name, which should convey something of the nature of the risk, and one or two sentence description of the nature of the risk. The probability of a risk occurring and its likelihood should be determined according to Table A.1.

Table A.1 Probability of risk occurring


Probability Low Medium High Very High Definition The risk is not expected to occur more than once per year. The risk is not expected to occur more than once per quarter. The risk is expected to occur at least once per month. The risk is known to exist or is expected to occur frequently and/or regularly.

30

BSI 11 August 2006

PAS 77:2006

In performing a risk assessment one should identify not only the immediate effects of the risk occurring but also the impact on the business of those effects. For example, the effect of a hard disk problem could be the corruption of some data stored on that disk, whilst the business impact of corrupt data relating to customer accounts could result in significant cash flow problems and could also adversely effect the organizations reputation for excellence. In general, the assessment of each risk should consider the impact on: a) environment; b) financial performance of the organization; c) health and safety of employees and the public; d) morale of employees; e) productivity and process efficiency;

f) product quality; g) business controls; h) regulatory or legislative compliance; i) reputation of the organization with its customers, investors, staff and suppliers; j) political impact at local, regional, national and international level. When assessing the impact of a risk one should ensure that the assessment is well informed and based upon verifiable evidence, hence, expert opinion should be called upon where possible and appropriate to do so. Table A.2 categorizes the impact of a risk.

Table A.2 Impact of risk


Impact Low Definition Expected to have a minor negative impact. The damage would not be expected to have a long term detrimental effect. Example: very short-term (less than five minutes) power failure Medium Expected to have a moderate negative impact. The impact could be expected to have short to medium term detrimental effects. Example: short-term (less than one hour) failure of email system High Expected to have a significant negative impact. The impact could be expected to have significant medium to long term effects. Example: unexpected failure of online banking system resulting from unknown cause. Very High Expected to have an immediate and very significant negative impact. The impact could be expected to have significant long term effects and potentially catastrophic short term effects. Example: data centre destroyed by fire or flood

BSI 11 August 2006

31

PAS 77:2006

A.3 Response selection


Implementing a risk response should only be done if the tangible and intangible benefits of doing so outweigh the tangible and intangible costs. In addition, the tangible and intangible costs of preparing the response and ultimately of deploying it should not outweigh the costs of taking no action. Since success in business involves a degree of risk taking, there will be risks that the business is happy to accept in the expectation that doing so will result in improved profitability, market share or other tangible benefits. The body responsible for deciding which responses should be implemented should consider the questions listed in Table A.3.

Table A.3 Questions


Question Is the risk likely to result in a positive outcome? Options If so, a response should be devised which causes the risk to occur and maximises the benefit derived from it. If not, consideration should be given to a response which would avoid, eliminate or mitigate the risk or its impact. If so, some form of response would appear appropriate. If so, prudence would suggest that some form of appropriate response should be developed.

Is the risk sufficiently likely or its impact sufficiently significant to justify implementing the response? Would a decision not to develop a response leave the organization or its officers open to civil or criminal litigation? Would the benefits (both in terms of risk mitigation and other consequential improvements) to the business from implementing the response outweigh both the costs of taking no action and the costs associated with the implementation?

If not, consideration should be given to alternative approaches which cost less to implement or in some cases whether the organization is prepared to accept the risk.

In order to ensure that Risk Management represents a viable and positive investment for the future of the business, a cost-benefit analysis for each possible risk response should be conducted. The objective of this exercise is to determine whether the benefits of taking action will outweigh the costs of taking no action. This analysis is then fed into the decision making process for selecting the responses to be implemented. From the risk profiles (see A.2), documented in the ITSC plan, obtain details of the financial costs of: a) the estimated cost of taking no action in the event that the risk occurs, i.e. the impact cost; b) the estimated development and implementation costs of existing and new counter-measures; c) the estimated costs that would be prevented or averted by implementing the proposed counter measures. In addition to these financial costs, other factors should be

taken into account, such as the organizations reputation, employee health, safety and morale, environmental protection, security and the confidence of investors, customers and regulators. In each case, an estimate of the impact on the intangible factors should be made for taking no action, for preventing the risk and for implementing the proposed counter-measures. By examining the intangible costs in conjunction with the financial costs a broader picture is seen. This can be fed into the process of deciding whether a response should be implemented for the risk(s) in question.

32

BSI 11 August 2006

PAS 77:2006

A.4 Response planning


A.4.1 General Once decisions have been made regarding the risk responses that are appropriate for the circumstances, the implementation of each element of the response should be carefully planned. Risk response planning is concerned with ensuring that resources are deployed effectively and efficiently, paying particular attention to maximizing the benefit to the business from implementing the response. For each response the most appropriate people should be involved in its development and implementation. As such, the participants may work in areas of the business other than that affected by the risk, or may indeed work for other stakeholders. The plan for implementing each risk response should identify: a) the scope and objectives of the response, such as RTO, RPO etc; b) planning assumptions; c) pre-requisites; d) summary of resources required (people, facilities, equipment, money); e) work breakdown, identifying the sequence of activities required to implement the response, including the resources required for each step and estimates of the effort and elapsed time required. Based upon the work breakdown for all required risk responses, a schedule of work should be created in which the timing of each activity should be determined by the availability of the time, effort and people required to complete it. A.4.2 Assign risk category Based on Table A.2, the definitions of risk categories, as deduced from predictions of likelihood and business impact, have been slightly modified, as shown in Figure A.2.

Figure A.2 Risk categories Risk Likelihood Very high Medium

High

Low Very high Business Impact High Medium Low

Category One

Category Two

Category Three

BSI 11 August 2006

33

PAS 77:2006

Having assigned the risk category, details of the likelihood, impact and risk category should be added to the risk description in the ITSC plan. At this stage the ITSC plan should contain the risks in category grouping, with Category One risks listed first. To interpret these categories further: a) A Category One risk is one to which the organization should certainly respond; b) A Category Two risk is one to which the organization should consider responding; c) A Category Three risk is one which the organization should consider accepting. No two organizations are the same and thus no firm guidance on interpreting these categories can be given without it being inappropriate to a significant percentage of the audience. Hence, though the guidance above is intentionally vague, it helps to frame the questions the organization should be asking itself at this stage of the process. A.4.3 Develop risk profile For each risk identified as falling into Categories One and Two, a risk profile should be developed, which defines: a) the nature of the risk and the events likely to trigger it; b) the probability of the risk occurring, including details of any circumstances where the likelihood of the risk could change; c) details of the potential impact of the risk on the business, including estimates of the cost to the business of taking no action to prevent or mitigate its impact; d) details of the symptoms likely to be displayed in the

event that the risk occurs and the ways in which these symptoms could be detected; e) an assessment of the likelihood of detecting the risk and measures that could be taken to increase that probability; f) details of existing counter-measures designed to monitor the risk, prevent it from occurring or to mitigate its impact, including estimates of the costs of implementing and maintaining these counter-measures; g) proposals for additional counter-measures, or changes to those in place, to prevent the risk from occurring and to mitigate its impact, including details of the facilities, equipment and personnel required, and estimates of the time, effort and cost required to implement and maintain these new counter-measures; h) estimated savings accruing from implementing the proposed counter-measures in the event that the risk occurs; i) estimated consequential savings likely to accrue from implementing the proposed counter-measures in the event that the risk does not occur. This information provides the basis for a cost-benefit analysis, which should support decision making on how each risk should be addressed by risk monitoring, risk mitigation, risk communication and business continuity planning activities. Details of the risk profile are added to the ITSC plan. A.4.4 Assess probability of detection The probability of symptoms of the risk being detected should be determined according to Table A.4.

Table A.4 Probability of risk detection


Probability Low Definition The symptoms expected to be displayed when the risk occurs will not be obvious or easy to detect without specialised monitoring processes. Example: disk hardware error causing infrequent and random errors when writing information to disk. Medium The symptoms expected to be displayed when the risk occurs will be detectable with basic or standard monitoring processes. Example: malicious intrusion onto corporate network, failure of online or batch process to complete successfully, etc. High The symptoms displayed when the risk occurs will be immediately apparent. Example: failure of email system, power failure, natural disaster etc.

34

BSI 11 August 2006

PAS 77:2006

A.4.5 Response selection A basic model for determining appropriate responses is based upon risk categorization and likelihood of detection. Having categorized the identified risks and having decided whether a response ought to be implemented, the nature of that response should be influenced not only by the potential impact or likelihood of the risk occurring but also the organizations ability to detect that it has occurred. For example, in planning a response to a risk such as the example given for a low probability of detection, the organization might be well advised to consider implementing specialized monitoring processes and/or equipment to make detecting the risk more possible. In the case of the medium probability example, the organization can implement one of a number of common firewall and intrusion detection tools to both identify and prevent such intrusions. A.4.6 Assign responsibility and implement Having determined the appropriate response to the risk, the actions implied should be planned such that resource utilization and cost information is available for cost-benefit analysis. The cost-benefit analysis is an important part of the decision making process for determining which of the potential response actions will be justified and therefore implemented. It is also important information to retain when a decision is taken not to take action in response to a risk, as it demonstrates that a formal and rigorous thought process was followed in arriving at that decision. Details of the decisions taken on the proposed response actions should be added to the ITSC plan and summarized in an action plan and work schedule.

A.5 Rehearse and learn lessons


An ITSC plan is only likely to be effective if it is regularly rehearsed and when the lessons from these rehearsals are fed back into updated plans. Clause 9 provides guidance on how to plan and conduct such rehearsals for maximum effectiveness.

BSI 11 August 2006

35

PAS 77:2006

Annex B (informative) IT Architecture Considerations


When selecting an appropriate IT architecture, the following non-exclusive options should be considered. a) Location and distance between sites: If failing over from one site to another the network path distance between the two sites should be carefully considered. If the two sites are too close together, for example on a campus, they could be impacted by the same natural disaster. If too far apart then the cost of connecting the two sites with suitable telecommunications and/or courier services could become prohibitive. Most importantly the distance between the sites could have a negative impact on the way in which the IT systems operate. If the chosen model includes synchronous replication, then the greater the distance the greater the latency, thus introducing delays in the transfer of data between sites which could in turn impact application performance. b) Number of sites: The number of sites used should be considered as a major factor for the IT architecture. For example a company might have corporate offices in three cities, e.g. Paris, London and Munich with partial processing of data in each local centre. The IT architecture would typically define methods for ensuring that the London centre could carry on the processing of Paris and Munich work in the event of their total and combined loss. There could be a mutual recovery strategy for all three, so that the systems, network and storage at each site is sized to cope with the combined traffic and work of the other two in addition to its own work. c) Site Security: for some types of organization, the security of information or premises is of paramount concern, especially when handling protectively marked or classified information. For example, the protective marking relating to a secondary site may be at a lower level than that of the primary site, especially if it is routinely used for developing, testing or training purposes. In these circumstances, the architecture of both premises and IT should be designed and implemented in such as way as to make the up-rating of the sites protective marking as straightforward as possible. d) Staff access and proximity: If there is a remote site strategy, then in the event of an incident the staff could have to work from the remote site for extended periods of time. This can then lead to compounded issues due to staff being separated from their family and/or dependants or finding and paying for hotel accommodation for staff near the remote site for extended periods. e) Remote access: An alternative solution to shipping people to the remote site is to provide them with remote access to the remote site via dial-up, or through the Internet using Virtual Private Network (VPN) or similar technology. This means that people can work from home in order to support the remote systems, but it could also introduce additional security issues. f) Dark site vs. manned site: The IT architecture could dictate that the recovery site is to be typically unmanned during normal operation, with all operation being handled from a central operations bridge. In this case in the event of an incident the central bridge may not be available and so an alternative bridge will also be required. Running a dark remote site could also mean that there are no operations staff at the remote site that can help recover systems in the event of an incident. g) Skill level of staff: In the event of a major incident, key staff could be unavailable either as a result of the incident, or simply because of holidays or sickness. The ability for the IT infrastructure to continue to operate could depend on the ability of the remaining staff to handle the surviving systems. If staff are not trained to provide cover across the board, then the recovery strategy could be at risk. Specifically, recovery procedures for the IT architecture should be written without assuming any detailed knowledge so that they can be implemented by as many members of the team as possible. h) Telecoms connectivity and redundant routing: As already stated, distance between sites can be an issue. If network connectivity is leased from a network supplier, depending on the requirements the proximity to existing high-speed trunks could present cost issues. To provide resilience, there should be a contract with different trunk providers to ensure continuity of service and redundant routing. There should also be separate contracts with different last-mile telecoms providers to ensure service continuity. i) Level of automation required: The IT architecture may include the requirement to make fail-over and fail-back completely automated. Many sites desire a completely automated fail-over to occur when problems are detected. Some sites have found that it is impractical to automate everything and like the ability to be able to instigate fail-over manually after local recovery attempts have been ruled out. Either way, perfecting automation could involve cost and time to develop and rehearse.

36

BSI 11 August 2006

PAS 77:2006

j) Redundant routing of communications: The ability to communicate in a period of disruption is fundamental to the successful management of an incident. Whilst there may be multiple redundant phone lines into and out of sites, check the telephony provider is not routing all these lines through one common exchange which can be impacted by an incident at that exchange. In addition since email systems can be impacted by an incident, it may be provident to maintain a number of independent email accounts on external Internet Service Providers (ISP) for use in case of emergency. Consideration should be given to providing multiple forms of communication, such as SMS, pagers, external (non-corporate) email systems, pre-agreed brief coded messages (to avoid overloading the networks and to speed communications) and so on. k) Third party connectivity and external links: If the organization depends on the services of a third party provider (for example, in the financial world many companies use third party credit reference agencies), those services should be accessible from the remote site. The contract with the third party should provide a guaranteed level of service in the event of an incident.

BSI 11 August 2006

37

PAS 77:2006

Annex C (informative) Virtualization


C.1 General
Virtualization, although considered a new technology by many people, has actually been with us since the early mainframe days when administrators were able to partition memory, processing and disk resources to create a virtual machine. This same technology has now been widely adopted in three keys areas: storage, server and network virtualization. Although each of the above are technically very different the concept of virtualization remains the same. Take a physical resource and partition it into multiple virtual resources or consolidate multiple resources into a single virtual resource. The benefits of virtualization allow you to maximize utilization of the physical resource while simplifying management through fewer physical devices. amount of metadata needs to be passed to the appliance, thus eliminating the bottleneck problem. Most appliances also support clustering so that the appliance does not become a point of failure. A disadvantage of the appliance approach is that adding an additional device increases complexity and management of the SAN. Fabric based virtualization places the virtualization technology inside the SAN fabric switches. This increases the processing and memory requirements of the switch but has the added advantage of reducing overall complexity. This technology is still at a relatively early stage but there are already of number of competing products on the market. There is however, some caution around how much intelligence should be implemented at the fabric level. There also the needs to be some standardization at the fabric level so that fabrics with multi vendor switches are fully interoperable.

C.2 Network virtualization


Network virtualization allows you to take the components in your network infrastructure and either consolidates them into fewer networks or takes an existing network and divides it into smaller segments. For example, you could take a single 48 port network switch and partition it into four segments, each with 12 ports. This allows you to create 4 isolated networks and utilize all ports on the network switch. It also makes managing the network easier as there is only one physical switch.

C.4 Server virtualization


Deploying virtualization software on a server allows you to partition the server into multiple virtual servers and then host an independent OS and applications on each of these virtual machines. Server virtualization abstracts the OS and applications from the underlying hardware. This helps protect applications from hardware peculiarities. It also makes it much easier to migrate applications onto new hardware platforms. The management console allows you to configure how much memory and processing resources each virtual machine can have. It also allows you to monitor how many resources on the physical server each virtual machine is consuming. Replication technologies built into the virtualization software allow you to quickly clone and deploy virtual machines. By integrating with some of the major software deployment tools, it is also possible to rapidly deploy applications onto virtual machines. One version of virtualization software also allows for the relocation of virtual machines between separate physical servers. This can be policy driven so in the event of a server failure the virtual machines can be moved to a new physical server.

C.3 Storage virtualization


Storage virtualization provides a means to hide the complexity of a storage infrastructure behind a virtual layer. The main advantage to doing this is simplified management. There are three ways to implement storage virtualization: a) use an appliance; b) in the network fabric; c) locally in the storage array. There are pros and cons to each of these methods. There are quite a few virtualization appliances on the market today that all more or less do the same thing. The appliance will usually sit between the storage arrays and fabric switches. This is called an in-band appliance. All data passing between the host and storage arrays also passes through the appliance. One concern about this approach is that the appliance might become a bottle neck. The second option is an appliance that sits on the edge of the SAN fabric. This is known as an out-of-band appliance. An advantage of this model is that only a small

38

BSI 11 August 2006

PAS 77:2006

Annex D (informative) Types of site models


D.1 General
There are a number of basic site models that can be adopted to provide resilience. The requirements from the ITSC strategy will have a major influence on which model is selected, and this may have significant implications for the IT architecture. Thus the decision will require input and careful consideration by many areas of the organization and the final selection is likely to be an iterative process as the costs and implications are more thoroughly understood. etc. that may be setup in the parking lot of an incident stricken company. Similarly hotel rooms and other rented office space may be turned into incident back-up sites to temporarily house new computer equipment. Specialist companies exist that can help ship equipment quickly to help minimize the costs and increase the viability of cold back-up sites. These companies are skilled in the rapid deployment and delivery of pre-configured systems and resources from servers and PCs through to telephone switches, structured cabling and furniture. An alternative is the potential for sharing machine room space with a supplier or business partner, providing reciprocal arrangements for computer room space. Care should be exercised here and no such arrangement should be undertaken until all the risks of co-hosting another companys equipment are fully understood. The advantages and disadvantages associated with this model are listed in Table D.1.

D.2 Active/Contingency
This model introduces a remote or back-up site for recovery only at the time of incident. It is often referred to as a cold back-up site since at the point of incident it usually consists of either an empty computer room, or a computer room populated with inactive computers in an un-initialized state. An alternative to this static computer room is a mobile computer suite provided with generators

Table D.1 Advantages/Disadvantages associated with Active/Contingency model


Advantages Typically lower cost than active/active If buying access to the contingency site from a supplier, the service will typically be treated as revenue/operational expenditure rather than capital which can have advantages for some organizations. Limited investment in unused infrastructure and removes need to upgrade continuity equipment when upgrading production. Additional support skills may be available if using a third party to provide the service. May be possible to utilize space across other sites within the organization, reducing or removing the need for a specific cold site. Disadvantages Typically a slower fail-over than other approaches. As systems are built at point of recovery, very rigorous change and configuration management is required to ensure fail-over procedures are up to date. Process is likely to require a high level of technical skill to deal with complex recovery issues. If using a shared recovery site, then an additional risk that another organization may also require or be using the site.

BSI 11 August 2006

39

PAS 77:2006

D.3 Active/Active
At the other end of the spectrum from the Active/Contingency model is the Active/Active model. As this name implies, in normal operation both sites are up and running accepting work at both centres and balancing the load across all computers at both sites. In the event of an incident or system failure at one site then all work is routed to the second site which has been sized to be able to accept the workload increase with little or no reduction in throughput. The advantages and disadvantages associated with this model are listed in Table D.2.

Table D.2 Advantages/Disadvantages associated with Active/Active model


Advantages Fast recovery from an incident Improved confidence in ability to fail-over as much of the resilience equipment is being actively used at each site. Recovery procedures can be simplified and/or automated, as much of the infrastructure will be up and running. May improve utilization of the infrastructure over other models. Less overhead on change and configuration management as sites are being continually exercised and so issues are likely to be identified more quickly than where equipment is not be used. Makes live fail-over rehearsals easier to implement. Disadvantages Can be more difficult to implement and manage than other models. May require additional load balancing technology to allow services to be split across sites. For example to route Internet traffic to two separate sites. Complex databases issues. If a database is to be active at multiple sites then a mechanism is required to externalize and manage updates so that data at the sites is kept synchronised. Some organizations approach this by running a cluster with database only active at one of the sites at any one time. Limited separation between sites. To achieve the desired level of performance the parts of the Active/Active pair are often close together.

40

BSI 11 August 2006

PAS 77:2006

D.4 Active/Alternate (Active/Passive)


In the Active/Alternate model, production runs at one site with a warm standby mirror copy of the production system maintained at a second site. In the event of a failure, production work moves from the main site to the warmstandby site with little or no interruption to service. This requires either synchronous (Zero Data Loss) or asynchronous (Point in Time) replication of data. The advantages and disadvantages associated with this model are listed in Table D.3.

Table D.3 Advantages/Disadvantages associated with the Active/Alternate model


Advantages Either site can be nominated as production site on a scheduled basis, providing confidence in the solution. Makes live fail-over rehearsals easier to implement. Updates and maintenance can be scheduled at either site by switching service to the other site. Disadvantages The fail-over to the Alternate site can have more impact on service than in the Active/Active model, though still typically better than other models. Systems at the alternate site must be kept in step with the Active site and as with the other models this will be a greater overhead than for the Active/Active model. Limited separation between sites. To achieve the desired level of performance the parts of the Active/Active pair are often close together.

D.5 Active/Back-up
In the Active/Back-up model two separate computer suites are maintained, but production only runs at one site, the remote site hosting back-up systems are only enabled when an incident strikes. One way of exploiting the software license issue is to utilize the back-up systems as development, test or training platforms. Many IT companies will reduce the cost of software licences if a system is only used for development work, and will allow production licences to be transferred to the back-up site when an incident strikes, although this can incur additional cost. The advantages and disadvantages associated with this model are listed in Table D.4.

Table D.4 Advantages/Disadvantages associated with the Active/Back-up model


Advantages Can reduce the number of software licences required as a warm standby system doing no productive work may still incur the cost of operating system, database, and communications software licences. Disadvantages If using back-up site systems for development, test and/or training, then in the event of a major incident the facility is no longer available. If the incident is the result of, say, a faulty software release, there may not be access to the required development resources such as source files, documentation required to provide a resolution. Security implications when running production from the back-up site. Slower to activate as existing services may have to be stopped first, before fail-over can be initiated.

BSI 11 August 2006

41

PAS 77:2006

D.6 Multi-site Models/Hybrids


Of course, the two-site model is fine for most companies, but some companies, especially multi-nationals by necessity adopt a three or four site model where one or more sites can take over the work of the other sites if required. The advantages and disadvantages associated with this model are listed in Table D.5.

Table D.5 Advantages/Disadvantages associated with a Multi-site models


Advantages Reduced impact following a major incident at one site as production is spread across multiple sites. Requires less spare capacity for resilience as load is spread across multiple sites. Disadvantages The approach can have complex implications, and give rise to issues such as visibility of data and scalability of systems.

42

BSI 11 August 2006

PAS 77:2006

Annex E (informative) High availability


E.1 General
High availability refers to the ability of a computer system and its hosted resources to withstand failures. These failures can range from component level hardware failures to complete site failures. Availability is commonly measured in 9s with five 9s being the highest level. NOTE For instance a system with five 9s availability allows for five minutes downtime per year. While five 9s or near continuous business operation is often desired, solutions guaranteeing zero-downtime are often too cost-prohibitive to implement, especially after weighing all risks of failure and determining what kind of downtime is acceptable for your needs, as shown in Figure E.1.

Figure E.1 Downtime vs. cost $$$$ Continuous Processing Fault Tolerant $$$

Fault Resilient $$ High Availability Commercial Availability $

Downtime

99.999%

99.99%

99.9%

99.0%

System Availability

BSI 11 August 2006

43

PAS 77:2006

Availability spans many discreet layers both within and outside the infrastructure with each layer providing additional levels of fault tolerance and/or recovery. The first layer is the physical building and security level that prevents unauthorised access to the proximity of the site. Each layer also includes a subset of systems such as electrical, mechanical, cooling, and security that can be used independently or combined to provide improved levels of availability.

generated by computer and communications equipment, lights, and people. NOTE For the computer and communications equipment it is best to use the peak load figure based on the maximum power requirements of the device. This is normally shown on the devices power requirements label or the system specification document. The size of the data centre, and the positioning of the systems within it, should also be taken into account as both will impact on the cooling requirements. E.2.3 Systems monitoring At the platform level all equipment should be fully monitored (preferably in real-time) to alert of any possible failures. All the large computer and communications vendors provide software utilities to monitor their systems. For ease of administration in a multi-vendor environment it can also be beneficial to look at deploying an enterprise class systems management suite. This will allow you to monitor all vendors systems from a single console. Some vendors also provide a feature called Phone home where a system will send an alert via an email or similar transport mechanism to the vendors support function personnel alerting them of a failing or failed component. The vendor can then dispatch an engineer to resolve the problem. E.2.4 Warranty and support All data centre equipment should have an appropriate level of warranty for maintenance and troubleshooting support. Most system vendors provide a tiered warranty structure ranging from next business day to 24/7x365 with 2 hour fix. The level of warranty purchased for a system is largely dependent on how mission critical the system is to the business.

E.2 Platform availability


High availability at the platform layer is achieved using systems that incorporate redundant components for cooling, power, disk, memory, etc. In most instances these components are also hot swappable to reduce downtime. The majority of computer manufactures now offer these features as standard on server level systems. As a minimum the computer system should provide for dual power inlets, redundant power supplies and cooling fans. E.2.1 Power Within the data centre there should also be some form of backup uninterruptible power supply (UPS) to protect against power surges and outages. The UPS should be sized according to the number of systems it needs to support and how long they need to be kept running for in the event of a power failure. The two main types of UPS are standby and online. Standby UPS are suitable for PCs and other small non mission critical appliances. In the event of a power failure the standby UPS will automatically switch to battery backup. Online UPS units are more suited to mission critical systems as they continuously monitor the power source to protect against line surges and brownouts along with automatically switching to battery backup during a power outage. Battery backup UPS units are suitable for keeping systems running during short outages. To provide for extended power outages a standby generator can be used. As an added layer of protection against power outages some companies also use dual sourced power for their data centre. This approach helps protect against power outages due to electricity grid related problems. In this scenario the dual power inlets on the server system are feed from separate power sources. E.2.2 Cooling Providing cooling in the data centre is also very important when building a highly available system. All systems in the data centre will generate heat. In cooling terms, effectiveness is measured by the amount of heat that can be controlled. When sizing cooling, the total heat being generated in the data centre should be calculated, including that

E.3 Data availability


The level of data availability is governed by the underlying storage hardware and the features it provides to protect its stored data. These features can range from standard Redundant Array of Independent Disks (RAID) implementations to the more advanced data replication technologies like Snapshots and synchronous mirroring. E.3.1 Redundant Array of Independent Disks (RAID) RAID provides for both high availability and improved I/O performance at the disk level by using mirroring and striping techniques to duplicate data across multiple disks. At present there are around twelve different RAID levels, some of which are proprietary to specific storage vendors. The most commonly supported RAID levels in everyday use are RAID 0, 1, 5, 0+1, and 10. Selecting which RAID type to implement is usually a trade off between performance and cost. As a rule of thumb

44

BSI 11 August 2006

PAS 77:2006

RAID levels that utilise the most disks provide the highest level of redundancy and performance. Each RAID level has its own advantages and disadvantages which are summarized in Table E.1.

Table E.1 Advantages/Disadvantages of RAID levels


RAID Level Description Data Striped across one or more disks Advantages Easy to implement Very good read/write performance No parity overhead No fault tolerance/redundancy Disadvantages No fault tolerance/redundancy Single disk failure causes data loss Not suitable for high availability Only 50% disk utilization Limited redundancy single disk failure RAID function requires additional processing Write penalty due to parity calculation Slow rebuild after drive failure Limited redundancy single disk failure Very expensive to implement Only 50% disk utilization Limited redundancy single disk failure

1 5

Data mirrored between two disks Requires a minimum of two disks

100% disk redundancy Improved read performance over RAID 0 Simple design

Data and parity striped across multiple disks Requires a minimum of three disks

Very good read performance Parity is distributed across all disks Maximum utilization of disk resources

0+1 10

Mirrored RAID 0 segments Requires a minimum of four disks

High I/O throughput read and write Same overhead as RAID 1

Striped RAID 1 segments Requires minimum of four disks

Very good/write performance Same overhead as RAID 1 Can withstand single drive failures across RAID 1 segments

Very expensive to implement Only 50% disk utilization

BSI 11 August 2006

45

PAS 77:2006

E.3.2 Direct Attached Storage (DAS) This is one of the most frequently used storage methods for both internal and external storage. It simply consists of directly attaching the storage device or disks to a computer system using a RAID controller. For internal storage the RAID controller can either be a Peripheral Component Interconnect/SUN Bus (PCI/SBUS) or integrated device. For external storage a number of options are available depending on the intelligence of the storage device. If the external device is just a bunch of disks (JBOD) these will need to be connected to an internal RAID controller. If the external storage device includes disks and a disk controller then an appropriate host bus adapter will need to be installed in the server system. NOTE For instance the external device might be a fibre channel array in which case a fibre channel host bus adapter will be installed in the server. The main disadvantage with DAS is that it creates islands of isolated storage that can only be accessed by the locally attached server. Each pool of storage also needs to be managed separately. Because of its simplicity DAS storage usually has more single points of failure than other storage models. E.3.3 Network Attached Storage (NAS) Most NAS devices use an underlying proprietary Operating System (OS) and file system so that they can be used in a heterogeneous environment. A NAS device will typically present both Network File System (NFS) and Common Internet File System (CIFS) file systems to end users but internally these files systems are usually stored as a separate file system. The majority of NAS devices operate at the file level, although some of the newer models which offer features like mirroring and snapshot technology operate at the block level. NAS devices have become very popular due to the ease with which they can be deployed and centrally managed. Gigabit networking has also helped to expand their usability in the data centre. Most NAS devices also incorporate high availability features at the platform level to protect against disk, power and controller failure. Some of the higher end models support clustering of the NAS device to protect against unit failure. The biggest disadvantage with NAS devices is their potential to saturate the local network during peak usage. This can limit their use for I/O intensive applications. Recent improvements in Local Area Network (LAN) connectivity speeds have helped to offset this limitation. E.3.4 Storage Area Networks (SAN) SAN storage has become increasingly popular over the last few years. Its centralized design helps storage administrators easily deploy and manage storage

resources. The SAN fabric is responsible for carrying data between the host servers and the target storage arrays. The fabric is a dedicated fibre based network designed for high availability through the use of multiple data paths between the hosts and storage array. The SAN array (also known as the storage array) is the subsystem which houses the power supplies, fans, disks, disk controllers and the arrays operating system. SAN arrays are designed for the high availability of mission critical data. All SAN arrays operate at the block level and are independent of the file systems they host. This makes them well suited to heterogeneous environments. In addition to the platform redundancy built into the SAN array there are also a number of other features inherent to SAN storage. These features include Snapshot, Mirroring and Cloning functionality. Snapshot technology allows for point in time copies of data to be created mainly for the purpose of backup. The Cloning feature allows for the creation of point in time copies of data. Unlike Snapshot technology which uses disk pointers to create an instant point in time copy of the data, cloning physically copies every block of data to a new disk. Cloning takes longer than a snapshot but it provides better availability by creating a duplicate of the source disk. Both Snapshot and Cloning store their information in the local array. For increased availability the data can be replicated to a remote array using mirroring technology. There are two main types of mirroring, synchronous and asynchronous. Synchronous mirroring copies each block of data to the remote array and waits for an acknowledgement before writing the block of data to the local array. This ensures that both the remote and local copies of data are always consistent. To implement synchronous replication between two sites will typically require a high speed link such as dark fibre Dense Wavelength Division Multiplexing (DWDM). Synchronous replication is also limited to network path distances of 200 km. To replicate data over extended distances requires asynchronous mirroring. Using this method allows the writes to the local array to continue as normal and then be replicated to the remote array at fixed intervals. The one disadvantage with asynchronous replication is the possibility of losing data that has not been replicated to the remote site should the local site fail.

46

BSI 11 August 2006

PAS 77:2006

E.4 Application availability


One of the major demands on IT personnel is ensuring that business critical applications are kept online. The next step is ensuring that the application layer is not impacted by hardware failures. A common approach to achieving this goal is through the use of clustering technologies. Clustering technologies cover a number of areas from the file system through to the computer node. E.4.1 Clustered file system A clustered file system is a file system that is distributed between multiple computer nodes. Each node holds part of the file system but to the end user they appear as a single file system. Clustered file systems are frequently used in high performance compute environments that require very high data throughput between compute nodes. Replicating the file system between multiple nodes provides added protection against hardware failures. For performance reasons most clustered file systems tend to be hosted from fibre channel storage arrays. Some of the disadvantages of clustered file systems include the high cost of deploying and managing them in large configurations. E.4.2 Application cluster Application clusters are somewhat similar in design to clustered file systems. The application is scaled out across multiple compute nodes to provide for both higher availability and performance. To the host the application will appear as a single resource. In most cases application clusters use a clustered file system for shared data access. To identify failures in the cluster a heart beat packet is exchanged between nodes to ensure that each node in the cluster is online. If a node fails to respond to a heart beat data packet then that node will be identified as being offline and all its resources will be distributed to the remaining nodes. The application cluster can also dynamically re-assign processing tasks if threshold policies are in place. Today there are a number of application cluster products on the market for web serving and transactional databases. One of the advantages of using an application cluster is the ability to scale out your application as you require additional processing power. Deploying an application cluster is a complex task but this is slowly been addressed by software vendors. E.4.3 Computer cluster services All the major operating systems include some form of cluster services support which can be installed during the OS deployment or as an additional add-on. Cluster services fall into one of two categories: a) Load balancing cluster services are frequently used for distributing network load across multiple hosts. Take for

example a large web farm with fifty web servers. In order to balance web requests across all fifty servers each server has load balancing services installed along with a virtual IP address. All fifty servers get the same virtual IP address so each time a web request is received all fifty web servers intercept the request but the load balancing software uses a set of rules to determine which server should process the request. b) Fail-over cluster services can be classified as either shared everything or shared nothing. Like load balancing cluster services they can also be installed as part of the operating system. There are also a large number of third party cluster services applications that integrate with the major OS. Typically cluster services will require a shared storage resource and a network port on each cluster node for sending and receiving heart beat information. 1) The shared everything model allows all nodes in the cluster to have shared access to the cluster resources. In order to achieve this, the cluster needs to use a distributed lock manager to control node access. Some users have questioned the scalability of the shared everything model because of the overhead and complexity of managing the resource locking. 2) The shared nothing model also uses shared storage but only one node in the cluster has access to a resource at any given point in time. The shared nothing model is often referred to as Active/Passive or Active/Active. An Active/Passive cluster is when one node in the cluster manages all the resources and the second node acts as a fail-over node that takes ownership of the active resources in the event of a failure. The Active/Active model is when both nodes in the cluster are actively hosting independent resources. In the event of a failure the surviving node takes ownership of all resources. Both models of fail-over clustering are widely used for applications such as web, file and print, messaging and database servers. Some of the higher end clustering services software packages also include data replication features. These features can be used to build what are known as stretch clusters. A stretch cluster basically allows you to increase the distance between the cluster nodes. NOTE An example of a stretch cluster is where one node might be located in London whilst the second node is located in Manchester. The replication engine ensures that the data copies at both locations are consistent. In the event of a failure the replication engine simply brings the secondary copy of data online so the surviving node can take ownership of all the cluster resources. This process can take a couple of minutes but importantly there is no outage at the application level and the fail-over is transparent to the end users.

BSI 11 August 2006

47

PAS 77:2006

Annex F (informative) Types of resilience


F.1 General
This annex provides a brief overview of some of the replication approaches that should be used to protect and recover data. As technology progresses other options will become viable and any selection should include a review of products available. As with the choice of a site model described in Annex D the ITSC Strategy will be a major factor in selecting the appropriate replication mechanism(s) and the resultant choice will have an effect on the IT Architecture. F.2.2 Example 2 Modern tape back-up systems adopt a more holistic approach to backing up the entire system, as illustrated in the following example. A software development system is available 24 x 7 for use by a group of 100 + developers writing and testing software. Source files and executables are created and updated on an adhoc basis throughout the day and at various times throughout the night. Developers work typical eight hour shifts during the core hours of 07:00 to 19:00 but in order to meet deadlines will sometimes work through the night. The least busy period is Sunday nights through early morning Monday. The back-up cycle for this system adopts the following pattern: a) full system back-up of all files on Sunday night; b) full system back-up of all files changed since Sunday at midnight every day; c) incremental back-up of all files changed since midnight twice per day, once at midday and once at 21:00. In this example only those files that are unopened are backed up. If a file is left open to an application it will not be backed up, simply because its contents cannot be guaranteed to be consistent. It is possible to override this restriction and create dirty back-ups but this should only be done with an understanding of the applications in use as data could be an uncertain state. Conversely, if ignoring open files certain files may never be backed up leaving the organization exposed to the risk of data loss. F.2.3 Media Issues Many of the issues related to tape back-up of open databases are dealt with by application or system based back-ups. Faced with the issues related to creating Zero Data Loss (ZDL) back-ups of live, permanently open databases, systems and database vendors have created their own back-up software that worked in concert with the database system to enable online production databases to be saved and restored with no data loss, and with the ability to fail-back updates performed by failed transactions. These back-up mechanisms rely on the creation of log files, often referred to as audit trails which reflect every update to the database, and allow databases to be restored in a consistent fashion should systems fail. F.2.4 Media Storage If tape-based back-up is the primary back-up mechanism, then copies of the back-ups should be taken offsite at the earliest opportunity. Many companies exist to provide secure data storage and can be contracted to collect data

F.2 Media back-up/restore


Creating data back-ups on an alternate media (usually tape) is still the de facto method of ensuring there is a secure Point In Time (PIT) copy of vital data but for many companies the tape back-up has become a second level back-up which is just insurance against disk or application based replication failures. However some legacy applications still rely on tape as the primary back-up. NOTE 1 In this context the term tape back-up will apply equally well to any other physical media used to create a back-up of live data which is then physically removed offsite. NOTE 2 If tape back-up is used either as a primary or secondary back-up it is normal to create a schedule of back-ups which reflects the working pattern of the system being backed up. F.2.1 Example 1 An order entry system is open for orders from 09:00 to 17:00 every day, Monday through Saturday, the online day. After 17:00 batch processes extract all new orders from the orders file and process them, distributing shipment orders to the warehouse and build orders to the factory floor. The order history file is then updated. On Sunday the system is unavailable. The back-up cycle for this system may be: a) full back-up of the entire database once per week on Sunday; b) back-up of all new orders, daily after 17:00 but before batch processing commences; c) back-up of all shipment orders and build orders after batch processing; d) back-up of order history file before start of online day. This is highly tailored to a known and fixed file based processing cycle.

48

BSI 11 August 2006

PAS 77:2006

at multiple times during the day. If keeping back-ups on site, they should be stored in a controlled environment or a fireproof safe ideally at some distance from the original source. It is a common mistake to make critical back-ups but leave them sitting in the office reception or on the loading dock for hours waiting for an offsite courier to take them to a secure store. If the back-up copies are held at a back-up site it can be beneficial to load up files from tape to disk. This verifies the back-up (the tapes can be read) and can also reduce the recovery time. A more sophisticated variation on the standard tape back-up mechanism is to perform remote back-ups. Usually only seen on high-end systems with high speed fibre links, the data is backed up to tape at the remote back-up site providing both a back-up and secure offsite back-up at the same time.

be in step with other applications being recovered by the organization. Thus if data needs to be synchronized between applications, additional recovery steps may be required to resynchronize the applications.

F.5 Host-based replication/mirroring


Host-based replication is a catch-all term referring to a technique whereby replication is achieved through the duplication of disk write requests, either via some specially installed software, or via the operating system itself. Essentially each write request is captured and duplicated on an alternate disk device, providing a synchronized copy of the original database on a completely separate set of disks. Often referred to as Host Mirroring, this form of replication originated with mainframe systems long before RAID technology was invented. Not only did it provide a mirror copy at the device/volume level to ensure data resilience, it also provided some performance improvements by toggling read requests between the pair of disks. However this type of replication has some inherent drawbacks: a) Delay the application waits for both writes to complete. As the primary reason for creating host mirroring was data security, the application is forced to wait for both writes to complete thereby potentially slowing it down. b) Host loading since the host is responsible for issuing the second write request, it also incurs a CPU overhead whilst performing it. c) Distance since both disk devices are seen as channel attach disks, then the second copy is typically limited to short distances. A more recent slant on this approach is available in the open systems world through host based file/disk replication software. These products typically require extra resources on the server, such as network adapters in order to function. They also use network protocols to replicate Input/Output (I/O) and are subject to the limitations of the operating system in this area.

F.3 Database management system-based replication


Many database vendors involved in writing Database Management Systems (DBMS) aimed at mission critical applications have not been able to rely on systems based back-up and replication software because it either doesnt exist or is unreliable. As a consequence they have incorporated data replication functionality into the database software itself. Not only does the DBMS provide a mechanism for updating records, but it also provides a mechanism for creating a database replica on a remote system. One mechanism for this is called log shipping. Updates to the database are captured to a log file, periodically batches of updates are shipped across a wide area network to a remote system hosting a copy of the database from the primary site. The remote system then reapplies the updates to the remote database, effectively keeping the two databases synchronised. In the event of a loss of the primary site, the remote system takes over processing. In this method updates are grouped into batches and file transferred across the network. This asynchronous shipment method means that the remote system is always a few updates behind the live system and as a consequence there is likely to be some data loss (i.e. any updates since the last log was shipped) should the main site be impacted by an incident. This time lag may or may not be an issue, depending on the RPO for the system.

F.6 Storage array-based replication


This method moves the job of replication down to the level of the storage controller. Firmware running in the disk control unit is responsible for the replication of data from one volume to another volume in a remote site. This form of replication can and does function over large distances and requires high speed and high bandwidth network links between the sites. The cost of implementation of this mechanism can be prohibitive except for large mission critical applications. The advantage of this mechanism is that it is divorced from the path of the I/O in as much as it imposes no extra

F.4 Application based replication


Alternatively the replication can be built into an application, either as part of a standard package or as a custom development by the organization. With this approach the recovery point for the application may not

BSI 11 August 2006

49

PAS 77:2006

load on the host system. Handling replication at the control unit level also allows the control unit to create multiple copies or snapshots of the volumes. These snapshots can be used to drive separate applications such as overnight batch processing and their careful use can vastly improve overall system throughput. However it should be noted that depending on how it is used, storage array based replication can introduce I/O latency and potentially delay the I/O completion.

F.7 Storage Area Network-based replication


In a Storage Area Network (SAN), storage devices are attached to a fabric of fibre channels which operate at very high speeds. An approach to replication that operates at the SAN level is to introduce a replication appliance into the SAN as both a disk volume and host at the same time. Through the use of special device drivers embedded in the host systems, write requests are replicated to the Replication Appliance. These devices can then ship the write across a Wide Area Network (WAN) to a remote location where another set of replication appliances can distribute the write requests to a duplicate set of disk devices. The advantage of this mechanism is that standard communications lines can be used to carry the replication traffic since replication appliances may also compress the data being sent.

F.8 Disk replication modes


Almost all of the mechanisms described for disk replication can operate in one of two modes and it is worth considering which of these modes best fit the RTO, RPO and cost requirements. a) Synchronous (Zero Data Loss) Each write request is replicated to a remote system and the issuing system effectively waits until it receives an I/O complete status back from the remote site. This ensures that all I/O write requests are securely completed on the remote site and a back-up is guaranteed to be synchronized with the original copy. It should be noted that in order to achieve acceptable throughput dedicated links are required between the two sites. b) Asynchronous (Point In Time) The write request is issued to the remote system but there is no delay or wait for an acknowledgement of I/O completion, rather the local application continues without any delay. While this improves performance it does create a window where data loss can occur if the local disk subsystem is destroyed with writes pending. This Asynchronous replication is also called a Point In Time back-up because it reflects the consistent state of the disk at a specified point in time.

50

BSI 11 August 2006

PAS 77:2006

Bibliography
Standards publications
PAS 56: 2003, Guide to Business Continuity Management BS ISO/IEC 20000, Information Technology Service Management ISO/IEC 17799:2005 Code of Practice for Information Security Management ISO Guide 73:2002, Risk management Vocabulary Guidelines for use in standards

Other publications
[1] IT Infrastructure Library (ITIL). Office of Government and Commerce: The Stationery Office. [2] PRINCE2 Maturity Model (P2MM). Office of Government and Commerce (OGC). [3] Project Management Body of Knowledge. Project Management Institue (PMI).

Further Reading
TR 19:2005, Technical Reference for Business Continuity Management (Bt GM). Spring Singapore. Emergency Preparedness: Guidance on Part 1 of the Civil Contingencies Act 2004, its associated Regulations and non-statutory arrangements. Home Office: The Stationery Office. Generally Accepted Practices for Business Continuity Practitioners. Disaster Recovery Journal and DRI International, 2005. Business Continuity. CBI with Computacenter, 2002. A Risk Management Standard. The Institute of Risk Management, The Association of Insurance and Risk Managers and The National Forum for Risk Management in the Public Sector, 2002. Microsoft Operations Framework, a pocket guide, Van Haren Publishing, ISBN 9077212108. Management of Risk: Guidance for Practitioners. Office of Government and Commerce: The Stationery Office. A Guide to Business Continuity Planning by James C. Barnes, ISBN 0-471-53015-8.

BSI 11 August 2006

51

PAS 77:2006

BSI British Standards Institution


BSI is the independent national body responsible for preparing British Standards. It presents the UK view on standards in Europe and at the international level. It is incorporated by Royal Charter.
Revisions
British Standards are updated by amendment or revision. Users of British Standards should make sure that they possess the latest amendments or editions. We would be grateful if anyone finding an inaccuracy or ambiguity while using this Publicly Available Specification would inform Customer Services. Tel: +44 (0)20 8996 9001 Fax: +44 (0)20 8996 7001 Email: orders@bsi-global.com BSI offers members an individual updating service called PLUS which ensures that subscribers automatically receive the latest editions of standards.

Information on standards
BSI provides a wide range of information on national, European and international standards through its Library and its Technical Help to Exporters Service. Various BSI electronic information services are also available which give details on all its products and services.

Copyright
Copyright subsists in all BSI publications. BSI also holds the copyright, in the UK, of the publications of the international standardization bodies. Except as permitted under the Copyright, Designs and Patents Act 1988 no extract may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, photocopying, recording or otherwise without prior written permission from BSI. This does not preclude the free use, in the course of implementing the standard, of necessary details such as symbols, and size, type or grade designations. If these details are to be used for any other purpose than implementation then the prior written permission of BSI must be obtained. Details and advice can be obtained from the Copyright & Licensing Manager. Tel: +44 (0) 20 8996 7070 Fax: +44 (0) 20 8996 7553 Email: copyright@bsi-global.com BSI, 389 Chiswick High Road London W4 4AL.

Contact the Information Centre


Tel: +44 (0) 20 8996 7111 Fax: +44 (0) 20 8996 7048 Email: info@bsi-global.com Subscribing members of BSI are kept up to date with standards developments and receive substantial discounts on the purchase price of standards. For details of these and other benefits contact Membership Administration. Tel: +44 (0) 20 8996 7002 Fax: +44 (0) 20 8996 7001 Email: membership@bsi-global.com Information regarding online access to British Standards via British Standards Online can be found at http://www.bsiglobal.com/bsonline Further information about BSI is available on the BSI website at http://www.bsi-global.com

Buying standards
Orders for all BSI, international and foreign standards publications should be addressed to Customer Services. Tel: +44 (0)20 8996 9001 Fax: +44 (0)20 8996 7001 Email: orders@bsi-global.com Standards are also available from the BSI website at http://www.bsi-global.com In response to orders for international standards, it is BSI policy to supply the BSI implementation of those that have been published as British Standards, unless otherwise requested.

52

BSI 11 August 2006

British Standards Institution 389 Chiswick High Road London W4 4AL United Kingdom http://www.bsi-global.com ISBN 0 580 49047 5