Professional Documents
Culture Documents
Best Practices
Disaster Recovery Best Practices in Cloud Computing
DISCLAIMER
This document has been prepared by Cloud Management Office (CMO) under the Ministry of
Electronics and Information Technology (MeitY). This document is advisory in nature and aims
to provide information in respect of the GI Cloud (MeghRaj) Initiative.
While every care has been taken to ensure that the contents of this document are accurate
and up to date, the readers are advised to exercise discretion and verify the precise current
provisions of law and other applicable instructions from the original sources. It represents
practices as on the date of issue of this document, which is subject to change without notice.
The readers are responsible for making their own independent assessment of the information
in this document.
In no event shall MeitY or its' contractors be liable for any damages whatsoever (including,
without limitation, damages for loss of profits, business interruption, loss of information)
arising out of the use of or inability to use this Document.
Table of Contents
1 Purpose ............................................................................................................................................................... 6
2 Background ....................................................................................................................................................... 7
Annexure 1............................................................................................................................................................... 29
1 Purpose
MeitY has introduced MeghRaj initiative to utilize and harness the benefits of Cloud
Computing in order to accelerate delivery of e-services in the country. It has led to a
significant adoption of Cloud technology across Government Departments. As there is a
significant upsurge in digital information throughout the Government eco-system, there is
an increased need of preparing our digital ecosystem to overcome any disasters, without
5 a considerable impact on public services. This document will assist User Departments in
evaluating and considering the best practices suitable for their respective departments in
terms of Disaster recovery and ensuring business continuity.
6.1
6.4
6.5
6.6
6.9
2 Background
As there is a constant rise in information systems and electronic data, the rise in
vulnerability of such data has also increased exponentially. The disruptions can be seen
ranging from mild (e.g., short-term power outage, disk drive failure) to severe (e.g., site
destruction, fire). Though the vulnerabilities may be minimized or eliminated through
management, operational, or technical controls, as part of the departments resiliency
5 effort, however, it is virtually impossible to completely eliminate all risks.
One of the challenges for User Departments is ensuring the operations remain unaffected,
even during adverse times. Disasters can strike at any moment, leading to socio-economic
6.1 and reputational losses.
The guideline focuses on describing Disaster recovery planning and detailing out the
considerations and best practices which should be followed to mitigate the risk of system
6.2 and service unavailability by providing effective and efficient solutions to enhance business
continuity.
6.5
6.6
6.9
The goal for any Department with DR is to continue operating as close to normal as
possible. Preparing for a DR requires a comprehensive approach that
Disaster Recovery in Cloud entails storing critical data and applications in Cloud storage
and failing over to a secondary site in case of any disaster. Cloud Computing services are
provided on a pay-as-you-go basis and can be accessed from anywhere and at any point
of time. Backup and Disaster Recovery in Cloud Computing can be automated, requiring
5
minimum manual interventions.
6.1
6.2
6.3
6.4
6.5
6.6
6.9
6.1
6.2
6.3
6.4
6.5
6.6
6.9
Recovery Time • RTO refers to the time an application can be down without
6.1 Objective (RTO) causing significant damage to the business
6.6
6.9
5 Types of Disasters
A disaster can be related to any incident (both intentional and/or non-intentional) that causes
severe damage to the operations and data of any Organization.
6.1
There can be various scenarios of disasters for which Departments should be prepared
beforehand. The outages can range from a simple application failure to the disaster of
whole Data centre. Below table shows some of the scenarios and he way Departments can
6.2 deal with such outages.
6.3
6.4
6.5
6.6
6.9
Backup of data exist but there In such cases, Departments need to back
5
are no plans for Disaster up their data regularly so that they can
management retrieve their data on the newly replaced
systems in case of failure.
6.1
There is a backup data plan Such Departments cannot tolerate to
and external site to keep the keep their systems down for an extended
backed-up data period. They have an arrangement to
6.2
restore the required backed-up data
which is kept at external site also called as
data Off-siting
6.3
Remote, redundant sites as Departments which have multiple data
backup centres (at least two) that are located far
away from each other. These data centres
6.4 are interlinked with a strong
communication network that facilitates the
quick transfer of data in case of any disaster
6.5 at either of these centres.
6.9
VI. Documenting DR
I. Identify the
Plan (roles and VII. Validating DR
6.1 criticality of Data &
Responsibility, Readiness
Applications
governance, SLA)
6.2
6.5
Journey towards DR
6.6
6.9
MeitY shall launch IGCSF Toolkit, as a part of Risk & Security Assessment Decision making
5 framework, which will help Departments to categorize their critical applications. The impact
is divided into three categories, viz.
Based on the impacts (high, medium and low), Departments can categorize their respective
6.2 application.
The Departments data remains equally critical as the data has evolved fast from mere excel
or spreadsheet records to representing communication such as e-mail and important
7
digital documents. However, not all data in an enterprise is mission critical. It is important
to classify data and define the associated metrics for retention, retrieval and archival.
Missing this can increase costs exponentially (storage, backup, management, etc.).
Classification helps in narrowing down the actual data that needs to be recovered in the
case of a disaster.
6.3
User Departments should classify application and data based on criticality,
as all the data and applications cannot be mission critical.
6.4
A survey that Forrester conducted in 2017 found that only 18% of organizations use either
DRaaS (Disaster Recovery as a Service) offerings or public cloud IaaS offerings. On the
other hand, Gartner estimates that the size of the DRaaS market will exceed that of the
6.5
market for more traditional subscription-based DR services by 2018.
6.6
6.9
User Departments can select between internal and external Disaster recovery sites, based
on their respective requirements. Below diagrams depicts the major difference between
5 the two sites.
6.1 When to use: Require aggressive RTO, When to use: When Departments require
require control over all aspects of the DR cost effective DR Sites.
process.
Considerations: An outside provider
6.2 Considerations: Expensive than an owns and operates an external DR site.
external site, Internal site needs to be
3 types: Hot site, Warm site, Cold site
built up completely by the Department
6.3
Distance is a key element in disaster recovery. A closer site is easier to manage, but it
6.4 should be far enough that it's not impacted by the drive up and staff costs.
6.6 Used for business- Data is replicated but Not ready for
critical apps servers may not be automatic failover
Fully functional DC ready High risk of data loss
Ready in the event of Takes time to bring up May take weeks to
disaster servers to recover recover, as data from
It can be of 2 types: application in warm site backup have to be
o Active-Active- Both Designed to be used loaded into servers
sites are live for no- business critical Minimal infrastructure
o Active-Passive- Data apps
is replicated in passive
site
6.9
6.6
6.9
Departments that rely on mission critical data and cannot compromise on RTOs,
can effectively leverage synchronous replication, while Organizations that can
endure longer RTOs but need cost effective disaster recovery can use asynchronous
replication. Also, Application based replication is the least preferred replication
5 option due to dependency on individual Application vendor.
6.1
6.2
6.3
6.4
6.5
6.6
6.9
6.1 There are two major factors which impact bandwidth requirement decision. Figure below
explains the factors
6.2
6.3
6.4
6.5
6.6
DRaaS pricing structure: DRaaS is often made up of several pricing components, including:
• Replicated data storage cost
• Software licensing costs (for disaster recovery and business continuity software to
provide data replication)
• Computing infrastructure cost
• Bandwidth cost
Some DRaaS providers only charge for storage and software licensing when the service is not
6.9 actually being used, adding compute infrastructure and bandwidth costs if the service is
activated in the case of a disaster. Others charge for all components in the form of a "service
availability fee," regardless of whether or not the service is actually used.
Disaster recovery Management Tool - Disaster recovery management tool is a part of DRaaS
solution. It helps an Organization to maintain or quickly resume its mission-critical functions
after a disaster. It is used to facilitate preventative planning and execution for catastrophic
events that can significantly damage a computer, server, or network. It allows an organization
to run instances of its applications in the provider's cloud. The obvious advantage is that the
time to return the application to production, assuming networking issues can be worked out,
is greatly reduced because there is no need to restore data across the Internet.
5
Some of the key features of a disaster recovery software are:
Ease of use
Monitoring capabilities
6.1
Automatic backup of critical data and systems
Quick disaster recovery with minimal user interaction.
Flexible options for recovery
6.2 Recovery point and recovery time objectives
Compatibility with physical servers
Easy billing structure
6.3 Options for the backup target
6.4
While selecting DRaaS, Departments should consider the following:
DRaaS works on pay as you go model, so Organizations should select Service
providers which provide different DRaaS service for different classes of
6.5 applications.
6.9
It should cover all details such as physical and logical architecture, dependencies (inter-
and intra-application), interface mapping, authentication, etc. Application dependency
6.2 matrix, interface diagrams and application to physical/virtual server mapping play an
important role in defining how applications interact with each other to deliver various
functionalities.
The DRP Coordinator shall have comprehensive decision-making powers, member from
the higher Authority expected to lead the DR activities.
6.6 Crisis Management Team (CMT):
The Crisis Management Team shall comprise of Management level personnel who shall
analyze the damage at DC, advise the DRP Coordinator for Disaster Declaration, and initiate
the recovery of Operations at the DR Site.
The Damage Assessment Team shall comprise of a management and technical expertise
mixture of personnel who shall assess & report the damage at DC and take steps to
minimize the extent of the same.
6.2
Roles and responsibilities of the Business continuity Committee should be clearly defined and
well communicated in the Departments.
6.3 6.6.2 Segregation of Responsibility between CSP, MSP and a User
Department
The segregation of roles and responsibilities between a Department, MSP and CSP can
6.4 be seen in the below mentioned matrix:
Plan Maintenance
6.6
Management Actions (Escalations, MANAGED BY USER DEPARTMENT AND MSP
Declaration and Orchestration)
Application Validation
System Recovery
6.9 Applications
Database
7
Middleware
UCompute
ser Departments should focus on recovering the application services and not
(servers) MANAGED BY CSP
just servers. This is where sequence of the activities to be performed for restoring
Network
the operation plays an important role. The technical recovery plan for each
application
Storage/Dataand service needs to document all the details of the activities that
need to be performed along with the sequence of the activities.
5 Alternate Site
6.9
The replicated resources to be reviewed based on sector wise requirements (At least
twice a year (six months)), to ensure resources are effectively replicated and identified
5 resources are in alignment with the business priorities and goals. The list below includes
elements that should be reviewed on a regular basis to support critical server definition
requirements.
6.1 DR Drill
Business impact analysis and risk assessment
Application dependencies and interdependencies
Backup procedure
6.2
o Offsite storage for essential records
o Data retention policies
Recovery time objectives (RTO)
6.3 Recovery point objectives (RPO)
IT and Senior management signoff
6.4
Departments to plan for thorough testing the DR plan to unearth issues such
as hardcoded IP addresses, host file entries, license file/ key, configuration
details, dependency on other applications/ services, etc., resulting in the need
6.5 to update the DR plan to make it robust. While it might not be possible to test
all key business applications during a DR drill, it is important to focus on
recovering the application(s)/service(s) and not just on servers. In essence, an
untested DR plan is only a strategy.
6.6
6.9
Applications and environment can be rebuilt but if Departments loose data during a
6.1
disaster it becomes difficult to return to normal operations. So, the foremost important
thing is securing data. Efficient data replication technologies go a long way toward
addressing that concern.
6.2
There are three major scenarios. These have been explained in detail in GI Cloud
Architecture Guideline.
6.3
6.4
Figure 5
6.6 While choosing the DRS and replication methodology, Department needs to follow
the sector-wise guidelines
DR setup/drills should be followed based on sector-wise regulatory requirements
The manpower deployed at NS/ DRS should have same expertise as for Primary
Data Centre in order to make DRS/NS functional in short notice, independently.
6.9
6.2
6.3
6.4
6.5
6.6
6.9
Annexure 1
Terminologies related to DR
Some of the key terms used for Disaster Recovery are described below:
Disaster Recovery - The process of restoring and maintaining the data, systems,
applications and other technical resources on which a business depends.
5 Application Recovery - A component of Disaster Recovery that deals with the
restoration of business system and data, after the operating system environment has
been restored or replaced.
Disaster recovery as a Service - Disaster recovery as a service (DRaaS) is
6.1
the replication and hosting of physical or virtual servers to provide failover in the
event of a man-made or natural catastrophe.
Business Continuity - A system of planning, for recovering and maintaining both the
6.2 IT and other functions within a Department regardless of the type of interruption. In
addition to the IT infrastructure, it covers people, workplaces, facilities, equipment,
processes, etc.
6.3 Business Impact Analysis - A collection of information on a wide range of areas from
recovery assumptions and critical business processes to interdependencies and critical
staff that is then analyzed to assess impact a disaster may have.
High Availability - High availability describes a system’s ability to continue processing
6.4
and functioning for a certain period of time - normally a very high percentage of time,
for example 99.999%. High availability can be implemented into an Organization’s IT
infrastructure by reducing any single points of failure using redundant components.
6.5 Hot Sites versus Remote Sites - A disaster recovery hot site is a remote physical
location where you can maintain copies of all of your critical systems, such as data,
documents, etc. A remote site provides a secondary instance or replica of IT
environment, without office infrastructure. It can be securely accessed and used
6.6
remotely by respective Organizations, through standard Internet connections from
anywhere.
Mission-critical - A computer system or application that is essential to the functioning
of business and its processes.
Production or Primary Site - The primary site contains the original data that cannot
be recreated.
Recovery Time Objective (RTO) - The RTO is the duration of time and service level
within which a business process must be restored after a disruption in order to avoid
unacceptable losses. RTO begins when a disaster hits and does not end until all systems
are up and running. Meeting tighter RTO, windows requires positioning secondary
6.9
data so that it can be accessed faster.
Recovery Point Objective (RPO) - The RPO is the point in time to which an
Organization must recover data. The RPO is the “acceptable loss” determined by an
7
Organization.in a disaster situation. The RPO dictates which replication method will be
required (i.e. backups, snapshots, continuous replication).
Redundancy - A system of using multiple sources, devices or connections so that no
single point of failure will completely stop the flow of information.
Risk Assessment - The identification and prioritization of potential business risk and
disruptions based on severity and likelihood of occurrence.
Secondary Site (or DR Site) - The secondary site contains information and
applications that are built from the primary repository information. This site is
activated whenever the primary site becomes unavailable.
DR Drill: DR Drill is a routine activity done by an organization to check if there is
business continuity in case if the DC site is down due to an unexpected event.
*Note: provided tables content is listed based on best practices, configuration and setup may
vary as per department needs