You are on page 1of 53

Disaster Recovery for a

BSS Data Center

DR for a BSS Data Centre 1


Disaster Recovery: The Lighter Side

DR for a BSS Data Centre 2


Section 1

Disaster Recovery Overview

DR for a BSS Data Centre 3


What is a Disaster?
• Hazard which has come to realization
• Perceived tragedy
– Natural calamity
– Man-made catastrophe
• Disasters are the consequence of
inappropriately managed risks

DR for a BSS Data Centre 4


Risks to be Addressed

DR for a BSS Data Centre 5


What is Disaster Recovery in IT Perspective?

• Timely and effective restoration of IT services


in a major incident

• Any plan or set of procedures implemented by


a business to maintain uptime and/or prevent
data loss in the event of a system failure

DR for a BSS Data Centre 6


Disaster Recovery
• People
– Staff, Outsourced Technology
• Process
– Crisis Management
Process
• Technology IT
– Hardware, Software

People

DR for a BSS Data Centre 7


Metrics for Disaster Recovery (1/2)
• Driven by two metrics
– Recovery Time Objective (RTO)
Interrupted for how long?
– Recovery Point Objective (RPO)
How much data loss?

DR for a BSS Data Centre 8


Metrics for Disaster Recovery (2/2)
Recovery Point Objectives Recovery Time Objectives
(RPO) (RTO)

5 6 7 8 9 10 11 12 1 2 3 4 5 6 7
a.m. a.m. a.m. a.m. a.m. a.m. a.m. a.m. p.m. p.m. p.m. p.m. p.m. p.m. p.m.

DECLARE
DISASTER
10 a.m.

RPO: Amount of data lost from failure, RTO: Targeted amount of time to restart a
measured as the amount of time from a business service after a disaster event
disaster event

DR for a BSS Data Centre 9


Understanding RPO and RTO
• Cost of downtime per hour
– Employee cost per hour + Cost of problem repair + Cost of employee
overtime
– Loss of customer
– Reputation of Company
• Recovery Point Objective (RPO)
– A point in time to which the data must be recovered
– An acceptable loss of data during disaster situation
• Recovery Time Objective (RTO)
– The duration of time within which a business process must be
restored after a disaster (underlying infrastructure and application
components are restored first)

DR for a BSS Data Centre 10


Investment Scenario

DR for a BSS Data Centre 11


High Availability v/s Disaster Tolerance
• High Availability
– Providing redundancy within a data center to maintain the service (with
or without a short outage)
• Hardware failures
• Software failures
• Human error

• Disaster Tolerance
– Providing redundancy between data centers to restore the service
quickly (tens of minutes) after certain disasters (dedicated equipments)
• Power loss
• Fire, flood, earthquake
• Sabotage, terrorism

DR for a BSS Data Centre 12


Availability Events (1/2)
• Planned Outages
– Network and power related changes
– Hardware repair
– Hardware and/or software upgrades
– Software maintenance
• OS
• Database
• Applications
– Data backup and storage management
• As data grows in size, tape backup is less effective
• What data must be archived
• How is the data archived?

DR for a BSS Data Centre 13


Availability Events (2/2)
• Unplanned Outages
– Hardware failure
• Server, storage, network, power
– Software failure
• Crashes, errors, hangs, etc.
• OS and applications
– Human error
• Hardware, software, data
– Disasters (Man made and otherwise)

DR for a BSS Data Centre 14


What causes the most Downtime?

Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008

DR for a BSS Data Centre 15


Measure of Availability

Hours of downtime
per year
per IT service

Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008

DR for a BSS Data Centre 16


Section 2

Architecture & Sizing for Disaster


Recovery

DR for a BSS Data Centre 17


2-Site Architecture
• 100% Primary Site + 100% DR Site
• Database changes are more frequent hence log
based replication of database between Primary and
DR site.
• Sync replication is not possible because of WAN
bandwidth
• A-synch Replication is possible
• RPO -> Depends on how much data to be replicated,
• RTO -> Depends upon People + Processes

DR for a BSS Data Centre 18


2-Site Architecture: Working

DR for a BSS Data Centre 19


Storage
Storage
SAN Asynchronous Replication
SAN VG
VG
Application files Application files
Storage
Storage VG VG
Storage Tier
Volume VG VG
Volume
Group
Group Archive logs VG Archive logs VG
Dark Fiber

DB Tier
DBCI DBCI
servers in servers in
Cluster Cluster

Application
Tier
Application Servers Application Servers

Primary Site
(ACTIVE) DR Site

DR for a BSS Data Centre 20


2-Site Architecture
• Advantages –
– Simple to manage
– Less expensive than other solutions
– Only one link needs to be procured
• Disadvantages
– RPO of 15 minutes is not quantifiable (Impact could be
high or low)
– Cannot estimate what kind of data loss will happen
– RTO for DR site cannot be quantified to business because
of lost transactions.
DR for a BSS Data Centre 21
3 Site Architecture (for RPO=0)

DR for a BSS Data Centre 22


• For RPO=0
– Must have synchronous replication of database
– Synchronous replication has limitations on
distance (40 to 60 km)
– Hence cannot replicate synchronously for long
distances
– But can replicate short distances
– So a 3 Site ( primary, Near, DR)solution might
achieve RPO=0 (Almost)

DR for a BSS Data Centre 23


• What case will RPO be zero
– Regional disasters which don’t destroy primary
and Near site at the same time.
– For all kind of DC failures RPO=0 can be achieved
– In case of regional disaster which wipes out both
Primary and Near site, RPO will depend upon the
link between Primary and DR( could be 15 minutes
depending upon the size of the link)

DR for a BSS Data Centre 24


WAN link

Storage
Storage
SAN Synchronous Replication SAN VG
VG Asynchronous
Replication
SAN
Application files Application files Application files
Storage
Storage VG VG VG Storage
Storage
Volume
Volume VG
VG
Group
Group Archive logs VG Archive logs VG Archive logs VG

DBCI DBCI
servers in DBCI servers in
Cluster servers Cluster

Application Servers Application Servers Application Servers

Primary Site Near Site DR Site

DR for a BSS Data Centre 25


3 SITE ARCHITECTURE: Working

Distance < 25 kms


Dark Fibre

Site A PROD Site B Near/ Bunker

Site C DR

DR for a BSS Data Centre 26


3 Site DR considerations
• What should a Near site must have
– Different & multiple power source/ power grid
– Network Termination exactly same as Primary DC
(if Near site has to be used for Primary site
operations)
– Replication links from multiple vendors (No SPOF)
– Link to DR site

DR for a BSS Data Centre 27


What should be in the Near Site??
– Option1 : Full 100 % Replica
of the Primary Site
• High cost (Infrastructure +
People0
– Servers, storage, firewalls,
switches, backup, power
sources
– Applications, Databases, etc
– Security, Personnel, Processes
– Network Connectivity
• Would protect against any
local problems at Primary DC

DR for a BSS Data Centre 28


What should be in the Near Site??
• Option 2: Split Configuration
between primary and Near Site
– Database servers split between
primary and Near Site (extended
cluster)
– When Primary DC fails operations
move to Near Site
– Maintenance and continuous
upkeep of the of the Near Site
essential
– Redundancy required in case of
Application Servers, Firewall,
routers, Servers, Backup etc

DR for a BSS Data Centre 29


What should be in the Near Site??
• Option 3: Minimalist
– Treat Near site only for RPO=0
purpose and not for operations
– Replicate storage continuously
for RPO=0
– Keep only that hardware which
can push data from Near sit to
DR in case of primary DC failure.
– Keeps the simplicity of 2 Site DR
which RPO=0 for 3 Site
– RPO=0 not achieved if Primary
and Near Site go down together

DR for a BSS Data Centre 30


Section 3

Connectivity to DR Site

DR for a BSS Data Centre 31


Connectivity

The majority of businesses deploy wide area networks (WANs) to


connect the remote parts of the business back to centralized
resources

Bandwidth is always an issue in disaster recovery. If you're


replicating data for potential failover—both locally and remotely
then your bandwidth issues become more complicated.

We want to establish a DR site that's far enough away that it won't


be affected by the same disaster, but not so far away that WAN
bandwidth costs will be prohibitive.

DR for a BSS Data Centre 32


The physical distance involved will often dictate the type of replication
used to move data between sites.

They are two types of replication:


1) Synchronous replication
2) Asynchronous replication

Synchronous replication moves data in real time so that the data center
and DR site contain the same data moment to moment, but synchronous
data transfers often need high-bandwidth

Asynchronous replication moves data on a bandwidth-available basis.


This allows data movement using cheaper, lower-bandwidth connections,
but presents a possibility of data loss because the data center and DR
site may be out of sync by up to several hours

DR for a BSS Data Centre 33


With the popularity of IP connectivity there are lots of connectivity
options available.Connectivity on SAN can be done by many options
like:-
 Ethernet
 FC (Fibre Channel)
 iSCSI (Internet Small Computer System Interface)
 FCIP
 FCoE (Fibre Channel over Ethernet)
 The sites can be connected by a VPN, which provides cost benefits

1) Ethernet
Traditional Ethernet ports support 10/100 Mbps -- far slower than Fibre
Channel. Ethernet bandwidth is increasing today and 10 Gigabit Ethernet
(10GigE) is widely available for data centers

2) Fibre Channel
Early FC implementations ran at 1 Gbps per port, and 2 Gbps reigned until
recently. Today, 4 Gbps FC is readily available and 10 Gbps implementations are
appearing on some high-end systems and director-class switches.
DR for a BSS Data Centre 34
3) iSCI (Internet Small Computer System Interface)
iSCSI to transfer data over LANs, WANs or the Internet and supports storage management
over long distances.
The emergence of iSCSI eases these challenges by encapsulating SCSI commands into IP
packets for transmission over an Ethernet connection, rather than a Fibre Channel
connection.
iSCSI still has two disadvantages for storage:-
• At 1 GigE, it does not perform as fast as Fibre Channel.
• And Ethernet will drop packets during network congestion.
These problems may be alleviated soon, thanks to the emergence of 10 GigE and Data
Center Ethernet

4) FCIP
. FCIP translates Fibre Channel commands and data into IP packets, which can be exchanged
between distant Fibre Channel SANs. It's important to note that FCIP only works to connect
Fibre Channel SANs, but iSCSI can run on any Ethernet network.

5) FCoE
Storage vendors are working on a Fibre Channel over Ethernet (FCoE) standard to enable
SAN and LAN convergence
DR for a BSS Data Centre 35
Requirements

 To establish WAN connectivity between the Central Location to 2 remote


locations for Data Transfer Application.
 The leased line based network design primarily to be used for
implementing the Online Data Transfer Application with the auto ISDN
backup connectivity.
 The connectivity from the Central Location to the remote locations at
64Kbps to 2 Mbps speed.
 The connectivity to be always on.
 The Network Devices to be SNMP managed.
 Provision for future scalability.

DR for a BSS Data Centre 36


DR for a BSS Data Centre 37
DAX Network
Central Location:
At the Central location, Dax recommended the customer to opt for 1 no. of DX-
2650 Modular Access Router with 1# 10/100 ports, 4NM Slots and VoIP Module
Support.
The router was populated as follows:
• Slot 1 – 2-ports Sync/Async Serial Module (speed up to : 2Mbps)
• Slot 2 – 4-port ISDN U module.
• Remaining 2 slots were left free for future scalability.

Remote Location:
At the Remote location, Dax recommended each remote branch to use DX-1721
Modular Router with 1# 10/100 port and 4 WAN Slot for WAN/VOIP modules.
Each DX1721 was loaded with the following modules:
• Slot 1 - ISDN S/T module for providing automatic back-up connectivity.
• Slot 2 - 1-Port High speed Serial Sync / Async WAN Interface module for
connecting leased line link @ 64 Kbps up to 2 Mbps Speed.
• The remaining 2 slots were left free for future scalability.

DR for a BSS Data Centre 38


Section 4

Backup Solution

DR for a BSS Data Centre 39


Possible Options
• Backup and recovery from tape
• Host-based replication
• Storage-based replication
• Data replication infrastructure
• Replicating databases
• A comparison of the various disaster recovery
solutions
• Metro clusters
DR for a BSS Data Centre 40
Backup And Recovery From Tape
RAID technology used to provide high levels of data availability

Cannot protect against data loss if the data is deleted (accidental or otherwise) or
corrupted

The tapes can be cloned, i.e., copied to new media to allow them to be stored off-site
in a disaster recovery location

Least expense of all the options

it is only really applicable as the primary disaster recovery mechanism for non-critical
services, i.e. services with RPOs where data loss and longer RTOs are acceptable

DR for a BSS Data Centre 41


Host-based replication
The remote mirror software works at the OS kernel level to intercept writes to
underlying logical devices as well as to physical devices, such as disk slices and
hardware RAID protected LUNs

It then forwards these writes on to one or more remote Solaris OS-based nodes
connected through an IP-based network

2 modes of data transfer: Synchronous mode replication, Asynchronous mode


replication

DR for a BSS Data Centre 42


Storage-Based Data Replication
Perform data replication on the CPUs or controllers resident in the storage systems.

2 ways- Synchronous and Asynchronous modes, but the software operates at a much
lower level.

Consequently, storage-based replication software can replicate data held by applications such as
Oracle OPS and Oracle RAC even though the I/Os to a single LUN might be issued by several
nodes concurrently.

The software provides remote replication through disk based journaling.

Journaling techniques can improve levels of reliability and robustness in remote


copying operations, thereby also providing better data recovery capabilities.

DR for a BSS Data Centre 43


Data replication infrastructure

DR for a BSS Data Centre 44


Replicating databases

(RDBMS) portfolios from IBM and Oracle include wide range of tools to manage and
administer data held in their respective databases: DB2 and Oracle

The RDBMS software is designed to handle logical changes to the underlying data

So, it offers considerably greater flexibility and lower network traffic than a
corresponding block-based replication solution.

DR for a BSS Data Centre 45


DR for a BSS Data Centre 46
DR for a BSS Data Centre 47
Metro Clusters

The ability to cluster systems across hundreds of kilometers using Dense Wave
Division Multiplexors (DWDM) and SAN connected Fibre Channel storage devices

Cluster deployments that try to combine availability and disaster recovery by


separating the two halves of the cluster and storage between two widely separated
data centers

The physically separated cluster nodes work identically but offer the added benefits
of protecting against local disasters and eliminating the requirement for a dedicated
disaster recovery environment

DR for a BSS Data Centre 48


Section 5

Costing

DR for a BSS Data Centre 49


• The investments on DR don’t increase top-line
revenue, though they will likely let you retain
more of your profits through cost avoidance
and corporate viability.
• Building the business case requires a different
approach that calculates the cost of downtime,
defines specific requirements, identifies realistic
risks, selects cost-effective technologies and
services, and shows a commitment to disaster
recovery planning and preparedness as an
ongoing program.

DR for a BSS Data Centre 50


SEVEN KEY STEPS FOR DISASTER RECOVERY SPENDING
• Implement a continuity management process.

• Conduct a business impact analysis (BIA) and risk assessment.

• Calculate the cost of downtime.

• Develop impact scenarios that address all risks, not just


“disasters.”
• Position DR as a competitive necessity.

• Develop a DR services catalog.

• Align DR technology investments with other IT initiatives.

DR for a BSS Data Centre 51


Assumption Qty Unit Price (INR) Cost (INR crores)

Capex
DC site 33% of space in sqft 20,000 25,000 50

Servers 33% of CPUs 2,000 500,000 100

Storage 33% of storage in TB 2,000 400,000 80

Network 10% of server cost 10

Software 15% of storage cost 12

Implementation- Consulting 10% of Capex 20

Total 272

Opex
Bandwidth 100,000 50

Power Rs. 50,000 per kw per annum, 6 600 300,000 18


kw per rack

Manpower 6 NOC seats, 20 on-site 10

AMC 6% of Capex 15

Total 93

DR for a BSS Data Centre 52


Thank You

DR for a BSS Data Centre 53

You might also like