Disaster Recovery BSS Data Center

Disaster Recovery for a
BSS Data Center
DR for a BSS Data Centre 1

Disaster Recovery: The Lighter Side

Section 1
Disaster Recovery Overview

What is a Disaster?
• Hazard which has come to realization
• Perceived tragedy
– Natural calamity
– Man-made catastrophe
• Disasters are the consequence of
inappropriately managed risks

Risks to be Addressed

What is Disaster Recovery in IT Perspective?
• Timely and effective restoration of IT services

in a major incident
• Any plan or set of procedures implemented by

a business to maintain uptime and/or prevent
data loss in the event of a system failure

Disaster Recovery
• People
– Staff, Outsourced Technology
• Process
– Crisis Management
Process
• Technology IT
– Hardware, Software
People

Metrics for Disaster Recovery (1/2)
• Driven by two metrics
– Recovery Time Objective (RTO)
Interrupted for how long?
– Recovery Point Objective (RPO)
How much data loss?

Metrics for Disaster Recovery (2/2)
Recovery Point Objectives Recovery Time Objectives
(RPO) (RTO)
5 6 7 8 9 10 11 12 1 2 3 4 5 6 7
a.m. a.m. a.m. a.m. a.m. a.m. a.m. a.m. p.m. p.m. p.m. p.m. p.m. p.m. p.m.
DECLARE
DISASTER
10 a.m.
RPO: Amount of data lost from failure, RTO: Targeted amount of time to restart a
measured as the amount of time from a business service after a disaster event
disaster event

Understanding RPO and RTO
• Cost of downtime per hour
– Employee cost per hour + Cost of problem repair + Cost of employee
overtime
– Loss of customer
– Reputation of Company
• Recovery Point Objective (RPO)
– A point in time to which the data must be recovered
– An acceptable loss of data during disaster situation
• Recovery Time Objective (RTO)
– The duration of time within which a business process must be
restored after a disaster (underlying infrastructure and application
components are restored first)

Investment Scenario

High Availability v/s Disaster Tolerance
• High Availability
– Providing redundancy within a data center to maintain the service (with
or without a short outage)
• Hardware failures
• Software failures
• Human error
• Disaster Tolerance
– Providing redundancy between data centers to restore the service
quickly (tens of minutes) after certain disasters (dedicated equipments)
• Power loss
• Fire, flood, earthquake
• Sabotage, terrorism

Availability Events (1/2)
• Planned Outages
– Network and power related changes
– Hardware repair
– Hardware and/or software upgrades
– Software maintenance
• OS
• Database
• Applications
– Data backup and storage management
• As data grows in size, tape backup is less effective
• What data must be archived
• How is the data archived?

Availability Events (2/2)
• Unplanned Outages
– Hardware failure
• Server, storage, network, power
– Software failure
• Crashes, errors, hangs, etc.
• OS and applications
– Human error
• Hardware, software, data
– Disasters (Man made and otherwise)

What causes the most Downtime?
Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008

Measure of Availability
Hours of downtime
per year
per IT service
Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008

Section 2
Architecture & Sizing for Disaster

Recovery

2-Site Architecture
• 100% Primary Site + 100% DR Site
• Database changes are more frequent hence log
based replication of database between Primary and
DR site.
• Sync replication is not possible because of WAN
bandwidth
• A-synch Replication is possible
• RPO -> Depends on how much data to be replicated,
• RTO -> Depends upon People + Processes

2-Site Architecture: Working

Storage
Storage
SAN Asynchronous Replication
SAN VG
VG
Application files Application files
Storage
Storage VG VG
Storage Tier
Volume VG VG
Volume
Group
Group Archive logs VG Archive logs VG
Dark Fiber
DB Tier
DBCI DBCI
servers in servers in
Cluster Cluster
Application
Tier
Application Servers Application Servers
Primary Site
(ACTIVE) DR Site

2-Site Architecture
• Advantages –
– Simple to manage
– Less expensive than other solutions
– Only one link needs to be procured
• Disadvantages
– RPO of 15 minutes is not quantifiable (Impact could be
high or low)
– Cannot estimate what kind of data loss will happen
– RTO for DR site cannot be quantified to business because
of lost transactions.
3 Site Architecture (for RPO=0)

• For RPO=0
– Must have synchronous replication of database
– Synchronous replication has limitations on
distance (40 to 60 km)
– Hence cannot replicate synchronously for long
distances
– But can replicate short distances
– So a 3 Site ( primary, Near, DR)solution might
achieve RPO=0 (Almost)

• What case will RPO be zero
– Regional disasters which don’t destroy primary
and Near site at the same time.
– For all kind of DC failures RPO=0 can be achieved
– In case of regional disaster which wipes out both
Primary and Near site, RPO will depend upon the
link between Primary and DR( could be 15 minutes
depending upon the size of the link)

WAN link
Storage
Storage
SAN Synchronous Replication SAN VG
VG Asynchronous
Replication
SAN
Application files Application files Application files
Storage
Storage VG VG VG Storage
Storage
Volume
Volume VG
VG
Group
Group Archive logs VG Archive logs VG Archive logs VG
DBCI DBCI
servers in DBCI servers in
Cluster servers Cluster
Application Servers Application Servers Application Servers
Primary Site Near Site DR Site

3 SITE ARCHITECTURE: Working
Distance < 25 kms

Dark Fibre
Site A PROD Site B Near/ Bunker
Site C DR

3 Site DR considerations
• What should a Near site must have
– Different & multiple power source/ power grid
– Network Termination exactly same as Primary DC
(if Near site has to be used for Primary site
operations)
– Replication links from multiple vendors (No SPOF)
– Link to DR site

What should be in the Near Site??
– Option1 : Full 100 % Replica
of the Primary Site
• High cost (Infrastructure +
People0
– Servers, storage, firewalls,
switches, backup, power
sources
– Applications, Databases, etc
– Security, Personnel, Processes
– Network Connectivity
• Would protect against any
local problems at Primary DC

• Option 2: Split Configuration
between primary and Near Site
– Database servers split between
primary and Near Site (extended
cluster)
– When Primary DC fails operations
move to Near Site
– Maintenance and continuous
upkeep of the of the Near Site
essential
– Redundancy required in case of
Application Servers, Firewall,
routers, Servers, Backup etc

• Option 3: Minimalist
– Treat Near site only for RPO=0
purpose and not for operations
– Replicate storage continuously
for RPO=0
– Keep only that hardware which
can push data from Near sit to
DR in case of primary DC failure.
– Keeps the simplicity of 2 Site DR
which RPO=0 for 3 Site
– RPO=0 not achieved if Primary
and Near Site go down together

Section 3
Connectivity to DR Site

Connectivity
The majority of businesses deploy wide area networks (WANs) to

connect the remote parts of the business back to centralized
resources
Bandwidth is always an issue in disaster recovery. If you're

replicating data for potential failover—both locally and remotely
then your bandwidth issues become more complicated.
We want to establish a DR site that's far enough away that it won't

be affected by the same disaster, but not so far away that WAN
bandwidth costs will be prohibitive.

The physical distance involved will often dictate the type of replication
used to move data between sites.
They are two types of replication:

1) Synchronous replication
2) Asynchronous replication
Synchronous replication moves data in real time so that the data center
and DR site contain the same data moment to moment, but synchronous
data transfers often need high-bandwidth
Asynchronous replication moves data on a bandwidth-available basis.

This allows data movement using cheaper, lower-bandwidth connections,
but presents a possibility of data loss because the data center and DR
site may be out of sync by up to several hours

With the popularity of IP connectivity there are lots of connectivity
options available.Connectivity on SAN can be done by many options
like:-
 Ethernet
 FC (Fibre Channel)
 iSCSI (Internet Small Computer System Interface)
 FCIP
 FCoE (Fibre Channel over Ethernet)
 The sites can be connected by a VPN, which provides cost benefits
1) Ethernet
Traditional Ethernet ports support 10/100 Mbps -- far slower than Fibre
Channel. Ethernet bandwidth is increasing today and 10 Gigabit Ethernet
(10GigE) is widely available for data centers
2) Fibre Channel
Early FC implementations ran at 1 Gbps per port, and 2 Gbps reigned until
recently. Today, 4 Gbps FC is readily available and 10 Gbps implementations are
appearing on some high-end systems and director-class switches.
3) iSCI (Internet Small Computer System Interface)
iSCSI to transfer data over LANs, WANs or the Internet and supports storage management
over long distances.
The emergence of iSCSI eases these challenges by encapsulating SCSI commands into IP
packets for transmission over an Ethernet connection, rather than a Fibre Channel
connection.
iSCSI still has two disadvantages for storage:-
• At 1 GigE, it does not perform as fast as Fibre Channel.
• And Ethernet will drop packets during network congestion.
These problems may be alleviated soon, thanks to the emergence of 10 GigE and Data
Center Ethernet
4) FCIP
. FCIP translates Fibre Channel commands and data into IP packets, which can be exchanged
between distant Fibre Channel SANs. It's important to note that FCIP only works to connect
Fibre Channel SANs, but iSCSI can run on any Ethernet network.
5) FCoE
Storage vendors are working on a Fibre Channel over Ethernet (FCoE) standard to enable
SAN and LAN convergence
Requirements
 To establish WAN connectivity between the Central Location to 2 remote

locations for Data Transfer Application.
 The leased line based network design primarily to be used for
implementing the Online Data Transfer Application with the auto ISDN
backup connectivity.
 The connectivity from the Central Location to the remote locations at
64Kbps to 2 Mbps speed.
 The connectivity to be always on.
 The Network Devices to be SNMP managed.
 Provision for future scalability.

DAX Network
Central Location:
At the Central location, Dax recommended the customer to opt for 1 no. of DX-
2650 Modular Access Router with 1# 10/100 ports, 4NM Slots and VoIP Module
Support.
The router was populated as follows:
• Slot 1 – 2-ports Sync/Async Serial Module (speed up to : 2Mbps)
• Slot 2 – 4-port ISDN U module.
• Remaining 2 slots were left free for future scalability.
Remote Location:
At the Remote location, Dax recommended each remote branch to use DX-1721
Modular Router with 1# 10/100 port and 4 WAN Slot for WAN/VOIP modules.
Each DX1721 was loaded with the following modules:
• Slot 1 - ISDN S/T module for providing automatic back-up connectivity.
• Slot 2 - 1-Port High speed Serial Sync / Async WAN Interface module for
connecting leased line link @ 64 Kbps up to 2 Mbps Speed.
• The remaining 2 slots were left free for future scalability.

Section 4
Backup Solution

Possible Options
• Backup and recovery from tape
• Host-based replication
• Storage-based replication
• Data replication infrastructure
• Replicating databases
• A comparison of the various disaster recovery
solutions
• Metro clusters
Backup And Recovery From Tape
RAID technology used to provide high levels of data availability
Cannot protect against data loss if the data is deleted (accidental or otherwise) or
corrupted
The tapes can be cloned, i.e., copied to new media to allow them to be stored off-site
in a disaster recovery location
Least expense of all the options
it is only really applicable as the primary disaster recovery mechanism for non-critical
services, i.e. services with RPOs where data loss and longer RTOs are acceptable

Host-based replication
The remote mirror software works at the OS kernel level to intercept writes to
underlying logical devices as well as to physical devices, such as disk slices and
hardware RAID protected LUNs
It then forwards these writes on to one or more remote Solaris OS-based nodes
connected through an IP-based network
2 modes of data transfer: Synchronous mode replication, Asynchronous mode

replication

Storage-Based Data Replication
Perform data replication on the CPUs or controllers resident in the storage systems.
2 ways- Synchronous and Asynchronous modes, but the software operates at a much
lower level.
Consequently, storage-based replication software can replicate data held by applications such as
Oracle OPS and Oracle RAC even though the I/Os to a single LUN might be issued by several
nodes concurrently.
The software provides remote replication through disk based journaling.
Journaling techniques can improve levels of reliability and robustness in remote

copying operations, thereby also providing better data recovery capabilities.

Data replication infrastructure

Replicating databases
(RDBMS) portfolios from IBM and Oracle include wide range of tools to manage and
administer data held in their respective databases: DB2 and Oracle
The RDBMS software is designed to handle logical changes to the underlying data
So, it offers considerably greater flexibility and lower network traffic than a
corresponding block-based replication solution.

Metro Clusters
The ability to cluster systems across hundreds of kilometers using Dense Wave
Division Multiplexors (DWDM) and SAN connected Fibre Channel storage devices
Cluster deployments that try to combine availability and disaster recovery by

separating the two halves of the cluster and storage between two widely separated
data centers
The physically separated cluster nodes work identically but offer the added benefits
of protecting against local disasters and eliminating the requirement for a dedicated
disaster recovery environment

Section 5
Costing

• The investments on DR don’t increase top-line
revenue, though they will likely let you retain
more of your profits through cost avoidance
and corporate viability.
• Building the business case requires a different
approach that calculates the cost of downtime,
defines specific requirements, identifies realistic
risks, selects cost-effective technologies and
services, and shows a commitment to disaster
recovery planning and preparedness as an
ongoing program.

SEVEN KEY STEPS FOR DISASTER RECOVERY SPENDING
• Implement a continuity management process.
• Conduct a business impact analysis (BIA) and risk assessment.
• Calculate the cost of downtime.
• Develop impact scenarios that address all risks, not just

“disasters.”
• Position DR as a competitive necessity.
• Develop a DR services catalog.
• Align DR technology investments with other IT initiatives.

Assumption Qty Unit Price (INR) Cost (INR crores)
Capex
DC site 33% of space in sqft 20,000 25,000 50
Servers 33% of CPUs 2,000 500,000 100
Storage 33% of storage in TB 2,000 400,000 80
Network 10% of server cost 10
Software 15% of storage cost 12
Implementation- Consulting 10% of Capex 20
Total 272
Opex
Bandwidth 100,000 50
Power Rs. 50,000 per kw per annum, 6 600 300,000 18

kw per rack
Manpower 6 NOC seats, 20 on-site 10
AMC 6% of Capex 15
Total 93

Thank You

Disaster Recovery BSS Data Center

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Disaster Recovery BSS Data Center

Uploaded by

Copyright:

Available Formats

Disaster Recovery for a

BSS Data Center

DR for a BSS Data Centre 1

DR for a BSS Data Centre 2

Disaster Recovery Overview

DR for a BSS Data Centre 3

DR for a BSS Data Centre 4

DR for a BSS Data Centre 5

• Timely and effective restoration of IT services

• Any plan or set of procedures implemented by

DR for a BSS Data Centre 6

DR for a BSS Data Centre 7

DR for a BSS Data Centre 8

DR for a BSS Data Centre 9

DR for a BSS Data Centre 10

DR for a BSS Data Centre 11

DR for a BSS Data Centre 12

DR for a BSS Data Centre 13

DR for a BSS Data Centre 14

DR for a BSS Data Centre 15

DR for a BSS Data Centre 16

Architecture & Sizing for Disaster

DR for a BSS Data Centre 17

DR for a BSS Data Centre 18

DR for a BSS Data Centre 19

DR for a BSS Data Centre 20

DR for a BSS Data Centre 22

DR for a BSS Data Centre 23

DR for a BSS Data Centre 24

Application Servers Application Servers Application Servers

Primary Site Near Site DR Site

DR for a BSS Data Centre 25

Distance < 25 kms

Site A PROD Site B Near/ Bunker

DR for a BSS Data Centre 26

DR for a BSS Data Centre 27

DR for a BSS Data Centre 28

DR for a BSS Data Centre 29

DR for a BSS Data Centre 30

DR for a BSS Data Centre 31

The majority of businesses deploy wide area networks (WANs) to

Bandwidth is always an issue in disaster recovery. If you're

We want to establish a DR site that's far enough away that it won't

DR for a BSS Data Centre 32

They are two types of replication:

Asynchronous replication moves data on a bandwidth-available basis.

DR for a BSS Data Centre 33

 To establish WAN connectivity between the Central Location to 2 remote

DR for a BSS Data Centre 36

DR for a BSS Data Centre 38

DR for a BSS Data Centre 39

Least expense of all the options

DR for a BSS Data Centre 41

2 modes of data transfer: Synchronous mode replication, Asynchronous mode

DR for a BSS Data Centre 42

The software provides remote replication through disk based journaling.

Journaling techniques can improve levels of reliability and robustness in remote

DR for a BSS Data Centre 43

DR for a BSS Data Centre 44

DR for a BSS Data Centre 45

Cluster deployments that try to combine availability and disaster recovery by

DR for a BSS Data Centre 48

DR for a BSS Data Centre 49

DR for a BSS Data Centre 50