You are on page 1of 40

Term Project

Disaster Recovery Plan for Archive Concepts


LLC,
by Brian Miller

Drexel University
Professor James D. Baranello
CT-415 Disaster Recovery & Continuity Planning
Fall Quarter 2012
Friday, November 2nd, 2012 (Part One)
Saturday, December 15th, 2012 (Part Two)

Brian Miller Info:


Student ID: 12171222
bmiller@archiveconcepts.com

Page 1 of 40
Table of Contents
Abstract.......................................................................................................................................................4
Disclaimer...................................................................................................................................................5
Section One.................................................................................................................................................6
Introduction.............................................................................................................................................6
Organizational Background.........................................................................................................................7
Backup Strategy......................................................................................................................................7
FEMA Disaster Scenario: Earthquake.......................................................................................................11
Emergency Scenario..............................................................................................................................11
Threat, Risk and Impact.........................................................................................................................11
Recovery Strategy.................................................................................................................................12
FEMA Disaster Scenario: Explosion.........................................................................................................15
Emergency Scenario..............................................................................................................................15
Threat, Risk and Impact.........................................................................................................................15
Recovery Strategy.................................................................................................................................17
Section Two...............................................................................................................................................20
Introduction...........................................................................................................................................20
Disaster Recovery Plan..............................................................................................................................21
Mission Statement.................................................................................................................................21
Definitions.............................................................................................................................................21
Stakeholders..........................................................................................................................................22
Scope of Work.......................................................................................................................................25
Disaster Recovery Timeframe...............................................................................................................26
Team Participants and Responsibility Matrix........................................................................................27
Resources..............................................................................................................................................28
Tools......................................................................................................................................................29
Sites.......................................................................................................................................................29
Recovery Procedure...............................................................................................................................30
Section Three.............................................................................................................................................33
Introduction...........................................................................................................................................33
Testing the DRP........................................................................................................................................34

Page 2 of 40
Semi Annual Review.............................................................................................................................34
Semi Annual Testing.............................................................................................................................34
Project Summary.......................................................................................................................................37
Appendices................................................................................................................................................38
Appendix I.............................................................................................................................................38
Appendix II...........................................................................................................................................39
References.................................................................................................................................................40

Page 3 of 40
________

Abstract:
________

This paper consists of an analysis of a real company as it relates to a disaster recovery

plan. The company is Archive Concepts, LLC. I am the legitimate owner of Archive Concepts.

However, some of the facts about the company need to be changed to accurately describe what a

disaster recovery plan for a medium to large business should consist of. This paper is divided

into three parts; section one, section two and section three. Section one will discuss the

background of the organization, define and explain two disaster scenarios, analyze the risk, threat

and impact to the organization and discuss the recovery strategy. Section two will contain the

recovery team mission statement, scope of work for the recovery team, the timeframe to

recovery, team participants, tools, sites in which to recover and the recovery procedures.

Stakeholders will be identified, describe how they are impacted and how they will be updated

throughout the recovery process. Section three discusses the importance of testing the disaster

recovery plan and reviewing it on a regular basis for accuracy and relevance. The end goal is to

have a clear understanding of a disaster recovery initiative for Archive Concepts.

Page 4 of 40
__________

Disclaimer:
__________

The company Archive Concepts, LLC is a real business entity. However, some of the

details about the company are false. Some regional offices are needed along with data centers to

accurately describe what a disaster recovery plan should present for a medium to large

organization. The company assets, resources, personnel, and other pertinent information will be

dramatized for the purpose of writing this paper to fulfill the requirements of the assignment and

ultimately an appropriate disaster recovery plan.

Page 5 of 40
___________

Section One:
___________

Introduction:

Section one discusses the background of the organization, defines and explains two

disaster scenarios, analyzes the risk, threat and impact to the organization, and discusses the

recovery strategy for the organization. The disaster scenarios are chosen from the Federal

Emergency Management Agency (FEMA) categories available from the document “Are You

Ready? An In-Depth Guide to Citizen Preparedness” which can be downloaded here:

http://www.fema.gov/pdf/areyouready/areyouready_full.pdf [ CITATION Fed04 \l 1033 ]. The

disaster scenarios chosen and discussed in this section are earthquakes and explosions.

Page 6 of 40
________________________
Organizational Background:
________________________

Archive Concepts LLC (AC) is a Full Service I.T. Company offering a wide variety of

products and services. Archive Concepts have been servicing the Pocono and surrounding areas

since the year 2000 and regionally since 2005. The company specializes in corporate networking

and systems implementation, PC and server sales/repair/installation, web design, document

management systems, virtual desktops, data storage solutions and managed hosting. Archive

Concepts has its corporate headquarters in East Stroudsburg Pennsylvania with primary

datacenter operations in a nearby Tier 1 facility and two regional offices, one in Atlanta Georgia

and the other in San Diego California.

The primary datacenter facility of the company uses the latest in technological advances

and utilizes a large VMware infrastructure connected to a SAN. The facility is Tier 1 and is

hardened against environmental disasters such as flooding, hurricanes, and tornadoes being

located in a mountainous non flood zone area. The facility has standard security features such as

electronic key card entry, hand scan, fire prevention, video surveillance, backup power and

efficient floor cooling. The secondary datacenter is located in a Tier 2 facility in San Diego. That

facility also has standard security features such as electronic key card entry, hand scan, fire

prevention, video surveillance, backup power and efficient floor cooling but is more susceptible

to natural disasters such as earthquakes and hurricanes.

Backup Strategy:

The backup solution for Archive Concepts is quite elaborate but does not take advantage

of a hot site as at this time there is no justification for one. The primary backup strategy

Page 7 of 40
leverages storage solutions offered by NetApp, ONTAP Snap Technology, DoubleTake

Replication Software, Symantec BackupExec Software and Overland LTO5 Tape Libraries. The

company refers to the backup as the “Company-Wide Backup Solution” (CWBS). The company

has critical data in all offices. CWBS consists of DoubleTake replication software loaded on any

server in the field that needs data to be backed up (referred to as source servers). Those source

servers replicate the data to the IDC on two CWBS target servers connected to our NetApp SAN

(Storage Area Network) system. All datastores in the VMware environment are also stored on

the NetApp. The NetApp appliances are the primary location for all data. The following chart

contains a list of technologies all covered by a 24/7 four hour response premium support contract

that will fully replace all hardware that fails or is damaged during a disaster event:

Primary and Secondary IDC Backup Equipment: Table 1    

Server Name Purpose/Backup Targets Tape Library(ies) Attached Connection

IDC-NETAPP01

IDC-NETAPP02 Primary storage solution for all


managed hosting data, VMware
Model FAS3240 Cluster datastores and other CWBS data. None FibreChannel

SDI-NETAPP01

SDI-NETAPP02
Backup storage solution for all
Model FAS2240-4 managed hosting data, VMware
Cluster datastores and other CWBS data. None FibreChannel

Primary sever used to back up


against vaulted data to backup (2) Overland 30-tape dual-drive Both
SDI-BACKUP01 storage. LTO5 libraries FibreChannel

IDC-CWBS01 All CWBS data. These are VM


servers within the virtual
IDC-CWBS02 environment. None FibreChannel

Page 8 of 40
As mentioned, all data is stored on the primary NetApp appliances located at the Tier 1

datacenter facility. AC has a ten node clustered VMware virtual infrastructure environment

where all datastores are located on IDC-NETAPP01 and IDC-NETAPP02 appliances. The two

servers IDC-CWBS01 and IDC-CWBS02 are within the virtual environment. All critical field

data is replicated in real time via DoubleTake replication software to partitions on IDC-CWBS01

and IDC-CWBS02. The filer controllers IDC-NETAPP01 and IDC-NETAPP02 have Data

ONTAP snap technology enabled. Local snapshots of all volumes (which contain VM datastores

and other data) are enabled and occur on an hourly, weekly and monthly basis as defined within

the data protection policies.

The NetApp filers SDI-NETAPP01 and SDI-NETAPP02 are located in the San Diego

Tier 2 datacenter facility. Data ONTAP snapmirror technology replicates the local snapshots of

IDC-NETAPP01 and IDC-NETAPP02 to the first controller SDI-NETAPP01 on an hourly,

weekly and monthly basis as defined in the data protection policies. Controller SDI-NETAPP01

then “vaults” each of those snapshots to the second controller SDI-NETAPP02 for longer term

retention. The server SDI-BACKUP01 via the Symantec BackupExec server executes a backup

against the vaulted snapshots contained on filer SDI-NETAPP02. For backup rotations, AC

primarily uses the Grandfather-Father-Son (GFS) method:

 The Grandfather rotation runs 2 -3 times a month on depending how long the month is. It

always starts on a Friday night and runs until completion throughout the weekend. This

rotation always demands a new tape – it never appends to tapes that may already have

data on them. This is because when we take the tapes offsite, no other rotation will

depend on the missing data.

Page 9 of 40
 The Father rotation runs 2 times a month. It always starts on a Friday night and runs until

completion throughout the weekend. Like the Grandfather rotation, it always demands a

new tape – it never appends to tapes that may already have data on them. This is because

if we take the tapes offsite, no other rotation will depend on the missing data.

 The Son rotation occurs every Monday through Thursday night of each week. It

performs a differential backup from the last Grandfather or Father that was run. This

rotation appends to other Son tapes to maximize the space on each tape.

All Grandfather tapes are shipped off site to an Iron Mountain secure facility for long

term storage. The Grandfather tapes contain a full backup of the vaulted snapshots and therefore

all data contained within them. The Father rotations are recycled after 180 days and Son rotations

are recycled after 45 days. The Father and Son tapes are taken off site with local IT resources.

The following chart is a visual representation of the backup strategy:

Page 10 of 40
_________________________________

FEMA Disaster Scenario: Earthquake:


_________________________________

Emergency Scenario:

The most destructive natural disaster is a severe earthquake. Depending on the severity of

the event, it could cause total destruction of a datacenter facility. In this scenario we discuss the

risk, threat and impact of a complete loss of the datacenter and the subsequent recovery effort to

bring the production environment back online. The scenario consists of a high impact earthquake

that was to hit either datacenter location in East Stroudsburg or San Diego. The result of the

earthquake to the organization is a complete loss of the datacenter but the devastation was not so

heavy to the area that the facility could not be rebuilt.

Threat, Risk and Impact:

The first objective is to identify the threat. Knowing the threat one can then focus on the

assets that are directly affected by the given threat. The most vulnerable assets will be on the top

of the list and will be those that are recovered first. Once you have identified the threat to the

organization, one can then determine the risk that those threats pose to the assets. Lastly, after

determining the risk, each asset should be analyzed as to the impact it would cause to the

organization in the event of a total loss.

In this case the threat is earthquakes. The risk of an earthquake affecting datacenter

operations at the primary Tier 1 facility in East Stroudsburg Pennsylvania is very low. The

datacenter is located in a mountainous area outside of any major fault lines, yet as slim as the

probability is, the possibility still exists. The risk of an earthquake affecting datacenter

Page 11 of 40
operations at the secondary Tier 2 facility in San Diego California is much higher than the

primary facility. California is known to be earthquake prone. The assets most critical to Archive

Concepts are its personnel, the datacenter facility itself and all of the equipment contained within

the facility. The specific most vulnerable assets are listed in table 1 above.

The primary datacenter contains all live production data. The secondary datacenter

contains a replica of the production data from the primary datacenter. A loss of the primary

datacenter facility would have a greater impact on operations as it is the live environment. This is

precisely the reason why the primary datacenter facility is located in an area where very little

natural disasters occur. If an earthquake where to hit the primary datacenter and completely

destroy the facility, there would be service interruption to company personnel and customers.

However, with minimal effort, the primary production processes can be transferred to the

secondary datacenter facility since it does have a fully replicated mirror of the production data. A

temporary VMware environment can be implemented quickly at the secondary location which

can access the mirrored data. If an earthquake where to hit the secondary datacenter and

completely destroy the facility, there would be no service interruption to company personnel and

customers as that site’s purpose is only for backup.

Recovery Strategy:

First it is most prudent to address the recovery of the primary datacenter facility in the

event of a massive earthquake. A complete loss of the facility and operations would cause

interruption of service. The recovery strategy is to transfer datacenter operations to the secondary

San Diego facility. Converting the mirrored snapshots of the primary facility to the backup

NetApp SAN will take place first. Hardware for creating a temporary VMware environment is in

Page 12 of 40
inventory. The process of configuring ESXi is a straightforward and relatively quick process.

The VMware environment can then be attached to the already existing mirrored copies of

VMware boot LUNs and datastores. Complete restoration of client services can be restored

within a 24 hour period.

Once restoration of services has been implemented at the backup facility, the corporate

facilities manager can then focus his attention to recovering the primary facility. The facilities

manager will coordinate with insurance, vendors and contractors to restore the primary

datacenter facility to its full functioning state. Other senior level management and information

technology professionals will also be involved in the recovery effort. The below table describes

additional personnel and their responsibilities:

Responsibility Matrix
These plans/points assume employee safety is taken first priority
Role Has Ability to Actions Responsible for
General Manager  Declare Emergency  Ensuring impacted staff are provided real-time
 Request DR plan to be instructions
activated  GM to keep RGM informed
 Keeps clients and others updated
 General status communications / point person.
Regional General  Declare Emergency  Keeps HR director updated
Manager  Request DR plan to be  Keeps Facility director updated
activated  Keeps CIO updated
 Approval authority to  Ensuring General Manager is aware of options
activate DR plan  Keeping Senior Management updated on event
handling
 Coordination with Human Resources on event
handling
 Participates in general status communications
Senior Management  Declare Emergency  Overall Oversight / set direction
 Request DR plan to be  Coordination with Executive Management
activated Team
 Approval authority to
activate DR plan
Information  Request DR plan to be  Initiates Technology tasks per DR plan
Technology activated  Communicates status of technology

Page 13 of 40
 Updates GM as to IT tasks/progress of DR plan
action status
Human Resources  Request DR plan to be  Ensures any legal, employee safety concerns
(Dir/Delegate) activated are being addressed.
 Approval authority to
activate DR plan
Facilities  Request DR plan to be  Keeping Senior Management updated on event
(Dir/Delegate) activated handling

When primary datacenter has been restored to its original state, the data at the San Diego

facility will be considered the production data. The data will need to be mirrored to the primary

datacenter facility using snap technology. The VMware environment will be rebuilt and a cut

over to the mirrored data will be scheduled. Primary production processes will be transferred to

the East Stroudsburg datacenter facility and San Diego will once again become the backup site.

Regularly scheduled snapshots, mirrors and vaults will resume the pre disaster time frame.

Secondly, if the San Diego facility were to be destroyed, the recovery strategy will be a

straightforward process since it is the backup site and not user or customer impacting. The

facilities manager will coordinate with insurance, contractors and vendors to restore the facility

to its original state. Information technology engineers will implement the replacement hardware

back to its original state. Replication of data can then be restored from the primary datacenter

facility.

Page 14 of 40
_______________________________
FEMA Disaster Scenario: Explosion:
_______________________________

Emergency Scenario:

In this emergency scenario an explosion occurs at the primary datacenter facility as a

result of a terrorist attack. If you think about it, a terrorist could potentially inflict a lot of damage

by destroying a tier 1 datacenter facility. This is especially true if it were a major carrier such at

AT&T where many customers, some very large and important, have their technology

infrastructure hosted. The impact potential to the entire country could be massive. Depending on

the severity of the event, it could cause total destruction of a datacenter facility. In this scenario

we discuss the risk, threat and impact of a complete loss of the datacenter and the subsequent

recovery effort to bring the production environment back online. The scenario consists of an

explosion due to terrorist activity that was to hit either datacenter location in East Stroudsburg or

San Diego. The result of the terrorist attack to the organization is a complete loss of the

datacenter but the devastation was not so heavy to the area that the facility could not be rebuilt.

Threat, Risk and Impact:

The first objective is to identify the threat. Knowing the threat one can then focus on the

assets that are directly affected by the given threat. The most vulnerable assets will be on the top

of the list and will be those that are recovered first. Once you have identified the threat to the

organization, one can then determine the risk that those threats pose to the assets. Lastly, after

determining the risk, each asset should be analyzed as to the impact it would cause to the

organization in the event of a total loss.

Page 15 of 40
In this case the threat is an explosion caused by a terrorist attack. The risk of a terrorist

attack affecting datacenter operations at the primary Tier 1 facility in East Stroudsburg

Pennsylvania is very low. The datacenter is located in a mountainous area outside of any major

cities or heavily populated areas, yet as slim as the probability is, the possibility still exists. The

risk of a terrorist attack affecting datacenter operations at the secondary Tier 2 facility in San

Diego California is much higher than the primary facility since it is a more heavily populated

area. Terrorists are known to want to take as many lives as possible, they do not necessarily think

about technological infrastructure damage at a datacenter facility. The assets most critical to

Archive Concepts are its personnel, the datacenter facility itself and all of the equipment

contained within the facility. The specific most vulnerable assets are listed in table 1 above.

The primary datacenter contains all live production data. The secondary datacenter

contains a replica of the production data from the primary datacenter. A loss of the primary

datacenter facility would have a greater impact on operations as it is the live environment. This is

precisely the reason why the primary datacenter facility is located in an area where very little

natural disasters occur and is located outside of any densely populated area. If a terrorist attack

where to happen to the primary datacenter and completely destroy the facility, there would be

service interruption to company personnel and customers. However, with minimal effort, the

primary production processes can be transferred to the secondary datacenter facility since it does

have a fully replicated mirror of the production data. A temporary VMware environment can be

implemented quickly at the secondary location which can access the mirrored data. If a terrorist

attack where to happen at the secondary datacenter and completely destroy the facility, there

would be no service interruption to company personnel and customers as that site’s purpose is

only for backup.

Page 16 of 40
Recovery Strategy:

The recovery strategy for a terrorist attack will be the same as an earthquake as in both

scenarios the facilities are destroyed but still recoverable over time. First it is most prudent to

address the recovery of the primary datacenter facility in the event of a terrorist attack. A

complete loss of the facility and operations would cause interruption of service. The recovery

strategy is to transfer datacenter operations to the secondary San Diego facility. Converting the

mirrored snapshots of the primary facility to the backup NetApp SAN will take place first.

Hardware for creating a temporary VMware environment is in inventory. The process of

configuring ESXi is a straightforward and relatively quick process. The VMware environment

can then be attached to the already existing mirrored copies of VMware boot LUNs and

datastores. Complete restoration of client services can be restored within a 24 hour period.

Once restoration of services has been implemented at the backup facility, the corporate

facilities manager can then focus his attention to recovering the primary facility. The facilities

manager will coordinate with insurance, vendors and contractors to restore the primary

datacenter facility to its full functioning state. Other senior level management and information

technology professionals will also be involved in the recovery effort. The below table describes

additional personnel and their responsibilities:

Responsibility Matrix
These plans/points assume employee safety is taken first priority
Role Has Ability to Actions Responsible for
General Manager  Declare Emergency  Ensuring impacted staff are provided real-time
 Request DR plan to be instructions
activated  GM to keep RGM informed
 Keeps clients and others updated
 General status communications / point person.
Regional General  Declare Emergency  Keeps HR director updated
Manager  Request DR plan to be  Keeps Facility director updated

Page 17 of 40
activated  Keeps CIO updated
 Approval authority to  Ensuring General Manager is aware of options
activate DR plan  Keeping Senior Management updated on event
handling
 Coordination with Human Resources on event
handling
 Participates in general status communications
Senior Management  Declare Emergency  Overall Oversight / set direction
 Request DR plan to be  Coordination with Executive Management
activated Team
 Approval authority to
activate DR plan
Information  Request DR plan to be  Initiates Technology tasks per DR plan
Technology activated  Communicates status of technology
 Updates GM as to IT tasks/progress of DR plan
action status
Human Resources  Request DR plan to be  Ensures any legal, employee safety concerns
(Dir/Delegate) activated are being addressed.
 Approval authority to
activate DR plan
Facilities  Request DR plan to be  Keeping Senior Management updated on event
(Dir/Delegate) activated handling

When primary datacenter has been restored to its original state, the data at the San Diego

facility will be considered the production data. The data will need to be mirrored to the primary

datacenter facility using snap technology. The VMware environment will be rebuilt and a cut

over to the mirrored data will be scheduled. Primary production processes will be transferred to

the East Stroudsburg datacenter facility and San Diego will once again become the backup site.

Regularly scheduled snapshots, mirrors and vaults will resume the pre disaster time frame.

Secondly, if the San Diego facility were to be destroyed, the recovery strategy will be a

straightforward process since it is the backup site and not user or customer impacting. The

facilities manager will coordinate with insurance, contractors and vendors to restore the facility

to its original state. Information technology engineers will implement the replacement hardware

Page 18 of 40
back to its original state. Replication of data can then be restored from the primary datacenter

facility.

Page 19 of 40
___________

Section Two:
___________

Introduction:

Section two defines the mission statement of the Archive Concepts disaster recovery

team, the scope of work for the team, the timeframe for the disaster to recover, team participants,

resources, tool, sites that will be recovered, and recovery procedures. The Archive Concepts

stakeholders will be identified, how they are impacted, and how the team will respond and

inform them during the disasters. Essentially, section two is the disaster recovery plan for the

Archive Concepts organization. In section one technology and scenarios are discussed. Section

two applies these technologies and formulates the actual disaster recovery plan recovery

timelines and objectives.

Page 20 of 40
____________________

Disaster Recovery Plan:


____________________

Mission Statement:

The following plan is designed to provide Archive Concepts staff information, guidance

and other direction regarding an operationally impacting event. The mission of the disaster

recovery team and the recovery effort is to protect all Archive Concepts company assets, to

ensure the safety of all company associates, and to ensure the continued high level of service to

the Archive Concepts customer and user base. The disaster recovery team will be prepared to

recover all mission critical business systems and services to another datacenter location and

ensure pre-arranged vendor agreements. The mission of the disaster recovery plan is to clearly

define responsibilities, actions and procedures to recover the Archive Concepts system,

communication and network environments in the event of a disaster. The DRP has three main

objectives; recover the physical network, recover the applications, and minimize the impact on

the business all within acceptable time frames as defined by the disaster recovery team and

executive management.

Definitions:

 Business Continuity – best effort to maintain business operations as close to normal

during an event.

 Disaster – Any situation, with advance notice or zero advance notice, that causes a

severe, potential risk to employees and/or the ongoing and possibly sustained impact to

operations.

Page 21 of 40
 Disaster Recovery Plan: The document that defines the resources, actions, tasks, and data

required to manage the business recovery process in the event of a disaster. The plan is

designed to assist in restoring the business processes within the stated disaster recovery

goals. Tim. J Smith states, “The business objective of Disaster Recovery is to manage

system outages” which aligns with the ideals of the Archive Concepts disaster recovery

team[CITATION Tim02 \l 1033 ].

 Recovery Time Objective: Amount of down time before an outage threatens the survival

of the organization and its mission critical processes through lost revenue and reputation

loss.

 Stakeholders: Any individuals who have a vested interest in the outcome of the disaster

recovery policy and any subsequent disaster recovery effort.

Stakeholders:

The internal stakeholders of the Archive Concepts disaster recovery plan policies and

recovery efforts are Senior Management/Executives, General Managers, Regional General

Managers, Information Technology professionals, Human Resources and the Facilities Manager.

External stakeholders include Archive Concepts primary vendors and customers.

Impact:

In the event of a complete disruption of services, all stakeholders will be affected in some

fashion. A business impact analysis was conducted to determine the impact of a disaster on the

operations of each operating unit within Archive Concepts. The business impact analysis

complements the disaster recovery plan by identifying those applications and systems with the

greatest impact on the business in the event of a disaster. By performing an impact analysis, it

Page 22 of 40
allows for defining the most effective recovery time period for each system and application.

Recovery times are established and accepted by the stakeholders.

The following chart outlines the Archive Concepts classification of systems, critical or

non-critical, the impact and optimal recovery time period:

Impact Matrix
System Classification Impact to Operations Coordinated Action
Email Critical Email is a critical application for Upstream SPAM partner
business purposes. If this service is has the ability to hold
unavailable, staff members are company email in the
unable to effectively communicate queue thus preventing
with each other and customers. undeliverable messages.
SQL Databases Critical SQL databases contain critical SQL databases can be
customer and Archive Concepts brought online at the
proprietary data. With services disaster recovery site
offline none of the staff can work. (San Diego).
Internet Access Critical Internet access is most critical for If necessary, VPN can
email communication. With no be established to the
Internet access there would be no disaster recovery facility
email communication. This would for individuals to access
impact business and internal the Internet.
communication operations.
ADP Payroll System Non-Critical ADP Payroll is accessed by the The ADP Payroll system
Payroll Department. This is a non- is in the virtualized
business impacting service that can environment. The
be offline for a reasonable amount VMware environment
of time. If necessary Payroll will be brought online at
operations can temporarily be the disaster recovery site
transferred to ADP. (San Diego).
iPro Critical iPro contains customer data. The iPro system is in the
Customers who rely on the ability virtualized environment.
to access this data would be The VMware
impacted negatively if the service environment will be
were unavailable. brought online at the
disaster recovery site
(San Diego).
Web Applications Critical These are customer facing web All web applications are
applications. These applications in the virtualized
interface with data such as iPro. environment. The
VMware environment
will be brought online at
the disaster recovery site

Page 23 of 40
(San Diego).
ACentral Non-Critical ACentral is an internal web The ACentral system is
application for use by Archive in the virtualized
Concepts only. This system can be environment. The
offline for a reasonable amount of VMware environment
time. will be brought online at
the disaster recovery site
(San Diego).
Great Plains Critical Microsoft Great Plains is a critical The Great Plains system
internal application for the is in the virtualized
operations, financial and human environment. The
resources departments. A disruption VMware environment
in service then staff cannot work will be brought online at
effectively. the disaster recovery site
(San Diego).
VMware Critical Archive Concepts has an extensive The virtualized
Environment VMware environment consisting of environment data is
many production virtual servers. A completely replicated to
service disruption would cause the recovery site in real
reputation damage and lost revenue. time. The VMware
environment will be
brought online at the
disaster recovery site
(San Diego).
Storage Systems Critical Many systems are interlocked with The primary NetApp
the NetApp storage systems. In storage system is
particular, VMware houses all the mirrored in real time to
datastores on the NetApp storage the disaster recovery
system. It is business critical to site. Primary storage
keep the storage system online and operations will be
accessible. Operations would be shifted to the San Diego
crippled with a service interruption. disaster recovery site.
Network Layer Critical The network layer is the backbone IT will assess failure
of the IT operations. Service point and work to define
interruption would bring all work-around or send
services offline and would be replacement equipment.
customer and revenue impacting.
Phone Critical Customers would not be able to IT staff will use phone
Communication communicate with Archive system (if available) to
Systems Concepts if the communication forward calls or attempt
systems are offline. This would to forward calls at the
affect revenue and reputation. carrier level.

Status Updates:

Page 24 of 40
Status updates will be given by various individuals as defined in the responsibility matrix.

Scope of Work:

Responsibility for the development and maintenance of the Archive Concepts disaster

recovery plan plan is assumed by the Information Technology department. Specific

responsibility for ensuring the plan is maintained and tested is assigned to the Infrastructure

Team within the Information Technology department. The end user community is responsible to

coordinate with the Help Desk Manager for their information technology requirements in the

event of a disaster.

Examples of Events that would be covered in scope for this plan:

 Natural events - Weather emergencies/Fire/Flood/Hurricane/Earthquakes.

 Major facility events - Telecom circuit/catastrophic phone system failure/sustained power

outage in building.

 Man-made events – Bomb threat, terrorism, impacting accidents, sabotage.

 Major Infrastructure events - Critical equipment failure, such as network router, voice

router failure.

Events not in scope:

 Planned, short term, local facility issues (building electrical maintenance).

 Individual or small group impacting (workstation/s failure).

Systems In scope for this plan:

Page 25 of 40
 Cisco Phones (incoming calls on primary line/extensions, inter office calling, dialing

out).

 Network (Layer and access to data centers).

 Applications/Systems (ACentral, Great Plains, SQL, email, VMware).

 Storage Systems (NetApp, HP SANs)

 Workstations (desktops/laptops)

Disaster Recovery Timeframe:

To determine the maximum time frame allowable, the following Archive Concepts

operating departments were interviewed:

 Information Technology

 Finance & Accounting

 Business Development: Sales and Marketing

 Operations

 Purchasing

 Human Resources

 Facilities

 Customer Service

By interviewing all individuals, the recovery team is able to determine the maximum time

frame that each department can be without functionality of a given system without incurring

severe operational impact. The Recovery Time Objective is defined in business days as the

elapsed time between the disasters up to the point where the systems must be functional again.

Page 26 of 40
The recovery plan involves restoring the most critical systems such as network and storage

systems. Without these two systems online it would not be possible to bring VMware and the

application servers back online. The least critical services such as ACentral and the ADP Payroll

system will be focused on last. Archive Concepts must have critical systems back online within a

24 hour period to sufficiently conduct business and care for its customer’s needs.

The following chart outlines the recovery time frame for each system:

Disaster Recovery Timeframe Matrix


System Disaster Recovery Timeframe Classification Time to Recovery
Email Critical 9-24 Hours
SQL Databases Critical 0-8 Hours
Internet Access Critical 9-24 Hours
ADP Payroll System Non-Critical 25-72 Hours
iPro Critical 9-24 Hours
Web Applications Critical 9-24 Hours
ACentral Non-Critical 25-72 Hours
Great Plains Critical 9-24 Hours
VMware Environment Critical 0-8 Hours
Storage Systems Critical 0-8 Hours
Network Layer Critical 0-8 Hours
Phone Communication Systems Critical 9-24 Hours

Team Participants and Responsibility Matrix:

According to Christina Cappelletti in research paper “Designing and Implementing a

Disaster Recovery Plan”, “A team needs to be assembled that will respond in the event of a

disaster. This team should include a member or representative of Senior Management, members

from the IS Department that will perform the assessment and recovery, representatives from

Facilities, and members from the Business and User Communities to determine what level of

recovery is needed and to verify that recovery is complete”[ CITATION Chr02 \l 1033 ]. The Archive

Concepts disaster recovery plan reflects this opinion as outline in the following table:

Page 27 of 40
Responsibility Matrix
These plans/points assume employee safety is taken first priority
Role Has Ability to Actions Responsible for
General Manager  Declare Emergency  Ensuring impacted staff are provided real-time
 Request DR plan to be instructions
activated  GM to keep RGM informed
 Keeps clients and others updated
 General status communications / point person.
Regional General  Declare Emergency  Keeps HR director updated
Manager  Request DR plan to be  Keeps Facility director updated
activated  Keeps CIO updated
 Approval authority to  Ensuring General Manager is aware of options
activate DR plan  Keeping Senior Management updated on event
handling
 Coordination with Human Resources on event
handling
 Participates in general status communications
Senior Management  Declare Emergency  Overall Oversight / set direction
 Request DR plan to be  Coordination with Executive Management
activated Team
 Approval authority to
activate DR plan
Information  Request DR plan to be  Initiates Technology tasks per DR plan
Technology activated  Communicates status of technology
 Updates GM as to IT tasks/progress of DR plan
action status
Human Resources  Request DR plan to be  Ensures any legal, employee safety concerns
(Dir/Delegate) activated are being addressed.
 Approval authority to
activate DR plan
Facilities  Request DR plan to be  Keeping Senior Management updated on event
(Dir/Delegate) activated handling

Resources:

The Archive Concepts disaster recovery team will leverage many internal and external

resources. Internal resources can include business analysts, end users, application owners,

managers and any other resource that is required. External resources will mainly consist of

primary vendors such as NetApp, VMware, CDW, and a variety of others on an as needed basis.

Page 28 of 40
Tools:

The tools used for disaster recovery is primarily the NetApp SnapMirror technology which

allows all data from the primary East Stroudsburg datacenter facility to be replicated to the

secondary San Diego backup facility. Since Archive Concepts systems are virtualized and stored

on the NetApp storage system this make for a safe ad easy recovery process. The following flow

chart describes the replication process:

All other flat files that may be contained within other storage systems are taken care of by

DoubleTake replication technology. All data is destined to a NetApp storage system where it is

then SnapMirrored to the backup location.

Sites:

Page 29 of 40
There are two sites covered within this disaster recovery document, the primary East

Stroudsburg, PA datacenter and the backup facility in San Diego, CA.

Recovery Procedure:

Upon assessment of damage and activation of disaster recovery processes, the IT

leadership will determine the appropriate data recovery strategy. The data recovery processes

shall reflect Archive Concepts’ information system priorities as outlined in the disaster recovery

timeframe matrix. Data recovery activities shall take place in a pre-planned sequential fashion so

that system components can be restored in a logical manner and should take into consideration:

Personnel: The IT leadership and workforce members as well as all the disaster recovery team

members involved in disaster recovery processes will be the most valuable resource. These

individuals may be asked to work above and beyond normal working hours. Archive Concepts

will provide resources meet their personal and professional needs.

Communication: Notification of internal and external business partners will be carried out by

appropriate individuals on an as needed basis.

Salvage of Existing IT Equipment: Initial data recovery efforts will be targeted at protecting and

preserving the current media, equipment, applications and systems. A priority will be to identify

if the network layer and NetApp storage system is recoverable. The IT equipment will be further

protected from the elements or removed to a safe location, away from the disaster site if

necessary and immediate shift in focus to the secondary backup site in San Diego will be

initiated.

Page 30 of 40
Designate Recovery Site: It will be necessary to determine if the data recovery efforts can be

carried out at the original primary East Stroudsburg site or moved to the secondary backup San

Diego location. The choice of using the primary site or the secondary backup site will be

dependent on the damage and estimated recovery time of the primary data center location.

Backup Equipment: The datacenter facility in San Diego has mirrored data to an existing NetApp

storage system. Primary operations can be transferred to this facility. A new VMware

environment can be quickly built and attached to the datastores already existing at the backup

facility. However, the recovery process will rely heavily on the ability of the Archive Concepts

vendors to quickly provide replacements for the resources which were not thought of or cannot

be salvaged from the primary facility. Emergency procurement processes will be implemented to

allow the IT leadership to quickly replace equipment, supplies, software and any others items

required for disaster recovery efforts.

Restoration of Data from Backups: Data recovery will rely on the availability of the backup data

from the secondary backup site. Initial data recovery efforts will focus on restoring the VMware

environment virtual servers by pre-determined priority.

Restoration of Applications: IT leadership will work with the individual departments and

application owners to restore each running application. This should be a painless process since

all VM images contained working application profiles that were replicated in real time prior to

the disaster. However, it is possible that application owners must issues as a result of the new

environment.

Move Back to Primary East Stroudsburg Site: Since the disaster recovery process has taken place

at the secondary backup San Diego site, the systems that have been brought online at the

Page 31 of 40
secondary site will need to be replicated to the original site when it becomes available. After

which, the primary data center operations can be shifted back to the original location in East

Stroudsburg where the corporate office resides.

Summary of Recovery Procedure:

Data is replicated to the secondary San Diego facility in real time. All Archive Concepts

systems are in a virtualized environment. All VM datastores which contain virtual server images

are contained within datastores stored on the NetApp systems. Since all data is current, an

interim VM environment can be used at the secondary backup facility. When the primary

datacenter facility has been restored, current data from San Diego will be replicated back to East

Stroudsburg where primary datacenter operations can then be shifted.

Page 32 of 40
____________

Section Three:
____________

Introduction:

Section three is dedicated to providing information on the testing of the Archive

Concepts disaster recovery plan document, policies and procedures. The disaster recovery plan

documentation is considered a living document and will need to be regularly reviewed for

accuracy given the changes in the business environment. The objective of tests is to obtain the

most value from the disaster recovery procedures. The use of test objectives and success criteria

enable the effectiveness of the disaster recovery plan and allows for business continuity.

Page 33 of 40
_______________

Testing the DRP:


_______________

Semi Annual Review:

Maintenance of the Archive Concepts disaster recovery plan and the other business

continuity policies is critical to the success of an actual recovery and the stability of the

company. The disaster recovery plan must reflect changes to the system and networking

environments that are defined within the disaster recovery plan and be updated as appropriate. If

new technology is implemented then it needs to be addressed and added to the disaster recovery

plan. Conversely, if systems are decommissioned, they need to be removed from the disaster

recovery plan and objectives need to be redefined appropriately. The Archive Concepts disaster

recovery plan and other business continuity policies should be reviewed on a semiannual basis.

This is currently defined as a quarterly review. Each department should review the system to

which they own and suggest to IT any changes that have been made to the environment since the

last review. The IT department will then update the disaster recovery plan appropriately.

Semi Annual Testing:

Testing a disaster recovery plan can be very complex depending on the environment. The

overall objective of a disaster recovery plan is to replicate parts of or the entire existing IT

production environment at an alternate site until normal operations have been resumed at the

primary facility. An actual test of the recovery procedures defined within the Archive Concepts

disaster recovery plan should be conducted at least twice a year to ensure the backup

technologies and procedures are still functioning and relevant. With the current technologies in

use by Archive Concepts this should be a relatively painless process. Since all critical data is on

Page 34 of 40
the NetApp system, engineers can check the replication status at any time to ensure compliance.

If there are any faults in replication status, they can be addressed at the time of discovery. If all

mirror replications have a mirrored status then all is well. Since all of the production data,

including CIFS shares, VM datastores and other assortment of LUNs are present on the primary

NetApp storage system, all of this data will be replicated to the secondary San Diego facility.

Testing procedures should include implementing a test VM environment, snapmirroring a

replicated VM datastore to another volume at the San Diego facility and attaching that volume to

the test VM environment. This is exactly the procedure during disaster recovery minus the need

for snapmirroring to another volume. Once the datastore is mounted to the test VM environment,

access to the individual virtual server images and configuration files will be granted. At that

point the test VM can be powered online to test access to the operating system and production

data. Any errors uncovered during the testing can be addressed without any impact to the

production environment.

The testing procedures should not interrupt production processes. Testing the transfer of

the network layer to the secondary San Diego facility is not possible during business hours.

However, a simulation of the transfer can be completed with software such as Cisco Packet

Tracer where network equipment configs can be loaded into actual IOS images to simulate the

actual Archive Concepts network layer. Accurate tests can be performed using this method to

ensure packets are routed between facilities in the event of a disaster at the primary East

Stroudsburg facility.

Once all testing is complete and errors are addressed, an update to the disaster recovery

plan documentation should be conducted. Once documentation is updated, senior IT leadership

Page 35 of 40
should review and give final authorization to the recommended documentation changes. Again,

the disaster recovery plan as well as the testing processes are living entities and should be

regularly reviewed and tested.

Page 36 of 40
________________

Project Summary:
________________

This paper consists of an analysis of a real company as it relates to a disaster recovery

plan. The company is Archive Concepts, LLC. I am the legitimate owner of Archive Concepts.

However, some of the facts about the company need to be changed to accurately describe what a

disaster recovery plan for a medium to large business should consist of. This paper is divided

into three parts; section one, section two and section three. Section one discussed the background

of the organization, defined and explained two disaster scenarios, analyze the risks, threats and

impacts to the organization and discussed the recovery strategy. Section two contained the

recovery team mission statement, scope of work for the recovery team, the timeframe to

recovery, team participants, tools, sites in which to recover and the recovery procedures.

Stakeholders were identified, described how they are impacted and how they would be updated

throughout the recovery process. Section three discussed the importance of testing the disaster

recovery plan and reviewing it on a regular basis for accuracy and relevance. The end goal is to

have a clear understanding of a disaster recovery initiative for Archive Concepts.

This paper gave me the opportunity to explore different ideas and write about my own

engineering experiences. The topics explained in this document are actual technologies I have

used for successful backup procedures and disaster recoveries. In the end I can use the

information contained in this document for future reference to draft disaster recovery plans for

other companies I deal with. I will obviously continue my research into new technologies to

improve backup and disaster recovery procedures.

Page 37 of 40
__________

Appendices:
__________

Appendix I:

See file: BrianMiller-Week10-TermProject-CT415-DRP-BCP-Visio.vsd.

This Visio file shows the logical flow of data replication to the secondary San Diego

datacenter facility.

Page 38 of 40
Appendix II:

See file: BrianMiller-Week10-TermProject-CT415-DRP-BCP.pptx.

This file is a PowerPoint presentation to Executive Management briefly describing the

Archive Concepts backup and recovery procedures.

Page 39 of 40
__________

References:
__________

Agency, F. E. (2004, August 22). Are You Ready Guide. Retrieved October 20, 2012, from
Ready.gov: http://www.fema.gov/pdf/areyouready/areyouready_full.pdf

Bahan, C. (2003, June). The Disaster Recovery Plan. Retrieved December 10, 2012, from
SANS.org: http://www.sans.org/reading_room/whitepapers/recovery/disaster-recovery-
plan_1164

Cappelletti, C. (2002). Designing and Implementing a Disaster Recovery Plan. Retrieved


December 10, 2012, from SANS.org: http://www.giac.org/paper/gsec/579/designing-
implementing-disaster-recovery-plan/101250

Swanson, M. (2010, May). Contingency Planning Guide for Federal Information Systems.
Retrieved December 10, 2012, from csrc.nist.gov:
http://csrc.nist.gov/publications/nistpubs/800-34-rev1/sp800-34-rev1_errata-Nov11-
2010.pdf

Tim J. Smith, P. (2002, November). CSA Explains Disaster Recovery. Retrieved December 10,
2012, from The Wiglaf Journal:
http://www.wiglafjournal.com/uncategorized/2002/11/csa-explains-disaster-recovery/

Page 40 of 40

You might also like