You are on page 1of 7

CIT 470: Advanced Network and

System Administration
Disaster Recovery

CIT 470: Advanced Network and System Administration Slide #1

Topics
1. Planning
2. Disasters
3. Mitigation

CIT 470: Advanced Network and System Administration Slide #2

What is a Disaster Recovery Plan?


1. Considers potential disasters.
2. Describes how to migitate potential
disasters.
3. Makes preparations to enable quick
restoration of services.
4. Identifies key services and how quickly
they need to be restored and in what order.

CIT 470: Advanced Network and System Administration Slide #3

1
Disaster Recovery Plans
1. Define (un)acceptable loss.
How much could you lose in a disaster?
2. Back up everything.
Backup data, metadata, and instructions on
how to restore your system.
3. Organize everything.
Can you find the backup tapes you need when
disaster strikes?

CIT 470: Advanced Network and System Administration Slide #4

Disaster Recovery Plans


4. Protect against disasters.
Natural and human disasters.
5. Document what you have done.
Plan must be detailed enough for people to
follow in a disaster w/o additional info.
6. Test, test, test.
A disaster recovery plan that has not been tested
is not a plan; it's a proposal.

CIT 470: Advanced Network and System Administration Slide #5

Define loss
Loss of service
How much employee productivity lost?
How much customer revenue lost?
Loss of data
Irreplaceable data
Medical image records
Stock purchases
Re-creatable data. At what cost?
Code for a software product
Simulation results

CIT 470: Advanced Network and System Administration Slide #6

2
Backup Everything
On a system
Project directories
Home directories
System files (fstab, kernel, passwd, LVM, MBR)
Types of systems
Laptops
Connect then backup to backup server on command.
Desktops
Store everything on network disks.
Servers
Permanent connection to backup system.

CIT 470: Advanced Network and System Administration Slide #7

Organize Everything
What resources do you back up?
On what schedule?
Media organization
Bar code labels on each tape.
Stored securely at proper temp/humidity.
Media database
Maps servers/drives to tapes and their locations.
Indicates whether tapes are on- or off-site.
Must be backed up w/ humanly-readable label.

CIT 470: Advanced Network and System Administration Slide #8

Protect against Disasters


On-site vaults.
Off-site storage.
Test your media regularly.
Store documentation securely too.

CIT 470: Advanced Network and System Administration Slide #9

3
Document
Store documentation in portable format.
Ensure documentation accessible in disaster.
Paper copies on and off-site.

CIT 470: Advanced Network and System Administration Slide #10

Test
Can other people understand procedures?
Sample test tapes on regular (weekly) basis.
Attempt a full system recovery 2/year.

CIT 470: Advanced Network and System Administration Slide #11

What is a Disaster?
A catastrophic event that causes loss of data
and/or service.

Human disasters
Errors or intentional.
Typo, backhoe, or hacker tools.
Natural disasters
Small scale: Hardware or power failure.
Large scale: Hurricane, earthquake, fire.

CIT 470: Advanced Network and System Administration Slide #12

4
Types of Disasters
User errors
Accidental file deletion / overwrite.
Very common. Snapshots can automate.
Sysadmin errors
Accidental mass file destruction.
Regular backups will prevent loss.
Drive failure
Single disk failure: RAID can prevent loss.
System failure
Loss of an entire system.
RAID won’t help. Need backups.
CIT 470: Advanced Network and System Administration Slide #13

Types of Disasters
Power/Network Failure
Need UPS/generator or redundant network.
Software Failure
Software corrupts its own or other apps data store.
Need regular and perhaps historical backups.
Security Breach
An attacker / worm destroys/corrupts data.
Need long-term historical backups.
Natural Disaster
Potential loss of entire data center, incl. backups.
Need off-site backups to restore data.
Need off-site (virtual) data center to restore service.
CIT 470: Advanced Network and System Administration Slide #14

Risk Analysis
Evaluate risk cost of disaster
Cost * Probability
Determines budget for disaster mitigation.
Ex: power failure
70% chance per year
Average downtime: 4 hours
Average web site revenue / hour: $1000
Budget = 4 hrs * (1000 $/hr) * 0.7/yr = $2800/yr

CIT 470: Advanced Network and System Administration Slide #15

5
Disaster Mitigation
Power Failures
UPS
Generator
System Failures
Redundancy: CPU, ECC RAM, NICs, power
Cluster of servers
Network Failures
Multiple internet connections f/ diff ISPs.

CIT 470: Advanced Network and System Administration Slide #16

Disaster Mitigation
Drive Failures
RAID
Backups
Accidental Deletion
Snapshots
Backups
Security Incident
Backups

CIT 470: Advanced Network and System Administration Slide #17

Redundant Site
Redundant site at a different location
– Location far enough away to be unaffected by
whatever disaster took down primary site.
– Automatic or manual switchover.
• DNS names with short experimation times.
Cheaper solution: use existing second site
– Duplicate critical services at both data centers.
– Rebuild less critical servers at second site.

CIT 470: Advanced Network and System Administration Slide #18

6
References
1. Aeleen Frisch, Essential System Administration,
3rd edition, O’Reilly, 2002.
2. Evi Nemeth et al, UNIX System Administration
Handbook, 3rd edition, Prentice Hall, 2001.
3. Thomas A. Limoncelli and Christine Hogan, The
Practice of System and Network Administration,
2nd edition, Addison-Wesley, 2007.
4. W. Curtis Preston, UNIX Backup & Recovery,
O’Reilly, 1999.

CIT 470: Advanced Network and System Administration Slide #19

You might also like