You are on page 1of 6

Understanding Downtime

10 out of 11 rated this helpful - Rate this topic


 

Topic Last Modified: 2005-05-20

Downtime can significantly impact the availability of your messaging system. It is important that
you familiarize yourself with the various causes of downtime and how they affect your
messaging system.

Planned and Unplanned Downtime


Unplanned downtime is downtime that occurs as a result of a failure (for example, a hardware
failure or a system failure caused by improper server configuration). Because administrators do
not know when unplanned downtime could occur, users are not notified of outages in advance. In
contrast, planned downtime is downtime that occurs when an administrator shuts down the
system at a scheduled time. Because planned downtime is scheduled, administrators can plan for
it to occur at a time that least affects productivity.

To remove or minimize planned downtime, you can implement server clustering. Even while
performing maintenance on a primary node, server clustering provides continuous messaging
availability for your organization (by means of temporarily failing over Exchange services to a
standby computer in the Exchange cluster). For more information about clustering, see Planning
for Exchange Clustering.

The following table lists common causes of downtime and specific examples for each cause.

Causes of downtime and examples of each cause

Causes of downtime Examples


Planned administrative Upgrades for hardware components, firmware, drivers, operating
downtime system, and software applications.
Faulty server components, such as memory chips, fans, system
boards, and power supplies.

Component failures Faulty storage subsystem components, such as failed disk drives and
disk controllers.

Faulty network components, such as routers and network cabling.


Software defects or Drive stops responding, operating system stops responding or reboots,
failures viruses, or file corruption.
Operator error or Accidental or intentional file deletion, unskilled operation, or
malicious users experimentation.
System outages or
Software or systems requiring reboot, or system board failure.
maintenance
Local disaster Fires, storms, and other localized disasters.
Regional disaster Earthquakes, hurricanes, floods, and other regional disasters.

Failure Types
An integral aspect to implementing a highly available messaging system is to ensure that no
single point of failure can render a server or network unavailable. Before you deploy your
Exchange 2003 messaging system, you must familiarize yourself with the following failure types
that may occur and plan accordingly.

Note:
For detailed information about how to minimize the impact of the following failure types, see
Making Your Exchange 2003 Organization Fault Tolerant.

Storage failures

Two common storage failures that can occur are hard disk failures and storage controller failures.
There are several methods you can use to protect against individual storage failures. One method
is to use redundant array of independent disks (RAID) to provide redundancy of the data on your
storage subsystem. Another method is to use storage vendors who provide advanced storage
solutions, such as Storage Area Network (SAN) solutions. These advanced storage solutions
should include features that allow you to exchange damaged storage devices and individual
storage controller components without losing access to the data. For more information about
RAID and SAN technologies, see Planning a Reliable Back-End Storage Solution.

Network Failures

Common network failures include failed routers, switches, hubs, and cables. To help protect
against such failures, there are various fault tolerant components you can use in your network
infrastructure. Fault tolerant components also help provide highly available connectivity to
network resources. As you consider methods for protecting your network, be sure to consider all
network types (such as client access and management networks). For information about network
hardware, see "Server-Class Network Hardware" in Component-Level Fault Tolerant Measures.

Component Failures

Common server component failures include failed network interface cards (NICs), memory
(RAM), and processors. As a best practice, you should keep spare hardware available for each of
the key server components (for example, NICs, RAM, and processors). In addition, many
enterprise-level server platforms provide redundant hardware components, such as redundant
power supplies and fans. Hardware vendors build computers with redundant, hot-swappable
components, such as Peripheral Component Interconnect (PCI) cards and memory. These
components allow you to replace damaged hardware without removing the computer from
service.

For information about using redundant components and spare hardware components see
Component-Level Fault Tolerant Measures.

Computer Failures

You must promptly address application failures or any other problem that affects a computer's
performance. To minimize the impact of a computer failure, there are two solutions you can
include in your disaster recovery plan: a standby server solution and a server clustering solution.

In a standby server solution, you keep one or more preconfigured computers readily available. If
a primary server fails, this standby server would replace it. For information about using standby
servers, see "Spare Components and Standby Servers" in Component-Level Fault Tolerant
Measures.

With server clustering, your applications and services are available to your users even if one
cluster node fails. This is possible either by failing over the application or service (transferring
client requests from one node to another) or by having multiple instances of the same application
available for client requests.

Note:
Server clustering can also help you maintain a high level of availability if one or more computers
must be temporarily removed from service for routine maintenance or upgrades.

For information about Network Load Balancing (NLB) and server clustering, see "Fault Tolerant
Infrastructure Measures" in System-Level Fault Tolerant Measures.

Site Failures

In extreme cases, an entire site can fail due to power loss, natural disaster, or other unusual
occurrences. To protect against such failures, many businesses are deploying mission-critical
solutions across geographically dispersed sites. These solutions often involve duplicating a
messaging system's hardware, applications, and data to one or more geographically remote sites.
If one site fails, the other sites continue to provide service (either through automatic failover or
through disaster recovery procedures performed at the remote site), until the failed site is
repaired. For more information, see "Using Multiple Physical Sites" in System-Level Fault
Tolerant Measures.

Costs of Downtime
Calculating some of the costs you experience as a result of downtime is relatively easy. For
example, you can easily calculate the replacement cost of damaged hardware. However, the
resulting costs from losses in areas such as productivity and revenue are more difficult to
calculate.

The following table lists the costs that are involved when calculating the impact of downtime.

Costs of downtime

Category Cost involved


Number of employees affected by loss of messaging functionality and
other IT assets
Productivity
Number of administrators needed to manage a site increases with
frequency of downtime
Direct losses

Compensatory payments

Revenue Lost future revenues

Billing losses

Investment losses
Revenue recognition

Cash flow

Lost discounts (A/P)


Financial
performance
Payment guarantees

Credit rating

Stock price
Customers

Suppliers

Damaged reputation Financial markets

Banks

Business partners
Other expenses Temporary employees
Equipment rental

Overtime costs

Extra shipping costs

Travel expenses

Impact of Downtime
Availability becomes increasingly important as businesses continue to increase their reliance on
information technology. As a result, the availability of mission-critical information systems is
often tied directly to business performance or revenue. Based on the role of your messaging
service (for example, how critical the service is to your organization), downtime can produce
negative consequences such as customer dissatisfaction, loss of productivity, or an inability to
meet regulatory requirements.

However, not all downtime is equally costly; the greatest expense is caused by unplanned
downtime. Outside of a messaging service's core service hours, the amount of downtime—and
corresponding overall availability level—may have little to no impact on your business. If a
system fails during core service hours, the result can have significant financial impact. Because
unplanned downtime is rarely predictable and can occur at any time, you should evaluate the cost
of unplanned downtime during core service hours.

Because downtime affects businesses differently, it is important that you select the proper
response for your organization. The following table lists different impact levels (based on
severity), including the impact each level has on your organization.

Downtime impact levels and corresponding effect on business

Impact
Description Business impact
level
Impact Minor impact on business
Low: Minimal availability requirement.
level 1 results.
Disrupts the normal business
processes.
Impact Low: Prevention of business loss improves return on
level 2 investment and profitability.
Minimal loss of revenue,
low recovery cost.
Impact Substantial revenue is lost; Medium: Prevention of business loss improves return on
level 3 some is recoverable. investment and profitability.
Impact Significant impact on core High: Prevention of lost revenue improves business
level 4 business activities. results. Business risk outweighs the cost of the solution.
Affects medium-term
results.
Strong impact on core
business activities.

Impact Affects medium-term


High: Business risk outweighs the cost of the solution.
level 5 results.

Company's survival may be


at risk.
Very strong impact on core
business activities.
Impact Extreme: Management of the business risk is essential.
level 6 Cost of the solution is secondary.
Immediate threat to the
company's survival.

You might also like