Professional Documents
Culture Documents
Downtime can significantly impact the availability of your messaging system. It is important that
you familiarize yourself with the various causes of downtime and how they affect your
messaging system.
To remove or minimize planned downtime, you can implement server clustering. Even while
performing maintenance on a primary node, server clustering provides continuous messaging
availability for your organization (by means of temporarily failing over Exchange services to a
standby computer in the Exchange cluster). For more information about clustering, see Planning
for Exchange Clustering.
The following table lists common causes of downtime and specific examples for each cause.
Component failures Faulty storage subsystem components, such as failed disk drives and
disk controllers.
Failure Types
An integral aspect to implementing a highly available messaging system is to ensure that no
single point of failure can render a server or network unavailable. Before you deploy your
Exchange 2003 messaging system, you must familiarize yourself with the following failure types
that may occur and plan accordingly.
Note:
For detailed information about how to minimize the impact of the following failure types, see
Making Your Exchange 2003 Organization Fault Tolerant.
Storage failures
Two common storage failures that can occur are hard disk failures and storage controller failures.
There are several methods you can use to protect against individual storage failures. One method
is to use redundant array of independent disks (RAID) to provide redundancy of the data on your
storage subsystem. Another method is to use storage vendors who provide advanced storage
solutions, such as Storage Area Network (SAN) solutions. These advanced storage solutions
should include features that allow you to exchange damaged storage devices and individual
storage controller components without losing access to the data. For more information about
RAID and SAN technologies, see Planning a Reliable Back-End Storage Solution.
Network Failures
Common network failures include failed routers, switches, hubs, and cables. To help protect
against such failures, there are various fault tolerant components you can use in your network
infrastructure. Fault tolerant components also help provide highly available connectivity to
network resources. As you consider methods for protecting your network, be sure to consider all
network types (such as client access and management networks). For information about network
hardware, see "Server-Class Network Hardware" in Component-Level Fault Tolerant Measures.
Component Failures
Common server component failures include failed network interface cards (NICs), memory
(RAM), and processors. As a best practice, you should keep spare hardware available for each of
the key server components (for example, NICs, RAM, and processors). In addition, many
enterprise-level server platforms provide redundant hardware components, such as redundant
power supplies and fans. Hardware vendors build computers with redundant, hot-swappable
components, such as Peripheral Component Interconnect (PCI) cards and memory. These
components allow you to replace damaged hardware without removing the computer from
service.
For information about using redundant components and spare hardware components see
Component-Level Fault Tolerant Measures.
Computer Failures
You must promptly address application failures or any other problem that affects a computer's
performance. To minimize the impact of a computer failure, there are two solutions you can
include in your disaster recovery plan: a standby server solution and a server clustering solution.
In a standby server solution, you keep one or more preconfigured computers readily available. If
a primary server fails, this standby server would replace it. For information about using standby
servers, see "Spare Components and Standby Servers" in Component-Level Fault Tolerant
Measures.
With server clustering, your applications and services are available to your users even if one
cluster node fails. This is possible either by failing over the application or service (transferring
client requests from one node to another) or by having multiple instances of the same application
available for client requests.
Note:
Server clustering can also help you maintain a high level of availability if one or more computers
must be temporarily removed from service for routine maintenance or upgrades.
For information about Network Load Balancing (NLB) and server clustering, see "Fault Tolerant
Infrastructure Measures" in System-Level Fault Tolerant Measures.
Site Failures
In extreme cases, an entire site can fail due to power loss, natural disaster, or other unusual
occurrences. To protect against such failures, many businesses are deploying mission-critical
solutions across geographically dispersed sites. These solutions often involve duplicating a
messaging system's hardware, applications, and data to one or more geographically remote sites.
If one site fails, the other sites continue to provide service (either through automatic failover or
through disaster recovery procedures performed at the remote site), until the failed site is
repaired. For more information, see "Using Multiple Physical Sites" in System-Level Fault
Tolerant Measures.
Costs of Downtime
Calculating some of the costs you experience as a result of downtime is relatively easy. For
example, you can easily calculate the replacement cost of damaged hardware. However, the
resulting costs from losses in areas such as productivity and revenue are more difficult to
calculate.
The following table lists the costs that are involved when calculating the impact of downtime.
Costs of downtime
Compensatory payments
Billing losses
Investment losses
Revenue recognition
Cash flow
Credit rating
Stock price
Customers
Suppliers
Banks
Business partners
Other expenses Temporary employees
Equipment rental
Overtime costs
Travel expenses
Impact of Downtime
Availability becomes increasingly important as businesses continue to increase their reliance on
information technology. As a result, the availability of mission-critical information systems is
often tied directly to business performance or revenue. Based on the role of your messaging
service (for example, how critical the service is to your organization), downtime can produce
negative consequences such as customer dissatisfaction, loss of productivity, or an inability to
meet regulatory requirements.
However, not all downtime is equally costly; the greatest expense is caused by unplanned
downtime. Outside of a messaging service's core service hours, the amount of downtime—and
corresponding overall availability level—may have little to no impact on your business. If a
system fails during core service hours, the result can have significant financial impact. Because
unplanned downtime is rarely predictable and can occur at any time, you should evaluate the cost
of unplanned downtime during core service hours.
Because downtime affects businesses differently, it is important that you select the proper
response for your organization. The following table lists different impact levels (based on
severity), including the impact each level has on your organization.
Impact
Description Business impact
level
Impact Minor impact on business
Low: Minimal availability requirement.
level 1 results.
Disrupts the normal business
processes.
Impact Low: Prevention of business loss improves return on
level 2 investment and profitability.
Minimal loss of revenue,
low recovery cost.
Impact Substantial revenue is lost; Medium: Prevention of business loss improves return on
level 3 some is recoverable. investment and profitability.
Impact Significant impact on core High: Prevention of lost revenue improves business
level 4 business activities. results. Business risk outweighs the cost of the solution.
Affects medium-term
results.
Strong impact on core
business activities.