Fundamentals of Availability Transcript

Fundamentals of Availability
Transcript
Slide 1
Welcome to the course: Fundamentals of Availability.
Slide 2: Welcome
For best viewing results, we recommend that you maximize your browser window now. The screen controls
allow you to navigate through the eLearning experience. Using your browser controls may disrupt the
normal play of the course. Click the attachments link to download supplemental information for this course.
Click the Notes tab to read a transcript of the narration.
Slide 3: Learning Objective

At the end of this course, you will be able to:
 Understand the key terms associated with availability
 Understand the difference between availability and reliability
 Recognize threats to availability
 Calculate cost of downtime
Slide 4: Introduction
In our rapidly changing business world, highly available systems and processes are of critical importance
and are the foundation upon which successful businesses rely. So much so, that according to the National
Archives and Records Administration in Washington, D.C., 93% of businesses that have lost availability in
their data center for 10 days or more have filed for bankruptcy within one year. The cost of one episode of
downtime can cripple an organization. Take for example an e-business. In a case of downtime, not only
would they potentially lose thousands or even millions of dollars in lost revenue, but their top competitor is
only a mouse-click away. Therefore loss is translated not only to lost revenue but also to a loss in customer
loyalty. The challenge of maintaining a highly available network is no longer just the responsibility of the IT
departments, rather it extends out to management and department heads, as well as the boards which
govern company policy. For this reason, having a sound understanding of the factors that lead to high
Fundamentals of Availability Page |1
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
availability, threats to availability, and ways to measure availability is imperative regardless of your business
sector.
Slide 5: Measuring Business Value

Measuring Business Value begins first with an understanding of the Physical Infrastructure.
Physical Infrastructure is the foundation upon which Information Technology (IT) and telecommunication
Networks reside.
Physical Infrastructure consists of the Racks, Power, Cooling, Fire Prevention/Security, Management, and
Services.

Business value for an organization, in general terms, is based on three core objectives:
1. Increasing revenue
2. Reducing costs
3. Better utilizing assets
Regardless of the line of business, these three objectives ultimately lead to improved earnings and cash
flow. Investments in Physical Infrastructure are made because they both directly and indirectly impact these
three business objectives. Managers purchase items such as generators, air conditioners, physical security
systems, and Uninterruptible Power Supplies to serve as “insurance policies.” For any network or data
center, there are risks of downtime from power, security and thermal problems, and investing in Physical
Infrastructure mitigates these and other risks. So how does this impact the three core business objectives
above (revenue, cost, and assets)? Revenue streams are slowed or stopped, business costs / expenses
are incurred, and assets are underutilized or underproductive when systems are down. Therefore, the more
efficient the strategy is in reducing downtime from any cause, the more value it has to the business in
meeting all three objectives.

Historically, assessment of Physical Infrastructure business value was based on two core criteria: availability
and upfront costs. Increasing the availability (uptime) of the Physical Infrastructure system and ultimately of
the business processes allows a business to continue to bring in revenues and better optimize the use (or
productivity) of assets. Imagine a credit card processing company whose systems are unavailable – credit
card purchases cannot be processed, halting the revenue stream for the duration of the downtime. In
addition, employees are not able to be productive without their systems online. And minimizing the upfront
cost of the Physical Infrastructure results in a greater return on that investment. If the Physical Infrastructure
cost is low and the risk / cost of downtime is high, the business case becomes easier to justify.
While these arguments still hold true, today’s rapidly changing IT environments are dictating two additional
criteria for assessing Physical Infrastructure business value. One is Agility. Business plans must be agile to
deal with changing market conditions, opportunities, and environmental factors. Investments that lock
resources limit the ability to respond in a flexible manner. And when this flexibility or agility is not present,
lost opportunity is the predictable result.
The other is Sustainability. It is imperative that data center owners have a solid action plan to achieve
sustainability goals and commitments.
1. Develop a plan that includes a bold and actionable strategy with clear objectives and prioritized action.
2. Implement efficient designs, which invest in technologies that improve energy efficiency and lower carbon
footprint like SF6 Free switchgear and liquid cooling, which could reduce overall IT and infrastructure energy
consumption by 15 percent.
3. Drive operational efficiency with connected systems to collect data that provides visibility, tracks energy
usage, and benchmarks performance.
4. Buy renewable energy which can be accomplished in three main ways – credit, on-site build, and off-site
build.
5. Decarbonize your supply chain – choose vendors that embrace circular economy with circularity designed
into products.
Slide 8: Five 9’s of Availability
A term that is commonly used when discussing availability is the term ‘5 Nine’s. Although often used, this
term is often very misleading, and often misunderstood. 5 9’s refers to a network that is accessible 99.999%
of the time. However, it is a rather misleading term. We’ll explain why a little later on in the course.
Slide 9: Key Terms

There are many additional terms associated with availability, business continuity and disaster recovery.
Before we go any further, let’s define some of these terms.
Reliability is the ability of a system or component to perform its required functions under stated conditions
for a specified period of time.
Availability, on the other hand, is the degree to which a system or component is operational and accessible
when required for use. It can be viewed as the likelihood that the system or component is in a state to
perform its required function under given conditions at a given instant in time. Availability is determined by a
system’s reliability, as well as its recovery time when a failure does occur. When systems have long
continuous operating times, failures are inevitable. Availability is often looked at because, when a failure
does occur, the critical variable now becomes how quickly the system can be recovered. In the data center,
having a reliable system design is the most critical variable, but when a failure occurs, the most important
consideration must be getting the IT equipment and business processes up and running as fast as possible
to keep downtime to a minimum.
Slide 10: Key Terms

Upon considering any availability or reliability value, one should always ask for a definition of failure. Moving
forward without a clear definition of failure, is like advertising the fuel efficiency of an automobile as “miles
per tank” without defining the capacity of the tank in liters or gallons. To address this ambiguity, one should
start with one of the following two basic definitions of a failure.
According to the IEC (International Electro-technical Commission) there are two basic definitions of a failure:
1. The termination of the ability of the product as a whole to perform its required function.
2. The termination of the ability of any individual component to perform its required function but not
the termination of the ability of the product as a whole to perform.
Slide 11: Key Terms

MTBF Mean Time Between Failure, is a basic measure of a system’s reliability. It is typically represented in
units of hours. The higher the MTBF number is, the higher the reliability of the product.
MTTR Mean Time to Recover (or Repair), is the expected time to recover a system from a failure. This may
include the time it takes to diagnose the problem, the time it takes to get a repair technician onsite, and the
time it takes to physically repair the system. Similar to MTBF, MTTR is represented in units of hours. MTTR
impacts availability and not reliability. The longer the MTTR, the worse off a system is. Simply put, if it takes
longer to recover a system from a failure, the system is going to have a lower availability. As the MTBF goes
up, availability goes up. As the MTTR goes up, availability goes down.
Slide 12: The Limitations of 99.999%

As before mentioned 5 9’s is a misleading term because the use of the term has become diluted. 5 9’s has
been used to refer to the amount of time that the Data Center systems are available. In other words, a data
center that has achieved 5 9’s is functioning 99.999% of the time. The frequency of failure is only 1 part of
the equation. The other part of the availability equation is how long it takes to recover from failure.
Let’s take for example two data centers that are both considered 99.999% available. In one year, Data
Center A lost power once, but it lasted for a full 5 minutes. Data Center B lost power 10 times, but for only
30 seconds each time. Both Data Centers were without power for a total of 5 minutes each. The missing
detail is the recovery time. Anytime systems fail, there is a recovery time to get back to operational state,
which includes the time for servers to be rebooted, data to be recovered, and corrupted systems to be
repaired. The Mean Time to Recover process could take minutes, hours, days, or even weeks. Now, if you
consider again the two data centers that have experienced downtime, you will see that Data Center B that
has had 10 instances of outages will actually have a much longer duration of downtime, than the data center
that only had once occurrence of downtime. Data Center B must recover from failure 10 times. It is
because of this dynamic that reliability is equally important to this discussion of availability. Reliability of a
data center talks to the frequency of downtime in a given time frame. There is an inversely proportional
relationship in that as time increases, reliability decreases. Availability, however is only a percentage of
downtime in a given duration.
Slide 13: Factors that Affect Availability and Reliability

It should be obvious that there are numerous factors that affect data center availability and reliability. Some
of these include AC Power conditions, lack of adequate cooling in the data center, equipment failure, natural
and artificial disasters, and human errors.
Slide 14: AC Power Conditions
Let’s look first at the AC power conditions. Power quality anomalies are organized into seven categories
based on wave shape:
1. Transients
2. Interruptions
3. Sag / Undervoltage
4. Swell / Overvoltage
5. Waveform distortion
6. Voltage fluctuations
7. Frequency variations
Slide 15: Inadequate Cooling

Another factor that poses a significant threat to availability is a lack of cooling in the IT environment. IT
equipment like servers and storage generate heat. In the Data Center Environment, where a mass quantity
of heat is being generated, the potential exists for significant downtime unless this heat is removed from the
space.

Cooling systems are needed in the data center to remove this heat, however, if the cooling is not distributed
properly hotspots can occur.

Hot spots within the data center further threaten availability. In addition, inadequate cooling significantly
detracts from the lifespan and availability of IT equipment. It is recommended that when designing the data
center layout, a hot aisle/cold aisle configuration is used. Hot spots can also be alleviated by the use of
properly sized cooling systems, and supplemental spot coolers and air distribution units.
Slide 18: Equipment Failures

The health of IT equipment is an important factor in ensuring a highly available system, as equipment
failures pose a significant threat to availability. Failures can occur for a variety of reasons, including
damage caused by prolonged improper utility power. Other such causes are from prolonged exposure to
elevated or decreased temperatures, humidity, component failure, and equipment age.
Slide 19: Natural and Artificial Disasters
Disasters also pose a significant threat to availability. Hurricanes, tornadoes, floods, and the often
subsequent blackouts that occur after these disasters all create tremendous opportunity for downtime. In
many of these cases, downtime is prolonged due to damage sustained by the power grid or the physical site
of the data center itself.
Slide 20: Human Error

According to Gartner Group, the largest single cause of downtime is human error or personnel issues. One
of the most common causes of intermittent downtime in the data center is poor training. Data center staff or
contractors should be trained on procedures for application failures/hangs, system update/upgrades, and
other tasks that can create problems if not done correctly.

Another problem is poor documentation. As staff sizes have shrunk, and with all the changes in the data
center due to rapid product cycles, it’s harder and harder to keep the documentation current. Patches can
go awry as incorrect software versions are updated. Hardware fixes can fail if the wrong parts are used.

Another area of potential downtime is management of systems. System Management has fragmented from
a single point of control to vendors, partners, ASPs, outsource suppliers, and even a number of internal
groups. With a variety of vendors, contractors and technicians freely accessing the IT equipment, errors are
inevitable. Technologies like AI and data analytics are enabling a reduction in human error, as maintenance
programs shift from calendar-based to condition-based.
Slide 23: Cost of Downtime

It is important to understand the cost of downtime to a business, and specifically, how that cost changes as
a function of outage duration. Lost revenue is often the most visible and easily identified cost of downtime,
but it is only the tip of the iceberg when discussing the real costs to the organization. In many cases, the
cost of downtime per hour remains constant. In other words, a business that loses at a rate of 100 dollars
per hour in the first minute of downtime will also lose at the same rate of 100 dollars per hour after an hour
of downtime. An example of a company that might experience this type of profile is a retail store, where a
constant revenue stream is present. When the systems are down, there is a relatively constant rate of loss.
Some businesses, however, may lose the most money after the first 500 milliseconds of downtime and then
lose very little thereafter. For example, a semiconductor fabrication plant loses the most money in the first
moments of an outage because when the process is interrupted, the Silicon wafers that were in production
can no longer be used, and must be scrapped.

And others yet, may lose at a lower rate for a short outage (since revenue is not lost but simply delayed),
and as the duration lengthens, there is an increased likelihood that the revenue will not be recovered.
Regarding customer satisfaction, a short duration may often be acceptable, but as the duration increases,
more customers will become increasingly upset. An example of this might be a car dealership, where
customers are willing to delay a transaction for a day. With significant outages however, public knowledge
often results in damaged brand perception, and inquiries into company operations. All of these activities
result in a downtime cost that begins to accelerate quickly as the duration becomes longer.
(Image on next page)

Costs associated with downtime can be classified as direct and indirect. Direct costs are easily identified
and measured in terms of hard dollars. Examples include:
1. Wages and costs of employees that are idled due to the unavailability of the network. Although
some employees will be idle, their salaries and wages continue to be paid. Other employees may
still do some work, but their output will likely be diminished.
2. Lost Revenues are the most obvious cost of downtime because if you cannot process customers,
you cannot conduct business. Electronic commerce magnifies the problem, as eCommerce sales
are entirely dependent on system availability
3. Wages and cost increases due to induced overtime or time spent checking and fixing systems. The
same employees that were idled by the system failure are probably the same employees that will
go back to work and recover the system via data entry. They not only have to do their ‘day job’ of
processing current data, but they must also re-enter any data that was lost due to the system crash,
or enter new data that was handwritten during the system outage. This means additional hours of
work, most often on an overtime basis.
4. Depending on the nature of the affected systems, the legal costs associated with downtime can be
significant. For example, if downtime problems result in a significant drop in share price,
shareholders may initiate a class-action suit if they believe that management and the board were
negligent in protecting vital assets. In another example, if two companies form a business
partnership in which one company’s ability to conduct business is dependent on the availability of
the other company’s systems, then, depending on the legal structure of the partnership, the first
company may be liable to the second for profits lost during any significant downtime event.
Indirect costs are not easily measured, but impact the business just the same. In 2000, Gartner Group
estimated that 80% of all companies calculating downtime were including indirect costs in their calculations
for the first time.
Examples include: reduced customer satisfaction; lost opportunity of customers that may have gone to
direct competitors during the downtime event; damaged brand perception; and negative public relations.
Slide 27: Cost of Downtime by Industry Sector

A business’s downtime costs are directly related to the industry sectors.
Fundamentals of Availability P a g e | 10
For example, Energy and Telecommunications organizations may experience lost revenues on the order of
2 to 3 million dollars an hour. Manufacturing, Financial Institutions, Information Technology, Insurance,
Retail and Pharmaceuticals all stand to lose over 1 million dollars an hour.
Slide 28: Calculating Cost of Downtime

There are many ways to calculate cost of downtime for an organization. For example, one way to estimate
the revenue lost due to a downtime event is to look at normal hourly sales and then multiply that figure by
the number of hours of downtime.
Remember, however, that this is only one component of a larger equation and, by itself, seriously
underestimates the true loss. Another example is loss of productivity.
The most common way to calculate the cost of lost productivity is to first take an average of the hourly
salary, benefits and overhead costs for the affected group. Then, multiply that figure by the number of hours
of downtime.
Because companies are in business to earn profits, the value employees contribute is usually greater than
the cost of employing them.
Therefore, this method provides only a very conservative estimate of the labor cost of downtime.
Slide 29: Summary
 To stay competitive in today’s global marketplace, businesses must strive to achieve high levels of
availability and reliability. 99.999% availability is a commonly stated target for most businesses.
 Power outages, inadequate cooling, natural and artificial disasters, and human errors pose a
significant barrier to high availability.
 The direct and indirect costs of downtime in many business sectors can be exorbitant, and often is
enough to bankrupt many organizations.
 Therefore it is critical for businesses today to calculate their level of availability in order to reduce
risks, and increase overall reliability and availability.
Slide 30: Thank You!

Thank you for participating in this course.

Fundamentals of Availability Transcript

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fundamentals of Availability Transcript

Uploaded by

Copyright:

Available Formats

Fundamentals of Availability

Slide 3: Learning Objective

Fundamentals of Availability Page |1

Slide 5: Measuring Business Value

Slide 6: Measuring Business Value

Slide 7: Measuring Business Value

Fundamentals of Availability Page |2

Slide 9: Key Terms

Fundamentals of Availability Page |3

Slide 10: Key Terms

Slide 11: Key Terms

Fundamentals of Availability Page |4

Slide 12: The Limitations of 99.999%

Slide 13: Factors that Affect Availability and Reliability

Slide 14: AC Power Conditions

Fundamentals of Availability Page |5

Slide 15: Inadequate Cooling

Slide 16: Inadequate Cooling

Slide 17: Inadequate Cooling

Slide 18: Equipment Failures

Fundamentals of Availability Page |6

Slide 20: Human Error

Slide 21: Human Error

Slide 22: Human Error

Slide 23: Cost of Downtime

Fundamentals of Availability Page |7

Slide 25: Cost of Downtime

Fundamentals of Availability Page |8

(Image on next page)

Slide 26: Cost of Downtime

Fundamentals of Availability Page |9

Slide 27: Cost of Downtime by Industry Sector

Slide 28: Calculating Cost of Downtime

Slide 30: Thank You!

You might also like