You are on page 1of 5

Application Operations

Module 3: Service Design

Capsule 6: Availability Management – Part 2

Audio Script

Slide 1: Background music.

Slide 2: Welcome to this video course on Application Operations.

Slide 3: You are now watching module 3 on service design. Capsule 6 discusses about
availability management basics in detail.

Slide 4: Andrew and Grace discussed on the overview of Availability Management in the
previous capsule. They shall continue their discussion in this capsule. Come, let’s join them.

Hello Grace.

Hi, Andrew.

Slide 5: At the end of this video, you will be able to.

 Describe the need for Availability Management.


 Define Availability and Reporting Period.
 And, explain the process of calculating Availability, Mean Time To Repair or MTTR
and Mean Time Between Failure or MTBF.

Slide 6: Welcome back friends!

Grace, we shall continue with our discussion on Availability Management.

Sure, Andrew.

Slide 7: You need to appoint an Availability Manager if you are implementing Availability
Management in your organization.

Availability manager is the single point of responsibility for IT systems availability, and, a
champion of Availability planning within an IT organization.

Slide 8: You need to define the extent of Availability Management role, by listing out your
key IT systems and Key services. Don’t be tempted to approach this from a technologist’s
perspective by asking ‘what systems do we have’. Instead, consider the key business
functions performed by your business, and from these, identify the Key Services involved in
that delivery. For each business function, you need to define an impact assessment
describing the impact on the business of a loss of Key Services.
Slide 9: Some examples of the key online internet services provided by Bank include.

 Checking account statement.


 Transferring funds.
 Opening a fixed deposit.

While, some examples of Ancillary services include.

 Pay utility bills.


 Insurance.
 Mutual Fund.

Slide 10: There are many ways to define your Availability Requirements. The best way is to
use a Service Level Agreement to define the required hours of operation for a Key Service.

After defining the SLA, you need to define the Availability and reporting period, such as,
weekly or monthly. Furthermore, you can choose to define.

 Maximum hours of downtime, which will be expressed simply in hours and minutes.
 Downtime as a percentage of availability.
 Maximum number of non-availability events.

Slide 11: In an ideal scenario, the IT organization provides a service and this service has
users. But when the users can’t access this service then it is termed as Unavailable.

However, there are other factors to consider, those, particularly connected with Quality of
Service. For example, you have an Internet Banking system that normally takes 1-2 seconds
to view the account statement and this is perfectly acceptable. However, if the time taken
to view the account statement shoots up to 60 seconds one day then can this be termed as
unavailability? What if the transaction time is 10 seconds instead of 60, is this also
Unavailability?

Then what is unavailability?

If the performance of a service, or the quality of an IT service, is degraded enough to cause


significant business impact you should consider the event as Unavailability.

Slide 12: Creating contingency and recovery plans helps create a series of steps that will
minimize the business impact. You must make sure that these are clearly documented as
well as rehearsed & tested.

You must also make sure that the recovery & restart procedures are written down, and the
Incident Management staff know where to find them. There are numerous situations where
various servers or switches have ‘crashed’ but only a couple of people, who are not around
when you need them, know the re-start or recovery procedure.
Slide 13: IT Services are delivered by the IT infrastructure.

Your organization's IT infrastructure is defined and documented within the Configuration


Management Data Base or CMDB, which gives a detailed representation of each of the
components that combine to deliver these Key Services.

Slide 14: Measuring Availability, and comparing actual Availability against business
requirements is an absolutely fundamental activity for each of the Key Service. Displayed on
screen is the formula to calculate basic availability as a percentage.

The formula comprises of AST which is Agreed Service Time possible over the period for
which the calculation is being made, and, DT or the actual down time recorded over the
period for which the calculation is being made. Agreed service time is the period when the
service is supposed to be available.

For example, a service that is offered from eight AM to seven PM, from Monday to Friday,
52 weeks per year, will have an AST of eleven multiplied by five multiplied by 52 which
equals to 2860 hours per year.

Let’s take another example. Suppose, the application suffered a downtime of 200 hours.
Then the availability would be as follows.

Slide 15: The degree of availability of a component or service is often expressed as a


number of nines.

The number of nines refers to the percentage of time for which the system is available.
Three nines means that a configuration item or service is available 99.9% of the time and
five nines means 99.999%. These numbers become more significant when you look at these
figures in terms of downtime over a fixed period.

For example, based on 24 or 7 or 365 service hours, a system with 99.9% availability can be
expected to experience only 8.75 hours of downtime in a year, while a system with 99.999%
availability will be down for just 315 seconds in a year.

Slide 16: In this graph the incident is broken down into individual stages within a lifecycle
that can be timed and measured.

The time the server is up and running is called uptime. The moment the motherboard
crashes and the server goes down, it is said that an incident has happened. The time lag
between the server going down and somebody detecting that the server has gone down is
called Time to detect. The time to create a record for this issue is called Time to record.
Once a record has been created there will be some Time to diagnose, to figure out what's
wrong with the server. Once diagnosed, there will be a Time to repair the issue, which in
this case would be informing that the motherboard has been fixed. But the server has not
come up and neither has the service dependent on that server, so the time taken after
repairs to bring the server backup and restore the service as it was to the users, consists of
Time to recover and Time to restore respectively.

The time the server went down to the time it took for restoration is called Downtime Mean
Time To Repair. The time elapsed from one failure to the next failure is called Uptime or
Mean Time Between Failure.

Slide 17: Let's understand how all this helps us in calculating availability. Take a look at this
diagram displayed on screen.

The section above the red line is the Uptime and the section below the line is the Downtime.
In this example the server uptime was 30 days. Then the server went down for a day and
after that the server was up again for 45 days and so on.

The first term is Mean Time between Failures or MTBF or Uptime. It is defined as, the
average time elapsed from one failure to the next, which relates to the reliability of the
service. Mean Time Between Failures is the Total up time divided by the number of
breakdowns.

In this example, the total uptime is 115 days which is the sum of 30, 45 and 40 days. The
number of breakdown is 3, as the system went down three times. So the Mean Time
Between Failures is 115 divided by 3 which is 38.33 days.

Mean Time To Repair or MTTR or Downtime is defined as, the average time it takes to repair
something after a failure.

Mean Time To Repair is the Total down time divided by the number of breakdowns.

Here, the total down time is 4 days, which is the sum of 1, 2 and 1 days. The number of
breakdown is 3, as the system went down three times. So the Mean Time To Repair is 4
divided by 3 which is 1.33 days.

Overall, Availability Management optimizes and monitors IT services in order to provide


continued services that are cost-effective and compliant with Service Level Agreements.

Slide 18: We have come to an end of this capsule.

Thank you for watching this capsule on Availability Management Basics.

Good bye!

Slide 19: In this video, you learnt about following key points:

 Availability manager is the single point of responsibility for IT systems availability,


and, a champion of Availability planning within an IT organization.
 Defining the SLA followed by the Availability and reporting period, such as, weekly or
monthly. Availability reporting can be defined in terms of:
o Maximum hours of downtime, which will be expressed simply in hours and
minutes.
o Downtime as a percentage of availability.
o Maximum number of non-availability events.
 If the performance of a service, or the quality of an IT service, is degraded enough to
cause significant business impact you should consider the event as Unavailability.
 Creating contingency and recovery plans helps create a series of steps that will
minimize the business impact.
 Availability is equal to AST minus DT divided by AST and multiplied by one hundred,
where.
o AST is Agreed Service Time possible over the period for which the calculation
is being made.
o DT is the actual down time recorded over the period for which the calculation
is being made.
 The time the server went down to the time it took for restoration is called
Downtime.
 MTBF is defined as, the average time elapsed from one failure to the next, which
relates to the reliability of the service.
 MTTR or Downtime is defined as, the average time it takes to repair something after
a failure.

Slide 20: Thank you for watching this video on availability management basics.

You might also like