You are on page 1of 36

Availability Management

- Premanand Lotlikar

31st July, 2007

• Introduction
• Objective of Availability Mgmt
• Basic Concepts
• Benefits
• Relationship with other processes
• Activities in Change Mgmt
• Process Control
• Key Performance Indicators
• Cost
• Possible Problems
• Determining availability requirements in close
collaboration with customers
• Guaranteeing the level of availability established
for the IT services
• Monitoring the availability of the IT services
• Proposing improvements in the IT infrastructure
and services with a view to increasing levels of
• Supervising compliance with the OLAs and UCs
agreed with internal and external service
Basic Concepts
Basic Concepts
• High Availability means
– IT service is continuously available to the customer
– Little downtime
– Rapid service recovery
• Availability of service depends on
– Complexity of the IT infrastructure architecture
– Reliability of the components
– Ability to respond quickly and effectively to faults
– Quality of maintenance by support and suppliers
Basic Concepts
• Reliability means
– Service is available for an agreed period without
• Includes resilience
• Calculated using statistics
• Determined by
– Reliability of the components
– Ability of service/component to operate despite failure
– Preventive maintenance
Basic Concepts
• Maintainability needed to
– Keep the services in operations
– Restore services when they fail
• Includes
– Taking measures to prevent faults
– Detecting faults
– Making diagnosis by components themselves
– Resolving the fault
– Restoring the service
Basic Concepts
Basic Concepts
• Mean Time to Repair (MTTR)
– Avg time b/w the occurrence of a fault and service
• Mean Time Between Failures (MTBF)
– Avg time b/w recovery from one incident and the
occurrence of next
• Mean Time Between System Incidents (MTBSI)
– Avg time b/w the occurrence of two consecutive
• Fulfillment of the agreed service levels.
• Reduction in the costs associated with a
given level of availability.
• The customer perceives a better quality of
• The levels of availability progressively
• The number of incidents is reduced.
Inputs - Outputs
Relationship with other processes
• Service Level Mgmt is responsible for
negotiating & managing availability
• Availability is one of the most important
element in SLA
Relationship with other processes
• Configuration Mgmt has information about
the infrastructure and can provide valuable
information to Availability Mgmt
Relationship with other processes
• Changes in capacity can often affect the
availability of a service
• Changes to availability will affect capacity
• These 2 processes exchange info about
– Scenarios for upgrading
– Phasing out IT components
– Availability trends that may need changes to
Relationship with other processes
• Problem Mgmt is directly involved in
identifying and resolving the causes of
actual or potential availability problems
Relationship with other processes
• Incident Mgmt provides reports with
information about recovery times, repair
times etc. This information is used to
determine the achieved availability.
Relationship with other processes
• Change Mgmt informs Availability Mgmt
about FSC
• Availability Mgmt informs Change Mgmt
about maintenance related to new service
and elements.
• Planning
• Monitoring
• Determining the availability requirements
• Designing for availability
• Designing for recoverability
• Security issues
• Maintenance management
• Developing the Availability Plan
Determining the availability
• Must be undertaken before SLA is
• Should address both new IT services and
changes to existing services
• Clearly defining availability requirements
early is essential to prevent confusion and
Determining the availability
• Should identify:
– Key business functions
– Agreed definition of IT service downtime
– Quantifiable availability requirements
– Quantifiable impact on the business functions
of unscheduled IT service downtime
– Business hours of customer
– Agreements about maintenance windows
Designing for availability
• Vulnerabilities affecting availability
standards should be identified early
• This will prevent
– Excessive development costs
– Unplanned expenditure at later stages
– Additional cost by suppliers
– Overall delays
Designing for recoverability
• Uninterrupted availability is rarely feasible
• Design for recoverability involves
– Effective Incident Mgmt
– Appropriate escalation
– Communication
– Backup and recovery procedures
– Tasks, responsibilities and authority clearly
Key Security issues
• Security and reliability are closely linked
• High availability can be supported by
effective information security
• This includes:
– Determining who is authorized to access
secure areas
– Determining which critical authorizations may
be issued
Maintenance management
• There will always be scheduled window of
• These periods can be used for preventive
• Maintenance must be carried out when
impact on services can be minimized
Developing the Availability Plan
• Long term plan concerning availability
over the next few years
• It is not the implementation plan for
Availability Mgmt
• Plan require liaison with areas such as
– Service Level Mgmt
– IT Service Continuity Mgmt
– Capacity Mgmt
– Change Mgmt
Methods and Techniques
• Component Failure Impact Analysis(CFIA)
– Uses an Availability matrix with strategic
components and their roles in each service
– Horizontal Analysis
– Vertical Analysis
Fault Tree Analysis
• Used to identify chain of events leading to
failure of IT service
• Distinguishes following events:
– Basic Event: power outages or operator error
– Resulting Event: resulting from combination of
earlier events
– Conditional Event: events that occur only in
certain conditions
– Trigger Event: events that cause other events
Fault Tree Analysis
Availability Calculations
• Availability is commonly defined as a
percentage as follows:

• For example, if the service is 24/7 and

over the last month the system has been
down for four hours to carry out
maintenance, the real availability of the
system was:
Process Control
• Critical Success Factors
– Business must have clearly defined
availability objectives
– SLM must have been setup to formalize
– Both parties must use the same definitions of
availability and downtime
Process Control
• Key Performance Indicators
– Percentage availability per service
– Downtime duration
– Downtime frequency
Possible Problems
• The real availability of the service is not monitored
• There is no commitment to the process in the IT
• The appropriate software tools and personnel are not
• The availability objectives do not match the customer's
• There is a lack of coordination with other processes.
• Internal and external service providers do not recognize
the authority of the Availability Manager as a result of a
lack of support from management
Thank you!