You are on page 1of 69

AVAILABILITY

Lecture 8
References
■ Software Metrics: A Rigorous and Practical Approach,
by Norman Fenton, James Bieman, Third Edition, CRC
Press, Inc., Boca Raton, FL, 2014 – Chapters 10, 11
■ Dr.-Ing. Nikola Milanovic, Models, Methods and Tools
for Availability Assessment of IT-Services and
Business Processes, Chapter 3, TU Berlin 2011.
■ Service Availability Forum - saforum.org
■ NIST/SEMATECH e-Handbook of Statistical Methods,
http://www.itl.nist.gov/div898/handbook/, 2013.
■ + 2 Case-studies
Availability

■ Your boss, tells you that your new system


should have “five nines” availability. You know
“five nines” means 99.999% availability.

■ reasonable or not?
■ how many hours of downtime per year that
allows?
Availability
Availability = f (failure)
Measuring failure
■ Mean Time Between Failures (MTBF)
– Repairable items
– the time passed before a component fails (under the
condition of a constant failure rate)
– MTBF = Total operating time/Number of failures
■ Mean Time to Failure (MTTF)
– Non-repairable items
– mean time expected until the first failure
– MTTF = Total operating time/number of units under test
■ Failure rate λ
– Number of failures/Total operating time
– λ = 1/MTBF
■ Mean time to repair (MTTR)
– total amount of time spent performing all corrective or
preventative maintenance repairs divided by the total
number of those repairs
Intuition on Availability
Availability = uptime/total time = MTBF/(MTBF+ MTTR)

If the system goes down once a week for 5 hours, what


is the availability?
■ Number of hours per week = 7 * 24 = 168
■ Availability = (168 - 5)/168 = 0.97 = 97%

If the typical time between failures is 48 hours, and the


typical time the system is down is 10 minutes, what is
the availability?
■ Availability = MTBF/(MTBF + MTTR) = 48/[48 +
(10/60)] = 0.996 = 99.6%
Example
■ IDC compared the availability of three different
server platforms for ERP systems (such as SAP or
Oracle), which were 99.98%, 99.67%, and 99.9% for
an average of 99.85%.
■ Your system’s requirement is to have downtime less
than 10 hours per year and you need to decide which
server platform to use.

■ What does the availability of your software need to


be? Assume that the server platforms include disk,
network, CPU . . . everything except the software.
Example answered
■ System Availability >= 99.88% (10 hours per year)
■ System Availability = Software Availability*Server
Platform Availability

Server 1 Server 2 Server 3 Average


Server
99.98% 99.67% 99.9% 99.85%
■ With average server availability, the Software
Availability = 99.88/99.85 > 1 which is impossible.
■ With the best platform, Software Availability =
99.88/99.98 = 99.89% available, which is 9h38 min
downtime per year.
=> if you can only have 10 hours of outage, and the
hardware takes X hours, your software can only be
down 10 - X.
Availability Factors
■ Frequency of outages (MTBF)
■ Duration of outage (MTTR)
– System Complexity: Complicated systems take longer
to start and restart, which makes the outages longer.
– Problem Severity: The more severe the problem, the
more time it takes to fully resolve and restore the
system, including lost data or lost work.
– Support Personnel Availability: The length of outage
includes the time it takes to find the on-call personnel,
get them online and allow them to diagnose and fix the
problem.
– Other Miscellaneous Factors: like a third party
supplier who can’t fix the part quickly enough, or
potentially, you cannot take the system offline to fix
your application because other applications are
running on the server.
Example
■ If the typical time between failures is 48 hours, and the
typical time the system is down is 10 minutes, what is
the availability if it takes 2 hours to get the right
person to get the system back up?

■ Availability = 48/[48 + (2 + 10/60)] = 0.957

95.7% versus 99.6% without the 2 hours to get the right


person.
Another Example
■ Your system averages 10 outages every 100 hours. 90% of
the outages are 5 minutes each and 10% of the outages are
2 hours each.
=>Availability = 97.75%.

You analyze the long outages and determine that you can
improve the error handling process for these outages so
that the recovery is simplified and the outage duration will
be reduced to 15 minutes.
■ What would be the impact of the improvement on
availability?
■ Overall outage duration for 100 hours is reduced from 2.45
hours to 1 hour. The availability would increase to 99.00%.
Outage Scope
■ To improve system availability, you want to
reduce both the scope and severity of the
outages
■ Fault-tolerant design concept is that of
“degraded mode”

■ Ex. you could partition the database so that


even if a part of the database is corrupted and
needs to be restored the rest of the system
continues to function.
Complexities in Measuring
Availability
■ How would you measure availability for a banking
system in which the server and networks are up 100%,
except for scheduled maintenance, which is from 2 a.m.
to 3 a.m. on Sunday mornings, and 50% of the ATM
machines fail 1 per month, and it takes a day to fix each
one?
⇒ complexities of scheduled vs. unscheduled downtime,
⇒ system vs. end-terminal availability
■ Continuous Availability (CA) => the system is to be
available 7 / 24 with no outages
■ High Availability (HA) => the system or application is to
be available during the committed hours of availability
with no unplanned or unscheduled outages
ATM example
■ How would you measure availability for a banking system in
which the server and networks are up 100%, except for
scheduled maintenance, which is from 2 a.m. to 3 a.m. on
Sunday mornings, and 50% of the ATM machines fail 1 per
month, and it takes a day to fix each one?
■ System Availability:
– Continuous Availability = 716/720 = 0.9944 = 99.44%
– High Availability = 100%

■ Average End-Client Availability:


– Continuous Availability = [720 - (24 * 0.5)]/720 *
(System-CA) = 0.9833 * 0.9944 = 0.9778 = 97.78%
– High Availability = [720 - (24 * 0.5)]/720 * (System-HA)
= 0.9833 *1 = 98.33%
Degrading mode operations
■ For the past month, 99% availability except for
degraded mode for 10 hours, which impacted
4% of the system

■ Effective Availability (EA) = (710/720) * 0.99 +


(10/720) * (0.04*0 + 0.96*1) = 0.976 + 0.013 =
0.989 = 98.9%
Typical availability
function
Failure density/ Availability/
Reliability
Uniform distribution
The probability of a component to fail in a time period is
equal.

Ex. A component that will fail in the next 5 days with


equal probability =>
f(t) = 1/5 for t = 1 to 5
f(t) = 0 for t > 5
Exponential distribution
■ The probability of failure changes
exponentially.
■ An exponential distribution

f(t) = λ e-λt

λ – failure rate
λ = 1/MTBF

If λ constant (hazard rate) = f(t)/R(t)


Demonstration
Mean time between failure
(MTBF)
Example 1
You track a system for one month.
It fails on days 2, 7, 8, 16, 22, and 31.
To calculate the MTBF, you need to look at the intervals
between failures.

Assuming you started on day 1, the intervals are 1, 5, 1,


8, 6, and 9 days.

MTBF = (1 + 5 + 1 + 8 + 6 + 9)/6 = 30/6 = 5 days


Example 2
You track the length of time it takes a system
to fail and find that

f (1) = 10%; f (2) = 15%; f (3) = 20%; f (4) = 25%; f


(5) = 30%; all other f (x) = 0.

MTBF = 0.1*1 + 0.15*2 + 0.2*3 + 0.25*4 + 0.30*5


= 0.1 + 0.3 + 0.6 + 1 + 1.5 = 3.5
Example 3
Consider an exponential distribution. If the
MTBF = 2 days what is the probability of the
system to fail on day 10?

=> λ = 1/ MTBF = 1/2 = 0.5

⇒ f (t) = 0.5* e-0.5t


⇒ f (10) = 0.5 e-5 = 0.0033 = 3.3%.
The Probability of Failure
During a Time Interval
■ The probability of failure between any two points of
time is the area under the curve of f (t)

■ Uniform distribution

■ Exponential distribution
Examples
The probability of failure between day 3 and day 4 for a
uniform distribution with c = 1/5 = 0.2 is 0.2(4-3) = 0.2 or
20%.

The probability of failure between day 5 and day 6 for


an exponential distribution is e-λ*5 – e-λ*6, which is
dependent on λ.
■ For λ = 1, the probability of failure is 0.004.
■ For λ = 0.1, it is 0.057.
=> the probability of failure between day 5 and 6 is 5.7%
when λ = 0.1.
■ For this example, MTBF = 1/0.1 = 10.
F(t): The Probability of Failure
by time t
■ Is equivalent to the probability of failure during the
interval 0 to t
■ F(t) is a cumulative distribution function (CDF)

■ For the uniform distribution example,


F(t) = c* t, where t is an integer

■ For the exponential distribution


Example
Availability computation
■ λ = 1/MTBF (failure rate)
■ μ = 1/MTTR (repair rate)
■ pup – probability that the system is up
■ pdown – probability that the system is down

pup + pdown = 1
λ * pup = μ * pdown => pup= μ/(μ + λ)
A = pup = 1/MTTR/(1/MTBF+1/MTTR) = MTBF/(MTBF+MTTR)
Analytical availability models

■ Combinatorial models
– Reliability Block Diagrams (RBD)
– Fault Trees
– Reliability Graphs
■ State-space models
– Markov Models
– Stochastic Petri Nets (SPN)
– Markov Reward Models (MRM)
Reliability Block Diagrams

■ Series configuration

■ Parallel configuration
Proof
K out of N
Example
Exponential distribution

MTTF

MTTR
Availability – Subsystem 1 (CPU,
MEM, CON)
Availability – Subsystem 2 (UPS,
UPS, CONTROL)
Availability – Subsystem 3 (Disks)
Adding all up
Fault Tree
■ Model focus on failure space

OR gates (series config) AND gates (parallel


config)
K out of N
Same example
Resulted availability
Markov models
■ Components are not independent (ex. Load-
balancing)
■ Stochastic processes:
– Family of random variables X(t) defined on a
sample space.
– The sample space = a set of all possible
evolutions of the states.
– When the value of X(t) (i.e. state) changes,
process has performed a state transition.
– The state space can be either discrete or
continuous. If the state space is discrete,
the process is called a chain.
Discrete Time Markov Chains
(DTMC)
■ a discrete-state stochastic process where the current state
depends only on the immediately preceding state
■ X(t) = a discrete-state stochastic process
■ P(X(tn) = j) = the probability that the process is in state j at
time tn
■ X(t) is a Markov chain if, for any ordered time sequence t0 <
t1 < ... < tn, the conditional probability of the process being in
any state j at time tn is given:
P[X(tn) = j|X(tn−1) = in−1, X(tn−2) = in−2, ..., X(t0) = i0] = P[X(tn) = j|X(tn−1)
= in−1]
■ If the conditional probability from previous equation is
invariant with respect to the time origin t0, Markov chain is
homogeneous
■ The probability that the process is in state j at time t:
πj(t) = P(X(t) = j)

■ The vector of all state probabilities:


π(t) = [π0(t), π1(t) , …]

■ The probability that the system is in state j at time t


given that it was in state i at time u = the transition
probability: Pi,j(t-u) = P[X(t)=j|X(u) = i]
■ The matrix P(t) is the square matrix of transition
probabilities Pij(t), given u=0
■ The Chapman-Kolmogorov equation states that, for a
system to be in state j at some time t, it must have been
in some state i at time u and must have made zero or
more transitions to reach j
π(t) = π(u) * P(t-u)
■ For DTMC: π(n+1) = π(n) * P = π(0) * Pn

Example
■ Web application with two processes: user login
(authentication/authorization) and search.
■ An in-memory cache on the server side with two memory blocks.
Cache access time is constant and synchronized and both caches
can be accessed simultaneously
■ A request for cache1 is generated with probability q1
■ Request for cache2 with probability q2 = 1 - q1
■ State (i,j); i = nr of requests waiting for cache1; j = nr of requests
waiting for cache2

Probability that both


processes generate a
req for cache 2
Probability that one
process generates a
req for cache 1 and the
other one for cache 2
Example
■ 0 - (1,1)
■ 1 - (2,0)
■ 2 – (0,2)

Transition probability Matrix

Chapman – Kolmogorov system of equations


Limiting probability vector
■ The solution of the system = the limiting probability vector
Limiting probability vector
■ What have these probabilities to do with availability?
■ A continuous–time repairable system with the failure rate λ
and the repair rate μ is defined by the generator matrix Q
Markov Analysis of Continuous Time
Markov Chains

⇒ Instantaneous availability
function

⇒ Steady-state availability
function

⇒ Expected uptime in
(0,t)

⇒ Instantaneous
availability
2-Component Non-Repairable
System
■ Generator matrix Q

■ Chapman – Kolmogorov differential system of


equations

■ Solution
2-component Repairable Sys.
■ Just 1 component is repairable

■ 2 repairable components - Non-shared repair

■ 2 repairable comp - Shared repair


Comparing strategies
Quantitative Tools
■ Anaytical and Simulation
– Isograph Toolbox (FaultTree+, AvSim+, Reliability
Workbench, NAP, AttackTree+),
– METFAC,
– MÄobius,
– OpenSesame,
– Reliass,
– Sharpe 2002,
– SPNP,
– Figaro,
– BQR
– Mathworks Stateflow.
Benchmark tools
■ Eclipse Test and Performance Tools Platform
(TPTP),
■ Exaustif,
■ FAIL-FCI,
■ Networked Fault Tolerance and Performance
Evaluator (NFTAPE),
■ Quake.
Techniques to improve availability

■ fault tolerance,
■ save/restore parallelism,
■ “safe libraries,”
■ active fault detection,
■ isolation and partitioning (for degraded
modes),
■ increased boot speeds,
■ concurrent release and fix upgrades
■ software rejuvenation
Case-studies
1. Alturkistani, F. M., & Alaboodi, S. S. (2017). An
Analytical Model for Availability Evaluation of
Cloud Service Provisioning
System. INTERNATIONAL JOURNAL OF
ADVANCED COMPUTER SCIENCE AND
APPLICATIONS, 8(6), 240-247.
2. Melo, C., Dantas, J., Oliveira, A., Oliveira, D., Fé,
I., Araujo, J., ... & Maciel, P. (2018). Availability
models for hyper-converged cloud computing
infrastructures. In Systems Conference
(SysCon), 2018 Annual IEEE International (pp.
1-7). IEEE.
[1]
■ proposes an analytical model for evaluating the availability of
cloud service provisioning systems focusing on IaaS;
architecture-based; it relies on National Institute of Standards and
Technology - Cloud Computing Reference Architecture (NIST-
CCRA)
■ availability is evaluated at two levels:
– the system-level
■ Reliability Block Diagrams (RBDs) are used to model the
system’s failures by considering series/parallel
arrangements of the cloud components/subsystems
– the component-level
■ Probabilistic model and operational measures.
■ Failure and repair data are modeled and analyzed using
probability distributions and statistical inferences
■ Operational measures are derived and used to estimate
component’s availability
NIST-CCRA service
orchestration model
IaaS Provisioning System (IPS) RBD
MTBF and MTTR for IPS components
IPS Availability
■ Hardware availability A1

■ Virtualized platform availability A2

■ Application availability A3 (depends on its MTBF,


MTTR)
■ AIPS = A1 x A2 x A3
Simulation results of IPS mean system
availability
[2]
■ evaluates the hyperconvergence, one of the ways to reach HA
in cloud computing environments based on the OpenStack
platform
■ Uses Dynamic RBDs: consider dependencies on dynamic
systems, priority between repairs, and resource sharing
OpenStack service Layout OpenStack with hyper-convergence
Availability Models
Baseline RBD model (Controller ^ Compute ^ Storage)

DRBD for Controller

DRBD for Compute Node RBD for Storage (OSD)


Redundant architectures
Double Triple
(Controller 1 V Controller 2) ^
(Compute 1 V Compute 2) ^
(OSD 1 V OSD 2)

Hyper-convergent
(Controller Monitor1 V Controller Monitor2 V Controller Monitor3) ^
(Compute OSD1 V Compute OSD 2V Compute OSD3).
Results
Wrap-up
■ Availability is a key metric for many systems
■ The main problem of analytical methods presented so
far is the necessity to parameterize model elements
(i.e. even after an arbitrary complex formal model has
been structurally developed (e.g., combination of
series and parallel configurations in an RBD or a set of
states and transitions in a Markov chain), which
corresponds to process or service description, it is
necessary to assign parameters to models elements).
■ Parameters, such as failure and repair rates as well as
adequate distributions, can be in principle derived
using historical data (past system behavior) or based
on measurements.

You might also like