Professional Documents
Culture Documents
Lecture 8
References
■ Software Metrics: A Rigorous and Practical Approach,
by Norman Fenton, James Bieman, Third Edition, CRC
Press, Inc., Boca Raton, FL, 2014 – Chapters 10, 11
■ Dr.-Ing. Nikola Milanovic, Models, Methods and Tools
for Availability Assessment of IT-Services and
Business Processes, Chapter 3, TU Berlin 2011.
■ Service Availability Forum - saforum.org
■ NIST/SEMATECH e-Handbook of Statistical Methods,
http://www.itl.nist.gov/div898/handbook/, 2013.
■ + 2 Case-studies
Availability
■ reasonable or not?
■ how many hours of downtime per year that
allows?
Availability
Availability = f (failure)
Measuring failure
■ Mean Time Between Failures (MTBF)
– Repairable items
– the time passed before a component fails (under the
condition of a constant failure rate)
– MTBF = Total operating time/Number of failures
■ Mean Time to Failure (MTTF)
– Non-repairable items
– mean time expected until the first failure
– MTTF = Total operating time/number of units under test
■ Failure rate λ
– Number of failures/Total operating time
– λ = 1/MTBF
■ Mean time to repair (MTTR)
– total amount of time spent performing all corrective or
preventative maintenance repairs divided by the total
number of those repairs
Intuition on Availability
Availability = uptime/total time = MTBF/(MTBF+ MTTR)
You analyze the long outages and determine that you can
improve the error handling process for these outages so
that the recovery is simplified and the outage duration will
be reduced to 15 minutes.
■ What would be the impact of the improvement on
availability?
■ Overall outage duration for 100 hours is reduced from 2.45
hours to 1 hour. The availability would increase to 99.00%.
Outage Scope
■ To improve system availability, you want to
reduce both the scope and severity of the
outages
■ Fault-tolerant design concept is that of
“degraded mode”
f(t) = λ e-λt
λ – failure rate
λ = 1/MTBF
■ Uniform distribution
■ Exponential distribution
Examples
The probability of failure between day 3 and day 4 for a
uniform distribution with c = 1/5 = 0.2 is 0.2(4-3) = 0.2 or
20%.
pup + pdown = 1
λ * pup = μ * pdown => pup= μ/(μ + λ)
A = pup = 1/MTTR/(1/MTBF+1/MTTR) = MTBF/(MTBF+MTTR)
Analytical availability models
■ Combinatorial models
– Reliability Block Diagrams (RBD)
– Fault Trees
– Reliability Graphs
■ State-space models
– Markov Models
– Stochastic Petri Nets (SPN)
– Markov Reward Models (MRM)
Reliability Block Diagrams
■ Series configuration
■ Parallel configuration
Proof
K out of N
Example
Exponential distribution
MTTF
MTTR
Availability – Subsystem 1 (CPU,
MEM, CON)
Availability – Subsystem 2 (UPS,
UPS, CONTROL)
Availability – Subsystem 3 (Disks)
Adding all up
Fault Tree
■ Model focus on failure space
⇒ Instantaneous availability
function
⇒ Steady-state availability
function
⇒ Expected uptime in
(0,t)
⇒ Instantaneous
availability
2-Component Non-Repairable
System
■ Generator matrix Q
■ Solution
2-component Repairable Sys.
■ Just 1 component is repairable
■ fault tolerance,
■ save/restore parallelism,
■ “safe libraries,”
■ active fault detection,
■ isolation and partitioning (for degraded
modes),
■ increased boot speeds,
■ concurrent release and fix upgrades
■ software rejuvenation
Case-studies
1. Alturkistani, F. M., & Alaboodi, S. S. (2017). An
Analytical Model for Availability Evaluation of
Cloud Service Provisioning
System. INTERNATIONAL JOURNAL OF
ADVANCED COMPUTER SCIENCE AND
APPLICATIONS, 8(6), 240-247.
2. Melo, C., Dantas, J., Oliveira, A., Oliveira, D., Fé,
I., Araujo, J., ... & Maciel, P. (2018). Availability
models for hyper-converged cloud computing
infrastructures. In Systems Conference
(SysCon), 2018 Annual IEEE International (pp.
1-7). IEEE.
[1]
■ proposes an analytical model for evaluating the availability of
cloud service provisioning systems focusing on IaaS;
architecture-based; it relies on National Institute of Standards and
Technology - Cloud Computing Reference Architecture (NIST-
CCRA)
■ availability is evaluated at two levels:
– the system-level
■ Reliability Block Diagrams (RBDs) are used to model the
system’s failures by considering series/parallel
arrangements of the cloud components/subsystems
– the component-level
■ Probabilistic model and operational measures.
■ Failure and repair data are modeled and analyzed using
probability distributions and statistical inferences
■ Operational measures are derived and used to estimate
component’s availability
NIST-CCRA service
orchestration model
IaaS Provisioning System (IPS) RBD
MTBF and MTTR for IPS components
IPS Availability
■ Hardware availability A1
Hyper-convergent
(Controller Monitor1 V Controller Monitor2 V Controller Monitor3) ^
(Compute OSD1 V Compute OSD 2V Compute OSD3).
Results
Wrap-up
■ Availability is a key metric for many systems
■ The main problem of analytical methods presented so
far is the necessity to parameterize model elements
(i.e. even after an arbitrary complex formal model has
been structurally developed (e.g., combination of
series and parallel configurations in an RBD or a set of
states and transitions in a Markov chain), which
corresponds to process or service description, it is
necessary to assign parameters to models elements).
■ Parameters, such as failure and repair rates as well as
adequate distributions, can be in principle derived
using historical data (past system behavior) or based
on measurements.