You are on page 1of 37

SOFTWARE

RELIABILITY
Part 1
Reliability
• Correct service = the delivered service fulfills the system
function
• Incorrect service = the delivered service does not fullfill the system
function
• Failure = transition from correct to incorrect service
• To quantify:
• Reliability = continuity of correct service
• Time to failure
• Availability = readiness for correct service
• Frequency of failure
Reliability model
Is this suitable for software
reliability ?
Hardware = real, physical component
Software = un-touchable, informational component
Reliability = continuity of correct service
Models
System=
hw+sw
System reliability tasks
“Reliability assurance of
combined hardware and
software systems requires
implementation of a
thorough, integrated set of
reliability modeling, allocation,
prediction, estimation and test
tasks. These tasks allow on-
going evaluation of the
reliability of system, subsystem
and lower-tier designs. The
results of these analyses are
used to assess the relative
merit of competing design
alternatives, to evaluate the
reliability progress of the
design program, and to
measure the final, achieved
product reliability through
demonstration testing.
At each step in the design-
evaluate-design process, the
metrics used to predict product
reliability provide a mechanism
for a total quality management
system to provide ongoing
control and refinement
of the design process.”
Hw and sw
Reliability part
Reliability/
management
part
Example: missile guidance system
• On February 25, 1991, during the Gulf War, an American Patriot Missile battery
in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi Scud missile. The
Scud struck an American Army barracks and killed 28 soldiers.
• The cause: an inaccurate calculation of the time since boot due to computer
arithmetic errors.
• Specifically, the time in tenths of second as measured by the system's internal clock was multiplied by 1/10 to produce the time in
seconds. This calculation was performed using a 24 bit fixed point register. In particular, the value 1/10, which has a non-terminating
binary expansion, was chopped at 24 bits after the radix point. The small chopping error, when multiplied by the large number giving
the time in tenths of a second, lead to a significant error. Indeed, the Patriot battery had been up around 100 hours, and an easy
calculation shows that the resulting time error due to the magnified chopping error was about 0.34 seconds. (The number 1/10
equals 1/24+1/25+1/28+1/29+1/212+1/213+.... In other words, the binary expansion of 1/10 is
0.0001100110011001100110011001100.... Now the 24 bit register in the Patriot stored instead 0.00011001100110011001100
introducing an error of 0.0000000000000000000000011001100... binary, or about 0.000000095 decimal. Multiplying by the number
of tenths of a second in 100 hours gives 0.000000095×100×60×60×10=0.34.) A Scud travels at about 1,676 meters per second, and
so travels more than half a kilometer in this time. This was far enough that the incoming Scud was outside the "range gate" that the
Patriot tracked. Ironically, the fact that the bad time calculation had been improved in some parts of the code, but not all, contributed
to the problem, since it meant that the inaccuracies did not cancel.
Pay attention:
The system, once split into modules, must be evaluated
at each module and as a whole
System reliability and a bit of Math
• Reliability R(t) = conditional probability that a system
functions correctly during the time interval [t0,t], if at the
initial point t0, the system works properly.
• Example:
• R_SS=R+(1-R)*R=2R-R^2
• R_TMR=R*R*R+(1-R)*R*R+R*(1-R)*R +R*R*(1-R)=3*R^2-2*R^3
System vs Module Reliability
1
0.9
0.8
System reliability

0.7
0.6
0.5 TMR
0.4 SS
0.3 =R

0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Module reliability
The System, the Module and the Software
• software ≠ hardware
• each mode in a software system’s (CSCI) operation, different
software modules (CSCs) will be executing
• each mode will have a unique time of operation associated with it.
• A model should be developed for the software portion of a
system to illustrate the modules which will be operating
during each system mode, and indicate the duration of
each system mode.
• Example:
Software Reliability Models
• Over 200 models
• since the early 1970s
• how to quantify software reliability still remains largely unsolved
• No single model that can be used in all situations
• No model is complete or even representative.
• One model may work well for a set of certain software, but may be
completely off track for other kinds of problems.
• Most software models contain:
• Assumptions
• Factors
• A mathematical function that relates the reliability with the factors.
The mathematical function is usually higher order exponential or
logarithmic
Software Reliability Models Comparison – 1/2
Model How it works What data? When
Prediction Uses historical Usually made
data prior to
observe and development
or test
accumulate
phases; can
failure data be used as
early as
analyze with concept phase
statistical Usually made
Estimation Uses data
inference later in life
from the
current cycle
software
development
effort
Software Reliability Models Comparison – 2/2
Model + - Examples
Prediction • software • “educated Musa's Execution
reliability can be guess” Time Model,
predicted early in • If no/little Putnam's Model
the development historical and Rome
phase data may Laboratory
• enhancements have models TR-92-51
can be initiated substantial and TR-92-15
to improve the errors
reliability.
Estimation • More accurate • Can be fault count/fault
values used after rate estimation
some data models:
have been exponential
collected distribution,
• Enhanceme Weibull
nts may be distribution
difficult to Bayesian fault
implement rate estimation
models:
Thompson and
Chelson's model
Software Reliability Prediction

Use Predict initial Predict growth Estimate


Collect data model time and
metrics failure rate () parameters resources

• Historical data
• Current data
Fault
content
Many, many metrics
• Product metrics
• Software size
• Lines Of Code (LOC), or LOC in thousands(KLOC)
• source code is used(SLOC, KSLOC)
• comments and other non-executable statements are not counted.
• can not faithfully compare software not written in the same language
• Function point metric
• a count of inputs, outputs, master files, inquires, and interfaces
• functional complexity of the program.
• is independent of the programming language.
• used primarily for business systems; it is not proven in scientific or real-time applications.
• Complexity-oriented metrics
• simplify the code into a graphical representation
• McCabe's Complexity Metric.
• Test coverage metrics
• software reliability is a function of the portion of software that has been successfully verified or tested.
• Project management metrics
• Cost
• Process metrics
• estimate, monitor and improve the reliability and quality of software, e.g.: ISO-9000
• Fault and failure metrics
• Mean Time Between Failures (MTBF) MTBF=MTTF+MTTR
• Mean Time To Failure (MTTF)
• Mean time to repair (MTTR)
• number of faults found during testing
• failures (or other problems) reported by users after delivery are collected
Different models, different metrics
Using those metrics – MTTF
• MTTF = Mean time to failure

• Distribution of failures?
• Exponential
• Weibull
• Logaritmic
• Etc…
Example (Xing)
• A module with constant failure rate of  will survive 200
hours without failure with a 0.97 probability
• MTTF?
• What’s the probability to survive 1000 hours?
• Solution:
• R(t)=exp(-t), t=200, R(t)=0.97
=>- *200=ln(0.97)
=>  = -0.030459/-200=1.523*10^-4
• MTTF=1/  =6566.16 ore
• R(1000)=exp(- *1000)=0.858
This helps, why?

Use Predict initial Predict growth Estimate


Collect data model time and
metrics failure rate () parameters resources

• Find out R for each module


• Compute system R
• Composition of modules
• Predict system’s reliability behavior
Side-note:
Good math skills are golden
Money matters
Money matters
Ariane 5
• On June 4, 1996 an unmanned Ariane 5 rocket launched
by the European Space Agency exploded just forty
seconds after lift-off. The rocket was on its first voyage,
after a decade of development costing $7 billion. The
destroyed rocket and its cargo were valued at $500
million.
• It turned out that the cause of the failure was a software error in the
inertial reference system. Specifically a 64 bit floating point number
relating to the horizontal velocity of the rocket with respect to the
platform was converted to a 16 bit signed integer. The number was
larger than 32,768, the largest integer storable in a 16 bit signed
integer, and thus the conversion failed.
Software Failure Reporting and Corrective
Action System (FRACAS).
List of Known Fault Types
Orthogonal Defect Classification
• Industry-used method
• Collected data is
regularly analyzed to
determine:
• Maturity of the software
associated with defect
types,
• Most common failure
modes, and eliminate
them, and
• Causality of problems
and associated methods
to reduce defects
Software is good because:
Musa model
Step 1: System reliability requirements
• inherent faults at the beginning of system
• faults in the code as opposed to failures, which are the resulting
incorrect outputs of executed faults
• estimate for inherent fault density can be based on KSLOC,
Function Points
• defect removal efficiency = proportion of faults removed
from code to faults removed plus new faults introduced.
• rate of failure identification and rate of failure correction
• highly dependent on the sophistication of both the software process
and software development personnel utilized on a program.
Musa model
Step 2: Black box approach
• the predictions derived using methods as described
above become the basis for initial allocations of a
software reliability requirement to the system operational
modes or functions.
• software requirement is not allocated to modules (CSCs)
• highly dependent on the environment (stimuli)
• many of the software failures that occur result from interface problems
between software modules and timing problems somewhere in the
system
• system-mode profile, functional profile, or operational
profile
Musa model
Step 3: allocate the failure rate goals to software CSCI
Musa model
Step 4: Equal apportionment applied to sequential software CSCIs
• used only when G, the failure rate goal of the software
aggregate, and N, the number of software CSCIs in the
aggregate, are known. The aggregate’s failure rate goal is
either specified in the requirements or is the result of an
allocation performed at a higher level in the system
hierarchy.
• Steps
1. Determine G, the failure rate goal for the software aggregate
2. Determine N, the number of software CSCIs in the aggregate
3. For each software CSCI, assign the failure rate goal
iG=G failures per hour i=1,2,...,N
Musa model
Step 5: Equal apportionment applied to concurrent software CSCIs
• used only when G, the failure rate goal of the software
aggregate, and N, the number of software CSCIs in the
aggregate, are known
• Steps
1. Determine G, the failure rate goal for the software aggregate
2. Determine N, the number of software CSCIs in the aggregate
3. For each software CSCI, assign the failure rate goal
iG=G/N failures per hour i=1,2,...,N
Musa model
Step 6: Optimized allocation based on system-mode profile
• This procedure assumes that all system modes are independent. That is, no two system modes operate at the
same time.
• Steps
1. Determine G, the failure rate goal for the software aggregate
2. Determine N, the number of System Modes in the aggregate
3. For each system mode, determine the proportion of time that the system will be operating in that mode. In other
words, assign occurrence probabilities to the system modes. The failure rate relationship of the software system
can then be established by G = pi iG where pi is the occurrence probability of the i-th system mode. There
are potentially an infinite number of combinations of these values that will meet the G requirement.
4. For each system mode, identify the software modules (CSCs or, in some cases, CSCIs) that are utilized during
the system mode operation. Add the estimated source lines of code (SLOC) for each module). Total the SLOC
count for each system mode.
5. Use the SLOC count for each system mode as an input to the Musa model. This is the source instruction
parameter (IS)
6. Supply values for each of the following parameters to the Musa model:
• B: fault reduction efficiency factor (between 0 and 1)
• r: instruction execution rate (e.g., 25 MIPS)
• Qx : code expansion ratio (# of object instructions per SLOC)
• mi : failure identification effort per failure (man-hours)
• mf: failure correction effort per failure (man-hours)
7. Select a target failure rate for each system mode such that the combination results in the aggregate system
requirement being met. Note that the value for expected failures experienced (m) will be calculated by the model.
Plugging this into the equations for failure identification personnel (XI) and failure correction personnel (XF)
provides the total test time (in man-hours) required to meet the target failure rate. (Failure identification includes
time to set up and run tests and to identify if a failure has occurred. Failure correction includes time to debug the
software, find the underlying fault causing the failure, and make the software change that fixes the associated
fault).
8. Add the values for XF and XI for each of the system modes to arrive at total system test time. Record this value.
9. Try N new values for failure rates for the N system modes. Record the total test time.
10. Tweak the failure rate values up and down until an optimal value which minimizes system test time is achieved.
These final failure rate values are the set to be allocated to the N system modes.