You are on page 1of 49


Part 2
Where are we now?
• Previous course
• System vs software reliability
• Model
• Module vs operation mode
• Software Reliability Prediction
• Metrics
• Software FRACAS
• Musa model
• This course
• Operational profile
• Human reliability
• SRE best practices
Factors Influencing Software Reliability
• A user’s perception of the reliability of a software
depends upon two categories of information.
• The number of faults present in the software.
• The ways users operate the system.
• This is known as the operational profile.

• The fault count in a system is influenced by the following.

• Size and complexity of code
• Characteristics of the development process used
• Education, experience, and training of development personnel
• Operational environment
Human error analysis and reliability
• explore difficulties of use early in design with the aim of
improving design
• hence comparable with other usability and walkthrough techniques
• assessing likelihood of human error of a developed
design as part of an assessment process
• hence comparable with other reliability assessment techniques
• “It must be shown by analysis, substantiated where
necessary by test, that as far as reasonably practicable all
design precautions have been taken to prevent human
errors in production, maintenance and operation causing
hazardous or catastrophic effect”
• Used extensively in the nuclear power industry
Different approaches
Engineering approach
•Quantitative ‘decomposition’
•Human treated as a “component”
•The mechanistic assumption: “The
human / mind as a fallible machine”
•The atomistic assumption: Human
performance can be adequately
described by considering individual
elements of the performance. Total
performance is an aggregate of the
individual performance elements

Cognitive approach
•models and theories of cognitive
functions which underlie human
•Cognitive psychology still immature
•Problem: human cognition is not
directly observable
Quantification techniques
• HEART: a human performance model-based technique
utilizing some standard probabilities
• A data-based method for assessing and reducing human error to
improve operational performance.
• J.C. Williams (1988) IEEE Fourth Conference on Human Factors and
Power Plants (pp.436-450)
• Based on long-term sizeable human reliability database; weighting
factors based on HF literature.
• Assumes human performance usually deteriorates when Error
Producing Conditions (EPCs) interact
• SLIM: a utility-based technique using team based
• THERP: earliest method
HEART generic categories
Generic Task Nominal human *5th-95th
unreliability percentile
(A) Totally unfamiliar, performed at speed with no real idea of likely
consequences 0.55 (0.35-0.97)
(B) Shift or restore system to a new or original state on a single attempt without
supervision or procedures 0.26 (0.14-0.42)
(C) Complex task requiring high level of comprehension and skill
0.16 (0.12 - 0.28)
(D) Fairly simple task performed rapidly or given scant attention
0.09 (0.06 - 0.13)
(E) Routine, highly practiced, rapid task involving relatively low level of skill
0.02 (0.007 -
(F) Restore or shift a system to original or new state following procedures, with
some checking
0.003 (0.0008 -
(G) Completely familiar, well-designed, highly practiced, routine task occurring
several times per hour, performed to highest possible standards by highly 0.0004 (0.00008 -
motivated, highly-trained
and experienced personnel, with time to correct potential error, but without the
benefit of significant job aids
(H) Respond correctly to system command even when there is an augmented or
automated supervisory system providing accurate interpretation of system state
0.00002 (0.000006 -
Error producing Conditions (EPCs) (selection) Factor
Unfamiliarity with a situation which is potentially important but which only occurs infrequently or which is novel 17
A shortage of time available for error detection and correction 11
A low signal-noise ratio 10
A means of suppressing or over-riding information or features which is too easily accessible 9
No obvious means of reversing an unintended action 8
A need to unlearn a technique and apply one which requires the application of an opposing philosophy 6
The need to transfer specific knowledge from task to task without loss 5.5
Ambiguity in the required performance standards 5
A means of suppressing or over-riding information or features which is too easily accessible 4
A mismatch between perceived and real risk. 4
No clear, direct and timely confirmation of an intended action from the portion of the system over which control is 4
Operator inexperience (e.g., a newly qualified tradesman but not an expert) 3
A mismatch between the educational achievement level of an individual and the requirements of the task 2
Little opportunity to exercise mind and body outside the immediate confines of a job 1.8
Little or no intrinsic meaning in a task 1.4
High level emotional stress 1.3
Evidence of ill-health amongst operatives especially fever. 1.2
Low workforce morale 1.2
A poor or hostile environment 1.15
Prolonged inactivity or highly repetitious cycling of low mental workload tasks (1st half hour) 1.1
(thereafter) 1.05
Disruption of normal work sleep cycles 1.1
Task pacing caused by the intervention of others 1.06
Additional team members over and above those necessary to perform task normally and satisfactorily. (per additional 1.03
team member)
How does it all come together?
• Find out task level:
• (E) Routine, highly practiced, rapid task involving relatively low level of
• R=1-0.02
• Are there any EPCs?
• A mismatch between perceived and real risk. E1=4
• Little or no intrinsic meaning in a task E2=1.4
• Additional team members (per member) E3=1.03*3
• Assess proportion of EPC ( ≠ 1)
• P1= 0.5, P2=0.2, P3=0.5
• Assess effect = ((E-1)*P)+1
• F1=2.5, F2=1.08, F3=2.045
• Assessed probability of failure
• 0.02*2.5*1.08*2.045=0.11043
Human + software reliability
How do they interact?
The operational profile
• A software-based product’s reliability depends on just how
a customer will use it.
• Making a good reliability estimate depends on testing the product
as if it were in the field.
• The operational profile
• quantitative characterization of how a system will be used
• Works also for hardware, human components
• Can be used for the whole system
Who develops an operational profile?
• Developed by:
• systems engineers
• high-level designers (architecture)
• test planners
• Strong participation by:
• product planning
• marketing professionals
• key customers, if available

• Developed by John Musa et al at AT&T to guide testing.

• An AT&T PBX switching system combined an operational profile with
other quality improvement techniques.
• Adopted by HP to re-organize system-test process for multi-processor
operating system.
• HP system-test process revision reduced system-test time by 50%
• First published results, 1987; active use since.
Overall creation process
• A progressively narrowing perspective from customers
down to operation
• At each step, quantify how often each of the elements in
that step will be used; convert to probabilities
• Process has been refined many times. Most AT&T
applications have been real-time telecommunications

• Profile = A set of disjoint alternatives and the probability

that each will occur
• On the way to creating the operational profile, several intermediate
profiles will be produced
• The usage data is not a profile until you add the probability info
• Example:
• 100 X -type transactions an hour
• 500 Y-type transactions an hour
• 300 Z-type transactions an hour
• Interesting but not useful until you know total number of
transactions per hour so you can compute probabilities
• 100/2000 = .05; 500/2000 = .25; 300/2000 = .15
• Completeness check: do the probabilities add up to 1?
• .05 + .25 + .15 = .45 Missing some.
• (Note: Can use raw data to re-create appropriate traffic levels in
• How accurate does this (combination of data and
probabilities) need to be?
Degree of Accuracy Required
• What is the economic gain expected from better decisions
resulting from more accurate data? (classic “risk
management” question!)
• In practice, often use “informed engineering judgment”
rather than formal economic analysis
• Emphasis on the word informed
• Infrequently executed functions of a highly critical nature
ARE important, e.g.,
• pilot ejection from cockpit
• overheating nuclear reactor shutdown procedures
• Must incorporate notion of criticality as well as use of
5 steps to create the operational profile
1. customer profile
2. user profile
3. system-mode profile
4. functional profile
5. operational profile
• (based on all of the above)
The O. P. Triangle
Customer groups

User groups



Example: retail store market
• Customer groups: 1. customer profile
• large retail stores
• small chains
• grocery chains
2. user profile
• User groups:
• Cashiers
• marketing analysts
• I-S specialists
• System-modes: 3. system-mode profile
• I-S specialists do database cleanup and
also report generation
4. functional profile
• Functions
• each mode has several functions (e.g.,
various reports in report generation
• Note use of word function is from user
perspective, i.e. user task
• Operations: 5. operational profile
• user functions are mapped onto the
software product’s operations
Some steps may be unnecessary
Uniformity of detail is not required
Customer Occurrence Probability
Step 1:
Educational Institution 0.45
Business Organization 0.35
Individual Home User 0.20
Ex: software spreadsheet
For instance, schools
might use them for
tabulating and updating
student grades.
Businesses might use
them mainly for financial
and operations controls.
Home users could keep
track of their monthly
income and expenses, as
well as investments and
savings plans.
The customer profile is
the list of customer
types and the
associated probabilities.
These probabilities are
simply the proportions
of time that each type of
customer would be
using the system.
Customer Occurrence Probability
Educational Institution 0.45
Step 2: User
Secretary 50%
Managers 30%
The user profile is the
set of user types and Teachers 20%
their associated
probabilities of using the Business Organization 0.35
system Secretary 40%
Within a customer Managers 60%
group: use the
proportion of customer
Individual Home User 0.20
group’s usage that the Individuals 100%
user group represents
If can’t determine usage,
use the number of users
as proportion of the total
users in that group
Combine same user
groups found in different
customer groups
User Occurrence Probability
Secretary 0.5*0.45+0.4*0.35=0.365
Step 2: User
Managers 0.3*0.45+0.6*0.35=0.345
Teachers 0.2*0.45=0.09
The user profile is the
set of user types and Other individuals 0.2
their associated
probabilities of using the

Within a customer
group: use the
proportion of customer
group’s usage that the
user group represents
If can’t determine usage,
use the number of users
as proportion of the total
users in that group
Combine same user
groups found in different
customer groups
System mode Occurrence Probability
Batch Mode 0.35
Step 3: System
User-Interactive Mode 0.65
Mode Profile
A system mode is a way that a
system can operate. The system
includes both hardware and
software. Most systems have more
than one mode of operation. For
example, system testing may take
place in batch mode or user-
interactive mode. An airplane flight
consists of takeoff and ascent
mode, level flight mode and
descent and land mode. An
automobile may be in normal
mode or four-wheel drive; it may
also be in normal mode or cruise
control. System modes can be
thought of as independent
segments of a system operation or
various different ways of using a
system. A system can switch
among modes sequentially, or it
can permit several modes to
operate concurrently, sharing the
same system resources. For each
system mode, if there are more
than one or two, an operational
profile (and sometimes functional
profile) should be developed.
There are no technical limits on
how many system modes may be
Short recap - Operational Profile Development
• Musa, J.D., “Operational Profiles in Software Reliability
Engineering,” IEEE Software Magazine, March 1993
Functional profile – 1/2
• After a good system mode profile has been developed, the
focus should turn to evaluation of each system mode for the
functions performed during that mode, and then assigning
probabilities to each of the functions.
• Functions
• are essentially tasks that an external entity such as a user can perform
with the system.
• user of an e-mail system would want the following functions: create
message, look up address, send message, open message
• are based on what activities the customer wants the system to be able
to perform.
• Developing a functional profile is, in that respect, a part of developing
• A functional profile need not have a defined number of
functions, but generally contains 20 to more than a hundred.
The number will vary based on project size, number of system
modes, environmental considerations, and function breadth.
Functional profile – 2/2
• The functional profile can be either explicit or implicit,
depending on the key input variables
• A key input variable is an external parameter which affects the
execution path a software system traverses based on the
different values the parameter takes on.
• consist of ranges of variables that cause different operations to be
• These various ranges are referred to as levels.
• A profile is explicit if each element is designated by
simultaneously specifying the levels of all key input variables
needed for its identification.
• A profile is implicit if it is expressed by subprofiles of each key
• That is, each key environmental parameter is assigned probabilities
associated with the ranges it can legally use.
Implicit Profile
Subprofile C Subprofile D

Key input Occurrence Key input Occurrence

variable value probability variable value probability
Example X1 0.6 Y1 0.7
Suppose there are two key
independent parameters, X and X2 0.3 Y2 0.2
Y, each taking on three
discrete values. Nine X3 0.1 Y3 0.1
operations can be defined
based on the combinations of
the variables Explicit Profile
The main advantage of using
the implicit profile is that a
significantly smaller number of Key input variable value Occurrence probability
elements need to be specified,
as few as the sum of the
number of levels of key input X1Y1 0.42
The explicit profile can have as X2Y1 0.21
many as the product of the
number of levels for each
variable. For five variables with X1Y2 0.12
five levels, assuming complete
independence, the implicit X3Y1 0.07
profile requires only 25
elements whereas the explicit
profile would call for 55, or X1Y3 0.06
3,125 elements.
In most cases it is not X2Y2 0.06
necessary to generate the
explicit profile, because it exists
by default from the implicit X2Y3 0.03
X3Y2 0.02
X3Y3 0.01
How to develop a function list
• Construct work-flow chart showing overall process, including
software, hardware, and people.
• The work flow shows the context and suggests necessary
• Usually done during requirements phase

• Basic requirements definition: ensure that almost all important

input values (commands, their variables, global data) and
environment variables are covered by the defined functions.
• Function differentiation is independent of that.
• The more refined the differentiation, the more detailed profile
you obtain.

• Wait a second….
It’s not that simple…
1. Generate an initial function list
• features and capabilities needed by the users
• organized by functions relevant to each key input variable if an implicit
profile is used
2. Determine environmental variables
• environmental variables characterize the conditions that influence the paths
traversed by a program, but do not correspond directly to features
• Ex: hardware configuration and traffic load
3. Create final function list
• environmental and feature variables should be examined for dependencies
• Partial dependencies can cause difficulties because all possible
combinations of levels of both variables may need to be listed
• The final number of functions in the list is then calculated as the product of
the number of functions in the initial list and the number of environmental
variable levels, minus the combinations of initial functions and
environmental variable values that do not occur.
4. Assign occurrence probabilities
Sample final function list
Function Environmental Variable
Standard Deviation X
Correlation X
Analysis of Variance X
Regression X
Final function list
Profile System Overall
Segment Function Mode Occurrence
For the assignment of occurrence Occurrence Probability
probabilities, the ideal data source
consists of usage measurements
taken on the latest release or a
similar system. These
measurements may be obtained
from system logs or data storage
Standard Deviation 0.60 0.12
devices. Occurrence probabilities
computed with the historical data
should be updated to account for
Correlation 0.22 0.044
new functions, users, or
environments. In the event that a
system is completely new the
Analysis of Variance 0.10 0.02
functional profile might be very
inaccurate. It should still be
developed, however, and updated
Regression 0.08 0.016
as more is known about how the
system will be operated. The
process of predicting usage forces
interaction with the customer,
which can be very important. The
required dialogue may highlight
Environmental Profile
the relative importance of the
various functions, indicating that
some functions may not be
Variable count Occurrence Probability
necessary while others are most
Reducing the number of functions
One (X) 0.6
should increase reliability
Multiple (Y) 0.4
Final Functional System Mode
Profile Function Occurrence
Segment Probability

All adds up (actually, it

X 0.072
multiplies) Deviation
Y 0.048
Correlation X 0.0264
Y 0.0176
Analysis of
X 0.012
Y 0.008
Regression X 0.0096
Y 0.0064
Final step: operational profile
• The functional profile is a user-
oriented view of system capabilities.
From the developers’ perspective, it
is operations that actually
implement the functions.
• Operations are usually the focus of
• An operation represents a task
being accomplished by the system
from the viewpoint of the people
who will test the system. To allocate
testing effort and develop a test
description, the operational profile
must be available for the purposes
of test planning.
How to get to the operational profile
1. Divide execution into runs
2. Identify input space
3. Partition input space
4. Occurrence Probabilities
• Test selections
• according to their occurrence probabilities
• Prioritize development
• This was a much more radical concept in 1993.

• Attention! operational profile may create an unrealistic set

of tests because the list of operations is too long
Let’s reduce the number of operations
1. Reduce the number of run types.
1. Reduce the size of the input variable list
1. Reduce functionality.
2. Reduce the number of possible hardware configurations.
3. Restrict the environment the program must operate in.
4. Reduce the number of fault types.
5. Reduce unnecessary interactions between successive runs *****
1. Minimize the input variables that application programs can access at any one
2. Reinitialize variables between runs.
3. Use synchronous, as opposed to asynchronous, design.
2. Reduce the number of levels of the input variables
2. Increase the number of run types grouped per operation.
3. Ignore the remaining set of run types expected to have total
occurrence probability appreciably less than the failure
intensity objective
• Financial and billing systems are commonly data driven.
• Suppose a cable television billing system was designed as an account processing
system. This system processes the charge entries for each account for the
current billing period and generates bills. The reliability to evaluate is the
probability of generating a correct bill. This involves determining the reliability over
the time required to process the bill and its entries.
• Assume that the design was not anticipated when the functional profile was
developed, so the relationship between the functional profile and operational
profile is complex. For instance, typical functions might have been bill processing,
bill correction, and delinquency identification.
• The account-processing system has an operational profile that relates to account
attributes. Its operations are classified by customer type (business or residential),
service type (basic, expanded basic, premium package), and payment status
(paid, delinquent).
• Assume that 90 percent of the customers are residential and 10 percent are
businesses. Forty percent of the customers subscribe to the basic cable service.
Half of all customers receive expanded basic, and the remaining 10 percent pay
for the full premium package. History shows that 2 percent of the accounts are
delinquent, on average.
Operation Occurrence Probability

Residential, Expanded Basic, Paid 0.4410

Residential, Basic, Paid 0.3528

Example Residential, Premium, Paid 0.0882
Operations and the
Business, Expanded, Paid 0.0490
associated probabilities
Business, Basic, Paid 0.0392

Business, Premium, Paid 0.0098

Residential, Expanded, Delinquent 0.0090

Residential, Basic, Delinquent 0.0072

Residential, Premium, Delinquent 0.0018

Business, Expanded, Delinquent 0.0010

Business, Basic, Delinquent 0.0008

Business, Premium, Delinquent 0.0002

Best practices
Design for reliability
• Functional and Non-functional Requirements
• System functional requirements may specify
• error checking
• recovery features
• system failure protection
• Non-functional requirements
• System reliability
• Hardware reliability
• probability a hardware component fails
• Software reliability
• probability a software component will produce an incorrect output
• software does not wear out
• software can continue to operate after a bad result
• Operator reliability
• probability system user makes an error
• Availability
• Functional Reliability Requirements
• The system will check the all operator inputs to see that they fall within their
required ranges.
• The system will check all disks for bad blocks each time it is booted.
• The system must be implemented in using a standard implementation of Ada
• Non-functional Reliability Specification
• The required level of reliability must be expressed quantitatively.
• Reliability is a dynamic system attribute.
• Source code reliability specifications are meaningless (e.g. N faults/1000 LOC)
• An appropriate metric should be chosen to specify the overall system reliability
• Probability of Failure on Demand (POFOD)
• POFOD = 0.001
• For one in every 1000 requests the service fails per time unit
• Rate of Fault Occurrence (ROCOF)
• ROCOF = 0.02
• Two failures for each 100 operational time units of operation
Building Failure Example Metric
Reliability Class
Specification Permanent ATM fails to ROCOF = .0001
Time unit = days
Non- operate with
1. For each sub-
system analyze any card, must
consequences of
corrupting restart to
possible system
2. From system failure
analysis partition
Transient Magnetic stripe POFOD = .0001
failure into Time unit =
appropriate classes Non- can't be read
3. For each class send on undamaged
out the appropriate corrupting card
reliability metric
Specification validation
• It is impossible to empirically validate high reliability
• No database corruption really means POFOD class < 1 in
200 million
• If each transaction takes 1 second to verify, simulation of
one day’s transactions takes 3.5 days
Statistical Testing
• Test data used, needs to follow typical software usage
• Measuring numbers of errors needs to be based on errors
of omission (failing to do the right thing) and errors of
commission (doing the wrong thing)
• Uncertainty when creating the operational profile
• High cost of generating the operational profile
• Statistical uncertainty problems when high reliabilities are
Six steps to SRE
1. Quantify product usage by specifying how frequently customers
will use various features and how frequently various environmental
conditions that influence processing will occur.
2. Define quality quantitatively with your customers by defining
failures and failure severities and by specifying the balance among
the key quality objectives of reliability, delivery date, and cost to
maximize customer satisfaction.
3. Employ product usage data and quality objectives to guide design
and implementation of your product and to manage resources to
maximize productivity (i.e., customer satisfaction per unit cost).
4. Measure reliability of reused software and acquired software
components delivered to you by suppliers, as an acceptance
5. Track reliability during test and use this information to guide
product release.
6. Monitor reliability in field operation and use results to guide new
feature introduction, as well as product and process improvement.
Why user opinion matters
• 80% AT&T users – the most important quality attribute = RELIABILITY
• AT&T developed the operational profile idea
• SRE will help your project
• Satisfy customer needs more precisely.
• Having precise reliability requirements focuses development on meeting your
customers’ reliability needs. Reliability requirements enable system testers to
concretely verify that the finished product meets customers’ needs before it is
• Deliver earlier.
• Delivering the exact reliability needed by the customer avoids wasting time for
unneeded extra testing.
• Increase productivity.
• By using the functional and operational profiles to focus resources on the high-usage
functions or operations and by developing and testing for exactly the reliability needed,
productivity is improved.
• Plan project resources better.
• Before testing begins, SRE supports prediction of the amount of system test resources
needed, avoiding unnecessary waste and disruption due to unpleasant surprises.
SRE activities
•Determine functional profile
Feasibility; •Define and classify failures
Requirements and •Identify customer reliability needs
•Conduct trade-oft studios
Development plan •Set reliability objectives

•Allocate reliability among components

Design and •Engineer to meet reliability objectives
•Focus resources based on functional profile

Implementation •Manage fault introduction and propagation

•Measure reliability of acquired software

•Determine operational profile

System test and •Conduct reliability growth testing
•Track testing progress

Field Trial •Project additional testing needed

•Certify reliability objectives are met

•Project post-release staff needs

Post Delivery; •Monitor field reliability vs. objectives
Operation and •Track customer satisfaction with reliability
•Time new feature introduction by monitoring reliability
Maintenance •Guide product and process improvement with reliability measures
Cost and release date trade-offs
People involved in SRE