Software Reliability

Carnegie Mellon University 18-849b Dependable Embedded Systems Spring 1999 Authors: Jiantao Pan jpan@cmu.edu

Abstract:
Software Reliability is the probability of failure-free software operation for a specified period of time in a specified environment. Software Reliability is also an important factor affecting system reliability. It differs from hardware reliability in that it reflects the design perfection, rather than manufacturing perfection. The high complexity of software is the major contributing factor of Software Reliability problems. Software Reliability is not a function of time - although researchers have come up with models relating the two. The modeling technique for Software Reliability is reaching its prosperity, but before using the technique, we must carefully select the appropriate model that can best suit our case. Measurement in software is still in its infancy. No good quantitative methods have been developed to represent Software Reliability without excessive limitations. Various approaches can be used to improve the reliability of software, however, it is hard to balance development time and budget with software reliability.

Contents:
y y y y y y y y y y y y

Introduction Key Concepts Definition Software failure mechanisms The bathtub curve for software reliability Available tools, techniques, and metrics Software reliability models Software reliability metrics Software reliability improvement techniques Relationship to other topics Conclusions Annotated Reference List & Further Reading

after changing three lines of code in a signaling program which contains millions lines of code. the local telephone systems in California and along the Eastern seaboard came to a stop. On the contrary. but gas-field fire. The computer industry is booming exponentially. With a continuously lowering cost and improved control. etc. People used to believe that "software never breaks". Intuitively. Tragedies in Therac 25 [Therac 25]. [Neumann95] Software can also have small unnoticeable errors or drifts that can culminate into a disaster. a computer-controlled radiation-therapy machine in the year 1986. Without being proven to be wrong.000000095 second in precision in every 10th of a second. it will be correct forever. flexible handling. A series of tragedies and chaos caused by software proves this to be wrong. are playing a vital role in our daily lives. or electronic parts such as transistors. made the Patriot missile fail to intercept a scud missile. optimistic people would think that once after the software can run correctly. [Telephone outage] . deform or crack. levers. Software does not age. material. caused by the software not being able to detect a race condition. but appliances such as washing machines. computers and intelligent parts are quickly pushing their mechanical counterparts out of the market. TVs. rust. On February 25. processors and software controlled systems offer compact design. Like machinery replaced craftsmanship in the industrial revolution. These events will always have their place in history. alerts us that it is dangerous to abandon our old but well-understood mechanical safety control and surrender our lives completely to software controlled safety mechanism. computers. 28 lives were lost. Software can make decisions. unlike mechanical parts such as bolts. There is no environmental constraint for software to operate as long as the hardware processor it runs on can operate. The British destroyer Sheffield was sunk because the radar system identified an incoming missile as "friendly". 1991. We may not have noticed. new serious problems may arise. In 1991.Embedded Disasters? With the advent of the computer age.Introduction: Embedded Software -. accumulating for 100 hours. software has no shape. during the Golf War. telephones. mass. It can not be seen or touched. wear-out. the chopping error that missed 0. [Sheffield] The defense system has matured to the point that it will not mistaken the rising moon for incoming missiles. Furthermore. but can just as unreliable as human beings. as well as the software running on them. capacitor. were also examples that can be misidentified as incoming missiles by the defense system. are having their analog and mechanical parts replaced by CPUs and software. and watches. rich features and competitive cost. software will stay "as is" unless there are problems in hardware that changes the storage content or data path. descending space junk. but it has a physical existence and is crucial to system functionality. [Patriot] Fixing problems may not necessarily make the software more reliable. color.

and comes with the notion of time. After the success of Ariane 4 rocket. While any system with a high degree of complexity. [ANSI91][Lyu95]Although Software Reliability is defined as a probabilistic function. system developers tend to push complexity into the software layer. Software failure mechanisms . Electronic and mechanical parts may become "old" and wear out with time and usage. maintainability. different from traditional Hardware Reliability. whether we should use software in safety-critical embedded applications. we must note that. Are we embedding potential disasters while we embed software into systems? Key Concepts Definition According to ANSI. capability. installability. Software Reliability is hard to achieve. the reliability of software is simply a matter of life and death. the upcoming international Space Station will have over two million lines on-board and over ten million lines of ground support software. For example. performance. several major life-critical defense systems will have over five million source lines of software. together with functionality. [Rook90] While the complexity of software is inversely related to software reliability. including software. etc. Emphasizing these features will tend to add more complexity to software. and documentation. especially functionality. the maiden flight of Ariane 5 ended up in flames while design defects in the control software were unveiled by faster horizontal drifting speed of the new rocket. Software will not change over time unless intentionally changed or upgraded. usability. it is directly related to other important factors in software quality. Software Reliability is not a direct function of time. Software Reliability is defined as: the probability of failure-free software operation for a specified period of time in a specified environment. next-generation air traffic control systems will contain between one and two million lines. With processors and software permeating safety critical embedded world. capability. because the complexity of software tends to be high.Once perfectly working software may also break if the running environment changes. radiation therapy machines. with the rapid growth of system size and ease of doing so by upgrading the software. serviceability. large next-generation aircraft will have over one million source lines of software onboard. will be hard to reach a certain level of reliability. heart pace-makers. a software error can easily claim people's lives. You can hardly ruin your clothes if the embedded software in your washing machine issues erroneous commands. [Ariane 5] There are much more scary stories to tell. This makes us wondering whether software is reliable at all. but in airplanes. but software will not rust or wear-out during its life cycle. and 50% of the chances you will be happy if the ATM machine miscalculates your money. Software Reliability is an important to attribute of software quality.

The bathtub curve for Software Reliability Over time.[Keiller91] While it is tempting to draw an analogy between Software Reliability and Hardware Reliability. design faults may also exist. B and C stands for burn-in phase. and correct. Strictly speaking there are no standard parts for software. ambiguities. Repairable system concept: Periodic restarts can help fix software problems. A partial list of the distinct characteristics of software compared to hardware is listed below [Keene94]: y y y y y y y y y y Failure cause: Software defects are mainly design defects. while software faults are design faults. Therefore. but physical faults usually dominate. because design faults can not be masked off by voting. which we don't have a solid understanding. but to a very limited extent. Period A. software and hardware have basic differences that make them different in failure mechanisms. In hardware. But in software industry. incorrect or unexpected usage of the software or other unforeseen problems. we have not observed this trend. Trying to achieve higher reliability by simply duplicating the same software modules will not work. inadequate testing. Redundancy: Can not improve Software reliability if identical software components are used. [Lyu95] Design faults are closely related to fuzzy human factors and the design process. . the quality of software will not change once it is uploaded into the storage and start running. A detailed discussion about the curve can be found in the topic Traditional Reliability.Software failures may be due to errors. classify. In software. Code reuse has been around for some time. known as the bathtub curve. except it might affect program inputs. Time dependency and life cycle: Software reliability is not a function of operational time. Errors can occur without warning. which are harder to visualize. except some standardized logic structures. since it depends completely on human factors in design. Hardware faults are mostly physical faults. we can hardly find a strict corresponding counterpart for "manufacturing" as hardware manufacturing process. oversights or misinterpretation of the specification that the software is supposed to satisfy. Environmental factors: Do not affect Software reliability. Wear-out: Software does not have energy related wear-out phase. Reliability prediction: Software reliability can not be predicted from any physical basis. Interfaces: Software interfaces are purely conceptual other than visual. carelessness or incompetence in writing code. hardware exhibits the failure characteristics shown in Figure 1. detect. Failure rate motivators: Usually not predictable from analyses of separate statements. Built with standard components: Well-understood and extensively-tested standard parts will help improve maintainability and reliability. useful life phase and end-of-life phase. if the simple action of uploading software modules into place does not count.

Bathtub curve for hardware reliability Software reliability.Figure 1. The second difference is that in the useful-life phase. . there are no motivation for any upgrades or changes to the software. [RAC96] There are two major differences between hardware and software curves. Therefore. does not show the same characteristics similar as hardware. software is approaching obsolescence. One difference is that in the last phase. the failure rate will not change. The failure rate levels off gradually. A possible curve is shown in Figure 2 if we projected software reliability on the same axes. software will experience a drastic increase in failure rate each time an upgrade is made. partly because of the defects found and fixed after the upgrades. In this phase. however. software does not have an increasing failure rate as hardware does.

But for SunOS. since the functionality of software is enhanced. such as clean-room method.Figure 2. Revised bathtub curve for software reliability The upgrades in Figure 2 imply feature upgrades. such as a redesign or reimplementation of some modules using better engineering approaches. if the bug fix induces other defects into software. Since software robustness is one aspect of software reliability. robustness failure rate drops when the version numbers go up. From the graph we see that for QNX and HP-UX. A proof can be found in the result from Ballista project. robustness testing of off-the-shelf software Components. For feature upgrades. . not upgrades for reliability. it is possible to incur a drop in software failure rate. For reliability upgrades. this result indicates that the upgrade of those systems shown in Figure 3 should have incorporated reliability upgrades. Figure 3 shows the testing results of fifteen POSIX compliant operating systems. the complexity of software is likely to be increased. if the goal of the upgrade is enhancing software reliability. robustness failure rate increases after the upgrade. Even bug fixes may be a reason for more software failures. IRIX and Digital UNIX.

Software Reliability Engineering (SRE) is the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability [IEEE95]. [Lyu95]. Over 200 models have been developed since the early 1970s. Reliability Engineering approaches are practiced in software field as well. techniques. and try to quantify software reliability. but how to quantify software reliability still remains largely unsolved.Available tools. none of the models can capture a satisfying amount of the complexity of software. Software Reliability Models A proliferation of software reliability models have emerged as people try to understand the characteristics of how and why software fails. Interested readers may refer to [RAC96]. and metrics Since Software Reliability is one of the most important aspects of software quality. As many models as there are and many more emerging. constraints and assumptions have to be made for the quantifying .

. and Rome Laboratory models TR-92-51 and TR-92-15. Thompson and Chelson's model. Putnam's Model. there is no single model that can be used in all situations. The mathematical function is usually higher order exponential or logarithmic. [RAC96] Both kinds of modeling techniques are based on observing and accumulating failure data and analyzing with statistical inference. but may be completely off track for other kinds of problems. software reliability can be predicted early in the development phase and enhancements can be initiated to improve the reliability. not typically used in concept or development phases TIME FRAME Predict reliability at some future time Estimate reliability at either present or some future time Table 1. factors. etc. Exponential models and Weibull distribution model are usually named as classical fault count/fault rate estimation models. Most software models contain the following parts: assumptions. Therefore. can be used as early as concept phase Usually made later in life cycle(after some data have been collected). Software modeling techniques can be divided into two subcategories: prediction modeling and estimation modeling. ISSUES PREDICTION MODELS ESTIMATION MODELS DATA REFERENCE Uses historical data Uses data from the current software development effort WHEN USED IN DEVELOPMENT CYCLE Usually made prior to development or test phases. No model is complete or even representative. Difference between software reliability prediction models and software reliability estimation models Representative prediction models include Musa's Execution Time Model. Representative estimation models include exponential distribution models. One model may work well for a set of certain software. The major difference of the two models are shown in Table 1. while Thompson and Chelson's model belong to Bayesian fault rate estimation models. Using prediction models.process. Weibull distribution model. etc. and a mathematical function that relates the reliability with the factors.

but not in software engineering.the observed faults and/or failures. KSLOC) and comments and other non-executable statements are not counted. The advent of new technologies of code reuse and code generation technique also cast doubt on this simple method. It is used primarily for business systems. any model has to have extra assumptions. source code is used(SLOC. that there is no one model that is best in all situations. outputs. complexity is reduced and abstraction is achieved. We have to carefully choose the right model that suits our specific case. and most of the aspects related to software reliability. There is no clear definition to what aspects are related to software reliability. We can not find a suitable way to measure software reliability. the models tend to specialize to be applied to only a portion of the situations and a certain class of the problems. development effort and reliability. second. inquires. The method can be used to estimate the size of a software system as soon as these functions can be identified. It is tempting to measure something related to reliability to reflect the characteristics. it is not proven in scientific or real-time applications. Most software reliability models ignore the software development process and focus on the results -. the modeling results can not be blindly believed and applied. . Complexity-oriented metrics is a method of determining the complexity of a program's control structure.The field has matured to the point that software models can be applied in practical situations and give meaningful results and. But there is not a standard way of counting. It is a measure of the functional complexity of the program. Until now. Function point metric is a method of measuring the functionality of a proposed software development based upon a count of inputs. Complexity is directly related to software reliability. however. is an intuitive initial approach to measuring software size. so representing complexity is important. Software Reliability Metrics Measurement is commonplace in other engineering field. Measuring software reliability remains a difficult problem because we don't have a good understanding of the nature of software. Furthermore. The current practices of software reliability measurement can be divided into four categories: [RAC96] y Product metrics Software size is thought to be reflective of complexity. we still have no good way of measuring software reliability. the quest of quantifying software reliability has never ceased. Even the most obvious product metrics such as software size have not uniform definition. master files. Lines Of Code (LOC). Representative metric is McCabe's Complexity Metric. Only limited factors can be put into consideration. and interfaces. [Lyu95] Because of the complexity of software. This method can not faithfully compare software not written in the same language. By doing so. It measures the functionality delivered to the user and is independent of the programming language. by simplify the code into a graphical representation. Typically. Though frustrating. if we can not measure reliability directly. or LOC in thousands(KLOC).

locate and remove software defects. Orthogonal Defect classification and formal methods. Software testing is heavily used to trigger. the software may pass all tests and yet be prone to failure once delivered. can also be used to minimize the possibility of defect occurrence after release and therefore improve software reliability. etc. y Project management metrics Researchers have realized that good management can result in better products. Various analysis tools such as trend analysis. faulttree analysis. Mean Time Between Failures (MTBF) or other parameters to measure or predict software reliability. Software Reliability Improvement Techniques Good engineering methods can largely improve software reliability. testing is crafted to suit specific needs in various software development projects in an ad-hoc manner. Research has demonstrated that a relationship exists between the development process and the ability to complete projects on time and within the desired quality objectives. summarized and analyzed to achieve this goal. y Process metrics Based on the assumption that the quality of the product is a direct function of the process. testing.. Usually. Before the deployment of software products.Test coverage metrics are a way of estimating fault and reliability by performing tests on software products. etc. monitor and improve the reliability and quality of software. y Fault and failure metrics The goal of collecting fault and failure metrics is to be able to determine when the software is approaching failure-free execution. Software testing is still in its infant stage. or "quality management standards". Detailed discussion about various software testing methods can be found in topic Software Testing. verification and validation are necessary steps. before delivery) and the failures (or other problems) reported by users after delivery are collected. . Test strategy is highly relative to the effectiveness of fault metrics. process metrics can be used to estimate. Minimally. ISO-9000 certification. risk management process. The failure data collected is therefore used to calculate failure density. because if the testing scenario does not cover the full functionality of the software. configuration management process. failure metrics are based upon customer information regarding failures found after release of the software. both the number of faults found during testing (i. based on the assumption that software reliability is a function of the portion of software that has been successfully verified or tested. Costs increase when developers use inadequate processes.e. is the generic reference for a family of standards developed by the International Standards Organization(ISO). Higher reliability can be achieved by using better development process.

such as the Therac-25 accident. Fault tolerance or fault/failure forecasting techniques will be helpful techniques and guide rules to minimize fault occurrence or impact of the fault on the system. y Software Testing Software testing serves as a way to measure and improve software reliability. It plays an important role in the design. implementation. validation and release phases. y Software Fault Tolerance Software fault tolerance is a necessary part of a system with high reliability. Many of the concepts and analytical methods that are used in traditional reliability can be used to assess and improve software reliability too. defect-free software product can not be achieved. which itself is difficult to achieve. software related problems and the quality of software products can cause serious problems. field data can be gathered and analyzed to study the behavior of software defects. Losses caused by software defects causes more and more social and legal concerns. It is not a mature field. by providing a set of functionally equivalent software modules developed by diverse and independent production teams. The defects in software are significantly different than those in hardware and other components of the system: they are usually design defects. No matter how hard we try.After deployment of the software product. The unfeasibility of completely testing a software module complicates the problem because bug-free software can not be guaranteed for a moderately complex piece of software. It relates to many areas where software quality is concerned. Advance in this field will have great impact on software industry. y Social & Legal Concerns As software permeates to every corner of our daily life. Relationship to other topics Software Reliability is a part of software quality. software reliability focuses on design perfection rather than manufacturing perfection. The assumption is the design diversity of software. However. and a lot of them are related to problems in specification. . as traditional/hardware reliability does. It is a way of handling unknown and unpredictable software (and hardware) failures (faults) [Lyu95]. y Traditional/Hardware Reliability The initial quest in software reliability study is based on an analogy of traditional and hardware reliability. Guaranteeing no known bugs is certainly not a good-enough approach to the problem.

Defect-free software product can not be assured. Development process. . measurement and improvement. faults and failures found are all factors related to software reliability. Complete testing of a moderately complex software module is infeasible. we must make sure they don't embed disasters. As more and more software is creeping into embedded systems. "How good is the software. there is still no good answer. as in other engineering field. so other related factors are measured to estimate software reliability and compare it among products. quantitatively?" As simple as the question is. and better process are introduced in software engineering field. More standard components. software reliability can be the reliability bottleneck of the whole system.Conclusions Software reliability is a key part in software quality. Software reliability measurement is naive. Until now there is no good way to conquer the complexity problem of software. If not considered carefully. Software reliability improvement is hard. promising progresses are still being made toward more reliable software. Assumptions and abstractions must be made to simplify the problem. Ensuring software reliability is no easy task. The study of software reliability can be categorized into three parts: modeling. Software reliability can not be directly measured. The difficulty of the problem stems from insufficient understanding of software reliability and in general. Realistic constraints of time and budget severely limits the effort put into software reliability improvement. There are many models exist. the characteristics of software. but no single model can capture a necessary amount of the software characteristics. Software reliability modeling has matured to the point that meaningful results can be obtained by applying suitable models to the problem. There is no single model that is universal to all the situations. Measurement is far from commonplace in software. As hard as the problem is.

10. 11.Software testing has its own life cycle that meets every stage of the SDLC. Requirement Stage Test Planning Test Analysis Test Design Test Verification and Construction Test Execution Result Analysis Bug Tracking Reporting and Rework Final Testing and Implementation Post Implementation . 6. 3. 5. 8. 2. 4. 9. 7. They are 1. The software testing life cycle diagram can help one visualize the various software testing life cycle phases.

.

Suspension criteria and resumption requirements: The suspension criterion specifies the criterion that is to be used to suspend all or a portion of the testing activities. no work can lead to success be it software-related or routine work. 13. Approach: This is the test strategy that should be appropriate to the level of the plan. Introduction: This describes the objective of the test plan. Once the requirements of the project are confirmed. charts that are required to be presented to the stakeholders on a regular basis during testing and when testing is completed. Testing tasks: This stage is needed to avoid confusion whether the defects should be reported for future function. 3. Test Planning Test planning is predetermining a plan well in advance to reduce further risks. Formal meetings of the panel can be held in order to document the requirements discussed which can be further used as software requirements specifications or SRS. It should be in acceptance with the higher and lower levels of the plan. 4. The test plan structure is as follows: 1. Risks and contingencies:This emphasizes on the probable risks and various events that can occur and what can be done in such situation. Testers can also involve themselves as they can think from the users' point of view which the developers may not. the list of features that are to be tested that are based on the implicit and explicit requirements from the customer. whereas resumption criterion specifies when testing can resume with the suspended portion. 2. testers and users can be formed. Features not to be tested: The incorporated or comprised features that can be skipped from the testing phase are listed here. Features to be tested: This describes the coverage area of the test plan. A test plan document plays an important role in achieving a process-oriented approach. 7. Test deliverable: This includes a list of documents.Requirement Stage This is the initial stage of the life cycle process in which the developers take part in analyzing the requirements for designing a product. Responsibilities: This phase assigns responsibilities to the person who can be held responsible in case of a risk. 11. 5. Staffing and training needs: Training on the application/system and training on the testing tools to be used needs to be given to the staff members who are responsible for the application. like incomplete modules or those on low severity eg. project plan. Item pass/fail criteria: Related to the show stopper issue. Environmental needs: The special requirements of that test plan depending on the environment in which that application has to be designed are listed here. 8. 9. Test Items The items that are referred to prepare this document will be listed here such as SRS. 10. Features that are out of scope of testing. reports. This also helps users and testers to avoid incomplete functions and prevent waste of resources. 12. a test plan is documented. 6. GUI features that don't hamper the further process can be included in the list. . Without a good plan. Thus a panel of developers. The criterion which is used has to explain which test item has passed or failed. ie.

i. At first. 2. 4. There originates a need for an end to end checklist that covers all the features of the project. and declare the result as pass/fail. Description: Detailed description of the bug. Approval: This decides who can approve the process as complete and allow the project to proceed to the next level that depends on the level of the plan. Test Verification and Construction In this phase test plans. Defect Profile Document contains the following Defect Id: Unique identification of the Defect. 5. Result Analysis Once the bug is fixed by the development team. Test Execution Planning and execution of various test cases is done in this phase. the functionality of the tests is done in this phase. When the development team is done with a unit of code. 3. Date of Submission: Date at which the bug was detected and reported. top level testing is done to find out top level failures and bugs are reported immediately to the development team to get the required workaround.e after the successful execution of the test case. Test reports have to be documented properly and the bugs have to be reported to the development team. . Test Case Id: Test case identification for that defect. the testing team has to retest it to compare the expected values with the actual values. 6. Bug Tracking This is one of the important stages as the Defect Profile Document (DPD) has to be updated for letting the developers know about the defect. the test design and automated script tests are completed. Test Design Test design is done based on the requirements of the project documented in the SRS. which can help in minimizing the number of records to be searched. 1. Once the unit testing is completed. Test Analysis Once the test plan documentation is done. Defect Submitted By: Name of the tester who detected/reported the bug. Integration testing and bug reporting is done in this phase of the software testing life cycle. Stress and performance testing plans are also completed at this stage.14. different paths for testing are to be identified first and writing of scripts has to be done if required. the testing team is required to help them in testing that unit and reporting of the bug if found. the next stage is to analyze what types of software testing should be carried out at the various stages of SDLC. Summary: This field contains some keyword information about the bug. In automation testing. This phase decides whether manual or automated testing is to be done.

Before release. Post Implementation Once the tests are evaluated. This helps to prevent similar problems from occuring in the future projects. load. Regression testing has to be done. Status: This field displays current status of the bug. In short. Build No. Creating plans for improvement and enhancement is an ongoing process. 12. Version No.: The version information of the software application in which the bug was detected and fixed. it has to undergo the testing process again to assure that the bug found is resolved. Final Testing and Implementation This phase focuses on the remaining levels of testing. The application needs to be verified under specified conditions with respect to the SRS. Various documents are updated and different matrices for testing are completed at this stage of the software testing life cycle.: Number of test runs required. Severity: Degree of severity of the defect. The contents of a bug well explain all the above mentioned things. Once the Quality Analyst assures that the product is ready. The bug once reported and as the development team fixes the bug. stress. . 10.7. 8. the recording of errors that occurred during various levels of the software testing life cycle. planning for improvement of the testing process for future applications is done in this phase. 11. Assigned To: Name of the developer who is supposed to fix the bug. such as acceptance. Priority: Priority of fixing the bug. Reporting and Rework Testing is an iterative process. is done. Thus testing is an ongoing process. 9. the software has to undergo one more round of top level testing. the software is released for production. performance and recovery testing.

Sign up to vote on this title
UsefulNot useful