Lewis Sykalski

A Case Study In Reliability Analysis

0. Abstract

The project consisted of analysis of dependability gains through usage of design diversity on two
Oracle releases. Bug reports from Oracle metalink were stimulated across multiple releases and
failure analysis was completed to determine the effectiveness of a design diversity fault tolerance
approach. In addition, reliability analysis and prediction was performed on the NCW Data
Collector software using both failure logs generated from past simulation events as well as a
software reliability analysis tool called CASRE (Computer Aided Software Reliability Estimation).

1. Introduction

It has been established that both accounting for fault tolerance and reliability analysis not
only bring better quality to the product but also help lower costs by allowing the analyzer to detect
reliability trends earlier in the lifecycle chain – when they can be corrected. However, it is
important to realize that there is an equilibrium that exists for each product wherein the utility of
doing more reliability analysis / fault tolerance activities are overcome by the utility of stopping.
Furthermore, creating fully reliable software can never be achieved. A more attainable objective is
to produce software that is reasonably reliable or write software that meets the customers’
requirement for software reliability. Despite the importance and cost benefits, reliability is quite
often ignored by enterprises, and when schedule starts to slip the established QA/reliability
activities are abandoned rather rapidly. This project will attempt to demonstrate the importance of
reliability activities as they relate to my software environment at my work. It will also allow me to
garner experience with these reliability methodologies in the hopes that I might employ some in my
everyday tasking at work.

2. Background

The Data Collector software provides a means for collecting data associated with an
experiment for later comprehensive analysis. The software is written in Java and is architecturally
very modular and versatile. Furthermore, it has the flexibility to support remote databases using
the JDBC and Java RMI APIs. Its sole purpose is to listen to network DIS PDU (Protocol Data
Units) traffic in the form of UDP datagram packets as well as XML Packets (on a separate port) and
then subsequently refine and record it to an Oracle Database. There is a front-end program
(Hyperion Interactive Reporting Studio) that serves as a graphical window into the database;
however, it will not be the focus of the reliability analysis. The diagram below, Figure 2.1
illustrates the simulation environment and subsequently the possible sources of network PDUs for
collection.

Lewis Sykalski

The different simulation players in Figure 2.1 are further described in Table 2.1 below in terms of
their relation to data collection.
Player Description
CAOC Combined Air Operations Center – Responsible for sending of XML-based
EW Reports to players.
DC Data Collector Program responsible for collection and refinement of
experiment data
EADSIM Environment model that controls certain red players. Responsible for
Entity State, Detonation, and Electromagnetic Emission PDU
FUSION FUSION algorithm with transmits fused EW Reports to all players in the
form of XML-based packets
Humvee Sim A man-in-the-loop simulation for controlling a hummer. Entity State,
Detonation, Fire PDUs
JABE Environment model that controls certain blue players. Responsible for all
cockpit PDUs defined by Man-in-the-loop Cockpit below
JIMM Environment model that controls certain red players. Responsible for
Entity State, Detonation, and Electromagnetic Emission PDU
JSAF Environment model that controls certain red players. Responsible for
Entity State, Detonation, and Electromagnetic Emission PDU
JTAC Joint Terminal Attack Controller – Responsible for sending of XML-based
NTISR Assignments to players.
Man-in-the-loop A manned player (F-16, F-22, or F-35) who sends a wide range of PDUs
Cockpit for collection (Entity State, Detonation, XML-based, etc)
Man-in-the-loop A man-in-the-loop simulation for controlling a SAM Site. Entity State,
Threat Sims Detonation, Fire PDUs are the pertinent PDUs that these transmit to DC

Should a problem occur. Detonation. Strigini about the use of Design Diversity in DBMS products it was decided I would analyze a similar approach utilizing multiple versions of Oracle. If the strategy was then found to be useful. Popov. After reading a paper by Gashi.1: Simulation Player Descriptions The DIS standard defines 67 different PDU types. If the program aborts. This results most often in the receipt of “garbage” datagrams. after all. however. From what I gauge from the environment. MBMS. Lewis Sykalski Other Sims A wide variety of other Sims that speak DIS. Detonation. is the sum of the individual component reliabilities. it will be more readily transparent and easily located. In addition. that is unacceptable. Not responsible for transmission of any PDUs WCS White Control Station . In choosing an application project. etc. it is not overly important. most exceptions are wrapped with try/catch blocks to fail safely and continue. I will close the connection and reinitialize. In some of our larger experiments. reliability has been unacceptable and the software has established a bad reputation. Most importantly. Overall reliability. configuration. different scenarios. When a catastrophic failure occurs and data for the experiment is lost.A God’s Eye Viewer. Despite. The software for the most part is allowed to miss a few data PDUs here and there. Not responsible for transmission of any PDUs needed by data collection Table 2. a fault containment strategy must be employed. this could be upwards of 40 people. I’ve heard many horror stories from my new colleagues of lost data during past experiments due to crashes.Electromagnetic Emission PDU • (XML Custom PDUs) – Emcon Status. problems are exacerbated at times by the fact that the other entities also enjoy a lax quality assurance environment. in the past. EW Report. Entity State. as we have no control over the source and you get what you get. reliability is important. if the TCP socket crashes to Oracle. NTISR Assignment. Entity State. Unfortunately. I thought it important to examine both components. 3. Fire PDUs VBMS Virtual Battlespace Management Software . Detonation PDU • (Distributed Emission Regeneration family) . it could be employed as a . To prevent some of these kinds of things from occurring I have personally added a few fail-safe measures.Entity State PDU • (Warfare family) – Fire PDU. which I must account for and recover from in my implementation. The data collector software collects & refines data from the following DIS PDUs & XML PDUs: • (Entity Information/Interaction family) . Furthermore. Fire PDUs are the main PDUs that Data Collection is interested from these Police Car Sim A man-in-the-loop simulation for controlling a hummer. with different PDUs of interest. For example. EW Fused Report. however. many of my reliability revisions. however. With Oracle. arranged into 12 families. I have built into the program a decent level of verbosity. different software loads. the simulation run must be thrown away wasting the time of everyone involved. This is due to the fact that each event is its own environment. Problem It’s been 8 months now since I came into possession of this software. and different corporate entities involved. and transport issues.A God’s Eye Viewer. I still encounter many crashes and other failures.

2 below. For this component. because of the short turn-around time required for this project. The following criteria are conditions I hoped to satisfy in my selection of bug scripts.1 Design Diversity Strategy Approximately 20 Bug reports for both Oracle 9i and 10g will be taken from the Oracle metalink site employing a pseudo-random approach. Severity of failure was then determined by examining the exception in the intermediate log file and comparing with Failure Descriptions in Table 4. a different strategy was necessary. it was decided that reliability trend analysis & prediction would be more beneficial. extracting Time of Program Start. they were then and organized by both run and experiment. CASRE was used in lieu of S-Plus. 4. Lewis Sykalski fault-tolerance/containment technique for Oracle. Self-evidence as well as divergence was noted. prep work was done to determine key words of interest. 3.2 bugs – as it would be more possible that it would be fixed in a later version should I pick an earlier date. The definition of this format can be found in Appendix C. CASRE was determined to be more suitable for this project as it was easier to understand and required less overhead by the analyzer. I will do my best to largely ignore this attribute. Suitable log files were also gathered from around the labs. In order to get a controlled set of inputs. 2. My main requirement was that the bugs come with a bug script or detailed description on how to reproduce. both S-Plus and CASRE were downloaded and examined for suitability. effectively removing this dependency. the bug would require a well-detailed bug report or a detailed description of the stimuli so it can be reproduced. Once gathered. I thus resolved myself to analyze reliability both across lifecycle phases (integration/execution) and across simulation events. and Exception or Failure Messages. This strategy if employed properly could help detect issues before it was too late in the development/integration cycle to resolve. Easy to Reproduce: While I would like to consider myself an expert in this domain. Bugs were then run on both versions of Oracle and failures were documented. Results were classified by failure type. In order to facilitate this. Date Independent: I wanted to remove dates as a factor especially in the selection of the Oracle 9. For the reliability analysis portion. Thus I hope the type distribution to be a fair representation of the frequency of that error type within a release. Failure information was then translated by hand into CASRE’s internal format. Time of Program Termination. I am not. In addition. . a simple JavaScript was written to parse the log files into an intermediate format. 1. I sorted on other columns effectively ignoring the report date column. To do this.1. 4. Type Independent: While some types will naturally be easier to reproduce and thus appear more frequently in my sample. I thus needed to be very careful about my employed selection criteria for bug scripts. Strategy 4. the NCW Data Collector log files were then ran through CASRE to determine reliability through a variety of metrics. In order to satisfy these criteria. For the Java NCW data collector. Time of Thread Terminations.2 Reliability of NCW DC Strategy In preparation for this activity.

This broke the exercise into 4 distinct phases: failure count data generation. In addition. Definition of evidence and divergence is provided in Appendix A: Glossary. Divergent 2357784 Internal Error X NO N/A X 2299898 Performance/Hang X NO N/A X 2202561 Incorrect Results NO N/A 2221401 Incorrect Results NO N/A 2739068 Incorrect Results NO N/A 2683540 Incorrect Results NO N/A 2991842 Incorrect Results NO N/A 2200057 Internal Error X NO N/A 2405258 Internal Error X NO N/A 2716265 Internal Error X NO N/A . and reliability prediction.2 and 10.3: CASRE Usage Flow). which became the basis for this exercise.2 and ran against both 9. Handled and Recovers Correctly Table 4. Thread Continues 3 Failure Causes Incorrect Data to be Written. (See Appendix C.E 10. Popov.2 S.0 Fails? 10. Activities & Results 5.1 for in-depth event descriptions). I could not go back any farther due to a lack of log files caused by poor configuration management by my predecessor and even if I managed to scrape up some log files – the verbosity in failure logging would not allow for meaningful reliability analysis. preliminary reliability analysis.2 Reliability Analysis Severity Definition In this exercise I performed analysis on 10 runs from 2 different unclassified experiments CALOE-08 and MAGTF-08 (Please see Appendix A. Strigini]. 5.1.A. Lewis Sykalski Severity Code Failure Description 9 Failure Causes Machine to be Rebooted Causing Catastrophic Loss 8 Failure Causes Program Abort 7 Failure Causes Program Thread Abort 5 Failure Causes Record Not to be Written. Thread Continues 1 Failure is Caught.0 S. I also planned for analysis of runs from the integration phase of MAGTF-08 and the execution phase to gauge reliability between life-cycle phases. The activities then would be guided by the built-in analysis functions within CASRE. Bug Reports were first chosen from 9. Bug # Type 9.1. Design Diversity Activity & Results An established set of metrics was chosen based off of the journal: Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers [Gashi.0 respectively. The CASRE User Guide includes flow charts detailing usage of the tool. reliability trend analysis. SQL failure types encountered are also provided in Appendix B: Failure Types. Results are reported in Table 5.E. Reliability Analysis was then to be performed using CASRE’s analysis methodology and built-in charting mechanisms.

E 10.0 and ran against both 10.2 Fails? 9.1.0 S.A: Oracle 9.0 Fails? 10. Divergent 2054241 Performance/Hang X NO N/A 2485871 Internal Error X NO N/A 2670497 Internal Error X NO N/A 2659126 Internal Error X NO N/A X 2064478 Internal Error X NO N/A 2624737 Internal Error X NO N/A X 1918751 Internal Error X NO N/A 2286290 Incorrect Results NO N/A X 2700474 Incorrect Results NO N/A 2576353 Internal Error X NO N/A Table 5. Bug # Type 10.1.2 respectively.0 and 9.E. Lewis Sykalski Bug # Type 9.2 Bug Classification Activity Bug Reports were then chosen from 10.B. Results are reported in Table 5.2 SE Divergent 5731063 Internal Error X NO N/A 3664284 Incorrect NO N/A Results 4582808 Incorrect NO N/A Results 3895678 Internal Error X YES X 3893571 Internal Error X YES X 3903063 Incorrect YES Results 3912423 Internal Error X NO N/A 4029857 Engine Crash X YES X 4156695 Incorrect YES Results 2929556 Internal Error X YES X X 3255350 Performance / X NO N/A Hang 3887704 Internal Error X NO N/A 3405237 Engine Crash X YES X 3952322 Feature X YES X Unusable 4033889 Incorrect NO N/A Results 4060997 Internal Error X YES X 4134776 Internal Error X NO X 4149779 Incorrect NO N/A Results 2964132 Internal Error X YES X 3361118 Internal Error X YES X .0 SE 9.2 S.

1-C. Execution Event Runs use a Design Of Experiments (DOE) standard deviation set of 10 runs to represent an experiment. Strigini analysis methods the number of failures not detected by design diversity is the number of Non-Divergent Non-Self-Evident failures across all scripts wherein both products fail. 7 Non – Self – Evident Oracle 10. 5.4. The files were then run through CASRE individually to track reliability progress within a simulation set. The following charts were then generated for both CALOE Execution/MAGTF Execution as well as MAGTF Integration/MAGTF Execution partitions (See Appendix C. Both products Failing: 11 (1 Divergent (1SE/0NSE) 10 Non-Divergent (8 SE/ 2 NSE)) Total Bug Failures 1 out of 2 Products Both DBMS Products Failing Scripts Failing S.2 for files). N. As working data collection is a requirement even on bad runs.E 0 0 2 2 Incorrect Result S. Popov. etc) or where data collection had to be restarted (not necessarily the run) due to program abort. Reliability of NCW DC Activity & Results The Reliability analysis activity was performed by first generating the input CASRE datafiles (See Appendix C.E 0 0 0 0 In summarizing the results: Oracle 9.E 7 0 6 2 Other S.5): .2 Total Bug Scripts 20 .E N.S.4. 20 - Failure Observed 20 0 20 11 Performance/ S. cockpit error.E Non-Divergent Divergent S.S.0 Bug Classification Activity Oracle 9. They were then grouped into dependent variable sets (integration/execution or event/event) to track reliability with regards to the independent variable.2 Oracle 10.E N.S.E 40 40 18 11 8 2 1 0 According to Gashi.E 0 0 1 1 N.0 Oracle 9.E 11 0 10 6 Engine Crash S. All datafiles included each and every run from integration dry runs or execution event runs for their respective events.E 2 0 1 0 Hang Internal Error S.E S.2 Scripts: 13 Self Evident. Lewis Sykalski Table 5.2.0 Oracle 10.E.0 Scripts: 1 product Failing: 9 (5SE/4NSE). Doing this calculation for my scripts: 2/40 (5%) Failures were not detectable.B: Oracle 10. To elaborate.1. The extra runs that are provided are runs that were either redone due to a variety of factors (pilot error. we can not throw these out. This is due to the fact that these failures are not able to be observed and happen identically across both DMBS products.S.E 0 0 0 0 N.S. Integration runs were not limited in the same fashion.

Strigini. this reasoning holds for fault detection.7): • Running Average: The running average of number of failures per interval for failure count data. if we are using this as a means to smooth out a transition to a future release. Strigini. & Popov and lead us to believe this is an omnipotent strategy. If using this as a true design diversity setup.6-C. First off. However.4. Upon successful determination of a reliability growth trend.9-C. 2 separate tests were employed (See Appendix C. If the test statistic increases with failure number (test interval number).4. These numbers are much better than similar measurements published by Gashi. (Generalized Poisson & Schick-Wolverton). then the null hypothesis can be rejected in favor of decreasing reliability. Popov methodology. However. predictions were employed using 3 models: NHPP (intervals). • Test Interval Length: a plot of the lengths of each test interval as a function of test interval number Reliability trend analysis was then performed to determine if the failure count data showed reliability growth. this may not be a fair means to interpret the results. & Generalized Poisson: ML Weight. as past version . If the test statistic decreases with increasing failure number (test interval number).10.1 Design Diversity Results Analysis Design diversity using multiple versions of Oracle has shown to be pretty effective for detecting failures across multiple releases according to the Gashi. • Failure Intensity: The failure intensity (failures observed per unit time) as a function of total elapsed testing time • Cumulative Failures: The total number of failures observed as a function of total time elapsed. Yamada S-Shaped.4. • Laplace Test: The null hypothesis for this test is that occurrences of failures can be described as a homogeneous Poisson process. we must proceed with caution on the results. detecting 95% of bug scripts. Also note that more models could have been evaluated had I converted the Failure Count Data to Time Between Failure data through CASRE’s built-in randomized sampling function. Secondly.8). then the null hypothesis can be rejected in favor of reliability growth at an appropriate significance level. we would be primarily concerned with bug scripts in the future version. Lewis Sykalski • Failure Count: A plot of the number of failures observed in a test interval as a function of the test interval number. Predictions were made at the average interval length for 15 additional intervals (See Appendix C.4. Cumulative Failure Predictions as well as Reliability Curves were then generated and are shown in Appendix C. Please note that two more would have been employed had I been able to get them not to crash. • Time Between Failures: A plot of time since the last failure as a function of failure number. 6. reliability growth is indicated. Results Analysis 6. however that would have degraded the data and stretched out my already long task list. If the running average decreases with time (fewer failures per test interval). this is a statistically insignificant sample and as such should only be taken with a grain of salt.4.

0. Applying the ranking scheme as shown below (the default provided by CASRE) with highest priority given to goodness of fit. Similarly. which showed how much more likely it was that one model will produce more accurate predictions than the other. by ignoring these criteria as a first-stage screening I was able to get model ranking. Furthermore. This was probably due to the moderate level of variance in the data (some test intervals with many failures next to clean test intervals with 0 failures. and hence the software’s reliability is increasing. we could be more restrictive in our analysis. overall this strategy seems well-suited for detecting faults. but it doesn’t necessarily mean we will be successful. 6 were N.0 bug scripts) have the potential for not being able to recovered from. Using the metric of Both DMBS products failing. yielding a 2/20 (10%) detection rate. For the Running Arithmetic Average Test this was evaluated by observing that the running arithmetic average decreases as the test interval number increased. The MAGTF integration / MAGTF Execution set representing different lifecycle phases exhibited more reliability growth than the CALOE execution / MAGTF Execution. Using more releases.S. I charted the prequential-likelihood.2 Reliability of NCW DC Results Analysis My expectations were low as the data was sampled by test interval and more discrete and granular in nature. However. This meant that the null hypothesis could be rejected the null hypothesis. if we chose to do so. Just because the fault is able to be detected. If this was so the number of failures observed per test interval was decreasing. 6. Lewis Sykalski bugs would likely be contained through other means. This was expected as the integration effort as compared to the execution phase is normally more fraught with problems needing to be resolved..61 indicating 5%+ significance in reliability growth. I did not arrive at any reliability curve that fit the model to CASRE’s default Goodness of Fit (GOF) significance.E & 4 out of 6 of these failures would be detected by utilizing a past release. . After employing the models using default parameters. Despite some of the other ways of slicing the data leading to less fruitful percentages. Examining just the 20 bug scripts for Oracle 10. the raw data proved to have a moderate to high level of reliability growth correlation. Sure. Additionally. Again both trend tests exhibited more growth trending in the MAGTF integration / MAGTF Execution set than in the CALOE execution / MAGTF Execution. we find that 10/40 (25% for both sets of bug scripts) or 11/20 (55% of Oracle 10. that occurrences of failures follow a Homogeneous Poisson Process (a Poisson process in which the rate remains unchanged over time) in favor of the hypothesis of reliability growth at the α% significance level. The reliability trend analysis exhibited statistically significant correlation utilizing both means indicating the possibility of reliability modeling. using different products I would expect would detect close to 100% of faults. we can employ rollback or other fault tolerance strategies once detected.. does not necessarily mean it is able to be recovered from. For the Laplace Test this was evaluated by observing the results of the test that the test statistic was less than or equal to -1. This data did not necessarily follow the Goodness of Fit (GOF) variance. or a more diverse set of releases would likely yield even better percentages. However. I arrived at the relative rankings as also shown below.

. What is more interesting is that reliability growth is increasing more near the end in the Execution/Execution set than in the Integration/Execution set. if transitioning to the beginning of the execution lifecycle phase from the current point in time [MAGTF Execution End]. but in both cases. But actually this is an expected result as it is demonstrates that different reliability growth models are better suited for a lifecycle phase data set (integration/execution) than an event execution set. reliability is increasing. The reliability model predictions arrive at different reliability growth curves. So to interpret this. NHPP [intervals]). Lewis Sykalski Both sets yielded wildly different top reliability models as shown above (Yamada S-Shaped vs. The execution/execution set prediction intervals represent new events in the execution lifecycle phase. due to the influence of the prior data set. the reliability models tell us that there is more growth to be had. while the integration/execution set prediction intervals represent continued time in the same execution lifecycle or a transition to a maintenance lifecycle (Note: we don’t have such a phase in our software development lifecycle). At first I thought this may be indicative of a divergent answer (after all we are modeling the same software). In fact the #1 model was the #3 model in the opposing set.

which in turn could be facilitated by adding auto-logging capability to NCW to write failures to a CASRE log format. by examining trends in multiple models. the reliability models tell us that we are already at a relatively stable point. in my opinion. So for the time-being I will continue to avoid focusing on forecasting reliability and instead exert effort towards fault containment and resolution. Follow-up Actions/Summary Since results from the design diversity experiment show a strong benefit. However. The most important factor. is more important than a deep-seeded understanding of the models. In addition. Lewis Sykalski However since we have a positive growth trend between events. a case could be made for design diversity to tolerate OTS faults. The transformation from Time Between Failure data to Failure Count Data. nothing will be done in the near-term as I really don’t have the time. most problems can be corrected in the integration phase leading up to the experiment. I will. however. It is more granular than Failure Count Data and can be transformed backwards to Failure Count Data with 100% precision. all this could change if a string of poor performances arises. My manager (the same one who yelled me for investigating a transition to Oracle 11) has a firm belief that we have sufficient reliability at present and gets overly upset when I bring up the prospect of restructuring the code to improve fault-tolerance and efficiency. However. I learned that having Failure Count Data is more limiting in the world of software reliability than Time Between Failure data. it likely won’t be as rough as MAGTF execution or CALOE before it. In addition. introduces error as random sampling must be employed. Finally. will likely result in no follow-up action. This is because certain models necessitate Time Between Failures Data. I am at present a one man team and do not have the time. if continuing with more runs within the execution phase from the current point in time [MAGTF Execution End]. Lessons Learned The one thing that I thought was most beneficial from this project was the gained practical knowledge of the tool CASRE. on the other hand. Trending/predictions could be employed to determine reliability short-falls when new functionality is added. while important in the sense that data collection doesn’t abort and waste everyone’s time. however. Gaining practical experience. we can gauge future reliability. tuck these results in the back of my head should I ever get relief from a new team member. realistically. However. in my conclusion that nothing will be done is the fact that this type of activity will not generate any interest from management. However. I also learned that there is much volatility in individual models and not to read too-far into specific numbers.1 Simulation Event & Concept Descriptions . 8. 7. is not a primary focus of our low to medium fidelity simulation. Reliability analysis results of the NCW Data Collector while useful to see where I am at. reliability. In addition. Appendix A: Glossary A. In fact. I got yelled at by my manager for even investigating a transition to Oracle 11. However.

1 SQL Failure Type Definitions Internal Error: This is an internal error within the DBMS product. This could be 1 out of 2 or up to n-1 out of n. Performance/Hang: A Self-Evident Error resulting in obvious loss of performance or product hang.2 and not 10. Engine crash failures. A. (e. internal failures.0 then it would be divergent) Nondivergent failures: The ones for which two (or more) DBMS products fail with identical symptoms. Most likely non-self-evident in nature . If bug exists in 9. Incorrect Results: An incorrect result returned by the DBMS product. signaled by DBMS product exceptions or performance failures) Appendix B: Design Diversity of Oracle Analysis B. Lewis Sykalski CALOE08 (Combat Air Level of Engagement): An experimentation event held in ’08 which focused on monitoring Level Of Engagement in a Combat Air Environment factored by level of NTISR (Non-Tradition Intelligence Sharing and Reconnaissance) enabled by Information Technology MAGTF08 (Marine Air Ground Task Force): An experimentation event held in ’08 focused on MAGTF operations where-in a balanced air-ground. (e. combined arms task organization of Marine Corps forces under a single commander is employed to accomplish a specific mission Network Centric Warfare (NCW): Experimentation focused on translating an information advantage. with acceptable response time) Self-evident failure: Failures that can be detected through observation.0 then it would be non-divergent) Non-self-evident failures: Failures that can not be detected through observation (e.2 SQL Failure Type Definitions Divergent failures: Any failures where DBMS products return different results.g.g. (e. incorrect result failures without DBMS product exceptions. generating an exception message. If bug exists in 9.2 and 10.g. enabled primarily by information technology.g. into a competitive warfighting advantage (either lethality or survivability) through information sharing and robust networking.

0 40.0 40.txt CASRE_Integration_MAGTF08.0 40. It can be located in Appendix C – Section 2. The first row in a file of failure counts must be one of the following seven keywords: Seconds.0 40. Days.txt CASRE13.0 40. Appendix C: CASRE Reliability Analysis C.0 1 C.0 3 3 7. For failure count data. The second through last rows in a file of failure count data have the following fields: Interval Number of Interval Error Number Errors Length Severity (int) (float) (float) (int) The following is an example of failure count data file: Hours 1 5.0 40. Years.txt . Weeks.txt CASRE23.0 40.0 40.0 1 4 5. Considered a miscellaneous error.0 40.0 1 5 4.0 2 1 2.txt CASRE_Execution_MAGTF08.0 1 2 3.0 User’s Guide defining the Failure Counts Input File Format. Lewis Sykalski Loss of Service/Crash: A Self-Evident error resulting in loss of Oracle Service or Crash of Oracle Engine Feature Unusable: An error that made a feature unusable.0 1 6 4.0 1 7 3.0 1 1 3.1 CASRE Failure Counts Input File Format The following is an Excerpt from CASRE 3. this keyword names the units in which the lengths of each test interval are expressed. Hours.2 CASRE Input Datafiles CASRE_Execution_CALOE08.0 3 2 4. Minutes.0 40.

3 CASRE Usage Flow . Lewis Sykalski C.

1 CASRE Failure Count C.4 CASRE Charts C.4. Lewis Sykalski C.2 CASRE Time Between Failures .4.

4.4 CASRE Cumulative Failures . Lewis Sykalski C.3 CASRE Failure Intensity C.4.

4. Lewis Sykalski C.5 CASRE Test Interval Length C.6 CASRE Running Average .4.

4.7 CASRE LaPlace Test . Lewis Sykalski C.

Lewis Sykalski C.4.4.9 CASRE Cumulative Failure Prediction (next 15 ints) .8 CASRE Prediction Setup C.

Lewis Sykalski C.4.10 CASRE Reliability Prediction (next 15 ints) .

2000. 280 Computer Aided Software Reliability Estimation (CASRE) User's Guide Version 3. Nikora. Lewis Sykalski C. IEEE Transactions on Dependable and Secure Computing. Peter Popov. March 23. 4. Washington: Oct-Dec 2007. . Lorenzo Strigini.0 Allen P. Vol. Iss. p.4.11 CASRE Prequential Likelihood (log) (next 15 ints) References: Fault Tolerance via Diversity for OTS Products: A Study w/ SQL Database Servers Ilir Gashi. 4.