Fault Tree Analysis

P.L. Clemens February 2002 4th Edition

Topics Covered
Fault Tree Definition Developing the Fault Tree Structural Significance of the Analysis Quantitative Significance of the Analysis Diagnostic Aids and Shortcuts Finding and Interpreting Cut Sets and Path Sets Success-Domain Counterpart Analysis Assembling the Fault Tree Analysis Report Fault Tree Analysis vs. Alternatives Fault Tree Shortcoming/Pitfalls/Abuses
All fault trees appearing in this training module have been drawn, analyzed, and printed using FaultrEaseTM, a computer application available from: Arthur D. Little, Inc./Acorn Park/ Cambridge, MA., 02140-2390 – Phone (617) 8645770.

2
8671

First – A Bit of Background
Origins of the technique Fault Tree Analysis defined Where best to apply the technique What the analysis produces Symbols and conventions
3
8671

Origins Fault tree analysis was developed in 1962 for the U. Air Force by Bell Telephone Laboratories for use with the Minuteman system…was later adopted and extensively applied by the Boeing Company…is one of many symbolic logic analytical techniques found in the operations research discipline. 4 8671 .S.

Numerical probabilities of occurrence can be entered and propagated through the model to evaluate probability of the foreseeable. The pathways interconnect contributory events and conditions. undesirable loss event.The Fault Tree is A graphic “model” of the pathways within a system that can lead to a foreseeable. Only one of many System Safety analytical tools and techniques. using standard logic symbols. undesirable event. 5 8671 .

6 8671 . (a must!) Indiscernible mishap causes (i. Numerous potential contributors to a mishap.e. Caveat: Large fault trees are resource-hungry and should not be undertaken without reasonable assurance of need. autopsies).e.. Complex or multi-element systems/processes. perceived threats of loss. i.. Already-identified undesirable events.Fault Tree Analysis is Best Applied to Cases with Large. high risk.

7 8671 . Qualitative/quantitative insight into probability of the loss event selected for analysis. Guidance for redeploying resources to optimize control of risk.Fault Tree Analysis Produces Graphic display of chains of events/conditions leading to the loss event. Identification of those potential contributors to failure that are “critical. Identification of resources committed to preventing failure. Documentation of analytical results.” Improved understanding of system characteristics.

– FAILURE • Loss.g. or 2) by a failure (see below). etc. All failures cause faults. 8 8671 *System element: a subsystem. a blown fuse.Some Definitions – FAULT • An abnormal undesirable state of a system or a system element* induced 1) by presence of an improper command or absence of a proper one. or the relay coil has burned out and will not close the contacts when commanded – the relay has failed. A system which has been shut down by safety features has not faulted. by a system or system element*. e. . piece part..g. e. not all faults are caused by failures. assembly. a pressure vessel bursts – the vessel fails. of functional integrity to perform as intended. component. relay contacts corrode and will not pass rated current closed. A protective device which functions as intended has not failed.

the failed element is overstressed/underqualified for its burden.. or calibrated for the application.. the failed element has been improperly designed. fatigue failure of a relay spring within its rated lifetime. leakage of a valve seal within its pressure rating. E.g. or installed. – SECONDARY FAILURE • Failure induced by exposure of the failed element to environmental and/or service stresses exceeding its intended ratings.Definitions – PRIMARY (OR BASIC) FAILURE • The failed element has seen no exposure to environmental or service stresses exceeding its ratings to perform. E. 9 8671 .g. or selected.

10 8671 . No sabotage.. Markov… – Fault rates are constant… = 1/MTBF = K – The future is independent of the past – i.e.Assumptions and Limitations I I I I Non-repairable system. mutually exclusive states. future states available to the system depend only upon its present state and pathways now available to it. not upon how it got where it is. Bernoulli… – Each system element analyzed has two.

(Called “Leaf. must be only these four (1) necessary and (2) sufficient to cause symbols. not developed further.The Logic Symbols TOP Event – forseeable. individual.” “Initiator. Events and Gates are not component parts of the system being analyzed. Any input.” or “Basic. undesirable event. They are symbols representing the logic of the analysis.or Intermediate event – describing a system state produced by antecedent events. “And” Gate – produces output if all inputs co-exist. They are bi-modal. individually must be (1) necessary and (2) sufficient to cause the output event Basic Event – Initiating fault/failure.”) The Basic Event marks the limit of resolution of the analysis. They function flawlessly. 11 8671 OR AND . All inputs. Most Fault Tree “Or” Gate – produces output if any input Analyses can be carried out using exists. the output event. toward which all fault tree logic paths flow.

” or “Basic”) indicates limit of analytical resolution.Steps in Fault Tree Analysis 1 3 Identify undesirable TOP event Link contributors to TOP by logic gates 2 Identify first-level contributors 5 Link second-level contributors to TOP by logic gates Identify second-level contributors 4 Basic Event (“Leaf.” “Initiator. 12 8671 6 Repeat/continue .

NO YES Don’t let gates feed gates. 13 8671 .Some Rules and Conventions Do use single-stem gate-feed inputs.

g. Use same name for same event/condition throughout the analysis. (Use index numbering for large trees. A large bass will not plug the hole in the hull. Lightning will not recharge the battery.) Say WHAT failed/faulted and HOW – e. 14 8671 .. “Switch Sw-418 contacts fail closed” Don’t expect miracles to “save” the system.More Rules and Conventions Be CONSISTENT in naming fault events/conditions.

No miracles! . – There’ll be a power outage when the worker’s hand contacts the highvoltage conductor.Some Conventions Illustrated Flat Tire ? Air Escapes From Casing Tire Pressure Drops Tire Deflates 15 8671 Initiators must be statistically independent of one another. Name basics consistently! MAYBE – A gust of wind will come along and correct the skid. – A sudden cloudburst will extinguish the ignition source.

Identifying TOP Events Explore historical records (own and others). Development “what-if” scenarios. Identify potential mission failure contributors.” 16 8671 . Use “shopping lists. Look to energy sources.

Example TOP Events Wheels-up landing Mid-air collision Subway derailment Turbine engine FOD Rocket failure to ignite Irretrievable loss of primary test data Dengue fever pandemic Sting failure Inadvertent nuke launch Reactor loss of cooling Uncommanded ignition Inability to dewater buoyancy tanks TOP events represent potential high-penalty losses (i. high risk).. Either severity of the outcome or frequency of occurrence can produce high risk.e. 17 8671 .

from external causes Unprotected body contact with potential greater than 40 volts Foreign object weighing more than 5 grams and having density greater than 3. 18 8671 .“Scope” the Tree TOP Too Broad Computer Outage Improved Outage of Primary Data Collection computer. To “scope.500 Exposed Conductor Foreign Object Ingestion Jet Fuel Dispensing Leak “Scoping” reduces effort spent in the analysis by confining it to relevant considerations.” describe the level of penalty or the circumstances for which the event becomes intolerable – use modifiers to narrow the event description. exceeding eight hours.2 gm/cc Fuel dispensing fire resulting in loss exceeding $2.

EFFECT Examples: Electrical power fails off Low-temp.043 btu/ft2/ sec Relay K-28 contacts freeze closed Transducer case ruptures CAUSE Proc. an action verb. under a given gate. 19 8671 . However. Step 42 omitted • (3) and. and individually under an OR gate. each fault must be independent of all (1) EACH others. Alarm fails off Solar q > 0.Adding Contributors to the Tree (2) must be an INDEPENDENT* FAULT or FAILURE CONDITION (typically described by a noun. each element must be an immediate contributor to the level above NOTE: As a group under an AND gate. contributing elements must be both necessary and sufficient to serve as immediate cause for the output event. the CONTRIBUTING same fault may ELEMENT appear at other points on the tree. and specifying modifiers) * At a given level.

Example Fault Tree Development Constructing the logic Spotting/correcting some common errors Adding quantitative data 20 8671 .

An Example Fault Tree Late for Work Undesirable Event Sequence Initiation Failures Oversleep Transport Failures Life Support Failures Process and Misc. or sequence of operation . System Malfunctions Causative Modalities* 21 8671 ? * Partitioned aspects of system function. physical arrangement. subdivided as the purpose.

Sequence Initiation Failures Oversleep No “Start” Pulse Natural Apathy Biorhythm Fails 22 8671 Artificial Wakeup Fails ? .

Verifying Logic Oversleep Does this “look” correct? Should the gate be OR? No “Start” Pulse Natural Apathy Artificial Wakeup Fails Biorhythm Fails ? 23 8671 .

too! .Test Logic in SUCCESS Domain Oversleep Redraw – invert all statements and gates “trigger” Wakeup Succeeds “motivation” No “Start” Pulse Failure Domain Natural Apathy “Start” Pulse Works Success Domain Natural High Torque BioRhythm Fails Artificial Wakeup Fails BioRhythm Fails Artificial Wakeup Works ? 24 8671 ? If it was wrong here……it’ll be wrong here.

Artificial Wakeup Fails Artificial Wakeup Fails Alarm Clocks Fail Nocturnal Deafness Main Plug-in Clock Fails Backup (Windup) Clock Fails Power Outage Faulty Innards Forget to Set Faulty Mechanism Forget to Set Forget to Wind Electrical Fault Mechanical Fault What does the tree tell up about system vulnerability at this point? Hour Hand Falls Off 25 8671 Hour Hand Jams Works .

Background for Numerical Methods Relating PF to R The Bathtub Curve Exponential Failure Distribution Propagation through Gates PF Sources 26 8671 .

Reliability and Failure Probability Relationships I I I I S = Successes F = Failures S Reliability… R =(S+F) Failure Probability… PF = F (S+F) S R + PF = (S+F)+ F ≡ 1 (S+F) = Fault Rate = 1 MTBF 27 8671 .

For exposure intervals that are brief (T < 0. BU O RN UT 0. for λT ≤ 20%) 1. PF is approximated within 2% by λT. faults occur at random times.63 0. During these periods.Significance of PF Random Failure T λ0 0 Fault probability is modeled acceptably well as a function of exposure interval (T) by the exponential.2 MTBF). PF ≅ λT (within 2%.0 t λ = 1 / MTBF (In B fa UR nt N M IN or ta lity ) 0 The Bathtub Curve Most system elements have fault rates (λ = 1/MTBF) that are constant (λ0) over long periods of useful life.5 PF = 1 – ε–λT ℜ = ε–λT 0 0 28 8671 T Exponentially Modeled Failure Probability 1 MTBF .

element failures produces system failure.B ≤ 0. independent. independent elements must fail to produce system failure. R + PF ≡ 1 ℜT = ℜA + ℜ B – ℜA ℜ B PF = 1 – ℜT PF = 1 – (ℜ A + ℜ B – ℜA ℜ B) PF = 1 – [(1 – PA) + (1 – PB) – (1 – PA)(1 – PB)] PF = PA + PB – PA PB …for PA. ℜT = ℜ A ℜB PF = 1 – ℜT PF = 1 (ℜA ℜB) PF = 1 – [(1 – PA)(1 – PB)] For 2 Inputs AND Gate Both of two.2 PF ≅ PA + PB with error ≤ 11% [Union / ∪] PF = PA PB [Intersection / ∩] “Rare Event Approximation” For 3 Inputs PF = PA PB PC Omit for approximation PF = PA + PB + PC – PA PB – PA PC – PB PC + PA PBPC 29 8671 .ℜ and PF Through Gates OR Gate Either of two.

1 P1 2 P2 PT = P1 P2 30 8671 PT = P1 + P2 – P1 P2 Usually negligible .PF Propagation Through Gates AND Gate… PT = Π Pe TOP PT = P1 P2 [Intersection / ∩] OR Gate… PT ≅ Σ Pe TOP PT ≅ P1+ P2 [Union / ∪] 1 P1 2 P2 1&2 are INDEPENDENT events.

“Ipping” Gives Exact OR Gate Solutions Failure TOP PT = ? Success TOP PT =Π (1 – Pe) Failure TOP PT = 1 P1 2 P2 3 P3 1 2 3 P3 = (1 – P3) 1 P1 2 P2 P1 = (1 – P1) The ip operator ( ) is the P2 = (1 – P2) co-function of pi (Π). Its use is rarely justifiable. It PT = Pe= 1 – Π (1 – Pe) provides an exact solution for propagating PT = 1 – [(1 – P1) ( 1 – P2) (1 – P3 … (1 – Pn )] probabilities through the OR gate. 31 8671 Π Pe 3 P3 Π Π .

Exclusive OR Gate… PT = P1 + P2 – 2 (P1 x P2) Opens when any one (but only one) event occurs.More Gates and Symbols Inclusive OR Gate… PT = P1 + P2 – (P1 x P2) Opens when any one or more events occur. M 32 8671 . the Rare Event ApproxiPT ≅ Σ Pe mation may be used for small values of Pe. For all OR Gate cases. Mutually Exclusive OR Gate… PT = P1 + P2 Opens when any one of two or more events occur. All other events are then precluded.

33 8671 . External Event An event normally expected to occur. Inhibit Gate Opens when (single) input event occurs in presence of enabling condition. Undeveloped Event An event not further developed. Conditioning Event Applies conditions or restrictions to other symbols.Still More Gates and Symbols Priority AND Gate PT = P1 x P2 Opens when input events occur in predetermined sequence.

Some Failure Probability Sources Manufacturer’s Data Industry Consensus Standards MIL Standards Historical Evidence – Same or Similar Systems Simulation/testing Delphi Estimates ERDA Log Average Method 34 8671 .

1 = 0.. DOE 76-45/11. September 1982.e.5 = 0.5 times the lower bound and 0. • Average the logarithms of the upper and lower bounds. for the example shown. 5. Thus.04 0. 35 8671 .55 times the upper bound * Reference: Briscoe..01 + 0.” System Safety Development Center.0316228 Probability 2 2 Bound 10–1 Note that. • The antilogarithm of the average of the logarithms of the upper and lower bounds is less than the upper bound and greater than the lower bound by the same factor.05 0. but upper and lower credible bounds can be judged… • Estimate upper and lower credible bounds of probability for the phenomenon in question. “Risk Management Guide. the arithmetic average would be… 0. Glen J.0316+ PL Lower Probability Bound 10–2 0.1 PU Upper Log PL + Log PU Log Average = Antilog = Antilog (–2) + (–1) = 10–1.055 2 i. 0.03 0.07 0. it is geometrically midway between the limits of estimation.Log Average Method* If probability is not estimated easily. SSDC-11.0 2 0.01 0.

including numerous industry-specific proprietary listings 36 8671 .” (Table XI-1). “Reactor Safety Study – An Assessment of Accident Risks in US Commercial Nuclear Power Plants.More Failure Probability Sources WASH-1400 (NUREG-75/014). 1986 Many others.” 1975 IEEE Standard 500 Government-Industry Data Exchange Program (GIDEP) Rome Air Development Center Tables NUREG-0492. “Fault Tree Handbook.

0035 29.” Prentice Hall . “Handbook of System and Product Safety.0 5.016 80.01 Average 1.0 3.0 0.0 10.0048 41.0 0.Typical Component Failure Rates Failures Per 106 Hours Device Semiconductor Diodes Transistors Microwave Diodes MIL-R-11 Resistors MIL-R-22097 Resistors Rotary Electrical Motors Connectors 37 8671 Minimum 0.10 Maximum 10.0 0.0 0.0 12.60 0.0 500.0 0.10 0.0 22.0 Source: Willie Hammer.10 3.0 10.

labeled.0001-0.003 avg.S.2-0.Typical Human Operator Failure Rates Activity *Error of omission/item embedded in procedure *Simple arithmetic error with self-checking *Inspector error of operator oversight *General rate/high stress/ dangerous activity **Checkoff provision improperly used **Error of omission/10-item checkoff list **Carry out plant policy/no check on operator **Select wrong control/group of identical.01 (0.005-0.) Sources: * WASH-1400 (NUREG-75/014).1-0.5 avg. “Handbook of Human Reliability Analysis with Emphasis on 38 Nuclear Power Plant Applications.001-0.005 (0.) 0. “Reactor Safety Study – An Assessment of Accident Risks in U.09 (0.05 (0.3 0.” 1980 8671 . Commercial Nuclear Power Plants.01 avg.” 1975 **NUREG/CR-1278. controls Error Rate 3 x 10–3 3 x 10–2 10–1 0.) 0.001 avg.) 0.

Some Factors Influencing Human Operator Failure Probability Experience Stress Training Individual self discipline/conscientiousness Fatigue Perception of error consequences (…to self/others) Use of guides and checklists Realization of failure on prior attempt Character of Task – Complexity/Repetitiveness 39 8671 .

Faults/Year………..83 x 10–2 Power Outage 3. x 10–3 2/1 Forget to Wind 1. x 10–2 3/1 Electrical Fault 3. x 10–8 Hour Hand Jams Works 2. x 10–4 Forget to Set 8. 2/1 Assume 260 operations/year Alarm Clocks Fail 3.82 x 10–2 Faulty Innards 1..34 x 10–4 approx. x 10–4 1/10 .1 / yr Nocturnal Deafness Negligible Backup (Windup) Clock Fails 1. x 10–4 1/10 Forget to Set 8. x 10–4 1/15 Hour Hand Falls Off 40 8671 3. 0. x 10–4 1/20 4.Artificial Wakeup Fails Artificial Wakeup Fails KEY: Faults/Operation……….34 x 10–4 Main Plug-in Clock Fails 1. x 10–3 2/1 Faulty Mechanism 4.8. x 10–2 3/1 Mechanical Fault 8. X 10–3 Rate.

S. “The Loss Rate Concept in Safety Engineering” * National Safety Council.T. Motor vehicles fatalities Death by disease (U. R.) U.L.S. Earth destroyed by extraterrestrial hit ≅10–2.10–3/exp MH† ≅10–3/exp hr† ≅10–4/exp hr† ≅10–5/exp hr† ≅10–6/exp MH† ≅10–6/exp MH ≅10–7-10–8/exp MH† ≅10–9/exp MH* ≅10–10/exp hr‡ ≅10–14/exp hr† 41 8671 † Browning..” Tenth International System Safety Conference .HOW Much PT is TOO Much? Consider “bootstrapping” comparisons with known risks… Human operator error (response to repetitive stimulus) Internal combustion engine failure (spark ignition) Pneumatic instrument recorder failure Distribution transformer failure U. “Analytical Methods Applicable to Risk Assessment & Prevention.S. lifetime avg. Employment fatalities Death by lightning Meteorite (>1 lb) hit on 103x 103 ft area of U. J.S.. “Accident Facts” ‡ Kopecek.

Apply Scoping What power outages are of concern? Power Outage 1 X 10–2 3/1 Not all of them! Only those that… • Are undetected/uncompensated • Occur during the hours of sleep • Have sufficient duration to fault the system This probability must reflect these conditions! 42 8671 .

” Professional Safety – March 1980 43 8671 .Single-Point Failure “A failure of one independent element of a system which causes an immediate hazard to occur and/or causes the whole system to fail.

1. PT = 0.01.1 may cost much less than one element having P = 0.Some AND Gate Properties TOP PT = P1 x P2 1 2 Cost: Assume two identical elements having P = 0. 44 8671 .01 Two elements having P = 0. Freedom from single point failure: Redundancy ensures that either 1 or 2 may fail without inducing TOP.

Fault Hand Falls/ Jams Works Gearing Fails Other Mech.Failures at Any Analysis Level Must Be Don’t • Independent of each other • True contributors to the level above Mechanical Fault Faulty Innards Do Independent Hand Falls Off Hand Jams Works Elect. Fault Alarm Failure Alarm Failure True Contributors Alarm Clock Fails Toast Burns Backup Clock Fails Alarm Clock Fails Backup Clock Fails 45 8671 .

will induce the occurrence of two or more fault tree elements. if it occurs.” Oversight of Common Causes is a frequently found fault tree flaw! 46 8671 .Common Cause Events/Phenomena “A Common Cause is an event or a phenomenon which.

and that source fails.Common Cause Oversight – An Example Unannunciated Intrusion by Burglar Microwave ElectroOptical Seismic Footfall Acoustic DETECTOR/ALARM FAILURES 47 8671 Four. The AND gate to the TOP event seems appropriate. wholly independent alarm systems are provided to detect and annunciate intrusion. suppose the four systems share a single source of operating power. Redundancy appears to be absolute. and there are no backup sources? . But. No two of them share a common operating principle.

leading to the TOP event through an OR gate. 48 8671 . if it occurs. OTHER COMMON CAUSES SHOULD ALSO BE SEARCHED FOR. will disable all four alarm systems.Common Cause Oversight Correction Unannunciated Intrusion by Burglar Detector/Alarm Failure Detector/Alarm Power Failure Microwave Electro-Optical Seismic Footfall Acoustic Basic Power Failure Emergency Power Failure Here. power source failure has been recognized as an event which. Power failure has been accounted for as a common cause event.

Example Common Cause Fault/Failure Sources Utility Outage – Electricity – Cooling Water – Pneumatic Pressure – Steam Moisture Corrosion Seismic Disturbance 49 8671 Dust/Grit Temperature Effects (Freezing/Overheat) Electromagnetic Disturbance Single Operator Oversight Many Others .

Separately powering/servicing/maintaining redundant elements. 50 8671 . Using independent operators/inspectors.Example Common Cause Suppression Methods Separation/Isolation/Insulation/Sealing/ Shielding of System Elements. Using redundant elements having differing operating principles.

Unannunciated Intrusion by Burglar SYSTEM CHALLENGE Detector/Alarm Failure Intrusion By Burglar Detector/Alarm System Failure Detector/Alarm Power Failure Burglar Present Barriers Fail Microwave Electro-Optical Seismic Footfall Acoustic Basic Power Failure Emergency Power Failure 51 8671 . The logic criteria of necessity and sufficiency must be satisfied.Missing Elements? Contributing elements must combine to satisfy all conditions essential to the TOP event.

Problem: Using the test for infection and the anti-toxin. However. if the test indicates need for it. The test for infection produces a false positive result in 2% of all uninfected astronauts and a false negative result in one percent of all infected astronauts. Anti-toxin administered during the incubation period is 100% effective in preventing the disease when administered to an infected astronaut. lassitude. Incubation period is 13 days. For a week thereafter. what is the probability that a returning astronaut will be a victim of malpractice? 52 8671 . it produces disorientation. Both treatment of an uninfected astronaut and failure to treat an infected astronaut constitute in malpractice. and a very crabby outlook.Example Problem – Sclerotic Scurvy – The Astronaut’s Scourge BACKGROUND: Sclerotic scurvy infects 10% of all returning astronauts. for an uninfected astronaut. confusion. A test can be used during the incubation period to determine whether an astronaut has been infected. and intensifies all undesirable personality traits for about seven days. victims of the disease display symptoms which include malaise.

succumb to side effects .001 Should the test be used? False Negative Test 0.02 10% of returnees are infected – 90% are not infected 1% of infected cases test falsely negative.9 False Positive Test 0. succumb to disease 53 8671 2% of uninfected cases test falsely positive. receive treatment.1 Healthy Astronaut 0.01 Infected Astronaut 0. receive no treatment.019 What is the greatest contributor to this probability? Treat Needlessly (Side Effects) 0.018 Fail to Treat Infection (Disease) 0.Sclerotic Scurvy Malpractice Malpractice 0.

Cut Sets AIDS TO… System Diagnosis Reducing Vulnerability Linking to Success Domain 54 8671 .

will cause the TOP event to occur. if all occur. will cause the TOP event to occur. 55 8671 . A MINIMAL CUT SET is a least group of fault tree initiators which. if all occur.Cut Sets A CUT SET is any group of fault tree initiators which.

As the construction progresses: Replace the letter for each AND gate by the letter(s)/number(s) for all gates/initiators which are its inputs. Display these horizontally. in matrix columns. Each newly formed OR gate replacement row must also contain all other entries found in the original parent row. Proceeding stepwise from TOP event downward.Finding Cut Sets 1. in matrix rows. 56 8671 . construct a matrix using the letters and numbers. Starting immediately below the TOP event. assign a unique letter to each gate. 2. Display these vertically. Ignore all tree elements except the initiators (“leaves/basics”). Replace the letter for each OR gate by the letter(s)/number(s) for all gates/initiators which are its inputs. 3. and assign a unique number to each initiator. The letter representing the TOP event gate becomes the initial matrix entry.

Finding Cut Sets 4. eliminate any row that contains all elements found in a lesser row. By inspection. Each row of this matrix is a Boolean Indicated Cut Set. A final matrix results. 57 8671 . The rows that remain are Minimal Cut Sets. displaying only numbers representing initiators. Also eliminate redundant elements within rows and rows that duplicate other rows.

If a basic initiator appears more than once. starting with the TOP “A” gate. represent it by the same number at each appearance. – Assign numbers to basic initiators. (TOP gate is “A. – Construct a matrix.A Cut Set Example PROCEDURE: – Assign letters to gates. TOP A B 1 C 2 D 4 2 3 58 8671 .”) Do not repeat letters.

replace it horizontally. its inputs. its inputs. . Minimal Cut Set rows are least groups of initiators which will induce TOP. 1 & C. 1 2 2 3 1 4 C is an AND gate. replace it horizontally. replace it vertically. 2 & 3. is an OR gate. Each requires a new row. Each requires a new row. is an OR gate. its inputs. replace it vertically. 59 8671 1 2 2 2 3 1 4 2 4 3 These BooleanIndicated Cut Sets… …reduce to these minimal cut sets. 2 & 4. Replace as before. B is an OR gate. B & D. its inputs. A is an AND gate.A Cut Set Example A B D 1 D C D 1 D 2 D 3 TOP event gate is A. 1 2 2 D 3 1 4 D (top row). the initial matrix entry. D (second row).

for which the Minimal Cut Sets were derived. For example. 60 8671 .An “Equivalent” Fault Tree An Equivalent Fault Tree can be constructed from Minimal Cut Sets. these Minimal Cut Sets… 1 2 1 2 3 4 1 2 1 TOP Boolean Equivalent Fault Tree 4 2 3 …represent this Fault Tree… …and this Fault Tree is a Logic Equivalent of the original.

9 gates 24 initiators 1 2 3 4 5 6 TOP Minimal cut sets 1/3/5 1/3/6 1/4/5 1/4/6 2/3/5 2/3/6 2/4/5 2/4/6 61 8671 1 3 5 1 3 6 1 4 5 1 4 6 2 3 5 2 3 6 2 4 5 2 4 6 .Equivalent Trees Aren’t Always Simpler 4 gates 6 initiators This Fault Tree has this logic equivalent.

Another Cut Set Example Compare this case to the first Cut Set example – note differences. TOP gate was AND. 3 62 8671 TOP A B C 6 D F 3 E 5 G 4 4 1 . TOP gate here is OR. 1 In the first example. 2 Proceed as with first example.

Another Cut Set Example Construct Matrix – make step-by-step substitutions…
A B C 1 D F 6 1 2 F D I E 1 2 3 5 G 6 1 E

Boolean-Indicated Cut Sets Minimal Cut Sets
1 2 3 5 G 1 3 1 4 6 1 3 1 1 3 2 5 G 3 4 5 1 6 1 1 1 3 2 3 4 4

5

6

6

Note that there are four Minimal Cut Sets. Co-existence of all of the initiators in any one of them will precipitate the TOP event.

An EQUIVALENT FAULT TREE can again be constructed…
63
8671

Another “Equivalent” Fault Tree
These Minimal Cut Sets… represent this Fault Tree – a Logic Equivalent of the original tree.
TOP 1 1 1 3 2 3 4 4 5 6

1
64
8671

2

1

3

1

4

3

4

5

6

From Tree to Reliability Block Diagram
TOP A

Blocks represent functions of system elements. Paths through them represent success.
“Barring” terms (n) denotes consideration of their success properties. C

3 5 4 6 1

B

2

3 1

4

1
D F

6 3
E

2

5
G

3

4

4

1

TOP The tree models a system fault, in failure domain. Let that fault be System Fails to Function as Intended. Its opposite, System Succeeds to function as intended, can be represented by a Reliability Block Diagram in which success flows through system element functions from left to right. Any path through the block diagram, not interrupted by a fault of an element, results in system success.

65
8671

) Minimal Cut Sets . (It contains 1/3. a true Minimal Cut Set.Cut Sets and Reliability Blocks TOP A 3 2 B C 3 1 4 4 5 1 6 1 D F 6 3 E 2 5 G 1 1 1 3 2 3 4 4 5 6 3 4 4 1 66 8671 Each Cut Set (horizontal rows in the matrix) interrupts all left-to-right paths through the Reliability Block Diagram Note that 3/5/1/6 is a Cut Set. but not a Minimal Cut Set.

Cut Set Uses Evaluating PT Finding Vulnerability to Common Causes Analyzing Common Cause Probability Evaluating Structural Cut Set “Importance” Evaluating Quantitative Cut Set “Importance” Evaluating Item “Importance” 67 8671 .

the product of probabilities for events within the Cut Set. Pk = Π Pe = P1 x P2 x P3 x…Pn 1 6 .Cut Set Uses/Evaluating PT TOP A PT Minimal Cut Sets 1 1 C 2 3 4 4 5 6 B 1 3 6 1 D F 2 E 3 5 G Pt ≅ Σ P k = P 1 x P2 + P1 x P3 + P1 x P4 + P3 x P4 x P5 x P6 Note that propagating probabilities through an “unpruned” tree.e. would produce a falsely high PT. is the probability that the Cut Set being considered will induce TOP. 1 2 3 5 4 6 1 3 1 4 3 5 3 4 4 1 68 8671 Cut Set Probability (Pk).. using Boolean-Indicated Cut Sets rather than minimal Cut Sets. i .

Some may have no Common ADVICE: Moisture proof one or more items. subscript designators.g…. 69 3m 4m 4m 1v 8671 . e. Cause vulnerability – receive no subscripts. l = location (code where) m = moisture h = human operator Minimal Cut Sets q = heat 1 v 2h f = cold 6m v = vibration 1v 3 m …etc.Cut Set Uses/Common Cause Vulnerability TOP A B C 1v D F Uniquely subscript initiators. using letter indicators of common cause susceptibility. Moisture is a Common Cause Some Initiators may be vulnerable to several Common Causes and receive several corresponding and can induce TOP. 1v 4 m 2h E 3m 5m G 3m 4m 5m 6m All Initiators in this Cut Set are vulnerable to moisture.

and (2) inducing all terms within the affected cut set.Analyzing Common Cause Probability TOP PT System Fault These must be OR Common-Cause Induced Fault Analyze as usual… …others Moisture Vibration Human Operator Heat 70 8671 Introduce each Common Cause identified as a “Cut Set Killer” at its individual probability level of both (1) occurring. .

Analyzing Structural Importance enables qualitative ranking of contributions to System Failure.Cut Set Structural “Importance” TOP A Minimal Cut Sets 1 1 C 2 3 4 4 5 6 B 1 3 6 1 D F 2 E 3 5 G 3 4 4 1 All other things being equal… • A LONG Cut Set signals low vulnerability • A SHORT Cut Set signals higher vulnerability • Presence of NUMEROUS Cut Sets signals high vulnerability …and a singlet cut set signals a Potential Single-Point Failure. 71 8671 .

Cut Set Quantitative “Importance” TOP A PT B C The quantitative importance of a Cut Set (Ik) is the numerical probability that. attack Cut Sets having greater Importance. Generally. given that TOP has occurred. short Cut Sets have greater Importance. that Cut Set has induced it. long Cut Sets have lesser Importance. To reduce system vulnerability most effectively. Pk Ik = PT 6 …where Pk = Π Pe = P3 x P4 x P5 x P6 Minimal Cut Sets 1 D F 2 E 3 5 G 1 1 1 3 2 3 4 4 5 6 3 4 4 1 Analyzing Quantitative Importance enables numerical ranking of contributions to System Failure. 72 8671 .

Ne = Number of Minimal Cut Sets containing Item e Ne Ie ≅ Σ Ike Minimal Cut Sets 1 1 1 3 73 8671 Ike = Importance of the Minimal Cuts Sets containing Item e Example – Importance of item 1… 2 3 4 4 5 6 I1 ≅ (P1 x P2) + (P1 x P3) + (P1 x P4) PT . given that TOP has occurred.Item ‘Importance” The quantitative Importance of an item (Ie) is the numerical probability that. that item has contributed to it.

Path Sets Aids to… Further Diagnostic Measures Linking to Success Domain Trade/Cost Studies 74 8671 .

Path Sets will be the result. Then proceed using matrix construction as for Cut Sets. 75 8671 .Path Sets A PATH SET is a group of fault tree initiators which. Path Sets are complements of Cut Sets. TO FIND PATH SETS* change all AND gates to OR gates and all OR gates to AND. *This Cut Set-to-Path-Set conversion takes advantage of de Morgan’s duality theorem. will guarantee that the TOP event cannot occur. if none of them occurs.

guarantee against TOP 6 occurring 1 G 2 E 3 5 3 4 5 6 3 4 1 1 1 2 1 1 1 3 76 8671 2 3 4 4 5 6 3 4 4 1 …and these Path Sets “Barring” terms (n) denotes consideration of their success properties .A Path Set Example TOP A B C This Fault Tree has these Minimal Cut sets 1 D F Path Sets are least groups of initiators which. if they cannot occur.

.Path Sets and Reliability Blocks TOP A 3 B C 2 3 1 4 4 5 1 6 1 D F 6 3 E 2 5 G 1 1 1 1 3 4 5 6 3 4 4 1 77 8671 2 3 4 Path Sets Each Path Set (horizontal rows in the matrix) represents a left-toright path through the Reliability Block Diagram.

To minimize failure probability. minimize path set probability. Pick the countermeasure ensemble(s) giving the most favorable ∆ Pp / ∆ $.) . a b c d e 78 8671 1 1 1 1 2 3 4 5 6 3 4 PPe Sprinkle countermeasure resources amongst the Path Sets. Compute the probability decrement for each newly adjusted Path Set option. (Selection results can be verified by computing ∆ PT/ ∆ $ for competing candidates.Pat Sets and Trade Studies 3 2 3 1 4 4 6 Path Sets Pp PPa PPb PPc PPd $ $a $b $c $d $e 5 1 Pp ≅ Σ Pe Path Set Probability (Pp) is the probability that the system will suffer a fault at one or more points along the operational route modeled by the path.

Rank items using Ie’ Ie ≅ Σ Ike Identify items amenable to improvement. Examine/alter system architecture – increase path set/cut set ratio. schedule) AND Does the new countermeasure… • Introduce new HAZARDS? • Cripple the system? 79 8671 . – Derate components (increase robustness/reduce Pe).} Ik= Pk/ PT Identify items amenable to improvement. N Evaluate item importance. – Fortify maintenance/parts replacement (increase MTBF). Pp ≅ Σ Pe } For all new countermeasures. THINK… • COST • EFFECTIVENESS • FEASIBILITY (incl. } e Evaluate path set probability.Reducing Vulnerability – A Summary Inspect tree – find/operate on major PT contributors… – Add interveners/redundancy (lengthen cut sets). Evaluate Cut Set Importance. Reduce PP at most favorable ∆P/∆ $. Rank items using Ik.

Some Diagnostic and Analytical Gimmicks A Conceptual Probabilistic Model Sensitivity Testing Finding a PT Upper Limit Limit of Resolution – Shutting off Tree Growth State-of-Component Method When to Use Another Technique – FMECA 80 8671 .

Some Diagnostic Gimmicks Using a “generic” all-purpose fault tree… TOP PT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 81 8671 .

The “peg count” ratio for each wheel is determined by 13 14 12 15 probability for that initiator. Spin all initiator wheels once for each system exposure interval.Think “Roulette Wheels” TOP PT A convenient. 26 28 29 27 25 21 P22 = 3 x 10–3 1. thought-tool model of probabilistic tree modeling… 1 2 3 4 5 6 7 10 11 16 17 22 23 24 Imagine a roulette wheel representing 9 8 each initiator.000 peg spaces 997 white 3 red 82 8671 30 31 32 33 34 . Wheels “winning” in 20 18 19 gate-opening combinations provide a path to the TOP.

21 ´ ∆ PT 18 19 now. get a better Pe. Is the corresponding PT acceptable? If not.Compute PT for a nominal value of Pe. Perform a crude sensitivity test to obtain quick relief from worry… or.Compute PT for a value of Pe at its upper credible limit. recompute PT 20 for a new Pe = Pe + ∆ Pe.1 in a large tree. 31 32 33 34 30 83 8671 . there’s a bothersome initiator with 9 8 an uncertain Pe.Use Sensitivity Tests TOP PT Gaging the “nastiness” of untrustworthy initiators… 1 2 3 4 5 6 7 10 11 12 P10 = ? 16 17 22 23 24 25 Embedded within the tree. work to ~27 28 26 29 Find a value for Pe having less uncertainty…or… 2. Then. to justify the urgency of need for more exact input data: 13 14 15 1. compute the “Sensitivity” of Pe = ´ ∆ Pe If this sensitivity exceeds ≈ 0.

Find a Max PT Limit Quickly The “parts-count” approach gives a sometimes-useful early estimate of PT… TOP PT 1 2 3 4 5 6 7 PT cannot exceed an8 upper bound given by: 9 PT(max) = Σ Pe = P1 + P2 + P3 + …Pn 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 84 8671 .

use the tree to guide risk reduction. The tree analysis provides probability. The TOP statement gives severity. If NO. ANALYZE 4 3 5 2 NO FURTHER DOWN THAN IS NECESSARY TO ENTER PROBABILITY DATA WITH CONFIDENCE. SOME EXCEPTIONS… 8 6 9 7 1. 16 17 18 19 20 21 ? Initiators / leaves / basics define the LIMIT OF RESOLUTION of the analysis. stop. Is risk acceptable? If YES.How Far Down Should a Fault Tree Grow? Severity TOP PT Probability 1 Where do you stop the analysis? The analysis is a Risk Management enterprise.) An event within the tree has alarmingly high probability. Dig deeper beneath it to find the source(s) of the high probability.) Mishap autopsies must sometimes analyze down to the cotter-pin level to produce a “credible cause” list. 10 11 12 13 14 15 2. ? 85 8671 .

State-of-Component Method
Relay K-28 Contacts Fail Closed

WHEN – Analysis has proceeded to the device level – i.e., valves, pumps, switches, relays, etc. HOW – Show device fault/failure in the mode needed for upward propagation.
Relay K-28 Secondary Fault

Basic Failure/ Relay K-28

Relay K-28 Command Fault

Install an OR gate. Place these three events beneath the OR. This represents faults from environmental and service stresses for which the device is not qualified – e.g., component struck by foreign object, wrong component selection/installation. (Omit, if negligible.)

This represents internal “self” failures under normal environmental and service stresses – e.g., coil burnout, spring failure, contacts drop off…
86
8671

Analyze further to find the source of the fault condition, induced by presence/absence of external command “signals.” (Omit for most passive devices – e.g., piping.)

The Fault Tree Analysis Report
Title Company Author Date etc.

Executive Summary (Abstract of complete report) Scope of the analysis… Say what is analyzed
Brief system description and TOP Description/Severity Bounding what is not analyzed. Analysis Boundaries Interfaces Treated Physical Boundaries Resolution Limit Operational Boundaries Exposure Interval Operational Phases Others… Human Operator In/out

The Analysis

Show Tree as Figure. Discussion of Method (Cite Refs.) Include Data Sources, Software Used Cut Sets, Path Sets, etc. Presentation/Discussion of the Tree as Tables. Source(s) of Probability Data (If quantified) Common Cause Search (If done) Sensitivity Test(s) (If conducted) Cut Sets (Structural and/or Quantitative Importance, if analyzed) Path Sets (If analyzed) Trade Studies (If Done)

Findings…

TOP Probability (Give Confidence Limits) Comments on System Vulnerability Chief Contributors Candidate Reduction Approaches (If appropriate) Risk Comparisons (“Bootstrapping” data, if appropriate) Is further analysis needed? By what method(s)?

Conclusions and Recommendations…
87
8671

FTA vs. FMECA Selection Criteria*
Selection Characteristic Safety of public/operating/maintenance personnel Small number/clearly defined TOP events Indistinctly defined TOP events Full-Mission completion critically important Many, potentially successful missions possible “All possible” failure modes are of concern High potential for “human error” contributions High potential for “software error” contributions Numerical “risk evaluation” needed Very complex system architecture/many functional parts Linear system architecture with little/human software influence System irreparable after mission starts
88
8671

Preferred FTA FMECA √ √ √ √ √ √ √ √ √ √ √ √

*Adapted from “Fault Tree Analysis Application Guide,” Reliability Analysis Center, Rome Air Development Center.

Initiators at a given analysis level beneath a common gate must be independent of each other. Events/conditions at any analysis level must be true. 89 8671 . immediate contributors to next-level events/conditions.Fault Tree Constraints and Shortcomings Undesirable events must be foreseen and are only analyzed singly. Each fault/failure initiator must be constrained to two conditional modes when modeled in the tree. All significant contributors to fault/failure must be anticipated. Each Initiator’s failure rate must be a predictable constant.

3 x 10–7/hour acceptable for a catastrophe? Overlooking common causes – Will a roof leak or a shaking floor wipe you out? Misapplication – Would Event Tree Analysis (or another technique) serve better? Scoping changes in mid-tree 90 8671 .0232 x 10–5…+/–? Credence in preposterously low probabilities – 1.666 x 10–24/hour Unpreparedness to deal with results (particularly quantitative) – Is 4.Common Fault Tree Abuses Over-analysis – “Fault Kudzu” Unjustified confidence in numerical results – 6.

Identifying potential single point failures. FAULT TREE ANALYSIS is a risk assessment enterprise. Guiding system reconfiguration to reduce vulnerability. Optimizing resource deployment to control vulnerability.Fault Tree Payoffs Gaging/quantifying system failure probability. Assessing system Common Cause vulnerability. Identifying Man Paths to disaster. 91 8671 . Supporting trade studies with differential analyses. Risk Probability is the result of the tree analysis. Risk Severity is defined by the TOP event.

Closing Caveats Be wary of the ILLUSION of SAFETY. Low probability does not mean that a mishap won’t happen! THERE IS NO ABSOLUTE SAFETY! An enterprise is safe only to the degree that its risks are tolerable! Apply broad confidence limits to probabilities representing human performance! A large number of systems having low probabilities of failure means that A MISHAP WILL HAPPEN – somewhere among them! P1 + P2+ P3+ P4 + ----------Pn ≈ 1 More… 92 8671 .

99 0.7 System Behavior I Constant I Constant System Properties Service Stresses I Constant Environmental Stresses Don’t drive the numbers into the ground! 93 8671 .997 0.Caveats Do you REALLY have enough data to justify QUANTITATIVE ANALYSIS? For 95% confidence… We must have no failures in Assumptions: I Stochastic to give PF ≅… and ℜ ≅ … 1.9 0.97 0.000 tests 300 tests 100 tests 30 tests 10 tests 3 x 10–3 10–2 3 x 10–2 10–1 3 x 10–1 0.

Analyze Only to Turn Results Into Decisions “Perform an analysis only to reach a decision. It is not effective to do so. Grose George Washington University 94 8671 . V.” Dr.L. It is a waste of resources. Do not perform an analysis if that decision can be reached without it.