Professional Documents
Culture Documents
FFMEA
DFMEA FTA
PO Box 10412
Palo Alto, CA 94303-0813
USA
800.313.3774
650.855.2121
askepri@epri.com 3002000509
www.epri.com Final Report, June 2013
DISCLAIMER OF WARRANTIES AND LIMITATION OF LIABILITIES
(A) MAKES ANY WARRANTY OR REPRESENTATION WHATSOEVER, EXPRESS OR IMPLIED, (I) WITH
RESPECT TO THE USE OF ANY INFORMATION, APPARATUS, METHOD, PROCESS, OR SIMILAR ITEM
DISCLOSED IN THIS DOCUMENT, INCLUDING MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE, OR (II) THAT SUCH USE DOES NOT INFRINGE ON OR INTERFERE WITH PRIVATELY OWNED
RIGHTS, INCLUDING ANY PARTY'S INTELLECTUAL PROPERTY, OR (III) THAT THIS DOCUMENT IS SUITABLE
TO ANY PARTICULAR USER'S CIRCUMSTANCE; OR
(B) ASSUMES RESPONSIBILITY FOR ANY DAMAGES OR OTHER LIABILITY WHATSOEVER (INCLUDING ANY
CONSEQUENTIAL DAMAGES, EVEN IF EPRI OR ANY EPRI REPRESENTATIVE HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES) RESULTING FROM YOUR SELECTION OR USE OF THIS DOCUMENT OR
ANY INFORMATION, APPARATUS, METHOD, PROCESS, OR SIMILAR ITEM DISCLOSED IN THIS
DOCUMENT.
REFERENCE HEREIN TO ANY SPECIFIC COMMERCIAL PRODUCT, PROCESS, OR SERVICE BY ITS TRADE
NAME, TRADEMARK, MANUFACTURER, OR OTHERWISE, DOES NOT NECESSARILY CONSTITUTE OR
IMPLY ITS ENDORSEMENT, RECOMMENDATION, OR FAVORING BY EPRI.
Électricité de France
THE TECHNICAL CONTENTS OF THIS DOCUMENT WERE NOT PREPARED IN ACCORDANCE WITH THE
EPRI NUCLEAR QUALITY ASSURANCE PROGRAM MANUAL THAT FULFILLS THE REQUIREMENTS OF 10 CFR
50, APPENDIX B AND 10 CFR PART 21, ANSI N45.2-1977 AND/OR THE INTENT OF ISO-9001 (1994).
USE OF THE CONTENTS OF THIS DOCUMENT IN NUCLEAR SAFETY OR NUCLEAR QUALITY
APPLICATIONS REQUIRES ADDITIONAL ACTIONS BY USER PURSUANT TO THEIR INTERNAL PROCEDURES.
NOTE
For further information about EPRI, call the EPRI Customer Assistance Center at 800.313.3774 or
e-mail askepri@epri.com.
Electric Power Research Institute, EPRI, and TOGETHER…SHAPING THE FUTURE OF ELECTRICITY are
registered service marks of the Electric Power Research Institute, Inc.
Copyright © 2013 Electric Power Research Institute, Inc. All rights reserved.
Acknowledgments The following organizations, under contract to the Electric Power
Research Institute (EPRI), prepared this report:
Principal Investigators
B. Geddes
M. Bailey
L. Freil
J. Thomas
B. Antoine
N. Geddes
Principal Investigator
D. Blanchard
Électricité de France
6, Quai Watier
78400 Chatou, France
Principal Investigator
N. Thuy
iv
Product
Description This report documents an investigation of the use of various hazard
and failure analysis methods to reveal potential vulnerabilities in
digital instrumentation and control (I&C) systems before they are
put into operation. The report looks at six approaches, ranging from
well-established practices to novel methods still transitioning from
academic demonstrations to practical, realistic applications. It
includes step-by-step procedures and worked examples, applying
each of the methods to sample problems based on actual cases to
assess the methods for effectiveness, range of applicability and
practicality of use by nuclear plant engineers and their suppliers.
Background
The lack of established practical methods for evaluating and
managing potential failure modes and mechanisms of digital I&C
systems is adversely affecting design, risk assessment, and licensing
efforts involving digital equipment. Results include undesired and
costly plant transients, significant increases in system costs and
complexity without commensurate safety benefits, and difficulty in
obtaining regulatory acceptance of designs that can improve
dependability and reduce overall risk. Traditional failure analysis
methods developed for hardware-based systems, primarily failure
modes and effects analysis (FMEA), are proving less effective and
more costly than desired. This report extends the earlier project
results documented in the 2011 EPRI report 1002985, Failure
Analysis of Digital Instrumentation and Control Equipment and Systems
– Demonstration of Concept. Most significantly, it offers improved
procedures, more examples, and more detailed discussion of the
methods, including new approaches that combine methods to help
improve the effectiveness and efficiency of the analysis.
Objectives
The research was intended to identify hazard and failure analysis
methods that could be applied to improve current practices,
demonstrate their potential effectiveness using realistic nuclear plant
examples, and develop a practical methodology for use by utility
engineers and their suppliers.
Approach
Building on the 2011 demonstration of concept results, the project
team developed additional worked examples of varying complexity to
better understand the strengths, weaknesses, and applicability of the
approaches. They looked at six methods: functional FMEA, design
v
FMEA, a top-down method using fault tree analysis (FTA),
HAZard and OPerability (HAZOP) analysis, systems theoretic
process analysis (STPA), and purpose graph analysis (PGA). Based
on lessons learned from the examples, the project team developed
step-by-step procedures for each of the methods. The notion of
potential hybrid or blended methods that combine top-down and
bottom-up approaches to improve efficiency and effectiveness was
given additional attention, with identification of logical transfer
points from one method to another. The digital failure analysis
taxonomy that was started in 2011 was expanded to include
additional devices. Utility engineers and technical experts from the
project team reviewed the results and provided feedback that was
subsequently incorporated.
Results
For each of the approaches studied, the report contains a detailed
description of the method, a step-by-step procedure, worked
examples, and a discussion of the method’s strengths and weaknesses.
Some methods focus on causes and effects of component failures.
Others also consider undesired behaviors that do not involve
component failures. This is particularly important for complex digital
systems because a significant percentage of mishaps involve undesired
behaviors that occur under unanticipated or untested operating
conditions, but with all components operating as designed. The
report also discusses the steps involved in planning hazard analysis
activities in the context of a plant modification effort.
Keywords
Digital instrumentation and control
Failure analysis
Failure modes and effects analysis (FMEA)
Fault tree analysis
Hazard analysis
Software hazard analysis
vi
Table of Contents
vii
24B 5.1 Top Down Method Overview and Objectives Using
Fault Tree Techniques ....................................................... 5-2
25B5.2 Procedure for Top Down Method Using Fault Tree
Techniques ...................................................................... 5-3
26B5.3 Applying the Top Down Results .................................. 5-29
27B5.4 Top Down Examples ................................................. 5-30
28B5.5 Top Down Strengths ................................................. 5-52
29B5.6 Top Down Limitations ............................................... 5-52
viii
61B
Recommendations ........................................................... A-2
ix
List of Figures
xi
Figure 4-7 Circulating Water System DCS Segment ................. 4-44
Figure 4-8 CWS MOV Control Circuit & Logic ........................ 4-45
Figure 4-9 Failure Mode Tree Using FMEA Results as an
Input ............................................................................. 4-66
Figure 5-1 BWR Safety Functions (Top Down) ........................... 5-7
Figure 5-2 PWR Safety Functions (Top Down) ......................... 5-10
Figure 5-3 BWR Generation Functions (Top Down) .................. 5-16
Figure 5-4 PWR Generation Functions (Top Down) .................. 5-18
Figure 6-1 BWR Balance of Plant ............................................ 6-5
Figure 6-2 BWR Trip Sequence of Events after LOOP ................. 6-9
Figure 7-1 A Classification of Control Flaws Leading to
Hazards.......................................................................... 7-4
Figure 7-2 Accidents, Hazards, Unsafe Control Actions &
Control Flaws .................................................................. 7-6
Figure 7-3 Basic Control Structure ........................................... 7-9
Figure 7-4 Basic Control Structure with Human Operator ......... 7-10
Figure 7-5 Control Actions, Process Model Variables (PMVs)
and PMV States ............................................................. 7-12
Figure 7-6 Structure of a Hazardous Control Action ................. 7-13
Figure 7-7 HPCI-RCIC Flow Control System (System Level) ........ 7-23
Figure 7-8 System-Level HPCI-RCIC Flow Control Structure ........ 7-24
Figure 7-9 System-Level HPCI-RCIC Process Models ................. 7-25
Figure 7-10 HPCI-RCIC Flow Control System (Component
Level) ............................................................................ 7-34
Figure 7-11 Component-Level HPCI-RCIC Flow Control
Structure........................................................................ 7-35
Figure 7-12 Component-Level HPCI-RCIC Process Models ......... 7-36
Figure 8-1 BWR Main Steam Pressure Switches and MSIV
Closure Logic................................................................... 8-7
Figure 8-2 State Graph with a Low Level Sub-State .................... 8-8
Figure 8-3 Main Steam Sub-State ............................................ 8-8
Figure 8-4 Notional Top Level State Graph for a BWR ............. 8-10
Figure 8-5 Top Level Process Graph for a BWR ....................... 8-12
xii
Figure 8-6 Alternative Processes in a Process Graph ................ 8-13
Figure 8-7 Layered Goals and Processes in a Process Graph .... 8-13
Figure 8-8 Checking for State and Goal Associations in the
Purpose Graph .............................................................. 8-14
Figure 8-9 Notional Top Level Process Graph for a BWR ......... 8-15
Figure 8-10 Notional Top-Level BWR Purpose Graph............... 8-18
Figure 8-11 HPCI State Graph .............................................. 8-32
Figure 8-12 HPCI Process Graph ........................................... 8-37
Figure 8-13 HPCI Purpose Graph .......................................... 8-42
Figure 8-14 One of the Indirect Goal Interactions in the
HPCI System .................................................................. 8-44
Figure 8-15 CWS State Graph.............................................. 8-49
Figure 8-16 CWS Process Graph .......................................... 8-54
Figure 8-17 CWS Purpose Graph ......................................... 8-61
Figure B-1 A Hierarchy of Failure Mechanisms, Modes and
Effects ............................................................................. B-2
Figure B-2 Linking a Taxonomy Sheet to an FMEA
Worksheet ...................................................................... B-5
Figure B-3 Linkage between Taxonomy Sheets .......................... B-6
Figure B-4 Hierarchy of Software Interactions & Faults ............. B-39
Figure C-1 Relay Contact Symbol ........................................... C-3
Figure D-1 a) Top down logic for response to trip of an
operating circulating water pump ..................................... D-7
Figure D-2 a) Top down logic for loss of circulating water
system due to spurious trips ............................................ D-10
Figure D-3 Top down logic for loss of circulating water
system ......................................................................... D-12
Figure D-4 Potential dominant contributors to circulating
water system failure ....................................................... D-13
xiii
List of Tables
xv
Table 5-6 HPCI & RCIC Components Controlled by I&C
Equipment (Safety Functions) ........................................... 5-35
Table 5-7 HPCI and RCIC Digital System Failure Modes .......... 5-38
Table 5-8 HPCI/RCIC Generation Functions ........................... 5-39
Table 5-10 CWS Components Controlled by I&C Equipment
(Safety & Generation) ..................................................... 5-49
Table 5-11 CWS Component vs. Digital System Failure
Modes .......................................................................... 5-51
Table 6-1 Sample HAZOP Worksheet ...................................... 6-3
Table 6-2 HAZOP Guide Words ............................................. 6-7
Table 6-3 CWS Controls HAZOP Worksheet .......................... 6-15
Table 7-1 Suggested Process Model Format ............................ 7-11
Table 7-2 Combining Control Actions with Affected Process
Models ......................................................................... 7-13
Table 7-3 Sample STPA Worksheet ....................................... 7-14
Table 7-4 HPCI-RCIC Turbine Controls: System-Level
Hazards vs. Accidents or Losses ...................................... 7-22
Table 7-5 Select HPCI-RCIC Flow Control Actions .................... 7-26
Table 7-6 Excerpt of STPA Results for Control Action 3 ............ 7-27
Table 7-7 Excerpt from List of HPCI-RCIC Hazardous Control
Actions ......................................................................... 7-28
Table 7-8 Potential Causes of Hazardous Control Action
No. 7 ........................................................................... 7-29
Table 7-9 Excerpt of STPA Results for Control Action 5 ............ 7-37
Table 7-10 Potential Causes of HCA 1................................... 7-38
Table 8-1 Ten Characteristics Evaluated in PGA Basic Step
2 .................................................................................... 8-5
Table 8-2 Sample PGA Preliminary Observables Table .............. 8-7
Table 8-3 Top-Level BWR State and Event Table (Partial) .......... 8-11
Table 8-4 Top-Level BWR Goal Table (Partial) ......................... 8-16
Table 8-5 Top-Level BWR Process Table (Partial)...................... 8-17
Table 8-6 Top-Level BWR State Analysis Table ........................ 8-20
Table 8-7 Top Level BWR Goal Analysis Table........................ 8-22
Table 8-8 Top Level BWR Process Interaction Table (Partial) ..... 8-25
xvi
Table 8-9 Alternatives for Mitigating Information
Degradation .................................................................. 8-28
Table 8-10 HPCI Observables............................................... 8-33
Table 8-11 HPCI States & Events ........................................... 8-33
Table 8-12 HPCI Goals ........................................................ 8-38
Table 8-13 HPCI Processes ................................................... 8-40
Table 8-14 HPCI State & Events Analysis Results ..................... 8-43
Table 8-15 HPCI Goal Interactions ........................................ 8-44
Table 8-16 HPCI Process Interactions ..................................... 8-45
Table 8-17 CWS Observables .............................................. 8-50
Table 8-18 CWS States & Events .......................................... 8-51
Table 8-19 CWS Goals ....................................................... 8-55
Table 8-20 CWS Processes .................................................. 8-58
Table 8-21 CWS State & Events Analysis Results ..................... 8-62
Table 8-22 CWS Goal Interactions........................................ 8-64
Table 8-23 CWS Process Interactions .................................... 8-65
Table 9-1 Comparative Strengths & Limitations of Each
Method ........................................................................... 9-4
Table A-1 Guidance Documents Assessed ............................... A-3
Table B-1 Taxonomy Devices and Components ......................... B-2
Table B-2 Basic Types of Defensive Measures ........................... B-4
Table D-1 Combinations of Failures (Cut Sets) Leading to
Loss of Circulating Water ................................................. D-6
xvii
Section 1: Introduction
1.1 Background
The lack of established, practical methods for evaluating and managing potential
failure modes and mechanisms of digital instrumentation and control systems is
adversely affecting design, risk assessment and licensing efforts involving digital
equipment. Operating experience and digital; I&C project experience points to:
System designs that overlook vulnerabilities that can lead to undesired plant
events
Significant increases in system costs and complexity without apparent
commensurate safety benefits
Failure analyses that are very expensive and impractical to apply to designs
Difficulty obtaining regulatory acceptance of analysis and design approaches
that can improve dependability and reduce overall risk.
This project investigated several methods for hazard & failure analysis of
industrial systems, ranging from well-established mature practices to innovative
methods still transitioning from academic demonstrations to practical, realistic
applications. The methods were applied to sample problems based on actual
nuclear plant experience to assess their effectiveness, range of applicability and
practicality for use by nuclear plant engineers and their suppliers.
1.2 Purpose/Objectives
Each of the hazard analysis methods described herein was researched and further
developed in order to meet the following objectives:
Evaluate the capability of each method for identifying potential
vulnerabilities in a digital I&C system, including hazardous interactions with
plant components and plant systems
1-1
Demonstrate the workability of each method on practical examples based on
experiences reported by EPRI members
Provide a step-by-step procedure for each method so that users can adapt
them into a procedure format
Provide worked examples to demonstrate each method in a step-by-step
manner
Use the results to identify the comparative strengths and limitations of each
method
Provide guidance on how to blend multiple methods to gain efficiencies in
the analysis, limit the analytical effort, or limit corrective actions such as
design changes or the application of administrative controls to the identified
hazards
1.3 Scope
2B
This guideline describes the following six hazard analysis methods, including
discussions of their ranges of applicability, step-by-step procedures, strengths and
limitations, and worked examples based on actual nuclear plant cases:
1. Functional Failure Modes & Effects Analysis (FFMEA) Method
The Functional FMEA method takes a top down approach to identifying
the potential causes of postulated functional failures of plant system-level
functions and processes without necessarily identifying and analyzing
specific sets of equipment and their individual failure modes. Thus the
FFMEA method is well suited for analyzing a system at the conceptual
design phase in order to identify functional hazards or hazardous
conditions that should be addressed in later phases of the development
lifecycle. The FFMEA method is described in detail in Section 3.
2. Design Failure Modes & Effects Analysis (DFMEA) Method
The Design FMEA method takes a bottom up approach to identifying the
effects of postulated failure mechanisms and failure modes at a user-
determined level of interest. DFMEA is the method most often used by
equipment vendors, I&C engineers, and other stakeholders in the digital
I&C community. It is the traditional bottom-up approach that is
described in various standards such as IEEE Std. 352-1987 (Reference 1).
3. Top Down Method Using Fault Tree Analysis (FTA) Techniques
The Top Down (FTA) method treats I&C systems as parts of a larger
integrated plant design. It postulates failures of high level safety and
generation related functions and identifies the plant mechanical and
electrical equipment needed for these functions, along with the digital
I&C systems that control them. This top down approach can thereby
focus the failure analysis of the system by identifying the potentially
important failure modes of the mechanical and electrical components
controlled or actuated by the digital system. Digital system hazards that
can lead to important plant component failure modes can be further
1-2
evaluated using the FTA technique, or the analyst can link the Top Down
(FTA) results to another hazard analysis method. The Top Down (FTA)
method is described in Section 5.
4. Hazard and Operability Analysis (HAZOP) Method
A HAZard and OPerability (HAZOP) analysis is a systematic review of a
process (e.g., system design), using “guide words,” to visualize the ways in
which a system can malfunction. The HAZOP analysis searches for
possible deviations from the design intent that can occur in components,
operator or maintenance technician actions, or material elements (e.g., air,
water, steam), and determines whether the consequences of such
deviations can result in hazards. The HAZOP method is described in
Section 6.
5. Systems Theoretic Process Analysis (STPA) Method
The STPA method is one part of a set of new or refined systems
engineering methods developed by researchers at the Massachusetts
Institute of Technology (MIT), under the heading of Systems-Theoretic
Accident Model and Processes (STAMP). Per Reference 19:
“The primary reason for developing STPA was to include the new causal
factors identified in STAMP that are not handled by the older techniques
[FMEA, FTA, HAZOP, and others].”
The STPA method is included in this study because it effectively addresses
potentially hazardous interactions in digital I&C systems, including
hazards introduced by unintended software behaviors and component
interactions (not just potential component failures). Note that in this
guideline, the term “loss” is often used instead of the term “accident” as
described by MIT to avoid confusion with the more limiting nuclear
industry term, “nuclear accident.” The STPA method is described in
Section 7.
6. Purpose Graph Analysis (PGA) Method
A Purpose Graph is a figure that illustrates the “Observable,” “State,”
“Goal” and “Process” features of a system. Purpose Graphs are used in
Systems Engineering design and analysis activities. The Purpose Graph is
composed of a State Graph placed side-by-side with a Process Graph.
The PGA method is useful for identifying potential digital systems
hazards that can arise from unexpected component or system behaviors by
providing insights into redundancy and diversity success paths, direct and
indirect consequences of failures to meet designed performance levels even
when no faults are present, desired and undesired interactions between
aspects of normal system state changes, incompatible goals, and
incompatible processes. The PGA method is described in Section 8.
This guideline is focused on methods for analyzing digital I&C systems in various
contexts to determine if potential hazards exist that could lead to accidents or losses.
1-3
Accidents or Losses
Hazards
Second, note that the word failure does not appear anywhere. Hazards
are not identical to failures - failures can occur without resulting in a
hazard and a hazard may occur without any precipitating failures. C. 0.
Miller, one of the founders of System Safety, cautioned that
"distinguishing hazards from failures is implicit in understanding the
difference between safety and reliability."
1-4
intended to be analyzed in the context of abnormal conditions and events
(ACES).
Context
The role of context also helps determine if a failure mode, design feature, or
other characteristic of a digital component or system is hazardous by viewing
them under various postulated conditions. The hazard analysis methods
described in this guideline provide techniques for systematically identifying the
conditions under which failure modes, design features or other characteristics of a
digital component or system are hazardous or not hazardous.
Additional Concepts
The simple example described below introduces some additional concepts, such
as a “Safety Constraint,” which can be thought of as a design constraint intended
to ensure safety. This example expresses some of the key concepts used or
described in this guideline.
Consider the act of running with a pair of scissors in your hands. Scissors are like
knives, and present a contradiction; a sharp pair of scissors is a safe pair of
scissors because they can serve their purpose without using excessive force, but a
sharp pair of scissors can also cut you. So we are taught as children to never run
when we have scissors in our hands, because if we fall, we might cut ourselves.
This is an example of a safety constraint (don’t run when you have scissors in your
hand) that is designed to prevent an accident (getting cut) due to a potentially
hazardous condition (having scissors in your hand when you are running). This
example provides context through a combination of a potentially hazardous piece
of equipment (scissors) and its environment (in your hand while you are running).
1-5
The likelihood of an accident is increased when the potentially hazardous condition
of having scissors in your hand is combined with the act of running. We could
propose other rules for reducing or eliminating this hazard by banning the use of
scissors to cut anything, or requiring scissors to be blunt. But sharp scissors are so
useful for their intended purpose, we are willing to live with the risk of an
accident as long as we teach and reinforce the rule of not running when we have
them in our hands. We accept that by complying with this rule, we can live with
reasonable assurance that we won’t get cut when we have scissors in our hands.
Technical Experience
1-6
system, but neither was the system designed for the plant conditions that it
encountered, leading to an event.
The configuration analyzed was different from the operating configuration.
The digital system that was analyzed, and even tested in some cases, was not
the exactly the same as system that was installed, commissioned, and turned
over to operations. On paper the failure analysis and tests showed acceptable
results, but in reality the system response to some failure modes was different
than expected.
Depth of analysis. There is no consensus method for determining the needed
level of detail in the analysis. When can the analysis stop at system-level
failure modes, and when should it penetrate to the deepest levels of a system,
including an assessment of individual devices and piece-parts that make up
each digital component or computing unit? This question leads to the
problem of integrating failure analysis results from two distinct domains; the
plant system domain, which is familiar to the owner/operator’s engineers,
and the I&C technology domain, which is familiar to the platform vendor’s
engineers. This problem is further exacerbated by a limited ability of the
engineers to communicate across the gap between their domains of expertise,
or combine the results of analyses performed in different domains.
Software failure modes. The term “software failure” is still being used by
some, but it can create confusion and misunderstandings. The term can be
misleading, because software doesn’t really fail; it does exactly what it is
designed to do. Under certain conditions, software design errors can wreak
havoc in digital systems, but they are not “failures.” It would be helpful to
replace the notion of “software failure modes” with a concept and terms that
better fit the reality, such as “hazardous behaviors that can be introduced via
software” and “unintended or undesired behaviors.” An updated approach to
hazard analysis for digital systems may be a key factor in rectifying this
problem.
Senior management awareness. Some contributors reported that I&C
engineers, project managers and middle managers used their own judgment
and experience to assess the acceptability or risk of identified failure modes
and their effects on the plant, when in fact there was an unwritten
expectation that the responsibility for these decisions rests solely with senior
management (e.g., station manager or site VP). In these cases, staff personnel
were able to convince themselves that the risks due to the effects of some
identified and potentially hazardous failure modes were acceptable, and did
not report the results to senior management. Later, after an operating event
exposed the failure mode and its unacceptable effects, the senior management
response was to require a modification of the system to prevent recurrence,
and change procedures so that failure analysis results that could critically
affect the plant are shared at the highest levels for decision-making before
implementation.
1-7
Project Experience
Most contributors to this methodology were familiar with the Failure Modes and
Effects Analysis (FMEA) method. The FMEA method is typically used on
digital upgrade activities, and over time has become the de facto choice among
I&C engineers (in owner/operator and supplier organizations) because it is more
familiar to them and is described in existing policies and procedures. The Fault
Tree Analysis (FTA) is also familiar to owner/operators, especially in the context
of the facility Probabilistic Risk Assessment (PRA), and in some cases fault trees
or the PRA itself are used to inform and assess digital system designs.
However, project experience with these methods has not always been good,
especially on large and more complex systems, such as a complete protection
system upgrade, or the application of Distributed Control System (DCS)
technology on multiple control system segments (e.g., main turbine control,
feedwater control, etc.). Contributors have reported the following issues:
Sometimes FMEAs are too big, expensive and difficult to manage. In some
cases, FMEA worksheets have run into thousands of pages when the analyst
considers all failure modes of each and every component.
Sometimes FMEA results are not timely enough to make a meaningful
difference. When an FMEA becomes too unwieldy, inevitably it is
completed later than planned, and in some cases the owner/operator is faced
with a decision to live with some of the identified vulnerabilities, because it is
too late or too costly to rework the system design.
Sometimes it is difficult for I&C engineers to fully understand and make use
of fault trees and/or the PRA. In some cases, I&C engineers responsible for
digital upgrade projects didn’t know what questions to ask of the PRA
engineers, or how to ask them. I&C in some cases is not modeled in the
existing PRA’s for use in projects.
The project team used the guidance provided by EPRI TR-104595, Abnormal
Conditions and Events (ACES) Analysis for Instrumentation and Control (I&C)
Systems and EPRI interim report 1022985 Failure Analysis of Digital
Instrumentation and Control Equipment and Systems – Demonstration of Concept as
input to developing this guideline.
1-8
The “ACES Report”
The ACES Report (TR-104595) provides the structure for failure analyses of
digital upgrades and a summary of evaluation techniques. The guidance
contained herein effectively expands the information already contained within the
ACES topical report, which was an early attempt at addressing hazards. The
problem of hazards analysis and finding potentially bad behaviors was understood
back then. The difference now is that industry has a lot more to work with in
terms of better developed methods and real examples.
Demonstration of Concept
EPRI Report 1022985 (Reference 15) was the result of initial research on
applications of failure analysis methods used in today’s digital upgrade activities.
Several methods for performing failure analysis of digital systems were explored
during this “demonstration of concept” research. The research evaluated the top
down approach of a Fault Tree Analysis (FTA) and the bottom up approach of a
Failure Modes and Effects Analysis (FMEA).
The purpose for evaluating both top down and bottom up approaches was to
investigate the possibility of developing a hybrid approach that could use top-
down and bottom-up techniques in complementary manners. In principle, the
top-down approach would identify critical functional failures, and the scope of
the bottom-up approach would be limited to component failures that could lead
to the critical functional failures. The objective was a method that would be both
more effective in finding potential vulnerabilities, and be less costly to apply than
conventional methods.
Detailed examples with failure analysis tables and results were included in EPRI
1022985 to demonstrate how the guidance in the report could be used in failure
analysis efforts. Several of the examples have been enhanced and carried forward
into this document.
1-9
1.7 Contents of this Guideline
Table 1-1 provides an overview of the contents of the remaining sections of this
guideline:
Table 1-1
Summary of Guideline Contents
1-10
Table 1-1 (continued)
Summary of Guideline Contents
1-11
1.8 How to Use this Guideline
This guideline is a large and comprehensive body of work, and therefore users
will benefit by taking the following steps before proceeding with a hazard analysis
activity:
a. Read Sections 1 through 3 to get an overview of each method, definitions of
key terms, and discussion of how to effectively select an effective method or
blend of methods for a given situation and plan the hazard analysis activities
b. Scan Appendices A through D for awareness and potential aid in the hazard
analysis
c. Identify the most likely method or blend of methods for the given problem
d. Read the sections and examples on the candidate methods identified in step c
e. Select the method or methods to apply
f. Plan the appropriate activities, and proceed. Reference the detailed sections,
examples and relevant appendices, as appropriate
Step e. is likely to be the most difficult, because the choice of method(s) depends
on several factors such as:
the scope of the digital I&C project
the scope of the hazard analysis
familiarity of methods to various stakeholders
how hazards are identified and characterized at various levels of interest
how methods can be used in various system lifecycle phases
the potential need for a facilitator or outside expertise
Sections 3.3 and 3.4 provide specific guidance on method selection for a wide
range of situations.
1-12
Section 2: Definitions
Accident (or Loss): An undesired and unplanned event that results in a loss
(including loss of human life or injury, property damage, environment pollution,
and so on). (Reference 19) (this definition is broader than the typical nuclear
plant definition of accident)
Basic Event: A basic fault that requires no further development in a fault tree
(Reference 35). Usually representative of a component and one of its failure
modes.
Behavior: The evolution of the input, processing and output states of a digital
computing system over time. By decomposition, the evolution of the states of a
subsystem or component over time. Some of the meaning of this term is similar
to the use of the term “Function,” as in functional requirements or function
decomposition.
Control Systems: Those systems used for normal operation that are not relied
upon to perform safety functions following anticipated operational occurrences or
accidents. The control systems evaluated using [Standard Review Plan (SRP)]
Chapter 7 are those which control plant processes having a significant impact on
plant safety, but are not wholly incorporated into systems addressed by other
SRP chapters. (Reference 6)
Cut Set: A combination of component failures which, if they all occur, will result
in the top event of a fault tree to occur (Reference 35)
2-1
limitations. The design basis identifies and supports the reasons a design
requirement is established. (Reference 7)
or operator. (Reference 2)
2-2
discipline. In common usage, the terms “error” and “bug” are used to express this
meaning. (Reference 2)
Fault Tree: A graphic model of the various parallel and sequential combinations
of faults that will result in the occurrence of a predefined undesired event.
(Reference 35)
Guide Word: Word or phrase which expresses and defines a specific type of
deviation from an element’s design intent (Reference 33)
For the purpose of this guidance, the term “hazard” is used to describe an
unwanted or unacceptable system behavior that could lead to an accident or loss,
or prevent an appropriate system response to an accident or loss condition.
Hazard Analysis: (1) A process that explores and identifies conditions that are
not identified by the normal design review and testing process. The scope of
hazard analysis extends beyond plant design basis events by including abnormal
events and plant operations with degraded equipment and plant systems. Hazard
analysis focuses on system failure mechanisms rather than verifying correct
system operation (Reference 9); (2) The process of identifying hazards and their
potential causal factors. (Reference 19). Conceptually, “hazard analysis” may be
considered somewhat broader than “failure analysis” in the sense that it also
considers situations in which there can be losses in the absence of any failures of
systems, subsystems or components. This document uses the two terms
interchangeably in the broader context.
Insertion Mechanism: For faults, the pathway of processes and conditions that
resulted in the presence of the fault, but not its discovery. Insertion mechanisms
are often linked to the stages of the development and production process (e.g.,
design, tool behavior, etc.)
License Basis: Documented elements that the NRC has considered in granting
and maintaining the license for the facility. These include the combined
operating license application (COLA); safety evaluation report (SER); design
2-3
control documents (DCDs); technical specifications; Inspections, Tests, Analyses
& Acceptance Criteria (TAAC); and other commitments made under the
corrective action program. (Reference 7)
Non-Fatal Fault: A software fault that allows program execution to continue, but
with incorrect behavior.
Non Plausible Outcome Failure: A non-fatal fault with output errors that do not
satisfy output expectations or specifications (i.e., a form of soft failure).
Part: Section of the system which is the subject of immediate study. Note: A part
may be physical (e.g. hardware) or logical (e.g. step in an operational sequence).
(Reference 33)
Plausible Outcome Failure: A non-fatal fault with output that appears to satisfy
output expectations but contains errors (i.e., a form of soft failure).
Protection System: 1) the part of the sense and command features involved in
generating those signals used primarily for the reactor trip system and engineered
safety features. (Reference 8), or 2) those I&C systems which initiate safety
actions to mitigate the consequences of design basis events. The protection
systems include the reactor trip system (RTS) and the engineered safety features
actuation system (ESFAS). (Reference 6)
Software Hazard: A process or resulting outcome that has the potential under at
least some conditions to result in an unplanned event or series of events causing
damage to equipment or the environment and/or death, injury or illness to
personnel. Hazards may be graded by the extent of the damage and injury
potential.
2-4
Abbreviations & Acronyms
Comm (Communication)
I/O (Input/Output)
SG (Steam Generator)
SW (Service Water)
Xformer (Transformer)
2-6
Section 3: Planning Hazard Analysis
Activities
A hazard analysis activity may be performed in accordance with a one-time plan,
on a project-specific basis, or it may be performed on a recurring basis (i.e.,
project by project) in accordance with written procedures. In either case, hazard
analysis activities should include the determination of the scope, objectives,
analysis methods, resources, schedule, acceptance criteria, and documentation.
The scope of the hazard analysis activity should be consistent with the project
scope. Project scope information typically outlines the scope of the design change
that will be performed in terms of affected systems, structures or components
(SSC), including an outline of the components in the system that are being
modified and their interfaces to other SSCs.
The objectives of the analysis should be determined after the scope of the project
and analysis have been determined. The objectives should encompass items that
involve equipment functions, success (or failure) criteria, and other project
objectives. The determination of the analysis objectives can be driven by
compliance objectives, although some objectives may be subjective based on the
risk impacts of the system or component being modified. The use of objectives to
outline the purpose of the analysis will allow the analysis to focus on the critical
aspects of the systems or components being analyzed. The following list provides
some potential objectives to consider before selecting and performing a specific
hazard analysis method:
Identify single failure vulnerabilities
Prevent loss of safety functions or critical functions
Prevent inadvertent actuation
Validate adequate redundancy
Comply with regulatory requirements
Prevent personal injury
Protect equipment
Differentiate and protect architectural segments
Develop periodic testing requirements
Aid in the analysis of field failures and consideration of design changes
Make use of best available engineering resources
Develop specific functional and performance requirements
Accept system by plant personnel and management
Before selecting and applying one or more hazard analysis methods, it is helpful
to identify the “level of interest,” as this will vary depending on the specifics of
the project. The specific characteristics of the project and/or analysis drive the
level of interest. For example, the impacts of a system level change on the plant
may require a different analysis than a software upgrade in a device.
3-2
items in each layer that make up a plant system or a digital system. This view,
while somewhat abstract, is used in this guideline to show where different hazard
analysis methods can be applied, in a singular manner or in a blended manner,
that suits the systems and components to be analyzed at any particular level of
interest, consistent with the objectives of the analysis.
Notice in Figure 3-1 that plant functions, systems and components are
distinguished from digital systems, components and devices, so that the analyst
can identify items and interfaces at any single level and determine how they
interact with adjacent levels. Because this guideline is about hazard analysis of
digital I&C systems, it describes various hazard analysis methods using this view,
and how some are applied from the top down, some are applied from the bottom
up, and how some methods can be blended to gain certain efficiencies.
- Main Turbine
PLANT FUNCTIONS - Main Generator
- Feedwater
- Rod Control
- Reactor Coolant
- Turbine Bypass
Plant Plant Plant - Switchyard
System 1 System 2 System n - Electrical
- Pumps
- Plant Computer
- Valves
- Reactor Protection
- Vessels
- Eng. Safety Features
- Compressors
Plant Plant Plant - Breakers
- Switchgear
Component 1 Component 2 Component n
- Xformers
- Heaters
- Pipes
- Ducts - S/G Level
- Air Handlers - FPT Speed
- Main Turbine EHC
Digital Digital Digital
- NSSS Controls
System 1 System 2 System n - Plant Computer
- Reactor Trip
- ESFAS
- Controllers
- Comm Modules
- I/O Modules
Digital Digital Digital - Indicators
- Power Supplies
Component 1 Component 2 Component n
- Workstations
- Servers
- Sensors
- CPU
- Actuators
- A/D
- D/A
Device Device Device
- RAM
1 2 n - ROM
- Watchdog
- Operating System - Parts
- Firmware
Software - Applications
Plant Functions, Digital Systems,
- Configuration Data
Systems & Components Components & Devices
Figure 3-1
A Hierarchical View
3-3
The hazard analysis methods described in this guideline may be complementary
to other analytical techniques that may be applied to a digital I&C system or a
larger set of plant systems and components, such as:
Probabilistic Risk Assessment (PRA)
Validation and Verification (V&V)
Design Review
Reliability Analysis
Diversity and Defense in Depth Analysis
System Modeling
Abnormal Conditions and Events Analysis
Details about the techniques listed above and how they can be applied to digital
I&C systems can be obtained from documents that are referenced in this report.
Table 3-1 lists the full name and analytical scope of the six methods described in
this report, the characteristics of the hazards that each method is designed to
reveal, and the Section number where each method is described in detail.
Before selecting and applying any candidate methods, users of this guideline
should:
1. Carefully review the applicable procedures and worked examples provided in
the related sections.
2. Consider a blended approach where two methods are applied in order to:
take advantage of readily available results from one method (e.g., plant-
specific fault trees that are maintained for use in the facility PRA), and
use them as an input to another method
use the results from one method to limit the effort required by another
method
use the results from one method to identify the potentially critical
hazards to be further evaluated by another method, and limit the need
for corrective actions to those which address critical hazards
Figures 3-2 through 3-11 compare the methods described in this guideline, in
various contexts, allowing users to assess their anticipated project or analysis
scope and objectives against the relative strengths of each method, and select the
method(s) that are best suited for the task at hand.
3-4
Table 3-1
Comparative Scope of Hazard Analysis Methods and their Identified Hazard Characteristics
3-5
Table 3-1 (continued)
Comparative Scope of Hazard Analysis Methods and their Identified Hazard Characteristics
3-6
FMEA Methods at Various Levels of Interest
Figure 3-2 illustrates two different FMEA methods (both are described in detail
in Section 3), and how they may be applied at various levels of interest. The
Design FMEA is a bottom-up method that can be applied at any level of
interest. The analyst selects the level to meet his/her objectives. For example,
digital I&C platform vendors are likely to be interested in demonstrating the
reliability of their systems and components, and will typically apply the Design
FMEA method from the device or piece-parts level (i.e., the very bottom), up to
the digital system level, but on a generic basis. On the other hand, a system
integrator, owner/operator or architect/engineer is more interested in reliability at
the plant system level, and will typically apply the Design FMEA method from
the digital component level and up, or from the digital system level and up, on a
system-specific or plant-specific basis. A taxonomy of failure mechanisms, modes
and effects for typical digital devices and components is provided in Appendix B,
with guidance on how to use it.
Failure Failure
Plant Plant Plant Failure Modes Mechanisms
Component 1 Component 2 Component n Effects
Failure Typically by
Mechanisms Design
Device 1 Device 2 Device n FMEA
System Integrator
or Owner/Operator
(or A/E by proxy)
Typically by
Design
Plant Functions, Digital Systems, FMEA
Digital Platform
Systems & Components Components & Devices Vendor
Figure 3-2
FMEA Methods at Various Levels of Interest
3-7
The hazards identified by the Functional FMEA and Design FMEA methods
are limited to those that can lead to the failures identified at the levels of interest.
Notice that these methods are not designed to evaluate software failures, because
software does not fail (a necessary condition to be postulated for an FMEA).
Software misbehaviors are a design problem. The HAZOP and PGA methods
can also be applied at various levels of interest, comparable to the Functional
FMEA method.
Figure 3-3 illustrates the scope of the Top Down method, using fault tree
techniques, at various levels of interest. Classical Fault Tree Analysis (FTA) uses
terms such as “events” at the top of the fault tree and “faults” at various lower
layers. In this guideline, the terms used in the FMEA methods (failure
mechanisms, modes and effects) are also used in the Top Down method so that
the results of the two methods can be compared side-by-side to confirm results or
blended in a manner that gains efficiencies. For more on blended methods, see
Section 3.4.
As in the FMEA methods, the hazards that can be identified by the Top Down
method are limited to those that can lead to the failures identified at the levels of
interest, and this method is not designed to evaluate software failures, because
software does not fail (a necessary condition before it can be postulated for a fault
tree).
FTA
PLANT FUNCTIONS Failure
Effects
FTA
Plant Plant Plant Failure
Failure
System 1 System 2 System n Modes
Effects
FTA
Plant Plant Plant Failure
Failure Failure
Component 1 Component 2 Component n Mechanisms
Modes Effects
FTA
Failure
Digital Digital Digital Mechanisms Failure Failure
System 1 System 2 System n Modes Effects
Failure
Mechanisms
Device 1 Device 2 Device n
Figure 3-3
Top Down (FTA) Method at Various Levels of Interest
3-8
STPA Method at Various Levels of Interest
Figure 3-4 illustrates how the STPA method, described in Section 7, can be
applied at various levels of interest. In this case, the only direct correlation
between the system/component hierarchy and the STPA method is at the point
where losses are identified. After losses are identified at the appropriate level of
interest, the STPA method systematically breaks them down into hazards,
hazardous control actions, and control flaws that can lead to hazardous control
actions. This approach essentially makes STPA a top down method, but only in
the sense that losses (identified at the level of interest) are the starting point.
There is no direct correlation between the subsequent steps in the STPA method
and lower levels in the system/component hierarchy.
Notice that software is identified at the bottom of Figure 3-4, because the STPA
method does not presume faults or failures. Instead, it identifies hazardous
control actions, even if there are no faults or failures, that can arise from
hardware or software design issues.
STPA
STPA
Plant Plant Plant Hazards
System 1 System 2 System n Losses
STPA Hazardous
Control
Plant Plant Plant Hazards Actions (HCA)
Component 1 Component 2 Component n Losses
Hazardous Control
Control Flaws
Hazards Actions (HCA)
Digital Digital Digital
System 1 System 2 System n
Hazardous Control
Control Flaws
Actions (HCA)
Figure 3-4
STPA at Various Levels of Interest
Figures 3-5 through 3-8 present qualitative comparisons of the various hazard
analysis methods, in various contexts, to give a sense of their ranges of
applicability, effectiveness and ease of use. Coverage in these figures refers to the
ability of the method to identify hazards. No method is completely effective; the
ability of each method to identify a wide range of hazard depends on the context
of the analysis (e.g., the depth of analysis (a single loop controller at a digital
3-9
component level vs. a complex highly integrated control system at the digital
system level); or anticipated failure modes vs. unanticipated behaviors).
Figure 3-5 shows that the Design FMEA (DFMEA) method is most effective at
identifying failure modes and effects at the device, component and sub-system
levels of a system, because it can readily postulate credible failure modes and
determine the resulting effects based on known and understood failure
mechanisms, from the bottom-up. Appendix B of this guideline describes typical
digital I&C device and component failure modes, as well as typical software
interactions and faults, and related defensive measures that can be applied.
The Functional FMEA (FFMEA), HAZOP, STPA and PGA methods are
effective across the subsystem, system, and plant levels of abstraction, as well as
interactions between the plant and its environs, because these methods postulate
system behaviors using “guide words” (i.e., postulated conditions) in one form or
another, then determine if these behaviors are hazardous at a functional level.
These methods are not constrained by hardware or software functional
allocations.
The Top Down (FTA) method is effective at the system and plant levels of
abstraction because it is a top-down method that focuses on preserving critical
functions, as opposed to analyzing component failures.
Figure 3-5
Relative Coverage of Methods in the Context of Depth of Analysis
3-10
Relative Usefulness of Methods in the Context of System Lifecycle Phases
Figure 3-6 shows that the Design FMEA method is useful at four distinct phases
of a project or system lifecycle:
1. It can be used in the concept phase to identify single points of failure, usually
between the proposed solution and interfacing equipment;
2. it can be used to assess the detailed design for unacceptable failure modes and
effects, and thus inform any design changes that may be necessary;
3. it can be used in the test phase to validate system responses that are expected
due to component failures; and
4. it can be used in the Operations and Maintenance (O&M) phase of the
system lifeycle to aid in the development of periodic test or preventive
maintenance procedures, system monitoring plans, and troubleshooting and
cause analysis activities.
The Functional FMEA, HAZOP, STPA and PGA methods can be particularly
useful in assessing conceptual designs, assisting in the development of functional
and performance requirements, and assessing the detailed design to assure that
desired behaviors are well understood and implemented, and that undesired
behaviors are well understood and eliminated, prevented, or effectively mitigated
either in the design or through administrative controls before entering the O&M
phase.
The Top Down method is useful in the conceptual design and detailed design
phases, for assessing the design against success or failure criteria in the context of
critical safety or generation functions, and in the O&M phase, in the context of
the plant PRA for assessing operational and maintenance risks, maintenance rule
activities, and the significance determination process.
3-11
FFMEA, HAZOP, STPA, PGA FTA
Usefulness
DFMEA
Figure 3-6
Relative Usefulness of Methods in the Context of System Lifecycle Phases
Figure 3-7 shows the relative effectiveness of hazard analysis methods in terms of
their ability to reveal expected vs. relatively unexpected behaviors. The Design
FMEA, Functional FMEA and Top Down (FTA) methods typically identify
system or component behaviors as a result of postulated failure modes and failure
mechanisms (expected behaviors) within the constraints of the analysis boundary,
and the results are usually well understood. However, operating experience has
shown that these methods do not consistently reveal strange, unexpected
behaviors that can arise from infrequent or unusual operating conditions,
unanticipated equipment modes (e.g., automatic, manual, standby, halted, reset,
latched, etc.), adverse plant or system conditions that don’t involve failures, or
interactions between systems and components that don’t ordinarily appear to be
functionally coupled.
On the other hand, the Functional FMEA, HAZOP, STPA and PGA methods
force consideration of functional misbehaviors without necessarily constraining
the analysis to specific pieces of equipment and their failure modes, or hardware
or software functions allocated to that equipment. While the Functional FMEA
method includes the notion of postulated functional failures, it does so at the
plant process level using a series of guide words in a manner similar to HAZOP,
where digital system faults and failures are not always necessary to create a hazard
at the plant system level. Thus, the Functional FMEA (FFMEA) method is
shown in Figure 3-7 as something in between the other sets of methods. These
methods provide a complement to the Design FMEA and Top Down methods
because they can reveal otherwise strange or unexpected behaviors in a system
design, and they can more fully inform the development of system requirements
so that strange and unexpected behaviors are much less likely to make their way
3-12
into the detailed design and ultimately operations and maintenance of the digital
system.
Coverage FFMEA
Figure 3-7
Relative Coverage of Methods in the Context of System Behaviors
Figure 3-8 illustrates the relative familiarity of each method in the context of
various users that are likely to pick up and apply this guideline. This guideline is
written for technically competent engineers who work with digital I&C
equipment and systems, but users should acknowledge that roles and
responsibilities can vary considerably.
Paradoxically, the Design FMEA is perhaps the method that is most familiar to
I&C engineers, but when it comes to digital I&C systems and components, the
responsibility for performing a Design FMEA on the digital I&C system or
equipment is almost always assigned to the equipment vendor (or the system
integrator, who is then responsible for interfacing with equipment vendors).
Experience has shown that equipment vendors don’t always provide a thorough
or high quality Design FMEA, and if they do provide one, it doesn’t go beyond
the “customer connections,” leaving the responsibility for assessing failure modes
of the full system to the I&C engineer. The Functional FMEA method offers a
strong complement to the Design FMEA by identifying the critical system-level
failure modes before asking an equipment vendor for a Design FMEA, thus
3-13
bringing the most attention to the equipment failure modes that intersect with
the critical system-level failure modes.
The Functional FMEA and HAZOP methods are proven and widely used in
other industries (e.g., automotive, petrochemical), but they are relatively
unknown in the I&C engineering community in the nuclear power industry.
Therefore, a facilitator may be necessary for assisting those users who are likely to
apply these methods on an infrequent basis. The use of fault trees in the Top
Down method is proven and widely used in the nuclear power industry, but not
necessarily by I&C engineers, digital equipment vendors, and their proxies, for
whom this guidance is written. Therefore, a facilitator may be necessary for
applying the Top Down method as well.
Finally, the STPA and PGA methods are recent advancements in hazard analysis
methods, emerging from academia and finding their way into various industries.
Textbooks and academic papers describe these methods, and EPRI has
performed some research into their effectiveness. Because they are found to be
promising in their ability to identify unexpected behaviors, some of which has
been found in nuclear operating experience, detailed procedures and worked
examples have been provided in Sections 7and 8 of this guideline. But these
methods and procedures will likely require a facilitator to enable their application
on a digital I&C project, and may need help from experts who either developed
the method or are day-to-day practitioners.
DFMEA
Familiarity
FTA
FFMEA,
HAZOP STPA,
PGA
Figure 3-8
Relative Familiarity of Methods in the Context of Various Users
3-14
3.4 Consider a Blended Approach
Each of the methods in this guideline taken to its extreme could be effective in
identifying most of the hazards associated with a digital system. But, taken to
extremes, any single method is likely to:
1. be costly
2. not be performed in a timely manner
3. provide results that are too extensive to be readily understood by those who
must utilize them
4. lose focus on the corrective actions that are worth pursuing
This guideline was not developed for the sole purpose of selecting any one
method to perform a hazard analysis on a given digital system, but neither does it
preclude the use of one preferred method. It would be an unusual digital system
for which a single method could be expected to be ‘best’. Therefore, the
discussion of each of the hazard analysis methods in this guideline emphasize
that the described steps are not the only way to implement the method; variations
are likely, and the steps can be blended with or replaced by steps described for
other methods in this guideline. t
Several blended approaches are described in this Section, but they are not the
only blended approaches that may be devised by analysts.
As long as there is a nexus between the following items assessed by any two
methods, then a blended approach may be useful:
Systems or components to be analyzed
The way hazards are characterized (see Table 3-1)
The following examples show how a few selected methods can be blended to
achieve efficiencies in analysis and design.
3-15
Example 3-1. Blending FTA (or Functional FMEA) Results
with a Design FMEA
See Figure 3-9, and consider a digital feedwater control system upgrade in a PWR.
The hazard analysis approach is to first obtain the existing fault trees for the facility,
and identify the faults (or failure effects) that have an adverse effect on the
feedwater system and ultimately the plant. This approach takes advantage of readily
available information that is maintained for use in the facility PRA. Using the existing
fault trees, the analyst can identify the following thread (among others):
Plant System: Feedwater
FTA Failure Effect: Loss of Feedwater
Plant Component: Feedwater Regulating Valve (FRV)
FTA Failure Mode / Design FMEA Failure Effect: FRV Closure
Digital System: Feedwater Controls
Digital System Failure Mode: Output to FRV Fails Low
Digital Component Failure Mechanism: Halted Controller
Although this example may seem trivial, a blended approach for a large complex
system can identify the critical failure modes for a number of threads, help focus
design efforts on the most limiting cases, and avoid wasted effort on unnecessary
design activities or corrective actions for non-critical digital system failure modes.
FFMEA
Failure
PLANT FUNCTIONS Effects
FTA
Failure Failure
Plant Plant Plant
Effects Modes
System 1 System 2 System n
Design
Device 1 Device 2 Device n FMEA
Figure 3-9
Blending Functional FMEA (FFMEA) or FTA Results with a Design FMEA (DFMEA)
3-16
Extending the results described in Example 3-1, the Top Down (FTA) method
has the potential benefit of reducing the effort needed to analyze digital system
hazards when combined with other methods described in this guideline, as
shown in Table 3-2:
Table 3-2
Blending the Top Down (FTA) Method with Other Hazard Analysis Methods
3-17
Example 3-2. Blending a Digital Platform Design FMEA with a Plant
System Design FMEA
See Figure 3-10. This example briefly describes how a Design FMEA (DFMEA)
provided by a digital I&C platform (or system) vendor can be blended with a plant
system Design FMEA. This is not an unusual occurrence, because historically the
DFMEA method has been applied on many digital I&C platforms and digital I&C
upgrade projects, and an integrated view is necessary before the analyst can
conclude that the resulting digital upgrade design does not produce any
unacceptable failure modes and effects.
However, neither the vendor nor the owner/operator typically has the qualifications,
knowledge or experience to prepare one integrated DFMEA, leaving the
owner/operator (or architect/engineer by proxy) with the problem of accepting the
platform DFMEA from the vendor and extending the results to the plant system level,
typically by preparing another DFMEA at that level (i.e., the level of interest as
described in Section 3.2).
Using the same (Example 3-1) feedwater control system upgrade project at a PWR,
the owner/operator can prepare the following thread (among others), from the
bottom up:
Digital Device Failure Mechanism: CPU Stops Running (from platform DFMEA)
Digital Component Failure Mode: Controller Halts (from platform DFMEA)
Digital System Effect: Outputs Rail Low (from platform DFMEA)
Digital System Failure Mechanism: Loss of Signal (identified in plant system
DFMEA)
Plant Component Failure Mode: Closed FRV (identified in plant system DFMEA)
Plant System Failure Effect: Loss of Feedwater (identified in plant system DFMEA)
Blending two Design FMEAs that are prepared at two different levels of interest
provides the integrated view that is necessary for identifying the critical failure
modes for a number of threads in a large complex system. Again, this approach
helps focus design efforts on the most limiting cases, and avoid wasted effort on
unnecessary design activities or corrective actions for non-critical failure modes and
effects.
3-18
PLANT FUNCTIONS
Digital Digital Digital Failure Plant System Use the digital platform DFMEA
Component 1 Component 2 Component n Modes Design FMEA output (failure effects) as an
input (failure mechanisms) to
the plant system DFMEA
Failure
Device 1 Device 2 Device n Mechanisms
Digital
Plant Functions, Digital Systems,
Platform
Systems & Components Components & Devices Design FMEA
Figure 3-10
Blending a Digital Platform FMEA with a Digital System FMEA
Example 3-3. Blending Functional FMEA (or FTA) Results with STPA
This example presents a blended approach that is somewhat different from those
described in Examples 3-1 and 3-2. Note that before two methods can be blended,
there should be a nexus at an appropriate system or component level of interest (as
described in Section 3.2), and the way in which hazards are characterized should
be similar.
Figure 3-11 illustrates this concept, but this time showing a nexus between Functional
FMEA failure modes or FTA failure mechanisms and the losses to be considered by the
STPA method. Continuing with the same proposed digital feedwater control system
upgrade at a PWR, the analyst can prepared the following thread (among others),
from the top down using the FTA or Functional FMEA methods:
Plant System: Feedwater
Functional FMEA Failure Mode or FTA Failure Effect: Loss of Feedwater
Actuated Plant Component: Feedwater Regulating Valve (FRV)
Functional FMEA Failure Mechanism or FTA Failure Mode: Spurious Closure of FRV
STPA Loss: Loss of Feedwater
STPA Hazard: Spurious Closure of FRV
STPA Hazardous Control Action (HCA): Digital feedwater system provides close
command to FRV when conditions are normal
STPA Control Flaw: Incomplete process model (e.g., in the software)
This example shows how the results from a top down method can be used to inform the
STPA method, and get down to the level where hazardous conditions may be present
even if there are no system or component failures. Once again, this approach helps
focus design efforts on the most limiting cases, and avoid wasted effort on unnecessary
design activities or corrective actions for non-critical failure modes and effects.
3-19
FFMEA
Failure
PLANT FUNCTIONS Effects STPA
FTA
Failure
Plant Plant Plant Failure Losses
Modes Effects
System 1 System 2 System n
Hazards
Plant Plant Plant Failure Failure
Component 1 Component 2 Component n Mechanisms Modes
Hazardous
Control
Actions (HCA)
Figure 3-11
Blending Functional FMEA or FTA Results with STPA
Technical Resources
Technical Information
3-20
Equipment Access
During hazard analysis activities, the system equipment may need to be reviewed
or accessed. If access to the system is needed during the design and test phases, it
should be identified in the project plan and schedule. Examples of the types of
access requirements consist of equipment walkdowns, equipment inspections,
and equipment testing (factory and site acceptance). The use of access time to
verify and/or validate the hazard analysis information should be specified in the
project plan. The results from the hazard analysis can be included in the test
phases to ensure that the expected response is actually demonstrated by the
system. In addition, walkdowns and inspections can ensure that the system
connections and design meet the expectations of the design documentation that
was used during the hazard analysis.
As part of the identification of the resources needed for the hazard analysis, a
schedule should be developed that outlines the milestones for the analysis. The
milestones will ensure that the analysis development is matched with the various
project lifecycle phases. This will allow for any results from the analysis to be
factored into the system development to mitigate any problems identified by the
analysis. The standard project lifecycles which would need to be aligned with the
analysis milestones consist of the project definition phase, conceptual design
phase, final design review, design testing, and implementation.
Table 3-3 provides the lifecycle or project phase and the corresponding analysis
milestones that would be aligned:
Table 3-3
Project Phases vs. Analysis Milestones
At project initiation there should be a clear definition of the project that details
the intended scope and schedule. The hazard analysis plan can then be
3-21
developed, consistent with the intended project scope and synchronized with the
specific project milestones.
After the project definition, the conceptual design begins on the system design.
As the conceptual design is developed, a preliminary hazard analysis needs to be
developed to identify potential vulnerabilities in the conceptual design so the
flaws can be eliminated or mitigated prior to getting into the detailed design
activities. The preliminary hazard analysis will serve as the foundation for the
hazard analysis, which will be a living document during the design phase of the
project.
As the design effort progresses, the hazard analysis should be updated at each
lifecycle phase to ensure that problems are identified as early as possible to
minimize the impact of changes needed to address the vulnerabilities. Such
updates can be viewed as iterative. The periodic update points should be
identified in the project schedule and potentially would be aligned with a
30/60/90 percent design review milestone.
When the final design is approved, the hazard analysis should be approved as
well. The final hazard analysis will be based on the approved design for the
project and will serve to demonstrate that the objectives of the analysis have been
satisfied. As system testing and implementation occurs, the hazard analysis may
need to be revised to address changes that are made to the design to resolve any
identified problems.
This Section is adapted from the Industrial Design Engineering Wiki, available
at http://www.wikid.eu/index.php/Function_analysis. In general, a Function
Analysis provides useful input to a Preliminary Hazard Analysis (PHA), as
described in Section 3.7, because it provides a clear representation of the
functions to be assessed at the level of interest.
The principle of Function Analysis is first to list, describe or specify wanted and
unwanted system, subsystem or component behaviors, and then to infer from
there what the parts, including hardware and software units (which are yet to be
selected and developed into an integrated system) should do. Function Analysis
forces designers to distance themselves from known products and components in
3-22
considering the question: what is the new system, subsystem or component
intended to do and how could it do that?
3-23
Function Analysis (FA) Procedure
FA Step 1: Gather and assess source information, such as the Final Safety
Analysis Report (FSAR), design and/or system descriptions, PRA success
criteria, system drawings, and any other information that describes the functional
requirements or characteristics of the system or components of interest.
FA Step 2: Describe the main function of the system or process in the form of a
black box. If one main function cannot be described, go to the next step.
FA Step 4: Elaborate the Function Structure. Fit in additional functions (or sub-
functions) which were left out in Steps 2 and 3, and find variations so as to find
the best Function Structure. Variation possibilities include moving the system
boundary, changing the sequence of sub-functions, and splitting or combining
functions or sub-functions. Exploring various possibilities is the essence of
Function Analysis: it allows for an exploration and generation of possible
solutions to the design problem.
Additional Guidance
Development of Function Structure variants is recommended. A statement
of a problem does not typically or imperatively lead to one particular
Function Structure. The strength of Function Analysis lies in the possibility
of creating and comparing, at an abstract level, alternatives for functions and
their structuring.
Certain sub-functions appear in almost all design problems. Knowledge of
elementary or general functions helps in seeking solution-specific functions.
3-24
The development of a Function Structure is an iterative process, which can
start from analyzing an existing design or with a first outline of an idea for a
new solution.
Function structures should be kept as simple as possible. The integration of
various functions into one functional block (i.e., a function carrier, such as a
steam generator level control system) is often a useful means in this respect.
Block diagrams of functions should remain conveniently arranged; use simple
and informative symbols. For more on functional symbols and other
representations, see Appendix C.
In industrial design engineering and system design, it is not always possible
to apply structuring principles. In the context of digital I&C systems in
nuclear power plants, functions and processes are better described in terms of
safety, generation, and equipment reliability objectives. A high level, generic
Function/Process Map for a typical Boiling Water Reactor is provided in
Figure 4-1.
FA Step 5: Document the results. The results of the Function Analysis can be
documented in a stand-alone engineering document (e.g., calculation or analysis
package), or they can be documented in the front end of a specific hazard analysis
document that results from using one or more of the hazard analysis methods
described in this guideline.
Per IEEE Std. 1228-1994 (Reference 40), a Preliminary Hazard Analysis (PHA)
(and any additional hazard analyses performed on the entire system or any
portion of the system) identifies:
1. Hazardous system states, typically at the digital system level. However, if the
Function Analysis results are described at the plant component or plant
system level, then hazardous system states would be identified at that level.
In either case, the hazardous system states become constraints (i.e., “must not
do” requirements) that get transferred into the set of digital system
requirements.
2. Sequences of actions that can cause the system to enter a hazardous state
3. Sequences of actions intended to return the system from a hazardous state to
a nonhazardous state
4. Actions intended to mitigate the consequences of accidents or losses
3-25
There are two basic approaches for performing a Preliminary Hazard Analysis
(PHA):
The Table Top method involves one or more organized meetings, where the
identified individuals come together and review, discuss and identify potential
hazards that may be introduced or affected by the digital I&C project. In general,
the number of identifiable hazards will typically range from 3 to 5, and in some
cases may range up to 6 to 8.
The Table Top method for performing a PHA relies on the judgment and
experience of individuals knowledgeable in the design, operations, maintenance,
and licensing basis of the potentially affected systems, sub-systems or
components. Such individuals and any additional resources that may be needed
should be identified as described in Section 3.5.
The result is a list of hazards for further consideration as one or more of the
Hazard Analysis methods described in this guideline is/are selected and applied.
Note that the results of the Function Analysis are still a prerequisite for
performing a PHA when users of this guideline jump to the application of one or
more specific Hazard Analysis methods in the conceptual design phase of a
project.
Also note that top-down Hazard Analysis methods such as Functional FMEA,
Top Down, STPA and PGA require identification of functions, in one form or
another, as an early step in the process. The Function Analysis results should be
directly applicable or adaptable (with little additional effort) in these cases.
Acceptance
The hazard analysis plan (if one is used), the project plan (i.e., project risk
analysis), or hazard analysis procedures should specify the criteria that will be
used to determine the acceptability of the analysis. The acceptance criteria will be
developed from the objectives that are identified as described in Section 3.1. For
example, if the objectives included that the analysis would identify single failure
vulnerabilities in the design, then the acceptance criteria could include the
determination that no single failure vulnerabilities exist or that any identified
vulnerabilities have been corrected.
The project plan or hazard analysis procedures should identify how to address
problems that are unresolved or unmitigated by the design. The level of
justification for the unresolved or unmitigated problems should be specified.
For areas that are not analyzed or cannot be analyzed, the acceptance criteria and
project plan should describe how unanalyzed design areas are to be dispositioned,
up to and including rejection of the system design. Unanalyzed areas of the
design may be acceptable for simple designs in low risk systems or components.
II. References
III. Definitions
3-27
Procedures and project plans should specify the hazard analysis documentation
that will be developed at each point in the project or system lifecycle, including
the Operations and Maintenance phase. Any existing documentation that will be
revised as part of the analysis activities will also be specified.
Hazard analysis deliverables that are developed for new technologies introduced
into the plant should be baselined upon completion of a project, then maintained
in a controlled manner for supporting changes. If a change affects a function or
hazard analysis result, the hazard analysis should be updated, and maintained
going forward.
3-28
Section 4: Failure Modes and Effects
Analysis (FMEA) Methods
This section describes methods for performing hazard analysis using two FMEA
methods, Functional FMEA (FFMEA) and Design FMEA (DFMEA).
Although not referred to as hazard analysis methods in typical nuclear industry
parlance, the FMEA methods are treated as hazard analysis methods in this
document because they can be used to identify hazardous failures that can lead to
an accident or loss. Annex D of IEEE Std. 7-4.3.2 – 2003 (Reference 9) includes
the following statement (emphasis added):
The FMEA method was first derived and applied in military applications in the
1950’s under MIL-STD-1629 (Reference 27). This method was later used in the
1960’s and 1970’s in the aerospace, automotive, food & beverage and commercial
nuclear power industries, with an emphasis on safety. The automotive industry
added a top-down view to the basic FMEA method (a bottom-up, inductive
view of system, component or device failure mechanisms, modes and effects) by
developing a perspective on causes of failure modes in manufacturing process
steps that could lead to component, assembly or vehicle failures.
This guideline describes two FMEA methods; the Functional FMEA (FFMEA)
and the Design FMEA (DFMEA) method. The Functional FMEA method
takes a “top down” approach by assessing system-level functions and processes
4-1
without necessarily identifying and analyzing specific sets of equipment and their
failure modes. Thus the Functional FMEA method is more suitable for
analyzing a system at the conceptual design phase in order to identify functional
hazards or hazardous conditions that should be addressed in later phases of the
lifecycle.
The Design FMEA method is one that should be more familiar to equipment
vendors, I&C engineers, and other stakeholders in the digital I&C community.
It is the traditional bottom-up approach that is described in various standards
such as IEEE Std. 352-1987 (Reference 1).
In general, the Functional FMEA is well suited for identifying hazardous failure
modes that can help limit the focus or scope of a Design FMEA. The Functional
FMEA should be performed by plant staff (or a designated contractor such as an
Architect/Engineer firm) early in the modification process, before an equipment
vendor or third-party integrator is asked to perform a Design FMEA. The
completed Functional FMEA can be an input to the Design FMEA activity so
the analyst can readily identify the functional or process-related failure modes
that should be eliminated, prevented or mitigated by the detailed design.
The Functional FMEA method is adapted to digital I&C systems in the nuclear
industry by considering the plant system functions and processes that are sensed,
controlled and indicated by digital I&C equipment. A Functional FMEA can be
particularly useful if it is applied before a Design FMEA is executed, when the
results can be used to reduce the scope of the Design FMEA to the failure
mechanisms that can arise from the affected plant functions and processes.
4-2
Prerequisite
The results of a Function Analysis, as described in Section 3.6, are a useful input
to the Functional FMEA (FFMEA) because they provide a well-organized set of
functions that can feed into the first two steps of the FFMEA procedure.
The first step in the FFMEA process is to draw a Function/Process Map, which
is a hierarchical view of plant system functions and processes of interest to the
analyst. The Function/Process Map uses the results of the Function Analysis
method described in Section 3.6. A generic Function/Process Map for a typical
BWR is provided in Figure 4-1. Note that it does not list or describe any specific
equipment or systems, structures or components beyond the heaviest components
(i.e., reactor, main turbine, etc.).
The focus on plant functions and processes is consistent with the expectations of
the AIAG Reference Manual on FFMEA, and is helpful because it supports a
top-down view of critical functions without forcing a complete bottom-up
analysis of all credible equipment failure modes and effects as expected by the
Design FMEA method (Section 4.3). The resulting Function/Process Map
therefore describes functions and processes at a level of abstraction that does not
need to identify specific equipment.
Note that the generic Function/Process Map presented in Figure 4-1 resembles a
fault tree to some extent, with the exception of logic symbols. It does not
represent success or failure criteria, or contiguous processes; it is simply a
hierarchical view of basic plant functions. However, the plant-specific fault tree
used in the PRA is likely to be a good input document for developing the
function/process map from a functional point of view (i.e., ignoring the failure
logic).
BWR
Plant
Operations
Equipment Power
Safety
Protection Generation
To Equipment To Power
To Safety Map
Protection Map Generation Map
Figure 4-1
Generic BWR Function/Process Map (Sheet 1 of 3)
4-3
To BWR Plant
Operations Map
Safety
Primary Shutdown Rx
Limit
Fire Radiation Safety Industrial Coolant and Maintain To Power
Releases to
Protection Protection Tagging Safety Manual System Safe Generation Map
Environment
Integrity Shutdown
Primary Reactor
Radiation Equipment Primary Coolant Flow Secondary Primary
Initiate Fire Isolate Tagout Equipment Coolant Reactivity Coolant
Indications & Safety Coolant to Interfacing Containment Containment
Suppression Area Inidications Lockouts Overpressure Control Inventory
Alarms Features Piping Systems Control Control
Protection Control
High Low
Systems Systems Containment Containment
Sense Sense Containment Pressure Pressure
Inside Outside Pressure Temperature
Smoke/Fire Radiation Isolation Inventory Inventory
Containment Containment Control Control
Control Control
4-4
To BWR Plant
Operations Map To BWR Plant
Operations Map
Equipment
Protection Power
Generation
Isolate Trip
Fire
Energy Rotating Main Main
Protection Reactor
Sources Equipment Turbine Generator
To Safety Map
From
Safety Map
4-5
Table 4-1
Sample Functional FMEA Worksheet
High Level Function/Process (check one): Equipment: Checked by/Date: Lifecycle Phase:
( ) Safety
( ) Equipment Protection Approval/ Date: Rev:
( ) Power Generation
Potential Current Prevent/Detect Method
Row Potential
Potential Potential Causes(s)/
Function Process Requirement(s) Failure Mode Effect(s) of Recommended Action
No. Failure Mode Mechanism of Failure Prevention Detection
Failure
10
4-6
FFMEA Step 2: Identify the functions and related processes of interest.
In a digital upgrade project, the functions and processes of interest are typically
those that are affected by the systems or components that are being replaced or
modified, and are usually relatively easy to identify at a functional level. For larger
upgrades that affect multiple plant process systems, there may be multiple
functions or processes that are differentiated by functional segments in the
architecture. Using the Function/Process Map developed in Step 1, highlight or
list the lowest function/process blocks that are affected by the equipment,
systems or components of interest.
Identify the following items in the appropriate blocks at the top of the worksheet:
FFMEA Number, Sheet Number, Revision, Lifecycle Phase
High Level Function/Process
- Nuclear Safety
- Power Generation
- Equipment Protection
Equipment
If more than one of the high level functions identified in the “High Level
Function/Process” block is affected (Nuclear Safety, Power Generation or
4-7
Equipment Protection), then a separate FFMEA worksheet should be prepared
for each high level function. It is not unusual for digital I&C equipment
functions to affect all three high level functions one way or another.
FFMEA Step 6: Identify the lowest level Functions, Processes and related
Requirements.
On each worksheet, under the column labeled “Function,” list the lowest level
functions from the Function/Process Map that are performed by or affected by
the identified equipment. In many cases, there may only be one or two entries in
the “Function” column on each worksheet.
Under the “Process” column, identify the basic Processes that are used to fulfill
each Function. In the context of the FFMEA method, a basic process may be
one or more of the following fundamental processes, characterized by the
properties of the system:
Energy Storage (thermal, electric, fluid, fuel, etc.)
Energy Transport (fluid flow, current flow, etc.)
Energy Addition (pumping, heating, boiling, charging, generating, etc.)
Energy Reduction (relieving, cooling, condensing, discharging, motoring, etc.)
Energy Conversion (nuclearthermal, thermalkinetic, kineticelectric,
etc.)
Energy Containment
For each identified Process, briefly list the associated functional or performance
requirements in the “Requirements” column.
FFMEA Step 7: Using the FFMEA Guide Words, postulate the failure modes
of each Process.
For each Process identified in Step 6, postulate the following Guide Words and
list the results under the “Potential Failure Mode Column”:
1. No Function
2. Partial Function
3. Over Function
4. Degraded Function
5. Intermittent Function
6. Unintended Function
4-8
These FFMEA Guide Words are designed to answer the “what can go wrong?”
question as it relates to each Process identified in Step 6. Each of the FFMEA
Guide Words is postulated and evaluated individually against each identified
Process, thus making the FFMEA method effective and useful for identifying
single failures, both active and passive, and the resulting effects.
It is not necessary to identify potential failure modes for all six Guide Words if
one or more Guide Words is not applicable or not credible. For example, if a
Process of initiating a safety function is being considered under the general
heading of Nuclear Safety, then the Guide Word “Unintended Function” is not
applicable if the Process is defined as one that is performed on demand due to an
accident condition (i.e., the function is actually intended, so the idea of an
“unintended function” doesn’t make any sense in this context). However, when
evaluating the same Process (initiating a safety function) under the general
heading of Power Generation, then “Unintended Function” would be evaluated
as a spurious actuation because the Process is defined as one that is required on
demand.
FFMEA Step 8: Determine the resulting effects that each Process Failure Mode
can have on the system of interest and the plant.
This step involves following the Potential Failure Modes identified in Step 7 out
to their effects at the system and plant level. This step requires knowledge of the
system or equipment of interest and how it can affect plant operations in terms of
safety, power generation, and equipment protection. This step may require some
cross-discipline support from design engineers, system engineers, or component
engineers who are technically competent in these areas. The results of this step
are entered in the FFMEA worksheet under the column labeled “Potential
Effects of Failure.”
This step typically requires some knowledge of the equipment that is or would be
involved in the potential failures. The results are listed under the column
“Potential Cause(s)/Mechanism of Failure.”
4-9
there is a postulated functioning of high pressure coolant injection (HPCI) when
there is no valid demand for HPCI. In other words, the HPCI pump will trip,
using the trip/throttle valve, if there is a spurious actuation and reactor level
reaches a high level setpoint.
Design features and functions can only be credited if they are independent from
the functions and processes that are within the scope of the FFMEA.
Continuing with the HPCI example, if the level of interest (per Section 3.2) is
the HPCI flow control system, as shown in Example 4-1 below, and the
trip/throttle function (implemented via sensors, bistables and the trip/throttle
valve) is outside of the system of interest, then credit can be taken for the HPCI
trip function to prevent or mitigate reactor overfill in the event of a postulated
functional actuation of the flow control system.
The following examples were originally developed for EPRI 1022985 (Reference
15). They are repeated here with some minor changes that make them more
complete, and some editorial changes that show how they follow the FMEA
procedures described in Sections 4.2 and 4.3. The first example demonstrates the
Functional FMEA method, and the second example demonstrates the Design
FMEA method.
4-10
Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA (continued)
Figure 4-3 satisfies the prerequisite for a Function Analysis (FA). It illustrates a
section of the overall, high-level BWR Function/Process Map that was provided in
Figure 4-1, and it is further developed to show the lower-level functions of “High
Pressure Injection” and “Trip Turbine Driven Equipment” and their related processes
that are necessary for satisfying the overall functions of Safety, Equipment Protection,
and Power Generation. These lower-level functions and processes will be used to
initiate the FFMEA worksheets in later steps.
FFMEA Step 2: Identify the functions and related processes of interest.
Figure 4-3 highlights the functions and related processes of interest for this example.
FFMEA Step 3: Write a summary description.
Summary descriptions of the HPCI and RCIC systems are provided below:
HPCI Summary Description
The design basis function of the HPCI system is reactor inventory control to ensure
the reactor core is adequately cooled to limit maximum fuel cladding temperature
following a small-break loss-of-coolant-accident (LOCA) which does not rapidly
depressurize the reactor pressure vessel (10CFR50.46 Criterion 1). HPCI also
provides a reactor inventory control function following other initiating events such as
transients, stuck open safety relief valve (SRVs), medium-break LOCAs and
anticipated transient without scram (ATWS).
HPCI can be initiated manually, or it will initiate automatically via high drywell
pressure or Low-Low reactor water level. The maximum response time allowed to
achieve rated flow is 60 seconds in the design basis analysis, but can be much
longer and still be successful, particularly given best estimate assumptions and given
non-LOCA initiating events.
When the HPCI is initiated, the system initiation signal opens the turbine steam
admission valve. When the steam admission valve opens, a limit switch on the valve
changes state when the valve begins to open, thus sending an enable signal to the
digital governor via one set of contacts and to the digital positioner via a second set
of contacts. The governor responds by sending a governor valve position demand
signal that ramps the turbine speed up to a preferred initial speed, then switches to
PID control in order to respond automatically to changes in system load. The
purpose of the ramp function is to enable controlled acceleration of the turbine and
avoid initial overspeed transients that may encroach upon the mechanical overspeed
trip limit. To support the initial response of the turbine when an enable signal is
received, the governor valve is preset to a partially open position.
The HPCI pump is a two stage component (booster pump + main pump), driven by a
single steam turbine. The pump takes suction from the condensate storage tank (CST)
until it reaches low level, then the suction source is switched to the suppression pool.
The pump supplies water to the reactor vessel via the feedwater line, or it can be
aligned in recirculation mode to discharge to the CST during surveillance tests. The
HPCI turbine is driven by Main Steam, which exhausts to the suppression pool after
leaving the turbine.
The HPCI turbine is automatically tripped on any HPCI isolation signal, high turbine
exhaust pressure, high reactor water level, low pump suction pressure, or
mechanical overspeed. The turbine can also be tripped manually from the Main
Control Room (MCR), the Remote Shutdown Panel (RSP), or locally at the turbine.
4-11
Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA (continued)
The turbine is tripped by closing the trip/throttle valve shown in Figure 4-1, thus
isolating the steam supply.
RCIC Summary Description
The design basis function of the RCIC system is reactor inventory control to provide
makeup water to the reactor vessel during reactor shutdown and isolation when the
main condenser and feedwater system are unavailable. RCIC also provides a
reactor inventory makeup function following initiating events such as non-isolation
transients and stuck open SRVs.
RCIC can be initiated manually, or it will initiate automatically via Low-Low reactor
water level. There is no automatic initiation of RCIC on drywell pressure. The design
basis maximum response time allowed to achieve rated flow is 60 seconds but can
be much longer and still be successful.
When RCIC is initiated, the system initiation signal opens the turbine steam
admission valve. When the steam admission valve opens, a limit switch on the valve
changes state when the valve reaches 20% open, thus sending an enable signal to
the digital governor via one set of contacts and to the digital positioner via a second
set of contacts. The governor responds by sending a governor valve position
demand signal that ramps the turbine speed up to a preferred initial speed, then
switches to PID control in order to respond automatically to changes in system load.
The purpose of the ramp function is to enable controlled acceleration of the turbine
and avoid initial overspeed transients that may encroach upon the mechanical
overspeed trip limit. To support the initial response of the turbine when an enable
signal is received, the governor valve is preset to a partially open position.
The RCIC pump is driven by a single steam turbine. The pump takes suction from the
condensate storage tank (CST) until it reaches low level, then the suction source is
switched to the suppression pool. The pump supplies water to the reactor vessel via
the feedwater line, or it can be aligned in recirculation mode to discharge to the
CST during surveillance tests. The RCIC turbine is driven by Main Steam which
exhausts to the suppression pool after leaving the turbine.
The RCIC turbine is automatically tripped on any RCIC isolation signal, high turbine
exhaust pressure, high reactor water level, low pump suction pressure, or
mechanical overspeed. The turbine can also be tripped manually from the MCR, the
RSP, or locally at the turbine. The turbine is tripped by closing the trip/throttle valve
shown in Figure 4-1, thus isolating the steam supply.
FFMEA Step 4: Prepare a Functional FMEA worksheet.
Functional FMEA worksheets are provided in Table 4-2.
FFMEA Step 5: On each worksheet, fill out the header rows.
In this example there are three sheets, differentiated by major BWR function in the
upper left corner of the worksheet.
FFMEA Step 6: Identify the lowest level Functions, Processes and related Requirements.
In this example, the results of Step 6 are shown in Table 4-2 under the columns
labeled “Function,” “Process” and “Requirements.”
Note that the entries in the “Function” and “Process” columns are transposed directly
from the Function/Process Map provided in Figure 4-3. The “Requirements” column
entries are derived from the plant FSAR, Technical Specifications, and System
4-12
Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA (continued)
Descriptions.
FFMEA Step 7: Using the FFMEA Guide Words, postulate the failure modes of
each Process
In this example, the results of Step 7 are shown in Table 4-2 under the column
labeled “Potential Failure Mode.” Note that for most Processes identified in this
example, only 3 or 4 of the 6 FFMEA Guide Words yield a result, where the other 2
or 3 Guide Words are not applicable. In each case, the failures are postulated as
single failures, albeit from a process point of view.
Only one Process, “Turbine/Pump provides required coolant flow,” was evaluated
with 5 Guide Words, shown in Rows 1 through 5 of the “Power Generation”
worksheet (sheet 3 of 3 in Table 4-2). This particular Process picked up the
postulated failure of “Unintended Function” because it is relevant in the context of
Power Generation functions. However, this same postulated failure is not applicable
in the context of Safety functions because the Safety requirement is stated in a
manner that HPCI or RCIC flow is required on demand (i.e., during an accident or
surveillance test), when the function of High Pressure Injection is actually intended.
FFMEA Step 8: Determine the resulting effects that each Functional Failure Mode
can have on the system of interest and the plant.
In this example, these effects are listed in Table 4-2 in the column labeled “Potential
Effect(s) of Failure.” Note that the effects are described in terms of impact on
equipment (e.g., turbine trip) or system function (e.g., loss of HPCI or RCIC).
FFMEA Step 9: Determine the Potential Causes or Failure Mechanisms for each
identified Functional Failure Mode.
In this example, the Potential Causes or Failure Mechanisms for each Functional
Failure Mode are listed in Table 4-2 in the column labeled “Potential
Cause(s)/Mechanism of Failure.” Note that the identified causes are generally
mechanical or electrical in nature, and none of them are directly attributable to any
digital I&C equipment because the FFMEA method looks at failure modes and effects
at a functional level before any specific digital equipment is identified.
Note that almost all of the identified methods of Prevention or Detection take
advantage of typical programs and processes that are in place at a typical nuclear
plant, including Preventive Maintenance, Procedures, Chemistry, Human
Performance, Surveillance Testing, ASME Section 11 Testing, and System Alarms. In
a few cases, “Software V&V” is identified as a method of Prevention, thus drawing
attention to the potential digital I&C solution that may be considered for replacing or
upgrading the HPCI or RCIC flow control system.
FFMEA Step 10: Provide Recommended Actions.
In this example, the Recommended Actions listed in Table 4-2 are centered on using
the Design FMEA method to further explore failure modes and effects of any
proposed digital I&C solution that can result in the related Functional Failure Modes.
For example, rows 1 through 4 on Sheet 1 of 3 in Table 4-2 show “Software V&V”
as a method of preventing turbine trips, failed initiation, late initiation, ramp rate too
slow, and other causes of a failed turbine/pump flow Process. These “causes” or
“failure mechanisms” at the Process level can be considered “failure modes” at the
Equipment level, which can be evaluated using the Design FMEA method.
4-13
Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA (continued)
FFMEA Step 12: Apply the results.
Because this example is constructed at the conceptual design stage of a hypothetical
project, the results of this Functional FMEA in Table 4-2 would be provided to the
appropriate analyst responsible for performing a Design FMEA downstream in the
project lifecycle. In other words, the Functional and Design FMEAs are linked with
respect to digital I&C equipment failure modes and effects that can result in
hazardous (i.e., unwanted) effects on system Functions and Processes.
4-14
M
LS
Figure 4-2
HPCI/RCIC System Diagram
4-15
BWR
Plant
Operations
Equipment Power
Protection
Safety Generation
Isolate Trip
Nuclear Reactor
Energy Rotating
Safety Steam
Sources Equipment
Reactor
Coolant
Inventory
Control
High Low
Pressure Pressure
Example 4-1: Inventory Inventory
Functions Control Control
Turbine/
Sense Sense Sense Sense Sense
Pump Steam Suction Coolant
Flow Level Pressure Temp. Overspeed
Provides Supply Supply Path to Rx
Deviation Deviation Deviation Deviation Condition
Flow
Figure 4-3
High Pressure Injection Function/Process Map
4-16
Table 4-2
HPCI/RCIC Flow Control System Functional FMEA Worksheets
High Level Process/Functional Area (check one): Equipment: Checked by/Date: Lifecycle Phase:
(X) Safety Conceptual Design
( ) Equipment Protection HPCI/RCIC Flow Control System Approval/ Date: Rev: 0a
( ) Power Generation
Ro Current Prevent/Detect Method
Potential Potential Potential Causes(s)/ Recommended
w Function Process Requirement(s)
No.
Failure Mode Effect(s) of Failure Mechanism of Failure Prevention Detection Action
5000 gpm (HPCI) Less than adequate Rx 1. HPCI starts, but turbine
Turbine/pump Less than 5000 gpm (HPCI) or 1. Software V&V Evaluate flow
500 gpm (RCIC) trips
2 provides inventory, possibly leading to 2. ESFAS PM control system
@ 1000 psi, on 500 gpm (RCIC) 2. Turbine speed too low
required core damage (medium LOCA) 3. Incorrect setpoint
3. Turbine PM failure modes via
1. ESFAS Test
demand, within 60 4. Setpoint DFMEA
coolant flow More than 5000 gpm (HPCI) or HPCI or RCIC turbine trip on 1. Turbine speed too high 2. System Flow
3 seconds Control
Test
500 gpm (RCIC) high Rx level (via trip valve) 2. Incorrect flow setpoint Program
3. Alarms
Less than adequate Rx 5. Human
1. Late initiation signal
5000 gpm (HPCI) or 500 gpm Performance
4 inventory, possibly leading to (or late response)
(RCIC), but after 60 seconds 2. Ramp rate too slow
6. Turbine trips
core damage
1. H2O Chem. 1. Section 11
Loss of Rx inventory, leading 1. Steam line break
5 No steam flow 2. Human Test
to core damage 2. Inadvertent isolation
2. Alarms
Performance
1. System Flow
Poor steam quality (high Turbine degradation, eventual
6 1. High carryover from Rx Rx PM Test
Supply high quality moisture) loss of Rx inventory
Steam Supply 2. Turbine PM
saturated steam at
to Turbine Turbine can run as low as 150 1. Steam line leak 1. Section 11
1000 psig 1. H2O Chem.
7 High Steam pressure < 1000 psig psig, then low pressure 2. Steam line partial Test
2. FME Program
Pressure systems take over blockage 2. Alarms
Injection Relief valves lift, steam 1. Steam hammer
8 Steam pressure > 1000 psig 1. Ops Alarms
pressure/flow transients 2. Rx pressure transient Procedures
2. Human 1. Alarms
Loss of Rx inventory, leading 1. Empty CST or Torus
9 No water flow Performance 2. CST/Torus
to core damage 2. Inadvertent isolation
Surveillance
Supply clean, 1. Pump damage, less than 1. Human 1. System Flow
1. Inadequate FME
Suction Supply demineralized aequate flow Performance Test
10 Foreign material in water controls
to Pump water with 2. Clogged strainer, low NPSH, 2. Material degradation
2. H2O 2. Chemistry
adequate NPSH less than adequate flow Chemistry Samples
1. Pump cavitation, eventual 1. Low water level in CST 1. Ops CST/Torus
11 Less than adequate NPSH damage, less than adequate or Torus Procedures Surveillance
flow 2. Pipe obstruction 2. FME Program Test
Loss of Rx inventory, leading 1. Pipe break
12 Loss of pressure boundary
to core damage 2. Interystem leak
Maintain pressure Less than adequate Rx 1. H2O
13 Coolant Flow boundary integrity, Capacity less than 5000 gpm inventory, possibly leading to Chemistry
Alarms
Path to Rx capable of 5000 core damage (medium LOCA) 1. Pipe leak 2. Human
gpm @ 1000 psi 2. Intersystem leak Performance
Less than adequate Rx
14 Less than 1000 psi inventory, possibly leading to
core damage
4-17
Table 4-2 (continued)
HPCI/RCIC Flow Control System Functional FMEA Worksheets
High Level Process/Functional Area (check one): Equipment: Checked by/Date: Lifecycle Phase:
( ) Safety Conceptual Design
(X) Equipment Protection HPCI/RCIC Flow Control System Approval/ Date: Rev: 0a
( ) Power Generation
Current Prevent/Detect Method
Row Potential Potential Potential Causes(s)/ Recommended
Function Process Requirement(s)
No. Failure Mode Effect(s) of Failure Mechanism of Failure Prevention Detection Action
4-18
Table 4-2 (continued)
HPCI/RCIC Flow Control System Functional FMEA Worksheets
High Level Process/Functional Area (check one): Equipment: Checked by/Date: Lifecycle Phase:
( ) Safety Conceptual Design
( ) Equipment Protection HPCI/RCIC Flow Control System Approval/ Date: Rev: 0a
(X) Power Generation
Ro Current Prevent/Detect Method
Potential Potential Potential Causes(s)/ Recommended
w Function Process Requirement(s)
No.
Failure Mode Effect(s) of Failure Mechanism of Failure Prevention Detection Action
4-19
4.4 Design FMEA (DFMEA) Procedure
19B
The following steps can be used to perform the Design FMEA (DFMEA)
method. This procedure is not the only way to implement the method; variations
are likely, depending on the owner/operator’s engineering and configuration
management program, and its implementing policies and procedures.
Prerequisite
The results of a Function Analysis, as described in Section 3.6, are useful inputs
to the DFMEA because they provide a well-organized set of functions that can
feed into the steps of the DFMEA procedure that consider failure modes and
their effects on the associated systems.
It may be necessary to prepare more than one version of the block diagram in
order to represent different system conditions that may arise in the operations
and maintenance phase of its lifecycle. An example would be one version that
shows a normal system condition during plant operations, and another version
that shows the system out-of-service in a maintenance mode or configuration
(e.g., with a configuration tool connected to an available port on a controller). In
this case, each version of the block diagram would be analyzed using the
remaining steps in this procedure. If portions of each version overlap or share
common characteristics, then it may not be necessary to repeat the analysis for
those portions.
4-20
be multiple boundaries that are differentiated by functional segments in the
architecture.
On new plant projects, it may be necessary to break down the digital I&C
architecture into systems and sub-systems, differentiated by functional segments
in the architecture.
Operating experience shows that Design FMEAs do not always account for
equipment interfaces that are actually used in the finished system, including
interfaces that are used on a temporary or intermittent basis (References 10, 11
and 16). Therefore, this step can be strengthened by the following methods:
Verify the equipment interfaces described in the technical information that is
provided with the digital system or components of interest (e.g., technical
manual)
Examine the interfaces on the actual equipment if it is available, via
walkdown or inspection (e.g., terminal blocks and data communication ports)
The goal of this step for either method is to demonstrate that all of the digital
equipment interfaces used in the target application are accounted for in the block
diagram.
Using the results of the Function Analysis (per Section 3.6), write a summary
description of the basic functions of the components inside the boundary drawn
in Step 2, and their interfaces with other equipment or components that cross the
boundary. The purpose of this section is to help anyone reading the DFMEA
understand the basic functions of the system or components being analyzed. It is
not necessary to develop or repeat a comprehensive system description, such as
would be found in a typical plant system description. The summary description
should be developed only to the extent that it supports the analysis.
4-21
DFMEA Step 5: On each worksheet, identify the interfacing components,
signals, power supplies, and other interfaces that can affect the functions or
performance of the components of interest.
The typical approach for this step is to examine the block diagram, identify each
interface that crosses the boundary drawn in Step 2, and identify the system or
component outside the boundary that provides the interface. The results of this
Step are entered on the DFMEA worksheet under the column labeled
“Component Identification.”
For each entry in the “Component Identification” column, determine its failure
modes using available technical information. The results of this step are entered
under the column labeled “Failure Modes.”
DFMEA Step 7: Determine the likely failure mechanisms associated with each
failure mode identified in Step 6.
Failure mechanisms are included in the FMEA worksheet because they provide
some insight for assessing the use of system design features and defensive
measures that may be available to help reduce the likelihood of such failure
mechanisms.
DFMEA Step 8: Determine the resulting effects that each interfacing system or
component failure mode can have on the components of interest, and the
resulting effects on the system.
4-22
This step involves following the device, component, or sub-system failure modes
out to their effects at the system level. This step requires some knowledge of the
plant system, including its control system, the controlled elements, the process
elements, and their mechanical or electrical properties. This step may require
some cross-discipline support from design engineers, system engineers, or
component engineers who are technically competent in these areas. The results of
this step are entered in the DFMEA worksheet under the column labeled
“Effects on System.”
DFMEA Step 9: Determine the methods of detection for each failure mode
identified in Step 6.
See Section 4.5 for guidance on selecting and applying methods of detection
during system development. The results of this Step are entered in the DFMEA
worksheet under the column labeled “Method of Detection.”
The “Remarks” column in the FMEA worksheet is used by the analyst to explain
unusual results or identify measures that could be taken to help prevent or
mitigate the identified failure mode.
When reviewing a system for single failure vulnerability, a design engineer can
use the Design FMEA to evaluate the system design and assess it for single
failure vulnerabilities. However, the DFMEA method will focus on a detailed
analysis of all functions and components, where it is developed by listing each
component in a system and evaluating the impact of the component failure on
the system for each failure mode. This approach is very detailed and contains
redundant reviews for systems with multiple channels or redundant components.
The use of redundancy in digital systems does create some additional challenges
with addressing common mode failures, and interdivisional or interchannel
communication impacts and dependencies. These challenges need to be
addressed in the failure analysis. The DFMEA failure analysis method can credit
redundant divisions or channels in a carte blanche manner, up front, when
developing the scope of the analysis. If the following criteria are satisfied, then
the scope of a Design FMEA can be reduced to a single redundancy in terms of
the components and interfaces that are analyzed:
Redundancy Boundary – The redundancy boundary denotes the set of
equipment where systems or components are identical and a single fault or
4-23
failure is contained in a single redundancy without adversely impacting the
overall function of the system. The analysis should be able to clearly identify
the extent to which divisions, channels or other redundancies are actually
redundant, vs. those systems, sub-systems or components that are not
redundant. For example, in a four division protection system, such as the one
illustrated in Figure 4-4, there are four distinct, separate and independent
divisions. On the other hand, a master/slave architecture such as the one
illustrated in Figure 4-5 will have limited redundancy, where there are
elements that are shared by both redundancies such as a single controlled
element (e.g., a control valve).
Dependencies – For system architectures that share data, signals or other
information between redundancies, the sharing of such data, signals or
information must be assessed to determine if any one redundancy is
dependent on one or more of the other redundancies in order to satisfy
functional or performance requirements, including behaviors that are
required to respond to faults and failures in the other redundancies. For
example, master/slave and triple modular redundant (TMR) architectures are
likely to require sharing of module status or signal information between
redundancies so that in the event of a fault or failure in one redundancy,
another one can detect the fault or failure and maintain adequate
functionality within specified performance requirements. In the event that
module or component status information or signals are shared, then the
Design FMEA must clearly describe this dependency and how it can be
credited for responding to each of the module or component failures modes
that are analyzed within a single redundancy.
Figure 4-4
Multi-Divisional System with Complete, Independent Redundancy
4-24
Power
Processor Output
Inputs
Controlled
Status
Element
Power
Processor Output
Inputs
Redundancy Boundary
Figure 4-5
Redundancy Boundary for a Master/Slave Architecture
4-25
Table 4-3
Sample Design FMEA Worksheet
Component Failure
Function(s) Failure Mechanisms Effect on System Method of Detection Remarks
Identification Modes
4-26
4.5 Design FMEA Examples
20B
4-27
Example 4-2. HPCI-RCIC Turbine Controls Design FMEA (continued)
DFMEA Step 2: Draw a boundary around the components of interest.
Figure 4-6 shows an analysis boundary around the digital governor and digital
positioner. These components are of interest to the analysis in this example because
they form a digital upgrade that replaces obsolete equipment. The components
shown outside the boundary are original equipment that will remain as-is, outside
the scope of the plant change project. Nevertheless, some of the components outside
the boundary have interfaces that cross the boundary, and will have failure modes
that will be accounted for in the analysis.
DFMEA Step 3: Write a summary description.
Table 4-4, which meets the prerequisite for a Function Analysis (in this case at a
component level), provides a listing of the principal components shown in Figure 4-
6, and their functions. Summary descriptions are provided below:
HPCI Summary Description
The design basis function of the HPCI system is reactor inventory control to ensure
the reactor core is adequately cooled to limit maximum fuel cladding temperature
following a small-break loss-of-coolant-accident (LOCA) which does not rapidly
depressurize the reactor pressure vessel (10CFR50.46 Criterion 1). HPCI also
provides a reactor inventory control function following other initiating events such as
transients, stuck open safety relief valves (SRVs), medium-break LOCAs and
anticipated transient without scram (ATWS).
HPCI can be initiated manually, or it will initiate automatically via high drywell
pressure or Low-Low reactor water level. The maximum response time allowed to
achieve rated flow is 55 seconds in the design basis analysis, but can be much
longer and still be successful, particularly given best estimate assumptions and given
non-LOCA initiating events.
When the HPCI is initiated, the system initiation signal opens the turbine steam
admission valve. When the steam admission valve opens, a limit switch on the valve
changes state when the valve begins to open, thus sending an enable signal to the
digital governor via one set of contacts and to the digital positioner via a second set
of contacts. The governor responds by sending a governor valve position demand
signal that ramps the turbine speed up to a preferred initial speed, then switches to
PID control in order to respond automatically to changes in system load. The
purpose of the ramp function is to enable controlled acceleration of the turbine and
avoid initial overspeed transients that may encroach upon the mechanical overspeed
trip limit. To support the initial response of the turbine when an enable signal is
received, the governor valve is preset to a partially open position.
The HPCI pump is a two stage component (booster pump + main pump), driven by a
single steam turbine. The pump takes suction from the condensate storage tank (CST)
until it reaches low level, then the suction source is switched to the suppression pool.
The pump supplies water to the reactor vessel via the feedwater line, or it can be
aligned in recirculation mode to discharge to the CST during surveillance tests. The
HPCI turbine is driven by Main Steam, which exhausts to the suppression pool after
leaving the turbine.
The HPCI turbine is automatically tripped on any HPCI isolation signal, high turbine
exhaust pressure, high reactor water level, low pump suction pressure, or
mechanical overspeed. The turbine can also be tripped manually from the Main
4-28
Example 4-2. HPCI-RCIC Turbine Controls Design FMEA (continued)
Control Room (MCR), the Remote Shutdown Panel (RSP), or locally at the turbine.
The turbine is tripped by closing the trip/throttle valve shown in Figure 4-6, thus
isolating the steam supply.
RCIC Summary Description
The design basis function of the RCIC system is reactor inventory control to provide
makeup water to the reactor vessel during reactor shutdown and isolation when the
main condenser and feedwater system are unavailable. RCIC also provides a
reactor inventory makeup function following initiating events such as non-isolation
transients and stuck open SRVs.
RCIC can be initiated manually, or it will initiate automatically via Low-Low reactor
water level. There is no automatic initiation of RCIC on drywell pressure. The design
basis maximum response time allowed to achieve rated flow is 50 seconds but can
be much longer and still be successful.
When RCIC is initiated, the system initiation signal opens the turbine steam
admission valve. When the steam admission valve opens, a limit switch on the valve
changes state when the valve reaches 20% open, thus sending an enable signal to
the digital governor via one set of contacts and to the digital positioner via a second
set of contacts. The governor responds by sending a governor valve position
demand signal that ramps the turbine speed up to a preferred initial speed, then
switches to PID control in order to respond automatically to changes in system load.
The purpose of the ramp function is to enable controlled acceleration of the turbine
and avoid initial overspeed transients that may encroach upon the mechanical
overspeed trip limit. To support the initial response of the turbine when an enable
signal is received, the governor valve is preset to a partially open position.
The RCIC pump is driven by a single steam turbine. The pump takes suction from the
condensate storage tank (CST) until it reaches low level, then the suction source is
switched to the suppression pool. The pump supplies water to the reactor vessel via
the feedwater line, or it can be aligned in recirculation mode to discharge to the
CST during surveillance tests. The RCIC turbine is driven by Main Steam which
exhausts to the suppression pool after leaving the turbine.
The RCIC turbine is automatically tripped on any RCIC isolation signal, high turbine
exhaust pressure, high reactor water level, low pump suction pressure, or
mechanical overspeed. The turbine can also be tripped manually from the MCR, the
RSP, or locally at the turbine. The turbine is tripped by closing the trip/throttle valve
shown in Figure 4-6, thus isolating the steam supply.
DFMEA Step 4: Prepare an FMEA worksheet for each device or component of
interest.
FMEA worksheets are provided in Table 4-5 for the digital governor and Table 4-6
for the digital positioner. In this example there are two FMEA worksheets,
differentiated by “subsystem” in the upper left corner, because there are two
components of interest in the analysis.
DFMEA Step 5: On each worksheet, identify the interfacing components, signals,
power supplies, and other interfaces that can affect the functions or performance of
the components of interest.
In this example, the results of Step 5 are shown in Tables 4-5 and 4-6 under the
column labeled “Component Identification.”
4-29
Example 4-2. HPCI-RCIC Turbine Controls Design FMEA (continued)
DFMEA Step 6: Determine the failure modes of each interfacing component,
signal, power supply or other interface.
In this example, the results of Step 6 are shown in Tables 4-5 and 4-6 under the
column labeled “Failure Modes.”
DFMEA Step 7: Determine the failure mechanisms associated with each failure
mode identified in Step 6.
In this example, the results of Step 7 are shown in Tables 4-5 and 4-6 under the
column labeled “Failure Mechanisms.”
DFMEA Step 8: Determine the resulting effects that each interfacing component
failure mode can have on the components of interest, and the resulting effects on the
system.
In this example, these effects are listed in Tables 4-5 and 4-6, in the column labeled
“Effects on System.”
DFMEA Step 9: Determine the methods of detection for each failure mode
identified in Step 6.
In this example, the methods of detection are listed in Tables 4-5 and 4-6 in the
column labeled “Method of Detection.” Note that for all of the identified failure
modes, the identified method of detection is “Periodic Test” (or audit) because at the
conceptual design stage shown in this example, there are no hardware or software
features that have been identified yet that can detect, mitigate and provide an
indication and/or alarm associated with the identified failure modes. Hardware
and/or software features that can provide methods of detecting the identified failure
modes would be expected to emerge later in the development phase of this example
project, and the FMEA worksheets would be updated accordingly.
DFMEA Step 10: Provide remarks.
In this example, the remarks listed in Tables 4-5 and 4-6 are centered on features
and functions that should be picked up in the design phase of the project; these
features and functions essentially become “defensive measures” against the
identified failure modes.
DFMEA Step 11: Analyze redundancies.
In this example, there are no redundant components in the system of interest.
However, in some BWRs, at the plant level, the HPCI and RCIC systems may provide
redundancy in terms of safety functions, in combination with other independent
systems (e.g., automatic depressurization). Therefore, the FMEA in this example is
complete, at least at the conceptual design phase. Tables 4-5 and 4-6 are equally
applicable to the conceptual design of the HPCI and RCIC turbine control systems.
DFMEA Step 12: Apply the results.
Because this example is constructed at the conceptual design stage of a hypothetical
project, the results of the preliminary FMEA in Tables 4-5 and 4-6 would be applied
in later phases. Therefore, the following Application Notes are provided for the
system designers, based in large part on the results in the “Remarks” column of the
FMEA worksheets.
Note that as this project would progress beyond the conceptual design phase, the
FMEA would be updated to reflect design details and methods of detection, and the
finished results would be validated. The final FMEA product would then be used to
support licensing activities and development or validation of periodic test
4-30
Example 4-2. HPCI-RCIC Turbine Controls Design FMEA (continued)
procedures, alarm management methods, maintenance procedures, and
troubleshooting and cause analysis guidance.
Application Notes
The following insights were obtained from the FMEA “Remarks” column in Tables 4-
5 and 4-6 for later use in developing the detailed hardware design and application
logic for the governor and positioner components. In addition, the FMEA should be
updated during the design phase of the project to account for changes in
component or system level effects as these application notes are factored into the
design, and ultimately the finished FMEA should be validated in the project test
phases (e.g., FAT, SAT or Post-Installation) to confirm the analytical results.
It is expected that the detailed control system design and the application logic be
modified as needed to detect, mitigate or eliminate the undesired effects currently
described in Tables 4-5 and 4-6. The number of failure modes currently detectable
only by periodic test should be reduced as much as practically achievable through
the use of signal validation and alarm methods.
1. Provide signal validation methods in the application logic for the governor and
positioner. Signal validation methods can include:
a. Out of Range Checks (where analog signals present less than a “live zero”
such as 4.0 mADC or 1.0 VDC, or greater than calibrated span such as
20.0 mADC or 5.0 VDC)
b. Median Select (provide three redundant signals, select the middle signal)
c. High Rate of Change (determine the maximum credible rate of change of a
signal, in units such as %/second, and design a simple filter or rate detection
algorithm that allows a signal to pass through if it’s rate of change is less
than the maximum credible rate)
2. Provide indications and alarms associated with the failure modes where alarms
are described in the Remarks column in Tables 4-5 and 4-6. A general trouble
alarm may suffice for the governor and positioner (each), as long as a local or
remote indication is provided for determining the failure mode that caused the
alarm. Alarms should be provided by the governor and positioner to the plant
annunciator system in the main control room, via contact or solid state outputs.
3. The taxonomy sheets in Appendix B of this guideline were used to inform the
FMEA worksheets. Numerous internal and external defensive measures are
potentially available as described in the taxonomy sheets, and should be
assessed and included in the final design. Internal defensive measures are those
that are implemented within the components of interest, such as memory integrity
test features that could be embedded within the operating system of the
governor. External defensive measures are those that are implemented outside
the component of interest, such as an input signal validation algorithm
implemented within the positioner in order to detect and alarm misbehaving
governor output signals.
4. Apply security controls described in NEI 08-09 or RG 5.71. The governor and
positioner are critical digital assets that are required to meet the cyber security
rule (10 CFR 73.54).
4-31
Table 4-4
Principal HPCI/RCIC Turbine Control Components and Functions
Component Functions
1. Provide automatic speed demand output to
governor on converting a fixed setpoint flow to a
Main Control Room and turbine speed.
Remote Shutdown Panel 2. Provide manual speed demand output to governor
Flow Indicating Controllers as set by operator
3. Provide indications of flow setpoint, actual flow,
and % output
Switch speed demand signal from MCR or RSP M/A
Hand switch (HS)
stations
Provide enable signal to governor and positioner
Limit Switch (LS)
when steam admission valve position is > 20% open
Provide automatic governor valve position demand
signal to digital positioner to compensate for error
Governor
between actual turbine speed (from MPU) and
demanded turbine speed (from M/A stations)
Magnetic Pickup (MPU) Provide actual turbine speed signal to governor
Provide clean, filtered 24 VDC power to governor and
24 VDC Power
positioner
Governor Program Provide a port for connecting a programming device to
Interface enable configuration changes and configuration audits
Provide automatic governor valve position signal to
actuator to compensate for error between actual
Positioner
governor valve position (from resolver feedback) and
demanded valve position (from governor)
Provide actuator stem position signal to positioner
Resolver Feedback (actuator stem is coupled directly to governor valve
stem)
Position the governor valve to the position demanded
Actuator
by the positioner
Governor Valve Throttle the steam supplied to the turbine
Isolate the steam supply to the turbine when a turbine
Trip/Throttle Valve
trip signal is received
Admit steam to the turbine when a system initiation
Steam Admission Valve
signal is received
4-32
FIC: Flow Indicating Controller
MCR: Main Control Room Analysis Boundary
RSP: Remote Shutdown Panel
PID: Proportional/Integral/Derivative Enable
HS: Handswitch
MCR FIC
HS Positioner
Speed Position
PID Demand PID S Demand PID System
Initiation
Flow Setpoint
Governor Enable Signal
(RCIC: 500gpm;
HPCI: 5000gpm)
PID Program M
24 Resolver
Interface VDC Actuator
Feedback
LS
RSP FIC From
Main
Steam
FLOW Governor Trip/ Steam
Magnetic
PickUp (MPU)
Valve Throttle Admission
To Valve Valve
Reactor From Torus or
Condensate
Storage Tank
Figure 4-6
HPCI/RCIC Turbine Control System Block Diagram
4-33
Table 4-5
HPCI/RCIC Governor Design FMEA Worksheet
Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
4-34
Table 4-5 (continued)
HPCI/RCIC Governor Design FMEA Worksheet
Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
Turbine slows to minimum
Output Fails Ringing, double or
speed, less than adequate Periodic Test 1. Include signal validation in the
Offscale High triple counting
HPCI or RCIC flow governor application logic
Turbine overspeeds, trips on 2. Provide MCR and RSP alarm
Output Fails Mounting failure (falls
high reactor level or Periodic Test connection to governor
Magnetic Pickup Provide actual turbine speed signal Offscale Low off)
mechanical overspeed
(MPU) to governor
1. Consider triple MPUs, use
Indeterminate; depends on signal validation to select best
Excessive Drift Degradation magnitude and direction of Periodic Test one
drift 2. Provide MCR and RSP alarm
connection to governor
Governor stops, outputs go to
Failed power source
Voltage below shelf state, turbine slows to
(battery, charger, bus, Periodic Test
specification minimum speed, less than
voltage regulator)
adequate HPCI or RCIC flow 1. Provide power loopback to
Provide clean, filtered 24 VDC
Governor overvoltage analog input
24 VDC Power power to digital governor and
protection causes it to stop, 2. Provide MCR and RSP alarm
digital positioner
Voltage above Failed voltage outputs go to shelf state, connection to governor
Periodic Test
specification regulator turbine slows to minimum
speed, less than adequate
HPCI or RCIC flow
4-35
Table 4-5 (continued)
HPCI/RCIC Governor Design FMEA Worksheet
Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
4-36
Table 4-6
HPCI/RCIC Positioner Design FMEA Worksheet
Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
Turbine overspeeds, trips on
Output Fails Saturated output
high reactor level or Periodic Test 1. Provide multiple outputs of
Offscale High circuit
mechanical overspeed the position demand signal
from governor to positioner
2. Include signal validation in the
Governor failure or Turbine slows to minimum
Output Fails positioner application logic
loss of power to speed, less than adequate Periodic Test
Offscale Low 3. Provide MCR and RSP alarm
Governor HPCI or RCIC flow
connection to positioner
Provide automatic governor valve
position demand signal to digital 1. Ensure governor is supplied
Governor positioner to compenate for error Indeterminate; depends on with a HW-based watchdog
between actual turbine speed and Output Fails Governor lockup via fail as-is value - likely to result timer that sets outputs to
demanded turbine speed Periodic Test
As-Is HW or SW defect in reactor overfill or underfill, preferred state
followed by turbine trip 2. Provide MCR and RSP alarm
connection to positioner
4-37
Table 4-6 (continued)
HPCI/RCIC Positioner Design FMEA Worksheet
Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
Turbine slows to minimum
Output Fails Resolver circuit failure
speed, less than adequate Periodic Test 1. Include signal validation in the
Offscale High (internal to actuator)
HPCI or RCIC flow governor application logic
Turbine overspeeds, trips on 2. Provide MCR and RSP alarm
Output Fails Loss of power to
high reactor level or Periodic Test connection to positioner
Offscale Low actuator
Provide actuator stem position mechanical overspeed
Resolver signal to positioner (actuator stem is Indeterminate; depends on
Resolver circuit
Feedback coupled directly to governor valve Inaccurate signal magnitude and direction of Periodic Test
degradation
stem) error
Failed mechanical
Governor valve returns to
connection Wear, corrosion, or
spring-closed position, less
between actuator fatigue at connection Periodic Test
than adequate HPCI or RCIC
and governor point
flow
valve
Positioner stops, outputs go
Failed power source
Voltage below to shelf state, turbine slows to
(battery, charger, bus,
specification minimum speed, less than
voltage regulator)
adequate HPCI or RCIC flow 1. Provide power loopback to
Provide clean, filtered 24 VDC
Positioner overvoltage analog input
24 VDC Power power to digital governor and Periodic Test
protection causes it to stop, 2. Provide MCR and RSP alarm
digital positioner
Voltage above Failed voltage outputs go to shelf state, connection to positioner
specification regulator turbine slows to minimum
speed, less than adequate
HPCI or RCIC flow
4-38
Example 4-3. Circ Water System Controls Design FMEA
DFMEA Step 1: Draw a block diagram of the system of interest.
Figure 4-7 provides a block diagram of a hypothetical Distributed Control System
(DCS) functional segment allocated to Circulating Water System (CWS) control
functions.
The DCS architecture in this example is based on a brief search of publicly available
information on more complex, non-1E DCS system architectures, resulting in the
selection of certain features of various DCS architectures in use today. Most non-
safety DCS architectures include several functional segments. This example examines
a Circulating Water segment in isolation because it is sufficiently complex and
functionally isolated from other segments to reveal insights.
DFMEA Step 2: Draw a boundary around the components of interest.
Figure 4-7 shows an analysis boundary around two “divisions” of logic and I/O
cabinets.
DFMEA Step 3: Write a summary description.
Table 4-7, which meets the prerequisite for a Function Analysis (in this case at the
component level), provides a listing of the principal components shown in Figure 4-
6, and their functions. A summary description is provided below:
CWS System Description
The circulating water system (CWS) under investigation supplies cooling water to
remove heat from the main condensers, under varying conditions of power plant
operation and site environmental conditions.
The CWS does not have a safety-related function and has no safety design basis.
The power generation design basis of the CWS is to remove heat load during
startup, normal shutdown, transient condition, or turbine trip (when a portion of the
main steam is bypassed to the main condenser via the turbine bypass valves).
See the lower portion of Figure 4-7. The CWS draws water from the cooling tower
basins, and returns water to the CWS cooling tower basins after passing through the
main condenser.
The CWS supplies cooling water at the specified flow rate to condense the steam in
the condenser. The CWS is automatically isolated in the event of gross leakage into
the turbine building (TB) condenser area to prevent flooding of the Turbine Building.
The CWS is designed such that a failure in a CWS component (piping, cooling
tower, expansion joint, pump, etc.) does not have a detrimental effect on any safety-
related equipment.
The CWS is composed of six, 25% capacity circulating water pumps, and two
cooling towers (each with their own basin). Other typical CWS components, such as
make-up pumps, waterbox isolation valves, and cooling tower fans are omitted from
this paper because they have no bearing on the analysis. During normal operations
at 100% power, two pumps are running in each basin, with one pump on standby
in each basin.
The circulating water pumps are located in the cooling tower basins, take suction
from the basin, and pump water through the main condenser under varying plant
loads and design basis weather conditions. The cooling towers are each sized for
75% of normal power operation load. The discharge pipe from each of the
circulating water pumps is connected to a common pipe that delivers waters to each
4-39
Example 4-3. Circ Water System Controls Design FMEA (continued)
condenser. The discharge pipe from each pump is equipped with a Motor Operated
Valve (MOV) to enable isolation. The isolation MOV prevents backflow through its
associated pump when it is idle
Basic DCS Design
The basic design of the non-1E Distributed Control System (DCS) includes two sets of
logic cabinets (A & B), two sets of I/O cabinets (A & B) and a set of human-system
interface (HSI) workstations. All of the cabinets and workstations are connected to
redundant data communication busses (Comm 1 and Comm 2).
The upper portion of Figure 4-7 illustrates this basic DCS architecture via the
segment that monitors the CWS pumps and controls their discharge valves. Other
DCS segments associated with other non-1E functions are omitted for clarity.
The DCS monitors and controls non-1E equipment using a master/slave controller
architecture. In Figure 4-7, the master controller for all 6 MOVs is shown in Logic
Cabinet A. The controller in Logic Cabinet B is in “slave” mode, following the status
of the Master controller, and is able to take control of the MOVs in the event of a
failure of the Master. Logic Cabinets A & B are located in an equipment room
adjacent to the Main Control Room.
I/O cabinets A & B are each remotely located in a separate, secure and
environmentally controlled structure near each cooling tower, which are some
distance from the Main Control Room. I/O Cabinet A contains digital input modules
that monitor the position of the 4KV breakers that provide power to the motors for
Pump-1 thru Pump-3, and digital output modules that position MOV-1 thru MOV-3
(open or closed). Likewise, I/O cabinet B provides the same functions for Pump-4
thru Pump-6 and MOV-4 thru MOV-6. Note that the digital DCS equipment does not
control the CWS pumps; this function is allocated to HS-1 and the 4KV switchgear.
The application logic for opening or closing MOV-1 thru MOV-6 runs in each
controller in Logic Cabinets A & B. Figure 4-8 illustrates this logic for MOV-1, but is
typical for all six MOVs. Please note that the logic in a typical CWS pump and
valve design is more complex than shown in Figure 4-8. It is simplified here to
provide a reasonably sufficient demonstration of the Design FMEA method on a DCS
segment.
CWS Pump-1 Functional Sequences
To further describe the CWS Pump controls, the following functional sequences are
helpful.
The following sequence will initiate operation of CWS Pump-1:
a. An operator at one of the two HSI workstations will select MOV-1 and command
it to open
b. An “open” command will be included in a message that passes between the HSI
workstation and Logic Cabinet A via the COMM1 and COMM2 busses
c. The application software in the Master Controller will send the command to
DO1 in I/O Cabinet A through the COMM1 and COMM2 busses
d. Digital Output 1 (DO1) will close
e. Relay R1 will energize
f. Contact R1-1 will close
4-40
Example 4-3. Circ Water System Controls Design FMEA (continued)
g. MOV-1 will move in the open direction until both limit switches LS1 and LS2
open (note that Figure 2 shows MOV-1 already in the open position)
h. Because HS1 is spring-return-to-auto, contact HS1-1 is normally closed
i. Limit switch LS5 will close when MOV-1 reaches 20% open (upon opening)
j. The Close coil in the 4Kv switchgear for Pump-1 will energize and contact C1
will seal-in
k. Pump-1 will start
In the event of a trip of CWS Pump-1, the following sequence will occur:
a. The Trip coil in the 4 Kv switchgear will energize and contact T1 will seal-in
(either due to an automatic pump trip signal, such as overcurrent protection, or
manually through use of HS1)
b. The breaker for Pump-1 will open
c. Contact T2 will close (indicating that the trip coil is energized and the pump
breaker is open)
d. Pump-1 will stop
e. Digital Input 1 in I/O Cabinet A will sense that contact T2 is closed
f. Messages passing from I/O Cabinet A to Logic Cabinet A via the COMM1 and
COMM2 busses will include data indicating that contact T2 is closed (thus
indicating Pump-1 is “Off”)
g. The application software in the Master Controller will register the status of Pump-
1 and will automatically initiate a “close” command to MOV-1
h. The close command will be included in messages from Logic Cabinet A to I/O
Cabinet A
i. Digital Output 1 (DO1) will open
j. Relay R1 will de-energize
k. Contact R1-2 will close
l. MOV-1 will move in the closed direction until limit switches LS3 and LS4 open
(MOV closed)
In the event of a manually commanded closure of MOV-1 from one of the HSI
workstations, the following sequence will occur:
a. An operator will select MOV-1 and command it to close
b. A “close” command will be included in a message that passes between the HSI
workstation and Logic Cabinet A via the COMM1 and COMM2 busses
c. The application software in the Master Controller will send the command to
DO1 in I/O Cabinet A through the COMM1 and COMM2 busses
d. Digital Output 1 (DO1) will open
e. Relay R1 will de-energize
f. Contact R1-2 will close
g. MOV-1 will move in the closed direction until limit switches LS3 and LS4 open
(MOV closed)
h. Limit switch LS6 will close
i. The Trip coil in the 4 Kv switchgear will energize and contact T1 will seal-in
4-41
Example 4-3. Circ Water System Controls Design FMEA (continued)
j. Contact T2 will close
k. Pump-1 will stop
DFMEA Step 4: Prepare an FMEA worksheet for each device or component of
interest.
Design FMEA worksheets are provided in Table 4-8 for I/O Cabinet A, Table 4-9
for Logic Cabinet A, and Table 4-10 for the HSI Workstations.
In this example there are three FMEA worksheets, differentiated by “subsystem” in
the upper left corner, because there are three subsystems of interest in the analysis.
DFMEA Step 5: On each worksheet, identify the interfacing components, signals,
power supplies, and other interfaces that can affect the functions or performance of
the components of interest.
In this example, the results of Step 5 are shown in Tables 4-8, 4-9, and 4-10 under
the column labeled “Component Identification.”
DFMEA Step 6: Determine the failure modes of each interfacing component,
signal, power supply or other interface.
In this example, the results of Step 6 are shown in Tables 4-8, 4-9, and 4-10 under
the column labeled “Failure Modes.”
DFMEA Step 7: Determine the failure mechanisms associated with each failure
mode identified in Step 6.
In this example, the results of Step 7 are shown in Tables 4-8, 4-9, and 4-10 under
the column labeled “Failure Mechanisms.”
DFMEA Step 8: Determine the resulting effects that each interfacing component
failure mode can have on the components of interest, and the resulting effects on the
system.
In this example, these effects are listed in Tables 4-8, 4-9, and 4-10, in the column
labeled “Effects on System.”
DFMEA Step 9: Determine the methods of detection for each failure mode
identified in Step 6.
In this example, the methods of detection are listed in Tables 4-8, 4-9, and 4-10 in
the column labeled “Method of Detection.” Note that hardware or software features
have been identified that can detect, mitigate and provide an indication and/or
alarm associated with the identified failure modes.
DFMEA Step 10: Provide remarks.
In this example, the remarks listed in Tables 4-8, 4-9, and 4-10 are centered on
typical alarms and indications that would be provided in a DCS segment such as the
one described in this example. They are omitted from the logic shown in Figure 4-8
for brevity.
DFMEA Step 11: Analyze redundancies.
In this example, Tables 4-8, 4-9, and 4-10 provide sufficient information for a
Design FMEA because each redundancy is identical, and meets the criteria for
analyzing a single redundancy described in Section 4.4.
DFMEA Step 12: Apply the results.
The results of this example could be used to verify adequate coverage of equipment
failure modes; verify expected alarms and indications of failures; validate the results
4-42
Example 4-3. Circ Water System Controls Design FMEA (continued)
during a FAT, SAT, or Post-Mod Test activity; and update operations procedures and
alarm response guides as needed. The following application notes are also
considered:
DFMEA Application Notes
The following insights were obtained from the FMEA “Remarks” column, and should
be assessed for possible inclusion in any planned modifications to the CWS control
system.
1. Typical alarms and associated logic are assumed to be implemented within the
DCS to annunciate loss of one or more modules.
2. It is assumed that adequate time is available for an operator to recognize and
response to a loss of one CWS pump with a manual action before the turbine
trips on low condenser vacuum.
3. For a failed digital input module, such as DI1 (see Table 4-8, Sheet 2 of 3),
alarm logic should be developed for the case of conflicting indications such as
“pump on” concurrent with “MOV closed.”
4. The taxonomy sheets in Appendix B of this guideline were used to inform the
FMEA worksheets. Numerous internal and external defensive measures are
potentially available as described in the taxonomy sheets, and should be
assessed and included in the final design. Internal defensive measures are those
that are implemented within the components of interest, such as memory integrity
test features that could be embedded within the operating system of the
governor. External defensive measures are those that are implemented outside
the component of interest, such as an input signal validation algorithm
implemented within the positioner in order to detect and alarm misbehaving
governor output signals.
5. Apply security controls described in NEI 08-09 or RG 5.71. The components in
the CWS control system are critical digital assets that are required to meet the
cyber security rule, 10 CFR 73.54.
4-43
ANALYSIS BOUNDARY
Logic Cabinet A Logic Cabinet B
COMM 2 COMM 2
COMM 1 COMM 1
Each Controller Is
MASTER SLAVE
Programmed to Control All
CONTROLLER CONTROLLER
Six Valves (Master/Slave)
D D D D D D D D D D D D
I O I O I O O I O I O I
1 1 2 2 3 3 1 1 2 2 3 3
4 KV
M M M M M M
COOLING M M M M M M COOLING
TOWER TOWER
A B
MOV-1 MOV-2 MOV-3 MOV-4 MOV-5 MOV-6
Normal Operation
PUMP-1 PUMP-2 PUMP-3 (Two Valves Open in PUMP-4 PUMP-5 PUMP-6
Each Basin)
Figure 4-7
Circulating Water System DCS Segment
4-44
4KV Switchgear Pump-1 ANALYSIS BOUNDARY
HS1-1 HS1-2
LS6
Control C1 LS5 T1 HSI 1 HSI 2
Power
C T
T2
C Close Coil
HS1
T Trip Coil
TRIP AUTO CLOSE to I/O
1 X X
Cabinet A,
DI-1
I/O Logic Cabinet A
Cabinet A
2 X
D MASTER CONTROLLER
I
1
Manual Pump 1 Manual
OPEN OFF CLOSE
MOV-1 Control Circuit
COMM 1
COMM 2
COMM 2
COMM 1
Control
C O
Power
Stop
to/from other cabinets
Figure 4-8
CWS MOV Control Circuit & Logic
4-45
Table 4-7
Principal CWS Components and Functions
Component Functions
1. Connect or disconnect 4 Kv electric power to the terminals on the motor that drives
Pump-1 Pump-1
4Kv Switchgear 2. Provide contact closure input, via dry contact T2, to digital input DI1 in I/O Cabinet A
(contact T2 is closed when the Trip Coil is energized).
Relay R1 Interface between Logic Cabinet A, DO1, and the control circuit for MOV-1
Control Power Provide clean, filtered power to the coil of relay R1 when DO1 is closed
Provide clean, filtered, redundant 120 VAC power to the DCS cabinets (internal cabinet
Instrument AC power
power supplies not shown)
Slave Controller Execute the application software logic (including takeover if the Master Controller fails)
4-46
Table 4-8
CWS I/O Cabinet A FMEA Worksheets
Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
1. Loss of Pump-1
1. Faulty protection devices
2. Automatic closure of
or circuits
Inadvertent MOV-1 (if not closed)
2. Operator error
trip 3. Operator opens standby
3. Spurious closure of MOV-
MOV and associated pump
1 (induced by PCS failure)
starts
1. False indication of
Pump-1 OFF
2. MOV-1 closes
1. Faulty switchgear 3. Pump-1 deadheads 1. Typical alarms and associated
Spurious interlocks against closed MOV-1 logic not shown in Figures 1 and
1. Connect or disconnect 4 Kv closure 2. MOV-1 Limit Switch LS5 4. Overload protection trips 2 for simplicity
fails closed Pump-1 breaker 1. Indications on HSI1
electric power to the terminals 2. Assume adequate time
5. Operator opens standby and HSI2
on the motor that drives Pump- available for operator to initiate
MOV and associated pump 2. Alarms
1 operation of standby pump
4 Kv Switchgear
2. Provide contact closure input, starts before turbine trip on low
Pump 1
via dry contact T2, to digital vacuum
1. False indication of
input DI1 in I/O Cabinet A
Pump-1 OFF
(contact T2 is closed when the
2. MOV-1 closes
Trip Coil is energized)
Switchgear contact 3. Pump-1 deadheads
T2 fails closed 1. Debris against closed MOV-1
(with breaker 2. Contact short 4. Overload protection trips
open) Pump-1 breaker
5. Operator opens standby
MOV and associated pump
starts
Switchgear contact
1. False indication of Develop alarm logic for "pump
T2 fails open (with 1. Contact failure Conflicting indications
Pump-1 ON on" AND "MOV-1 closed"
breaker closed)
4-47
Table 4-8 (continued)
CWS I/O Cabinet A FMEA Worksheets
False indication of
contact T2 open 1. False indication of Develop alarm logic for "pump
Internal failure machanism Conflicting indications
(when actually Pump-1 ON on" AND "MOV-1 closed"
closed)
4-48
Table 4-8 (continued)
CWS I/O Cabinet A FMEA Worksheets
1. False indication of
Pump-1 OFF 1. Typical alarms and associated
2. MOV-1 closes logic not shown in Figures 5-2
Loss of control 3. Pump-1 deadheads and 5-3 for simplicity
Provide control power to the 1. Indications on HSI1
power 1. Breaker opens on fault against closed MOV-1 2. Assume adequate time
Control Power coil of relay R1 when DO1 is and HSI2
(with MOV-1 2. Inadvertent breaker trip 4. Overload protection trips available for operator to initiate
closed 2. Alarms
open) Pump-1 breaker operation of standby pump
5. Operator opens standby before turbine trip on low
MOV, associated pump vacuum
starts
4-49
Table 4-9
CWS Logic Cabinet A FMEA Worksheets
Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
1. CPU Halt
Controller
2. CPU Crash
Lockup
3. Stopped internal clock
4-50
Table 4-9 (continued)
CWS Logic Cabinet A FMEA Worksheets
4-51
Table 4-10
CWS HSI Workstation FMEA Worksheets
Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
1. Loss of heartbeat
1. Design flaw
Workstation locks 1. HSI2 still available signal Typical alarms and associated logic
2. Mfg. defect
up 2. Controllers not affected 2. Alarm not shown in Figures 5-2 and 5-3
3. Bit error
(display freeze) 3. No effect on system 3. Conflicting for simplicity
4. Overheating
indications
1. Design flaw
2. Mfg. defect
3. Bit error
4. Failed 1. HSI2 still available
Workstation
connection(s) 2. Controllers not affected 1. Blank display
1. Provide indications and an shotdown
5. Overheating 3. No effect on system
operator interface for manually 6. Power supply
controlling plant components failure
connected to the plant control 7. Power supply dip
HSI1
system
2. Send and receive data to/from
PCS controllers via COMM1 or
COMM2 Erroneous MOV-1 1. Operator error Assuming HSI displays are
1. MOV-1 opens Indications on HSI1
open command 2. Logic error programmed to display pump
2. Pump-1 starts and HSI2
(when closed) 3. Bit error status as sensed by DI1
1. MOV-1 closes
1. Typical alarms and associated
2. Pump-1 deadheads
logic not shown in Figures 5-2 and
against closed MOV-1
Erroneous MOV-1 1. Operator error 1. Indications on HSI1 5-3 for simplicity
3. Overload protection trips
close command 2. Logic error and HSI2 2. Assume adequate time available
Pump-1 breaker
(when open) 3. Bit error 2. Alarms for operator to initiate operation of
4. Operator opens standby
standby pump before turbine trip
MOV, associated pump
on low vacuum
starts
4-52
Table 4-10 (continued)
CWS HSI Workstation FMEA Worksheets
4-53
4.6 Applying the FMEA Results
21B
The FFMEA and DFMEA processes and results can be used in support of the
following activities:
Platform Development
The FMEA results can then be used by the equipment vendor to improve
component designs through the platform or component development lifeycle
process, and ultimately support calculations that demonstrate equipment
reliability claims.
Application Development
The DFMEA process should be applied on the digital system. The level of detail
in the FMEA should be driven down to the individual components that make up
the system. Appendix B includes taxonomy sheets for typical components found
in digital I&C systems.
The FMEA results can then be used by the integrator to improve system designs
through the application development lifeycle process. The conceptual design
phase of the lifecycle process should include a preliminary hazards analysis, which
can take the form of a preliminary FMEA, such as the one described in Example
4-1. A preliminary FMEA should be used to identify and reduce or eliminate
potential vulnerabilities in the system as the design activities progress. Some
vulnerabilities may be eliminated or mitigated to a reasonable extent through one
or more defensive measures that are realized through design requirements and/or
plant programs and processes. For guidance on applying defensive measures in
digital I&C systems, see References 20 and 21.
4-54
The FMEA should be updated through the design process, or when the design is
complete, to reflect the finished design at an appropriate application baseline. For
guidance on determining baselines, see EPRI 1022991 (Reference 18). Note that
the FMEA should reflect the design details (e.g., all interfaces), but it should still
reflect postulated failure modes and mechanisms, even if the detailed design can
demonstrate that the likelihood of some failure modes is reasonably low.
The finished FMEA should be validated, at least to the extent that failure
mechanisms can be tested without extraordinary conditions or destructive
methods, in the test phase of the application development lifecycle. FMEA
validation test cases can be executed at the Factory Acceptance Test (FAT), Site
Acceptance Test (SAT) or during post-installation testing. Additional guidance
on testing is provided in Reference 32.
4-55
Case Study 4-1. Inadequate Configuration Control and Testing
An operating plant purchased a digital rod control system from a third-party
integrator. The system was equipped with control rod drive motor power supplies
that each provided a feedback signal, proportional to the power supply output
voltage, to a digital controller. The controller application program included a
function block to monitor the feedback signals and raise an alarm when any
power supply output voltage approached limits, and if a voltage limit was
exceeded, then disable, or turn off the power supply.
A system-level FMEA was developed by the system integrator. It included
component-level failure modes, and in some cases, went down to the device level
failure mechanisms in some of the components. The internal chip that provides
the feedback signal from each power supply was evaluated in the FMEA, which
concluded that the signal could drift out of tolerance, and if it did, would be
automatically detected via the warning alarm prior to disabling the associated
power supply.
During the FAT, local power quality issues in the FAT environment caused control
rod power supply voltage variations, which in turn caused the controller to raise
numerous alarms. Considered a nuisance, the integrator and the plant
owner/operator agreed to temporarily disable these alarms, and continued with
the FAT.
When the system was installed at the plant, the alarms were still disabled, and
the system was placed into service after testing. Later, when power supply
voltage variations occurred, the digital controller disabled two power supplies
that signaled they were out of tolerance, but there were no alarms, and the result
was a dropped control rod without any warning when the plant was at 100%
power.
While the primary lesson learned from this Case Study is arguably about
inadequate configuration control, because the disabled alarm functions were not
properly re-enabled before placing it in service, another lesson learned is about
validation of the system FMEA. The system FMEA clearly described an alarm
feature that would indicate feedback signal drift or a malfunction in a control rod
power supply, but there were no test cases executed at the SAT (in a pre-
installation environment) or during post-installation testing to validate these
alarms.
Test cases should be developed and executed to validate FMEA results, at least to
the extent that they can be executed without requiring extraordinary conditions or
destructive method.
Often, multiple FMEAs are involved or may be available in the course of digital
I&C project:
Generic platform or component Design FMEA by the equipment vendor
In order to support reliability claims, some equipment vendors perform
Design FMEAs on their equipment, typically down to the failure modes of
the individual devices (e.g., CPUs, Analog to Digital Converters, resistors,
capacitors, etc.) that make up each platform component (e.g., controllers,
4-56
I/O modules, power supplies, etc.). Such FMEAs should be performed on a
component-by-component basis, and systematically analyze the failure
modes, failure mechanisms and methods of detection for each internal device
in a given component.
Additionally, defensive measures in terms of hardware design features,
software features, or limits and precautions in the use of a given module can
be assessed. The Taxonomy of failure modes, failure mechanisms and
defensive measures provided in Appendix B of this guideline can be used as
an aid to assess the adequacy of a Design FMEA provided by or available
from an equipment vendor.
When a piece of vendor equipment is applied in a solution, and a generic
equipment-level Design FMEA is available, the Design FMEA should be
assessed for internal failure modes that can propagate to the equipment
interfaces, for assuring that the system-level Design FMEA adequately
assesses those failure modes.
Functional FMEA by Owner/Operator
As described in Section 4.2, the Functional FMEA method is helpful for
identifying failure modes at the basic function/process level, before
equipment-specific functions are allocated or assessed. The Functional
FMEA should bound the functions and processes that would be provided by,
affected by, or interfaced with equipment-based solutions (i.e., systems or
components) to be analyzed later under a Design FMEA or other suitable
hazard analysis method.
The Functional FMEA is typically expected to be performed by I&C
Engineers or their designees (e.g., an Architect/Engineer firm), with support
from individuals knowledgeable in the basic design and operation of the
affected plant systems, including system engineers, operators, and reliability
or PRA engineers.
Equipment-Level Design FMEA by System Integrator
4-57
a. The physical, functional, and data interfaces assessed in the Design
FMEA should be systematically compared to the actual interfaces
provided in the finished solution to verify that all interfaces and their
failure modes are fully and completely addressed, including unused
interfaces or interfaces that may be used infrequently.
b. Factory Acceptance Test (FAT) and/or Site Acceptance Test (SAT) test
cases should be developed to systematically validate the results of the
Design FMEA, to the extent that test cases are non-destructive. Some
failure modes and the expected effects and related methods of detection
are simple to test, such as a failed low analog signal at an appropriate
interface, or turning off a power supply to validate that an expected
indication or alarm is raised.
Plant System-Level Design FMEA by Owner/Operator
In some cases, a system-level Design FMEA may be useful or required in
order to demonstrate how the results of an equipment-level Design FMEA
interact with interfacing systems or components that are not assessed at the
equipment level. The Owner/Operator or designee is typically responsible for
a system-level Design FMEA, and may choose to revise or append the
equipment-level Design FMEA to account for system-level failure modes
and effects, or a separate, stand-alone Design FMEA may be preferred. In
either case, the finished FMEA product(s) should cover the interactions
between the new or modified equipment and the plant systems or
components that are not modified. Such interactions are typically assessed at
the equipment interfaces.
System-level Design FMEAs should also be validated during system testing,
via SAT and/or Post-Modification Testing (PMT) activities, to the extent
that test cases are practical or non-destructive. Operating experience has
shown that failure modes of plant interfaces with new or modified equipment
were not tested at the FAT or SAT due to limitations, and not tested during
PMT due to an oversight in the Mod Test Plan, leading to surprising and
unexpected behaviors when such interfaces fail.
Linking results
When multiple FMEAs are developed or provided, they should be assessed
for adequate coverage of equipment interfaces, adequate overlap between
digital I&C equipment and interfacing plant systems and components, and
adequate methods of detection for translation into operations and
maintenance procedures.
Methods of Detection
4-58
should also be balanced against their potential to interfere with safety or mission-
critical functions.
Some digital systems are capable of alarm management approaches that enable
detection and logging of degraded conditions that don’t cross the threshold of an
alarm condition. In these cases, an alarm management philosophy should be
established that balances the need for automatic system alarms against the
likelihood of creating nuisance alarms.
If FMEA results show that a system is vulnerable to failure modes that can
significantly affect equipment reliability, then the results should be
communicated up to senior management for review and decision-making before
proceeding any further through the development lifecycle.
4-59
Case Study 4-2. Unresolved Single Point Vulnerabilities
A digital upgrade project included an objective to eliminate single point
vulnerabilities in one of the mission-critical plant control system. At the end of the
detailed design phase, the project FMEA was updated to reflect design details,
but it showed some remaining single point vulnerabilities that could not be
removed without significant rework. Because the senior management team had
communicated that removing single point vulnerabilities is a high priority for the
station, the project team communicated their finding to the senior management
team, with a recommendation that the project schedule and budget be adjusted
to enable the rework.
Lesson Learned
While the senior management team was disappointed with this finding, they
appreciated the opportunity to review its implications and provide direction to the
project team. Their direction was to defer installation until the system design
could be reworked in order to remove the vulnerabilities, thus preventing
installation of a system that would not meet station objectives.
Licensing
An FMEA is one of the failure analysis methods that can be used to support
licensing activities under 10CFR50 (for operating plants) or under 10CFR52 (for
new plants).
For operating plants, the EPRI guideline on licensing of digital upgrades, TR-
102348 Rev 1 (Reference 4), describes a lifecycle approach to system
development activities that is joined to failure analysis and licensing activities,
such as preparation of 10CFR50.59 evaluations or License Amendment
Requests. Figure 3-1 in TR-102348, sometimes referred to as “the bus-bar
diagram,” illustrates four steps under the heading of “Failure Analysis” that
support and work in parallel with design and licensing activities:
1. Identify system-level failures and their effects on the plant. System-level
failures can occur in the form of single failures or common-cause failures, and
they can be forced by misbehaviors in interfacing systems, or by abnormal
conditions and events. System-level failures would be identified in an FMEA
under the heading “Effect on System.”
2. Identify potential causes of system failures. In an FMEA, potential causes of
system failures would be identified under the headings “Failure Modes” or
“Failure Mechanisms.” Such potential causes of system failures should be
considered as technical causes or direct causes of system failures, and should
not be confused with root or apparent causes of system failure events, which
typically include programmatic or human performance characteristics.
3. Assess significance and risk of failures. This failure analysis step (among
other activities) helps determine the likelihood and consequences of
malfunctions and accidents, which are the key concepts in the 50.59 rule.
4. Identify resolution. This failure analysis step involves the activities described
herein, in terms of how to use the FMEA results.
4-60
Case Study 4-3. Inadequate Licensing Evaluation
An operating plant installed a digital rod control system under the 50.59 rule
(without prior NRC review and approval). The application included a function
that would allow ganged rod movement, but it was disabled, pending regulatory
review of a License Amendment Request to allow use of this specific function.
While reviewing the license amendment request (LAR), the NRC raised some
questions about the installed digital system, and performed an inspection at the
facility.
The inspectors determined that “…the licensee had not properly evaluated
questions associated with software common cause failure and the potential for
spurious, uncontrolled withdrawal of four control rods." The inspection report
adds: "The inspectors were concerned that the Rod Control System, as a highly
safety significant system, should have been evaluated, under 10 CFR 50.59,
assuming software common cause failures, because under certain software
failures the plant could potentially be placed in a condition outside its design
bases by causing unanalyzed abnormal operating occurrences." Later, the NRC
issued Information Notice 2010-10 in response to this inspection.
The owner/operator determined that the root cause of the inspection finding was
an unsupported determination that a software common cause failure of the Rod
Control System was not credible. EPRI TR-102348 Rev 1 suggests that "with
respect to failures due to software, including common cause failures, the key to
addressing these failure modes in licensing is having performed appropriate
design, analysis and evaluation activities to provide reasonable assurance that
such failures have a very low likelihood."
Lesson Learned
The owner/operator had two opportunities regarding failure analysis activities
that may have helped to prevent the inspection finding:
1. Performing a preliminary FMEA in the conceptual design phase can help the
detailed design by quickly identifying critical functions and key failure modes
to avoid. A preliminary FMEA can also help to identify the safety analysis
events that could be adversely impacted by the change.
2. In the detailed design phase, not only evaluate and credit software quality
process measures, but also evaluate design features and defensive measures
that protect against CCF to determine if there was adequate protection. If so,
then the evaluation may have provided the technical basis for asserting that
the likelihood of a malfunction due to software is sufficiently low so that it
need not be considered further in the 50.59 evaluation.
In the context of software common cause failures, TR-102348 Rev 1 defines
"sufficiently low" as "...much lower than the likelihood of failures that are
considered in the UFSAR (for example, single failures) and comparable to other
common cause failures that are not considered in the UFSAR (such as design
flaws, maintenance errors, and calibration errors).”
4-61
Periodic Testing
The FMEA can be used to identify failure modes and effects that can only be
detected through periodic testing. For digital I&C systems that function on
demand, or change modes of operation as plant conditions change, periodic
testing should be considered for detecting such failure modes before they can
adversely impact system operation. Periodic testing may be as simple as logging
into an engineering or maintenance workstation and retrieving diagnostic
information for review, or walking down the system and inspecting local
indications (e.g., status LEDs or power supply lamps). Periodic testing may
require taking the system or part of the system out-of-service for the purpose of
injecting signals or simulating plant conditions and observing the system
response.
The FMEA may be used as an input to fault trees and the plant Probabilistic
Risk Assessment, especially if digital system and component failure modes are
different from the original analog system. Often, digital system software is (or
can be) designed to force a specific response to component-level failure modes,
such as “fail open” or “fail as-is,” which should be accounted for in the PRA, at
least for those systems that are modeled to that extent in the PRA, using the
FMEA as an input for changes to the PRA.
For Tech Spec system failure modes and effects that cannot be automatically
indicated and/or alarmed, thus leaving surveillance testing as the only viable
method of detection, the PRA can be used to determine an acceptable
surveillance interval. If the PRA is used to determine the surveillance interval,
then the proposed design should be modeled in a technically adequate PRA, and
the results in terms of change in Core Damage Frequency (CDF) and Large
Early Release Frequency (LERF) should be assessed. The PRA can assess the
change in risk similar to a Maintenance Rule a(4) assessment; the failure
probability of I&C components can be assumed to be proportional to the
surveillance interval, and acceptance guidance can be based on Regulatory Guide
1.174 (Reference 29). If several surveillance intervals are to be assessed, the
collective change in CDF and LERF also should be examined. NEI 04-10
(Reference 28) provides additional guidance.
4-62
perform its intended safety function. In this respect, the item is deemed
equivalent to an item designed and manufactured under a 10 CFR 50, Appendix
B quality assurance program. The GCD process is accomplished by identifying
the critical characteristics of the item and subsequently verifying their
acceptability by inspections, tests, or analysis supplemented as necessary by
commercial grade surveys, product inspections or hold point witnessing at the
manufacturer’s facility, or analysis of historical records for acceptable
performance.
For additional guidance on applying the FMEA method and results in CGD
activities, see References 23, 24 and 25.
4-63
Troubleshooting & Cause Analysis
FMEA (as well as Top Down) results can be used as an input to troubleshooting
and cause analysis activities when digital I&C equipment fails or misbehaves.
One method developed by Exelon involves the use of a “Failure Mode Tree”
(FMT). The FMT method systematically postulates possible failure modes that
may have caused a system problem or event, then compares available evidence to
support or refute each possible cause. Each failure mode is listed in a tree format,
under a “Problem Statement”, and available evidence is listed under each failure
mode. Evidence is gathered from system logs, diagnostic information,
measurements, tests, inspections and other sources using simple or complex
troubleshooting plans.
Each failure mode in the FMT is transposed into a table that includes columns
for validation or action steps, expected results, and actual results. The end
product is a package of information that systematically supports troubleshooting
and immediate (or technical) cause determinations, which is especially helpful for
complex systems. It should be noted that the Exelon FMT method is not used
for Root Cause or Apparent Cause Analysis activities, which are beyond the
scope of this guideline.
Case Study 4-5. Troubleshooting and Cause Analysis
The system described in Examples 4-1 and 4-2 is the subject of this Case Study,
which is about using Design FMEA results to inform a complex system
troubleshooting and cause analysis activity. The method begins with defining a
Problem Statement, which in this case is shown at the top of the Failure Mode Tree
on the left side of Figure 4-9 as “HPCI System Fails to Reach Required Flow During
Surveillance Test.”
The Failure Mode Tree is further developed by postulating potential failure modes
that could cause the defined problem. In this case, the HPCI Flow Control System
FMEAs developed in Examples 4-1 and 4-2 are used as an input. Many potential
causes would be considered, but only two are shown in Figure 4-9 due to space
limitations.
Digital system logs are obtained, equipment inspections are performed, and other
data is acquired as evidence for performing a support/refute analysis of each
potential cause. In this Case Study, physical evidence obtained by inspection
during an equipment walkdown shows that the Limit Switch on the HPCI turbine
steam admission valve is significantly misaligned. On the other hand, a HPCI Flow
Control System log shows that there were no failures of the demand signal to the
governor valve positioner.
The HPCI Flow Control System Design FMEA, of which an excerpt is shown on the
right side of Figure 4-9, indicates that a misaligned Limit Switch is a failure
mechanism that can lead to a failed open system enable signal, ultimately leading
to closure of the HPCI turbine governor valve (or failure to open on demand). This
evidence supports the misaligned limit switch as the cause of the defined problem,
and other evidence refutes all other potential causes.
Lesson Learned
System FMEAs can be used as an effective input to troubleshooting and cause
4-64
Case Study 4-5. Troubleshooting and Cause Analysis (continued)
analysis activities, especially when complex digital I&C equipment is involved. The
Failure Mode Tree and Support/Refute methods developed by Exelon can use
system FMEAs, among other sources of information, to systematically identify
causes of system failures.
System FMEAs should be maintained and available as controlled documents after a
digital upgrade project is completed in order to support troubleshooting and cause
analysis activities. They can be controlled as their own uniquely identified
document type, or they can be stored and retrieved as a form of “system
calculation” or inserted into the vendor technical manual. The equipment database
should indicate a link to the appropriate controlled document that contains the
FMEA so that it can be readily retrieved from the document control system.
4-65
Problem Statement
HPCI Fails to Reach
Required Flow During
Surveillance Test
Figure 4-9
Failure Mode Tree Using FMEA Results as an Input
4-66
4.7 FMEA Strengths
2B
The FMEA method is focused on single failure mechanisms and/or single failure
modes. This focal point is a strength in terms of its ability to identify single
failure modes for demonstrating compliance with the single failure criterion and
for identifying single point vulnerabilities.
Simplicity
Users of this guideline can take advantage of the generic failure taxonomy
provided in Appendix B for identifying likely failure modes, failure mechanisms,
and defensive measures. Users can also derive or expand their own taxonomy for
use in repeated applications of the FMEA method on various projects.
The FMEA methods described in this guideline are widely used in multiple
industries, and FMEA standards, guides, procedures and training programs are
readily available. Nuclear power plant I&C design engineers and I&C equipment
designers are usually trained and experienced in mechanical, electrical,
electronics, nuclear and other discipline-specific engineering fundamentals. The
idea of postulating component and device failures is consistent with their training
and experience with design and support of I&C equipment. Therefore, the
FMEA method is one of the most accessible and understood hazard analysis
methods available to the I&C engineering community, within nuclear power
utilities and among equipment vendors.
4-67
Software Hazards
The FMEA method typically considers hardware failures only, where it can be
applied effectively. However, to date, methods for identifying software failures
and determining their effects is still a research problem, especially since there is
no clear industry and regulatory consensus on the meaning of “software failure.”
However, this same research problem is being studied heavily in several industrial
sectors, and the software failure taxonomy sheets provided in Appendix B are a
summary of currently available guidance. Users of this guideline can venture into
“software failures” using the FMEA methods described herein, but should be
cautioned that this approach has not gained wide acceptance in the nuclear power
industry. For additional insights on this research topic, see Reference 22.
The FMEA method is useful for analyzing failure modes and effects between
components of interest and between interfacing systems and components.
However, it may not assess the effects of all interfaces if the boundary is not
drawn correctly, or if the block diagram does not account for all interfaces that
actually cross the boundary in the implemented system.
4-68
Section 5: Top Down Method Using Fault
Tree Analysis (FTA) Techniques
Fault tree analysis is a technique that generally is used to identify combinations of
components and their failure modes leading to failure of systems to perform their
intended functions. Fault tree analysis has been applied as a method to study
system design for over fifty years (Reference 34). It has gained acceptance in
numerous industries, among them defense, aerospace, chemical, transportation,
automotive, robotics and nuclear power (both research and commercial reactors).
Fault trees are deductive logic models (Reference 35). They begin by defining the
occurrence of a top event at a facility or within its systems (e.g., core damage,
failure to provide generation above a selected capacity factor, or failure of a
system to perform a given function) which then is broken down into failures of
trains within the systems being modeled and ultimately components and their
failure modes that would contribute to the occurrence of the top event. As the
name implies, fault trees are constructed in failure space. The focus of fault trees
are on failures because for complex systems with built in redundancy, the number
of ways a system can fail are generally fewer and made up of smaller sets of
components than the number of ways to succeed.
Fault trees can be used to quantify the failure probability of a system or collection
of systems (or, conversely, estimate their reliability). More importantly, however,
fault trees generate qualitative insights regarding the design of a plant and its
systems. In the guidance provided in this report, it is the qualitative or
deterministic aspects of fault tree analysis that are considered in the failure
analysis of digital systems. Among the qualitative information that may be
derived from fault trees in performing a review of the design of a digital system
are:
1. confirmation of the functions that are most useful for the digital system to
provide (including those that may be beyond the primary purpose of the
digital system)
2. identification of the important failure modes of the plant components that
are to be actuated or controlled by the digital system (as well as determining
the failure modes that are not important)
3. understanding the context of the digital system in the plant design as a whole
4. validation that the architecture of the digital system is consistent with success
criteria for the systems that it supports.
5-1
Note that some of these qualitative insights may not require a fault tree model to
be developed for the digital system itself and that fault trees providing much of
the above information may already be available in support of other plant
programs (e.g., as a part of the plant specific PRA). In that regard, the guidance
in this section is directed at taking advantage of existing fault trees from the PRA
as opposed to developing new fault trees for the purpose of performing the failure
analysis.
Section 1.5 effectively defines the objectives of this report as providing guidance
to ensure that a digital system failure analysis is as complete as practical while
requiring a reasonable effort to perform. The following is an overview of a top
down methodology that is directed at those objectives with a focus on taking
advantage of fault tree techniques.
The proposed top down approach begins by recognizing that I&C systems are a
part of a larger integrated plant design. By themselves, they cannot accomplish
the functions needed to ensure safe and efficient operation of the plant without
the equipment they actuate, monitor or control. For that reason, the top down
approach begins by defining high level safety and generation related functions
and works its way down to where the interface of the I&C systems with plant
mechanical and electrical equipment that perform these functions occurs. The
primary objective of this top down review, therefore, is to focus the scope of
digital system failure modes that should be investigated in the failure analysis of
the system by identifying the potentially important failure modes of the
mechanical and electrical components controlled or actuated by the digital
system.
As part of the top down review of safety and generation functions, consideration
of what is modeled in the plant specific PRA is encouraged. The PRA contains
fault tree logic for many of the plant systems that may be influenced by the
digital systems under review, including some that are generation-related. Taking
advantage of models already developed for the PRA can limit the effort required
to define the failure modes of interest for the digital systems.
5-2
5.2 Procedure for Top Down Method Using Fault Tree
Techniques
Prerequisite
The first three steps described below for the fault tree analysis approach
effectively accomplish the Function Analysis described in Section 3.6. To aid in
the Function Analysis as an input to the Top Down method, example frontline
and support system safety and generation functions are listed in this Section.
The following steps describe one approach for performing a digital system hazard
analysis using fault trees as input. These steps are not the only way to implement
the method; variations are likely, and can be blended with or replaced by steps
described for other methods in this guideline. The analyst is encouraged to
review and modify the fault trees presented in this guideline as needed to reflect
their plant specific design.
The fourth step in the fault tree analysis approach converts the results of the fault
tree based Failure Analysis into the failure modes of interest for the digital system
under review. This step effectively represents the PHA described in Section 3.7.
An obvious initial step in the failure analysis of a digital system is to identify the
scope of the I&C system under review. For the purpose of performing a top down
analysis of the identified system(s), it is not necessary that the design of the system
be complete or that details of the design be available. In fact, the first few steps of
the top down analysis are sufficiently general that they would apply whether the
I&C in question consists of a small set of individual I&C components within a
specific plant system or involves a plant wide digital I&C review including balance
of plant as well as safety systems. As noted in Section 5.1, the key information that
eventually will be needed in implementing the top down analysis will be the
identification of the non-I&C mechanical and electrical components and their
failure modes with which the I&C system under review actuates or controls.
Top Down Step 2: Define Plant Level Functions & Develop System Level Fault
Tree Logic
Activities at a nuclear facility are directed toward the primary goals of nuclear
safety and efficient plant operation. The following are suggested for defining
high level safety and generation functions in performing a top down digital
system failure analysis.
Safety Functions
The three key safety functions listed in 10CFR50.2 are a reasonable starting
point for defining high level safety functions in that they encompass the most
important considerations regarding protecting the health and safety of the public
including events that go beyond the design basis. They are consistent with lower
5-3
level functions considered in the plant's safety analysis, the emergency operating
procedures (EOPs) and the plant specific PRA.
1. Ensure primary coolant system integrity
2. Shutdown the reactor and maintain safe shutdown
3. Prevent significant releases (e.g., those in excess of 10CFR100)
Generation Functions
Three key functions can be defined that are each necessary for the production of
energy for delivery to the grid.
1. Energy conversion to steam and inventory control
2. Steam flow and condensation
3. Conversion of energy to electricity and delivery to the grid
The three safety functions are required for defense-in-depth purposes with
respect to ensuring the health and safety of the public such that no single
function is relied upon to the exclusion of the others (e.g., containment cannot be
credited by itself in preventing significant releases without also having a means of
providing adequate core cooling and vice versa). Loss of any one of the
generation functions will result in a plant shutdown or load reduction with the
loss of electrical power production. The failure of these key safety and generation
functions can be used to define the top events of fault trees intended to model
safe and efficient plant operation.
Note that Table 5-1, Figure 5-1 and Figure 5-2 develop the top down logic to
the plant frontline system level along with the failure modes of those systems. A
frontline system is a plant system that directly provides the function specified in
the first column of the table. It is recognized that there are also supporting
systems that are necessary for the front line systems to accomplish their
functions. Consideration of support systems is discussed in subsequent steps. In
Step 2 of this procedure, it is suggested that top logic need only be developed to
the extent that it identifies plant systems which directly support plant functions.
Table 5-2 lists the frontline functions and systems necessary for generation at the
plant level.
Figure 5-3 and Figure 5-4 each provide possible top level generation related logic
for BWRs and PWRs respectively. Plant level generation functions are as
follows. Like the safety functions, in this step development of top logic is needed
only down to the point that the frontline systems that perform the generation
functions are identified.
5-4
Table 5-1
Frontline Functions/Systems for Nuclear Safety at the Plant Level
5-5
Table 5-1 (continued)
Frontline Functions/Systems for Nuclear Safety at the Plant Level
5-6
Basic Safety
Functions
Plant Level
Safety Functions
Figure 5-1
BWR Safety Functions (Top Down)
5-7
Figure 5-1 (continued)
BWR Safety Functions (Top Down)
5-8
Figure 5-1 (continued)
BWR Safety Functions (Top Down)
5-9
Basic Safety
Functions
Plant Level
Safety Functions
Figure 5-2
PWR Safety Functions (Top Down)
5-10
Figure 5-2 (continued)
PWR Safety Functions (Top Down)
5-11
Figure 5-2 (continued)
PWR Safety Functions (Top Down)
5-12
Figure 5-2 (continued)
PWR Safety Functions (Top Down)
5-13
Table 5-2
Frontline Functions/Systems for Generation at the Plant Level
BWR PWR
Type of Function System System
Description Description
Designator Designator
Reactor
RR CVCS Charging/Letdown
Recirculation
Reactor
RRFC Recirculation --- ---
Reactivity Control Flow Control
Control Rod
CRD CRD Control Rod Drive
Drive
Nuclear Boiler Nuclear Boiler
NBI NBI
Instrumentation Instrumentation
Reactor
Primary Functions --- --- PCP
Recirculation
--- --- CVCS Charging/Letdown
Reactor
RF RF Reactor Feedwater
Reactor Inventory Feedwater
Makeup/ Heat Reactor feed
Removal RFC RFC Reactor feed control
control
Main
MC MC Main Condensate
Condensate
Condensate Condensate
CM CM
Makeup Makeup
5-14
Table 5-2 (continued)
Frontline Functions/Systems for Generation at the Plant Level
BWR PWR
Type of Function System System
Description Description
Designator Designator
Turbine Electro-
Turbine Electro-
Flow of Steam to TGC Hydraulic TGC
Hydraulic Controls
Turbine Controls
MS Main Steam MS Main Steam
AR Air removal AR Air removal
OG Offgas OG Offgas
Augmented
AOG
Offgas
Turbine Turbine
TG TG
Generator Generator
Conversion of Steam Turbine
Energy to Power Turbine Generator
Generator
TGI TGI Supervisory
Supervisory
Instrumentation
Instrumentation
5-15
Loss of Generation
Functions
GENERATION_FUNC
Basic Generation
Functions
Energy Conversion to Steam Steam Flow and Conversion of Energy to
and Inventory Control Condensation Electricity
Reactivity Control Reactor Inventory Control / Main Steam System Steam Condensation Tutbine Generaator Generator Supervisory
Heat Removal (Electrical) Instrumentation
Figure 5-3
BWR Generation Functions (Top Down)
5-16
Steam Condensation
Reactivity Control Reactor Inventory Control /
Heat Removal
COND
RX_CTL RX_INV_CTL
MC CWS
CRD RR FW CND
AR ES
NBI RRFC RFC
OG CD
Augmented Offgas
AOG
5-17
Loss of Generation
Functions
GENERATION_FUNC
Basic Generation
Functions
Energy Conversion to Steam Steam Flow and Conversion of Energy to
and Inventory Control Condensation Electricity
Reactivity Control Reactor Inventory Control / Heat Removal Main Steam System Steam Condensation Tutbine Generaator Generator Supervisory
Heat Removal (Electrical) Instrumentation
Figure 5-4
PWR Generation Functions (Top Down)
5-18
Reactivity Control Steam Condensation Heat Removal
Control Rod Drive Charging / Letdown Main Condenser Circulating Water Primary Coolant Pumps Condensate
Mechanisms
NBI AR ES FW MC
OG CD RFC
5-19
Top Down Step 3: Identify Actuated/Controlled Components and their Failure
Modes
Logically, the next step in the top down process would be to develop fault trees
for each of the plant systems identified in the preceding Tables and Figures. The
objective of these fault trees would be to identify the mechanical and electrical
components that are to be actuated or controlled by the digital system under
review and understand their failure modes. Having knowledge of the failure
modes of these components may help to focus the review of the digital system by
eliminating the need to consider digital failure modes that would not contribute
to loss of safety and generation functions important to plant operation or its
response to transients and accidents.
This procedure suggests that further development of fault trees may not be
necessary, however. Rather, at this stage of the evaluation, advantage can be
taken of the fault trees that already have been developed for a given plant, in
support of the plant specific PRA.
Safety Functions
Table 5-3 provides a suggested format for obtaining relevant component and
failure mode information from the PRA for safety functions:
Table 5-3
Format for Capturing Component Failure Mode Information from the PRA
5-20
between the first five columns of Table 5-3 and the Safety Function column
is intentional. It is likely that reports extracted from the PRA will easily
contain the first five columns, but it may be necessary for the analyst to
manually identify the related or affected safety functions.
For any given system modeled in the PRA (or for the entire PRA), a listing of
the Basic Events and a Description is simple to generate from the PRA. A
database relating the Basic Events to the Tag IDs for components may also be
available (the Tag ID may make up a part of the Basic Event name in many
cases). Often, the description of the Basic Event included in the PRA may
simply reflect information already a part of the Basic Event name (e.g., the Tag
ID and its Failure Mode). In this case, it may be useful to expand the definition
to be more descriptive of the component and its function (e.g., ‘charging pump
fails to provide flow to the reactor’s opposed to ‘P-101 FTR’).
Generation Functions
5-21
many of the components and their failure modes are the same regardless of
what system functions are being considered.
Generation related systems that have Initiating Event Fault Trees or Safety
Function Fault Trees modeling the same components as needed to support
power operation can be reviewed in a manner similar to that described for the
Safety System Fault Trees. For each system, a listing of Basic Events along with
its associated Tag ID and Failure Mode can be provided from the PRA. The
description of the components and their failure modes can be modified to discuss
how failure of the component would contribute to the potential for generation
loss. Finally, the plant level generation functions supported by these components
would be identified.
As noted above, Initiating Event Fault Trees may not contain logic for
supporting systems. However, for those PRAs utilizing Initiating Event Fault
Trees, supporting systems also may be modeled as initiators. A review of
supporting systems for each of the Initiating Event Fault Trees should be
performed and for any supporting system that is not also modeled, a substitute
method of identifying their components and failure modes should be selected,
which could include relying on the Safety System Fault Trees for the supporting
systems.
Table 5-4 provides an initial starting point for the review of support systems as
initiators of plant trips or load reductions. The objective for all generation related
systems is to list the Basic Events, Tag IDs, Failure Modes and a Description of
how the component supports each generation related function.
5-22
Table 5-4
Supporting Functions/Systems for Generation at the Plant Level
BWR PWR
Type of Function System System
Description Description
Designator Designator
SWY Switchyard SWY Switchyard
Motive Power
EE Electrical Equipment SPS Station Power
EE Instrument AC IAC Instrument AC
DC DC Power DC DC Power
Instrument Air
IA IAS Instrument Air
(Pneumatic Supply)
Control
SA Service Air CAS Service Air
Power
Turbine Electro- Turbine electro
TGF EHC
Hydraulic Control Fluid hydraulic control
Reactor Recirculation
RRMG --- ---
Motor/Generator Set
Service Water / Non-
SWS Service Water SWS / NSW
Supporting critical SW
Functions Turbine Building Component Cooling
TBCCW CCW
Equipment Cooling Water
Equipment Cooling
Reactor Bldg.
RBCCW --- ---
Equipment Cooling
Diesel Generator
DGJW --- ---
Jacket Water
Turbine Lube Oil
LOGT LO Turbine Lube Oil
(instrumentation)
Turbine Lube Oil
LO --- ---
Lubrication (mechanical)
RFLO Reactor Feed Lube Oil LO Feedwater Lube Oil
Reactor Recirculation
RRLO --- ---
Lube Oil
5-23
Table 5-4 (continued)
Supporting Functions/Systems for Generation at the Plant Level
BWR PWR
Type of Function System System
Description Description
Designator Designator
Reactor Bldg. Heating, Aux Bldg. Heating,
Supporting
HVAC HV Ventilation, Air HVAC Ventilation, Air
Functions
Conditioning Conditioning
PCP Seal Containment
Seals
Cooling Component Cooling
RCS SRV Safety Relief Valves SRV / PORVs Safety Relief Valves
Integrity Primary Coolant
RPV Reactor Pressure Vessel PCS
System
Auxiliary
NB Nuclear Boiler --- ---
Functions
Reactor Water
RWCU / CVCS Cleanup / Charging / CVC Charging / Letdown
Reactor
Letdown
Water Chemistry
Condensate
CF CND Condensate System
Demineralizers
5-24
Table 5-4 (continued)
Supporting Functions/Systems for Generation at the Plant Level
BWR PWR
Type of Function System System
Description Description
Designator Designator
--- --- EFW Emergency Feedwater
High Pressure Safety
HPCI High Pressure Injection HPSI
Injection
Reactor Core Isolation
RCIC --- ---
Cooling
Low Pressure Coolant Low Pressure Safety
LPCI LPSI
Injection Injection
CS Core Spray --- ---
Regulatory Residual Heat
RHR Residual Heat Removal RHR
Functions Removal
DG Diesel Generators EDG Diesel Generators
Diesel Generator Fuel
DGFO FO DG Fuel Oil
Oil
PC Primary Containment PC Primary Containment
Primary Containment Primary Containment
PCIS CIS
Isolation System Isolation System
Engineered Safety
--- --- ESFAS
Feature Actuation
5-25
There will be some systems that support generation that will not be modeled in
the PRA either as an initiating event or in support of a safety function (e.g.,
turbine and generator systems, feedwater heating, and reactor recirculation
systems). For these systems, the top down approach would require a method
other than use of existing fault trees. Alternatives include developing a list of Tag
IDs for each system beginning with Piping & Instrumentation Diagrams
(P&ID), equipment lists and the assistance of system engineers. Such a process
may already have been undertaken as a part of implementation of AP-913
(Reference 37).
For the purpose of the failure analysis, identifying more than just the critical
components (coming out of AP-913) would be necessary in supporting
subsequent steps of the digital system failure analysis, because a somewhat
simplified consideration of single point vulnerabilities may not provide a
complete list of components that would be used in defining digital system failure
modes. Again, the purpose of a list of generation-related components for each
support system, whether developed by fault tree or simply producing a table, is to
identify Tag IDs, Failure Modes, and a Description of how the components
might contribute to failures of the supporting systems (Table 5-4) to perform
their functions and the effects of those failures on the frontline systems (Table 5-
2) in terms of their ability to perform their functions.
This step of the top down process examines the interface between the digital
system and the mechanical and electrical components that it controls or actuates.
The preceding steps define the failure modes for these mechanical and electrical
components in support of plant operation and their response to transients or
accidents. The following is a relatively simple approach to translating these
failure modes to digital I&C failure modes at the system level.
At the system level, a few digital I&C failure modes may be all that is necessary
to identify when implementing a top down, focused failure analysis. For example:
No signal when one is needed
A delayed signal subsequent to when it is needed
A signal when one is not needed
A protective trip signal at an inappropriate time
Control signal too high
Control signal too low
Rate of change of control signal inappropriate, given plant process rate of
change
For the purpose of documenting the basis for selection of a given digital system
failure mode, a simple table should suffice, as provided in Table 5-5:
5-26
Table 5-5
Formatting the Basis for Selection of Digital System Failure Modes
After completing the first 4 Steps of this procedure to identify relevant failure
modes at the digital system level, a decision now needs to be made as to how
much further to continue the Top Down method. Two options are available:
Option 1: Stop Here and Transition to One of the Other Hazard Analysis
Methods
Those results of the other methods described in this guideline that did not
contribute to any of digital system failure modes identified by the Top Down
method could be set aside, leaving only the digital system failure modes that are
relevant to the overall plant design. Even if the analysis of a digital I&C system
using one of the other methods is in progress, as the results become available they
can be compared to the failure modes coming out of the Top Down method.
5-27
For those failure modes that are relevant, the designer can investigate additional
methods for preventing, reducing the potential or being able to cope with those
failure modes. For those failure modes which are not relevant to the mechanical
and electrical equipment being controlled by the digital system, further effort to
address those failure modes can be reduced or eliminated altogether.
Option 2: Continue Pursuing the Top Down Method into the Digital System
Itself
This Option may be useful if the results from other methods are not yet available
for the digital system, or if an investigation of the impact of combinations of
failures within the digital system is of interest. The circulating water system
illustrated in Figure 4-7 and Figure 4-8 of is examined in Example 4-5 using a
fault tree to identify vulnerabilities in a digital system design that involve more
than single failures. Step 6 summarizes this approach and provides
recommendations regarding an appropriate level of detail if this Option is
selected.
Top Down Step 6: Extend the Top Down Method to the Digital I&C System
Even if the digital system being analyzed does plays a significant role in a plant-
level function, EPRI 1025278 suggests that the detail in the fault tree logic for a
digital system should be developed no lower than the computing unit level within
the system. The computing unit level would consist of major components of the
system such as sensors, function controllers, communication processors and
voting logic. Having developed fault tree logic to this level, the remainder of the
failure analysis from a top down perspective could be completed using a glass box
approach (i.e., going deeper into the digital system).
5-28
It should be kept in mind that at any time during the top down development of a
fault tree for a digital system, results or information produced by other hazard
analysis methods (e.g., FMEA, HAZOP, STPA or PGA), may become
available, thus lessening any interest in further development of the fault tree
model. It only is necessary to develop the fault tree to the point that it provides a
link to the safety or generation functions that the digital system or its
components support within the overall integrated plant design. At any point in
the development of the fault tree, insights regarding the impact of failure of the
parts of the digital system that have been modeled thus far can be summarized
and provided for integration with information that is available from other hazard
analysis methods, particularly those described in this guideline.
For more information and general guidance on the FTA method, see EPRI
1025278 (Reference 38).
The results of the Top Down approach to hazard analysis of digital I&C systems
can be applied to the following activities.
The Top Down method using the fault tree analysis technique not limited to
design basis events. It can also be used to identify system functions beyond those
for which a system was originally intended, and confirm the system’s ability to
support these functions. In addition, fault tree analysis systematically evaluates
the effect of multiple concurrent failures assisting in the identification of
potential common-cause effects including those that may involve dependencies
on plant conditions and/or locations. An effective fault tree analysis can identify
where diversity is of value or where it is not of value, and provide an engineering
rationale for these decisions based on the overall plant design and its operating
characteristics.
Fault tree analysis allows for the propagation of the effects of postulated failures
throughout the logic model including those due to multiple failures. This
capability permits testing the design of a digital system to confirm that the plant
can continue to operate or safety systems function satisfactorily in the presence of
failures for which the system was intended to cope. Solving the fault tree for its
cut sets also can provide an indication to what extent the digital system itself
contributes to power reductions and safety system failures and identifies where
potential vulnerabilities may be in that regard.
5-29
related degradation. However, it should be kept in mind that it is not true of
software which is neither random nor does it have aging related failure
mechanisms.
The following examples were originally developed for EPRI 1022985 (Reference
15). Here they are presented again, this time adjusting the analysis to illustrate
the Top Down method described in this Section. The first example takes
advantage of an existing plant-specific PRA in its application of fault trees (Step
4 of Section 5.2). In the second example, development of fault trees into the
digital system itself is performed (Step 6 of Section 5.2).
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees
Top Down Step 1: Define the I&C Systems to be Analyzed
This example illustrates the use of fault trees to perform a Top Down analysis of the
same HPCI/RCIC turbine control system that was defined in Section 4 and presented
in Examples 4-1 and 4-2. The identified “system” is shown within the “Analysis
Boundary” box of Figure 4-6.
The HPCI/RCIC control system is designed to maintain flow at the required setpoint
when the in-service flow controller is in automatic mode. The flow control system
consists of a flow element, a flow transmitter, the flow controllers, a digital turbine
speed governor, a digital valve positioner, and feedback loops from sensors. The
flow controller applies a Proportional-Integral-Derivative (PID) control algorithm that
adjusts the speed demand output of the controller to compensate for any errors
between the flow setpoint and the actual flow signal provided by a flow transmitter
downstream of the HPCI/RCIC pump. The digital turbine speed governor, working
with the digital valve positioner, automatically adjusts the position of the governor
valve to match the actual speed of the turbine to the speed demanded by the flow
controller.
Top Down Step 2: Define Plant Level Functions & Develop System Level Fault Tree
Logic
Figure 5-1, which meets the prerequisite for a Function Analysis (FA), provides a top
down view of basic high level safety functions for a BWR 4 broken down into plant
level safety functions and eventually identifying the systems which provide support
for these plant level functions.
On the first page of Figure 5-1, three, high level, basic safety functions are
considered:
Primary coolant system integrity
Shutdown the reactor and maintain safe shutdown
Limit releases to the environment
The three basic safety functions can be broken down further into what will be
described as plant level safety functions. The first page of Figure 5-1 also identifies
what may be considered as plant level safety functions for a BWR. Plant level safety
functions can be related to those functions accomplished by the plant emergency
5-30
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees (continued)
operating procedures (EOP) and/or modeled in the plant specific probabilistic risk
assessment (PRA):
Primary coolant system integrity
− Primary coolant piping
− Primary coolant overpressure protection
− Primary coolant loss through interfacing systems
o Systems inside containment
o Systems outside containment
Shutdown the reactor and maintain safe shutdown
− Reactivity control (subcriticality)
− Reactor coolant inventory control
o High pressure inventory control
o Low pressure inventory control
Limit releases to the environment
− Primary containment control
o Containment isolation
o Containment Pressure control
o Containment temperature control
− Secondary containment control
Beneath each of these plant level functions in Figure 5-1 are listed the plant systems
that support these functions for a typical BWR (a BWR 4). Considering that the focus
of this top down review is on HPCI and RCIC, it should be noted that HPCI and
RCIC components play a role in all three basic safety functions. In addition to their
obvious reactor inventory control function at high reactor pressure, HPCI and RCIC
steam line isolation valves play a primary coolant system integrity function and a
containment isolation function. A review of the PRA reveals fault tree logic for HPCI
and RCIC that support all three of these functions.
The first section of Figure 5-3 provides a listing of three basic generation functions
which, in turn, are broken down into plant level generation functions:
Reactor
− Reactivity control (maintain reactor power level)
− Reactor inventory control
Turbine
− Flow of steam to turbine
− Condenser operation
Generator
− Conversion of steam energy to power
5-31
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees (continued)
Functions that support the systems that provide plant level generation functions are
also summarized in Table 5-4:
Control Power/Pneumatic supply
Equipment cooling
Lubrication
HVAC
Auxiliary functions are also shown in Table 5-4 that, if lost, may not directly affect
any of the primary or supporting generation related functions but eventually could
lead to a manual shutdown. These auxiliary functions generally are related to
maintaining reactor and fuel conditions.
Finally, a regulatory function is shown that is related to the operability of plant safety
systems. Again, these do not affect the ability to generate power directly, but reflect
limiting conditions for operation as found in the Technical Specifications.
A review of the plant specific PRA did not identify any explicit contribution to reactor
trip resulting from HPCI or RCIC components. If there is a contribution, it likely is
rolled up in the data that supports selection of initiating events and their frequencies.
For this reason, a more qualitative assessment is performed to document any
potential impact of HPCI and RCIC on reactor power operation. This qualitative
assessment is shown in Table 5-6 (note that the shaded row identified those
component functions that may be affected by the digital upgrade defined in Step 1
and Figure 4-6). From Table 5-8, it can be seen that HPCI and RCIC systems may
impact reactor power control and reactor inventory makeup functions through the
spurious operation of either system. Also, as they are systems that are governed by
Technical Specification requirements, HPCI and RCIC can have an impact on
generation for regulatory reasons.
Top Down Step 3: Identify Actuated/Controlled Components and their Failure
Modes
At this point, development of detailed system level fault tree logic would highlight
what components within the system would support each function and what failure
modes are associated with each of these components. However, detailed fault tree
models already may be available in the plant specific PRA. It may be possible to
take advantage of these existing fault tree models to identify key components and
their failure modes that are controlled by the digital I&C.
From a safety perspective, Table 5-6 lists the major components (by Tag ID) within
the HPCI and RCIC systems supporting the primary coolant makeup function that
also are actuated or controlled by I&C equipment. In addition, the table includes the
failure modes associated with these components in terms of their potential adverse
effects on the ability of the systems to makeup to the reactor. Note that the shaded
cells in Table 5-6 are those components that are affected by the digital upgrade
defined in Step 1 (see Figure 4-6).
The shaded cells in Table 5-8 identify the impact that the HPCI/RCIC systems can
have on generation. Note that there are several HPCI system failure modes that
could lead to a plant trip or shutdown, but RCIC impact on generation is limited to
regulatory availability. Table 5-7 lists the Tag ID and Failure Modes for HPCI and
RCIC components that could lead to the loss of generating functions identified in
5-32
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees (continued)
Table 5-8. As noted earlier, the PRA does not model the contribution of HPCI and
RCIC to reactor trips or shutdowns explicitly and, therefore, there are no basic
events in the fault trees that represent component failure modes that lead to
generation losses.
Top Down Step 4: Relate Actuated Component Failure Modes to Digital System
Failure Modes
The shaded row in Table 5-8 shows that the only HPCI/RCIC components affected
by the upgrade of the I&C to digital systems are the governor valves themselves.
Important governor valve failure modes are as follows:
Failure to open the governor valve sufficiently to provide inventory makeup to
the reactor at rates necessary to maintain reactor level.
− For HPCI, during events involving stuck open safety valve or LOCAs (small
or medium), this would be a flow rate roughly equivalent to the rate at
which reactor inventory is leaving the primary coolant system through the
breach in the primary coolant system. For RCIC, during events involving a
stuck open SRV, this would mean a flow rate less than the design basis for
the system (several hundred gpm).
− For HPCI and RCIC, during non-LOCA transients, this would be a flow rate
roughly equivalent to decay heat inventory losses.
Failure to throttle the governor valve sufficiently to prevent a turbine trip due to
overspeed.
− For RCIC, an inadvertent overspeed trip would likely disable the system for
all of its functions.
− For HPCI, an overspeed trip may affect its capability to provide adequate
makeup for the largest LOCAs (e.g., medium LOCA or the upper end of the
small LOCA range). Due to its relatively large flow rate, inadvertent
overspeed trip of HPCI is not likely to inhibit its ability to provide adequate
makeup for decay heat.
Given these governor valve failure modes, the failure modes of the digital control
system for the governor valve are listed in Table 5-7:
Control signal too low (which would result in too much throttling and insufficient
flow)
Control signal too high (which could result in a possible overspeed trip of the
turbine and ultimately insufficient flow)
Table 5-9 lists the results of assessing HPCI/RCIC components for their impact on
generation. While there are several components and their failure modes that could
lead to a plant trip or shutdown, given that the only component impacted by the
digital I&C under review in this example is the governor valve, this table indicates
that there are no effects on generation associated with the HPCI/RCIC turbine
control system
Top Down Step 5: Make a Decision (Continue or Transition to another Method)
At this point in the analysis, a decision should be made as to whether detailed fault
tree modeling of the turbine control system would be of value. Given the limited
scope of the functions associated with the HPCI/RCIC turbine control system, an
5-33
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees (continued)
appropriate approach would be to simply transfer the results of the Top Down
analysis to those responsible for the design of the turbine control system and suggest
that a Design FMEA (per Section 4.4 of this guidance) focus on identifying and
addressing digital speed control system failure modes that could lead to a control
signal that is too high or too low. Therefore, the Top Down method applied to this
example ends at Step 5.
5-34
Table 5-6
HPCI & RCIC Components Controlled by I&C Equipment (Safety Functions)
5-35
Table 5-6 (continued)
HPCI & RCIC Components Controlled by I&C Equipment (Safety Functions)
5-36
Table 5-6 (continued)
HPCI & RCIC Components Controlled by I&C Equipment (Safety Functions)
5-37
Table 5-6 (continued)
HPCI & RCIC Components Controlled by I&C Equipment (Safety Functions)
Table 5-7
HPCI and RCIC Digital System Failure Modes
System Tag ID Failure Mode Digital System Failure Mode(s) Safety Function(s)
HPCI HO-008 Fail to remain Control signal too low (which would result in too Reactor Inventory Control
RCIC HO-009 open and throttle much throttling and insufficient flow)
Control signal too high (which could result in a
possible overspeed trip of the turbine and ultimately
insufficient flow)
5-38
Table 5-8
HPCI/RCIC Generation Functions
BWR
Type of Function System HPCI or RCIC Controls
Description
Designator
RR Reactor Recirculation
RF Reactor Feedwater
Reactor Inventory
Spurious operation of HPCI requires runback of feedwater
Makeup/ Heat RFC Reactor feed control
flow to prevent high reactor level trip of feedwater pumps
Removal
Primary MC Main Condensate
Functions CM Condensate Makeup
Turbine Electro-Hydraulic
Flow of Steam to TGC
Controls
Turbine
MS Main Steam
AR Air removal
OG Offgas
AOG Augmented Offgas
Condenser
Operation CW Circulating Water
CD Condensate Drains
ES Extraction Steam
5-39
Table 5-8 (continued)
HPCI/RCIC Generation Functions
BWR
Type of Function System HPCI or RCIC Controls
Description
Designator
TG Turbine Generator
Conversion of
Steam Energy to
Power Turbine Generator
TGI Supervisory
Instrumentation
Motive Power EE Electrical Equipment
EE Instrument AC
DC DC Power
5-40
Table 5-8 (continued)
HPCI/RCIC Generation Functions
BWR
Type of Function System HPCI or RCIC Controls
Description
Designator
LOGT Turbine Lube Oil (I&C)
LO Turbine Lube Oil (Mech.)
Lubrication RFLO Reactor Feed Lube Oil
Reactor Recirculation Lube
RRLO
Oil
HVAC HV Reactor Bldg. HVAC
Seals
5-41
Table 5-8 (continued)
HPCI/RCIC Generation Functions
BWR
Type of Function System HPCI or RCIC Controls
Description
Designator
RHR Residual Heat Removal
High Pressure Injection To the extent that HPCI controls are inoperable for
HPCI an extended period, a plant shutdown could result.
(HPCI LCO is 14 days)
Reactor Core Isolation To the extent that RCIC controls are inoperable for
RCIC Cooling an extended period, a plant shutdown could result.
Regulatory (RCIC LCO is 14 days)
Functions
CS Core Spray
DG Diesel Generators
DGFO Diesel Generator Fuel Oil
PC Primary Containment
Primary Containment
PCIS
Isolation System
5-42
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
Top Down Step 1: Define the I&C Systems to be Analyzed
This example illustrates the use of fault trees to perform a Top Down analysis of the
same control system for circulating water that was defined in Example 4-3 of Section
4 in the application of the DFMEA method. The circulating water system consists of
six 25% capacity pumps distributed in two divisions. During normal operations at
100% power, two pumps are running in each division, with one pump on standby in
each division; four running pumps are necessary for operation of the plant at full
power. The CWS controls are shown within the “Analysis Boundary” box of Figure 4-
7.
The basic design of the circulating water control system includes two sets of logic
cabinets, two sets of I/O cabinets and a set of HSI workstations. All of the cabinets
and workstations are connected to redundant data communication busses (Comm 1
and Comm 2).
I/O Cabinet A contains digital input modules that monitor the position of the 4KV
breakers that provide power to the motors for three of the circulating water pumps
and digital output modules that position their associated discharge valves (open or
closed). Likewise, I/O cabinet B provides the same functions for the remaining three
pumps and discharge valves.
Top Down Step 2: Define Plant Level Functions & Develop System Level Fault Tree
Logic
Figure 5-2, which meets the prerequisite for a Function Analysis (per Section 3.6),
provides a top down view of basic high level safety functions for a PWR broken
down into plant level safety functions and eventually identifying the systems which
provide support for these plant level functions.
On the first page of Figure 5-2, three, high level, basic safety functions are
considered:
Primary coolant system integrity
Shutdown the reactor and maintain safe shutdown
Limit releases to the environment
The three basic safety functions can be broken down further into what will be
described as plant level safety functions. The first page of Figure 5-2 identifies what
may be considered plant level safety functions for a typical PWR. Plant level safety
functions can be related to those functions accomplished by the plant emergency
operating procedures (EOP) and/or modeled in the plant-specific probabilistic risk
assessment (PRA).
Primary coolant system integrity
− Primary coolant piping
− Primary coolant overpressure protection
− Primary coolant loss through interfacing systems
o Systems inside containment
o Systems outside containment
Shutdown the reactor and maintain safe shutdown
− Reactivity control (subcriticality)
− Secondary heat removal
5-43
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
− Reactor coolant inventory control
o High pressure inventory control
o Low pressure inventory control
Limit releases to the environment
− Primary containment control
o Containment isolation
o Containment pressure control
o Containment temperature control
− Secondary containment control
Beneath each of the plant level functions in Figure 5-2, plant systems that support
these functions for a typical PWR are listed. The focus of this Top Down analysis is on
circulating water, but it is not considered to be a frontline system in the PRA and does
not appear in Figure 5-2. However, review of the fault tree logic and dependency
matrices for the frontline systems shown in Figure 5-2 show that the main condenser,
which is supported by circulating water, ultimately provides support to two plant level
safety functions:
Reactor inventory control – through the operation of turbine driven feedwater
pumps which require a condenser vacuum
Secondary heat removal – through the maintenance of CST inventory (e.g.,
avoiding the need to makeup to the CSTs from systems such as demineralized
water or fire protection in order to maintain an adequate long term AFW pump
suction source)
The first section of Figure 5-4 provides a listing of three basic generation functions
which, in turn, are broken down into plant level generation functions:
Reactor
− Reactivity control (maintain reactor power level)
− Reactor inventory control
Turbine
− Flow of steam to turbine
− Condenser operation
− Steam generator inventory control
Generator
− Conversion of steam energy to power
Functions that support the systems that provide plant level generation functions are
also summarized in Table 5-4:
Control power/Pneumatic supply
Equipment cooling
Lubrication
HVAC
Auxiliary functions are also shown that, if lost, may not directly affect any of the
primary or supporting generation related functions but eventually could lead to a
5-44
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
manual shutdown. These auxiliary functions generally are related to maintaining
reactor and fuel conditions.
Finally, a regulatory function is shown that is related to the operability of plant safety
systems. Again, these do not affect the ability to generate power directly, but reflect
limiting conditions for operation as found in the Technical Specifications.
From Figure 5-4, as expected, it can be seen that circulating water impacts the
condenser operation as a frontline system. While the PRA for this plant does not have
initiating event fault trees, a review of the fault trees used to perform accident
sequence quantification also identifies the main condenser and, hence circulating
water, as support systems for operation of the turbine driven feedwater pumps. The
success criteria for the circulating water system differ in its support of power
generation vs. post-trip decay heat removal (fewer trains are needed post-trip).
However, individual components needed for circulating water to support these plant
level functions and their failure modes are the same for either function.
Top Down Step 3: Identify Actuated/Controlled Components and their Failure
Modes
Given that for this plant the circulating water system is modeled in the PRA, then
simply listing the major components that are controlled by I&C and their failure
modes as modeled in the PRA may be all that is necessary to complete this step.
Table 5-10 lists the CWS Tag IDs and Failure Modes for components in the
circulating water system that are actuated by the I&C equipment described in this
example. Table 5-10 also lists the PRA Basic Events representing these Components
and their Failure Modes, the normal state of these Components, and the state
required for each Component to support its required function.
The circulating water is not modeled in all PRAs. In this situation, the top down
approach would require development of a simple fault tree for this system. Figures D-
1 through D-3 (see Appendix D) provide such a fault tree using the success criteria for
the circulating water system in support of full power operation.
Development of two fault trees is considered. Both fault tree models assume four of
the six circulating water pumps must be in service to support full power operation (or,
conversely, failure of three of the six pumps is assumed to result in a high condenser
vacuum):
1. System response to tripping of an operating CWS pump
2. Operation of CWS Components when not called upon to operate (e.g., spurious
closure of a pump discharge valve)
The important components and their failure modes include the Tag IDs and Failure
Modes identified in Table 5-10.
Top Down Step 4: Relate Actuated Component Failure Modes to Digital System
Failure Modes
Given the CWS Components and their Failure Modes identified in Step 3, it is
relatively easy to develop a list of failure modes for the digital I&C equipment at the
system level. Table 5-11 provides the results, which are summarized below:
No control signal to isolate a valve (on loss of a pump)
5-45
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
No control signal to open a valve and start a pump (on operator action to initiate
this signal)
A control signal when one is not needed (spurious closure of a pump discharge
valve)
Top Down Step 5: Make a Decision (Continue or Transition to another Method)
Having identified the key digital system failure modes in Step 4, the results could be
turned over to the designer for use as input to a Design FMEA at this point. However,
it is assumed for this example that there is a need to confirm that the success criteria
for the digital control system is consistent with the overall design of the circulating
water system. Further development of the fault tree logic (to include portions of the
digital control system itself) is provided in Step 6.
Top Down Step 6: Extend the Top Down Method to the Digital I&C System
Trip and/or isolation of a single circulating water pump is assumed to leave the
system with insufficient capacity to support full power operation. However, the
increase in condenser vacuum in response to a reduction in circulating water flow is
gradual, allowing time for the operators to open the discharge isolation valve and
start one or both of the idle circulating water pumps. The time available for the
operators to initiate this action and avoid a plant trip is several minutes.
Given this context, the plant control system impacts the circulating water system in
one of three ways:
Support normal operation of the system by allowing operators to monitor system
performance and realign the system for the purpose of rotating equipment, etc.
Response to the trip of a circulating water pump by automatically isolating the
affected pump (this prevents reverse flow through the tripped pump and an even
greater reduction in flow through the condenser than from just the loss of the pump)
and support for operator action to start and un-isolate one of the idle circulating
water pumps.
Spurious actuation of circulating water equipment when not called upon to operate
(e.g., spurious closure of the circulating water pump discharge isolation valve).
The focus of the top down analysis is on the last two of these three functions. The top
down analysis takes the form of fault trees, similar to that used in a nuclear power
plant PRA, but not to the same level of detail, or requiring failure rates for
quantification.
Attachment D contains the circulating water system related fault trees used for the top
down evaluation of the plant control system shown in Figure 4-7. Figures D-1a
through D-1c define the system in support of maintaining plant operation should a
circulating water pump trip occur. Figures D-2a and D-2b present a top down review
of the system with respect to the potential for the system to lead to a spurious plant
trip.
Results
The top logic in Appendix D was used to identify the combinations of failures (i.e.,
cut sets) that must occur to lead to the inability of the circulating water system to
support full power operation. Table D-1 presents the dominant contributors to failure
of the system to perform its function. As expected, analysis using this top down logic
5-46
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
confirms that no single component failure leads to the loss of the ability of the
circulating water system to provide an adequate heat sink in support of full power
operation. The bulk of the combinations of failures that must occur to lose adequate
circulating water flow consist of three or more components and their failure modes
(i.e., pumps fail to run, breakers fail to remain open, discharge MOVs fail to remain
open in combinations of three).
The number of failures required for the circulating water system not to be able to
perform its heat sink function is not unexpected given that four pumps are required to
support plant operation while there are two standby spare pump trains available.
However, there are approximately twenty cut sets that consist of only pairs of
components and their failure modes that can lead to failure of the circulating water
system. Many of these twenty pairs include components from the plant control system.
These combinations of failures can be found in Table D-1 and are highlighted in
Figure D-4.
Eight combinations of failures consist entirely of pairs of communication module
failures. These pairs of failures come from the spurious actuation top logic. Total loss
of communications for an entire division of circulating water can occur if all (two)
communication modules in the redundant communication loops in that division were
to fail. This leads to no input to the digital output modules for that division. Under
these conditions, the discharge isolation valves for all three pumps in the affected
division close leaving only the three pumps in the unaffected division. As the plant
requires four circulating water pumps to support full power operation, loss of the
pairs of communications modules results in insufficient circulating water pump flow.
Four of the remaining cut sets consisting of pairs of failures include a digital output
module failure combined with failure of the operators to initiate the standby trains of
circulating water in time to avoid a low condenser vacuum trip. These failures also
come from the spurious actuation top logic. Loss of a single digital output module
results in a false isolation signal to the discharge isolation MOV in the affected pump
train. As only three pumps are now providing circulating water flow, starting of one
of the standby trains is required. Failure of the operators to initiate one of the standby
trains in time results in the circulating water flow not being able to support full power
operation.
Other plant control system components (digital input modules, the master controller,
slave controller and operator workstations) appear with hardware and I&C failures in
combinations of three or more. That these components require multiple additional
failures before they can lead to conditions in which the plant cannot operate at full
power reflects the fact that there are two spare circulating water pump trains and the
operators can initiate the standby trains to mitigate loss of these components.
What it Means
It would seem unusual to have a circulating water system design that from a hydraulic
and mechanical standpoint essentially is designed to accommodate multiple failures,
yet is potentially vulnerable to pairs of failures in the control system. The reasons lie
in several places:
Circulating water success criterion
While there are two divisions of circulating water, each apparently with a standby
5-47
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
spare pump train, it is necessary to have pumps from both trains in service in order
to support full power operation (four of six pumps). Combinations of component
failure that lead to loss of a single division of circulating water result in insufficient
flow to avoid high condenser pressure. There are pairs of control system
components (communication units, in particular) that can lead to loss of an entire
division of circulating water.
Control system component failure modes
Failure modes of selected individual components in the control system result in the
failure loss of individual pump trains. For example, the digital output modules revert
to their shelf state when an input signal is not available. This, in turn, generates an
isolation signal to the discharge valve in the affected pump train.
A final insight coming out of the top down approach is takes the form of a qualitative
ranking of the importance of various control system components, particularly relative
to one another. While no one component is critical to generation, a subset of control
system components is relatively important in supporting adequate circulating water
flow. These components include communications modules and digital output devices.
Absent design changes, these components would be those for which it would be
desirable to ensure their dependability from a design perspective and provide a high
degree of reliability from a maintenance perspective. Other control systems
components (master and slave controllers, input modules, workstations) do not have
as great an impact on system operation as multiple and diverse component failures
must occur in addition to these components before the system cannot perform its
function and they are not likely to trigger a plant transient were they to fail.
5-48
Table 5-9
CWS Components Controlled by I&C Equipment (Safety & Generation)
Config
Req’d
CWS Failure Normal
PRA Basic Events to Auto Comment
Tag ID Modes Config.
Support
Function
CWS-CBCO-CB-01
CWS-CBCO-CB-02
Fail to
Circuit CWS-CBCO-CB-03 Pump
remain Closed Closed Trip on overcurrent, low voltage, manual
Breaker CWS-CBCO-CB-04 protection
closed
CB-01 CWS-CBCO-CB-05
CB-02 CWS-CBCO-CB-06
CB-03 CWS-CBOO-CB-01
CB-04 Discharge
CWS-CBOO-CB-02
CB-05 valve Partial opening of valve required before
CWS-CBOO-CB-03
Fail to close Open Closed position breaker closes to prevent deadhead of
CB-06 CWS-CBOO-CB-04
starts pump
CWS-CBOO-CB-05
pump
CWS-CBOO-CB-06
CWS-MVOC-MO-01
CWS-MVOC-MO-02
Pump
Fail to CWS-MVOC-MO-03
Discharge Open Open
remain open CWS-MVOC-MO-04
Valve
CWS-MVOC-MO-05
MO-01
CWS-MVOC-MO-06
MO-02
CWS-MVOO-MO-01
MO-03
CWS-MVOO-MO-02 Close on
MO-04
CWS-MVOO-MO-03 opening of
MO-05 Fail to close Open Closed Prevent flow diversion through idle pump
CWS-MVOO-MO-04 pump
MO-06
CWS-MVOO-MO-05 breaker
CWS-MVOO-MO-06
5-49
Table 5-10 (continued)
CWS Components Controlled by I&C Equipment (Safety & Generation)
Config
Req’d
CWS Failure Normal
PRA Basic Events to Auto Comment
Tag ID Modes Config.
Support
Function
CWS-MVCC-MO-01
CWS-MVCC-MO-02
Sufficient time available for manual start
CWS-MVCC-MO-03 Manually
Fail to open Closed Open on loss of another train before function is
CWS-MVCC-MO-04 open
lost
CWS-MVCC-MO-05
CWS-MVCC-MO-06
CWS-PMFR-P1
Discharge
CWS-PMFR-P2
valve
CWS-PMFR-P3 Four trains needed to support power
Fail to run Run Run position
CWS-PMFR-P4 operation, one train needed post trip.
closes
CWS-PMFR-P5
breaker
Circ Water CWS-PMFR-P6
Pump CWS-PMFS-P1
CWS-PMFS-P2
CWS-PMFS-P3
Fail to start Idle Start
CWS-PMFS-P4
CWS-PMFS-P5
CWS-PMFS-P6
5-50
Table 5-10
CWS Component vs. Digital System Failure Modes
CWS
Safety/Generation
System Tag ID Component Digital System Failure Mode(s)
Functions
Failure Mode
MO-01 Fail to remain A signal when one is not needed (spurious closure
MO-02 open of a pump discharge valve)
MO-03 Fail to close No signal to isolate valve (on loss of a pump) Condenser operation
MO-04 SG inventory control
MO-05 No signal to open a valve and start a pump (on
Failure to open
operator action to initiate this signal)
Circulating MO-06
Water CB-01
CB-02
CB-03 No signal to open a valve and start a pump (on Condenser operation
Fail to close
CB-04 operator action to initiate this signal) SG inventory control
CB-05
CB-06
5-51
5.5 Top Down Strengths
Fault tree analysis provides a view of the role a system plays within the overall
integrated plant design. This integrated perspective even includes events going
beyond the design basis and considers the effects of failures not only within the
digital system but at the level of the systems in which the digital system is
installed, at the plant function level for both safety and generation functions.
Existing logic
Top down analysis methods may be able to take advantage of existing fault tree
logic that has been developed in support of the plant specific PRA. Components
and failure modes that are included in the PRA also may be appropriate for
consideration in evaluating generation related functions.
Focus on failures
The focus of fault tree analysis on failure modes limits the ability of the method
to consider interactions between systems or components that can lead to adverse
behaviors under plant states in which no failures are present. Existing fault tree
logic may be incomplete for evaluating plant conditions in which everything
performed as designed but an unacceptable outcome still occurred.
Complexity of models
Fault tree logic models can be large, difficult to display on a few pages or screens
and require specialized software to present and review. Should development of
new fault trees be needed, the effort can be burdensome if not managed
effectively.
The last two items listed above are limitations given traditional approaches to
development of fault trees. It may be possible to borrow techniques from some of
the other methods to address these limitations.
For example, the HAZOP method described in Section 6 uses guide words to
assess the state of a system or component under review. These guide words are
applied without regard to whether the system or components in question have
succeeded or failed. They then lead to subsequent questions regarding what plant
5-52
conditions can lead to the state defined by the guide word, whether or not it
involves successful operation of the component or is a result of its failure. A
similar approach can be taken once Tag IDs and their failure modes have been
identified from the plant specific PRA. That is, ask an additional question as to
what legitimate plant conditions can lead to the system and component being in
the so called ‘failure mode’ modeled in the PRA. If those legitimate conditions
have not been considered explicitly in the accident sequences of the PRA, the
failure analysis can be extended beyond what is included the fault trees to review
those conditions in a format similar to that used in HAZOPs or, alternately, the
fault tree could be expanded to model the events that lead to those plant
conditions.
The PGA method described in Section 8 has, as one of its steps, the
development of tables that contain Goals and Processes. The objective of this
step is to identify where conflicting or incompatible goals may exist. The
conflicts that may be identified are irrespective of the success or failure of the
systems under review. A similar approach can be taken beginning with the Tag
IDs and Failure Modes coming from the plant specific PRA. Understanding the
function(s) that are being supported by the Tag ID for specific failure modes and
asking whether there are any functional successes that are directly incompatible
may lead to the identification of plant conditions which are not considered in the
PRA but could lead to adverse outcomes even though the systems and
components under review perform as designed.
5-53
Section 6: Hazard & Operability Analysis
(HAZOP) Method
Per Reference 12, a HAZOP, or HAZard and OPerability analysis, is a
systematic review of a process (e.g., system design), using “guide words,” to
visualize the ways in which a system can malfunction. The HAZOP analysis
searches for possible deviations from the design intent that can occur in
components, operator or maintenance technician actions, or material elements
(e.g., air, water, steam), and whether the consequences of such deviations can
result in a hazard.
Reference 31 adds:
Safety and reliability in the design of a plant initially relies upon the
application of various codes of practice, or design codes and standards.
These represent the accumulation of knowledge and experience of both
individual experts and the industry as a whole. Such application is
usually backed up by the experience of the engineers involved, who
might well have been previously concerned with the design,
commissioning or operation of similar plant. However, it is considered
that although codes of practice are extremely valuable, it is important to
supplement them with an imaginative anticipation of deviations that
might occur because of, for example, equipment malfunction or
operator error. In addition, most companies will admit to the fact that
for a new plant, design personnel are under pressure to keep the project
on schedule. This pressure always results in errors and oversights. The
Hazop Study is an opportunity to correct these before such changes
become too expensive, or 'impossible' to accomplish.
Reference 33 states:
6-2
Table 6-1
Sample HAZOP Worksheet
6-3
6.2 HAZOP Procedure
Prerequisite
The results of a Function Analysis, as described in Section 3.6, are a useful input
to the HAZOP analysis because it provides a well-organized set of functions that
can feed into Step 3 of the HAZOP procedure (identify design intentions and
success criteria).
The HAZOP procedure works best when the assessment team is gathered
together in one or more meetings with the purpose of executing the HAZOP
procedure steps described below. When the right people are together, who query
each other on potential process deviations and their likely causes in cross-
disciplined manner, a more complete assessment will emerge and provide more
opportunities for identifying unwanted and potentially hazardous system
behaviors.
A “process part” in the context of the HAZOP method means that portion of the
plant system or process that is of interest to the analyst. A process part can be a
section of a passive element in a system or process, such as a main steam or
feedwater line, or a tank or vessel, such as a steam generator or main condenser.
A process part can also be an active process element such as a pump or valve. . A
process part can also be a high level function in a plant that encompasses multiple
systems or trains.
6-4
Figure 6-1 illustrates an example, simplified view of the Balance of Plant (BOP)
systems in a BWR. The piping sections and major components represent “parts”
of the process that can be expressed in terms of process conditions, such as
temperature, flow and level. This example, which is used here to demonstrate the
HAZOP procedure, is based on the EPRI Utility Requirements Document
(Reference 43), Volume II Section 3.4.5, in which advanced reactors are required
to have load rejection capability (to have some capacity to continue operating as
an island on loss of offsite power without a reactor trip). This is a design feature
that was available for some of the first generation of US nuclear power plants.
The example examines an event early in the life of a BWR-1 facility with such a
load rejection capability, where the reactor tripped after experiencing a transient
condition in the BOP systems that were thought to be designed for such
transients (that were not supposed to result in a reactor trip).
100% Flow
CV
High Low
Generator
Pressure Pressure
(100% Mwe)
Turbine Turbines
Reactor
(100% RTP) 0% Flow
Condensate
Condenser Storage
TBV
Tank
Makeup
Reject
HP FW
Heater LP FW
Heater
Feedwater
Condensate
Pump
Pump
Figure 6-1
BWR Balance of Plant
6-5
HAZOP Step 3: Determine Design Intention and Success Criteria
This step requires a clear statement of the design intention of the process part
under consideration, and the success criteria (or acceptance criteria) that are used
to demonstrate that the design intention is met. Per Reference 33, the term
“design intent” is defined as:
Continuing with the BWR example, the “design intentions” of the main
condenser (i.e., the part) are to:
a. Condense the exhaust from the low pressure turbines when the reactor and
the main turbine/generator are at 100% power, or
b. Condense up to 95% of the main steam supplied by the reactor when it is at
100% power and the turbine bypass valve is open.
The “success criteria” for this design intention is for condenser vacuum to remain
below its high pressure setpoints and hotwell level to remain between the upper
and lower operational limits.
If condenser vacuum rises above 22.5 inches Hg, then a reactor and turbine trip
would be initiated. To avoid this trip, the circulating water system is sized to
accommodate more than 100% rated thermal power.
If the hotwell level increases from the normal operating band to the upper limit,
the reject valve shown in Figure 6-1 will open, and the condensate pump will
dump the excess inventory in the hotwell to the condensate storage tank in order
to protect the condenser and main turbine from an overfill condition. The line
between the condensate pump discharge and the condensate storage tank is sized
to accommodate full condensate flow.
For example, if the makeup valve between the condensate storage tank and the
main condenser were to malfunction and open, when it should be closed, then
hotwell level will increase to the point where the reject valve will open to
compensate for the inadvertent addition of condensate inventory.
Likewise, if the hotwell level decreases from the normal operating band to the
lower limit, the makeup valve will open, thus increasing the hotwell level and
protecting the condensate pump from inadequate suction pressure.
The next step is to identify the elements or attributes that characterize the
selected process part(s). An “element” is defined, per Reference 33, as:
6-6
involved, the activity being carried out, the equipment employed, etc.
Material should be considered in a general sense and includes data,
software, etc.
In the BWR example, the element involved is the water in the hotwell basin of
the condenser.
Table 6-2 provides the “guide words” that are used in the HAZOP procedure to
assess postulated conditions that could be “deviations” from the design intention
identified in Step 3. The underlying idea is to propose each of the guide words in
the context of the design intention and see if the affected process part deviates
from its design intention.
In the BWR example, starting with the “Not” guide word, the following
statement is proposed:
When the turbine bypass valve is open, the reactor is at 100% power,
and the condenser is condensing 95% of the main steam, hotwell level
does not remain within operating limits.
Notice how this statement includes the design intention (condensing 95% of main
steam) and a proposed deviation from the success criteria (within limits) using a
guide word (not).
Table 6-2
HAZOP Guide Words
6-7
HAZOP Step 6: List Possible Causes of Deviations
The next step is to identify and list the possible causes of the deviations identified
in Step 4.
In the BWR example, possible causes of the stated deviation (hotwell level
outside of normal operating limits) could be as follows:
95% turbine bypass flow + inadvertent opening of hotwell makeup valve leads
to high hotwell level
Greater than 95% turbine bypass flow leads to high hotwell level
95% turbine bypass flow + inadvertent opening of hotwell reject valve leads
to low hotwell level
Less than 95% turbine bypass flow leads to low hotwell level
Two-phase conditions in the hotwell basin lead to high hotwell level (i.e.,
swell)
The next step is to evaluate the consequences of the deviations identified in Step
4.
In this example, a high level condition in the hotwell will cause the reject valves
to open, diverting condensate flow to the condensate storage tank. The resulting
effect on the feedwater system is a reduction in feedpump suction pressure,
leading to a feedpump trip, which then causes reactor water level to decrease to
the point of reaching an automatic reactor trip.
In fact, this condition was experienced by the BWR facility that provided the
background for this example. Figure 6-2 illustrates the scenario by the following
sequence of events (using the labels provided in the figure):
A. A Loss of Offsite Power (LOOP) event occurs. By design, the main turbine
control valve (CV) closes to 5% flow, and the turbine bypass valve (TBV)
opens to 95%. The reactor remains in Mode 1 at hot full power, and the
main generator remains connected to house loads, running at 5% power
(MWe).
B. When the turbine bypass valve opens, the condenser experiences pressure and
temperature fluctuations that reach a “flashing” condition resulting in a two-
phase mix in the hotwell basin. At first, condenser pressure increases, but
stays below the turbine exhaust pressure trip setpoint. When the pressure
decreases back to the normal vacuum condition, a phase change begins to
occur in the hotwell, from the liquid to the vapor phase, resulting in the two-
phase mixture.
C. The two-phase mix results in a sensed (i.e., indicated) increase in hotwell
level.
6-8
D. A “high level” signal is transmitted to the reject valve, which promptly opens
as designed.
E. Full condensate flow is diverted to the condensate tank, as designed.
F. The feedwater pump trips on low suction pressure, as designed.
G. The reactor trips on low water level, as designed.
Notice that all of the components involved in this scenario behaved exactly as
designed, although it may be true that the BOP system design criteria never
considered the possibility of a two-phase condition in the hotwell due to a
temperature/pressure transient in the main condenser.
5% Flow
CV
High Low
A Generator
Pressure Pressure
LOOP (5% Mwe)
Turbine Turbines
Two-Phase Due
Reactor 95% Flow to Pressure Xient
(100% RTP) B
Condenser Condensate
C Storage
G TBV
Tank
Level Makeup
Increase
Reject
Rx Trip on HP FW
LP FW D
Lo Water Heater
Level Heater Opens on
Hi Hotwell
Trips on Lo Feedwater Full
Condensate E Level
Suction Pump Flow
Pump
F
Pressure
Figure 6-2
BWR Trip Sequence of Events after LOOP
The next step is to identify any safeguards (i.e., features, functions, administrative
controls, etc.) that exist that can prevent the deviations from occurring in the first
place.
In the BWR example, there are no existing safeguards to prevent the high
hotwell level deviation; otherwise, the event would not have occurred. A review
6-9
of existing safeguards could have led to at least some recognition of the possibility
of the deviation that was experienced.
The HAZOP procedure concludes with a list of action items associated with
each identified deviation. In practice, a HAZOP worksheet like the sample
provided in Table 6-1 captures the results of all 9 steps of the procedure. If any
particular action item meets the criteria for entry into the facility corrective action
program, then one or more condition reports should be initiated and cross-
referenced to the HAZOP worksheet. Section 6.4 provides a worked example
using a suggested HAZOP worksheet format.
In this BWR example, and in the actual BWR facility that experienced the event,
one of the resulting action items was to modify the plant response to a load
rejections as follows:
Retain the existing reject valve “open” permissive on high hotwell level
Provide automatic trip of a single recirculating water pump on signals that
result in full opening of the turbine bypass valve. Thermal hydraulic analysis
of the reduction in flow to the reactor on loss of a single pump confirmed
that the void increase in the core would cause a temporary rise in reactor
level. The high reactor level would result in a relatively early throttling of
feedwater flow by the flow control valves. This reduction in feedwater flow
allowed feedwater pump suction to remain above the low suction pressure
setpoint even if the reject valves were open due to a false high hotwell level
signal. On stabilizing conditions in the condenser and hotwell, the resulting
steam flow to the through the bypass valve given a tripped recirculating water
pump was significantly less than rated flow while reactor power was still
more than sufficient to support house loads using the main generator and
avoid a plant trip.
For each process part, or for each element associated with a given part, the
HAZOP procedure is repeated until the hazard analysis scope is satisfied. For
guidance on developing hazard analysis scope and objectives, refer to Section 3.1.
6-10
6.3 Applying the HAZOP Results
As with other hazard analysis methods described in this guideline, the results of a
HAZOP analysis can be used in support of the following activities:
Application Development
The HAZOP results can be used by the integrator to improve system designs
through the application development lifeycle process. The conceptual design
phase of the lifecycle process should include a preliminary hazards analysis, using
one of the approaches described in Section 3.7. A preliminary HAZOP analysis
can be used to identify and reduce or eliminate potential vulnerabilities in the
system as the design activities progress. Some vulnerabilities may be prevented or
mitigated to a reasonable extent through one or more defensive measures that are
realized through design requirements and/or plant programs and processes. For
guidance on applying defensive measures in digital I&C systems, see References
20 and 21.
The HAZOP analysis should be updated through the design process, or when
the design is complete, to reflect the finished design at an appropriate application
baseline. For guidance on determining baselines, see EPRI 1022991 (Reference
18).
The finished HAZOP analysis should be validated, at least to the extent that the
behaviors or corrective actions identified in the analysis can be tested without
extraordinary conditions or destructive methods, in the test phase of the
application development lifecycle. HAZOP validation test cases can be executed
at the Factory Acceptance Test (FAT), Site Acceptance Test (SAT) or during
post-installation testing. Additional guidance on testing is provided in EPRI
1025282 (Reference 32).
Licensing
6-11
IEEE Definition of Hazard: A condition that is a prerequisite to an
accident. Hazards include external events as well as conditions internal
to computer hardware or software. (Reference 9)
The IEEE definition, accepted by the NRC, considers internal and external
events and conditions. The HAZOP method can be useful because it considers
plant process deviations that can be caused by, or mitigated by control system
actions.
For brevity, one worked example of the HAZOP method is provided, using the
same Circ Water System (CWS) controls described in previous examples. Figure
4-7 and Figure 4-8 provide diagrams of the CWS controls that are evaluated in
this example.
Example 6-1. Circ Water System Controls HAZOP
HAZOP Step 1: Form an Assessment Team
A multidisciplined team was formed, made up of expertise from digital I&C design,
mechanical systems design, systems engineering, digital control system product
design and PRA knowledge domains. The team met twice, first to review the CWS
control system design and initiate the HAZOP worksheet, and again to review the
results and confirm the recommended corrective actions. A HAZOP method facilitator
was on the phone with the team for both meetings, and offered valuable guidance on
the selection of the process parts to be assessed, and how to effectively use the guide
words to postulate deviations.
The team members initials were recorded at the top of the HAZOP worksheet that
was initiated in the first team meeting, provided in Table 6-3.
HAZOP Step 2: Select a Process Part
For this example, the HAZOP analysis team selected the COMM 1 “part” of the I/O
cabinet in CWS control system process illustrated in Figure 4-7. The process part was
identified on the HAZOP worksheet.
HAZOP Step 3: Determine Design Intention and Success Criteria
The design intention of the COMM 1 module is a function that would be listed or
described in a prerequisite Function Analysis (FA). The design intention of COMM 1
in a given I/O cabinet is to pass, or communicate data that is addressed to or from
the I/O modules in that cabinet. The success criterion is to communicate the data
without any errors or losses of the COMM 1 data link that connects the I/O cabinet
to other cabinets. The design intention and success criteria were recorded on the top
rows of the HAZOP worksheet.
6-12
Example 6-1. Circ Water System Controls HAZOP (continued)
HAZOP Step 4: Identify Elements/Attributes
For this example, the HAZOP team identified one of the elements/attributes of the
design intention (data communication to/from I/O modules) as the “signaling voltage
on the physical interface (indicating the presence of a modulated carrier).” In other
words, the electrical characteristics at the physical layer described by the Open
Systems Interconnect (OSI) 7-layer model. This element was recorded in the second
column of the HAZOP worksheet.
HAZOP Step 5: Apply Guide Words to Develop Possible Deviations
The “guide words” provided in Table 6-2 were used to assess postulated conditions
that could be “deviations” from the design intention identified in Step 3. Each guide
word was listed in its own row in the HAZOP worksheet, and the resulting deviations
were recorded in the “Deviation” column. For example, The guide word “No” could
result in the deviation “no carrier signal.”
HAZOP Step 6: List Possible Causes of Deviations
The HAZOP team discussed and debated possible causes of each deviation listed in
the HAZOP worksheet (Table 6-3).
For example, continuing with the “No carrier signal” deviation, three possible causes
are as follows:
A broken wire
A dead COMM 1 module
A failed backplane
The results of this step are recorded in the “possible causes” column of the HAZOP
worksheet.
HAZOP Step 7: Evaluate Consequences of Deviations
The team carefully examined Figure 4-7and Figure 4-8 to determine and evaluate the
consequences of the deviations listed in the HAZOP worksheet.
The consequences associated with the “No carrier signal” deviation listed in Table 6-
3 are as follows:
No consequence, other than loss of one COMM module. The redundant COMM
module maintains communication (i.e., the data link) with other cabinets.
A possible “failed backplane” cause of the “No carrier signal” deviation will
result in loss of both COMM modules, a complete failure to communicate data to
other cabinets in the CWS control system, and due to the basic design and
architecture of the controls, will result in loss of the circulating water system
pumps.
HAZOP Step 8: Identify Existing Safeguards to Prevent Deviations
For each of the deviations and their possible causes, existing safeguards were
identified and evaluated for their potential to prevent or mitigate the deviation. Upon
review of the completed worksheet in Table 6-3, it is apparent that existing
safeguards are available for all deviations and their causes except for one; that
being a failed backplane.
HAZOP Step 9: Develop Action Items
Action items were developed and assigned to the appropriate team members. In most
cases, the action items were written to confirm the applicability and effectiveness of
6-13
Example 6-1. Circ Water System Controls HAZOP (continued)
existing safeguards such as wiring standard, periodic test procedures, or internal
control system diagnostic features.
One action item (highlighted in yellow) stood out of this example, requesting a
design review of the CWS control system architecture and a proposal for a design
change to prevent the loss of all 3 CWS pumps due to a failed backplane. The plant
described by this example requires 4 out of 6 CWS pumps to be operating to avoid
ultimate heat sink issues that would lead to an inadequate condenser vacuum,
causing a turbine trip.
What is interesting about this example is that it readily identified a failure of a single
passive element (the backplane) that leads to an unacceptable result (turbine trip).
The FMEA and FTA methods, applied to the same example, did not reveal this
vulnerability.
The FTA method has the potential for identifying such vulnerabilities if modeling of
common cause failures is considered, although specific root causes of these failure
modes may not be identified explicitly without further effort
6-14
Table 6-3
CWS Controls HAZOP Worksheet
6-15
6.5 HAZOP Strengths
Systems View
The HAZOP method takes a system view. The results are useful for input to the
requirements definition phase of a digital I&C project because they result in a
goal-driven design from the beginning. Goals include safety, reliability, power
generation, etc.
The HAZOP method can provide insights into system behaviors beyond what is
typically revealed by FMEA and Top Down, because it considers the behaviors
of active and passive plant elements without necessarily postulating specific
failures.
Unexpected Behaviors
The HAZOP method can help identify unexpected and strange system behaviors
that may not otherwise be thought credible or possible. For example, it can
identify adverse interactions between components and systems that would on the
surface appear to have no potential interactions at all.
When the data is reduced to the final list of corrective actions, the results can
typically be readily used to inform requirements, identify and apply defensive
measures, and demonstrate system acceptability.
The final results can also be used as an input to another method to help avoid
searches for faults and failures that don’t lead to hazards.
Interactions
6-16
HAZOP is a hazard identification technique which considers system
parts individually and methodically examines the effects of deviations
on each part. Sometimes a serious hazard will involve the interaction
between a number of parts of the system. In these cases the hazard may
need to be studied in more detail using techniques such as event tree
and fault tree analyses.
Trained Facilitator
The principal investigators of this guideline researched the HAZOP method and
developed the worked examples provided in Section 6.4. As the examples were
developed using the team approach described in the HAZOP procedure (Section
6.2), it became apparent that the team’s experience with other methods such as
FMEA and Top Down drove an overly narrow consideration of active plant
components and their failure modes. This narrow-minded focus missed the point
that the HAZOP method provides the most benefit by considering deviations
from the design intentions of plant process parts, which can be active or passive
elements in the plant. A trained facilitator helped the team recognize the error
traps created by their own mindsets and get back on the right track.
6-17
Section 7: Systems Theoretic Process
Analysis (STPA) Method
Systems Theoretic Process Analysis (STPA), a hazard analysis method, is one
part of a set of new or refined system safety engineering methods developed by
Dr. Nancy Leveson and her team at the Massachusetts Institute of Technology
(MIT), under the heading of Systems-Theoretic Accident Model and Processes
(STAMP). This work has been published in Dr. Leveson’s book, Engineering a
Safer World – Systems Thinking Applied to Safety (Reference 19).
The following guidance is not intended to alter the STPA method described in
Reference 19. This guidance is adapted to the extent that it demonstrates the
usefulness of STPA in performing hazard analysis of digital I&C systems in
commercial nuclear power plants.
The primary reason for developing STPA was to include the new causal
factors identified in STAMP that are not handled by the older
techniques [FMEA, FTA, HAZOP, and others]. More specifically,
the hazard analysis technique should include design errors, including
software flaws; component interaction accidents; cognitively complex
human decision-making errors; and social, organizational, and
7-1
management factors contributing to accidents. In short, the goal is to
identify accident scenarios that encompass the entire accident process,
not just the electromechanical components.
The notion of “worst case environment conditions” also needs some explanation. As
used in this guideline, it is meant to convey the idea that STPA is meant to
consider the states of the environment around the system in their abnormal
conditions. This was the fundamental approach proposed in the EPRI “ACES
Report” (Reference 14), where the digital system design functions were intended
to be analyzed in the context of abnormal conditions and events (ACES). The
STPA method is a natural extension of this idea, and is more systematic than the
methods proposed in the ACES Report. They key in the STPA method is to
avoid assuming that environmental conditions around a digital system are in their
normal states; it leads the analyst down the path of considering abnormal
conditions by using guide words that force consideration or control actions in
various contexts (using process model variables and their various states).
The hazard convention used in this guideline is the same convention used in the
definition of hazard provided in Reference 19 (i.e., hazards are system states or
conditions, not events).
7-2
system that performs an automatic or semi-automatic control or protective
function.
The term “control action,” as it is used in the STPA method, describes the effect
that a controller (human, machine, or both) has on an actuator and ultimately the
controlled process. Control actions can be safe, or unsafe, and may depend on
their context. In one context, a control action can be considered safe, while in
another context it may be unsafe. For example, an unplanned, automatic main
turbine trip may be considered safe in the context of protecting the main
turbine/generator set, but it may also be considered unsafe in the context of
nuclear safety because it is an initiating event that can challenge safety systems.
Therefore, the term “safety,” as it is used in the STPA method, is not necessarily
synonymous with the term “nuclear safety” that is used in the commercial nuclear
power industry. Reference 19 defines safety as “freedom from accidents (loss
events),” and it is the definition used here.
Causal Factors
The “causal factors identified in STAMP,” mentioned in the leading paragraph, are
built around the concept of a “control structure,” illustrated in Figure 7-1.
7-3
and controlled processes, and (3) communication and coordination
among controllers and decision makers.
Although these ideas are introduced at an abstract level, they can be applied
systematically on complex systems, and decomposed to any level of detail that
serves the objectives of the analysis. The worked examples provided in Section
7.3 demonstrate how various levels of abstraction can be systematically applied on
real systems.
Controller
Process Model
Inadequate Control Algorithm
Inconsistent,
Inappropriate, (Flaws in creation, process
Incomplete, or
Ineffective or Missing changes, incorrect modification Inadequate or
Incorrect
Control Action or adaptation) Missing Feedback
Feedback Delays
Actuator Sensor
Inadequate Inadequate
Operation Operation
Incorrect or No
Information Provided
Delayed
Operation Measurement
Inaccuracies
Feedback Delays
Controlled Process
Controller 2 Component Failures or
Conflicting Control Actions
Changes Over Time Process Output
Process Input Contributes to
Missing or Wrong System Hazard
Unidentified or Out-of-
Range Disturbance
Figure 7-1
A Classification of Control Flaws Leading to Hazards
Credit: Dr. Nancy G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety,
published by The MIT Press
Control Flaws
The idea of causal factors is transformed into a set of control flaws that can be
superimposed on the control structure. The control flaws are shown in Figure 7-1
in red text. MIT researchers have not yet found any evidence from their
investigations of accidents (losses) or complex systems that the set of control
flaws illustrated in Figure 7-1 is incomplete. Each of the control flaws (e.g.,
delayed actuator operation, measurement inaccuracies, inadequate or missing
feedback, inadequate control algorithm, etc.) are fully described in Reference 19.
7-4
Control Flaws vs. Causal Factors
A Hierarchical View
As described in Section 1.4, hazards can lead to losses, and the purpose of a
hazard analysis is to identify hazards so they can be eliminated, reduced or
mitigated. STPA extends this hierarchy to include control flaws (causal factors),
with the underlying principle that if the analyst can find and eliminate control
flaws, then resulting potential hazards may be eliminated, and accidents
prevented.
7-5
An undesired and unplanned event that results in a loss (including loss of human life
Accident(s) or injury, property damage, environment pollution, and so on). (Reference 19)
or Loss(es)
A system state or set of conditions that, together with a particular set of worst-case
environment conditions, will lead to an accident (loss). (Reference 19)
Hazard(s)
Figure 7-2
Accidents, Hazards, Unsafe Control Actions & Control Flaws
The term context as it is used in the STPA method means the system or
environmental state, or combination of states, in which a control action is
provided. Different contexts can lead to different conclusions regarding hazards.
For example, one context that shows an increasing pump speed can be beneficial
if system flow is too low, but hazardous if pump speed is too high and
approaching an equipment limit.
Prerequisite
The results of a Function Analysis, as described in Section 3.6, are a useful input
to the STPA analysis because it provides a well-organized set of functions that
can feed into the steps of the STPA procedure that identify the control structure
and process models.
7-6
Figure 7-1 shows one controller, and one control action (the down arrow
between the controller and the actuator). Therefore, there is one control action
that would be evaluated further under the STPA method, which classifies control
action behaviors as follows:
Control Action is Provided
Control Action is Not Provided
Control Action is Provided Too Early
Control Action is Provided Too Late
Control Action is Stopped Too Soon
Notice that the bolded words bear some resemblance to the guide-words used in
the HAZOP method.
The STPA method postulates these control action behaviors in various contexts to
determine if they are hazardous. If a control action is hazardous, then it is an
Unsafe Control Action.
The focus on control actions and contexts is particularly useful when analyzing
digital I&C systems for the presence of hazards that may be introduced by
software.
Figure 7-1 shows several more control flaws in other parts of the control loop
that could lead to hazards, which in turn could lead to losses. Identifying the
presence of these other control flaws is the object of STPA Basic Step 2.
Basic Step 2 requires an analysis of the potential causes of the Unsafe Control
Actions (UCA) identified in Basic Step 1. In essence, for each UCA, the analyst
will “go around the loop” in the control structure and consider if any of the
potential control flaws in other parts of the loop could cause the controller to
“command” the UCA. It is important to remember that a UCA can be active in
the sense that it is a control action that may be provided (or provided too early)
and lead to a hazard, or passive in the sense that it is not provided (or provided
too late or stopped too soon) and lead to a hazard.
One of the strengths of STPA is that it limits the evaluation to only the control
flaws that can lead to hazards.
This guideline expands the basic STPA procedure described in Reference 19 into
more discrete steps, as follows:
7-7
The analysis begins with determining a system boundary, which requires
identification of the plant system (or systems), and their interfaces, that can affect
or be affected by an activity.
For a digital upgrade activity, the system boundary would encompass the digital
equipment and the plant systems or components that can influence or be
influenced by the digital equipment. The output of the Function Analysis
method described in Section 0 should be used as an input to the STPA analysis.
One method for identifying the system boundary on a digital upgrade project
would be to:
4. Identify the digital equipment
5. Identify the process elements that the digital equipment is expected to
protect or control
6. Identify the equipment that interfaces between the digital equipment and the
process elements
7. Identify remaining digital equipment interfaces, and the equipment that
might be connected to them
8. Identify any other equipment or processes that can affect the environment
around the equipment and processes identified in Steps 1 through 4.
The output from Step 1 is a drawing that represents the equipment, process
elements, and their interfaces and interconnection. The drawing should include
physical and functional representations. Appendix C provides a generic list of
equipment types and process elements, as well as physical and functional
representations that can be used in a system drawing.
Note that STPA results may be sensitive to where the system boundary is placed.
One of the strengths of this method is its ability to identify interactions between
components that would otherwise not appear to interact, such as components
that appear to be physically and functionally independent. Another strength of
STPA is its ability to identify adverse component interactions, even if none of the
components have failed or malfunctioned. Therefore, care should be taken when
identifying the boundary to avoid missing components and interfaces that may
interact with the system.
7-8
Lost or reduced generation
Any other loss that is of concern to the owner/operator
The output from this step is a short and simple list of losses. For more detailed
guidance, see Reference 19.
Using the results from Step 2, list the possible system-level hazards that could
lead to each loss. The system-level hazards are a function of the controlled
process elements and their ability to cause a loss. As described in Section 1.4, it is
important to use a clear definition of hazard, and apply it consistently. As in Step
2, the list of hazards should be short and simple.
Using the results of the Function Analysis described in Section 3.6, the next step
is to draw the control structure. Start with a basic, rudimentary structure, more or
less consistent with the control structures illustrated in Figure 7-3.
Process
Controller Model
Control Feedback
Actions Signals
Controlled Process
Figure 7-3
Basic Control Structure
7-9
Training & Environmental
Procedures Conditions
Model of
Automation
Control
Human Action
Controller Generation Model of
Controlled
Process
Human-System Interface
Automated Model of
Control
Controlled
Controller Process
Algorithm
Actuators Sensors
Controlled
Process
Inputs Process Process
Outputs
Figure 7-4
Basic Control Structure with Human Operator
Credit: Dr. Nancy G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety,
published by The MIT Press
The control structure should have at least one controller and a representation of
the control actions and feedback signals between the controller and the controlled
process. Figure 7-3 meets the minimum criteria for a control structure, and may
be adequate for a variety of situations. Creating the Process Model details comes
in Step 5.
Figure 7-4 provides a more resolved control structure that separates the human
and automated controllers, and how the control actions are directly applied to
actuators via the automated controller (solid down arrows) and indirectly applied
as intended by the human operator (dashed down arrow). In both Figures, a
Process Model is represented by a box in each controller. Creating the Process
Model details comes in Step 5.
A more detailed or resolved control structure can be prepared to break down the
basic control structure into more discrete components if desired. However, it is
useful to complete the STPA analysis at the basic system-level before proceeding
with a more detailed analysis because it can provide significant insights before
expending more effort at a detailed level.
7-10
STPA Step 5: Create Process Model(s)
Mismatched or conflicting process models arises when one of the process models
is incorrect or incomplete, which amounts to a control flaw. Step 7 is designed to
identify this flaw (among others).
Process model variables (PMVs) are essentially the up arrows and sideways
arrows in the control structure created in Step 4. Step 6 describes the PMVs and
their states in greater detail.
The output of this step is a table that lists Process Model Variables (PMV) and
their possible States. PMVs are readily identified from the control structure as the
feedback signals and other inputs to a given controller. Possible PMV states
include open, closed, on, off, increasing, decreasing, as-needed, or other
characteristics that simply describe PMV behaviors.
Table 7-1
Suggested Process Model Format
Controller Name
Process Model Variables PMV States
PMV1 State 1
(Controller Feedback State 2
Signal or Input) State n
PMV2 State 1
(Controller Feedback State 2
Signal or Input) State n
PMVn State 1
(Controller Feedback State 2
Signal or Input) State n
7-11
STPA Step 6: Identify Hazardous Control Actions
Examine the control structure, and for each controller, identify the control
actions (down arrows) and their basic characteristics in terms of their effects
or influences on the next controller or the controlled process that it acts upon.
Figure 7-5 illustrates some key STPA are terms used in the STPA process
are used
When a control action from a given controller acts upon another
controller, it is expressed by the manner in which its action is expected to
influence the state of one or more of the Process Model Variables in that
controller (e.g., increase desired flow, decrease desired flow, etc.).
When a control action from a give controller acts upon the controlled
process, it is expressed by the manner in which its action is expected to
influence the state of one or more of the controlled process elements (e.g.,
increase valve position, decrease valve position, start pump, stop pump,
etc.).
Figure 7-5
Control Actions, Process Model Variables (PMVs) and PMV States
• Plant Condition
• Plant Mode
• Others...
Process
Controller Model
Other Inputs
or Conditions PMV
States
Control • Normal
CAs Feedback Process Model • Accident
• Increase Actions
• Decrease
Signals Variables •
•
Increasing
Decreasing
• Open • As Needed
• Close • Pressure • On
• Hold • Flow • Off
• Switch Controlled Process • Temperature • Mode 1
• Others... • Voltage • Automatic
• Current • Manual
• Others... • Others...
The result is a list of CAs for each controller and how they can influence
the state of a process model variable in the next controller or controlled
process element. At this point, it is helpful to begin building a worksheet
or table that combines the CAs from each controller with the Process
Model table for the next controller or controlled process elements that
they influence or act upon. Table 7-2 provides a suggested format for a
worksheet or table:
7-12
Table 7-2
Combining Control Actions with Affected Process Models
Next Controller or
Controller N Controlled Process Element
PMVs PMV States
PMV1 State 1
(Controller Feedback
Signal or Input) State n
CA1
(Influence 1) PMVn State 1
(Controller Feedback
Signal or Input) State n
PMV1 State 1
(Controller Feedback
Signal or Input) State n
CAn
(Influence n) PMVn State 1
(Controller Feedback
Signal or Input) State n
The key to this step is to determine the contexts in which each control action
can be hazardous. Contexts are a function of process model variables and
their states. A context can be simple, comprising one PMV with two possible
states (e.g., valve is open, or valve is closed), or it can be more complex,
comprising two or more PMVs, each with two or more states. (e.g., valve is
open and turbine speed is increasing and tank level is decreasing).
Using the results from step (a), postulate the following Behaviors for each
Control Action, and determine if it is hazardous in each context:
5555551. Control Action Is Provided
2. Control Action Is Not Provided
3. Control Action Is Provided Too Early
4. Control Action Is Provided Too Late
5. Control Action Is Stopped Too Soon
Figure 7-6
Structure of a Hazardous Control Action
7-13
It is helpful to organize a team of knowledgeable individuals such as
system engineers, operators, and design engineers, and hold one or more
team meetings to consider each context and determine if it is or could be
hazardous (hazards having been identified in Step 3).
The result of step (b) is an expanded version of the table created in step (a).
A sample of a suggested worksheet format is provided in Table 7-3. In this
sample, the worksheet would be produced five times for each CA; once for
each of the five postulated CA behaviors. This example shows five hazards
and three PMVs; the first two each have two possible states, the third
PMV has three possible states. Note that the STPA worksheet can have
any number of PMVs and PMV states.
Table 7-3
Sample STPA Worksheet
Row Is Situation Is CA
PMV1 PMV2 Related Comments
PMV3 Already Behavior
(Name) (Name) Hazard (Situational Context)
(Name) Hazardous? Hazardous?
1
2 State 1
3
4
5 State 1 State 2
6
7
8 State 3
9
State 1
10
11 State 1
12
13
14 State 2 State 2
15
16
17 State 3
18
19
20 State 1
21
22
23 State 1 State 2
24
25
26 State 3
27
State 2
28
29 State 1
30
31
32 State 2 State 2
33
34
35 State 3
36
7-14
A few observations can be made about Table 7-3:
Each row denotes a specific context, which is a requirement for
determining if a postulated control action behavior is hazardous.
Sometimes a context (i.e., combination of PMV states) is inherently
hazardous, which is the purpose of the column labeled “Is situation
already hazardous?” The control action may mitigate the hazard, or make
it worse, or have no effect at all; this should be noted in the comments
column.
When there are more PMVs or more possible PMV states, the number
of contexts to be evaluated can grow significantly, thus requiring more
effort
The analyst or the team performing the analysis should consider each context
and attempt to answer the question “Is CA Behavior Hazardous” as “Yes” or
“No.” Sometimes it is difficult to answer definitively because the context may
have conflicting PMV constraints (e.g., a CA that ultimately increases the
speed of a pump is beneficial if system flow is too low, but hazardous if, at
the same time, the pump speed is too high and approaching equipment
limits). In these cases, it is acceptable to put “Maybe” or some other notation
that indicates a question for further analysis in Step 7.
Caution
It is easy to fall into the trap of thinking that some contexts are absurd or can’t
exist. For example, if a process model for a steam turbine-driven pump in a fluid
system includes turbine speed as one PMV and system flow as another PMV,
then one might be tempted to dismiss the contextual combination of “turbine
speed too high” and “system flow too low.” However, these types of strange
behaviors might occur due to problems or malfunctions in the controlled process,
such as debris in the line or equipment degradation (e.g., a damaged pump
impeller), and it may be that the postulated CA for such a context is precisely the
right thing to do.
It is important to not throw out any contexts, no matter how strange, because
experience shows that strange behaviors are the ones we least expect and fail to account
for in system design and operation, yet they still manifest themselves and lead to
accidents or losses.
Results
When the STPA worksheet is completed, the results should be reduced to a list
of Hazardous Control Actions by transposing each row of the worksheet that
indicates when a postulated Control Action is hazardous. The format of each
Hazardous Control Action should follow the structure presented in Figure 7-6
As described at the beginning of this section, Step 6 delivers the result of Basic
Step 1 of the STPA method.
7-15
STPA Step 7: Identify Potential Causes of Hazardous Control Actions
The analysis team that performed Step 6 should remain intact, and perform this
step by considering each of the control flaws presented in Figure 7-1 in the
context of each Hazardous Control Action identified in Step 6.
The team should be careful to not discount or dismiss any potential causes, even
if the team is aware of adequate defensive measures that would reasonably reduce
the likelihood of such causes of the hazardous control action to an acceptable
level. The purpose of the STPA method is to systematically identify the potential
causes of hazardous control actions first, without prejudice, so that later steps in
the system design lifecycle can:
eliminate, reduce, or mitigate such hazards to an acceptable level, or
confirm that the proposed (or existing) design and administrative controls are
adequate as-is.
As described at the beginning of this section, Step 7 delivers the result of Basic
Step 2 of the STPA method.
See Section 7.3 for guidance on applying STPA results, which should be used to
identify design changes, administrative controls, or a combination of both in
order to eliminate, reduce, or mitigate such hazards to an acceptable level.
Application Development
Each of the potential causes of each hazardous control action should be evaluated
by a team of knowledgeable individuals responsible for system design, test,
operations, and maintenance. For each potential cause, the team should decide if
it can be eliminated, prevented, or mitigated to a reasonable extent through one
or more defensive measures that are realized through design requirements and/or
plant programs and processes. For guidance on applying defensive measures in
digital I&C systems, see References 20 and 21.
Ideally, this evaluation is performed early, at the conceptual design phase, so that
safety-driven requirements are inserted before detailed design begins. The results
of the STPA analysis should be reviewed again as the detailed design emerges, to
determine if any of the design details substantially altered the control structure
used in the analysis. Of particular concern would be new interfaces or functions
that were not accounted for in the STPA analysis.
7-16
If the STPA method is applied late in a project for some reason, the
owner/operator should be prepared to stop the project and rework the design if
the STPA results clearly indicate potential hazards that are not effectively
eliminated, prevented, or mitigated to a reasonable extent.
Because STPA results are focused on hazardous control actions, the results can
be used to provide a more focused approach when other methods may be applied.
For example, the FMEA method requires a bottom-up analysis of all devices in a
component, or all components in a system, which can become very large, time
consuming, and costly if there is a large number of devices or components. If the
STPA method is applied first, then an FMEA can focus exclusively on the
devices or components that could cause a hazardous control action, and
determine the failure modes or failure mechanisms that could lead to such
hazardous actions.
Licensing
7-17
The IEEE definition, accepted by the NRC, considers internal and external
events and conditions, while the STPA definition considers system-level and
environmental conditions. This does not mean the STPA method is not suitable
for a licensing activity; in fact it is quite useful because it considers internal
conditions in the form of causes of Hazardous Control Actions (STPA Basic Step
2). In other words, identification of Hazardous Control Actions, and their causes,
appears compatible with the IEEE definition of “Hazard” and therefore may be
suitable for licensing activities.
7-18
Example 7-1. System-Level HPCI-RCIC Turbine Controls STPA
(continued)
H2: Radioactive materials released
(e.g., pressure too high, coolant leak, air leak)
H3: Equipment operated beyond physical limits
(e.g., turbine overspeed)
H4: Inadvertent equipment operation during maintenance
(e.g., unexpected actuator or valve movement/pinch points, false or misleading
indications)
H5: Reactor shutdown
This list of system-level hazards is as short and simple as the list of losses. Table 7-4
provides a simple cross-reference that shows how any given hazard can lead to one
or more losses.
However, different contexts are implied in Table 7-4. As this example unfolds, it will
become apparent that some HPCI-RCIC flow control actions are hazardous when
they are provided when there is no demand from the plant protection system (one
context), and hazardous when they are not provided when there is a demand from
the plant protection system (another context). At this point in the analysis, the entries
in Table 7-4 implicitly reflect these different contexts.
STPA Step 4: Draw the Control Structure
As described in the procedure provided in Section 7.2, using the results of a
Function Analysis (per Section 3.6), this step starts with a system-level control
structure, shown in Figure 7-8. At this point, the control structure can be verified as
complete and correct within the system boundary identified in Step 1, and it can be
used to complete the analysis at the system level.
In Figure 7-8, the Control Actions that will be evaluated later are represented by the
down arrows. By inspection, the Control Actions include two possible actions by the
operator and two possible actions by the flow control system. The process model
variables (PMVs) are the up arrows in Figure 7-8, with the addition of the plant
conditions as a sideways arrow, for a total of five PMVs. The control structure used
in this example is relatively simple. In practice it can be much more complicated,
depending on the scope and boundaries of the problem.
STPA Step 5: Create Process Model
At the system level shown in Figure 7-8, there are two basic “controllers;” one a
human operator and the other represented as a Flow Control System.
The controllers and their process models are represented in Figure 7-9. Notice this
figure is just a variation of the control structure created in Step 3 in order to make
room for the process models. This variation of the control structure shows control
actions down the left side, a truncated view of the controlled process at the bottom,
and feedback signals and other inputs up the right side. The process models are
shown in the tables located inside each controller box.
At this point in the STPA method, the process models are captured in tables or
spreadsheets and carried forward to the next steps. When working with bigger
tables and spreadsheets in later steps, it is helpful to refer to Figure 7-9 because it
readily shows the relationships between the control actions, the process model
variables, and the process model states.
7-19
Example 7-1. System-Level HPCI-RCIC Turbine Controls STPA
(continued)
STPA Step 6: Identify Hazardous Control Actions
In this example, it is assumed that administrative procedures are in place that
require the operator to leave the flow indicating controllers in automatic mode, at a
fixed flow setpoint, at all times. Of course in real life, procedures can be wrong and
operators can make mistakes, so the application of this method on an actual project
should avoid such assumptions. They are only used here to allow reducing the
number of system contexts to be analyzed for the sake of brevity.
In this example control action CA3, shown in Figure 7-9, is analyzed against five
process model variables and their states, as shown in Table 7-5.
For brevity, this example is limited to the analysis of CA3 as Providing its control
action and Not Providing its control action in the contexts of the five PMVs. A full
analysis would postulate each of the following behaviors:
Provided
Not Provided
Too Early
Too Late
Stopped Too Soon
The intermediate results of the “CA3 is Provided” analysis are shown in Table 7-6,
where the following observations can be made:
Almost all of the PMV combinations, or contexts, are already hazardous. For
these situations, the question become “does the control action make the hazard
worse, or does it mitigate the hazard, or does it make any difference?” In some
cases, the answer is “Maybe” because the control action would increase system
flow when that is the correct response when system flow is too low, in terms of
reactor limits, but at the same time increasing the valve position might worsen
the equipment damage hazard.
Half of the contexts result in “No Response” when there is an accident and there
is no system enable signal.
The bottom two rows are reduced, for brevity, to the case where there is not an
accident, the valve position doesn’t matter, the turbine speed is too high, and
the system flow doesn’t matter.
− If a system enable signal is received under these conditions, then increasing
the governor valve position is hazardous because it would worsen the effect
of a spurious actuation, thus worsening the effects of unwanted system flow,
possibly causing the reactor to reach a limit (e.g., power transient or
overfill).
− If a system enable is not received under these conditions, then increasing
the governor valve position might be hazardous, causing an unwanted
turbine speed transient that could result in an equipment damage hazard,
perhaps if there is a leaky steam admission valve. In this case, it is assumed
that downstream process valves remain closed, thus eliminating any hazard
to the reactor.
7-20
Example 7-1. System-Level HPCI-RCIC Turbine Controls STPA
(continued)
In this example, the intermediate results of the “CA3 is Not Provided” analysis is
omitted for brevity.
The final results of Step 6 (Table 7-6) are reduced into the list of Hazardous Control
Actions shown in Table 7-7. Rows in Table 7-6 that don’t show hazardous control
actions are not included in Table 7-7. Additionally, rows from Table 7-6 that have
identical hazardous control actions for multiple combinations of process model
variable states have been consolidated into single rows. The notes that accompany
this Table provide some insights as to why these actions are hazardous.
STPA Step 7: Identify Potential Causes of Hazardous Control Actions
For the purposes of this example, Hazardous Control Action No. 7 was selected for
further evaluation in Step 7. The analysis team performed this step by considering
each of the control flaws presented in Figure 7-1 in the context of Hazardous
Control Action No. 7. The team was careful to not discount or dismiss any potential
causes, even if the team was aware of adequate defensive measures that would
reasonably reduce the likelihood of such causes of the hazardous control action.
Table 7-8 provides the results of this team assessment, where the following
observations can be made:
All potential causes listed could be analyzed further and in more detail given
more information about the system
This higher-level control loop should take into account additional aspects like:
− HPCI/RCIC is achieving desired flow rate at the pump, but downstream
leaks or blockage is causing insufficient flow rate at the reactor
− Upstream problems like water supply depleted, steam pressure inadequate,
leaks/blockage
− HPCI/RCIC system unable to achieve necessary flow rate, or the max flow
rate is achieved but not sufficient to cool reactor
STPA Step 8: Apply the Results
In Step 7, none of the potential causes of Hazardous Control Action No. 7 were
dismissed or eliminated because the purpose of the STPA method is to identify
hazardous control actions. It is up to the users of this method to decide what to do
about the results.
In this example, the project team evaluated Table 7-8 and determined one or more
defensive measures against each potential cause, some of which become design
requirements (e.g., signal validation), and some of which become defensive
measures during the operations and maintenance phase of the system lifecycle (e.g.,
sensor calibration). In many cases, plant programs and processes can be credited
as defensive measures.
By making a positive determination of all reasonable causes of a potentially
hazardous control action, the STPA results can systematically demonstrate how
system requirements will prevent or mitigate some hazards, and how existing
programs and processes will prevent or mitigate other hazards.
7-21
Table 7-4
HPCI-RCIC Turbine Controls: System-Level Hazards vs. Accidents or Losses
Accidents or Losses
A1 A2 A3 A4 A5
H1
Reactor Exceeds X X
Limits
H2
Radioactive X X
Material Release
H3
Equipment Operated X X
Beyond Limits
Inadvertent Equip.
H4 Operation During X
Maintenance
H5 Reactor Trip or X
Shutdown
7-22
M
LS
Figure 7-7
HPCI-RCIC Flow Control System (System Level)
7-23
Process
Operator Model
Plant
Conditions
Process
Flow Control System Model
System
Initiation
Signal
System Turbine Valve Open/Close System
Flow Rate Speed Position Commands Enable
Actuator M
LS
From
Main
Steam
FLOW Governor Trip/ Steam
Magnetic Valve Throttle Admission
PickUp
To Valve Valve
Reactor From Torus or
Condensate
Storage Tank
Controlled Process
Figure 7-8
System-Level HPCI-RCIC Flow Control Structure
7-24
Operator
Process Model Variables Process Model States
Plant
Normal Conditions
Plant Conditions
Accident
Selected Controller
Main Control Room Location
Remote Shutdown Panel
Flow Indicating Manual Controller
Controller Mode Automatic Mode
Too Low
System Flow At Desired Flow
Too High
Indicated
Flow
CA4: Decrease
Actual Position Governor Valve Actuator
Governor Valve
Figure 7-9
System-Level HPCI-RCIC Process Models
7-25
Table 7-5
Select HPCI-RCIC Flow Control Actions
7-26
Table 7-6
Excerpt of STPA Results for Control Action 3
7-27
Table 7-7
Excerpt from List of HPCI-RCIC Hazardous Control Actions
Flow control system provides increase governor valve position (CA3) when:
there is an and valve too open or and turbine and system and system
2 * * Yes H3
accident position is as needed speed is flow is enable is
there is an and valve and turbine too high or and system and system
3 too closed * Yes H3
accident position is speed is as needed flow is enable is
there is an and valve and turbine and system too high or and system
4 too closed too low Yes H3
accident position is speed is flow is as needed enable is
there is not an and valve and turbine and system and system
5 * too high too high Yes2 H1
accident position is speed is flow is enable is
there is not an and valve and turbine and system and system
6 * too high * No3 H3
accident position is speed is flow is enable is
Flow control system does not provide increase governor valve position (CA3) when:
Notes
1. A Hazardous Control Action because flow control system does not respond at all when there is an accident and no system enable
2. A Hazardous Control Action because increasing the governor valve position (CA3) worsens the effect of a spurious system actuation
3. Might be a Hazardous Control Action if it causes turbine speed to reach a limit when turbine speed is already too high and there is no system
enable (possible due to a leaky steam admission valve?)
4. A Hazardous Control Action because system flow is too low during an accident, regardless of the states of the other process model variables,
including the system enable signal
7-28
Table 7-8
Potential Causes of Hazardous Control Action No. 7
Hazardous Control Action No. 1: “Increase governor valve position” command (CA5) is
provided when: there is an accident, turbine speed is too high, and system enable is present
7-29
Table 7-8 (continued)
Potential Causes of Hazardous Control Action No. 7
Hazardous Control Action No. 1: “Increase governor valve position” command (CA5) is
provided when: there is an accident, turbine speed is too high, and system enable is present
Inadequate Feedback
Enable signal sent to controller before there is a valid demand on HPCI/RCIC
enable provided when steam admission valve is not open (broken or misaligned LS)
steam admission valve commanded open when there is no demand on HPCI/RCIC (spurious ESFAS signal)
Enable signal sent to controller when there is a demand on HPCI/RCIC, but delayed
enable provided when steam admission valve is opened, but too late (misaligned LS or LS setpoint too high)
steam admission valve opens too slowly when commanded by ESFAS Initiation Signal (excessive stem thrust)
steam admission valve commanded open too late when there is a demand on HPCI/RCIC (ESFAS delay)
HPCI/RCIC pump flow rate signal to controller is missing, delayed, incorrect, too infrequent, or has inadequate
resolution
Signal corrupted during transmission
sensor failure
sensor design flaw
sensor operates correctly but actual flow rate is outside sensor’s operating range
fluid type is not as expected (water vs. steam?)
Governor valve position signal to controller is missing, delayed, incorrect, too infrequent, or has inadequate
resolution
Problems with communication path
actual position is beyond sensor’s range
sensor reports actuator position and it doesn’t match valve position
sensor correctly reports valve position but position doesn’t match assumed area/shape
Inadequate Execution of Control Action
Increase governor valve position command is provided in this context, but the command does not produce an
increase in governor valve position
Command sent but does not reach governor valve actuator
command received but governor valve is already fully open
command parameter is outside actuator’s operating range
command conflicts with another command
actuator failure
valve stuck
not powered
Increase governor valve position command is provided and the governor valve position increases, but the amount of
increase is not as commanded
Valve response time too slow (design flaw, physical failure, valve worn, etc.)
Inadequate Process Inputs, Physical System
Power missing or inadequate
Hardware failure (e.g. memory bit errors, etc.)
7-30
Example 7-2. Component-Level HPCI-RCIC Turbine Controls STPA
STPA Step 1: Identify System Boundary
The system boundary identified in this example is the same boundary identified in
Example 7-1 and Figure 4-6.
STPA Step 2: Identify Accidents (Losses)
The accidents (losses) identified in this example are the same accidents identified in
Example 7-1:
A1: People exposed to radioactivity
A2: Environment contaminated
A3: Equipment damage, up to and including core meltdown
(economic loss)
A4: Personnel injury or death
A5: Loss of generation
STPA Step 3: Identify System-Level Hazards
Even when applying the STPA method at the component level, the hazards are still
identified at the system-level because the ultimate purpose of the analysis is to
eliminate, reduce or mitigate hazards that can lead to accidents or other losses.
Therefore, the hazards identified in this example are the same hazards identified in
Example 7-1:
H1: Reactor exceeds limits
H2: Radioactive materials released
H3: Equipment operated beyond physical limits
H4: Inadvertent equipment operation during maintenance
H5: Reactor shutdown
Table 7-4 still provides a simple cross-reference that shows how any given hazard
can lead to one or more losses.
STPA Step 4: Draw the Control Structure
In order to develop requirements or analyze for the presence of hazards created at
the component level, a more resolved control structure is required. The hypothetical
digital upgrade in this example involves a digital governor and a digital positioner,
and for a number of reasons, described at the end, it is useful to deepen the
analysis to the component level. Therefore, a more refined control structure is
provided in Figure 7-11, where the flow control system is resolved into the flow
indicating controllers, the handswitch, the governor, and the positioner.
In Figure 7-11, the Control Actions that will be evaluated later are represented by
the down arrows. By inspection, the Control Actions include various actions by the
operator, the desired speed output from the flow controllers, the desired position
output from the governor, and ultimately the position demand output from the
positioner.
STPA Step 5: Create Process Model
At the component level shown in Figure 7-11 there are four “controllers” and
therefore four process models. The controllers are as follows:
7-31
Example 7-2. Component-Level HPCI-RCIC Turbine Controls STPA
(continued)
1. Human Operator
2. In-service Flow Indicating Controller (1 of 2 identical FICs)
3. Governor
4. Positioner
Each controller and its process model is represented in Figure 7-12. Notice this
figure is just a variation of the control structure created in Step 3 in order to make
room for the process models. This variation of the control structure shows control
actions down the left side, a truncated view of the controlled process at the bottom,
and feedback signals and other inputs up the right side. The process models are
shown in the tables located inside each controller box.
At this point in the STPA method, the process models are captured in tables or
spreadsheets and carried forward. Figure 7-12 is helpful in later steps because it
illustrates the relationships between the control actions, the process model variables,
and the process model states.
STPA Step 6: Identify Hazardous Control Actions
In this example control action CA5 (Increase Desired Position), shown in Figure 7-
12, is analyzed against two process model variables (Turbine Speed and System
Enable). For brevity, this example is limited to the analysis of CA5 as Providing its
control action in the contexts of the two PMVs. A full analysis would postulate each
of the following behaviors:
Provided
Not Provided
Too Early
Too Late
Stopped Too Soon
The intermediate results of the “CA5 is Provided” analysis are shown in Table 7-9,
where the following observations can be made:
First, the number of rows in Table 7-9 is only 6, a dramatic reduction from the
size of the results table (Table 7-6) when the analysis was done on one control
action for the whole system in Example 4-8. This data reduction is achieved
because in this example, only one controller is analyzed. This approach points
to the benefit of avoiding very large combinatorial sets by isolating one
controller at a time, but the downside is that other “contexts” can be missed if
other Process Model Variables associated with other controller are not factored
into the analysis.
If one assumes the analysis is done in the context of a demand on the HPCI
system due to plant conditions that indicate an accident, then there are three
immediately recognizable hazardous conditions in Table 7-9 when there is no
“System Enable” signal, regardless of turbine speed. If the System Enable signal
is not present, the governor will not provide any output to the positioner, and the
HPCI system will be inoperable.
The only definitive hazardous control action identified in Table 7-9 is one in
which the governor provides an increasing valve position demand signal (CA5)
when turbine speed is already too high and the system is enabled. Unless other
7-32
Example 7-2. Component-Level HPCI-RCIC Turbine Controls STPA
(continued)
protective actions are provided, this control action will lead to a turbine
overspeed issue and/or an overfill condition in the reactor.
A preexisting hazardous state is indicated if turbine speed is already too low
before CA5 acts upon the system. Otherwise, CA5 is not hazardous.
STPA Step 7: Identify Potential Causes of Hazardous Control Actions
Row 1 in Table 7-9 was identified as a Hazardous Control Action. In this example, it
is labeled HCA1, or Hazardous Control Action 1.
The possible causes of HCA1 are listed in Table 7-10.
STPA Step 8: Apply the Results
The results of this analysis would be applied during the conceptual design or
requirements definition phase of the digital upgrade project, to assure the following:
A reliable means of providing a system enable signal. These STPA results
indicate this may be the Achilles Heel of the whole control system. Methods
could include redundant and/or diverse contact closure input schemes, or
avoiding use of the valve stem-mounted limit switch as a signal source, and
using a direct input from the system initiation source (ESFAS in this case).
A reliable means of avoiding a false or inaccurate turbine speed signal.
Methods could include a redundant and/or diverse speed sensor scheme.
Performing software V&V activities, with particular emphasis on development of
test cases and validation testing to demonstrate that hazardous control action 1
(provide CA5 when turbine speed is too high) is prevented.
7-33
FIC: Flow Indicating Controller
MCR: Main Control Room
RSP: Remote Shutdown Panel
PID: Proportional/Integral/Derivative Enable
HS: Handswitch
MCR FIC
HS Positioner
Speed Position
PID Demand PID S Demand PID System
Initiation
Flow Setpoint
Governor Enable Signal
(RCIC: 500gpm;
HPCI: 5000gpm)
PID Program M
24 Resolver
Interface VDC Actuator
Feedback
LS
RSP FIC From
Main
Steam
FLOW Governor Trip/ Steam
Magnetic
PickUp (MPU)
Valve Throttle Admission
To Valve Valve
Reactor From Torus or
Condensate
Storage Tank
Figure 7-10
HPCI-RCIC Flow Control System (Component Level)
7-34
Operator Process Model Plant
Conditions
Adjust Flow Set Flow Auto or System Desired Desired System Auto or Set Flow Adjust Flow
(Manual) (Auto) Manual Flow Speed Speed Flow Manual (Auto) (Manual)
Actuator M
LS
From
Main
Steam
FLOW Governor Trip/ Steam
Magnetic Valve Throttle Admission
PickUp
To Valve Valve
Reactor From Torus or
Condensate
Storage Tank
Controlled Process
Figure 7-11
Component-Level HPCI-RCIC Flow Control Structure
7-35
Operator
Process Model Variables Process Model States
Plant
Normal Conditions
Plant Conditions
Accident
Selected Controller
Main Control Room Location
Remote Shutdown Panel
Flow Indicating Manual Controller
Controller Mode Automatic Mode
CA1: Increase Too Low
Desired Flow System Flow At Desired Flow
CA8: Decrease
Actual Position Governor Valve Actuator
Governor Valve
Figure 7-12
Component-Level HPCI-RCIC Process Models
7-36
Table 7-9
Excerpt of STPA Results for Control Action 5
Analysis Results
PMV1 PMV2 Is Situation Is CA
Row Related Comments
Turbine System Already Behavior
Hazards (Situational Context)
Speed Enable Hazardous? Hazardous?
7-37
Table 7-10
Potential Causes of HCA 1
Hazardous Control Action No. 1: “Increase governor valve position” command (CA5) is
provided when: there is an accident, turbine speed is too high, and system enable is present
7-38
Table 7-10 (continued)
Potential Causes of HCA 1
Hazardous Control Action No. 1: “Increase governor valve position” command (CA5) is
provided when: there is an accident, turbine speed is too high, and system enable is present
Inadequate Feedback
Enable signal sent to controller before there is a valid demand on HPCI/RCIC
enable provided when steam admission valve is not open (broken or misaligned LS)
steam admission valve commanded open when there is no demand on HPCI/RCIC (spurious ESFAS signal)
Enable signal sent to controller when there is a demand on HPCI/RCIC, but delayed
enable provided when steam admission valve is opened, but too late (misaligned LS or LS setpoint too high)
steam admission valve opens too slowly when commanded by ESFAS Initiation Signal (excessive stem thrust)
steam admission valve commanded open too late when there is a demand on HPCI/RCIC (ESFAS delay)
HPCI/RCIC pump flow rate signal to controller is missing, delayed, incorrect, too infrequent, or has inadequate
resolution
Signal corrupted during transmission
sensor failure
sensor design flaw
sensor operates correctly but actual flow rate is outside sensor’s operating range
fluid type is not as expected (water vs. steam?)
Governor valve position signal to controller is missing, delayed, incorrect, too infrequent, or has inadequate
resolution
Problems with communication path
actual position is beyond sensor’s range
sensor reports actuator position and it doesn’t match valve position
sensor correctly reports valve position but position doesn’t match assumed area/shape
Inadequate Execution of Control Action
Increase governor valve position command is provided in this context, but the command does not produce an
increase in governor valve position
Command sent but does not reach governor valve actuator
command received but governor valve is already fully open
command parameter is outside actuator’s operating range
command conflicts with another command
actuator failure
valve stuck
not powered
Increase governor valve position command is provided and the governor valve position increases, but the amount of
increase is not as commanded
Valve response time too slow (design flaw, physical failure, valve worn, etc.)
Inadequate Process Inputs, Physical System
Power missing or inadequate
Hardware failure (e.g. memory bit errors, etc.)
7-39
7.5 STPA Strengths
High Coverage
Systems View
The STPA method is essentially a top-down method that takes a system view.
The results are useful for input to the requirements definition phase of a digital
I&C project because they result in a safety-driven design from the beginning.
Unexpected Behaviors
The STPA method can identify unexpected and strange system behaviors that
may not otherwise be thought credible or possible. For example, it can identify
adverse interactions between components and systems that would on the surface
appear to have no potential interactions at all.
When the data is reduced to the final list of Hazardous Control Actions and
their potential causes, the results can typically be readily used to inform
requirements, identify and apply defensive measures, and demonstrate system
acceptability.
The final results can also be used as an input to another method to help avoid
searches for faults and failures that don’t lead to hazards.
Single Failures
The STPA method does not readily identify the effects of postulated single
failures unless each Process Model Variable is considered in isolation, which goes
against its purpose. Therefore, STPA results are not well suited as an input to a
single failure analysis or identifying single point vulnerabilities.
Some of the intermediate tables that can result from the STPA method can
become very large and tedious to manage and evaluate if there are more than a
few Process Model Variables. Section 7.7 describes likely developments in the
future of the STPA method to address this problem.
7-40
Trained Facilitator
At the time of publication, MIT researchers were developing new STPA tools
for creating control structures and visualizing results; checking for interactions
and influences; and tools for automating and reducing data to more readily usable
sets. This work holds promise for making the STPA more accessible to a wider
range of users.
7-41
Section 8: Purpose Graph Analysis (PGA)
Method
A Purpose Graph is a figure that illustrates the Observable, State, Goal and
Process features of a system. Purpose Graphs are used in Systems Engineering
design and analysis activities. The Purpose Graph is composed of a State Graph
placed side-by-side with a Process Graph.
Purpose Graph Analysis (PGA) can be used as a form of Hazard Analysis. The
PGA method is particularly useful for identifying potential digital systems
hazards that can arise from unexpected component or system behaviors by
providing insights into the following issues:
Redundancy of success paths in the system
Diversity of success paths in the system
Direct and indirect consequences of failures to meet designed performance
levels, even when no faults are present
Desired and undesired interactions between aspects of normal system state
changes
Incompatible goals. Large systems with many active components can easily
develop conflicts between the goals of different parts of the system. Most
complex systems are designed to have goal conflict resolution approaches
within them, but often suffer from a lack of completeness of these
approaches. Hazards occur when conflicting goals are not detected and
resolved in a timely way during operations.
Incompatible processes. Even when a large system is free of goal conflicts,
there may be hazardous interactions between the processes that are being
used to achieve the goals. These hazardous interactions may occur during
normal operations, even in the absence of faults and failures. Because the
design of system components is often distributed across many organizations,
these potential adverse process interactions may not be identified using
standard design practices.
8-1
8.1 PGA Overview and Objectives
This section describes the basic steps for constructing the Purpose Graph and
analyzing it for potential hazards. To illustrate the method in the context of a
detailed procedure, the top-level analysis of a portion of a typical Boiling Water
Reactor (BWR) is provided in Section 8.2, with worked examples of specific
systems and subsystems of a BWR digital safety system provided in Section4.5.
In this basic step, two graphs are developed and juxtaposed. These graphs are
represented as both drawings and tables that describe them:
a. State Graph: The State Graph is a hierarchical graph (but not a tree!) of the
States of a system and its relevant subsystems and components. The
following terms are used in the expression of State Graphs:
Observable: A value that can be directly sensed or observed within the system
or its environment. Observables form the basis for the attributes and values
of a Sub-State, and are placed at the lowest level in the State Graph.
8-2
expected range of values. Attributes are similar to the Process Model
Variables used in the STPA method described in Section 7
b. Process Graph: The Process Graph is a hierarchical graph (also not a tree!) of
the higher-level Goals, Processes and sub-Goals (and sub-processes) of the
system and its relevant subsystems and components. The following terms are
used in the expression of Process Graphs:
Goal: A desired set of values for one or more Sub-States. A Goal can be
compared to the actual values of the Sub-States and evaluated as either
Satisfied or Unsatisfied. Activation of a bistable function at a fixed setpoint
(e.g., reactor trip) is an example of a Goal, but other Goals in a system may
be more abstract, such as Safety or Availability.
8-3
Constraint: A combination of system Sub-State values that defines the
operating envelope of the “Goal-achieving” ability of a Process. Constraints
are used in many places within a Purpose Graph Analysis, but always to
indicate a Sub-State Attribute value relationship that must be satisfied in
order for a relationship (link) to hold true.
The construction and analysis of system States in the State Graph and Goals and
Processes in the Process Graph use composition and decomposition techniques
driven by Purpose:
When juxtaposed, a State Graph and a Process Graph form a Purpose Graph that
can reveal the Purpose of system Goals and Processes in the context of its States,
Attributes, and Observables, thus leading to the name “Purpose Graph Analysis.”
In Basic Step 2, the graphs and tables prepared in Basic Step 1 are analyzed
against a set of ten Characteristics that reveal important strengths, weaknesses and
interactions in the digital system. Three areas of analysis can be performed:
States, Goals and Processes. This basic step probes deeply into State, Goal and
Process characteristics in order to identify system behaviors, both desired and
undesired (e.g., hazardous). The ten Characteristics evaluated in Basic Step 2 are
provided in Table 8-1, organized under their analysis headings. Note that some
Characteristics would be expected and desired, and some Characteristics would
not be expected or desired.
8-4
assessing for available or proposed measures that can eliminate, prevent or
mitigate such characteristics.
Table 8-1
Ten Characteristics Evaluated in PGA Basic Step 2
8-5
8.2 PGA Procedure
The following steps are recommended for performing the PGA method. This
procedure is not the only way to implement the method; variations may be
suitable for different projects.
Prerequisite
The results of a Function Analysis, as described in Section 3.6, are a useful input
to the PGA analysis because it provides a well-organized set of functions that can
feed into the steps of the PGA procedure that identify the system states, goals
and processes.
The first step of the PGA method is the construction of the preliminary State
Graph for the system being evaluated. Design and licensing basis information
and conceptual or detailed design information (e.g., specifications, drawings,
system descriptions, etc.) is used as inputs for this Step. The State Graph is
constructed as a drawing then augmented with a data table that lists Sub-States
and their attributes. In the early iterations of the State Graph, Sub-States are
composed from observables or lower-level Sub-States. Construction of the State
Graph is broken down to the following five sub-steps:
Figure 8-1 and Table 8-2 show sample information from a portion of a typical
Boiling Water Reactor (BWR). Figure 8-1 illustrates the arrangement of Main
Steam pressure switches, connected to a common instrument line manifold, that
are used to sense steam line breaks and initiate closure of all MSIVs using one-
out-of-two-taken-twice logic. The information in Table 8-2 will be expanded
and illustrated within the context of the PGA method as each step of the PGA
procedure unfolds.
8-6
MSIV MSIV
MSIV MSIV
PS1 PS3
PS
PS 1
2
Reactor PS
PS 3
4 Close
MSIVs
PS2 PS4
MSIV MSIV
MSIV MSIV
Figure 8-1
BWR Main Steam Pressure Switches and MSIV Closure Logic
Table 8-2
Sample PGA Preliminary Observables Table
A low level Sub-State represents the composition of one or more observables into
a named Sub-State, as in the State Graph per Figure 8-2 below. An observable
can be composed into many separate Sub-States; there is no exclusive
relationship between an observable (or any Sub-State) and its parent Sub-State.
The links in the State Graph show “parent-child” relationships that reflect
dependency, where the parent Sub-State depends on the values of the children
Sub-States. The Sub-State in Figure 8-2 is typical of a Sub-State that aggregates
sensor data, such as a voting arrangement like the one-out-of-two-taken-twice
logic shown on the right side of Figure 8-1. Note that the Main Steam Pressure
Sub-State is not represented in Figure 8-2 as high, low, true, false or any other
State value; consideration and evaluation of State values is performed in a later
8-7
step. By itself, the State Graph is a representation of relevant system States and
related Observables, regardless of their possible values.
Steam Steam
Pressure Steam Steam
Pressure Pressure Pressure
PS1 PS2 PS3 PS4
Figure 8-2
State Graph with a Low Level Sub-State
Main
Steam State
Observable
Main Steam Main Steam Reactor
Pressure Isolation Power
Valves
Reactor Reactor
Steam Coolant
Control
Pressure Steam Steam
PS1 Pressure Pressure
PS2 Steam PS4
Pressure
PS3
Figure 8-3
Main Steam Sub-State
A higher-level Sub-State should be defined and added to the State Graph when
it is useful to collect up values from several lower level states into a larger Sub-
State (composition). To guide this step, the following questions are considered:
8-8
Is it useful to group (compose) two or more lower level Sub-States into a
higher level Sub-State?
Can all of the relevant operating events be associated with Sub-States? If an
event (alarm, process trigger, etc.) can’t be associated with a particular Sub-
State, it is an indication that Sub-States are missing.
After a higher level Sub-State is added to the graph as a new parent, are
there any other Sub-States that should influence the newly created higher-
level Sub-State?
- If they exist, link them to the new Sub-State as children
- If the influence is not already represented as a Sub-State, add the needed
new Sub-State to the State Graph and link it to the parent as a child.
In Figure 8-3, the lower level Sub-States that are children of Main Steam
Isolation Valves and Reactor Power State are not yet defined. They were
composed directly in Step 1.c) as other Sub-States that influence the State of the
Main Steam system. Step 1.c) can be used to define the Sub-State children for
these additional States; or Step 1.a) and Step 1.b) can be used to compose them
from their related Observables.
After adding Sub-States and links to the State Graph to capture the preliminary
dependency relationships between the Sub-States, the construction of the Process
Graph should begin. The steps for building the Process Graph are described in
PGA Step 2.
As the Process Graph is built, the State Graph should be repeatedly assessed to
determine if it is complete with respect to the Goals defined in the Process
Graph. A State Graph is complete when it is possible to associate all of the Goals
in the Process Graph and all of the operating constraints of the Processes in the
Process graph to specific Sub-States in the State Graph.
As the State Graph approaches completeness, the States and Events Table is
constructed. All of the Sub-States in the State Graph are listed in tabular form as
shown in Table 8-3; for expediency, this one shows only three of the Sub-States
from Figure 8-4. Each entry in the State table captures the definition of the Sub-
8-9
State, its attributes, and the operational events that are associated with the Sub-
State and its values.
The State and Event Table is used in the analysis phase of the PGA method
described in PGA Steps 3, 4 and 5.
Overall
Plant State
Reactor Reactor
Control Coolant
Figure 8-4
Notional Top Level State Graph for a BWR
8-10
Table 8-3
Top-Level BWR State and Event Table (Partial)
Once the construction of the State Graph has reached Step 1.c), construction of
the Process Graph should be started. It is important to remember that
throughout Process Graph Steps Step 2.a) to Step 2.c), the State Graph will
remain at Step 1.c) and may be revisited several times.
As with the State Graph, the Process Graph will be built first as a drawing, then
as two tables that include the details about the Goals and Processes that make up
the drawing. The Process graph will be constructed mostly by the decomposition
of Processes into Sub-Goals and identifying all of the possible Processes that can
satisfy Goals.
The links in Figure 8-5 indicate that the Sub-Goals are necessary for the correct
performance of the parent Process. Note that the Process “Nuclear Power Plant
Operations” is very abstract, and has very broad scope. Similarly, its Sub-Goals
are abstract, with broad scope. The succeeding steps of the Process Graph will
add more and more detail and specificity by decomposing plant operations into
finer and finer Goals and Processes from the top down, as opposed to the State
Graph which composes observables and lower level Sub-States into higher level
8-11
states, from the bottom up. At the lowest level of the Process Graph, the
Processes can be directly performed by operators and machines.
BWR Plant
Operations Process
Goal
Electric Power Plant System
Plant Safety
Production Readiness
Figure 8-5
Top Level Process Graph for a BWR
The construction of the Process Graph continues by defining Processes for each
Sub-Goal. At the high levels of the Process Graph, these Processes will still be
abstract. More than one Process can be defined as a child of a Goal. These
sibling Processes are alternative ways to achieve the Goal, and represent the
presence of design characteristics like diversity and redundancy. To guide the
identification of the Process children of a Goal, consider these questions:
Are there diverse ways to accomplish the Goal? Typical examples are
multiple diverse subsystems, or the independent manual actions of the
human operators.
Are there redundant resources for accomplishing the Goal?
If these questions are evaluated as true, then alternative Processes can be defined
as children of the Goal. Sibling Processes that are defined as alternatives should
be distinct in some way from each other, a feature normally apparent in Step 2.c).
Figure 8-6 shows the Goal of “Plant System Readiness” (a Sub-Goal from Figure
8-5). Note that while the illustrated Processes are not necessarily mutually
exclusive, they are distinct. It would be considered proper for all of these
Processes to be going on in parallel with each other and concurrently with many
other activities within the plant. In some cases, particularly at the lower levels of
the Process Graph, sibling Processes may in fact be mutually exclusive. While
exclusivity of alternative Processes has no influence on their status as siblings, the
property of exclusivity in alternate Processes will be evaluated in the analysis
phase.
8-12
Plant System Goal
Readiness
Process
Scheduled System
On Condition
Maintenance Replacement/
Maintenance
Plan Upgrades
Figure 8-6
Alternative Processes in a Process Graph
Constraints are implied in the connections between alternative Processes that can
each satisfy a connected Goal. Constraints provide a sense of context that limits
the applicability of an identified Process for achieving its parent Goal.
Constraints are expressed in terms of the related States and Sub-States in the
State Graph. Constraints are useful when analyzing Process Interactions, and can
be listed in the Process Table (see Table 8-5).
For each Process that is defined, its Sub-Goals are defined and linked to them as
children. It is not allowed to link Processes directly to other Processes, or Goals
directly to other Goals. In the Process Graph, the layers of Processes and Goals
are interleaved (i.e., layered), so that Processes have only Goal children and Goals
have only Process children.
Electric Power
Production Goal
Figure 8-7
Layered Goals and Processes in a Process Graph
In Figure 8-7, the three alternative Processes for the Goal of “Electric Power
Production” are shown, along with their decomposed Sub-Goals; in this case, the
8-13
three alternative Processes are in fact mutually exclusive. Sub-Goals represent
steps or activities necessary to carry out the parent Processes, and the more
abstract the parent Processes, the more abstract their children Sub-Goals. Notice
in Figure 8-7 that the Sub-Goals of “Meet House Load,” “Provide Main Steam
Supply,” “Meet External Electric Load” and “Remove Excess Heat” are the
children of the three alternatives are partially shared higher up in the Process
Graph, but the alternative Processes “Turbine Start-Up,” “Steam Turbine
Electric Production” and “Turbine Shutdown” each have a different set of Sub-
Goals. For example, the “Steam Turbine Electric Generation” Process has the
Sub-Goal of “Meet External Electric Load”, while the other Processes do not
share that Sub-Goal.
When new Sub-Goals are defined, it is necessary to return to Step 1.d) of the
State Graph construction Process and verify that the system Sub-States that
correspond with the newly defined Sub-Goals are defined.
Plant Electric
Load Balance Electric Power
Production
Figure 8-8
Checking for State and Goal Associations in the Purpose Graph
8-14
Figure 8-9 provides a portion of the final Process Graph that results from
extending the initial Process Graph in Figure 8-7 by iterating Step 2.b) and Step
2.c) until it includes the lowest level Sub-Goals and their children Processes.
Note that at each Process layer, the Processes are increasingly explicit (i.e., with
decreasing abstraction). Table 8-4 provides a list of Goals and related Sub-States,
and Table 8-5 provides a list of Processes and related Goals. The Goal “Protect
Core” and its related Process, “Respond to All Reactor Events,” are highlighted
to show their places and associations in these tables. The procedure for preparing
the Goal and Process tables is provided in Step 2.d) below.
The complete Purpose Graph for the notional BWR in this procedure is
provided in Figure 8-10.
As in the State Graph, the Process Graph drawing is supported by two tables,
one for identifying the attributes and state relationships for Goals and the other
for identifying the related Processes. These tables are built by entering each Goal
in the Goal Table and each Process in the Process Table. These tables will be
used directly in the analysis phase of the PGA method.
BWR Plant
Operations
Steam
Turbine Start System
Turbine Turbine Integrated Scheduled On Condition
Up Replacement/
Electric Shutdown Safety Maintenance Maintenance
Operations Plan Upgrades
Generation
Replace
Maintain Subsystem
Meet House Protect
Core Subsystem
Electric Load
Repair Detect System
Remove Excess Protect Subsystem Variances
Meet External
Electric Load Heat Containment
Protect
Provide Main Equipment Periodic
Respond to All Respond to Inspection
Steam Supply
Reactor Events All
Containment Reduced Surveillance
Events Operational Shutdown
Full Turbine Normal Testing
use Rx Steam Demand Equipment
Condenser
Production Respond to
Operations
LOCA
Figure 8-9
Notional Top Level Process Graph for a BWR
8-15
Table 8-4
Top-Level BWR Goal Table (Partial)
Related Sub-
Goals Description Attributes States (from
Table 8-3)
Protect Core Prevent fuel Fuel temp. Reactor Power
degradation, up Core geometry Reactor Coolant
to and including
core damage
Provide Main Produce main Main Steam press. Main Steam
Steam Supply steam with the Main Steam temp. Pressure
required press, Main Steam Flow
temp & flow
8-16
Table 8-5
Top-Level BWR Process Table (Partial)
Related Goals
Processes Description Attributes Constraints
(from Table 8-4)
Respond to All Detect reactor events and Rx Flux Protect Core Coolable Core Geometry
Reactor Events initiate protective action Rx Temp
Rx Steam Use reactor heat Main steam temp. Provide Main Steam Supply Core is within expected
Production generation and feedwater Main steam press. operating range
supply to make steam Main steam flow Steam production matches
main steam demand (i.e.,
pressure, flow, quality)
8-17
BWR Plant
Operations
Overall
Plant State
Electric Power Plant System
Production Plant Safety
Plant Electric Readiness
Safety
Load Balance State Plant System
Readiness Scheduled
Total Electric Maintenance
Power Demand Plan
Electric Power Heat System
State Safety On Condition
Production State Replacement/
Systems Maintenance
External House Turbine Turbine Integrated Upgrades
Turbine Start
Electric Readiness Safety
Power Up Generation Shutdown
Load Load Operations
Figure 8-10
Notional Top-Level BWR Purpose Graph
8-18
PGA Step 3: Analyze States and Events
Once the Purpose Graph representations have been assembled, its links and
nodes can be analyzed to reveal system behavior issues. The analysis phase of the
PGA method requires system knowledge and engineering judgment, and is more
effective when performed by a team made up of knowledgeable design
engineering, system engineering, operations, and vendor engineering personnel.
This step in the PGA procedure analyzes the State Graph for the State
Characteristics listed in Table 8-1:
1. State Redundancy: Are there multiple means to determine the Attribute
values that make up a Sub-State, so that loss of an information source does
not prevent determining the Characteristics of the Sub-State?
2. State Interdependence: What attributes of a given Sub-State depend on
other Sub-States, and what Attributes are directly measurable?
3. Attribute Diversity: Are there diverse means to determine the Attribute
values that make up a given Sub-State?
The State Analysis Table is constructed by listing each Sub-State and its related
Attributes, as identified in PGA Step 1, with additional columns for the three
State Characteristics listed above. Each State Characteristic is then analyzed for
its presence and strength (or depth) by examining the State Graph. Continuing
with the three top-level BWR States provided in Table 8-3, a sample State
Analysis Table is provided via Table 8-6:
8-19
Table 8-6
Top-Level BWR State Analysis Table
Attribute
States Attributes Redundancy Interdependence
Diversity
8-20
another State having a value that prevents achieving the critical Goal(s)
associated with the given State.
8-21
2. Indirect Goal Interaction: An indirect Goal interaction occurs when the two
Goals are not directly incompatible, but there is no feasible Process that
could start in the State defined by one of the Goals and arrive at the State
required by the other Goal.
Table 8-7
Top Level BWR Goal Analysis Table
In most cases when Direct Goal Interactions are identified (in which two Goals
are incompatible by definition), they are expected on the basis of the system
design. As a result, few Direct Goal Interactions are potentially hazardous. In
Table 8-7 there are no identified Direct Goal Interactions because the listed
Goals are high in the Process Graph; however, several Direct Goal Interactions
are found and described in the worked examples provided in Section 8.4.
Indirect Goal Interactions are much more difficult to design around, and many
Indirect Goal Interactions are potentially hazardous. Two potentially hazardous
Indirect Goal Interactions are highlighted in red in Table 8-7 because for the
Goals of “Electric Power Production” and “Provide Main Steam Supply” there is
no feasible Process that could start in the State defined by either one of these
Goals and arrive at the State required by the “Plant Safety” Goal.
For example, by inspecting Figure 8-10, one can see that the “Electric Power
Production” Goal associates with the “Electric Power Production” State, and the
“Plant Safety” Goal associated with the “Safety” State, and there is no Process that
can support both Goal/State pairs at the same time under all Event conditions
8-22
(e.g., LOCA). Of course, this Indirect Goal Interaction is already understood and
recognized in the facility design basis, which requires a turbine trip when there is
a reactor trip in response to design basis events. However, it serves to illustrate
the systematic manner in which the PGA method can reveal potential hazards.
8-23
Step 5.a) Identify Potentially Hazardous Process Redundancy Characteristics
For Process redundancy analysis, the Process Graph is inspected for Singletons,
which are Goals that have only a single Process defined as a means to meet the
Goal. Not all singletons are potential hazards, since in some cases the Process is
an abstraction that has broad scope to be performed in many ways. For mid to
low level Goals, however, singletons are a sign of lack of redundancy, which may
be hazardous in some contexts.
In the Top Level BWR Process Graph provided in Figure 8-5, there are several
singletons. As examples, the following are noteworthy:
The Plant Safety Goal has a singleton Process, Integrated Safety Operations.
This is an example of a broad scope Process that is not a redundancy concern.
The Provide Main Steam Supply Goal has the singleton BWR Steam
Production. While other forms of steam production are possible, no
redundancy or diversity is expected for this Process.
The Remove Excess Heat Goal has a singleton child Process, Normal
Condenser Operations. This is more problematic, and could be potentially
hazardous; the Process Graph should be examined carefully to ensure that
lower level child Goals of this Process offer strong redundancy or diversity.
When a Goal is not a singleton, the Processes that are identified as being able to
satisfy the Goal (Process siblings) are inspected for Process interdependence. To
be fully independent, the sibling Processes should not have Sub-Goal instances in
common. The highest level of independence is to be mutually exclusive. If two
siblings are not mutually exclusive and they have a a Sub-Goal instance in
common with circumstances under which the common Sub-Goal cannot be
satisfied, the Processes with the common Sub-Goal instance will both fail and
may be potentially hazardous. In the Top Level BWR Process Graph provided in
Figure 8-5, the following example of a potentially hazardous Process
Interdependency is seen:
All of the Process children of the Goal “Electric Power Production” have a
common child, “Remove Excess Heat.” If this Goal fails, all Processes
involved with Electric Power Production will fail, including the Turbine
Shutdown Process. This observation makes the three Processes
interdependent, and elevates the significance of the Remove Excess Heat
Sub-Goal.
8-24
Table 8-8
Top Level BWR Process Interaction Table (Partial)
Not all Process interactions identified in the Process Interaction table will be
potentially hazardous. This is particularly true for higher-level Processes that
have broad scope. For Processes with Sub-Goal interactions and resource
interactions, the hazard potential is the loss of ability to perform the Process
under circumstances where it may be needed.
Side-Effect interactions are among the most difficult to detect and design for,
but the ability of Purpose Graph Analysis to detect these interactions is one of its
major benefits. In many cases, detection and avoidance of side-effect Process
interactions is left to the operations crew and to training and procedures, rather
than explicit system design measures. This approach is marginally successful in
practice; thus side-effect interactions should be considered potentially hazardous
and given careful attention.
Three Processes listed in Table 8-8 show Side-Effect Interactions, and are highlighted
in red because they are considered to be potentially hazardous. Operating experience in
multiple industries has shown that the operations and maintenance staff is not always
well prepared to recognize side-effect interactions in actual operational performance
until equipment damage or personnel injury is imminent.
8-25
8.3 Applying the PGA Results
By providing a list of potentially hazardous system design issues, the PGA results
can be used to derive or modify system requirements in order to prevent or
mitigate some hazards; and to leverage existing programs and processes that can
prevent or mitigate other hazards.
Application Development
Each of the potential hazards identified by the PGA method should be evaluated
by a team of knowledgeable individuals responsible for system design, test,
operations, and maintenance activities. The team should decide if each identified
hazard can be eliminated, prevented, or mitigated to a reasonable extent through
one or more defensive measures that are realized through design requirements
and/or plant programs and processes. For guidance on applying defensive
measures in digital I&C systems, see References 20 and 21.
Ideally, this evaluation is performed early, at the conceptual design phase, so that
safety-driven requirements are inserted before detailed design begins. The results
of the PGA analysis should be reviewed again as the detailed design emerges, to
determine if any of the design details substantially altered the control structure
used in the analysis. Of particular concern would be new interfaces or functions
that were not accounted for in the preliminary PGA.
If the PGA method is applied late in a project for some reason, the
owner/operator should be prepared to stop the project and rework the design if
the PGA results clearly indicate potential hazards that are not effectively
eliminated, prevented, or mitigated to a reasonable extent.
Mitigation of Information Degradation
Digital control systems can experience information degradation as a result of
several fundamental issues. While some of the issues are very familiar as faults
and failures, other issues act to potentially degrade information even in the
absence of failures:
Loss of information. Although most often associated with a faulty sensor or
communications device, this type of degradation can also occur if a software
process halts. Loss of information may be intermittent or persistent. There
are different approaches to dealing with suspected loss of information, and
these approaches can produce very different effects.
Incorrect or unexpectedly noisy information. This familiar form of
degradation is also often associated with a faulty sensor or a degraded
communications channel. The incorrect or noisy information may be
intermittent or persistent. In addition, software processes with non-fatal
errors may create incorrect or noisy information.
Sampling. Digital systems require that the continuous time real world
information is sampled to produce discrete digital values. Both the sampling
frequency and the sampling precision can result in information degradation.
While the speed and precision of digital systems today serves to reduce this
8-26
issue, sampling is still a source of information degradation for rapidly
fluctuating information.
Information incompleteness. For even modest systems, it is impractical to
attempt to sense all of the important information about the system.
Assumptions and design choices must be made about how many sensors will
be used and where they will be located in an effort to capture system state
information. Similarly, software algorithm design must also select the specific
input parameters that will be used in its calculations, nearly always a subset of
the information that is available.
Synchronicity. The information in a digital control system is distributed
across time as a result of sampling, communications and calculation times.
Because the state of the system cannot be determined simultaneously across
all of its components, information degradation results--even when the system
is operating perfectly. The greater the mismatch between the fundamental
dynamics of the controlled system and its digital controls, the greater the
information degradation.
Estimating, smoothing and filtering. To offset the effects of sampling,
incompleteness and the lack of synchronicity, digital control systems
commonly use software algorithms for estimating, smoothing and filtering of
information. These algorithms can be helpful under normal conditions, but
may mask important changes at the extremes or edges of their performance.
Unrecognized and unmitigated information degradation can result in an incorrect
understanding of the situation, and subsequently, in inappropriate selection of goals and
activation of processes leading to an accident or loss, even in the absence of a failure
condition. Three main approaches have been used to mitigate information degradation:
Redundancy. By having multiple identical sensors or communications
channels or software processes, the likelihood of some forms of information
degradation can be reduced. Redundancy is particularly aligned with loss of
information or incorrect and noisy information that results from a random
disturbance or failure mechanism, whether intermittent or persistent.
Diversity. By having multiple, diverse sensors or communications channels or
software processes, the likelihood of information degradation can be reduced.
Diversity has a broader effect than redundancy, but a much higher cost in
terms of system complexity.
Independence. By isolating a flow of information from other sources of
information, the effects of degradations in the other sources cannot
propagate to the isolated, independent flow. While independence can reduce
the effects of sensor, communications and software failures, it does little to
offset the other sources of information degradation in the independent flow.
These three mitigation approaches are not independent of each other and have
limitations to their effectiveness. The hazards due to information degradation in a
system can be assessed from the PGA results by the considering the extent to which
these three mitigations are present in the State Graph for the DCS. In Table 8-9
below, some of the combinations of redundancy, diversity and independence are
discussed from a mitigation effectiveness view and a cost and complexity view.
8-27
Table 8-9
Alternatives for Mitigating Information Degradation
8-28
Provide Input to Other Hazard Analysis Methods
Because PGA results are focused on the identification of systematic hazards, the
results can be used to provide a more focused approach when other methods may
be applied.
For example, the FMEA method requires a bottom-up analysis of all devices in a
component, or all components in a system, which can become very large, time
consuming, and costly if there is a large number of devices or components. If the
PGA method is applied first, then an FMEA can focus exclusively on the devices
or components that could cause or contribute to hazards, and then determine the
failure modes or failure mechanisms that could lead to such hazards.
The following examples of the PGA method are provided, using the same
example systems used throughout this guideline.
Example 8-1. HPCI Turbine Controls PGA
The hypothetical turbine control system digital upgrade project examined in
Example 4-2 (Figure 4-6) is also examined here, this time using the PGA method.
This example limits the analysis to the HPCI system for expediency. Table 4-4 from
Example 4-2 satisfies the prerequisite for a Function Analysis in this example.
PGA Step 1: Construct the State Graph
In Section 8.2, the Purpose Graph Analysis procedure was illustrated with a top-
level State Graph and Process Graph for a notional Boiling Water Reactor system.
This example extends the top-level BWR State Graph and Process Graph to the
HPCI and RCIC systems.
The safety functions of the HPCI and RCIC systems are identified in Example 4-4
(Fault Tree Analysis) as 1) maintain reactor coolant inventory, 2) maintain primary
coolant system integrity, and 3) containment isolation. In addition, spurious
activation of the HPCI system could result in low reactor water temperature, leading
to a reactor trip on high flux.
A preliminary HPCI State Graph is provided in Figure 8-11. It is an extension of the
top-level BWR State Graph provided in Figure 8-4, which identifies sub-states that
are associated with the reactor state, main steam state, reactor coolant inventory
state, feedwater state and reactor control state. The Reactor Coolant Inventory sub-
state can be seen to depend on the sub-states for the Feedwater Pumps and on the
HPCI Performance State. The Reactor Coolant Inventory State is, in turn, directly
related to the production of electric power, but is also part of the safety state of the
plant. In addition, the HPCI system state is a part of the safety system readiness sub-
state of the plant.
The HPCI Observables Table is provided via Table 8-10; note that the Observables
are identified simply by inspecting Figure 4-6 for equipment that provides state
indications and other measured values.
8-29
Example 8-1. HPCI Turbine Controls PGA (continued)
The HPCI States and Events Table is provided in Table 8-11. Each sub-state in the
State Graph is defined in terms of its attributes and values. Associated with each
sub-state are the events of interest that can be detected and reported from the sub-
state. For some sub-states, not all known events are shown in this table due to the
focus on the HPCI system.
PGA Step 2: Construct the Process Graph
The preliminary Process Graph is provided in Figure 8-12 as an extension of the
top-level BWR Process Graph provided in Figure 8-9. While the primary purpose of
the HPCI system is to perform safety functions, it also interacts with Plant Readiness
goals and processes. Furthermore, in the event of spurious operations, the HPCI
system has the potential to cause low reactor coolant temperature, leading to a
high flux reactor trip.
The HPCI Operation is a Process that is connected to several Goals. It is composed
of sub-goals and lower level processes that describe the manner of operation of the
HPCI system. Because of the relationships between the HPCI system and the
feedwater system, for the purpose of providing Reactor Coolant Inventory, the
Processes for the feedwater system are included in this example.
In Figure 8-12, Normal HPCI Operation can be used to satisfy the Reactor Coolant
Inventory goal during the process BWR Steam Production as a non-exclusive
alternative to Feedwater Pump Operation. Normal RCIC Operation is also a non-
exclusive alternative to Feedwater Pump Operation and Normal HPCI Operation.
Also, note that the CST Inventory process is used by both the Feedwater Pump
Operation process (via the Condensate Pump Operation process and the Hotwell
Level) and the Normal HPCI Operation process as one of its potential sources for
coolant.
In addition to the Normal HPCI Operation process, there is a second process HPCI
Surveillance Test that shares many of the sub-goals of the Normal HPCI Operation
process, except that the coolant is re-circulated rather than directed to the reactor
coolant inventory. This second way to operate the HPCI is connected to the goal of
testing the HPCI as a part of the process of Surveillance Testing of safety critical
subsystems.
Table 8-12 provides the Goal Table for the HPCI Process Graph, and Table 8-13
provides the Process Table. The HPCI Goal Table and the HPCI Process Table
include, respectively, higher level goals or higher level processes from the top-level
BWR Process Graph provided in Figure 8-9.
The finished HPCI Purpose Graph (State and Process Graphs side-by-side) is
provided in Figure 8-13.
PGA Step 3: Analyze States and Events
The analysis of the HPCI State Graph considers State Redundancy, State
Interdependence, and State Diversity. This example is focused on the “HPCI
Operational” and “HPCI Performance” States.
Potentially Hazardous State Characteristics
The results are listed in Table 8-14, and summarized below:
1. The HPCI governor and positioner and the HPCI system in general show low or
non-existent levels of redundancy. This result is not surprising because the HPCI
8-30
Example 8-1. HPCI Turbine Controls PGA (continued)
system is a single train in the larger set of Emergency Core Cooling Systems
(ECCS) where redundancy is demonstrated among multiple systems.
2. The HPCI design also shows low levels of diversity and higher levels of
interdependency for states and processes.
PGA Step 4: Analyze Goals
The analysis of the HPCI Process Graph considers Direct Goal Interactions and
Indirect Goal Interactions. In this example, the Goals listed in Table 8-15 are
compared pair-wise to the goals listed in Table 8-12 (HPCI Goal Table).
Potentially Hazardous Goal Interactions
Table 8-15 lists the resulting Direct and Indirect Goal Interactions. The direct goal
interactions noted for HPCI were easily recognized. The indirect goal interactions
are of greater interest, revealing four such interactions that are potentially
hazardous and would be assessed for design alternatives or defensive measures:
The success of HPCI Rated Flow Achieved goal may interfere with the
Feedwater Temperature goal, an issue that was noted in the Top Down
analysis (Section 1).
Under some conditions, the HPCI Off-line goal could result in reduced ability to
satisfy the Reactor Coolant Inventory goal.
Under some conditions, the HPCI Water Supply goal may interfere with the use
of CST Inventory to meet the Hotwell level goal.
If there was reduced Main Steam Supply provided by the reactor, the HPCI
Steam Supply goal may not be met.
PGA Step 5: Analyze Processes
The Processes in the HPCI Process Graph (Figure 8-12) are analyzed for Sub-Goal,
Resource and Side-Effect interaction issues through an analysis of pair-wise
combinations of the Processes listed in Table 8-13. The results are listed in Table 8-
16.
Potentially Hazardous Process Redundancy Issues
For process redundancy analysis, the Process Graph is inspected for Singletons,
which goals that have only a single process identified as a means to meet the goal.
Not all singletons are a cause for concern, since in some cases the process is an
abstraction that has broad scope to be performed in many ways. For mid- to low-
level goals, however, singletons are a sign of a lack of redundancy. For the HPCI
Process Graph, there are two singletons related to the HPCI:
The HPCI Steam Supply goal has only a single process, Main Steam Supply,
for satisfying the goal.
The HPCI Rated Flow Achieved goal has only the governor-positioner process
as its means to satisfy the goal.
Potentially Hazardous Process Interdependency Characteristics
When a goal is not a singleton, the processes that are identified as being able to
satisfy the goal (process siblings) are inspected for process interdependence. To be
fully independent, the sibling processes should not have sub-goal instances in
common. If a sub-goal instance is in common and circumstances exist under which
the common sub-goal cannot be satisfied, the processes with the common sub-goal
instance will both fail.
8-31
Example 8-1. HPCI Turbine Controls PGA (continued)
For the HPCI, the processes HPCI Operation and HPCI Surveillance Test, although
not siblings, have 3 common sub-goals. In this case, the process interdependence is
desirable, since a failure of the HPCI Surveillance Test is intended to reveal
problems with the HPCI Operation process. These two processes are not siblings of
the same goal, and their lack of independence does not reduce any redundancies.
Potentially Hazardous Process Interaction Characteristics
The Processes in the HPCI Process Graph are analyzed for Sub-Goal Interactions,
Resource Interactions, and Side-Effect Interactions. In this example, the HPCI
processes listed in Table 8-16 are compared pair-wise to the processes listed in the
HPCI Process Table (Table 8-13).
Most of the process interactions found for the HPCI were easily understood as
being either incompatible processes by design, or as processes with known shared
resources. There was only one area of side-effect interaction found to be potentially
hazardous:
HPCI Operation could have a side effect on Reactivity Management as a result
of the lack of pre-heating for the HPCI coolant flow, leading to a high-flux trip.
HPCI Operation could have a side effect on Condensate Feed Pressure as a
result of low hotwell levels if the CST is used to supply the HPCI instead of
supplying the hotwell makeup.
None of these process interactions were the result of the proposed digital upgrade
for the HPCI turbine controls illustrated in Figure 4-6.
Safety
Main Reactor
Systems
Steam Safety
Readiness
Reactor
Main Steam Power
Pressure
Reactor Reactor
Control Coolant
Main
Feedwater
HP Feedwater
Heater
HPCI
Performance
Feedwater
LP Feedwater Pump
Heater
HPCI
Operational
Turbine Condensate
Extraction Pump Meas. Speed
Steam Flow Gov.
Valve Demand
Steam Pos.
Hotwell Meas.
Admit Vlv. Turbine Flow
State Pos. T/T Setpoint
Spd.
Valve
Observable CST Pos.
Figure 8-11
HPCI State Graph
8-32
Table 8-10
HPCI Observables
Links to Sub-
Observables Description
States
Sensed from a limit switch on
HPCI Turbine Steam the Steam Admission Valve.
HPCI Operational
Admission Valve Position HPCI does not directly sense the
System Initiation Signal.
Sensed from flowmeter on pump
HPCI Measured flow HPCI Performance
output
HPCI Flow Setpoint Provided by Operator at FIC HPCI Performance
Turbine Speed Demand Output of FIC (Auto or Manual) HPCI Performance
Sensed by mag. pickup on
Measured Turbine Speed HPCI Performance
turbine shaft
HPCI Trip/Throttle
Manual valve HPCI Operational
Valve Position
HPCI Governor
Sensed from actuator resolver HPCI Performance
Valve Position
Table 8-11
HPCI States & Events
Main Steam The state of the Steam flow Not analyzed in this
main steam being Steam example
generated by the temperature
Rx
Reactor Safety The overall state of Not analyzed in Not analyzed in this
reactor safety this example example
Safety System The overall state of Not analyzed in Not analyzed in this
Readiness readiness of the this example example
safety systems
Reactor The thermal power Reactivity High drywell
Power and reactivity state Temperature pressure
of the reactor Pressure High flux state
Coolant flow High reactor
Void fraction temperature
Main Steam The pressure of the Main steam Stuck Safety Relief
Pressure steam within the pressure Valve
main steam lines Safety relief valve
positions
8-33
Table 8-11 (continued)
HPCI States & Events
States
Description Attributes Events
Reactor The overall state of Not analyzed in Not analyzed in this
Control the reactor controls this example example
Reactor The state of the Reactor coolant Low-Low Rx Water
Coolant coolant flowing level Level
through the reactor Main FW Temp High Rx Water Level
core Main FW Flow LOCA
Rx Recirculation
Flow
HPCI flow
RCIC flow
Main The state of the Main FW Temp Low feedwater
Feedwater feedwater at the Main FW pressure
reactor vessel Pressure
Main FW Flow
High Pressure The state of the HP Main FW Temp Not analyzed in this
Feedwater Feedwater heating Main FW example
Heater process Pressure
Main FW Flow
Extraction Steam
Flow
Low Pressure The state of the LP Main FW Temp Not analyzed in this
Feedwater Feedwater heating Main FW example
Heater process Pressure
Main FW Flow
Extraction Steam
Flow
Turbine The state of the Extraction steam Not analyzed in this
Extraction Steam turbine extraction temp example
steam to the Extraction steam
feedwater heaters pressure
Feedwater The operational Feedwater pump Feedwater pump not
Pump state of the 1 operational
feedwater pumps Feedwater pump Low supply pressure
2 to feedpump
Feedwater supply
pressure
Recirculation
valves
8-34
Table 8-11 (continued)
HPCI States & Events
States
Description Attributes Events
Condensate The operational Condensate Condensate pump
Pump state of the Pump 1 not operational
condensate pumps Condensate
Pump 2
Recirculation
valves
Hotwell The state of the Hotwell level Hotwell low level
hotwell of Hotwell Hotwell high level
condensate that temperature Excessive hotwell
feeds the temperature
condensate pumps
Condensate The state of the CST Level Low CST Level
Storage Tank condensate CST Temperature
storage tank (CST)
HPCI The sub-state that Demand HPCI Trip
Operational describes the (ON,OFF) HPCI not
operation of HPCI Operating State operational
(Tagged-out,
Ready, Under
Test, Operating,
Tripped)
Exception
(overspeed, low
suction,
unexpected
operation, DC
power out)
Coolant source
(CST,
Suppression Pool)
Output (reactor
feed,
recirculation)
8-35
Table 8-11 (continued)
HPCI States & Events
States
Description Attributes Events
HPCI The sub-state that Main steam valve Turbine Overspeed
Performance describes the position (Open, Low flow
performance of the Closed) Unexpected turbine
HPCI HPCI flow operation
setpoint (Value) Failed governor
HPCI measured valve actuator
flow (value) High turbine outlet
HPCI turbine pressure
speed demand
(value)
HPCI measured
turbine speed
(value)
HPCI governor
valve position
(value)
8-36
Main Steam Remove Excess
Supply Provided Heat
Reactivity
Reactivity
Management
Feedwater
Temperature
Reactor Coolant Detect HPCI
HPCI Off-line
Apply High Inventory Variances
Pressure
Heating
Figure 8-12
HPCI Process Graph
8-37
Table 8-12
HPCI Goals
Related Sub-
Goals Description Attributes States from
Table 8-11
Main Steam Produce main steam Main Steam Press Main Steam
Supply with the pressure, Main Steam Temp
temperature and flow Main Steam Flow
specified
Remove Prevent the heat from None identified in None identified
Excess Heat the reactor and steam this analysis in this analysis
from reaching
dangerous levels
Reactivity Meet the specified Reactivity Reactor Power
reactivity parameters State
Main Feedwater Provide Feedwater at Main FW Temp Main
Temperature the desired temperature Feedwater
from the high pressure
feedwater heating
process
Reactor Coolant Maintain desired Reactor Coolant Reactor
Inventory coolant levels in the Level Coolant
reactor
HPCI Off Line Return HPCI to an off- None identified in HPCI
line condition this example Operational
Detect HPCI Develop evidence that None identified in HPCI
Variances the HPCI system is this example Operational
completely functional
Low Pressure Provide the desired Feedwater temp Low Pressure
Feedwater feedwater preheat at Feedwater press Feedwater
Preheat the low pressure Feedwater flow Heater
heating stage
Feedwater Provide desired Feedwater press Feedwater
Pressure feedwater pressure and Feedwater flow
flow into the reactor
and into the high
pressure feedwater
heater
Turbine Extract Provide desired steam Steam pressure Turbine
Steam State pressure, flow & Steam extraction
temperature to the FW temperature steam state
heaters
8-38
Table 8-12 (continued)
HPCI Goals
Related Sub-
Goals Description Attributes States from
Table 8-11
Feedwater Sufficient supply Feedwater supply Feedwater
Supply Pressure pressure to the pressure Pump
Feedwater pumps for
safe operation
Feedwater Having the Feedwater Feedwater pumps Feedwater
Pumps pumps operating status Pump
correctly in the desired
state
Condensate Having the Condensate Condensate Condensate
Pumps pumps operating pumps status Pump
correctly in the desired
state
Hotwell Maintain the desired Hotwell Hotwell
Level level of coolant in the condensate level
hotwell Hotwell
condensate
temperature
HPCI Water HPCI pump has Source HPCI
Supply adequate water supply Suction Operational
and suction
HPCI Turbine HPCI turbine has None identified in HPCI
Steam Supply sufficient steam supply this example Operational
HPCI Steam Stop flow of steam to None identified in HPCI
Supply Stopped HPCI turbine this example Operational
mechanisms
Rated HPCI Flow produced by the HPCI measured HPCI
Flow Achieved HPCI meets rated flow flow Performance
desired HPCI flow
demand
HPCI Water The desired destination Destination HPCI
Recirculated of HPCI pump output is Operational
recirculated to source
8-39
Table 8-13
HPCI Processes
Related Goals
Processes Description Attributes from Table 8-
12
BWR Steam Use reactor heat Main steam Main Steam
Production generation and temperature Supply
feedwater supply to Main steam
make steam pressure
Main steam flow
Emergency Core Keep reactor at safe Reactor core Remove Excess
Cooling temperature during temperature Heat
transients
Normal Use main high and low None identified in Remove Excess
Condenser pressure condensers to this analysis Heat
Operations condense remaining
steam
Shutdown Stop the operation of Equipment Item Protect
Equipment an item of equipment Equipment
Surveillance Conduct tests of safety Completion date Detect System
Testing systems to determine variances
that the systems are fully
functional
Reactivity Use feedwater flow and Feedwater Reactivity
management temperature to control temperature
reactor heat generation Feedwater flow
Void fraction
High Pressure Use turbine extraction Feedwater Feedwater
Feedwater steam to pre-heat the temperature Temperature
Heater feedwater to the
desired temperature
Feedwater Use the pumps to re- Feedwater flow Feedwater
Recirculation circulate feedwater in Pressure
Pumps the reactor core
Feedwater Pump Use the Feedwater Feedwater level Feedwater
Operation pumps to provide Feedwater Pressure
feedwater level, pressure Reactor
pressure and flow to the Feedwater flow Coolant
reactor Inventory
Condensate Use the Condensate Feedwater supply Feedwater
Pump Feed pumps to provide pressure Supply
Pressure supply pressure to the Pressure
Feedwater pumps
8-40
Table 8-13 (continued)
HPCI Processes
Related Goals
Processes Description Attributes from Table 8-
12
HPCI Operation Operate HPCI to supply HPCI flow Reactor
water to reactor setpoint Coolant
HPCI measured Inventory
flow
HPCI Operate HPCI with HPCI flow Detect HPCI
Surveillance Test pump output re- setpoint Variances
circulated HPCI measured
flow
Govern Steam to Use the HPCI governor HPCI flow HPCI Rated
HPCI Turbine and positioning setpoint Flow Achieved
controllers to position HPCI measured
the turbine governor flow
valve to control steam HPCI Turbine
to turbine speed demand
HPCI measured
Turbine speed
HPCI Main Use the Steam Steam Admission HPCI Steam
Steam Supply Admission Valve to valve position Supply
allow main steam
pressure to the HPCI
turbine
CST Inventory Draw coolant for the HPCI source HPCI Water
HPCI pump from the Supply
CST
Suppression Draw coolant for the HPCI source HPCI Water
Pool Inventory HPCI pump from the Supply
suppression pool
Close Trip Close the trip throttle HPCI Turbine Steam Supply
Throttle Valve valve or governor valve speed Stopped
to stop HPCI turbine
operation
Close Governor Close the governor HPCI Turbine Steam Supply
Valve valve to stop HPCI speed Stopped
turbine operation
Close Steam Close the Main Steam HPCI Turbine Steam Supply
Admission valve Admission valve to stop speed Stopped
HPCI turbine operation
8-41
Safety
Main Reactor Main Steam Remove Excess
Systems
Steam Safety Supply Provided Heat
Readiness
Feedwater
Main Temperature
Feedwater
Reactor Coolant Detect HPCI
HPCI Off-line
Apply High Inventory Variances
Pressure
HP Feedwater Heating
Heater
Hotwell Water
Condensate Steam Supply
Feed Pressure Stopped Recirculated
Main Steam
Supply
Govern Steam
Condensate
Hotwell Level to HPCI
CST Pumps On
Meas. Speed Turbine
Steam Flow Gov. Close Steam
State Valve Demand Suppression
Admit Vlv. Condensed Admission
T/T Pos. Pool Close Trip
Pos. Meas. Steam Valve
Valve Flow Goal Inventory Throttle Valve
Turbine
Observable Pos. Spd. Setpoint
CST Inventory
Close Governor
Process
Valve
Figure 8-13
HPCI Purpose Graph
8-42
Table 8-14
HPCI State & Events Analysis Results
States
Attributes Redundancy Inter-dependence Diversity
HPCI Demand (ON,OFF) There is an apparent low The State (position) of the There are many ways to
Operational Operating State (Tagged- amount of redundancy in the Main Steam Admission Valve influence the “HPCI
out, Ready, Under Test, state information. limit switch represents the Operational” State (five
Operating, Tripped) State of the System Initiation different Observables)
Exception (overspeed, low signal (On or Off), resulting
suction, unexpected in a very high
operation, DC power out) interdependence
Coolant source (CST,
Suppression Pool)
Output (reactor feed,
recirculation)
HPCI Steam admission valve There is an apparent low The performance of the HPCI There is only one way to
Performance position (Open, Closed) amount of redundancy in the system depends on a few influence the “HPCI
HPCI flow setpoint (Value) state information. highly related information Performance” State (via the
HPCI measured flow (value) sources “HPCI Operational” State)
HPCI turbine speed
demand (value)
HPCI measured turbine
speed (value)
HPCI governor valve
position (value)
8-43
Table 8-15
HPCI Goal Interactions
Indirect Goal
Goals Direct Goal Interactions
Interactions
Detect HPCI HPCI Off-line None
Variances
HPCI Off-line Detect HPCI Variances Reactor Coolant Inventory
HPCI Rated HPCI Water Recirculated Feedwater Temperature
Flow Achieved Steam Supply Stopped
HPCI Steam Supply HPCI Steam Supply Main Steam Supply
Stopped Provided
HPCI Water Supply None Hotwell Level
HPCI water HPCI Rated Flow Achieved None
Recirculated Steam Supply Stopped
HPCI Steam HPCI Steam Supply None
Supply Stopped HPCI Rated Flow Achieved
Figure 8-14
One of the Indirect Goal Interactions in the HPCI System
8-44
Table 8-16
HPCI Process Interactions
8-45
Example 8-2. CWS Control System PGA
The hypothetical Circ Water System control system examined in Example 4-3 (Figure
4-7 and Figure 4-8) is also examined here, this time using the PGA method. Table 4-
7 from Example 4-3 satisfies the prerequisite for a Function Analysis in this example.
PGA Step 1: Construct the State Graph
In Section 8.2, the Purpose Graph Analysis procedure was illustrated with a top-
level State Graph and Process Graph for a notional Boiling Water Reactor system.
This example extends the top-level BWR State Graph and Process Graph to the
CWS system. The function of the CWS system is to remove excess heat from the
plant and exchange it with the ultimate heat sink.
The State Graph for this example is provided in Figure 8-15, which omits some of
the state information in order to keep the state graph drawing from becoming
cluttered with repeated detail. The sub-states for the High Pressure Condensers and
the Low Pressure Condensers depend upon the Circulating Water Flow sub-state, as
well as the Turbine State and Bypass Steam State. Similarly, the CWS Division A
sub-state can be seen to depend on the sub-states for the CWS Pump Train A1 sub-
state, the other 2 Division A pump train states (not shown in the Figure) and on the
Comm Channel State.
As noted, The CWS has two divisions, A and B, each with three pump trains for a
total of 6 pump trains. The sub-states of the pump trains are shown for only a single
pump, Pump A1. The other 5 pump trains (2 additional in Division A, and 3 in
Division B) are identical in their sub-state structure to that shown for Pump A1.
Also for simplicity, the sub-states for Comm Channel State for each division includes
that state of both Channel 1 and Channel 2. The sub-state for Controller State also
includes both Controller A and Controller B state, as well as the current assignment
of the Master for the two controllers.
The CWS Observables table is provided in Table 8-17, and the State table is
provided in Table 8-18.
PGA Step 2: Construct the Process Graph
The top-level BWR Process Graph provided in Figure 8-5 illustrates the three main
goals of Electric Generation, Plant Safety and Plant Readiness. These goals are
supported by processes with sub-goals and sub-processes as described in Section
8.2. The CWS supports electric power production and influences safety functions,
represented by the Goal to Remove Excess Heat. However, the CWS also interacts
with Plant Readiness goals and processes.
The preliminary Process Graph is shown in Figure 8-16. The CWS operation is a
process that is connected to goals for Low Pressure Condenser and High Pressure
Condenser operating conditions. It in turn, is composed of sub-goals and lower level
processes that describe the manner of operation of the CWS as described in this
report. Again, because of the relationships between the CWS purpose of providing
Removing Excess Heat and the process of creating condensate that in turn is the
source for the feedwater, the processes for the feedwater system are included in this
analysis.
8-46
Example 8-2. CWS Control System PGA (continued)
In Figure 8-16, CWS Operation is used to satisfy the LP Condenser Conditions goal
and the HP Condenser Conditions goal that are part of the processes for LP
condenser Operations and HP Condenser Operations, and ultimately the goal to
Remove Excess Heat. Also, note that the Hotwell Management process is influenced
by the performance of the condensers, which are influenced in turn by the
performance of the CWS.
In addition to the CWS Operation process, there is a second process “Shutdown
CWS Component” that is connected to the readiness goals for the plant’s non-safety
systems via sub-goals for Repair Subsystem and Service Subsystem.
To facilitate the discussion of the Process Graph, it is helpful to use two tables, one
for Goals and one for Processes. Table 8-19 provides the Goal Table for the CWS
Process Graph, and Table 8-20 provides the Process Table. The CWS Goal Table
and the CWS Process Table include, respectively, higher level goals or higher level
processes from the top-level BWR Process Graph provided in Figure 8-5.
The finished Purpose Graph, which is a juxtaposition of the State Graph and the
Process Graph, is provided in Figure 8-17.
PGA Step 3: Analyze States and Events
The analysis of the CWS State Graph considers State Redundancy, State
Interdependence, and State Diversity. As with the STPA method, potential hazards
can include any losses that are considered unacceptable, including lost generation.
Potentially Hazardous State Characteristics
The results are listed in Table 8-21. The CWS is a moderately complex subsystem,
with many sources of data and many options for configuring its subsystem
components. For the higher-level states, there are generally multiple sources for data,
with moderate degrees of direct measurement and dependence on other sub-state
values. In most cases, there are diverse means of determining state values. At lower
levels of state, there is less redundancy and diversity, indicating potential hazards,
but less scope of influence of the state values. The following are selected as
representative of this observation:
Pump train MOV state. Because there are multiple limit switches to sense MOV
position, there is redundancy. Since all switches operate in the same manner,
there is no diversity.
Digital Input (DI) State and Digital Output (DO) State. Determining the state of
these components is partially measurable and partially dependent on the state
of other components. In some cases, the DI or DO may be able to report their
state over the Comm channels, but in other cases, another component (such as
the Controller acting as master) may need to query the component to infer its
state. All understanding of the state of DI or DO must be sent over the Comm
Channels.
PGA Step 4: Analyze Goals
The analysis of the CWS Process Graph considers Direct Goal Interactions and
Indirect Goal Interactions. In this example, the Goals listed in Table 8-22 are pair-
wise comparisons of the Goals listed in Table 8-12 (CWS Goal Table).
Potentially Hazardous Goal Interactions
Table 8-22 lists the resulting Direct and Indirect Goal Interactions. The direct goal
8-47
Example 8-2. CWS Control System PGA (continued)
interactions noted for CWS were easily recognized. The indirect goal interactions
were of greater interest, and revealed 3 such interactions that could be assessed as
potential hazards:
Repair and servicing of CWS components. Because of the interactions between
the goals for CWS configurations involving both Division A and B pump trains,
opportunities to service or repair more than one CWS component at a time,
including those components in the digital control subsystem, must be carefully
considered.
Heat removal and condenser operations. Sub-goal changes within the CWS can
affect heat removal and the balance of condenser operating conditions across
the HP and LP condensers. As an example, a change in the number of CWS
pumps that are on-line may cause condenser vacuum and temperature
transients.
The CWS Flow goal can influence the amount and temperature of condensate
that is collected in the Hotwell, and in turn, the supply of condensate to the
Feedwater system.
The digital design for the CWS provides opportunities to trigger some of the goal
interactions because of its effects on the number of pumps that are online at one
time. In particular, if a pump train digital output component loses its
communications, its shelf state is to close the MOV and trip the associated pump.
This may produce condenser vacuum and temperature transients resulting in a plant
trip, particularly if the entire I/O cabinet for a Division has lost communications and
all of its pumps are tripped.
PGA Step 5: Analyze Processes
The Processes in the CWS Process Graph (Figure 8-16) are analyzed for Sub-Goal,
Resource and Side-Effect interaction issues through an analysis of pair-wise
combinations of the Processes listed in Table 8-13. The results are listed in Table 8-
23.
Potential Process Hazards
While many of the process interactions found for the CWS were easily understood
as being either incompatible processes by design, or as processes with known
shared resources, there were several notable process interactions that could be
assessed further as potential hazards:
Communication channels as a resource. Because of the central role of the
Communication Channels in setting and maintaining the CWS division pump
train configurations, there are strong potential resource interactions with other
users of the Communication Channels. As an example, a malfunctioning process
in another subsystem that also uses the Comm Channels may saturate the
channel bandwidth, blocking the delivery of signals in the CWS.
Lost communications behavior of the digital control system components. Two
process issues can arise: an unintended pump start as a result of a lost
communications resulting in a DO shelf state process; and the loss of clear
master/slave relationships between the controllers.
8-48
Electric Power Heat State
Production State
HP Condenser
State
LP Condenser
State
Non-safety
systems
Turbine State Circulating readiness
Water Flow
Turbine Main Bypass Steam
Steam Supply State
State CWS Supply CWS System
State State
Turbine
Reactor Main Turbine
Control Valves
Steam State Bypass Valves
CWS Division CWS Division
Reactor Power A State B State
State
Controller
LS6 Cont. LS2
Pos. State
CST State T2 Pos.
LS5
Pos. LS1 LS3 HSI
Pos. Pos. Command
LS4 T2
Pump Trip Pos. Closed
State
State Obervable
Figure 8-15
CWS State Graph
8-49
Table 8-17
CWS Observables
8-50
Table 8-18
CWS States & Events
States
Description Attributes Events
Electric Power The state of Current Not analyzed in this
Production electric power Voltage example
State being produced Quality
by the plant
Heat State The state of the Total residual Not analyzed in this
overall heat heat example
balance of the
plant from
operations
Reactor Main The state of the Steam pressure Not analyzed in this
Steam State main steam being Steam flow example
generated by the Steam
reactor (from the temperature
BWR Top Level
State Graph)
Reactor Coolant The state of the Reactor coolant Low-Low reactor
State coolant flowing level coolant level
through the reactor Main feedwater High reactor
core temperature coolant level
Main feedwater LOCA
flow
Feedwater
recirculation
HPCI flow
RCIC flow
Condenser The state of Steam Flow condenser vacuum
State operations and Vacuum trip
conditions of the Pipe side condenser
condenser temperature temperature trip
Shell side
temperature
Feedwater The operational Feedwater Feedwater pump
Pump State state of the pump 1 not operational
feedwater pumps Feedwater Low supply
pump 2 pressure to
Feedwater feedpump
supply pressure
Recirculation
valves
8-51
Table 8-18 (continued)
CWS States & Events
States
Description Attributes Events
Circulating The state of the Water pressure
Water Flow water flow to the Water flow
condenser Water inlet
temperature
Water outlet
temperature
CWS Supply The state of the Water
State water and temperature
conditions in the Water level
CWS supply, such
as the cooling
basins.
CWS State The overall Percent capacity Capacity alarm
condition and Percent
operating status of readiness
the CWS
Condensate The operational Condensate Condensate pump
Pump State state of the Pump 1 not operational
condensate pumps Condensate
Pump 2
Recirculation
valves
Hotwell State The state of the Hotwell level Hotwell low level
hotwell of Hotwell Hotwell high level
condensate that temperature Excessive hotwell
feeds the temperature
condensate pumps
CWS Division A The operating Pump Status Controller fail
State state of the Valve Status Logic cabinet A
Division A Controller status comms fail
equipment of the I/O Cabinet A
Communications
CWS comms fail
channel status
CWS Division B The operating Pump Status Controller fail
State State of the Valve Status
Division B of the Controller status
CWS
Communications
channel status
8-52
Table 8-18 (continued)
CWS States & Events
States
Description Attributes Events
CWS Pump The operating Pump status
Train A1 State state of the Valve status
equipment in Pump DO status
Train A1
DI status
Pump A1 State The operating Pump status Pump trip
state of the A1
pump
A1 MOV State The position and Valve position Valve fails to move
operating Limit Sw 1&2
condition of the Limit Sw 3&4
A1 MOV
Limit Sw 5
4KV Switchgear The operating Limit Sw 5
State condition and state Limit Sw 6
of the pump Contact T2
switchgear
Digital Input The status of the DI DI Status
State board for the
pump train
Digital Output The status of the DO status
State DO board for the
pump train
Division A The status of the Logic A channel 1
Comm Channel comm. Channels Logic A channel 2
State in the Division A I/O A channel 1
Logic cabinet and
I/O A channel 2
I/O cabinet
Division B The status of the Logic B channel 1
Comm Channel comm. Channels Logic B channel 2
State in the Division B I/O B channel 1
Logic cabinet and
I/O B channel 2
I/O cabinet
Controller state The status of the Assigned
master and slave Master
controllers in Logic Logic A
cabinets A and B controller status
Logic B
controller status
8-53
Main Steam Remove Excess Repair Service
Supply Provided Heat Subsystem Subsystem
Normal Condenser
BWR Steam Operations
Production Shutdown
HP Heat LP Heat
Removed Removed CWS
component
HP Condenser
Reactor Coolant
Operations
Inventory
HP Condenser
HP Condensate Steam In
Feedwater Out
Pump LP condenser
Operation Operations
Both CWS
Divisions Pumps A1 & Pumps A2 & Pumps A1 &
Condensed Controlled A2 Online A3 Online A3 Online
CST Inventory
Steam
Pump A1
Pump A1 Online Pump A2 Online Pump A3 Online
Offline
Use Controller Use Controller
Normal Start Normal Stop
A as Master B as Master
Pump A1 Pump A1
Pump A1 Trip
Communicate Communicate Communicate Signal MOV Pump A1
between MOV Open
with Div B Pumps with Div A Pumps Open Started
controllers
Pump A1 4KV
I/O Cabinet B I/O Cabinet B I/O Cabinet I/O Cabinet A Switchgear
Both on Comm Comm A Comm Comm
Channel 1 channel 1 Channel 2 Channel 1 Channel 2 MOV A1
Control
Both on Lost Comms Sequence
Channel 2 State Goal DO A1 Shelf
State
Figure 8-16
CWS Process Graph
8-54
Table 8-19
CWS Goals
Related sub-states
Goals Description Attributes
from Table 8-18
Main Steam Produce main steam Main Steam Reactor Main Steam
Supply Provided with the pressure, Pressure State
temperature and Main Steam
flow specified Temperature
Main Steam
Flow
Remove Excess Prevent the heat from None identified None identified in this
Heat the reactor and in this analysis analysis
steam from reaching
dangerous levels
Reactor Coolant Maintain reactor Coolant level Reactor Coolant State
Inventory coolant at desired Coolant
levels and temperature
temperature
Repair Correct subsystem Subsystem ID Non safety systems
Subsystem conditions that readiness
resulted in the CWS Division A State
subsystem not CWS Division B State
meeting readiness
goals
Service Complete planned Subsystem ID Non safety systems
Subsystem on-condition readiness
servicing for a CWS Division A State
subsystem CWS Division B State
LP Heat The desired amount LP condenser Heat State
Removed of heat is being temperature
removed by the LP
condensers
HP Heat The desired amount HP condenser Heat State
Removed of heat is being temperature
removed by the HP
condensers
Condenser The desired amount LP Condenser State
Steam in of steam is flowing
into the LP
condensers
Condensate Out The desired amount LP Condenser State
of condensate is
leaving the LP
condensers
8-55
Table 8-19 (continued)
CWS Goals
Related sub-states
Goals Description Attributes
from Table 8-18
Condenser The desired internal LP condenser LP Condenser State
Conditions conditions are met vacuum
within the LP LP condenser
condenser temperature
Feedwater Sufficient supply Feedwater Feedwater Pump State
Supply Pressure pressure to the supply pressure
Feedwater pumps for
safe operation
Feedwater Having the Feedwater Feedwater Pump State
Pumps Feedwater pumps pumps status
operating correctly
in the desired state
Cooling Basin The desired Basin CWS Supply State
conditions are met in Temperature
the Cooling Basin Basin Level
CWS Flow The desired flow and Circ water flow Circulating Water
temperature of the Circ water Flow
circulating water temperature
Cooling Tower A The desired status Temperature CWS Division A State
Conditions and state of the drop
cooling tower in Flow
CWS Division A
Condensate Having the Condensate Condensate Pump
Pumps Condensate pumps pumps status State
operating correctly
in the desired state
Both CWS Have a functioning Assigned Controller State
Divisions master controller for Master
Controlled the A and B CWS Controller A
Divisions status
Controller B
status
Division A 2 Have 2 pumps from Pumps online CWS Division A State
Pumps Online Division A online
Division B 2 Have 2 pumps from Pumps online CWS Division B State
Pumps Online Division B online
Pump A1 Online Have Pump Train A1 Pumps status Pump A1 State
online
8-56
Table 8-19 (continued)
CWS Goals
Related sub-states
Goals Description Attributes
from Table 8-18
Pump A1 Offline Have Pump A1 Pump status Pump A1 State
offline
Communicate Have at least one Logic A channel Both on channel 1
Between comms channel 1 Both on channel 2
Controllers operating between Logic A
the Division A and B Channel 2
controllers Logic B Channel
1
Logic B channel
2
Communicate Have at least one Assigned master Division B Comm
with Division B comms channel Channel 1 channel State
Pumps between the Channel 2
Assigned master and
the Div B I/O
cabinet
Communicate Have at least one Assigned master Division A Comm
with Division A comms channel Channel 1 channel State
Pumps between the Channel 2
Assigned master and
the Div A I/O
cabinet
Signal MOV Deliver signal to MOV state MOV State
Open pump train DO to
open MOV
MOV Open Achieve full open of MOV State MOV State
the MOV
Pump A1 Started Pump is energized 4KV Switchgear 4KV Switchgear State
and moving circ status
water
8-57
Table 8-20
CWS Processes
Related Goals
Processes Description Attributes
from Table 8-19
BWR Steam Use reactor heat Main steam Main Steam Supply
Production generation and temperature
feedwater supply to Main steam
make steam pressure
Main steam
flow
Normal Use main high and None identified Remove Excess Heat
Condenser low pressure in this analysis
Operations condensers to
condense remaining
steam
Shutdown CWS Stop the operation of Equipment Item Repair Equipment
Component an item of CWS (from Top Level BWR
equipment goals)
Condenser Operate the Temperature Heat Removed
Operation condenser within Vacuum
limits to remove heat
Hotwell Maintain the Hotwell Hotwell Level HP condensate Out
Management level and Hotwell LP condensate Out
temperature to Temperature
supply condensate to
the feedwater system
Circulating Use circulating water CWS Flow
Water to cool the HP and LP
Operation condensers
CWS 3A + 1 B Use a pump train CWS Flow
configuration with 3
pumps from Division
A and 1 from
Division B to supply
the circ water
CWS 1A + 3 B Use a pump train CWS Flow
configuration with 1
pumps from Division
A and 3 from
Division B to supply
the circ water
8-58
Table 8-20 (continued)
CWS Processes
Related Goals
Processes Description Attributes
from Table 8-19
CWS 2A + 2 B Use a pump train CWS Flow
configuration with 2
pumps from Division
A and 2 from
Division B to supply
the circ water
Condensed Use condensate from Hotwell Level
Steam the HP and LP
condensers to
provide condensate
for use in the
feedwater system
CST Inventory Draw coolant for the Hotwell makeup Hotwell level
Feedwater system to
the Hotwell from the
CST
Pumps A1 and Bring Pumps A1 and Pump status Div A 2 Pumps Online
A2 Online A2 to operating
status using the
pump train controls
Use Controller A Set the CWS master Controller A Both Divisions
as Master controller to Logic status Controlled
cabinet A Assigned
Master
Use Controller B Set the CWS master Controller B Both Divisions
as Master controller to Logic status Controlled
cabinet B Assigned
Master
Normal Start Use the normal Pump status Pump A1 Online
Pump A1 controlled start to
bring pump A1
online
Normal Stop Use the normal Pump Status Pump A1 Offline
Pump A1 controlled stop to
bring pump A1
offline
Both on channel Operate both master Assigned Communicate
1 and slave controller master Between Controllers
on channel 1
8-59
Table 8-20 (continued)
CWS Processes
Related Goals
Processes Description Attributes
from Table 8-19
Both on channel Operate both master Assigned Communicate
2 and slave controller master Between Controllers
on Channel 2
I/O Cabinet A Use comm. Channel Assigned Communicate with
Comm channel 1 1 to communicate master Division A Pumps
with the Div A I/O Signal MOV Open
cabinet pump trains
Lost comms A1 Revert to the shelf Signal MOV Open
Shelf State state of the A1 pump
train if lost comms,
which is ON
8-60
Electric Power Heat State
Production State
Main Steam Remove Excess Repair Service
Supply Provided Heat Subsystem Subsystem
HP Condenser
Normal Condenser
State
BWR Steam Operations
Production Shutdown
HP Heat LP Heat
LP Condenser Removed Removed CWS
State component
Non-safety HP Condenser
Reactor Coolant
systems Operations
Inventory
Turbine State Circulating readiness
HP Condenser
Water Flow HP Condensate Steam In
Turbine Main Bypass Steam Feedwater Out
Steam Supply State Pump LP condenser
CWS Supply CWS System Operation Operations
State
State State
Feedwater Feedwater HP Condenser
Turbine Supply Pressure Pumps On Conditions
LP Condenser LP Condenser
Reactor Main Turbine
Control Valves Conditions Steam In
Steam State Bypass Valves
Turbine LP Condensate
CWS Division CWS Division Bypass Steam Out Circulating Turbine LP
Reactor Power A State B State Water Exhaust
State Operation Steam
Condensate Hotwell
Feed Pressure Management
Cooling Towers
Reactor 2 other identical CWS Pump Cooling Basin CWS Flow
CWS Pump Conditions
Coolant State Pump Trains, A2, A3 Train A1 State Division B Train B1 State
Comm CWS 2A + 2B
CWS 3A+1B CWS 1A + 3B
Feedwater Channel State Condensate
State Pump A1 Pumps ON
MOV State 2 other identical Div A 2 Pumps Div B 2 Pumps
State
Pump Trains, B2, B3 Online Online
Feedwater Reactor Hotwell Level
Pump State Control State Both CWS
Pumps A1 & Pumps A2 & Pumps A1 &
Digital Input Digital Output Divisions
Condensed Controlled A2 Online A3 Online A3 Online
4KV State State CST Inventory
Steam
Condensate Switchgear
Pump State State Pump A1
Pump A1 Online Pump A2 Online Pump A3 Online
Division A MOV Offline
Comm Command Use Controller Use Controller
Message Normal Start Normal Stop
Hotwell State Channel State A as Master B as Master
Pump A1 Pump A1
Controller
LS6 Cont. LS2 Pump A1 Trip
Pos. State Communicate Communicate Communicate Signal MOV Pump A1
CST State T2 Pos. MOV Open
between with Div B Pumps with Div A Pumps Open Started
LS5 controllers
Pos. LS1 LS3 HSI
Pos. Pump A1 4KV
Pos. Command Switchgear
I/O Cabinet B I/O Cabinet B I/O Cabinet I/O Cabinet A
LS4 T2 Both on Comm Comm A Comm Comm
Pump Trip Pos. Closed MOV A1
Channel 1 channel 1 Channel 2 Channel 1 Channel 2
State Control
State Obervable
Both on Lost Comms Sequence
Channel 2 State Goal DO A1 Shelf
State
Figure 8-17
CWS Purpose Graph
8-61
Table 8-21
CWS State & Events Analysis Results
States
Attributes Redundancy Interdependence Diversity
Circulating Water Flow Water pressure Multiple sources of data Both directly measurable Diverse means of
Water flow are available and determinable from determination exist.
Water inlet temperature other state data
Water outlet temperature
CWS Supply State Water temperature Multiple sources of data Directly measurable Diverse means of
Water level are available determination exist.
CWS State Percent capacity Multiple sources of data Dependent on other state Diverse means of
Percent readiness are available data determination exist.
CWS Division A State Pump Status Multiple sources of data Dependent on other state Diverse means of
Valve Status are available data determination exist.
Controller status
Communications channel
status
CWS Division B State Pump Status Multiple sources of data Dependent on other state Diverse means of
Valve Status are available data determination exist.
Controller status
Communications channel
status
CWS Pump Train A1 Pump status Multiple sources of data Dependent on other state Diverse means of
State Valve status are available data determination exist.
DO status
DI status
Pump A1 State Pump status Multiple sources of data Partially measurable and Diverse means of
Pump current are available dependent on other state determination exist.
data
8-62
Table 8-21 (continued)
CWS State & Events Analysis Results
States
Attributes Redundancy Interdependence Diversity
A1 MOV State Valve position Multiple sources of data Directly measurable No diversity is provided.
Limit Sw 1&2 are available
Limit Sw 3&4
Limit Sw 5
4KV Switchgear State Limit Sw 5 Only single sources of Directly measurable No diversity is provided.
Limit Sw 6 data are available
Contact T2
Digital Input State DI Status Only single sources of Partially measurable and No diversity is provided.
data are available dependent on other state
data
Digital Output State DO status Only single sources of Partially measurable and No diversity is provided.
data are available dependent on other state
data
Division A Comm Logic A channel 1 Only single sources of Directly measurable No diversity is provided.
Channel State Logic A channel 2 data are available
I/O A channel 1
I/O A channel 2
Division B Comm Logic B channel 1 Only single sources of Directly measurable No diversity is provided.
Channel State Logic B channel 2 data are available
I/O B channel 1
I/O B channel 2
Controller state Assigned Master Multiple sources of data Directly measurable Diverse means of
Logic A controller status are available determination exist.
Logic B controller status
8-63
Table 8-22
CWS Goal Interactions
Direct goal
Goals Indirect goal Interactions
interactions
Repair Subsystem None found Repair Subsystem
Service Subsystem
Service Subsystem None found Repair Subsystem
Service Subsystem
Heat Removed None found Repair Subsystem
Service Subsystem
Condensate Out None found None found
Condenser Conditions None found Hotwell Level
LP Condenser Conditions
Condensate Out None found None found
Cooling Basin None found None found
CWS Flow None found Pump A1 Offline
Cooling Tower A None found None found
Conditions
Hotwell Level None found HP Condenser Conditions
LP condenser Conditions
Both CWS Divisions None found None found
Controlled
Division A 2 Pumps None found Pump A1 Offline
Online
Division B 2 Pumps None found None found
Online
Pump A1 Online Pump A1 Offline None found
Pump A1 Offline Pump A1 Online CWS Flow
Communicate Between None found None found
Controllers
Communicate with None found None found
Division B Pumps
Communicate with None found None found
Division A Pumps
Signal MOV Open Signal MOV Closed None found
MOV Open MOV Closed None found
Pump A1 Started Pump A1 Stopped None found
8-64
Table 8-23
CWS Process Interactions
8-65
Table 8-23 (continued)
CWS Process Interactions
High Coverage
The PGA method is designed to provide very high coverage of potential hazards.
This coverage is very useful because the results can eliminate, reduce or mitigate
hazards when performing system requirements generation and design activities.
8-66
Systems View
The PGA method is essentially a top-down method that takes a system view.
The results are useful for input to the requirements definition phase of a digital
IC project because they result in a safety-driven or hazard-avoidance design from
the beginning.
Unexpected Behaviors
The PGA method can identify unexpected and strange system behaviors that
may not otherwise be thought credible or possible. For example, it can identify
adverse interactions between components and systems that would on the surface
appear to have no potential interactions at all.
Simplified Results
When the data is reduced to the final list of potential hazards to be addressed,
the results can typically be readily used to inform requirements, identify and
apply defensive measures, and demonstrate system acceptability.
The final results can also be used as an input to another method to help avoid
searches for faults and failures that don’t necessarily lead to hazards.
Single Failures
The PGA method does not readily identify the effects of postulated single
failures. Therefore, PGA results are not well suited as an input to a single failure
analysis or identifying single point vulnerabilities.
Trained Facilitator
This method requires the ability to evaluate various abstractions presented by the
graphs and tables for potential interactions. It is possible to overlook or dismiss
possibly hazardous State, Goal or Process interactions without a trained
facilitator on the assessment team.
8-67
Section 9: Conclusions &
Recommendations
The hazard analysis guidance in this document covers a wide range of methods
and practices, some mature and well-proven, others emergent and still works in
progress in terms of their immediate applications in the nuclear power industry.
Proven methods, such as FMEA and use of Fault Trees, are well established in
the commercial nuclear power industry, and have their place. This guidance
provides step-by-step procedures and worked examples for these proven methods
so that users can immediately apply them on digital I&C projects and achieve
effective results.
Methods that show promise and emergence in the nuclear power industry,
including HAZOP, STPA and PGA, are also described in this guideline, with
step-by-step procedures and worked examples that can be compared to similar
examples that were developed for the FMEA and Top Down methods.
However, for these emergent methods,, the conclusions and recommendations
reported here should be considered qualitative and preliminary. It is clear that
these emergent methods have the potential to immediately and significantly
improve on current industry practices. However, they are new to most utility
engineers, who will likely need training and the help of facilitators to gain
proficiency in them. Therfore, future work on technical transfer mechanisms will
be important in deploying these methods and particularly in getting to the point
where utility engineers can confidently and efficiently apply them to real plant
problems. Technical transfer mechanisms may include formal training,
workshops, and other approaches for bringing these emergent methods to the
same levels of maturity and competence as the more proven methods (FMEAs
and use of Fault Trees in the Top Down Method).
9.1 Conclusions
Table 9-1 compares strengths and limitations of the hazard analysis methods,
based on results of the investigations and examples used in the current study. The
following observations can be made:
1. The FMEA methods are well suited for postulating single failures and their
effects on other systems, sub-systems or components, and they can make use
of the proposed failure taxonomy provided in Attachment B. However, these
methods are not well suited for use in identifying misbehaviors or hazards
9-1
beyond single failures, such as multiple hardware failures or unintended
interactions of hardware and software components.
2. The Top Down method can evaluate the effects of single and multiple
failures, and takes an integrated view of the plant design. It focuses on
functional faults and failures, as opposed to unintended behaviors that do not
involve component failures. It can also encounter complex fault tree models.
3. The HAZOP, STPA and PGA methods offer the following strengths:
- Cover hazards beyond faults and failures
- Integrated view of plant design
- Identify unexpected behaviors and interactions
4. However, HAZOP, STPA and PGA share the following limitations:
- Need a trained facilitator
5. STPA and PGA have the following additional limitations:
- Do not pinpoint single failures for easy identification
- Can produce tedious intermediate results (large tables)
6. At the conceptual design stage, the Design FMEA method can identify
“application notes,” such as insights derived from the FMEA regarding
failure mechanisms and potential mitigation methods that can be used to
influence the detailed design in order to produce a more robust solution.
- In the simple system example (HPCI/RCIC turbine control system)
described in Sections 4 and 1, the overlap between FMEA and Top
Down analysis results was nearly complete, suggesting that one method
or the other is sufficient to demonstrate a robust solution. Based on this
example, performing both methods on such simple systems or
components may be a wasted effort, because one method appears unlikely
to reveal vulnerabilities that are not also revealed by the other.
- For the more complex example (CWS DCS described in Sections 4 and
1), the FMEA approach was found to be focused on single failures and
did not identify vulnerabilities inherent in the system architecture.
However, the FMEA approach was useful in identifying specific
vulnerabilities within the components identified by the Top Down
analysis as being the most critical in terms of the system success criteria.
7. For more complex systems, it appears that a top down failure analysis can be
useful in influencing the system architectural design to avoid vulnerabilities
that can lead to undue safety or generation risks.
8. When designing or reviewing the design of I&C systems, it is important to
develop an understanding of the top level success criteria for the process
systems or components being actuated or controlled. Without a good
understanding of these success criteria, the potential exists to weaken or
eliminate the effectiveness of apparent redundancies that may be designed
into the I&C system. The cut-sets produced by a top down analysis can
reveal vulnerabilities inherent in the architecture of the digital I&C system
itself, and when combined with a clear understanding of the analytical
success criteria, can be used to help produce a more robust design.
9-2
9. The taxonomy described in Appendix B was a useful aid in preparing the
FMEA worksheets for the two example problems. The taxonomy can be
applied to hazard analysis activities, and can be used to assess the availability
of defensive measures within systems, sub-systems, components or devices of
interest.
10. Some of the methods can be used effectively in a blended approach. For
example:
- The Functional FMEA (FFMEA) method can be used to identify
hazardous functions at the plant system or process level that can be
further scrutinized using the Design FMEA (DFMEA) method. If there
is no need to systematically identify and evaluate all digital I&C system
failure modes, then the FFMEA results can be used to limit the scope of
the DFEMA analysis.
- The Top Down method includes a step for transitioning to the Design
FMEA method, thus limiting the scope of the Design FMEA to the
digital I&C system failure modes that can adversely affect actuated
components, which in turn adversely affect plant systems.
9-3
Table 9-1
Comparative Strengths & Limitations of Each Method
9-4
9.2 Recommendations
1. Further work is needed for developing and applying tools for dealing with
large sets of data that can be produced by the STPA method.
2. Technical Transfer mechanisms such as industry training, a computer-based
training (CBT) module, and industry workshops should be developed for
enabling use of this guidance by owner/operator engineers, system integrators
and equipment vendors, especially on the advanced methods (STPA and
PGA).
3. Additional demonstrations of the various hazard analysis methods, including
combinations of the methods on real plant systems and proposed
modifications, over a range of scales and complexity are needed to improve
the current knowledge base and help refine the deployment of the methods.
9-5
Section 10: References
1. IEEE Std. 352-1987, “IEEE Guide for General Principles of Reliability
Analysis of Nuclear Power Generating Station Safety Systems”
2. IEEE Std. 610.12-1990, “IEEE Standard Glossary of Software Engineering
Terminology”
3. IEEE Std. 100-2000, “The Authoritative Dictionary of IEEE Standards
Terms”
4. EPRI TR-102348, “Guideline on Licensing Digital Upgrades EPRI TR-
102348 Revision 1 NEI 01-01”
5. NEI 96-07, Rev 1, “Guidelines for 10 CFR 50.59 Implementation”
6. NUREG 0800, “Standard Review Plan”
7. EPRI 1022684, “Elements of Pre-Operational and Operational
Configuration Management for a New Nuclear Facility”
8. IEEE Std. 603-1998, “IEEE Standard Criteria for Safety Systems for
Nuclear Power Generating Stations”
9. IEEE Std. 7-4.3.2-2003, “IEEE Standard Criteria for Digital Computers in
Safety Systems of Nuclear Power Generating Stations”
10. EPRI 1016722, “Digital Instrumentation & Control Operating Experience
Lessons Learned”
11. EPRI 1022247, “Digital Instrumentation & Control Operating Experience
Lessons Learned Volume II – Case Studies 6-10”
12. “An Introduction to Hazard and Operability Studies – The Guide Word
Approach,” by R. Ellis Knowlton, Seventh Printing
13. EPRI 1023010, “Combinatorial Testing for Digital I&C Systems,” 2011
14. EPRI TR-104595, “Abnormal Conditions and Events Analysis for
Instrumentation and Control Systems, Vol. 1: Methodology for Nuclear
Power Plant Digital Upgrades; Vol. 2: Survey and Evaluation of Industry
Practices” (1995)
15. EPRI 1022985, “Failure Analysis of Digital Instrumentation & Control
Equipment and Systems – Demonstration of Concept”
16. EPRI 1016731, “Operating Experience Insights on Common-Cause Failures
in Digital Instrumentation and Control Systems”
10-1
17. EPRI 1011710, “Handbook for Evaluating Critical Digital Equipment and
Systems”
18. EPRI 1022991, “Guideline on Configuration Management for Digital
Instrumentation & Control Equipment and Systems”
19. “Engineering a Safer World – Systems Thinking Applied to Safety,” by Dr.
Nancy G. Leveson; MIT Press, Cambridge MA; ISBN 978-0-262-01662-9
20. EPRI 1021077, “Estimating Failure Rates in Highly Reliable Digital
Systems”
21. EPRI 1019182, “Protecting Against Digital Common-Cause Failure:
Combining Defensive Measures and Diversity Attributes”
22. NUREG/IA-254, “Suitability of Fault Modes and Effects Analysis for
Regulatory Assurance of Complex Logic in Digital Instrumentation and
Control Systems” June, 2011
23. EPRI NP-5652, Guideline for the Utilization of Commercial Grade Items
in Nuclear Safety Related Applications (NCIG-07)”
24. EPRI TR-102260, “Supplemental Guidance for the Application of EPRI
Report NP-5652 on the Utilization of Commercial Grade Items”
25. EPRI TR-106439, "Guideline on Evaluation and Acceptance of
Commercial Grade Digital Equipment for Nuclear Safety Applications”
26. “Potential Failure Modes and Effects Analysis (FMEA) Reference Manual,
Fourth Edition,” June 2008, by the Automotive Industry Action Group;
ISBN 978-1-60534-136-1.
27. MIL-STD-1629A, “Military Standard: Procedures for Performing A Failure
Mode, Effects, and Criticality Analysis (24 Nov 1980)”
28. NEI 04-10, Rev. 1 “Risk Informed Technical Specification Initiative 5b,
Risk Informed Method for Control of Surveillance Frequencies”
29. RG 1.174, Rev. 1 “An Approach for Using Probabilistic Risk Assessment in
Risk-Informed Decisions on Plant-Specific Changes to the Licensing Basis”
(November 2002)
30. “Instrument Engineer’s Handbook” (three volume set), 4th Edition, by Bela
G, Liptak; CRC Press; ISBN 9781466571716.
31. http://www.lihoutech.com; website for Lihou Technical & Software
Services; 150 Shenley Fields Rd., Selly Oak, Birmingham, B29 5BT United
Kingdom.
32. EPRI Report 1025282, “Guideline on Testing Digital Instrumentation and
Control Systems”
33. IEC 61882-2001, “Hazard and Operability Studies (HAZOP Studies) –
Application Guide”
34. “Launch Control Safety Study,” Watson, H. A., Bell Labs, 1961
35. NUREG-0492, “Fault Tree Handbook,” USNRC, 1981
10-2
36. EPRI 1013490, “Support System Initiating Events: Identification and
Quantification Guideline”’ Electric Power Research Institute, 2006.
37. AP-913 Rev.1, “Equipment Reliability Process Description,” Institute of
Nuclear Power Operations, 2001.
38. EPRI 1025278, “Modeling Digital I&C in Nuclear Power Plant
Probabilistic Risk Assessments,” Electric Power Research Institute, 2012
39. Regulatory Guide 1.177 Rev. 1, “An Approach for Plant Specific Risk
Informed Decisionmaking: Technical Specifications,” USNRC, 2011
40. IEEE Std. 1228-1994, “IEEE Standard for Software Safety Plans”
41. EPRI 1016722, “Digital Instrumentation & Control Operating Experience
Lessons Learned – Case Studies,” 2008
42. EPRI 1022247, “Digital Instrumentation & Control Operating Experience
Lessons Learned – Volume II,” 2010
43. EPRI TR-016780 ‘Advanced Light Water Reactor Requirements
Document’, Volume II, Rev 8, 1999
10-3
Appendix A: Overview of Available
Guidance
Purpose
Assessment Summary
This assessment determined that the currently available guidance listed in Table
A-1 is provided at various levels that describe the basis for performing failure
analyses, and collectively provide an outline of the basic methods and formats for
producing the expected deliverables. These guidance documents point to the fact
that failure analysis can be performed from a top down approach (fault tree
analysis) as well as a bottom up approach (failure mode and effect analysis).
Other methods such as software hazard analysis, software integrated critical path,
system modeling, walkthroughs (code reviews) and software sneak circuit analysis
are discussed in the documents. The different methods have their advantages but
can result in exhaustive efforts to complete the failure analysis on a complex
system upgrade.
A-1
Recommendations
A-2
a. Trials of the failure analysis guidance on real plant upgrades
b. Assessment of the failure analysis guidance against actual initiated events
c. Development of failure analysis guidance for reusable and COTS
software
9. System dependencies on communications are an area that should be included
in the EPRI Digital Failure Analysis guidance document.
10. For FMEA guidance, the guidance from the NASA failure analysis
procedure to identify mitigation corrective actions, owners, and resolutions as
part of the failure analysis efforts should be included in the EPRI Digital
Failure Analysis guidance.
11. NUREG-0492 points out that analysis of complex systems may need to be
performed by a team approach. The NASA failure analysis procedure also
provides guidance on using a team approach for the failure analysis. The
EPRI Digital Failure Analysis guidance should provide instructions for use
of an analysis team.
12. Several of the reports and papers that were reviewed for this technical report
outlined limitations with software failure analysis and reliability modeling.
This report should consider additional reviews of the benefits and limitations
for software analysis to determine the need to continue efforts to perform
software failure analysis or to develop methods to bound software failures.
Table A-1
Guidance Documents Assessed
A-3
Table A-1 (continued)
Guidance Documents Assessed
A-4
Table A-1 (continued)
Guidance Documents Assessed
A-5
Appendix B: Taxonomy of Failure Modes,
Failure Mechanisms, Faults,
and Defensive Measures
Purpose
The purpose of this digital failure analysis taxonomy is to provide the following
information for use in digital failure analysis activities:
Descriptions of typical digital devices and components
Describe a hierarchy of typical digital devices, components, and systems, and
how failure mechanisms, failure modes and effects can propagate up through
the hierarchy
List typical failure mechanisms that can affect typical digital devices and
components
List the typical device or component failure modes that result from typical
failure mechanisms
List the possible defensive measures that could be implemented (or validated)
for preventing or mitigating typical failure mechanisms associated with a
device or component
Describe how to use this Taxonomy in digital failure analysis activities
Table B-1 lists the devices and components described with this taxonomy. Note
that only a handful of devices are described for this guideline, in order to
demonstrate the taxonomy concept.
B-1
Table B-1
Taxonomy Devices and Components
PLANT FUNCTIONS
Failure
Plant Plant Plant Failure Modes
Component 1 Component 2 Component n Effects
Failure Failure
Digital Digital Digital Mechanisms
Modes
Component 1 Component 2 Component n
Failure
Mechanisms
Device 1 Device 2 Device n
Plant Functions, Digital Systems,
Systems & Components Components & Devices
Figure B-1
A Hierarchy of Failure Mechanisms, Modes and Effects
B-2
Figure B-1 illustrates a basic hierarchy of that can be applied to digital devices,
components, sub-systems and systems. The analyst responsible for evaluating the
potential misbehaviors of devices, components, sub-system or system of interest
can perform the analysis at any level in this hierarchy.
On the other hand, the same CPU device may also be susceptible to lower-level
failure mechanisms, such as manufacturing defects or age related degradation
that lead to its own failure modes. This distinction might be important for an
analyst, such as a product engineer at a DCS vendor, who is interested in
evaluating these failure mechanisms to determine the controller failure modes,
and measures that can be used to prevent or mitigate such failure modes.
The failure mode table in the first sheet is color coded to represent basic types of
defensive measures as shown in Table B-2:
B-3
Table B-2
Basic Types of Defensive Measures
Color Hardware Software
Key
Code Defensive Measure Defensive Measure
Run-time diagnostics External diagnostic
Measure applied
Blue implemented in comparison by user or
during operation
software/firmware diverse software means
Pre-installation, start-up
Measure applied
or boot tests (e.g., POST) Design, implementation
during
Orange implemented in and compilation
specification and
hardware/ standards and checks
development
software/firmware
Measure applied Qualification testing on
Qualification tests in the
Green by Qualification target platform
target environment
Testing environment
Measure applied
Black by Administrative
Controls
B-4
Excerpt from Table 5-6
Functional Level Diagram Sheet 1 of 2
System HPCI, RCIC Design Phase: Conceptual
See Figure 5-1
Subsystem Positioner Rev: 0a
Component Method of
Function(s) Failure Modes Failure Mechanisms Effect on System Remarks
Identification Detection
Turbine overspeeds, trips on 1. Provide multiple outputs of the
Output Fails
high reactor level or Periodic Test position demand signal from
Offscale High
mechanical overspeed governor to positioner
1. CPU Data Corruption 2. Include signal validation in the
2. CPU Logic Error Turbine slows to minimum positioner application logic
Output Fails
3. D/A Device Error speed, less than adequate Periodic Test 3. Provide MCR and RSP alarm
Offscale Low
4. Lost or corrupted HPCI or RCIC flow connection to positioner
RAM data
1. Include rate detection in signal
Output High Rapid change in turbine speed validation logic
Periodic Test
Rate of Change and pump flow 2. Provide alarm connection to
Provide automatic governor positioner
valve position demand signal to
digital positioner to compenate Indeterminate; depends on
Governor 1. CPU Halt
for error between actual turbine fail as-is value - likely to
Controller 2. CPU Crash
speed and demanded turbine result in reactor overfill or Periodic Test
Lockup 3. Stopped internal
speed underfill, followed by turbine
clock
trip
1. Ensure governor is supplied
1. CPU Data Corruption with a HW-based watchdog timer
Loss of turbine control,, less
Failure to 2. CPU Logic Error that sets outputs to preferred
than adequate HPCI or RCIC Periodic Test
Boot or Reset 3. Lost or corrupted state
flow
ROM data 2. Provide MCR and RSP alarm
connection to positioner
1. Failed internal power
Turbine slows to minimum
Dead supply
speed, less than adequate Periodic Test
Controller 2. Line voltage below
HPCI or RCIC flow
spec
Figure B-2
Linking a Taxonomy Sheet to an FMEA Worksheet
B-5
Figure B-3
Linkage between Taxonomy Sheets
B-6
Sheet B-1a: Central Processor Device Failure Modes
The number of input and output connectors and bits can vary
greatly by the processor design. Also, some low power
processors don’t have the Heat Sink contact in the center
because heat generation is not as much of a problem.
Commonly the processor clock speed runs at a multiple of the
input clock signal. Data I/O is commonly handled using
various protocols, implemented by on-chip peripherals.
Common bit widths* of processors: 16, 32, and 64-bit
Failure
Failure Mechanism Defensive Measures
Mode
CPU Halt 1. Power Supply Off Do not turn off power Supply.
2. Power Supply Dip Do not let power dip.
CPU Logic 1. Power Supply Dip Do not let power dip.
Error 2. Bit Errors (radiation, Quality testing.
EMI) Ensure proper shielding around
3. Design Flaws controller to protect against radiation
4. Manufacturing Defect and EMI.
5. Failed Connections Integrity tests before and while
(including internal bond- running.**
wire and lead free solder Use of diverse microprocessor
interconnect failure) Architectural diversity/redundancy
6. Overheating Integrity tests before and while running
7. Part Wear Out (e.g., **
due to various age-related Quality testing of processor cooling
degradation mechanisms, systems.
exacerbated by small Temperature monitors on or near the
feature size) processor.
Ensure cooling systems are properly
mounted
CPU Data 1. Bit Errors (radiation, Quality testing.
Corruption EMI) Ensure proper shielding around
2. Design Flaws controller to protect against radiation
3. Manufacturing Defect and EMI.
4. Failed Connections Integrity tests before and while
(including internal bond- running..**
wire and lead free solder Use of diverse microprocessor
interconnect failure) Architectural diversity/redundancy
5. Overheating Integrity tests before and while running.
6. Part Wear Out (e.g., **
B-7
Failure
Failure Mechanism Defensive Measures
Mode
due to various age-related Quality testing of processor cooling
degradation mechanisms, systems.
exacerbated by small Temperature monitors on or near the
feature size) processor.
Ensure cooling systems are properly
mounted
Use devices with feature size >= 350nm
Specify devices with ceramic packaging
Specify components using leaded solder
CPU Crash 1. Manufacturing Defect Integrity tests before and while running.
2. Failed Connections **
(including internal bond- Quality testing of processor cooling
wire and lead free solder systems.
interconnect failure) Temperature monitors on or near the
3. Overheating processor.
Ensure cooling systems are properly
mounted
Permanent 1. Overheating Use devices with feature size >= 350nm
CPU Damage 2. Part Wear Out (e.g., Architectural diversity/redundancy
due to various age-related Specify devices with ceramic packaging
degradation mechanisms, Specify components using leaded solder
exacerbated by small
feature size)
B-8
Sheet B-1b: Central Processor Device Description
There are a few characteristics that related to all processors, regardless of type,
and that is power requirements, bit-width, and clock speed. The power
requirements can vary significantly, more power generally means more heat
generated. Also, the higher the clock speed, generally the higher the power
requirements. Processors designed for use in embedded systems tend to be
manufactured to require less power or produce less heat, but this is not always the
case. Clock speed determines the number of instruction cycles per second.
Generally faster clock speeds means faster processing times, however this is not a
perfect measure, because many different processor instructions take different
numbers of cycles to complete.
Bit width determines the maximum size integer the processor can handle. This is
an important factor because it determines the accuracy of integer and float
operations, and it also generally determines the maximum amount of RAM a
processor can address. Some processor support higher internal bit widths for
floating point operations. Bit-widths for most processors are in powers of 2
starting at 8. 16, 32, and 64-bit processors are the most common.
B-9
Since the late 20th Century the development of microprocessors has been driven
by the requirements of high-volume consumer electronics, mobile
communications and home/business computing in a free market. These vertical
markets have many different requirements to industrial/safety/nuclear
applications. Because of the tremendous cost of developing and manufacturing,
microprocessors produced today are optimized for these high-volume markets;
this is typically manifested in increased design-complexity and reduction in the
semiconductor feature-size – simplistically speaking both techniques improve
performance. Unfortunately, increased design complexity presents real challenges
in the safety-justification of devices using microprocessors and, after a certain
point long-since passed in commercial processes, reducing feature-sizes reduces
the usable life of components. When short component lifetimes are combined
with the short production-runs associated with the high-volume consumer
electronics market, the Utility user/OEM may be left with a looming
obsolescence problem.
The most common bit sizes of these processors are 32 and 64-bit processors.
These processors are most commonly found in general purpose computers and
servers. CISC processors commonly incorporate microcode; this is essentially
embedded software (typically immutable) which allows complex instructions to
be decomposed and executed with multiple steps of more simple instructions.
Where present, the correctness of this microcode should be considered as part of
the safety justification process.
B-10
Other Components:
Cache Memory
Microprocessor Cores
On-Chip Peripherals
Multi-core devices
Die Shrink
Package Style
B-12
Sheet B-2a: RAM Device Failure Modes
B-13
Failure Mode Failure Mechanism Defensive Measure
*note: some types of RAM require their data to be periodically refreshed, or the data will
be lost. A refresh operation is almost always managed by some sort of memory controller;
increasingly this is incorporated into the same integrated circuit. In some cases however,
the refresh must be done in software (usually through the CPU as part of a timing
interrupt)
B-14
Sheet B-2b: RAM Device Description
Also, in the document below, several different relative speed comparisons are
made. These are just general trends, as bandwidths and response times can vary
even between sub-categories of these chips. Also, many of these chips have
versions used for applications where small bit errors can cause problems, and have
extra components that allow for detection and fixing of these bit errors.
B-15
increase reliability in unfavorable conditions. Data densities are much higher, and
this type of RAM is much cheaper. Though this type of RAM is slightly more
susceptible than SRAM to heat or radiation based bit-flips, modern
manufacturing methods have reduced this effect significantly to the point where
such occurrences are rather rare. Also, there are several methods that can be used
to detect and recover (sometimes without even any loss of data) from such bit
errors.
DDR SDRAM – (Double Data Rate SDRAM) – double the words of data
transferred in a single clock cycle from SDRAM. All versions of DDR (such as
DDR2, DDR3, etc.) merely follow this trend, and double the number of reads
and writes per clock cycle.
1T (or 1T/1C) DRAM – A different design for DRAM that doesn't use
capacitors to store individual bits. Otherwise behaves the same as regular
DRAM.
B-16
Sheet B-3a: ROM Device Failure Modes
B-17
Failure Mode Failure Mechanism Defensive Measure
2. Manufacturing Defect Determine expected life spans of
3. Inadvertent Exposure components to determine
to UV rays (only probability of failure in given
UVPROM or EPROM) time-frame.
Integrity tests before and while
running.
Hardware and software data
verification, using methods such
as parity bits and checksums.
Software techniques to distribute
use across the chip to increase
the life span of the component.
Shield equipment from accidental
UV exposure. Cover the Erase
Window with opaque sticker.
B-18
Sheet B-3b: ROM Device Description
ROM devices are generally used in order to store data while the device or
component is turned off. All forms of ROM are considered highly stable, and
can go without power for years without losing their data. In many applications,
the main downside to ROM devices is that many of them are Read-Only, and
cannot actually be written to, or can be written to only once. These devices have
to be carefully programmed, and replaced if their data ever gets damaged. Some
forms of ROM can be re-written, but usually in large data blocks, and can only
be done fairly slowly.
Compared to most RAM devices, the read speed for most ROM devices is too
slow to use effectively during the standard operation of a component. So many
components will copy the data out of their ROM devices into RAM devices and
access the data from there. This can introduce vulnerabilities since RAM is
generally less stable that ROM devices.
The error detection and/or correction strategies are similar to that of RAM
devices. The most common methods include using parity bits to check for single
bit flips, and checksums in general to check for larger amounts of corrupted data.
Some data correction strategies call for using a combination of checksums and
redundancy in order to recover lost data when possible. In most cases however, if
a ROM device has any data errors, it needs to be either reprogrammed, or
replaced entirely. Fortunately, ROM devices tend to have significantly longer life
expectancies than other devices.
Mask ROM-(Mask Read Only Memory)- This is the oldest form of ROM and
the term ‘Mask’ refers to the most common method of manufacturing . Each
chip is manufactured with the data encoded on the chip and cannot be changed.
This is generally the cheapest and most dense form of ROM, and is commonly
used in applications where the values can never change, such as math lookup
tables. A common development cycle might call for using other forms of ROM
chips first, and then finalizing to Mask ROM to cut production costs in the
production run. However, this practice is becoming less and less common as
other forms of ROM become cheaper and grant greater flexibility.
B-19
every time even a single correction has to be made. Erasing is achieved by
exposing the internal circuitry of the chip to UV light through the Erase
Window, which clears the internal logic gates of their charge, allowing the chip’s
contents to be re-written. Sunlight can start erasing an EPROM chip in about
weeks, and fluorescent lighting can erase and EPROM chips in months to years.
Extended overexposure to UV can permanently clear the chip so that it can no
longer store any data. Also, dust around and inside the chip can sometimes cause
parts of the circuitry to be shielded from the UV light, and can cause incomplete
erasures. Versions of EPROM have been manufactured as OTP (One Time
Programmable) EPROM and have the window covered with an opaque
substance in order to prevent the erasure of the chip. This behaves like PROM,
however is still EPROM because of the way it is manufactured. Accidental
exposure to UV can inadvertently clear the memory from UV ROM chips.
Flash Memory (F-ROM)- This is a type of EEPROM that can be erased and
re-written much faster than standard EEPROM. Newer versions can have
greater durability and endurance (the number of erase-write cycles before chips
failure) than standard EEPROM. Some forms of Flash Memory have Write
Protection modes that disable writes except under special conditions to make it
behave more along the lines of standard ROM.
B-20
Sheet B-4a: A/D Converter Device Failure Modes
Supply Reference
Voltage Voltage Analog to Digital converters are devices that are used
Status to convert and analog signal to a digital signal. To do
Sensor Signal
this the input analog voltage is compared to a
(can be more than 1
if MUX is on-board) ‘standard’ voltage in order to assign a discrete value to
ADC Clock the signal. The different methods of converting analog
to digital include:
Serial Data Out Flash, successive-approximation, ramp-compare,
(to CPU) integrating, delta-encoded, and pipeline.
B-21
Failure Mode Failure Mechanism Defensive Measures
Inaccurate Output 1. Power Supply Off Quality testing.
Signal 2. Power Supply Dip Ensure proper shielding
3. Design Flaws around device to
4. Manufacturing Defect protect against radiation and
5. Failed Connections. EMI.
6. Incorrect Calibration Integrity tests before and
while running.
Ensure Proper Calibration
throughout signal
range**
* Many A/D converters require accurate calibration in order convert a analog signal
to a digital signal. This is often done by providing a known input. This can be done
internally using power converters and capacitors to provide a steady comparison
signal. If this signal fluctuates or is inaccurate, then it can result in inaccurate reading
from the real signal.
** Many digital converters have different natural fluctuations in the output as the
signal moves up and down the available range. This means the device must be
calibrated at several different levels along the full range of the possible input signals.
B-22
Sheet B-4b: A/D Converter Device Description
Analog to Digital Converters are devices that convert a continuous analog signal
to a discrete value. There are many different methods of doing this, and there are
several key factors to consider when using an A/D Converter. The most common
methods are flash, successive-approximation, ramp-compare, integrating, delta-
encoded, pipeline and sigma-delta converters. Also, the important features across
all the different analog to digital converters such as sampling speed, sampling
frequency, resolution, accuracy, and non-linearity.
The sampling speed is the speed at which the A/D converter can convert the input
analog signal to a digital signal. Depending on the type of converter this can be
anywhere from nanoseconds to milliseconds. Generally the more accurate the
converter the slower the conversion, so the required sampling speed must be
considered when used in an application where many samples must be taking in a
short period of time. Sampling frequency is how many times per second a sample
of the analog signal is taken. In the case of an analog signal that is known to
oscillate, the converter should have a sample frequency of at least double the
analog signal.
Resolution describes the precision of the output digital signal and the number of
bits used to describe a discrete analog value. In situations where the analog signal
may have a wide range, yet minor fluctuations are important a greater resolution
converter is needed. The accuracy of a converter describes its ability to reliably
convert an analog value to the digital value that describes the signal. The non-
linearity describes the behavior of some A/D converters where an even increment
of the analog signal doesn’t convert to an even increment of the digital signal.
These types of converters need to be calibrated at several different analog signal
levels.
Flash Converter – These are the fastest analog to digital converters. They run the
signal through a bank of capacitors and resistors in order to convert the signal in
a single pass. Because the resolution is determined by the number of comparators
they can fit on a single chip, these types of chips tend to have fairly low
resolution. Also, they can be error prone since if any comparator fails it can create
a lot of noise in the output signal. These are most commonly used in high-
bandwidth applications like video conversion because of the speed needed to
transfer that much data.
B-23
of the signal line. However, because the timing is done with a simple oscillating
circuit, temperature changes can affect the accuracy of these converters. Though
this affect can be minor, these converters need to be calibrated at the intended
operating temperature in order to be accurate, and even then additional measures
must be taken to ensure output signal accuracy.
Pipeline Converter - These converters combine some of the features of Flash and
successive approximation converters in order to have a converter that works faster
than a successive approximation converter, and more accurately than a flash
converter while keeping the overall size of the converter much smaller.
B-24
Sheet B-5a: D/A Converter Device Failure Modes
Supply Reference
Voltage Voltage Digital to Analog converters are devices that
convert a discrete digital signal to an analogue
Read/Write
Data Bit n signal by sending rectangular electrical pulses at
Load Register extremely high frequencies. This analogue signal
Up to “n” bits can then be passed through a filter that smooths
(8, 12, 16 bits typical) DAC Reset out the signal. The most common types of
Voltage or Current DACs are pulse-width modulators, successive-
Data Bit 2
Signal Out approximation DAC, and thermometer-coded
Data Bit 1
DACs.
B-25
Sheet B-5b: D/A Converter Device Description
Though these are the basic types, it is not uncommon for DACs to be built using
a combination of two or more different basic types. This is usually done in order
to combine the strengths of different types. For example, some use thermometer-
coded circuits for the most significant bits, but an R2R ladder for the least
significant bits. This still allows for the extremely fast conversion while reducing
the overall manufacturing complexity.
Pulse Width Modulators - These types of DACs are the simplest and most
commonly used for electric motor control. A stable current is applied for variable
amounts of time to the output line. The amount of time per cycle is determined
by the digital value passed into the converter. The output is an analog signal that
oscillates at a known frequency where the width of the peak varies and where the
average voltage on the line is determined by the input signal. This output can
then be passed through a filter the a smooths out the peaks or shifts the noise
into a range that won’t interfere with the analog equipment on the circuit.
B-26
Power Steering – This is a type of converter that converts the digital signal using
set of parallel resistors to convert the digital values to an analog value. The way
these resistors are configured allows for significantly fewer resistors on the same
circuit in order to achieve the output voltage. The largest downside to this type of
DAC is that it tends to be fairly inaccurate, so it is often used in situations where
speed of conversion and lowest cost are the highest priorities.
B-27
Sheet B-6a: Type 1 Controller Component Failure Modes
Alarm
Clock
W/D Power Line This is a basic layout of a standalone controller, labeled
Timer Supply Voltage
“Type 1” in this guideline. A Type 1 controller is
capable of performing typical I&C loop functions
RAM CPU ROM without the need for any other modules.
A Type 1 controller typically contains CPU, RAM,
Internal Bus ROM, A/D Converter, D/A Converter, HSI, Clock,
Operator Watchdog Timer, and internal Power Supply devices
Inputs
A/D D/A HSI
Operator
(see related Taxonomy sheets).
Display
Input Output
Signals Signals
Failure
Failure Mode Defensive Measures
Mechanisms
Controller Lockup 1. CPU Halt See CPU Device Taxonomy Sheet
2. CPU Crash B-1a
3. Stopped internal Configure W/D Timer to detect,
clock alarm, and force outputs to
preferred state
Dead Controller 1. Failed internal Implement redundant,
power supply uninterruptable line power
2. Line voltage below
spec
Outputs Fail High 1. CPU Data See CPU Device Taxonomy sheet
Outputs Fail Low Corruption B-1a
2. CPU Logic Error See RAM Device Taxonomy Sheet
Output High Rate
3. D/A Device Error B-2a
of Change
4. Lost or corrupted See D/A Device Taxonomy Sheet
RAM data B-5a
Implement “loopback” signal by
connecting outputs to spare inputs
and check for deviations via SW
logic
Implement redundant controller,
validate output from primary
controller, takeover if needed
Loss of Input Signal 1. CPU Data See CPU Device Taxonomy sheet B-1a
Processing Corruption See RAM Device Taxonomy Sheet B-2a
2. CPU Logic Error See A/D Device Taxonomy Sheet B-4a
3. A/D Device Error Implement redundant controller to
4. Lost or corrupted takeover if needed
RAM data
B-28
Failure
Failure Mode Defensive Measures
Mechanisms
Loss of Operator 1. CPU Data See CPU Device Taxonomy sheet
Interface Corruption B-1a
2. CPU Logic Error See RAM Device Taxonomy Sheet
3. HSI Device Error B-2a
4. Lost or corrupted Implement redundant controller to
RAM data takeover if needed
Failure to Boot or 1. CPU Data See CPU Device Taxonomy sheet
Reset Corruption B-1a
2. CPU Logic Error See RAM Device Taxonomy Sheet
3. Lost or corrupted B-2a
ROM data Implement redundant controller to
takeover if needed
B-29
Sheet B-6b: Type 1 Controller Component Description
A Type 1 controller will have an integrated operator display that may be in a dot-
matrix, LED, or LCD form factor, or some combination thereof. Operator
inputs may be through the use of pushbuttons or touchscreen elements.
Type 1 controllers will have one or more analog outputs (current loops or voltage
outputs) and one or more digital outputs.
Type 1 controllers will have an embedded operating system that may include on-
board “function blocks” that can be called via an application-specific
configuration table, or it may require a separate application program to be
generated through the use of an application engineering tool.
Type 1 controllers will come with some form of CPU, RAM, ROM, Internal
Clock, Watchdog Timer, A/D Converter, D/A Converter, Internal Power
Supply and HSI devices.
Some Type 1 controllers may have a dedicated data communication port (Serial,
Ethernet or fieldbus).
B-30
Sheet B-7a: Type 2 Controller Component Failure Modes
Alarm
Data
Failure
Failure Mode Defensive Measures
Mechanisms
Controller Lockup 1. CPU Halt See CPU Device Taxonomy Sheet B-
2. CPU Crash 1a
3. Stopped internal Configure W/D Timer to detect,
clock alarm, and force outputs to preferred
state
Dead Controller 1. Failed internal Implement redundant, uninterruptable
power supply line power
2. Line voltage below
spec
Loss of Data 1. CPU Data See CPU Device Taxonomy sheet B-
Processing Corruption 1a
2. CPU Logic Error See RAM Device Taxonomy Sheet B-
3. Lost or corrupted 2a
RAM data Implement redundant controller,
4. Failed Backplane takeover if needed
Interface
Failure to Boot or 1. CPU Data See CPU Device Taxonomy sheet B-
Reset Corruption 1a
2. CPU Logic Error See RAM Device Taxonomy Sheet B-
3. Lost or corrupted 2a
ROM data Implement redundant controller to
takeover if needed
B-31
Sheet B-7b: Type 2 Controller Component Description
Type 2 controllers will have an embedded operating system that may include on-
board “function blocks” that can be called via an application-specific
configuration table, or it may require a separate application program to be
generated through the use of an application engineering tool.
Type 2 controllers will come with some form of CPU, RAM, ROM, Internal
Clock, Watchdog Timer, Internal Power Supply and Backplane Interface
devices.
Some Type 2 controllers may have a dedicated data communication port (Serial,
Ethernet or fieldbus).
B-32
Sheet B-8a: Data Communication Component Failure Modes
Failure
Failure Mode Defensive Measures
Mechanism
Failure to 1. Power Supply Off Do not turn off power Supply.
Send/Receive 2. Power Supply Dip Do not let power dip.
network signals 3. Failed Quality Testing
Connections Ensure Communication Module is
4. Manufacturing properly installed
Defects Network Initialization and integrity
tests*
Network Integrity tests during
operation*
Corruption of Data 1. Power Supply Dip Do not let power dip.
2. Design Flaws Quality testing.
3. Manufacturing Ensure proper shielding around
Defect module to
4. Failed protect against radiation and EMI.
Connections Integrity tests during operation.**
5. Memory Errors Memory Tests during operation **
6. Processor Logic Memory Redundancy
Errors Architectural diversity/redundancy
Integrity tests before installation **
Loss of Data 1. Power Supply Off Quality testing.
2. Manufacturing Ensure proper shielding around
Defect controller to
3. Failed protect against radiation and EMI.
Connections Memory Redundancy
4. Processor Crashes Architectural diversity/redundancy
5. Memory Failures Integrity tests before and while
running. **
* The most common methods of ensuring proper and continuous connectivity of any
sort of communication device is to require periodic signals to and from other
network components. If no signals are received within a fixed period (generally
B-33
Failure
Failure Mode Defensive Measures
Mechanism
significantly less than .1 sec for real-time systems) of time, assume some sort of
network connection failure. Testing for direct connection (through the backplane
using some bus architecture) to a controller can be done more quickly and reliably
using a similar method, however usually has much shorter timeout periods.
** The data integrity of data transmitted over the network lines can be tested by
using packets with checksums, parity bits, and other such integrity tests. Integrity
tests for data sent to/from the controller can be done similarly, as well as having
redundant memory storage in this device.
B-34
Sheet B-8b: Data Communication Component Description
When receiving data, the device converts the incoming signals to digital and
saves it to memory. It then checks for the integrity of the data against whatever
integrity information included in the packet. Once it has determine whether the
information has been corrupted or not, it sends a confirmation of receipt or a
resend request back across the line. If the quantity of data to be received is greater
than the maximum size of the packet, the communication module can receive
several packets and reassemble the data before sending it to the controller.
There are many different protocols that communication modules can use. The
different protocols determine how the module is supposed to packetize data,
when, and how often to send packets across the network. Some modules support
a few different protocols and must be configured to use the desired protocol, or in
a few cases, and auto-detect the protocol being used by other devices on the
network.
Software Faults
The assessment of system hazards that result from software faults has remained a
controversial issue across many industries where software plays mission critical or
safety critical roles. Although software behaviors remain inextricably connected
to the hardware environment in which the software executes, the nature of
software faults and failures is fundamentally different from that of electronic
hardware. As noted earlier in this Appendix, hardware faults and failures can be
traced to changes in the properties of the electronic materials that in turn can be
predicted and assessed using conventional concepts such as wear and other age-
related degradation mechanisms.
B-35
Software, of course, does not wear out in any normal sense of the term. It might
be expected that once a unit of software was known to be complete and correct, it
would remain fault-free forever. There are two important reasons why this is
neither an accurate view of industry experience with software systems nor a useful
view from a systems theory point of view.
To address this first challenge, both the application software and its entire tool
chain need to be assessed for hazards due to behavioral complexity. While some
aspects of this issue can be approached from the perspective of standards for the
software development process, extensive testing of the integrated
hardware/software system in its native environment (or in a high fidelity replica
of the native environment) remains the most effective way to discover software
faults in complex software.
B-36
produce anomalous behaviors (“failures”) resulting both from software faults and
from software interactions.
Figure B-4 illustrates a hierarchy of software interaction and faults, which are
broken down and described in Taxonomy Sheets B-9 through B-11. In using
these Taxonomy Sheets, software failure modes and failure mechanisms should
be included with the hardware failure modes and mechanisms of components,
sub-systems and systems as additional sources of hazard. At the component level,
the lowest 3 levels of software failure taxonomy should be considered. At higher
levels (sub-system, system) the Level 1 Binaries issues may be deferred. The
Level 4 Architecture issues, as well as the Levels 2 and 3 should be evaluated for
all subsystems and systems that interact with the proposed modifications, even if
no changes are proposed in the other interacting portions of the system.
The following terms are used in this section of Appendix B, specific to the
software interactions and faults described in Taxonomy Sheets B-9 through B-
11. Their definitions are provided in Section 1, and are repeated here for
convenience:
Behavior: The evolution of the input, processing and output states of a digital
computing system over time. By decomposition, the evolution of the states of a
subsystem or component over time. Some of the meaning of this term is similar
to the use of the term Function, as in functional requirements or function
decomposition.
B-37
Error. (1) The difference between a computed, observed, or measured value or
condition and the true, specified, or theoretically correct value or condition. For
example, a difference of 30 meters between a computed result and the correct
result. (2) An incorrect step, process, or data definition. For example, an incorrect
instruction in a computer program. (3) An incorrect result. For example, a
computed result of 12 when the correct result is 10. (4) A human action that
produces an incorrect result. For example, an incorrect action on the part of a
programmer or operator. (Reference 2)
Insertion Mechanism: For faults, the pathway of processes and conditions that
resulted in the presence of the fault, but not its discovery. Insertion mechanisms
are often linked to the stages of the development and production process (e.g.,
design, tool behavior, etc.)
Non-Fatal Fault: A software fault that allows program execution to continue, but
with incorrect behavior.
Non Plausible Outcome Failure: A non-fatal fault with output errors that do not
satisfy output expectations or specifications (i.e., a form of soft failure).
Plausible Outcome Failure: A non-fatal fault with output that appears to satisfy
output expectations but contains errors (i.e., a form of soft failure).
Software Hazard: A process or resulting outcome that has the potential under at
least some conditions to result in an unplanned event or series of events causing
damage to equipment or the environment and/or death, injury or illness to
personnel. Hazards may be graded by the extent of the damage and injury
potential.
B-38
Input/Output
Level 4 B
(System Architecture) CPU U
S
Memory
Input/Output
B
CPU U
S
Requirements
Memory Application
Development Tools
Design
Level 3
(Application, OS SW) Implementation
Application, Operating
System Software
Source Codes
Compiler
Loader Level 2
(Tools)
Input/Output
B
CPU U
S
Memory Level 1
(Binaries)
Figure B-4
Hierarchy of Software Interactions & Faults
B-39
Sheet B-9a: Level 1 (Binaries) Interactions & Faults
Failure
Failure Mode Defensive Measures
Mechanism
Process halt with Hardware fault Hardware defensive measures
exception (Fatal) Compiler/Loader Complier/Loader validation
error procedures
Application software Compiler/Loader testing on target
fault hardware
Architecture error Application validation procedures
Application testing on target
hardware
Diversity of applications
Architecture testing on target
hardware
Ensure user visibility into application
outcomes
Ensure user visibility into architecture
outcomes
Ensure safe user restart procedure
Process halt without Hardware fault Hardware defensive measures
an exception (Fatal)
Process indefinite loop Compiler/Loader Complier/Loader validation
(Non-fatal) error procedures
Application software Compiler/Loader testing on target
error hardware
Application validation procedures
Application testing on target
hardware
Diversity of applications
Ensure user visibility into application
outcomes
Ensure safe user restart procedure
Ensure safe user process termination
procedure
B-40
Failure
Failure Mode Defensive Measures
Mechanism
Arithmetic Logic Unit Hardware fault Hardware defensive measures
error (Non-fatal) Application testing on target
hardware
Diversity of applications
Ensure user visibility into application
outcomes
Ensure user visibility into architecture
outcomes
Ensure safe user process termination
procedure
Digital input error Hardware fault Hardware defensive measures
(Non-fatal) Device driver fault Application testing on target
hardware
Diversity of applications
Architecture testing on target
hardware
Ensure user visibility into application
outcomes
Ensure user visibility into architecture
outcomes
Digital output error Hardware fault Hardware defensive measures
(Non-fatal) Device driver fault Application testing on target
hardware
Diversity of applications
Architecture testing on target
hardware
Ensure user visibility into application
outcomes
Ensure user visibility into architecture
outcomes
B-41
Sheet B-9b: Level 1 (Binaries) Description
The earlier discussions of the CPU (Sheet B-1) and Memory (Sheets B-2 and B-
3) provide a hardware view of digital system faults. To frame the faults and
failure modes of software, it is important to understand how software can execute
incorrectly at the binary level even with no hardware failure. To illustrate this
idea, a Von Neumann CPU architecture is described, and although this is a very
common architecture, it is not the only type of computing hardware.
Most CPUs have built in “monitor” logic that uses the manifest to check the
Program Counter and any requests to fetch data from memory to verify that the
Program Counter is getting its next instruction from a legal program area of
memory and that a data request is fetching from a legal data area of memory. If
the monitor detects an incorrect value for either of these, it can generate an
exception and halt processing for that software component instance. Another
example of execution monitoring is a watchdog timer. This logic looks for the
last instruction of the software component instance as defined by its manifest,
and if does not see this endpoint instruction after a defined interval of time on
the CPU clock, it raises an exception.
B-42
Exceptions found during execution can be handled in several ways. Two of the
most common actions are (1) to restart the software component instance by
resetting the Program Counter to its first instruction; or (2) allow another
running software component instance (eg the operating system) to access the
exception and determine what action is appropriate.
A software component instance can receive data from input devices and send data
to output devices and storage devices through its data memory. In some CPU
instruction sets, any area of data memory can be used as an input or output. In
other CPUs, special regions of memory are designated for input and output.
Errors in input values can cause faulty program execution, and similarly, errors in
program execution may cause errors in output values.
B-43
Sheet B-10a: Level 2 (Tools) Interactions & Faults
Failure
Failure Mode Defensive Measures
Mechanism
Complier fatal Compiler/Loader Complier/Loader validation procedures
translation error error Compiler/Loader testing on target
hardware
Application testing on target hardware
Diversity of applications
Compiler non-fatal Compiler/Loader Complier/Loader validation procedures
translation error error Compiler/Loader testing on target
hardware
Application testing on target hardware
Diversity of applications
Loader program Compiler/Loader Complier/Loader validation procedures
error (fatal) error Compiler/Loader testing on target
hardware
Application testing on target hardware
Diversity of applications
Loader data error Compiler/Loader Complier/Loader validation procedures
(non-fatal) error Compiler/Loader testing on target
hardware
Application testing on target hardware
Diversity of applications
B-44
Sheet B-10b: Level 2 (Tools) Description
Early efforts to program computers directly in binary quickly revealed how time
consuming and error prone this approach was. The development of assemblers,
compilers and loaders allowed the creation of binary executables from higher-
level computer languages. Assemblers and compilers are used to generate a
software component object code file. A loader is used to place the instructions in
the object code file into their correct memory locations on the host hardware
platform to create the software component instance. These tools are computer
programs themselves, and therefore have all of the hazards at the execution level
as well as new mechanisms for introducing faults into the binary executable code
of the software component instance.
Some widely used modern languages like C and C++ are known to have weak
compilers—just because the source code compiles into a set of object code
B-45
without errors or warnings does not mean that the code is error free, or will even
execute at all. To attack this problem and achieve trusted object code generation,
the Department of Defense developed the Ada language and compiler. The
verification and validation of this language and its family of compilers took over
10 years and many hundreds of millions of dollars. In the end, the DoD
abandoned the Ada language because it was professionally difficult to get
software developers to learn and use the language, and it became economically
unviable.
B-46
Sheet B-11a: Level 3 (Application & OS Source Codes)
Interactions & Faults
B-47
Failure Mode Failure Mechanism Defensive Measures
Flawed design Application software Application validation procedures
allocation error(non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Flawed interface Application software Application validation procedures
definition error (non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Flawed logic or Application software Application validation procedures
algorithm design error (non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Flawed interface Application software Application validation procedures
implementation error (fatal or non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Severely flawed Application software Application validation procedures
logic or algorithm error (fatal or non-fatal) Application testing on target
implementation hardware
Diversity of applications
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
B-48
Failure Mode Failure Mechanism Defensive Measures
Mildly flawed logic Application software Application validation procedures
or algorithm error (non-fatal) Application testing on target
implementation hardware
Diversity of applications
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Inappropriate but Application software Application validation procedures
allowed use of error (non-fatal) Application testing on target
language constructs hardware
Diversity of applications
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Configuration User error (fatal or non- Application testing on target
parameters out of fatal) hardware
bounds Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Oversight and review of
parameter settings
B-49
Sheet B-11b: Level 3 (Application & OS Source Codes)
Description
Let B be the set of all possible system behaviors. Let B* be the set of
desired behaviors that are required by a set of requirements statements,
R, and let B^ be the set of behaviors that are prohibited by R. R is said
to be logically closed if it can be shown that B* + B^ = B
In most cases, the set R leaves a very big gap between what R explicitly states (B*
and B^) and the entire set B. The difference between R and B amounts to
“unstated” requirements in terms of desired, but unspecified behaviors (B*), and
prohibited, but unspecified behaviors (B^). This situation exists not just for the
application source code requirements, but also for the language implementation
(i.e., compilers, interpreters) and for the Operating System itself. The current
practice of reviewing requirements statements frequently with the stakeholders
during development has been marginally successful in reducing requirements
errors, which can be errors of omission (unspecified behaviors that should be
specified) and errors of commission (specified behaviors that are expressed
incorrectly).
B-50
the timing, accuracy and memory usage of alternative approaches. Common
errors introduced in design are underestimating the computing time and
maximum memory footprint of an algorithm and its related data structures,
failing to guard against out-of-range data values, and failing to ensure that the
software follows the expected execution paths. Over the past 20 years, the use of
standard design representations such as Unified Modeling Language (UML) and
the documentation of standard “design patterns” have helped reduce the variance
in the ways that the software is designed.
Because the current ability remains weak in using inspections to detect errors
introduced in requirements, design and implementation, software development
success can depend strongly on extensive testing, diagnostics and repair of the
source code. A limitation to the test of complex software is the difficulty of
generating the complete set of test cases that exercise all of the logic paths in the
software. During test and “debugging” of the software, disciplined change
management and configuration control are important in preventing the
introduction of new errors during the repair of others.
Experience and case studies of software application development show that the
larger and more complex the software, the larger the number of latent faults it
likely contains. While some languages are simply more verbose than others, the
relationship between functional size and complexity and fault rates is not
disputable.
B-51
Sheet B-12a: Level 4 (System Architecture) Interactions &
Faults
Input/Output
Level 4 Software Architecture faults arise as a
B result of interactions between software units.
CPU U While some Level 4 faults result from the
S
propagation of data between faulty software
Memory
units (Level 3 or lower faults), it is also
Input/Output
possible to have Level 4 faults when all
software units are operating normally (an
B
CPU U absence of Level 3 or lower faults).
S The types of fault at level 4 depend strongly on
Memory the types of architectures, from strongly
isolated to strongly coupled.
B-52
Failure Mode Failure Mechanism Defensive Measures
Ensure safe user process
termination procedure
Channel fatal Hardware fault Hardware defensive measures
outcome Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Input data out of Application software Application validation procedures
range fault in Sender Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Input data in range Application software Application validation procedures
but incorrect fault in Sender Application testing on target
hardware
Diversity of applications
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Required input data Hardware fault Hardware defensive measures
not received Application software Application validation procedures
fault in Sender Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
B-53
Failure Mode Failure Mechanism Defensive Measures
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Expected input data Hardware fault Hardware defensive measures
not received Application software Application validation procedures
fault in Sender Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Process Starvation Hardware fault Hardware defensive measures
Operating system error Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Process degradation Hardware fault Hardware defensive measures
Operating system error Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
B-54
Failure Mode Failure Mechanism Defensive Measures
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Process resource Hardware fault Application testing on target
contention Operating system error hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Process incorrect Operating system error Application testing on target
termination hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Faulty process Operating system error Application testing on target
execution hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
B-55
Sheet B-12b: Level 4 (System Architecture) Description
B-56
processing unit. The core processing architecture uses a separate software
component, the Operating System (OS), to control what application software
components are loaded into memory and what software components are
executing of those that are loaded. The core processing unit usually has many
hardware resources, such as analog in/out converters, video processing, network
communications devices, and mass storage devices that the OS can make
available to the application software processes by software interactions (“calls”)
with the OS software behaviors. Many modern Operating Systems support calls
directly between separately executing application software components, allowing
the separate applications to share data and to make calls on each other’s
functional behaviors. In addition to its loader functions, the OS normally will
have a Scheduler function that shifts execution dynamically between multiple
“running” applications so that the processing and peripheral resources of the core
processing unit are shared across the application software components over time
(“multi-processing”).
Despite the evolution of standards and the economic incentives to increase the
performance and reliability of OS and applications, there are very few trusted OS
today. In some industry verticals, the development of a trusted OS and trusted
applications is so important that these systems are highly proprietary and not
available to other developers. The Operating Systems in wide use today are
known to have many weaknesses, and the pace of OS configuration changes and
patches is quite fast, leading to great pressure on the applications developers to
update their software for each OS change or be left out of the market.
B-57
limits the interactions across the segments of core processing, but allows rich
interactions within the segments.
B-58
Appendix C: Physical and Functional
Representations
A physical representation would typically include several of the following
elements:
Digital System Components
- Controllers
- Input or Output modules
- Data Communication Modules
- Network Components
- Media Converters
- Power Supplies
- Workstations
- Servers
Interfacing Components
- Other Controllers or Hand/Auto Stations
- Handswitches
- Limit or Position Switches
- Sensors
- Indicators
- Alarms
- Relays
- Firewalls or Data Diodes
Connections
- Analog signals
- Digital signals
- Data communications
- Clock signals
- Power
- Grounds
- Maintenance ports (e.g., laptop connection point)
- Factory ports (i.e., used by the vendor only)
Controlled Components
- Pumps
- Valves
- Breakers
C-1
- Switchgear
- Motor Control Centers
Process Elements
- Reactor
- Heat Exchangers
- Tanks
- Steam Generators
- Pipes
- Flow Elements
Symbols used in a block diagram should follow the same conventions used for
other plant drawings, such as piping and instrumentation diagrams or electrical
elementary diagrams. For some components, state representation should be
drawn into the symbol that is used to represent the physical component. For
example, a relay contact can be represented in an FMEA block diagram as
normally closed, using the following symbol and a note:
C-2
Note 1: Relay contact shown as normally closed (de-energized)
Figure C-1
Relay Contact Symbol
C-3
Appendix D: Circulating Water System
(CWS) Top Down Analysis
Section 4, Figure 4-7 and Figure 4-8 in the body of this guideline introduce and
describe a distributed control system for a circulating water system at a nuclear
power plant. In this Appendix, the top down logic for relevant portions of the
circulating water system and the control system are developed along with a
discussion of the results.
The distributed control system has a direct impact on the availability of the
circulating water system in two ways:
Response to the trip of a circulating water pump by automatically isolating the
affected pump (this prevents reverse flow through the tripped pump and an
even greater reduction in flow through the condenser than from just the loss
of the pump) and support for operator action to start and un-isolate one of the
idle circulating water pumps.
Spurious actuation of circulating water equipment when not called upon to operate
(e.g., spurious closure of the circulating water pump discharge isolation valve).
The top event in Figure D-1a represents insufficient circulating water flow
initiated by the trip of an operating circulating water pump. This figure is a partial
fault tree focusing on one train of circulating water (Train 1). Shown under gate
CWS-TR1-01 are the trip of Pump 1 or the spurious opening of the breaker for
the pump. System response to tripping of the pump would include automatic
isolation of the discharge valve, the logic for which is shown under gate G006.
Failure to isolate the failed pump could be a result of the discharge MOV failing
to close or failure of the plant control system to initiate a closure signal (Gate
G010 – developed further in Figure D-1b).
If a tripped pump were not to be isolated, the pump would coast down and
reverse flow through the pump would begin. The loss of the pump plus
additional diversion of flow roughly would be equivalent to the loss of two
circulating water pumps. Given the need for four pump flow to keep the plant at
D-1
full power, loss of only one additional circulating water pump is required before
inadequate circulating water flow would be expected to lead to a trip on high
condenser vacuum. Top down logic for loss of the additional pump is developed
under Gate G004, which considers loss of any of the remaining five pumps
(Pumps 2 through 6).
Recall that two of the circulating water pumps are in standby. Therefore, they
must be started by the operators in order to have sufficient circulating water flow
to avoid a plant trip on high condenser vacuum. Figure D-1a also presents this
logic for one of the standby pumps.
The logic for starting of a standby pump (Train 3) is presented under Gate CWS-
3. The loss of this pump train may occur due to failure of the pump to start,
failure of the breaker to close (both under Gate G002) or failure of the plant
control system to initiate the pump train (Gate G013 – developed further in
Figure D-1c). Similar logic is also developed for Train 4, the other standby train.
A pump, once operating, may become unavailable if it fails to run, the breaker fails
to remain closed or the discharge isolation valve fails to remain open. This logic is
presented under Gates CWS-TR3-01 and CWS-TR3-02 for Train 3. Logic
similar to this is developed for all five pumps, reflecting the possibility of any of
these trains failing given that initially they are running successfully.
In Figure D-1b, the top down logic for the plant control system is developed in
support of the system function to isolate a pump that has tripped. A signal to close
the discharge MOV for the tripped pump may be due to digital input module
failing to sense that the pump breaker has opened (DI1), the communications
network failing to transmit this information from the I/O cabinet to the logic
cabinet and back (two network loops – Gate G015-A-FF), the master logic
controller failing to interpret the information correctly and provide an output
signal to close the valve or due to the digital output module failing to provide a
signal to close the MOV. With respect to the master logic controller (Gate
G039), its failure is backed up by a slave controller. Loss of this backup source of a
closure signal to the discharge MOV could be due to failure of the watchdog timer
which monitors the status of the master logic controller, failure of the slave
controller itself or loss of the two communications networks. Note that the master
controller and the slave controller are in separate divisions of the plant control
system. For a controller to transmit information to the MOVs in the opposite
division through a given communication loop, loss of any of the four
communications modules in that loop fails the communications loop as shown
under Gate G037. For a controller to transmit information to the MOVs in the
same division through a given communication loop, only the communications
units in that division within that loop can contribute to loss of communication to
the MOVs from the controller (as shown under Gate G015-A-FF).
In Figure D-1c, the top down logic for opening an MOV on a standby pump and
starting the pump is shown. Failure to initiate a standby pump in the event that an
operating pump trips leaves the plant with insufficient circulating water flow to
maintain full power operation. The action to start a standby pump is modeled as
D-2
an action that the operators take in response to the tripped pump. Failure to start
the standby pump can occur if the operators do not take action in time (event
CWS-PMOA-OPENMOV in Figure C-1c), the workstations and
communication network loops do not transmit the operator’s signal to the I/O
cabinets or the digital output device does not pass the signal on to the MOV
circuitry.
The top down logic presented in Figure D-1a represents the different means of
failing to isolate a single train of circulating water (Train 1) should the pump in
that train trip during plant operation. The logic focuses on the mechanical
equipment that make up the pump train. Similar top down logic exists for all six
circulating water pump trains.
The top down logic presented in Figure D-1b represents the plant control system
as it is required to produce an automatic isolation signal for the discharge MOV
for a pump that has tripped. The pump train represented in the logic is again
pump train 1. Similar logic has been developed for all six circulating water pump
trains. Figure D-1c presents top logic for the plant control system as it is needed
to start a standby pump manually. There are two standby trains associated with
the example circulating water system and similar top down logic has been
developed for both.
Isolation of an operating circulating water pump train can occur for several
reasons; in response to trip of the circulating water pump in that train, spurious
opening of the breaker to the pump, spurious closure of the discharge valve for the
pump or initiation of a spurious signal to close the discharge valve. Figure A-2a
presents the top down logic for the first three of these failures while the logic for
the spurious signal is shown in Figure D-2b. Figure D-2a also shows logic for
starting a standby pump (Gate G013). This is the same logic that was developed
above under Figure D-1c.
While the attached logic is for pump train 3, similar logic is developed for all six
pumps. The logic for starting a pump is applicable only to the two standby pumps
(Trains 3 and 4).
D-3
Results
On development of the top logic for each of the six circulating water pump trains,
the logic is combined in a manner that reflects the success criterion for the
circulating water system. Figure C-3 presents this logic.
As noted earlier, flow through the condenser equivalent to that for four circulating
water pumps is assumed to be required to support full power operation. If an
operating circulating water pump was to trip and not be isolated successfully, the
loss of flow from the pump plus the reverse flow through the affected pump train
is equivalent to loss of flow from two pumps. This means that loss of one
additional pump is all that is necessary to reduce circulating water flow to the
point it can no longer support full power operation. The logic under Gate CWS-
TOP-FF reflects this criterion. Note that the gate has as input the logic for only
the four operating trains. As the two standby trains are not in service, they cannot
contribute to the loss off circulating water flow other than to fail to start and run
in response to loss of one of the other operating trains.
The two sets of top logic are combined and the logic reduced to identify the
combinations of failures (cut sets) that will result in the circulating water system
not being able to support full power operation.
As flow from the equivalent of four trains of circulating water are needed, that the
bulk of the combinations consist of three failures is to be expected (i.e., pumps fail
to run, breakers fail to remain open, discharge MOVs fail to remain open in
combinations of three). A number of these cut sets are shown in Table A-1.
However, it can be seen there are approximately twenty cut sets that consist of
only pairs of failures. Many of these twenty pairs include components from the
plant control system.
The first eight pairs of failures in Table D-1 contain only communication module
failures. These pairs of failures come from the spurious actuation top logic (Gate
CWS-TOP-SS). Total loss of communications for an entire division of
circulating water can occur if a communication module in each of the two
communications loop in that division were to fail. This leads to no input to the
digital output modules for that division. Under these conditions, the discharge
isolation valves for all three pumps in the affected division close leaving only the
three pumps in the unaffected division. As the plant is assumed to require four
circulating water pumps to support full power operation, loss of the pairs of
communications modules results in insufficient circulating water pump flow.
Four of the remaining cut sets consisting of pairs of failures include a digital
output module failure combined with failure of the operators to initiate the
D-4
standby trains of circulating water in time to avoid a low condenser vacuum trip.
These failures also come from the spurious actuation top logic. Loss of a single
digital output module results in a false isolation signal to the discharge isolation
MOV in the affected pump train. As only three pumps are now providing
circulating water flow, starting of one of the standby trains is required. Failure of
the operators to initiate one of the standby trains in time results in the circulating
water flow not being able to support full power operation.
Other plant control system components appear with hardware and I&C failures in
combinations of three or more. These components include digital input modules,
the master controller, slave controller and operator workstations. That these
components require multiple additional failures before they can lead to conditions
in which the plant cannot operate at full power reflects the fact that there are two
spare circulating water pump trains and the operators can initiate the standby
trains to mitigate loss of these components.
D-5
Table D-1
Combinations of Failures (Cut Sets) Leading to Loss of Circulating Water
CWS-CMFF-I/OA-COMM2 CWS-CMFF-I/OA-COMM1
CWS-CMFF-I/OA-COMM2 CWS-CMFF-LCA-COMM1
CWS-CMFF-I/OB-COMM2 CWS-CMFF-I/OB-COMM1
CWS-CMFF-I/OB-COMM2 CWS-CMFF-LCB-COMM1 Failure of pairs of
CWS-CMFF-LCA-COMM2 CWS-CMFF-I/OA-COMM1 communication units
CWS-CMFF-LCA-COMM2 CWS-CMFF-LCA-COMM1
CWS-CMFF-LCB-COMM2 CWS-CMFF-I/OB-COMM1
CWS-CMFF-LCB-COMM2 CWS-CMFF-LCB-COMM1
CWS-PMOA-OPENMOV CWS-CBCO-CB-01
CWS-PMOA-OPENMOV CWS-CBCO-CB-02
CWS-PMOA-OPENMOV CWS-CBCO-CB-05
CWS-PMOA-OPENMOV CWS-CBCO-CB-06
CWS-PMOA-OPENMOV CWS-IOSS-I/OA-D01 Digital output device failure in
CWS-PMOA-OPENMOV CWS-IOSS-I/OA-D02 combination with failure of
CWS-PMOA-OPENMOV CWS-IOSS-I/OB-D02 operator action to start a
standby pump
CWS-PMOA-OPENMOV CWS-IOSS-I/OB-D03
CWS-PMOA-OPENMOV CWS-MVOC-MO-01
CWS-PMOA-OPENMOV CWS-MVOC-MO-02
CWS-PMOA-OPENMOV CWS-MVOC-MO-05
CWS-PMOA-OPENMOV CWS-MVOC-MO-06
CWS-PMOA-OPENMOV CWS-PMFR-P1
CWS-PMOA-OPENMOV CWS-PMFR-P2
CWS-PMOA-OPENMOV CWS-PMFR-P5
CWS-PMOA-OPENMOV CWS-PMFR-P6
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-CBCO-CB-01
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-CBCO-CB-06
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-IOSS-I/OB-D03
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-MVOC-MO-06
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-PMFR-P1
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-PMFR-P6
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-CBCO-CB-01
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-CBCO-CB-04
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-CBCO-CB-06
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-CBOC-CB-04
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-IOFF-I/OA-D04
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-IOSS-I/OB-D01
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-IOSS-I/OB-D03
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-MVOC-MO-04
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-MVOC-MO-06
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-PMFR-P1
D-6
Loss of CWS flow due to
flow diversion through pump
train 1
CWS-TR1-FLDIV
G003 G004
CWS Pump 1 Trip Failure to isolate CWS Pump Failure ofcirculating water Failure ofcirculating water Failure ofcirculating water Failure ofcirculating water Failure ofcirculating water
1 pump train 2 pump train 3 pump train 4 pump train 5 pump train 6
CWS Pump 1 fails to run Circuit Breaker for CWS CWS Pump 1 discharge PCS does not automatically Pump train 3 fails to run Pump train 3 discharge valve Failure to open pump train 3
Pump 1 fails to remain closed valve fails to close isolate CWS Pump 1 spuriously closes discharge valve
CWS Pump 3 fails to run Circuit Breaker for CWS CWS Pump 3 discharge Spurious signal to close CWS Pump 3 start failures CWS Pump 3 is in standby
Pump 3 fails to remaain valve fails to remain open pump train 3 discharge valve
closed
G002 G013
CWS-PMFS-P3 CWS-CBOC-CB-03
Figure D-1
a) Top down logic for response to trip of an operating circulating water pump
D-7
PCS does not automatically
isolate CWS Pump 1
G010
Page 1
Digital output device D01 Digital input device DI1 fails Loss of Division A Logic controllers fail to
fails to provide output to sense Pump 1 tripped communication networks islollate an idle pump
signal
Loss of Division A Loss of Division A Master lcontroller fails to Slave controller fails to
communication network 1 communication network 2 function function
I/O Cabinet A Logic Cabinet A I/O Cabinet A Logic Cabinet A Sleve controller fails to Communications failure Watchdog timer fails to
Communication Module 1 Communication Module 1 Communication Module 2 Communication Module 2 function between slave controller and transfer control
functional failure functional failure functional failure functional failure Div 1 components
G047 G049
D-8
Failure to start standby circ
water pump
G013
Page 1
G087 G090
D-9
Failure ofcirculating water
pump train 3
CWS-3
Page 1
... see x-ref
Pump train 3 fails to run Pump train 3 discharge valve Failure to open pump train 3
spuriously closes discharge valve
CWS Pump 3 fails to run Circuit Breaker for CWS CWS Pump 3 discharge Spurious signal to close CWS Pump 3 start failures CWS Pump 3 is in standby
Pump 3 fails to remaain valve fails to remain open pump train 3 discharge valve
closed
G002 G013
CWS-PMFS-P3 CWS-CBOC-CB-03
Figure D-2
a) Top down logic for loss of circulating water system due to spurious trips
D-10
Spurious signal to close
pump train 3 discharge valve
G009-3
Page 1
I/O Cabinet A diigtal output Circuit Breaker for CWS Faillure of Division A
module 3 spurious signal Pump 3 fails to remaain comnunications networks
closed
G016-A G017-A
Page 1 Page 1
Page 1 Page 1
D-11
Loss of circ water system
CWS-TOP
Loss of the circ water system Loss of the circ water system
due to failure to isolate a due to spurious actuations
tripped pump
CWS-TOP-FF CWS-TOP-SS
Loss of CWS flow due to Loss of CWS flow due to Failure ofcirculating water Failure ofcirculating water
flow diversion through pump flow diversion through pump pump train 1 pump train 4
train 1 train 5
Loss of CWS flow due to Loss of CWS flow due to Failure ofcirculating water Failure ofcirculating water
flow diversion through pump flow diversion through pump pump train 2 pump train 5
train 2 train 6
CWS-3 CWS-6
Figure D-3
Top down logic for loss of circulating water system
D-12
Pairs of communications modules
ANALYSIS BOUNDARY
Logic Cabinet A Logic Cabinet B
COMM 2 COMM 2
COMM 1 COMM 1
Each Controller Is
MASTER SLAVE
Programmed to Control All
CONTROLLER CONTROLLER
Six Valves (Master/Slave)
D D D D D D D D D D D D
I O I O I O O I O I O I
1 1 2 2 3 3 1 1 2 2 3 3
Digital output
4 KV devices
(in combination with
action to start
standby pump)
CONDENSER CONDENSER CONDENSER
M M M M M M
M M M M M M
COOLING COOLING
TOWER TOWER
Normal Operation
PUMP-1 PUMP-2 PUMP-3 (Two Valves Open in PUMP-4 PUMP-5 PUMP-6
Each Basin)
Figure D-4
Potential dominant contributors to circulating water system failure
D-13
Export Control Restrictions The Electric Power Research Institute, Inc. (EPRI, www.epri.com)
Access to and use of EPRI Intellectual Property is granted with the spe- conducts research and development relating to the generation, delivery
cific understanding and requirement that responsibility for ensuring full and use of electricity for the benefit of the public. An independent,
compliance with all applicable U.S. and foreign export laws and regu- nonprofit organization, EPRI brings together its scientists and engineers
lations is being undertaken by you and your company. This includes as well as experts from academia and industry to help address challenges
an obligation to ensure that any individual receiving access hereunder in electricity, including reliability, efficiency, affordability, health, safety
who is not a U.S. citizen or permanent U.S. resident is permitted access and the environment. EPRI also provides technology, policy and economic
under applicable U.S. and foreign export laws and regulations. In the analyses to drive long-range research and development planning, and
event you are uncertain whether you or your company may lawfully supports research in emerging technologies. EPRI’s members represent
obtain access to this EPRI Intellectual Property, you acknowledge that it approximately 90 percent of the electricity generated and delivered in
is your obligation to consult with your company’s legal counsel to deter- the United States, and international participation extends to more than
mine whether this access is lawful. Although EPRI may make available 30 countries. EPRI’s principal offices and laboratories are located in
on a case-by-case basis an informal assessment of the applicable U.S. Palo Alto, Calif.; Charlotte, N.C.; Knoxville, Tenn.; and Lenox, Mass.
export classification for specific EPRI Intellectual Property, you and your
Together...Shaping the Future of Electricity
company acknowledge that this assessment is solely for informational
purposes and not for reliance purposes. You and your company ac-
knowledge that it is still the obligation of you and your company to make
your own assessment of the applicable U.S. export classification and
ensure compliance accordingly. You and your company understand and
acknowledge your obligations to make a prompt report to EPRI and the
appropriate authorities regarding any access to or use of EPRI Intellec-
tual Property hereunder that may be in violation of applicable U.S. or
foreign export laws or regulations.
Program:
Instrumentation and Control
© 2013 Electric Power Research Institute (EPRI), Inc. All rights reserved. Electric Power
Research Institute, EPRI, and TOGETHER...SHAPING THE FUTURE OF ELECTRICITY are
registered service marks of the Electric Power Research Institute, Inc.
3002000509