You are on page 1of 410

2013 TECHNICAL REPORT

Hazard Analysis Methods for Digital


Instrumentation and Control Systems

HAZOP, STPA, PGA


Coverage

FFMEA

DFMEA FTA

Anticipated Failure Modes Unexpected Behaviors


Hazard Analysis Methods for
Digital Instrumentation and
Control Systems

This document does NOT meet the requirements of


10CFR50 Appendix B, 10CFR Part 21, ANSI
N45.2-1977 and/or the intent of ISO-9001 (1994).

EPRI Project Manager


R. Torok

3420 Hillview Avenue


Palo Alto, CA 94304-1338
USA

PO Box 10412
Palo Alto, CA 94303-0813
USA

800.313.3774
650.855.2121

askepri@epri.com 3002000509
www.epri.com Final Report, June 2013
DISCLAIMER OF WARRANTIES AND LIMITATION OF LIABILITIES

THIS DOCUMENT WAS PREPARED BY THE ORGANIZATION(S) NAMED BELOW AS AN ACCOUNT OF


WORK SPONSORED OR COSPONSORED BY THE ELECTRIC POWER RESEARCH INSTITUTE, INC. (EPRI).
NEITHER EPRI, ANY MEMBER OF EPRI, ANY COSPONSOR, THE ORGANIZATION(S) BELOW, NOR ANY
PERSON ACTING ON BEHALF OF ANY OF THEM:

(A) MAKES ANY WARRANTY OR REPRESENTATION WHATSOEVER, EXPRESS OR IMPLIED, (I) WITH
RESPECT TO THE USE OF ANY INFORMATION, APPARATUS, METHOD, PROCESS, OR SIMILAR ITEM
DISCLOSED IN THIS DOCUMENT, INCLUDING MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE, OR (II) THAT SUCH USE DOES NOT INFRINGE ON OR INTERFERE WITH PRIVATELY OWNED
RIGHTS, INCLUDING ANY PARTY'S INTELLECTUAL PROPERTY, OR (III) THAT THIS DOCUMENT IS SUITABLE
TO ANY PARTICULAR USER'S CIRCUMSTANCE; OR

(B) ASSUMES RESPONSIBILITY FOR ANY DAMAGES OR OTHER LIABILITY WHATSOEVER (INCLUDING ANY
CONSEQUENTIAL DAMAGES, EVEN IF EPRI OR ANY EPRI REPRESENTATIVE HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES) RESULTING FROM YOUR SELECTION OR USE OF THIS DOCUMENT OR
ANY INFORMATION, APPARATUS, METHOD, PROCESS, OR SIMILAR ITEM DISCLOSED IN THIS
DOCUMENT.

REFERENCE HEREIN TO ANY SPECIFIC COMMERCIAL PRODUCT, PROCESS, OR SERVICE BY ITS TRADE
NAME, TRADEMARK, MANUFACTURER, OR OTHERWISE, DOES NOT NECESSARILY CONSTITUTE OR
IMPLY ITS ENDORSEMENT, RECOMMENDATION, OR FAVORING BY EPRI.

THE FOLLOWING ORGANIZATIONS, UNDER CONTRACT TO EPRI, PREPARED THIS REPORT:

Southern Engineering Services, Inc.

Applied Reliability Engineering, Inc.

Électricité de France

THE TECHNICAL CONTENTS OF THIS DOCUMENT WERE NOT PREPARED IN ACCORDANCE WITH THE
EPRI NUCLEAR QUALITY ASSURANCE PROGRAM MANUAL THAT FULFILLS THE REQUIREMENTS OF 10 CFR
50, APPENDIX B AND 10 CFR PART 21, ANSI N45.2-1977 AND/OR THE INTENT OF ISO-9001 (1994).
USE OF THE CONTENTS OF THIS DOCUMENT IN NUCLEAR SAFETY OR NUCLEAR QUALITY
APPLICATIONS REQUIRES ADDITIONAL ACTIONS BY USER PURSUANT TO THEIR INTERNAL PROCEDURES.

NOTE

For further information about EPRI, call the EPRI Customer Assistance Center at 800.313.3774 or
e-mail askepri@epri.com.

Electric Power Research Institute, EPRI, and TOGETHER…SHAPING THE FUTURE OF ELECTRICITY are
registered service marks of the Electric Power Research Institute, Inc.

Copyright © 2013 Electric Power Research Institute, Inc. All rights reserved.
Acknowledgments The following organizations, under contract to the Electric Power
Research Institute (EPRI), prepared this report:

Southern Engineering Services, Inc.


331 Allendale Drive
Canton, GA 30115

Principal Investigators
B. Geddes
M. Bailey
L. Freil
J. Thomas
B. Antoine
N. Geddes

Applied Reliability Engineering, Inc.


1478 27th Avenue
San Francisco, CA 94122

Principal Investigator
D. Blanchard

Électricité de France
6, Quai Watier
78400 Chatou, France

Principal Investigator
N. Thuy

This report describes research sponsored by EPRI.

This publication is a corporate


document that should be cited in the
literature in the following manner:

Hazard Analysis Methods for Digital


Instrumentation and Control Systems.
EPRI, Palo Alto, CA: 2013.
3002000509.
 iii 
The project team wishes to acknowledge the technical advisory group
on this project for their support in developing the technical approach
and examples described in the report:

R. Hopkins Rolls Royce, PLC


Paul Collinge Rolls Royce, PLC
Philip Mertens Rolls Royce, PLC
David Curtis Rolls Royce, PLC (Retired)
N. Thuy EDF
John Carey PSEG Nuclear, LLC (Retired)
Ron Jarrett Tennessee Valley Authority
J.C. Williams Exelon
James Pritchett Progress Energy
Eunchan Lee KHNP
Philip Brady PPL
David Woods PPL
James Snyder Constellation

 iv 
Product
Description This report documents an investigation of the use of various hazard
and failure analysis methods to reveal potential vulnerabilities in
digital instrumentation and control (I&C) systems before they are
put into operation. The report looks at six approaches, ranging from
well-established practices to novel methods still transitioning from
academic demonstrations to practical, realistic applications. It
includes step-by-step procedures and worked examples, applying
each of the methods to sample problems based on actual cases to
assess the methods for effectiveness, range of applicability and
practicality of use by nuclear plant engineers and their suppliers.

Background
The lack of established practical methods for evaluating and
managing potential failure modes and mechanisms of digital I&C
systems is adversely affecting design, risk assessment, and licensing
efforts involving digital equipment. Results include undesired and
costly plant transients, significant increases in system costs and
complexity without commensurate safety benefits, and difficulty in
obtaining regulatory acceptance of designs that can improve
dependability and reduce overall risk. Traditional failure analysis
methods developed for hardware-based systems, primarily failure
modes and effects analysis (FMEA), are proving less effective and
more costly than desired. This report extends the earlier project
results documented in the 2011 EPRI report 1002985, Failure
Analysis of Digital Instrumentation and Control Equipment and Systems
– Demonstration of Concept. Most significantly, it offers improved
procedures, more examples, and more detailed discussion of the
methods, including new approaches that combine methods to help
improve the effectiveness and efficiency of the analysis.

Objectives
The research was intended to identify hazard and failure analysis
methods that could be applied to improve current practices,
demonstrate their potential effectiveness using realistic nuclear plant
examples, and develop a practical methodology for use by utility
engineers and their suppliers.

Approach
Building on the 2011 demonstration of concept results, the project
team developed additional worked examples of varying complexity to
better understand the strengths, weaknesses, and applicability of the
approaches. They looked at six methods: functional FMEA, design

v
FMEA, a top-down method using fault tree analysis (FTA),
HAZard and OPerability (HAZOP) analysis, systems theoretic
process analysis (STPA), and purpose graph analysis (PGA). Based
on lessons learned from the examples, the project team developed
step-by-step procedures for each of the methods. The notion of
potential hybrid or blended methods that combine top-down and
bottom-up approaches to improve efficiency and effectiveness was
given additional attention, with identification of logical transfer
points from one method to another. The digital failure analysis
taxonomy that was started in 2011 was expanded to include
additional devices. Utility engineers and technical experts from the
project team reviewed the results and provided feedback that was
subsequently incorporated.

Results
For each of the approaches studied, the report contains a detailed
description of the method, a step-by-step procedure, worked
examples, and a discussion of the method’s strengths and weaknesses.
Some methods focus on causes and effects of component failures.
Others also consider undesired behaviors that do not involve
component failures. This is particularly important for complex digital
systems because a significant percentage of mishaps involve undesired
behaviors that occur under unanticipated or untested operating
conditions, but with all components operating as designed. The
report also discusses the steps involved in planning hazard analysis
activities in the context of a plant modification effort.

Applications, Value, and Use


The procedures and example problems in the report and the
taxonomy examples will help utility engineers involved in
implementing digital systems. Improved hazard analysis methods
may also be the key to resolving regulatory uncertainty regarding
understanding and managing software and digital system failure
modes. However, it appears that the most effective methods are also
the most difficult to apply and are new to the utility industry. Before
these methods are widely used, more work will be needed in the form
of demonstration projects, training, and industry workshops.

Keywords
Digital instrumentation and control
Failure analysis
Failure modes and effects analysis (FMEA)
Fault tree analysis
Hazard analysis
Software hazard analysis

 vi 
Table of Contents

Section 1: 91BIntroduction ............................................1-1


0B1.1 Background ............................................................... 1-1
1B1.2 Purpose/Objectives .................................................... 1-1
2B1.3 Scope ....................................................................... 1-2
3B1.4 Key Concepts ............................................................ 1-4
4B1.5 Why New Guidance? ................................................ 1-7
5B1.6 Relationship to Other Work ......................................... 1-9
6B1.7 Contents of this Guideline ......................................... 1-11
7B1.8 How to Use this Guideline......................................... 1-13

Section 2: 92BDefinitions ..............................................2-1


8BAbbreviations & Acronyms ................................................ 2-5

Section 3: 93BPlanning Hazard Analysis Activities .........3-1


9B3.1 Determine Scope & Objectives .................................... 3-1
10B3.2 Identify the Level(s) of Interest ...................................... 3-2
1B3.3 Determine Appropriate Method(s) ................................ 3-3
12B3.4 Consider a Blended Approach .................................. 3-15
13B3.5 Determine Resources & Schedule ............................... 3-20
14B3.6 Function Analysis ..................................................... 3-22
15B3.7 Preliminary Hazard Analysis (PHA) ............................ 3-25
16B3.8 Hazard Analysis Acceptance, Documentation &
Maintenance ................................................................. 3-27

Section 4: 94BFailure Modes and Effects Analysis


(FMEA) Methods ......................................4-1
17B4.1 FMEA Overview ........................................................ 4-1
18B4.2 Functional FMEA (FFMEA) Procedure............................ 4-2
4.3 Functional FMEA (FFMEA) Example ............................ 4-10
19B4.4 Design FMEA (DFMEA) Procedure.............................. 4-20
20B4.5 Design FMEA Examples ............................................ 4-27
21B4.6 Applying the FMEA Results ........................................ 4-54
2B4.7 FMEA Strengths ....................................................... 4-67
23B4.8 FMEA Limitations ..................................................... 4-67

Section 5: 95BTop Down Method Using Fault Tree


Analysis (FTA) Techniques ........................5-1

 vii 
24B 5.1 Top Down Method Overview and Objectives Using
Fault Tree Techniques ....................................................... 5-2
25B5.2 Procedure for Top Down Method Using Fault Tree
Techniques ...................................................................... 5-3
26B5.3 Applying the Top Down Results .................................. 5-29
27B5.4 Top Down Examples ................................................. 5-30
28B5.5 Top Down Strengths ................................................. 5-52
29B5.6 Top Down Limitations ............................................... 5-52

Section 6: 96BHazard & Operability Analysis


(HAZOP) Method .....................................6-1
30B6.1 HAZOP Overview and Objectives ............................... 6-1
31B6.2 HAZOP Procedure ..................................................... 6-4
32B6.3 Applying the HAZOP Results ..................................... 6-11
3B6.4 HAZOP Example...................................................... 6-12
34B6.5 HAZOP Strengths ..................................................... 6-16
35B6.6 HAZOP Limitations ................................................... 6-16

Section 7: 97BSystems Theoretic Process Analysis


(STPA) Method.........................................7-1
36B7.1 STPA Overview and Objectives ................................... 7-1
37B7.2 STPA Procedure ......................................................... 7-6
38B7.3 Applying the STPA Results ......................................... 7-16
39B7.4 STPA Examples ........................................................ 7-18
40B7.5 STPA Strengths......................................................... 7-40
41B7.6 STPA Limitations ....................................................... 7-40
42B7.7 Future Developments in STPA ..................................... 7-41

Section 8: 98BPurpose Graph Analysis (PGA) Method ....8-1


43B8.1 PGA Overview and Objectives .................................... 8-2
4B8.2 PGA Procedure .......................................................... 8-6
8.3 Applying the PGA Results.......................................... 8-26
46B8.4 PGA Examples......................................................... 8-29
47B8.5 PGA Strengths ......................................................... 8-66
48B8.6 PGA Limitations ....................................................... 8-67

Section 9: 9BConclusions & Recommendations .............9-1


49B9.1 Conclusions ............................................................... 9-1
50B9.2 Recommendations ...................................................... 9-5

Section 10:10BReferences ............................................10-1

Appendix A: 10BOverview of Available Guidance ......... A-1


59BPurpose.......................................................................... A-1
60BAssessment Summary....................................................... A-1

 viii 
61B
Recommendations ........................................................... A-2

Appendix B: 102BTaxonomy of Failure Modes, Failure


Mechanisms, Faults, and Defensive
Measures ................................................B-1
62BPurpose........................................................................... B-1
63BTypical Digital Devices and Components ............................ B-1
64BHierarchy of Failure Mechanisms, Modes, and Effects .......... B-2
65BHow to Read the Taxonomy Sheets .................................... B-3
6BHow to Use the Taxonomy ................................................ B-4
67BSheet B-1a: Central Processor Device Failure Modes ............ B-8
Sheet B-1b: Central Processor Device Description .............. B-10
Sheet B-2a: RAM Device Failure Modes ............................ B-14
Sheet B-2b: RAM Device Description ................................ B-16
71BSheet B-3a: ROM Device Failure Modes ........................... B-18
Sheet B-3b: ROM Device Description................................ B-20
73BSheet B-4a: A/D Converter Device Failure Modes.............. B-22
Sheet B-4b: A/D Converter Device Description .................. B-24
75BSheet B-5a: D/A Converter Device Failure Modes.............. B-26
76BSheet B-5b: D/A Converter Device Description .................. B-27
7BSheet B-6a: Type 1 Controller Component Failure Modes ... B-29
Sheet B-6b: Type 1 Controller Component Description........ B-31
79BSheet B-7a: Type 2 Controller Component Failure Modes ... B-32
Sheet B-7b: Type 2 Controller Component Description........ B-33
Sheet B-8a: Data Communication Component Failure
Modes .......................................................................... B-34
Sheet B-8b: Data Communication Component
Description .................................................................... B-36
83BSheet B-9a: Level 1 (Binaries) Interactions & Faults ............. B-41
Sheet B-9b: Level 1 (Binaries) Description.......................... B-43
Sheet B-10a: Level 2 (Tools) Interactions & Faults ............... B-45
86BSheet B-10b: Level 2 (Tools) Description............................ B-46
Sheet B-11a: Level 3 (Application & OS Source Codes)
Interactions & Faults ....................................................... B-48
Sheet B-11b: Level 3 (Application & OS Source Codes)
Description .................................................................... B-51
89BSheet B-12a: Level 4 (System Architecture) Interactions &
Faults ............................................................................ B-53
Sheet B-12b: Level 4 (System Architecture) Description ....... B-57

Appendix C: 103BPhysical and Functional


Representations ...................................... C-1

Appendix D: Circulating Water System (CWS) Top


Down Analysis ....................................... D-1

 ix 
List of Figures

Figure 3-1 A Hierarchical View ............................................... 3-3


Figure 3-2 FMEA Methods at Various Levels of Interest............... 3-7
Figure 3-3 Top Down (FTA) Method at Various Levels of
Interest ............................................................................ 3-8
Figure 3-4 STPA at Various Levels of Interest ............................. 3-9
Figure 3-5 Relative Coverage of Methods in the Context of
Depth of Analysis ........................................................... 3-10
Figure 3-6 Relative Usefulness of Methods in the Context of
System Lifecycle Phases................................................... 3-12
Figure 3-7 Relative Coverage of Methods in the Context of
System Behaviors ........................................................... 3-13
Figure 3-8 Relative Familiarity of Methods in the Context of
Various Users ................................................................ 3-14
Figure 3-9 Blending Functional FMEA (FFMEA) or FTA
Results with a Design FMEA (DFMEA) ............................... 3-16
Figure 3-10 Blending a Digital Platform FMEA with a Digital
System FMEA ................................................................ 3-19
Figure 3-11 Blending Functional FMEA or FTA Results with
STPA ............................................................................ 3-20
Figure 4-1 Generic BWR Function/Process Map (Sheet 1 of
3) ................................................................................... 4-3
Figure 4-2 HPCI/RCIC System Diagram ................................. 4-15
Figure 4-3 High Pressure Injection Function/Process Map......... 4-16
Figure 4-4 Multi-Divisional System with Complete,
Independent Redundancy ................................................ 4-24
Figure 4-5 Redundancy Boundary for a Master/Slave
Architecture ................................................................... 4-25
Figure 4-6 HPCI/RCIC Turbine Control System Block
Diagram ....................................................................... 4-33

 xi 
Figure 4-7 Circulating Water System DCS Segment ................. 4-44
Figure 4-8 CWS MOV Control Circuit & Logic ........................ 4-45
Figure 4-9 Failure Mode Tree Using FMEA Results as an
Input ............................................................................. 4-66
Figure 5-1 BWR Safety Functions (Top Down) ........................... 5-7
Figure 5-2 PWR Safety Functions (Top Down) ......................... 5-10
Figure 5-3 BWR Generation Functions (Top Down) .................. 5-16
Figure 5-4 PWR Generation Functions (Top Down) .................. 5-18
Figure 6-1 BWR Balance of Plant ............................................ 6-5
Figure 6-2 BWR Trip Sequence of Events after LOOP ................. 6-9
Figure 7-1 A Classification of Control Flaws Leading to
Hazards.......................................................................... 7-4
Figure 7-2 Accidents, Hazards, Unsafe Control Actions &
Control Flaws .................................................................. 7-6
Figure 7-3 Basic Control Structure ........................................... 7-9
Figure 7-4 Basic Control Structure with Human Operator ......... 7-10
Figure 7-5 Control Actions, Process Model Variables (PMVs)
and PMV States ............................................................. 7-12
Figure 7-6 Structure of a Hazardous Control Action ................. 7-13
Figure 7-7 HPCI-RCIC Flow Control System (System Level) ........ 7-23
Figure 7-8 System-Level HPCI-RCIC Flow Control Structure ........ 7-24
Figure 7-9 System-Level HPCI-RCIC Process Models ................. 7-25
Figure 7-10 HPCI-RCIC Flow Control System (Component
Level) ............................................................................ 7-34
Figure 7-11 Component-Level HPCI-RCIC Flow Control
Structure........................................................................ 7-35
Figure 7-12 Component-Level HPCI-RCIC Process Models ......... 7-36
Figure 8-1 BWR Main Steam Pressure Switches and MSIV
Closure Logic................................................................... 8-7
Figure 8-2 State Graph with a Low Level Sub-State .................... 8-8
Figure 8-3 Main Steam Sub-State ............................................ 8-8
Figure 8-4 Notional Top Level State Graph for a BWR ............. 8-10
Figure 8-5 Top Level Process Graph for a BWR ....................... 8-12

 xii 
Figure 8-6 Alternative Processes in a Process Graph ................ 8-13
Figure 8-7 Layered Goals and Processes in a Process Graph .... 8-13
Figure 8-8 Checking for State and Goal Associations in the
Purpose Graph .............................................................. 8-14
Figure 8-9 Notional Top Level Process Graph for a BWR ......... 8-15
Figure 8-10 Notional Top-Level BWR Purpose Graph............... 8-18
Figure 8-11 HPCI State Graph .............................................. 8-32
Figure 8-12 HPCI Process Graph ........................................... 8-37
Figure 8-13 HPCI Purpose Graph .......................................... 8-42
Figure 8-14 One of the Indirect Goal Interactions in the
HPCI System .................................................................. 8-44
Figure 8-15 CWS State Graph.............................................. 8-49
Figure 8-16 CWS Process Graph .......................................... 8-54
Figure 8-17 CWS Purpose Graph ......................................... 8-61
Figure B-1 A Hierarchy of Failure Mechanisms, Modes and
Effects ............................................................................. B-2
Figure B-2 Linking a Taxonomy Sheet to an FMEA
Worksheet ...................................................................... B-5
Figure B-3 Linkage between Taxonomy Sheets .......................... B-6
Figure B-4 Hierarchy of Software Interactions & Faults ............. B-39
Figure C-1 Relay Contact Symbol ........................................... C-3
Figure D-1 a) Top down logic for response to trip of an
operating circulating water pump ..................................... D-7
Figure D-2 a) Top down logic for loss of circulating water
system due to spurious trips ............................................ D-10
Figure D-3 Top down logic for loss of circulating water
system ......................................................................... D-12
Figure D-4 Potential dominant contributors to circulating
water system failure ....................................................... D-13

 xiii 
List of Tables

Table 1-1 Summary of Guideline Contents.............................. 1-10


Table 3-1 Comparative Scope of Hazard Analysis Methods
and their Identified Hazard Characteristics ......................... 3-5
Table 3-2 Blending the Top Down (FTA) Method with Other
Hazard Analysis Methods ............................................... 3-17
Table 3-3 Project Phases vs. Analysis Milestones ..................... 3-21
Table 4-1 Sample Functional FMEA Worksheet ......................... 4-6
Table 4-2 HPCI/RCIC Flow Control System Functional FMEA
Worksheets ................................................................... 4-17
Table 4-3 Sample Design FMEA Worksheet ........................... 4-26
Table 4-4 Principal HPCI/RCIC Turbine Control Components
and Functions ................................................................ 4-32
Table 4-5 HPCI/RCIC Governor Design FMEA Worksheet ....... 4-34
Table 4-6 HPCI/RCIC Positioner Design FMEA Worksheet ....... 4-37
Table 4-7 Principal CWS Components and Functions .............. 4-46
Table 4-8 CWS I/O Cabinet A FMEA Worksheets .................. 4-47
Table 4-9 CWS Logic Cabinet A FMEA Worksheets ................ 4-50
Table 4-10 CWS HSI Workstation FMEA Worksheets.............. 4-52
Table 5-1 Frontline Functions/Systems for Nuclear Safety at
the Plant Level .................................................................. 5-5
Table 5-2 Frontline Functions/Systems for Generation at the
Plant Level ..................................................................... 5-14
Table 5-3 Format for Capturing Component Failure Mode
Information from the PRA ................................................ 5-20
Table 5-4 Supporting Functions/Systems for Generation at
the Plant Level ................................................................ 5-23
Table 5-5 Formatting the Basis for Selection of Digital
System Failure Modes ..................................................... 5-27

 xv 
Table 5-6 HPCI & RCIC Components Controlled by I&C
Equipment (Safety Functions) ........................................... 5-35
Table 5-7 HPCI and RCIC Digital System Failure Modes .......... 5-38
Table 5-8 HPCI/RCIC Generation Functions ........................... 5-39
Table 5-10 CWS Components Controlled by I&C Equipment
(Safety & Generation) ..................................................... 5-49
Table 5-11 CWS Component vs. Digital System Failure
Modes .......................................................................... 5-51
Table 6-1 Sample HAZOP Worksheet ...................................... 6-3
Table 6-2 HAZOP Guide Words ............................................. 6-7
Table 6-3 CWS Controls HAZOP Worksheet .......................... 6-15
Table 7-1 Suggested Process Model Format ............................ 7-11
Table 7-2 Combining Control Actions with Affected Process
Models ......................................................................... 7-13
Table 7-3 Sample STPA Worksheet ....................................... 7-14
Table 7-4 HPCI-RCIC Turbine Controls: System-Level
Hazards vs. Accidents or Losses ...................................... 7-22
Table 7-5 Select HPCI-RCIC Flow Control Actions .................... 7-26
Table 7-6 Excerpt of STPA Results for Control Action 3 ............ 7-27
Table 7-7 Excerpt from List of HPCI-RCIC Hazardous Control
Actions ......................................................................... 7-28
Table 7-8 Potential Causes of Hazardous Control Action
No. 7 ........................................................................... 7-29
Table 7-9 Excerpt of STPA Results for Control Action 5 ............ 7-37
Table 7-10 Potential Causes of HCA 1................................... 7-38
Table 8-1 Ten Characteristics Evaluated in PGA Basic Step
2 .................................................................................... 8-5
Table 8-2 Sample PGA Preliminary Observables Table .............. 8-7
Table 8-3 Top-Level BWR State and Event Table (Partial) .......... 8-11
Table 8-4 Top-Level BWR Goal Table (Partial) ......................... 8-16
Table 8-5 Top-Level BWR Process Table (Partial)...................... 8-17
Table 8-6 Top-Level BWR State Analysis Table ........................ 8-20
Table 8-7 Top Level BWR Goal Analysis Table........................ 8-22
Table 8-8 Top Level BWR Process Interaction Table (Partial) ..... 8-25

 xvi 
Table 8-9 Alternatives for Mitigating Information
Degradation .................................................................. 8-28
Table 8-10 HPCI Observables............................................... 8-33
Table 8-11 HPCI States & Events ........................................... 8-33
Table 8-12 HPCI Goals ........................................................ 8-38
Table 8-13 HPCI Processes ................................................... 8-40
Table 8-14 HPCI State & Events Analysis Results ..................... 8-43
Table 8-15 HPCI Goal Interactions ........................................ 8-44
Table 8-16 HPCI Process Interactions ..................................... 8-45
Table 8-17 CWS Observables .............................................. 8-50
Table 8-18 CWS States & Events .......................................... 8-51
Table 8-19 CWS Goals ....................................................... 8-55
Table 8-20 CWS Processes .................................................. 8-58
Table 8-21 CWS State & Events Analysis Results ..................... 8-62
Table 8-22 CWS Goal Interactions........................................ 8-64
Table 8-23 CWS Process Interactions .................................... 8-65
Table 9-1 Comparative Strengths & Limitations of Each
Method ........................................................................... 9-4
Table A-1 Guidance Documents Assessed ............................... A-3
Table B-1 Taxonomy Devices and Components ......................... B-2
Table B-2 Basic Types of Defensive Measures ........................... B-4
Table D-1 Combinations of Failures (Cut Sets) Leading to
Loss of Circulating Water ................................................. D-6

 xvii 
Section 1: Introduction
1.1 Background

The lack of established, practical methods for evaluating and managing potential
failure modes and mechanisms of digital instrumentation and control systems is
adversely affecting design, risk assessment and licensing efforts involving digital
equipment. Operating experience and digital; I&C project experience points to:
 System designs that overlook vulnerabilities that can lead to undesired plant
events
 Significant increases in system costs and complexity without apparent
commensurate safety benefits
 Failure analyses that are very expensive and impractical to apply to designs
 Difficulty obtaining regulatory acceptance of analysis and design approaches
that can improve dependability and reduce overall risk.

Improving utility proficiency in failure and hazard analysis of digital systems is


important to help ensure high plant reliability and safety system functionality.
Specific issues from operating experience, digital project experience and licensing
reviews are described in Section 1.5.

This project investigated several methods for hazard & failure analysis of
industrial systems, ranging from well-established mature practices to innovative
methods still transitioning from academic demonstrations to practical, realistic
applications. The methods were applied to sample problems based on actual
nuclear plant experience to assess their effectiveness, range of applicability and
practicality for use by nuclear plant engineers and their suppliers.

1.2 Purpose/Objectives

The purpose of this guideline is to provide comprehensive, practical, cost-


effective methods for identifying hazards in digital systems before the systems are
put into operation.

Each of the hazard analysis methods described herein was researched and further
developed in order to meet the following objectives:
 Evaluate the capability of each method for identifying potential
vulnerabilities in a digital I&C system, including hazardous interactions with
plant components and plant systems

 1-1 
 Demonstrate the workability of each method on practical examples based on
experiences reported by EPRI members
 Provide a step-by-step procedure for each method so that users can adapt
them into a procedure format
 Provide worked examples to demonstrate each method in a step-by-step
manner
 Use the results to identify the comparative strengths and limitations of each
method
 Provide guidance on how to blend multiple methods to gain efficiencies in
the analysis, limit the analytical effort, or limit corrective actions such as
design changes or the application of administrative controls to the identified
hazards

1.3 Scope
2B

This guideline describes the following six hazard analysis methods, including
discussions of their ranges of applicability, step-by-step procedures, strengths and
limitations, and worked examples based on actual nuclear plant cases:
1. Functional Failure Modes & Effects Analysis (FFMEA) Method
The Functional FMEA method takes a top down approach to identifying
the potential causes of postulated functional failures of plant system-level
functions and processes without necessarily identifying and analyzing
specific sets of equipment and their individual failure modes. Thus the
FFMEA method is well suited for analyzing a system at the conceptual
design phase in order to identify functional hazards or hazardous
conditions that should be addressed in later phases of the development
lifecycle. The FFMEA method is described in detail in Section 3.
2. Design Failure Modes & Effects Analysis (DFMEA) Method
The Design FMEA method takes a bottom up approach to identifying the
effects of postulated failure mechanisms and failure modes at a user-
determined level of interest. DFMEA is the method most often used by
equipment vendors, I&C engineers, and other stakeholders in the digital
I&C community. It is the traditional bottom-up approach that is
described in various standards such as IEEE Std. 352-1987 (Reference 1).
3. Top Down Method Using Fault Tree Analysis (FTA) Techniques
The Top Down (FTA) method treats I&C systems as parts of a larger
integrated plant design. It postulates failures of high level safety and
generation related functions and identifies the plant mechanical and
electrical equipment needed for these functions, along with the digital
I&C systems that control them. This top down approach can thereby
focus the failure analysis of the system by identifying the potentially
important failure modes of the mechanical and electrical components
controlled or actuated by the digital system. Digital system hazards that
can lead to important plant component failure modes can be further

 1-2 
evaluated using the FTA technique, or the analyst can link the Top Down
(FTA) results to another hazard analysis method. The Top Down (FTA)
method is described in Section 5.
4. Hazard and Operability Analysis (HAZOP) Method
A HAZard and OPerability (HAZOP) analysis is a systematic review of a
process (e.g., system design), using “guide words,” to visualize the ways in
which a system can malfunction. The HAZOP analysis searches for
possible deviations from the design intent that can occur in components,
operator or maintenance technician actions, or material elements (e.g., air,
water, steam), and determines whether the consequences of such
deviations can result in hazards. The HAZOP method is described in
Section 6.
5. Systems Theoretic Process Analysis (STPA) Method
The STPA method is one part of a set of new or refined systems
engineering methods developed by researchers at the Massachusetts
Institute of Technology (MIT), under the heading of Systems-Theoretic
Accident Model and Processes (STAMP). Per Reference 19:
“The primary reason for developing STPA was to include the new causal
factors identified in STAMP that are not handled by the older techniques
[FMEA, FTA, HAZOP, and others].”
The STPA method is included in this study because it effectively addresses
potentially hazardous interactions in digital I&C systems, including
hazards introduced by unintended software behaviors and component
interactions (not just potential component failures). Note that in this
guideline, the term “loss” is often used instead of the term “accident” as
described by MIT to avoid confusion with the more limiting nuclear
industry term, “nuclear accident.” The STPA method is described in
Section 7.
6. Purpose Graph Analysis (PGA) Method
A Purpose Graph is a figure that illustrates the “Observable,” “State,”
“Goal” and “Process” features of a system. Purpose Graphs are used in
Systems Engineering design and analysis activities. The Purpose Graph is
composed of a State Graph placed side-by-side with a Process Graph.
The PGA method is useful for identifying potential digital systems
hazards that can arise from unexpected component or system behaviors by
providing insights into redundancy and diversity success paths, direct and
indirect consequences of failures to meet designed performance levels even
when no faults are present, desired and undesired interactions between
aspects of normal system state changes, incompatible goals, and
incompatible processes. The PGA method is described in Section 8.

1.4 Key Concepts


3B

This guideline is focused on methods for analyzing digital I&C systems in various
contexts to determine if potential hazards exist that could lead to accidents or losses.
 1-3 
Accidents or Losses

The notion of an “accident” takes on different meaning in different domains.


STPA practitioners at MIT use the term broadly to refer to any type of undesired
result. In the nuclear power industry, this term is used in the context of nuclear
safety, which is expressed in terms of nuclear fuel damage or release of radioactive
material. However, the term “accident” can be replaced with the term “loss” to
express other ideas such as lost generation, personnel injury or loss of life,
equipment damage, or any other losses deemed unacceptable. This guideline uses
the term “loss” rather than “accident,” to avoid confusing nuclear accidents with
other types of undesired results that could be the focus of a hazard analysis.

Hazards

The term “hazard” is a bit more difficult to understand. The systematic


identification and prevention, mitigation or removal of “hazards” can prevent
accidents (or losses). But when engaged in a hazard analysis activity, it is possible
to confuse conditions with events. Dr. Nancy Leveson, MIT Professor of
Aeronautics and Astronautics and Engineering Systems, writes in her book
“Engineering a Safer World,” (Reference 19):

Hazard: A system state or set of conditions that, together with a


particular set of worst-case environmental conditions, will lead to an
accident (loss).

This definition requires some explanation. First, hazards may be


defined in terms of conditions, as here, or in terms of events as long as
one of these choices is used consistently. While there have been
arguments about whether hazards are events or conditions, the
distinction is irrelevant and either can be used. The hazard for a
chemical plant could be stated as the release of chemicals (an event) or
chemicals in the atmosphere (a condition). The only difference is that
events are limited in time while the conditions caused by the event
persist over time until another event occurs that changes the prevailing
conditions. For different purposes, one choice might be advantageous
over the other.

Second, note that the word failure does not appear anywhere. Hazards
are not identical to failures - failures can occur without resulting in a
hazard and a hazard may occur without any precipitating failures. C. 0.
Miller, one of the founders of System Safety, cautioned that
"distinguishing hazards from failures is implicit in understanding the
difference between safety and reliability."

The notion of “worst case environment conditions” needs some explanation. As


used here, it is intended to convey the idea that the analyst should consider
operating modes and the states of the environment around the system in their
abnormal conditions. This was the fundamental approach proposed in the EPRI
“ACES Report” (Reference 14), where the digital system design functions were

 1-4 
intended to be analyzed in the context of abnormal conditions and events
(ACES).

Conceptually, “hazard analysis” may be considered somewhat broader than


“failure analysis” in the sense that it also considers situations in which there can
be losses in the absence of any failures of systems, subsystems or components.
This document uses the two terms interchangeably in the broader context.

Context

In practice, the identification of hazards should be limited to things that can be


controlled or prevented. For example, radiation releases may be prevented or
controlled, but we can’t control or prevent the wind from blowing, so there is no
point in arguing that the only way to protect the public from radiation exposure
is to prevent people from living downwind of a nuclear power plant. This idea
brings some useful context to an assessment of potential hazards that could lead
to an accident (in this case, radiation exposure) or loss. In other words, it is much
more helpful to focus time and energy on reducing or eliminating releases.

The role of context also helps determine if a failure mode, design feature, or
other characteristic of a digital component or system is hazardous by viewing
them under various postulated conditions. The hazard analysis methods
described in this guideline provide techniques for systematically identifying the
conditions under which failure modes, design features or other characteristics of a
digital component or system are hazardous or not hazardous.

Additional Concepts

The simple example described below introduces some additional concepts, such
as a “Safety Constraint,” which can be thought of as a design constraint intended
to ensure safety. This example expresses some of the key concepts used or
described in this guideline.

Consider the act of running with a pair of scissors in your hands. Scissors are like
knives, and present a contradiction; a sharp pair of scissors is a safe pair of
scissors because they can serve their purpose without using excessive force, but a
sharp pair of scissors can also cut you. So we are taught as children to never run
when we have scissors in our hands, because if we fall, we might cut ourselves.
This is an example of a safety constraint (don’t run when you have scissors in your
hand) that is designed to prevent an accident (getting cut) due to a potentially
hazardous condition (having scissors in your hand when you are running). This
example provides context through a combination of a potentially hazardous piece
of equipment (scissors) and its environment (in your hand while you are running).

This example illustrates that a combination of two conditions is required to create


a potential hazard: a) having scissors in your hand, and b) while you are running.
Notice that neither condition is considered hazardous by itself.

 1-5 
The likelihood of an accident is increased when the potentially hazardous condition
of having scissors in your hand is combined with the act of running. We could
propose other rules for reducing or eliminating this hazard by banning the use of
scissors to cut anything, or requiring scissors to be blunt. But sharp scissors are so
useful for their intended purpose, we are willing to live with the risk of an
accident as long as we teach and reinforce the rule of not running when we have
them in our hands. We accept that by complying with this rule, we can live with
reasonable assurance that we won’t get cut when we have scissors in our hands.

1.5 Why New Guidance?

Various contributors to this guidance reported a number of experiences that lead


to its identified need. The methods described herein, including guidance on how
to apply the results, are intended to help solve problems like those described
below. Some of the issues reported from experience are systemic to the quality of
the work done and won’t necessarily be solved by new methods. The guidance
here on traditional methods (e.g., FMEA) is intended to help users execute those
methods more effectively. The newer methods presented in this document are
intended to be useful both in facilitating the application of traditional methods to
make them more successful, and in enabling the analyst to discover potential
vulnerabilities that are difficult or impossible to reveal using traditional methods.
In addition, the step-by-step procedures in the worked examples help illustrate
each of the methods.

Technical Experience

Recent EPRI evaluations of digital operating experience (OE) (References 10, 11


& 16) revealed that in some cases the features credited in the failure analysis did
not behave as expected, indicating a weakness in the analysis or a weakness in the
testing program. Also, in some events, the FMEA that had been performed did
not consider a failure mode that later occurred, again indicating a need for
improvement in the failure analysis. Most of these cases involved non-safety
systems that are critical to plant operation and for which additional attention is
warranted. This research result, along with reports by contributors to this
guidance, call attention to the following types of failure and hazard analysis
shortcomings that are being experienced:
 The failure analysis did not identify a failure mode that was actually
experienced. In some cases, the unidentified failure mode was discovered
after a plant trip.
 The failure analysis identified the experienced failure mode, but the stated
effects on the plant (i.e., the results) were incorrect. In some cases, the stated
effect of the failure mode was that the digital system would swap to another
component, such as a backup controller, or the digital system would raise an
alarm, but the failure would have no effect at the plant system level, when in
fact it did (usually resulting in a plant trip).
 The operating event occurred due to an unintended or unanticipated
interaction. In these cases, there were no actual failures in the digital I&C

 1-6 
system, but neither was the system designed for the plant conditions that it
encountered, leading to an event.
 The configuration analyzed was different from the operating configuration.
The digital system that was analyzed, and even tested in some cases, was not
the exactly the same as system that was installed, commissioned, and turned
over to operations. On paper the failure analysis and tests showed acceptable
results, but in reality the system response to some failure modes was different
than expected.
 Depth of analysis. There is no consensus method for determining the needed
level of detail in the analysis. When can the analysis stop at system-level
failure modes, and when should it penetrate to the deepest levels of a system,
including an assessment of individual devices and piece-parts that make up
each digital component or computing unit? This question leads to the
problem of integrating failure analysis results from two distinct domains; the
plant system domain, which is familiar to the owner/operator’s engineers,
and the I&C technology domain, which is familiar to the platform vendor’s
engineers. This problem is further exacerbated by a limited ability of the
engineers to communicate across the gap between their domains of expertise,
or combine the results of analyses performed in different domains.
 Software failure modes. The term “software failure” is still being used by
some, but it can create confusion and misunderstandings. The term can be
misleading, because software doesn’t really fail; it does exactly what it is
designed to do. Under certain conditions, software design errors can wreak
havoc in digital systems, but they are not “failures.” It would be helpful to
replace the notion of “software failure modes” with a concept and terms that
better fit the reality, such as “hazardous behaviors that can be introduced via
software” and “unintended or undesired behaviors.” An updated approach to
hazard analysis for digital systems may be a key factor in rectifying this
problem.
 Senior management awareness. Some contributors reported that I&C
engineers, project managers and middle managers used their own judgment
and experience to assess the acceptability or risk of identified failure modes
and their effects on the plant, when in fact there was an unwritten
expectation that the responsibility for these decisions rests solely with senior
management (e.g., station manager or site VP). In these cases, staff personnel
were able to convince themselves that the risks due to the effects of some
identified and potentially hazardous failure modes were acceptable, and did
not report the results to senior management. Later, after an operating event
exposed the failure mode and its unacceptable effects, the senior management
response was to require a modification of the system to prevent recurrence,
and change procedures so that failure analysis results that could critically
affect the plant are shared at the highest levels for decision-making before
implementation.

 1-7 
Project Experience

Recent project experience of vendors and utilities on major digital upgrades


shows that failure analysis activities have been more difficult and resource
intensive than originally planned, leading to significant delays and cost overruns.
This experience suggests that failure analysis methods for large, complex
upgrades are not defined or understood well enough by the industry to assure
predictable cost, schedule and quality of failure analysis deliverables. Improved
guidance and training will enable lower cost and improved predictability.

Most contributors to this methodology were familiar with the Failure Modes and
Effects Analysis (FMEA) method. The FMEA method is typically used on
digital upgrade activities, and over time has become the de facto choice among
I&C engineers (in owner/operator and supplier organizations) because it is more
familiar to them and is described in existing policies and procedures. The Fault
Tree Analysis (FTA) is also familiar to owner/operators, especially in the context
of the facility Probabilistic Risk Assessment (PRA), and in some cases fault trees
or the PRA itself are used to inform and assess digital system designs.

However, project experience with these methods has not always been good,
especially on large and more complex systems, such as a complete protection
system upgrade, or the application of Distributed Control System (DCS)
technology on multiple control system segments (e.g., main turbine control,
feedwater control, etc.). Contributors have reported the following issues:
 Sometimes FMEAs are too big, expensive and difficult to manage. In some
cases, FMEA worksheets have run into thousands of pages when the analyst
considers all failure modes of each and every component.
 Sometimes FMEA results are not timely enough to make a meaningful
difference. When an FMEA becomes too unwieldy, inevitably it is
completed later than planned, and in some cases the owner/operator is faced
with a decision to live with some of the identified vulnerabilities, because it is
too late or too costly to rework the system design.
 Sometimes it is difficult for I&C engineers to fully understand and make use
of fault trees and/or the PRA. In some cases, I&C engineers responsible for
digital upgrade projects didn’t know what questions to ask of the PRA
engineers, or how to ask them. I&C in some cases is not modeled in the
existing PRA’s for use in projects.

1.6 Relationship to Other Work

The project team used the guidance provided by EPRI TR-104595, Abnormal
Conditions and Events (ACES) Analysis for Instrumentation and Control (I&C)
Systems and EPRI interim report 1022985 Failure Analysis of Digital
Instrumentation and Control Equipment and Systems – Demonstration of Concept as
input to developing this guideline.

 1-8 
The “ACES Report”

EPRI TR-104595 (Reference 14) provides information about Abnormal


Conditions and Events (ACES) Analysis for Instrumentation and Control
Systems with a focus on identifying and evaluating ACES in digital upgrades.
Effectively, what was called “ACES analysis” then (circa 1995) we are calling
“hazard analysis” today.

The ACES Report (TR-104595) provides the structure for failure analyses of
digital upgrades and a summary of evaluation techniques. The guidance
contained herein effectively expands the information already contained within the
ACES topical report, which was an early attempt at addressing hazards. The
problem of hazards analysis and finding potentially bad behaviors was understood
back then. The difference now is that industry has a lot more to work with in
terms of better developed methods and real examples.

Demonstration of Concept

EPRI Report 1022985 (Reference 15) was the result of initial research on
applications of failure analysis methods used in today’s digital upgrade activities.
Several methods for performing failure analysis of digital systems were explored
during this “demonstration of concept” research. The research evaluated the top
down approach of a Fault Tree Analysis (FTA) and the bottom up approach of a
Failure Modes and Effects Analysis (FMEA).

The proposed failure analysis methodology developed by NRC Research, under


NUREG-6962, describes one possible approach where the analyst performs a
Design FMEA on a digital system at various “levels”, beginning with the overall
system and possible failure effects on it.

The purpose for evaluating both top down and bottom up approaches was to
investigate the possibility of developing a hybrid approach that could use top-
down and bottom-up techniques in complementary manners. In principle, the
top-down approach would identify critical functional failures, and the scope of
the bottom-up approach would be limited to component failures that could lead
to the critical functional failures. The objective was a method that would be both
more effective in finding potential vulnerabilities, and be less costly to apply than
conventional methods.

In addition to the failure mode analysis evaluations, EPRI 1022985 provides


detailed taxonomy information related to several digital components, and that
taxonomy has been enhanced and carried forward into this report. As part of the
failure analysis efforts evaluating the FTA and FMEA methods, the component
failure information from the taxonomy sheets is used during the failure analysis.

Detailed examples with failure analysis tables and results were included in EPRI
1022985 to demonstrate how the guidance in the report could be used in failure
analysis efforts. Several of the examples have been enhanced and carried forward
into this document.

 1-9 
1.7 Contents of this Guideline

Table 1-1 provides an overview of the contents of the remaining sections of this
guideline:

Table 1-1
Summary of Guideline Contents

Section Heading Content


Definitions of key terms (with references) and
2 Definitions & Acronyms
key acronyms
Guidance for:
 Determining analysis scope & objectives
 Method selection
 Determining necessary resources &
Planning Hazard Analysis
3 schedule
Activities
 Function Analysis (FA)
 Preliminary Hazards Analysis (PHA)
 Hazard analysis acceptance,
documentation & maintenance
Overview, procedure and worked examples
for the Functional FMEA and Design FMEA
Failure Modes & Effects
4 Methods. Includes case studies based on
Analysis (FMEA) Method
reported experience, and discussion of
strengths and limitations of each method
Top Down Using Fault Overview, procedure and worked examples
5 Tree Analysis (FTA) for the Top Down (FTA) Method, including
Method discussion of strengths and limitations
Overview, procedure and worked examples
Hazard & Operability
6 for the HAZOP Method, including discussion
Analysis (HAZOP) Method
of strengths and limitations
Overview, procedure and worked examples
Systems Theoretic Process
7 for the STPA Method, including discussion of
Analysis (STPA) Method
strengths and limitations
Overview, procedure and worked examples
Purpose Graph Analysis
8 for the PGA Method, including discussion of
(PGA) Method
strengths and limitations
Conclusions & Conclusions (including a method comparison
9
Recommendations table), and recommendations for future work
10 References List of documents referenced by this guideline

 1-10 
Table 1-1 (continued)
Summary of Guideline Contents

Section Heading Content


Summary assessment of currently available
Overview of Available industry standards and guidance related to
App. A
Guidance failure analysis methods, with an emphasis on
digital systems
 Descriptions of typical digital devices and
components
 A hierarchy of typical digital devices,
components, and systems, and how
failure mechanisms, failure modes and
effects can propagate up through the
hierarchy
Taxonomy of  Typical failure mechanisms that can affect
Failure Mechanisms, typical digital devices and components
App. B Failure Modes,  Typical digital device or component
Faults, and failure modes that result from typical
Defensive Measures failure mechanisms
 Possible defensive measures that could be
implemented (or validated) for preventing
or mitigating typical failure mechanisms
associated with a digital device or
component
 How to use the Taxonomy in digital
failure analysis activities
Lists of typical physical and functional
Physical & Functional
App. C elements of digital systems that could be
Representations
represented in hazard analysis activities
Circulating Water System
Detailed analysis developed for one of the
App. D (CWS) Top Down
worked examples provided in Section 5.
Analysis

 1-11 
1.8 How to Use this Guideline

This guideline is a large and comprehensive body of work, and therefore users
will benefit by taking the following steps before proceeding with a hazard analysis
activity:
a. Read Sections 1 through 3 to get an overview of each method, definitions of
key terms, and discussion of how to effectively select an effective method or
blend of methods for a given situation and plan the hazard analysis activities
b. Scan Appendices A through D for awareness and potential aid in the hazard
analysis
c. Identify the most likely method or blend of methods for the given problem
d. Read the sections and examples on the candidate methods identified in step c
e. Select the method or methods to apply
f. Plan the appropriate activities, and proceed. Reference the detailed sections,
examples and relevant appendices, as appropriate

Step e. is likely to be the most difficult, because the choice of method(s) depends
on several factors such as:
 the scope of the digital I&C project
 the scope of the hazard analysis
 familiarity of methods to various stakeholders
 how hazards are identified and characterized at various levels of interest
 how methods can be used in various system lifecycle phases
 the potential need for a facilitator or outside expertise

Sections 3.3 and 3.4 provide specific guidance on method selection for a wide
range of situations.

 1-12 
Section 2: Definitions
Accident (or Loss): An undesired and unplanned event that results in a loss
(including loss of human life or injury, property damage, environment pollution,
and so on). (Reference 19) (this definition is broader than the typical nuclear
plant definition of accident)

Anomaly: Anything observed in the operation of software that deviates from


expectations based on previously verified software behaviors. (Reference 2)

Basic Event: A basic fault that requires no further development in a fault tree
(Reference 35). Usually representative of a component and one of its failure
modes.

Behavior: The evolution of the input, processing and output states of a digital
computing system over time. By decomposition, the evolution of the states of a
subsystem or component over time. Some of the meaning of this term is similar
to the use of the term “Function,” as in functional requirements or function
decomposition.

Component: One of the parts that make up a system. A component may be


hardware or software and may be subdivided into other components. Note: The
terms “module,” “component,” and “unit” are often used interchangeably or
defined to be subelements of one another in different ways depending upon the
context. The relationship of these terms is not yet standardized. (Reference 2)

Control Systems: Those systems used for normal operation that are not relied
upon to perform safety functions following anticipated operational occurrences or
accidents. The control systems evaluated using [Standard Review Plan (SRP)]
Chapter 7 are those which control plant processes having a significant impact on
plant safety, but are not wholly incorporated into systems addressed by other
SRP chapters. (Reference 6)

Cut Set: A combination of component failures which, if they all occur, will result
in the top event of a fault tree to occur (Reference 35)

Design Basis: The high-level functional requirements, interfaces, and


expectations of a facility’s SSCs that are based on regulatory requirements or
facility analyses. Individual bases are contained in design information and may be
reflected in any combination of criteria, codes, standards, specifications,
computations, or analyses identifying pertinent constraints, qualifications, or

 2-1 
limitations. The design basis identifies and supports the reasons a design
requirement is established. (Reference 7)

Design Constraint: a restriction on how a system can achieve its purpose


(Reference 19).

Design Intent (or Intention): Designer’s desired, or specified range of behavior


for elements and characteristics (Reference 33)

Deviation: Departure from the design intent (Reference 33)

Digital Device: A device that operates on the basis of discrete numerical


techniques in which the variables are represented by coded pulses or states
(Reference 3)

Digital Upgrade: A modification to a plant system or component which involves


installation of equipment containing one or more computers. These upgrades are
often made to plant instrumentation and control (I&C) systems, but the term as
used in [Reference 5] also applies to the replacement of mechanical or electrical
equipment when the new equipment contains a computer (e.g., installation of a
new heating and ventilation system which includes controls that use one or more
embedded microprocessors). (Reference 4)

Element: Constituent of a part which serves to identify the part’s essential


features. Note: The choice of elements may depend upon the particular
application, but elements can include features such as the material involved, the
activity being carried out, the equipment employed, etc. Material should be
considered in a general sense and includes data, software, etc. (Reference 33)

Error: (1) The difference between a computed, observed, or measured value or


condition and the true, specified, or theoretically correct value or condition. For
example, a difference of 30 meters between a computed result and the correct
result. (2) An incorrect step, process, or data definition. For example, an incorrect
instruction in a computer program. (3) An incorrect result. For example, a
computed result of 12 when the correct result is 10. (4) A human action that
produces an incorrect result. For example, an incorrect action on the part of a
programmer

or operator. (Reference 2)

Failure: The inability of a system or component to perform its required functions


within specified performance requirements. Note: The fault tolerance discipline
distinguishes between a human action (a mistake), its manifestation (a hardware
or software fault), the result of the fault (a failure), and the amount by which the
result is incorrect (the error). (Reference 2)

Fault: (1) A defect in a hardware device or component; for example, a short


circuit or broken wire. (2) An incorrect step, process, or data definition in a
computer program. Note: This definition is used primarily by the fault tolerance

 2-2 
discipline. In common usage, the terms “error” and “bug” are used to express this
meaning. (Reference 2)

Fault Tree: A graphic model of the various parallel and sequential combinations
of faults that will result in the occurrence of a predefined undesired event.
(Reference 35)

Fatal Error: An error that results in the complete inability of a system or


component to function. (Reference 2)

Function: (1) A defined objective or characteristic action of a system or


component. For example, a system may have inventory control as its primary
function; (2) A software module that performs a specific action, is invoked by the
appearance of its name in an expression, may receive input values, and returns a
single value. (Reference 2)

Guide Word: Word or phrase which expresses and defines a specific type of
deviation from an element’s design intent (Reference 33)

Hazard: (1) A condition that is a prerequisite to an accident. Hazards include


external events as well as conditions internal to computer hardware or software
(Reference 9); (2) A system state or set of conditions that, together with a
particular set of worst-case environment conditions, will lead to an accident
(loss). (Reference 19)

For the purpose of this guidance, the term “hazard” is used to describe an
unwanted or unacceptable system behavior that could lead to an accident or loss,
or prevent an appropriate system response to an accident or loss condition.

Hazard Analysis: (1) A process that explores and identifies conditions that are
not identified by the normal design review and testing process. The scope of
hazard analysis extends beyond plant design basis events by including abnormal
events and plant operations with degraded equipment and plant systems. Hazard
analysis focuses on system failure mechanisms rather than verifying correct
system operation (Reference 9); (2) The process of identifying hazards and their
potential causal factors. (Reference 19). Conceptually, “hazard analysis” may be
considered somewhat broader than “failure analysis” in the sense that it also
considers situations in which there can be losses in the absence of any failures of
systems, subsystems or components. This document uses the two terms
interchangeably in the broader context.

Insertion Mechanism: For faults, the pathway of processes and conditions that
resulted in the presence of the fault, but not its discovery. Insertion mechanisms
are often linked to the stages of the development and production process (e.g.,
design, tool behavior, etc.)

License Basis: Documented elements that the NRC has considered in granting
and maintaining the license for the facility. These include the combined
operating license application (COLA); safety evaluation report (SER); design

 2-3 
control documents (DCDs); technical specifications; Inspections, Tests, Analyses
& Acceptance Criteria (TAAC); and other commitments made under the
corrective action program. (Reference 7)

Malfunction: In the context of 50.59, malfunction means the failure of a


structure, system, or component to perform its intended design functions as
described in the UFSAR (whether or not classified as safety-related in
accordance with 10 CFR 50, Appendix B). (Reference 5)

Non-Fatal Fault: A software fault that allows program execution to continue, but
with incorrect behavior.

Non Plausible Outcome Failure: A non-fatal fault with output errors that do not
satisfy output expectations or specifications (i.e., a form of soft failure).

Part: Section of the system which is the subject of immediate study. Note: A part
may be physical (e.g. hardware) or logical (e.g. step in an operational sequence).
(Reference 33)

Plausible Outcome Failure: A non-fatal fault with output that appears to satisfy
output expectations but contains errors (i.e., a form of soft failure).

Protection System: 1) the part of the sense and command features involved in
generating those signals used primarily for the reactor trip system and engineered
safety features. (Reference 8), or 2) those I&C systems which initiate safety
actions to mitigate the consequences of design basis events. The protection
systems include the reactor trip system (RTS) and the engineered safety features
actuation system (ESFAS). (Reference 6)

Risk: Combination of the probability of occurrence of harm and the severity of


that harm (Reference 33)

Safety: Freedom from accidents (loss events) (Reference 19)

Safety Constraint: A design constraint intended to assure safety.

Software Hazard: A process or resulting outcome that has the potential under at
least some conditions to result in an unplanned event or series of events causing
damage to equipment or the environment and/or death, injury or illness to
personnel. Hazards may be graded by the extent of the damage and injury
potential.

Unsafe Control Action: A controller command that violates a safety constraint.


(Derived from Reference 19)

 2-4 
Abbreviations & Acronyms

ACES (Abnormal Conditions and Events)

AFW (Auxiliary Feedwater)

A/D (Analog to Digital)

BWR (Boiling Water Reactor)

CIS (Containment Isolation System)

Comm (Communication)

CPU (Central Processing Unit)

CRD (Control Rod Drive)

CVCS (Chemical and Volume Control System)

CWS (Circulating Water System)

DCS (Distributed Control System)

D/A (Digital to Analog)

EHC (Electro-Hydraulic Control)

ESFAS (Engineered Safety Features Actuation System)

FMEA (Failure Modes & Effects Analysis

FPS (Fire Protection System)

FPT (Feed Pump Turbine)

FTA (Fault Tree Analysis)

HAZOP (Hazard and Operability Analysis)

HPCI (High Pressure Coolant Injection)

HPSI (High Pressure Safety Injection)

I/O (Input/Output)

LAR (License Amendment Request)

LPCI (Low Pressure Coolant Injection)

LPSI (Low Pressure Safety Injection)


 2-5 
MCR (Main Control Room)

MOV (Motor Operated Valve)

NSSS (Nuclear Steam Supply System)

PCS (Primary Coolant System)

PHA (Preliminary Hazard Analysis)

PMV (Process Model Variable)

PORV (Pilot Operated Relief Valve)

RAM (Random Access Memory)

RCIC (Reactor Core Isolation Cooling)

RCP (Reactor Coolant Pump)

RHR (Residual Heat Removal)

ROM (Read Only Memory)

RSP (Remote Shutdown Panel)

RWCU (Reactor Water Clean Up)

SDC (Shutdown Cooling)

SG (Steam Generator)

SPC (Suppression Pool Cooling)

SRV (Safety Relief Valve)

SSA (Software Safety Analysis)

SSC (System, Structure or Component)

STPA (Systems Theoretic Process Analysis)

SW (Service Water)

S/G (Steam Generator)

TBPV (Turbine Bypass Valves)

UCA (Unsafe Control Action)

Xformer (Transformer)
 2-6 
Section 3: Planning Hazard Analysis
Activities
A hazard analysis activity may be performed in accordance with a one-time plan,
on a project-specific basis, or it may be performed on a recurring basis (i.e.,
project by project) in accordance with written procedures. In either case, hazard
analysis activities should include the determination of the scope, objectives,
analysis methods, resources, schedule, acceptance criteria, and documentation.

A hazard analysis activity begins in the definition phase of a project so the


identified scope, objectives, selected methods, resources, schedule and acceptance
criteria for the analysis are properly accounted for in the overall project plan. As a
project proceeds, adjustments to the project plan may be needed to incorporate
changes in the analysis activities due to initial findings or changing expectations.

The planning steps, further described below, are as follows:


1. Determine scope and objectives
2. Determine resources and schedule
3. Function analysis
4. Preliminary hazard analysis
5. Determine appropriate methods
6. Hazard analysis acceptance, documentation, and maintenance

3.1 Determine Scope & Objectives

The scope of the hazard analysis activity should be consistent with the project
scope. Project scope information typically outlines the scope of the design change
that will be performed in terms of affected systems, structures or components
(SSC), including an outline of the components in the system that are being
modified and their interfaces to other SSCs.

None of the methods described in this report, either individually or collectively,


are complete in their ability to find all undesired behaviors or hazards that may
exist in a digital system under review. Nor is it necessary or even important that
the methods be complete. Hazards in digital systems will continue to exist,
failures will occur and the plant designs must be robust in their ability to cope
with such failures. A review of operating experience confirms that the plant
 3-1 
designs already are robust in this regard (References 41 and 42). Therefore, the
objectives of the methods described in this guideline simply are to attain greater
coverage of undesired behaviors and hazards that may exist in digital systems
(i.e., address some of the limitations described in Section 1.2) while at the same
time reducing the effort needed to perform a hazard analysis and achieve that
coverage.

The objectives of the analysis should be determined after the scope of the project
and analysis have been determined. The objectives should encompass items that
involve equipment functions, success (or failure) criteria, and other project
objectives. The determination of the analysis objectives can be driven by
compliance objectives, although some objectives may be subjective based on the
risk impacts of the system or component being modified. The use of objectives to
outline the purpose of the analysis will allow the analysis to focus on the critical
aspects of the systems or components being analyzed. The following list provides
some potential objectives to consider before selecting and performing a specific
hazard analysis method:
 Identify single failure vulnerabilities
 Prevent loss of safety functions or critical functions
 Prevent inadvertent actuation
 Validate adequate redundancy
 Comply with regulatory requirements
 Prevent personal injury
 Protect equipment
 Differentiate and protect architectural segments
 Develop periodic testing requirements
 Aid in the analysis of field failures and consideration of design changes
 Make use of best available engineering resources
 Develop specific functional and performance requirements
 Accept system by plant personnel and management

3.2 Identify the Level(s) of Interest

Before selecting and applying one or more hazard analysis methods, it is helpful
to identify the “level of interest,” as this will vary depending on the specifics of
the project. The specific characteristics of the project and/or analysis drive the
level of interest. For example, the impacts of a system level change on the plant
may require a different analysis than a software upgrade in a device.

Figure 3-1 presents a hierarchy of plant functions, plant systems, plant


components, digital systems, digital components and digital devices. Examples of
items that can be found in each layer are provided in the text boxes, and although
this figure only highlights one item in each layer, in reality there are multiple

 3-2 
items in each layer that make up a plant system or a digital system. This view,
while somewhat abstract, is used in this guideline to show where different hazard
analysis methods can be applied, in a singular manner or in a blended manner,
that suits the systems and components to be analyzed at any particular level of
interest, consistent with the objectives of the analysis.

Notice in Figure 3-1 that plant functions, systems and components are
distinguished from digital systems, components and devices, so that the analyst
can identify items and interfaces at any single level and determine how they
interact with adjacent levels. Because this guideline is about hazard analysis of
digital I&C systems, it describes various hazard analysis methods using this view,
and how some are applied from the top down, some are applied from the bottom
up, and how some methods can be blended to gain certain efficiencies.

- Main Turbine
PLANT FUNCTIONS - Main Generator
- Feedwater
- Rod Control
- Reactor Coolant
- Turbine Bypass
Plant Plant Plant - Switchyard
System 1 System 2 System n - Electrical
- Pumps
- Plant Computer
- Valves
- Reactor Protection
- Vessels
- Eng. Safety Features
- Compressors
Plant Plant Plant - Breakers
- Switchgear
Component 1 Component 2 Component n
- Xformers
- Heaters
- Pipes
- Ducts - S/G Level
- Air Handlers - FPT Speed
- Main Turbine EHC
Digital Digital Digital
- NSSS Controls
System 1 System 2 System n - Plant Computer
- Reactor Trip
- ESFAS
- Controllers
- Comm Modules
- I/O Modules
Digital Digital Digital - Indicators
- Power Supplies
Component 1 Component 2 Component n
- Workstations
- Servers
- Sensors
- CPU
- Actuators
- A/D
- D/A
Device Device Device
- RAM
1 2 n - ROM
- Watchdog
- Operating System - Parts
- Firmware
Software - Applications
Plant Functions, Digital Systems,
- Configuration Data
Systems & Components Components & Devices

Figure 3-1
A Hierarchical View

3.3 Determine Appropriate Method(s)

One or more appropriate hazard analysis methods should be selected to ensure


that the identified scope and objectives are adequately addressed. There are six
hazard analysis methods described in this guideline, suitable for use on digital
I&C systems and components, depending on the project or analysis objectives,
timing, and availability of qualified resources.

 3-3 
The hazard analysis methods described in this guideline may be complementary
to other analytical techniques that may be applied to a digital I&C system or a
larger set of plant systems and components, such as:
 Probabilistic Risk Assessment (PRA)
 Validation and Verification (V&V)
 Design Review
 Reliability Analysis
 Diversity and Defense in Depth Analysis
 System Modeling
 Abnormal Conditions and Events Analysis

Details about the techniques listed above and how they can be applied to digital
I&C systems can be obtained from documents that are referenced in this report.

Hazard Analysis Method Comparisons

Table 3-1 lists the full name and analytical scope of the six methods described in
this report, the characteristics of the hazards that each method is designed to
reveal, and the Section number where each method is described in detail.

Before selecting and applying any candidate methods, users of this guideline
should:
1. Carefully review the applicable procedures and worked examples provided in
the related sections.
2. Consider a blended approach where two methods are applied in order to:
 take advantage of readily available results from one method (e.g., plant-
specific fault trees that are maintained for use in the facility PRA), and
use them as an input to another method
 use the results from one method to limit the effort required by another
method
 use the results from one method to identify the potentially critical
hazards to be further evaluated by another method, and limit the need
for corrective actions to those which address critical hazards

Figures 3-2 through 3-11 compare the methods described in this guideline, in
various contexts, allowing users to assess their anticipated project or analysis
scope and objectives against the relative strengths of each method, and select the
method(s) that are best suited for the task at hand.

For guidance and examples of blended approaches, see Section 3.4.

 3-4 
Table 3-1
Comparative Scope of Hazard Analysis Methods and their Identified Hazard Characteristics

Method Section Analysis Scope Identified Hazard Characteristics


Single functional failures of plant
processes and their causes
Functional  No Function
Plant-level and Hazards include faults and
Failure Modes
system-level  Partial Function failures, as well as misbehaviors
& Effects
functions  Over Function in the absence of any faults or
Analysis
and processes  Degraded Function failures
(FFMEA) 4
 Intermittent Function
 Unintended Function
Design Failure Functional Single failures and their effects on
Modes & Effects interfaces at the other sub-components,
Analysis system, component components or systems (hazardous
(DFMEA) or device levels or not) Hazards are limited to faults
Top Plant and Actuated components fail to or failures
Down Using system perform critical functions, possibly
5
Fault Trees functions, actuated due to associated digital system
(FTA) components failures or misbehaviors

 3-5 
Table 3-1 (continued)
Comparative Scope of Hazard Analysis Methods and their Identified Hazard Characteristics

Method Section Analysis Scope Identified Hazard Characteristics


Unintended system behaviors
Hazard &
Active or passive under abnormal operating
Operability
6 plant processes conditions. Revealed using guide
Analysis
and elements words to assess process deviations
(HAZOP)
(e.g., more, less, early, late, etc.)
Systems Control actions that are hazardous
Theoretic Control actions by under certain combinations of Hazards include faults and
Process 7 humans, systems, process model states. Does not failures, as well as misbehaviors
Analysis and components presume the presence of faults or in the absence of any faults or
(STPA) failures. failures
Plant systems, sub-
Conditions that lead to undefined
systems or
Purpose Graph equipment states. Interactions,
components and
Analysis 8 conflicts or unintended
their functional
(PGA) dependencies between process
states, goals and
elements.
processes

 3-6 
FMEA Methods at Various Levels of Interest

Figure 3-2 illustrates two different FMEA methods (both are described in detail
in Section 3), and how they may be applied at various levels of interest. The
Design FMEA is a bottom-up method that can be applied at any level of
interest. The analyst selects the level to meet his/her objectives. For example,
digital I&C platform vendors are likely to be interested in demonstrating the
reliability of their systems and components, and will typically apply the Design
FMEA method from the device or piece-parts level (i.e., the very bottom), up to
the digital system level, but on a generic basis. On the other hand, a system
integrator, owner/operator or architect/engineer is more interested in reliability at
the plant system level, and will typically apply the Design FMEA method from
the digital component level and up, or from the digital system level and up, on a
system-specific or plant-specific basis. A taxonomy of failure mechanisms, modes
and effects for typical digital devices and components is provided in Appendix B,
with guidance on how to use it.

In contrast, the Functional FMEA method takes a top-down approach and is


likely to be applied by the owner/operator or architect/engineer. The Functional
FMEA method can be used to identify the causes of unwanted or unacceptable
failure mechanisms at the plant component level, which can be blended with the
results of a Design FMEA to focus design changes or other corrective actions on
mitigating or eliminating significant failures and avoid wasting resources on
failure effects that are of no consequence. For more on blending methods, see
Section 3.4.
Functional
FMEA
Failure
PLANT FUNCTIONS Effects

Plant Plant Plant Failure Failure


Effects Modes
System 1 System 2 System n

Failure Failure
Plant Plant Plant Failure Modes Mechanisms
Component 1 Component 2 Component n Effects

Digital Digital Digital Failure Failure


Failure
Modes Mechanisms
System 1 System 2 System n Effects

Failure Failure Design


Digital Digital Digital Mechanisms
Modes FMEA
Component 1 Component 2 Component n

Failure Typically by
Mechanisms Design
Device 1 Device 2 Device n FMEA
System Integrator
or Owner/Operator
(or A/E by proxy)
Typically by
Design
Plant Functions, Digital Systems, FMEA
Digital Platform
Systems & Components Components & Devices Vendor

Figure 3-2
FMEA Methods at Various Levels of Interest

 3-7 
The hazards identified by the Functional FMEA and Design FMEA methods
are limited to those that can lead to the failures identified at the levels of interest.
Notice that these methods are not designed to evaluate software failures, because
software does not fail (a necessary condition to be postulated for an FMEA).
Software misbehaviors are a design problem. The HAZOP and PGA methods
can also be applied at various levels of interest, comparable to the Functional
FMEA method.

Top Down (FTA) Method at Various Levels of Interest

Figure 3-3 illustrates the scope of the Top Down method, using fault tree
techniques, at various levels of interest. Classical Fault Tree Analysis (FTA) uses
terms such as “events” at the top of the fault tree and “faults” at various lower
layers. In this guideline, the terms used in the FMEA methods (failure
mechanisms, modes and effects) are also used in the Top Down method so that
the results of the two methods can be compared side-by-side to confirm results or
blended in a manner that gains efficiencies. For more on blended methods, see
Section 3.4.

As in the FMEA methods, the hazards that can be identified by the Top Down
method are limited to those that can lead to the failures identified at the levels of
interest, and this method is not designed to evaluate software failures, because
software does not fail (a necessary condition before it can be postulated for a fault
tree).

FTA
PLANT FUNCTIONS Failure
Effects

FTA
Plant Plant Plant Failure
Failure
System 1 System 2 System n Modes
Effects

FTA
Plant Plant Plant Failure
Failure Failure
Component 1 Component 2 Component n Mechanisms
Modes Effects

FTA
Failure
Digital Digital Digital Mechanisms Failure Failure
System 1 System 2 System n Modes Effects

Digital Digital Digital Failure Failure


Component 1 Component 2 Component n Mechanisms Modes

Failure
Mechanisms
Device 1 Device 2 Device n

Plant Functions, Digital Systems,


Systems & Components Components & Devices

Figure 3-3
Top Down (FTA) Method at Various Levels of Interest

 3-8 
STPA Method at Various Levels of Interest

Figure 3-4 illustrates how the STPA method, described in Section 7, can be
applied at various levels of interest. In this case, the only direct correlation
between the system/component hierarchy and the STPA method is at the point
where losses are identified. After losses are identified at the appropriate level of
interest, the STPA method systematically breaks them down into hazards,
hazardous control actions, and control flaws that can lead to hazardous control
actions. This approach essentially makes STPA a top down method, but only in
the sense that losses (identified at the level of interest) are the starting point.
There is no direct correlation between the subsequent steps in the STPA method
and lower levels in the system/component hierarchy.

Notice that software is identified at the bottom of Figure 3-4, because the STPA
method does not presume faults or failures. Instead, it identifies hazardous
control actions, even if there are no faults or failures, that can arise from
hardware or software design issues.

STPA

PLANT FUNCTIONS Losses

STPA
Plant Plant Plant Hazards
System 1 System 2 System n Losses

STPA Hazardous
Control
Plant Plant Plant Hazards Actions (HCA)
Component 1 Component 2 Component n Losses

Hazardous Control
Control Flaws
Hazards Actions (HCA)
Digital Digital Digital
System 1 System 2 System n
Hazardous Control
Control Flaws
Actions (HCA)

Digital Digital Digital


Component 1 Component 2 Component n Control
Flaws

Device 1 Device 2 Device n Software

Plant Functions, Digital Systems,


Systems & Components Components & Devices

Figure 3-4
STPA at Various Levels of Interest

Figures 3-5 through 3-8 present qualitative comparisons of the various hazard
analysis methods, in various contexts, to give a sense of their ranges of
applicability, effectiveness and ease of use. Coverage in these figures refers to the
ability of the method to identify hazards. No method is completely effective; the
ability of each method to identify a wide range of hazard depends on the context
of the analysis (e.g., the depth of analysis (a single loop controller at a digital

 3-9 
component level vs. a complex highly integrated control system at the digital
system level); or anticipated failure modes vs. unanticipated behaviors).

Relative Coverage of Methods in the Context of Depth of Analysis

Figure 3-5 shows that the Design FMEA (DFMEA) method is most effective at
identifying failure modes and effects at the device, component and sub-system
levels of a system, because it can readily postulate credible failure modes and
determine the resulting effects based on known and understood failure
mechanisms, from the bottom-up. Appendix B of this guideline describes typical
digital I&C device and component failure modes, as well as typical software
interactions and faults, and related defensive measures that can be applied.

The Functional FMEA (FFMEA), HAZOP, STPA and PGA methods are
effective across the subsystem, system, and plant levels of abstraction, as well as
interactions between the plant and its environs, because these methods postulate
system behaviors using “guide words” (i.e., postulated conditions) in one form or
another, then determine if these behaviors are hazardous at a functional level.
These methods are not constrained by hardware or software functional
allocations.

The Top Down (FTA) method is effective at the system and plant levels of
abstraction because it is a top-down method that focuses on preserving critical
functions, as opposed to analyzing component failures.

DFMEA FFMEA, HAZOP, STPA, PGA FTA


Coverage

Digital Digital Digital Plant Plant


Devices Components Systems Components Systems

Figure 3-5
Relative Coverage of Methods in the Context of Depth of Analysis

 3-10 
Relative Usefulness of Methods in the Context of System Lifecycle Phases

Figure 3-6 shows that the Design FMEA method is useful at four distinct phases
of a project or system lifecycle:
1. It can be used in the concept phase to identify single points of failure, usually
between the proposed solution and interfacing equipment;
2. it can be used to assess the detailed design for unacceptable failure modes and
effects, and thus inform any design changes that may be necessary;
3. it can be used in the test phase to validate system responses that are expected
due to component failures; and
4. it can be used in the Operations and Maintenance (O&M) phase of the
system lifeycle to aid in the development of periodic test or preventive
maintenance procedures, system monitoring plans, and troubleshooting and
cause analysis activities.

The Functional FMEA, HAZOP, STPA and PGA methods can be particularly
useful in assessing conceptual designs, assisting in the development of functional
and performance requirements, and assessing the detailed design to assure that
desired behaviors are well understood and implemented, and that undesired
behaviors are well understood and eliminated, prevented, or effectively mitigated
either in the design or through administrative controls before entering the O&M
phase.

The STPA and PGA methods can be particularly helpful in determining


functional and performance requirements in terms of wanted and unwanted
system behaviors. If they are applied effectively, these methods will substantially
improve “coverage” of the system requirements, including the “must not do”
requirements, so that they can be explicitly stated, which in turn can inform the
development of test cases to validate the designed system against those
requirements.

The Top Down method is useful in the conceptual design and detailed design
phases, for assessing the design against success or failure criteria in the context of
critical safety or generation functions, and in the O&M phase, in the context of
the plant PRA for assessing operational and maintenance risks, maintenance rule
activities, and the significance determination process.

 3-11 
FFMEA, HAZOP, STPA, PGA FTA

Usefulness

DFMEA

Concept Requirements Design Implement Test O&M

Figure 3-6
Relative Usefulness of Methods in the Context of System Lifecycle Phases

Relative Coverage of Methods in the Context of System Behaviors

Figure 3-7 shows the relative effectiveness of hazard analysis methods in terms of
their ability to reveal expected vs. relatively unexpected behaviors. The Design
FMEA, Functional FMEA and Top Down (FTA) methods typically identify
system or component behaviors as a result of postulated failure modes and failure
mechanisms (expected behaviors) within the constraints of the analysis boundary,
and the results are usually well understood. However, operating experience has
shown that these methods do not consistently reveal strange, unexpected
behaviors that can arise from infrequent or unusual operating conditions,
unanticipated equipment modes (e.g., automatic, manual, standby, halted, reset,
latched, etc.), adverse plant or system conditions that don’t involve failures, or
interactions between systems and components that don’t ordinarily appear to be
functionally coupled.

On the other hand, the Functional FMEA, HAZOP, STPA and PGA methods
force consideration of functional misbehaviors without necessarily constraining
the analysis to specific pieces of equipment and their failure modes, or hardware
or software functions allocated to that equipment. While the Functional FMEA
method includes the notion of postulated functional failures, it does so at the
plant process level using a series of guide words in a manner similar to HAZOP,
where digital system faults and failures are not always necessary to create a hazard
at the plant system level. Thus, the Functional FMEA (FFMEA) method is
shown in Figure 3-7 as something in between the other sets of methods. These
methods provide a complement to the Design FMEA and Top Down methods
because they can reveal otherwise strange or unexpected behaviors in a system
design, and they can more fully inform the development of system requirements
so that strange and unexpected behaviors are much less likely to make their way

 3-12 
into the detailed design and ultimately operations and maintenance of the digital
system.

DFMEA FTA HAZOP, STPA, PGA

Coverage FFMEA

Anticipated Failure Modes Unexpected Behaviors

Figure 3-7
Relative Coverage of Methods in the Context of System Behaviors

Relative Familiarity of Methods in the Context of Various Users

Figure 3-8 illustrates the relative familiarity of each method in the context of
various users that are likely to pick up and apply this guideline. This guideline is
written for technically competent engineers who work with digital I&C
equipment and systems, but users should acknowledge that roles and
responsibilities can vary considerably.

Most often, I&C engineers who are members of the owner/operator


organization, or their contractors by proxy (e.g., architect/engineer firms), will
have relatively strong knowledge and capabilities in terms of plant system
functional and performance characteristics, design criteria, and regulatory
requirements. The Functional FMEA and HAZOP methods line up relatively
well against this level of knowledge and capabilities.

Paradoxically, the Design FMEA is perhaps the method that is most familiar to
I&C engineers, but when it comes to digital I&C systems and components, the
responsibility for performing a Design FMEA on the digital I&C system or
equipment is almost always assigned to the equipment vendor (or the system
integrator, who is then responsible for interfacing with equipment vendors).
Experience has shown that equipment vendors don’t always provide a thorough
or high quality Design FMEA, and if they do provide one, it doesn’t go beyond
the “customer connections,” leaving the responsibility for assessing failure modes
of the full system to the I&C engineer. The Functional FMEA method offers a
strong complement to the Design FMEA by identifying the critical system-level
failure modes before asking an equipment vendor for a Design FMEA, thus

 3-13 
bringing the most attention to the equipment failure modes that intersect with
the critical system-level failure modes.

The Functional FMEA and HAZOP methods are proven and widely used in
other industries (e.g., automotive, petrochemical), but they are relatively
unknown in the I&C engineering community in the nuclear power industry.
Therefore, a facilitator may be necessary for assisting those users who are likely to
apply these methods on an infrequent basis. The use of fault trees in the Top
Down method is proven and widely used in the nuclear power industry, but not
necessarily by I&C engineers, digital equipment vendors, and their proxies, for
whom this guidance is written. Therefore, a facilitator may be necessary for
applying the Top Down method as well.

Finally, the STPA and PGA methods are recent advancements in hazard analysis
methods, emerging from academia and finding their way into various industries.
Textbooks and academic papers describe these methods, and EPRI has
performed some research into their effectiveness. Because they are found to be
promising in their ability to identify unexpected behaviors, some of which has
been found in nuclear operating experience, detailed procedures and worked
examples have been provided in Sections 7and 8 of this guideline. But these
methods and procedures will likely require a facilitator to enable their application
on a digital I&C project, and may need help from experts who either developed
the method or are day-to-day practitioners.

DFMEA
Familiarity

FTA

FFMEA,
HAZOP STPA,
PGA

Equipment Vendor I&C Engineer Facilitator Expert

Figure 3-8
Relative Familiarity of Methods in the Context of Various Users

 3-14 
3.4 Consider a Blended Approach

Each of the methods in this guideline taken to its extreme could be effective in
identifying most of the hazards associated with a digital system. But, taken to
extremes, any single method is likely to:
1. be costly
2. not be performed in a timely manner
3. provide results that are too extensive to be readily understood by those who
must utilize them
4. lose focus on the corrective actions that are worth pursuing

This guideline was not developed for the sole purpose of selecting any one
method to perform a hazard analysis on a given digital system, but neither does it
preclude the use of one preferred method. It would be an unusual digital system
for which a single method could be expected to be ‘best’. Therefore, the
discussion of each of the hazard analysis methods in this guideline emphasize
that the described steps are not the only way to implement the method; variations
are likely, and the steps can be blended with or replaced by steps described for
other methods in this guideline. t

Therefore, given the availability of multiple methods, it should not be necessary


to rely exclusively on any single method. A Functional FMEA performed on one
project may be useful in focusing a HAZOP analysis on a subsequent digital
upgrade for plant systems that support similar plant functions. Implementation of
more advanced methods such as STPA or PGA may find the fault tree logic
from the plant specific PRA useful in developing a technical basis for dismissing
potential hazards and losses that are not relevant to the plant design. The
completeness of fault trees from the PRA can benefit from insights coming out
of the more advanced methods, particularly when hazards or accidents (losses)
involve system or component behaviors that require no failures.

The “best” approach to performing hazard analysis of digital systems is likely to


be a blend of approaches.

Several blended approaches are described in this Section, but they are not the
only blended approaches that may be devised by analysts.

As long as there is a nexus between the following items assessed by any two
methods, then a blended approach may be useful:
 Systems or components to be analyzed
 The way hazards are characterized (see Table 3-1)

The following examples show how a few selected methods can be blended to
achieve efficiencies in analysis and design.

 3-15 
Example 3-1. Blending FTA (or Functional FMEA) Results
with a Design FMEA
See Figure 3-9, and consider a digital feedwater control system upgrade in a PWR.
The hazard analysis approach is to first obtain the existing fault trees for the facility,
and identify the faults (or failure effects) that have an adverse effect on the
feedwater system and ultimately the plant. This approach takes advantage of readily
available information that is maintained for use in the facility PRA. Using the existing
fault trees, the analyst can identify the following thread (among others):
 Plant System: Feedwater
 FTA Failure Effect: Loss of Feedwater
 Plant Component: Feedwater Regulating Valve (FRV)
 FTA Failure Mode / Design FMEA Failure Effect: FRV Closure
 Digital System: Feedwater Controls
 Digital System Failure Mode: Output to FRV Fails Low
 Digital Component Failure Mechanism: Halted Controller
Although this example may seem trivial, a blended approach for a large complex
system can identify the critical failure modes for a number of threads, help focus
design efforts on the most limiting cases, and avoid wasted effort on unnecessary
design activities or corrective actions for non-critical digital system failure modes.

FFMEA
Failure
PLANT FUNCTIONS Effects
FTA

Failure Failure
Plant Plant Plant
Effects Modes
System 1 System 2 System n

Plant Plant Plant Failure


Failure Failure
Component 1 Component 2 Component n Modes Mechanisms
Effects

Digital Digital Digital Failure


System 1 System 2 System n Modes The FFMEA or FTA methods can
intersect with the DFMEA at the Plant
Component level, effectively using
FFMEA or FTA results to narrow the
DFMEA corrective actions to the
Digital Digital Digital digital system failure modes that
Failure
Component 1 Component 2 Component n Mechanisms result in undesired behaviors of the
actuated components

Design
Device 1 Device 2 Device n FMEA

Plant Functions, Digital Systems,


Systems & Components Components & Devices

Figure 3-9
Blending Functional FMEA (FFMEA) or FTA Results with a Design FMEA (DFMEA)

 3-16 
Extending the results described in Example 3-1, the Top Down (FTA) method
has the potential benefit of reducing the effort needed to analyze digital system
hazards when combined with other methods described in this guideline, as
shown in Table 3-2:

Table 3-2
Blending the Top Down (FTA) Method with Other Hazard Analysis Methods

Method Potential Benefit


Functional FMEA (FFMEA) Top Down (FTA) method can confirm or replace FFMEA
Top Down (FTA) method can:
 Limit the extent of the effort by limiting the scope of
failure modes that matter in the accomplishing of
safety and generation functions
Design FMEA (DFMEA)  Provide an engineering rationale for why specific
HAZOP failure modes are not important in regard to hazard
analysis
 Reduce the need for detection measures to identify
and cope with various failure modes
 Assist in the integration of multiple evaluations
Top Down method can provide an engineering rationale
STPA as to why some digital system goals and behaviors are
PGA not significant, reducing the need to investigate all
behaviors.

 3-17 
Example 3-2. Blending a Digital Platform Design FMEA with a Plant
System Design FMEA
See Figure 3-10. This example briefly describes how a Design FMEA (DFMEA)
provided by a digital I&C platform (or system) vendor can be blended with a plant
system Design FMEA. This is not an unusual occurrence, because historically the
DFMEA method has been applied on many digital I&C platforms and digital I&C
upgrade projects, and an integrated view is necessary before the analyst can
conclude that the resulting digital upgrade design does not produce any
unacceptable failure modes and effects.
However, neither the vendor nor the owner/operator typically has the qualifications,
knowledge or experience to prepare one integrated DFMEA, leaving the
owner/operator (or architect/engineer by proxy) with the problem of accepting the
platform DFMEA from the vendor and extending the results to the plant system level,
typically by preparing another DFMEA at that level (i.e., the level of interest as
described in Section 3.2).
Using the same (Example 3-1) feedwater control system upgrade project at a PWR,
the owner/operator can prepare the following thread (among others), from the
bottom up:
 Digital Device Failure Mechanism: CPU Stops Running (from platform DFMEA)
 Digital Component Failure Mode: Controller Halts (from platform DFMEA)
 Digital System Effect: Outputs Rail Low (from platform DFMEA)
 Digital System Failure Mechanism: Loss of Signal (identified in plant system
DFMEA)
 Plant Component Failure Mode: Closed FRV (identified in plant system DFMEA)
 Plant System Failure Effect: Loss of Feedwater (identified in plant system DFMEA)
Blending two Design FMEAs that are prepared at two different levels of interest
provides the integrated view that is necessary for identifying the critical failure
modes for a number of threads in a large complex system. Again, this approach
helps focus design efforts on the most limiting cases, and avoid wasted effort on
unnecessary design activities or corrective actions for non-critical failure modes and
effects.

 3-18 
PLANT FUNCTIONS

Plant Plant Plant Failure


System 1 System 2 System n Effects

Plant Plant Plant


Failure
Component 1 Component 2 Component n
Modes

Digital Digital Digital Failure Failure


System 1 System 2 System n Effects Mechanisms

Digital Digital Digital Failure Plant System Use the digital platform DFMEA
Component 1 Component 2 Component n Modes Design FMEA output (failure effects) as an
input (failure mechanisms) to
the plant system DFMEA

Failure
Device 1 Device 2 Device n Mechanisms

Digital
Plant Functions, Digital Systems,
Platform
Systems & Components Components & Devices Design FMEA

Figure 3-10
Blending a Digital Platform FMEA with a Digital System FMEA

Example 3-3. Blending Functional FMEA (or FTA) Results with STPA
This example presents a blended approach that is somewhat different from those
described in Examples 3-1 and 3-2. Note that before two methods can be blended,
there should be a nexus at an appropriate system or component level of interest (as
described in Section 3.2), and the way in which hazards are characterized should
be similar.
Figure 3-11 illustrates this concept, but this time showing a nexus between Functional
FMEA failure modes or FTA failure mechanisms and the losses to be considered by the
STPA method. Continuing with the same proposed digital feedwater control system
upgrade at a PWR, the analyst can prepared the following thread (among others),
from the top down using the FTA or Functional FMEA methods:
 Plant System: Feedwater
 Functional FMEA Failure Mode or FTA Failure Effect: Loss of Feedwater
 Actuated Plant Component: Feedwater Regulating Valve (FRV)
 Functional FMEA Failure Mechanism or FTA Failure Mode: Spurious Closure of FRV
 STPA Loss: Loss of Feedwater
 STPA Hazard: Spurious Closure of FRV
 STPA Hazardous Control Action (HCA): Digital feedwater system provides close
command to FRV when conditions are normal
 STPA Control Flaw: Incomplete process model (e.g., in the software)
This example shows how the results from a top down method can be used to inform the
STPA method, and get down to the level where hazardous conditions may be present
even if there are no system or component failures. Once again, this approach helps
focus design efforts on the most limiting cases, and avoid wasted effort on unnecessary
design activities or corrective actions for non-critical failure modes and effects.

 3-19 
FFMEA
Failure
PLANT FUNCTIONS Effects STPA
FTA

Failure
Plant Plant Plant Failure Losses
Modes Effects
System 1 System 2 System n

Hazards
Plant Plant Plant Failure Failure
Component 1 Component 2 Component n Mechanisms Modes
Hazardous
Control
Actions (HCA)

Digital Digital Digital


System 1 System 2 System n FTA or FFMEA Control
results can identify Flaws
the losses to be
considered in the
STPA method
Digital Digital Digital
Component 1 Component 2 Component n

Device 1 Device 2 Device n Software

Plant Functions, Digital Systems,


Systems & Components Components & Devices

Figure 3-11
Blending Functional FMEA or FTA Results with STPA

3.5 Determine Resources & Schedule

Technical Resources

Clear identification of roles and responsibilities of the resources assigned to


support the analysis is vital. Also, the special needs for expertise, technical
information, system access during design and testing, and system
modeling/simulations should be recognized. Hazard analysis procedures should
specify roles and responsibilities, including expectations for specific areas of
expertise for each role. Potential roles that should be covered include the lead
hazard analyst, facilitators or experts on the selected method, utility design
engineers, vendor design engineers, system integrators, utility system engineers,
utility operations and maintenance personnel.

Technical Information

In order to support the development of the hazard analysis, information about


the system will be needed to ensure that the analysis correctly assesses the design.
The information needed during the lifecycle of the project will vary with each
project phase. Procedures should specify the details of the information needed to
support the analysis. Examples of types of information to outline in the plan
include vendor design drawings, functional requirements specification, system
modeling/simulation results, vendor application development guidance, test
procedures, and operation/maintenance procedures.

 3-20 
Equipment Access

During hazard analysis activities, the system equipment may need to be reviewed
or accessed. If access to the system is needed during the design and test phases, it
should be identified in the project plan and schedule. Examples of the types of
access requirements consist of equipment walkdowns, equipment inspections,
and equipment testing (factory and site acceptance). The use of access time to
verify and/or validate the hazard analysis information should be specified in the
project plan. The results from the hazard analysis can be included in the test
phases to ensure that the expected response is actually demonstrated by the
system. In addition, walkdowns and inspections can ensure that the system
connections and design meet the expectations of the design documentation that
was used during the hazard analysis.

As part of the identification of the resources needed for the hazard analysis, a
schedule should be developed that outlines the milestones for the analysis. The
milestones will ensure that the analysis development is matched with the various
project lifecycle phases. This will allow for any results from the analysis to be
factored into the system development to mitigate any problems identified by the
analysis. The standard project lifecycles which would need to be aligned with the
analysis milestones consist of the project definition phase, conceptual design
phase, final design review, design testing, and implementation.

Table 3-3 provides the lifecycle or project phase and the corresponding analysis
milestones that would be aligned:

Table 3-3
Project Phases vs. Analysis Milestones

Project or Lifecycle Phase Analysis Milestone


Hazard Analysis Method(s) Selection
Project Definition
Hazard Analysis Plan or Procedure
Conceptual Design Function Analysis
Regulatory Submittal (if required) Preliminary Hazard Analysis
Partial Design Complete Hazard Analysis development
Final Design
Hazard Analysis approved
Regulatory Audit or Inspection
Hazard Analysis validated (Revised if
Design Testing
necessary)
Hazard Analysis verification (Revised if
Implementation
necessary)
Operations and Maintenance Hazard Analysis maintenance

At project initiation there should be a clear definition of the project that details
the intended scope and schedule. The hazard analysis plan can then be

 3-21 
developed, consistent with the intended project scope and synchronized with the
specific project milestones.

After the project definition, the conceptual design begins on the system design.
As the conceptual design is developed, a preliminary hazard analysis needs to be
developed to identify potential vulnerabilities in the conceptual design so the
flaws can be eliminated or mitigated prior to getting into the detailed design
activities. The preliminary hazard analysis will serve as the foundation for the
hazard analysis, which will be a living document during the design phase of the
project.

As the design effort progresses, the hazard analysis should be updated at each
lifecycle phase to ensure that problems are identified as early as possible to
minimize the impact of changes needed to address the vulnerabilities. Such
updates can be viewed as iterative. The periodic update points should be
identified in the project schedule and potentially would be aligned with a
30/60/90 percent design review milestone.

When the final design is approved, the hazard analysis should be approved as
well. The final hazard analysis will be based on the approved design for the
project and will serve to demonstrate that the objectives of the analysis have been
satisfied. As system testing and implementation occurs, the hazard analysis may
need to be revised to address changes that are made to the design to resolve any
identified problems.

3.6 Function Analysis

This Section is adapted from the Industrial Design Engineering Wiki, available
at http://www.wikid.eu/index.php/Function_analysis. In general, a Function
Analysis provides useful input to a Preliminary Hazard Analysis (PHA), as
described in Section 3.7, because it provides a clear representation of the
functions to be assessed at the level of interest.

Function Analysis is a method for analyzing and developing a Function


Structure. A Function Structure is an abstract model of the behaviors of a system,
subsystem or component, without specifying or describing features such as shape,
dimensions and materials, or allocation of functions to hardware or software
units. It describes the functions of the system, subsystem or component, and its
parts, and indicates the mutual relations. The underlying idea is that a Function
Structure may be built from the top down, starting with the highest level of
functional abstraction, until functions are decomposed to elementary (or general)
functions.

The principle of Function Analysis is first to list, describe or specify wanted and
unwanted system, subsystem or component behaviors, and then to infer from
there what the parts, including hardware and software units (which are yet to be
selected and developed into an integrated system) should do. Function Analysis
forces designers to distance themselves from known products and components in

 3-22 
considering the question: what is the new system, subsystem or component
intended to do and how could it do that?

A Function Analysis is typically carried out at the beginning of a digital I&C


project, and should be a prerequisite for hazard analysis methods and activities
described in this guideline. Descriptions of three possible starting points for a
Function Analysis follow. Note that they may also be used in various
combinations:
1. A list or table of basic functions (e.g., Figure 3-1). Basic functions are the
plant functions, plant system functions or plant component functions that are
expected to be affected by or interact with the digital system. Basic functions
should be identified at the highest practical level, based on available
information, without necessarily attributing functions to specific pieces of
equipment.
For example, in a Function Analysis for a proposed digital feedwater control
system upgrade in a PWR, a basic function could be “perform three element
steam generator level control between 15% and 100% rated thermal power.”
On the other hand, a Function Analysis for a digital single loop controller
upgrade on a service water temperature control valve might identify a basic
function as “provide closed loop PID control for service water temperature
using temperature control valve 1A.” The function is necessarily constrained
at the plant component level (existing control valve).
2. A Function/Process Map. This can be drafted from scratch, based on, or
extended from an existing plant-specific Function/Process Map, or by
extracting information available in fault trees developed for the plant (for use
in the PRA). For more on Function/Process Maps, see Section 4.2.
3. A collection of elementary (general) functions, for instance those described in
the Instrument Engineer’s Handbook (Reference 30). While it may appear that
elementary functions in Reference 30 are specific to the digital system, upon
closer examination they characterize elementary functions that couple a
digital system with plant processes and functions, without identifying specific
pieces of equipment, such as “closed loop feedback control on tank level
using a sensor function, a PID function, and an actuation function.” It is
helpful to think of the resulting set of elementary functions as those that
would become part of a function block diagram.

The outcome of the Function Analysis is a thorough understanding of the basic


functions and sub-functions (i.e., basic functions decomposed to more specific
functions) expected of the new or modified system, subsystem or component.
From functions and sub-functions the parts and components for the new or
modified system can be specified, developed, and assessed for potential hazards.

Note that a Function Analysis is an abstraction. It is not intended to describe


discrete functions at the component or part level, or allocate functions across
specific hardware or software elements. It is a basic process that precedes more
detailed design activities.

 3-23 
Function Analysis (FA) Procedure

The following steps describe how to perform a Function Analysis. In lieu of a


specific example in this Section, the worked examples for the hazard analysis
methods described in their respective Sections include the results of a Function
Analysis.

FA Step 1: Gather and assess source information, such as the Final Safety
Analysis Report (FSAR), design and/or system descriptions, PRA success
criteria, system drawings, and any other information that describes the functional
requirements or characteristics of the system or components of interest.

FA Step 2: Describe the main function of the system or process in the form of a
black box. If one main function cannot be described, go to the next step.

FA Step 3: Make a list of sub-functions. The lower levels of a Function/Process


Map offer a good starting point. For a complex system, a Function Structure may
be appropriate. There are several typical approaches to function structuring,
which may be applied singly or in combination:
1. Putting functions in a sequential order. To visualize functions in a sequential
order, one can simply list the functions.
2. Connecting inputs and outputs of flows between functions (matter, energy
and information flows). To visualize function flows, one can connect boxes
by arrows.
3. Hierarchy (main functions, sub-functions, sub-sub-functions, etc.). To
visualize hierarchy, draw a tree structure (for example, the Function/Process
Map described in Section 4.2).

FA Step 4: Elaborate the Function Structure. Fit in additional functions (or sub-
functions) which were left out in Steps 2 and 3, and find variations so as to find
the best Function Structure. Variation possibilities include moving the system
boundary, changing the sequence of sub-functions, and splitting or combining
functions or sub-functions. Exploring various possibilities is the essence of
Function Analysis: it allows for an exploration and generation of possible
solutions to the design problem.

Additional Guidance
 Development of Function Structure variants is recommended. A statement
of a problem does not typically or imperatively lead to one particular
Function Structure. The strength of Function Analysis lies in the possibility
of creating and comparing, at an abstract level, alternatives for functions and
their structuring.
 Certain sub-functions appear in almost all design problems. Knowledge of
elementary or general functions helps in seeking solution-specific functions.

 3-24 
 The development of a Function Structure is an iterative process, which can
start from analyzing an existing design or with a first outline of an idea for a
new solution.
 Function structures should be kept as simple as possible. The integration of
various functions into one functional block (i.e., a function carrier, such as a
steam generator level control system) is often a useful means in this respect.
 Block diagrams of functions should remain conveniently arranged; use simple
and informative symbols. For more on functional symbols and other
representations, see Appendix C.
 In industrial design engineering and system design, it is not always possible
to apply structuring principles. In the context of digital I&C systems in
nuclear power plants, functions and processes are better described in terms of
safety, generation, and equipment reliability objectives. A high level, generic
Function/Process Map for a typical Boiling Water Reactor is provided in
Figure 4-1.

FA Step 5: Document the results. The results of the Function Analysis can be
documented in a stand-alone engineering document (e.g., calculation or analysis
package), or they can be documented in the front end of a specific hazard analysis
document that results from using one or more of the hazard analysis methods
described in this guideline.

3.7 Preliminary Hazard Analysis (PHA)

In the preliminary or conceptual design phases of a project, preliminary hazards


that could be potentially created by or related to a proposed solution or
modification should be identified.

Per IEEE Std. 1228-1994 (Reference 40), a Preliminary Hazard Analysis (PHA)
(and any additional hazard analyses performed on the entire system or any
portion of the system) identifies:
1. Hazardous system states, typically at the digital system level. However, if the
Function Analysis results are described at the plant component or plant
system level, then hazardous system states would be identified at that level.
In either case, the hazardous system states become constraints (i.e., “must not
do” requirements) that get transferred into the set of digital system
requirements.
2. Sequences of actions that can cause the system to enter a hazardous state
3. Sequences of actions intended to return the system from a hazardous state to
a nonhazardous state
4. Actions intended to mitigate the consequences of accidents or losses

 3-25 
There are two basic approaches for performing a Preliminary Hazard Analysis
(PHA):

Table Top Method

The Table Top method involves one or more organized meetings, where the
identified individuals come together and review, discuss and identify potential
hazards that may be introduced or affected by the digital I&C project. In general,
the number of identifiable hazards will typically range from 3 to 5, and in some
cases may range up to 6 to 8.

The Table Top method for performing a PHA relies on the judgment and
experience of individuals knowledgeable in the design, operations, maintenance,
and licensing basis of the potentially affected systems, sub-systems or
components. Such individuals and any additional resources that may be needed
should be identified as described in Section 3.5.

The Function Analysis results, as described in Section 3.6, should be used as an


input to the PHA. To assist in hazard identification, the table top discussions
should consider potential failures of identified functions that could lead to an
accident or loss, or accidents or losses that could potentially affect identified
functions in an adverse manner.

The result is a list of hazards for further consideration as one or more of the
Hazard Analysis methods described in this guideline is/are selected and applied.

Hazard Analysis Method

An alternate approach to performing a PHA is to select and apply one or more of


the Hazard Analysis methods described in this guideline, and apply it (or them)
during the conceptual design phase of a digital I&C project. The goal of this
approach is to produce a list of hazards for further consideration in later phases of
the project.

Note that the results of the Function Analysis are still a prerequisite for
performing a PHA when users of this guideline jump to the application of one or
more specific Hazard Analysis methods in the conceptual design phase of a
project.

Also note that top-down Hazard Analysis methods such as Functional FMEA,
Top Down, STPA and PGA require identification of functions, in one form or
another, as an early step in the process. The Function Analysis results should be
directly applicable or adaptable (with little additional effort) in these cases.

Document the Results

The results of the PHA can be documented in a stand-alone engineering


document (e.g., calculation or analysis package), or they can be documented in
the front end of a specific hazard analysis document that results from using one
or more of the hazard analysis methods described in this guideline.
 3-26 
3.8 Hazard Analysis Acceptance, Documentation &
Maintenance

Acceptance

The hazard analysis plan (if one is used), the project plan (i.e., project risk
analysis), or hazard analysis procedures should specify the criteria that will be
used to determine the acceptability of the analysis. The acceptance criteria will be
developed from the objectives that are identified as described in Section 3.1. For
example, if the objectives included that the analysis would identify single failure
vulnerabilities in the design, then the acceptance criteria could include the
determination that no single failure vulnerabilities exist or that any identified
vulnerabilities have been corrected.

The project plan or hazard analysis procedures should identify how to address
problems that are unresolved or unmitigated by the design. The level of
justification for the unresolved or unmitigated problems should be specified.

For areas that are not analyzed or cannot be analyzed, the acceptance criteria and
project plan should describe how unanalyzed design areas are to be dispositioned,
up to and including rejection of the system design. Unanalyzed areas of the
design may be acceptable for simple designs in low risk systems or components.

Documentation & Maintenance

Depending on the complexity of the design change, the hazard analysis


documentation may need to be in standalone, controlled documents or contained
in the design change package paperwork. A more complex design would be
expected to have standalone analysis documentation. A design change that is
straight forward could have the hazard analysis included in the design change
package paperwork.

A suggested structure of a standalone hazard analysis document or package is as


follows, but users may find alternate structures that are more suitable within their
engineering processes:

I. Purpose & Objectives

II. References

III. Definitions

IV. Function Analysis Results

V. Preliminary Hazard Analysis Results

VI. Final Hazard Analysis Results

VII. Conclusions & Recommendations

 3-27 
Procedures and project plans should specify the hazard analysis documentation
that will be developed at each point in the project or system lifecycle, including
the Operations and Maintenance phase. Any existing documentation that will be
revised as part of the analysis activities will also be specified.

Hazard analysis deliverables that are developed for new technologies introduced
into the plant should be baselined upon completion of a project, then maintained
in a controlled manner for supporting changes. If a change affects a function or
hazard analysis result, the hazard analysis should be updated, and maintained
going forward.

Another area to be addressed in the documentation section of the analysis plan is


the identification of the classification and review levels of the documentation.
The QA classification documentation needs to be identified in the plan. The
need for independently reviewed documentation would also be identified in the
plan.

 3-28 
Section 4: Failure Modes and Effects
Analysis (FMEA) Methods
This section describes methods for performing hazard analysis using two FMEA
methods, Functional FMEA (FFMEA) and Design FMEA (DFMEA).
Although not referred to as hazard analysis methods in typical nuclear industry
parlance, the FMEA methods are treated as hazard analysis methods in this
document because they can be used to identify hazardous failures that can lead to
an accident or loss. Annex D of IEEE Std. 7-4.3.2 – 2003 (Reference 9) includes
the following statement (emphasis added):

One method of determining hazards is through the use of analysis


techniques such as FTA and FMEA. IEEE Std. 603-1998 (5.15
through reference to IEEE Std. 352-1987) suggests using an FMEA
for performing reliability analyses. These techniques can be useful for
identifying potential hazards.

4.1 FMEA Overview

The FMEA method was first derived and applied in military applications in the
1950’s under MIL-STD-1629 (Reference 27). This method was later used in the
1960’s and 1970’s in the aerospace, automotive, food & beverage and commercial
nuclear power industries, with an emphasis on safety. The automotive industry
added a top-down view to the basic FMEA method (a bottom-up, inductive
view of system, component or device failure mechanisms, modes and effects) by
developing a perspective on causes of failure modes in manufacturing process
steps that could lead to component, assembly or vehicle failures.

This guidance refers to the basic, bottom-up FMEA method as a Design


FMEA, and the top-down FMEA method as a Functional FMEA.

Failure modes and effects analysis (FMEA) is a step-by-step approach for


identifying possible failures in a design, process, or product. “Failure modes”
means the ways, or modes, in which something might fail to meet a specified
functional or performance characteristic. “Effects analysis” refers to studying the
consequences of those failures.

This guideline describes two FMEA methods; the Functional FMEA (FFMEA)
and the Design FMEA (DFMEA) method. The Functional FMEA method
takes a “top down” approach by assessing system-level functions and processes
 4-1 
without necessarily identifying and analyzing specific sets of equipment and their
failure modes. Thus the Functional FMEA method is more suitable for
analyzing a system at the conceptual design phase in order to identify functional
hazards or hazardous conditions that should be addressed in later phases of the
lifecycle.

The Design FMEA method is one that should be more familiar to equipment
vendors, I&C engineers, and other stakeholders in the digital I&C community.
It is the traditional bottom-up approach that is described in various standards
such as IEEE Std. 352-1987 (Reference 1).

In general, the Functional FMEA is well suited for identifying hazardous failure
modes that can help limit the focus or scope of a Design FMEA. The Functional
FMEA should be performed by plant staff (or a designated contractor such as an
Architect/Engineer firm) early in the modification process, before an equipment
vendor or third-party integrator is asked to perform a Design FMEA. The
completed Functional FMEA can be an input to the Design FMEA activity so
the analyst can readily identify the functional or process-related failure modes
that should be eliminated, prevented or mitigated by the detailed design.

4.2 Functional FMEA (FFMEA) Procedure

The “Functional” FMEA, as described in this guidance, is derived from a


reference manual (Reference 26) developed and maintained by the Automotive
Industry Action Group (www.aiag.org). In the automotive industry, a “process” is
a sequence of manufacturing operations that produce finished items, assemblies,
or vehicles, and a “Functional FMEA” is intended to identify failure mechanisms
and failure modes that can occur in each step of the manufacturing process that
could ultimately yield an item, assembly, or vehicle that fails to meet
specifications. For example, a grinding operation may be one step in the
manufacture of an engine crankshaft, and failure to grind the shaft to
specifications may cause excessive wear or failure before the end of its specified
lifetime. The Functional FMEA method is differentiated from the bottom-up
Design FMEA method described in Section 4.4 because it is focused on
functions and processes that can affect an item or assembly, instead of the entire
set of credible failure mechanisms that are postulated and evaluated by the
Design FMEA method.

The Functional FMEA method is adapted to digital I&C systems in the nuclear
industry by considering the plant system functions and processes that are sensed,
controlled and indicated by digital I&C equipment. A Functional FMEA can be
particularly useful if it is applied before a Design FMEA is executed, when the
results can be used to reduce the scope of the Design FMEA to the failure
mechanisms that can arise from the affected plant functions and processes.

 4-2 
Prerequisite

The results of a Function Analysis, as described in Section 3.6, are a useful input
to the Functional FMEA (FFMEA) because they provide a well-organized set of
functions that can feed into the first two steps of the FFMEA procedure.

FFMEA Step 1: Draw a Function/Process Map

The first step in the FFMEA process is to draw a Function/Process Map, which
is a hierarchical view of plant system functions and processes of interest to the
analyst. The Function/Process Map uses the results of the Function Analysis
method described in Section 3.6. A generic Function/Process Map for a typical
BWR is provided in Figure 4-1. Note that it does not list or describe any specific
equipment or systems, structures or components beyond the heaviest components
(i.e., reactor, main turbine, etc.).

The focus on plant functions and processes is consistent with the expectations of
the AIAG Reference Manual on FFMEA, and is helpful because it supports a
top-down view of critical functions without forcing a complete bottom-up
analysis of all credible equipment failure modes and effects as expected by the
Design FMEA method (Section 4.3). The resulting Function/Process Map
therefore describes functions and processes at a level of abstraction that does not
need to identify specific equipment.

Note that the generic Function/Process Map presented in Figure 4-1 resembles a
fault tree to some extent, with the exception of logic symbols. It does not
represent success or failure criteria, or contiguous processes; it is simply a
hierarchical view of basic plant functions. However, the plant-specific fault tree
used in the PRA is likely to be a good input document for developing the
function/process map from a functional point of view (i.e., ignoring the failure
logic).

BWR
Plant
Operations

Equipment Power
Safety
Protection Generation

To Equipment To Power
To Safety Map
Protection Map Generation Map

Figure 4-1
Generic BWR Function/Process Map (Sheet 1 of 3)

 4-3 
To BWR Plant
Operations Map

Safety

To Equipment Personnel Nuclear


Protection Map Safety Safety

Primary Shutdown Rx
Limit
Fire Radiation Safety Industrial Coolant and Maintain To Power
Releases to
Protection Protection Tagging Safety Manual System Safe Generation Map
Environment
Integrity Shutdown

Primary Reactor
Radiation Equipment Primary Coolant Flow Secondary Primary
Initiate Fire Isolate Tagout Equipment Coolant Reactivity Coolant
Indications & Safety Coolant to Interfacing Containment Containment
Suppression Area Inidications Lockouts Overpressure Control Inventory
Alarms Features Piping Systems Control Control
Protection Control

High Low
Systems Systems Containment Containment
Sense Sense Containment Pressure Pressure
Inside Outside Pressure Temperature
Smoke/Fire Radiation Isolation Inventory Inventory
Containment Containment Control Control
Control Control

Figure 4-1 (continued)


Generic BWR Function/Process Map (Sheet 2 of 3)

 4-4 
To BWR Plant
Operations Map To BWR Plant
Operations Map

Equipment
Protection Power
Generation

Isolate Trip
Fire
Energy Rotating Main Main
Protection Reactor
Sources Equipment Turbine Generator
To Safety Map

From
Safety Map

Isolate Isolate Trip Motor Trip Engine Trip Turbine Reactor


Electricity Process Driven Driven Driven Reactivity Coolant Steam Flow Condenser Power
Supply Line Equipment Equipment Equipment Control Inventory to Turbine Operation Conversion
Control

Sense High Low


Sense Sense Sense Sense Sense Sense
Voltage/ Sense High Sense Pressure Pressure
Excess Flow Level Pressure Temp. Overspeed
Freq. Vibration High Temp. Inventory Inventory
Current Deviation Deviation Deviation Deviation Condition
Deviation Control Control

Figure 4-1 (continued)


Generic BWR Function/Process Map (Sheet 3 of 3)

 4-5 
Table 4-1
Sample Functional FMEA Worksheet

PFMEA Number Prepared by/Date: Sheet:

High Level Function/Process (check one): Equipment: Checked by/Date: Lifecycle Phase:
( ) Safety
( ) Equipment Protection Approval/ Date: Rev:
( ) Power Generation
Potential Current Prevent/Detect Method
Row Potential
Potential Potential Causes(s)/
Function Process Requirement(s) Failure Mode Effect(s) of Recommended Action
No. Failure Mode Mechanism of Failure Prevention Detection
Failure

3 What can go wrong?


- No Function
4 - Partial Function
- Over Function
5 - Degraded Function
- Intermitent Function
6 - Unintended Function

10

 4-6 
FFMEA Step 2: Identify the functions and related processes of interest.

In a digital upgrade project, the functions and processes of interest are typically
those that are affected by the systems or components that are being replaced or
modified, and are usually relatively easy to identify at a functional level. For larger
upgrades that affect multiple plant process systems, there may be multiple
functions or processes that are differentiated by functional segments in the
architecture. Using the Function/Process Map developed in Step 1, highlight or
list the lowest function/process blocks that are affected by the equipment,
systems or components of interest.

FFMEA Step 3: Write a summary description.

Write a summary description of the basic functions of the system or components


of interest to the analysis, and how these basic functions fulfill the higher level
functions and processes identified in FFMEA Step 2. The purpose of this section
is to help anyone reading the FFMEA understand the basic functions of the
system or components being analyzed. It is not necessary to develop or repeat a
comprehensive system description, such as would be found in a typical plant
system description. The summary description should be developed only to the
extent that it supports the analysis.

It is helpful in most cases to include a table that lists each component or


component type and its basic functions.

In more functionally complex systems, it may be helpful to include the functional


sequences that are used to startup or shutdown the plant system (that is being
controlled) in order to provide a more complete functional description of the
components of interest.

FFMEA Step 4: Prepare a FFMEA worksheet.

A blank FFMEA worksheet is provided in Table 4-1. The following steps


describe how to prepare the FFMEA worksheet.

FFMEA Step 5: On each worksheet, fill out the header rows.

Identify the following items in the appropriate blocks at the top of the worksheet:
 FFMEA Number, Sheet Number, Revision, Lifecycle Phase
 High Level Function/Process
- Nuclear Safety
- Power Generation
- Equipment Protection
 Equipment

If more than one of the high level functions identified in the “High Level
Function/Process” block is affected (Nuclear Safety, Power Generation or

 4-7 
Equipment Protection), then a separate FFMEA worksheet should be prepared
for each high level function. It is not unusual for digital I&C equipment
functions to affect all three high level functions one way or another.

The “Equipment” block should identify the system or components of interest


that are described in FFMEA Step 3. A reference to a functional block diagram,
such as one that may be produced by the Function Analysis per Section 3.6, may
be included.

FFMEA Step 6: Identify the lowest level Functions, Processes and related
Requirements.

On each worksheet, under the column labeled “Function,” list the lowest level
functions from the Function/Process Map that are performed by or affected by
the identified equipment. In many cases, there may only be one or two entries in
the “Function” column on each worksheet.

Under the “Process” column, identify the basic Processes that are used to fulfill
each Function. In the context of the FFMEA method, a basic process may be
one or more of the following fundamental processes, characterized by the
properties of the system:
 Energy Storage (thermal, electric, fluid, fuel, etc.)
 Energy Transport (fluid flow, current flow, etc.)
 Energy Addition (pumping, heating, boiling, charging, generating, etc.)
 Energy Reduction (relieving, cooling, condensing, discharging, motoring, etc.)
 Energy Conversion (nuclearthermal, thermalkinetic, kineticelectric,
etc.)
 Energy Containment

For each identified Process, briefly list the associated functional or performance
requirements in the “Requirements” column.

FFMEA Step 7: Using the FFMEA Guide Words, postulate the failure modes
of each Process.

For each Process identified in Step 6, postulate the following Guide Words and
list the results under the “Potential Failure Mode Column”:
1. No Function
2. Partial Function
3. Over Function
4. Degraded Function
5. Intermittent Function
6. Unintended Function

 4-8 
These FFMEA Guide Words are designed to answer the “what can go wrong?”
question as it relates to each Process identified in Step 6. Each of the FFMEA
Guide Words is postulated and evaluated individually against each identified
Process, thus making the FFMEA method effective and useful for identifying
single failures, both active and passive, and the resulting effects.

It is not necessary to identify potential failure modes for all six Guide Words if
one or more Guide Words is not applicable or not credible. For example, if a
Process of initiating a safety function is being considered under the general
heading of Nuclear Safety, then the Guide Word “Unintended Function” is not
applicable if the Process is defined as one that is performed on demand due to an
accident condition (i.e., the function is actually intended, so the idea of an
“unintended function” doesn’t make any sense in this context). However, when
evaluating the same Process (initiating a safety function) under the general
heading of Power Generation, then “Unintended Function” would be evaluated
as a spurious actuation because the Process is defined as one that is required on
demand.

FFMEA Step 8: Determine the resulting effects that each Process Failure Mode
can have on the system of interest and the plant.

This step involves following the Potential Failure Modes identified in Step 7 out
to their effects at the system and plant level. This step requires knowledge of the
system or equipment of interest and how it can affect plant operations in terms of
safety, power generation, and equipment protection. This step may require some
cross-discipline support from design engineers, system engineers, or component
engineers who are technically competent in these areas. The results of this step
are entered in the FFMEA worksheet under the column labeled “Potential
Effects of Failure.”

FFMEA Step 9: Determine the Potential Causes or Failure Mechanisms for


each identified Process Failure Mode.

This step typically requires some knowledge of the equipment that is or would be
involved in the potential failures. The results are listed under the column
“Potential Cause(s)/Mechanism of Failure.”

FFMEA Step 10: Identify currently available methods of Prevention and


Detection for each Potential Cause or Failure Mechanism identified in Step 9.

Methods of prevention typically include design features, operations and


maintenance practices, procedures, and training. Methods of detection typically
include alarms, indications, and tests, and related procedures. In this step, it is
important to identify the Prevention and Detection methods that are currently
available. If currently available methods are not sufficient, and a new or revised
method is deemed necessary, then it should be noted in Step 11.

One example of a currently available method of prevention or detection would be


an equipment trip function that is designed to prevent BWR vessel overfill if

 4-9 
there is a postulated functioning of high pressure coolant injection (HPCI) when
there is no valid demand for HPCI. In other words, the HPCI pump will trip,
using the trip/throttle valve, if there is a spurious actuation and reactor level
reaches a high level setpoint.

Design features and functions can only be credited if they are independent from
the functions and processes that are within the scope of the FFMEA.
Continuing with the HPCI example, if the level of interest (per Section 3.2) is
the HPCI flow control system, as shown in Example 4-1 below, and the
trip/throttle function (implemented via sensors, bistables and the trip/throttle
valve) is outside of the system of interest, then credit can be taken for the HPCI
trip function to prevent or mitigate reactor overfill in the event of a postulated
functional actuation of the flow control system.

If there is no currently available method for preventing or mitigating the effects


of a postulated functional failure, then it is likely that some recommended actions
will be necessary.

FFMEA Step 11: Provide Recommended Actions.

The “Recommended Actions” column in the FFMEA worksheet is used by the


analyst to explain unacceptable results and what should be done about them, or
identify additional methods that should be implemented for preventing and/or
detecting failure mechanisms.

FFMEA Step 12: Apply the results.

Guidance for applying the results of a FFMEA is provided in Section 4.5.

4.3 Functional FMEA (FFMEA) Example

The following examples were originally developed for EPRI 1022985 (Reference
15). They are repeated here with some minor changes that make them more
complete, and some editorial changes that show how they follow the FMEA
procedures described in Sections 4.2 and 4.3. The first example demonstrates the
Functional FMEA method, and the second example demonstrates the Design
FMEA method.

Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA


FFMEA Step 1: Draw a Function/Process Map
Figure 4-2 provides an overall view of a High Pressure Coolant Injection (HPCI) or
Reactor Core Isolation Cooling (RCIC) system, with an emphasis on the turbine-
driven pump and flow control systems that are part of HPCI or RCIC in a typical
BWR. Note that the flow control system is represented as a single functional block,
which is sufficient for the FFMEA method.

 4-10 
Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA (continued)
Figure 4-3 satisfies the prerequisite for a Function Analysis (FA). It illustrates a
section of the overall, high-level BWR Function/Process Map that was provided in
Figure 4-1, and it is further developed to show the lower-level functions of “High
Pressure Injection” and “Trip Turbine Driven Equipment” and their related processes
that are necessary for satisfying the overall functions of Safety, Equipment Protection,
and Power Generation. These lower-level functions and processes will be used to
initiate the FFMEA worksheets in later steps.
FFMEA Step 2: Identify the functions and related processes of interest.
Figure 4-3 highlights the functions and related processes of interest for this example.
FFMEA Step 3: Write a summary description.
Summary descriptions of the HPCI and RCIC systems are provided below:
HPCI Summary Description
The design basis function of the HPCI system is reactor inventory control to ensure
the reactor core is adequately cooled to limit maximum fuel cladding temperature
following a small-break loss-of-coolant-accident (LOCA) which does not rapidly
depressurize the reactor pressure vessel (10CFR50.46 Criterion 1). HPCI also
provides a reactor inventory control function following other initiating events such as
transients, stuck open safety relief valve (SRVs), medium-break LOCAs and
anticipated transient without scram (ATWS).
HPCI can be initiated manually, or it will initiate automatically via high drywell
pressure or Low-Low reactor water level. The maximum response time allowed to
achieve rated flow is 60 seconds in the design basis analysis, but can be much
longer and still be successful, particularly given best estimate assumptions and given
non-LOCA initiating events.
When the HPCI is initiated, the system initiation signal opens the turbine steam
admission valve. When the steam admission valve opens, a limit switch on the valve
changes state when the valve begins to open, thus sending an enable signal to the
digital governor via one set of contacts and to the digital positioner via a second set
of contacts. The governor responds by sending a governor valve position demand
signal that ramps the turbine speed up to a preferred initial speed, then switches to
PID control in order to respond automatically to changes in system load. The
purpose of the ramp function is to enable controlled acceleration of the turbine and
avoid initial overspeed transients that may encroach upon the mechanical overspeed
trip limit. To support the initial response of the turbine when an enable signal is
received, the governor valve is preset to a partially open position.
The HPCI pump is a two stage component (booster pump + main pump), driven by a
single steam turbine. The pump takes suction from the condensate storage tank (CST)
until it reaches low level, then the suction source is switched to the suppression pool.
The pump supplies water to the reactor vessel via the feedwater line, or it can be
aligned in recirculation mode to discharge to the CST during surveillance tests. The
HPCI turbine is driven by Main Steam, which exhausts to the suppression pool after
leaving the turbine.
The HPCI turbine is automatically tripped on any HPCI isolation signal, high turbine
exhaust pressure, high reactor water level, low pump suction pressure, or
mechanical overspeed. The turbine can also be tripped manually from the Main
Control Room (MCR), the Remote Shutdown Panel (RSP), or locally at the turbine.

 4-11 
Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA (continued)
The turbine is tripped by closing the trip/throttle valve shown in Figure 4-1, thus
isolating the steam supply.
RCIC Summary Description
The design basis function of the RCIC system is reactor inventory control to provide
makeup water to the reactor vessel during reactor shutdown and isolation when the
main condenser and feedwater system are unavailable. RCIC also provides a
reactor inventory makeup function following initiating events such as non-isolation
transients and stuck open SRVs.
RCIC can be initiated manually, or it will initiate automatically via Low-Low reactor
water level. There is no automatic initiation of RCIC on drywell pressure. The design
basis maximum response time allowed to achieve rated flow is 60 seconds but can
be much longer and still be successful.
When RCIC is initiated, the system initiation signal opens the turbine steam
admission valve. When the steam admission valve opens, a limit switch on the valve
changes state when the valve reaches 20% open, thus sending an enable signal to
the digital governor via one set of contacts and to the digital positioner via a second
set of contacts. The governor responds by sending a governor valve position
demand signal that ramps the turbine speed up to a preferred initial speed, then
switches to PID control in order to respond automatically to changes in system load.
The purpose of the ramp function is to enable controlled acceleration of the turbine
and avoid initial overspeed transients that may encroach upon the mechanical
overspeed trip limit. To support the initial response of the turbine when an enable
signal is received, the governor valve is preset to a partially open position.
The RCIC pump is driven by a single steam turbine. The pump takes suction from the
condensate storage tank (CST) until it reaches low level, then the suction source is
switched to the suppression pool. The pump supplies water to the reactor vessel via
the feedwater line, or it can be aligned in recirculation mode to discharge to the
CST during surveillance tests. The RCIC turbine is driven by Main Steam which
exhausts to the suppression pool after leaving the turbine.
The RCIC turbine is automatically tripped on any RCIC isolation signal, high turbine
exhaust pressure, high reactor water level, low pump suction pressure, or
mechanical overspeed. The turbine can also be tripped manually from the MCR, the
RSP, or locally at the turbine. The turbine is tripped by closing the trip/throttle valve
shown in Figure 4-1, thus isolating the steam supply.
FFMEA Step 4: Prepare a Functional FMEA worksheet.
Functional FMEA worksheets are provided in Table 4-2.
FFMEA Step 5: On each worksheet, fill out the header rows.
In this example there are three sheets, differentiated by major BWR function in the
upper left corner of the worksheet.
FFMEA Step 6: Identify the lowest level Functions, Processes and related Requirements.
In this example, the results of Step 6 are shown in Table 4-2 under the columns
labeled “Function,” “Process” and “Requirements.”
Note that the entries in the “Function” and “Process” columns are transposed directly
from the Function/Process Map provided in Figure 4-3. The “Requirements” column
entries are derived from the plant FSAR, Technical Specifications, and System

 4-12 
Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA (continued)
Descriptions.
FFMEA Step 7: Using the FFMEA Guide Words, postulate the failure modes of
each Process
In this example, the results of Step 7 are shown in Table 4-2 under the column
labeled “Potential Failure Mode.” Note that for most Processes identified in this
example, only 3 or 4 of the 6 FFMEA Guide Words yield a result, where the other 2
or 3 Guide Words are not applicable. In each case, the failures are postulated as
single failures, albeit from a process point of view.
Only one Process, “Turbine/Pump provides required coolant flow,” was evaluated
with 5 Guide Words, shown in Rows 1 through 5 of the “Power Generation”
worksheet (sheet 3 of 3 in Table 4-2). This particular Process picked up the
postulated failure of “Unintended Function” because it is relevant in the context of
Power Generation functions. However, this same postulated failure is not applicable
in the context of Safety functions because the Safety requirement is stated in a
manner that HPCI or RCIC flow is required on demand (i.e., during an accident or
surveillance test), when the function of High Pressure Injection is actually intended.
FFMEA Step 8: Determine the resulting effects that each Functional Failure Mode
can have on the system of interest and the plant.
In this example, these effects are listed in Table 4-2 in the column labeled “Potential
Effect(s) of Failure.” Note that the effects are described in terms of impact on
equipment (e.g., turbine trip) or system function (e.g., loss of HPCI or RCIC).
FFMEA Step 9: Determine the Potential Causes or Failure Mechanisms for each
identified Functional Failure Mode.
In this example, the Potential Causes or Failure Mechanisms for each Functional
Failure Mode are listed in Table 4-2 in the column labeled “Potential
Cause(s)/Mechanism of Failure.” Note that the identified causes are generally
mechanical or electrical in nature, and none of them are directly attributable to any
digital I&C equipment because the FFMEA method looks at failure modes and effects
at a functional level before any specific digital equipment is identified.
Note that almost all of the identified methods of Prevention or Detection take
advantage of typical programs and processes that are in place at a typical nuclear
plant, including Preventive Maintenance, Procedures, Chemistry, Human
Performance, Surveillance Testing, ASME Section 11 Testing, and System Alarms. In
a few cases, “Software V&V” is identified as a method of Prevention, thus drawing
attention to the potential digital I&C solution that may be considered for replacing or
upgrading the HPCI or RCIC flow control system.
FFMEA Step 10: Provide Recommended Actions.
In this example, the Recommended Actions listed in Table 4-2 are centered on using
the Design FMEA method to further explore failure modes and effects of any
proposed digital I&C solution that can result in the related Functional Failure Modes.
For example, rows 1 through 4 on Sheet 1 of 3 in Table 4-2 show “Software V&V”
as a method of preventing turbine trips, failed initiation, late initiation, ramp rate too
slow, and other causes of a failed turbine/pump flow Process. These “causes” or
“failure mechanisms” at the Process level can be considered “failure modes” at the
Equipment level, which can be evaluated using the Design FMEA method.

 4-13 
Example 4-1. HPCI-RCIC Turbine Controls Functional FMEA (continued)
FFMEA Step 12: Apply the results.
Because this example is constructed at the conceptual design stage of a hypothetical
project, the results of this Functional FMEA in Table 4-2 would be provided to the
appropriate analyst responsible for performing a Design FMEA downstream in the
project lifecycle. In other words, the Functional and Design FMEAs are linked with
respect to digital I&C equipment failure modes and effects that can result in
hazardous (i.e., unwanted) effects on system Functions and Processes.

 4-14 
M

Main Steam Operator


Interaction
Main Feedwater
System
Initiation
Signal
HPCI/RCIC Flow
M
Control System

LS

FLOW Governor Trip/ Steam


Condensate Valve Throttle Admission
Storage Tank
M Valve Valve
M

System Initiation Signals System Isolation Signals Turbine Trip Signals


(Open Steam Admission Valve & (Trip Turbine & Close Process Valves) (Close Trip/Throttle Valve)
Process Valves) 1. High Steam Line Flow 1. Any system isolation signal
1. Low Reactor Level (-48") 2. High Area Temperature 2. High Steam Exhaust Pressure (150 psi)
2. High Drywell Pressure (HPCI 3. Low Steam Line Pressure (HPCI only) 3. High Reactor Level (+46")
only; +2 psig) 4. Low Reactor Pressure (RCIC only) 4. Low pump suction pressure (15" Hg)
5. Manual 5. Turbine overspeed
6. Manual (local or remote)

Figure 4-2
HPCI/RCIC System Diagram

 4-15 
BWR
Plant
Operations

Equipment Power
Protection
Safety Generation

Isolate Trip
Nuclear Reactor
Energy Rotating
Safety Steam
Sources Equipment

Reactor
Coolant
Inventory
Control

High Low
Pressure Pressure
Example 4-1: Inventory Inventory
Functions Control Control

Isolate Trip Turbine High


Process Driven Pressure
Line Equipment Injection

Turbine/
Sense Sense Sense Sense Sense
Pump Steam Suction Coolant
Flow Level Pressure Temp. Overspeed
Provides Supply Supply Path to Rx
Deviation Deviation Deviation Deviation Condition
Flow

Example 4-1: Processes

Figure 4-3
High Pressure Injection Function/Process Map

 4-16 
Table 4-2
HPCI/RCIC Flow Control System Functional FMEA Worksheets

PFMEA Number: Example 4-1 Prepared by/Date: Sheet: 1 of 3

High Level Process/Functional Area (check one): Equipment: Checked by/Date: Lifecycle Phase:
(X) Safety Conceptual Design
( ) Equipment Protection HPCI/RCIC Flow Control System Approval/ Date: Rev: 0a
( ) Power Generation
Ro Current Prevent/Detect Method
Potential Potential Potential Causes(s)/ Recommended
w Function Process Requirement(s)
No.
Failure Mode Effect(s) of Failure Mechanism of Failure Prevention Detection Action

1. Failed initiation signal 1. Software V&V 1. ESFAS Test


Loss of Rx inventory, leading
1 No coolant flow 2. Tripped turbine (no 2. ESFAS PM 2. System Flow
to core damage reset) 3. Turbine PM Test

5000 gpm (HPCI) Less than adequate Rx 1. HPCI starts, but turbine
Turbine/pump Less than 5000 gpm (HPCI) or 1. Software V&V Evaluate flow
500 gpm (RCIC) trips
2 provides inventory, possibly leading to 2. ESFAS PM control system
@ 1000 psi, on 500 gpm (RCIC) 2. Turbine speed too low
required core damage (medium LOCA) 3. Incorrect setpoint
3. Turbine PM failure modes via
1. ESFAS Test
demand, within 60 4. Setpoint DFMEA
coolant flow More than 5000 gpm (HPCI) or HPCI or RCIC turbine trip on 1. Turbine speed too high 2. System Flow
3 seconds Control
Test
500 gpm (RCIC) high Rx level (via trip valve) 2. Incorrect flow setpoint Program
3. Alarms
Less than adequate Rx 5. Human
1. Late initiation signal
5000 gpm (HPCI) or 500 gpm Performance
4 inventory, possibly leading to (or late response)
(RCIC), but after 60 seconds 2. Ramp rate too slow
6. Turbine trips
core damage
1. H2O Chem. 1. Section 11
Loss of Rx inventory, leading 1. Steam line break
5 No steam flow 2. Human Test
to core damage 2. Inadvertent isolation
2. Alarms
Performance
1. System Flow
Poor steam quality (high Turbine degradation, eventual
6 1. High carryover from Rx Rx PM Test
Supply high quality moisture) loss of Rx inventory
Steam Supply 2. Turbine PM
saturated steam at
to Turbine Turbine can run as low as 150 1. Steam line leak 1. Section 11
1000 psig 1. H2O Chem.
7 High Steam pressure < 1000 psig psig, then low pressure 2. Steam line partial Test
2. FME Program
Pressure systems take over blockage 2. Alarms
Injection Relief valves lift, steam 1. Steam hammer
8 Steam pressure > 1000 psig 1. Ops Alarms
pressure/flow transients 2. Rx pressure transient Procedures
2. Human 1. Alarms
Loss of Rx inventory, leading 1. Empty CST or Torus
9 No water flow Performance 2. CST/Torus
to core damage 2. Inadvertent isolation
Surveillance
Supply clean, 1. Pump damage, less than 1. Human 1. System Flow
1. Inadequate FME
Suction Supply demineralized aequate flow Performance Test
10 Foreign material in water controls
to Pump water with 2. Clogged strainer, low NPSH, 2. Material degradation
2. H2O 2. Chemistry
adequate NPSH less than adequate flow Chemistry Samples
1. Pump cavitation, eventual 1. Low water level in CST 1. Ops CST/Torus
11 Less than adequate NPSH damage, less than adequate or Torus Procedures Surveillance
flow 2. Pipe obstruction 2. FME Program Test
Loss of Rx inventory, leading 1. Pipe break
12 Loss of pressure boundary
to core damage 2. Interystem leak
Maintain pressure Less than adequate Rx 1. H2O
13 Coolant Flow boundary integrity, Capacity less than 5000 gpm inventory, possibly leading to Chemistry
Alarms
Path to Rx capable of 5000 core damage (medium LOCA) 1. Pipe leak 2. Human
gpm @ 1000 psi 2. Intersystem leak Performance
Less than adequate Rx
14 Less than 1000 psi inventory, possibly leading to
core damage

 4-17 
Table 4-2 (continued)
HPCI/RCIC Flow Control System Functional FMEA Worksheets

PFMEA Number: Example 4-1 Prepared by/Date: Sheet: 2 of 3

High Level Process/Functional Area (check one): Equipment: Checked by/Date: Lifecycle Phase:
( ) Safety Conceptual Design
(X) Equipment Protection HPCI/RCIC Flow Control System Approval/ Date: Rev: 0a
( ) Power Generation
Current Prevent/Detect Method
Row Potential Potential Potential Causes(s)/ Recommended
Function Process Requirement(s)
No. Failure Mode Effect(s) of Failure Mechanism of Failure Prevention Detection Action

Failed mechanical overspeed Jammed or broken Overspeed


1 Turbine PM
sensing mechanism overspeed bolt/cam Test
Sense turbine Initiate turbine trip 1. Turbine damage
speed and trip on at 5000 rpm, 2. Loss of HPCI or RCIC
T/T valve stem stuck or T/T Valve
2 overspeed regardless of other Failed T/T valve Turbine PM
broken StrokeTest
condition conditions
False sensing of overspeed 1. Turbine trip Misposition of overspeed Overspeed
3 Turbine PM
Trip condition 2. Loss of HPCI or RCIC bolt/cam Test
Turbine
Driven 1. False "open" command
1. Governor valve does not 1. Software
from governor System Flow
4 Equipment Sense turbine Failed-open governor valve close V&V
2. Failed-open valve Test
speed and stop if Initiate governor 2. Turbine PM
2. Possible turbine trip on actuator Evaluate flow
no enable signal valve closure if no mechanical overspeed or Rx control system
and turbine enable signal and overfill Failed-closed limit switch System Flow failure modes via
5 rolling due to turbine speed > False enable signal 3. Loss of HPCI or RCIC Turbine PM
contacts Test DFMEA
leaky steam 1000 rpm
admission valve
False sensing of overspeed Governor valve closes or Speed sensor has System Flow
6 Turbine PM
condition remains closed excessive drift or fails-high Test

 4-18 
Table 4-2 (continued)
HPCI/RCIC Flow Control System Functional FMEA Worksheets

PFMEA Number: Example 4-1 Prepared by/Date: Sheet: 3 of 3

High Level Process/Functional Area (check one): Equipment: Checked by/Date: Lifecycle Phase:
( ) Safety Conceptual Design
( ) Equipment Protection HPCI/RCIC Flow Control System Approval/ Date: Rev: 0a
(X) Power Generation
Ro Current Prevent/Detect Method
Potential Potential Potential Causes(s)/ Recommended
w Function Process Requirement(s)
No.
Failure Mode Effect(s) of Failure Mechanism of Failure Prevention Detection Action

1. Failed initiation signal 1. Software V&V 1. ESFAS Test


1 No coolant flow 2. Tripped turbine (no 2. ESFAS PM 2. System Flow
reset) 3. Turbine PM Test

1. Failed surveillance test 1. HPCI/RCIC starts, but


2. Enter Tech Spec action turbine trips Evaluate flow
Less than 5000 gpm (HPCI) or 1. Software V&V
2 statement
2. Turbine/pump speed
control system
5000 gpm (HPCI) 500 gpm (RCIC) too low
2. ESFAS PM
Turbine/pump 3. Shut down unit if operability 3. Turbine PM 1. ESFAS Test failure modes via
500 gpm (RCIC) 3. Incorrect setpoint
provides not restored within 14 days 4. Setpoint 2. System Flow DFMEA
@ 1000 psi, on
required More than 5000 gpm (HPCI) or 4. Maintenance rule A1 impact 1. Turbine speed too high Control Test
3 demand, within 60
coolant flow 500 gpm (RCIC) 2. Incorrect setpoint Program 3. Alarms
seconds 5. Human
1. Late initiation signal
5000 gpm (HPCI) or 500 gpm Performance
4 (or late response)
(RCIC), but after 60 seconds 2. Ramp rate too slow
Cold water addition to reactor 1. ESFAS PM
1. False initiation signal 1. ESFAS Alarm
2. Turbine PM
5 Spurious flow possibly leading to high flux 2. Misposition of steam 2. System Flow
3. Human
trip (HPCI) admission valve Test
Performance
1. H2O 1. Section 11
1. Steam line break
6 No steam flow Chemistry Test
2. Inadvertent isolation
2. Human 2. Alarms
1. System Flow
Poor steam quality (high 1. High moisture
7 Supply high quality Rx PM Test
Steam Supply moisture) carryover from Rx
2. Turbine PM
High saturated steam at
Pressure to Turbine 1. Steam line leak 1. H2O 1. Section 11
8 1000 psig Steam pressure too low 2. Steam line partial Test
Injection Chemistry
blockage 2. FME Program 2. Alarms
1. Steam hammer
9 Steam pressure too high 1. Ops Alarms
2. Rx pressure xient
Procedures
2. Human 1. Alarms
1. Empty CST or Torus
10 No water flow 1. Failed surveillance test Performance 2. CST/Torus
2. Inadvertent isolation
2. Enter Tech Spec action Surveillance
Supply clean, statement 1. Human 1. System Flow
3. Shut down unit if operability 1. Inadequate FME
Suction Supply demineralized Performance Test
11 Foreign material in water controls
to Pump water with not restored within 14 days 2. H2O 2. Chemistry
2. Material degradation
adequate NPSH 4. Maintenace rule A1 impact Chemistry Samples

1. Low water level in CST 1. Ops CST/Torus


12 Less than adequate NPSH or Torus Procedures Surveillance
2. Pipe obstruction 2. FME Program Test
1. Pipe break
13 Loss of pressure boundary
2. Interystem leak
Maintain pressure 1. H2O
Coolant Flow boundary integrity, Capacity less than 5000 gpm Chemistry
14 Alarms
Path to Rx capable of 5000 (HPCI or 500 gpm (RCIC) 2. Human
1. Pipe leak
gpm @ 1000 psi 2. Intersystem leak Performance
15 Less than 1000 psi

 4-19 
4.4 Design FMEA (DFMEA) Procedure
19B

The following steps can be used to perform the Design FMEA (DFMEA)
method. This procedure is not the only way to implement the method; variations
are likely, depending on the owner/operator’s engineering and configuration
management program, and its implementing policies and procedures.

This section, including the worked examples, is written on a stand-alone basis, as


if the analyst is only going to perform a Design FMEA. If another method (such
as Functional FMEA) is used as an input to a Design FMEA, then the analyst
would be taking a blended approach, which is described in Section 3.4.

Prerequisite

The results of a Function Analysis, as described in Section 3.6, are useful inputs
to the DFMEA because they provide a well-organized set of functions that can
feed into the steps of the DFMEA procedure that consider failure modes and
their effects on the associated systems.

DFMEA Step 1: Draw a block diagram of the system of interest.

The block diagram should be an integrated view of physical and functional


representations of the system, and it should be drawn to the extent that it
represents the components of interest to the analysis and their interfaces to other
components. See Appendix C for a list of elements that could be used in physical
and functional system representations.

It is helpful in some cases to add supplemental information to a block diagram in


order to fully describe the system physical and functional characteristics.
Supplemental information could include truth tables, limit switch state diagrams,
and itemized lists (e.g., trip functions).

It may be necessary to prepare more than one version of the block diagram in
order to represent different system conditions that may arise in the operations
and maintenance phase of its lifecycle. An example would be one version that
shows a normal system condition during plant operations, and another version
that shows the system out-of-service in a maintenance mode or configuration
(e.g., with a configuration tool connected to an available port on a controller). In
this case, each version of the block diagram would be analyzed using the
remaining steps in this procedure. If portions of each version overlap or share
common characteristics, then it may not be necessary to repeat the analysis for
those portions.

DFMEA Step 2: Draw a boundary around the components of interest.

In a digital upgrade project, the components of interest are typically the


components that are being replaced or modified, and are usually relatively easy to
identify. For larger upgrades that affect multiple plant process systems, there may

 4-20 
be multiple boundaries that are differentiated by functional segments in the
architecture.

On new plant projects, it may be necessary to break down the digital I&C
architecture into systems and sub-systems, differentiated by functional segments
in the architecture.

Operating experience shows that Design FMEAs do not always account for
equipment interfaces that are actually used in the finished system, including
interfaces that are used on a temporary or intermittent basis (References 10, 11
and 16). Therefore, this step can be strengthened by the following methods:
 Verify the equipment interfaces described in the technical information that is
provided with the digital system or components of interest (e.g., technical
manual)
 Examine the interfaces on the actual equipment if it is available, via
walkdown or inspection (e.g., terminal blocks and data communication ports)

The goal of this step for either method is to demonstrate that all of the digital
equipment interfaces used in the target application are accounted for in the block
diagram.

DFMEA Step 3: Write a summary description.

Using the results of the Function Analysis (per Section 3.6), write a summary
description of the basic functions of the components inside the boundary drawn
in Step 2, and their interfaces with other equipment or components that cross the
boundary. The purpose of this section is to help anyone reading the DFMEA
understand the basic functions of the system or components being analyzed. It is
not necessary to develop or repeat a comprehensive system description, such as
would be found in a typical plant system description. The summary description
should be developed only to the extent that it supports the analysis.

It is helpful in most cases to include a table that lists each component or


component type and its basic functions.

In more functionally complex systems, it may be helpful to include the functional


sequences that are used to start up or shut down the plant system (that is being
controlled) in order to provide a more complete functional description of the
components of interest.

DFMEA Step 4: Prepare a DFMEA worksheet for each device or component of


interest.

A blank worksheet is provided in Table 4-3. The taxonomy sheets in Appendix B


are available as an aid. The following steps describe how to prepare the DFMEA
worksheet.

 4-21 
DFMEA Step 5: On each worksheet, identify the interfacing components,
signals, power supplies, and other interfaces that can affect the functions or
performance of the components of interest.

The typical approach for this step is to examine the block diagram, identify each
interface that crosses the boundary drawn in Step 2, and identify the system or
component outside the boundary that provides the interface. The results of this
Step are entered on the DFMEA worksheet under the column labeled
“Component Identification.”

DFMEA Step 6: Determine the failure modes of each interfacing component,


signal, power supply or other interface.

For each entry in the “Component Identification” column, determine its failure
modes using available technical information. The results of this step are entered
under the column labeled “Failure Modes.”

The taxonomy sheets provided in Appendix B can be used as an aid if there is a


sheet that closely resembles the component in question. A taxonomy sheet should
not be used by itself. The component or device vendor should be able to provide
technical information that describes the specific failure modes of their products.
Use engineering judgment when determining the failure modes of a device or
component. The results should be verifiable by a competent, independent
engineer.

Appendix B is provided as a prompting aid only, and is not intended to be a


complete representation of all types of digital I&C systems, components or
devices.

DFMEA Step 7: Determine the likely failure mechanisms associated with each
failure mode identified in Step 6.

Failure mechanisms are included in the FMEA worksheet because they provide
some insight for assessing the use of system design features and defensive
measures that may be available to help reduce the likelihood of such failure
mechanisms.

Each taxonomy sheet in Appendix B includes typical defensive measures that


could be applied to reduce the likelihood of failure mechanisms in typical devices
and components that make up a digital system. Additional “coverage” of device
and component failure mechanisms may be obtained by contacting the device or
component manufacturer and asking for relevant technical information, or by
performing a Critical Digital Review using the guidance in EPRI Report
1011710 (Reference 17). For additional guidance on defensive measures in digital
systems, refer to EPRI 1019182 (Reference 21).

DFMEA Step 8: Determine the resulting effects that each interfacing system or
component failure mode can have on the components of interest, and the
resulting effects on the system.

 4-22 
This step involves following the device, component, or sub-system failure modes
out to their effects at the system level. This step requires some knowledge of the
plant system, including its control system, the controlled elements, the process
elements, and their mechanical or electrical properties. This step may require
some cross-discipline support from design engineers, system engineers, or
component engineers who are technically competent in these areas. The results of
this step are entered in the DFMEA worksheet under the column labeled
“Effects on System.”

DFMEA Step 9: Determine the methods of detection for each failure mode
identified in Step 6.

See Section 4.5 for guidance on selecting and applying methods of detection
during system development. The results of this Step are entered in the DFMEA
worksheet under the column labeled “Method of Detection.”

DFMEA Step 10: Provide remarks.

The “Remarks” column in the FMEA worksheet is used by the analyst to explain
unusual results or identify measures that could be taken to help prevent or
mitigate the identified failure mode.

DFMEA Step 11: Analyze redundancies.

Digital I&C systems offer improvements over analog systems by including


features that improve fault tolerance, failure detection, and reliability. The new
digital systems obtain these improvements by including redundancy in the
equipment through redundant channels or components. In most cases,
redundancy involves identical channels or components in terms of equipment,
software, and interfaces.

When reviewing a system for single failure vulnerability, a design engineer can
use the Design FMEA to evaluate the system design and assess it for single
failure vulnerabilities. However, the DFMEA method will focus on a detailed
analysis of all functions and components, where it is developed by listing each
component in a system and evaluating the impact of the component failure on
the system for each failure mode. This approach is very detailed and contains
redundant reviews for systems with multiple channels or redundant components.

The use of redundancy in digital systems does create some additional challenges
with addressing common mode failures, and interdivisional or interchannel
communication impacts and dependencies. These challenges need to be
addressed in the failure analysis. The DFMEA failure analysis method can credit
redundant divisions or channels in a carte blanche manner, up front, when
developing the scope of the analysis. If the following criteria are satisfied, then
the scope of a Design FMEA can be reduced to a single redundancy in terms of
the components and interfaces that are analyzed:
 Redundancy Boundary – The redundancy boundary denotes the set of
equipment where systems or components are identical and a single fault or

 4-23 
failure is contained in a single redundancy without adversely impacting the
overall function of the system. The analysis should be able to clearly identify
the extent to which divisions, channels or other redundancies are actually
redundant, vs. those systems, sub-systems or components that are not
redundant. For example, in a four division protection system, such as the one
illustrated in Figure 4-4, there are four distinct, separate and independent
divisions. On the other hand, a master/slave architecture such as the one
illustrated in Figure 4-5 will have limited redundancy, where there are
elements that are shared by both redundancies such as a single controlled
element (e.g., a control valve).
 Dependencies – For system architectures that share data, signals or other
information between redundancies, the sharing of such data, signals or
information must be assessed to determine if any one redundancy is
dependent on one or more of the other redundancies in order to satisfy
functional or performance requirements, including behaviors that are
required to respond to faults and failures in the other redundancies. For
example, master/slave and triple modular redundant (TMR) architectures are
likely to require sharing of module status or signal information between
redundancies so that in the event of a fault or failure in one redundancy,
another one can detect the fault or failure and maintain adequate
functionality within specified performance requirements. In the event that
module or component status information or signals are shared, then the
Design FMEA must clearly describe this dependency and how it can be
credited for responding to each of the module or component failures modes
that are analyzed within a single redundancy.

Input Function Output Controlled


Sensors Relays Division A
Modules Processor Modules Elements

Input Function Output Controlled


Sensors Relays Division B
Modules Processor Modules Elements

Input Function Output Controlled


Sensors Relays Division C
Modules Processor Modules Elements

Input Function Output Controlled


Sensors Relays Division D
Modules Processor Modules Elements

Figure 4-4
Multi-Divisional System with Complete, Independent Redundancy

 4-24 
Power
Processor Output
Inputs
Controlled
Status
Element
Power
Processor Output
Inputs
Redundancy Boundary

Figure 4-5
Redundancy Boundary for a Master/Slave Architecture

DFMEA Step 12: Apply the results.

Guidance for applying the results of a DFMEA is provided in Section 4.5.

 4-25 
Table 4-3
Sample Design FMEA Worksheet

Functional Level Diagram Sheet:


System Design Phase:
Subsystem Rev:

Component Failure
Function(s) Failure Mechanisms Effect on System Method of Detection Remarks
Identification Modes

 4-26 
4.5 Design FMEA Examples
20B

Example 4-2. HPCI-RCIC Turbine Controls Design FMEA


DFMEA Step 1: Draw a block diagram of the system of interest.
Figure 4-6 provides a block diagram of a hypothetical upgrade to a turbine
governor control system on High Pressure Coolant Injection (HPCI) and Reactor Core
Isolation Cooling (RCIC) turbines. The turbine control system in this example is
relatively simple from a physical and functional point of view, but it is safety
significant. This diagram shows a proposed solution at the conceptual design stage
of this hypothetical project.
The HPCI and RCIC control systems are functionally identical. Pumps, valves,
turbines, etc. are sized differently to meet different flow requirements. Per Figure 4-6,
the primary control loop is via flow controllers with Manual/Auto (M/A) capability
in the Main Control Room (MCR) or on the Remote Shutdown Panel (RSP). The HPCI
control system is designed to maintain flow at the required setpoint when the in-
service flow controller is in automatic mode. The flow controller applies a
Proportional-Integral-Derivative (PID) control algorithm that adjusts the speed demand
output of the controller to compensate for any errors between the flow setpoint and
the actual flow signal provided by a flow transmitter downstream of the HPCI pump.
Note that Figure 4-6 shows three valves that directly affect the steam supply to the
turbine (governor valve, trip/throttle valve and steam admission valve). Not shown
are other process valves, such as the valves that connect the HPCI and RCIC systems
to the feedwater lines or the main steam lines. These other process valves are
opened and closed by system initiation or isolation signals from manual controls or
the plant protection system. However, these signals are listed in Figure 4-6 because
they can affect the state of the turbine steam admission valve or the turbine
trip/throttle valve.
The original turbine speed control system on the HPCI and RCIC turbines applied a
series of electronic modules that processed the signal from the M/A station and
provided an interface to hydraulic components on the HPCI pump skid. Ultimately,
the electronic modules and the hydraulic components automatically adjusted the
position of the governor valve to match the actual speed of the turbine to the speed
demanded by the flow controller.
The proposed upgrade is to replace the original electronic modules and most of the
hydraulic components with a digital governor and a digital positioner, per the
conceptual design shown in Figure 4-6, which illustrates the steam turbine, pump,
and piping, and the basic, closed loop functions of the flow control system. The flow
control system consists of a flow element, a flow transmitter, the flow controllers, a
digital turbine speed governor, a digital valve positioner, and feedback loops from
sensors.
The physical representations of the system in Figure 4-6 are shown in yellow boxes
or symbols, while the functional representations are shown in white boxes, or they
are self-evident in the shape of the symbols.

 4-27 
Example 4-2. HPCI-RCIC Turbine Controls Design FMEA (continued)
DFMEA Step 2: Draw a boundary around the components of interest.
Figure 4-6 shows an analysis boundary around the digital governor and digital
positioner. These components are of interest to the analysis in this example because
they form a digital upgrade that replaces obsolete equipment. The components
shown outside the boundary are original equipment that will remain as-is, outside
the scope of the plant change project. Nevertheless, some of the components outside
the boundary have interfaces that cross the boundary, and will have failure modes
that will be accounted for in the analysis.
DFMEA Step 3: Write a summary description.
Table 4-4, which meets the prerequisite for a Function Analysis (in this case at a
component level), provides a listing of the principal components shown in Figure 4-
6, and their functions. Summary descriptions are provided below:
HPCI Summary Description
The design basis function of the HPCI system is reactor inventory control to ensure
the reactor core is adequately cooled to limit maximum fuel cladding temperature
following a small-break loss-of-coolant-accident (LOCA) which does not rapidly
depressurize the reactor pressure vessel (10CFR50.46 Criterion 1). HPCI also
provides a reactor inventory control function following other initiating events such as
transients, stuck open safety relief valves (SRVs), medium-break LOCAs and
anticipated transient without scram (ATWS).
HPCI can be initiated manually, or it will initiate automatically via high drywell
pressure or Low-Low reactor water level. The maximum response time allowed to
achieve rated flow is 55 seconds in the design basis analysis, but can be much
longer and still be successful, particularly given best estimate assumptions and given
non-LOCA initiating events.
When the HPCI is initiated, the system initiation signal opens the turbine steam
admission valve. When the steam admission valve opens, a limit switch on the valve
changes state when the valve begins to open, thus sending an enable signal to the
digital governor via one set of contacts and to the digital positioner via a second set
of contacts. The governor responds by sending a governor valve position demand
signal that ramps the turbine speed up to a preferred initial speed, then switches to
PID control in order to respond automatically to changes in system load. The
purpose of the ramp function is to enable controlled acceleration of the turbine and
avoid initial overspeed transients that may encroach upon the mechanical overspeed
trip limit. To support the initial response of the turbine when an enable signal is
received, the governor valve is preset to a partially open position.
The HPCI pump is a two stage component (booster pump + main pump), driven by a
single steam turbine. The pump takes suction from the condensate storage tank (CST)
until it reaches low level, then the suction source is switched to the suppression pool.
The pump supplies water to the reactor vessel via the feedwater line, or it can be
aligned in recirculation mode to discharge to the CST during surveillance tests. The
HPCI turbine is driven by Main Steam, which exhausts to the suppression pool after
leaving the turbine.
The HPCI turbine is automatically tripped on any HPCI isolation signal, high turbine
exhaust pressure, high reactor water level, low pump suction pressure, or
mechanical overspeed. The turbine can also be tripped manually from the Main

 4-28 
Example 4-2. HPCI-RCIC Turbine Controls Design FMEA (continued)
Control Room (MCR), the Remote Shutdown Panel (RSP), or locally at the turbine.
The turbine is tripped by closing the trip/throttle valve shown in Figure 4-6, thus
isolating the steam supply.
RCIC Summary Description
The design basis function of the RCIC system is reactor inventory control to provide
makeup water to the reactor vessel during reactor shutdown and isolation when the
main condenser and feedwater system are unavailable. RCIC also provides a
reactor inventory makeup function following initiating events such as non-isolation
transients and stuck open SRVs.
RCIC can be initiated manually, or it will initiate automatically via Low-Low reactor
water level. There is no automatic initiation of RCIC on drywell pressure. The design
basis maximum response time allowed to achieve rated flow is 50 seconds but can
be much longer and still be successful.
When RCIC is initiated, the system initiation signal opens the turbine steam
admission valve. When the steam admission valve opens, a limit switch on the valve
changes state when the valve reaches 20% open, thus sending an enable signal to
the digital governor via one set of contacts and to the digital positioner via a second
set of contacts. The governor responds by sending a governor valve position
demand signal that ramps the turbine speed up to a preferred initial speed, then
switches to PID control in order to respond automatically to changes in system load.
The purpose of the ramp function is to enable controlled acceleration of the turbine
and avoid initial overspeed transients that may encroach upon the mechanical
overspeed trip limit. To support the initial response of the turbine when an enable
signal is received, the governor valve is preset to a partially open position.
The RCIC pump is driven by a single steam turbine. The pump takes suction from the
condensate storage tank (CST) until it reaches low level, then the suction source is
switched to the suppression pool. The pump supplies water to the reactor vessel via
the feedwater line, or it can be aligned in recirculation mode to discharge to the
CST during surveillance tests. The RCIC turbine is driven by Main Steam which
exhausts to the suppression pool after leaving the turbine.
The RCIC turbine is automatically tripped on any RCIC isolation signal, high turbine
exhaust pressure, high reactor water level, low pump suction pressure, or
mechanical overspeed. The turbine can also be tripped manually from the MCR, the
RSP, or locally at the turbine. The turbine is tripped by closing the trip/throttle valve
shown in Figure 4-6, thus isolating the steam supply.
DFMEA Step 4: Prepare an FMEA worksheet for each device or component of
interest.
FMEA worksheets are provided in Table 4-5 for the digital governor and Table 4-6
for the digital positioner. In this example there are two FMEA worksheets,
differentiated by “subsystem” in the upper left corner, because there are two
components of interest in the analysis.
DFMEA Step 5: On each worksheet, identify the interfacing components, signals,
power supplies, and other interfaces that can affect the functions or performance of
the components of interest.
In this example, the results of Step 5 are shown in Tables 4-5 and 4-6 under the
column labeled “Component Identification.”

 4-29 
Example 4-2. HPCI-RCIC Turbine Controls Design FMEA (continued)
DFMEA Step 6: Determine the failure modes of each interfacing component,
signal, power supply or other interface.
In this example, the results of Step 6 are shown in Tables 4-5 and 4-6 under the
column labeled “Failure Modes.”
DFMEA Step 7: Determine the failure mechanisms associated with each failure
mode identified in Step 6.
In this example, the results of Step 7 are shown in Tables 4-5 and 4-6 under the
column labeled “Failure Mechanisms.”
DFMEA Step 8: Determine the resulting effects that each interfacing component
failure mode can have on the components of interest, and the resulting effects on the
system.
In this example, these effects are listed in Tables 4-5 and 4-6, in the column labeled
“Effects on System.”
DFMEA Step 9: Determine the methods of detection for each failure mode
identified in Step 6.
In this example, the methods of detection are listed in Tables 4-5 and 4-6 in the
column labeled “Method of Detection.” Note that for all of the identified failure
modes, the identified method of detection is “Periodic Test” (or audit) because at the
conceptual design stage shown in this example, there are no hardware or software
features that have been identified yet that can detect, mitigate and provide an
indication and/or alarm associated with the identified failure modes. Hardware
and/or software features that can provide methods of detecting the identified failure
modes would be expected to emerge later in the development phase of this example
project, and the FMEA worksheets would be updated accordingly.
DFMEA Step 10: Provide remarks.
In this example, the remarks listed in Tables 4-5 and 4-6 are centered on features
and functions that should be picked up in the design phase of the project; these
features and functions essentially become “defensive measures” against the
identified failure modes.
DFMEA Step 11: Analyze redundancies.
In this example, there are no redundant components in the system of interest.
However, in some BWRs, at the plant level, the HPCI and RCIC systems may provide
redundancy in terms of safety functions, in combination with other independent
systems (e.g., automatic depressurization). Therefore, the FMEA in this example is
complete, at least at the conceptual design phase. Tables 4-5 and 4-6 are equally
applicable to the conceptual design of the HPCI and RCIC turbine control systems.
DFMEA Step 12: Apply the results.
Because this example is constructed at the conceptual design stage of a hypothetical
project, the results of the preliminary FMEA in Tables 4-5 and 4-6 would be applied
in later phases. Therefore, the following Application Notes are provided for the
system designers, based in large part on the results in the “Remarks” column of the
FMEA worksheets.
Note that as this project would progress beyond the conceptual design phase, the
FMEA would be updated to reflect design details and methods of detection, and the
finished results would be validated. The final FMEA product would then be used to
support licensing activities and development or validation of periodic test

 4-30 
Example 4-2. HPCI-RCIC Turbine Controls Design FMEA (continued)
procedures, alarm management methods, maintenance procedures, and
troubleshooting and cause analysis guidance.
Application Notes
The following insights were obtained from the FMEA “Remarks” column in Tables 4-
5 and 4-6 for later use in developing the detailed hardware design and application
logic for the governor and positioner components. In addition, the FMEA should be
updated during the design phase of the project to account for changes in
component or system level effects as these application notes are factored into the
design, and ultimately the finished FMEA should be validated in the project test
phases (e.g., FAT, SAT or Post-Installation) to confirm the analytical results.
It is expected that the detailed control system design and the application logic be
modified as needed to detect, mitigate or eliminate the undesired effects currently
described in Tables 4-5 and 4-6. The number of failure modes currently detectable
only by periodic test should be reduced as much as practically achievable through
the use of signal validation and alarm methods.
1. Provide signal validation methods in the application logic for the governor and
positioner. Signal validation methods can include:
a. Out of Range Checks (where analog signals present less than a “live zero”
such as 4.0 mADC or 1.0 VDC, or greater than calibrated span such as
20.0 mADC or 5.0 VDC)
b. Median Select (provide three redundant signals, select the middle signal)
c. High Rate of Change (determine the maximum credible rate of change of a
signal, in units such as %/second, and design a simple filter or rate detection
algorithm that allows a signal to pass through if it’s rate of change is less
than the maximum credible rate)
2. Provide indications and alarms associated with the failure modes where alarms
are described in the Remarks column in Tables 4-5 and 4-6. A general trouble
alarm may suffice for the governor and positioner (each), as long as a local or
remote indication is provided for determining the failure mode that caused the
alarm. Alarms should be provided by the governor and positioner to the plant
annunciator system in the main control room, via contact or solid state outputs.
3. The taxonomy sheets in Appendix B of this guideline were used to inform the
FMEA worksheets. Numerous internal and external defensive measures are
potentially available as described in the taxonomy sheets, and should be
assessed and included in the final design. Internal defensive measures are those
that are implemented within the components of interest, such as memory integrity
test features that could be embedded within the operating system of the
governor. External defensive measures are those that are implemented outside
the component of interest, such as an input signal validation algorithm
implemented within the positioner in order to detect and alarm misbehaving
governor output signals.
4. Apply security controls described in NEI 08-09 or RG 5.71. The governor and
positioner are critical digital assets that are required to meet the cyber security
rule (10 CFR 73.54).

 4-31 
Table 4-4
Principal HPCI/RCIC Turbine Control Components and Functions

Component Functions
1. Provide automatic speed demand output to
governor on converting a fixed setpoint flow to a
Main Control Room and turbine speed.
Remote Shutdown Panel 2. Provide manual speed demand output to governor
Flow Indicating Controllers as set by operator
3. Provide indications of flow setpoint, actual flow,
and % output
Switch speed demand signal from MCR or RSP M/A
Hand switch (HS)
stations
Provide enable signal to governor and positioner
Limit Switch (LS)
when steam admission valve position is > 20% open
Provide automatic governor valve position demand
signal to digital positioner to compensate for error
Governor
between actual turbine speed (from MPU) and
demanded turbine speed (from M/A stations)
Magnetic Pickup (MPU) Provide actual turbine speed signal to governor
Provide clean, filtered 24 VDC power to governor and
24 VDC Power
positioner
Governor Program Provide a port for connecting a programming device to
Interface enable configuration changes and configuration audits
Provide automatic governor valve position signal to
actuator to compensate for error between actual
Positioner
governor valve position (from resolver feedback) and
demanded valve position (from governor)
Provide actuator stem position signal to positioner
Resolver Feedback (actuator stem is coupled directly to governor valve
stem)
Position the governor valve to the position demanded
Actuator
by the positioner
Governor Valve Throttle the steam supplied to the turbine
Isolate the steam supply to the turbine when a turbine
Trip/Throttle Valve
trip signal is received
Admit steam to the turbine when a system initiation
Steam Admission Valve
signal is received

 4-32 
FIC: Flow Indicating Controller
MCR: Main Control Room Analysis Boundary
RSP: Remote Shutdown Panel
PID: Proportional/Integral/Derivative Enable
HS: Handswitch
MCR FIC
HS Positioner
Speed Position
PID Demand PID S Demand PID System
Initiation
Flow Setpoint
Governor Enable Signal
(RCIC: 500gpm;
HPCI: 5000gpm)

PID Program M
24 Resolver
Interface VDC Actuator
Feedback
LS
RSP FIC From
Main
Steam
FLOW Governor Trip/ Steam
Magnetic
PickUp (MPU)
Valve Throttle Admission
To Valve Valve
Reactor From Torus or
Condensate
Storage Tank

System Initiation Signals System Isolation Signals Turbine Trip Signals


(Open Steam Admission Valve (Trip Turbine & Close Process Valves) (Close Trip/Throttle Valve)
& Process Valves) 1. High Steam Line Flow 1. Any system isolation signal
1. Low Reactor Level (-48") 2. High Area Temperature 2. High Steam Exhaust Pressure (150 psi)
2. High Drywell Pressure 3. Low Steam Line Pressure (HPCI only) 3. High Reactor Level (+46")
(HPCI only; +2 psig) 4. Low Reactor Pressure (RCIC only) 4. Low pump suction pressure (15" Hg)
5. Manual 5. Turbine overspeed
6. Manual (local or remote)

Figure 4-6
HPCI/RCIC Turbine Control System Block Diagram

 4-33 
Table 4-5
HPCI/RCIC Governor Design FMEA Worksheet

Functional Level Diagram Sheet: 1 of 3


System HPCI, RCIC See See
Figure 4-6 Design Phase: Conceptual
Figure 4-4
Subsystem Governor Rev: 0a

Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification

Turbine slows to minimum


Output Fails Saturated output
speed, less than adequate Periodic Test
Offscale Low circuit
HPCI or RCIC flow
1. Include signal validation in the
1. Provide automatic speed governor application logic
demand output to compensate for 2. Provide MCR and RSP alarm
Main Control Room
error between fixed setpoint and connection to governor
and Remote M/A station failure or Turbine overspeeds, trips on
actual flow Output Fails
Shutdown Panel loss of power to M/A high reactor level or Periodic Test
2. Provide manual speed demand Offscale High
Flow Indicating station mechanical overspeed
output as set by operator
Controllers
3. Provide indications of flow
setpoint, actual flow, and % output Consider sending flow setpoint
Indeterminate; depends on signal and actual flow signal to
Output Fails fail as-is value - likely to result governor for validating speed
M/A station failure Periodic Test
As-Is in reactor overfill or underfill, demand signal (by comparing
followed by turbine trip actual demand signal to
expected demand signal)
Fail Open 1. Include signal validation in the
Turbine overspeeds, trips on
(when aligned to Broken or dirty governor application logic
high reactor level or Periodic Test
in-service M/A contacts 2. Provide MCR and RSP alarm
mechanical overspeed
station) connection to governor
1. Use a procedure to set the
Switch speed demand signal from
Handswitch M/A station that is not in-service
MCR or RSP M/A stations Fail Closed (when
Indeterminate; depends on as- to manual mode with a pre-
not aligned to in- Broken contacts,
left settings of M/A station Periodic Test determined output
service M/A conductive debris
not in-service 2. Perform periodic
station)
maintenance to assure
cleanliness

 4-34 
Table 4-5 (continued)
HPCI/RCIC Governor Design FMEA Worksheet

Functional Level Diagram Sheet: 2 of 3


System HPCI, RCIC See Figure 4-6 Design Phase: Conceptual
See Figure 4-4
Subsystem Governor Rev: 0a

Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
Turbine slows to minimum
Output Fails Ringing, double or
speed, less than adequate Periodic Test 1. Include signal validation in the
Offscale High triple counting
HPCI or RCIC flow governor application logic
Turbine overspeeds, trips on 2. Provide MCR and RSP alarm
Output Fails Mounting failure (falls
high reactor level or Periodic Test connection to governor
Magnetic Pickup Provide actual turbine speed signal Offscale Low off)
mechanical overspeed
(MPU) to governor
1. Consider triple MPUs, use
Indeterminate; depends on signal validation to select best
Excessive Drift Degradation magnitude and direction of Periodic Test one
drift 2. Provide MCR and RSP alarm
connection to governor
Governor stops, outputs go to
Failed power source
Voltage below shelf state, turbine slows to
(battery, charger, bus, Periodic Test
specification minimum speed, less than
voltage regulator)
adequate HPCI or RCIC flow 1. Provide power loopback to
Provide clean, filtered 24 VDC
Governor overvoltage analog input
24 VDC Power power to digital governor and
protection causes it to stop, 2. Provide MCR and RSP alarm
digital positioner
Voltage above Failed voltage outputs go to shelf state, connection to governor
Periodic Test
specification regulator turbine slows to minimum
speed, less than adequate
HPCI or RCIC flow

 4-35 
Table 4-5 (continued)
HPCI/RCIC Governor Design FMEA Worksheet

Functional Level Diagram Sheet: 3of 3


System HPCI, RCIC See Figure 4-6 Design Phase: Conceptual
See Figure 4-4
Subsystem Governor Rev: 0a

Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification

Degradation, See Positioner worksheet;


Provide automatic governor valve
Shorted input conductive debris, same effect as offscale high Periodic Test
position signal to actuator to
faulty connection output from Governor
Positioner compenate for error between See Positioner worksheet
actual governor valve position and
see Positioner worksheet;
demanded governor valve position Degradation, faulty
Open input same effect as offscale low Periodic Test
connection
output from Governor
Steam admission valve opens,
Fail Open (with a but Governor
Misalignment; broken
system initiation not enabled; turbine rolls, Periodic Test
or dirty contacts
Steam signal) governor valve closes @
Perform periodic maintenance
Admission Provide enable signal 1000 rpm, no system flow
to assure alignment and
Valve to governor Governor enabled,
cleanliness
Limit Switch Fail Closed (with governor valve opens,
Misalignment;
no system no turbine response Periodic Test
shorted contacts
initiation signal) because steam admission
valve is closed
1. Establish security controls iaw
NEI 08-09 or RG 5.71 (harden
Provide a port for connecting a
Governor interface port, access controls,
programming device to enable Inadvertent Uncontrolled Indeterminate - depends on
Program Periodic Test or Audit etc.)
configuration changes and Logic Change connection extent of change
Interface 2. Provide alarm upon
configuration audits
connection of programming
device.

 4-36 
Table 4-6
HPCI/RCIC Positioner Design FMEA Worksheet

Functional Level Diagram Sheet 1 of 2

HPCI, RCIC SeeSee


Figure Conceptual
Figure4-6
System Design Phase:
4-4
Subsystem Positioner Rev: 0a

Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
Turbine overspeeds, trips on
Output Fails Saturated output
high reactor level or Periodic Test 1. Provide multiple outputs of
Offscale High circuit
mechanical overspeed the position demand signal
from governor to positioner
2. Include signal validation in the
Governor failure or Turbine slows to minimum
Output Fails positioner application logic
loss of power to speed, less than adequate Periodic Test
Offscale Low 3. Provide MCR and RSP alarm
Governor HPCI or RCIC flow
connection to positioner
Provide automatic governor valve
position demand signal to digital 1. Ensure governor is supplied
Governor positioner to compenate for error Indeterminate; depends on with a HW-based watchdog
between actual turbine speed and Output Fails Governor lockup via fail as-is value - likely to result timer that sets outputs to
demanded turbine speed Periodic Test
As-Is HW or SW defect in reactor overfill or underfill, preferred state
followed by turbine trip 2. Provide MCR and RSP alarm
connection to positioner

1. Include rate detection in


Output High Rate Step change in output Rapid change in turbine signal validation logic
Periodic Test
of Change via HW or SW defect speed and pump flow 2. Provide MCR and RSP alarm
connection to positioner

 4-37 
Table 4-6 (continued)
HPCI/RCIC Positioner Design FMEA Worksheet

Functional Level Diagram Sheet 2 of 2

System HPCI, RCIC SeeSee


Figure 4-6
Figure 4-4
Design Phase: Conceptual
Subsystem Positioner Rev: 0a

Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
Turbine slows to minimum
Output Fails Resolver circuit failure
speed, less than adequate Periodic Test 1. Include signal validation in the
Offscale High (internal to actuator)
HPCI or RCIC flow governor application logic
Turbine overspeeds, trips on 2. Provide MCR and RSP alarm
Output Fails Loss of power to
high reactor level or Periodic Test connection to positioner
Offscale Low actuator
Provide actuator stem position mechanical overspeed
Resolver signal to positioner (actuator stem is Indeterminate; depends on
Resolver circuit
Feedback coupled directly to governor valve Inaccurate signal magnitude and direction of Periodic Test
degradation
stem) error
Failed mechanical
Governor valve returns to
connection Wear, corrosion, or
spring-closed position, less
between actuator fatigue at connection Periodic Test
than adequate HPCI or RCIC
and governor point
flow
valve
Positioner stops, outputs go
Failed power source
Voltage below to shelf state, turbine slows to
(battery, charger, bus,
specification minimum speed, less than
voltage regulator)
adequate HPCI or RCIC flow 1. Provide power loopback to
Provide clean, filtered 24 VDC
Positioner overvoltage analog input
24 VDC Power power to digital governor and Periodic Test
protection causes it to stop, 2. Provide MCR and RSP alarm
digital positioner
Voltage above Failed voltage outputs go to shelf state, connection to positioner
specification regulator turbine slows to minimum
speed, less than adequate
HPCI or RCIC flow

 4-38 
Example 4-3. Circ Water System Controls Design FMEA
DFMEA Step 1: Draw a block diagram of the system of interest.
Figure 4-7 provides a block diagram of a hypothetical Distributed Control System
(DCS) functional segment allocated to Circulating Water System (CWS) control
functions.
The DCS architecture in this example is based on a brief search of publicly available
information on more complex, non-1E DCS system architectures, resulting in the
selection of certain features of various DCS architectures in use today. Most non-
safety DCS architectures include several functional segments. This example examines
a Circulating Water segment in isolation because it is sufficiently complex and
functionally isolated from other segments to reveal insights.
DFMEA Step 2: Draw a boundary around the components of interest.
Figure 4-7 shows an analysis boundary around two “divisions” of logic and I/O
cabinets.
DFMEA Step 3: Write a summary description.
Table 4-7, which meets the prerequisite for a Function Analysis (in this case at the
component level), provides a listing of the principal components shown in Figure 4-
6, and their functions. A summary description is provided below:
CWS System Description
The circulating water system (CWS) under investigation supplies cooling water to
remove heat from the main condensers, under varying conditions of power plant
operation and site environmental conditions.
The CWS does not have a safety-related function and has no safety design basis.
The power generation design basis of the CWS is to remove heat load during
startup, normal shutdown, transient condition, or turbine trip (when a portion of the
main steam is bypassed to the main condenser via the turbine bypass valves).
See the lower portion of Figure 4-7. The CWS draws water from the cooling tower
basins, and returns water to the CWS cooling tower basins after passing through the
main condenser.
The CWS supplies cooling water at the specified flow rate to condense the steam in
the condenser. The CWS is automatically isolated in the event of gross leakage into
the turbine building (TB) condenser area to prevent flooding of the Turbine Building.
The CWS is designed such that a failure in a CWS component (piping, cooling
tower, expansion joint, pump, etc.) does not have a detrimental effect on any safety-
related equipment.
The CWS is composed of six, 25% capacity circulating water pumps, and two
cooling towers (each with their own basin). Other typical CWS components, such as
make-up pumps, waterbox isolation valves, and cooling tower fans are omitted from
this paper because they have no bearing on the analysis. During normal operations
at 100% power, two pumps are running in each basin, with one pump on standby
in each basin.
The circulating water pumps are located in the cooling tower basins, take suction
from the basin, and pump water through the main condenser under varying plant
loads and design basis weather conditions. The cooling towers are each sized for
75% of normal power operation load. The discharge pipe from each of the
circulating water pumps is connected to a common pipe that delivers waters to each

 4-39 
Example 4-3. Circ Water System Controls Design FMEA (continued)
condenser. The discharge pipe from each pump is equipped with a Motor Operated
Valve (MOV) to enable isolation. The isolation MOV prevents backflow through its
associated pump when it is idle
Basic DCS Design
The basic design of the non-1E Distributed Control System (DCS) includes two sets of
logic cabinets (A & B), two sets of I/O cabinets (A & B) and a set of human-system
interface (HSI) workstations. All of the cabinets and workstations are connected to
redundant data communication busses (Comm 1 and Comm 2).
The upper portion of Figure 4-7 illustrates this basic DCS architecture via the
segment that monitors the CWS pumps and controls their discharge valves. Other
DCS segments associated with other non-1E functions are omitted for clarity.
The DCS monitors and controls non-1E equipment using a master/slave controller
architecture. In Figure 4-7, the master controller for all 6 MOVs is shown in Logic
Cabinet A. The controller in Logic Cabinet B is in “slave” mode, following the status
of the Master controller, and is able to take control of the MOVs in the event of a
failure of the Master. Logic Cabinets A & B are located in an equipment room
adjacent to the Main Control Room.
I/O cabinets A & B are each remotely located in a separate, secure and
environmentally controlled structure near each cooling tower, which are some
distance from the Main Control Room. I/O Cabinet A contains digital input modules
that monitor the position of the 4KV breakers that provide power to the motors for
Pump-1 thru Pump-3, and digital output modules that position MOV-1 thru MOV-3
(open or closed). Likewise, I/O cabinet B provides the same functions for Pump-4
thru Pump-6 and MOV-4 thru MOV-6. Note that the digital DCS equipment does not
control the CWS pumps; this function is allocated to HS-1 and the 4KV switchgear.
The application logic for opening or closing MOV-1 thru MOV-6 runs in each
controller in Logic Cabinets A & B. Figure 4-8 illustrates this logic for MOV-1, but is
typical for all six MOVs. Please note that the logic in a typical CWS pump and
valve design is more complex than shown in Figure 4-8. It is simplified here to
provide a reasonably sufficient demonstration of the Design FMEA method on a DCS
segment.
CWS Pump-1 Functional Sequences
To further describe the CWS Pump controls, the following functional sequences are
helpful.
The following sequence will initiate operation of CWS Pump-1:
a. An operator at one of the two HSI workstations will select MOV-1 and command
it to open
b. An “open” command will be included in a message that passes between the HSI
workstation and Logic Cabinet A via the COMM1 and COMM2 busses
c. The application software in the Master Controller will send the command to
DO1 in I/O Cabinet A through the COMM1 and COMM2 busses
d. Digital Output 1 (DO1) will close
e. Relay R1 will energize
f. Contact R1-1 will close

 4-40 
Example 4-3. Circ Water System Controls Design FMEA (continued)
g. MOV-1 will move in the open direction until both limit switches LS1 and LS2
open (note that Figure 2 shows MOV-1 already in the open position)
h. Because HS1 is spring-return-to-auto, contact HS1-1 is normally closed
i. Limit switch LS5 will close when MOV-1 reaches 20% open (upon opening)
j. The Close coil in the 4Kv switchgear for Pump-1 will energize and contact C1
will seal-in
k. Pump-1 will start
In the event of a trip of CWS Pump-1, the following sequence will occur:
a. The Trip coil in the 4 Kv switchgear will energize and contact T1 will seal-in
(either due to an automatic pump trip signal, such as overcurrent protection, or
manually through use of HS1)
b. The breaker for Pump-1 will open
c. Contact T2 will close (indicating that the trip coil is energized and the pump
breaker is open)
d. Pump-1 will stop
e. Digital Input 1 in I/O Cabinet A will sense that contact T2 is closed
f. Messages passing from I/O Cabinet A to Logic Cabinet A via the COMM1 and
COMM2 busses will include data indicating that contact T2 is closed (thus
indicating Pump-1 is “Off”)
g. The application software in the Master Controller will register the status of Pump-
1 and will automatically initiate a “close” command to MOV-1
h. The close command will be included in messages from Logic Cabinet A to I/O
Cabinet A
i. Digital Output 1 (DO1) will open
j. Relay R1 will de-energize
k. Contact R1-2 will close
l. MOV-1 will move in the closed direction until limit switches LS3 and LS4 open
(MOV closed)
In the event of a manually commanded closure of MOV-1 from one of the HSI
workstations, the following sequence will occur:
a. An operator will select MOV-1 and command it to close
b. A “close” command will be included in a message that passes between the HSI
workstation and Logic Cabinet A via the COMM1 and COMM2 busses
c. The application software in the Master Controller will send the command to
DO1 in I/O Cabinet A through the COMM1 and COMM2 busses
d. Digital Output 1 (DO1) will open
e. Relay R1 will de-energize
f. Contact R1-2 will close
g. MOV-1 will move in the closed direction until limit switches LS3 and LS4 open
(MOV closed)
h. Limit switch LS6 will close
i. The Trip coil in the 4 Kv switchgear will energize and contact T1 will seal-in

 4-41 
Example 4-3. Circ Water System Controls Design FMEA (continued)
j. Contact T2 will close
k. Pump-1 will stop
DFMEA Step 4: Prepare an FMEA worksheet for each device or component of
interest.
Design FMEA worksheets are provided in Table 4-8 for I/O Cabinet A, Table 4-9
for Logic Cabinet A, and Table 4-10 for the HSI Workstations.
In this example there are three FMEA worksheets, differentiated by “subsystem” in
the upper left corner, because there are three subsystems of interest in the analysis.
DFMEA Step 5: On each worksheet, identify the interfacing components, signals,
power supplies, and other interfaces that can affect the functions or performance of
the components of interest.
In this example, the results of Step 5 are shown in Tables 4-8, 4-9, and 4-10 under
the column labeled “Component Identification.”
DFMEA Step 6: Determine the failure modes of each interfacing component,
signal, power supply or other interface.
In this example, the results of Step 6 are shown in Tables 4-8, 4-9, and 4-10 under
the column labeled “Failure Modes.”
DFMEA Step 7: Determine the failure mechanisms associated with each failure
mode identified in Step 6.
In this example, the results of Step 7 are shown in Tables 4-8, 4-9, and 4-10 under
the column labeled “Failure Mechanisms.”
DFMEA Step 8: Determine the resulting effects that each interfacing component
failure mode can have on the components of interest, and the resulting effects on the
system.
In this example, these effects are listed in Tables 4-8, 4-9, and 4-10, in the column
labeled “Effects on System.”
DFMEA Step 9: Determine the methods of detection for each failure mode
identified in Step 6.
In this example, the methods of detection are listed in Tables 4-8, 4-9, and 4-10 in
the column labeled “Method of Detection.” Note that hardware or software features
have been identified that can detect, mitigate and provide an indication and/or
alarm associated with the identified failure modes.
DFMEA Step 10: Provide remarks.
In this example, the remarks listed in Tables 4-8, 4-9, and 4-10 are centered on
typical alarms and indications that would be provided in a DCS segment such as the
one described in this example. They are omitted from the logic shown in Figure 4-8
for brevity.
DFMEA Step 11: Analyze redundancies.
In this example, Tables 4-8, 4-9, and 4-10 provide sufficient information for a
Design FMEA because each redundancy is identical, and meets the criteria for
analyzing a single redundancy described in Section 4.4.
DFMEA Step 12: Apply the results.
The results of this example could be used to verify adequate coverage of equipment
failure modes; verify expected alarms and indications of failures; validate the results

 4-42 
Example 4-3. Circ Water System Controls Design FMEA (continued)
during a FAT, SAT, or Post-Mod Test activity; and update operations procedures and
alarm response guides as needed. The following application notes are also
considered:
DFMEA Application Notes
The following insights were obtained from the FMEA “Remarks” column, and should
be assessed for possible inclusion in any planned modifications to the CWS control
system.
1. Typical alarms and associated logic are assumed to be implemented within the
DCS to annunciate loss of one or more modules.
2. It is assumed that adequate time is available for an operator to recognize and
response to a loss of one CWS pump with a manual action before the turbine
trips on low condenser vacuum.
3. For a failed digital input module, such as DI1 (see Table 4-8, Sheet 2 of 3),
alarm logic should be developed for the case of conflicting indications such as
“pump on” concurrent with “MOV closed.”
4. The taxonomy sheets in Appendix B of this guideline were used to inform the
FMEA worksheets. Numerous internal and external defensive measures are
potentially available as described in the taxonomy sheets, and should be
assessed and included in the final design. Internal defensive measures are those
that are implemented within the components of interest, such as memory integrity
test features that could be embedded within the operating system of the
governor. External defensive measures are those that are implemented outside
the component of interest, such as an input signal validation algorithm
implemented within the positioner in order to detect and alarm misbehaving
governor output signals.
5. Apply security controls described in NEI 08-09 or RG 5.71. The components in
the CWS control system are critical digital assets that are required to meet the
cyber security rule, 10 CFR 73.54.

 4-43 
ANALYSIS BOUNDARY
Logic Cabinet A Logic Cabinet B
COMM 2 COMM 2
COMM 1 COMM 1
Each Controller Is
MASTER SLAVE
Programmed to Control All
CONTROLLER CONTROLLER
Six Valves (Master/Slave)

I/O Cabinet A I/O Cabinet B


COMM 1 COMM 1
COMM 2 COMM 2

D D D D D D D D D D D D
I O I O I O O I O I O I
1 1 2 2 3 3 1 1 2 2 3 3

4 KV

CONDENSER CONDENSER CONDENSER

M M M M M M
COOLING M M M M M M COOLING
TOWER TOWER
A B
MOV-1 MOV-2 MOV-3 MOV-4 MOV-5 MOV-6

Normal Operation
PUMP-1 PUMP-2 PUMP-3 (Two Valves Open in PUMP-4 PUMP-5 PUMP-6
Each Basin)

Figure 4-7
Circulating Water System DCS Segment

 4-44 
4KV Switchgear Pump-1 ANALYSIS BOUNDARY

HS1-1 HS1-2
LS6
Control C1 LS5 T1 HSI 1 HSI 2
Power

C T

T2
C Close Coil
HS1
T Trip Coil
TRIP AUTO CLOSE to I/O
1 X X
Cabinet A,
DI-1
I/O Logic Cabinet A
Cabinet A
2 X

D MASTER CONTROLLER

I
1
Manual Pump 1 Manual
OPEN OFF CLOSE
MOV-1 Control Circuit

COMM 1
COMM 2

COMM 2
COMM 1
Control
C O
Power

TS1 LS1 TS2 LS3


D
O
LS2 LS4
1
OPEN CLOSE
O C
MOV-1* MOV-1*
R1-1 R1-2 R1

*Typical for all 6 MOVs


Open Close

Stop
to/from other cabinets

OPEN INTER. CLOSE


LS1 X
M Instrument AC
LS2 X X
LS3 X
MOV-1 LS4 X X
LS5 X X* *20% open
LS6 X X MOV-1 shown open

Figure 4-8
CWS MOV Control Circuit & Logic

 4-45 
Table 4-7
Principal CWS Components and Functions

Component Functions
1. Connect or disconnect 4 Kv electric power to the terminals on the motor that drives
Pump-1 Pump-1
4Kv Switchgear 2. Provide contact closure input, via dry contact T2, to digital input DI1 in I/O Cabinet A
(contact T2 is closed when the Trip Coil is energized).

MOV-1 Isolate the discharge of Pump-1

I/O Cabinet A 1. Sense the state of contact T2


Digital Input 1 (DI1) 2. Interface with modules COMM1 and COMM2 in I/O Cabinet A

1. Interface with modules COMM1 and COMM2 in I/O Cabinet A


I/O Cabinet A
2. Open or close its output contact in response to COMM1 and COMM2 data
Digital Output 1 (DO1)
3. Upon loss of communication, fail to shelf state (open)

Relay R1 Interface between Logic Cabinet A, DO1, and the control circuit for MOV-1

1. Acquire data from input modules in I/O Cabinet A


I/O Cabinet A
2. Deliver data to output modules in I/O Cabinet A
COMM1
3. Send and receive messages to/from various addresses on the COMM1 bus
1. Acquire data from input modules in I/O Cabinet A
I/O Cabinet A
2. Deliver data to output modules in I/O Cabinet A
COMM2
3. Send and receive messages to/from various addresses on the COMM2 bus

Control Power Provide clean, filtered power to the coil of relay R1 when DO1 is closed

Provide clean, filtered, redundant 120 VAC power to the DCS cabinets (internal cabinet
Instrument AC power
power supplies not shown)

1. Provide indications and an operator interface for manually controlling plant


HSI Workstations
components connected to the plant control system.
(HSI1 and HSI2)
2. Send and receive data to/from PCS controllers via COMM1 or COMM2
1. Acquire inbound messages that are addressed to the Master Controller
Logic Cabinet A 2. Deliver data to the Master Controller
COMM1 3. Acquire data from the Master Controller
4. Deliver outbound messages to various addresses on the COMM1 bus
1. Acquire inbound messages that are addressed to the Master Controller
Logic Cabinet A 2. Deliver data to the Master Controller
COMM2 3. Acquire data from the Master Controller
4. Deliver outbound messages to various addresses on the COMM2 bus

Master Controller Execute the application software logic

Slave Controller Execute the application software logic (including takeover if the Master Controller fails)

 4-46 
Table 4-8
CWS I/O Cabinet A FMEA Worksheets

Functional Level Diagram Sheet: 1 of 3

System CWS See Figure 4-7 & Figure 4-8


See Figures 4-5 and 4-6
Design Phase: Detailed
Subsystem I/O Cabinet A Rev: 0a

Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
1. Loss of Pump-1
1. Faulty protection devices
2. Automatic closure of
or circuits
Inadvertent MOV-1 (if not closed)
2. Operator error
trip 3. Operator opens standby
3. Spurious closure of MOV-
MOV and associated pump
1 (induced by PCS failure)
starts
1. False indication of
Pump-1 OFF
2. MOV-1 closes
1. Faulty switchgear 3. Pump-1 deadheads 1. Typical alarms and associated
Spurious interlocks against closed MOV-1 logic not shown in Figures 1 and
1. Connect or disconnect 4 Kv closure 2. MOV-1 Limit Switch LS5 4. Overload protection trips 2 for simplicity
fails closed Pump-1 breaker 1. Indications on HSI1
electric power to the terminals 2. Assume adequate time
5. Operator opens standby and HSI2
on the motor that drives Pump- available for operator to initiate
MOV and associated pump 2. Alarms
1 operation of standby pump
4 Kv Switchgear
2. Provide contact closure input, starts before turbine trip on low
Pump 1
via dry contact T2, to digital vacuum
1. False indication of
input DI1 in I/O Cabinet A
Pump-1 OFF
(contact T2 is closed when the
2. MOV-1 closes
Trip Coil is energized)
Switchgear contact 3. Pump-1 deadheads
T2 fails closed 1. Debris against closed MOV-1
(with breaker 2. Contact short 4. Overload protection trips
open) Pump-1 breaker
5. Operator opens standby
MOV and associated pump
starts

Switchgear contact
1. False indication of Develop alarm logic for "pump
T2 fails open (with 1. Contact failure Conflicting indications
Pump-1 ON on" AND "MOV-1 closed"
breaker closed)

 4-47 
Table 4-8 (continued)
CWS I/O Cabinet A FMEA Worksheets

Functional Level Diagram Sheet: 2 of 3

CWS See Figure 4-7 &4-5


Figure
and 4-64-8
System Design Phase: Detailed
See Figures
Subsystem I/O Cabinet A Rev: 0a

False indication of
contact T2 open 1. False indication of Develop alarm logic for "pump
Internal failure machanism Conflicting indications
(when actually Pump-1 ON on" AND "MOV-1 closed"
closed)

1. Sense the state of contact T2 1. False indication of


Digital Input 2. Interface with modules Pump-1 OFF
DI1 COMM1 and COMM2 in I/O 2. MOV-1 closes
Cabinet A False indication of 3. Pump-1 deadheads
1. Shorted input 1. Indications on HSI1 Typical alarms and associated
contact T2 closed against closed MOV-1
2. Internal failure and HSI2 logic not shown in Figures 5-2
(when actually 4. Overload protection trips
mechanism 2. Alarms and 5-3 for simplicity
open) Pump-1 breaker
5. Operator opens standby
MOV, associated pump
starts
Fail closed (with
1. Internal failure Typical alarms and associated
MOV-1 closed and 1. MOV-1 opens
machanism logic not shown in Figures 5-2
no demand to 2. Pump-1 starts
2. Shorted output and 5-3 for simplicity
open it)
1. Interface with modules
COMM1 and COMM2 in I/O 1. False indication of
Cabinet A Pump-1 OFF 1. Typical alarms and associated
1. Indications on HSI1
Digital Output 2. Open or close its output 2. MOV-1 closes logic not shown in Figures 1 and
and HSI2
DO1 contact in response to COMM1 Fail open (with 3. Pump-1 deadheads 2 for simplicity
2. Alarms
and COMM2 data MOV-1 open and against closed MOV-1 2. Assume adequate time
Internal failure machanism
3. Upon loss of communication, no demand to 4. Overload protection trips available for operator to initiate
fail to shelf state (open) close it) Pump-1 breaker operation of standby pump
5. Operator opens standby before turbine trip on low
MOV, associated pump vacuum
starts

 4-48 
Table 4-8 (continued)
CWS I/O Cabinet A FMEA Worksheets

Functional Level Diagram Sheet: 3 of 3

CWS See See


Figure 4-7
4-5 &
andFigure 4-8
System Design Phase: Detailed
Figures 4-6
Subsystem I/O Cabinet A Rev: 0a

1. Acquire inbound messages


1. COMM1 bus splits into
that are addressed to the
two loops
Master Controller
2. Loss of COMM1
2. Deliver data to the Master
connectivity between I/O 1. Indications on HSI1 Typical alarms and associated
I/O Cabinet A Controller Loss of
Internal failure machanism Cabinet A and Logic Cabinet and HSI2 logic not shown in Figure 1 and 2
COMM1 3. Acquire data from the Master communication
A 2. Alarms for simplicity
Controller
3. COMM2 bus remains
4. Deliver outbound messages
intact
to various addresses on the
4. No effect on system
COMM1 bus

1. Acquire inbound messages


1. COMM2 bus splits into
that are addressed to the
two loops
Master Controller
2. Loss of COMM2
2. Deliver data to the Master
connectivity between I/O 1. Indications on HSI1 Typical alarms and associated
I/O Cabinet A Controller Loss of
Internal failure machanism Cabinet A and Logic Cabinet and HSI2 logic not shown in Figures 5-2
COMM2 3. Acquire data from the Master communication
A 2. Alarms and 5-3 for simplicity
Controller
3. COMM1 bus remains
4. Deliver outbound messages
intact
to various addresses on the
4. No effect on system
COMM1 bus

1. 120 VAC breaker trips on


Provide clean, filtered, 1. Indications on HSI1 Typical alarms and associated
bus fault
Instrument AC redundant 120 VAC power to Loss of one bus No effect on system and HSI2 logic not shown in Figures 5-2
2. Inadvertent trip of 120
the PCS cabinets 2. Alarms and 5-3 for simplicity
VAC breaker

1. False indication of
Pump-1 OFF 1. Typical alarms and associated
2. MOV-1 closes logic not shown in Figures 5-2
Loss of control 3. Pump-1 deadheads and 5-3 for simplicity
Provide control power to the 1. Indications on HSI1
power 1. Breaker opens on fault against closed MOV-1 2. Assume adequate time
Control Power coil of relay R1 when DO1 is and HSI2
(with MOV-1 2. Inadvertent breaker trip 4. Overload protection trips available for operator to initiate
closed 2. Alarms
open) Pump-1 breaker operation of standby pump
5. Operator opens standby before turbine trip on low
MOV, associated pump vacuum
starts

 4-49 
Table 4-9
CWS Logic Cabinet A FMEA Worksheets

Functional Level Diagram Sheet: 1 of 2


CWS See Figure 4-7 &4-5
Figure
and 4-64-8
System Design Phase: Detailed
See Figures
Subsystem Logic Cabinet A Rev: 0a

Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification

1. Quaity program 1. Assuming quality methods


1. CPU Data Corruption
during design & are effective against design
Failure to 2. CPU Logic Error 1. Slave controller in service
mfg. and mfg. flaws
Boot or Reset 3. Lost or corrupted 2. No effect on system
2. Successful 2. Assuming redundant
ROM data
operating history cooling fans

1. CPU Halt
Controller
2. CPU Crash
Lockup
3. Stopped internal clock

Master Execute the application


Controller software logic 1. CPU Data Corruption
2. CPU Logic Error 1. Watchdog timer times out Typical alarms and
1. Indications on
Loss of Data 3. Lost or corrupted RAM 2. Slave controller takes associated logic not shown in
HSI1 and HSI2
Processing data over Figures 5-2 and 5-3 for
2. Alarms
4. Failed Backplane 3. No effect on system simplicity
Interface

1. Failed internal power


Dead supply
Controller 2. Line voltage below
spec

 4-50 
Table 4-9 (continued)
CWS Logic Cabinet A FMEA Worksheets

Functional Level Diagram Sheet: 2 of 2


System CWS See Figure 4-7 & Figure 4-8
See Figures 4-5 and 4-6
Design Phase: Detailed
Subsystem Logic Cabinet A Rev: 0a

1. Acquire inbound messages that


1. COMM1 bus splits into
are addressed to the Master
two loops
Controller
2. Loss of COMM1
2. Deliver data to the Master Typical alarms and
connectivity between I/O 1. Indications on
Logic Cabinet A Controller Loss of Internal failure associated logic not shown in
Cabinet A and Logic Cabinet HSI1 and HSI2
COMM1 3. Acquire data from the Master communication machanism Figures 5-2 and 5-3 for
A 2. Alarms
Controller simplicity
3. COMM2 bus remains
4. Deliver outbound messages to
intact
various addresses on the COMM1
4. No effect on system
bus

1. Acquire inbound messages that


1. COMM2 bus splits into
are addressed to the Master
two loops
Controller
2. Loss of COMM2
2. Deliver data to the Master Typical alarms and
connectivity between I/O 1. Indications on
Logic Cabinet A Controller Loss of Internal failure associated logic not shown in
Cabinet A and Logic Cabinet HSI1 and HSI2
COMM2 3. Acquire data from the Master communication machanism Figures 5-2 and 5-3 for
A 2. Alarms
Controller simplicity
3. COMM1 bus remains
4. Deliver outbound messages to
intact
various addresses on the COMM1
4. No effect on system
bus

1. 120 VAC breaker trips Typical alarms and


Provide clean, filtered, redundant 1. Indications on
on bus fault associated logic not shown in
Instrument AC 120 VAC power to the PCS Loss of one bus No effect on system HSI1 and HSI2
2. Inadvertent trip of 120 Figures 5-2 and 5-3 for
cabinets 2. Alarms
VAC breaker simplicity

 4-51 
Table 4-10
CWS HSI Workstation FMEA Worksheets

Functional Level Diagram Sheet: 1 of 2


System CWS See Figure 4-7 & Figure 4-8 Design Phase: Detailed
See Figures 4-5 and 4-6
Subsystem HSI Workstations Rev: 0a

Component
Function(s) Failure Modes Failure Mechanisms Effect on System Method of Detection Remarks
Identification
1. Loss of heartbeat
1. Design flaw
Workstation locks 1. HSI2 still available signal Typical alarms and associated logic
2. Mfg. defect
up 2. Controllers not affected 2. Alarm not shown in Figures 5-2 and 5-3
3. Bit error
(display freeze) 3. No effect on system 3. Conflicting for simplicity
4. Overheating
indications

1. Design flaw
2. Mfg. defect
3. Bit error
4. Failed 1. HSI2 still available
Workstation
connection(s) 2. Controllers not affected 1. Blank display
1. Provide indications and an shotdown
5. Overheating 3. No effect on system
operator interface for manually 6. Power supply
controlling plant components failure
connected to the plant control 7. Power supply dip
HSI1
system
2. Send and receive data to/from
PCS controllers via COMM1 or
COMM2 Erroneous MOV-1 1. Operator error Assuming HSI displays are
1. MOV-1 opens Indications on HSI1
open command 2. Logic error programmed to display pump
2. Pump-1 starts and HSI2
(when closed) 3. Bit error status as sensed by DI1

1. MOV-1 closes
1. Typical alarms and associated
2. Pump-1 deadheads
logic not shown in Figures 5-2 and
against closed MOV-1
Erroneous MOV-1 1. Operator error 1. Indications on HSI1 5-3 for simplicity
3. Overload protection trips
close command 2. Logic error and HSI2 2. Assume adequate time available
Pump-1 breaker
(when open) 3. Bit error 2. Alarms for operator to initiate operation of
4. Operator opens standby
standby pump before turbine trip
MOV, associated pump
on low vacuum
starts

 4-52 
Table 4-10 (continued)
CWS HSI Workstation FMEA Worksheets

Functional Level Diagram Sheet: 2 of 2


System CWS See Figure 4-7 & Figure 4-8
See Figures 4-5 and 4-6
Design Phase: Detailed
Subsystem HSI Workstations Rev: 0a

1. Provide indications and an


operator interface for manually
controlling plant components
connected to the plant control Same as above (except
HSI2 Same as above Same as above Same as above Same as above
system substitute HSI1 for HSI2)
2. Send and receive data to/from
PCS controllers via COMM1 or
COMM2

1. COMM1 bus splits into


two loops
2. Loss of COMM1
1. Acquire inbound messages that
connectivity between HSI1 1. Indications on HSI1 Typical alarms and associated logic
are addressed to HSI1 Loss of Internal failure
COMM1 and PCS and HSI2 not shown in Figures 5-2 and 5-3
2. Deliver HSI1 data to the Master communication machanism
3. COMM2 bus remains 2. Alarms for simplicity
and Slave Controllers
intact
4. HSI2 still available
5. No effect on system

1. COMM2 bus splits into


two loops
2. Loss of COMM2
1. Acquire inbound messages that
connectivity between HSI2 1. Indications on HSI1 Typical alarms and associated logic
are addressed to HSI2 Loss of Internal failure
COMM2 and PCS and HSI2 not shown in Figures 5-2 and 5-3
2. Deliver HSI2 data to the Master communication machanism
3. COMM1 bus remains 2. Alarms for simplicity
and Slave Controllers
intact
4. HSI1 still available
5. No effect on system

 4-53 
4.6 Applying the FMEA Results
21B

The FFMEA and DFMEA processes and results can be used in support of the
following activities:

Platform Development

Digital I&C platform (or component) development activities are performed by


the equipment vendor. The FMEA process should be applied on each
component that makes up the platform. The level of detail in the component
FMEAs should be driven down to the individual devices that make up each
component. Appendix B includes taxonomy sheets for typical devices found in
digital platforms or components.

The FMEA results can then be used by the equipment vendor to improve
component designs through the platform or component development lifeycle
process, and ultimately support calculations that demonstrate equipment
reliability claims.

Plant owner/operators can use this guideline, and Appendix B in particular, to


determine if an equipment vendor is addressing device-level failure mechanisms
in their component designs, and if they are applying appropriate defensive
measures. This guideline can also be used in conjunction with guidance on
Critical Digital Reviews provided in Reference 17.

Application Development

System or component development at the application level is generally


considered an integration activity. The integrator role may be performed by the
equipment vendor, a third-party systems integrator (e.g., NSSS vendor), an
architect/engineer, or the plant owner/operator.

The DFMEA process should be applied on the digital system. The level of detail
in the FMEA should be driven down to the individual components that make up
the system. Appendix B includes taxonomy sheets for typical components found
in digital I&C systems.

The FMEA results can then be used by the integrator to improve system designs
through the application development lifeycle process. The conceptual design
phase of the lifecycle process should include a preliminary hazards analysis, which
can take the form of a preliminary FMEA, such as the one described in Example
4-1. A preliminary FMEA should be used to identify and reduce or eliminate
potential vulnerabilities in the system as the design activities progress. Some
vulnerabilities may be eliminated or mitigated to a reasonable extent through one
or more defensive measures that are realized through design requirements and/or
plant programs and processes. For guidance on applying defensive measures in
digital I&C systems, see References 20 and 21.

 4-54 
The FMEA should be updated through the design process, or when the design is
complete, to reflect the finished design at an appropriate application baseline. For
guidance on determining baselines, see EPRI 1022991 (Reference 18). Note that
the FMEA should reflect the design details (e.g., all interfaces), but it should still
reflect postulated failure modes and mechanisms, even if the detailed design can
demonstrate that the likelihood of some failure modes is reasonably low.

Methods of detecting each failure mode or failure mechanism should be carefully


considered and evaluated during system design. Alarms, indications, event logs,
and other sources of information that can reveal failure mechanisms and failure
modes through automatic diagnostic tests or surveillance tests should be applied
to the extent they don’t adversely impact safety or mission-critical functions. The
system design should reduce or eliminate undetectable failure modes or failure
mechanisms as much as possible.

The finished FMEA should be validated, at least to the extent that failure
mechanisms can be tested without extraordinary conditions or destructive
methods, in the test phase of the application development lifecycle. FMEA
validation test cases can be executed at the Factory Acceptance Test (FAT), Site
Acceptance Test (SAT) or during post-installation testing. Additional guidance
on testing is provided in Reference 32.

 4-55 
Case Study 4-1. Inadequate Configuration Control and Testing
An operating plant purchased a digital rod control system from a third-party
integrator. The system was equipped with control rod drive motor power supplies
that each provided a feedback signal, proportional to the power supply output
voltage, to a digital controller. The controller application program included a
function block to monitor the feedback signals and raise an alarm when any
power supply output voltage approached limits, and if a voltage limit was
exceeded, then disable, or turn off the power supply.
A system-level FMEA was developed by the system integrator. It included
component-level failure modes, and in some cases, went down to the device level
failure mechanisms in some of the components. The internal chip that provides
the feedback signal from each power supply was evaluated in the FMEA, which
concluded that the signal could drift out of tolerance, and if it did, would be
automatically detected via the warning alarm prior to disabling the associated
power supply.
During the FAT, local power quality issues in the FAT environment caused control
rod power supply voltage variations, which in turn caused the controller to raise
numerous alarms. Considered a nuisance, the integrator and the plant
owner/operator agreed to temporarily disable these alarms, and continued with
the FAT.
When the system was installed at the plant, the alarms were still disabled, and
the system was placed into service after testing. Later, when power supply
voltage variations occurred, the digital controller disabled two power supplies
that signaled they were out of tolerance, but there were no alarms, and the result
was a dropped control rod without any warning when the plant was at 100%
power.
While the primary lesson learned from this Case Study is arguably about
inadequate configuration control, because the disabled alarm functions were not
properly re-enabled before placing it in service, another lesson learned is about
validation of the system FMEA. The system FMEA clearly described an alarm
feature that would indicate feedback signal drift or a malfunction in a control rod
power supply, but there were no test cases executed at the SAT (in a pre-
installation environment) or during post-installation testing to validate these
alarms.
Test cases should be developed and executed to validate FMEA results, at least to
the extent that they can be executed without requiring extraordinary conditions or
destructive method.

Dealing with Multiple FMEAs

Often, multiple FMEAs are involved or may be available in the course of digital
I&C project:
 Generic platform or component Design FMEA by the equipment vendor
In order to support reliability claims, some equipment vendors perform
Design FMEAs on their equipment, typically down to the failure modes of
the individual devices (e.g., CPUs, Analog to Digital Converters, resistors,
capacitors, etc.) that make up each platform component (e.g., controllers,

 4-56 
I/O modules, power supplies, etc.). Such FMEAs should be performed on a
component-by-component basis, and systematically analyze the failure
modes, failure mechanisms and methods of detection for each internal device
in a given component.
Additionally, defensive measures in terms of hardware design features,
software features, or limits and precautions in the use of a given module can
be assessed. The Taxonomy of failure modes, failure mechanisms and
defensive measures provided in Appendix B of this guideline can be used as
an aid to assess the adequacy of a Design FMEA provided by or available
from an equipment vendor.
When a piece of vendor equipment is applied in a solution, and a generic
equipment-level Design FMEA is available, the Design FMEA should be
assessed for internal failure modes that can propagate to the equipment
interfaces, for assuring that the system-level Design FMEA adequately
assesses those failure modes.
 Functional FMEA by Owner/Operator
As described in Section 4.2, the Functional FMEA method is helpful for
identifying failure modes at the basic function/process level, before
equipment-specific functions are allocated or assessed. The Functional
FMEA should bound the functions and processes that would be provided by,
affected by, or interfaced with equipment-based solutions (i.e., systems or
components) to be analyzed later under a Design FMEA or other suitable
hazard analysis method.
The Functional FMEA is typically expected to be performed by I&C
Engineers or their designees (e.g., an Architect/Engineer firm), with support
from individuals knowledgeable in the basic design and operation of the
affected plant systems, including system engineers, operators, and reliability
or PRA engineers.
 Equipment-Level Design FMEA by System Integrator

The Design FMEA method described in Section 4.4 is typically expected to


be performed by the system integrator or solution provider, and the extent or
scope of a Design FMEA performed by a system integrator or solution
provider is typically limited to the physical or functional boundaries of the
equipment to be provided. Sometimes the Owner/Operator performs the
system integration role, but more often it is performed by a firm (e.g., an
NSSS provider) that demonstrates detailed knowledge and control of system
development lifecycle processes; the equipment to be purchased, developed
and integrated into a solution; and the affected plant functions and processes.
Owner/operator support is typically required for assuring adequate coverage
of affected plant functions, processes, and interfaces in the Design FMEA;
for this reason, the Functional FMEA is a useful input to a Design FMEA
performed at the equipment level.

A Design FMEA performed by a system integrator or solution provider


should be subject to the following conditions:

 4-57 
a. The physical, functional, and data interfaces assessed in the Design
FMEA should be systematically compared to the actual interfaces
provided in the finished solution to verify that all interfaces and their
failure modes are fully and completely addressed, including unused
interfaces or interfaces that may be used infrequently.
b. Factory Acceptance Test (FAT) and/or Site Acceptance Test (SAT) test
cases should be developed to systematically validate the results of the
Design FMEA, to the extent that test cases are non-destructive. Some
failure modes and the expected effects and related methods of detection
are simple to test, such as a failed low analog signal at an appropriate
interface, or turning off a power supply to validate that an expected
indication or alarm is raised.
 Plant System-Level Design FMEA by Owner/Operator
In some cases, a system-level Design FMEA may be useful or required in
order to demonstrate how the results of an equipment-level Design FMEA
interact with interfacing systems or components that are not assessed at the
equipment level. The Owner/Operator or designee is typically responsible for
a system-level Design FMEA, and may choose to revise or append the
equipment-level Design FMEA to account for system-level failure modes
and effects, or a separate, stand-alone Design FMEA may be preferred. In
either case, the finished FMEA product(s) should cover the interactions
between the new or modified equipment and the plant systems or
components that are not modified. Such interactions are typically assessed at
the equipment interfaces.
System-level Design FMEAs should also be validated during system testing,
via SAT and/or Post-Modification Testing (PMT) activities, to the extent
that test cases are practical or non-destructive. Operating experience has
shown that failure modes of plant interfaces with new or modified equipment
were not tested at the FAT or SAT due to limitations, and not tested during
PMT due to an oversight in the Mod Test Plan, leading to surprising and
unexpected behaviors when such interfaces fail.
 Linking results
When multiple FMEAs are developed or provided, they should be assessed
for adequate coverage of equipment interfaces, adequate overlap between
digital I&C equipment and interfacing plant systems and components, and
adequate methods of detection for translation into operations and
maintenance procedures.

Methods of Detection

Digital components and systems are likely to enable extensive coverage of


automatic detection and alarming of internal and external malfunctions. System
designers should take advantage of these features in order to reduce the set of
malfunctions that would otherwise be detected by intervening test methods that
require portions of the system to be out-of-service, thus reducing its availability.
However, the extent to which automatic test or diagnostic features are used

 4-58 
should also be balanced against their potential to interfere with safety or mission-
critical functions.

Methods of detection include the following:


 Diagnostic features that are provided with a digital component or platform,
designed to provide an indication, alarm or both in the event of an internal
malfunction
 Application-level functions that are designed to provide a system indication,
alarm, or both in the event of a device or component malfunction
 Surveillance tests that are designed to reveal degraded or failed sub-systems
or components that are not necessarily self-announcing
 System behaviors that are only evident via their effects on the plant. This
method of detection should be discouraged on safety and mission-critical
systems.

Note that in some cases, detection of a device or component malfunction is


automatic, but only results in raising a “system alarm” that requires temporary
connection of an engineering or maintenance tool to the system to retrieve
detailed information on the specific condition, in the form of a log or event
database.

Some digital systems are capable of alarm management approaches that enable
detection and logging of degraded conditions that don’t cross the threshold of an
alarm condition. In these cases, an alarm management philosophy should be
established that balances the need for automatic system alarms against the
likelihood of creating nuisance alarms.

If the preferred or only available method of detecting a malfunction is by periodic


testing, then the interval between tests should be carefully evaluated to determine
the likelihood of an undetected malfunction that would occur during system
operation, and its potential impact on mission-critical performance or system
operability.

Senior Management Acceptance

Equipment reliability is a strong measure of plant performance. Senior managers


in a typical owner/operator organization make this measure highly visible, and
consider any activities that can adversely impact equipment reliability as
unacceptable.

If FMEA results show that a system is vulnerable to failure modes that can
significantly affect equipment reliability, then the results should be
communicated up to senior management for review and decision-making before
proceeding any further through the development lifecycle.

 4-59 
Case Study 4-2. Unresolved Single Point Vulnerabilities
A digital upgrade project included an objective to eliminate single point
vulnerabilities in one of the mission-critical plant control system. At the end of the
detailed design phase, the project FMEA was updated to reflect design details,
but it showed some remaining single point vulnerabilities that could not be
removed without significant rework. Because the senior management team had
communicated that removing single point vulnerabilities is a high priority for the
station, the project team communicated their finding to the senior management
team, with a recommendation that the project schedule and budget be adjusted
to enable the rework.
Lesson Learned
While the senior management team was disappointed with this finding, they
appreciated the opportunity to review its implications and provide direction to the
project team. Their direction was to defer installation until the system design
could be reworked in order to remove the vulnerabilities, thus preventing
installation of a system that would not meet station objectives.

Licensing

An FMEA is one of the failure analysis methods that can be used to support
licensing activities under 10CFR50 (for operating plants) or under 10CFR52 (for
new plants).

For operating plants, the EPRI guideline on licensing of digital upgrades, TR-
102348 Rev 1 (Reference 4), describes a lifecycle approach to system
development activities that is joined to failure analysis and licensing activities,
such as preparation of 10CFR50.59 evaluations or License Amendment
Requests. Figure 3-1 in TR-102348, sometimes referred to as “the bus-bar
diagram,” illustrates four steps under the heading of “Failure Analysis” that
support and work in parallel with design and licensing activities:
1. Identify system-level failures and their effects on the plant. System-level
failures can occur in the form of single failures or common-cause failures, and
they can be forced by misbehaviors in interfacing systems, or by abnormal
conditions and events. System-level failures would be identified in an FMEA
under the heading “Effect on System.”
2. Identify potential causes of system failures. In an FMEA, potential causes of
system failures would be identified under the headings “Failure Modes” or
“Failure Mechanisms.” Such potential causes of system failures should be
considered as technical causes or direct causes of system failures, and should
not be confused with root or apparent causes of system failure events, which
typically include programmatic or human performance characteristics.
3. Assess significance and risk of failures. This failure analysis step (among
other activities) helps determine the likelihood and consequences of
malfunctions and accidents, which are the key concepts in the 50.59 rule.
4. Identify resolution. This failure analysis step involves the activities described
herein, in terms of how to use the FMEA results.

 4-60 
Case Study 4-3. Inadequate Licensing Evaluation
An operating plant installed a digital rod control system under the 50.59 rule
(without prior NRC review and approval). The application included a function
that would allow ganged rod movement, but it was disabled, pending regulatory
review of a License Amendment Request to allow use of this specific function.
While reviewing the license amendment request (LAR), the NRC raised some
questions about the installed digital system, and performed an inspection at the
facility.
The inspectors determined that “…the licensee had not properly evaluated
questions associated with software common cause failure and the potential for
spurious, uncontrolled withdrawal of four control rods." The inspection report
adds: "The inspectors were concerned that the Rod Control System, as a highly
safety significant system, should have been evaluated, under 10 CFR 50.59,
assuming software common cause failures, because under certain software
failures the plant could potentially be placed in a condition outside its design
bases by causing unanalyzed abnormal operating occurrences." Later, the NRC
issued Information Notice 2010-10 in response to this inspection.
The owner/operator determined that the root cause of the inspection finding was
an unsupported determination that a software common cause failure of the Rod
Control System was not credible. EPRI TR-102348 Rev 1 suggests that "with
respect to failures due to software, including common cause failures, the key to
addressing these failure modes in licensing is having performed appropriate
design, analysis and evaluation activities to provide reasonable assurance that
such failures have a very low likelihood."
Lesson Learned
The owner/operator had two opportunities regarding failure analysis activities
that may have helped to prevent the inspection finding:
1. Performing a preliminary FMEA in the conceptual design phase can help the
detailed design by quickly identifying critical functions and key failure modes
to avoid. A preliminary FMEA can also help to identify the safety analysis
events that could be adversely impacted by the change.
2. In the detailed design phase, not only evaluate and credit software quality
process measures, but also evaluate design features and defensive measures
that protect against CCF to determine if there was adequate protection. If so,
then the evaluation may have provided the technical basis for asserting that
the likelihood of a malfunction due to software is sufficiently low so that it
need not be considered further in the 50.59 evaluation.
In the context of software common cause failures, TR-102348 Rev 1 defines
"sufficiently low" as "...much lower than the likelihood of failures that are
considered in the UFSAR (for example, single failures) and comparable to other
common cause failures that are not considered in the UFSAR (such as design
flaws, maintenance errors, and calibration errors).”

 4-61 
Periodic Testing

The FMEA can be used to identify failure modes and effects that can only be
detected through periodic testing. For digital I&C systems that function on
demand, or change modes of operation as plant conditions change, periodic
testing should be considered for detecting such failure modes before they can
adversely impact system operation. Periodic testing may be as simple as logging
into an engineering or maintenance workstation and retrieving diagnostic
information for review, or walking down the system and inspecting local
indications (e.g., status LEDs or power supply lamps). Periodic testing may
require taking the system or part of the system out-of-service for the purpose of
injecting signals or simulating plant conditions and observing the system
response.

For digital I&C systems that function in a closed-loop, continuous manner,


where there are no significant changes in functional response as plant conditions
change, then periodic testing for the purpose of detecting failure modes and
effects may not reveal any persistent failure mechanisms if the test cases simply
mimic the actual conditions that were present prior to the test. However, if the
digital equipment is capable of logging diagnostic information that is otherwise
not indicated or alarmed, then a periodic test to retrieve and assess this
information may be beneficial.

The FMEA may be used as an input to fault trees and the plant Probabilistic
Risk Assessment, especially if digital system and component failure modes are
different from the original analog system. Often, digital system software is (or
can be) designed to force a specific response to component-level failure modes,
such as “fail open” or “fail as-is,” which should be accounted for in the PRA, at
least for those systems that are modeled to that extent in the PRA, using the
FMEA as an input for changes to the PRA.

For Tech Spec system failure modes and effects that cannot be automatically
indicated and/or alarmed, thus leaving surveillance testing as the only viable
method of detection, the PRA can be used to determine an acceptable
surveillance interval. If the PRA is used to determine the surveillance interval,
then the proposed design should be modeled in a technically adequate PRA, and
the results in terms of change in Core Damage Frequency (CDF) and Large
Early Release Frequency (LERF) should be assessed. The PRA can assess the
change in risk similar to a Maintenance Rule a(4) assessment; the failure
probability of I&C components can be assumed to be proportional to the
surveillance interval, and acceptance guidance can be based on Regulatory Guide
1.174 (Reference 29). If several surveillance intervals are to be assessed, the
collective change in CDF and LERF also should be examined. NEI 04-10
(Reference 28) provides additional guidance.

Commercial Grade Dedication

Commercial grade dedication (CGD) is a process used to provide reasonable


assurance that a commercial grade item to be used as a basic component will

 4-62 
perform its intended safety function. In this respect, the item is deemed
equivalent to an item designed and manufactured under a 10 CFR 50, Appendix
B quality assurance program. The GCD process is accomplished by identifying
the critical characteristics of the item and subsequently verifying their
acceptability by inspections, tests, or analysis supplemented as necessary by
commercial grade surveys, product inspections or hold point witnessing at the
manufacturer’s facility, or analysis of historical records for acceptable
performance.

The FMEA method provides a formal and systematic approach to identifying


potential system failure modes, their causes, and the effects of the failure mode
occurrence on the system operation. In the CGD process, an FMEA can be used
to help determine the safety function and critical characteristics of each item.
Each item (component or device) is examined to determine its function and
identify each failure mode and the mechanism causing the failure. Assuming that
it is the only failure (single point failure), each failure mode is evaluated to
determine the effect of failure on the parent component and system. The FMEA
results can then be used to develop the appropriate inspections, test, or analyses
to verify the acceptability of the item.

For additional guidance on applying the FMEA method and results in CGD
activities, see References 23, 24 and 25.

System Monitoring Plans

For conditions that do not warrant an automatic system alarm, a system


monitoring plan should be established that includes procedures for periodic
walkdowns, inspections, and retrieval of system event logs or databases when
system conditions permit.
Case Study 4-4. Inadequate System Monitoring
The system event described in Case Study 4-1 is also the subject of this Case
Study, which is about developing and using a System Monitoring Plan. While the
rod power supply feedback signal alarm was disabled, the digital controller kept
an event log that included time-stamped feedback signal values stored over a
period of time. After the dropped rod event occurred, the system engineer logged
into the system and found the feedback signal values stored in the system event
log, which clearly showed an adverse trend for the control rod power supplies
that were automatically turned off by the system controller. The System Monitoring
Plan was updated to require a weekly download and review of the system event
log to determine if any adverse trends were occurring prior to approaching any
programmed limits.
Lesson Learned
System FMEAs should be used an input when developing System Monitoring
Plans. Methods of detection that show an automatic indication, alarm, or trip, or
result in disabling a piece of equipment, should be compared with system event
logs or other forms of diagnostic information that could be reviewed on a periodic
basis, to determine if system monitoring activities could be used to identify
degraded conditions before exceeding limits.

 4-63 
Troubleshooting & Cause Analysis

FMEA (as well as Top Down) results can be used as an input to troubleshooting
and cause analysis activities when digital I&C equipment fails or misbehaves.
One method developed by Exelon involves the use of a “Failure Mode Tree”
(FMT). The FMT method systematically postulates possible failure modes that
may have caused a system problem or event, then compares available evidence to
support or refute each possible cause. Each failure mode is listed in a tree format,
under a “Problem Statement”, and available evidence is listed under each failure
mode. Evidence is gathered from system logs, diagnostic information,
measurements, tests, inspections and other sources using simple or complex
troubleshooting plans.

Each failure mode in the FMT is transposed into a table that includes columns
for validation or action steps, expected results, and actual results. The end
product is a package of information that systematically supports troubleshooting
and immediate (or technical) cause determinations, which is especially helpful for
complex systems. It should be noted that the Exelon FMT method is not used
for Root Cause or Apparent Cause Analysis activities, which are beyond the
scope of this guideline.
Case Study 4-5. Troubleshooting and Cause Analysis
The system described in Examples 4-1 and 4-2 is the subject of this Case Study,
which is about using Design FMEA results to inform a complex system
troubleshooting and cause analysis activity. The method begins with defining a
Problem Statement, which in this case is shown at the top of the Failure Mode Tree
on the left side of Figure 4-9 as “HPCI System Fails to Reach Required Flow During
Surveillance Test.”
The Failure Mode Tree is further developed by postulating potential failure modes
that could cause the defined problem. In this case, the HPCI Flow Control System
FMEAs developed in Examples 4-1 and 4-2 are used as an input. Many potential
causes would be considered, but only two are shown in Figure 4-9 due to space
limitations.
Digital system logs are obtained, equipment inspections are performed, and other
data is acquired as evidence for performing a support/refute analysis of each
potential cause. In this Case Study, physical evidence obtained by inspection
during an equipment walkdown shows that the Limit Switch on the HPCI turbine
steam admission valve is significantly misaligned. On the other hand, a HPCI Flow
Control System log shows that there were no failures of the demand signal to the
governor valve positioner.
The HPCI Flow Control System Design FMEA, of which an excerpt is shown on the
right side of Figure 4-9, indicates that a misaligned Limit Switch is a failure
mechanism that can lead to a failed open system enable signal, ultimately leading
to closure of the HPCI turbine governor valve (or failure to open on demand). This
evidence supports the misaligned limit switch as the cause of the defined problem,
and other evidence refutes all other potential causes.
Lesson Learned
System FMEAs can be used as an effective input to troubleshooting and cause

 4-64 
Case Study 4-5. Troubleshooting and Cause Analysis (continued)
analysis activities, especially when complex digital I&C equipment is involved. The
Failure Mode Tree and Support/Refute methods developed by Exelon can use
system FMEAs, among other sources of information, to systematically identify
causes of system failures.
System FMEAs should be maintained and available as controlled documents after a
digital upgrade project is completed in order to support troubleshooting and cause
analysis activities. They can be controlled as their own uniquely identified
document type, or they can be stored and retrieved as a form of “system
calculation” or inserted into the vendor technical manual. The equipment database
should indicate a link to the appropriate controlled document that contains the
FMEA so that it can be readily retrieved from the document control system.

 4-65 
Problem Statement
HPCI Fails to Reach
Required Flow During
Surveillance Test

Open Input Misaligned


Failure
Signal to LS on Steam
Mode n
Positioner Admission Valve

Log shows Inspection …


no signal shows LS is …
failures
X misaligned …
X

Figure 4-9
Failure Mode Tree Using FMEA Results as an Input

 4-66 
4.7 FMEA Strengths
2B

Focus on Single Failures

The FMEA method is focused on single failure mechanisms and/or single failure
modes. This focal point is a strength in terms of its ability to identify single
failure modes for demonstrating compliance with the single failure criterion and
for identifying single point vulnerabilities.

Simplicity

The FMEA method is relatively simple compared to other methods, especially if


the objective is to systematically identify single failures and their causes.
Additional steps would need to be added to go beyond single failures. Therefore,
the FMEA method is likely to meet the objective of systematically identifying
single failures and their causes with less effort and cost than other methods.

Ability to Leverage a Failure Taxonomy

Users of this guideline can take advantage of the generic failure taxonomy
provided in Appendix B for identifying likely failure modes, failure mechanisms,
and defensive measures. Users can also derive or expand their own taxonomy for
use in repeated applications of the FMEA method on various projects.

Accessible to I&C Design Engineers and I&C Equipment Designers

The FMEA methods described in this guideline are widely used in multiple
industries, and FMEA standards, guides, procedures and training programs are
readily available. Nuclear power plant I&C design engineers and I&C equipment
designers are usually trained and experienced in mechanical, electrical,
electronics, nuclear and other discipline-specific engineering fundamentals. The
idea of postulating component and device failures is consistent with their training
and experience with design and support of I&C equipment. Therefore, the
FMEA method is one of the most accessible and understood hazard analysis
methods available to the I&C engineering community, within nuclear power
utilities and among equipment vendors.

4.8 FMEA Limitations


23B

Common Cause Failures

The focus on single failures is also a limitation because it is difficult to postulate


and consider the effects of potential common-cause failures (CCF).

The focus on single failures also limits consideration of adverse interactions


between systems or components, including human interactions, especially when
an adverse reaction can result even if there are no system or component failures.

 4-67 
Software Hazards

The FMEA method typically considers hardware failures only, where it can be
applied effectively. However, to date, methods for identifying software failures
and determining their effects is still a research problem, especially since there is
no clear industry and regulatory consensus on the meaning of “software failure.”

However, this same research problem is being studied heavily in several industrial
sectors, and the software failure taxonomy sheets provided in Appendix B are a
summary of currently available guidance. Users of this guideline can venture into
“software failures” using the FMEA methods described herein, but should be
cautioned that this approach has not gained wide acceptance in the nuclear power
industry. For additional insights on this research topic, see Reference 22.

An alternative point of view would be consideration of “inappropriate software


behaviors” in lieu of “software failures,” where an analyst could systematically
assess a digital I&C system in terms of its potential behaviors, even when no
“faults” or “failures” are evident, and determine if any behaviors (including
unexpected system or component interactions) are hazardous. Recent
advancements in Hazard Analysis methods are designed to implement this idea,
and they are presented in Section 0 of this guideline.

Dependent on Analysis Boundary

The FMEA method is useful for analyzing failure modes and effects between
components of interest and between interfacing systems and components.
However, it may not assess the effects of all interfaces if the boundary is not
drawn correctly, or if the block diagram does not account for all interfaces that
actually cross the boundary in the implemented system.

Coverage of Other Hazards

Because the Design FMEA method is a bottom-up method that is focused on


single failures of equipment, it does not systematically identify a wider range of
hazards that can lead to accidents or losses, such as requirements errors, human
errors, or adverse interactions between components that haven’t failed. Advanced
Hazard Analysis methods are required for addressing these problems, as
presented in Section 6 of this guideline.

 4-68 
Section 5: Top Down Method Using Fault
Tree Analysis (FTA) Techniques
Fault tree analysis is a technique that generally is used to identify combinations of
components and their failure modes leading to failure of systems to perform their
intended functions. Fault tree analysis has been applied as a method to study
system design for over fifty years (Reference 34). It has gained acceptance in
numerous industries, among them defense, aerospace, chemical, transportation,
automotive, robotics and nuclear power (both research and commercial reactors).

Fault trees are deductive logic models (Reference 35). They begin by defining the
occurrence of a top event at a facility or within its systems (e.g., core damage,
failure to provide generation above a selected capacity factor, or failure of a
system to perform a given function) which then is broken down into failures of
trains within the systems being modeled and ultimately components and their
failure modes that would contribute to the occurrence of the top event. As the
name implies, fault trees are constructed in failure space. The focus of fault trees
are on failures because for complex systems with built in redundancy, the number
of ways a system can fail are generally fewer and made up of smaller sets of
components than the number of ways to succeed.

Fault trees can be used to quantify the failure probability of a system or collection
of systems (or, conversely, estimate their reliability). More importantly, however,
fault trees generate qualitative insights regarding the design of a plant and its
systems. In the guidance provided in this report, it is the qualitative or
deterministic aspects of fault tree analysis that are considered in the failure
analysis of digital systems. Among the qualitative information that may be
derived from fault trees in performing a review of the design of a digital system
are:
1. confirmation of the functions that are most useful for the digital system to
provide (including those that may be beyond the primary purpose of the
digital system)
2. identification of the important failure modes of the plant components that
are to be actuated or controlled by the digital system (as well as determining
the failure modes that are not important)
3. understanding the context of the digital system in the plant design as a whole
4. validation that the architecture of the digital system is consistent with success
criteria for the systems that it supports.
 5-1 
Note that some of these qualitative insights may not require a fault tree model to
be developed for the digital system itself and that fault trees providing much of
the above information may already be available in support of other plant
programs (e.g., as a part of the plant specific PRA). In that regard, the guidance
in this section is directed at taking advantage of existing fault trees from the PRA
as opposed to developing new fault trees for the purpose of performing the failure
analysis.

5.1 Top Down Method Overview and Objectives Using Fault


Tree Techniques

Section 1.5 effectively defines the objectives of this report as providing guidance
to ensure that a digital system failure analysis is as complete as practical while
requiring a reasonable effort to perform. The following is an overview of a top
down methodology that is directed at those objectives with a focus on taking
advantage of fault tree techniques.

The proposed top down approach begins by recognizing that I&C systems are a
part of a larger integrated plant design. By themselves, they cannot accomplish
the functions needed to ensure safe and efficient operation of the plant without
the equipment they actuate, monitor or control. For that reason, the top down
approach begins by defining high level safety and generation related functions
and works its way down to where the interface of the I&C systems with plant
mechanical and electrical equipment that perform these functions occurs. The
primary objective of this top down review, therefore, is to focus the scope of
digital system failure modes that should be investigated in the failure analysis of
the system by identifying the potentially important failure modes of the
mechanical and electrical components controlled or actuated by the digital
system.

As part of the top down review of safety and generation functions, consideration
of what is modeled in the plant specific PRA is encouraged. The PRA contains
fault tree logic for many of the plant systems that may be influenced by the
digital systems under review, including some that are generation-related. Taking
advantage of models already developed for the PRA can limit the effort required
to define the failure modes of interest for the digital systems.

If consideration is given to developing new fault tree logic to assist in performing


the failure analysis, such as for the I&C system itself, detailed logic modeling to
the level that may occur in the PRA may not be necessary. Rather, developing
high level logic down to the major components in the digital system may be all
that is needed, particularly given that probabilistic quantification of the system is
not one of the objectives of the failure analysis.

 5-2 
5.2 Procedure for Top Down Method Using Fault Tree
Techniques

Prerequisite

The first three steps described below for the fault tree analysis approach
effectively accomplish the Function Analysis described in Section 3.6. To aid in
the Function Analysis as an input to the Top Down method, example frontline
and support system safety and generation functions are listed in this Section.

The following steps describe one approach for performing a digital system hazard
analysis using fault trees as input. These steps are not the only way to implement
the method; variations are likely, and can be blended with or replaced by steps
described for other methods in this guideline. The analyst is encouraged to
review and modify the fault trees presented in this guideline as needed to reflect
their plant specific design.

The fourth step in the fault tree analysis approach converts the results of the fault
tree based Failure Analysis into the failure modes of interest for the digital system
under review. This step effectively represents the PHA described in Section 3.7.

Top Down Step 1: Define the I&C System(s) to be Analyzed

An obvious initial step in the failure analysis of a digital system is to identify the
scope of the I&C system under review. For the purpose of performing a top down
analysis of the identified system(s), it is not necessary that the design of the system
be complete or that details of the design be available. In fact, the first few steps of
the top down analysis are sufficiently general that they would apply whether the
I&C in question consists of a small set of individual I&C components within a
specific plant system or involves a plant wide digital I&C review including balance
of plant as well as safety systems. As noted in Section 5.1, the key information that
eventually will be needed in implementing the top down analysis will be the
identification of the non-I&C mechanical and electrical components and their
failure modes with which the I&C system under review actuates or controls.

Top Down Step 2: Define Plant Level Functions & Develop System Level Fault
Tree Logic

Activities at a nuclear facility are directed toward the primary goals of nuclear
safety and efficient plant operation. The following are suggested for defining
high level safety and generation functions in performing a top down digital
system failure analysis.

Safety Functions

The three key safety functions listed in 10CFR50.2 are a reasonable starting
point for defining high level safety functions in that they encompass the most
important considerations regarding protecting the health and safety of the public
including events that go beyond the design basis. They are consistent with lower

 5-3 
level functions considered in the plant's safety analysis, the emergency operating
procedures (EOPs) and the plant specific PRA.
1. Ensure primary coolant system integrity
2. Shutdown the reactor and maintain safe shutdown
3. Prevent significant releases (e.g., those in excess of 10CFR100)

Generation Functions

Three key functions can be defined that are each necessary for the production of
energy for delivery to the grid.
1. Energy conversion to steam and inventory control
2. Steam flow and condensation
3. Conversion of energy to electricity and delivery to the grid

The three safety functions are required for defense-in-depth purposes with
respect to ensuring the health and safety of the public such that no single
function is relied upon to the exclusion of the others (e.g., containment cannot be
credited by itself in preventing significant releases without also having a means of
providing adequate core cooling and vice versa). Loss of any one of the
generation functions will result in a plant shutdown or load reduction with the
loss of electrical power production. The failure of these key safety and generation
functions can be used to define the top events of fault trees intended to model
safe and efficient plant operation.

On identifying the top events, it may be useful to identify intermediate plant


level functions that support the key safety and generation functions. The plant
level safety functions are similar to those defined in the EOP functional
guidelines and are listed in Table 5-1. Figure 5-1 and Figure 5-2 each provide
possible top level safety logic for BWRs and PWRs respectively.

Note that Table 5-1, Figure 5-1 and Figure 5-2 develop the top down logic to
the plant frontline system level along with the failure modes of those systems. A
frontline system is a plant system that directly provides the function specified in
the first column of the table. It is recognized that there are also supporting
systems that are necessary for the front line systems to accomplish their
functions. Consideration of support systems is discussed in subsequent steps. In
Step 2 of this procedure, it is suggested that top logic need only be developed to
the extent that it identifies plant systems which directly support plant functions.

Table 5-2 lists the frontline functions and systems necessary for generation at the
plant level.

Figure 5-3 and Figure 5-4 each provide possible top level generation related logic
for BWRs and PWRs respectively. Plant level generation functions are as
follows. Like the safety functions, in this step development of top logic is needed
only down to the point that the frontline systems that perform the generation
functions are identified.
 5-4 
Table 5-1
Frontline Functions/Systems for Nuclear Safety at the Plant Level

Function BWR PWR


Primary Coolant System (PCS) integrity
PCS Piping  Piping  Piping
(pressure boundary integrity failure or valve failure to  SRVs  SG tubes
remain isolated)  Head vent  SRVs
 PORVs
 RCP Seal Failure
 Head vent
PCS Interfaces with Other Systems  Main Steam  LPCI  HPSI  LPSI
(isolation valve failure to close or remain closed)  Feedwater  Core spray  Letdown  SDC suction
 HPCI  SDC suction
 RCIC
 RWCU
PCS Overpressure Protection  SRVs  Pressurizer SRVs
(valves fail to open)  TBPVs  PORVs
Shutdown Reactor and Maintain Safe Shutdown
Reactivity Control  CRDs  CRDs
(rods fail to insert or failure to inject boron)  SLC  CVCS
Reactor Pressure Control  SRVs (depress)  Feedwater  PORVs
(valves fail to open to reduce pressure or failure to  AFW
remove heat through heat exchanger)  SDC

 5-5 
Table 5-1 (continued)
Frontline Functions/Systems for Nuclear Safety at the Plant Level

Function BWR PWR


Reactor Inventory Control  Feedwater  Condensate  CVCS
(failure to makeup to reactor)  HPCI  LPCI  HPSI
 RCIC  Core spray  LPSI
 CRD  SW
 FPS
Containment Control
Containment Isolation  CIS  CIS
Containment Pressure control  Main Condenser  Fan Coolers
Containment Temperature Control  RHR (SPC, SDC or drywell spray  Containment spray
modes)
 Containment Vents

 5-6 
Basic Safety
Functions

Plant Level
Safety Functions

Figure 5-1
BWR Safety Functions (Top Down)

 5-7 
Figure 5-1 (continued)
BWR Safety Functions (Top Down)

 5-8 
Figure 5-1 (continued)
BWR Safety Functions (Top Down)

 5-9 
Basic Safety
Functions

Plant Level
Safety Functions

Figure 5-2
PWR Safety Functions (Top Down)

 5-10 
Figure 5-2 (continued)
PWR Safety Functions (Top Down)

 5-11 
Figure 5-2 (continued)
PWR Safety Functions (Top Down)

 5-12 
Figure 5-2 (continued)
PWR Safety Functions (Top Down)

 5-13 
Table 5-2
Frontline Functions/Systems for Generation at the Plant Level

BWR PWR
Type of Function System System
Description Description
Designator Designator
Reactor
RR CVCS Charging/Letdown
Recirculation
Reactor
RRFC Recirculation --- ---
Reactivity Control Flow Control
Control Rod
CRD CRD Control Rod Drive
Drive
Nuclear Boiler Nuclear Boiler
NBI NBI
Instrumentation Instrumentation
Reactor
Primary Functions --- --- PCP
Recirculation
--- --- CVCS Charging/Letdown
Reactor
RF RF Reactor Feedwater
Reactor Inventory Feedwater
Makeup/ Heat Reactor feed
Removal RFC RFC Reactor feed control
control
Main
MC MC Main Condensate
Condensate
Condensate Condensate
CM CM
Makeup Makeup

 5-14 
Table 5-2 (continued)
Frontline Functions/Systems for Generation at the Plant Level

BWR PWR
Type of Function System System
Description Description
Designator Designator
Turbine Electro-
Turbine Electro-
Flow of Steam to TGC Hydraulic TGC
Hydraulic Controls
Turbine Controls
MS Main Steam MS Main Steam
AR Air removal AR Air removal
OG Offgas OG Offgas
Augmented
AOG
Offgas

Condenser Operation Circulating


CW CW Circulating Water
Primary Functions Water
Condensate Condensate
CD CD
Drains Drains

ES Extraction Steam ES Extraction Steam

Turbine Turbine
TG TG
Generator Generator
Conversion of Steam Turbine
Energy to Power Turbine Generator
Generator
TGI TGI Supervisory
Supervisory
Instrumentation
Instrumentation

 5-15 
Loss of Generation
Functions

GENERATION_FUNC

Basic Generation
Functions
Energy Conversion to Steam Steam Flow and Conversion of Energy to
and Inventory Control Condensation Electricity

STEAM_GEN STEAM_FLOW&COND ENERGY_CONV

Reactivity Control Reactor Inventory Control / Main Steam System Steam Condensation Tutbine Generaator Generator Supervisory
Heat Removal (Electrical) Instrumentation

RX_CTL RX_INV_CTL MS COND TG TGI

Page 1 Page 1 Page 1

Turbine Generator System Plant Level


(Steam) Generation
Functions
TGS

Figure 5-3
BWR Generation Functions (Top Down)

 5-16 
Steam Condensation
Reactivity Control Reactor Inventory Control /
Heat Removal

COND
RX_CTL RX_INV_CTL

Main Condenser Circulating Water


Control Rod Drive Reactor Recirculation Feedwater Condensate
Mechanisms

MC CWS
CRD RR FW CND

Air Removal Extraction Steam


Nuclear Instrumentation Reactor Flow Control Reactor Feedwater Control

AR ES
NBI RRFC RFC

Offgas Condensate Drains

OG CD

Augmented Offgas

AOG

Figure 5-3 (continued)


BWR Generation Functions (Top Down)

 5-17 
Loss of Generation
Functions

GENERATION_FUNC
Basic Generation
Functions
Energy Conversion to Steam Steam Flow and Conversion of Energy to
and Inventory Control Condensation Electricity

STEAM_GEN STEAM_FLOW&COND ENERGY_CONV

Reactivity Control Reactor Inventory Control / Heat Removal Main Steam System Steam Condensation Tutbine Generaator Generator Supervisory
Heat Removal (Electrical) Instrumentation

RX_CTL RX_INV_CTL HT_REM MS COND TG TGI

Page 1 Page 1 Page 1

Charging / Letdown Turbine Generator System


(Steam)
Plant Level
Generation Functions
CVCS TGS

Figure 5-4
PWR Generation Functions (Top Down)

 5-18 
Reactivity Control Steam Condensation Heat Removal

RX_CTL COND HT_REM

Control Rod Drive Charging / Letdown Main Condenser Circulating Water Primary Coolant Pumps Condensate
Mechanisms

CRD CVCS MC CWS PCP CND

Nuclear Instrumentation Air Removal Extraction Steam Feedwater Main Condenser

NBI AR ES FW MC

Offgas Condensate Drains Reactor Feedwater Control

OG CD RFC

Figure 5-4 (continued)


PWR Generation Functions (Top Down)

 5-19 
Top Down Step 3: Identify Actuated/Controlled Components and their Failure
Modes

Logically, the next step in the top down process would be to develop fault trees
for each of the plant systems identified in the preceding Tables and Figures. The
objective of these fault trees would be to identify the mechanical and electrical
components that are to be actuated or controlled by the digital system under
review and understand their failure modes. Having knowledge of the failure
modes of these components may help to focus the review of the digital system by
eliminating the need to consider digital failure modes that would not contribute
to loss of safety and generation functions important to plant operation or its
response to transients and accidents.

This procedure suggests that further development of fault trees may not be
necessary, however. Rather, at this stage of the evaluation, advantage can be
taken of the fault trees that already have been developed for a given plant, in
support of the plant specific PRA.

Safety Functions

Table 5-3 provides a suggested format for obtaining relevant component and
failure mode information from the PRA for safety functions:

Table 5-3
Format for Capturing Component Failure Mode Information from the PRA

Tag Failure Basic Safety


System Description/Comments
ID Mode Events Function

Each column in Table 5-3 is explained as follows:


 System – a frontline system from the bottom of the fault trees illustrated in
Figure 5-1 or Figure 5-2.
 Tag ID – unique identification given to the component represented by the
basic event.
 Failure Mode – manner in which the component fails such that it contributes
to loss of the system to perform its function (e.g., fail to open, fail to remain
open, fail to start, fail to run, etc.).
 Basic Events – list of events included in the fault tree for the system as found
in the plant specific PRA (a basic event usually represents a component and
its failure mode).
 Description – definition of the component and its failure mode in terms of
the function it performs within the system in which it is located, plus any
other information that may be useful in performing the failure analysis.
 Safety Function – a list of the plant level functions to which the component
and its failure mode would contribute were it to fail. Notice that the gap

 5-20 
between the first five columns of Table 5-3 and the Safety Function column
is intentional. It is likely that reports extracted from the PRA will easily
contain the first five columns, but it may be necessary for the analyst to
manually identify the related or affected safety functions.

For any given system modeled in the PRA (or for the entire PRA), a listing of
the Basic Events and a Description is simple to generate from the PRA. A
database relating the Basic Events to the Tag IDs for components may also be
available (the Tag ID may make up a part of the Basic Event name in many
cases). Often, the description of the Basic Event included in the PRA may
simply reflect information already a part of the Basic Event name (e.g., the Tag
ID and its Failure Mode). In this case, it may be useful to expand the definition
to be more descriptive of the component and its function (e.g., ‘charging pump
fails to provide flow to the reactor’s opposed to ‘P-101 FTR’).

In addition to the components and their Failure Modes, information regarding


systems supporting the operation of each frontline system can be provided from
the PRA by obtaining dependency matrices developed during the creation of the
PRA fault trees or by providing a list of support system transfers into each system
fault tree. On obtaining a listing of the supporting systems for each frontline
system, a list of Basic Events for the support systems also can be developed along
with associated Tag IDs, Failure Modes, etc.

Generation Functions

As the plant specific PRA is not directed at quantification of generation losses, it


is likely that only a subset of systems identified in Figure 5-3 or Figure 5-4 will
have fault trees developed as a part of the PRA. However, two types of fault trees
from the PRA may be useful in supporting a failure analysis of Generation related
digital systems.
 Initiating Event Fault Trees – Some PRAs use fault trees for the
quantification of the frequency of trips due to selected systems. Initiating
Event Fault Trees typically are developed for supporting systems that can
have an impact on multiple frontline systems modeled in the PRA (e.g.,
service water, component cooling water, instrument air, electrical
distribution, etc.). A few Initiating Event Fault Trees also may be available
for frontline systems that can lead to a plant trip but having a sufficiently low
frequency that the systems have not resulted in a trip over the life of the plant
(e.g., CRD makeup or charging). Development of initiating event fault trees
is described in Reference 36. Note that Initiating Event Fault Trees
developed for use in the PRA may not have support systems developed as
they would be developed for the mitigating systems. Support system
modeling for generation purposes is discussed further below.
 Safety Function Fault Trees – Several systems are modeled explicitly in the
PRA for their safety functions that also happen to perform un-modeled
generation functions. These systems often are associated with the balance of
plant (e.g., feedwater, condensate, circulating water, etc.). While the success
criteria for these systems differs between the safety and generation functions,

 5-21 
many of the components and their failure modes are the same regardless of
what system functions are being considered.

Generation related systems that have Initiating Event Fault Trees or Safety
Function Fault Trees modeling the same components as needed to support
power operation can be reviewed in a manner similar to that described for the
Safety System Fault Trees. For each system, a listing of Basic Events along with
its associated Tag ID and Failure Mode can be provided from the PRA. The
description of the components and their failure modes can be modified to discuss
how failure of the component would contribute to the potential for generation
loss. Finally, the plant level generation functions supported by these components
would be identified.

As noted above, Initiating Event Fault Trees may not contain logic for
supporting systems. However, for those PRAs utilizing Initiating Event Fault
Trees, supporting systems also may be modeled as initiators. A review of
supporting systems for each of the Initiating Event Fault Trees should be
performed and for any supporting system that is not also modeled, a substitute
method of identifying their components and failure modes should be selected,
which could include relying on the Safety System Fault Trees for the supporting
systems.

Table 5-4 provides an initial starting point for the review of support systems as
initiators of plant trips or load reductions. The objective for all generation related
systems is to list the Basic Events, Tag IDs, Failure Modes and a Description of
how the component supports each generation related function.

 5-22 
Table 5-4
Supporting Functions/Systems for Generation at the Plant Level

BWR PWR
Type of Function System System
Description Description
Designator Designator
SWY Switchyard SWY Switchyard
Motive Power
EE Electrical Equipment SPS Station Power
EE Instrument AC IAC Instrument AC
DC DC Power DC DC Power
Instrument Air
IA IAS Instrument Air
(Pneumatic Supply)
Control
SA Service Air CAS Service Air
Power
Turbine Electro- Turbine electro
TGF EHC
Hydraulic Control Fluid hydraulic control
Reactor Recirculation
RRMG --- ---
Motor/Generator Set
Service Water / Non-
SWS Service Water SWS / NSW
Supporting critical SW
Functions Turbine Building Component Cooling
TBCCW CCW
Equipment Cooling Water
Equipment Cooling
Reactor Bldg.
RBCCW --- ---
Equipment Cooling
Diesel Generator
DGJW --- ---
Jacket Water
Turbine Lube Oil
LOGT LO Turbine Lube Oil
(instrumentation)
Turbine Lube Oil
LO --- ---
Lubrication (mechanical)
RFLO Reactor Feed Lube Oil LO Feedwater Lube Oil
Reactor Recirculation
RRLO --- ---
Lube Oil

 5-23 
Table 5-4 (continued)
Supporting Functions/Systems for Generation at the Plant Level

BWR PWR
Type of Function System System
Description Description
Designator Designator
Reactor Bldg. Heating, Aux Bldg. Heating,
Supporting
HVAC HV Ventilation, Air HVAC Ventilation, Air
Functions
Conditioning Conditioning
PCP Seal Containment
Seals
Cooling Component Cooling
RCS SRV Safety Relief Valves SRV / PORVs Safety Relief Valves
Integrity Primary Coolant
RPV Reactor Pressure Vessel PCS
System
Auxiliary
NB Nuclear Boiler --- ---
Functions
Reactor Water
RWCU / CVCS Cleanup / Charging / CVC Charging / Letdown
Reactor
Letdown
Water Chemistry
Condensate
CF CND Condensate System
Demineralizers

 5-24 
Table 5-4 (continued)
Supporting Functions/Systems for Generation at the Plant Level

BWR PWR
Type of Function System System
Description Description
Designator Designator
--- --- EFW Emergency Feedwater
High Pressure Safety
HPCI High Pressure Injection HPSI
Injection
Reactor Core Isolation
RCIC --- ---
Cooling
Low Pressure Coolant Low Pressure Safety
LPCI LPSI
Injection Injection
CS Core Spray --- ---
Regulatory Residual Heat
RHR Residual Heat Removal RHR
Functions Removal
DG Diesel Generators EDG Diesel Generators
Diesel Generator Fuel
DGFO FO DG Fuel Oil
Oil
PC Primary Containment PC Primary Containment
Primary Containment Primary Containment
PCIS CIS
Isolation System Isolation System
Engineered Safety
--- --- ESFAS
Feature Actuation

 5-25 
There will be some systems that support generation that will not be modeled in
the PRA either as an initiating event or in support of a safety function (e.g.,
turbine and generator systems, feedwater heating, and reactor recirculation
systems). For these systems, the top down approach would require a method
other than use of existing fault trees. Alternatives include developing a list of Tag
IDs for each system beginning with Piping & Instrumentation Diagrams
(P&ID), equipment lists and the assistance of system engineers. Such a process
may already have been undertaken as a part of implementation of AP-913
(Reference 37).

For the purpose of the failure analysis, identifying more than just the critical
components (coming out of AP-913) would be necessary in supporting
subsequent steps of the digital system failure analysis, because a somewhat
simplified consideration of single point vulnerabilities may not provide a
complete list of components that would be used in defining digital system failure
modes. Again, the purpose of a list of generation-related components for each
support system, whether developed by fault tree or simply producing a table, is to
identify Tag IDs, Failure Modes, and a Description of how the components
might contribute to failures of the supporting systems (Table 5-4) to perform
their functions and the effects of those failures on the frontline systems (Table 5-
2) in terms of their ability to perform their functions.

Top Down Step 4: Relate Actuated Component Failure Modes to Digital


System Failure Modes

This step of the top down process examines the interface between the digital
system and the mechanical and electrical components that it controls or actuates.
The preceding steps define the failure modes for these mechanical and electrical
components in support of plant operation and their response to transients or
accidents. The following is a relatively simple approach to translating these
failure modes to digital I&C failure modes at the system level.

At the system level, a few digital I&C failure modes may be all that is necessary
to identify when implementing a top down, focused failure analysis. For example:
 No signal when one is needed
 A delayed signal subsequent to when it is needed
 A signal when one is not needed
 A protective trip signal at an inappropriate time
 Control signal too high
 Control signal too low
 Rate of change of control signal inappropriate, given plant process rate of
change

For the purpose of documenting the basis for selection of a given digital system
failure mode, a simple table should suffice, as provided in Table 5-5:

 5-26 
Table 5-5
Formatting the Basis for Selection of Digital System Failure Modes

Tag Failure Digital System Failure Safety


System
ID Mode Mode(s) Function

Each column in Table 5-5 is explained as follows:


 System/Tag ID/Failure mode – obtained from the list developed in Step 3
for those components directly actuated or controlled by the I&C.
 Digital system failure mode – One or more of the above suggested digital
system level failure modes that could cause the failure mode for the specific
Tag ID
 Safety function – the plant level safety function identified in Step 2 for the
specific Tag ID and failure mode. Notice that the gap between the first four
columns of Table 5-5 and the Safety Function column is intentional. It is
likely that information gathered from the PRA or other means will readily
support filling in the first four columns, but it may be necessary for the
analyst to manually identify the related or affected safety functions. Also, the
analyst should not be surprised if some generation-related functions can
adversely affect one or more safety functions.

Top Down Step 5: Make a Decision (Continue or Transition to another


Method)

After completing the first 4 Steps of this procedure to identify relevant failure
modes at the digital system level, a decision now needs to be made as to how
much further to continue the Top Down method. Two options are available:

Option 1: Stop Here and Transition to One of the Other Hazard Analysis
Methods

This transition would be accomplished by comparing the results of Steps 1


through 4 (the list of digital system failure modes and the basis for their
selection) of the Top Down method to the results of another method described
in this guideline. Such a transition would be most effective at this point if the
results of other failure analysis methods are available. For example, hazards
identified in a completed FMEA (Section 4), HAZOP (Section 6), STPA
(Section 7), or PGA (Section 8) to the digital system level could simply be
compared to the system failure modes identified via the Top Down method
described in this Section.

Those results of the other methods described in this guideline that did not
contribute to any of digital system failure modes identified by the Top Down
method could be set aside, leaving only the digital system failure modes that are
relevant to the overall plant design. Even if the analysis of a digital I&C system
using one of the other methods is in progress, as the results become available they
can be compared to the failure modes coming out of the Top Down method.
 5-27 
For those failure modes that are relevant, the designer can investigate additional
methods for preventing, reducing the potential or being able to cope with those
failure modes. For those failure modes which are not relevant to the mechanical
and electrical equipment being controlled by the digital system, further effort to
address those failure modes can be reduced or eliminated altogether.

Option 2: Continue Pursuing the Top Down Method into the Digital System
Itself

This Option may be useful if the results from other methods are not yet available
for the digital system, or if an investigation of the impact of combinations of
failures within the digital system is of interest. The circulating water system
illustrated in Figure 4-7 and Figure 4-8 of is examined in Example 4-5 using a
fault tree to identify vulnerabilities in a digital system design that involve more
than single failures. Step 6 summarizes this approach and provides
recommendations regarding an appropriate level of detail if this Option is
selected.

Top Down Step 6: Extend the Top Down Method to the Digital I&C System

If Option 2 is selected in Step 5, it is likely that the other hazard analysis


methods on the system have not yet been performed or do not take the analysis
to the system level. The need for such a Top Down analysis, extended into the
digital system of interest using the Fault Tree Analysis technique, is dependent
on the definition of the problem that is being investigated. For example, a single
point vulnerability analysis will not yield any information regarding multiple
failures, particularly those associated with common cause events or location
dependencies. If the effects of combinations of failures are of interest, then a
simple fault tree model can yield insights of interest to the reliability of the digital
system that may not be available from other methods.

Recalling the likely objectives of a hazard analysis described in Section 3.1, it


should not be necessary to develop fault tree logic for a digital system to a great
level of detail. EPRI 1025278 (Reference 38) provides guidance in this regard
and emphasizes consideration of context in the development of digital I&C fault
trees for use in PRA. Context refers to the role that the digital system plays
within the overall plant design, in particular with respect to the plant-level
functions that were identified in Step 2. Engineering judgment can be used to
decide whether the digital system being investigated plays a significant role (or
not) in supporting any of these plant-level functions.

Even if the digital system being analyzed does plays a significant role in a plant-
level function, EPRI 1025278 suggests that the detail in the fault tree logic for a
digital system should be developed no lower than the computing unit level within
the system. The computing unit level would consist of major components of the
system such as sensors, function controllers, communication processors and
voting logic. Having developed fault tree logic to this level, the remainder of the
failure analysis from a top down perspective could be completed using a glass box
approach (i.e., going deeper into the digital system).

 5-28 
It should be kept in mind that at any time during the top down development of a
fault tree for a digital system, results or information produced by other hazard
analysis methods (e.g., FMEA, HAZOP, STPA or PGA), may become
available, thus lessening any interest in further development of the fault tree
model. It only is necessary to develop the fault tree to the point that it provides a
link to the safety or generation functions that the digital system or its
components support within the overall integrated plant design. At any point in
the development of the fault tree, insights regarding the impact of failure of the
parts of the digital system that have been modeled thus far can be summarized
and provided for integration with information that is available from other hazard
analysis methods, particularly those described in this guideline.

For more information and general guidance on the FTA method, see EPRI
1025278 (Reference 38).

5.3 Applying the Top Down Results

The results of the Top Down approach to hazard analysis of digital I&C systems
can be applied to the following activities.

Defense-in-Depth and Diversity (D3) Analyses

The Top Down method using the fault tree analysis technique not limited to
design basis events. It can also be used to identify system functions beyond those
for which a system was originally intended, and confirm the system’s ability to
support these functions. In addition, fault tree analysis systematically evaluates
the effect of multiple concurrent failures assisting in the identification of
potential common-cause effects including those that may involve dependencies
on plant conditions and/or locations. An effective fault tree analysis can identify
where diversity is of value or where it is not of value, and provide an engineering
rationale for these decisions based on the overall plant design and its operating
characteristics.

Validate the Success Criteria of the Digital System

Fault tree analysis allows for the propagation of the effects of postulated failures
throughout the logic model including those due to multiple failures. This
capability permits testing the design of a digital system to confirm that the plant
can continue to operate or safety systems function satisfactorily in the presence of
failures for which the system was intended to cope. Solving the fault tree for its
cut sets also can provide an indication to what extent the digital system itself
contributes to power reductions and safety system failures and identifies where
potential vulnerabilities may be in that regard.

Input to Periodic Testing Plans

A common use of fault trees is to optimize testing and surveillance intervals


(Reference 39). This is particularly true for periodic testing of hardware
components within a digital system that may be subject to the effects of age

 5-29 
related degradation. However, it should be kept in mind that it is not true of
software which is neither random nor does it have aging related failure
mechanisms.

5.4 Top Down Examples

The following examples were originally developed for EPRI 1022985 (Reference
15). Here they are presented again, this time adjusting the analysis to illustrate
the Top Down method described in this Section. The first example takes
advantage of an existing plant-specific PRA in its application of fault trees (Step
4 of Section 5.2). In the second example, development of fault trees into the
digital system itself is performed (Step 6 of Section 5.2).
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees
Top Down Step 1: Define the I&C Systems to be Analyzed
This example illustrates the use of fault trees to perform a Top Down analysis of the
same HPCI/RCIC turbine control system that was defined in Section 4 and presented
in Examples 4-1 and 4-2. The identified “system” is shown within the “Analysis
Boundary” box of Figure 4-6.
The HPCI/RCIC control system is designed to maintain flow at the required setpoint
when the in-service flow controller is in automatic mode. The flow control system
consists of a flow element, a flow transmitter, the flow controllers, a digital turbine
speed governor, a digital valve positioner, and feedback loops from sensors. The
flow controller applies a Proportional-Integral-Derivative (PID) control algorithm that
adjusts the speed demand output of the controller to compensate for any errors
between the flow setpoint and the actual flow signal provided by a flow transmitter
downstream of the HPCI/RCIC pump. The digital turbine speed governor, working
with the digital valve positioner, automatically adjusts the position of the governor
valve to match the actual speed of the turbine to the speed demanded by the flow
controller.
Top Down Step 2: Define Plant Level Functions & Develop System Level Fault Tree
Logic
Figure 5-1, which meets the prerequisite for a Function Analysis (FA), provides a top
down view of basic high level safety functions for a BWR 4 broken down into plant
level safety functions and eventually identifying the systems which provide support
for these plant level functions.
On the first page of Figure 5-1, three, high level, basic safety functions are
considered:
 Primary coolant system integrity
 Shutdown the reactor and maintain safe shutdown
 Limit releases to the environment
The three basic safety functions can be broken down further into what will be
described as plant level safety functions. The first page of Figure 5-1 also identifies
what may be considered as plant level safety functions for a BWR. Plant level safety
functions can be related to those functions accomplished by the plant emergency

 5-30 
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees (continued)
operating procedures (EOP) and/or modeled in the plant specific probabilistic risk
assessment (PRA):
 Primary coolant system integrity
− Primary coolant piping
− Primary coolant overpressure protection
− Primary coolant loss through interfacing systems
o Systems inside containment
o Systems outside containment
 Shutdown the reactor and maintain safe shutdown
− Reactivity control (subcriticality)
− Reactor coolant inventory control
o High pressure inventory control
o Low pressure inventory control
 Limit releases to the environment
− Primary containment control
o Containment isolation
o Containment Pressure control
o Containment temperature control
− Secondary containment control
Beneath each of these plant level functions in Figure 5-1 are listed the plant systems
that support these functions for a typical BWR (a BWR 4). Considering that the focus
of this top down review is on HPCI and RCIC, it should be noted that HPCI and
RCIC components play a role in all three basic safety functions. In addition to their
obvious reactor inventory control function at high reactor pressure, HPCI and RCIC
steam line isolation valves play a primary coolant system integrity function and a
containment isolation function. A review of the PRA reveals fault tree logic for HPCI
and RCIC that support all three of these functions.
The first section of Figure 5-3 provides a listing of three basic generation functions
which, in turn, are broken down into plant level generation functions:
 Reactor
− Reactivity control (maintain reactor power level)
− Reactor inventory control
 Turbine
− Flow of steam to turbine
− Condenser operation
 Generator
− Conversion of steam energy to power

 5-31 
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees (continued)
Functions that support the systems that provide plant level generation functions are
also summarized in Table 5-4:
 Control Power/Pneumatic supply
 Equipment cooling
 Lubrication
 HVAC
Auxiliary functions are also shown in Table 5-4 that, if lost, may not directly affect
any of the primary or supporting generation related functions but eventually could
lead to a manual shutdown. These auxiliary functions generally are related to
maintaining reactor and fuel conditions.
Finally, a regulatory function is shown that is related to the operability of plant safety
systems. Again, these do not affect the ability to generate power directly, but reflect
limiting conditions for operation as found in the Technical Specifications.
A review of the plant specific PRA did not identify any explicit contribution to reactor
trip resulting from HPCI or RCIC components. If there is a contribution, it likely is
rolled up in the data that supports selection of initiating events and their frequencies.
For this reason, a more qualitative assessment is performed to document any
potential impact of HPCI and RCIC on reactor power operation. This qualitative
assessment is shown in Table 5-6 (note that the shaded row identified those
component functions that may be affected by the digital upgrade defined in Step 1
and Figure 4-6). From Table 5-8, it can be seen that HPCI and RCIC systems may
impact reactor power control and reactor inventory makeup functions through the
spurious operation of either system. Also, as they are systems that are governed by
Technical Specification requirements, HPCI and RCIC can have an impact on
generation for regulatory reasons.
Top Down Step 3: Identify Actuated/Controlled Components and their Failure
Modes
At this point, development of detailed system level fault tree logic would highlight
what components within the system would support each function and what failure
modes are associated with each of these components. However, detailed fault tree
models already may be available in the plant specific PRA. It may be possible to
take advantage of these existing fault tree models to identify key components and
their failure modes that are controlled by the digital I&C.
From a safety perspective, Table 5-6 lists the major components (by Tag ID) within
the HPCI and RCIC systems supporting the primary coolant makeup function that
also are actuated or controlled by I&C equipment. In addition, the table includes the
failure modes associated with these components in terms of their potential adverse
effects on the ability of the systems to makeup to the reactor. Note that the shaded
cells in Table 5-6 are those components that are affected by the digital upgrade
defined in Step 1 (see Figure 4-6).
The shaded cells in Table 5-8 identify the impact that the HPCI/RCIC systems can
have on generation. Note that there are several HPCI system failure modes that
could lead to a plant trip or shutdown, but RCIC impact on generation is limited to
regulatory availability. Table 5-7 lists the Tag ID and Failure Modes for HPCI and
RCIC components that could lead to the loss of generating functions identified in

 5-32 
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees (continued)
Table 5-8. As noted earlier, the PRA does not model the contribution of HPCI and
RCIC to reactor trips or shutdowns explicitly and, therefore, there are no basic
events in the fault trees that represent component failure modes that lead to
generation losses.
Top Down Step 4: Relate Actuated Component Failure Modes to Digital System
Failure Modes
The shaded row in Table 5-8 shows that the only HPCI/RCIC components affected
by the upgrade of the I&C to digital systems are the governor valves themselves.
Important governor valve failure modes are as follows:
 Failure to open the governor valve sufficiently to provide inventory makeup to
the reactor at rates necessary to maintain reactor level.
− For HPCI, during events involving stuck open safety valve or LOCAs (small
or medium), this would be a flow rate roughly equivalent to the rate at
which reactor inventory is leaving the primary coolant system through the
breach in the primary coolant system. For RCIC, during events involving a
stuck open SRV, this would mean a flow rate less than the design basis for
the system (several hundred gpm).
− For HPCI and RCIC, during non-LOCA transients, this would be a flow rate
roughly equivalent to decay heat inventory losses.
 Failure to throttle the governor valve sufficiently to prevent a turbine trip due to
overspeed.
− For RCIC, an inadvertent overspeed trip would likely disable the system for
all of its functions.
− For HPCI, an overspeed trip may affect its capability to provide adequate
makeup for the largest LOCAs (e.g., medium LOCA or the upper end of the
small LOCA range). Due to its relatively large flow rate, inadvertent
overspeed trip of HPCI is not likely to inhibit its ability to provide adequate
makeup for decay heat.
Given these governor valve failure modes, the failure modes of the digital control
system for the governor valve are listed in Table 5-7:
 Control signal too low (which would result in too much throttling and insufficient
flow)
 Control signal too high (which could result in a possible overspeed trip of the
turbine and ultimately insufficient flow)
Table 5-9 lists the results of assessing HPCI/RCIC components for their impact on
generation. While there are several components and their failure modes that could
lead to a plant trip or shutdown, given that the only component impacted by the
digital I&C under review in this example is the governor valve, this table indicates
that there are no effects on generation associated with the HPCI/RCIC turbine
control system
Top Down Step 5: Make a Decision (Continue or Transition to another Method)
At this point in the analysis, a decision should be made as to whether detailed fault
tree modeling of the turbine control system would be of value. Given the limited
scope of the functions associated with the HPCI/RCIC turbine control system, an

 5-33 
Example 5-1. Top Down Analysis of HPCI/RCIC Turbine Controls Using
Fault Trees (continued)
appropriate approach would be to simply transfer the results of the Top Down
analysis to those responsible for the design of the turbine control system and suggest
that a Design FMEA (per Section 4.4 of this guidance) focus on identifying and
addressing digital speed control system failure modes that could lead to a control
signal that is too high or too low. Therefore, the Top Down method applied to this
example ends at Step 5.

 5-34 
Table 5-6
HPCI & RCIC Components Controlled by I&C Equipment (Safety Functions)

HPCI/RCIC Failure Normal Accident


PRA Basic Event(s) Auto Comment
Tag ID Modes Config. Config.
Steam Supply
Isolation  Fail to  HPI-MOV-OC-MO- Open Open Close on Gr5 Not required to change position to
Valve remain open 014 Isol. provide steam supply function
(inboard)  Spurious  RCI-MOV-OC-MO-
 MO-014 close 055
 MO-055
Isolation  Fail to  HPI-MOV-OC-MO- Open Open Close on Gr5 Not required to change position to
Valve remain open 015 Isol. provide steam supply function
(outboard)  Spurious  RCI-MOV-OC-MO-
 MO-015 close 056
 MO-056
Actuation  Fail to open  HPI-MOV-OC-MO- Closed Open Open on low- The HPCI actuation valve also opens on
Valve  Fail to 016 low Rx level high drywell pressure
 MO-016 remain open  HPI-MOV-CC-MO-
 MO-058 016
 RCI-MOV-OC-MO-
058
 RCI-MOV-CC-MO-
058
Trip/Throttle  Fail to  HPI-HOV-OC-HO-007 Open Open Close on: Not required to change position to
Valve remain open  RCI-MOV-OC-MO-  Over-speed provide steam supply function
 HO-007  Spurious 060  Lo suction
 MO-060 close  Hi Ex-haust
 Gr5 Isol.

 5-35 
Table 5-6 (continued)
HPCI & RCIC Components Controlled by I&C Equipment (Safety Functions)

HPCI/RCIC Failure Normal Accident


PRA Basic Event(s) Auto Comment
Tag ID Modes Config. Config.
Governor Fail to throttle  HPI-HOV-OC-HO-008 Open Throttle Throttle Too much throttling may result in
Valve Fail to remain  RCI-HOV-OC-HO-009 insufficient flow to the reactor.
 HO-008 open Too little throttling may result in turbine
 HO-009 trip on overspeed.
Suction Supply
CST  Fail to  HPI-MOV-OC-MO- Open Open Close on low EOPs instruct bypassing high torus level
 MO-043 remain open 043 CST or high trip in preference to CST suction.
 MO-081  Spurious  RCI-MOV-OC-MO- torus Loss of closing function has no impact
closure 081 on ability to take suction from torus.
Torus  Fail to open  HPI-MOV-OC-MO- Closed Open Open on low EOPs instruct bypassing high torus level
 MO-041  Fail to 041 CDT or high trip in preference to CST suction.
 MO-042 remain open  HPI-MOV-CC-MO- torus
 MO-080 041
 MO-081  HPI-MOV-OC-MO-
042
 HPI-MOV-CC-MO-
042
 RCI-MOV-OC-MO-
080
 RCI-MOV-CC-MO-
080
 RCI-MOV-OC-MO-
081
 RCI-MOV-CC-MO-
081

 5-36 
Table 5-6 (continued)
HPCI & RCIC Components Controlled by I&C Equipment (Safety Functions)

HPCI/RCIC Failure Normal Accident


PRA Basic Event(s) Auto Comment
Tag ID Modes Config. Config.
Injection path
Injection  Fail to open  HPI-MOV-OC-MO- Closed Open Open on low- Not required to change position to
Valve  Fail to 047 low Rx level provide injection function.
(outboard) remain open  HPI-MOV-CC-MO-
 MO-047 047
 MO-086  RCI-MOV-OC-MO-
086
 RCI-MOV-CC-MO-
086
Injection  Fail to open  HPI-MOV-OC-MO- Open Open Open on low- Not required to change position to
Valve  Fail to 048 low Rx level provide injection function.
(inboard) remain open  HPI-MOV-CC-MO-
 MO-048 048
 MO-087  RCI-MOV-OC-MO-
087
 RCI-MOV-CC-MO-
087
Min. Flow  Fail to  RCI-MOV-CO-MO- Closed Closed Open during Only RCIC min flow line is large
Valves remain 090 testing enough to divert sufficient flow to
 MO-090 closed  RCI-MOV-CO-MO- threaten makeup function. HPCI pump is
 MO-492 492 sufficient to provide adequate makeup
whether or not min flow line is isolated.

 5-37 
Table 5-6 (continued)
HPCI & RCIC Components Controlled by I&C Equipment (Safety Functions)

HPCI/RCIC Failure Normal Accident


PRA Basic Event(s) Auto Comment
Tag ID Modes Config. Config.
Support System
Lube Oil  Fail to start  HPI-PMM-FS-P-201 Idle Run Start on low-
 P-201  Fail to run  HPI-PMM-FR-P-201 low Rx level
 MO-076  Fail to open  RCI-MOV-OC-MO- Closed Open Open on low-
 Fail to 076 low Rx level
remain open  RCI-MOV-CC-MO-
076
Primary System/Containment Isolation
Steam Line  Fail to close  HPI-MOV-OO-MO- Open Closed Close on
Isolation 014 Group 5
(inboard)  RCI-MOV-OC-MO- isolation
 MO-014 055
 MO-055
Steam Line  Fail to close  HPI-MOV-OO-MO- Open Closed Close on
Isolation 015 Group 5
(inboard)  RCI-MOV-OC-MO- isolation
 MO-015 056
 MO-056

Table 5-7
HPCI and RCIC Digital System Failure Modes

System Tag ID Failure Mode Digital System Failure Mode(s) Safety Function(s)
HPCI  HO-008 Fail to remain  Control signal too low (which would result in too  Reactor Inventory Control
RCIC  HO-009 open and throttle much throttling and insufficient flow)
 Control signal too high (which could result in a
possible overspeed trip of the turbine and ultimately
insufficient flow)

 5-38 
Table 5-8
HPCI/RCIC Generation Functions

BWR
Type of Function System HPCI or RCIC Controls
Description
Designator

RR Reactor Recirculation

Reactor Recirculation Flow


RRFC
Reactivity Control Control
CRD Control Rod Drive
Nuclear Boiler Spurious operation of HPCI results in sufficient cold water
NBI
Instrumentation addition that a high flux trip may occur.

RF Reactor Feedwater
Reactor Inventory
Spurious operation of HPCI requires runback of feedwater
Makeup/ Heat RFC Reactor feed control
flow to prevent high reactor level trip of feedwater pumps
Removal
Primary MC Main Condensate
Functions CM Condensate Makeup

Turbine Electro-Hydraulic
Flow of Steam to TGC
Controls
Turbine
MS Main Steam
AR Air removal
OG Offgas
AOG Augmented Offgas
Condenser
Operation CW Circulating Water
CD Condensate Drains

ES Extraction Steam

 5-39 
Table 5-8 (continued)
HPCI/RCIC Generation Functions

BWR
Type of Function System HPCI or RCIC Controls
Description
Designator

TG Turbine Generator
Conversion of
Steam Energy to
Power Turbine Generator
TGI Supervisory
Instrumentation
Motive Power EE Electrical Equipment
EE Instrument AC
DC DC Power

Control IA Instrument Air


Power SA Service Air
TGF Turbine EHC Fluid
Supporting
Functions RRMG Reactor Recirc. M/G Set
SW Service Water
Turbine Equipment
TEC
Cooling
Equipment Cooling
Reactor Equipment
REC
Cooling
DGJW D/G Jacket Water

 5-40 
Table 5-8 (continued)
HPCI/RCIC Generation Functions

BWR
Type of Function System HPCI or RCIC Controls
Description
Designator
LOGT Turbine Lube Oil (I&C)
LO Turbine Lube Oil (Mech.)
Lubrication RFLO Reactor Feed Lube Oil
Reactor Recirculation Lube
RRLO
Oil
HVAC HV Reactor Bldg. HVAC
Seals

RCS SRV Safety Relief Valves


Integrity RPV Reactor Pressure Vessel
Auxiliary NB Nuclear Boiler
Functions
RWCU / Reactor Water Cleanup /
Reactor CVCS Charging / Letdown
Water Chemistry Condensate
CF
Demineralizers

 5-41 
Table 5-8 (continued)
HPCI/RCIC Generation Functions

BWR
Type of Function System HPCI or RCIC Controls
Description
Designator
RHR Residual Heat Removal
High Pressure Injection To the extent that HPCI controls are inoperable for
HPCI an extended period, a plant shutdown could result.
(HPCI LCO is 14 days)
Reactor Core Isolation To the extent that RCIC controls are inoperable for
RCIC Cooling an extended period, a plant shutdown could result.
Regulatory (RCIC LCO is 14 days)
Functions
CS Core Spray
DG Diesel Generators
DGFO Diesel Generator Fuel Oil
PC Primary Containment
Primary Containment
PCIS
Isolation System

 5-42 
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
Top Down Step 1: Define the I&C Systems to be Analyzed
This example illustrates the use of fault trees to perform a Top Down analysis of the
same control system for circulating water that was defined in Example 4-3 of Section
4 in the application of the DFMEA method. The circulating water system consists of
six 25% capacity pumps distributed in two divisions. During normal operations at
100% power, two pumps are running in each division, with one pump on standby in
each division; four running pumps are necessary for operation of the plant at full
power. The CWS controls are shown within the “Analysis Boundary” box of Figure 4-
7.
The basic design of the circulating water control system includes two sets of logic
cabinets, two sets of I/O cabinets and a set of HSI workstations. All of the cabinets
and workstations are connected to redundant data communication busses (Comm 1
and Comm 2).
I/O Cabinet A contains digital input modules that monitor the position of the 4KV
breakers that provide power to the motors for three of the circulating water pumps
and digital output modules that position their associated discharge valves (open or
closed). Likewise, I/O cabinet B provides the same functions for the remaining three
pumps and discharge valves.
Top Down Step 2: Define Plant Level Functions & Develop System Level Fault Tree
Logic
Figure 5-2, which meets the prerequisite for a Function Analysis (per Section 3.6),
provides a top down view of basic high level safety functions for a PWR broken
down into plant level safety functions and eventually identifying the systems which
provide support for these plant level functions.
On the first page of Figure 5-2, three, high level, basic safety functions are
considered:
 Primary coolant system integrity
 Shutdown the reactor and maintain safe shutdown
 Limit releases to the environment
The three basic safety functions can be broken down further into what will be
described as plant level safety functions. The first page of Figure 5-2 identifies what
may be considered plant level safety functions for a typical PWR. Plant level safety
functions can be related to those functions accomplished by the plant emergency
operating procedures (EOP) and/or modeled in the plant-specific probabilistic risk
assessment (PRA).
 Primary coolant system integrity
− Primary coolant piping
− Primary coolant overpressure protection
− Primary coolant loss through interfacing systems
o Systems inside containment
o Systems outside containment
 Shutdown the reactor and maintain safe shutdown
− Reactivity control (subcriticality)
− Secondary heat removal

 5-43 
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
− Reactor coolant inventory control
o High pressure inventory control
o Low pressure inventory control
 Limit releases to the environment
− Primary containment control
o Containment isolation
o Containment pressure control
o Containment temperature control
− Secondary containment control
Beneath each of the plant level functions in Figure 5-2, plant systems that support
these functions for a typical PWR are listed. The focus of this Top Down analysis is on
circulating water, but it is not considered to be a frontline system in the PRA and does
not appear in Figure 5-2. However, review of the fault tree logic and dependency
matrices for the frontline systems shown in Figure 5-2 show that the main condenser,
which is supported by circulating water, ultimately provides support to two plant level
safety functions:
 Reactor inventory control – through the operation of turbine driven feedwater
pumps which require a condenser vacuum
 Secondary heat removal – through the maintenance of CST inventory (e.g.,
avoiding the need to makeup to the CSTs from systems such as demineralized
water or fire protection in order to maintain an adequate long term AFW pump
suction source)
The first section of Figure 5-4 provides a listing of three basic generation functions
which, in turn, are broken down into plant level generation functions:
 Reactor
− Reactivity control (maintain reactor power level)
− Reactor inventory control
 Turbine
− Flow of steam to turbine
− Condenser operation
− Steam generator inventory control
 Generator
− Conversion of steam energy to power
Functions that support the systems that provide plant level generation functions are
also summarized in Table 5-4:
 Control power/Pneumatic supply
 Equipment cooling
 Lubrication
 HVAC
Auxiliary functions are also shown that, if lost, may not directly affect any of the
primary or supporting generation related functions but eventually could lead to a

 5-44 
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
manual shutdown. These auxiliary functions generally are related to maintaining
reactor and fuel conditions.
Finally, a regulatory function is shown that is related to the operability of plant safety
systems. Again, these do not affect the ability to generate power directly, but reflect
limiting conditions for operation as found in the Technical Specifications.
From Figure 5-4, as expected, it can be seen that circulating water impacts the
condenser operation as a frontline system. While the PRA for this plant does not have
initiating event fault trees, a review of the fault trees used to perform accident
sequence quantification also identifies the main condenser and, hence circulating
water, as support systems for operation of the turbine driven feedwater pumps. The
success criteria for the circulating water system differ in its support of power
generation vs. post-trip decay heat removal (fewer trains are needed post-trip).
However, individual components needed for circulating water to support these plant
level functions and their failure modes are the same for either function.
Top Down Step 3: Identify Actuated/Controlled Components and their Failure
Modes
Given that for this plant the circulating water system is modeled in the PRA, then
simply listing the major components that are controlled by I&C and their failure
modes as modeled in the PRA may be all that is necessary to complete this step.
Table 5-10 lists the CWS Tag IDs and Failure Modes for components in the
circulating water system that are actuated by the I&C equipment described in this
example. Table 5-10 also lists the PRA Basic Events representing these Components
and their Failure Modes, the normal state of these Components, and the state
required for each Component to support its required function.
The circulating water is not modeled in all PRAs. In this situation, the top down
approach would require development of a simple fault tree for this system. Figures D-
1 through D-3 (see Appendix D) provide such a fault tree using the success criteria for
the circulating water system in support of full power operation.
Development of two fault trees is considered. Both fault tree models assume four of
the six circulating water pumps must be in service to support full power operation (or,
conversely, failure of three of the six pumps is assumed to result in a high condenser
vacuum):
1. System response to tripping of an operating CWS pump
2. Operation of CWS Components when not called upon to operate (e.g., spurious
closure of a pump discharge valve)
The important components and their failure modes include the Tag IDs and Failure
Modes identified in Table 5-10.
Top Down Step 4: Relate Actuated Component Failure Modes to Digital System
Failure Modes
Given the CWS Components and their Failure Modes identified in Step 3, it is
relatively easy to develop a list of failure modes for the digital I&C equipment at the
system level. Table 5-11 provides the results, which are summarized below:
 No control signal to isolate a valve (on loss of a pump)

 5-45 
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
 No control signal to open a valve and start a pump (on operator action to initiate
this signal)
 A control signal when one is not needed (spurious closure of a pump discharge
valve)
Top Down Step 5: Make a Decision (Continue or Transition to another Method)
Having identified the key digital system failure modes in Step 4, the results could be
turned over to the designer for use as input to a Design FMEA at this point. However,
it is assumed for this example that there is a need to confirm that the success criteria
for the digital control system is consistent with the overall design of the circulating
water system. Further development of the fault tree logic (to include portions of the
digital control system itself) is provided in Step 6.
Top Down Step 6: Extend the Top Down Method to the Digital I&C System
Trip and/or isolation of a single circulating water pump is assumed to leave the
system with insufficient capacity to support full power operation. However, the
increase in condenser vacuum in response to a reduction in circulating water flow is
gradual, allowing time for the operators to open the discharge isolation valve and
start one or both of the idle circulating water pumps. The time available for the
operators to initiate this action and avoid a plant trip is several minutes.
Given this context, the plant control system impacts the circulating water system in
one of three ways:
 Support normal operation of the system by allowing operators to monitor system
performance and realign the system for the purpose of rotating equipment, etc.
 Response to the trip of a circulating water pump by automatically isolating the
affected pump (this prevents reverse flow through the tripped pump and an even
greater reduction in flow through the condenser than from just the loss of the pump)
and support for operator action to start and un-isolate one of the idle circulating
water pumps.
 Spurious actuation of circulating water equipment when not called upon to operate
(e.g., spurious closure of the circulating water pump discharge isolation valve).
The focus of the top down analysis is on the last two of these three functions. The top
down analysis takes the form of fault trees, similar to that used in a nuclear power
plant PRA, but not to the same level of detail, or requiring failure rates for
quantification.
Attachment D contains the circulating water system related fault trees used for the top
down evaluation of the plant control system shown in Figure 4-7. Figures D-1a
through D-1c define the system in support of maintaining plant operation should a
circulating water pump trip occur. Figures D-2a and D-2b present a top down review
of the system with respect to the potential for the system to lead to a spurious plant
trip.
Results
The top logic in Appendix D was used to identify the combinations of failures (i.e.,
cut sets) that must occur to lead to the inability of the circulating water system to
support full power operation. Table D-1 presents the dominant contributors to failure
of the system to perform its function. As expected, analysis using this top down logic

 5-46 
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
confirms that no single component failure leads to the loss of the ability of the
circulating water system to provide an adequate heat sink in support of full power
operation. The bulk of the combinations of failures that must occur to lose adequate
circulating water flow consist of three or more components and their failure modes
(i.e., pumps fail to run, breakers fail to remain open, discharge MOVs fail to remain
open in combinations of three).
The number of failures required for the circulating water system not to be able to
perform its heat sink function is not unexpected given that four pumps are required to
support plant operation while there are two standby spare pump trains available.
However, there are approximately twenty cut sets that consist of only pairs of
components and their failure modes that can lead to failure of the circulating water
system. Many of these twenty pairs include components from the plant control system.
These combinations of failures can be found in Table D-1 and are highlighted in
Figure D-4.
Eight combinations of failures consist entirely of pairs of communication module
failures. These pairs of failures come from the spurious actuation top logic. Total loss
of communications for an entire division of circulating water can occur if all (two)
communication modules in the redundant communication loops in that division were
to fail. This leads to no input to the digital output modules for that division. Under
these conditions, the discharge isolation valves for all three pumps in the affected
division close leaving only the three pumps in the unaffected division. As the plant
requires four circulating water pumps to support full power operation, loss of the
pairs of communications modules results in insufficient circulating water pump flow.
Four of the remaining cut sets consisting of pairs of failures include a digital output
module failure combined with failure of the operators to initiate the standby trains of
circulating water in time to avoid a low condenser vacuum trip. These failures also
come from the spurious actuation top logic. Loss of a single digital output module
results in a false isolation signal to the discharge isolation MOV in the affected pump
train. As only three pumps are now providing circulating water flow, starting of one
of the standby trains is required. Failure of the operators to initiate one of the standby
trains in time results in the circulating water flow not being able to support full power
operation.
Other plant control system components (digital input modules, the master controller,
slave controller and operator workstations) appear with hardware and I&C failures in
combinations of three or more. That these components require multiple additional
failures before they can lead to conditions in which the plant cannot operate at full
power reflects the fact that there are two spare circulating water pump trains and the
operators can initiate the standby trains to mitigate loss of these components.
What it Means
It would seem unusual to have a circulating water system design that from a hydraulic
and mechanical standpoint essentially is designed to accommodate multiple failures,
yet is potentially vulnerable to pairs of failures in the control system. The reasons lie
in several places:
 Circulating water success criterion
While there are two divisions of circulating water, each apparently with a standby

 5-47 
Example 5-2. Top Down Analysis of CWS Controls Using Fault Trees
(continued)
spare pump train, it is necessary to have pumps from both trains in service in order
to support full power operation (four of six pumps). Combinations of component
failure that lead to loss of a single division of circulating water result in insufficient
flow to avoid high condenser pressure. There are pairs of control system
components (communication units, in particular) that can lead to loss of an entire
division of circulating water.
 Control system component failure modes
Failure modes of selected individual components in the control system result in the
failure loss of individual pump trains. For example, the digital output modules revert
to their shelf state when an input signal is not available. This, in turn, generates an
isolation signal to the discharge valve in the affected pump train.
A final insight coming out of the top down approach is takes the form of a qualitative
ranking of the importance of various control system components, particularly relative
to one another. While no one component is critical to generation, a subset of control
system components is relatively important in supporting adequate circulating water
flow. These components include communications modules and digital output devices.
Absent design changes, these components would be those for which it would be
desirable to ensure their dependability from a design perspective and provide a high
degree of reliability from a maintenance perspective. Other control systems
components (master and slave controllers, input modules, workstations) do not have
as great an impact on system operation as multiple and diverse component failures
must occur in addition to these components before the system cannot perform its
function and they are not likely to trigger a plant transient were they to fail.

 5-48 
Table 5-9
CWS Components Controlled by I&C Equipment (Safety & Generation)

Config
Req’d
CWS Failure Normal
PRA Basic Events to Auto Comment
Tag ID Modes Config.
Support
Function
 CWS-CBCO-CB-01
 CWS-CBCO-CB-02
 Fail to
Circuit  CWS-CBCO-CB-03 Pump
remain Closed Closed Trip on overcurrent, low voltage, manual
Breaker  CWS-CBCO-CB-04 protection
closed
 CB-01  CWS-CBCO-CB-05
 CB-02  CWS-CBCO-CB-06
 CB-03  CWS-CBOO-CB-01
 CB-04 Discharge
 CWS-CBOO-CB-02
 CB-05 valve Partial opening of valve required before
 CWS-CBOO-CB-03
 Fail to close Open Closed position breaker closes to prevent deadhead of
 CB-06  CWS-CBOO-CB-04
starts pump
 CWS-CBOO-CB-05
pump
 CWS-CBOO-CB-06
 CWS-MVOC-MO-01
 CWS-MVOC-MO-02
Pump
 Fail to  CWS-MVOC-MO-03
Discharge Open Open
remain open  CWS-MVOC-MO-04
Valve
 CWS-MVOC-MO-05
 MO-01
 CWS-MVOC-MO-06
 MO-02
 CWS-MVOO-MO-01
 MO-03
 CWS-MVOO-MO-02 Close on
 MO-04
 CWS-MVOO-MO-03 opening of
 MO-05  Fail to close Open Closed Prevent flow diversion through idle pump
 CWS-MVOO-MO-04 pump
 MO-06
 CWS-MVOO-MO-05 breaker
 CWS-MVOO-MO-06

 5-49 
Table 5-10 (continued)
CWS Components Controlled by I&C Equipment (Safety & Generation)

Config
Req’d
CWS Failure Normal
PRA Basic Events to Auto Comment
Tag ID Modes Config.
Support
Function
 CWS-MVCC-MO-01
 CWS-MVCC-MO-02
Sufficient time available for manual start
 CWS-MVCC-MO-03 Manually
 Fail to open Closed Open on loss of another train before function is
 CWS-MVCC-MO-04 open
lost
 CWS-MVCC-MO-05
 CWS-MVCC-MO-06
 CWS-PMFR-P1
Discharge
 CWS-PMFR-P2
valve
 CWS-PMFR-P3 Four trains needed to support power
 Fail to run Run Run position
 CWS-PMFR-P4 operation, one train needed post trip.
closes
 CWS-PMFR-P5
breaker
Circ Water  CWS-PMFR-P6
Pump  CWS-PMFS-P1
 CWS-PMFS-P2
 CWS-PMFS-P3
 Fail to start Idle Start
 CWS-PMFS-P4
 CWS-PMFS-P5
 CWS-PMFS-P6

 5-50 
Table 5-10
CWS Component vs. Digital System Failure Modes

CWS
Safety/Generation
System Tag ID Component Digital System Failure Mode(s)
Functions
Failure Mode
 MO-01  Fail to remain  A signal when one is not needed (spurious closure

 MO-02 open of a pump discharge valve)
 MO-03  Fail to close  No signal to isolate valve (on loss of a pump)   Condenser operation
 MO-04  SG inventory control
 MO-05  No signal to open a valve and start a pump (on
 Failure to open 
operator action to initiate this signal)
Circulating  MO-06
Water  CB-01
 CB-02
 CB-03  No signal to open a valve and start a pump (on  Condenser operation
 Fail to close 
 CB-04 operator action to initiate this signal)  SG inventory control
 CB-05
 CB-06

 5-51 
5.5 Top Down Strengths

Integrated view of plant design

Fault tree analysis provides a view of the role a system plays within the overall
integrated plant design. This integrated perspective even includes events going
beyond the design basis and considers the effects of failures not only within the
digital system but at the level of the systems in which the digital system is
installed, at the plant function level for both safety and generation functions.

Not limited to single failures

Fault tree analysis systematically assesses and evaluates the effects of


combinations of failures including those due to common cause.

Existing logic

Top down analysis methods may be able to take advantage of existing fault tree
logic that has been developed in support of the plant specific PRA. Components
and failure modes that are included in the PRA also may be appropriate for
consideration in evaluating generation related functions.

5.6 Top Down Limitations

Focus on failures

The focus of fault tree analysis on failure modes limits the ability of the method
to consider interactions between systems or components that can lead to adverse
behaviors under plant states in which no failures are present. Existing fault tree
logic may be incomplete for evaluating plant conditions in which everything
performed as designed but an unacceptable outcome still occurred.

Complexity of models

Fault tree logic models can be large, difficult to display on a few pages or screens
and require specialized software to present and review. Should development of
new fault trees be needed, the effort can be burdensome if not managed
effectively.

Addressing the Limitations

The last two items listed above are limitations given traditional approaches to
development of fault trees. It may be possible to borrow techniques from some of
the other methods to address these limitations.

For example, the HAZOP method described in Section 6 uses guide words to
assess the state of a system or component under review. These guide words are
applied without regard to whether the system or components in question have
succeeded or failed. They then lead to subsequent questions regarding what plant

 5-52 
conditions can lead to the state defined by the guide word, whether or not it
involves successful operation of the component or is a result of its failure. A
similar approach can be taken once Tag IDs and their failure modes have been
identified from the plant specific PRA. That is, ask an additional question as to
what legitimate plant conditions can lead to the system and component being in
the so called ‘failure mode’ modeled in the PRA. If those legitimate conditions
have not been considered explicitly in the accident sequences of the PRA, the
failure analysis can be extended beyond what is included the fault trees to review
those conditions in a format similar to that used in HAZOPs or, alternately, the
fault tree could be expanded to model the events that lead to those plant
conditions.

The PGA method described in Section 8 has, as one of its steps, the
development of tables that contain Goals and Processes. The objective of this
step is to identify where conflicting or incompatible goals may exist. The
conflicts that may be identified are irrespective of the success or failure of the
systems under review. A similar approach can be taken beginning with the Tag
IDs and Failure Modes coming from the plant specific PRA. Understanding the
function(s) that are being supported by the Tag ID for specific failure modes and
asking whether there are any functional successes that are directly incompatible
may lead to the identification of plant conditions which are not considered in the
PRA but could lead to adverse outcomes even though the systems and
components under review perform as designed.

 5-53 
Section 6: Hazard & Operability Analysis
(HAZOP) Method
Per Reference 12, a HAZOP, or HAZard and OPerability analysis, is a
systematic review of a process (e.g., system design), using “guide words,” to
visualize the ways in which a system can malfunction. The HAZOP analysis
searches for possible deviations from the design intent that can occur in
components, operator or maintenance technician actions, or material elements
(e.g., air, water, steam), and whether the consequences of such deviations can
result in a hazard.

Reference 31 adds:

Safety and reliability in the design of a plant initially relies upon the
application of various codes of practice, or design codes and standards.
These represent the accumulation of knowledge and experience of both
individual experts and the industry as a whole. Such application is
usually backed up by the experience of the engineers involved, who
might well have been previously concerned with the design,
commissioning or operation of similar plant. However, it is considered
that although codes of practice are extremely valuable, it is important to
supplement them with an imaginative anticipation of deviations that
might occur because of, for example, equipment malfunction or
operator error. In addition, most companies will admit to the fact that
for a new plant, design personnel are under pressure to keep the project
on schedule. This pressure always results in errors and oversights. The
Hazop Study is an opportunity to correct these before such changes
become too expensive, or 'impossible' to accomplish.

6.1 HAZOP Overview and Objectives

Reference 33 states:

HAZOP is a structured and systematic technique for examining a


defined system, with the objective of:

 identifying potential hazards in the system. The hazards involved


may include both those essentially relevant only to the immediate
area of the system and those with a much wider sphere of
influence, e.g. some environmental hazards;
 6-1 
 identifying potential operability problems with the system and in
particular identifying causes of operational disturbances and
production deviations likely to lead to nonconforming products.

An important benefit of HAZOP studies is that the resulting


knowledge, obtained by identifying potential hazards and operability
problems in a structured and systematic manner, is of great assistance in
determining appropriate remedial measures. A characteristic feature of
a HAZOP study is the “examination session” during which a
multidisciplinary team under the guidance of a study leader
systematically examines all relevant parts of a design or system. It
identifies deviations from the system design intent utilizing a core set of
guide words. The technique aims to stimulate the imagination of
participants in a systematic way to identify hazards and operability
problems. HAZOP should be seen as an enhancement to sound design
using experience-based approaches such as codes of practice rather than
a substitute for such approaches.

The HAZOP analysis results are described in a worksheet; a sample is provided


in Table 6-1.

 6-2 
Table 6-1
Sample HAZOP Worksheet

 6-3 
6.2 HAZOP Procedure

Prerequisite

The results of a Function Analysis, as described in Section 3.6, are a useful input
to the HAZOP analysis because it provides a well-organized set of functions that
can feed into Step 3 of the HAZOP procedure (identify design intentions and
success criteria).

The following HAZOP procedure is based on the guidance provided in


Reference 12.

HAZOP Step 1: Form an Assessment Team

The HAZOP method involves the judgment and experience of a


multidisciplined team. Expertise may be needed from a variety of disciplines,
such as I&C, electrical, mechanical, thermal-hydraulic, reactor engineering,
human factors, operations, maintenance, PRA, etc.

A facilitator trained and experienced in the HAZOP method should be a


member of the team, available to facilitate assessment meetings.

The HAZOP procedure works best when the assessment team is gathered
together in one or more meetings with the purpose of executing the HAZOP
procedure steps described below. When the right people are together, who query
each other on potential process deviations and their likely causes in cross-
disciplined manner, a more complete assessment will emerge and provide more
opportunities for identifying unwanted and potentially hazardous system
behaviors.

HAZOP Step 2: Select a Process Part

A “process part” in the context of the HAZOP method means that portion of the
plant system or process that is of interest to the analyst. A process part can be a
section of a passive element in a system or process, such as a main steam or
feedwater line, or a tank or vessel, such as a steam generator or main condenser.
A process part can also be an active process element such as a pump or valve. . A
process part can also be a high level function in a plant that encompasses multiple
systems or trains.

While this guidance is about hazard analysis of digital I&C systems, it is


important to view the process parts that are analyzed with the HAZOP method
as the active and passive plant process elements that are affected by the digital
I&C system or component that “acts upon” those elements in terms of the
performance of the process itself (i.e., flow, level, temperature, reactivity, heat
rate, etc.). A piping and instrument diagram (P&ID) is likely to be the best
starting point for identifying the process parts that are acted upon or affected by
the digital I&C system or component of interest.

 6-4 
Figure 6-1 illustrates an example, simplified view of the Balance of Plant (BOP)
systems in a BWR. The piping sections and major components represent “parts”
of the process that can be expressed in terms of process conditions, such as
temperature, flow and level. This example, which is used here to demonstrate the
HAZOP procedure, is based on the EPRI Utility Requirements Document
(Reference 43), Volume II Section 3.4.5, in which advanced reactors are required
to have load rejection capability (to have some capacity to continue operating as
an island on loss of offsite power without a reactor trip). This is a design feature
that was available for some of the first generation of US nuclear power plants.
The example examines an event early in the life of a BWR-1 facility with such a
load rejection capability, where the reactor tripped after experiencing a transient
condition in the BOP systems that were thought to be designed for such
transients (that were not supposed to result in a reactor trip).

This particular BWR is designed to remain in Mode 1 at 100% reactor power


and 5% electric power following a Loss Of Offsite Power (LOOP) in order to
provide power for house loads. The process “part” to be evaluated in this
demonstration is the main condenser, which is sized for 95% turbine bypass flow
conditions to accommodate full reactor power while the main generator
continues to supply power to the plant systems at 5% MWe following a LOOP.
The plant conditions represented in Figure 6-1 are Mode 1, 100% power, with
BOP systems running normally.

100% Flow

CV
High Low
Generator
Pressure Pressure
(100% Mwe)
Turbine Turbines

Reactor
(100% RTP) 0% Flow

Condensate
Condenser Storage
TBV
Tank

Makeup

Reject
HP FW
Heater LP FW
Heater
Feedwater
Condensate
Pump
Pump

Figure 6-1
BWR Balance of Plant

 6-5 
HAZOP Step 3: Determine Design Intention and Success Criteria

This step requires a clear statement of the design intention of the process part
under consideration, and the success criteria (or acceptance criteria) that are used
to demonstrate that the design intention is met. Per Reference 33, the term
“design intent” is defined as:

Designer’s desired, or specified range of behavior for elements and


characteristics.

Continuing with the BWR example, the “design intentions” of the main
condenser (i.e., the part) are to:
a. Condense the exhaust from the low pressure turbines when the reactor and
the main turbine/generator are at 100% power, or
b. Condense up to 95% of the main steam supplied by the reactor when it is at
100% power and the turbine bypass valve is open.

The “success criteria” for this design intention is for condenser vacuum to remain
below its high pressure setpoints and hotwell level to remain between the upper
and lower operational limits.

If condenser vacuum rises above 22.5 inches Hg, then a reactor and turbine trip
would be initiated. To avoid this trip, the circulating water system is sized to
accommodate more than 100% rated thermal power.

If the hotwell level increases from the normal operating band to the upper limit,
the reject valve shown in Figure 6-1 will open, and the condensate pump will
dump the excess inventory in the hotwell to the condensate storage tank in order
to protect the condenser and main turbine from an overfill condition. The line
between the condensate pump discharge and the condensate storage tank is sized
to accommodate full condensate flow.

For example, if the makeup valve between the condensate storage tank and the
main condenser were to malfunction and open, when it should be closed, then
hotwell level will increase to the point where the reject valve will open to
compensate for the inadvertent addition of condensate inventory.

Likewise, if the hotwell level decreases from the normal operating band to the
lower limit, the makeup valve will open, thus increasing the hotwell level and
protecting the condensate pump from inadequate suction pressure.

HAZOP Step 4: Identify Elements/Attributes

The next step is to identify the elements or attributes that characterize the
selected process part(s). An “element” is defined, per Reference 33, as:

Constituent of a part which serves to identify the part’s essential


features. Note: The choice of elements may depend upon the particular
application, but elements can include features such as the material

 6-6 
involved, the activity being carried out, the equipment employed, etc.
Material should be considered in a general sense and includes data,
software, etc.

In the BWR example, the element involved is the water in the hotwell basin of
the condenser.

HAZOP Step 5: Apply Guide Words to Develop Possible Deviations

Table 6-2 provides the “guide words” that are used in the HAZOP procedure to
assess postulated conditions that could be “deviations” from the design intention
identified in Step 3. The underlying idea is to propose each of the guide words in
the context of the design intention and see if the affected process part deviates
from its design intention.

In the BWR example, starting with the “Not” guide word, the following
statement is proposed:

When the turbine bypass valve is open, the reactor is at 100% power,
and the condenser is condensing 95% of the main steam, hotwell level
does not remain within operating limits.

Notice how this statement includes the design intention (condensing 95% of main
steam) and a proposed deviation from the success criteria (within limits) using a
guide word (not).

Table 6-2
HAZOP Guide Words

Guide Words Meanings Comments


The complete
No part of the design intention is achieved,
NO or NOT negation of the
but nothing else happens
design intentions
MORE Quantitative These refer to quantities and properties such
increases or as flow rates and temperatures as well as
LESS decreases activities like “HEAT” or “REACT”
All of the design and operating intentions are
A qualitative
AS WELL AS achieved together with some additional
increase
activity
A qualitative Only some of the intentions are achieved;
PART OF
decrease some are not
This is mostly applicable to activities. For
The logical example reverse flow or chemical reaction. It
REVERSE opposite of the can also be applied to substances, e.g.
intention “POISON” instead of “ANTIDOTE” or “D”
instead of “L” optical isomers.
Complete No part of the original design intention is
OTHER THAN
substitution achieved. Something quite different happens.

 6-7 
HAZOP Step 6: List Possible Causes of Deviations

The next step is to identify and list the possible causes of the deviations identified
in Step 4.

In the BWR example, possible causes of the stated deviation (hotwell level
outside of normal operating limits) could be as follows:
 95% turbine bypass flow + inadvertent opening of hotwell makeup valve leads
to high hotwell level
 Greater than 95% turbine bypass flow leads to high hotwell level
 95% turbine bypass flow + inadvertent opening of hotwell reject valve leads
to low hotwell level
 Less than 95% turbine bypass flow leads to low hotwell level
 Two-phase conditions in the hotwell basin lead to high hotwell level (i.e.,
swell)

HAZOP Step 7: Evaluate Consequences of Deviations

The next step is to evaluate the consequences of the deviations identified in Step
4.

In this example, a high level condition in the hotwell will cause the reject valves
to open, diverting condensate flow to the condensate storage tank. The resulting
effect on the feedwater system is a reduction in feedpump suction pressure,
leading to a feedpump trip, which then causes reactor water level to decrease to
the point of reaching an automatic reactor trip.

In fact, this condition was experienced by the BWR facility that provided the
background for this example. Figure 6-2 illustrates the scenario by the following
sequence of events (using the labels provided in the figure):
A. A Loss of Offsite Power (LOOP) event occurs. By design, the main turbine
control valve (CV) closes to 5% flow, and the turbine bypass valve (TBV)
opens to 95%. The reactor remains in Mode 1 at hot full power, and the
main generator remains connected to house loads, running at 5% power
(MWe).
B. When the turbine bypass valve opens, the condenser experiences pressure and
temperature fluctuations that reach a “flashing” condition resulting in a two-
phase mix in the hotwell basin. At first, condenser pressure increases, but
stays below the turbine exhaust pressure trip setpoint. When the pressure
decreases back to the normal vacuum condition, a phase change begins to
occur in the hotwell, from the liquid to the vapor phase, resulting in the two-
phase mixture.
C. The two-phase mix results in a sensed (i.e., indicated) increase in hotwell
level.

 6-8 
D. A “high level” signal is transmitted to the reject valve, which promptly opens
as designed.
E. Full condensate flow is diverted to the condensate tank, as designed.
F. The feedwater pump trips on low suction pressure, as designed.
G. The reactor trips on low water level, as designed.

Notice that all of the components involved in this scenario behaved exactly as
designed, although it may be true that the BOP system design criteria never
considered the possibility of a two-phase condition in the hotwell due to a
temperature/pressure transient in the main condenser.

This is an example of an unintended behavior of a system design that can be


revealed using the HAZOP method, which also forces consideration of potential
causes of such behaviors.

5% Flow

CV
High Low
A Generator
Pressure Pressure
LOOP (5% Mwe)
Turbine Turbines

Two-Phase Due
Reactor 95% Flow to Pressure Xient
(100% RTP) B
Condenser Condensate
C Storage
G TBV
Tank

Level Makeup
Increase
Reject
Rx Trip on HP FW
LP FW D
Lo Water Heater
Level Heater Opens on
Hi Hotwell
Trips on Lo Feedwater Full
Condensate E Level
Suction Pump Flow
Pump
F
Pressure

Figure 6-2
BWR Trip Sequence of Events after LOOP

HAZOP Step 8: Identify Existing Safeguards to Prevent Deviations

The next step is to identify any safeguards (i.e., features, functions, administrative
controls, etc.) that exist that can prevent the deviations from occurring in the first
place.

In the BWR example, there are no existing safeguards to prevent the high
hotwell level deviation; otherwise, the event would not have occurred. A review

 6-9 
of existing safeguards could have led to at least some recognition of the possibility
of the deviation that was experienced.

HAZOP Step 9: Develop Action Items

The HAZOP procedure concludes with a list of action items associated with
each identified deviation. In practice, a HAZOP worksheet like the sample
provided in Table 6-1 captures the results of all 9 steps of the procedure. If any
particular action item meets the criteria for entry into the facility corrective action
program, then one or more condition reports should be initiated and cross-
referenced to the HAZOP worksheet. Section 6.4 provides a worked example
using a suggested HAZOP worksheet format.

In this BWR example, and in the actual BWR facility that experienced the event,
one of the resulting action items was to modify the plant response to a load
rejections as follows:
 Retain the existing reject valve “open” permissive on high hotwell level
 Provide automatic trip of a single recirculating water pump on signals that
result in full opening of the turbine bypass valve. Thermal hydraulic analysis
of the reduction in flow to the reactor on loss of a single pump confirmed
that the void increase in the core would cause a temporary rise in reactor
level. The high reactor level would result in a relatively early throttling of
feedwater flow by the flow control valves. This reduction in feedwater flow
allowed feedwater pump suction to remain above the low suction pressure
setpoint even if the reject valves were open due to a false high hotwell level
signal. On stabilizing conditions in the condenser and hotwell, the resulting
steam flow to the through the bypass valve given a tripped recirculating water
pump was significantly less than rated flow while reactor power was still
more than sufficient to support house loads using the main generator and
avoid a plant trip.

If the reader is questioning why this BWR example is included in a guideline on


hazard analysis methods for I&C systems, the results of step 9 provide the
answer. The purpose of most I&C systems is to provide sensing, command and
control of plant processes. Often, a traditional hazard analysis method such as
FMEA will systematically postulate I&C system or component failure modes
and effects, and overlook the possibility of unintended behaviors when process
deviations occur. This is what makes HAZOP particularly effective for
identifying unintended behaviors in the process parts and formulating corrective
actions that can be implemented in the control systems.

HAZOP Step 10: Repeat

For each process part, or for each element associated with a given part, the
HAZOP procedure is repeated until the hazard analysis scope is satisfied. For
guidance on developing hazard analysis scope and objectives, refer to Section 3.1.

 6-10 
6.3 Applying the HAZOP Results

As with other hazard analysis methods described in this guideline, the results of a
HAZOP analysis can be used in support of the following activities:

Application Development

The HAZOP results can be used by the integrator to improve system designs
through the application development lifeycle process. The conceptual design
phase of the lifecycle process should include a preliminary hazards analysis, using
one of the approaches described in Section 3.7. A preliminary HAZOP analysis
can be used to identify and reduce or eliminate potential vulnerabilities in the
system as the design activities progress. Some vulnerabilities may be prevented or
mitigated to a reasonable extent through one or more defensive measures that are
realized through design requirements and/or plant programs and processes. For
guidance on applying defensive measures in digital I&C systems, see References
20 and 21.

The HAZOP analysis should be updated through the design process, or when
the design is complete, to reflect the finished design at an appropriate application
baseline. For guidance on determining baselines, see EPRI 1022991 (Reference
18).

The finished HAZOP analysis should be validated, at least to the extent that the
behaviors or corrective actions identified in the analysis can be tested without
extraordinary conditions or destructive methods, in the test phase of the
application development lifecycle. HAZOP validation test cases can be executed
at the Factory Acceptance Test (FAT), Site Acceptance Test (SAT) or during
post-installation testing. Additional guidance on testing is provided in EPRI
1025282 (Reference 32).

Licensing

HAZOP results can be useful when considering the likelihood of malfunctions


and accidents under regulatory rules. For guidance on licensing of digital
upgrades, see Reference 4.

In the US, protection system upgrades subject to 10CFR50.55a(h) require a


Software Safety Analysis (SSA). Because SSA is a hazards-driven process, the
HAZOP method may be suitable for meeting this requirement, as long as the
SSA steps described in the licensee or applicant Software Safety Plan are fulfilled.

The HAZOP method is useful when a licensing activity requires a


demonstration that hazards have been properly identified and eliminated,
reduced or mitigated. However, the definition of “hazard” is very important when
adapting the HAZOP method to a licensing activity (emphasis added):

 6-11 
IEEE Definition of Hazard: A condition that is a prerequisite to an
accident. Hazards include external events as well as conditions internal
to computer hardware or software. (Reference 9)

The IEEE definition, accepted by the NRC, considers internal and external
events and conditions. The HAZOP method can be useful because it considers
plant process deviations that can be caused by, or mitigated by control system
actions.

When properly applied, the HAZOP method should be well-suited for a


licensing activity that requires a demonstration that hazards have been
systematically and properly identified and addressed.

6.4 HAZOP Example

For brevity, one worked example of the HAZOP method is provided, using the
same Circ Water System (CWS) controls described in previous examples. Figure
4-7 and Figure 4-8 provide diagrams of the CWS controls that are evaluated in
this example.
Example 6-1. Circ Water System Controls HAZOP
HAZOP Step 1: Form an Assessment Team
A multidisciplined team was formed, made up of expertise from digital I&C design,
mechanical systems design, systems engineering, digital control system product
design and PRA knowledge domains. The team met twice, first to review the CWS
control system design and initiate the HAZOP worksheet, and again to review the
results and confirm the recommended corrective actions. A HAZOP method facilitator
was on the phone with the team for both meetings, and offered valuable guidance on
the selection of the process parts to be assessed, and how to effectively use the guide
words to postulate deviations.
The team members initials were recorded at the top of the HAZOP worksheet that
was initiated in the first team meeting, provided in Table 6-3.
HAZOP Step 2: Select a Process Part
For this example, the HAZOP analysis team selected the COMM 1 “part” of the I/O
cabinet in CWS control system process illustrated in Figure 4-7. The process part was
identified on the HAZOP worksheet.
HAZOP Step 3: Determine Design Intention and Success Criteria
The design intention of the COMM 1 module is a function that would be listed or
described in a prerequisite Function Analysis (FA). The design intention of COMM 1
in a given I/O cabinet is to pass, or communicate data that is addressed to or from
the I/O modules in that cabinet. The success criterion is to communicate the data
without any errors or losses of the COMM 1 data link that connects the I/O cabinet
to other cabinets. The design intention and success criteria were recorded on the top
rows of the HAZOP worksheet.

 6-12 
Example 6-1. Circ Water System Controls HAZOP (continued)
HAZOP Step 4: Identify Elements/Attributes
For this example, the HAZOP team identified one of the elements/attributes of the
design intention (data communication to/from I/O modules) as the “signaling voltage
on the physical interface (indicating the presence of a modulated carrier).” In other
words, the electrical characteristics at the physical layer described by the Open
Systems Interconnect (OSI) 7-layer model. This element was recorded in the second
column of the HAZOP worksheet.
HAZOP Step 5: Apply Guide Words to Develop Possible Deviations
The “guide words” provided in Table 6-2 were used to assess postulated conditions
that could be “deviations” from the design intention identified in Step 3. Each guide
word was listed in its own row in the HAZOP worksheet, and the resulting deviations
were recorded in the “Deviation” column. For example, The guide word “No” could
result in the deviation “no carrier signal.”
HAZOP Step 6: List Possible Causes of Deviations
The HAZOP team discussed and debated possible causes of each deviation listed in
the HAZOP worksheet (Table 6-3).
For example, continuing with the “No carrier signal” deviation, three possible causes
are as follows:
 A broken wire
 A dead COMM 1 module
 A failed backplane
The results of this step are recorded in the “possible causes” column of the HAZOP
worksheet.
HAZOP Step 7: Evaluate Consequences of Deviations
The team carefully examined Figure 4-7and Figure 4-8 to determine and evaluate the
consequences of the deviations listed in the HAZOP worksheet.
The consequences associated with the “No carrier signal” deviation listed in Table 6-
3 are as follows:
 No consequence, other than loss of one COMM module. The redundant COMM
module maintains communication (i.e., the data link) with other cabinets.
 A possible “failed backplane” cause of the “No carrier signal” deviation will
result in loss of both COMM modules, a complete failure to communicate data to
other cabinets in the CWS control system, and due to the basic design and
architecture of the controls, will result in loss of the circulating water system
pumps.
HAZOP Step 8: Identify Existing Safeguards to Prevent Deviations
For each of the deviations and their possible causes, existing safeguards were
identified and evaluated for their potential to prevent or mitigate the deviation. Upon
review of the completed worksheet in Table 6-3, it is apparent that existing
safeguards are available for all deviations and their causes except for one; that
being a failed backplane.
HAZOP Step 9: Develop Action Items
Action items were developed and assigned to the appropriate team members. In most
cases, the action items were written to confirm the applicability and effectiveness of

 6-13 
Example 6-1. Circ Water System Controls HAZOP (continued)
existing safeguards such as wiring standard, periodic test procedures, or internal
control system diagnostic features.
One action item (highlighted in yellow) stood out of this example, requesting a
design review of the CWS control system architecture and a proposal for a design
change to prevent the loss of all 3 CWS pumps due to a failed backplane. The plant
described by this example requires 4 out of 6 CWS pumps to be operating to avoid
ultimate heat sink issues that would lead to an inadequate condenser vacuum,
causing a turbine trip.
What is interesting about this example is that it readily identified a failure of a single
passive element (the backplane) that leads to an unacceptable result (turbine trip).
The FMEA and FTA methods, applied to the same example, did not reveal this
vulnerability.
The FTA method has the potential for identifying such vulnerabilities if modeling of
common cause failures is considered, although specific root causes of these failure
modes may not be identified explicitly without further effort

 6-14 
Table 6-3
CWS Controls HAZOP Worksheet

 6-15 
6.5 HAZOP Strengths

Accessible to I&C Design Engineers and I&C Equipment Designers

The HAZOP method as described in this guideline is widely used in multiple


industries. Nuclear power plant I&C design engineers and I&C equipment
designers are usually trained and experienced in mechanical, electrical,
electronics, nuclear and other discipline-specific engineering fundamentals. The
idea of postulating deviations in plant processes is consistent with their training
and experience with design and support of the facility. Therefore, the HAZOP
method is accessible and available to the staffs responsible for design, technical
support, and operations and maintenance activities at a nuclear power plant.

Systems View

The HAZOP method takes a system view. The results are useful for input to the
requirements definition phase of a digital I&C project because they result in a
goal-driven design from the beginning. Goals include safety, reliability, power
generation, etc.

The HAZOP method can provide insights into system behaviors beyond what is
typically revealed by FMEA and Top Down, because it considers the behaviors
of active and passive plant elements without necessarily postulating specific
failures.

Unexpected Behaviors

The HAZOP method can help identify unexpected and strange system behaviors
that may not otherwise be thought credible or possible. For example, it can
identify adverse interactions between components and systems that would on the
surface appear to have no potential interactions at all.

Elegant Final Results

When the data is reduced to the final list of corrective actions, the results can
typically be readily used to inform requirements, identify and apply defensive
measures, and demonstrate system acceptability.

The final results can also be used as an input to another method to help avoid
searches for faults and failures that don’t lead to hazards.

6.6 HAZOP Limitations

Interactions

Reference 33 says the following regarding the effectiveness of the HAZOP


method for identifying interactions between systems or parts of a system:

 6-16 
HAZOP is a hazard identification technique which considers system
parts individually and methodically examines the effects of deviations
on each part. Sometimes a serious hazard will involve the interaction
between a number of parts of the system. In these cases the hazard may
need to be studied in more detail using techniques such as event tree
and fault tree analyses.

Many systems are highly inter-linked, and a deviation at one of them


may have a cause elsewhere. Adequate local mitigating action may not
address the real cause and still result in a subsequent accident. Many
accidents have occurred because small local modifications had
unforeseen knock-on effects elsewhere. Whilst this problem can be
overcome by carrying forward the implications of deviations from one
part to another, in practice this is frequently not done.

For guidance on methods that are more well-suited to identifying potentially


hazardous interactions between systems or parts of a system, please see Sections
3.3 and 3.4.

Trained Facilitator

It helps to have a facilitator trained in the use of HAZOP, because it takes on a


broader view of the system(s) that can be affected by a digital I&C activity and
the hazards that it may cause. Most users of this guidance are likely to be trained
and competent in specific engineering disciplines or tasks, and may find it
difficult to navigate the HAZOP process the first time or two without a
facilitator. This method requires the ability to isolate a process part and
characterize its elements, consider deviations from multiple points of view, and
identify the causes of such deviations and related safeguards or corrective actions.

The principal investigators of this guideline researched the HAZOP method and
developed the worked examples provided in Section 6.4. As the examples were
developed using the team approach described in the HAZOP procedure (Section
6.2), it became apparent that the team’s experience with other methods such as
FMEA and Top Down drove an overly narrow consideration of active plant
components and their failure modes. This narrow-minded focus missed the point
that the HAZOP method provides the most benefit by considering deviations
from the design intentions of plant process parts, which can be active or passive
elements in the plant. A trained facilitator helped the team recognize the error
traps created by their own mindsets and get back on the right track.

 6-17 
Section 7: Systems Theoretic Process
Analysis (STPA) Method
Systems Theoretic Process Analysis (STPA), a hazard analysis method, is one
part of a set of new or refined system safety engineering methods developed by
Dr. Nancy Leveson and her team at the Massachusetts Institute of Technology
(MIT), under the heading of Systems-Theoretic Accident Model and Processes
(STAMP). This work has been published in Dr. Leveson’s book, Engineering a
Safer World – Systems Thinking Applied to Safety (Reference 19).

The STAMP model addresses challenges that are introduced by complex


systems. It is beyond the scope of this guideline to address the full set of STAMP
methods and guidance available in Reference 19. However, the STPA method
brings new and powerful insights to the specific topic of hazard analysis, which is
the subject of this guideline.

STPA, in a similar manner to FTA, starts with a focus on identified accidents or


losses. It then systematically uncovers hazardous control actions (including
failures) that can lead to the identified losses under normal, abnormal and faulted
operating conditions. It does not limit the analysis to consideration of failures in
the way that FTA and FMEA do. It also considers undesired behaviors that
don’t involve component failures. This is particularly important for complex
digital systems, because a significant percentage of mishaps involve undesired
behaviors that occur under unanticipated or untested operating conditions

The following guidance is not intended to alter the STPA method described in
Reference 19. This guidance is adapted to the extent that it demonstrates the
usefulness of STPA in performing hazard analysis of digital I&C systems in
commercial nuclear power plants.

7.1 STPA Overview and Objectives

Per Reference 19:

The primary reason for developing STPA was to include the new causal
factors identified in STAMP that are not handled by the older
techniques [FMEA, FTA, HAZOP, and others]. More specifically,
the hazard analysis technique should include design errors, including
software flaws; component interaction accidents; cognitively complex
human decision-making errors; and social, organizational, and
 7-1 
management factors contributing to accidents. In short, the goal is to
identify accident scenarios that encompass the entire accident process,
not just the electromechanical components.

Key Terms in the STPA Method

The term “accident,” as it is used in the STPA method, is not necessarily


synonymous with the term “nuclear accident” that is used in the commercial
nuclear power industry. Reference 19 defines an accident as “an undesired and
unplanned event that results in a loss (including loss of human life or injury, property
damage, environment pollution, and so on).” For the purposes of applying STPA on
digital I&C systems in nuclear plants, an accident can be considered personal
injury or death; core damage; offsite releases exceeding limits; damaged or
degraded equipment; lost or reduced plant system availability; lost or reduced
generation; or any other loss that is of concern to the plant owner/operator.

The term “hazard,” as it is used in the STPA method, is by definition “a system


state or set of conditions that, together with a particular set of worst case environment
conditions, will lead to an accident (loss).” Users of this guideline should be
cautioned that it is easy to confuse system states, conditions, and events when
identifying hazards. Reference 19 further explains (emphasis added):

This definition [of hazard] requires some explanation. First, hazards


may be defined in terms of conditions, as here, or in terms of events as
long as one of these choices is used consistently. While there have been
arguments about whether hazards are events or conditions, the
distinction is irrelevant and either can be used.

The notion of “worst case environment conditions” also needs some explanation. As
used in this guideline, it is meant to convey the idea that STPA is meant to
consider the states of the environment around the system in their abnormal
conditions. This was the fundamental approach proposed in the EPRI “ACES
Report” (Reference 14), where the digital system design functions were intended
to be analyzed in the context of abnormal conditions and events (ACES). The
STPA method is a natural extension of this idea, and is more systematic than the
methods proposed in the ACES Report. They key in the STPA method is to
avoid assuming that environmental conditions around a digital system are in their
normal states; it leads the analyst down the path of considering abnormal
conditions by using guide words that force consideration or control actions in
various contexts (using process model variables and their various states).

The hazard convention used in this guideline is the same convention used in the
definition of hazard provided in Reference 19 (i.e., hazards are system states or
conditions, not events).

The term “controller,” as it is used in the STPA method, may be a human or a


machine. In the context of applying the STPA method on digital I&C systems, a
controller can be considered a human operator, a maintenance technician, an
engineer, or a manager (including project managers); or a digital component or

 7-2 
system that performs an automatic or semi-automatic control or protective
function.

The term “control action,” as it is used in the STPA method, describes the effect
that a controller (human, machine, or both) has on an actuator and ultimately the
controlled process. Control actions can be safe, or unsafe, and may depend on
their context. In one context, a control action can be considered safe, while in
another context it may be unsafe. For example, an unplanned, automatic main
turbine trip may be considered safe in the context of protecting the main
turbine/generator set, but it may also be considered unsafe in the context of
nuclear safety because it is an initiating event that can challenge safety systems.

Therefore, the term “safety,” as it is used in the STPA method, is not necessarily
synonymous with the term “nuclear safety” that is used in the commercial nuclear
power industry. Reference 19 defines safety as “freedom from accidents (loss
events),” and it is the definition used here.

A clear understanding and consistent use of these terms is necessary for


successfully applying the STPA method. This would be a good time to review
the simple “running with scissors” example in Section 1.5 and then rereading the
definitions above.

Causal Factors

The “causal factors identified in STAMP,” mentioned in the leading paragraph, are
built around the concept of a “control structure,” illustrated in Figure 7-1.

Per Reference 19, inadequate control is characterized as follows [emphasis


added]:

… if there is an accident, one or more of the following must have


occurred:

1. Safety constraints were not enforced by the controller.

a. The control actions necessary to enforce the associated safety


constraint at each level of the sociotechnical control structure
for the system were not provided.

b. The necessary control actions were provided but at the wrong


time (too early or too late) or stopped too soon.

c. Unsafe control actions were provided that caused a violation of


the safety constraints.

2. Appropriate control actions were provided but not followed…”

…the causal factors in accidents can be divided into three general


categories: (1), the controller operation, (2) the behavior of actuators

 7-3 
and controlled processes, and (3) communication and coordination
among controllers and decision makers.

Although these ideas are introduced at an abstract level, they can be applied
systematically on complex systems, and decomposed to any level of detail that
serves the objectives of the analysis. The worked examples provided in Section
7.3 demonstrate how various levels of abstraction can be systematically applied on
real systems.

Control input or external


information wrong or missing

Controller
Process Model
Inadequate Control Algorithm
Inconsistent,
Inappropriate, (Flaws in creation, process
Incomplete, or
Ineffective or Missing changes, incorrect modification Inadequate or
Incorrect
Control Action or adaptation) Missing Feedback
Feedback Delays

Actuator Sensor
Inadequate Inadequate
Operation Operation

Incorrect or No
Information Provided
Delayed
Operation Measurement
Inaccuracies
Feedback Delays

Controlled Process
Controller 2 Component Failures or
Conflicting Control Actions
Changes Over Time Process Output
Process Input Contributes to
Missing or Wrong System Hazard

Unidentified or Out-of-
Range Disturbance

Figure 7-1
A Classification of Control Flaws Leading to Hazards
Credit: Dr. Nancy G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety,
published by The MIT Press

Control Flaws

The idea of causal factors is transformed into a set of control flaws that can be
superimposed on the control structure. The control flaws are shown in Figure 7-1
in red text. MIT researchers have not yet found any evidence from their
investigations of accidents (losses) or complex systems that the set of control
flaws illustrated in Figure 7-1 is incomplete. Each of the control flaws (e.g.,
delayed actuator operation, measurement inaccuracies, inadequate or missing
feedback, inadequate control algorithm, etc.) are fully described in Reference 19.

 7-4 
Control Flaws vs. Causal Factors

The STPA methodology described in Reference 19 uses the terms “Control


Flaws” and “Causal Factors” interchangeably. At first glance, if a Causal Factor is
the same thing as a Control Flaw, as shown in Figure 7-1, then it appears that
Causal Factors don’t include programmatic, engineering human performance,
and organizational factors that sometimes form the Root Cause or Contributing
Causes of an event in the nuclear power industry. However, this is not really the
case, because the basic Control Structure in Figure 7-1, upon which Control
Flaws are overlaid, can also be used to model the humans and their organizations
that may introduce flaws in the broader context of interactions between social
and technical systems.

It is important to remember that the control structure illustrated in Figure 7-1


can include humans, machines, or both, thus making it a useful model for
analyzing a wide range of systems, from functionally simple plant control systems
to complex technical and organizational systems that interact with each other.
Therefore, control flaws can include errors in decision making or design errors
introduced from an incorrect or incomplete process model.

Reference 19 describes several examples that apply the STPA method on


complex sociotechnical systems, which is a topic beyond the scope of this
guideline.

A Hierarchical View

As described in Section 1.4, hazards can lead to losses, and the purpose of a
hazard analysis is to identify hazards so they can be eliminated, reduced or
mitigated. STPA extends this hierarchy to include control flaws (causal factors),
with the underlying principle that if the analyst can find and eliminate control
flaws, then resulting potential hazards may be eliminated, and accidents
prevented.

This is a powerful idea, but applying it effectively depends on multiple factors,


including the breadth and depth of the analysis, and the extent to which the
results are applied during system design and operation. Good planning, as
described in Section 1, goes a long way in assuring effective application of the
STPA method.

 7-5 
An undesired and unplanned event that results in a loss (including loss of human life
Accident(s) or injury, property damage, environment pollution, and so on). (Reference 19)
or Loss(es)

A system state or set of conditions that, together with a particular set of worst-case
environment conditions, will lead to an accident (loss). (Reference 19)
Hazard(s)

Unsafe Control A controller command that violates a safety constraint.


Action

Control See Figure 7-1


Flaw

Figure 7-2
Accidents, Hazards, Unsafe Control Actions & Control Flaws

The Role of Context

The term context as it is used in the STPA method means the system or
environmental state, or combination of states, in which a control action is
provided. Different contexts can lead to different conclusions regarding hazards.
For example, one context that shows an increasing pump speed can be beneficial
if system flow is too low, but hazardous if pump speed is too high and
approaching an equipment limit.

7.2 STPA Procedure

Prerequisite

The results of a Function Analysis, as described in Section 3.6, are a useful input
to the STPA analysis because it provides a well-organized set of functions that
can feed into the steps of the STPA procedure that identify the control structure
and process models.

Reference 19 provides a high level description of the STPA process, described in


two steps:

Basic Step 1: Identify Unsafe Control Actions

An “Unsafe Control Action” is, by definition, a command from a controller that


violates a safety constraint. Identifying an Unsafe Control Action (UCA) first
requires identifying a controller and the control actions it is expected to provide
to an actuator and the controlled process. If a control action is unsafe in the sense
that it violates a safety constraint, then it is also a control flaw.

 7-6 
Figure 7-1 shows one controller, and one control action (the down arrow
between the controller and the actuator). Therefore, there is one control action
that would be evaluated further under the STPA method, which classifies control
action behaviors as follows:
 Control Action is Provided
 Control Action is Not Provided
 Control Action is Provided Too Early
 Control Action is Provided Too Late
 Control Action is Stopped Too Soon

Notice that the bolded words bear some resemblance to the guide-words used in
the HAZOP method.

The STPA method postulates these control action behaviors in various contexts to
determine if they are hazardous. If a control action is hazardous, then it is an
Unsafe Control Action.

The focus on control actions and contexts is particularly useful when analyzing
digital I&C systems for the presence of hazards that may be introduced by
software.

Figure 7-1 shows several more control flaws in other parts of the control loop
that could lead to hazards, which in turn could lead to losses. Identifying the
presence of these other control flaws is the object of STPA Basic Step 2.

Basic Step 2: Identify Causes of Unsafe Control Actions

Basic Step 2 requires an analysis of the potential causes of the Unsafe Control
Actions (UCA) identified in Basic Step 1. In essence, for each UCA, the analyst
will “go around the loop” in the control structure and consider if any of the
potential control flaws in other parts of the loop could cause the controller to
“command” the UCA. It is important to remember that a UCA can be active in
the sense that it is a control action that may be provided (or provided too early)
and lead to a hazard, or passive in the sense that it is not provided (or provided
too late or stopped too soon) and lead to a hazard.

One of the strengths of STPA is that it limits the evaluation to only the control
flaws that can lead to hazards.

Expanded STPA Procedure

This guideline expands the basic STPA procedure described in Reference 19 into
more discrete steps, as follows:

STPA Step 1: Identify System Boundary

 7-7 
The analysis begins with determining a system boundary, which requires
identification of the plant system (or systems), and their interfaces, that can affect
or be affected by an activity.

For a digital upgrade activity, the system boundary would encompass the digital
equipment and the plant systems or components that can influence or be
influenced by the digital equipment. The output of the Function Analysis
method described in Section 0 should be used as an input to the STPA analysis.

One method for identifying the system boundary on a digital upgrade project
would be to:
4. Identify the digital equipment
5. Identify the process elements that the digital equipment is expected to
protect or control
6. Identify the equipment that interfaces between the digital equipment and the
process elements
7. Identify remaining digital equipment interfaces, and the equipment that
might be connected to them
8. Identify any other equipment or processes that can affect the environment
around the equipment and processes identified in Steps 1 through 4.

The output from Step 1 is a drawing that represents the equipment, process
elements, and their interfaces and interconnection. The drawing should include
physical and functional representations. Appendix C provides a generic list of
equipment types and process elements, as well as physical and functional
representations that can be used in a system drawing.

Note that STPA results may be sensitive to where the system boundary is placed.
One of the strengths of this method is its ability to identify interactions between
components that would otherwise not appear to interact, such as components
that appear to be physically and functionally independent. Another strength of
STPA is its ability to identify adverse component interactions, even if none of the
components have failed or malfunctioned. Therefore, care should be taken when
identifying the boundary to avoid missing components and interfaces that may
interact with the system.

STPA Step 2: Identify Accidents (Losses)

In order to identify losses to be considered in the STPA method, it is important


to define losses. Losses can be selected from the following list:
 Personal injury or death
 Accidents or Malfunctions considered in the plant safety analysis (FSAR)
and Probabilistic Risk Assessment (PRA)
 Damaged or degraded equipment
 Lost or reduced plant system availability

 7-8 
 Lost or reduced generation
 Any other loss that is of concern to the owner/operator

The output from this step is a short and simple list of losses. For more detailed
guidance, see Reference 19.

STPA Step 3: Identify System-Level Hazards

Using the results from Step 2, list the possible system-level hazards that could
lead to each loss. The system-level hazards are a function of the controlled
process elements and their ability to cause a loss. As described in Section 1.4, it is
important to use a clear definition of hazard, and apply it consistently. As in Step
2, the list of hazards should be short and simple.

One method for preparing the list is to assemble a team of individuals


knowledgeable in the controlled process elements, interfacing equipment, and the
system environment, and discuss the ability of these elements to cause one or
more of the losses listed in Step 2.

Step 3 is one form of a Preliminary Hazards Analysis (PHA), which is described


in Section 3.7.

STPA Step 4: Draw the Control Structure

Using the results of the Function Analysis described in Section 3.6, the next step
is to draw the control structure. Start with a basic, rudimentary structure, more or
less consistent with the control structures illustrated in Figure 7-3.

Process
Controller Model

Control Feedback
Actions Signals

Controlled Process

Figure 7-3
Basic Control Structure

 7-9 
Training & Environmental
Procedures Conditions

Model of
Automation
Control
Human Action
Controller Generation Model of
Controlled
Process

Human-System Interface

Automated Model of
Control
Controlled
Controller Process
Algorithm

Actuators Sensors

Controlled
Process
Inputs Process Process
Outputs

Figure 7-4
Basic Control Structure with Human Operator
Credit: Dr. Nancy G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety,
published by The MIT Press

The control structure should have at least one controller and a representation of
the control actions and feedback signals between the controller and the controlled
process. Figure 7-3 meets the minimum criteria for a control structure, and may
be adequate for a variety of situations. Creating the Process Model details comes
in Step 5.

Figure 7-4 provides a more resolved control structure that separates the human
and automated controllers, and how the control actions are directly applied to
actuators via the automated controller (solid down arrows) and indirectly applied
as intended by the human operator (dashed down arrow). In both Figures, a
Process Model is represented by a box in each controller. Creating the Process
Model details comes in Step 5.

A more detailed or resolved control structure can be prepared to break down the
basic control structure into more discrete components if desired. However, it is
useful to complete the STPA analysis at the basic system-level before proceeding
with a more detailed analysis because it can provide significant insights before
expending more effort at a detailed level.

 7-10 
STPA Step 5: Create Process Model(s)

Each controller in the control structure (including humans if they are


represented) has its own process model, which is simply expressed by first listing
the input signals and other inputs that are used by the controller to determine the
control actions, then listing the possible states of each signal or input. Note that
inputs may be sensor signals or information (e.g., pressure, temperature, etc.), or
plant conditions derived from a variety of sources (e.g., Mode 1, LOCA, etc.).

A human controller, perhaps as shown in Figure 7-4, will have some


understanding of the controlled process, along with sensor feedback and other
indications of operating conditions, and using control interfaces, will be able to
take action when needed. The key is in the Human System Interface (HSI),
which is why a well-designed HSI provides indications of process values, output
values, and controller states (e.g., on, off, auto, manual, normal, failed, etc.), as
well as the ability to influence the controller states (increase or decrease setpoints,
acknowledge and silence alarms, change modes, increase or decrease outputs,
query event logs and databases, etc.)

Mismatched or conflicting process models arises when one of the process models
is incorrect or incomplete, which amounts to a control flaw. Step 7 is designed to
identify this flaw (among others).

Process model variables (PMVs) are essentially the up arrows and sideways
arrows in the control structure created in Step 4. Step 6 describes the PMVs and
their states in greater detail.

The output of this step is a table that lists Process Model Variables (PMV) and
their possible States. PMVs are readily identified from the control structure as the
feedback signals and other inputs to a given controller. Possible PMV states
include open, closed, on, off, increasing, decreasing, as-needed, or other
characteristics that simply describe PMV behaviors.

The following process model format is suggested:

Table 7-1
Suggested Process Model Format

Controller Name
Process Model Variables PMV States
PMV1 State 1
(Controller Feedback State 2
Signal or Input) State n
PMV2 State 1
(Controller Feedback State 2
Signal or Input) State n
PMVn State 1
(Controller Feedback State 2
Signal or Input) State n

 7-11 
STPA Step 6: Identify Hazardous Control Actions

Step 6, which culminates in a list of Hazardous Control Actions, is broken down


into sub-steps as follows:

a. Identify Control Actions (CA)

Examine the control structure, and for each controller, identify the control
actions (down arrows) and their basic characteristics in terms of their effects
or influences on the next controller or the controlled process that it acts upon.

Figure 7-5 illustrates some key STPA are terms used in the STPA process
are used
 When a control action from a given controller acts upon another
controller, it is expressed by the manner in which its action is expected to
influence the state of one or more of the Process Model Variables in that
controller (e.g., increase desired flow, decrease desired flow, etc.).
 When a control action from a give controller acts upon the controlled
process, it is expressed by the manner in which its action is expected to
influence the state of one or more of the controlled process elements (e.g.,
increase valve position, decrease valve position, start pump, stop pump,
etc.).

Figure 7-5
Control Actions, Process Model Variables (PMVs) and PMV States

• Plant Condition
• Plant Mode
• Others...
Process
Controller Model
Other Inputs
or Conditions PMV
States
Control • Normal
CAs Feedback Process Model • Accident
• Increase Actions
• Decrease
Signals Variables •

Increasing
Decreasing
• Open • As Needed
• Close • Pressure • On
• Hold • Flow • Off
• Switch Controlled Process • Temperature • Mode 1
• Others... • Voltage • Automatic
• Current • Manual
• Others... • Others...

The result is a list of CAs for each controller and how they can influence
the state of a process model variable in the next controller or controlled
process element. At this point, it is helpful to begin building a worksheet
or table that combines the CAs from each controller with the Process
Model table for the next controller or controlled process elements that
they influence or act upon. Table 7-2 provides a suggested format for a
worksheet or table:

 7-12 
Table 7-2
Combining Control Actions with Affected Process Models

Next Controller or
Controller N Controlled Process Element
PMVs PMV States
PMV1 State 1
(Controller Feedback
Signal or Input) State n
CA1
(Influence 1) PMVn State 1
(Controller Feedback
Signal or Input) State n

PMV1 State 1
(Controller Feedback
Signal or Input) State n
CAn
(Influence n) PMVn State 1
(Controller Feedback
Signal or Input) State n

b. Identify Hazardous Control Actions

The key to this step is to determine the contexts in which each control action
can be hazardous. Contexts are a function of process model variables and
their states. A context can be simple, comprising one PMV with two possible
states (e.g., valve is open, or valve is closed), or it can be more complex,
comprising two or more PMVs, each with two or more states. (e.g., valve is
open and turbine speed is increasing and tank level is decreasing).

Using the results from step (a), postulate the following Behaviors for each
Control Action, and determine if it is hazardous in each context:
5555551. Control Action Is Provided
2. Control Action Is Not Provided
3. Control Action Is Provided Too Early
4. Control Action Is Provided Too Late
5. Control Action Is Stopped Too Soon

Thus, the structure of a hazardous control action is expressed in terms of it


source, its behavior, the control action, and its context, as shown in Figure 7-6:

Controller 1 Provides Close Valve command when Tank Level is Decreasing

Source Behavior Control Action Context

Figure 7-6
Structure of a Hazardous Control Action

 7-13 
It is helpful to organize a team of knowledgeable individuals such as
system engineers, operators, and design engineers, and hold one or more
team meetings to consider each context and determine if it is or could be
hazardous (hazards having been identified in Step 3).

The result of step (b) is an expanded version of the table created in step (a).
A sample of a suggested worksheet format is provided in Table 7-3. In this
sample, the worksheet would be produced five times for each CA; once for
each of the five postulated CA behaviors. This example shows five hazards
and three PMVs; the first two each have two possible states, the third
PMV has three possible states. Note that the STPA worksheet can have
any number of PMVs and PMV states.

Table 7-3
Sample STPA Worksheet

Controller: Controller Name H1 Hazard 1


Control H2 Hazard 2
CA# Control Action Command
Action: H3 Hazard 3
Postulated One of the Five CA Behaviors H4 Hazard 4
Behavior: (Is CA Behavior Hazardous?) H5 Hazard 5
Process Model Variables Analysis Results

Row Is Situation Is CA
PMV1 PMV2 Related Comments
PMV3 Already Behavior
(Name) (Name) Hazard (Situational Context)
(Name) Hazardous? Hazardous?

1
2 State 1
3
4
5 State 1 State 2
6
7
8 State 3
9
State 1
10
11 State 1
12
13
14 State 2 State 2
15
16
17 State 3
18
19
20 State 1
21
22
23 State 1 State 2
24
25
26 State 3
27
State 2
28
29 State 1
30
31
32 State 2 State 2
33
34
35 State 3
36

 7-14 
A few observations can be made about Table 7-3:
 Each row denotes a specific context, which is a requirement for
determining if a postulated control action behavior is hazardous.
 Sometimes a context (i.e., combination of PMV states) is inherently
hazardous, which is the purpose of the column labeled “Is situation
already hazardous?” The control action may mitigate the hazard, or make
it worse, or have no effect at all; this should be noted in the comments
column.
 When there are more PMVs or more possible PMV states, the number
of contexts to be evaluated can grow significantly, thus requiring more
effort

The analyst or the team performing the analysis should consider each context
and attempt to answer the question “Is CA Behavior Hazardous” as “Yes” or
“No.” Sometimes it is difficult to answer definitively because the context may
have conflicting PMV constraints (e.g., a CA that ultimately increases the
speed of a pump is beneficial if system flow is too low, but hazardous if, at
the same time, the pump speed is too high and approaching equipment
limits). In these cases, it is acceptable to put “Maybe” or some other notation
that indicates a question for further analysis in Step 7.

Caution

It is easy to fall into the trap of thinking that some contexts are absurd or can’t
exist. For example, if a process model for a steam turbine-driven pump in a fluid
system includes turbine speed as one PMV and system flow as another PMV,
then one might be tempted to dismiss the contextual combination of “turbine
speed too high” and “system flow too low.” However, these types of strange
behaviors might occur due to problems or malfunctions in the controlled process,
such as debris in the line or equipment degradation (e.g., a damaged pump
impeller), and it may be that the postulated CA for such a context is precisely the
right thing to do.

It is important to not throw out any contexts, no matter how strange, because
experience shows that strange behaviors are the ones we least expect and fail to account
for in system design and operation, yet they still manifest themselves and lead to
accidents or losses.

Results

When the STPA worksheet is completed, the results should be reduced to a list
of Hazardous Control Actions by transposing each row of the worksheet that
indicates when a postulated Control Action is hazardous. The format of each
Hazardous Control Action should follow the structure presented in Figure 7-6

As described at the beginning of this section, Step 6 delivers the result of Basic
Step 1 of the STPA method.

 7-15 
STPA Step 7: Identify Potential Causes of Hazardous Control Actions

The analysis team that performed Step 6 should remain intact, and perform this
step by considering each of the control flaws presented in Figure 7-1 in the
context of each Hazardous Control Action identified in Step 6.

The team should be careful to not discount or dismiss any potential causes, even
if the team is aware of adequate defensive measures that would reasonably reduce
the likelihood of such causes of the hazardous control action to an acceptable
level. The purpose of the STPA method is to systematically identify the potential
causes of hazardous control actions first, without prejudice, so that later steps in
the system design lifecycle can:
 eliminate, reduce, or mitigate such hazards to an acceptable level, or
 confirm that the proposed (or existing) design and administrative controls are
adequate as-is.

As described at the beginning of this section, Step 7 delivers the result of Basic
Step 2 of the STPA method.

STPA Step 8: Apply the Results

See Section 7.3 for guidance on applying STPA results, which should be used to
identify design changes, administrative controls, or a combination of both in
order to eliminate, reduce, or mitigate such hazards to an acceptable level.

7.3 Applying the STPA Results

By making a positive determination of reasonable causes of a potentially


hazardous control action, the STPA results can be used to derive or modify
system requirements in order to prevent or mitigate some hazards; and to
leverage existing programs and processes that can prevent or mitigate other
hazards.

Application Development

Each of the potential causes of each hazardous control action should be evaluated
by a team of knowledgeable individuals responsible for system design, test,
operations, and maintenance. For each potential cause, the team should decide if
it can be eliminated, prevented, or mitigated to a reasonable extent through one
or more defensive measures that are realized through design requirements and/or
plant programs and processes. For guidance on applying defensive measures in
digital I&C systems, see References 20 and 21.

Ideally, this evaluation is performed early, at the conceptual design phase, so that
safety-driven requirements are inserted before detailed design begins. The results
of the STPA analysis should be reviewed again as the detailed design emerges, to
determine if any of the design details substantially altered the control structure
used in the analysis. Of particular concern would be new interfaces or functions
that were not accounted for in the STPA analysis.
 7-16 
If the STPA method is applied late in a project for some reason, the
owner/operator should be prepared to stop the project and rework the design if
the STPA results clearly indicate potential hazards that are not effectively
eliminated, prevented, or mitigated to a reasonable extent.

Provide Input to Other Hazard Analysis Methods

Because STPA results are focused on hazardous control actions, the results can
be used to provide a more focused approach when other methods may be applied.

For example, the FMEA method requires a bottom-up analysis of all devices in a
component, or all components in a system, which can become very large, time
consuming, and costly if there is a large number of devices or components. If the
STPA method is applied first, then an FMEA can focus exclusively on the
devices or components that could cause a hazardous control action, and
determine the failure modes or failure mechanisms that could lead to such
hazardous actions.

Provide Input to HFE Evaluations

If unsafe control actions by a human controller can be readily modeled, then


results should be evaluated within the context of a Human Factors Engineering
(HFE) Evaluation to determine if tasks, training and other elements of human
performance are effectively applied so as to eliminate, prevent, or mitigate such
hazards to a reasonable extent.

Licensing

STPA results should be useful when considering the likelihood of malfunctions


and accidents for 10CFR50.59 evaluations. For guidance on licensing of digital
upgrades, see Reference 4.

Protection system upgrades subject to 10CFR50.55a(h) require a Software Safety


Analysis (SSA). Because SSA is a hazards-driven process, the STPA method
may be suitable for meeting this requirement, as long as the SSA steps described
in the licensee or applicant Software Safety Plan are fulfilled.

The STPA method may be useful when a licensing activity requires a


demonstration that hazards have been properly identified and eliminated,
reduced or mitigated. However, the definition of “hazard” is very important when
adapting the STPA method to a licensing activity (emphasis added):

IEEE Definition of Hazard: A condition that is a prerequisite to an


accident. Hazards include external events as well as conditions internal
to computer hardware or software. (Reference 9)

STPA Definition of Hazard: A system state or set of conditions that,


together with a particular set of worst-case environment conditions, will
lead to an accident (loss). (Reference 19)

 7-17 
The IEEE definition, accepted by the NRC, considers internal and external
events and conditions, while the STPA definition considers system-level and
environmental conditions. This does not mean the STPA method is not suitable
for a licensing activity; in fact it is quite useful because it considers internal
conditions in the form of causes of Hazardous Control Actions (STPA Basic Step
2). In other words, identification of Hazardous Control Actions, and their causes,
appears compatible with the IEEE definition of “Hazard” and therefore may be
suitable for licensing activities.

When properly applied, the STPA method should be well-suited to support a


licensing activity that requires a demonstration that hazards have been
systematically and properly identified and addressed.

7.4 STPA Examples


Example 7-1. System-Level HPCI-RCIC Turbine Controls STPA
STPA Step 1: Identify System Boundary
The hypothetical digital upgrade project examined in Example 4-1 is also examined
here, this time using the STPA method. The system boundary for this example is
essentially marked by the perimeter of Figure 7-7. A broader boundary could be
considered; for example, the boundary could be expanded to show the reactor, the
torus, the condensate storage tank, and the pipes and valves that connect these
elements to the steam and water lines shown in Figure 7-7. The same boundary is
used in both examples to facilitate comparison of hazard analysis methods.
STPA Step 2: Identify Accidents (Losses)
A meeting was held, attended by individuals knowledgeable in the areas of HPCI-
RCIC system functional and performance requirements and the STPA method. Using
the definition provided by Reference 19, the team was able to quickly identify the
following list of losses to be part of the analysis:
A1: People exposed to radioactivity
A2: Environment contaminated
A3: Equipment damage, up to and including core meltdown
(economic loss)
A4: Personnel injury or death
A5: Loss of generation
Notice the list is short and simple. By design, it bounds a wide range of loss
conditions. It is not necessary to specify the characteristics of these conditions.
STPA Step 3: Identify System-Level Hazards
Using the definition provided in Reference 19, the team identified the following list
of system-level hazards that could lead to one or more of the losses identified in Step
1 (this also demonstrates the results of a Preliminary Hazard Analysis (PHA) per
Section 3.7):
H1: Reactor exceeds limits
(e.g., fuel degradation, core damage, hydrogen production, overfill)

 7-18 
Example 7-1. System-Level HPCI-RCIC Turbine Controls STPA
(continued)
H2: Radioactive materials released
(e.g., pressure too high, coolant leak, air leak)
H3: Equipment operated beyond physical limits
(e.g., turbine overspeed)
H4: Inadvertent equipment operation during maintenance
(e.g., unexpected actuator or valve movement/pinch points, false or misleading
indications)
H5: Reactor shutdown
This list of system-level hazards is as short and simple as the list of losses. Table 7-4
provides a simple cross-reference that shows how any given hazard can lead to one
or more losses.
However, different contexts are implied in Table 7-4. As this example unfolds, it will
become apparent that some HPCI-RCIC flow control actions are hazardous when
they are provided when there is no demand from the plant protection system (one
context), and hazardous when they are not provided when there is a demand from
the plant protection system (another context). At this point in the analysis, the entries
in Table 7-4 implicitly reflect these different contexts.
STPA Step 4: Draw the Control Structure
As described in the procedure provided in Section 7.2, using the results of a
Function Analysis (per Section 3.6), this step starts with a system-level control
structure, shown in Figure 7-8. At this point, the control structure can be verified as
complete and correct within the system boundary identified in Step 1, and it can be
used to complete the analysis at the system level.
In Figure 7-8, the Control Actions that will be evaluated later are represented by the
down arrows. By inspection, the Control Actions include two possible actions by the
operator and two possible actions by the flow control system. The process model
variables (PMVs) are the up arrows in Figure 7-8, with the addition of the plant
conditions as a sideways arrow, for a total of five PMVs. The control structure used
in this example is relatively simple. In practice it can be much more complicated,
depending on the scope and boundaries of the problem.
STPA Step 5: Create Process Model
At the system level shown in Figure 7-8, there are two basic “controllers;” one a
human operator and the other represented as a Flow Control System.
The controllers and their process models are represented in Figure 7-9. Notice this
figure is just a variation of the control structure created in Step 3 in order to make
room for the process models. This variation of the control structure shows control
actions down the left side, a truncated view of the controlled process at the bottom,
and feedback signals and other inputs up the right side. The process models are
shown in the tables located inside each controller box.
At this point in the STPA method, the process models are captured in tables or
spreadsheets and carried forward to the next steps. When working with bigger
tables and spreadsheets in later steps, it is helpful to refer to Figure 7-9 because it
readily shows the relationships between the control actions, the process model
variables, and the process model states.

 7-19 
Example 7-1. System-Level HPCI-RCIC Turbine Controls STPA
(continued)
STPA Step 6: Identify Hazardous Control Actions
In this example, it is assumed that administrative procedures are in place that
require the operator to leave the flow indicating controllers in automatic mode, at a
fixed flow setpoint, at all times. Of course in real life, procedures can be wrong and
operators can make mistakes, so the application of this method on an actual project
should avoid such assumptions. They are only used here to allow reducing the
number of system contexts to be analyzed for the sake of brevity.
In this example control action CA3, shown in Figure 7-9, is analyzed against five
process model variables and their states, as shown in Table 7-5.
For brevity, this example is limited to the analysis of CA3 as Providing its control
action and Not Providing its control action in the contexts of the five PMVs. A full
analysis would postulate each of the following behaviors:
 Provided
 Not Provided
 Too Early
 Too Late
 Stopped Too Soon
The intermediate results of the “CA3 is Provided” analysis are shown in Table 7-6,
where the following observations can be made:
 Almost all of the PMV combinations, or contexts, are already hazardous. For
these situations, the question become “does the control action make the hazard
worse, or does it mitigate the hazard, or does it make any difference?” In some
cases, the answer is “Maybe” because the control action would increase system
flow when that is the correct response when system flow is too low, in terms of
reactor limits, but at the same time increasing the valve position might worsen
the equipment damage hazard.
 Half of the contexts result in “No Response” when there is an accident and there
is no system enable signal.
 The bottom two rows are reduced, for brevity, to the case where there is not an
accident, the valve position doesn’t matter, the turbine speed is too high, and
the system flow doesn’t matter.
− If a system enable signal is received under these conditions, then increasing
the governor valve position is hazardous because it would worsen the effect
of a spurious actuation, thus worsening the effects of unwanted system flow,
possibly causing the reactor to reach a limit (e.g., power transient or
overfill).
− If a system enable is not received under these conditions, then increasing
the governor valve position might be hazardous, causing an unwanted
turbine speed transient that could result in an equipment damage hazard,
perhaps if there is a leaky steam admission valve. In this case, it is assumed
that downstream process valves remain closed, thus eliminating any hazard
to the reactor.

 7-20 
Example 7-1. System-Level HPCI-RCIC Turbine Controls STPA
(continued)
In this example, the intermediate results of the “CA3 is Not Provided” analysis is
omitted for brevity.
The final results of Step 6 (Table 7-6) are reduced into the list of Hazardous Control
Actions shown in Table 7-7. Rows in Table 7-6 that don’t show hazardous control
actions are not included in Table 7-7. Additionally, rows from Table 7-6 that have
identical hazardous control actions for multiple combinations of process model
variable states have been consolidated into single rows. The notes that accompany
this Table provide some insights as to why these actions are hazardous.
STPA Step 7: Identify Potential Causes of Hazardous Control Actions
For the purposes of this example, Hazardous Control Action No. 7 was selected for
further evaluation in Step 7. The analysis team performed this step by considering
each of the control flaws presented in Figure 7-1 in the context of Hazardous
Control Action No. 7. The team was careful to not discount or dismiss any potential
causes, even if the team was aware of adequate defensive measures that would
reasonably reduce the likelihood of such causes of the hazardous control action.
Table 7-8 provides the results of this team assessment, where the following
observations can be made:
 All potential causes listed could be analyzed further and in more detail given
more information about the system
 This higher-level control loop should take into account additional aspects like:
− HPCI/RCIC is achieving desired flow rate at the pump, but downstream
leaks or blockage is causing insufficient flow rate at the reactor
− Upstream problems like water supply depleted, steam pressure inadequate,
leaks/blockage
− HPCI/RCIC system unable to achieve necessary flow rate, or the max flow
rate is achieved but not sufficient to cool reactor
STPA Step 8: Apply the Results
In Step 7, none of the potential causes of Hazardous Control Action No. 7 were
dismissed or eliminated because the purpose of the STPA method is to identify
hazardous control actions. It is up to the users of this method to decide what to do
about the results.
In this example, the project team evaluated Table 7-8 and determined one or more
defensive measures against each potential cause, some of which become design
requirements (e.g., signal validation), and some of which become defensive
measures during the operations and maintenance phase of the system lifecycle (e.g.,
sensor calibration). In many cases, plant programs and processes can be credited
as defensive measures.
By making a positive determination of all reasonable causes of a potentially
hazardous control action, the STPA results can systematically demonstrate how
system requirements will prevent or mitigate some hazards, and how existing
programs and processes will prevent or mitigate other hazards.

 7-21 
Table 7-4
HPCI-RCIC Turbine Controls: System-Level Hazards vs. Accidents or Losses

Accidents or Losses
A1 A2 A3 A4 A5

Hazards Radiation Contaminated Equipment Injury or Lost


Exposure Environment Damage Death Generation

H1
Reactor Exceeds X X
Limits

H2
Radioactive X X
Material Release

H3
Equipment Operated X X
Beyond Limits

Inadvertent Equip.
H4 Operation During X
Maintenance

H5 Reactor Trip or X
Shutdown

 7-22 
M

Main Steam Operator


Interaction
Main Feedwater
System
Initiation
Signal
HPCI/RCIC Flow
M
Control System

LS

FLOW Governor Trip/ Steam


Condensate Valve Throttle Admission
Storage Tank
M Valve Valve
M

System Initiation Signals System Isolation Signals Turbine Trip Signals


(Open Steam Admission Valve & (Trip Turbine & Close Process Valves) (Close Trip/Throttle Valve)
Process Valves) 1. High Steam Line Flow 1. Any system isolation signal
1. Low Reactor Level (-48") 2. High Area Temperature 2. High Steam Exhaust Pressure (150 psi)
2. High Drywell Pressure (HPCI 3. Low Steam Line Pressure (HPCI only) 3. High Reactor Level (+46")
only; +2 psig) 4. Low Reactor Pressure (RCIC only) 4. Low pump suction pressure (15" Hg)
5. Manual 5. Turbine overspeed
6. Manual (local or remote)

Figure 7-7
HPCI-RCIC Flow Control System (System Level)

 7-23 
Process
Operator Model
Plant
Conditions

Select Set Desired Adjust System


Select Auto Desired
Controller Flow Rate Flow Flow
or Manual Speed
(MCR/RSP) (Auto) (Manual) Rate

Process
Flow Control System Model
System
Initiation
Signal
System Turbine Valve Open/Close System
Flow Rate Speed Position Commands Enable

Actuator M

LS
From
Main
Steam
FLOW Governor Trip/ Steam
Magnetic Valve Throttle Admission
PickUp
To Valve Valve
Reactor From Torus or
Condensate
Storage Tank
Controlled Process
Figure 7-8
System-Level HPCI-RCIC Flow Control Structure

 7-24 
Operator
Process Model Variables Process Model States
Plant
Normal Conditions
Plant Conditions
Accident

Selected Controller
Main Control Room Location
Remote Shutdown Panel
Flow Indicating Manual Controller
Controller Mode Automatic Mode
Too Low
System Flow At Desired Flow
Too High
Indicated
Flow

CA1: Increase Flow Control System


Desired Flow
System
CA2: Decrease Process Model Variables Process Model States
Flow (FT)
Desired Flow
Too Low
System Flow At Desired Flow
Too High Turbine
Too Low Speed (MPU)
Turbine Speed At Desired Speed
Too High
Yes System
System Enable
No Enable (LS)
Too Closed
Valve Position At Desired Position
Too Open Valve Position
CA3: Increase (Resolvers)
Actual Position

CA4: Decrease
Actual Position Governor Valve Actuator

Governor Valve

Figure 7-9
System-Level HPCI-RCIC Process Models

 7-25 
Table 7-5
Select HPCI-RCIC Flow Control Actions

Control Actions Process Model Variables Process Model States


PMV1 Normal
Plant Conditions Accident
Too Closed
PMV2
As Needed
Valve Position
Too Open
Too Low
PMV3
CA3 Increase Valve Position As Needed
Turbine Speed
Too High
Too Low
PMV4
As Needed
System Flow
Too High
PMV5 Yes
System Enable No
PMV1 Normal
Plant Conditions Accident
Too Closed
PMV2
As Needed
Valve Position
Too Open
Too Low
PMV3
CA4 Decrease Valve Position As Needed
Turbine Speed
Too High
Too Low
PMV4
As Needed
System Flow
Too High
PMV5 Yes
System Enable No

 7-26 
Table 7-6
Excerpt of STPA Results for Control Action 3

Controller: HPCI-RCIC Flow Control System H1 Reactor Exceeds Limits


Control H2 Radioactive Release
Action: CA3 Increase Governor Valve Position
H3 Equipment Damage
Postulated Providing (the increase valve position command) H4 Personnel Injury or Death
Behavior: (Is CA Behavior Hazardous?) H5 Reactor Shutdown
Process Model Variables Analysis Results

PMV1 PMV2 Is Situation Is CA


Row PMV3 PMV4 Related Comments
Plant Valve Turbine PMV5 Already Behavior
System Hazards (Situational Context)
Conditions Position System Hazardous? Hazardous?
Speed Flow Enable
1 Yes Yes Yes H3 Leads to Rx overfill
Too high
2 No Yes No Response H1, H2 Accident and no enable
3 Yes Yes Maybe H3 Increase flow, but overspeed?
Too high Too low
4 No Yes No Response H1, H2 Accident and no enable
5 As Yes No Yes H3 Leads to Rx overfill
6 needed No Yes No Response H1, H2 Accident and no enable
7 Yes Yes Yes H3 Leads to Rx overfill
Too high
8 No Yes No Response H1, H2 Accident and no enable
9 Too Yes Yes Maybe H3 Increase flow, but valve damage?
Too low Too low
10 open No Yes No Response H1, H2 Accident and no enable
11 As Yes No Yes H3 Leads to Rx overfill
12 needed No Yes No Response H1, H2 Accident and no enable
13 Yes Yes Yes H3 Leads to Rx overfill
Too high
14 No Yes No Response H1, H2 Accident and no enable
15 As Yes Yes Maybe H3 Increase flow, but valve damage?
Too low
16 needed No Yes No Response H1, H2 Accident and no enable
17 As Yes No Yes H3 Leads to Rx overfill
18 needed No Yes No Response H1, H2 Accident and no enable
19 Yes Yes Yes H3 Rx overfill? Turb. overspeed?
Too high
20 No Yes No Response H1, H2 Accident and no enable
21 Yes Yes Maybe H3 Increase flow, but overspeed?
Too high Too low
22 No Yes No Response H1, H2 Accident and no enable
23 As Yes No Yes H3 Rx overfill? Turb. overspeed?
24 needed No Yes No Response H1, H2 Accident and no enable
25 Yes Yes Yes H3 Leads to Rx overfill
Too high
26 No Yes No Response H1, H2 Accident and no enable
27 Too Yes Yes No --- Tries to increase flow
Accident Too low Too low
28 closed No Yes No Response H1, H2 Accident and no enable
29 As Yes No Yes H3 Leads to Rx overfill
30 needed No Yes No Response H1, H2 Accident and no enable
31 Yes Yes Yes H3 Leads to Rx overfill
Too high
32 No Yes No Response H1, H2 Accident and no enable
33 As Yes Yes Maybe H3 Increase flow, but turb. Damage?
Too low
34 needed No Yes No Response H1, H2 Accident and no enable
35 As Yes No Yes H3 Leads to Rx overfill
36 needed No Yes No Response H1, H2 Accident and no enable
37 Yes Yes Yes H3 Rx overfill? Turb. overspeed?
Too high
38 No Yes No Response H1, H2 Accident and no enable
39 Yes Yes Maybe H3 Increase flow, but overspeed?
Too high Too low
40 No Yes No Response H1, H2 Accident and no enable
41 As Yes No Yes H3 Rx overfill? Turb. overspeed?
42 needed No Yes No Response H1, H2 Accident and no enable
43 Yes Yes Yes H3 Leads to Rx overfill
Too high
44 No Yes No Response H1, H2 Accident and no enable
45 As Yes Yes Maybe H3 Increase flow, but valve damage?
Too low Too low
46 needed No Yes No Response H1, H2 Accident and no enable
47 As Yes No Yes H3 Leads to Rx overfill
48 needed No Yes No Response H1, H2 Accident and no enable
49 Yes Yes Yes H3 Leads to Rx overfill
Too high
50 No Yes No Response H1, H2 Accident and no enable
51 As Yes Yes Maybe H3 Increase flow, but valve damage?
Too low
52 needed No Yes No Response H1, H2 Accident and no enable
53 As Yes No Yes H3 Leads to Rx overfill
54 needed No Yes No Response H1, H2 Accident and no enable
55 Yes Yes Yes H1 Spurious actuation
Normal * Too High *
56 No Yes Maybe H3 Leaky steam admit valve?

 7-27 
Table 7-7
Excerpt from List of HPCI-RCIC Hazardous Control Actions

Hazardous Control Actions Hazard

Flow control system provides increase governor valve position (CA3) when:

there is an and valve and turbine and system and system


1 * * * No1 H1, H2
accident position is speed is flow is enable is

there is an and valve too open or and turbine and system and system
2 * * Yes H3
accident position is as needed speed is flow is enable is

there is an and valve and turbine too high or and system and system
3 too closed * Yes H3
accident position is speed is as needed flow is enable is

there is an and valve and turbine and system too high or and system
4 too closed too low Yes H3
accident position is speed is flow is as needed enable is

there is not an and valve and turbine and system and system
5 * too high too high Yes2 H1
accident position is speed is flow is enable is
there is not an and valve and turbine and system and system
6 * too high * No3 H3
accident position is speed is flow is enable is

Flow control system does not provide increase governor valve position (CA3) when:

there is an and valve and turbine and system and system 4


7 * * too low * H1, H2
accident position is speed is flow is enable is

Notes
1. A Hazardous Control Action because flow control system does not respond at all when there is an accident and no system enable
2. A Hazardous Control Action because increasing the governor valve position (CA3) worsens the effect of a spurious system actuation
3. Might be a Hazardous Control Action if it causes turbine speed to reach a limit when turbine speed is already too high and there is no system
enable (possible due to a leaky steam admission valve?)
4. A Hazardous Control Action because system flow is too low during an accident, regardless of the states of the other process model variables,
including the system enable signal

 7-28 
Table 7-8
Potential Causes of Hazardous Control Action No. 7

Hazard: Equipment Operated Beyond Limits (H3)


Controller: HPCI Turbine Governor

Hazardous Control Action No. 1: “Increase governor valve position” command (CA5) is
provided when: there is an accident, turbine speed is too high, and system enable is present

Flawed Control Algorithm


Control algorithm design commands increased valve position when the controller believes: there is demand on
HPCI/RCIC to operate and turbine speed is too high
missing rules : situation as presented is unknown to the controller. May default to a known but inappropriate
behavior.(e.g., turbine speed is said to be beyond (here: greater than) the maximum level for which a rule is
provided (e.g.: no rule provided for speed values 100% and above))
wrong rules : situation is known to the controller, but it is designed to implement rules that are wrong. (e.g.,
send a negative voltage rather than positive one; …)
wrong clock: when timing is of the essence, asynchrony of the controller with the other elements in the loop can
be a source of hazard. e.g., controller is made to "think too much" (e.g. average readings from several flow
measurements over "too long" a period of time before emitting command, without taking into account the
fluctuating nature of flow measurements) before emitting a command
Flawed Process Model
Controller does not believe there is demand on HPCI/RCIC when in fact it is needed
Controller believes governor valve is open wide enough or open too wide, when in fact it is not open enough
Controller believes HPCI/RCIC pump flow rate is too high or as needed, when in fact it is too little
Flawed Feedback Interpretation
Controller receives enable signal, but does not interpret this as a demand on HPCI/RCIC to operate (bypass modes?
Other situations to ignore the enable signal?)
Controller receives accurate governor valve position signal (voltage?) but does not interpret it as valve position = not
enough
the interpretation algorithm is flawed
the position feedback conflicts with some other feedback signal
the feedback conflicts with the controller’s process model
Controller receives accurate HPCI/RCIC pump flow rate signal, but does not interpret it as flow rate = too little
the flow signal is out of range low
controller is in speed mode so flow rate signal is ignored
feedback or PM conflicts
Inadequate Goal
Controller's desired HPCI/RCIC pump flow rate is too low [note: too high is also hazardous (H3), but from what I can
tell it's not a cause of this UCA]
Controller's desired governor valve position is too low [note: too high is also hazardous (H3), but from what I can tell
it's not a cause of this UCA]
Controller is in wrong mode
Controller believes it is in speed mode – the goal is a desired speed and the desired flow rate is ignored or never
Controller is in bypass mode (maintenance?)

 7-29 
Table 7-8 (continued)
Potential Causes of Hazardous Control Action No. 7

Hazard: Equipment Operated Beyond Limits (H3)


Controller: HPCI Turbine Governor

Hazardous Control Action No. 1: “Increase governor valve position” command (CA5) is
provided when: there is an accident, turbine speed is too high, and system enable is present

Inadequate Feedback
Enable signal sent to controller before there is a valid demand on HPCI/RCIC
enable provided when steam admission valve is not open (broken or misaligned LS)
steam admission valve commanded open when there is no demand on HPCI/RCIC (spurious ESFAS signal)
Enable signal sent to controller when there is a demand on HPCI/RCIC, but delayed
enable provided when steam admission valve is opened, but too late (misaligned LS or LS setpoint too high)
steam admission valve opens too slowly when commanded by ESFAS Initiation Signal (excessive stem thrust)
steam admission valve commanded open too late when there is a demand on HPCI/RCIC (ESFAS delay)
HPCI/RCIC pump flow rate signal to controller is missing, delayed, incorrect, too infrequent, or has inadequate
resolution
Signal corrupted during transmission
sensor failure
sensor design flaw
sensor operates correctly but actual flow rate is outside sensor’s operating range
fluid type is not as expected (water vs. steam?)
Governor valve position signal to controller is missing, delayed, incorrect, too infrequent, or has inadequate
resolution
Problems with communication path
actual position is beyond sensor’s range
sensor reports actuator position and it doesn’t match valve position
sensor correctly reports valve position but position doesn’t match assumed area/shape
Inadequate Execution of Control Action
Increase governor valve position command is provided in this context, but the command does not produce an
increase in governor valve position
Command sent but does not reach governor valve actuator
command received but governor valve is already fully open
command parameter is outside actuator’s operating range
command conflicts with another command
actuator failure
valve stuck
not powered
Increase governor valve position command is provided and the governor valve position increases, but the amount of
increase is not as commanded
Valve response time too slow (design flaw, physical failure, valve worn, etc.)
Inadequate Process Inputs, Physical System
Power missing or inadequate
Hardware failure (e.g. memory bit errors, etc.)

 7-30 
Example 7-2. Component-Level HPCI-RCIC Turbine Controls STPA
STPA Step 1: Identify System Boundary
The system boundary identified in this example is the same boundary identified in
Example 7-1 and Figure 4-6.
STPA Step 2: Identify Accidents (Losses)
The accidents (losses) identified in this example are the same accidents identified in
Example 7-1:
A1: People exposed to radioactivity
A2: Environment contaminated
A3: Equipment damage, up to and including core meltdown
(economic loss)
A4: Personnel injury or death
A5: Loss of generation
STPA Step 3: Identify System-Level Hazards
Even when applying the STPA method at the component level, the hazards are still
identified at the system-level because the ultimate purpose of the analysis is to
eliminate, reduce or mitigate hazards that can lead to accidents or other losses.
Therefore, the hazards identified in this example are the same hazards identified in
Example 7-1:
H1: Reactor exceeds limits
H2: Radioactive materials released
H3: Equipment operated beyond physical limits
H4: Inadvertent equipment operation during maintenance
H5: Reactor shutdown
Table 7-4 still provides a simple cross-reference that shows how any given hazard
can lead to one or more losses.
STPA Step 4: Draw the Control Structure
In order to develop requirements or analyze for the presence of hazards created at
the component level, a more resolved control structure is required. The hypothetical
digital upgrade in this example involves a digital governor and a digital positioner,
and for a number of reasons, described at the end, it is useful to deepen the
analysis to the component level. Therefore, a more refined control structure is
provided in Figure 7-11, where the flow control system is resolved into the flow
indicating controllers, the handswitch, the governor, and the positioner.
In Figure 7-11, the Control Actions that will be evaluated later are represented by
the down arrows. By inspection, the Control Actions include various actions by the
operator, the desired speed output from the flow controllers, the desired position
output from the governor, and ultimately the position demand output from the
positioner.
STPA Step 5: Create Process Model
At the component level shown in Figure 7-11 there are four “controllers” and
therefore four process models. The controllers are as follows:

 7-31 
Example 7-2. Component-Level HPCI-RCIC Turbine Controls STPA
(continued)
1. Human Operator
2. In-service Flow Indicating Controller (1 of 2 identical FICs)
3. Governor
4. Positioner
Each controller and its process model is represented in Figure 7-12. Notice this
figure is just a variation of the control structure created in Step 3 in order to make
room for the process models. This variation of the control structure shows control
actions down the left side, a truncated view of the controlled process at the bottom,
and feedback signals and other inputs up the right side. The process models are
shown in the tables located inside each controller box.
At this point in the STPA method, the process models are captured in tables or
spreadsheets and carried forward. Figure 7-12 is helpful in later steps because it
illustrates the relationships between the control actions, the process model variables,
and the process model states.
STPA Step 6: Identify Hazardous Control Actions
In this example control action CA5 (Increase Desired Position), shown in Figure 7-
12, is analyzed against two process model variables (Turbine Speed and System
Enable). For brevity, this example is limited to the analysis of CA5 as Providing its
control action in the contexts of the two PMVs. A full analysis would postulate each
of the following behaviors:
 Provided
 Not Provided
 Too Early
 Too Late
 Stopped Too Soon
The intermediate results of the “CA5 is Provided” analysis are shown in Table 7-9,
where the following observations can be made:
 First, the number of rows in Table 7-9 is only 6, a dramatic reduction from the
size of the results table (Table 7-6) when the analysis was done on one control
action for the whole system in Example 4-8. This data reduction is achieved
because in this example, only one controller is analyzed. This approach points
to the benefit of avoiding very large combinatorial sets by isolating one
controller at a time, but the downside is that other “contexts” can be missed if
other Process Model Variables associated with other controller are not factored
into the analysis.
 If one assumes the analysis is done in the context of a demand on the HPCI
system due to plant conditions that indicate an accident, then there are three
immediately recognizable hazardous conditions in Table 7-9 when there is no
“System Enable” signal, regardless of turbine speed. If the System Enable signal
is not present, the governor will not provide any output to the positioner, and the
HPCI system will be inoperable.
 The only definitive hazardous control action identified in Table 7-9 is one in
which the governor provides an increasing valve position demand signal (CA5)
when turbine speed is already too high and the system is enabled. Unless other

 7-32 
Example 7-2. Component-Level HPCI-RCIC Turbine Controls STPA
(continued)
protective actions are provided, this control action will lead to a turbine
overspeed issue and/or an overfill condition in the reactor.
 A preexisting hazardous state is indicated if turbine speed is already too low
before CA5 acts upon the system. Otherwise, CA5 is not hazardous.
STPA Step 7: Identify Potential Causes of Hazardous Control Actions
Row 1 in Table 7-9 was identified as a Hazardous Control Action. In this example, it
is labeled HCA1, or Hazardous Control Action 1.
The possible causes of HCA1 are listed in Table 7-10.
STPA Step 8: Apply the Results
The results of this analysis would be applied during the conceptual design or
requirements definition phase of the digital upgrade project, to assure the following:
 A reliable means of providing a system enable signal. These STPA results
indicate this may be the Achilles Heel of the whole control system. Methods
could include redundant and/or diverse contact closure input schemes, or
avoiding use of the valve stem-mounted limit switch as a signal source, and
using a direct input from the system initiation source (ESFAS in this case).
 A reliable means of avoiding a false or inaccurate turbine speed signal.
Methods could include a redundant and/or diverse speed sensor scheme.
 Performing software V&V activities, with particular emphasis on development of
test cases and validation testing to demonstrate that hazardous control action 1
(provide CA5 when turbine speed is too high) is prevented.

 7-33 
FIC: Flow Indicating Controller
MCR: Main Control Room
RSP: Remote Shutdown Panel
PID: Proportional/Integral/Derivative Enable
HS: Handswitch
MCR FIC
HS Positioner
Speed Position
PID Demand PID S Demand PID System
Initiation
Flow Setpoint
Governor Enable Signal
(RCIC: 500gpm;
HPCI: 5000gpm)

PID Program M
24 Resolver
Interface VDC Actuator
Feedback
LS
RSP FIC From
Main
Steam
FLOW Governor Trip/ Steam
Magnetic
PickUp (MPU)
Valve Throttle Admission
To Valve Valve
Reactor From Torus or
Condensate
Storage Tank

System Initiation Signals System Isolation Signals Turbine Trip Signals


(Open Steam Admission Valve (Trip Turbine & Close Process Valves) (Close Trip/Throttle Valve)
& Process Valves) 1. High Steam Line Flow 1. Any system isolation signal
1. Low Reactor Level (-48") 2. High Area Temperature 2. High Steam Exhaust Pressure (150 psi)
2. High Drywell Pressure 3. Low Steam Line Pressure (HPCI only) 3. High Reactor Level (+46")
(HPCI only; +2 psig) 4. Low Reactor Pressure (RCIC only) 4. Low pump suction pressure (15" Hg)
5. Manual 5. Turbine overspeed
6. Manual (local or remote)

Figure 7-10
HPCI-RCIC Flow Control System (Component Level)

 7-34 
Operator Process Model Plant
Conditions
Adjust Flow Set Flow Auto or System Desired Desired System Auto or Set Flow Adjust Flow
(Manual) (Auto) Manual Flow Speed Speed Flow Manual (Auto) (Manual)

FIC (MCR) Process Model FIC (RSP) Process Model

System Desired Speed Select Controller Desired Speed System


Flow Rate Flow Rate
Handswitch
A A
MCR: Main Desired Speed
Control
Room
RSP: Remote Governor Process Model
Shutdown
Panel Turbine Desired Position System
FIC: Flow Speed Initiation
Indicating Positioner Process Model
System
Signal
Controller
Enable
Valve Position Position Demand

Actuator M

LS
From
Main
Steam
FLOW Governor Trip/ Steam
Magnetic Valve Throttle Admission
PickUp
To Valve Valve
Reactor From Torus or
Condensate
Storage Tank
Controlled Process

Figure 7-11
Component-Level HPCI-RCIC Flow Control Structure

 7-35 
Operator
Process Model Variables Process Model States
Plant
Normal Conditions
Plant Conditions
Accident

Selected Controller
Main Control Room Location
Remote Shutdown Panel
Flow Indicating Manual Controller
Controller Mode Automatic Mode
CA1: Increase Too Low
Desired Flow System Flow At Desired Flow

CA2: Decrease Too High


Desired Flow Indicated
Flow
Flow Indicating Controller
Process Model Variables Process Model States
Too Low
System Flow At Desired Flow System
Too High Flow (FT)
CA3: Increase
Desired Speed
Governor
CA4: Decrease
Process Model Variables Process Model States
Desired Speed Turbine
Too Low
Turbine Speed At Desired Speed Speed (MPU)
Too High
Yes
System Enable
No
CA5: Increase
Desired Position System
CA6: Decrease Enable (LS)
Positioner
Desired Position
Process Model Variables Process Model States
Too Closed
Valve Position At Desired Position
Too Open Valve Position
Yes
CA7: Increase System Enable
No
(Resolvers)
Actual Position

CA8: Decrease
Actual Position Governor Valve Actuator

Governor Valve

Figure 7-12
Component-Level HPCI-RCIC Process Models

 7-36 
Table 7-9
Excerpt of STPA Results for Control Action 5

Controller: HPCI-RCIC Flow Control System H1 Reactor Exceeds Limits

Control Increase Governor H2 Radioactive Release


Action:
CA5
Valve Position H3 Equipment Damage

Postulated Providing H4 Personnel Injury or Death


Behavior: (the increase valve position command) H5 Reactor Shutdown

Analysis Results
PMV1 PMV2 Is Situation Is CA
Row Related Comments
Turbine System Already Behavior
Hazards (Situational Context)
Speed Enable Hazardous? Hazardous?

1 Yes Yes Yes H3 Leads to Turb overspeed or Rx overfill


Too high
2 No Yes No Response H1, H2 No enable = complete loss of function

3 Yes Maybe No H3 Speed may be too low to recover


Too low
4 No Yes No Response H1, H2 No enable = complete loss of function

5 Yes No No None Leads to Rx overfill


As needed
6 No Yes No Response H1, H2 No enable = complete loss of function

 7-37 
Table 7-10
Potential Causes of HCA 1

Hazard: Equipment Operated Beyond Limits (H3)


Controller: HPCI Turbine Governor

Hazardous Control Action No. 1: “Increase governor valve position” command (CA5) is
provided when: there is an accident, turbine speed is too high, and system enable is present

Flawed Control Algorithm


Control algorithm design commands increased valve position when the controller believes: there is demand on
HPCI/RCIC to operate and turbine speed is too high
missing rules : situation as presented is unknown to the controller. May default to a known but inappropriate
behavior.(e.g., turbine speed is said to be beyond (here: greater than) the maximum level for which a rule is
provided (e.g.: no rule provided for speed values 100% and above))
wrong rules : situation is known to the controller, but it is designed to implement rules that are wrong. (e.g.,
send a negative voltage rather than positive one; …)
wrong clock: when timing is of the essence, asynchrony of the controller with the other elements in the loop can
be a source of hazard. e.g., controller is made to "think too much" (e.g. average readings from several flow
measurements over "too long" a period of time before emitting command, without taking into account the
fluctuating nature of flow measurements) before emitting a command
Flawed Process Model
Controller does not believe there is demand on HPCI/RCIC when in fact it is needed
Controller believes governor valve is open wide enough or open too wide, when in fact it is not open enough
Controller believes HPCI/RCIC pump flow rate is too high or as needed, when in fact it is too little
Flawed Feedback Interpretation
Controller receives enable signal, but does not interpret this as a demand on HPCI/RCIC to operate (bypass modes?
Other situations to ignore the enable signal?)
Controller receives accurate governor valve position signal (voltage?) but does not interpret it as valve position = not
enough
the interpretation algorithm is flawed
the position feedback conflicts with some other feedback signal
the feedback conflicts with the controller’s process model
Controller receives accurate HPCI/RCIC pump flow rate signal, but does not interpret it as flow rate = too little
the flow signal is out of range low
controller is in speed mode so flow rate signal is ignored
feedback or PM conflicts
Inadequate Goal
Controller's desired HPCI/RCIC pump flow rate is too low [note: too high is also hazardous (H3), but from what I can
tell it's not a cause of this UCA]
Controller's desired governor valve position is too low [note: too high is also hazardous (H3), but from what I can tell
it's not a cause of this UCA]
Controller is in wrong mode
Controller believes it is in speed mode – the goal is a desired speed and the desired flow rate is ignored or never
sent
Controller is in bypass mode (maintenance?)

 7-38 
Table 7-10 (continued)
Potential Causes of HCA 1

Hazard: Equipment Operated Beyond Limits (H3)


Controller: HPCI Turbine Governor

Hazardous Control Action No. 1: “Increase governor valve position” command (CA5) is
provided when: there is an accident, turbine speed is too high, and system enable is present

Inadequate Feedback
Enable signal sent to controller before there is a valid demand on HPCI/RCIC
enable provided when steam admission valve is not open (broken or misaligned LS)
steam admission valve commanded open when there is no demand on HPCI/RCIC (spurious ESFAS signal)
Enable signal sent to controller when there is a demand on HPCI/RCIC, but delayed
enable provided when steam admission valve is opened, but too late (misaligned LS or LS setpoint too high)
steam admission valve opens too slowly when commanded by ESFAS Initiation Signal (excessive stem thrust)
steam admission valve commanded open too late when there is a demand on HPCI/RCIC (ESFAS delay)
HPCI/RCIC pump flow rate signal to controller is missing, delayed, incorrect, too infrequent, or has inadequate
resolution
Signal corrupted during transmission
sensor failure
sensor design flaw
sensor operates correctly but actual flow rate is outside sensor’s operating range
fluid type is not as expected (water vs. steam?)
Governor valve position signal to controller is missing, delayed, incorrect, too infrequent, or has inadequate
resolution
Problems with communication path
actual position is beyond sensor’s range
sensor reports actuator position and it doesn’t match valve position
sensor correctly reports valve position but position doesn’t match assumed area/shape
Inadequate Execution of Control Action
Increase governor valve position command is provided in this context, but the command does not produce an
increase in governor valve position
Command sent but does not reach governor valve actuator
command received but governor valve is already fully open
command parameter is outside actuator’s operating range
command conflicts with another command
actuator failure
valve stuck
not powered
Increase governor valve position command is provided and the governor valve position increases, but the amount of
increase is not as commanded
Valve response time too slow (design flaw, physical failure, valve worn, etc.)
Inadequate Process Inputs, Physical System
Power missing or inadequate
Hardware failure (e.g. memory bit errors, etc.)

 7-39 
7.5 STPA Strengths

High Coverage

The STPA method is designed to provide high coverage of potential hazards.


This coverage is useful because the results can eliminate, reduce or mitigate the
majority of hazards when performing system requirements generation and design
activities. This is also helpful when using the STPA method to support a
licensing activity, as described in Section 7.3.

Systems View

The STPA method is essentially a top-down method that takes a system view.
The results are useful for input to the requirements definition phase of a digital
I&C project because they result in a safety-driven design from the beginning.

Unexpected Behaviors

The STPA method can identify unexpected and strange system behaviors that
may not otherwise be thought credible or possible. For example, it can identify
adverse interactions between components and systems that would on the surface
appear to have no potential interactions at all.

Simplified Final Results

When the data is reduced to the final list of Hazardous Control Actions and
their potential causes, the results can typically be readily used to inform
requirements, identify and apply defensive measures, and demonstrate system
acceptability.

The final results can also be used as an input to another method to help avoid
searches for faults and failures that don’t lead to hazards.

7.6 STPA Limitations

Single Failures

The STPA method does not readily identify the effects of postulated single
failures unless each Process Model Variable is considered in isolation, which goes
against its purpose. Therefore, STPA results are not well suited as an input to a
single failure analysis or identifying single point vulnerabilities.

Tedious Intermediate Results

Some of the intermediate tables that can result from the STPA method can
become very large and tedious to manage and evaluate if there are more than a
few Process Model Variables. Section 7.7 describes likely developments in the
future of the STPA method to address this problem.

 7-40 
Trained Facilitator

It helps to have a facilitator trained in the use of STPA, because it takes on a


broader view of the system(s) that can be affected by a digital I&C activity and
the hazards that it may cause. Most users of this guidance are likely to be trained
and competent in specific engineering disciplines or tasks, and may find it
difficult to navigate the STPA process the first time or two without a facilitator.
This method requires keeping an open mind when evaluating each context to
avoid throwing out or dismiss any of them; it considers all possible conditions
even if a user is convinced that existing programmatic processes will prevent them
(e.g., sensor calibration procedures).

7.7 Future Developments in STPA

At the time of publication, MIT researchers were developing new STPA tools
for creating control structures and visualizing results; checking for interactions
and influences; and tools for automating and reducing data to more readily usable
sets. This work holds promise for making the STPA more accessible to a wider
range of users.

 7-41 
Section 8: Purpose Graph Analysis (PGA)
Method
A Purpose Graph is a figure that illustrates the Observable, State, Goal and
Process features of a system. Purpose Graphs are used in Systems Engineering
design and analysis activities. The Purpose Graph is composed of a State Graph
placed side-by-side with a Process Graph.

Purpose Graph Analysis (PGA) can be used as a form of Hazard Analysis. The
PGA method is particularly useful for identifying potential digital systems
hazards that can arise from unexpected component or system behaviors by
providing insights into the following issues:
 Redundancy of success paths in the system
 Diversity of success paths in the system
 Direct and indirect consequences of failures to meet designed performance
levels, even when no faults are present
 Desired and undesired interactions between aspects of normal system state
changes
 Incompatible goals. Large systems with many active components can easily
develop conflicts between the goals of different parts of the system. Most
complex systems are designed to have goal conflict resolution approaches
within them, but often suffer from a lack of completeness of these
approaches. Hazards occur when conflicting goals are not detected and
resolved in a timely way during operations.
 Incompatible processes. Even when a large system is free of goal conflicts,
there may be hazardous interactions between the processes that are being
used to achieve the goals. These hazardous interactions may occur during
normal operations, even in the absence of faults and failures. Because the
design of system components is often distributed across many organizations,
these potential adverse process interactions may not be identified using
standard design practices.

Once identified, potentially hazardous interactions are candidates for further


analysis and assignment of defensive measures.

 8-1 
8.1 PGA Overview and Objectives

This section describes the basic steps for constructing the Purpose Graph and
analyzing it for potential hazards. To illustrate the method in the context of a
detailed procedure, the top-level analysis of a portion of a typical Boiling Water
Reactor (BWR) is provided in Section 8.2, with worked examples of specific
systems and subsystems of a BWR digital safety system provided in Section4.5.

Three Fundamental Features

The PGA method has the following fundamental features:


a. PGA separates System Functions into Goals and Processes. A Goal is
something that the function is intended to achieve, and a Process is
something that can be used to achieve the Goal. Processes are in turn
decomposed into the sub-Goals that make up the steps of the Process. By
making this separation, a Process can be related to the achievement of more
than one Goal, and a sub-Goal can be used as a step in more than one
Process.
b. PGA places emphasis on defining the States of the system as well as its
functions. The relationship between system States and functions is important
to capture explicitly. Because the PGA method separates a function into its
separate Goals and Processes, this form of analysis can address what values of
system States are needed to satisfy the Goal of a function. In addition, PGA
uses the system State information to capture the context for the performance
of Processes. For example, the PGA method can determine under what
range of State values a Process be used, and when it will fail to achieve its
related Goal.
c. PGA allows for multiple alternative Processes for a Goal. The Goal part of a
function can have as many alternative Processes defined for it as desired, each
with its own set of sub-Goals, some of which may be shared with other
Processes.

Basic Step 1: Purpose Graph Construction

In this basic step, two graphs are developed and juxtaposed. These graphs are
represented as both drawings and tables that describe them:

a. State Graph: The State Graph is a hierarchical graph (but not a tree!) of the
States of a system and its relevant subsystems and components. The
following terms are used in the expression of State Graphs:

Observable: A value that can be directly sensed or observed within the system
or its environment. Observables form the basis for the attributes and values
of a Sub-State, and are placed at the lowest level in the State Graph.

Attribute: A representation of a State variable value. Attributes are 1)


Observables, 2) calculated from Observables, or 3) calculated from other
attributes. A collection of Attributes is used to define a sub-State and its

 8-2 
expected range of values. Attributes are similar to the Process Model
Variables used in the STPA method described in Section 7

Sub-State: A collection of Observables and/or Attributes (from other Sub-


States) that define a condition of interest to the system designers or
operators. Sub-States are characterized by Attributes (State variables) and
their values. The collection of all Sub-States defines the entire State space of
the system and its environment.

State Graph Link: A dependency relationship between Sub-States in the State


Graph. A higher level Sub-State depends on attributes in the lower level
Sub-States that are linked to it. The State Graph is hierarchical, with
complex Sub-States being composed of two or more simpler Sub-states.

Event: A particular combination of Sub-State Attribute values that is an


indicator of the success or failure of a system Process or the need to change
one or more of the system Processes. Events are used in many places in a
Purpose Graph Analysis. Typical Events include Goal satisfaction, Process
initiation or halt conditions, and conditions resulting in activation of a new
Goal. Events bear some similarity to the Control Action Behaviors used in
the STPA method described in Section 7.

b. Process Graph: The Process Graph is a hierarchical graph (also not a tree!) of
the higher-level Goals, Processes and sub-Goals (and sub-processes) of the
system and its relevant subsystems and components. The following terms are
used in the expression of Process Graphs:

Goal: A desired set of values for one or more Sub-States. A Goal can be
compared to the actual values of the Sub-States and evaluated as either
Satisfied or Unsatisfied. Activation of a bistable function at a fixed setpoint
(e.g., reactor trip) is an example of a Goal, but other Goals in a system may
be more abstract, such as Safety or Availability.

Process: A system behavior that can reasonably be expected to satisfy a Goal


under at least some combination of system Sub-States (the “Goal achieving”
ability of the Process). Processes use resources and take place over time.
Processes can be decomposed into Sub-Goals. Processes have operational
envelopes defined by the values of Sub-States. Processes operating outside
their envelope have reduced likelihood of satisfying their intended Goal, and
may fail completely.

Process Graph Link: A relationship between a Goal and a Process, or between


a Process and its Sub-Goals. A “Process child” that is linked to a Goal is a
possible means to satisfy the Goal. A Sub-Goal of a Process is a step that
must be achieved for the Process to be performed. A Process may be a child
of more than one Goal, and may share Sub-Goals with other Processes. A
Goal may have links to more than one alternative Process child.

 8-3 
Constraint: A combination of system Sub-State values that defines the
operating envelope of the “Goal-achieving” ability of a Process. Constraints
are used in many places within a Purpose Graph Analysis, but always to
indicate a Sub-State Attribute value relationship that must be satisfied in
order for a relationship (link) to hold true.

The construction and analysis of system States in the State Graph and Goals and
Processes in the Process Graph use composition and decomposition techniques
driven by Purpose:

Decomposition separates an entity with broad scope into a group of related


entities, each with less scope. The term “parent” will be used to refer to the
original entity; its decomposed group is called collectively “children” and
between the children, they are “siblings.”

Composition is the joining together of a number of entities with smaller scope


into an entity with larger scope. On the surface, Composition would appear
to be the inverse of Decomposition, but there are limits. For example,
composing an apple and a banana to make fruit meets a certain rule (i.e., the
definition of fruit), but adding a cucumber violates this rule.

Composition and Decomposition techniques are used extensively in the PGA


method, and appear in the detailed procedure provided in Section 8.2 and the
worked examples provided in Section 8.4.

Basic Step 2: Purpose Graph Analysis

When juxtaposed, a State Graph and a Process Graph form a Purpose Graph that
can reveal the Purpose of system Goals and Processes in the context of its States,
Attributes, and Observables, thus leading to the name “Purpose Graph Analysis.”

In Basic Step 2, the graphs and tables prepared in Basic Step 1 are analyzed
against a set of ten Characteristics that reveal important strengths, weaknesses and
interactions in the digital system. Three areas of analysis can be performed:
States, Goals and Processes. This basic step probes deeply into State, Goal and
Process characteristics in order to identify system behaviors, both desired and
undesired (e.g., hazardous). The ten Characteristics evaluated in Basic Step 2 are
provided in Table 8-1, organized under their analysis headings. Note that some
Characteristics would be expected and desired, and some Characteristics would
not be expected or desired.

The objective of the PGA method is to systematically identify these


Characteristics in the context of the constructed Purpose Graph and determine if
they are hazardous or not. As in other Hazard Analysis methods, it is important
to not discard or delete any potentially hazardous characteristics because they are
considered non-credible or automatically prevented or mitigate through existing
programs or processes. It is much more helpful for demonstrating system safety
by 1) systematically identifying all potentially hazardous characteristics, then 2)

 8-4 
assessing for available or proposed measures that can eliminate, prevent or
mitigate such characteristics.

Table 8-1
Ten Characteristics Evaluated in PGA Basic Step 2

Three State Characteristics to be Evaluated


Are there multiple means to determine the Attribute
State values that make up a given Sub-State, so that loss of an
1
Redundancy information source does not prevent determining the
characteristics of the Sub-State?
State Inter- What attributes of a given Sub-State depend on other
2
dependence Sub-States, and what Attributes are directly measurable?
Attribute Are there diverse means to determine the Attribute
3
Diversity values that make up a given Sub-State?
Two Goal Characteristics to be Evaluated
A direct Goal interaction occurs when two Goals are
Direct Goal
1 incompatible and both cannot be achieved in the same
Interaction
time period.
An indirect Goal interaction occurs when two Goals are
Indirect Goal not directly incompatible, but there is no feasible
2
Interaction Process that can start in the State required by one Goal
and arrive at the State required by the other Goal.
Five Process Characteristics to be Evaluated
Process
1 Are their multiple means to achieve each Goal?
Redundancy
In the case where there appear to be multiple means to
Process Inter-
2 achieve a Goal, to what extent are the Processes
dependence
interdependent?
Considering normally functioning Processes, can the
Process
following interactions occur across separate Processes
Interactions
that produce undesirable behaviors?
The Sub-Goals of two Processes have either Direct or
Sub-Goal
3 Indirect Goal Interactions that result in the Processes
Interactions
being incompatible
Resource Two Processes have a mutual dependence on a limited
4
Interaction Resource
Although two Processes have no Sub-Goal interactions,
Side-Effect a side-effect change in State results when one Process
5
Interaction interferes with the ability of the other Process to achieve
its Goal.

 8-5 
8.2 PGA Procedure

The following steps are recommended for performing the PGA method. This
procedure is not the only way to implement the method; variations may be
suitable for different projects.

Prerequisite

The results of a Function Analysis, as described in Section 3.6, are a useful input
to the PGA analysis because it provides a well-organized set of functions that can
feed into the steps of the PGA procedure that identify the system states, goals
and processes.

PGA Step 1: Construct the State Graph

The first step of the PGA method is the construction of the preliminary State
Graph for the system being evaluated. Design and licensing basis information
and conceptual or detailed design information (e.g., specifications, drawings,
system descriptions, etc.) is used as inputs for this Step. The State Graph is
constructed as a drawing then augmented with a data table that lists Sub-States
and their attributes. In the early iterations of the State Graph, Sub-States are
composed from observables or lower-level Sub-States. Construction of the State
Graph is broken down to the following five sub-steps:

Step 1.a) Identify or Define the Observables

Make a Preliminary Observables Table, as in Table 8-2, that lists the


Observables that appear to be relevant to the functions that will be modeled in
the Purpose Graph Analysis. Typically, this step identifies all measurements
made by all sensors in the subsystems of possible interest. It is better to include
observables that may not be important than to exclude data prematurely. As the
State Graph is assembled, the Observables Table will be updated with additional
Observables and with the mapping from Observables to Sub-States.

Figure 8-1 and Table 8-2 show sample information from a portion of a typical
Boiling Water Reactor (BWR). Figure 8-1 illustrates the arrangement of Main
Steam pressure switches, connected to a common instrument line manifold, that
are used to sense steam line breaks and initiate closure of all MSIVs using one-
out-of-two-taken-twice logic. The information in Table 8-2 will be expanded
and illustrated within the context of the PGA method as each step of the PGA
procedure unfolds.

 8-6 
MSIV MSIV

MSIV MSIV

PS1 PS3

PS
PS 1
2
Reactor PS
PS 3
4 Close
MSIVs
PS2 PS4

MSIV MSIV

MSIV MSIV

Figure 8-1
BWR Main Steam Pressure Switches and MSIV Closure Logic

Table 8-2
Sample PGA Preliminary Observables Table

Observable Description Links to Sub-States


Main Steam Pressure 1 Sensed from Pressure Switch 1 TBD
Main Steam Pressure 2 Sensed from Pressure Switch 2 TBD
Main Steam Pressure 3 Sensed from Pressure Switch 3 TBD
Main Steam Pressure 4 Sensed from Pressure Switch 4 TBD

Step 1.b) Define Low Level Sub-States

A low level Sub-State represents the composition of one or more observables into
a named Sub-State, as in the State Graph per Figure 8-2 below. An observable
can be composed into many separate Sub-States; there is no exclusive
relationship between an observable (or any Sub-State) and its parent Sub-State.
The links in the State Graph show “parent-child” relationships that reflect
dependency, where the parent Sub-State depends on the values of the children
Sub-States. The Sub-State in Figure 8-2 is typical of a Sub-State that aggregates
sensor data, such as a voting arrangement like the one-out-of-two-taken-twice
logic shown on the right side of Figure 8-1. Note that the Main Steam Pressure
Sub-State is not represented in Figure 8-2 as high, low, true, false or any other
State value; consideration and evaluation of State values is performed in a later

 8-7 
step. By itself, the State Graph is a representation of relevant system States and
related Observables, regardless of their possible values.

Main Steam State


Pressure
Observable

Steam Steam
Pressure Steam Steam
Pressure Pressure Pressure
PS1 PS2 PS3 PS4

Figure 8-2
State Graph with a Low Level Sub-State

Step 1.c) Define Higher Level Sub-States

Higher level Sub-States create abstractions and aggregations through the


composition of lower level Sub-States. Higher-level Sub-States are parents to
lower-level Sub-States, and the links in the State Graph represent dependencies
between parents and children. In Figure 8-3, the Sub-State for Main Steam State
was added as a way to compose together all of the dependencies that influence
the state of the Main Steam system. In this Figure, Main Steam State depends
on the Main Steam Pressure Sub-State as defined earlier. But it also depends on
the state of the Main Steam Isolation Valves and the Reactor Power State.

Main
Steam State

Observable
Main Steam Main Steam Reactor
Pressure Isolation Power
Valves

Reactor Reactor
Steam Coolant
Control
Pressure Steam Steam
PS1 Pressure Pressure
PS2 Steam PS4
Pressure
PS3

Figure 8-3
Main Steam Sub-State

A higher-level Sub-State should be defined and added to the State Graph when
it is useful to collect up values from several lower level states into a larger Sub-
State (composition). To guide this step, the following questions are considered:

 8-8 
 Is it useful to group (compose) two or more lower level Sub-States into a
higher level Sub-State?
 Can all of the relevant operating events be associated with Sub-States? If an
event (alarm, process trigger, etc.) can’t be associated with a particular Sub-
State, it is an indication that Sub-States are missing.
 After a higher level Sub-State is added to the graph as a new parent, are
there any other Sub-States that should influence the newly created higher-
level Sub-State?
- If they exist, link them to the new Sub-State as children
- If the influence is not already represented as a Sub-State, add the needed
new Sub-State to the State Graph and link it to the parent as a child.

In Figure 8-3, the lower level Sub-States that are children of Main Steam
Isolation Valves and Reactor Power State are not yet defined. They were
composed directly in Step 1.c) as other Sub-States that influence the State of the
Main Steam system. Step 1.c) can be used to define the Sub-State children for
these additional States; or Step 1.a) and Step 1.b) can be used to compose them
from their related Observables.

Step 1.d) Assess the Completeness of the State Graph

After adding Sub-States and links to the State Graph to capture the preliminary
dependency relationships between the Sub-States, the construction of the Process
Graph should begin. The steps for building the Process Graph are described in
PGA Step 2.

As the Process Graph is built, the State Graph should be repeatedly assessed to
determine if it is complete with respect to the Goals defined in the Process
Graph. A State Graph is complete when it is possible to associate all of the Goals
in the Process Graph and all of the operating constraints of the Processes in the
Process graph to specific Sub-States in the State Graph.

As missing Sub-States are discovered or found to be relevant, they are added to


the State Graph using Steps 1.a) through 1.c) until the State Graph is complete.
A notional top-level State Graph for a BWR is shown in Figure 8-4, where the
Sub-States shown in Figure 8-3 are now embedded in a much wider view of the
state of a BWR Plant. For the purposes of illustration, this State Graph does not
show the Sub-States all the way down to the observables, but this top-level State
Graph will serve as the starting point for the two worked examples provided in
Section 8.4.

Step 1.e) Construct the States and Events Table

As the State Graph approaches completeness, the States and Events Table is
constructed. All of the Sub-States in the State Graph are listed in tabular form as
shown in Table 8-3; for expediency, this one shows only three of the Sub-States
from Figure 8-4. Each entry in the State table captures the definition of the Sub-

 8-9 
State, its attributes, and the operational events that are associated with the Sub-
State and its values.

The State and Event Table is used in the analysis phase of the PGA method
described in PGA Steps 3, 4 and 5.

Overall
Plant State

Plant Electric Safety Plant System


Load Balance State Readiness

Electric Power Heat


Total Electric
Production State State
Power Demand
Safety
External Main Generators LP Condenser systems
Electric Load State State readiness
Pressure relief
House Power HP Condenser State Non-safety
Load State systems
readiness
Safety Relief
Turbine State Bypass Steam
Valve State
State Circulating
Water Flow Reactor safety
Turbine Turbine Main state
Turbine
Control Steam Supply
Bypass Valves
Valves State Containment
State
Main
Steam

Main Steam Reactor


Pressure MSIVs Power

Reactor Reactor
Control Coolant

Figure 8-4
Notional Top Level State Graph for a BWR

 8-10 
Table 8-3
Top-Level BWR State and Event Table (Partial)

Sub-State Description Attributes Events


Reactor The thermal power  Reactivity  High Drywell
Power and reactivity state  Coolant Press.
of the reactor Temperature  High Rx Flux
 Coolant Pressure  High Rx Temp
 Coolant Flow
 Void Fraction
Main Steam The pressure of the  Steam pressure  Stuck SRVs
Pressure steam within the  SRV positions  MSIV closure
main steam lines  MSIV positions
Reactor The state of the  Rx Coolant Level  Lo-Lo Rx Coolant
Coolant coolant flowing  Main FW Flow Level
through the core  Rx Coolant Temp.  High Rx Coolant
 Rx Recirc Flow Level
 HPCI Flow  LOCA
 RCIC Flow

PGA Step 2: Construct the Process Graph

Once the construction of the State Graph has reached Step 1.c), construction of
the Process Graph should be started. It is important to remember that
throughout Process Graph Steps Step 2.a) to Step 2.c), the State Graph will
remain at Step 1.c) and may be revisited several times.

As with the State Graph, the Process Graph will be built first as a drawing, then
as two tables that include the details about the Goals and Processes that make up
the drawing. The Process graph will be constructed mostly by the decomposition
of Processes into Sub-Goals and identifying all of the possible Processes that can
satisfy Goals.

Step 2.a) Define the Top-Level Process and its Sub-Goals

Even when analyzing a subsystem of a much larger entity, it is advisable to start


at the top-most Process of the entity, and identify its major Sub-Goals. In Figure
8-5, the top level Process is the operation of the entire BWR Plant. The Process
is shown as a rectangle in the figure, and its Sub-Goals are the ovals.

The links in Figure 8-5 indicate that the Sub-Goals are necessary for the correct
performance of the parent Process. Note that the Process “Nuclear Power Plant
Operations” is very abstract, and has very broad scope. Similarly, its Sub-Goals
are abstract, with broad scope. The succeeding steps of the Process Graph will
add more and more detail and specificity by decomposing plant operations into
finer and finer Goals and Processes from the top down, as opposed to the State
Graph which composes observables and lower level Sub-States into higher level
 8-11 
states, from the bottom up. At the lowest level of the Process Graph, the
Processes can be directly performed by operators and machines.

BWR Plant
Operations Process

Goal
Electric Power Plant System
Plant Safety
Production Readiness

Figure 8-5
Top Level Process Graph for a BWR

Step 2.b) Define Processes that Satisfy Goals

The construction of the Process Graph continues by defining Processes for each
Sub-Goal. At the high levels of the Process Graph, these Processes will still be
abstract. More than one Process can be defined as a child of a Goal. These
sibling Processes are alternative ways to achieve the Goal, and represent the
presence of design characteristics like diversity and redundancy. To guide the
identification of the Process children of a Goal, consider these questions:
 Are there diverse ways to accomplish the Goal? Typical examples are
multiple diverse subsystems, or the independent manual actions of the
human operators.
 Are there redundant resources for accomplishing the Goal?

If these questions are evaluated as true, then alternative Processes can be defined
as children of the Goal. Sibling Processes that are defined as alternatives should
be distinct in some way from each other, a feature normally apparent in Step 2.c).

Figure 8-6 shows the Goal of “Plant System Readiness” (a Sub-Goal from Figure
8-5). Note that while the illustrated Processes are not necessarily mutually
exclusive, they are distinct. It would be considered proper for all of these
Processes to be going on in parallel with each other and concurrently with many
other activities within the plant. In some cases, particularly at the lower levels of
the Process Graph, sibling Processes may in fact be mutually exclusive. While
exclusivity of alternative Processes has no influence on their status as siblings, the
property of exclusivity in alternate Processes will be evaluated in the analysis
phase.

 8-12 
Plant System Goal
Readiness

Process
Scheduled System
On Condition
Maintenance Replacement/
Maintenance
Plan Upgrades

Figure 8-6
Alternative Processes in a Process Graph

Constraints are implied in the connections between alternative Processes that can
each satisfy a connected Goal. Constraints provide a sense of context that limits
the applicability of an identified Process for achieving its parent Goal.
Constraints are expressed in terms of the related States and Sub-States in the
State Graph. Constraints are useful when analyzing Process Interactions, and can
be listed in the Process Table (see Table 8-5).

Step 2.c) Define Sub-Goals for Each Process

For each Process that is defined, its Sub-Goals are defined and linked to them as
children. It is not allowed to link Processes directly to other Processes, or Goals
directly to other Goals. In the Process Graph, the layers of Processes and Goals
are interleaved (i.e., layered), so that Processes have only Goal children and Goals
have only Process children.

Electric Power
Production Goal

Turbine Steam Turbine Process


Turbine
Start Up Electric
Shutdown
Generation

Meet House Remove Excess


Electric Load Heat
Meet External Provide Main
Electric Load Steam Supply

Full Turbine Reactor Normal


Use Steam Condenser
Production Operations

Figure 8-7
Layered Goals and Processes in a Process Graph

In Figure 8-7, the three alternative Processes for the Goal of “Electric Power
Production” are shown, along with their decomposed Sub-Goals; in this case, the
 8-13 
three alternative Processes are in fact mutually exclusive. Sub-Goals represent
steps or activities necessary to carry out the parent Processes, and the more
abstract the parent Processes, the more abstract their children Sub-Goals. Notice
in Figure 8-7 that the Sub-Goals of “Meet House Load,” “Provide Main Steam
Supply,” “Meet External Electric Load” and “Remove Excess Heat” are the
children of the three alternatives are partially shared higher up in the Process
Graph, but the alternative Processes “Turbine Start-Up,” “Steam Turbine
Electric Production” and “Turbine Shutdown” each have a different set of Sub-
Goals. For example, the “Steam Turbine Electric Generation” Process has the
Sub-Goal of “Meet External Electric Load”, while the other Processes do not
share that Sub-Goal.

When new Sub-Goals are defined, it is necessary to return to Step 1.d) of the
State Graph construction Process and verify that the system Sub-States that
correspond with the newly defined Sub-Goals are defined.

Figure 8-8 provides a partial Purpose Graph by juxtaposing a portion of Figure 8-


4 (State Graph) with Figure 8-7 (Process Graph). As indicated by the arrows, the
Sub-Goal “Meet House Electric Load” is associated with the Sub-State “House
Power Load” and the Sub-Goal “Provide Main Steam Supply” is associated with
the Sub-State “Turbine Main Steam Supply”. In this case, no new Sub-States are
needed to be defined. If a new Sub-Goal is defined that does not have an
association with an existing Sub-State, a new Sub-State should be defined in the
State Graph in accordance with the procedure for constructing the State Graph
(PGA Step 1).

Plant Electric
Load Balance Electric Power
Production

Electric Power Turbine Steam Turbine


Total Electric Turbine
Power Demand Production Start Up Electric
Shutdown
Generation
External Main
Electric Load Generator
Main Meet House
Turbine Remove Excess
House Power Electric Load Heat
Load
Meet External Provide Main
Turbine Turbine Main Turbine Electric Load Steam Supply
Control Valves Steam Supply Bypass Valves
State
Full Turbine Reactor Normal
Main Use Condenser
Steam
Steam Operations
Observable Production

Main Steam Reactor Goal


Pressure MSIVs Power
Process
Reactor Reactor
PS1 PS2 PS3 PS4 Coolant
Control

Figure 8-8
Checking for State and Goal Associations in the Purpose Graph

 8-14 
Figure 8-9 provides a portion of the final Process Graph that results from
extending the initial Process Graph in Figure 8-7 by iterating Step 2.b) and Step
2.c) until it includes the lowest level Sub-Goals and their children Processes.
Note that at each Process layer, the Processes are increasingly explicit (i.e., with
decreasing abstraction). Table 8-4 provides a list of Goals and related Sub-States,
and Table 8-5 provides a list of Processes and related Goals. The Goal “Protect
Core” and its related Process, “Respond to All Reactor Events,” are highlighted
to show their places and associations in these tables. The procedure for preparing
the Goal and Process tables is provided in Step 2.d) below.

The complete Purpose Graph for the notional BWR in this procedure is
provided in Figure 8-10.

Step 2.d) Construct the Goal and Process Tables

As in the State Graph, the Process Graph drawing is supported by two tables,
one for identifying the attributes and state relationships for Goals and the other
for identifying the related Processes. These tables are built by entering each Goal
in the Goal Table and each Process in the Process Table. These tables will be
used directly in the analysis phase of the PGA method.

BWR Plant
Operations

Electric Power Plant System


Production Plant Safety
Readiness

Steam
Turbine Start System
Turbine Turbine Integrated Scheduled On Condition
Up Replacement/
Electric Shutdown Safety Maintenance Maintenance
Operations Plan Upgrades
Generation

Replace
Maintain Subsystem
Meet House Protect
Core Subsystem
Electric Load
Repair Detect System
Remove Excess Protect Subsystem Variances
Meet External
Electric Load Heat Containment
Protect
Provide Main Equipment Periodic
Respond to All Respond to Inspection
Steam Supply
Reactor Events All
Containment Reduced Surveillance
Events Operational Shutdown
Full Turbine Normal Testing
use Rx Steam Demand Equipment
Condenser
Production Respond to
Operations
LOCA

Figure 8-9
Notional Top Level Process Graph for a BWR

 8-15 
Table 8-4
Top-Level BWR Goal Table (Partial)

Related Sub-
Goals Description Attributes States (from
Table 8-3)
Protect Core Prevent fuel  Fuel temp.  Reactor Power
degradation, up  Core geometry  Reactor Coolant
to and including
core damage
Provide Main Produce main  Main Steam press.  Main Steam
Steam Supply steam with the  Main Steam temp. Pressure
required press,  Main Steam Flow
temp & flow

 8-16 
Table 8-5
Top-Level BWR Process Table (Partial)

Related Goals
Processes Description Attributes Constraints
(from Table 8-4)
Respond to All Detect reactor events and  Rx Flux  Protect Core  Coolable Core Geometry
Reactor Events initiate protective action  Rx Temp
Rx Steam Use reactor heat  Main steam temp.  Provide Main Steam Supply  Core is within expected
Production generation and feedwater  Main steam press. operating range
supply to make steam  Main steam flow  Steam production matches
main steam demand (i.e.,
pressure, flow, quality)

 8-17 
BWR Plant
Operations

Overall
Plant State
Electric Power Plant System
Production Plant Safety
Plant Electric Readiness
Safety
Load Balance State Plant System
Readiness Scheduled
Total Electric Maintenance
Power Demand Plan
Electric Power Heat System
State Safety On Condition
Production State Replacement/
Systems Maintenance
External House Turbine Turbine Integrated Upgrades
Turbine Start
Electric Readiness Safety
Power Up Generation Shutdown
Load Load Operations

Condenser Non-safety Maintain Replace


Main
Pressure systems Subsystem Subsystem
Generators
Relief readiness

Circulating Bypass Steam Detect System


Turbine State
Water Flow State Protect Variances
Containment
Core
Repair
Meet House Subsystem
Reactor Protect
Turbine Turbine Main Safety Electric Load
Safety Containment
Bypass Valves Steam Supply Relief Periodic
Remove Excess
Valve Inspection
Heat Protect
Meet External
Equipment
Turbine Electric Load
Main Respond to All
Control Valves Surveillance
Steam Provide Main Reactor Events Testing
Steam Supply
Respond to
All
Containment
Main Steam Events
Pressure
Reactor
Power
Normal
Condenser
MSIVs Full Turbine Rx Steam Operations Reduced
Reactor Reactor
use Production Operational Shutdown
Control Coolant
Demand Equipment
Respond to
LOCA

Figure 8-10
Notional Top-Level BWR Purpose Graph

 8-18 
PGA Step 3: Analyze States and Events

Once the Purpose Graph representations have been assembled, its links and
nodes can be analyzed to reveal system behavior issues. The analysis phase of the
PGA method requires system knowledge and engineering judgment, and is more
effective when performed by a team made up of knowledgeable design
engineering, system engineering, operations, and vendor engineering personnel.
This step in the PGA procedure analyzes the State Graph for the State
Characteristics listed in Table 8-1:
1. State Redundancy: Are there multiple means to determine the Attribute
values that make up a Sub-State, so that loss of an information source does
not prevent determining the Characteristics of the Sub-State?
2. State Interdependence: What attributes of a given Sub-State depend on
other Sub-States, and what Attributes are directly measurable?
3. Attribute Diversity: Are there diverse means to determine the Attribute
values that make up a given Sub-State?

Step 3.a) Construct the State Analysis Table

The State Analysis Table is constructed by listing each Sub-State and its related
Attributes, as identified in PGA Step 1, with additional columns for the three
State Characteristics listed above. Each State Characteristic is then analyzed for
its presence and strength (or depth) by examining the State Graph. Continuing
with the three top-level BWR States provided in Table 8-3, a sample State
Analysis Table is provided via Table 8-6:

 8-19 
Table 8-6
Top-Level BWR State Analysis Table

Attribute
States Attributes Redundancy Interdependence
Diversity

 Reactivity There are


 Temperature There are diverse means
Partially measurable
Reactor multiple similar for
 Pressure with dependence on
Power paths of determining or
 Coolant Flow other Sub-States
information estimating this
 Void Fraction state
Pressure
 Steam pressure Directly measurable,
sensors are
Main but pressure sensors
 Safety relief identical,
Steam Multiple sensors are on a common
valve positions causing
Pressure line vulnerable to a
 MSIV positions reduced
single passive failure
diversity
 Rx Coolant
Level
 Main FW Flow Multiple flow
 Rx Pressure Directly measurable
Multiple flow paths and
Reactor with some
 Rx Coolant paths with diverse sensors
Coolant dependence on other
Temp. multiple sensors contribute to
Sub-States
 Rx Recirc Flow this state
 HPCI Flow
 RCIC Flow

Step 3.b) Identify Potentially Hazardous State Characteristics

Potentially hazardous characteristics that can be found in State Analysis Table


are as follows:
 Those States where Redundancy or Attribute Diversity is weak or
nonexistent, and critical Goals and Processes are associated with such States.
Weak or nonexistent Redundancy and Attribute Diversity characteristics are
suggestive of potential hazards due to the opportunity for incorrect
information to reach a Process, without the ability to detect that it is
erroneous.
 Those States were Interdependency is high and critical Goals and Processes
are associated with such States. When this Characteristic is evaluated as high
in a given State, it is suggestive of a potential hazard due to the possibility of

 8-20 
another State having a value that prevents achieving the critical Goal(s)
associated with the given State.

Table 8-6 uses a three-tiered color-coding scheme to help identify potentially


hazardous State Characteristics. Users of this guidance can classify the strength
or depth of State Characteristics using this scheme or any other schemes that are
suitable for a given project. In the BWR example that has been developed up to
this point in the PGA procedure, all three States show some weaknesses,
highlighted in yellow to indicate they are potentially hazardous for further
review.

The State Interdependence Characteristic associated with the Main Steam


Pressure State is high because the pressure switches used to sense and command
closure of the Main Steam Isolation Valves are connected to a common pressure
sensing instrument line, as shown in Figure 8-1. This State characteristic is
associated with the Goal “Provide Main Steam Supply” and may be potentially
hazardous if this Goal is considered to be critical. The criticality of a Goal can be
determined by the associated States and Events that are listed in the State Table.
If an Event listed next to a State is considered to be unacceptable, then the
associated Goal is critical.

This same Interdependence Characteristic (pressure sensors connected to a


common instrument line vulnerable to a single passive failure) also propagates up
to higher States in the State Graph, such as “Reactor Safety” and “Containment.”
In this case, the related Goals “Protect Reactor” and “Protect Containment” are
definitely critical because they are associated with High Reactor Flux and High
Reactor Temperature Events which are both unacceptable in the context of
nuclear safety. Therefore this Characteristic is highlighted in red to indicate that
it is considered potentially hazardous and should be evaluated for preventive or
mitigative defensive measures in the system design or its operating and
maintenance procedures.

Notice the paradox in the Interdependence Characteristic in Table 8-6. When


evaluated against the Goal “Provide Main Steam Supply,” it might be a potential
hazard, but when it is evaluated against the Goals “Protect Reactor” and “Protect
Containment” there is no doubt. It is not unusual to encounter these paradoxes
in State Characteristics, and in these cases it is important to assess the criticality
of related Goals and take the most conservative position for identifying potential
hazards.

PGA Step 4: Analyze Goals

Goal analysis is focused on Goal interactions that can lead to potentially


hazardous Process execution issues. This step in the PGA procedure analyzes the
Process Graph for the Goal Characteristics listed in Table 8-1:
1. Direct Goal Interaction: A direct Goal interaction occurs when the Goals
are incompatible and cannot both be achieved in the same time period.

 8-21 
2. Indirect Goal Interaction: An indirect Goal interaction occurs when the two
Goals are not directly incompatible, but there is no feasible Process that
could start in the State defined by one of the Goals and arrive at the State
required by the other Goal.

Step 4.a) Construct the Goal Analysis Table

The Goal Analysis Table is constructed by listing each Goal as identified in


PGA Step 1, with additional columns for the two Goal Characteristics listed
above. Each Goal Characteristic is then analyzed for its possible presence by
examining the Process Graph and Goal Table prepared in PGA Step 2. Each
Goal is compared pair-wise to all other Goals listed in the Goal Table. A sample
Goal Analysis Table is provided in Table 8-7, based on selected Goals from the
Top Level BWR Process Graph provided in Figure 8-5:

Table 8-7
Top Level BWR Goal Analysis Table

Direct Goal Indirect Goal Interactions


Goals
interactions
Electric Power Production None noted  Plant Safety
Plant Safety None noted  Electric Power Production
Plant System Readiness None noted  Electric Power Production
Provide Main Steam
None noted  Plant Safety
Supply
Remove Excess Heat None noted  Provide Main Steam Supply

Step 4.b) Identify Potentially Hazardous Goal Interactions

In most cases when Direct Goal Interactions are identified (in which two Goals
are incompatible by definition), they are expected on the basis of the system
design. As a result, few Direct Goal Interactions are potentially hazardous. In
Table 8-7 there are no identified Direct Goal Interactions because the listed
Goals are high in the Process Graph; however, several Direct Goal Interactions
are found and described in the worked examples provided in Section 8.4.

Indirect Goal Interactions are much more difficult to design around, and many
Indirect Goal Interactions are potentially hazardous. Two potentially hazardous
Indirect Goal Interactions are highlighted in red in Table 8-7 because for the
Goals of “Electric Power Production” and “Provide Main Steam Supply” there is
no feasible Process that could start in the State defined by either one of these
Goals and arrive at the State required by the “Plant Safety” Goal.

For example, by inspecting Figure 8-10, one can see that the “Electric Power
Production” Goal associates with the “Electric Power Production” State, and the
“Plant Safety” Goal associated with the “Safety” State, and there is no Process that
can support both Goal/State pairs at the same time under all Event conditions

 8-22 
(e.g., LOCA). Of course, this Indirect Goal Interaction is already understood and
recognized in the facility design basis, which requires a turbine trip when there is
a reactor trip in response to design basis events. However, it serves to illustrate
the systematic manner in which the PGA method can reveal potential hazards.

One approach for preventing or mitigating Indirect Goal Interactions is to


attempt to define the design basis of a plant so that the circumstances that might
result in the interaction are excluded or at least remote. For example, the Goals
“Provide Main Steam Supply” and “Remove Excess Heat” have an Indirect Goal
Interaction because there can be circumstances in which the means for removing
the excess heat (i.e., the “Normal Condenser Operations” Process shown in
Figure 8-5) cannot accomplish the Goal. This result may limit the amount of
main steam that can be produced or result in added means for removing heat.

In past analysis of operational events in other industries, most notable the


aerospace industry, Indirect Goal Interactions which were unexpected in the
context of system design and operation have been found to play important roles
in influencing accidents. As a result, all Indirect Goal Interactions should be
considered as potentially hazardous and closely examined for elimination,
prevention or mitigation through design alternatives and other defensive
measures.

PGA Step 5: Analyze Processes

Process analysis is focused on Process characteristics and interactions that can


lead to potentially hazardous Process execution issues. This step in the PGA
procedure analyzes the Process Graph for the Process Characteristics listed in
Table 8-1:
1. Process Redundancy: Are their multiple means to achieve each Goal?
2. Process Interdependence: In the case where there appear to be multiple
means to achieve a Goal, to what extent are the Processes interdependent?
3. Process Interactions: Considering normally functioning Processes, which of
the following Process interactions can occur across separate Processes that
produce undesirable behaviors?
a. Sub-Goal Interaction: The Sub-Goals of two Processes have either
Direct or Indirect Goal Interactions that result in the Processes being
incompatible.
b. Resource Interaction: Two Processes have a mutual dependence on a
limited resource.
c. Side-Effect Interaction. Although neither of two Processes have Sub-
Goal interactions, a side-effect change in state that results from one
Process interferes with the ability of the other Process to achieve its
Goal.

 8-23 
Step 5.a) Identify Potentially Hazardous Process Redundancy Characteristics

For Process redundancy analysis, the Process Graph is inspected for Singletons,
which are Goals that have only a single Process defined as a means to meet the
Goal. Not all singletons are potential hazards, since in some cases the Process is
an abstraction that has broad scope to be performed in many ways. For mid to
low level Goals, however, singletons are a sign of lack of redundancy, which may
be hazardous in some contexts.

In the Top Level BWR Process Graph provided in Figure 8-5, there are several
singletons. As examples, the following are noteworthy:
 The Plant Safety Goal has a singleton Process, Integrated Safety Operations.
This is an example of a broad scope Process that is not a redundancy concern.
 The Provide Main Steam Supply Goal has the singleton BWR Steam
Production. While other forms of steam production are possible, no
redundancy or diversity is expected for this Process.
 The Remove Excess Heat Goal has a singleton child Process, Normal
Condenser Operations. This is more problematic, and could be potentially
hazardous; the Process Graph should be examined carefully to ensure that
lower level child Goals of this Process offer strong redundancy or diversity.

Step 5.b) Identify Potentially Hazardous Process Interdependencies

When a Goal is not a singleton, the Processes that are identified as being able to
satisfy the Goal (Process siblings) are inspected for Process interdependence. To
be fully independent, the sibling Processes should not have Sub-Goal instances in
common. The highest level of independence is to be mutually exclusive. If two
siblings are not mutually exclusive and they have a a Sub-Goal instance in
common with circumstances under which the common Sub-Goal cannot be
satisfied, the Processes with the common Sub-Goal instance will both fail and
may be potentially hazardous. In the Top Level BWR Process Graph provided in
Figure 8-5, the following example of a potentially hazardous Process
Interdependency is seen:
 All of the Process children of the Goal “Electric Power Production” have a
common child, “Remove Excess Heat.” If this Goal fails, all Processes
involved with Electric Power Production will fail, including the Turbine
Shutdown Process. This observation makes the three Processes
interdependent, and elevates the significance of the Remove Excess Heat
Sub-Goal.

Step 5.c) Construct the Process Interaction Table

In this Step of the PGA procedure, a Process Interaction Table is prepared by


listing each Process from the Process Table prepared in Step 2.d, with additional
columns for the three types of Process interactions listed in Table 8-1. A sample
Process Interaction Table is provided via Table 8-8, using selected Processes
from Figure 8-9:

 8-24 
Table 8-8
Top Level BWR Process Interaction Table (Partial)

Sub-Goal Resource Side-Effect


Processes
interactions Interactions Interactions
Nuclear power
None noted None noted None noted
Plant Operations
 Scheduled Maintenance
Steam Turbine  Integrated  On Condition
 Turbine Start Up
Electric Safety Maintenance
 Turbine Shutdown
Generation Operations  System Upgrades/
Replacements
BWR Steam Not analyzed  Periodic Inspection
None noted
Production in this example  Surveillance Testing
 BWR Steam
Production
Shutdown Not analyzed  Normal Condenser
 Normal
Equipment in this example Operations
Condenser
Operations

Step 5.d) Identify Potentially Hazardous Process Interactions

Not all Process interactions identified in the Process Interaction table will be
potentially hazardous. This is particularly true for higher-level Processes that
have broad scope. For Processes with Sub-Goal interactions and resource
interactions, the hazard potential is the loss of ability to perform the Process
under circumstances where it may be needed.

Side-Effect interactions are among the most difficult to detect and design for,
but the ability of Purpose Graph Analysis to detect these interactions is one of its
major benefits. In many cases, detection and avoidance of side-effect Process
interactions is left to the operations crew and to training and procedures, rather
than explicit system design measures. This approach is marginally successful in
practice; thus side-effect interactions should be considered potentially hazardous
and given careful attention.

Three Processes listed in Table 8-8 show Side-Effect Interactions, and are highlighted
in red because they are considered to be potentially hazardous. Operating experience in
multiple industries has shown that the operations and maintenance staff is not always
well prepared to recognize side-effect interactions in actual operational performance
until equipment damage or personnel injury is imminent.

 8-25 
8.3 Applying the PGA Results

By providing a list of potentially hazardous system design issues, the PGA results
can be used to derive or modify system requirements in order to prevent or
mitigate some hazards; and to leverage existing programs and processes that can
prevent or mitigate other hazards.
Application Development
Each of the potential hazards identified by the PGA method should be evaluated
by a team of knowledgeable individuals responsible for system design, test,
operations, and maintenance activities. The team should decide if each identified
hazard can be eliminated, prevented, or mitigated to a reasonable extent through
one or more defensive measures that are realized through design requirements
and/or plant programs and processes. For guidance on applying defensive
measures in digital I&C systems, see References 20 and 21.
Ideally, this evaluation is performed early, at the conceptual design phase, so that
safety-driven requirements are inserted before detailed design begins. The results
of the PGA analysis should be reviewed again as the detailed design emerges, to
determine if any of the design details substantially altered the control structure
used in the analysis. Of particular concern would be new interfaces or functions
that were not accounted for in the preliminary PGA.

If the PGA method is applied late in a project for some reason, the
owner/operator should be prepared to stop the project and rework the design if
the PGA results clearly indicate potential hazards that are not effectively
eliminated, prevented, or mitigated to a reasonable extent.
Mitigation of Information Degradation
Digital control systems can experience information degradation as a result of
several fundamental issues. While some of the issues are very familiar as faults
and failures, other issues act to potentially degrade information even in the
absence of failures:
 Loss of information. Although most often associated with a faulty sensor or
communications device, this type of degradation can also occur if a software
process halts. Loss of information may be intermittent or persistent. There
are different approaches to dealing with suspected loss of information, and
these approaches can produce very different effects.
 Incorrect or unexpectedly noisy information. This familiar form of
degradation is also often associated with a faulty sensor or a degraded
communications channel. The incorrect or noisy information may be
intermittent or persistent. In addition, software processes with non-fatal
errors may create incorrect or noisy information.
 Sampling. Digital systems require that the continuous time real world
information is sampled to produce discrete digital values. Both the sampling
frequency and the sampling precision can result in information degradation.
While the speed and precision of digital systems today serves to reduce this

 8-26 
issue, sampling is still a source of information degradation for rapidly
fluctuating information.
 Information incompleteness. For even modest systems, it is impractical to
attempt to sense all of the important information about the system.
Assumptions and design choices must be made about how many sensors will
be used and where they will be located in an effort to capture system state
information. Similarly, software algorithm design must also select the specific
input parameters that will be used in its calculations, nearly always a subset of
the information that is available.
 Synchronicity. The information in a digital control system is distributed
across time as a result of sampling, communications and calculation times.
Because the state of the system cannot be determined simultaneously across
all of its components, information degradation results--even when the system
is operating perfectly. The greater the mismatch between the fundamental
dynamics of the controlled system and its digital controls, the greater the
information degradation.
 Estimating, smoothing and filtering. To offset the effects of sampling,
incompleteness and the lack of synchronicity, digital control systems
commonly use software algorithms for estimating, smoothing and filtering of
information. These algorithms can be helpful under normal conditions, but
may mask important changes at the extremes or edges of their performance.
Unrecognized and unmitigated information degradation can result in an incorrect
understanding of the situation, and subsequently, in inappropriate selection of goals and
activation of processes leading to an accident or loss, even in the absence of a failure
condition. Three main approaches have been used to mitigate information degradation:
 Redundancy. By having multiple identical sensors or communications
channels or software processes, the likelihood of some forms of information
degradation can be reduced. Redundancy is particularly aligned with loss of
information or incorrect and noisy information that results from a random
disturbance or failure mechanism, whether intermittent or persistent.
 Diversity. By having multiple, diverse sensors or communications channels or
software processes, the likelihood of information degradation can be reduced.
Diversity has a broader effect than redundancy, but a much higher cost in
terms of system complexity.
 Independence. By isolating a flow of information from other sources of
information, the effects of degradations in the other sources cannot
propagate to the isolated, independent flow. While independence can reduce
the effects of sensor, communications and software failures, it does little to
offset the other sources of information degradation in the independent flow.
These three mitigation approaches are not independent of each other and have
limitations to their effectiveness. The hazards due to information degradation in a
system can be assessed from the PGA results by the considering the extent to which
these three mitigations are present in the State Graph for the DCS. In Table 8-9
below, some of the combinations of redundancy, diversity and independence are
discussed from a mitigation effectiveness view and a cost and complexity view.

 8-27 
Table 8-9
Alternatives for Mitigating Information Degradation

Redundancy Diversity Independence Effectiveness Cost and Complexity


Most effective at mitigating Difficult to achieve in practice, very
High High High
information hazards high system cost and complexity
Effective at mitigating information
High High Low High cost and complexity
hazards
Susceptible to systematic
High Low High Moderate cost and complexity
information errors or failures
Marginally effective at mitigating
High Low Low Moderate cost and complexity
information hazards
Marginally effective at mitigating
Low High High High cost and complexity
information hazards
Effective at mitigating information
Low High Low Moderate cost and complexity
hazards
Susceptible to both random and
Lower cost, moderately complex
Low Low High systematic information errors or
isolated system
failures
Susceptible to both random and
Low Low Low systematic information errors or Least complex, coupled system
failures

 8-28 
Provide Input to Other Hazard Analysis Methods

Because PGA results are focused on the identification of systematic hazards, the
results can be used to provide a more focused approach when other methods may
be applied.

For example, the FMEA method requires a bottom-up analysis of all devices in a
component, or all components in a system, which can become very large, time
consuming, and costly if there is a large number of devices or components. If the
PGA method is applied first, then an FMEA can focus exclusively on the devices
or components that could cause or contribute to hazards, and then determine the
failure modes or failure mechanisms that could lead to such hazards.

8.4 PGA Examples

The following examples of the PGA method are provided, using the same
example systems used throughout this guideline.
Example 8-1. HPCI Turbine Controls PGA
The hypothetical turbine control system digital upgrade project examined in
Example 4-2 (Figure 4-6) is also examined here, this time using the PGA method.
This example limits the analysis to the HPCI system for expediency. Table 4-4 from
Example 4-2 satisfies the prerequisite for a Function Analysis in this example.
PGA Step 1: Construct the State Graph
In Section 8.2, the Purpose Graph Analysis procedure was illustrated with a top-
level State Graph and Process Graph for a notional Boiling Water Reactor system.
This example extends the top-level BWR State Graph and Process Graph to the
HPCI and RCIC systems.
The safety functions of the HPCI and RCIC systems are identified in Example 4-4
(Fault Tree Analysis) as 1) maintain reactor coolant inventory, 2) maintain primary
coolant system integrity, and 3) containment isolation. In addition, spurious
activation of the HPCI system could result in low reactor water temperature, leading
to a reactor trip on high flux.
A preliminary HPCI State Graph is provided in Figure 8-11. It is an extension of the
top-level BWR State Graph provided in Figure 8-4, which identifies sub-states that
are associated with the reactor state, main steam state, reactor coolant inventory
state, feedwater state and reactor control state. The Reactor Coolant Inventory sub-
state can be seen to depend on the sub-states for the Feedwater Pumps and on the
HPCI Performance State. The Reactor Coolant Inventory State is, in turn, directly
related to the production of electric power, but is also part of the safety state of the
plant. In addition, the HPCI system state is a part of the safety system readiness sub-
state of the plant.
The HPCI Observables Table is provided via Table 8-10; note that the Observables
are identified simply by inspecting Figure 4-6 for equipment that provides state
indications and other measured values.

 8-29 
Example 8-1. HPCI Turbine Controls PGA (continued)
The HPCI States and Events Table is provided in Table 8-11. Each sub-state in the
State Graph is defined in terms of its attributes and values. Associated with each
sub-state are the events of interest that can be detected and reported from the sub-
state. For some sub-states, not all known events are shown in this table due to the
focus on the HPCI system.
PGA Step 2: Construct the Process Graph
The preliminary Process Graph is provided in Figure 8-12 as an extension of the
top-level BWR Process Graph provided in Figure 8-9. While the primary purpose of
the HPCI system is to perform safety functions, it also interacts with Plant Readiness
goals and processes. Furthermore, in the event of spurious operations, the HPCI
system has the potential to cause low reactor coolant temperature, leading to a
high flux reactor trip.
The HPCI Operation is a Process that is connected to several Goals. It is composed
of sub-goals and lower level processes that describe the manner of operation of the
HPCI system. Because of the relationships between the HPCI system and the
feedwater system, for the purpose of providing Reactor Coolant Inventory, the
Processes for the feedwater system are included in this example.
In Figure 8-12, Normal HPCI Operation can be used to satisfy the Reactor Coolant
Inventory goal during the process BWR Steam Production as a non-exclusive
alternative to Feedwater Pump Operation. Normal RCIC Operation is also a non-
exclusive alternative to Feedwater Pump Operation and Normal HPCI Operation.
Also, note that the CST Inventory process is used by both the Feedwater Pump
Operation process (via the Condensate Pump Operation process and the Hotwell
Level) and the Normal HPCI Operation process as one of its potential sources for
coolant.
In addition to the Normal HPCI Operation process, there is a second process HPCI
Surveillance Test that shares many of the sub-goals of the Normal HPCI Operation
process, except that the coolant is re-circulated rather than directed to the reactor
coolant inventory. This second way to operate the HPCI is connected to the goal of
testing the HPCI as a part of the process of Surveillance Testing of safety critical
subsystems.
Table 8-12 provides the Goal Table for the HPCI Process Graph, and Table 8-13
provides the Process Table. The HPCI Goal Table and the HPCI Process Table
include, respectively, higher level goals or higher level processes from the top-level
BWR Process Graph provided in Figure 8-9.
The finished HPCI Purpose Graph (State and Process Graphs side-by-side) is
provided in Figure 8-13.
PGA Step 3: Analyze States and Events
The analysis of the HPCI State Graph considers State Redundancy, State
Interdependence, and State Diversity. This example is focused on the “HPCI
Operational” and “HPCI Performance” States.
Potentially Hazardous State Characteristics
The results are listed in Table 8-14, and summarized below:
1. The HPCI governor and positioner and the HPCI system in general show low or
non-existent levels of redundancy. This result is not surprising because the HPCI

 8-30 
Example 8-1. HPCI Turbine Controls PGA (continued)
system is a single train in the larger set of Emergency Core Cooling Systems
(ECCS) where redundancy is demonstrated among multiple systems.
2. The HPCI design also shows low levels of diversity and higher levels of
interdependency for states and processes.
PGA Step 4: Analyze Goals
The analysis of the HPCI Process Graph considers Direct Goal Interactions and
Indirect Goal Interactions. In this example, the Goals listed in Table 8-15 are
compared pair-wise to the goals listed in Table 8-12 (HPCI Goal Table).
Potentially Hazardous Goal Interactions
Table 8-15 lists the resulting Direct and Indirect Goal Interactions. The direct goal
interactions noted for HPCI were easily recognized. The indirect goal interactions
are of greater interest, revealing four such interactions that are potentially
hazardous and would be assessed for design alternatives or defensive measures:
 The success of HPCI Rated Flow Achieved goal may interfere with the
Feedwater Temperature goal, an issue that was noted in the Top Down
analysis (Section 1).
 Under some conditions, the HPCI Off-line goal could result in reduced ability to
satisfy the Reactor Coolant Inventory goal.
 Under some conditions, the HPCI Water Supply goal may interfere with the use
of CST Inventory to meet the Hotwell level goal.
 If there was reduced Main Steam Supply provided by the reactor, the HPCI
Steam Supply goal may not be met.
PGA Step 5: Analyze Processes
The Processes in the HPCI Process Graph (Figure 8-12) are analyzed for Sub-Goal,
Resource and Side-Effect interaction issues through an analysis of pair-wise
combinations of the Processes listed in Table 8-13. The results are listed in Table 8-
16.
Potentially Hazardous Process Redundancy Issues
For process redundancy analysis, the Process Graph is inspected for Singletons,
which goals that have only a single process identified as a means to meet the goal.
Not all singletons are a cause for concern, since in some cases the process is an
abstraction that has broad scope to be performed in many ways. For mid- to low-
level goals, however, singletons are a sign of a lack of redundancy. For the HPCI
Process Graph, there are two singletons related to the HPCI:
 The HPCI Steam Supply goal has only a single process, Main Steam Supply,
for satisfying the goal.
 The HPCI Rated Flow Achieved goal has only the governor-positioner process
as its means to satisfy the goal.
Potentially Hazardous Process Interdependency Characteristics
When a goal is not a singleton, the processes that are identified as being able to
satisfy the goal (process siblings) are inspected for process interdependence. To be
fully independent, the sibling processes should not have sub-goal instances in
common. If a sub-goal instance is in common and circumstances exist under which
the common sub-goal cannot be satisfied, the processes with the common sub-goal
instance will both fail.

 8-31 
Example 8-1. HPCI Turbine Controls PGA (continued)
For the HPCI, the processes HPCI Operation and HPCI Surveillance Test, although
not siblings, have 3 common sub-goals. In this case, the process interdependence is
desirable, since a failure of the HPCI Surveillance Test is intended to reveal
problems with the HPCI Operation process. These two processes are not siblings of
the same goal, and their lack of independence does not reduce any redundancies.
Potentially Hazardous Process Interaction Characteristics
The Processes in the HPCI Process Graph are analyzed for Sub-Goal Interactions,
Resource Interactions, and Side-Effect Interactions. In this example, the HPCI
processes listed in Table 8-16 are compared pair-wise to the processes listed in the
HPCI Process Table (Table 8-13).
Most of the process interactions found for the HPCI were easily understood as
being either incompatible processes by design, or as processes with known shared
resources. There was only one area of side-effect interaction found to be potentially
hazardous:
 HPCI Operation could have a side effect on Reactivity Management as a result
of the lack of pre-heating for the HPCI coolant flow, leading to a high-flux trip.
 HPCI Operation could have a side effect on Condensate Feed Pressure as a
result of low hotwell levels if the CST is used to supply the HPCI instead of
supplying the hotwell makeup.
None of these process interactions were the result of the proposed digital upgrade
for the HPCI turbine controls illustrated in Figure 4-6.

Safety
Main Reactor
Systems
Steam Safety
Readiness

Reactor
Main Steam Power
Pressure

Reactor Reactor
Control Coolant

Main
Feedwater

HP Feedwater
Heater
HPCI
Performance
Feedwater
LP Feedwater Pump
Heater
HPCI
Operational
Turbine Condensate
Extraction Pump Meas. Speed
Steam Flow Gov.
Valve Demand
Steam Pos.
Hotwell Meas.
Admit Vlv. Turbine Flow
State Pos. T/T Setpoint
Spd.
Valve
Observable CST Pos.

Figure 8-11
HPCI State Graph

 8-32 
Table 8-10
HPCI Observables

Links to Sub-
Observables Description
States
Sensed from a limit switch on
HPCI Turbine Steam the Steam Admission Valve.
HPCI Operational
Admission Valve Position HPCI does not directly sense the
System Initiation Signal.
Sensed from flowmeter on pump
HPCI Measured flow HPCI Performance
output
HPCI Flow Setpoint Provided by Operator at FIC HPCI Performance
Turbine Speed Demand Output of FIC (Auto or Manual) HPCI Performance
Sensed by mag. pickup on
Measured Turbine Speed HPCI Performance
turbine shaft
HPCI Trip/Throttle
Manual valve HPCI Operational
Valve Position
HPCI Governor
Sensed from actuator resolver HPCI Performance
Valve Position

Table 8-11
HPCI States & Events

Description Attributes Events


States

Main Steam The state of the  Steam flow Not analyzed in this
main steam being  Steam example
generated by the temperature
Rx
Reactor Safety The overall state of Not analyzed in Not analyzed in this
reactor safety this example example
Safety System The overall state of Not analyzed in Not analyzed in this
Readiness readiness of the this example example
safety systems
Reactor The thermal power  Reactivity  High drywell
Power and reactivity state  Temperature pressure
of the reactor  Pressure  High flux state
 Coolant flow  High reactor
 Void fraction temperature

Main Steam The pressure of the  Main steam  Stuck Safety Relief
Pressure steam within the pressure Valve
main steam lines  Safety relief valve
positions

 8-33 
Table 8-11 (continued)
HPCI States & Events

States
Description Attributes Events
Reactor The overall state of Not analyzed in Not analyzed in this
Control the reactor controls this example example
Reactor The state of the  Reactor coolant  Low-Low Rx Water
Coolant coolant flowing level Level
through the reactor  Main FW Temp  High Rx Water Level
core  Main FW Flow  LOCA
 Rx Recirculation
Flow
 HPCI flow
 RCIC flow
Main The state of the  Main FW Temp  Low feedwater
Feedwater feedwater at the  Main FW pressure
reactor vessel Pressure
 Main FW Flow
High Pressure The state of the HP  Main FW Temp Not analyzed in this
Feedwater Feedwater heating  Main FW example
Heater process Pressure
 Main FW Flow
 Extraction Steam
Flow
Low Pressure The state of the LP  Main FW Temp Not analyzed in this
Feedwater Feedwater heating  Main FW example
Heater process Pressure
 Main FW Flow
 Extraction Steam
Flow
Turbine The state of the  Extraction steam Not analyzed in this
Extraction Steam turbine extraction temp example
steam to the  Extraction steam
feedwater heaters pressure
Feedwater The operational  Feedwater pump  Feedwater pump not
Pump state of the 1 operational
feedwater pumps  Feedwater pump  Low supply pressure
2 to feedpump
 Feedwater supply
pressure
 Recirculation
valves

 8-34 
Table 8-11 (continued)
HPCI States & Events

States
Description Attributes Events
Condensate The operational  Condensate  Condensate pump
Pump state of the Pump 1 not operational
condensate pumps  Condensate
Pump 2
 Recirculation
valves
Hotwell The state of the  Hotwell level  Hotwell low level
hotwell of  Hotwell  Hotwell high level
condensate that temperature  Excessive hotwell
feeds the temperature
condensate pumps
Condensate The state of the  CST Level  Low CST Level
Storage Tank condensate  CST Temperature
storage tank (CST)
HPCI The sub-state that  Demand  HPCI Trip
Operational describes the (ON,OFF)  HPCI not
operation of HPCI  Operating State operational
(Tagged-out,
Ready, Under
Test, Operating,
Tripped)
 Exception
(overspeed, low
suction,
unexpected
operation, DC
power out)
 Coolant source
(CST,
Suppression Pool)
 Output (reactor
feed,
recirculation)

 8-35 
Table 8-11 (continued)
HPCI States & Events

States
Description Attributes Events
HPCI The sub-state that  Main steam valve  Turbine Overspeed
Performance describes the position (Open,  Low flow
performance of the Closed)  Unexpected turbine
HPCI  HPCI flow operation
setpoint (Value)  Failed governor
 HPCI measured valve actuator
flow (value)  High turbine outlet
 HPCI turbine pressure
speed demand
(value)
 HPCI measured
turbine speed
(value)
 HPCI governor
valve position
(value)

 8-36 
Main Steam Remove Excess
Supply Provided Heat

BWR Steam Emergency Normal Shutdown Surveillance


Production Core Cooling Condenser Equipment Testing
Operations

Reactivity

Reactivity
Management

Feedwater
Temperature
Reactor Coolant Detect HPCI
HPCI Off-line
Apply High Inventory Variances
Pressure
Heating

Low Pressure Feedwater HPCI HPCI HPCI


Feedwater Preheat Pressure Operation Shutdown Surveillance
Test

Apply Low Feedwater Feedwater


Pressure Heating Recirculation Pump Operation
Pumps
HPCI Water
supply
Turbine Extract Feedwater Feedwater Rated Flow
Steam Heat Supply Pressure Pumps On Steam Supply Achieved

Condensate Steam Supply Water


Feed Pressure Stopped Recirculated
Main Steam
Supply
Govern Steam
Condensate
Hotwell Level to HPCI
Pumps On
Turbine
Suppression Close Steam
Goal Condensed Admission
Pool Close Trip
Steam Valve
Inventory Throttle Valve
Process
CST Inventory
Close Governor
Valve

Figure 8-12
HPCI Process Graph

 8-37 
Table 8-12
HPCI Goals

Related Sub-
Goals Description Attributes States from
Table 8-11
Main Steam Produce main steam  Main Steam Press  Main Steam
Supply with the pressure,  Main Steam Temp
temperature and flow  Main Steam Flow
specified
Remove Prevent the heat from None identified in None identified
Excess Heat the reactor and steam this analysis in this analysis
from reaching
dangerous levels
Reactivity Meet the specified  Reactivity  Reactor Power
reactivity parameters State
Main Feedwater Provide Feedwater at  Main FW Temp  Main
Temperature the desired temperature Feedwater
from the high pressure
feedwater heating
process
Reactor Coolant Maintain desired  Reactor Coolant  Reactor
Inventory coolant levels in the Level Coolant
reactor
HPCI Off Line Return HPCI to an off- None identified in  HPCI
line condition this example Operational
Detect HPCI Develop evidence that None identified in  HPCI
Variances the HPCI system is this example Operational
completely functional
Low Pressure Provide the desired  Feedwater temp  Low Pressure
Feedwater feedwater preheat at  Feedwater press Feedwater
Preheat the low pressure  Feedwater flow Heater
heating stage
Feedwater Provide desired  Feedwater press  Feedwater
Pressure feedwater pressure and  Feedwater flow
flow into the reactor
and into the high
pressure feedwater
heater
Turbine Extract Provide desired steam  Steam pressure  Turbine
Steam State pressure, flow &  Steam extraction
temperature to the FW temperature steam state
heaters

 8-38 
Table 8-12 (continued)
HPCI Goals

Related Sub-
Goals Description Attributes States from
Table 8-11
Feedwater Sufficient supply  Feedwater supply  Feedwater
Supply Pressure pressure to the pressure Pump
Feedwater pumps for
safe operation
Feedwater Having the Feedwater  Feedwater pumps  Feedwater
Pumps pumps operating status Pump
correctly in the desired
state
Condensate Having the Condensate  Condensate  Condensate
Pumps pumps operating pumps status Pump
correctly in the desired
state
Hotwell Maintain the desired  Hotwell  Hotwell
Level level of coolant in the condensate level
hotwell  Hotwell
condensate
temperature
HPCI Water HPCI pump has  Source  HPCI
Supply adequate water supply  Suction Operational
and suction
HPCI Turbine HPCI turbine has None identified in  HPCI
Steam Supply sufficient steam supply this example Operational
HPCI Steam Stop flow of steam to None identified in  HPCI
Supply Stopped HPCI turbine this example Operational
mechanisms
Rated HPCI Flow produced by the  HPCI measured  HPCI
Flow Achieved HPCI meets rated flow flow Performance
desired  HPCI flow
demand
HPCI Water The desired destination  Destination  HPCI
Recirculated of HPCI pump output is Operational
recirculated to source

 8-39 
Table 8-13
HPCI Processes

Related Goals
Processes Description Attributes from Table 8-
12
BWR Steam Use reactor heat  Main steam  Main Steam
Production generation and temperature Supply
feedwater supply to  Main steam
make steam pressure
 Main steam flow
Emergency Core Keep reactor at safe  Reactor core  Remove Excess
Cooling temperature during temperature Heat
transients
Normal Use main high and low  None identified in  Remove Excess
Condenser pressure condensers to this analysis Heat
Operations condense remaining
steam
Shutdown Stop the operation of  Equipment Item  Protect
Equipment an item of equipment Equipment
Surveillance Conduct tests of safety  Completion date  Detect System
Testing systems to determine variances
that the systems are fully
functional
Reactivity Use feedwater flow and  Feedwater  Reactivity
management temperature to control temperature
reactor heat generation  Feedwater flow
 Void fraction
High Pressure Use turbine extraction  Feedwater  Feedwater
Feedwater steam to pre-heat the temperature Temperature
Heater feedwater to the
desired temperature
Feedwater Use the pumps to re-  Feedwater flow  Feedwater
Recirculation circulate feedwater in Pressure
Pumps the reactor core
Feedwater Pump Use the Feedwater  Feedwater level  Feedwater
Operation pumps to provide  Feedwater Pressure
feedwater level, pressure  Reactor
pressure and flow to the  Feedwater flow Coolant
reactor Inventory
Condensate Use the Condensate  Feedwater supply  Feedwater
Pump Feed pumps to provide pressure Supply
Pressure supply pressure to the Pressure
Feedwater pumps

 8-40 
Table 8-13 (continued)
HPCI Processes

Related Goals
Processes Description Attributes from Table 8-
12
HPCI Operation Operate HPCI to supply  HPCI flow  Reactor
water to reactor setpoint Coolant
 HPCI measured Inventory
flow
HPCI Operate HPCI with  HPCI flow  Detect HPCI
Surveillance Test pump output re- setpoint Variances
circulated  HPCI measured
flow
Govern Steam to Use the HPCI governor  HPCI flow  HPCI Rated
HPCI Turbine and positioning setpoint Flow Achieved
controllers to position  HPCI measured
the turbine governor flow
valve to control steam  HPCI Turbine
to turbine speed demand
 HPCI measured
Turbine speed
HPCI Main Use the Steam  Steam Admission  HPCI Steam
Steam Supply Admission Valve to valve position Supply
allow main steam
pressure to the HPCI
turbine
CST Inventory Draw coolant for the  HPCI source  HPCI Water
HPCI pump from the Supply
CST
Suppression Draw coolant for the  HPCI source  HPCI Water
Pool Inventory HPCI pump from the Supply
suppression pool
Close Trip Close the trip throttle  HPCI Turbine  Steam Supply
Throttle Valve valve or governor valve speed Stopped
to stop HPCI turbine
operation
Close Governor Close the governor  HPCI Turbine  Steam Supply
Valve valve to stop HPCI speed Stopped
turbine operation
Close Steam Close the Main Steam  HPCI Turbine  Steam Supply
Admission valve Admission valve to stop speed Stopped
HPCI turbine operation

 8-41 
Safety
Main Reactor Main Steam Remove Excess
Systems
Steam Safety Supply Provided Heat
Readiness

BWR Steam Emergency Normal Shutdown Surveillance


Production Core Cooling Condenser Equipment Testing
Operations
Reactor
Power
Reactivity
Main Steam
Pressure
Reactor Reactor Reactivity
Control Coolant Management

Feedwater
Main Temperature
Feedwater
Reactor Coolant Detect HPCI
HPCI Off-line
Apply High Inventory Variances
Pressure
HP Feedwater Heating
Heater

Low Pressure Feedwater HPCI HPCI HPCI


Feedwater pressure
Feedwater Preheat Operation Shutdown Surveillance
LP Feedwater Pump HPCI Test
Heater Performance
Apply Low Feedwater Feedwater
Pressure heating Recirculation Pump Operation
Pumps
Condensate HPCI Water
Pump supply
Turbine
Extraction HPCI Turbine Extract Feedwater Feedwater Rated Flow
Steam Operational Steam Heat Supply Pressure Pumps On Steam Supply Achieved

Hotwell Water
Condensate Steam Supply
Feed Pressure Stopped Recirculated
Main Steam
Supply
Govern Steam
Condensate
Hotwell Level to HPCI
CST Pumps On
Meas. Speed Turbine
Steam Flow Gov. Close Steam
State Valve Demand Suppression
Admit Vlv. Condensed Admission
T/T Pos. Pool Close Trip
Pos. Meas. Steam Valve
Valve Flow Goal Inventory Throttle Valve
Turbine
Observable Pos. Spd. Setpoint
CST Inventory
Close Governor
Process
Valve

Figure 8-13
HPCI Purpose Graph

 8-42 
Table 8-14
HPCI State & Events Analysis Results

States
Attributes Redundancy Inter-dependence Diversity
HPCI  Demand (ON,OFF) There is an apparent low The State (position) of the There are many ways to
Operational  Operating State (Tagged- amount of redundancy in the Main Steam Admission Valve influence the “HPCI
out, Ready, Under Test, state information. limit switch represents the Operational” State (five
Operating, Tripped) State of the System Initiation different Observables)
 Exception (overspeed, low signal (On or Off), resulting
suction, unexpected in a very high
operation, DC power out) interdependence
 Coolant source (CST,
Suppression Pool)
 Output (reactor feed,
recirculation)
HPCI  Steam admission valve There is an apparent low The performance of the HPCI There is only one way to
Performance position (Open, Closed) amount of redundancy in the system depends on a few influence the “HPCI
 HPCI flow setpoint (Value) state information. highly related information Performance” State (via the
 HPCI measured flow (value) sources “HPCI Operational” State)
 HPCI turbine speed
demand (value)
 HPCI measured turbine
speed (value)
 HPCI governor valve
position (value)

 8-43 
Table 8-15
HPCI Goal Interactions

Indirect Goal
Goals Direct Goal Interactions
Interactions
Detect HPCI  HPCI Off-line None
Variances
HPCI Off-line  Detect HPCI Variances  Reactor Coolant Inventory
HPCI Rated  HPCI Water Recirculated  Feedwater Temperature
Flow Achieved  Steam Supply Stopped
HPCI Steam Supply  HPCI Steam Supply  Main Steam Supply
Stopped Provided
HPCI Water Supply None  Hotwell Level
HPCI water  HPCI Rated Flow Achieved None
Recirculated  Steam Supply Stopped
HPCI Steam  HPCI Steam Supply None
Supply Stopped  HPCI Rated Flow Achieved

Figure 8-14
One of the Indirect Goal Interactions in the HPCI System

 8-44 
Table 8-16
HPCI Process Interactions

Sub-Goal Resource Side Effect


Processes
Interactions Interactions Interactions
HPCI  HPCI Surveillance  BWR Steam  Reactivity
Operation Test Production Management
 HPCI Shutdown  Condensate Feed
Pressure
HPCI  HPCI Operation  BWR Steam None
Surveillance Test  HPCI Shutdown Production
HPCI  HPCI Surveillance None None
Shutdown Test
 HPCI Operation
Govern Steam to None None None
HPCI Turbine
HPCI Main None  BWR Steam None
Steam Supply Production
CST Inventory None None None
Suppression None None None
Pool Inventory
Close Governor None None None
Valve
Close Steam None None None
Admission Valve
Close Trip/ None None None
Throttle Valve

 8-45 
Example 8-2. CWS Control System PGA
The hypothetical Circ Water System control system examined in Example 4-3 (Figure
4-7 and Figure 4-8) is also examined here, this time using the PGA method. Table 4-
7 from Example 4-3 satisfies the prerequisite for a Function Analysis in this example.
PGA Step 1: Construct the State Graph
In Section 8.2, the Purpose Graph Analysis procedure was illustrated with a top-
level State Graph and Process Graph for a notional Boiling Water Reactor system.
This example extends the top-level BWR State Graph and Process Graph to the
CWS system. The function of the CWS system is to remove excess heat from the
plant and exchange it with the ultimate heat sink.
The State Graph for this example is provided in Figure 8-15, which omits some of
the state information in order to keep the state graph drawing from becoming
cluttered with repeated detail. The sub-states for the High Pressure Condensers and
the Low Pressure Condensers depend upon the Circulating Water Flow sub-state, as
well as the Turbine State and Bypass Steam State. Similarly, the CWS Division A
sub-state can be seen to depend on the sub-states for the CWS Pump Train A1 sub-
state, the other 2 Division A pump train states (not shown in the Figure) and on the
Comm Channel State.
As noted, The CWS has two divisions, A and B, each with three pump trains for a
total of 6 pump trains. The sub-states of the pump trains are shown for only a single
pump, Pump A1. The other 5 pump trains (2 additional in Division A, and 3 in
Division B) are identical in their sub-state structure to that shown for Pump A1.
Also for simplicity, the sub-states for Comm Channel State for each division includes
that state of both Channel 1 and Channel 2. The sub-state for Controller State also
includes both Controller A and Controller B state, as well as the current assignment
of the Master for the two controllers.
The CWS Observables table is provided in Table 8-17, and the State table is
provided in Table 8-18.
PGA Step 2: Construct the Process Graph
The top-level BWR Process Graph provided in Figure 8-5 illustrates the three main
goals of Electric Generation, Plant Safety and Plant Readiness. These goals are
supported by processes with sub-goals and sub-processes as described in Section
8.2. The CWS supports electric power production and influences safety functions,
represented by the Goal to Remove Excess Heat. However, the CWS also interacts
with Plant Readiness goals and processes.
The preliminary Process Graph is shown in Figure 8-16. The CWS operation is a
process that is connected to goals for Low Pressure Condenser and High Pressure
Condenser operating conditions. It in turn, is composed of sub-goals and lower level
processes that describe the manner of operation of the CWS as described in this
report. Again, because of the relationships between the CWS purpose of providing
Removing Excess Heat and the process of creating condensate that in turn is the
source for the feedwater, the processes for the feedwater system are included in this
analysis.

 8-46 
Example 8-2. CWS Control System PGA (continued)
In Figure 8-16, CWS Operation is used to satisfy the LP Condenser Conditions goal
and the HP Condenser Conditions goal that are part of the processes for LP
condenser Operations and HP Condenser Operations, and ultimately the goal to
Remove Excess Heat. Also, note that the Hotwell Management process is influenced
by the performance of the condensers, which are influenced in turn by the
performance of the CWS.
In addition to the CWS Operation process, there is a second process “Shutdown
CWS Component” that is connected to the readiness goals for the plant’s non-safety
systems via sub-goals for Repair Subsystem and Service Subsystem.
To facilitate the discussion of the Process Graph, it is helpful to use two tables, one
for Goals and one for Processes. Table 8-19 provides the Goal Table for the CWS
Process Graph, and Table 8-20 provides the Process Table. The CWS Goal Table
and the CWS Process Table include, respectively, higher level goals or higher level
processes from the top-level BWR Process Graph provided in Figure 8-5.
The finished Purpose Graph, which is a juxtaposition of the State Graph and the
Process Graph, is provided in Figure 8-17.
PGA Step 3: Analyze States and Events
The analysis of the CWS State Graph considers State Redundancy, State
Interdependence, and State Diversity. As with the STPA method, potential hazards
can include any losses that are considered unacceptable, including lost generation.
Potentially Hazardous State Characteristics
The results are listed in Table 8-21. The CWS is a moderately complex subsystem,
with many sources of data and many options for configuring its subsystem
components. For the higher-level states, there are generally multiple sources for data,
with moderate degrees of direct measurement and dependence on other sub-state
values. In most cases, there are diverse means of determining state values. At lower
levels of state, there is less redundancy and diversity, indicating potential hazards,
but less scope of influence of the state values. The following are selected as
representative of this observation:
 Pump train MOV state. Because there are multiple limit switches to sense MOV
position, there is redundancy. Since all switches operate in the same manner,
there is no diversity.
 Digital Input (DI) State and Digital Output (DO) State. Determining the state of
these components is partially measurable and partially dependent on the state
of other components. In some cases, the DI or DO may be able to report their
state over the Comm channels, but in other cases, another component (such as
the Controller acting as master) may need to query the component to infer its
state. All understanding of the state of DI or DO must be sent over the Comm
Channels.
PGA Step 4: Analyze Goals
The analysis of the CWS Process Graph considers Direct Goal Interactions and
Indirect Goal Interactions. In this example, the Goals listed in Table 8-22 are pair-
wise comparisons of the Goals listed in Table 8-12 (CWS Goal Table).
Potentially Hazardous Goal Interactions
Table 8-22 lists the resulting Direct and Indirect Goal Interactions. The direct goal

 8-47 
Example 8-2. CWS Control System PGA (continued)
interactions noted for CWS were easily recognized. The indirect goal interactions
were of greater interest, and revealed 3 such interactions that could be assessed as
potential hazards:
 Repair and servicing of CWS components. Because of the interactions between
the goals for CWS configurations involving both Division A and B pump trains,
opportunities to service or repair more than one CWS component at a time,
including those components in the digital control subsystem, must be carefully
considered.
 Heat removal and condenser operations. Sub-goal changes within the CWS can
affect heat removal and the balance of condenser operating conditions across
the HP and LP condensers. As an example, a change in the number of CWS
pumps that are on-line may cause condenser vacuum and temperature
transients.
 The CWS Flow goal can influence the amount and temperature of condensate
that is collected in the Hotwell, and in turn, the supply of condensate to the
Feedwater system.
The digital design for the CWS provides opportunities to trigger some of the goal
interactions because of its effects on the number of pumps that are online at one
time. In particular, if a pump train digital output component loses its
communications, its shelf state is to close the MOV and trip the associated pump.
This may produce condenser vacuum and temperature transients resulting in a plant
trip, particularly if the entire I/O cabinet for a Division has lost communications and
all of its pumps are tripped.
PGA Step 5: Analyze Processes
The Processes in the CWS Process Graph (Figure 8-16) are analyzed for Sub-Goal,
Resource and Side-Effect interaction issues through an analysis of pair-wise
combinations of the Processes listed in Table 8-13. The results are listed in Table 8-
23.
Potential Process Hazards
While many of the process interactions found for the CWS were easily understood
as being either incompatible processes by design, or as processes with known
shared resources, there were several notable process interactions that could be
assessed further as potential hazards:
 Communication channels as a resource. Because of the central role of the
Communication Channels in setting and maintaining the CWS division pump
train configurations, there are strong potential resource interactions with other
users of the Communication Channels. As an example, a malfunctioning process
in another subsystem that also uses the Comm Channels may saturate the
channel bandwidth, blocking the delivery of signals in the CWS.
 Lost communications behavior of the digital control system components. Two
process issues can arise: an unintended pump start as a result of a lost
communications resulting in a DO shelf state process; and the loss of clear
master/slave relationships between the controllers.

 8-48 
Electric Power Heat State
Production State

HP Condenser
State

LP Condenser
State
Non-safety
systems
Turbine State Circulating readiness
Water Flow
Turbine Main Bypass Steam
Steam Supply State
State CWS Supply CWS System
State State
Turbine
Reactor Main Turbine
Control Valves
Steam State Bypass Valves
CWS Division CWS Division
Reactor Power A State B State
State

Reactor 2 other identical CWS Pump


Train A1 State CWS Pump
Coolant State Pump Trains, A2, A3 Division B Train B1 State
Comm
Feedwater Channel State
State Pump A1
MOV State 2 other identical
State
Pump Trains, B2, B3
Feedwater Reactor
Pump State Control State
Digital Input Digital Output
4KV State State
Condensate Switchgear
Pump State State
Division A MOV
Comm Command
Channel State Message
Hotwell State

Controller
LS6 Cont. LS2
Pos. State
CST State T2 Pos.
LS5
Pos. LS1 LS3 HSI
Pos. Pos. Command
LS4 T2
Pump Trip Pos. Closed
State
State Obervable

Figure 8-15
CWS State Graph

 8-49 
Table 8-17
CWS Observables

Observables Links to sub-


Description
states
Detects position of the MOV as
Limit Switch 1 (LS1) MOV State
fully open
Detects position of the MOV as
Limit Switch 2 (LS2) MOV State
fully open
Detects position of the MOV as
Limit Switch 3 (LS3) MOV State
fully closed
Detects position of the MOV as
Limit Switch 4 (LS4) MOV State
fully open
Detects position of the MOV as 4KV Switchgear
Limit Switch 5 (LS5)
20% open for the 4KV Switchgear State
Detects position of the MOV as 4KV Switchgear
Limit Switch 6 (LS6)
fully open for the 4KV Switchgear State
4KV Switchgear
Contact T2 Energizes/de-energizes the Pump
State
This message is sent by the Digital
Contact T2 Closed Input board of a pump train to MOV Command
Message report that T2 is closed and the Message
pump is stopped.
This message is sent by the HSI to
the Controller and then on to the
HSI MOV MOV Command
Digital Output board of a pump
Command Message
train to command the MOV to
open or close
This switch energizes the close and
4KV Switchgear
Switch HS1 trip coils of the 4KV Switchgear for
State
a pump train

 8-50 
Table 8-18
CWS States & Events

States
Description Attributes Events
Electric Power The state of  Current Not analyzed in this
Production electric power  Voltage example
State being produced  Quality
by the plant
Heat State The state of the  Total residual Not analyzed in this
overall heat heat example
balance of the
plant from
operations
Reactor Main The state of the  Steam pressure Not analyzed in this
Steam State main steam being  Steam flow example
generated by the  Steam
reactor (from the temperature
BWR Top Level
State Graph)
Reactor Coolant The state of the  Reactor coolant  Low-Low reactor
State coolant flowing level coolant level
through the reactor  Main feedwater  High reactor
core temperature coolant level
 Main feedwater  LOCA
flow
 Feedwater
recirculation
 HPCI flow
 RCIC flow
Condenser The state of  Steam Flow  condenser vacuum
State operations and  Vacuum trip
conditions of the  Pipe side  condenser
condenser temperature temperature trip
 Shell side
temperature
Feedwater The operational  Feedwater  Feedwater pump
Pump State state of the pump 1 not operational
feedwater pumps  Feedwater  Low supply
pump 2 pressure to
 Feedwater feedpump
supply pressure
 Recirculation
valves

 8-51 
Table 8-18 (continued)
CWS States & Events

States
Description Attributes Events
Circulating The state of the  Water pressure
Water Flow water flow to the  Water flow
condenser  Water inlet
temperature
 Water outlet
temperature
CWS Supply The state of the  Water
State water and temperature
conditions in the  Water level
CWS supply, such
as the cooling
basins.
CWS State The overall  Percent capacity  Capacity alarm
condition and  Percent
operating status of readiness
the CWS
Condensate The operational  Condensate  Condensate pump
Pump State state of the Pump 1 not operational
condensate pumps  Condensate
Pump 2
 Recirculation
valves
Hotwell State The state of the  Hotwell level  Hotwell low level
hotwell of  Hotwell  Hotwell high level
condensate that temperature  Excessive hotwell
feeds the temperature
condensate pumps
CWS Division A The operating  Pump Status  Controller fail
State state of the  Valve Status  Logic cabinet A
Division A  Controller status comms fail
equipment of the  I/O Cabinet A
 Communications
CWS comms fail
channel status
CWS Division B The operating  Pump Status  Controller fail
State State of the  Valve Status
Division B of the  Controller status
CWS
 Communications
channel status

 8-52 
Table 8-18 (continued)
CWS States & Events

States
Description Attributes Events
CWS Pump The operating  Pump status
Train A1 State state of the  Valve status
equipment in Pump  DO status
Train A1
 DI status
Pump A1 State The operating  Pump status  Pump trip
state of the A1
pump
A1 MOV State The position and  Valve position  Valve fails to move
operating  Limit Sw 1&2
condition of the  Limit Sw 3&4
A1 MOV
 Limit Sw 5
4KV Switchgear The operating  Limit Sw 5
State condition and state  Limit Sw 6
of the pump  Contact T2
switchgear
Digital Input The status of the DI  DI Status
State board for the
pump train
Digital Output The status of the  DO status
State DO board for the
pump train
Division A The status of the  Logic A channel 1
Comm Channel comm. Channels  Logic A channel 2
State in the Division A  I/O A channel 1
Logic cabinet and
 I/O A channel 2
I/O cabinet
Division B The status of the  Logic B channel 1
Comm Channel comm. Channels  Logic B channel 2
State in the Division B  I/O B channel 1
Logic cabinet and
 I/O B channel 2
I/O cabinet
Controller state The status of the  Assigned
master and slave Master
controllers in Logic  Logic A
cabinets A and B controller status
 Logic B
controller status

 8-53 
Main Steam Remove Excess Repair Service
Supply Provided Heat Subsystem Subsystem

Normal Condenser
BWR Steam Operations
Production Shutdown
HP Heat LP Heat
Removed Removed CWS
component
HP Condenser
Reactor Coolant
Operations
Inventory
HP Condenser
HP Condensate Steam In
Feedwater Out
Pump LP condenser
Operation Operations

Feedwater Feedwater HP Condenser


Supply Pressure Pumps On Conditions
LP Condenser LP Condenser
Conditions Steam In
Turbine LP Condensate
Bypass Steam Out Circulating Turbine LP
Water Exhaust
Operation Steam
Condensate Hotwell
Feed Pressure Management
Cooling Towers
Cooling Basin CWS Flow
Conditions

CWS 3A+1B CWS 1A + 3B CWS 2A + 2B


Condensate
Pumps ON
Div A 2 Pumps Div B 2 Pumps
Hotwell Level Online Online

Both CWS
Divisions Pumps A1 & Pumps A2 & Pumps A1 &
Condensed Controlled A2 Online A3 Online A3 Online
CST Inventory
Steam
Pump A1
Pump A1 Online Pump A2 Online Pump A3 Online
Offline
Use Controller Use Controller
Normal Start Normal Stop
A as Master B as Master
Pump A1 Pump A1

Pump A1 Trip
Communicate Communicate Communicate Signal MOV Pump A1
between MOV Open
with Div B Pumps with Div A Pumps Open Started
controllers
Pump A1 4KV
I/O Cabinet B I/O Cabinet B I/O Cabinet I/O Cabinet A Switchgear
Both on Comm Comm A Comm Comm
Channel 1 channel 1 Channel 2 Channel 1 Channel 2 MOV A1
Control
Both on Lost Comms Sequence
Channel 2 State Goal DO A1 Shelf
State

Figure 8-16
CWS Process Graph

 8-54 
Table 8-19
CWS Goals

Related sub-states
Goals Description Attributes
from Table 8-18
Main Steam Produce main steam  Main Steam  Reactor Main Steam
Supply Provided with the pressure, Pressure State
temperature and  Main Steam
flow specified Temperature
 Main Steam
Flow
Remove Excess Prevent the heat from  None identified  None identified in this
Heat the reactor and in this analysis analysis
steam from reaching
dangerous levels
Reactor Coolant Maintain reactor  Coolant level  Reactor Coolant State
Inventory coolant at desired  Coolant
levels and temperature
temperature
Repair Correct subsystem  Subsystem ID  Non safety systems
Subsystem conditions that readiness
resulted in the  CWS Division A State
subsystem not  CWS Division B State
meeting readiness
goals
Service Complete planned  Subsystem ID  Non safety systems
Subsystem on-condition readiness
servicing for a  CWS Division A State
subsystem  CWS Division B State
LP Heat The desired amount  LP condenser  Heat State
Removed of heat is being temperature
removed by the LP
condensers
HP Heat The desired amount  HP condenser  Heat State
Removed of heat is being temperature
removed by the HP
condensers
Condenser The desired amount   LP Condenser State
Steam in of steam is flowing
into the LP
condensers
Condensate Out The desired amount   LP Condenser State
of condensate is
leaving the LP
condensers

 8-55 
Table 8-19 (continued)
CWS Goals

Related sub-states
Goals Description Attributes
from Table 8-18
Condenser The desired internal  LP condenser  LP Condenser State
Conditions conditions are met vacuum
within the LP  LP condenser
condenser temperature
Feedwater Sufficient supply  Feedwater  Feedwater Pump State
Supply Pressure pressure to the supply pressure
Feedwater pumps for
safe operation
Feedwater Having the  Feedwater  Feedwater Pump State
Pumps Feedwater pumps pumps status
operating correctly
in the desired state
Cooling Basin The desired  Basin  CWS Supply State
conditions are met in Temperature
the Cooling Basin  Basin Level
CWS Flow The desired flow and  Circ water flow  Circulating Water
temperature of the  Circ water Flow
circulating water temperature
Cooling Tower A The desired status  Temperature  CWS Division A State
Conditions and state of the drop
cooling tower in  Flow
CWS Division A
Condensate Having the  Condensate  Condensate Pump
Pumps Condensate pumps pumps status State
operating correctly
in the desired state
Both CWS Have a functioning  Assigned  Controller State
Divisions master controller for Master
Controlled the A and B CWS  Controller A
Divisions status
 Controller B
status
Division A 2 Have 2 pumps from  Pumps online CWS Division A State
Pumps Online Division A online
Division B 2 Have 2 pumps from  Pumps online CWS Division B State
Pumps Online Division B online
Pump A1 Online Have Pump Train A1  Pumps status Pump A1 State
online

 8-56 
Table 8-19 (continued)
CWS Goals

Related sub-states
Goals Description Attributes
from Table 8-18
Pump A1 Offline Have Pump A1  Pump status Pump A1 State
offline
Communicate Have at least one  Logic A channel Both on channel 1
Between comms channel 1 Both on channel 2
Controllers operating between  Logic A
the Division A and B Channel 2
controllers  Logic B Channel
1
 Logic B channel
2
Communicate Have at least one  Assigned master Division B Comm
with Division B comms channel  Channel 1 channel State
Pumps between the  Channel 2
Assigned master and
the Div B I/O
cabinet
Communicate Have at least one  Assigned master Division A Comm
with Division A comms channel  Channel 1 channel State
Pumps between the  Channel 2
Assigned master and
the Div A I/O
cabinet
Signal MOV Deliver signal to  MOV state  MOV State
Open pump train DO to
open MOV
MOV Open Achieve full open of  MOV State  MOV State
the MOV
Pump A1 Started Pump is energized  4KV Switchgear  4KV Switchgear State
and moving circ status
water

 8-57 
Table 8-20
CWS Processes

Related Goals
Processes Description Attributes
from Table 8-19
BWR Steam Use reactor heat  Main steam  Main Steam Supply
Production generation and temperature
feedwater supply to  Main steam
make steam pressure
 Main steam
flow
Normal Use main high and  None identified  Remove Excess Heat
Condenser low pressure in this analysis
Operations condensers to
condense remaining
steam
Shutdown CWS Stop the operation of  Equipment Item  Repair Equipment
Component an item of CWS (from Top Level BWR
equipment goals)
Condenser Operate the  Temperature  Heat Removed
Operation condenser within  Vacuum
limits to remove heat
Hotwell Maintain the Hotwell  Hotwell Level  HP condensate Out
Management level and  Hotwell  LP condensate Out
temperature to Temperature
supply condensate to
the feedwater system
Circulating Use circulating water  CWS Flow
Water to cool the HP and LP
Operation condensers
CWS 3A + 1 B Use a pump train  CWS Flow
configuration with 3
pumps from Division
A and 1 from
Division B to supply
the circ water
CWS 1A + 3 B Use a pump train  CWS Flow
configuration with 1
pumps from Division
A and 3 from
Division B to supply
the circ water

 8-58 
Table 8-20 (continued)
CWS Processes

Related Goals
Processes Description Attributes
from Table 8-19
CWS 2A + 2 B Use a pump train  CWS Flow
configuration with 2
pumps from Division
A and 2 from
Division B to supply
the circ water
Condensed Use condensate from  Hotwell Level
Steam the HP and LP
condensers to
provide condensate
for use in the
feedwater system
CST Inventory Draw coolant for the  Hotwell makeup  Hotwell level
Feedwater system to
the Hotwell from the
CST
Pumps A1 and Bring Pumps A1 and  Pump status  Div A 2 Pumps Online
A2 Online A2 to operating
status using the
pump train controls
Use Controller A Set the CWS master  Controller A  Both Divisions
as Master controller to Logic status Controlled
cabinet A  Assigned
Master
Use Controller B Set the CWS master  Controller B  Both Divisions
as Master controller to Logic status Controlled
cabinet B  Assigned
Master
Normal Start Use the normal  Pump status  Pump A1 Online
Pump A1 controlled start to
bring pump A1
online
Normal Stop Use the normal  Pump Status  Pump A1 Offline
Pump A1 controlled stop to
bring pump A1
offline
Both on channel Operate both master  Assigned  Communicate
1 and slave controller master Between Controllers
on channel 1

 8-59 
Table 8-20 (continued)
CWS Processes

Related Goals
Processes Description Attributes
from Table 8-19
Both on channel Operate both master  Assigned  Communicate
2 and slave controller master Between Controllers
on Channel 2
I/O Cabinet A Use comm. Channel  Assigned  Communicate with
Comm channel 1 1 to communicate master Division A Pumps
with the Div A I/O  Signal MOV Open
cabinet pump trains
Lost comms A1 Revert to the shelf  Signal MOV Open
Shelf State state of the A1 pump
train if lost comms,
which is ON

 8-60 
Electric Power Heat State
Production State
Main Steam Remove Excess Repair Service
Supply Provided Heat Subsystem Subsystem
HP Condenser
Normal Condenser
State
BWR Steam Operations
Production Shutdown
HP Heat LP Heat
LP Condenser Removed Removed CWS
State component
Non-safety HP Condenser
Reactor Coolant
systems Operations
Inventory
Turbine State Circulating readiness
HP Condenser
Water Flow HP Condensate Steam In
Turbine Main Bypass Steam Feedwater Out
Steam Supply State Pump LP condenser
CWS Supply CWS System Operation Operations
State
State State
Feedwater Feedwater HP Condenser
Turbine Supply Pressure Pumps On Conditions
LP Condenser LP Condenser
Reactor Main Turbine
Control Valves Conditions Steam In
Steam State Bypass Valves
Turbine LP Condensate
CWS Division CWS Division Bypass Steam Out Circulating Turbine LP
Reactor Power A State B State Water Exhaust
State Operation Steam
Condensate Hotwell
Feed Pressure Management
Cooling Towers
Reactor 2 other identical CWS Pump Cooling Basin CWS Flow
CWS Pump Conditions
Coolant State Pump Trains, A2, A3 Train A1 State Division B Train B1 State
Comm CWS 2A + 2B
CWS 3A+1B CWS 1A + 3B
Feedwater Channel State Condensate
State Pump A1 Pumps ON
MOV State 2 other identical Div A 2 Pumps Div B 2 Pumps
State
Pump Trains, B2, B3 Online Online
Feedwater Reactor Hotwell Level
Pump State Control State Both CWS
Pumps A1 & Pumps A2 & Pumps A1 &
Digital Input Digital Output Divisions
Condensed Controlled A2 Online A3 Online A3 Online
4KV State State CST Inventory
Steam
Condensate Switchgear
Pump State State Pump A1
Pump A1 Online Pump A2 Online Pump A3 Online
Division A MOV Offline
Comm Command Use Controller Use Controller
Message Normal Start Normal Stop
Hotwell State Channel State A as Master B as Master
Pump A1 Pump A1

Controller
LS6 Cont. LS2 Pump A1 Trip
Pos. State Communicate Communicate Communicate Signal MOV Pump A1
CST State T2 Pos. MOV Open
between with Div B Pumps with Div A Pumps Open Started
LS5 controllers
Pos. LS1 LS3 HSI
Pos. Pump A1 4KV
Pos. Command Switchgear
I/O Cabinet B I/O Cabinet B I/O Cabinet I/O Cabinet A
LS4 T2 Both on Comm Comm A Comm Comm
Pump Trip Pos. Closed MOV A1
Channel 1 channel 1 Channel 2 Channel 1 Channel 2
State Control
State Obervable
Both on Lost Comms Sequence
Channel 2 State Goal DO A1 Shelf
State

Figure 8-17
CWS Purpose Graph

 8-61 
Table 8-21
CWS State & Events Analysis Results

States
Attributes Redundancy Interdependence Diversity
Circulating Water Flow  Water pressure  Multiple sources of data  Both directly measurable  Diverse means of
 Water flow are available and determinable from determination exist.
 Water inlet temperature other state data
 Water outlet temperature
CWS Supply State  Water temperature  Multiple sources of data  Directly measurable  Diverse means of
 Water level are available determination exist.
CWS State  Percent capacity  Multiple sources of data  Dependent on other state  Diverse means of
 Percent readiness are available data determination exist.
CWS Division A State  Pump Status  Multiple sources of data  Dependent on other state  Diverse means of
 Valve Status are available data determination exist.
 Controller status
 Communications channel
status
CWS Division B State  Pump Status  Multiple sources of data  Dependent on other state  Diverse means of
 Valve Status are available data determination exist.
 Controller status
 Communications channel
status
CWS Pump Train A1  Pump status  Multiple sources of data  Dependent on other state  Diverse means of
State  Valve status are available data determination exist.
 DO status
 DI status
Pump A1 State  Pump status  Multiple sources of data  Partially measurable and  Diverse means of
 Pump current are available dependent on other state determination exist.
data

 8-62 
Table 8-21 (continued)
CWS State & Events Analysis Results

States
Attributes Redundancy Interdependence Diversity
A1 MOV State  Valve position  Multiple sources of data  Directly measurable  No diversity is provided.
 Limit Sw 1&2 are available
 Limit Sw 3&4
 Limit Sw 5
4KV Switchgear State  Limit Sw 5  Only single sources of  Directly measurable  No diversity is provided.
 Limit Sw 6 data are available
 Contact T2
Digital Input State  DI Status  Only single sources of  Partially measurable and  No diversity is provided.
data are available dependent on other state
data
Digital Output State  DO status  Only single sources of  Partially measurable and  No diversity is provided.
data are available dependent on other state
data
Division A Comm  Logic A channel 1  Only single sources of  Directly measurable  No diversity is provided.
Channel State  Logic A channel 2 data are available
 I/O A channel 1
 I/O A channel 2
Division B Comm  Logic B channel 1  Only single sources of  Directly measurable  No diversity is provided.
Channel State  Logic B channel 2 data are available
 I/O B channel 1
 I/O B channel 2
Controller state  Assigned Master  Multiple sources of data  Directly measurable  Diverse means of
 Logic A controller status are available determination exist.
 Logic B controller status

 8-63 
Table 8-22
CWS Goal Interactions

Direct goal
Goals Indirect goal Interactions
interactions
Repair Subsystem  None found  Repair Subsystem
 Service Subsystem
Service Subsystem  None found  Repair Subsystem
 Service Subsystem
Heat Removed  None found  Repair Subsystem
 Service Subsystem
Condensate Out  None found  None found
Condenser Conditions  None found  Hotwell Level
 LP Condenser Conditions
Condensate Out  None found  None found
Cooling Basin  None found  None found
CWS Flow  None found  Pump A1 Offline
Cooling Tower A  None found  None found
Conditions
Hotwell Level  None found  HP Condenser Conditions
 LP condenser Conditions
Both CWS Divisions  None found  None found
Controlled
Division A 2 Pumps  None found  Pump A1 Offline
Online
Division B 2 Pumps  None found  None found
Online
Pump A1 Online  Pump A1 Offline  None found
Pump A1 Offline  Pump A1 Online  CWS Flow
Communicate Between  None found  None found
Controllers
Communicate with  None found  None found
Division B Pumps
Communicate with  None found  None found
Division A Pumps
Signal MOV Open  Signal MOV Closed  None found
MOV Open  MOV Closed  None found
Pump A1 Started  Pump A1 Stopped  None found

 8-64 
Table 8-23
CWS Process Interactions

Sub-Goal Resource Side Effect


Processes
Interactions Interactions Interactions
BWR Steam  Normal  Not analyzed in  Not analyzed in
Production Condenser this example. this example.
Operations
Normal  BWR Steam  Not analyzed in  Not analyzed in
Condenser Production this example. this example.
Operations
Shutdown CWS  Circulating  None found  HP Condenser
Component Water Operation
Operations  LP condenser
Operation
 Hotwell
Management
Condenser  LP condenser  Circulating  Not analyzed in
Operation Operation Water this example.
Operations
Hotwell  HP Condenser  Not analyzed in  Circulating
Management Operation this example. Water
 LP condenser Operations
Operation
Circulating  Shutdown CWS  Condenser  Hotwell
Water Component Operation Management
Operation
CWS 3A + 1 B  CWS 1A + 3B  None found  None found
 CWS 2A + 2B
 Normal Stop
Pump A1
CWS 1A + 3 B  CWS 3A + 1B  None found  None found
 CWS 2A + 2B
 Normal Stop
Pump B1
CWS 2A + 2 B  CWS 1A + 3B  None found  None found
 CWS 3A + 1B
Pumps A1 and  Other  None found  None found
A2 Online combinations of
Div A pumps.
Similarly, with
Division B pump
combination
processes.

 8-65 
Table 8-23 (continued)
CWS Process Interactions

Sub-Goal Resource Side Effect


Processes
Interactions Interactions Interactions
Use Controller A  Use Controller B  None found  None found
as Master as Master
Use Controller B  Use Controller A  None found  None found
as Master as Master
Normal Start  Normal Stop  None found  None found
Pump A1 Pump A1
Normal Stop  Normal Start  None found  None found
Pump A1 Pump A1
 CWS 3A + 1B
Both on channel  None found  Other traffic on  None found
1 Channel 1 may
limit channel 1
as a resource
Both on channel  None found  Other traffic on  None found
2 Channel 2 may
limit channel 2
as a resource
I/O Cabinet A  Lost comms A1  Other traffic on  None found
Comm channel 1 Shelf State Channel 1 may
limit channel 1
as a resource
I/O Cabinet A  Lost comms A1  Other traffic on  None found
Comm channel 2 Shelf State Channel 2 may
limit channel 2
as a resource
Lost  I/O Cabinet A  None found  CWS 1A + 3B
communications Comm channel 1  CWS 2A + 2B
DO A1 Shelf  I/O Cabinet A
State Comm channel 2

8.5 PGA Strengths

High Coverage

The PGA method is designed to provide very high coverage of potential hazards.
This coverage is very useful because the results can eliminate, reduce or mitigate
hazards when performing system requirements generation and design activities.

 8-66 
Systems View

The PGA method is essentially a top-down method that takes a system view.
The results are useful for input to the requirements definition phase of a digital
IC project because they result in a safety-driven or hazard-avoidance design from
the beginning.

Unexpected Behaviors

The PGA method can identify unexpected and strange system behaviors that
may not otherwise be thought credible or possible. For example, it can identify
adverse interactions between components and systems that would on the surface
appear to have no potential interactions at all.

Simplified Results

When the data is reduced to the final list of potential hazards to be addressed,
the results can typically be readily used to inform requirements, identify and
apply defensive measures, and demonstrate system acceptability.

The final results can also be used as an input to another method to help avoid
searches for faults and failures that don’t necessarily lead to hazards.

8.6 PGA Limitations

Single Failures

The PGA method does not readily identify the effects of postulated single
failures. Therefore, PGA results are not well suited as an input to a single failure
analysis or identifying single point vulnerabilities.

Trained Facilitator

It helps to have a facilitator trained in the use of PGA, because it takes on a


broader view of the system(s) that can be affected by a digital I&C activity and
the hazards that it may cause. Most users of this guidance are likely to be trained
and competent in specific engineering disciplines or tasks, and may find it
difficult to navigate the PGA method the first time or two without a facilitator.

This method requires the ability to evaluate various abstractions presented by the
graphs and tables for potential interactions. It is possible to overlook or dismiss
possibly hazardous State, Goal or Process interactions without a trained
facilitator on the assessment team.

 8-67 
Section 9: Conclusions &
Recommendations
The hazard analysis guidance in this document covers a wide range of methods
and practices, some mature and well-proven, others emergent and still works in
progress in terms of their immediate applications in the nuclear power industry.

Proven methods, such as FMEA and use of Fault Trees, are well established in
the commercial nuclear power industry, and have their place. This guidance
provides step-by-step procedures and worked examples for these proven methods
so that users can immediately apply them on digital I&C projects and achieve
effective results.

Methods that show promise and emergence in the nuclear power industry,
including HAZOP, STPA and PGA, are also described in this guideline, with
step-by-step procedures and worked examples that can be compared to similar
examples that were developed for the FMEA and Top Down methods.
However, for these emergent methods,, the conclusions and recommendations
reported here should be considered qualitative and preliminary. It is clear that
these emergent methods have the potential to immediately and significantly
improve on current industry practices. However, they are new to most utility
engineers, who will likely need training and the help of facilitators to gain
proficiency in them. Therfore, future work on technical transfer mechanisms will
be important in deploying these methods and particularly in getting to the point
where utility engineers can confidently and efficiently apply them to real plant
problems. Technical transfer mechanisms may include formal training,
workshops, and other approaches for bringing these emergent methods to the
same levels of maturity and competence as the more proven methods (FMEAs
and use of Fault Trees in the Top Down Method).

9.1 Conclusions

Table 9-1 compares strengths and limitations of the hazard analysis methods,
based on results of the investigations and examples used in the current study. The
following observations can be made:
1. The FMEA methods are well suited for postulating single failures and their
effects on other systems, sub-systems or components, and they can make use
of the proposed failure taxonomy provided in Attachment B. However, these
methods are not well suited for use in identifying misbehaviors or hazards

 9-1 
beyond single failures, such as multiple hardware failures or unintended
interactions of hardware and software components.
2. The Top Down method can evaluate the effects of single and multiple
failures, and takes an integrated view of the plant design. It focuses on
functional faults and failures, as opposed to unintended behaviors that do not
involve component failures. It can also encounter complex fault tree models.
3. The HAZOP, STPA and PGA methods offer the following strengths:
- Cover hazards beyond faults and failures
- Integrated view of plant design
- Identify unexpected behaviors and interactions
4. However, HAZOP, STPA and PGA share the following limitations:
- Need a trained facilitator
5. STPA and PGA have the following additional limitations:
- Do not pinpoint single failures for easy identification
- Can produce tedious intermediate results (large tables)
6. At the conceptual design stage, the Design FMEA method can identify
“application notes,” such as insights derived from the FMEA regarding
failure mechanisms and potential mitigation methods that can be used to
influence the detailed design in order to produce a more robust solution.
- In the simple system example (HPCI/RCIC turbine control system)
described in Sections 4 and 1, the overlap between FMEA and Top
Down analysis results was nearly complete, suggesting that one method
or the other is sufficient to demonstrate a robust solution. Based on this
example, performing both methods on such simple systems or
components may be a wasted effort, because one method appears unlikely
to reveal vulnerabilities that are not also revealed by the other.
- For the more complex example (CWS DCS described in Sections 4 and
1), the FMEA approach was found to be focused on single failures and
did not identify vulnerabilities inherent in the system architecture.
However, the FMEA approach was useful in identifying specific
vulnerabilities within the components identified by the Top Down
analysis as being the most critical in terms of the system success criteria.
7. For more complex systems, it appears that a top down failure analysis can be
useful in influencing the system architectural design to avoid vulnerabilities
that can lead to undue safety or generation risks.
8. When designing or reviewing the design of I&C systems, it is important to
develop an understanding of the top level success criteria for the process
systems or components being actuated or controlled. Without a good
understanding of these success criteria, the potential exists to weaken or
eliminate the effectiveness of apparent redundancies that may be designed
into the I&C system. The cut-sets produced by a top down analysis can
reveal vulnerabilities inherent in the architecture of the digital I&C system
itself, and when combined with a clear understanding of the analytical
success criteria, can be used to help produce a more robust design.
 9-2 
9. The taxonomy described in Appendix B was a useful aid in preparing the
FMEA worksheets for the two example problems. The taxonomy can be
applied to hazard analysis activities, and can be used to assess the availability
of defensive measures within systems, sub-systems, components or devices of
interest.
10. Some of the methods can be used effectively in a blended approach. For
example:
- The Functional FMEA (FFMEA) method can be used to identify
hazardous functions at the plant system or process level that can be
further scrutinized using the Design FMEA (DFMEA) method. If there
is no need to systematically identify and evaluate all digital I&C system
failure modes, then the FFMEA results can be used to limit the scope of
the DFEMA analysis.
- The Top Down method includes a step for transitioning to the Design
FMEA method, thus limiting the scope of the Design FMEA to the
digital I&C system failure modes that can adversely affect actuated
components, which in turn adversely affect plant systems.

 9-3 
Table 9-1
Comparative Strengths & Limitations of Each Method

Functional Design Top


Method HAZOP STPA PGA
FMEA FMEA Down
Section # 4 5 6 7 8
Focus on Single Failures X X
Simplicity X X
Leverage a Failure Taxonomy X X
Familiarity for I&C Engineers X X X X
Familiarity for Equipment Designers X X X
Strengths Covers Hazards Beyond Failures X X X
Not Limited to Single Failures X X X X
Makes Use of Existing Fault Trees X
Integrated View of Plant Design X X X X X
Identifies Unexpected Behaviors X X X
Simplified Final Results X X

Need a Trained Facilitator X X X


Limited to Faults or Failures X X X
Does Not Pinpoint Single Failures X X
Difficult to Evaluate CCF X X
Limitations
Inadequate for Software Hazards X X
Dependant on Analysis Boundary X X X X X X
Complexity of Models X
Tedious Intermediate Results X X

 9-4 
9.2 Recommendations
1. Further work is needed for developing and applying tools for dealing with
large sets of data that can be produced by the STPA method.
2. Technical Transfer mechanisms such as industry training, a computer-based
training (CBT) module, and industry workshops should be developed for
enabling use of this guidance by owner/operator engineers, system integrators
and equipment vendors, especially on the advanced methods (STPA and
PGA).
3. Additional demonstrations of the various hazard analysis methods, including
combinations of the methods on real plant systems and proposed
modifications, over a range of scales and complexity are needed to improve
the current knowledge base and help refine the deployment of the methods.

 9-5 
Section 10: References
1. IEEE Std. 352-1987, “IEEE Guide for General Principles of Reliability
Analysis of Nuclear Power Generating Station Safety Systems”
2. IEEE Std. 610.12-1990, “IEEE Standard Glossary of Software Engineering
Terminology”
3. IEEE Std. 100-2000, “The Authoritative Dictionary of IEEE Standards
Terms”
4. EPRI TR-102348, “Guideline on Licensing Digital Upgrades EPRI TR-
102348 Revision 1 NEI 01-01”
5. NEI 96-07, Rev 1, “Guidelines for 10 CFR 50.59 Implementation”
6. NUREG 0800, “Standard Review Plan”
7. EPRI 1022684, “Elements of Pre-Operational and Operational
Configuration Management for a New Nuclear Facility”
8. IEEE Std. 603-1998, “IEEE Standard Criteria for Safety Systems for
Nuclear Power Generating Stations”
9. IEEE Std. 7-4.3.2-2003, “IEEE Standard Criteria for Digital Computers in
Safety Systems of Nuclear Power Generating Stations”
10. EPRI 1016722, “Digital Instrumentation & Control Operating Experience
Lessons Learned”
11. EPRI 1022247, “Digital Instrumentation & Control Operating Experience
Lessons Learned Volume II – Case Studies 6-10”
12. “An Introduction to Hazard and Operability Studies – The Guide Word
Approach,” by R. Ellis Knowlton, Seventh Printing
13. EPRI 1023010, “Combinatorial Testing for Digital I&C Systems,” 2011
14. EPRI TR-104595, “Abnormal Conditions and Events Analysis for
Instrumentation and Control Systems, Vol. 1: Methodology for Nuclear
Power Plant Digital Upgrades; Vol. 2: Survey and Evaluation of Industry
Practices” (1995)
15. EPRI 1022985, “Failure Analysis of Digital Instrumentation & Control
Equipment and Systems – Demonstration of Concept”
16. EPRI 1016731, “Operating Experience Insights on Common-Cause Failures
in Digital Instrumentation and Control Systems”

 10-1 
17. EPRI 1011710, “Handbook for Evaluating Critical Digital Equipment and
Systems”
18. EPRI 1022991, “Guideline on Configuration Management for Digital
Instrumentation & Control Equipment and Systems”
19. “Engineering a Safer World – Systems Thinking Applied to Safety,” by Dr.
Nancy G. Leveson; MIT Press, Cambridge MA; ISBN 978-0-262-01662-9
20. EPRI 1021077, “Estimating Failure Rates in Highly Reliable Digital
Systems”
21. EPRI 1019182, “Protecting Against Digital Common-Cause Failure:
Combining Defensive Measures and Diversity Attributes”
22. NUREG/IA-254, “Suitability of Fault Modes and Effects Analysis for
Regulatory Assurance of Complex Logic in Digital Instrumentation and
Control Systems” June, 2011
23. EPRI NP-5652, Guideline for the Utilization of Commercial Grade Items
in Nuclear Safety Related Applications (NCIG-07)”
24. EPRI TR-102260, “Supplemental Guidance for the Application of EPRI
Report NP-5652 on the Utilization of Commercial Grade Items”
25. EPRI TR-106439, "Guideline on Evaluation and Acceptance of
Commercial Grade Digital Equipment for Nuclear Safety Applications”
26. “Potential Failure Modes and Effects Analysis (FMEA) Reference Manual,
Fourth Edition,” June 2008, by the Automotive Industry Action Group;
ISBN 978-1-60534-136-1.
27. MIL-STD-1629A, “Military Standard: Procedures for Performing A Failure
Mode, Effects, and Criticality Analysis (24 Nov 1980)”
28. NEI 04-10, Rev. 1 “Risk Informed Technical Specification Initiative 5b,
Risk Informed Method for Control of Surveillance Frequencies”
29. RG 1.174, Rev. 1 “An Approach for Using Probabilistic Risk Assessment in
Risk-Informed Decisions on Plant-Specific Changes to the Licensing Basis”
(November 2002)
30. “Instrument Engineer’s Handbook” (three volume set), 4th Edition, by Bela
G, Liptak; CRC Press; ISBN 9781466571716.
31. http://www.lihoutech.com; website for Lihou Technical & Software
Services; 150 Shenley Fields Rd., Selly Oak, Birmingham, B29 5BT United
Kingdom.
32. EPRI Report 1025282, “Guideline on Testing Digital Instrumentation and
Control Systems”
33. IEC 61882-2001, “Hazard and Operability Studies (HAZOP Studies) –
Application Guide”
34. “Launch Control Safety Study,” Watson, H. A., Bell Labs, 1961
35. NUREG-0492, “Fault Tree Handbook,” USNRC, 1981

 10-2 
36. EPRI 1013490, “Support System Initiating Events: Identification and
Quantification Guideline”’ Electric Power Research Institute, 2006.
37. AP-913 Rev.1, “Equipment Reliability Process Description,” Institute of
Nuclear Power Operations, 2001.
38. EPRI 1025278, “Modeling Digital I&C in Nuclear Power Plant
Probabilistic Risk Assessments,” Electric Power Research Institute, 2012
39. Regulatory Guide 1.177 Rev. 1, “An Approach for Plant Specific Risk
Informed Decisionmaking: Technical Specifications,” USNRC, 2011
40. IEEE Std. 1228-1994, “IEEE Standard for Software Safety Plans”
41. EPRI 1016722, “Digital Instrumentation & Control Operating Experience
Lessons Learned – Case Studies,” 2008
42. EPRI 1022247, “Digital Instrumentation & Control Operating Experience
Lessons Learned – Volume II,” 2010
43. EPRI TR-016780 ‘Advanced Light Water Reactor Requirements
Document’, Volume II, Rev 8, 1999

 10-3 
Appendix A: Overview of Available
Guidance
Purpose

An assessment of the industry standards and guidance related to failure analysis


methods was performed to identify and summarize currently available failure
analysis guidance, with an emphasis on digital systems. This assessment will help
the EPRI Digital Failure Analysis Guideline project leverage the strengths
available in the current guidance and help direct attention to areas that need
additional guidance. This assessment has been completed on a number of
currently available guidance documents, with an emphasis on EPRI reports, as
listed in Table A-1.

Assessment Summary

This assessment determined that the currently available guidance listed in Table
A-1 is provided at various levels that describe the basis for performing failure
analyses, and collectively provide an outline of the basic methods and formats for
producing the expected deliverables. These guidance documents point to the fact
that failure analysis can be performed from a top down approach (fault tree
analysis) as well as a bottom up approach (failure mode and effect analysis).
Other methods such as software hazard analysis, software integrated critical path,
system modeling, walkthroughs (code reviews) and software sneak circuit analysis
are discussed in the documents. The different methods have their advantages but
can result in exhaustive efforts to complete the failure analysis on a complex
system upgrade.

In general, the available guidance can be enhanced to provide the following:


1. Methods for determining the most efficient and effective approach for
performing failure analyses
2. Detailed steps or tools for performing any single failure analysis method
3. Detailed examples of successful failure analysis activities, deliverables, and
lessons learned

Detailed recommendations to address the guidance enhancements to be factored


into a digital failure analysis are outlined in the following section.

 A-1 
Recommendations

As part of the assessment, recommendations were included in the detailed review


of each guide or standard listed in Table 1. The detailed review is documented in
EPRI 1022985 (Reference 15). The following list highlights the
recommendations contained in Reference 15, and the standard/document which
led to the recommendation:
1. The EPRI Digital Failure Analysis guidance should be designed to include
detailed guidance, procedure steps and examples within the digital failure
analysis document. Thus, the failure analysis process will not completely be
dependent on the expertise of the failure analysts and can result in analyses
which achieve consistent results.
2. The EPRI Digital Failure Analysis guidance should include general failure
modes for digital systems, components, and software. The EPRI Digital
Failure Analysis document should incorporate information to provide
sufficient guidance that the analyst can use and the NRC can evaluate for
endorsement.
3. The questions and checklists of items to consider during an ACES analysis
that are included in EPRI TR-104595 Volume 1 should be expanded to
provide experience from previous digital I&C upgrades within the nuclear
industry and to ensure that the level of detail in the analysis is driven to the
appropriate level. This will allow for additional detailed guidance so the
reliance on the expertise of the analysts will be reduced.
4. A challenge with any failure analysis is determining the point to terminate
the analysis. IEEE 352-1987 and NUREG/CR-6962 point to the use of
good engineering judgment for making the terminate determination. The
EPRI Digital Failure Analysis guidance should provide guidance to the
failure analyst on factors to consider in making the termination decision.
5. Appendix H of EPRI TR-104595 Volume 1 provides guidance on the
software fault tree analysis technique. The EPRI Digital Failure Analysis
guidance will include an example software fault tree with the associated
software code to benefit the ACES analysis performer. The example should
include guidance on how relationships are built in the fault tree and what is
considered a critical code statement.
6. As part of the EPRI Digital Failure Analysis guidance, the methods to
identify hazards throughout the lifecycle of the project should be included in
the guidance to expand upon the guidance already contained in IEEE 7-
4.3.2-2003.
7. IEEE 7-4.3.2-2003 identified the need to perform hazard analysis using
more than a single technique. The EPRI Digital Failure Analysis guidance
should include details about performing more than one hazard analysis
methodology for an example system/component.
8. For all of the failure analysis techniques that are addressed in the EPRI
Digital Failure Analysis guidance document, the following areas should be
factored into the guidance.

 A-2 
a. Trials of the failure analysis guidance on real plant upgrades
b. Assessment of the failure analysis guidance against actual initiated events
c. Development of failure analysis guidance for reusable and COTS
software
9. System dependencies on communications are an area that should be included
in the EPRI Digital Failure Analysis guidance document.
10. For FMEA guidance, the guidance from the NASA failure analysis
procedure to identify mitigation corrective actions, owners, and resolutions as
part of the failure analysis efforts should be included in the EPRI Digital
Failure Analysis guidance.
11. NUREG-0492 points out that analysis of complex systems may need to be
performed by a team approach. The NASA failure analysis procedure also
provides guidance on using a team approach for the failure analysis. The
EPRI Digital Failure Analysis guidance should provide instructions for use
of an analysis team.
12. Several of the reports and papers that were reviewed for this technical report
outlined limitations with software failure analysis and reliability modeling.
This report should consider additional reviews of the benefits and limitations
for software analysis to determine the need to continue efforts to perform
software failure analysis or to develop methods to bound software failures.

Table A-1
Guidance Documents Assessed

Number Title Date


I&C Upgrades for Nuclear Plants Desk
EPRI TR-107980 Dec 1997
Reference
EPRI TR-102348 Rev 1 Guideline on Licensing Digital Upgrades Mar 2002
EPRI TR-104595 Abnormal Conditions and Events Analysis
Dec 1995
Volumes 1 and 2 for Instrumentation and Control Systems
Requirements Engineering for Digital
EPRI TR-108831 Upgrades – Specification, Analysis, and Dec 1997
Tracking
Guideline for Performing Defense-In-Depth
and Diversity Assessments for Digital
EPRI Report 1002835 Dec 2004
Upgrades – Applying Risk-Informed and
Deterministic Methods
Draft EPRI Report
Modeling Digital I&C in Nuclear Power
(Accession No. Jul 2007
Plant Probabilistic Risk Assessment
ML072350195)
IEEE Standard Criteria for Digital
IEEE 7-4.3.2-2003 Computers in Safety Systems of Nuclear Dec 2003
Power Generating Stations

 A-3 
Table A-1 (continued)
Guidance Documents Assessed

Number Title Date


IEEE Guide for General Principles of
IEEE 352-1987 Reliability Analysis of Nuclear Power Nov 1985
Generating Station Safety Systems
IEEE Standard Application of the Single-
IEEE 379-2000 Failure Criterion to Nuclear Power Sep 2000
Generating Station Safety Systems
IEEE Standard Criteria for Safety Systems
IEEE 603-1998 Jul 1998
for Nuclear Power Generating Stations
NUREG 0492 Fault Tree Handbook Jan 1981
Standard Review Plan – Chapter 7 –
NUREG 0800 Instrumentation and Controls – Overview May 2010
of Review Process
Standard Review Plan – Branch Technical
NUREG 0800 Position 7-14 – Guidance on Software
Mar 2007
BTP 7-14 Reviews for Digital Computer-Based
Instrumentation and Control Systems
Standard Review Plan – Branch Technical
Position 7-19 –Guidance for Evaluation of
NUREG 0800
Defense-in-Depth and Diversity in Digital Mar 2007
BTP 7-19
Computer-Based Instrumentation and
Control Systems
Method for Performing Diversity and
NUREG/CR-6303 Defense-in-Depth Analyses of Reactor Dec 1994
Protection Systems
Dynamic Reliability Modeling of Digital
Instrumentation and Control Systems for
NUREG/CR-6942 May 2006
Nuclear Reactor Probabilistic Risk
Assessments
Traditional Probabilistic Risk Assessment
NUREG/CR-6962 Oct 2008
Methods for Digital Systems
Identifying and Analyzing Fault Modes
Draft NUREG/IA Attributable to Complex Logic in Digital Dec 2010
I&C Systems
Modernization of Instrumentation and
IAEA TECDOC-1016 May 1998
Control in Nuclear Power Plants
Managing Modernization of Nuclear
IAEA TECDOC-1389 Power Plant Instrumentation and Control Feb 2004
Systems

 A-4 
Table A-1 (continued)
Guidance Documents Assessed

Number Title Date


Verification and Validation of Software
IAEA Report
Related to Nuclear Power Plant May 1999
No 384
Instrumentation and Control
Protecting Against Common Cause Failures
IAEA Report
in Digital I&C Systems of Nuclear Power Nov 2009
No NP-T-1.5
Plants
Procedures for Performing a Failure Mode,
MIL-STD-1629A Nov 1980
Effects and Criticality Analysis
MIL-STD-882B System Safety Program Requirements Mar 1984
FAA System Safety
Chapter 9 – Analysis Techniques Dec 2000
Handbook
Standard for Performing a Failure Mode
NASA Flight Assurance
and Effects Analysis (FMEA) and
Procedure None
Establishing a Critical Items List (CIL)
(FAP) – 322 - 209
(DRAFT)
Experience with the application of HAZOP
Technical Paper 1995
to computer-based systems
Architecture-based approach to reliability
Technical Paper Feb 2001
assessment of software systems

 A-5 
Appendix B: Taxonomy of Failure Modes,
Failure Mechanisms, Faults,
and Defensive Measures
Purpose

The purpose of this digital failure analysis taxonomy is to provide the following
information for use in digital failure analysis activities:
 Descriptions of typical digital devices and components
 Describe a hierarchy of typical digital devices, components, and systems, and
how failure mechanisms, failure modes and effects can propagate up through
the hierarchy
 List typical failure mechanisms that can affect typical digital devices and
components
 List the typical device or component failure modes that result from typical
failure mechanisms
 List the possible defensive measures that could be implemented (or validated)
for preventing or mitigating typical failure mechanisms associated with a
device or component
 Describe how to use this Taxonomy in digital failure analysis activities

Typical Digital Devices and Components

Table B-1 lists the devices and components described with this taxonomy. Note
that only a handful of devices are described for this guideline, in order to
demonstrate the taxonomy concept.

 B-1 
Table B-1
Taxonomy Devices and Components

Device or Component Taxonomy Sheets


CPU Device B-1a and B-1b
RAM Device B-2a and B-2b
ROM Device B-3a and B-3b
A/D Converter Device B-4a and B-4b
D/A Converter Device B-5a and B-5b
Type 1 Controller B-6a and B-6b
Type 2 Controller B-7a and B-7b
Communication Module B-8a and B-8b
Level 1 (Binaries) Software Interactions & Faults B-9a and B-9b
Level 2 (Tools) Software Interactions & Faults B-10a and B-10b
Level 3 (Application & OS) Software Interactions & Faults B-11a and B-11b
Level 4 (System Architecture) Software Interactions & Faults B-12a and B-12b

Hierarchy of Failure Mechanisms, Modes, and Effects

PLANT FUNCTIONS

Plant Plant Plant Failure


Effects
System 1 System 2 System n

Failure
Plant Plant Plant Failure Modes
Component 1 Component 2 Component n Effects

Digital Digital Digital Failure Failure


Failure
Modes Mechanisms
System 1 System 2 System n Effects

Failure Failure
Digital Digital Digital Mechanisms
Modes
Component 1 Component 2 Component n

Failure
Mechanisms
Device 1 Device 2 Device n
Plant Functions, Digital Systems,
Systems & Components Components & Devices

Figure B-1
A Hierarchy of Failure Mechanisms, Modes and Effects

 B-2 
Figure B-1 illustrates a basic hierarchy of that can be applied to digital devices,
components, sub-systems and systems. The analyst responsible for evaluating the
potential misbehaviors of devices, components, sub-system or system of interest
can perform the analysis at any level in this hierarchy.

For example, an analyst performing a top-down analysis, such as the analyses


described in Sections 1 or 1 may start with the top plant-level functions and
break down the analysis through this hierarchy until the top-down analysis is
sufficiently complete. Likewise, a bottom-up analysis, such as the ones described
in Section 4 can be performed on the devices or components of interest.

It is important that failure mechanisms, modes and effects be described and


understood within the context of the devices or components of interest. For
example, a digital device such as a CPU may have certain failure modes, but
within the context of a single loop controller, the CPU contributes failure
mechanisms that cause failure modes of the controller, which in turn cause sub-
system or system level effects. This distinction is important to an analyst, such as a
project engineer at a plant, who is interested in evaluating assessing the failure
modes of a controller within a given sub-system or system.

On the other hand, the same CPU device may also be susceptible to lower-level
failure mechanisms, such as manufacturing defects or age related degradation
that lead to its own failure modes. This distinction might be important for an
analyst, such as a product engineer at a DCS vendor, who is interested in
evaluating these failure mechanisms to determine the controller failure modes,
and measures that can be used to prevent or mitigate such failure modes.

How to Read the Taxonomy Sheets

Each device or component taxonomy sheet is split into two sheets:


1. A sheet that briefly illustrates the device or component and its typical failure
modes, associated failure mechanisms, and possible defensive measures that
could be employed to help prevent or mitigate the failure mechanisms.
2. A sheet that describes the basic characteristics of the device or component

The failure mode table in the first sheet is color coded to represent basic types of
defensive measures as shown in Table B-2:

 B-3 
Table B-2
Basic Types of Defensive Measures
Color Hardware Software
Key
Code Defensive Measure Defensive Measure
Run-time diagnostics External diagnostic
Measure applied
Blue implemented in comparison by user or
during operation
software/firmware diverse software means
Pre-installation, start-up
Measure applied
or boot tests (e.g., POST) Design, implementation
during
Orange implemented in and compilation
specification and
hardware/ standards and checks
development
software/firmware
Measure applied Qualification testing on
Qualification tests in the
Green by Qualification target platform
target environment
Testing environment
Measure applied
Black by Administrative
Controls

How to Use the Taxonomy


1. When performing a failure analysis activity, the first step is to draw a
functional diagram of the system, such as the ones provided in Sections 0,
5.4, 6.4, or 7.4. An analyst interested in evaluating the failure mechanisms
and failure modes of a single component or device would draw a diagram that
illustrates similar information at the level of interest.
2. Using the diagram(s) from step 1, make a list of the components or devices of
interest, and their functions. Table 4-4 provides an example of this list.
3. Using the list from step 2, identify the taxonomy sheets that correspond with
the devices or components of interest. The controller described in the
HPCI/RCIC Turbine Control system Design FMEA example (Section 4.5),
is similar to the Type 1 controller described in Taxonomy Sheets B-6a and
B-6b. The controller described in the CWS DCS Design FMEA example
(Section 4.5) is similar to the Type 2 controller in Taxonomy Sheets B-7a
and B-7b. Therefore, these Taxonomy Sheets were useful in performing the
failure analyses for each example.
4. When performing a failure analysis activity (such as a Design FMEA using
the procedure described in Section 4.4), use the Taxonomy Sheets identified
in step 3, to identify Failure Modes and Failure Mechanisms. For example,
Taxonomy Sheet B-6a was used for identifying potential failure modes and
failure mechanisms of the governor controller listed in the HPCI/RCIC
Design FMEA worksheets (Section 4.5). Figure B-2 illustrates the link
between a Design FMEA worksheet and a Taxonomy Sheet.
5. Evaluate available defensive measures described at the device and/or
component level to aid the analysis and inform the design for a more robust
solution. Figure B-3 provides an example of the linkage between taxonomy
sheets and how they can inform this step.

 B-4 
Excerpt from Table 5-6
Functional Level Diagram Sheet 1 of 2
System HPCI, RCIC Design Phase: Conceptual
See Figure 5-1
Subsystem Positioner Rev: 0a

Component Method of
Function(s) Failure Modes Failure Mechanisms Effect on System Remarks
Identification Detection
Turbine overspeeds, trips on 1. Provide multiple outputs of the
Output Fails
high reactor level or Periodic Test position demand signal from
Offscale High
mechanical overspeed governor to positioner
1. CPU Data Corruption 2. Include signal validation in the
2. CPU Logic Error Turbine slows to minimum positioner application logic
Output Fails
3. D/A Device Error speed, less than adequate Periodic Test 3. Provide MCR and RSP alarm
Offscale Low
4. Lost or corrupted HPCI or RCIC flow connection to positioner
RAM data
1. Include rate detection in signal
Output High Rapid change in turbine speed validation logic
Periodic Test
Rate of Change and pump flow 2. Provide alarm connection to
Provide automatic governor positioner
valve position demand signal to
digital positioner to compenate Indeterminate; depends on
Governor 1. CPU Halt
for error between actual turbine fail as-is value - likely to
Controller 2. CPU Crash
speed and demanded turbine result in reactor overfill or Periodic Test
Lockup 3. Stopped internal
speed underfill, followed by turbine
clock
trip
1. Ensure governor is supplied
1. CPU Data Corruption with a HW-based watchdog timer
Loss of turbine control,, less
Failure to 2. CPU Logic Error that sets outputs to preferred
than adequate HPCI or RCIC Periodic Test
Boot or Reset 3. Lost or corrupted state
flow
ROM data 2. Provide MCR and RSP alarm
connection to positioner
1. Failed internal power
Turbine slows to minimum
Dead supply
speed, less than adequate Periodic Test
Controller 2. Line voltage below
HPCI or RCIC flow
spec

Figure B-2
Linking a Taxonomy Sheet to an FMEA Worksheet

 B-5 
Figure B-3
Linkage between Taxonomy Sheets

 B-6 
Sheet B-1a: Central Processor Device Failure Modes

The number of input and output connectors and bits can vary
greatly by the processor design. Also, some low power
processors don’t have the Heat Sink contact in the center
because heat generation is not as much of a problem.
Commonly the processor clock speed runs at a multiple of the
input clock signal. Data I/O is commonly handled using
various protocols, implemented by on-chip peripherals.
Common bit widths* of processors: 16, 32, and 64-bit

Failure
Failure Mechanism Defensive Measures
Mode
CPU Halt 1. Power Supply Off Do not turn off power Supply.
2. Power Supply Dip Do not let power dip.
CPU Logic 1. Power Supply Dip Do not let power dip.
Error 2. Bit Errors (radiation, Quality testing.
EMI) Ensure proper shielding around
3. Design Flaws controller to protect against radiation
4. Manufacturing Defect and EMI.
5. Failed Connections Integrity tests before and while
(including internal bond- running.**
wire and lead free solder Use of diverse microprocessor
interconnect failure) Architectural diversity/redundancy
6. Overheating Integrity tests before and while running
7. Part Wear Out (e.g., **
due to various age-related Quality testing of processor cooling
degradation mechanisms, systems.
exacerbated by small Temperature monitors on or near the
feature size) processor.
Ensure cooling systems are properly
mounted
CPU Data 1. Bit Errors (radiation, Quality testing.
Corruption EMI) Ensure proper shielding around
2. Design Flaws controller to protect against radiation
3. Manufacturing Defect and EMI.
4. Failed Connections Integrity tests before and while
(including internal bond- running..**
wire and lead free solder Use of diverse microprocessor
interconnect failure) Architectural diversity/redundancy
5. Overheating Integrity tests before and while running.
6. Part Wear Out (e.g., **

 B-7 
Failure
Failure Mechanism Defensive Measures
Mode
due to various age-related Quality testing of processor cooling
degradation mechanisms, systems.
exacerbated by small Temperature monitors on or near the
feature size) processor.
Ensure cooling systems are properly
mounted
Use devices with feature size >= 350nm
Specify devices with ceramic packaging
Specify components using leaded solder
CPU Crash 1. Manufacturing Defect Integrity tests before and while running.
2. Failed Connections **
(including internal bond- Quality testing of processor cooling
wire and lead free solder systems.
interconnect failure) Temperature monitors on or near the
3. Overheating processor.
Ensure cooling systems are properly
mounted
Permanent 1. Overheating Use devices with feature size >= 350nm
CPU Damage 2. Part Wear Out (e.g., Architectural diversity/redundancy
due to various age-related Specify devices with ceramic packaging
degradation mechanisms, Specify components using leaded solder
exacerbated by small
feature size)

 B-8 
Sheet B-1b: Central Processor Device Description

Central processors (general-purpose microprocessors) perform most of the


programmable functions of a controller. There are two main categories of
processors which are RISC (Reduced Instruction Set Computing) and CISC
(Complex Instruction Set Computing) processors. The differences in the type of
processors can affect the speeds at which certain actions are performed. Also,
different processors can sometimes be more suited to certain tasks. The general
principle of a processor is that it takes in code and signals from the outside world
and performs actions based on them according to its code. Because of the
limitations of using simple logic gates, it means that many instructions require
several cycles in order to complete. It is common for different instructions to take
a different number of cycles, so pure clock speed does not always perfectly
describe how fast a processor operates. Also, processors tend to generate a lot of
heat, so methods of cooling a processor must be taken into consideration.

There are a few characteristics that related to all processors, regardless of type,
and that is power requirements, bit-width, and clock speed. The power
requirements can vary significantly, more power generally means more heat
generated. Also, the higher the clock speed, generally the higher the power
requirements. Processors designed for use in embedded systems tend to be
manufactured to require less power or produce less heat, but this is not always the
case. Clock speed determines the number of instruction cycles per second.
Generally faster clock speeds means faster processing times, however this is not a
perfect measure, because many different processor instructions take different
numbers of cycles to complete.

Bit width determines the maximum size integer the processor can handle. This is
an important factor because it determines the accuracy of integer and float
operations, and it also generally determines the maximum amount of RAM a
processor can address. Some processor support higher internal bit widths for
floating point operations. Bit-widths for most processors are in powers of 2
starting at 8. 16, 32, and 64-bit processors are the most common.

Typically, microprocessors have a number of dedicated inputs known as


interrupts; when external circuitry asserts a signal on such an input the processor
is notified and, if programmed correctly, can respond to external events before
resuming the original task. Interrupts may require suppressing in real-time or
safety applications.

Most microprocessors include additional components to assist in their


functionality regardless of whether they are a CISC or RISC architecture. These
components include Cache Memory, Multiple Processor Cores on the same chip,
Microprocessor Cores for specific functions, and On-Chip Peripherals. Also,
Package Style of the processor is important in that it can affect the longevity and
reliability of a chip, as well as a process called Die Shrink that is common across
many different types of integrated circuits.

 B-9 
Since the late 20th Century the development of microprocessors has been driven
by the requirements of high-volume consumer electronics, mobile
communications and home/business computing in a free market. These vertical
markets have many different requirements to industrial/safety/nuclear
applications. Because of the tremendous cost of developing and manufacturing,
microprocessors produced today are optimized for these high-volume markets;
this is typically manifested in increased design-complexity and reduction in the
semiconductor feature-size – simplistically speaking both techniques improve
performance. Unfortunately, increased design complexity presents real challenges
in the safety-justification of devices using microprocessors and, after a certain
point long-since passed in commercial processes, reducing feature-sizes reduces
the usable life of components. When short component lifetimes are combined
with the short production-runs associated with the high-volume consumer
electronics market, the Utility user/OEM may be left with a looming
obsolescence problem.

RISC – (Reduced Instruction Set Computing) – These processors use a smaller


set of processor instructions. This can often lead to the processor being capable of
processing each instruction more quickly, however more complex tasks can
require multiple instructions. These processors are more commonly found smaller
dedicated devices, instead of general computing devices. Common types of RISC
processor families are the ARC, ARM, PowerPC, SPARC, and many more. Each
family has its own Instruction Set which determines how the code for it must be
compiled. Also, it is more sometimes possible to order custom versions of these
types of chips which can be useful when designing specific devices. This type of
processor comes in the largest range of bit-widths of processors. The smallest can
be a few as a n 8-bit and the largest commonly seen are 32 and 64-bit sizes. The
highest currently readily available is 128-bit. RISC processors also tend to be
more power efficient and produce less heat for similar clock frequencies.

CISC – (Complex Instruction Set Computing) – These processors use a larger


set of processor instructions. Each instruction can take more cycles than a
comparable RISC processor, however it is possible for more complex tasks to be
performed within the processor which allows some of the higher level
programming constructs to be expressed directly in machine code. Because of the
inherent complexity of this instruction set, there are fewer families of this type of
processor. The most well known is the x86 instruction set, which include the Intel
Pentium and the AMD Athlon processors.

The most common bit sizes of these processors are 32 and 64-bit processors.
These processors are most commonly found in general purpose computers and
servers. CISC processors commonly incorporate microcode; this is essentially
embedded software (typically immutable) which allows complex instructions to
be decomposed and executed with multiple steps of more simple instructions.
Where present, the correctness of this microcode should be considered as part of
the safety justification process.

 B-10 
Other Components:

Cache Memory

Modern Microprocessors use Cache Memory as a means of increasing


performance of the executed software, in the general case. The management of
Cache Memory requires extra complex hardware to implement the various
complex strategies employed. This extra complexity may present further
challenges in the safety-justification of equipment using such microprocessors.
Some cache memory may be implemented as a separate ‘chip’ within the same
overall package as a microprocessor; This has the potential to increase the
number of failure mechanisms to which a chip is vulnerable.

Microprocessor Cores

Most microprocessors have a core set of functionality. This typically includes


functional blocks such as an Arithmetic and Logic Unit, a Floating-Point
Arithmetic Unit, registers (essentially short term working memory) and
mechanisms to load from and store to memory or Cache Memory.

On-Chip Peripherals

Frequently microprocessors include on-chip peripherals, for example hardware


interfaces such as serial ports (synchronous and asynchronous). Modern complex
processors are also likely to include a Memory Management Unit (MMU) of
considerable complexity.

Multi-core devices

In a drive to further increase performance (as demanded by users of consumer


electronics) manufactures are implementing multiple microprocessor ‘cores’ onto
single devices, often linked by shared Cache Memory. This practice is becoming
so pervasive that some predict that it may soon become difficult to buy ‘single-
core’ devices. Multi-core devices show increased complexity in the hardware and
in the software required to run them (even if the ‘other’ cores are ignored in
software); this presents further challenges in the safety-justification of equipment
using such microprocessors.

Related devices are Digital Signal Processors (DSP).

DSPs are essentially specialized microprocessor frequently optimized to perform


certain (often integer-based) algorithms. The use of DSPs is of particular benefit
in areas such as vibration monitoring, signal analysis, and motor control (e.g. in
motor-driven valve-actuators, or variable speed drives). For the purposes of this
guideline, a DSP may be considered as just one type of microprocessor.

Die Shrink

Within the semiconductor industry, there is an on-going trend to reduce the


feature size of semiconductors generated. This is marked by key production
 B-11 
milestones known as process ‘nodes’. These process nodes are commonly out of
phase with microprocessor device lifecycles. As such, it is common for ‘die shrink’
to occur where halfway through a production-run the production method is
changed and the feature-size is reduced. This fundamentally changes the nature
of the device being produced and may introduce faults not present in earlier
generations which the manufacturer may or may not fix or disclose. The
manufacturer may or may not disclose ‘die shrink’; if the user is lucky then the
part number or batch number will change but this is not always the case. The
potential for the fundamental performance, lifetime and operation of a ‘notionally
identical’ part to change during its lifetime presents further challenges in the
safety-justification of equipment using such microprocessors.

Package Style

As devices have become more complex, typically the number of interconnections


has grown. There has been a trend away from ‘through hole’ pin devices, through
surface mount leads of ever increasing densities. Presently ‘Ball Grid Arrays’ are
common in consumer devices; however these present challenges in the initial and
ongoing inspection of equipment using such devices (you can’t see the
connections and, they’re so close together any tin-whisker growth can cause big
problems). Future trends towards mechanical interconnects might present
problems in high-reliability equipment.

The materials used in the construction and interconnection of modern mass-


market microprocessors (e.g. lead-free solder) can introduce age-related
degradation mechanisms which can present problems to their use.

 B-12 
Sheet B-2a: RAM Device Failure Modes

This is the basic configuration of a RAM


chip/component. Newer chips tend to run at lower
voltages (in order to run cooler, faster and more
efficiently). The use of lower supply voltages is also a
consequence of the smaller feature sizes used to permit
increased memory density. The number of bits for
each input and output can also vary depending on the
type of RAM. Common Types of RAM include:
SRAM, DRAM, SDRAM, DDR (SDRAM),
RLDRAM, 1T (or 1T/1C) DRAM, PSRAM,
PSDRAM

Failure Mode Failure Mechanism Defensive Measure


All Data is Lost 1. Power Supply Off Provide uninterruptible power
2. Bit Errors Qualification testing.
-radiation Ensure proper environmental control
-EMI (e.g. temperature, humidity).
-Age Based Degradation Ensure proper shielding around
-Heat controller to protect against radiation
and EMI.
3. Manufacturing Defect
Determine expected lifespans of
4. Failed Connections
components to determine probability
of failure in given time-frame.
Memory integrity tests before and
while running.
Hardware and software data
verification, using methods such as
parity bits and checksums.
Some data is 1. Power Supply Dip Ensure high power quality
corrupted or Lost 2. Bit Errors Qualification testing.
-radiation Ensure proper environmental control
-EMI (e.g. temperature, humidity).
-Age Based Degradation Ensure proper shielding around
-Heat controller to protect against radiation
and EMI.
3. Manufacturing Defect
Determine expected lifespans of
4. Failed Connections
components to determine probability
5. Failed Refresh*
of failure in given time-frame.
Memory integrity tests before and
while running.
Hardware and software data
verification, using methods such as
parity bits and checksums.

 B-13 
Failure Mode Failure Mechanism Defensive Measure
*note: some types of RAM require their data to be periodically refreshed, or the data will
be lost. A refresh operation is almost always managed by some sort of memory controller;
increasingly this is incorporated into the same integrated circuit. In some cases however,
the refresh must be done in software (usually through the CPU as part of a timing
interrupt)

 B-14 
Sheet B-2b: RAM Device Description

RAM devices usually provide most of the short-term or active memory


functionality of modern computer systems. Though the term ‘RAM’ actually
covers several other devices such as flash drives, ROMS or FRAM, this
document only covers the highly volatile types of RAM such as SRAM, DRAM,
SDRAM, PSRAM, and others. These types of RAM are considered volatile
because all of them lose some or all of their data after only short periods of power
loss. Some types (all forms of DRAM, DDRRAM, etc.) have to be periodically
refreshed in order to maintain their data. Also, many of these chips have versions
used for applications where small bit errors can cause problems, and have extra
components that allow for detection and fixing of these bit errors.

While most modern computer architectures support byte-addressing, most


modern RAMs are wider than one byte; they are often arranged in four or eight
byte-wide ‘lines’. Thus writing a single byte (or any number less than an entire
line) is no longer trivial. Often a memory controller will be employed to handle
writing to the lines of RAM, and to interact with any cache memory present.
Some controllers permit direct byte-wise access to data, some only permit access
on four- or eight-byte boundaries and may raise an exception for any other mode
of access. In any case the behavior of such a controller should be well-understood
in any application important to safety.

Also, in the document below, several different relative speed comparisons are
made. These are just general trends, as bandwidths and response times can vary
even between sub-categories of these chips. Also, many of these chips have
versions used for applications where small bit errors can cause problems, and have
extra components that allow for detection and fixing of these bit errors.

SRAM – (Static Random Access Memory) - This type of RAM uses 6+


transistors per bit in order to maintain semi-permanent memory storage. This
type of RAM requires no refresh, though will still lose its data if it loses power.
Also, this type tends to respond faster than DRAM and uses much less power
when idle. This type of RAM is commonly used as cache memory,
microprocessor registers, and integrated into other chips. The primary downside
to this type of RAM is that it expensive compared to DRAM, and has much
lower storage densities. Also, power usage can vary wildly depending on how
much data is being stored. This type of RAM tends to be less susceptible to
random bit flips from random radiation because of the way the several transistors
are set up in order to store data.

DRAM – (Dynamic Random Access Memory) - This category is most


commonly what is referred to using the term RAM. This type of RAM usually
uses only one transistor per bit. Due to the inherent imperfections in real
transistors, the charge is lost over time. This requires that the memory be
refreshed in order to maintain its data. DRAM memory controllers often
automatically handle this refresh function. Also, different grades of DRAM
(commercial, industrial, automotive) may have different refresh rates. Higher
refresh rates can cause the reading and writing of data to be slower, but can

 B-15 
increase reliability in unfavorable conditions. Data densities are much higher, and
this type of RAM is much cheaper. Though this type of RAM is slightly more
susceptible than SRAM to heat or radiation based bit-flips, modern
manufacturing methods have reduced this effect significantly to the point where
such occurrences are rather rare. Also, there are several methods that can be used
to detect and recover (sometimes without even any loss of data) from such bit
errors.

SDRAM – (Synchronous Dynamic Random Access Memory) – This type of


dynamic RAM is synced with the system bus allowing for faster response times
than strict DRAM.

DDR SDRAM – (Double Data Rate SDRAM) – double the words of data
transferred in a single clock cycle from SDRAM. All versions of DDR (such as
DDR2, DDR3, etc.) merely follow this trend, and double the number of reads
and writes per clock cycle.

RLDRAM – (Reduced Latency DRAM) – higher performance rate DDR


SDRAM. Specifically designed with networking and caching applications in
mind.

1T (or 1T/1C) DRAM – A different design for DRAM that doesn't use
capacitors to store individual bits. Otherwise behaves the same as regular
DRAM.

PSRAM (or PSDRAM) – Pseudo-Static (Dynamic) Random Access Memory –


this is a special form of DRAM component that has a built in memory controller
in order to completely handle the memory refresh as well as memory addressing.
This allows the memory to behave like SRAM while maintaining the storage
densities of DRAM. Unfortunately because of the additional complexity and the
inherent hidden functionality, this type of RAM can be difficult to justify from a
security or reliability perspective. From the security standpoint,

 B-16 
Sheet B-3a: ROM Device Failure Modes

This is a basic layout of a ROM chip. Some forms of


ROM will have fewer components than those labeled
above, such as everything but EPROM does not have
that Erase Window. Mask ROM does not have the
additional Power input for writes, the Data in Line, or
the Erase Window. Some modern forms of EEPROM
and EAROM have built-in Charge Pumps to provide
the higher voltage required for writes, and as such do not
have the Write Power Input.
Common Types of ROM: Mask ROM, PROM,
EPROM/UVPROM, EEPROM, EAROM, Flash
Memory

Failure Mode Failure Mechanism Defensive Measure


Chip fails to respond. 1. Power Supply Off Do not turn off power Supply.
2. Power Supply Dip Do not let power dip.
Some data corrupted 1. Power Supply Dip Do not let power dip.
2. Bit Errors: Quality testing.
-radiation Ensure proper cooling in
-EMI controller.
-Age Based Degradation Ensure proper shielding around
-Heat controller to protect against
radiation and EMI.
-Use Based Degradation
Determine expected life spans of
3. Manufacturing Defect
components to determine
4. Failed Write
probability of failure in given
5. Inadvertent Exposure time-frame.
to UV rays (only
Integrity tests before and while
UVPROM or EPROM) running.
Hardware and software data
verification, using methods such
as parity bits and checksums.
Software techniques to distribute
use across the chip to increase
the life span of the component.
Shield equipment from accidental
UV exposure. Cover the Erase
Window with opaque sticker.
Some data Lost 1. Bit Errors: Quality testing.
-radiation Ensure proper cooling in
-EMI controller.
-Age Based Degradation Ensure proper shielding around
-Heat controller to protect against
radiation and EMI.
-Use Based Degradation

 B-17 
Failure Mode Failure Mechanism Defensive Measure
2. Manufacturing Defect Determine expected life spans of
3. Inadvertent Exposure components to determine
to UV rays (only probability of failure in given
UVPROM or EPROM) time-frame.
Integrity tests before and while
running.
Hardware and software data
verification, using methods such
as parity bits and checksums.
Software techniques to distribute
use across the chip to increase
the life span of the component.
Shield equipment from accidental
UV exposure. Cover the Erase
Window with opaque sticker.

 B-18 
Sheet B-3b: ROM Device Description

ROM devices are generally used in order to store data while the device or
component is turned off. All forms of ROM are considered highly stable, and
can go without power for years without losing their data. In many applications,
the main downside to ROM devices is that many of them are Read-Only, and
cannot actually be written to, or can be written to only once. These devices have
to be carefully programmed, and replaced if their data ever gets damaged. Some
forms of ROM can be re-written, but usually in large data blocks, and can only
be done fairly slowly.

Compared to most RAM devices, the read speed for most ROM devices is too
slow to use effectively during the standard operation of a component. So many
components will copy the data out of their ROM devices into RAM devices and
access the data from there. This can introduce vulnerabilities since RAM is
generally less stable that ROM devices.

The error detection and/or correction strategies are similar to that of RAM
devices. The most common methods include using parity bits to check for single
bit flips, and checksums in general to check for larger amounts of corrupted data.
Some data correction strategies call for using a combination of checksums and
redundancy in order to recover lost data when possible. In most cases however, if
a ROM device has any data errors, it needs to be either reprogrammed, or
replaced entirely. Fortunately, ROM devices tend to have significantly longer life
expectancies than other devices.

Mask ROM-(Mask Read Only Memory)- This is the oldest form of ROM and
the term ‘Mask’ refers to the most common method of manufacturing . Each
chip is manufactured with the data encoded on the chip and cannot be changed.
This is generally the cheapest and most dense form of ROM, and is commonly
used in applications where the values can never change, such as math lookup
tables. A common development cycle might call for using other forms of ROM
chips first, and then finalizing to Mask ROM to cut production costs in the
production run. However, this practice is becoming less and less common as
other forms of ROM become cheaper and grant greater flexibility.

PROM- (Programmable Read Only Memory) – This form of ROM is a Write-


Once form of ROM. Every bit in the chip is initially set to a 1, and has a fuse at
each bit location. When programming, a higher voltage is applied and is used to
permanently burn out the fuse at each bit location that needs to be set to a zero.
This form of ROM is fairly cheap, and large quantities of blanks can be kept on
hand in case of errors when burning one of these ROM chips. This type of
ROM can also be highly susceptible to static discharges and minor power surges,
because these events can sometimes burn out additional bits, and these bits
cannot be re-written, so the whole chip has to be replaced.

EPROM/UV-ROM- (Erasable Programmable Read Only Memory) or


(Ultraviolet Programmable Read Only Memory) – This form of ROM can
actually be rewritten several times, however the whole chip has to be re-written

 B-19 
every time even a single correction has to be made. Erasing is achieved by
exposing the internal circuitry of the chip to UV light through the Erase
Window, which clears the internal logic gates of their charge, allowing the chip’s
contents to be re-written. Sunlight can start erasing an EPROM chip in about
weeks, and fluorescent lighting can erase and EPROM chips in months to years.
Extended overexposure to UV can permanently clear the chip so that it can no
longer store any data. Also, dust around and inside the chip can sometimes cause
parts of the circuitry to be shielded from the UV light, and can cause incomplete
erasures. Versions of EPROM have been manufactured as OTP (One Time
Programmable) EPROM and have the window covered with an opaque
substance in order to prevent the erasure of the chip. This behaves like PROM,
however is still EPROM because of the way it is manufactured. Accidental
exposure to UV can inadvertently clear the memory from UV ROM chips.

EEPROM-(Electrically Erasable Programmable Read Only Memory) – This


form of ROM can be re-written using a higher voltage than required for reading.
Some forms of EEPROM can have Charge Pumps built into their circuitry to
provide the required higher voltage. This type of ROM is the only one that can
be re-written without having to be removed from the component it is integrated
with. Modern EEPROM usually has to be re-written in blocks of several bytes or
more. All EEPROM has a limited number of write cycles (usually in the
hundreds of thousands or more) before new writes and erasures fail due to
imperfections in the insulating layer around each bit. Also, due to these
imperfections in the insulting layers, eventually individual bits will lose their
charge, and the data stored will be lost. The time frame for this failure is almost
always 10 years or more, and the data can be re-written any time before this in
order to restore the initial charge in the bit cell. All erasures and writes are much
slower than reads for all forms of EEPROM, which is why EEPROM tends to
be used in applications where the data does not change frequently or quickly, and
long term low power data storage is required.

EAROM-(Electrically Alterable Read Only Memory) – A type of EEPROM


that allows data to be re-written 1 bit at a time. This is usually a much slower
process than standard EEPROM re-writes.

Flash Memory (F-ROM)- This is a type of EEPROM that can be erased and
re-written much faster than standard EEPROM. Newer versions can have
greater durability and endurance (the number of erase-write cycles before chips
failure) than standard EEPROM. Some forms of Flash Memory have Write
Protection modes that disable writes except under special conditions to make it
behave more along the lines of standard ROM.

 B-20 
Sheet B-4a: A/D Converter Device Failure Modes

Supply Reference
Voltage Voltage Analog to Digital converters are devices that are used
Status to convert and analog signal to a digital signal. To do
Sensor Signal
this the input analog voltage is compared to a
(can be more than 1
if MUX is on-board) ‘standard’ voltage in order to assign a discrete value to
ADC Clock the signal. The different methods of converting analog
to digital include:
Serial Data Out Flash, successive-approximation, ramp-compare,
(to CPU) integrating, delta-encoded, and pipeline.

Failure Mode Failure Mechanism Defensive Measures


Lack of Output 1. Power Supply Off Do not turn off power Supply.
Signal 2. Power Supply Dip Do not let power dip.
3. Failed Connections Quality Testing
4. Manufacturing Defects Ensure A/D Converter is
properly installed
A/D initialization and
calibration tests*
Inaccurate 1. Power Supply Dip Do not let power dip.
Fluctuation of 2. Design Flaws Quality testing.
Output Signal 3. Manufacturing Defect Ensure proper shielding
4. Failed Connections around device to
5. Noisy Input Signal protect against radiation and
6. Fluctuating comparison EMI.
signal* Integrity tests before
installation*
Output Signal High 1. Power Supply Off Quality testing.
2. Manufacturing Defect Ensure proper shielding
3. Design Flaws around device to
4. Failed Connections protect against radiation and
EMI.
Integrity tests before and
while running.
Ensure Proper Calibration*
Output Signal Low 1. Power Supply Off Quality testing.
2. Power Supply Dip Ensure proper shielding
3. Design Flaws around device to
4. Manufacturing Defect protect against radiation and
5. Failed Connections EMI.
6. Incorrect Calibration Integrity tests before and
while running.
Ensure Proper Calibration*

 B-21 
Failure Mode Failure Mechanism Defensive Measures
Inaccurate Output 1. Power Supply Off Quality testing.
Signal 2. Power Supply Dip Ensure proper shielding
3. Design Flaws around device to
4. Manufacturing Defect protect against radiation and
5. Failed Connections. EMI.
6. Incorrect Calibration Integrity tests before and
while running.
Ensure Proper Calibration
throughout signal
range**
* Many A/D converters require accurate calibration in order convert a analog signal
to a digital signal. This is often done by providing a known input. This can be done
internally using power converters and capacitors to provide a steady comparison
signal. If this signal fluctuates or is inaccurate, then it can result in inaccurate reading
from the real signal.
** Many digital converters have different natural fluctuations in the output as the
signal moves up and down the available range. This means the device must be
calibrated at several different levels along the full range of the possible input signals.

 B-22 
Sheet B-4b: A/D Converter Device Description

Analog to Digital Converters are devices that convert a continuous analog signal
to a discrete value. There are many different methods of doing this, and there are
several key factors to consider when using an A/D Converter. The most common
methods are flash, successive-approximation, ramp-compare, integrating, delta-
encoded, pipeline and sigma-delta converters. Also, the important features across
all the different analog to digital converters such as sampling speed, sampling
frequency, resolution, accuracy, and non-linearity.

The sampling speed is the speed at which the A/D converter can convert the input
analog signal to a digital signal. Depending on the type of converter this can be
anywhere from nanoseconds to milliseconds. Generally the more accurate the
converter the slower the conversion, so the required sampling speed must be
considered when used in an application where many samples must be taking in a
short period of time. Sampling frequency is how many times per second a sample
of the analog signal is taken. In the case of an analog signal that is known to
oscillate, the converter should have a sample frequency of at least double the
analog signal.

Resolution describes the precision of the output digital signal and the number of
bits used to describe a discrete analog value. In situations where the analog signal
may have a wide range, yet minor fluctuations are important a greater resolution
converter is needed. The accuracy of a converter describes its ability to reliably
convert an analog value to the digital value that describes the signal. The non-
linearity describes the behavior of some A/D converters where an even increment
of the analog signal doesn’t convert to an even increment of the digital signal.
These types of converters need to be calibrated at several different analog signal
levels.

Flash Converter – These are the fastest analog to digital converters. They run the
signal through a bank of capacitors and resistors in order to convert the signal in
a single pass. Because the resolution is determined by the number of comparators
they can fit on a single chip, these types of chips tend to have fairly low
resolution. Also, they can be error prone since if any comparator fails it can create
a lot of noise in the output signal. These are most commonly used in high-
bandwidth applications like video conversion because of the speed needed to
transfer that much data.

Successive-Approximation Register (SAR) Converter - These are the most


common converters used in industrial applications. Though they are significantly
slower the flash converters, they use a series of approximations of the signal to
generate whatever the desired resolution is. The flexibility and reliability of these
converters makes them therefore very useful when monitoring sensors where the
signal shouldn’t change relatively quickly.

Ramp-Compare Converter - These comparators work by ramping up a


comparison voltage until it is the same as the signal voltage. The amount of time
spent ramping up the voltage is measured and can be used to calculate the voltage

 B-23 
of the signal line. However, because the timing is done with a simple oscillating
circuit, temperature changes can affect the accuracy of these converters. Though
this affect can be minor, these converters need to be calibrated at the intended
operating temperature in order to be accurate, and even then additional measures
must be taken to ensure output signal accuracy.

Integrating Converter - These are similar to ramp comparators in that a


unknown signal is inputted into an integration circuit for a known amount of
time, then an opposite polarity charge is used to dissipate the charge. The time it
takes to dissipate the charge is then used to calculate the voltage of the input
signal. The speed of the run-down segment can be adjusted to give higher
resolution with fewer errors. This type of converter is most commonly used in
handheld measurement tools, due to their versatility.

Delta-Encoded Converter - These converters work by using simple comparator,


a counter to store the calculated signal value, and a DAC (Digital to Analog
Converter). The input signal is put through the comparator, which continuously
adjusts the counter until the DAC produces an output that is close enough to the
input signal. This means that the desired resolution can be made just by adjusting
the number of bits in the counter, and the maximum delta allowed. The time it
takes to convert the signal, however, varies depending on the value of the input
signal.

Pipeline Converter - These converters combine some of the features of Flash and
successive approximation converters in order to have a converter that works faster
than a successive approximation converter, and more accurately than a flash
converter while keeping the overall size of the converter much smaller.

Sigma-Delta (or Delta-Sigma) – These devices work by taking many 1-bit


samples of a single input signal and average the results to determine the input
voltage. These samples are then passed through the Delta-Sigma component that
uses the ratio of the 0s to 1s to determine the input this voltage. The error
calculated in this step can be used to improve the accuracy of the first step by
setting threshold values for the 1-bit signal sampler. During the process of
conversion the Sigma-Delta component uses an integrator circuit, which shifts
the low frequency noise on the input signal into higher frequencies, generally out
of the band of interest. Therefore these devices are primarily used in low-
bandwidth (or low frequency) signals, where a high accuracy is required.

 B-24 
Sheet B-5a: D/A Converter Device Failure Modes

Supply Reference
Voltage Voltage Digital to Analog converters are devices that
convert a discrete digital signal to an analogue
Read/Write
Data Bit n signal by sending rectangular electrical pulses at
Load Register extremely high frequencies. This analogue signal
Up to “n” bits can then be passed through a filter that smooths
(8, 12, 16 bits typical) DAC Reset out the signal. The most common types of
Voltage or Current DACs are pulse-width modulators, successive-
Data Bit 2
Signal Out approximation DAC, and thermometer-coded
Data Bit 1
DACs.

Failure Mode Failure Mechanism Defensive Measures


Lack of Output 1. Power Supply Off Do not turn off power Supply.
Signal 2. Power Supply Dip Do not let power dip.
3. Failed Connections Quality Testing
4. Manufacturing Ensure D/A Converter is properly
Defects installed
D/A initialization and calibration
tests*
Noisy output 1. Power Supply Dip Do not let power dip.
analogue Signal 2. Design Flaws Quality testing.
3. Manufacturing Defect Ensure proper shielding around
4. Failed Connections device to
protect against radiation and EMI.
Integrity tests before installation*
Output Signal 1. Manufacturing Defect Quality testing.
High 2. Design Flaws Ensure proper shielding around
3. Failed Connections device to
4. Incorrect Calibration protect against radiation and EMI.
Integrity tests before and while
running.
Ensure Proper Calibration*
Output Signal Low 1. Power Supply Off Quality testing.
2. Power Supply Dip Ensure proper shielding around
3. Design Flaws device to
4. Manufacturing Defect protect against radiation and EMI.
5. Failed Connections Integrity tests before and while
6. Incorrect Calibration running.
Ensure Proper Calibration*
* Some Digital to Analogue converters requires calibration against a known signal. If this
calibration is not done, or is done incorrectly, the output analog signal could be inaccurate.

 B-25 
Sheet B-5b: D/A Converter Device Description

Digital to Analog converters are designed to convert a digital value to an Analog


value within a specified range. This is most commonly done with applying a
known voltage to the output line for some percent of the time so that the average
voltage on the line is the desired amount. The frequency of the pulses are usually
so fast that most analog systems can’t even detect the fluctuations. In cases where
a smooth signal needs to be produced, or if noise on the output line can have
adverse effects on other equipment, then the signal is often passed through a
analog filter which smooths out all of the peaks in the output. The most common
types of DACs are Pulse-Width Modulators, Successive Approximation,
Thermometer-Coded, R2R and Power Steering converters.

Though these are the basic types, it is not uncommon for DACs to be built using
a combination of two or more different basic types. This is usually done in order
to combine the strengths of different types. For example, some use thermometer-
coded circuits for the most significant bits, but an R2R ladder for the least
significant bits. This still allows for the extremely fast conversion while reducing
the overall manufacturing complexity.

Pulse Width Modulators - These types of DACs are the simplest and most
commonly used for electric motor control. A stable current is applied for variable
amounts of time to the output line. The amount of time per cycle is determined
by the digital value passed into the converter. The output is an analog signal that
oscillates at a known frequency where the width of the peak varies and where the
average voltage on the line is determined by the input signal. This output can
then be passed through a filter the a smooths out the peaks or shifts the noise
into a range that won’t interfere with the analog equipment on the circuit.

Successive-Approximation – These converters work by converting each bit of the


input signal to a fixed voltage and adjusting the output voltage each step. Once
each bit has been converted, the desired output voltage has been achieved. The
number of bits can be adjusted to improve the overall precision of the output
signal.

Thermometer-Coded - These converters work by having an equal current source


or resistor for every possible value of output. The desired output voltage is then
achieved by selecting the desired segment of the converter. This is by far the
fastest type of DAC converter but has the downside of being expensive and to
achieve greater precision requires significantly more circuitry on the chip.

R2R – R2R is a resistor-network style digital to analog converter. Passing the


digital signal through a ladder of resistors allows for the extremely quick
conversion of a digital signal to an analog signal. However, in order to get
accurate conversions, the resistors for the higher-order bits must be significantly
more accurate than those for the lower-order bits. The difficulty of making such
highly accurate resistors means that these types of converters are generally used
for converting digital signals of 8 bits or less.

 B-26 
Power Steering – This is a type of converter that converts the digital signal using
set of parallel resistors to convert the digital values to an analog value. The way
these resistors are configured allows for significantly fewer resistors on the same
circuit in order to achieve the output voltage. The largest downside to this type of
DAC is that it tends to be fairly inaccurate, so it is often used in situations where
speed of conversion and lowest cost are the highest priorities.

 B-27 
Sheet B-6a: Type 1 Controller Component Failure Modes
Alarm

Clock
W/D Power Line This is a basic layout of a standalone controller, labeled
Timer Supply Voltage
“Type 1” in this guideline. A Type 1 controller is
capable of performing typical I&C loop functions
RAM CPU ROM without the need for any other modules.
A Type 1 controller typically contains CPU, RAM,
Internal Bus ROM, A/D Converter, D/A Converter, HSI, Clock,
Operator Watchdog Timer, and internal Power Supply devices
Inputs
A/D D/A HSI
Operator
(see related Taxonomy sheets).
Display

Input Output
Signals Signals

Failure
Failure Mode Defensive Measures
Mechanisms
Controller Lockup 1. CPU Halt See CPU Device Taxonomy Sheet
2. CPU Crash B-1a
3. Stopped internal Configure W/D Timer to detect,
clock alarm, and force outputs to
preferred state
Dead Controller 1. Failed internal Implement redundant,
power supply uninterruptable line power
2. Line voltage below
spec
Outputs Fail High 1. CPU Data See CPU Device Taxonomy sheet
Outputs Fail Low Corruption B-1a
2. CPU Logic Error See RAM Device Taxonomy Sheet
Output High Rate
3. D/A Device Error B-2a
of Change
4. Lost or corrupted See D/A Device Taxonomy Sheet
RAM data B-5a
Implement “loopback” signal by
connecting outputs to spare inputs
and check for deviations via SW
logic
Implement redundant controller,
validate output from primary
controller, takeover if needed
Loss of Input Signal 1. CPU Data See CPU Device Taxonomy sheet B-1a
Processing Corruption See RAM Device Taxonomy Sheet B-2a
2. CPU Logic Error See A/D Device Taxonomy Sheet B-4a
3. A/D Device Error Implement redundant controller to
4. Lost or corrupted takeover if needed
RAM data

 B-28 
Failure
Failure Mode Defensive Measures
Mechanisms
Loss of Operator 1. CPU Data See CPU Device Taxonomy sheet
Interface Corruption B-1a
2. CPU Logic Error See RAM Device Taxonomy Sheet
3. HSI Device Error B-2a
4. Lost or corrupted Implement redundant controller to
RAM data takeover if needed
Failure to Boot or 1. CPU Data See CPU Device Taxonomy sheet
Reset Corruption B-1a
2. CPU Logic Error See RAM Device Taxonomy Sheet
3. Lost or corrupted B-2a
ROM data Implement redundant controller to
takeover if needed

 B-29 
Sheet B-6b: Type 1 Controller Component Description

A controller in the context of this taxonomy sheet is a digital component that


monitors one or more plant processes and controls one more final control
elements such as valves and pumps.

A Type 1 controller is a controller that is entirely self-contained and relies on no


other modules (e.g., I/O modules) to perform its control functions. That said, a
Type 1 controller still depends on connections to and from plant sensors,
controlled elements, and line power or loop power supplies to perform its
function.

A Type 1 controller is usually supplied in a surface mounted or panel mounted


form factor, where the controller is one integrated enclosure with suitable
interfaces such as terminal strips, keypads and displays.

A Type 1 controller will have an integrated operator display that may be in a dot-
matrix, LED, or LCD form factor, or some combination thereof. Operator
inputs may be through the use of pushbuttons or touchscreen elements.

It is usually possible to connect a wide variety of analog sensor signals directly to


a Type 1 controller, including 4-20 mADC analog current loops, 1-5VDC or 1-
10VDC analog signals, thermocouples, RTDs, and other sensor inputs.
Likewise, digital inputs can usually be directly connected, using dry or wetted
contact closure signal mechanisms.

Type 1 controllers will have one or more analog outputs (current loops or voltage
outputs) and one or more digital outputs.

Type 1 controllers will have an embedded operating system that may include on-
board “function blocks” that can be called via an application-specific
configuration table, or it may require a separate application program to be
generated through the use of an application engineering tool.

Type 1 controllers will come with some form of CPU, RAM, ROM, Internal
Clock, Watchdog Timer, A/D Converter, D/A Converter, Internal Power
Supply and HSI devices.

Watchdog timers should be implemented in hardware, and should be designed so


that the controller can be configured to send an alarm and force controller
outputs to a preferred state.

Some Type 1 controllers may have a dedicated data communication port (Serial,
Ethernet or fieldbus).

 B-30 
Sheet B-7a: Type 2 Controller Component Failure Modes
Alarm

W/D Power Line


Clock
Timer Supply Voltage

This is a basic layout of a “Type 2” controller. A Type 2


RAM CPU ROM controller requires an interface with other modules in order
to perform typical I&C loop functions.
Internal Bus
A Type 1 controller typically contains CPU, RAM, ROM,
Backplane Interface Clock, Watchdog Timer, and internal Power Supply devices
(see related Taxonomy sheets).

Data

Failure
Failure Mode Defensive Measures
Mechanisms
Controller Lockup 1. CPU Halt See CPU Device Taxonomy Sheet B-
2. CPU Crash 1a
3. Stopped internal Configure W/D Timer to detect,
clock alarm, and force outputs to preferred
state
Dead Controller 1. Failed internal Implement redundant, uninterruptable
power supply line power
2. Line voltage below
spec
Loss of Data 1. CPU Data See CPU Device Taxonomy sheet B-
Processing Corruption 1a
2. CPU Logic Error See RAM Device Taxonomy Sheet B-
3. Lost or corrupted 2a
RAM data Implement redundant controller,
4. Failed Backplane takeover if needed
Interface
Failure to Boot or 1. CPU Data See CPU Device Taxonomy sheet B-
Reset Corruption 1a
2. CPU Logic Error See RAM Device Taxonomy Sheet B-
3. Lost or corrupted 2a
ROM data Implement redundant controller to
takeover if needed

 B-31 
Sheet B-7b: Type 2 Controller Component Description

A controller in the context of this taxonomy sheet is a digital component that


monitors one or more plant processes and controls one more final control
elements such as valves and pumps.

A Type 2 controller is a controller that is not entirely self-contained and relies on


other modules (e.g., I/O modules) to perform its control functions. A Type 2
controller is usually supplied in a “card” form factor, where the controller is a
module designed to be inserted into a rack or enclosure with other components.
A Type 2 controller will usually only have primitive interfaces such as LEDs for
gross indication of status, and communication ports for connecting engineering
or diagnostic tools.

A Type 2 controller is designed to interface with other components through a


backplane or similar bus configuration. A Type 2 controller may reside in a
standalone PLC rack or in a subrack in a DCS architecture.

Type 2 controllers will have an embedded operating system that may include on-
board “function blocks” that can be called via an application-specific
configuration table, or it may require a separate application program to be
generated through the use of an application engineering tool.

Type 2 controllers will come with some form of CPU, RAM, ROM, Internal
Clock, Watchdog Timer, Internal Power Supply and Backplane Interface
devices.

Watchdog timers should be implemented in hardware, and should be designed so


that the controller can be configured to send an alarm and force controller
outputs to a preferred state.

Some Type 2 controllers may have a dedicated data communication port (Serial,
Ethernet or fieldbus).

 B-32 
Sheet B-8a: Data Communication Component Failure Modes

There are many variations on Communication


Modules depending on the networking protocols that
the component supports, as well as variations in
available bandwidth. Many Network Modules have a
small amount of dedicated memory in order to handle
requests more quickly. Network modules may use a
variety of different communication protocols and
cabling types to transmit data to other modules. The
data in/out from controller is usually transmitted
through a backplane or other bus type connections.

Failure
Failure Mode Defensive Measures
Mechanism
Failure to 1. Power Supply Off Do not turn off power Supply.
Send/Receive 2. Power Supply Dip Do not let power dip.
network signals 3. Failed Quality Testing
Connections Ensure Communication Module is
4. Manufacturing properly installed
Defects Network Initialization and integrity
tests*
Network Integrity tests during
operation*
Corruption of Data 1. Power Supply Dip Do not let power dip.
2. Design Flaws Quality testing.
3. Manufacturing Ensure proper shielding around
Defect module to
4. Failed protect against radiation and EMI.
Connections Integrity tests during operation.**
5. Memory Errors Memory Tests during operation **
6. Processor Logic Memory Redundancy
Errors Architectural diversity/redundancy
Integrity tests before installation **
Loss of Data 1. Power Supply Off Quality testing.
2. Manufacturing Ensure proper shielding around
Defect controller to
3. Failed protect against radiation and EMI.
Connections Memory Redundancy
4. Processor Crashes Architectural diversity/redundancy
5. Memory Failures Integrity tests before and while
running. **
* The most common methods of ensuring proper and continuous connectivity of any
sort of communication device is to require periodic signals to and from other
network components. If no signals are received within a fixed period (generally

 B-33 
Failure
Failure Mode Defensive Measures
Mechanism
significantly less than .1 sec for real-time systems) of time, assume some sort of
network connection failure. Testing for direct connection (through the backplane
using some bus architecture) to a controller can be done more quickly and reliably
using a similar method, however usually has much shorter timeout periods.
** The data integrity of data transmitted over the network lines can be tested by
using packets with checksums, parity bits, and other such integrity tests. Integrity
tests for data sent to/from the controller can be done similarly, as well as having
redundant memory storage in this device.

 B-34 
Sheet B-8b: Data Communication Component Description

Communication Modules are components that transform data so that is can be


sent and receive over a network. The main devices included it most
communication modules include some sort of processor to handle the data
manipulation required to process the data, and some internal RAM for higher
speed data processing times. The Communication Module is often connected to
some sort of controller using a high speed bus system, such as a Serial Bus. The
module is either told to send some data, or pulls data to transmit from some sort
of controller memory. The module then packetizes the data, including any
redundancy/checksum information as well as any destination and source
information required. Afterwards the module proceeds to send this data over
some long-distance cables to a receiving communication component. The
module then waits for a confirmation or resend request. If too much time passes
without receiving any response the module will often resend the packet several
times before deciding there is a network failure.

When receiving data, the device converts the incoming signals to digital and
saves it to memory. It then checks for the integrity of the data against whatever
integrity information included in the packet. Once it has determine whether the
information has been corrupted or not, it sends a confirmation of receipt or a
resend request back across the line. If the quantity of data to be received is greater
than the maximum size of the packet, the communication module can receive
several packets and reassemble the data before sending it to the controller.

There are many different protocols that communication modules can use. The
different protocols determine how the module is supposed to packetize data,
when, and how often to send packets across the network. Some modules support
a few different protocols and must be configured to use the desired protocol, or in
a few cases, and auto-detect the protocol being used by other devices on the
network.

Considerations when choosing a communication module should include what


protocol(s) the module uses, whether the protocol is intended to be real-time or
not, and the available data bandwidth of the module.

Software Faults

The assessment of system hazards that result from software faults has remained a
controversial issue across many industries where software plays mission critical or
safety critical roles. Although software behaviors remain inextricably connected
to the hardware environment in which the software executes, the nature of
software faults and failures is fundamentally different from that of electronic
hardware. As noted earlier in this Appendix, hardware faults and failures can be
traced to changes in the properties of the electronic materials that in turn can be
predicted and assessed using conventional concepts such as wear and other age-
related degradation mechanisms.

 B-35 
Software, of course, does not wear out in any normal sense of the term. It might
be expected that once a unit of software was known to be complete and correct, it
would remain fault-free forever. There are two important reasons why this is
neither an accurate view of industry experience with software systems nor a useful
view from a systems theory point of view.

The first challenge to the completeness and correctness of software is its


fundamental complexity. Over the history of digital computers, software has
grown in complexity at a faster pace than hardware. This is no surprise—software
exists to produce the desired system behaviors that can’t be created in hardware
alone. But the pace of complexity in software has also frequently exceeded the
methods and tools for producing and testing it.

To address this first challenge, both the application software and its entire tool
chain need to be assessed for hazards due to behavioral complexity. While some
aspects of this issue can be approached from the perspective of standards for the
software development process, extensive testing of the integrated
hardware/software system in its native environment (or in a high fidelity replica
of the native environment) remains the most effective way to discover software
faults in complex software.

The second challenge to software completeness and correctness is the result of


the need to make behavior changes in the system and its environment over time.
This is the “soft” in software—it can be easily changed. Change, however, has
proven to be a two-edged sword. System changes can be highly desirable to
improve accuracy, efficiency, security and safety. Yet every change opens the
opportunity for new faults to appear in the software. The impact of changes can
reach across the system, resulting in unchanged software units that no longer
work properly, as well as inadvertently introducing faults into the changed code
that may not be revealed in testing. While disciplined change management
processes are important, extensive testing of the integrated hardware/software
system in its native environment (or in a high fidelity replica of the native
environment) still remains the most useful approach to reducing software hazards
due to system changes.

An important aspect of software hazards that warrants emphasis is the role of


software interactions (as opposed to software faults) in creating anomalous system
behaviors. Just as there is a hierarchical structure for hardware systems,
subsystems, components and devices, there is a hierarchy of software structure
that reaches from the system-level software architecture down through layers of
software (applications, operating systems, tools) to the lowest level of a single
binary unit of execution, the software component instance. Figure B-4 illustrates
the four levels discussed in this Appendix in Taxonomy Sheets B-9 though B-11.
At the lowest level, Level 1 Binaries, software can fail to execute properly due to
latent faults that were introduced at other levels through tool errors or application
errors. At the higher levels of software structure, software may also fail to behave
properly as a result of interactions between software component instances that
separately appear to have no fault within them at all. Consequently, software
hazards are best assessed from the perspective of the potential of the software to

 B-36 
produce anomalous behaviors (“failures”) resulting both from software faults and
from software interactions.

Software failures, whether due to software faults or software interactions, may


result in a wide variety of system effects depending on the intended function of
the software involved in the failure. The effects of software failures can be
grouped into the following types and sub-types:
 Fatal effect. In this case, the software is no longer able to execute. While a
software fault at the Binary level is one possible root cause of a software
failure with fatal effect, it is also possible for a software process that fails with
fatal effect to have no fault in it at all, but whose failure was caused by a fault
in another software process.
 Non-fatal effect. In this case, the failed software process continues to execute
with incorrect results. Non-fatal effects have two sub-types to consider:
- Non-plausible incorrect outcomes. If the incorrect result is no longer
within the range of expected outcomes, the existence of the failure can be
detected by monitoring logic.
- Plausible incorrect outcomes. The incorrect result may be completely
within the range of expected outcomes, but still be incorrect. Detection
and defense against software failures that can produce this class of effect
should be given emphasis in software hazard analysis.

Figure B-4 illustrates a hierarchy of software interaction and faults, which are
broken down and described in Taxonomy Sheets B-9 through B-11. In using
these Taxonomy Sheets, software failure modes and failure mechanisms should
be included with the hardware failure modes and mechanisms of components,
sub-systems and systems as additional sources of hazard. At the component level,
the lowest 3 levels of software failure taxonomy should be considered. At higher
levels (sub-system, system) the Level 1 Binaries issues may be deferred. The
Level 4 Architecture issues, as well as the Levels 2 and 3 should be evaluated for
all subsystems and systems that interact with the proposed modifications, even if
no changes are proposed in the other interacting portions of the system.

Software-Specific Terminology in this Taxonomy

The following terms are used in this section of Appendix B, specific to the
software interactions and faults described in Taxonomy Sheets B-9 through B-
11. Their definitions are provided in Section 1, and are repeated here for
convenience:

Anomaly: Anything observed in the operation of software that deviates from


expectations based on previously verified software behaviors. (Reference 2)

Behavior: The evolution of the input, processing and output states of a digital
computing system over time. By decomposition, the evolution of the states of a
subsystem or component over time. Some of the meaning of this term is similar
to the use of the term Function, as in functional requirements or function
decomposition.

 B-37 
Error. (1) The difference between a computed, observed, or measured value or
condition and the true, specified, or theoretically correct value or condition. For
example, a difference of 30 meters between a computed result and the correct
result. (2) An incorrect step, process, or data definition. For example, an incorrect
instruction in a computer program. (3) An incorrect result. For example, a
computed result of 12 when the correct result is 10. (4) A human action that
produces an incorrect result. For example, an incorrect action on the part of a
programmer or operator. (Reference 2)

Failure: The inability of a system or component to perform its required functions


within specified performance requirements. Note: The fault tolerance discipline
distinguishes between a human action (a mistake), its manifestation (a hardware
or software fault), the result of the fault (a failure), and the amount by which the
result is incorrect (the error). (Reference 2)

Fault: (1) A defect in a hardware device or component; for example, a short


circuit or broken wire. (2) An incorrect step, process, or data definition in a
computer program. Note: This definition is used primarily by the fault tolerance
discipline. In common usage, the terms “error” and “bug” are used to express this
meaning. (Reference 2)

Fatal Error: An error that results in the complete inability of a system or


component to function. (Reference 2)

Insertion Mechanism: For faults, the pathway of processes and conditions that
resulted in the presence of the fault, but not its discovery. Insertion mechanisms
are often linked to the stages of the development and production process (e.g.,
design, tool behavior, etc.)

Non-Fatal Fault: A software fault that allows program execution to continue, but
with incorrect behavior.

Non Plausible Outcome Failure: A non-fatal fault with output errors that do not
satisfy output expectations or specifications (i.e., a form of soft failure).

Plausible Outcome Failure: A non-fatal fault with output that appears to satisfy
output expectations but contains errors (i.e., a form of soft failure).

Software Hazard: A process or resulting outcome that has the potential under at
least some conditions to result in an unplanned event or series of events causing
damage to equipment or the environment and/or death, injury or illness to
personnel. Hazards may be graded by the extent of the damage and injury
potential.

 B-38 
Input/Output
Level 4 B
(System Architecture) CPU U
S
Memory

Input/Output
B
CPU U
S
Requirements
Memory Application
Development Tools

Design

Level 3
(Application, OS SW) Implementation

Application, Operating
System Software
Source Codes
Compiler

Loader Level 2
(Tools)

Input/Output
B
CPU U
S
Memory Level 1
(Binaries)

Figure B-4
Hierarchy of Software Interactions & Faults

 B-39 
Sheet B-9a: Level 1 (Binaries) Interactions & Faults

Level 1 (Binary) interactions occur while the binary “object


Input/Output code” of a unit of software is being executed on the host
B platform (CPU, Memory, Bus and IO devices). This is the
most primitive level of software failure or behavior
CPU U
anomaly. The effects of faults at this level are bounded by
S effects on a single process, which may halt (fatal effect) or
Memory continue to execute incorrectly (non-fatal effect).
There are many pathways for software faults to occur at
this level in the absence of a hardware failure.

Failure
Failure Mode Defensive Measures
Mechanism
Process halt with Hardware fault Hardware defensive measures
exception (Fatal) Compiler/Loader Complier/Loader validation
error procedures
Application software Compiler/Loader testing on target
fault hardware
Architecture error Application validation procedures
Application testing on target
hardware
Diversity of applications
Architecture testing on target
hardware
Ensure user visibility into application
outcomes
Ensure user visibility into architecture
outcomes
Ensure safe user restart procedure
Process halt without Hardware fault Hardware defensive measures
an exception (Fatal)
Process indefinite loop Compiler/Loader Complier/Loader validation
(Non-fatal) error procedures
Application software Compiler/Loader testing on target
error hardware
Application validation procedures
Application testing on target
hardware
Diversity of applications
Ensure user visibility into application
outcomes
Ensure safe user restart procedure
Ensure safe user process termination
procedure

 B-40 
Failure
Failure Mode Defensive Measures
Mechanism
Arithmetic Logic Unit Hardware fault Hardware defensive measures
error (Non-fatal) Application testing on target
hardware
Diversity of applications
Ensure user visibility into application
outcomes
Ensure user visibility into architecture
outcomes
Ensure safe user process termination
procedure
Digital input error Hardware fault Hardware defensive measures
(Non-fatal) Device driver fault Application testing on target
hardware
Diversity of applications
Architecture testing on target
hardware
Ensure user visibility into application
outcomes
Ensure user visibility into architecture
outcomes
Digital output error Hardware fault Hardware defensive measures
(Non-fatal) Device driver fault Application testing on target
hardware
Diversity of applications
Architecture testing on target
hardware
Ensure user visibility into application
outcomes
Ensure user visibility into architecture
outcomes

 B-41 
Sheet B-9b: Level 1 (Binaries) Description

All software, whether developed as an application, or as an element of supporting


software (for example, operating systems and device drivers) is only executable in
binary form. At the start of the computer age, software was written in binary
form directly to the program memory of the computer. As computers matured,
the creation of software has become more and more removed from the binary
executable form. The fact remains however that all software faults are manifested
at the binary execution level.

The earlier discussions of the CPU (Sheet B-1) and Memory (Sheets B-2 and B-
3) provide a hardware view of digital system faults. To frame the faults and
failure modes of software, it is important to understand how software can execute
incorrectly at the binary level even with no hardware failure. To illustrate this
idea, a Von Neumann CPU architecture is described, and although this is a very
common architecture, it is not the only type of computing hardware.

A software component instance is composed of binary information stored in one


or more blocks of memory in hardware that includes the machine-level program
instructions of the software component, its symbolic reference maps of its data
and the machine level representations of the data being manipulated by the
instructions. Taxonomy sheets B-10 through B-12 will discuss how the software
component instance was created and loaded into memory, and the potential
software faults that might result from errors at higher levels.

Normally, the software component instance is accompanied by a small amount of


memory that defines the memory address ranges for the program and its data
storage allocation—its manifest. This information is used by the CPU to step
through the binary instructions of the program. Within the CPU is a special
register called the Program Counter, which stores the memory address of the
next instruction to be fetched. To start execution, the Program Counter is set to
the address of the first instruction, and the instruction is fetched from memory to
the registers of the CPU. The CPU logic executes the instruction, which may
fetch and change data and set a new value for the Program Counter. This allows
the program to change the values of data and to change its execution sequence
(logical branches and loops).

Most CPUs have built in “monitor” logic that uses the manifest to check the
Program Counter and any requests to fetch data from memory to verify that the
Program Counter is getting its next instruction from a legal program area of
memory and that a data request is fetching from a legal data area of memory. If
the monitor detects an incorrect value for either of these, it can generate an
exception and halt processing for that software component instance. Another
example of execution monitoring is a watchdog timer. This logic looks for the
last instruction of the software component instance as defined by its manifest,
and if does not see this endpoint instruction after a defined interval of time on
the CPU clock, it raises an exception.

 B-42 
Exceptions found during execution can be handled in several ways. Two of the
most common actions are (1) to restart the software component instance by
resetting the Program Counter to its first instruction; or (2) allow another
running software component instance (eg the operating system) to access the
exception and determine what action is appropriate.

A software component instance can receive data from input devices and send data
to output devices and storage devices through its data memory. In some CPU
instruction sets, any area of data memory can be used as an input or output. In
other CPUs, special regions of memory are designated for input and output.
Errors in input values can cause faulty program execution, and similarly, errors in
program execution may cause errors in output values.

 B-43 
Sheet B-10a: Level 2 (Tools) Interactions & Faults

Level 2 (Tool) Interactions occur during the


Compiler creation of the binary “object code” of a unit of
software from its source code or during its transfer
Loader (loading) into the memory for the hardware
platform. The effects of faults at this level are not
bounded by effects on a single process, and may
Input/Output create fatal or non-fatal effects on unrelated units of
B
application software.
CPU U
S There are many types of software that can be
Memory affected by errors at this level, including device
drivers, operating systems and applications.

Failure
Failure Mode Defensive Measures
Mechanism
Complier fatal Compiler/Loader Complier/Loader validation procedures
translation error error Compiler/Loader testing on target
hardware
Application testing on target hardware
Diversity of applications
Compiler non-fatal Compiler/Loader Complier/Loader validation procedures
translation error error Compiler/Loader testing on target
hardware
Application testing on target hardware
Diversity of applications
Loader program Compiler/Loader Complier/Loader validation procedures
error (fatal) error Compiler/Loader testing on target
hardware
Application testing on target hardware
Diversity of applications
Loader data error Compiler/Loader Complier/Loader validation procedures
(non-fatal) error Compiler/Loader testing on target
hardware
Application testing on target hardware
Diversity of applications

 B-44 
Sheet B-10b: Level 2 (Tools) Description

Early efforts to program computers directly in binary quickly revealed how time
consuming and error prone this approach was. The development of assemblers,
compilers and loaders allowed the creation of binary executables from higher-
level computer languages. Assemblers and compilers are used to generate a
software component object code file. A loader is used to place the instructions in
the object code file into their correct memory locations on the host hardware
platform to create the software component instance. These tools are computer
programs themselves, and therefore have all of the hazards at the execution level
as well as new mechanisms for introducing faults into the binary executable code
of the software component instance.

The most primitive tool is an Assembler, which supports the implementation of


the object code for a specific CPU instruction set. It represents the binary
instructions as hexadecimal expressions, and performs some error checking as
well as supporting in-line design notes (“comments”) that do not get translated
into machine instructions. The main advantage of an Assembler is that the
memory addresses for program and data do not have to be expressed directly as
fixed memory addresses on a specific machine, but instead as relative addresses.
Assembled object code is normally much more efficient than object code created
by a compiler, but developing large bodies of software this way is extremely time
consuming and can be error prone. Small software components with very
demanding execution time requirements are often built using an Assembler.
Device drivers and the kernels of operating systems are common examples.

Compilers were developed to enable the use of higher order programming


languages that were no longer directly coupled to a specific CPU instruction set.
A compiler reads in the higher order language expressions (“source code”) and
translates them into sections of object code for a specific instruction set. The
same source code can be compiled for different CPU instruction sets using the
appropriate compiler for each. In modern use, a compiler may also use specific
services expected to be provided by an operating system. As a result, most
compilers in use today generate object code that is both specific to an instruction
set and to an operating system. If the operating system changes, the source code
may need to be re-compiled with an updated compiler in order to generate a
correct object code file.

Compilers normally contain many error detection features to trap incorrect


source code expressions. However, the complexity of the particular higher order
computer language and the complexity of the logic within the source code itself
can be challenging to a compiler. Although many higher order languages have
standard formal specifications, not all compilers are completely successful at
implementing the standards. Specific versions of compilers frequently have
design limitations that are addressed through training and documentation in
their user communities.

Some widely used modern languages like C and C++ are known to have weak
compilers—just because the source code compiles into a set of object code

 B-45 
without errors or warnings does not mean that the code is error free, or will even
execute at all. To attack this problem and achieve trusted object code generation,
the Department of Defense developed the Ada language and compiler. The
verification and validation of this language and its family of compilers took over
10 years and many hundreds of millions of dollars. In the end, the DoD
abandoned the Ada language because it was professionally difficult to get
software developers to learn and use the language, and it became economically
unviable.

Once a software component object code file is created by an assembler or


compiler, it can be loaded into the memory of a specific computing platform by a
Loader. The loader may be used to create a static load at install time, or it may be
used as a part of the operating system to dynamically load object code files into
memory throughout the execution process. The loader determines the starting
addresses for the program instructions, determines the allocations of memory for
data, resolves all of the external references to other software components and
operating system services (“linking”), creates the manifest and copies the machine
instructions and manifest to the correct memory locations on the platform. Note
that one copy of the software component object code in mass persistent memory
can be used by the Loader to make many instances of the same component in
different blocks of memory.

 B-46 
Sheet B-11a: Level 3 (Application & OS Source Codes)
Interactions & Faults

Level 3 Software Source Code faults are created during the


Requirements specification, design and implementation of a unit of
software as source code. Historically, most software
Design behavior anomalies can be traced to errors at the source
code level. Factors that influence the introduction of
source code faults are the complexity of the software, the
Implementation process maturity of the developing organization and the
skills of the individual personnel.
Even perfectly written source code may have faults
Application, Operating introduced at the lower levels.
System Software
Source Codes

Failure Mode Failure Mechanism Defensive Measures


Incorrect Application software Application validation procedures
requirements error (non-fatal) Application testing on target
statement hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Unspecified Application software Application validation procedures
behavior error(non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure

 B-47 
Failure Mode Failure Mechanism Defensive Measures
Flawed design Application software Application validation procedures
allocation error(non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Flawed interface Application software Application validation procedures
definition error (non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Flawed logic or Application software Application validation procedures
algorithm design error (non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Flawed interface Application software Application validation procedures
implementation error (fatal or non-fatal) Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Severely flawed Application software Application validation procedures
logic or algorithm error (fatal or non-fatal) Application testing on target
implementation hardware
Diversity of applications
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes

 B-48 
Failure Mode Failure Mechanism Defensive Measures
Mildly flawed logic Application software Application validation procedures
or algorithm error (non-fatal) Application testing on target
implementation hardware
Diversity of applications
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Inappropriate but Application software Application validation procedures
allowed use of error (non-fatal) Application testing on target
language constructs hardware
Diversity of applications
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Configuration User error (fatal or non- Application testing on target
parameters out of fatal) hardware
bounds Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Oversight and review of
parameter settings

 B-49 
Sheet B-11b: Level 3 (Application & OS Source Codes)
Description

The development of application software and operating systems involves logic


design and use of higher order computer language primitives to write the source
code, which is then compiled and loaded into computer systems. This activity is
the largest use of human effort in the software industry. To improve the success
rates for the development of application software and operating systems, strong
emphasis has been placed on procedures for managing the software development
life cycle. There are differences in the sequencing and timing of the stages of
software development across the many popular models for the software
development life cycle. However, most descriptions of the software development
life cycle include at least the stages for requirements definition, design,
implementation and test.

Requirements definition and analysis remains a persistent weakness of the


software industry. The dominant approach is to write native language statements,
often the form “The system shall perform this behavior subject to these conditions”.
A large number of such statements can be written for a complex system, and it is
typical that the resulting body of requirements is not consistent and not
complete, despite the best professional efforts of all concerned. First, native
natural language is very ambiguous, and this can only be partly offset by careful
use of standard dictionaries. Native language is successful in speech because the
context of the current speech act is generally known to both the speaker and the
listeners (why gossip and second hand accounts are so easily distorted). But in
writing, the contextual meaning is easily lost when small statements are taken out
of their full original context. Second, the set of requirements statements can’t be
tested for logical closure. Logical closure means the following:

Let B be the set of all possible system behaviors. Let B* be the set of
desired behaviors that are required by a set of requirements statements,
R, and let B^ be the set of behaviors that are prohibited by R. R is said
to be logically closed if it can be shown that B* + B^ = B

In most cases, the set R leaves a very big gap between what R explicitly states (B*
and B^) and the entire set B. The difference between R and B amounts to
“unstated” requirements in terms of desired, but unspecified behaviors (B*), and
prohibited, but unspecified behaviors (B^). This situation exists not just for the
application source code requirements, but also for the language implementation
(i.e., compilers, interpreters) and for the Operating System itself. The current
practice of reviewing requirements statements frequently with the stakeholders
during development has been marginally successful in reducing requirements
errors, which can be errors of omission (unspecified behaviors that should be
specified) and errors of commission (specified behaviors that are expressed
incorrectly).

Software design is the process of translating behavior requirements into specific


algorithms and data structures that are intended to produce the required
behaviors during execution of the software. This frequently involves estimating

 B-50 
the timing, accuracy and memory usage of alternative approaches. Common
errors introduced in design are underestimating the computing time and
maximum memory footprint of an algorithm and its related data structures,
failing to guard against out-of-range data values, and failing to ensure that the
software follows the expected execution paths. Over the past 20 years, the use of
standard design representations such as Unified Modeling Language (UML) and
the documentation of standard “design patterns” have helped reduce the variance
in the ways that the software is designed.

Implementation involves translating the design description into the expressions


of a specific computer language to create the source code. Most software
developers can only manage competency in a few languages, although many in
the profession are familiar with a larger set of languages and are aware of their
differences. Modern computer languages are rich in function, allowing many
ways to write source code to perform the same (or apparently the same)
behaviors. Each language has both an accepted style within its user community
and a programmer-specific set of language styles that are the result of individual
personal experience and biases. However, even a small change in the way the
source code is written can make a significant change in the way in which the
compiler creates the object code.

Because the current ability remains weak in using inspections to detect errors
introduced in requirements, design and implementation, software development
success can depend strongly on extensive testing, diagnostics and repair of the
source code. A limitation to the test of complex software is the difficulty of
generating the complete set of test cases that exercise all of the logic paths in the
software. During test and “debugging” of the software, disciplined change
management and configuration control are important in preventing the
introduction of new errors during the repair of others.

Experience and case studies of software application development show that the
larger and more complex the software, the larger the number of latent faults it
likely contains. While some languages are simply more verbose than others, the
relationship between functional size and complexity and fault rates is not
disputable.

 B-51 
Sheet B-12a: Level 4 (System Architecture) Interactions &
Faults

Input/Output
Level 4 Software Architecture faults arise as a
B result of interactions between software units.
CPU U While some Level 4 faults result from the
S
propagation of data between faulty software
Memory
units (Level 3 or lower faults), it is also
Input/Output
possible to have Level 4 faults when all
software units are operating normally (an
B
CPU U absence of Level 3 or lower faults).
S The types of fault at level 4 depend strongly on
Memory the types of architectures, from strongly
isolated to strongly coupled.

Failure Mode Failure Mechanism Defensive Measures


Faulty message Hardware fault Hardware defensive measures
syntax Compiler/Loader error Complier/Loader validation
Application software procedures
fault in Sender Compiler/Loader testing on target
hardware
Application validation procedures
Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
architecture outcomes
Message saturation Hardware fault Hardware defensive measures
Application software Complier/Loader validation
fault in Sender procedures
Compiler/Loader testing on target
hardware
Application validation procedures
Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure

 B-52 
Failure Mode Failure Mechanism Defensive Measures
Ensure safe user process
termination procedure
Channel fatal Hardware fault Hardware defensive measures
outcome Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Input data out of Application software Application validation procedures
range fault in Sender Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Input data in range Application software Application validation procedures
but incorrect fault in Sender Application testing on target
hardware
Diversity of applications
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure
Required input data Hardware fault Hardware defensive measures
not received Application software Application validation procedures
fault in Sender Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes

 B-53 
Failure Mode Failure Mechanism Defensive Measures
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Expected input data Hardware fault Hardware defensive measures
not received Application software Application validation procedures
fault in Sender Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Process Starvation Hardware fault Hardware defensive measures
Operating system error Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Process degradation Hardware fault Hardware defensive measures
Operating system error Application testing on target
hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes

 B-54 
Failure Mode Failure Mechanism Defensive Measures
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Process resource Hardware fault Application testing on target
contention Operating system error hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Ensure safe user process
termination procedure
Process incorrect Operating system error Application testing on target
termination hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user restart procedure
Faulty process Operating system error Application testing on target
execution hardware
Diversity of applications
Architecture validation procedures
Architecture testing on target
hardware
Ensure user visibility into
application outcomes
Ensure user visibility into
architecture outcomes
Ensure safe user process
termination procedure

 B-55 
Sheet B-12b: Level 4 (System Architecture) Description

Architectures and Inter-process Interactions - Digital instrumentation and


control systems (DCS) are made from many hardware and software components.
The arrangement of these components, both physically and functionally, and
their means of interacting defines the architecture of a DCS. The architecture of
a DCS can act to both reduce the effects of software hazards and failures and to
magnify and propagate failures to other healthy functional components. There
are several ways that a group of software components can be arranged to form a
System. These different architectures result in different kinds of interactions
between separate lexical software components.

Federated System with Embedded Software - This system architecture connects


a set of separate processing units together, each processing unit having within it a
single specific software component that is permanently loaded into its address
space. While the processing units may be identical, the function of each
processing unit is determined at software load time and the load cannot be
changed during operations. Thus each processing unit is functionally distinct. It
is often the case in this type of architecture that the processing units can be quite
dissimilar, with different analog interfaces, processors, memory, speed, and mass
storage, with the only common aspect of any of the processing units being the
communications bus interface that connects the units together.

The embedded software component for a particular processing unit is normally


developed and tested on that processing unit only, and often has specific
dependency on the processor memory and data interfaces. While the embedded
software component is often developed using a higher order language and a
compiler, it is common that smaller software components may be hand coded in
machine language or built using an assembler for the target processor hardware.
The loader for an embedded software component is an external program that
statically defines the memory locations for the program and is used to load the
program into non-volatile memory, often Programmable Read Only Memory
(PROM).

The requirements definition, design, analysis and testing of a subsystem


processing unit of the federated architecture is normally done at the processing
unit level, followed by separate system integration testing in the full federated
environment. A System Integrator is responsible for the requirements for the
entire federated system. It is very rarely the case that the developer of the
processing unit is aware of the full set of requirements for the federated system
itself, and how those requirements have been allocated by the System Integrator
to other processing units within the architecture. The System Integrator in turn
may recognize that the set of requirements is incomplete and inconsistent, but is
unable to discover those gaps in the allocation of behaviors to the processing
units.

Core Processing Architecture - In this system architecture, a single processing


unit hosts a large set of application software components, each able to be loaded
and executed on demand to produce a very large set of behaviors on a single

 B-56 
processing unit. The core processing architecture uses a separate software
component, the Operating System (OS), to control what application software
components are loaded into memory and what software components are
executing of those that are loaded. The core processing unit usually has many
hardware resources, such as analog in/out converters, video processing, network
communications devices, and mass storage devices that the OS can make
available to the application software processes by software interactions (“calls”)
with the OS software behaviors. Many modern Operating Systems support calls
directly between separately executing application software components, allowing
the separate applications to share data and to make calls on each other’s
functional behaviors. In addition to its loader functions, the OS normally will
have a Scheduler function that shifts execution dynamically between multiple
“running” applications so that the processing and peripheral resources of the core
processing unit are shared across the application software components over time
(“multi-processing”).

Developing application software components for a core processing architecture


requires careful definition of the underlying hardware of the core processing unit,
the Operating System and the language compiler that can create the object code
for the OS associated with the processing hardware. The applications
development community exerts very strong pressure to keep the number of
unique Operating Systems and core processing architectures small in order to
have larger addressable marketplace. The developers of OS have been driven
towards ever tighter standards for interface calls that the OS supports (eg POSIX
standards). Similarly, there is great effort to support standards for data
communications and remote procedure calls between application software
components as well.

Despite the evolution of standards and the economic incentives to increase the
performance and reliability of OS and applications, there are very few trusted OS
today. In some industry verticals, the development of a trusted OS and trusted
applications is so important that these systems are highly proprietary and not
available to other developers. The Operating Systems in wide use today are
known to have many weaknesses, and the pace of OS configuration changes and
patches is quite fast, leading to great pressure on the applications developers to
update their software for each OS change or be left out of the market.

The requirements for application software components in the core processing


environment are coupled to the OS and its ability to manage the interactions
with other software components running in the same OS. In many cases, the
application software requirements ignore completely the other processes that may
or may not be running, and focus strictly on a local view of behavior. This leads
to incompleteness since the requirements assume behaviors from the OS that
may not be validated.

Hierarchical Architectures - The Federated architecture can also be used to


connect segments of core processing, where groups of related software processes
are run in each separate core processing environment. The system functionality is
distributed across the federated segments of core processing. This architecture

 B-57 
limits the interactions across the segments of core processing, but allows rich
interactions within the segments.

 B-58 
Appendix C: Physical and Functional
Representations
A physical representation would typically include several of the following
elements:
 Digital System Components
- Controllers
- Input or Output modules
- Data Communication Modules
- Network Components
- Media Converters
- Power Supplies
- Workstations
- Servers
 Interfacing Components
- Other Controllers or Hand/Auto Stations
- Handswitches
- Limit or Position Switches
- Sensors
- Indicators
- Alarms
- Relays
- Firewalls or Data Diodes
 Connections
- Analog signals
- Digital signals
- Data communications
- Clock signals
- Power
- Grounds
- Maintenance ports (e.g., laptop connection point)
- Factory ports (i.e., used by the vendor only)
 Controlled Components
- Pumps
- Valves
- Breakers
 C-1 
- Switchgear
- Motor Control Centers
 Process Elements
- Reactor
- Heat Exchangers
- Tanks
- Steam Generators
- Pipes
- Flow Elements

A functional representation would typically include several of the following


elements:
 Possible Function Blocks
- Proportional, Integral, Derivative (PID)
- Lead, Lag, Lead/Lag
- Bistable
- Compare
- Sum
- Multiply or Divide
- Boolean Logic
- Switch
- Average
- Square Root Extractor
- Characterizer
 Possible System or Component States
- Normal, Abnormal or Emergency
- Plant Mode (as defined in Technical Specifications))
- Operable, Available
- Inoperable, Out of Service, Maintenance Mode
- On or Off
- Open or Closed
- Running or Stopped
- Energized or De-energized

Symbols used in a block diagram should follow the same conventions used for
other plant drawings, such as piping and instrumentation diagrams or electrical
elementary diagrams. For some components, state representation should be
drawn into the symbol that is used to represent the physical component. For
example, a relay contact can be represented in an FMEA block diagram as
normally closed, using the following symbol and a note:

 C-2 
Note 1: Relay contact shown as normally closed (de-energized)

Figure C-1
Relay Contact Symbol

 C-3 
Appendix D: Circulating Water System
(CWS) Top Down Analysis
Section 4, Figure 4-7 and Figure 4-8 in the body of this guideline introduce and
describe a distributed control system for a circulating water system at a nuclear
power plant. In this Appendix, the top down logic for relevant portions of the
circulating water system and the control system are developed along with a
discussion of the results.
The distributed control system has a direct impact on the availability of the
circulating water system in two ways:
 Response to the trip of a circulating water pump by automatically isolating the
affected pump (this prevents reverse flow through the tripped pump and an
even greater reduction in flow through the condenser than from just the loss
of the pump) and support for operator action to start and un-isolate one of the
idle circulating water pumps.
 Spurious actuation of circulating water equipment when not called upon to operate
(e.g., spurious closure of the circulating water pump discharge isolation valve).

A Top Down analysis is performed herein, focusing on these two of functions.


The Top Down analysis takes advantage of fault trees, similar to those used in a
nuclear power plant PRA, but not to the same level of detail, nor requiring failure
rates for quantification.

Response to the trip of a circulating water pump

The top event in Figure D-1a represents insufficient circulating water flow
initiated by the trip of an operating circulating water pump. This figure is a partial
fault tree focusing on one train of circulating water (Train 1). Shown under gate
CWS-TR1-01 are the trip of Pump 1 or the spurious opening of the breaker for
the pump. System response to tripping of the pump would include automatic
isolation of the discharge valve, the logic for which is shown under gate G006.
Failure to isolate the failed pump could be a result of the discharge MOV failing
to close or failure of the plant control system to initiate a closure signal (Gate
G010 – developed further in Figure D-1b).
If a tripped pump were not to be isolated, the pump would coast down and
reverse flow through the pump would begin. The loss of the pump plus
additional diversion of flow roughly would be equivalent to the loss of two
circulating water pumps. Given the need for four pump flow to keep the plant at
 D-1 
full power, loss of only one additional circulating water pump is required before
inadequate circulating water flow would be expected to lead to a trip on high
condenser vacuum. Top down logic for loss of the additional pump is developed
under Gate G004, which considers loss of any of the remaining five pumps
(Pumps 2 through 6).

Recall that two of the circulating water pumps are in standby. Therefore, they
must be started by the operators in order to have sufficient circulating water flow
to avoid a plant trip on high condenser vacuum. Figure D-1a also presents this
logic for one of the standby pumps.

The logic for starting of a standby pump (Train 3) is presented under Gate CWS-
3. The loss of this pump train may occur due to failure of the pump to start,
failure of the breaker to close (both under Gate G002) or failure of the plant
control system to initiate the pump train (Gate G013 – developed further in
Figure D-1c). Similar logic is also developed for Train 4, the other standby train.

A pump, once operating, may become unavailable if it fails to run, the breaker fails
to remain closed or the discharge isolation valve fails to remain open. This logic is
presented under Gates CWS-TR3-01 and CWS-TR3-02 for Train 3. Logic
similar to this is developed for all five pumps, reflecting the possibility of any of
these trains failing given that initially they are running successfully.

In Figure D-1b, the top down logic for the plant control system is developed in
support of the system function to isolate a pump that has tripped. A signal to close
the discharge MOV for the tripped pump may be due to digital input module
failing to sense that the pump breaker has opened (DI1), the communications
network failing to transmit this information from the I/O cabinet to the logic
cabinet and back (two network loops – Gate G015-A-FF), the master logic
controller failing to interpret the information correctly and provide an output
signal to close the valve or due to the digital output module failing to provide a
signal to close the MOV. With respect to the master logic controller (Gate
G039), its failure is backed up by a slave controller. Loss of this backup source of a
closure signal to the discharge MOV could be due to failure of the watchdog timer
which monitors the status of the master logic controller, failure of the slave
controller itself or loss of the two communications networks. Note that the master
controller and the slave controller are in separate divisions of the plant control
system. For a controller to transmit information to the MOVs in the opposite
division through a given communication loop, loss of any of the four
communications modules in that loop fails the communications loop as shown
under Gate G037. For a controller to transmit information to the MOVs in the
same division through a given communication loop, only the communications
units in that division within that loop can contribute to loss of communication to
the MOVs from the controller (as shown under Gate G015-A-FF).

In Figure D-1c, the top down logic for opening an MOV on a standby pump and
starting the pump is shown. Failure to initiate a standby pump in the event that an
operating pump trips leaves the plant with insufficient circulating water flow to
maintain full power operation. The action to start a standby pump is modeled as

 D-2 
an action that the operators take in response to the tripped pump. Failure to start
the standby pump can occur if the operators do not take action in time (event
CWS-PMOA-OPENMOV in Figure C-1c), the workstations and
communication network loops do not transmit the operator’s signal to the I/O
cabinets or the digital output device does not pass the signal on to the MOV
circuitry.

The top down logic presented in Figure D-1a represents the different means of
failing to isolate a single train of circulating water (Train 1) should the pump in
that train trip during plant operation. The logic focuses on the mechanical
equipment that make up the pump train. Similar top down logic exists for all six
circulating water pump trains.

The top down logic presented in Figure D-1b represents the plant control system
as it is required to produce an automatic isolation signal for the discharge MOV
for a pump that has tripped. The pump train represented in the logic is again
pump train 1. Similar logic has been developed for all six circulating water pump
trains. Figure D-1c presents top logic for the plant control system as it is needed
to start a standby pump manually. There are two standby trains associated with
the example circulating water system and similar top down logic has been
developed for both.

Spurious actuation of circulating water equipment when not called upon to


operate

Isolation of an operating circulating water pump train can occur for several
reasons; in response to trip of the circulating water pump in that train, spurious
opening of the breaker to the pump, spurious closure of the discharge valve for the
pump or initiation of a spurious signal to close the discharge valve. Figure A-2a
presents the top down logic for the first three of these failures while the logic for
the spurious signal is shown in Figure D-2b. Figure D-2a also shows logic for
starting a standby pump (Gate G013). This is the same logic that was developed
above under Figure D-1c.

The distributed control system contribution to spurious isolation of a circulating


water train is shown in Figure D-2b. To generate a spurious signal, the output
module may generate an inadvertent isolation signal. In addition, given that the
shelf state of the output module is to open, then loss of communication input to
the module also will lead to an inadvertent closure signal to the discharge MOV.
Loss of both communications loops is required to lead to this condition.

While the attached logic is for pump train 3, similar logic is developed for all six
pumps. The logic for starting a pump is applicable only to the two standby pumps
(Trains 3 and 4).

 D-3 
Results

On development of the top logic for each of the six circulating water pump trains,
the logic is combined in a manner that reflects the success criterion for the
circulating water system. Figure C-3 presents this logic.

As noted earlier, flow through the condenser equivalent to that for four circulating
water pumps is assumed to be required to support full power operation. If an
operating circulating water pump was to trip and not be isolated successfully, the
loss of flow from the pump plus the reverse flow through the affected pump train
is equivalent to loss of flow from two pumps. This means that loss of one
additional pump is all that is necessary to reduce circulating water flow to the
point it can no longer support full power operation. The logic under Gate CWS-
TOP-FF reflects this criterion. Note that the gate has as input the logic for only
the four operating trains. As the two standby trains are not in service, they cannot
contribute to the loss off circulating water flow other than to fail to start and run
in response to loss of one of the other operating trains.

If an operating circulating pump were to isolate inadvertently, then two spare


trains of circulating water are available to make up for the reduction in flow. In
this situation three of the six trains must fail to reduce circulating water flow to
the point that full power operation cannot be supported. The logic under Gate
CWS-TOP-SS reflects this logic.

The two sets of top logic are combined and the logic reduced to identify the
combinations of failures (cut sets) that will result in the circulating water system
not being able to support full power operation.

As flow from the equivalent of four trains of circulating water are needed, that the
bulk of the combinations consist of three failures is to be expected (i.e., pumps fail
to run, breakers fail to remain open, discharge MOVs fail to remain open in
combinations of three). A number of these cut sets are shown in Table A-1.
However, it can be seen there are approximately twenty cut sets that consist of
only pairs of failures. Many of these twenty pairs include components from the
plant control system.

The first eight pairs of failures in Table D-1 contain only communication module
failures. These pairs of failures come from the spurious actuation top logic (Gate
CWS-TOP-SS). Total loss of communications for an entire division of
circulating water can occur if a communication module in each of the two
communications loop in that division were to fail. This leads to no input to the
digital output modules for that division. Under these conditions, the discharge
isolation valves for all three pumps in the affected division close leaving only the
three pumps in the unaffected division. As the plant is assumed to require four
circulating water pumps to support full power operation, loss of the pairs of
communications modules results in insufficient circulating water pump flow.

Four of the remaining cut sets consisting of pairs of failures include a digital
output module failure combined with failure of the operators to initiate the

 D-4 
standby trains of circulating water in time to avoid a low condenser vacuum trip.
These failures also come from the spurious actuation top logic. Loss of a single
digital output module results in a false isolation signal to the discharge isolation
MOV in the affected pump train. As only three pumps are now providing
circulating water flow, starting of one of the standby trains is required. Failure of
the operators to initiate one of the standby trains in time results in the circulating
water flow not being able to support full power operation.

Other plant control system components appear with hardware and I&C failures in
combinations of three or more. These components include digital input modules,
the master controller, slave controller and operator workstations. That these
components require multiple additional failures before they can lead to conditions
in which the plant cannot operate at full power reflects the fact that there are two
spare circulating water pump trains and the operators can initiate the standby
trains to mitigate loss of these components.

 D-5 
Table D-1
Combinations of Failures (Cut Sets) Leading to Loss of Circulating Water

CWS-CMFF-I/OA-COMM2 CWS-CMFF-I/OA-COMM1
CWS-CMFF-I/OA-COMM2 CWS-CMFF-LCA-COMM1
CWS-CMFF-I/OB-COMM2 CWS-CMFF-I/OB-COMM1
CWS-CMFF-I/OB-COMM2 CWS-CMFF-LCB-COMM1 Failure of pairs of
CWS-CMFF-LCA-COMM2 CWS-CMFF-I/OA-COMM1 communication units
CWS-CMFF-LCA-COMM2 CWS-CMFF-LCA-COMM1
CWS-CMFF-LCB-COMM2 CWS-CMFF-I/OB-COMM1
CWS-CMFF-LCB-COMM2 CWS-CMFF-LCB-COMM1
CWS-PMOA-OPENMOV CWS-CBCO-CB-01
CWS-PMOA-OPENMOV CWS-CBCO-CB-02
CWS-PMOA-OPENMOV CWS-CBCO-CB-05
CWS-PMOA-OPENMOV CWS-CBCO-CB-06
CWS-PMOA-OPENMOV CWS-IOSS-I/OA-D01 Digital output device failure in
CWS-PMOA-OPENMOV CWS-IOSS-I/OA-D02 combination with failure of
CWS-PMOA-OPENMOV CWS-IOSS-I/OB-D02 operator action to start a
standby pump
CWS-PMOA-OPENMOV CWS-IOSS-I/OB-D03
CWS-PMOA-OPENMOV CWS-MVOC-MO-01
CWS-PMOA-OPENMOV CWS-MVOC-MO-02
CWS-PMOA-OPENMOV CWS-MVOC-MO-05
CWS-PMOA-OPENMOV CWS-MVOC-MO-06
CWS-PMOA-OPENMOV CWS-PMFR-P1
CWS-PMOA-OPENMOV CWS-PMFR-P2
CWS-PMOA-OPENMOV CWS-PMFR-P5
CWS-PMOA-OPENMOV CWS-PMFR-P6
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-CBCO-CB-01
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-CBCO-CB-06
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-IOSS-I/OB-D03
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-MVOC-MO-06
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-PMFR-P1
CWS-CBCO-CB-02 CWS-CBCO-CB-04 CWS-PMFR-P6
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-CBCO-CB-01
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-CBCO-CB-04
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-CBCO-CB-06
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-CBOC-CB-04
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-IOFF-I/OA-D04
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-IOSS-I/OB-D01
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-IOSS-I/OB-D03
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-MVOC-MO-04
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-MVOC-MO-06
CWS-CBCO-CB-02 CWS-CBCO-CB-05 CWS-PMFR-P1

 D-6 
Loss of CWS flow due to
flow diversion through pump
train 1

CWS-TR1-FLDIV

Flow diversion through Failure of one additonal


CWS Pump 1 train train of circulating water

G003 G004

CWS Pump 1 Trip Failure to isolate CWS Pump Failure ofcirculating water Failure ofcirculating water Failure ofcirculating water Failure ofcirculating water Failure ofcirculating water
1 pump train 2 pump train 3 pump train 4 pump train 5 pump train 6

CWS-TR1-01 G006 CWS-2 CWS-3 CWS-4 CWS-5 CWS-6


Page 1 ... see x-ref

CWS Pump 1 fails to run Circuit Breaker for CWS CWS Pump 1 discharge PCS does not automatically Pump train 3 fails to run Pump train 3 discharge valve Failure to open pump train 3
Pump 1 fails to remain closed valve fails to close isolate CWS Pump 1 spuriously closes discharge valve

CWS-PMFR-P1 CWS-CBCO-CB-01 CWS-MVOO-MO-01 G010 CWS-TR3-01 CWS-TR3-02 G001

CWS Pump 3 fails to run Circuit Breaker for CWS CWS Pump 3 discharge Spurious signal to close CWS Pump 3 start failures CWS Pump 3 is in standby
Pump 3 fails to remaain valve fails to remain open pump train 3 discharge valve
closed

CWS-PMFR-P3 CWS-CBCO-CB-03 CWS-MVOC-MO-03 G009-3 G011 CWS-HSE-PM3-STBY

CWS Pump 3 fails to start Failure to start standby circ


water pump

G002 G013

CWS Pump 3 fails to start Circuit Breaker for CWS


Pump 3 fails to close

CWS-PMFS-P3 CWS-CBOC-CB-03

Figure D-1
a) Top down logic for response to trip of an operating circulating water pump

 D-7 
PCS does not automatically
isolate CWS Pump 1

G010
Page 1

Digital output device D01 Digital input device DI1 fails Loss of Division A Logic controllers fail to
fails to provide output to sense Pump 1 tripped communication networks islollate an idle pump
signal

CWS-IOFF-I/OA-D01 CWS-IOFF-I/OA-DI1 G015-A-FF G039

Loss of Division A Loss of Division A Master lcontroller fails to Slave controller fails to
communication network 1 communication network 2 function function

G016-A G017-A CWS-MCFF-CTL-MASTER G035


Page 1 Page 1
Page 1 Page 1

I/O Cabinet A Logic Cabinet A I/O Cabinet A Logic Cabinet A Sleve controller fails to Communications failure Watchdog timer fails to
Communication Module 1 Communication Module 1 Communication Module 2 Communication Module 2 function between slave controller and transfer control
functional failure functional failure functional failure functional failure Div 1 components

CWS-CMFF-I/OA-COMM1 CWS-CMFF-LCA-COMM1 CWS-CMFF-I/OA-COMM2 CWS-CMFF-LCA-COMM2 CWS-SCFF-CTL-SLAVE G037 CWS-WDFF-WATCHDOG

Communications loop 1 Communications loop 2


failure failure

G047 G049

I/O Cabinet A I/O Cabinet B I/O Cabinet A I/O Cabinet B


Communication Module 1 Communication Module 1 Communication Module 2 Communication Module 2
functional failure functional failure functional failure functional failure

CWS-CMFF-I/OA-COMM1 CWS-CMFF-I/OB-COMM1 CWS-CMFF-I/OA-COMM2 CWS-CMFF-I/OB-COMM2

Logic Cabinet A Logic Cabinet B Logic Cabinet A Logic Cabinet B


Communication Module 1 Communication Module 1 Communication Module 2 Communication Module 2
functional failure functional failure functional failure functional failure

CWS-CMFF-LCA-COMM1 CWS-CMFF-LCB-COMM1 CWS-CMFF-LCA-COMM2 CWS-CMFF-LCB-COMM2

Figure D-1 (continued)


b) Top down logic for response to trip of an operating circulating water pump

 D-8 
Failure to start standby circ
water pump

G013
Page 1

Failure of I/O Cabinet A Failure of operaators to Loss of Division A


output module 3 start standby circ water communication loops
pump

CWS-IOFF-I/OA-D03 CWS-PMOA-OPENMOV G015-A-FF-MAN

Loss of Division A Loss of Division A


communication loop 1 communication loop 2

G087 G090

Loss of Division A Failure of communication Loss of Division A Failure of communication


communication network 1 loop 1 workstation communication network 2 loop 2 workstation

G016-A CWS-WSFF-WKST1 G017-A CWS-WSFF-WKST2


Page 1 Page 1
Page 1 Page 1

I/O Cabinet A Logic Cabinet A I/O Cabinet A Logic Cabinet A


Communication Module 1 Communication Module 1 Communication Module 2 Communication Module 2
functional failure functional failure functional failure functional failure

CWS-CMFF-I/OA-COMM1 CWS-CMFF-LCA-COMM1 CWS-CMFF-I/OA-COMM2 CWS-CMFF-LCA-COMM2

Figure D-1 (continued)


c) Top down logic for response to trip of an operating circulating water pump

 D-9 
Failure ofcirculating water
pump train 3

CWS-3
Page 1
... see x-ref

Pump train 3 fails to run Pump train 3 discharge valve Failure to open pump train 3
spuriously closes discharge valve

CWS-TR3-01 CWS-TR3-02 G001

CWS Pump 3 fails to run Circuit Breaker for CWS CWS Pump 3 discharge Spurious signal to close CWS Pump 3 start failures CWS Pump 3 is in standby
Pump 3 fails to remaain valve fails to remain open pump train 3 discharge valve
closed

CWS-PMFR-P3 CWS-CBCO-CB-03 CWS-MVOC-MO-03 G009-3 G011 CWS-HSE-PM3-STBY

CWS Pump 3 fails to start Failure to start standby circ


water pump

G002 G013

CWS Pump 3 fails to start Circuit Breaker for CWS


Pump 3 fails to close

CWS-PMFS-P3 CWS-CBOC-CB-03

Figure D-2
a) Top down logic for loss of circulating water system due to spurious trips

 D-10 
Spurious signal to close
pump train 3 discharge valve

G009-3
Page 1

I/O Cabinet A diigtal output Circuit Breaker for CWS Faillure of Division A
module 3 spurious signal Pump 3 fails to remaain comnunications networks
closed

CWS-IOSS-I/OA-D03 CWS-CBCO-CB-03 G015-A-SS


Page 1

Loss of Division A Loss of Division A


communication network 1 communication network 2

G016-A G017-A
Page 1 Page 1
Page 1 Page 1

I/O Cabinet A Logic Cabinet A I/O Cabinet A Logic Cabinet A


Communication Module 1 Communication Module 1 Communication Module 2 Communication Module 2
functional failure functional failure functional failure functional failure

CWS-CMFF-I/OA-COMM1 CWS-CMFF-LCA-COMM1 CWS-CMFF-I/OA-COMM2 CWS-CMFF-LCA-COMM2

Figure D-2 (continued)


b) Top down logic for loss of circulating water system due to spurious trips

 D-11 
Loss of circ water system

CWS-TOP

Loss of the circ water system Loss of the circ water system
due to failure to isolate a due to spurious actuations
tripped pump

CWS-TOP-FF CWS-TOP-SS

Loss of CWS flow due to Loss of CWS flow due to Failure ofcirculating water Failure ofcirculating water
flow diversion through pump flow diversion through pump pump train 1 pump train 4
train 1 train 5

CWS-TR1-FLDIV CWS-TR5-FLDIV CWS-1 CWS-4

Loss of CWS flow due to Loss of CWS flow due to Failure ofcirculating water Failure ofcirculating water
flow diversion through pump flow diversion through pump pump train 2 pump train 5
train 2 train 6

CWS-TR2-FLDIV CWS-TR6-FLDIV CWS-2 CWS-5

Failure ofcirculating water Failure ofcirculating water


pump train 3 pump train 6

CWS-3 CWS-6

Figure D-3
Top down logic for loss of circulating water system

 D-12 
Pairs of communications modules

ANALYSIS BOUNDARY
Logic Cabinet A Logic Cabinet B
COMM 2 COMM 2
COMM 1 COMM 1
Each Controller Is
MASTER SLAVE
Programmed to Control All
CONTROLLER CONTROLLER
Six Valves (Master/Slave)

I/O Cabinet A I/O Cabinet B


COMM 1 COMM 1
COMM 2 COMM 2

D D D D D D D D D D D D
I O I O I O O I O I O I
1 1 2 2 3 3 1 1 2 2 3 3

Digital output
4 KV devices
(in combination with
action to start
standby pump)
CONDENSER CONDENSER CONDENSER

M M M M M M
M M M M M M
COOLING COOLING
TOWER TOWER

MOV-1 MOV-2 MOV-3 MOV-4 MOV-5 MOV-6

Normal Operation
PUMP-1 PUMP-2 PUMP-3 (Two Valves Open in PUMP-4 PUMP-5 PUMP-6
Each Basin)

Figure D-4
Potential dominant contributors to circulating water system failure

 D-13 
Export Control Restrictions The Electric Power Research Institute, Inc. (EPRI, www.epri.com)
Access to and use of EPRI Intellectual Property is granted with the spe- conducts research and development relating to the generation, delivery
cific understanding and requirement that responsibility for ensuring full and use of electricity for the benefit of the public. An independent,
compliance with all applicable U.S. and foreign export laws and regu- nonprofit organization, EPRI brings together its scientists and engineers
lations is being undertaken by you and your company. This includes as well as experts from academia and industry to help address challenges
an obligation to ensure that any individual receiving access hereunder in electricity, including reliability, efficiency, affordability, health, safety
who is not a U.S. citizen or permanent U.S. resident is permitted access and the environment. EPRI also provides technology, policy and economic
under applicable U.S. and foreign export laws and regulations. In the analyses to drive long-range research and development planning, and
event you are uncertain whether you or your company may lawfully supports research in emerging technologies. EPRI’s members represent
obtain access to this EPRI Intellectual Property, you acknowledge that it approximately 90 percent of the electricity generated and delivered in
is your obligation to consult with your company’s legal counsel to deter- the United States, and international participation extends to more than
mine whether this access is lawful. Although EPRI may make available 30 countries. EPRI’s principal offices and laboratories are located in
on a case-by-case basis an informal assessment of the applicable U.S. Palo Alto, Calif.; Charlotte, N.C.; Knoxville, Tenn.; and Lenox, Mass.
export classification for specific EPRI Intellectual Property, you and your
Together...Shaping the Future of Electricity
company acknowledge that this assessment is solely for informational
purposes and not for reliance purposes. You and your company ac-
knowledge that it is still the obligation of you and your company to make
your own assessment of the applicable U.S. export classification and
ensure compliance accordingly. You and your company understand and
acknowledge your obligations to make a prompt report to EPRI and the
appropriate authorities regarding any access to or use of EPRI Intellec-
tual Property hereunder that may be in violation of applicable U.S. or
foreign export laws or regulations.

Program:
Instrumentation and Control

© 2013 Electric Power Research Institute (EPRI), Inc. All rights reserved. Electric Power
Research Institute, EPRI, and TOGETHER...SHAPING THE FUTURE OF ELECTRICITY are
registered service marks of the Electric Power Research Institute, Inc.

3002000509

Electric Power Research Institute


3420 Hillview Avenue, Palo Alto, California 94304-1338 • PO Box 10412, Palo Alto, California 94303-0813 USA
800.313.3774 • 650.855.2121 • askepri@epri.com • www.epri.com

You might also like