A variety of system architectures are available for use in process safety applications. The selected architecture should satisfy the safety requirements of the process under control. If frequent internal failures of the safety system forces the process through an excessive number of shutdown / start-up cycles, that process is operating in its most hazardous state.
A variety of system architectures are available for use in process safety applications. The selected architecture should satisfy the safety requirements of the process under control. If frequent internal failures of the safety system forces the process through an excessive number of shutdown / start-up cycles, that process is operating in its most hazardous state.
A variety of system architectures are available for use in process safety applications. The selected architecture should satisfy the safety requirements of the process under control. If frequent internal failures of the safety system forces the process through an excessive number of shutdown / start-up cycles, that process is operating in its most hazardous state.
Architectures Used in Safety Systems By: Dr. Lawrence Beckman - HIMA-Americas, Inc. There are several system architectures available for use in process safety applications. These range from single channel systems to triplicated or higher redundancy configurations. The selected architecture should first satisfy the safety requirements of the process under control, but likewise address production and operational issues which have a definite impact on safety, i.e. false tripping the process. As such, high availability is also an important consideration for safety systems. If frequent internal failures of the safety system forces the process through an excessive number of shutdown/start-up cycles, that process is operating in its most hazardous state for unnecessarily long periods of time. Operation under these conditions should be minimized in the interest of safety. There is considerable work in process to establish standards and implementation guidelines, both in the USA and internationally, which match the risk inherent in a given situation to the required integrity level of the safety system. Regrettably, they are not specific to a particular type of process, but deal only with a qualitative level of risk. This paper is intended to provide a background and further economic insight into this subject, and explore some architectural alternatives available to the control or safety engineer, given a state-of-the-art approach to safety system design. Safety System Architecture Most safety systems in the process environment are designed to shut the process down upon detecting a hazardous state or condition. These systems are called Emergency Shutdown (ESD) Systems and many operate in a fail safe mode. In this mode of operation, an internal failure of the safety system will result in a shut down of the process, and all repairs to this system are performed in the non-operating state. Fail safe systems can be redundant, but may lack sufficient redundancy to be considered fault tolerant. As such, they are subject to false trips and are not used in applications requiring a high level of availability. Availability is defined as the percentage of time over which a system is capable of performing its intended function during a given time period. Efforts have been made to increase the availability of dual systems by operating them in a mode where both channels must fail for the system to shutdown the process. This mode of operation is called the 2-out-of-2 (2oo2) mode, as opposed to the normal 1-out- of-2 (1oo2) fail safe mode. As such, the system continues to operate on a single channel after sustaining the first internal failure. While this configuration is certainly more available, its integrity depends heavily on comprehensive internal diagnostics. Fault tolerant (dual or higher) architectures are capable of sustaining reliable operation in the presence of a fault, by providing additional levels of redundancy. The 1oo2D configuration is an example of a dual implementation of fault tolerant architecture. It normally operates in the 2oo2 mode, but reverts to the 1oo2 mode upon the unlikely LVBECONANAL DOC 2 occurrence of an unresolved fault. Diagnostic watchdogs are provided in both channels as a secondary means of de-energizing outputs. Either channel is capable of switching off its outputs, as well as those of the other channel if required. As such, it is both very safe and available. Details of this Programmable Electronic System (PES) configuration are given in the literature (2,3). After sustaining a fault, the system will continue to operate properly on its remaining channel, thus avoiding a process shutdown; and allow repairs to be performed on-line. The safety integrity of the system in the presence of a fault is not compromised for the sake of increased availability, as all internal diagnostics are fully functional on the remaining channel. As such, fault tolerant systems found in industrial applications are at minimum dual redundant, providing two independent channels of redundancy. Another inherent advantage of some redundant architectures is the ability to communicate data between channels in order to decide which of the channels of the system is malfunctioning. This capability significantly improves the systems ability to diagnose faults and subsequently increases the safety integrity of the system. A systems ability to diagnose faults is often referred to as its diagnostic coverage, where the Coverage Factor (C) is defined as the probability of detecting a fault, given that one has occurred. Perfect coverage would imply 100% effective self-diagnostics. This is of course impossible. After sustaining a fault, a triplicated 2-out-of-3 (2oo3) system can operate in either of two modes. If a second fault occurs before repairs can be effected on the first malfunctioning channel, the system can shut itself down immediately upon the occurrence of the second fault; or it can revert to single channel operation. For safety applications of a 2oo3 system, two channel operation is restricted to a short time interval; and single channel operation is never allowed, due to the lack of comprehensive internal diagnostics. As such, the integrity of a triplicated (TMR) system depends heavily on its ability to vote, and consequently diagnostic coverage degrades consistent with the number of operating channels. The Coverage Factor of the various architectures will vary based on the quality of the systems internal diagnostics. Triplicated architectures rely heavily on their voting capacity to implement diagnostic coverage, and as such an operational third channel is critical. After sustaining a fault, diagnostic capability, and consequently coverage, is substantially diminished and in some instance may be non existent. Dual architecture typically offer superior internal diagnostics, which are capable of diagnosing the operational state of the entire system every scan cycle. This differs dramatically from other architectures which may require 30 seconds or longer to diagnose a problem; i.e., memory failure. Diagnostic coverage is an important consideration in evaluating covert system availability (U C ). Levels of redundancy beyond triplicated systems are rare in the industrial environment, and are very difficult to justify economically considering cost versus incremental improvement in safety integrity. Single channel systems are definitely not recommended for critical safety applications. Please refer to the following table of PES Architectures for further clarification. LVBECONANAL DOC 3 Configuration Operating Mode Channels Needed to Operate Channels Needed to Trip 1 2 1 1oo1 1oo2 2oo2 2oo3 (2oo2 1oo2) 1oo2D 1 1 1 1-0 2-0 2-1-0 1 2 2 2 3-2-0 2-1-0 Safety vs. Availability Having discussed the redundant configuration options available, let us quantify their relative performance for both safety and availability operation. The criteria are necessarily different, and will be characterized as follows: The Safety criterion will be the Hazard Rate (H) which is calculated as H D U C = = where D = Demand Rate (demands/yr) U C = = Covert (safety) Unavailability The Covert (safety) Unavailability (also referred to as fractional dead time or probability of dangerous failure) is the probability that the system is in a failed or non-functioning state because of a covert failure. It is this condition which represents the true hazard. Not all covert failures are dangerous failures, but all are potentially dangerous. Thus, a more conservative approach would require that covert and dangerous be considered synonymous. As such, the Covert Unavailability ( C U ) is a function of the unrevealed system failure rate ( C ) and the proof test interval (T P ). The Availability criterion will be the False Trip Rate (F). It is a function of the revealed failure rate ( R ) and the repair time (T R ). Whether a given failure is revealed or unrevealed (Covert) depends upon the level of coverage provided by the system's diagnostics. The repair process likewise is heavily dependent upon the systems ability to detect a fault, as the repair time is the sum of the time to detect the fault and the time to make the repair. In a system with a low level of diagnostic coverage, the repair time will be extended to equal the proof test interval in most instances. Programmable Electronic LVBECONANAL DOC 4 Systems designed for high integrity/safety applications have comprehensive diagnostics and consequently high coverage factors, ranging from 97% to 99+%. For a given coverage factor (C), the failure rates can be computed as follows: R C = = = = ; (1- C) C where is the total failure rate of the unit or module. For a system consisting of "l" input modules and "m" output modules, with a main processor module; the revealed failure rate of the system can be calculated as follows: R SYSTEM i i P P o o l C C m C = = + + + + where l = number of input modules m = number of output modules C = module coverage factor = total module failure rate (i = input, P = processor, o = output) The covert failure rate of this system can be calculated by substituting (1 i C ) for i C , (1 P C ) for P C , etc. above. For the system configurations of interest, we can develop a list of covert Unavailabilities ( C U ) and False Trip Rates (F) as follows. A discussion of these equations can be found in Beckman (1). System Operating Failures Covert ( C U ) False Trip (F) Configuration Mode Allowed Unavailability Rate 1-out-of-1 1-0 0 C P T 2
R 1-out-of-2 2-0 1 C P T 2 2 3 2 R 2-out-of-2 2-1-0 0 C P T 2 2 R R T 2-out-of-3 3-2-0 1 C P T 2 2 6 2 R R T 1-out-of-2D 2-1-0 1 C P T 2 2 3 2 2 R R T The basic system used in this analysis is simplex, in that it has a single set of inputs and outputs. For the redundant configurations, all channels operate independently, the inputs are in parallel and the outputs are arranged to provide the level of operational redundancy required; i.e., for the 1-out-of-2 configuration the two outputs are in series. For the 2- out-of-2 configuration, the two outputs are in parallel. In addition, the above results were obtained assuming that coverage factors for all configurations are equal, and that all units are tested simultaneously. Only normal mode failures were included in this analysis. LVBECONANAL DOC 5 Common Mode Failures Common Mode Failure occurs when a single cause affects multiple channels of a redundant system, usually resulting in complete system failure. Sources of common mode failure are environmental conditions, design errors, manufacturing errors, and operational or maintenance failures. The higher the level of redundancy, the more likely the occurrence of this type of failure. For example, a dual system (consisting of channels A and B) has only a single common mode failure possibility; while a triple redundant system (Channels A, B and C) has multiple common mode failure possibilities (AB, AC, BC and ABC). This situation is exacerbated further when multiple channels share a common hardware platform; i.e., a common I/O module, etc. Common Mode Failure is typically modeled using the "beta factor" method, where beta ( ) represents the percentage of total failures attributable to common mode failure; i.e., = = + + CM CM NM , where total failures include both common and normal mode failure. Usually this fraction is in the range of 5-15%, but can be smaller based on operational experience. Necessarily, it is a reasonable estimate. However, depending upon the importance placed on this type of failure, the resulting system reliability will be significantly altered. Considering two of the redundant architectures discussed, the occurrence of a covert common mode failure in either the 1-out-of-2/1-out-of-2D or the 2-out-of-3 system configuration results in a fail-to-function situation. This result is the same irrespective of the architecture. However, the susceptibility is lower by a factor of three for the dual architecture. Incorporating covert common mode failure into their corresponding reliability models yields the following modified equations for Covert Unavailability (U- c ): System Covert Unavailability (U c ) Configuration Common Mode Normal Mode 1-out-of-2 or CC P T 3 + CN p T 2 2 3 1-out-of-2D 2-out-of-3 CC P T + CN p T 2 2 where U c = U c (Common Mode Failure) + U c (Normal Mode Failure)
CC = 2 (assuming 50% are covert) CN = (1- ) (1-C) The net effect is equivalent to placing a simplex (non-redundant) element in series with the redundant architecture for both the 1-out-of-2 and 2-out-of-3 system configurations considered. Depending upon a reasonable estimate of the beta factor and the resulting common mode dangerous failure rate, the Common Mode term can completely dominate the computation, rendering Normal Mode failures insignificant for higher levels of LVBECONANAL DOC 6 redundancy. In practice, this is typically not the case; and care should be taken to keep common mode failure in perspective. However, it certainly should not be ignored in critical safety evaluations. In the economic analysis that follows, common mode failures have not been included, as they are outside the scope of this paper. A comprehensive discussion is provided in the literature (2,4,5). System Integrity In a hazardous process environment there are typically two types of systems in operation; the control system and the safety or protective system. The two systems should be totally independent of each other. The purpose of the safety system is to protect against the process hazard, while preventing plant shutdowns due to false trips. The safety system is typically dormant for extended periods of time and susceptible to functional failures, which are generally unrevealed failures. Given less than perfect diagnostic coverage, the internal diagnostics of today's programmable safety systems will not detect 100% of all possible failure conditions. As such, it is necessary to conduct periodic proof testing to detect such undiagnosed failures. It is however, not a substitute for comprehensive internal diagnostics. This testing should be made as quick and simple as possible in order to give the maximum system availability, while reducing the possibility of human error. Repair should be able to be performed while the system is operating. No advantage is gained if the safety system or the process has to be stopped to rectify any faults found. The time interval between proof testing is of great concern, as the potential for human error while conducting the proof test is significant. Considerations which affect the choice of proof test interval are as follows: 1) System redundancy and the coverage factor of the internal diagnostics. 2) Potential for human error due to complexity of the test/repair process. 3) The time required to perform the necessary testing and repair. During the test period, the system (or some portion thereof) is under test and unavailable to perform its intended safety function. As such, proof testing too frequently increases the unavailability of the safety system and the probability of human error. On the other hand, infrequent testing increases the risk of developing undiagnosed faults, particularly in systems with a low level of diagnostic coverage. As stated earlier, the purpose of the proof test is to improve the reliability of the safety system. The objective is to minimize the safety unavailability of the system while conducting the required periodic testing to maintain system integrity. The selection of the optimum proof test interval based on minimizing safety unavailability is critical. Consider the following equation for total system Unavailability ( TOTAL U ), including field devices: TOTAL C T FD E U U U U U = = + + + + + + where C U = covert (safety) unavailability due to unrevealed system failure LVBECONANAL DOC 7 T U = unavailability resulting from proof testing FD U = Covert (safety) unavailability due to unrevealed field device failures E U = unavailability resulting from human error (system isolation, i.e., bypass not restored) It is desired to minimize TOTAL U with respect to P T for a given configuration of the system and field devices. The resulting optimum proof test interval is P MI N T . A derivation of this methodology can be found in Beckman (1). Testing and Repair The safety system design should facilitate maintenance of both the safety system and associated field devices. The system itself should give the maintenance technician a clear, visual indication of the fault; so that repair can proceed with absolute certainty, thereby reducing the possibility for human error. Repair procedures should be simple and straight forward to allow fast, easy repair and keep the repair time as short as possible (low MTTR). In addition, provisions should be made to simplify the by-passing of field devices for purposes of proof testing, calibration and maintenance. Many redundant systems based on traditional PLCs are not fully integrated, and are consequently difficult to test and repair. Redundant implementations which require the maintenance technician to diagnose complex problems, perform difficult repair procedures, or reload the application program as part of the repair process are prone to human error, and will at the least contribute to false or nuisance trips of the process. At worst, incomplete or inadequate repair could result in a catastrophic failure of the safety system. Steps should be taken to minimize the occurrence of human error during testing and maintenance of the safety system. In determining the optimum proof test interval, consideration should also be given to the potential for human error. Under ideal conditions, the human error rate is estimated to be 1 in 100. However, most process testing conditions are far from ideal, and as such this rate will be substantially higher. Measures can be taken in the safety system design to mitigate this situation, but the potential for human error both while conducting the testing and required repair must be considered. The "Human Error" failure rate far exceeds that of other safety system components such as sensors, actuators, etc. As such, it represents the largest potential cause for operational failure of the safety system. Economic Model Given the above, it is now possible to construct an economic model for the system configurations of interest. This analysis will focus on the safety system itself, and as such will not include the associated field devices. The link between Safety and Availability is becoming significantly stronger, as industry recognizes that cycling processes up and down inherently has safety implications; in addition to the cost associated with lost production. Hence, availability is now LVBECONANAL DOC 8 considered a key factor in safety system design and operation. Considering the above, the model includes three terms as follows: 1) Hazardous failures 2) False or Nuisance Trips 3) Periodic Proof Testing The focus of the model is on Total Safety Cost, and consequently does not include the initial cost of system hardware, integration, programming or maintenance. One could safely assume that these costs would be in proportion to the selected level of redundancy, with triplication being the most expensive. These costs, even when amortized over the life cycle of the system, differ from one configuration to another; but are mostly fixed compared to the Total Safety Cost. The model utilizes an operating period of one (1) year, and a proof testing interval that is optimized for each configuration considered. Given these conditions, the Safety Cost model is
TOTAL H F T P MI N C H C F C C T ($) / = = + + + + where TOTAL C ($) = Total Annual Safety Cost H = Hazard Rate F = False Trip Rate
H C = Hazard Cost ($)
F C = Nuisance Trip Cost ($)
T C = Proof Testing Cost ($)
P MI N T = Optimum Proof Test Interval Please note that P MI N T is also used in computing the Safety Unavailability and consequently the Hazard Rate. Using the following values for Coverage Factor (C), Covert Failure Rate ( C ), Demand Rate ( D), Mean Time to Repair ( R T ), and Test Time ( D T ), we compute the following for the configurations of interest, using the equations for Covert Unavailability and False Trip Rate given in the listing: C= 0.97 C = 0.18 Failures/yr. D = 0.5 Demands/yr. R T = 8 hrs. D T = 4 hrs. Configuration P MI N T weeks ( ) TOTAL U H per year ( ) F per year ( ) TOTAL C ($) 1-out-of-1 3.7 1.38x10 -2 6.91x10 -3 5.82 322,534 1-out-of-2 14.4 3.48x10 -3 1.74x10 -3 11.64 590,103 2-out-of-2 2.6 1.91x10 -2 9.56x10 -3 0.062
47,585 2-out-of-3 10.0 4.57x10 -3 2.28x10 -3 0.186 20,855 1-out-of-2D 14.4 3.48x10 -3 1.74x10 -3 0.062 11,196 TOTAL C ($) was calculated using the following associated costs (per occurrence): Hazard Cost ( H C ) = $500,000 Nuisance Trip Cost ( F C ) = $50,000 LVBECONANAL DOC 9 Proof Testing Cost ( T C ) = $2,000 The lowest cost was achieved by the 1-out-of-2D configuration where both Hazard failures and Nuisance trip were virtually eliminated. The 2-out-of-3 configuration finished second, due to increased costs associated with proof testing and nuisance trips. Note that the 1-out-of-2/1-out-of-2D configuration also had the longest proof test interval, and that the 2-out-of-2 configuration was the least safe, actually less safe than the 1-out-of-1 (simplex) configuration. In addition, no attempt was made to comprehend the following in the model 1) Increase in Covert Unavailability ( C U ) due to human error associated with more frequent proof testing. 2) Increase in Demand rate ( D) due to more frequent false trips and process start- ups. Including these effects would have further biased the results in favor of the higher integrity configurations. Effects of Coverage Factors It would now be of interest to investigate the effect of the Coverage Factor (C) on the economic model for the configurations of interest. We will use the same model for TOTAL C ($) , keeping all parameters constant with the exception of the Coverage Factor, and the resulting covert and revealed system failure rates. Based on this analysis, we compute the following dollar values for TOTAL C ($) on an annual basis.
TOTAL C ($) C=0.98 C=0.90 C=0.75 Configuration C = 0.12 C = 0.6 C = 1.5 1-out-of-1 319,793 327,366 315,559 1-out-of-2 594,243 557,772 482,527 2-out-of-2 39,531 83,678 129,815 2-out-of-3 18,365 33,510 52,349 1-out-of-2D 9,400 20,435 34,376 The results are interesting in that two distinct effects are observed. TOTAL C ($) actually decreases as C decreases for the two configurations (1-out-of-X) which are prone to false trips. Correspondingly TOTAL C ($) increases as the coverage factor decreases for the 2-out-of-X configurations, and the 1-out-of-2D configuration (which are less prone to false trips), because of a dramatic increase in the Hazard Rate. As these are the most likely configurations to be utilized, a decrease in the coverage factor represents a significant increase in the probability of a hazard, and its inherent financial consequences. Please refer to Figure 1 for a summary of these results. It is also interesting to note that a small increase in the coverage factor (which implies a corresponding decrease in the covert failure rate) resulting from more comprehensive internal diagnostics, voting, etc. will substantially reduce the Hazard Rate in all cases, LVBECONANAL DOC 10 and consequently the Total Safety Cost for those configurations least affected by false trips. The optimum Proof Test Interval ( P MI N T ) likewise increases as the overall integrity of the system improves. This effect could also be achieved by reducing the total failure rates of the individual module which comprise the overall system configuration. Conclusions An economic analysis of the safety system must comprehend both process safety and availability. A model was constructed which included Hazardous Failures, False or Nuisance Trips and lastly Periodic Proof Testing required to maintain system integrity for the system configurations of interest. Hazardous failures were computed based on the Total Safety Unavailability of the system. Human error is a significant contributor to Safety Unavailability, and steps to minimize the probability of occurrence should be employed in the integration, testing, and repair of the system. The analysis indicates that the use of either the 1-out-of-1 or 1-out-of-2 configuration is not economically feasible, given that the 1-out-of-2 configuration however is quite safe. It is best suited for fail-safe applications where loss of production is not a consideration. The 2-out-of-2 configuration is the least safe, and should be utilized only where safety is not the primary consideration. The 2-out-of-3 and 1-out-of-2D configurations are economically advantaged, in that they satisfy both safety and availability requirements, thus minimizing the Total Safety Cost on an annual basis. However, the 1-out-of-2D configuration is superior in both safety performance and cost. Including common mode failure in the economic model would further reinforce this result. The importance of having comprehensive diagnostics and consequently a high coverage factor in the safety system cannot be overemphasized. Improving coverage has an exponential effect on increasing reliability, safety system integrity, and reducing Total Safety Cost. Proof testing should be used to complement a systems' internal diagnostics, and not as a substitute for inadequate diagnostics. Frequent proof testing and complex repair procedures increase the probability of human error, and should always be avoided. Given the above, a safety analysis should be performed both prior to design and again after installation to determine if the System achieves the Safety Integrity Level (SIL) as required by the Process Hazard Analysis (PHA). This analysis can likewise establish the proper selection of the System architecture to satisfy economic criteria, and the tangible performance of the system as regards the mitigation of hazards which can lead to significant economic, safety, and environmental consequences. References LVBECONANAL DOC 11 (1) L.V. Beckman, "Optimum Proof Testing of Programmable Safety Systems," Hydrocarbon Processing, November 1992. (2) Bukowski, J.V. and W.M. Goble, Comparing Control Systems Reliability: Architecture, Diagnostics, and Common Cause; Proceedings of the ISA/94 Conference and Exhibit. ISA, 1994. (3) IEC 1508 (Draft) Part 2: Requirements for Electrical/ Electronic/ Programmable electronic systems. IEC 1508 (Draft) Part 6: Guidelines on Application of Parts 2 and 3. (4) A.J. Bourne, et. al., Defenses against common-mode failures in redundant systems; Safety and Reliability Directorate UK AEA, January 1981. (5) Freeman, Raymond A., Reliability of Interlocking Systems; Process Safety Progress (Vol. 13, No. 3), July 1994, Pg. 146. Figure 1 Total Annual Safety Cost 0 100000 200000 300000 400000 500000 600000 1oo1 1oo2 2oo2 2oo3 1oo2D D o l l a r s C= 0.98 C= 0.97 C= 0.90 C= 0.75