SIS - Safety Instrumented Systems - A Practical View - Part 1

SIS - Safety Instrumented Systems - A practical view - Part 1
César Cassiolato
Marketing, Quality, Project and Services Engineering Director

SMAR Industrial Automation
cesarcass@smar.com.br
Introduction
The Safety Instrumented Systems are used to monitor the condition of values and
parameters of a plant within the operational limits and, when risk conditions occur, they
must trigger alarms and place the plant in a safe condition or even at the shutdown
condition.
The safety conditions should be always followed and adopted by plants and the best
operating and installation practices are a duty of employers and employees. It is
important to remember that the first concept regarding the safety law is to ensure that all
systems are installed and operated in a safe way and the second one is that
instruments and alarms involved with safety are operated with reliability and efficiency.
The Safety Instrumented Systems (SIS) are the systems responsible for the operating
safety and ensuring the emergency stop within the limits considered as safe, whenever
the operation exceeds such limits. The main objective is to avoid accidents inside and
outside plants, such as fires, explosions, equipment damages, protection of production
and property and, more than that, avoiding life risk or personal health damages and
catastrophic impacts to community. It should be clear that no system is completely
immune to failures and, even in case of failure; it should provide a safe condition.
For several years, the safety systems were designed according to the German
standards (DIN V VDE 0801 and DIN V 19250), which were well accepted for years by
the global safety community and which caused the efforts to create a global standard,
IEC 61508, which now works as a basis for all operational safety regarding electric,
electronic systems and programmable devices for any kind of industry. Such standard
covers all safety systems with electronic nature.
Products certified according to IEC 61508 should basically cover 3 types of failures:
 Random hardware failures

 Systematic failures
 Common causes failures
IEC 61508 is divided in 7 parts, where the first 4 are mandatory and the other 3 act as
guidelines:
 Part 1: General requirements
 Part 2: Requirements for E/E/PE safety-related systems
 Part 3: Software requirements
 Part 4: Definitions and abbreviations
 Part 5: Examples of methods for the determination of safety integrity levels
 Part 6: Guidelines on the application of IEC 61508-2 and IEC 61508-3
 Part 7: Overview of techniques and measures
Such standard systematically covers all activities of a SIS (Safety Instrumented System)
life cycle and is focused on the performance required from a system, that is, once the
desired SIL level (safety integrity level) is reached, the redundancy level and the test
interval are at the discretion of who specified the system.
IEC61508 aims at potentializing the improvements of PES (Programmable Electronic
Safety, where the PLCs, microprocessed systems, distributed control systems, sensors,
and intelligent actuators, etc. are included) so as to standardize the concepts involved.
Recently, several standards on the SIS development, project and maintenance were
prepared, as IEC 61508 (overall industries) already mentioned, and is also important to
mention IEC 61511, focused on industries with ongoing, liquid, and gas process.
In practice, in several applications it has been seen the specification of equipment with
SIL certification to be used in control systems, without safety function. It is also believed
that there is a disinformation market, leading to the purchase of more expensive pieces
of equipment developed for safety functions where, in practice, they will be used in
process control functions, where the SIL certification does not bring the expected
benefits, making difficult, inclusive, the use and operation of equipment.
In addition, such disinformation makes users to believe that they have a certified safe
control system, but what they have is a controller with certified safety functions.
With the increase of usage and applications with digital equipment and instruments, it is
extremely important that professionals involved on projects or daily instrumentation are
qualified and have the knowledge on how to determine the performance required by the
safety systems, who have domain on calculations tools and risk rates within the
acceptable limits.
In addition, it is necessary to:
 Understand the common mode failures, know which types of safe and non-safe failures
are possible in a specific system, how to prevent them and, also, when, how, where
and which redundancy level is more appropriate for each case.
 Define the preventive maintenance level appropriate for each application.
The simple use of modern, sophisticated or even certified equipment does not
absolutely ensure any improvement on reliability and safety of operation, when
compared with traditional technologies, except when the system is deployed with criteria
and knowledge of advantages and limitations inherent to each type of technology
available. In addition, the entire SIS life cycle should be in mind.
Commonly we see accidents related to safety devices bypassed by operation or during
maintenance. Certainly it is very difficult to avoid, in the project stage, that one of such
devices are bypassed in the future, but by a solid project that better satisfies the
operational needs of the safety system user, it is possible to considerably eliminate or
reduce the number of non-authorized bypass.
By using and applying techniques with determined or programmable logic circuits,
failure-tolerant and/or safe failure, microcomputers and software concepts, today is
possible to project efficient and safe systems with costs suitable for such function.
The SIS complexity level depends a lot on the process considered. Heaters, reactors,
cracking columns, boilers, and stoves are typical examples of equipment requiring
safety interlock system carefully designed and implemented.
The appropriate operation of a SIS requires better performance and diagnosis
conditions compared to the conventional systems. The safe operation in a SIS is
composed by sensors, logic programmers, processors and final elements designed with
the purpose of causing a stop whenever safe limits are exceeded (for example, process
variables such as pressure and temperature over the very high alarm limits) or event
preventing the operation under unfavorable conditions to the safe operation conditions.
Typical examples of safety systems:
 Emergency Shutdown System

 Safety Shutdown System
 Safety Interlock System
 Fire and Gas System
We will start with a Safety Life Cycle and Risk Analysis.
Safety Life Cycle

Definition: “It is an engineering process with the specific purpose of reaching and
ensuring that a SIS is effective and allowing the reduction of risk levels at an effective
cost during the entire system life".
In other words, the cycle is intended to be a risk evaluation guide during the entire
system life, since the project conception to the daily maintenance.
Why the Safety Life Cycle?
 Accidents may happen and, therefore, it is necessary to minimize them in frequency

and severity.
 Safety Instrumented Systems and Safety Life Cycles are designed to minimize risks.
Figure 1 - Typical example of a Safety Life Cycle
The Safety Life Cycle involves the probability analysis so as to ensure the safety project
integrity. In addition, it allows, by the calculations, reducing the risk at an effective cost.
Keeping a SIS integrity during the plant life cycle is extremely important for the safety
management. An effective management program should include strict controls and
procedures ensuring that:
 The identification of critical points, concepts and choice of sensors, technology, logic
solver and final equipment and elements and the redundancy need comply with the
safety levels and calculated risks reduction. Once the technology and the architecture
are chosen, there is an analysis plan and periodic review of them, reassessing the
overall safety.
 The tests of each phase (project, installation, operation, modification/maintenance) are
conducted in compliance with the safety requirements, safety procedures and
standards.
 The SIS goes back to its normal operation after a maintenance.
 The system integrity is not compromised by non-authorized access to set up, trip or
bypasses points.
 Procedures of change management are always followed to any system change.
 The changes quality is verified and the system is revalidated before returning to
operation.
The Safety Life Cycle should be part of a PSM (Process Safety Management System).
In this way, it will be conveniently adopted and applied in a conscious way and involving
employees in all its stages and company levels.
Risk Analysis
The more risks a system has, the more difficult is to meet the requirements of a safe
system. Basically, the risk is the sum of the probability of something undesirable
happening as a consequence of such occurrence.
The risk of a process may be defined as the product of the frequency of occurrence of a
specific event (F) and the consequence resulting from the event occurrence (C).
Risk = F x C.
Figure 2 - Risk considerations according to IEC 61508.

In the safety systems, the search is for minimizing the risks at acceptable levels and the
SIL level for control may be determined by the analysis and identification of process
risks. The verification of the SIL level may be conducted by the probability of failure on
demand (PFD).
The IEC 61508 defines requirements for a system operation and integrity. The
requirements for operation are based on the process and for integrity; they are based
on reliability, which is defined as the Safety Integrity Level (SIL). There are 4 discreet
levels which have 3 important properties:
1. Applicable to the overall safety function;

2. The highest the SIL level, the stricter are requirements;
3. Applicable to technical and non-technical requirements
Table 1 - SIL Levels
How to interpret the SIL levels? As we have seen, the SIL level is an integrity measure
of a SIS and we can interpret it in two ways:
1) Taking into consideration the risk reduction and table 1:
 SIL1: risk reduction >= 10 and <=100

2) By interpreting table 2, where, for example, SIL 1 means that the risk of accident or
something undesirable is low and that a SIS has 90% availability, or even a 10%
chance of failure.
Table 2 - Levels of SIL and SFF according to the tolerance to hardware failure
The SIL evaluation has grown in the last few years, mainly in chemical and
petrochemical applications. We can even express the need for SIL due to the likely
impact on the plant and community:
“4”- Catastrophic impact for the community.
“3”- Protection of employees and community.
“2”- Protection of production and property. Possible damages to employees.
“1”- Slight impact on the property and protection of production.
Figure 3 - SIL due to the likely impact on the plant and community
Such analysis is not satisfactory as it is hard to classify what is a slight and big impact.
There are several methods of risks identification:
 HAZOP technique (Hazard and Operability Study): where risks are identified and when
higher SIL levels are required;
 Check Lists technique;
 FMEA technique (Failure Modes and its Effects), when the failure of each equipment
and component is analyzed at the control screening.
In terms of SIL levels, the higher the required level, the higher the cost due to more
complex and stricter specifications for hardware and software. Usually, the SIL choice of
each safety function is associated with the staff experience, but one may chose the
HAZOP matrix analysis or the Layers of Protection Analysis (LOP), where the policy,
procedures, safety strategies and instrumentation are included.
Follows below some stages and details of Risk Analysis:
1. Identification of potential risks
a. Starts with HAZOP (Hazard and Operational Issues Study)
b. The company should have a group of experts in the process
and in its risks
c. Several methodologies may be applied, such as PHA
(Process of Hazard Analysis), HAZOP for risks identification,
modified HAZOP, accident consequences, Risks Matrix, Risks
Diagram or Quantitative Analysis for identification of the safety
level to be reached.
d. The standards suggest methodologies for the SIL
identification
e. The available methods are qualitative, quantitative or
semiquantitative
f. Determine the SIL appropriate for the SIS, where the risk
inherent to the process should be equal to or lower to the
acceptable risk, ensuring the necessary safety for the plant
operation.
2. Evaluate the probability of a potential risk related to
a. Equipment failure
b. Human errors
3. Evaluate the potential risks and consequences of the event impacts
Table 3 - Example of Risk Matrix

Table 4 - Frequency Range - Qualitative Criteria
Table 5 - Consequence Range - Qualitative Criteria
Some terms and concepts involved in safety systems
 Demand: Every condition or event generating requiring a safety system to operate

 PFD (Probability of Failure on Demand): Indicator of reliability appropriate for the
safety systems.
 The MTBF is a basic measure of the reliability in repairable items of a piece of
equipment. It may be expressed in hours or years. It is commonly used in systems
reliability and sustainability analysis.
 MTBF: can be calculated by the following formula:
 MTBF = MTTR + MTTF
Where:
 MTTR = Mean Time to Repair

 MTTF = Mean Time to Failure = the inverse of the sum of all failures rates
 SFF = Safe Failure Fraction, is the fraction of all failure rates of a equipment resulting
in a safe or unsafe failure, but duly diagnosed.
 Types of Failures analyzed in a FMDEA (Failure Modes, Effects, and Diagnostic

Analysis):
1) Dangerous Detected (DD): Detectable failure and may take to an error

over 2% in the output.
2) Dangerous Undetected (DU): Undetectable failure and may take to an
error over 2% in the output.
3) Safe Detected (SD): Detectable failure and does not affect the
measured variable, but places the output current to a safe value and notifies
the user
4) Safe Undetected (SU): In this case, there is a problem with the
equipment, but it is not possible to detect it, but the output operates with
success within a limit of 2% of safety tolerance. If such safety tolerance is
used as a project parameter, such type of failure may be ignored.
5) Diagnostic Annunciation Failure (AU): a failure with no immediate
impact, but that a second occurrence may place the equipment under a risk
condition.
6) The following failures may also be characterized:
 Random failures: A spontaneous failure of a component (hardware). The random

failures may be permanent (they exist until they are eliminated) or intermittent (occur
under some circumstances and disappear in the following moment).
 Systemic failures: A failure hidden within a project or assembling (hardware or typically
software) or failures due to errors (including mistakes and omissions) in the safety
activities cycle which cause the SIS to fail in some circumstances, under specific
combinations of input or under a specific environmental condition.
 Failure in a common mode: The result of a defect in common mode.
 Defect in a common mode: A single cause may cause failures in several elements of
the system. May be internal or external to the system.
Curiosity
Figure 4 - Study on the causes of accidents involving control systems
Conclusion
In practical terms, the aim is the reduction of failures and, consequently, the reduction of
shutdowns and operational risks. The purpose is to increase the operational availability
and also, in terms of processes, the minimization of variability with direct consequence
to the profitability increase.
References
 IEC 61508 – Functional safety of electrical/electronic/programmable electronic safety-
related systems.
 IEC 61511-1, clause 11, " Functional safety - Safety instrumented systems for the
process industry sector - Part 1:Framework, definitions, system, hardware and
software requirements", 2003-01
 ESTEVES, Marcello; RODRIGUEZ, João Aurélio V.; MACIEL, Marcos.Sistema de
intertravamento de segurança, 2003.
 William M. Goble, Harry Cheddie, "Safety Instrumented Systems Verification: Practical
Probabilistic Calculation"
 Sistemas Instrumentados de Segurança - César Cassiolato
 “Confiabilidade nos Sistemas de Medições e Sistemas Instrumentados de Segurança”
- César Cassiolato
 Manual LD400-SIS
 Sistemas Instrumentados de Segurança – Uma visão prática – Parte 1, César
Cassiolato
 Researches on internet
Share12
César Cassiolato

Introduction
condition.

guidelines:Part 1: General requirements

acceptable limits.
 Understand the common mode failures, know which types of safe and non-safe
failures are possible in a specific system, how to prevent them and, also, when, how,
where and which redundancy level is more appropriate for each case.

We have seen in the previous article, in the first part, some details on the Safety Life
Cycle and Risk Analysis. Now we will see, in the second part, a little aboutReliability
Engineering
Reliability of Measurement Systems

The reliability of the measurement systems may be quantified as the mean time
between the failures occurring in the system. In this context, a failure means the
occurrence of an unexpected condition that causes an incorrect value in the output.
Reliability Principles
The reliability of a measurement system is defined as the ability of a system executing
its function within the operating limits and conditions during a defined time period.
Unfortunately, several factors, such as manufacturer’s tolerance according to operating
conditions sometimes make difficult such determination and, in practice, what we can
get is statistically expressing the reliability by failure probabilities occurring within a time
period.
In practice, we face a great difficulty, which is determining what is a failure. When the
output of a system is incorrect, this is something hard to be interpreted compared with
the overall loss of the measurement output.
Quantification of Reliability in almost absolute terms

As seen before, reliability essentially has a probability nature and can be quantified in
almost absolute terms by mean time between failures (MTBF) and mean time to failure
(MTTF). Importantly, those two times are usually the mean values calculated using an
identical number of instruments and, therefore, for any particular instrument its values
may be different from the average.
The MTBF is a parameter expressing the mean time between failures occurring in an
instrument, calculated in a specific period of time. In cases where the equipment has
high reliability, in practice, it will difficult to count the number o failures occurrences and
non-precise number may be generated for the MTBF and, then, using the manufacturer
value is recommended.
The MTTF is an alternative mode to quantify reliability. It is normally used for devices
such as thermocouples, as they are discharged when they fail. MTTF expresses the
mean time before the failure occurs, calculated in an identical number of devices.
The final associated reliability in terms of importance to the measurement system is
expressed by the mean time to repair, that is, the mean time to repair an instrument or
even the mean time to replace an equipment.
The combination of MTBF and MTTR shows the availability:
Availability = MTBF/ (MTBF+MTTR)
The availability measures the proportion of time in which the instrument works without
failures.
The objective with measurement systems is to maximize the MTBF and minimize the
MTTR and, consequently, maximize the Availability.
Failure Models
The failure mode in a device may change throughout its life cycle. It may remain
unchanged, decrease or, at least, increase.
In electronic devices, it is common to have a behavior according to figure 1, also know
as the bathtub curve.
Figure 1 - Typical Curve of reliability variation of an electronic component
Manufacturers usually apply burn-in tests in a way to eliminate the phase until T1, until
products are placed in the market.
But the mechanical components will have a higher failure rate in the end of their life
cycle, as per figure 2.
Figure 2 - Typical Curve of reliability variation of a mechanical component
In practice, where systems are electronic and mechanical compositions, the failure
models are complex. The more components, the higher the incidents and probabilities
of failures.
Reliability Laws
In the practice, usually we will have several components, and the measurement system
is complex. We may have components in series or in parallel.
Reliability of components in series should take into consideration the probability of
individual failures in a time period. For a measurement system with n components in
series, reliability Rs is the product of individual reliabilities: Rs = R1xR2...Rn.
Imagine we have a measurement system composed by a sensor, a conversion element
and a circuit of signal processing, where we have the following reliability: 0.9, 0.95 and
0.099, respectively. In such case, the system reliability will be:
0.9x0.95x0.009 = 0.85.
The reliability can be increased by placing components in parallel, what means that the
system fails if all components fail. In such case, reliability Rs is demonstrated by:
Rs = 1 – Fs, where Fs is the non reliability of the system.
The non-reliability is Fs = F1xF2...F3.

For example, in a safe measurement system, there are three identical instruments in
parallel. The reliability of each one is 0.95 and that of the system is:
Rs = 1 – [ (1-0.95)x(1-0.95)x(1-0.95)] = 0.999875
Improving the reliability of a measurement system
What we look for, in the practice, is to minimize the level of failures. An important
requirement is to ensure one knows and act before T2 (see figures 1 and 2), when the
statistical frequency of failures increases. The ideal is to make T (time period or life
cycle) is equal to T2 and, then, maximizing the period without failures.
There are several ways to increase the reliability of a measurement system:
 Choice of instruments: One should always be aware to the instruments specified, its
influences regarding the process, materials, environment, etc.
 Protection of instruments: protecting the instruments with appropriate protections
may help to improve and ensure a higher level of reliability. For example,
thermocouples should be protected in adverse operation conditions.
 Regular calibration: Most of the failures may be caused by drifts that may change
and generate incorrect outputs. Therefore, according to the good instrumentation
practices, we recommend that the instruments are periodically checked and
calibrated.
 Redundancy: In such case, there is more than one equipment working in parallel and
locked with a key, sometimes, automatically. Here the reliability is significantly
improved.
Safety and Reliability Systems
The Safety Systems are used to monitor the condition of values and parameters of a
plant within the operational limits and, when risk conditions occur, they must trigger
alarms and place the plant in a safe condition or even at the shutdown condition.
Note that the safety conditions should be followed and adopted by plants where the best
Metrics used in the Reliability Engineering field

involving SIS
1. Reliability R(t)
Reliability is a metric developed to determine the probability of success of an operation
in a specified period of time.
When (failure rate) is too low, the non-reliability function (F(t)) or the Probability of
Failure (PF) is shown by: PF(t) = t
Figure 3 - Reliability R(t)
2. MTTR = Mean Time to Repair

The reliability measurement requires that a system has success in an operation during a
time interval. In this sense, appears the MTTR metrics, which is the time in which is
detected a failure and its repair occurs (or the reestablishment of operating success).
The reestablishment of the operating success is shown by: µ = 1/MTTR
In practice, it is not simple to estimate that rate, mainly when periodic inspection
activities occur, as the failure may occur just after an inspection.
3. MTBF – Mean Time Between Failures

The MTBF is a basic measure of the reliability in repairable items of a piece of
equipment. It may be expressed in hours or years. It is commonly used in systems
reliability and sustainability analysis and can be calculated by the following formula:
MTBF = MTTR + MTTF

Where:
 MTTR = Mean Time to Repair

 MTTF = Mean Time to Failure = the inverse of the sum of all failures rates
As the MTTR is too low in practice, it is common to assume the MTBF = MTTF
4. Availability A(t) and Unavailability U(t)

Another very useful metric is the availability. It is defined as the probability of a device
being available (without failures) when a time t requires it to operate within the operating
conditions to which it was designed.
Unavailability is given by: U(t) = 1 – A(t)
Availability is not only a reliability function, but it is also a maintenance function. Table 1
below shows the relationship between reliability, maintenance and availability. Note in
this table that an increase in the maintenance ability implies on a decrease in the time
necessary to conduct the maintenance actions.
Table 1 - Relationship between Reliability, Maintenance and Availability

Figure 4 - Reliability, Availability and Costs
5. Probability of Failure on Demand (PFDavg) and

Periodic Test and inspection
PFDavg is the probability of failure that a system (for failure prevention) has when a
failure occurs. The SIL level is related to this probability of failure on demand and with
the factor of risk reduction (how much needs to be protected to ensure an acceptable
risk when a failure event occurs).
PFD is the Indicator of reliability appropriate for the safety systems.
If it is not tested, the failure probability tends to 1.0 with the time. Periodic tests maintain
the probability of failure within the desired limit.
Figure 5 - Voting, PFD and Architecture
Figure 5 shows the architecture details versus the voting and PFD and figure 6 shows
the correlation in PFD and Factor of Risk Reduction. Subsequently, we will discuss it in
more details in the articles complementing this series.
Figure 6 - Correlation between PFDavg and Factor of Risk Reduction
The Failure Probability may be calculated using the following equation:

PFAvg = (Cpt x x TI/2) + ((1-Cpt) x x L xT/2), where:
 : failure rate:
 Cpt: percentage of failures detected by a test (proof test)
 TI: test period
 LT: life period of a process unit
Let’s see and example: Let’s suppose that a valve is used in a safety instrumented
system and has an annual failure rate of 0.002. Every year a verification and inspection
test is conducted. It is estimated that 70% of failures are detected in such tests. Such
valve will be used for 25 years and its usage demand is estimated as once every 100
years. What is the average probability of failure?
Using the previous equation we have:
 : 0.002
 Cpt: 0.7
 TI: 1 year
 LT: 25 years
PFDavg = (0.7) x 0.002 x ½ + (1-0.7) x 0.002 x 25/2 = 0.0082
Conclusion
References
 IEC 61508 – Functional safety of electrical/electronic/programmable electronic
safety-related systems.
 William M. Goble, Harry Cheddie, "Safety Instrumented Systems Verification:
Practical Probabilistic Calculation"
 “Confiabilidade nos Sistemas de Medições e Sistemas Instrumentados de
Segurança” - César Cassiolato
Cassiolato
Share9
César Cassiolato
Introduction
condition.

guidelines:


acceptable limits.
Typical examples of safety systems:Emergency Shutdown System

We have seen in the previous article, in the second part, some details on the Reliability
Engineering. Now we will see, with models using series and parallel systems, fault
trees, Markov model and some calculations.
Failure Analysis - Fault Trees
There are some methodologies for failure analysis. One of them, which is very used, is
the fault tree analysis (FTA), aiming at improving the reliability of products and
processes by a systematic analysis of possible failures and their consequences,
instructing on the adoption of corrective or preventive measures.
The fault tree diagram shows the hierarchical relationship between the identified failure
modes. The tree construction process begins with the perception or anticipation of a
failure, which is then decomposed and detailed to simpler events. Therefore, the fault
tree analysis is a top-down technique, as part of the general events that are divided in
more specific events.
Following below, an example of a FTA diagram applied to a failure in an electric engine
is shown. The initial event, which may be an observed or anticipated failure, is called
top event, and is indicated by the blue arrow. From that event, other failures are
detailed, until reaching the basic events composing the diagram resolution limit. Failures
shown in yellow compose the resolution limit of this diagram.
Figure 1 - FTA Example

It is possible to add logic elements to the diagram, such as “and” and “or” to better
describe the relationship between failures. In that way it is possible to use the diagram
to estimate the probability of a failure happening as of more specific events. The
following example shows a tree applied to the overheating issue in an electric engine
using logic elements.
Figure 2 - FTA Example using logic elements
The Fault Tree analysis was developed in the beginning of 1960 by the engineers from
Bell Telephone Company.
Logic Symbols used in the FTA
The FTA conduction is the graphic representation of the interrelation between the
equipment or operation failures that may result in a specific accident. The symbols
shown below are used in the construction of the tree to represent such interrelation.
“OR” Gate: Indicates that the event output occurs when
any type of input occurs.
“AND” Gate: Indicates that the event output occurs only

when a simultaneous input of all events occurs.
Inhibition Gate: Indicates that the event output occurs

when the input occurs and the inhibiting condition is met.
Restriction Gate: Indicates that the event output occurs

when the input occurs and the specific delay or restriction
time expired.
Basic Event: represents the equipment BASIC FAILURE

or system failure not requiring other failures or additional
defects.
INTERMEDIATE EVENT: represents a failure in an event

resulting from the interaction with other failures developed
by logic inputs as described above.
UNDEVELOPED EVENT: represents a failure no longer

examined, as the information is no longer available or
because the consequences are not significant.
EXTERNAL EVENT: Represents a condition or event
supposed to exist as a limit condition of the system for
analysis.
TRANSFERENCES: indicates that the failure tree is

developed additionally to other leaves. The transference
symbols are identified by number or letters.
Figure 3 - Logic Symbols used in the FTA
Markov Models
A Markov model is a state diagram where the several failure state of a system is
identified. The states are connected by a bow with the failure rates or repair rates
leading the system from one state to the other (see figure 4 and figure 5). The Markov
models are also known as state space diagrams or state diagrams. The state space is
defined as the set of all states where the system can be found.
Figure 4 - Example of Markov model

For a specific system, a Markov model consists in a list of all possible states in that
system, the possible transition ways between those states, and the failure rates of such
transitions. In the transition reliability analysis, usually there are failures and repairs.
When graphically representing a Markov model, each state is represented with a “circle,
with arrows indicating the transition ways between states, as shown on figure 4.
The Markov method is a useful technique to model the reliability in systems where
failures are statistically independent and failure and repair rates are constant.
The state of a component is understood as the set of possible values its parameters can
assume. Such parameters are called variables of state and describe the component
condition. The state space is the set of all states that a component may present.
The Markov model of a real system usually includes a “full-up” of the state (that is, the
state with all operating elements) and a set of intermediate states representing a partial
failure condition, leading to the completely in failure state, that is, the state in which the
system is incapable of performing its project function. The model can include transition
repair ways, as well as the failure transition ways. Overall, each transition way between
two states reduces the probability of the state from where it is starting, and increases
the probability of the state where it is entering, to an equal rate to the transition
parameter multiplied by the current probability of the state of origin.
The overall probability flow in a specific state is the sum of all transition rates for such
state, each one multiplied by the state probability in the origin of such transition. The
probability of flow output of a specific state is the sum of all transitions leaving a state
multiplied by the probability of such state. To exemplify that, the typical input and output
flows of a state and of surrounding states are represented in Figure 4.
In such model, all failures are classified as dangerous failures or as safe failures. A
dangerous failure is that placing the safety system in a state in which it will not be
available to stop the process if that becomes necessary. A safe failure is that leading
the system to stop the process in a situation of no danger. The safe failure is usually
called false trip or spurious.
The Markov models include diagnosis coverage factors for all components and repair
rates. The models consider that non-detected failures will be diagnosed and repaired by
periodic proof tests.
Markov models also include failure rates associated with performance failures and
common hardware failures.
The system modeling should include all possible types of failures and they can be
grouped in two categories:
1) Physical failures
2) Performance failures
Physical failures are those occurring when the function performed by a module, a
component, etc, presents a deviation concerning the specified function due to physical
degradation.
The physical failures may be failures due to natural aging or failures caused by the
environment.
To use physical failures in Markov models, the cause of failures and its effects on the
modules, etc, should be determined. Physical failures should be categorized as
dependent or independent failures.
Independent failures are those that never affect more than one module, while
dependent failures may cause the failure of several modules.
Performance failures are those occurring when the physical equipment is in operation,
although without performing the specified function due to a performance deficiency or
human error. Examples of performance failures are: safety system project errors,
software, hardware connection, human interaction errors and hardware project errors.
In the Markov models, the performance failures are separated in safe and dangerous
failures. A safe performance failure is supposed to result in a spurious trip. Similarly, a
dangerous performance failure will result in a failure to work state, that is, that failure in
which the system will no longer be available to stop the process. The performance
failure rate evaluation should take into consideration several possible causes, such as:
1) Safety system project errors
Including logic specification errors of the safety system, inappropriate architecture
choice for the system, incorrect selection of sensors and actuators, errors in the
interface project between PLCs and sensors and actuators.
2) Hardware implementation errors
Such errors include errors in the sensors and actuators connection to PLCs. The
probability of an error increases with the E/S redundancy if the user has to connect
each sensor and each actuator to several E/S terminals. The use of redundant sensors
and actuators also cause a major probability of connection errors.
3) Software errors
Such errors include the errors in software developed both by the supplier and user. The
suppliers’ software usually include an operational system, the E/S routines, application
functions and operational language. The supplier software error may be minimized by
ensuring a good software project and the compliance with coding procedures and tests.
The conduction of independent tests by other companies may also be very useful.
The errors of software developed by the user include application program errors, user
interface diagnosis and routines (displays, etc.). Engineers specialized in safety
systems software may help minimizing the user software errors. Also, exhausting
software tests should also be conducted.
4) Human interaction errors
Here it is included the project and operation errors of the man-machine interface of the
safety system, errors made during periodic safety system tests and during the
maintenance of defective modules of the safety system. The maintenance errors may
be reduced by a good safety system diagnosis identifying the defective module and
including failure indicators in the defective modules. It is important to keep in mind that
in this point there is no perfect or failure-proof diagnosis.
5) Hardware project errors
Among those errors, it is included the PLCs manufacturing project errors, sensors and
actuators, as well as users errors in the safety system and process interface.
In redundant settings of PLCs, sensors and actuators, some performance failures may
be reduced by using several hardware and/or software.
Dependent failures should be modeled in a different way, as it is possible that multiple
failures occur simultaneously. At the modeling point of view, the dominant dependent
failures are failures with a common cause. The common cause failures are the direct
result of a common basic cause. An example is the interference of radiofrequency
causing the simultaneous failure of multiple modules. The analysis of that kind of failure
is very complex and requires a deep knowledge of the system, both in hardware and
software level and in the environment.
Figure 5 - Example of Markov model in redundant system
Certainly with certified equipment and tools according to IEC 61508 standard the failure
rates of products is known, making easier the safety calculations and architectures.
Conclusion
References
Cassiolato
Share8
César Cassiolato
Introduction
condition.

guidelines:

acceptable limits.

We have seen in the previous article, in the third part, some details on the models of
fault trees analysis, Markov model and some calculations.
In the forth part, we will see some points about the SIF Verification Process.
SIF Verification Process (Safety Instrumented Function)

A Safety Instrumented System (SIS) is a critical layer for the accident prevention. A SIS
performs several SIFs (Safety Instrumented Function) and is typically composed by
sensors, logic analyzers and final elements of control. Acceptable probability of failure
on demand (SIL - Safety Integrity Level ) for each SIF need to be determined for the
project and subsequent verification.
The safety analysis is made over the SIFs risks levels.
A Pressure transmitter and a Positioner are part of the SIF, for example;
There are several methods to identify the SILs required to SIFs. One of them is the
Layer of Protection Analysis, LOPA, a technique of risk analysis which is applied
following the use of a qualitative technique for hazard identification, such as, HAZOP
(Hazard and Operational Study). Derived from a risk quantitative analysis tool, the
frequency analysis by an event tree, LOPA may be described as a semiquantitative
technique, as it provides a risk estimative.
The control systems are projected to keep the process within the specific process
parameters considered as acceptable for the normal and safe operation of the plant.
When the process exceeds the normal operation limit, it may present potential risk to
human life, environment, and assets. In the evaluation stage, the risks are identified
together with its consequences and the ways to prevent their occurrence are defined.
The risk identified will have its probability reduced according to capacity of the system
providing preventive layers. The risk reduction establishes three criteria:
 The equipment should be approved for the environmental conditions where it will be
installed;
 The subsystems should have tolerance to failures required due to the dangerous
failures presented by the process;
 The Probability of Failure on Demand (PFD) of SIF should be appropriate to the risks
acceptable by the company.
The user should have domain of information on the equipment, so it is possible to

conduct a good analysis of SIF performance. The constructive techniques with a
tolerance view concerning the component failures prevent that a single failure causes a
device failure. Finally, the performance calculation determines if the SIS keeps the
project expectations regarding the desired integrity level. The SIS reliability is defined
by some parameters:
 Mean Time Between Failures (MTBF)

 Voting Architecture
 Diagnosis coverage (DC)
 Test interval (TI)
 Mean Time to Repair (MTTR)
 Common failure mode
For each SIF, at least the following information should be analyzed:
 Hazard and its consequences

 The hazard frequency
 Definition of the process safe state
 SIF description
 Description of the process measurements and its trip points
 Relation between inputs and outputs, including logics, mathematical functions,
operation modes, etc.
 Required SIL
 Proof test period
 Maximum trip rate allowed
 Maximum response time for SIF
 Requirements for SIF activation
 Requirements for SIF reset
 SIF response in case of diagnosis failure
 Human interface requirement, that is, what needs to be shown in Displays,
supervision, etc.
 Maintenance requirement
 MTTR estimation after trip
 Expected environmental conditions in several situations: normal and emergency
operation.
Equipment Selection
It is necessary to take care of the choice of equipment working in safety systems.
Certified pieces of equipment should be specified according to IEC61508 or complying
with the “prior use” criteria according to IEC61511.
Proven in Use (PIU) is a characteristic defined by IEC61511 (clause 11.4.4) in which if a
equipment has already been successfully used in safety applications and meets some
requirements (see below, then the HTF (hardware tolerance fault) can be reduced and,
with that, use it in safe applications with lower costs:
 The supplier quality system should be considered

 Equipment hardware and software version
 The performance and application in safety systems documents
 The equipment can not be programmed and should allow, for example, setting up
the operation range
 The equipment need a write protection or Jumper command
 In such case, the SIF is SIL 3 or lower.
The major advantage is that it is possible to standardize Equipment for use in control
and Equipment for safety with a much lower cost.
By hardware analysis, called FMEDA (Failure Modes Effects and Diagnostics Analysis)
it is also possible to determine the failure rates and the instrument modes. Such
analysis type is an extension of the known FMEA method, the methodology of Failure
Mode and Effect Analysis. In that case, the FMEDA identifies and calculates the failure
rates in the following categories: Detectable safe, non-detectable safe, detectable
dangerous and non-detectable dangerous. Such failure rates are used to calculate the
safety coverage factor and the risk factor
Once the safety integrity level and its requirements are calculated, then the equipment,
redundancy levels and tests are to be chosen, according to the SIF demand. After that,
with the information of each equipment and device, it is calculated by equations, tree
analysis, Markov model and other techniques if the equipment chosen meet the safety
requirements.
How to determine the architecture?
 A SIF architecture is decided by the failure tolerance of its components.
 It may reach a SIL higher level using redundancy.
 The number of pieces of equipment will depend on the reliability of each component
defined in its FMEDA (Failure Modes, Effects and Diagnostic Analysis).
 The three commonest architectures are:
 Simplex or voting 1oo1 (1 out of 1)
 Duplex or voting 1oo2 or 2oo2
 Triplex or voting 2oo3
Figure 1 shows the most common examples of architecture for safety systems, where
several techniques are used according to the voting system and desired SIL:
Figure 1 - Typical examples of architecture for safety systems

For SIFs, the failure probability may be interpreted as the transition of a device from the
operation state to the state where it is no longer playing the role to which it was
specified.
When the device is tested, the PFD (t) is reduced to the initial value. That involves two
implicit assumptions:
 Every device failure is detected by the inspection and by the proof test.
 The device is repaired and returned to service as new. The proof test effect is shown
by the form of the saw tooth shown on figure 2.
As the result, the test interval is a imperative factor to determine the reached SIL
classification.
Figure 2 - Transition and PFD states
Establishing Functional Tests Interval

 The time period is a parameter significantly affecting the PFD and, therefore, the SIL
 It is common to increase the frequency of tests and, then, decreasing the the failures
probabilities (example: valve tests, partial strokes)
 Suppose a SIL supports SIL 2, but the test interval is long, and it can support SIL 1.
 In the same way, if you have 2 pieces of equipment SIL 2 in voting and the interval is
short, SIL 3 can be supported.
Conclusion
References
Cassiolato
Share47
César Cassiolato
Introduction
condition.

guidelines:

 Part 6: Guidelines on the application of IEC 61508-2 and IEC 61508-3Part 7:
Overview of techniques and measures
acceptable limits.
composed by sensors, logic programmers, processers and final elements designed with
variables such as pressure and temperature over the very high alarm limits) or
event preventing the operation under unfavorable conditions to the safe operation
conditions.

We have seen in the previous article, in the fourth part, some details on the SIF
Verification Process
In this fifth and last part, we will see something about the typical SIF solutions and an
application example.
SIF Typical Solutions (Safety Instrumented Function)
How to determine the architecture?
 SIF architecture is decided by the failure tolerance of its components.
 It may reach a SIL higher level using redundancy.
 The number of pieces of equipment will depend on the reliability of each component
defined in its FMEDA (Failure Modes, Effects and Diagnostic Analysis).
 The three commonest architectures are:
 Simplex or voting 1oo1 (1 out of 1)
 Duplex or voting 1oo2 or 2oo2
 Triplex or voting 2oo3
Simplex or voting 1oo1 (1 out of 1)

The voting principle 1oo1 involves a single channel system, and is normally designed
for low level safety applications. Immediately results in the loss of safety function or
process closure.
Duplex or voting 1oo2 or 2oo2

The voting principle 1oo2 was developed to improve the performance of safety integrity
of safety systems based on 1oo1. If a failure occurs in a channel, the other is still
capable of developing a safety function. Unfortunately, such concept does not improve
the rate of false trips. Even worst, the probability of false trip is almost doubled.
2oo2: The main disadvantage of a single safety system (that is, non-redundant) is that
the only failure immediately leads to a trip. The duplication of channels in a 2oo2
application significantly reduces the probability of a false trip, as both channels have to
fail in order the system is placed under shutdown. On the other side, the system has the
disadvantage that the probability of failure on demand is twice higher than that of a
single channel.
Triplex or voting 2oo3

2oo3: In that voting, there are three channels, two requiring being ok in order to operate
and comply with safety functions. The 2oo3 voting principle is better applied when there
is a complete physical separation of microprocessors. However, that requires they are
located in three different modules. Although the most recent systems have a higher
diagnosis level, safety systems based on 2003 voting still keep the disadvantage of
probability of failure on demand, which is approximately three times higher that those of
the 1oo2-based systems.
Architecture Examples
1. SIL 1
Figure 1 – SIF – SIL 1
2. SIL 2
3.SIL 3
Application Example
Figure 5 - Process with basic control

Figure 5 shows a simple process where a fluid is added in a continuous and automatic
way to a process vessel. If the control system fails due to a very high pressure
condition, a safety relief occurs, producing an undesirable smell out of the plant. Na
acceptable risk rate for such event is 0.01/year or less (once every one hundred years
or 1 change in 100 per year). Let’s specify a Safety Instrumented System (SIS) reaching
such safety requirements.
In order to define the safety integrity requirements, the demand rate regarding the SIS
should be estimated. In such example, the SIS demand rate should be the dangerous
failure rate of the control loop.
The overall failure rate for the control loop may be estimated from the failure rates for
components, which, in the example, we will assume as:
Failures/year
Pressure transmission 0.6
Controller 0.3
I/P 0.5
Control valve 0.2
Failure overall 1.6
The control loop for such example may fail in any direction, assuming that both are
equally probable. Since the active control loop is under the operator supervision, it is
assumed that only 1 failure out of 4 would be suddenly enough to cause a demand for a
shut down condition without a previous intervention by the operator. That causes the
overall result of (1 out of 2) x (1 out of 4) or 1/8 of overall failure rate, which should be
used as the demand rate for a stop. Different assumptions should be made based on
the specific knowledge on the equipment and conditions.
Therefore, the demand rate = 1.6/8 = 0.2/year
The acceptable unavailability =
The required availability is = 1-0.05 = 0.95

SIS with simple and direct connection is proposed to cut the supply when the system
pressure reaches 80% of the safety valve setting value.
Compliance should be evaluated by the estimation of loop unavailability. Follows below
the failure rates, showed as example and that could be consulted by each
manufactured:
The loop is designed to fail in the safe direction, therefore, it is admitted that only 1 out
of 3 failures would be in the unsafe direction. All those passive system failures would
not be diagnosed.
Therefore, the non-diagnosed failures rate = 0.6/3 = 0.2/year
With a test annual frequency,
FDT = ½ fT = ½ x 0.2/year x 1 year = 0.1
That provides an availability of 0.9, which still does not comply with the safety
requirements. However, the availability may be increase with a higher frequency of
tests. With monthly tests we have,
FDT = ½ x 0.2/year x (1/12) year = 0.0083
Reaching an availability >0.99. The project test frequency should be specified as part of
the project documents.
According to table 1, a SIL 1 system with frequent tests should provide an availability of
0.99 complying with the 95% availability required.
Table 1 - Architecture according to SIL level - IEC 61508
Some details
There is a wrong conception very common that the products by themselves or
components are classified as SIL. Applicable products and components are at SIL
levels, but they are not SIL individually. SIL levels are applied to the SIFs safety
functions. The equipment or system should be used to support the risk reduction
project. A piece of equipment certified for use in SIL 2 or 3 applications does not
ensure, necessarily, that the system will meet SIL 2 or 3. All SIF components should be
analyzed.
An important development parameter calculated during the SIL verification is the
MTTFsp: Mean time between failures due to disturbances or false trips. Such variable
indicates how many times the SIS may suffer a false trip until it reaches the shutdown
condition. Table 2 below shows the estimation of cost by false trips in industries with
different processes:
Table 2 - Costs with False Trips.
Conclusion
References
 http://www.exida.com/images/uploads/CCPS_LA_2010_SIS_EsparzaHochleitner.pdf
Cassiolato

SIS - Safety Instrumented Systems - A Practical View - Part 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SIS - Safety Instrumented Systems - A Practical View - Part 1

Uploaded by

Copyright:

Available Formats

SIS - Safety Instrumented Systems - A practical view - Part 1

Marketing, Quality, Project and Services Engineering Director

 Random hardware failures

 Emergency Shutdown System

We will start with a Safety Life Cycle and Risk Analysis.

Safety Life Cycle

 Accidents may happen and, therefore, it is necessary to minimize them in frequency

Figure 1 - Typical example of a Safety Life Cycle

Figure 2 - Risk considerations according to IEC 61508.

1. Applicable to the overall safety function;

Table 1 - SIL Levels

1) Taking into consideration the risk reduction and table 1:

 SIL1: risk reduction >= 10 and <=100

Table 3 - Example of Risk Matrix

Some terms and concepts involved in safety systems

 Demand: Every condition or event generating requiring a safety system to operate

 MTTR = Mean Time to Repair

 Types of Failures analyzed in a FMDEA (Failure Modes, Effects, and Diagnostic

1) Dangerous Detected (DD): Detectable failure and may take to an error

 Random failures: A spontaneous failure of a component (hardware). The random

Figure 4 - Study on the causes of accidents involving control systems

SIS - Safety Instrumented Systems - A practical view - Part 2

Marketing, Quality, Project and Services Engineering Director

 Random hardware failures

 Part 2: Requirements for E/E/PE safety-related systems

Typical examples of safety systems:

 Emergency Shutdown System

Reliability of Measurement Systems

Quantification of Reliability in almost absolute terms

Availability = MTBF/ (MTBF+MTTR)

Figure 1 - Typical Curve of reliability variation of an electronic component

Figure 2 - Typical Curve of reliability variation of a mechanical component

Rs = 1 – Fs, where Fs is the non reliability of the system.

The non-reliability is Fs = F1xF2...F3.

Improving the reliability of a measurement system

There are several ways to increase the reliability of a measurement system:

Safety and Reliability Systems

Metrics used in the Reliability Engineering field

Figure 3 - Reliability R(t)

2. MTTR = Mean Time to Repair

3. MTBF – Mean Time Between Failures

MTBF = MTTR + MTTF

 MTTR = Mean Time to Repair

4. Availability A(t) and Unavailability U(t)

Table 1 - Relationship between Reliability, Maintenance and Availability

5. Probability of Failure on Demand (PFDavg) and

Figure 6 - Correlation between PFDavg and Factor of Risk Reduction

The Failure Probability may be calculated using the following equation:

Using the previous equation we have:

PFDavg = (0.7) x 0.002 x ½ + (1-0.7) x 0.002 x 25/2 = 0.0082

 Random hardware failures

 Part 1: General requirements

IEC61508 aims at potentializing the improvements of PES (Programmable Electronic

 Safety Shutdown System

Figure 1 - FTA Example

Figure 2 - FTA Example using logic elements

“AND” Gate: Indicates that the event output occurs only

Inhibition Gate: Indicates that the event output occurs

Restriction Gate: Indicates that the event output occurs

Basic Event: represents the equipment BASIC FAILURE

INTERMEDIATE EVENT: represents a failure in an event

UNDEVELOPED EVENT: represents a failure no longer