Professional Documents
Culture Documents
a r t i c l e i n f o a b s t r a c t
Article history: In process plants, alarms are configured to notify operators of any abnormalities or faults. However, in
Received 7 May 2012 practice a majority of raised alarms are false or nuisance and create problems for operators as they face
Received in revised form 14 October 2012 an increasing number of alarms to handle. Adding delay-timers is a simple technique that can reduce this
Accepted 26 December 2012
problem and is widely exercised in industry. In this work we propose a generalized delay-timer frame-
Available online 11 February 2013
work where instead of consecutive n samples in the conventional case, n1 out of n consecutive samples
(n1 ≤ n) are considered to raise an alarm. For the generalized delay-timer, three important performance
Keywords:
indices, namely, the false alarm rate (FAR), the missed alarm rate (MAR) and the expected detection delay
Alarm systems
Alarm management
(EDD), are calculated using Markov processes. Moreover, the performance and sensitivity of generalized
Western electric rules delay-timers are compared with conventional delay-timers.
Fault detection © 2013 Elsevier Ltd. All rights reserved.
Markov processes
Delay-timers
1. Introduction hour during normal operation of the plant [4,5]. But this is hardly
the case in practice and operators normally receive far more
The complexity and degree of automation of technical processes alarms than these standards suggests [3]. A single fault may give
are continuously increasing with growing demands for plant reli- rise to a complex alarm situation due to consequences of the
ability, higher efficiency, better performance and product quality. fault triggering other alarms [6]. There is no single solution to
Due to significant advancement in communications and computer improve alarm systems; however a few improving techniques can
technologies nowadays a large number of process variables are be applied in different areas, e.g., multivariate process monitor-
monitored during plant operations. A downside of this advance- ing, state-based priority setting, model-based process monitoring,
ment, however, is the large increase in the number of configured threshold design and data processing [2,3,7]. We concentrate on
alarms. An alarm is an announcement to the operator (in visual or the latter and focus in this work a specific technique, namely the
auditory form) indicating an equipment malfunction, process devi- delay-timer.
ation, or any other abnormality in the plant; and an alarm system Fault detection and isolation (FDI) forms a very active area of
is the system to generate and process alarms and to present them research, mainly on the basis of mathematical or knowledge based
to the operator. models. Several model-based FDI methods were developed in the
Too many configured alarms make the alarm system complex last decades [8,9]. The main focus in model-based approach is to
and operators often face difficulties during abnormal events. For obtain a proper mathematical model which is relatively difficult to
example, the Three-Mile Island nuclear power plant was designed do for complex plants. An alternative to model-based methods of
with ability to provide alarms for every small problem [1]; but dur- fault detection is the signal-based methods. In signal-based fault
ing the accident, operators were flooded with information, much detection, process variables are directly monitored and compared
of it irrelevant and misleading. Therefore an efficient alarm sys- with some thresholds (or alarm limits) [8,9]. Signal-based meth-
tem is of paramount importance. Alarms indicating faults should ods of process monitoring are the most commonly used techniques
be unmistakable and for each cause there should be no more in industry and are readily implementable on almost all modern
than one alarm [2,3]. According to ISA 18.2 and EEMUA 191 stan- distributed control systems (DCS).
dards, an operator should not receive more than six alarms per Generally, alarm systems can be designed in two frameworks:
univariate and multivariate. In univariate design, which is the most
commonly practiced method of alarming, the alarm threshold and
∗ Corresponding author. processing technique (deadband, delay-timer, filter, etc.) are indi-
E-mail addresses: naseeb@ualberta.ca (N.A. Adnan), cheng5@ualberta.ca vidually designed for each process variable. In multivariate design,
(Y. Cheng), iman.izadi@ec.iut.ac.ir (I. Izadi), tchen@ualberta.ca (T. Chen). alarms are designed for some latent variables which are typically
0959-1524/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.jprocont.2012.12.013
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 383
linear combinations of several process variables. In this work, we set to reduce discrepancy of correlations. This method requires the
concentrate on signal-based univariate alarm design. process connectivity information, and in the presence of time lag,
The signal-based methods based on simple limit checking is it cannot provide any solution. Refs. [11,17] discussed an alarm
comparatively simple and easy to implement. However, incorrect reduction method based on finding the event correlations from sta-
settings of alarm limits or thresholds may result in problems of tistical similarities of alarms. In this method, alarms were divided
false, missed and nuisance alarms [2,3,10]. An alarm that is raised into a limited number of groups considering several factors, e.g.,
in the fault-free operation of a plant is known as a false alarm. An sequentiality in alarm generation. From these groups unnecessary
alarm that is true but redundant as operator has already been noti- alarms were identified. The idea of event correlation analysis was
fied of the fault by another alarm is a nuisance alarm. On the other first proposed in [18] to improve the performance of independent
hand the situation where an alarm is not raised in the presence of protection layers (IPL) 2 and 3 and evaluated the effectiveness in a
a fault is known as a missed alarm. Another associated problem in chemical plant. Another important aspect of alarm systems is the
the alarm system is delay in raising the alarm. When a fault occurs, human factors or operators actions due to their direct and indi-
alarms may not be raised instantly because of numerous delays rect participation in handling of abnormal situations. Under a set
in the system including delays caused by the alarm configuration of assumed malfunctions an operator model to be used as virtual
(deadband, delay-timer, etc.). The difference in time between the subject for evaluating an alarm systems is discussed in [19]. Corre-
actual moment of fault occurrence and the moment an alarm is lation of alarms and virtual subject modeling will be addressed in
activated is defined as the detection delay. future work.
It is widely accepted that proper implementation of alarm con- In this work, the focus of study is signal processing based uni-
figurations (e.g., delay-timer, deadband, and filtering) can improve variate alarm systems. For the proposed generalized delay-timers,
the performance of alarm systems. However, very limited research alarm systems are modeled using Markov chains [3,15,20]. Also
is conducted so far to calculate performance specifications for false alarm rates (FAR), missed alarm rates (MAR) and expected
these commonly used alarm configuration techniques. The exact detection delays (EDD) for such systems are calculated. Rational-
quantitative relationship is not well known yet. This work aims ization of alarm systems is a key concern in process industries.
at obtaining quantitative expressions that relate alarm configura- Nowadays use of DCS has made it possible to store large amount of
tion techniques to performance measures – false alarm rates (FAR), data in historian which can be used as a great resource for effective
missed alarm rates (MAR) and expected detection delays (EDD). alarm rationalization. Normally process engineers are well aware of
Later such relationships can be used to design alarm systems. The normal behavior of the plant and are expected to be able to discrim-
mentioned performance indices are calculated for one of the alarm inate between fault-free and faulty process data. The calculated
attribute, the delay-timer, described in the ISA 18.2 [4] or EEMUA equations can be applied to historical process data to get an accu-
191 [5] standards for basic alarm design. The knowledge of FAR, rate measure of performance specifications, namely, FAR, MAR and
MAR and EDD is important to include preventive measures in the EDD. Later, changing different tuning parameters (namely, delay-
system design. timer settings, alarm limits) desired alarm settings can be obtained
To reduce the number of unnecessary redundant alarms, a con- with the help of these equations. This work is continuation of our
ventional approach is to review top ten worst alarms. In this previously published work [15,16]. Previously the performance
method, from a list generated, alarms are reviewed one by one indices were calculated for conventional delay-timers, and dead-
starting from the most frequent one and the root cause is identi- bands. In this work, a generalized framework of the delay-timer is
fied [11]. Many of these generated alarms can simply be removed by proposed; moreover, analytical expressions of performance indices
properly adjusting alarm thresholds [2]. Alarm thresholds are nor- for special cases and an algorithm with flowchart for general case
mally set considering certain confidence range, e.g., ± 3 (where are given. The conventional delay-timer framework can be treated
is mean and is standard deviation of the process variable) [10]. as a special case. However, the generalized setup certainly has the
In [12], a method of alarm threshold design is described; adding convenience of large number of combination provisions that the
a margin to the upper and lower boundaries of the normal pro- conventional setup cannot provide. In the analysis section some
cess fluctuation range is proposed in this method. But it does not preliminary discussions on advantages and disadvantages of the
discuss how to justify the optimality of selected margins. If a mar- generalized and conventional methods with respect to accuracy,
gin is set too high, it poses the risk of high missed alarms; and a latency and sensitivity are given.
too tight margin may lead to higher false alarms. In [13], consid- The rest of the paper is organized as follows: performance
ering the regulatory standard of EEMUA 191 [5], alarm systems indices of alarm systems are discussed in Section 2. Markov chain
management and optimization are discussed. But very little is writ- is discussed in Section 3. Section 4 discusses the conventional
ten about the trade-offs among false alarm rates, missed alarm delay-timers. The generalized delay-timer is described in Section
rates and detection delays except [3,10,14,15]. In [10], trade-offs 5. Section 6 is devoted to the algorithm to formulate the Markov
between the detection delay and the false alarm, and between the chain for the generalized case. Comparison of performance and sen-
false alarm rate and the missed alarm rate are discussed. In [3], a sitivity analysis are provided in Section 7. Finally, some concluding
threshold design process is discussed based on the ROC curves bal- remarks are given in Section 8.
ancing the false alarm rate and the missed alarm rate for the filter,
time delay and deadband. In [15,16], detection delays caused by the
simple limit checking, deadbands and delay-timers are calculated; 2. Performance measurement of alarm systems
also a design method is proposed to balance the FAR, MAR and EDD.
A large increase in the number of monitored variables in recent In alarm systems, raising of an alarm can be discussed from
days has subsequently increased the number of alarms to be the perspective of two-class classification problem, where each
processed by an operator. These monitored variables are not all instance is mapped to a class. If P and N are number of positive
independent. There have been some studies on alarm systems and negative instances with outcomes p (positive) and n (negative)
based on correlation analysis [11,14,17,18]. In [14], a method of sug- respectively, then there are four possible outcomes from the given
gesting alarm limits is discussed considering correlations between classifier problem. In such case, the outcomes can be described by
the process data and the alarm data. The idea is to study the sim- a 2 × 2 confusion matrix (also known as contingency table) (Fig. 1).
ilarity of correlation maps of physical process variables and their If both prediction and actual outcomes are true (p), then it is
alarm history through causal maps. Then alarm limits should be called a true positive. But if the actual one is false (n), then it is a
384 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395
p n
False True N′
n′
Negative Negative
Total P N
p1 consider the case of high alarm for simplicity). Therefore the proba-
NA1 NA2 bility of one sample falling below the threshold is, 1 − p1 (Fig. 2). In
p1 p1 cases where deadbands are applied with or without delay-timers,
1-p1 1-p1 1-p2 p1 + p2 =/ 1 [16].
Assume initially the process is in fault-free region of operation
and no alarm is raised, which is the state NA in the Markov process.
NA A
It will require 3 consecutive samples with probability p1 to move
1-p2 the Markov process to state A and raise an alarm. Any intermedi-
1-p1 1-p2
ate sample with probability 1 − p1 will bring the chain back to the
p2 p2 original no alarm (NA) state. In this case, the sample temporarily
p2
A2 A1 exceeding the threshold is treated as an outlier sample. A similar
process is followed in off-delay for clearing a raised alarm. Similar
to the fault-free region, there will be another Markov process in the
Fig. 4. Combined on and off-delay for n = m = 3.
faulty region with probability q2 and q1 (see Fig. 2 for definitions).
⎛ ⎞
1 − p2 p2 0 ··· 0 1-p1 NA5
⎛ ⎞
0 0 ··· 0 ⎜ ⎟ p1
⎜ 1 − p2 0 p2 ··· 0 ⎟ 1-p1 1-p2
⎜ . . .. ⎟
.. ⎜ ⎟ 1-p1
⎜ .. .. ⎟ ⎜ . .. ⎟ p1 p1 p1
⎜ . ⎟. ⎜ . .. .. .. ⎟
Pn21 = ⎜ 0 0 ⎟ Pn22 = ⎜ . . ⎟,
. . . NA NA1 NA2 A
⎜ ··· 0⎟, ⎜ ⎟
⎜p 0 ⎟ ⎜ 1 − p2 0 0 · · · p2 ⎟ 1-p1
⎝ 2 ··· 0⎠ ⎜ ⎟ p1 p1
⎜1−p ⎟ 1-p1
⎝ 2 0 0 ··· 0 ⎠
NA3 NA4
p2
1-p1
here, horizontal line in Pn is used to indicate, the numbers of
columns in Pn11 and Pn21 or in 0n×(m−1) and Pn22 are not equal. Pn
is of dimension (n + m) × (n + m). After a transient time in the fault- Fig. 5. Generalized on-delay with n1 = 3, n = 4.
free region, the Markov chain reaches its steady state and converges
to the invariant vector. For the system operating in the fault-free The expected value of the detection delay (EDD or averaged
region, the probability of false alarm is the summation of all alarm detection delay) indicates swiftness of the system to raise an alarm.
state probabilities in the steady state and can be calculated accord- EDD is the average time it takes to raise an alarm when the sys-
ing to the theory of Markov chains [23]. It can be shown that the tem enters into abnormal states of operation. EDD is defined as,
probability of false alarm (the steady-state probability of all alarm EDD = E(DD).
states) is A smaller EDD is more desirable as it allows more time for an
m−1 operator to address an alarm. In [15,16], it has been discussed that
pn1 pi2
P(False Alarm) = P(A + A1 + · · · + Am−1 ) =
n
m−1 i=0
n−1 j
(4) how changes in limit and settings of delay-timers affect the detec-
p1 i=0
pi2 + pm
2 j=0
p1 tion delay. A Markov process is used to model the alarm system.
Assume the system is in the normal state of operation. Once a fault
4.2. Missed alarm rate occurs, the time required to move from no alarm (NAi ) states to
alarm (Aj ) states is the detection delay. To estimate the EDD, con-
Missed alarms (also known as type II error or false negative)
cept of hitting time is used. If the Markov model is divided into two
occur when the system is in faulty state of operation but an alarm
sub-spaces, D for all alarm states and E for all no alarm states, the
is not raised. From the process data in Fig. 2, it can be seen that
hitting time is the time required to move to D from E assuming the
the missed alarm rate can be decreased by lowering the threshold.
system initially started in sub-space E [23]. The expected delay is
However, it will increase the false alarm rate. Therefore, it leads
given by [15]
to the problem of balancing between the false alarm rate and the n−1 n−j−1 n−1 i
missed alarm rate [3,16]. Calculation of missed alarm rates for dif- n−1 j
pm−1
2
pn1 q1 i=0
qi2 + p2 p
j=0 1 k=0
qk2 − qn2 i=0
p1
ferent cases of delay-timers are similar to that of false alarm rates. EDD = m−1
(7)
To calculate the missed alarm rate the probability density function n m n−1
q2 p2 i=0
pi1 + pn1 i=0
pi2
of the faulty data should be used (Fig. 2, likelihood function L1 ).
Here q2 is the probability of one sample above the threshold and q1 Eq. (7) can be further modified for different values of m and n
(or 1 − q2 ) is the probability of one sample below the threshold in to get special cases of simple threshold checking (no delay-timer,
the abnormal region. From Fig. 2, the probability of missed alarm is m = n = 1), only on-delay (m = 1), and only off-delay (n = 1).
t
P(False Alarm) = L1 (x)dx (5) 5. Generalized delay-timer
−∞
When delay-timers are applied, the missed alarm rate is the sum Sometimes conditions to raise an alarm in the conventional
of all probabilities of no alarm states in the Markov chain under the delay-timer are difficult to satisfy. For example, consider a system
faulty region of operation. For calculation of MAR, the probability with 10 samples on-delay. When the system enters from the normal
distribution of faulty data and the transition probability matrix, Pf , state to the faulty state of operation, consecutive 10 samples must
should be used, where Pf can be obtained replacing probabilities p1 , cross the threshold to raise the alarm. It is very likely to have some
p2 by q2 , q1 , respectively, in equation (3). For the n-sample on-delay delay in raising the alarm during this transition as first 10 samples
and m-sample off-delay, the probability of missed alarm is may not satisfy the strict requirement. However if the condition is
n−1 relaxed to for instance 8 out of 10 samples crossing the threshold,
qm qi2
intuitively the probability of early detection is higher. Accordingly
n−1 m−1
1 i=0
P(Missed Alarm) = P(NA + NA1 + · · · + NAn−1 ) = (6)
j
m
q1 qi2 + qn2 q1 we propose the following two definitions:
i=0 j=0
1-p1 1-p2
p1 p1
p2 NA A
1-p2
p2
p1 p1
NA NA1 A
p2
1-p1
A1
1-p1 p1
p2 1-p2
NA2 p1
A2
p2
1-p1
p1
1-p2
p2
1-p1
p1
1-p2
NAn-1 p2 Am-1
(a) (b)
Fig. 6. (a) 2 out of n generalized on-delay; (b) 2 out of m generalized off-delay.
1
= P(NA) P(NA1 ) P(NA2 ) ··· P(NAn−1 ) P(A) = × p2 p1 p2 p1 p22 ··· p1 pn−1 p1 (1 − pn−1 ) (9)
p1 (1 − pn−1
2
) + p2 (2 − pn−1
2
) 2 2
chain reaches state A. Since the number of paths is increased in the It can be shown that the probability of false alarm in the steady
generalized on-delay compared to the conventional on-delay, the state is
number of intermediate states has also been increased.
It is easy to see that, for n1 out of n on-delay, there can be a
maximum of (2n−1 − 1) intermediate states. However some of these p1 (1 − pn−1
2
)
P(False Alarm) = P(A) = (10)
states are unreachable and not shown in Fig. 5. Due to the possi- p1 (1 − pn−1
2
) + p2 (2 − pn−1
2
)
bility of several paths, generalized delay-timer calculation is more
complex than the conventional case. Therefore with larger n, com-
putational complexity increases significantly and it is difficult to
provide a general representation of the Markov diagram for all com- 5.1.2. 2 out of m samples off-delay
binations. Below some special cases of generalized delay-timers (2 Similar to the generalized on-delay, generalized off-delay will
out of n samples on-delay and 2 out of m samples off-delay cases) require m1 out of m samples below the threshold to clear an alarm.
are discussed. In Section 6, a numerical algorithm to calculate the As a special case, the Markov process for 2 out of m samples is
FAR, MAR and EDD in the general case is given. considered (Fig. 6(b)). In the fault-free region of operation, any 2
388 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395
(a) 80 (b) 2
Monte Carlo Simulation
Monte Carlo Simulation
60 EDD equation
FAR equation
40 1
20
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5
80 2
60
40 1
20
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5
(ii) (ii)
80 2
60
40 1
20
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5
(iii) (iii)
Threshold
Threshold
Fig. 7. [a] Verification of FAR formula by Monte-Carlo simulations; [b] Verification of EDD formula by Monte-Carlo simulations.
samples out of m will clear an already raised alarm. The transitional from 0 to 2 with 0.1 increment. In Fig. 7(a), three different cases of
probability matrix is only on-delay, only off-delay and both on- and off-delays applied
⎛ ⎞ together are shown. For each cases, 2000 Monte Carlo simulations
1 − p1 p1 0 0 ··· ··· 0
are performed and averaged to get a single estimate. It can be seen
⎜ 0 ⎟
⎜ 1 − p2 p2 0 ··· ··· 0⎟ that the theoretical and simulated results are consistent.
⎜ ⎟
⎜ p2 1 − p2 ··· ··· 0 ⎟
⎜ 0 0 ⎟
⎜ ⎟
⎜ . .. .. .. .. ⎟ 5.2. Missed alarm rate
Pn = ⎜ . . ⎟, (11)
⎜ . . . . ⎟
⎜ . ⎟ Calculation of the missed alarm rate is similar to that of the false
⎜ . .. .. .. .. ⎟
⎜ . . . . . ⎟ alarm rate. But the probability density function of the abnormal
⎜ ⎟
⎝ p2 0 0 ··· · · · · · · 1 − p2 ⎠ data should be used. The transition probability matrix (Pn in Section
5.1) of the fault-free operation would be replaced by the matrix for
p2 1 − p2 0 ··· ··· ··· 0
the faulty operation (Pf ); where probability entries p1 , p2 would be
The steady-state invariant vector is replaced by q2 , q1 respectively. The probability of missed alarm for
2 out of n-sample on-delay and 2 out of m-sample off-delay can be
expressed as
= P(NA) P(A) P(A1 ) ··· P(Am−2 ) P(Am−1 )
1 q1 (2 − qn−1
1
)(1 − qm−1
2
)
= P(Missed Alarm) = (15)
p1 (2 − pm−1 ) + p2 (1 − pm−1 ) q2 (1 − qn−1
1
)(2 − qm−1
2
) + q1 (2 − qn−1
1
)(1 − qm−1
2
)
1 1
× p2 (1 − pm−1
1
) p1 p1 p2 p21 p2 ··· pm−1
1
p2 (12)
5.3. Detection delay
Alarm !!
……………..
Sample number 1 2 3 4 5 6 7 8 9
Probability p1 1-p1 p1 p1 1-p1 p1 1-p1 p1 1-p1 …
One requirement we add on the alarm rule is that the Markov 6.3. Lumping phase
chain represented by the full size transition matrix should only have
one ergodic set which is regular, and all the transient sets include The goal of lumping is to group the states, i.e., make a partition,
only one state. This requirement is reasonable since the rule should so that all the states in the same group have the probability of p1 /
guarantee that the probability that the Markov chain moves to the (1 − p1 ) to move to states in a single group. The lumping phase can
ergodic set in finite steps is 1. further decrease the size of the Markov chain.
The Markov chain applied in our work is in a special case: each
6.2. Discarding phase
state has only two possible successor states, and there is a unique
probability p1 for all states. Because of this, the Markov chain is
In this phase transient states are discussed. Based on the require-
equivalent to a deterministic finite automaton (DFA) with 2l states
ment we mentioned above, the discarding procedure is straight
and 2 inputs that correspond to p1 and 1 − p1 .
forward:
The DFA minimization problem is a classical one in the field of
Step 1: Search for such columns ik in the transition matrix that all computer science that has been extensively studied. Since the goal
the elements in the column are zero. Assume that K = / 0 of lumping the states in our special Markov chain is exactly the
columns, {i1 , i2 , . . ., iK }, are found, initialize the discarding same as the purpose of DFA minimization, we can directly adopt
states list to {i1 , i2 , . . ., iK }. the algorithm for the lumping phase.
Step 2: For an index ik that appears in the discarding states list, The essence of the DFA algorithm is to first partition the states
set the two non-zero entries in the ik th row to zero, and according to their outputs (in our case alarm or no-alarm). Then the
check those two columns containing these two non-zero groups are repeatedly refined based on the successor states on a
entries. If any of them is all-zero column, add its index into given input for the states in the same group. If the successor states
the discarding states list. Repeat step 2 for the new added are in a single group, no further partition is required; otherwise
indices in the list until no more such state can be found. partition the group to subgroups whose successor states are in a
Step 3: Remove all the columns and rows in the discarding list from single group. When no further partition is needed, all states in the
the transition matrix. same group can be lumped to one state. By introducing the “process
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 391
the smaller half” strategy, reference [26] speeded up the algorithm rows 8, 9, 11, 13 to zero. Then one more zero column, column 10,
to an O(n log(n)) time bound, where n is the number of states of the is found. The discarded states list becomes {7, 8, 10, l2, 9,}. After
automaton. set row 10 to a zero row, no more zero column can be found. Rows
The reason why we need to reduce the size of the transition 8, 9, 10, 11, 13 and columns 8, 9, 10, 11, 13 are removed, and the
matrix is that as the length of the delay-timer increases the size transition matrix shrinks to the following:
of the transition matrix increases exponentially. Large transition ⎡ ⎤
1 − p1 p1 0 0 0 0 0 0 0 0 0
matrices cause great computational burden for invariant vectors,
⎢ ⎥
necessary for FAR, MAR and EDD calculation. We conjecture that ⎢ 0 0 1 − p1 p1 0 0 0 0 0 0 0 ⎥
the algorithm can reduce the size of the transition from 2l to ⎢ ⎥
⎢ 0 ⎥
Cnn −1 + Cmm ; where C represents combination. It is easy to find ⎢ 0 0 0 1 − p1 p1 0 0 0 0 0 ⎥
1 1 −1 ⎢ ⎥
out the upper bound of the ratio of reduced size to full size by using ⎢ 0 p1 ⎥
⎢ 0 0 0 0 0 1 − p1 0 0 0 ⎥
Sterling’s formula. The ratio converges to zero in the rate of O(l−0.5 ). ⎢ ⎥
⎢ 1 − p1 p1 0 0 0 0 0 0 0 0 0 ⎥
Although it still means a large computational burden, the total com- ⎢ ⎥
⎢ ⎥
putational complexity of the reduction algorithm and the eigenvec- ⎢ 0 0 1 − p1 0 0 0 0 0 0 0 p1 ⎥
tor computation of the reduced transition matrix is greatly reduced ⎢ ⎥
⎢ 0 ⎥
compared with directly computing the eigenvector of the full size ⎢ 0 0 0 1 − p1 0 0 0 0 0 p1 ⎥
⎢ ⎥
transition matrix. A flow chart of the algorithm is given in Fig. 9. ⎢ 0 p1 ⎥
⎢ 0 0 0 0 0 0 0 0 1 − p1 ⎥
Example: 3-out-of-4 on-delay timer, and 2-out-of-3 off-delay time ⎢ ⎥
Generating Phase: Since l = max(4, 3) = 4, the length of description ⎢ 1 − p1 0 0 0 0 0 0 p1 0 0 0 ⎥
⎢ ⎥
sequence should be 4. So there will be 24 = 16 states, which means ⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 p1 0 0 ⎥
that the size of the initial transition matrix should be 16 × 16. The ⎢ ⎥
⎢ 0 ⎥
non-zero entries in the matrix are designed as follows: The first ⎣ 0 0 0 0 0 0 0 0 1 − p1 p1 ⎦
state, state 0, has a description sequence ‘0000’. The first ‘0’ means
no-alarm situation, the following three ‘0’ show that all the latest
three samples are under the threshold. The probability that next Lumping phase: After the discarding phase, only 11 states are
sample exceeds the threshold is p1 . Then the latest four samples left. Among them the first 7 states, states {0, 1, 2, 3, 4, 5, 6}, are
become ‘0001’. Because of the 3-out-of-4 on-delay timer, the no- no-alarm states, and states {7, 8, 9, 10} are alarm states. For the no-
alarm situation can be kept. So there is a probability of p1 that state alarm states group, the 1 − p1 successor states set is {0, 2, 4, 6}, and
0 moves to state 1, namely, ‘0001’ (the first ‘0’ means on-alarm, the the p1 successor states set is {1, 3, 5, 10}. The 1 − p1 successor states
following sequence ‘001’ describe the latest three samples). Simi- set is a subset of the no-alarm group. However, one element in p1
larly, there is a probability of 1 − p1 that the state is kept. So in the successor states set belongs to the alarm group while others belong
first row, the two non-zero entries are the first and second entries. to the no-alarm group. So the no-alarm group should be partitioned
The values are of them are 1 − p1 and p1 , respectively. according to its elements’ p1 successor states. As a result, the states
Operate on the following 15 states in the same way, finally the are partitioned into three groups: states {3, 5, 6}; states {0, 1, 2, 4};
initial transition matrix is obtained as follows: states {7, 8, 9, 10}. Repeat this procedure until no more partition
⎡1−p p1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
⎤
1
⎢ 0 0 1 − p1 p1 ⎥
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1 − p1 p1 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 1 − p1 0 0 0 0 0 0 0 0 p1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 1 − p1 p1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 1 − p1 0 p1 ⎥
⎢ 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1 − p1 0 0 0 0 0 0 0 0 0 0 p1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 1 − p1 0 0 0 0 0 0 0 0 p1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 0 p1 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢1−p 0 p1 0 ⎥
⎢ 1 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 0 0 0 0 0 p1 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 − p1 p1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 0 p1 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢1−p 0 p1 0 ⎥
⎢ 1 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 0 0 0 0 0 p1 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎣ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 − p1 p1 ⎦
(a) 100
2 out of 2 (b) 20
2 out of 2
2 out of 3
2 out of 3
2 out of 4
2 out of 4
2 out of 5
2 out of 5
Probability of Missed Alarm (%)
50 10
0 0
0 50 100 0 1 2 3 3.5
Probability of False Alarm (%) Threshold
Fig. 10. (a) Changes in ROC curves for corresponding changes in the threshold for n = 2–5, n1 = 2; (b) corresponding changes in EDD curves.
EDD
50 10
0 0
0 50 100 −3 −2 −1 0 1 2 3
FAR Threshold
Fig. 11. (a) Changes in ROC curves for corresponding changes in the threshold; (b) Corresponding changes in EDD curves.
is needed. Finally, states {0, 4} are in one group; states {7, 10} are 7. Comparative analysis
in one group. All the other groups have single state. To lump states
{0, 4}, pick up a representative state for this group, say state 0;
For a discussion of advantages and disadvantages of the pro-
add column 5 to column 1, and then remove column 5 and row
posed generalized delay-timers and the conventional delay-timers,
5. To lump states {7, 10}, pick up state 10 as the representative
their performance is compared in this section in terms of three
state; add column 7 to column 10, and then remove column 7 and
criteria, namely, accuracy (ROC curve with false and missed alarm
row 7. Eventually, the reduced size transition matrix is obtained as
rates), latency (detection delay with respect to change in settings)
follows:
⎡ ⎤ and sensitivity (change in performance with any change in design
1 − p1 p1 0 0 0 0 0 0 0 settings).
⎢ 0 ⎥
⎢ 0 1 − p1 p1 0 0 0 0 0 ⎥
⎢ ⎥ 7.1. Accuracy and latency
⎢ 1 − p1 p1 0 ⎥
⎢ 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 1 − p1 0 0 p1 ⎥ The receiver operating characteristic (ROC) curve is the plot of
⎢ ⎥
⎢ ⎥ two alternatives when the decision variable is changed [27]. In our
⎢ 0 0 1 − p1 0 0 0 0 0 p1 ⎥ context, the ROC curve is the plot of FAR vs MAR when the threshold
⎢ ⎥
⎢1−p ⎥ changes. It is widely used in signal detection theory for graphical
⎢ 1 0 0 0 0 0 0 0 p1 ⎥
⎢ ⎥ sensitivity analysis. For simplicity of analysis only on-delay timers
⎢ 1 − p1 p1 ⎥
⎢ 0 0 0 0 0 0 0 ⎥ are considered in this section.
⎢ ⎥ Generally different costs are associated with false and missed
⎢ 1 − p1 0 0 0 0 0 p1 0 0 ⎥
⎢ ⎥ alarm rates. Selection of the appropriate point on the ROC curve
⎢ ⎥
⎣ 0 0 0 0 0 0 0 1 − p1 p1 ⎦ depends on these costs. Usually a weighted combination of false
and missed alarm rates are used to include these associated costs
in the design. For simplicity, if (FAR)2 + (MAR)2 is used as the
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 393
performance index, then the optimum point on the ROC curve will A comparison of FAR, MAR and EDD can be seen from Table 2. Here
be the point closest to the origin; and the corresponding threshold the mean of fault-free data is 32.8 and variance is 1.04. The chart
of the optimum point in the ROC curve will be the optimum oper- is obtained following a 3-sigma limit of Shewhart chart and setting
ating threshold. In this case, if the threshold is set higher than the thresholds at one sigma distance.
optimum point, false alarms will decrease at the cost of increased From Fig. 13(a) and (b), it can be seen that conventional setup is
missed alarms. On the other hand, setting the threshold lower than more accurate if equal weight is given on false and missed alarms
the optimum point will result in smaller number of missed alarms as the curve is closest to the origin. However, the generalized con-
but more false alarms. figuration causes lower detection delay, as was also observed in
A process variable is considered with Gaussian distribution previous simulation examples. This kind of chart can be followed
where fault-free data has mean 0, variance 2 and faulty data has to determine operating settings of the delay-timer. Depending on
mean 2, variance 2. The data length is 1000 for both fault-free and design requirements of FAR, MAR and EDD, operating thresholds
faulty section; 1 s uniform sampling is used. The threshold is var- and on-delay configuration can be fixed. For example, if the require-
ied from minimum of the process data to the maximum with 0.05 ment is ≤5% MAR and ≤10 samples EDD irrespective of FAR, then
increment. In Fig. 10(a), the corresponding ROC curves are obtained either of generalized 3 or 4 out of 5 samples and conventional 5
for different delay-timers (n1 = 2, n = 2 −5). samples on-delay timers can be used in setting the threshold at
32.8. But if the requirement is changed to ≤5% FAR and <5 samples
Example 3. It is normally desirable to minimize both FAR and MAR EDD, then the conventional setup cannot satisfy the conditions. A 3
which account for accuracy of the alarm design; better accuracy out of 5 samples generalized setup is required in setting the thresh-
indicates lower false and missed alarm rates. It can be seen from old at 33.82. In practice, any change in alarm configuration setting
Fig. 10(a) that with use of the conventional 2 out of 2 samples delay- requires significant process knowledge, and detailed safety analysis
timer, the ROC curve is closer to the origin. But when generalized is a must before any change is made.
n1 out of n idea is used, the curves move away from the origin. This
indicates that increasing of n with fixed n1 decreases the accuracy
of alarm systems. On the other hand, in addition to better accuracy 7.2. Sensitivity
a lower detection delay (latency) is also a desired feature. Fig. 10(b)
indicates that increasing of n with fixed n1 results in lower detection During practical process operations, it is very likely to adjust
delays. Here conventional delay-timers cause lower FAR and MAR alarm design parameters for many reasons. Therefore, it is neces-
compared to the generalized cases, which is expected; however, sary to investigate the degree to which changes in alarm design
they also cause higher detection delays. parameters affect performance indices. Here, the parameter that
changes most frequently is the alarm limit or threshold for a fixed
Example 4. In the previous example, only n was varied while delay-timer. We define sensitivity as the ratio of the infinitesi-
keeping n1 fixed. Now a few cases of different n and n1 in close range mal change in the function (FAR, MAR or EDD) to the infinitesimal
would be considered for comparison. Here generalized 3 out of 5, 4 change in the alarm limit (t) for a fixed delay-timer:
out of 5 and conventional 5 samples on-delay cases are considered.
Then the corresponding ROC and EDD curves in these three cases Fractional change in the function, F F/F t ∂F
SF:t = lim = lim =
are compared with conventional 3 and 4 samples on-delay timers. t→0 Fractional change in the threshold, t t→0 t/t F ∂t
Like the previous example, from Fig. 11 it can also be seen that a Here F can be any of the FAR, MAR, and EDD quantities derived
conventional delay-timer has higher accuracy compared to a gen- in earlier sections. As a specific example, Eq. (10) for the false alarm
eralized one. A 5 out of 5 on-delay is more accurate than 4 out of 4 rate of 2 out of n on-delay is considered. Replacing p2 = 1 − p1 and
case. Even a 4 out 4 on-delay yields more accurate design than a 4 denoting p1 = p, it can be shown that the sensitivity is
out of 5 on-delay. The downside is the higher delay in detection. In
this case a conventional 5 samples on-delay has the highest and a (1 − p)n (np2 − p2 − np − 2p + 3) − (1 − p)2n − 2(1 − p)2
S=t f (t)
generalized 3 out of 5 on-delay has the lowest detection delay. p(1 − p)n − (1 − p)p2 − 3p − (1 − p)n + 2
7.1.1. Case study here f(t) represents the value of likelihood function in Fig. 2.
From an oil-sand processing facility, an actual flow measure- Consider the sensitivity curves in Fig. 14(a) which are obtained
ment is considered for case study (Fig. 12). Data is sampled at 1 s for the same set of data in Example 3 with the threshold changing
sampling period. Three different combinations of delay-timers are from 0 to 2. It can be seen that for the 2 out of n samples on-delay, a
applied on the measured variable to compare their results of FAR, higher value of n is more sensitive to the threshold change compar-
MAR and EDD. Performance of a conventional 5 samples on-delay is ative to a lower one. This indicates the conventional delay-timer is
compared with 3 out of 5 and 4 out of 5 samples generalized setup. more sensitive to changes in threshold than the generalized one.
394 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395
Detection Delay
30
4/5
50 5/5
3/5
20
10
0 0
0 50 100 34 35 36 37 38 39
Probability of false alarm (%) Threshold
Fig. 13. (a) ROC curves for case study; (b) EDD curves for case study changing thresholds.
6
(a) (b) 25
2/2
2/3
2/4 3/5
5 2/5 4/5
20 3/3
4/4
5/5
4
Sensitivity
15
Sensitivity
10
2
5
1
0 0
0 0.5 1 1.5 2 0 1 2 3 4
Threshold Threshold
Fig. 14. (a) Sensitivity curves for example 3; (b) Sensitivity curves for example 4.
[9] S.X. Ding, Model-Based Fault Diagnosis Techniques: Design Schemes, Algo- [18] J. Nishiguchi, T. Takai, IPL2 and 3 performance improvement method for process
rithms, and Tools, Springer-Verlag, Berlin, 2008. safety using event correlation analysis, Computers & Chemical Engineering 34
[10] I. Izadi, S.L. Shah, D.S. Shook, T. Chen, An introduction to alarm analysis and (12) (2010) 2007–2013.
design, in: 7th IFAC Symposium on Fault Detection, Supervision and Safety of [19] X. Liu, M. Noda, H. Nishitani, Evaluation of plant alarm systems by behavior
Technical Processes, June 30–July 3, 2009, Barcelona, Spain, 2009, pp. 645–650. simulation using virtual subject, Computers & Chemical Engineering 34 (3)
[11] F. Higuchi, I. Yamamoto, T. Takai, M. Noda, H. Nishitani, Use of event correlation (2010) 374–386.
analysis to reduce number of alarms, in: 10th International Symposium on [20] E. Naghoosi, I. Izadi, T. Chen, Estimation of alarm chattering, Journal of Process
Process Systems Engineering, August 16–20, 2009, pp. 1521–1526. Control 21 (9) (2011) 1243–1249.
[12] Y. Luo, X. Liu, M. Noda, H. Nishitani, Systematic design approach for plant alarm [21] J.A. Swets, Measuring the accuracy of diagnostic systems, Science 240 (4857)
systems, Journal of Chemical Engineering of Japan 40 (9) (2007) 765–772. (1988) 1285–1293.
[13] B.G. Liptak, Instrument Engineers’ Handbook: Process Control and Optimiza- [22] A. Papoulis, S.U. Pillai, Probability, Random Variables and Stochastic Processes,
tion, vol. 2, 4th ed., CRC Press, Florida, 2006, pp. 59–63. 4th ed., Tata Mcgraw-Hill Publishing Company Ltd., New Delhi, 2002.
[14] F. Yang, S.L. Shah, D. Xiao, Correlation analysis of alarm data and alarm limit [23] G.F. Lawler, Introduction to Stochastic Processes, 2nd ed., Chapman & Hall/CRC,
design for industrial processes, in: American Control Conference (ACC 2010), Boca Raton, 2006.
June 30 – July 2, 2010, Baltimore, USA, 2010, pp. 5850–5855. [24] M. Bauer, J.W. Cox, M.H. Caveness, J.J. Downs, N.F. Thornhill, Finding the
[15] N.A. Adnan, I. Izadi, T. Chen, On expected detection delays for alarm systems direction of disturbance propagation in a chemical process using transfer
with deadbands and delay-timers, Journal of Process Control 21 (9) (2011) entropy, IEEE Transactions on Control Systems Technology 15 (1) (2007)
1318–1331. 12–21.
[16] N.A. Adnan, I. Izadi, T. Chen, Computing detection delays in industrial alarm [25] Statistical Quality Control Handbook, 1st ed., Western Electric Company, Indi-
systems, in: American Control Conference (ACC 2011), June 29–July 1, 2011, anapolis, IN, 1956.
San Francisco, CA, USA, 2011, pp. 786–791. [26] J.E. Hopcroft, An n log n algorithm for minimizing states in a finite automaton,
[17] F. Higuchi, I. Yamamoto, T. Takai, M. Noda, H. Nishitani, Use of event correlation Theory of Machines and Computations (1971) 189–196.
analysis to rationalize plant alarm system, in: 5th International Symposium [27] B.C. Levy, Principles of Signal Detection and Parameter Estimation, Springer,
on Design, Operation and Control of Chemical Processes, July 25–28, 2010, New York, 2008.
Singapore, 2010, pp. 1–7.