You are on page 1of 14

Journal of Process Control 23 (2013) 382–395

Contents lists available at SciVerse ScienceDirect

Journal of Process Control


journal homepage: www.elsevier.com/locate/jprocont

Study of generalized delay-timers in alarm configuration


Naseeb Ahmed Adnan a,∗ , Yue Cheng a , Iman Izadi b , Tongwen Chen a
a
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, T6G 2V4,Canada
b
Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, 84156-83111, Iran

a r t i c l e i n f o a b s t r a c t

Article history: In process plants, alarms are configured to notify operators of any abnormalities or faults. However, in
Received 7 May 2012 practice a majority of raised alarms are false or nuisance and create problems for operators as they face
Received in revised form 14 October 2012 an increasing number of alarms to handle. Adding delay-timers is a simple technique that can reduce this
Accepted 26 December 2012
problem and is widely exercised in industry. In this work we propose a generalized delay-timer frame-
Available online 11 February 2013
work where instead of consecutive n samples in the conventional case, n1 out of n consecutive samples
(n1 ≤ n) are considered to raise an alarm. For the generalized delay-timer, three important performance
Keywords:
indices, namely, the false alarm rate (FAR), the missed alarm rate (MAR) and the expected detection delay
Alarm systems
Alarm management
(EDD), are calculated using Markov processes. Moreover, the performance and sensitivity of generalized
Western electric rules delay-timers are compared with conventional delay-timers.
Fault detection © 2013 Elsevier Ltd. All rights reserved.
Markov processes
Delay-timers

1. Introduction hour during normal operation of the plant [4,5]. But this is hardly
the case in practice and operators normally receive far more
The complexity and degree of automation of technical processes alarms than these standards suggests [3]. A single fault may give
are continuously increasing with growing demands for plant reli- rise to a complex alarm situation due to consequences of the
ability, higher efficiency, better performance and product quality. fault triggering other alarms [6]. There is no single solution to
Due to significant advancement in communications and computer improve alarm systems; however a few improving techniques can
technologies nowadays a large number of process variables are be applied in different areas, e.g., multivariate process monitor-
monitored during plant operations. A downside of this advance- ing, state-based priority setting, model-based process monitoring,
ment, however, is the large increase in the number of configured threshold design and data processing [2,3,7]. We concentrate on
alarms. An alarm is an announcement to the operator (in visual or the latter and focus in this work a specific technique, namely the
auditory form) indicating an equipment malfunction, process devi- delay-timer.
ation, or any other abnormality in the plant; and an alarm system Fault detection and isolation (FDI) forms a very active area of
is the system to generate and process alarms and to present them research, mainly on the basis of mathematical or knowledge based
to the operator. models. Several model-based FDI methods were developed in the
Too many configured alarms make the alarm system complex last decades [8,9]. The main focus in model-based approach is to
and operators often face difficulties during abnormal events. For obtain a proper mathematical model which is relatively difficult to
example, the Three-Mile Island nuclear power plant was designed do for complex plants. An alternative to model-based methods of
with ability to provide alarms for every small problem [1]; but dur- fault detection is the signal-based methods. In signal-based fault
ing the accident, operators were flooded with information, much detection, process variables are directly monitored and compared
of it irrelevant and misleading. Therefore an efficient alarm sys- with some thresholds (or alarm limits) [8,9]. Signal-based meth-
tem is of paramount importance. Alarms indicating faults should ods of process monitoring are the most commonly used techniques
be unmistakable and for each cause there should be no more in industry and are readily implementable on almost all modern
than one alarm [2,3]. According to ISA 18.2 and EEMUA 191 stan- distributed control systems (DCS).
dards, an operator should not receive more than six alarms per Generally, alarm systems can be designed in two frameworks:
univariate and multivariate. In univariate design, which is the most
commonly practiced method of alarming, the alarm threshold and
∗ Corresponding author. processing technique (deadband, delay-timer, filter, etc.) are indi-
E-mail addresses: naseeb@ualberta.ca (N.A. Adnan), cheng5@ualberta.ca vidually designed for each process variable. In multivariate design,
(Y. Cheng), iman.izadi@ec.iut.ac.ir (I. Izadi), tchen@ualberta.ca (T. Chen). alarms are designed for some latent variables which are typically

0959-1524/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.jprocont.2012.12.013
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 383

linear combinations of several process variables. In this work, we set to reduce discrepancy of correlations. This method requires the
concentrate on signal-based univariate alarm design. process connectivity information, and in the presence of time lag,
The signal-based methods based on simple limit checking is it cannot provide any solution. Refs. [11,17] discussed an alarm
comparatively simple and easy to implement. However, incorrect reduction method based on finding the event correlations from sta-
settings of alarm limits or thresholds may result in problems of tistical similarities of alarms. In this method, alarms were divided
false, missed and nuisance alarms [2,3,10]. An alarm that is raised into a limited number of groups considering several factors, e.g.,
in the fault-free operation of a plant is known as a false alarm. An sequentiality in alarm generation. From these groups unnecessary
alarm that is true but redundant as operator has already been noti- alarms were identified. The idea of event correlation analysis was
fied of the fault by another alarm is a nuisance alarm. On the other first proposed in [18] to improve the performance of independent
hand the situation where an alarm is not raised in the presence of protection layers (IPL) 2 and 3 and evaluated the effectiveness in a
a fault is known as a missed alarm. Another associated problem in chemical plant. Another important aspect of alarm systems is the
the alarm system is delay in raising the alarm. When a fault occurs, human factors or operators actions due to their direct and indi-
alarms may not be raised instantly because of numerous delays rect participation in handling of abnormal situations. Under a set
in the system including delays caused by the alarm configuration of assumed malfunctions an operator model to be used as virtual
(deadband, delay-timer, etc.). The difference in time between the subject for evaluating an alarm systems is discussed in [19]. Corre-
actual moment of fault occurrence and the moment an alarm is lation of alarms and virtual subject modeling will be addressed in
activated is defined as the detection delay. future work.
It is widely accepted that proper implementation of alarm con- In this work, the focus of study is signal processing based uni-
figurations (e.g., delay-timer, deadband, and filtering) can improve variate alarm systems. For the proposed generalized delay-timers,
the performance of alarm systems. However, very limited research alarm systems are modeled using Markov chains [3,15,20]. Also
is conducted so far to calculate performance specifications for false alarm rates (FAR), missed alarm rates (MAR) and expected
these commonly used alarm configuration techniques. The exact detection delays (EDD) for such systems are calculated. Rational-
quantitative relationship is not well known yet. This work aims ization of alarm systems is a key concern in process industries.
at obtaining quantitative expressions that relate alarm configura- Nowadays use of DCS has made it possible to store large amount of
tion techniques to performance measures – false alarm rates (FAR), data in historian which can be used as a great resource for effective
missed alarm rates (MAR) and expected detection delays (EDD). alarm rationalization. Normally process engineers are well aware of
Later such relationships can be used to design alarm systems. The normal behavior of the plant and are expected to be able to discrim-
mentioned performance indices are calculated for one of the alarm inate between fault-free and faulty process data. The calculated
attribute, the delay-timer, described in the ISA 18.2 [4] or EEMUA equations can be applied to historical process data to get an accu-
191 [5] standards for basic alarm design. The knowledge of FAR, rate measure of performance specifications, namely, FAR, MAR and
MAR and EDD is important to include preventive measures in the EDD. Later, changing different tuning parameters (namely, delay-
system design. timer settings, alarm limits) desired alarm settings can be obtained
To reduce the number of unnecessary redundant alarms, a con- with the help of these equations. This work is continuation of our
ventional approach is to review top ten worst alarms. In this previously published work [15,16]. Previously the performance
method, from a list generated, alarms are reviewed one by one indices were calculated for conventional delay-timers, and dead-
starting from the most frequent one and the root cause is identi- bands. In this work, a generalized framework of the delay-timer is
fied [11]. Many of these generated alarms can simply be removed by proposed; moreover, analytical expressions of performance indices
properly adjusting alarm thresholds [2]. Alarm thresholds are nor- for special cases and an algorithm with flowchart for general case
mally set considering certain confidence range, e.g.,  ± 3 (where are given. The conventional delay-timer framework can be treated
 is mean and  is standard deviation of the process variable) [10]. as a special case. However, the generalized setup certainly has the
In [12], a method of alarm threshold design is described; adding convenience of large number of combination provisions that the
a margin to the upper and lower boundaries of the normal pro- conventional setup cannot provide. In the analysis section some
cess fluctuation range is proposed in this method. But it does not preliminary discussions on advantages and disadvantages of the
discuss how to justify the optimality of selected margins. If a mar- generalized and conventional methods with respect to accuracy,
gin is set too high, it poses the risk of high missed alarms; and a latency and sensitivity are given.
too tight margin may lead to higher false alarms. In [13], consid- The rest of the paper is organized as follows: performance
ering the regulatory standard of EEMUA 191 [5], alarm systems indices of alarm systems are discussed in Section 2. Markov chain
management and optimization are discussed. But very little is writ- is discussed in Section 3. Section 4 discusses the conventional
ten about the trade-offs among false alarm rates, missed alarm delay-timers. The generalized delay-timer is described in Section
rates and detection delays except [3,10,14,15]. In [10], trade-offs 5. Section 6 is devoted to the algorithm to formulate the Markov
between the detection delay and the false alarm, and between the chain for the generalized case. Comparison of performance and sen-
false alarm rate and the missed alarm rate are discussed. In [3], a sitivity analysis are provided in Section 7. Finally, some concluding
threshold design process is discussed based on the ROC curves bal- remarks are given in Section 8.
ancing the false alarm rate and the missed alarm rate for the filter,
time delay and deadband. In [15,16], detection delays caused by the
simple limit checking, deadbands and delay-timers are calculated; 2. Performance measurement of alarm systems
also a design method is proposed to balance the FAR, MAR and EDD.
A large increase in the number of monitored variables in recent In alarm systems, raising of an alarm can be discussed from
days has subsequently increased the number of alarms to be the perspective of two-class classification problem, where each
processed by an operator. These monitored variables are not all instance is mapped to a class. If P and N are number of positive
independent. There have been some studies on alarm systems and negative instances with outcomes p (positive) and n (negative)
based on correlation analysis [11,14,17,18]. In [14], a method of sug- respectively, then there are four possible outcomes from the given
gesting alarm limits is discussed considering correlations between classifier problem. In such case, the outcomes can be described by
the process data and the alarm data. The idea is to study the sim- a 2 × 2 confusion matrix (also known as contingency table) (Fig. 1).
ilarity of correlation maps of physical process variables and their If both prediction and actual outcomes are true (p), then it is
alarm history through causal maps. Then alarm limits should be called a true positive. But if the actual one is false (n), then it is a
384 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395

Actual Value Total 100

p n

Probability of Missed Alarm


True False P′
p′
Predicted Positive Positive
Outcome

False True N′
n′
Negative Negative

Total P N

Fig. 1. Confusion matrix of two-class classification problem.

false positive. In literatures of alarm systems, the false positive is


0
termed as the false alarm rate (FAR). On the other hand, a prediction 0 100
outcome, n and an actual value, p is known as a false negative, which Probability of False Alarm
is the missed alarm rate (MAR) in alarm systems [2,3,15].
Fig. 3. The receiver operating characteristic (ROC) curve.
The performance measurement of alarm systems is an impor-
tant concern in industries. For the univariate alarm systems, one
way to assess the performance is to compute accuracy of the design. In Fig. 2, the fault-free data is Gaussian distributed with mean 0,
The false alarm rate (FAR) and the missed alarm rate (MAR) provide variance 1, and faulty data is also Gaussian distributed with mean
a measure of alarm systems accuracy. A good design of higher accu- 2, variance 2. If the high alarm limit is set at 2, alarm is activated
racy indicates lower false and missed alarm rates. For the process instantly when fault occurred. However, setting the limit at 4 in this
data shown in Fig. 2, for a high alarm limit some part of the fault- example delays alarm activation by 7 s (assuming 1 s sampling). It
free data falls above the limit causing false alarms and some part of is therefore important to know how alarm settings cause delay in
the faulty data falls below the limit causing missed alarms. raising alarms.
The accuracy of an alarm system can be presented by the A quantitative relationship among alarm configuration and per-
receiver operating characteristics (ROC) curve [3]. An ROC curve is formance indices is necessary for efficient alarm systems design. In
a technique to visualize and select classifiers based on their perfor- the following sections, calculation of these three indices (FAR, MAR,
mance. In signal detection theory, ROC curves are used to describe and EDD) are discussed. Calculation is mainly based on modeling
trade-offs between hit rates and false alarm rates [21]. However, the alarm systems by Markov chains.
in alarm systems ROC curves were first introduced in [3], where
instead of the hit rate the missed alarm rate (1-hit rate or false 3. Markov chain
negative) was used for convenience. If the objective of the alarm
system design is to minimize both the false alarm rate and missed A Markov process is an independent process where outcome at
alarm rate, then the optimum point in the ROC curve is the point any time instance depends only on the outcome that precedes it
closest to the origin (Fig. 3). and none before that [22]. The stochastic process (Xn )∞
n=0 is called a
Another important performance index in alarm systems is the Markov chain if, for any n and any collection of states i0 , i1 , . . ., in+1
swiftness or latency to raise an alarm. Once an abnormality occurs, there is
estimation of the average time required to raise an alarm is a critical
P(Xn+1 = in+1 | Xn = in , . . . , X1 = i1 , X0 = i0 ) = P(Xn+1 = in+1 | Xn = in ) (1)
concern for safety reasons. The average required time to activate
a true positive outcome is known as the expected detection delay In this work the Markov chains are used to calculate the FAR,
(EDD), which provides a measure of alarm systems latency. MAR and EDD for different design techniques. For a Markov chain

Fig. 2. Process data and corresponding PDFs.


N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 385

p1 consider the case of high alarm for simplicity). Therefore the proba-
NA1 NA2 bility of one sample falling below the threshold is, 1 − p1 (Fig. 2). In
p1 p1 cases where deadbands are applied with or without delay-timers,
1-p1 1-p1 1-p2 p1 + p2 =/ 1 [16].
Assume initially the process is in fault-free region of operation
and no alarm is raised, which is the state NA in the Markov process.
NA A
It will require 3 consecutive samples with probability p1 to move
1-p2 the Markov process to state A and raise an alarm. Any intermedi-
1-p1 1-p2
ate sample with probability 1 − p1 will bring the chain back to the
p2 p2 original no alarm (NA) state. In this case, the sample temporarily
p2
A2 A1 exceeding the threshold is treated as an outlier sample. A similar
process is followed in off-delay for clearing a raised alarm. Similar
to the fault-free region, there will be another Markov process in the
Fig. 4. Combined on and off-delay for n = m = 3.
faulty region with probability q2 and q1 (see Fig. 2 for definitions).

with a limited number of states, the transition probability distri-


bution can be represented by the transitionmatrix(P), where the (i,
j)th element of P is
4.1. False alarm rate
pij = P(Xn+1 = j | Xn = i)
The most common practice in process monitoring is to com-
A probability vector  is called invariant for the Markov chain
pare a variable with thresholds (also known as alarm limits or trip
if  = P [23]. The invariant vector  exists if the Markov chain
points) to detect out of range conditions. A general situation is pre-
satisfies the following two conditions [15]:
sented in Fig. 2, where L0 and L1 are the likelihoods for fault-free
 and faulty classes. For the process data shown in Fig. 2, some part
1. In each row of P, the sum of all entries is 1 ( j pij = 1).
of the probability distributions of the fault-free data falls over the
2. All the elements of P are nonnegative (pij ≥ 0).
threshold and will cause false alarms (also known as Type I error
or false positive). Probability of false alarms from likelihoods or
In this work, it is assumed that the probability density functions
distributions from Fig. 2 can be computed as
(PDFs) of the fault free and faulty data are known. It is not necessary
to assume any particular distribution type. But enough relevant
data points must be available to estimate these distributions that  ∞
can be obtained from historical process data. Moreover, the time P(False Alarm) = L0 (x)dx (2)
series must be stationary and the process data is independently and t
identically distributed (I.I.D.). Although these two assumptions are
restrictive to some extent, they are not impractical [3,16,24]. Other
cases of non-stationary distributions will be discussed elsewhere. where t is the decision threshold. Eq. (2) is the simplest case when
a single sample is compared with a threshold to raise an alarm.
4. Conventional delay-timer But in practical cases many other design constraints are added to
the simplest threshold comparison. For example, delay-timers are
Generally alarms are raised and cleared based on a single sam- widely exercised in industries where consecutive few samples are
ple comparing with the threshold. For high alarms, if the sample is checked before raising or clearing an alarm. For different cases of
above the threshold, an alarm is raised; and when it goes below the delay-timers probability of false alarms calculations are discussed
threshold the alarm is cleared. A similar idea is applied in the case of below:
low alarms. However, often this simple limit or threshold checking For a conventional or regular delay-timer (Fig. 4), Markov pro-
is not considered as a good design due to the problem of nuisance cesses are used to calculate the probability of false alarms. There
and false alarms [3]. To reduce nuisance and false alarms, tech- are three possible cases: n-sample on-delay, m-sample off-delay
niques like delay-timers, deadbands, and filters are widely used in and combined on- and off-delays. For combined n-sample on-delay
industry. The idea of delay-timers is very simple yet effective. In and m-sample off-delay the transition probabilities in the fault-free
alarm systems two types of delay-timers are used: on-delay and region from one state to another is given as follows
off-delay [3,16]. For on-delay timers, an alarm is only raised if n
consecutive samples cross the threshold. In other words, the pro-  
cess measurement remains in the alarm state for n units of time Pn11 0n×(m−1)
Pn = , (3)
before activating the alarm [4]. With off-delay configured, a raised Pn21 Pn22
alarm will only be cleared if a number of consecutive samples fall
below the threshold.
It is essential to conduct a careful safety analysis in tuning the where
delay-timer parameters (time and alarm limit). From safety point
of view, compared to off-delay, on-delay timers are more sophisti-
⎛ ⎞
cated. Since on-delay is connected with alarm activation, the setting 1 − p1 p1 0 ··· 0
should allow operators sufficient time before the abnormal situa- ⎜1−p ⎟
tion becomes an incident. Later in this section detailed equations ⎜ 1 0 p1 ··· ⎟ 0
⎜ ⎟
are provided that can be used to calculate this activation time for ⎜ . .. ⎟
Pn11 = ⎜ .
.. .. .. ⎟,
any settings of delay-timers. ⎜ . . . . . ⎟
⎜ ⎟
The Markov process for 3-sample on-delay and 3-sample off- ⎝ 1 − p1 0 0 · · · p1 ⎠
delay is shown in Fig. 4. Here p1 is the probability of one sample
above the threshold in fault-free region of operation (we only
386 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395

⎛ ⎞
1 − p2 p2 0 ··· 0 1-p1 NA5
⎛ ⎞
0 0 ··· 0 ⎜ ⎟ p1
⎜ 1 − p2 0 p2 ··· 0 ⎟ 1-p1 1-p2
⎜ . . .. ⎟
.. ⎜ ⎟ 1-p1
⎜ .. .. ⎟ ⎜ . .. ⎟ p1 p1 p1
⎜ . ⎟. ⎜ . .. .. .. ⎟
Pn21 = ⎜ 0 0 ⎟ Pn22 = ⎜ . . ⎟,
. . . NA NA1 NA2 A
⎜ ··· 0⎟, ⎜ ⎟
⎜p 0 ⎟ ⎜ 1 − p2 0 0 · · · p2 ⎟ 1-p1
⎝ 2 ··· 0⎠ ⎜ ⎟ p1 p1
⎜1−p ⎟ 1-p1
⎝ 2 0 0 ··· 0 ⎠
NA3 NA4
p2
1-p1
here, horizontal line in Pn is used to indicate, the numbers of
columns in Pn11 and Pn21 or in 0n×(m−1) and Pn22 are not equal. Pn
is of dimension (n + m) × (n + m). After a transient time in the fault- Fig. 5. Generalized on-delay with n1 = 3, n = 4.
free region, the Markov chain reaches its steady state and converges
to the invariant vector. For the system operating in the fault-free The expected value of the detection delay (EDD or averaged
region, the probability of false alarm is the summation of all alarm detection delay) indicates swiftness of the system to raise an alarm.
state probabilities in the steady state and can be calculated accord- EDD is the average time it takes to raise an alarm when the sys-
ing to the theory of Markov chains [23]. It can be shown that the tem enters into abnormal states of operation. EDD is defined as,
probability of false alarm (the steady-state probability of all alarm EDD = E(DD).
states) is A smaller EDD is more desirable as it allows more time for an
m−1 operator to address an alarm. In [15,16], it has been discussed that
pn1 pi2
P(False Alarm) = P(A + A1 + · · · + Am−1 ) =
n
m−1 i=0
n−1 j
(4) how changes in limit and settings of delay-timers affect the detec-
p1 i=0
pi2 + pm
2 j=0
p1 tion delay. A Markov process is used to model the alarm system.
Assume the system is in the normal state of operation. Once a fault
4.2. Missed alarm rate occurs, the time required to move from no alarm (NAi ) states to
alarm (Aj ) states is the detection delay. To estimate the EDD, con-
Missed alarms (also known as type II error or false negative)
cept of hitting time is used. If the Markov model is divided into two
occur when the system is in faulty state of operation but an alarm
sub-spaces, D for all alarm states and E for all no alarm states, the
is not raised. From the process data in Fig. 2, it can be seen that
hitting time is the time required to move to D from E assuming the
the missed alarm rate can be decreased by lowering the threshold.
system initially started in sub-space E [23]. The expected delay is
However, it will increase the false alarm rate. Therefore, it leads
given by [15]
to the problem of balancing between the false alarm rate and the n−1  n−j−1 n−1 i
missed alarm rate [3,16]. Calculation of missed alarm rates for dif- n−1 j
pm−1
2
pn1 q1 i=0
qi2 + p2 p
j=0 1 k=0
qk2 − qn2 i=0
p1
ferent cases of delay-timers are similar to that of false alarm rates. EDD =  m−1 (7)
To calculate the missed alarm rate the probability density function n m n−1
q2 p2 i=0
pi1 + pn1 i=0
pi2
of the faulty data should be used (Fig. 2, likelihood function L1 ).
Here q2 is the probability of one sample above the threshold and q1 Eq. (7) can be further modified for different values of m and n
(or 1 − q2 ) is the probability of one sample below the threshold in to get special cases of simple threshold checking (no delay-timer,
the abnormal region. From Fig. 2, the probability of missed alarm is m = n = 1), only on-delay (m = 1), and only off-delay (n = 1).
 t
P(False Alarm) = L1 (x)dx (5) 5. Generalized delay-timer
−∞

When delay-timers are applied, the missed alarm rate is the sum Sometimes conditions to raise an alarm in the conventional
of all probabilities of no alarm states in the Markov chain under the delay-timer are difficult to satisfy. For example, consider a system
faulty region of operation. For calculation of MAR, the probability with 10 samples on-delay. When the system enters from the normal
distribution of faulty data and the transition probability matrix, Pf , state to the faulty state of operation, consecutive 10 samples must
should be used, where Pf can be obtained replacing probabilities p1 , cross the threshold to raise the alarm. It is very likely to have some
p2 by q2 , q1 , respectively, in equation (3). For the n-sample on-delay delay in raising the alarm during this transition as first 10 samples
and m-sample off-delay, the probability of missed alarm is may not satisfy the strict requirement. However if the condition is
n−1 relaxed to for instance 8 out of 10 samples crossing the threshold,
qm qi2
intuitively the probability of early detection is higher. Accordingly
n−1 m−1
1 i=0
P(Missed Alarm) = P(NA + NA1 + · · · + NAn−1 ) = (6)
j
m
q1 qi2 + qn2 q1 we propose the following two definitions:
i=0 j=0

4.3. Detection delay n1 out of n samples generalized on-delay: an alarm will be


raised if n1 out of n consecutive samples cross the threshold.
Besides the false alarm and missed alarm rates, detection delay m1 out of m samples generalized off-delay: an alarm will be
is another important performance index in alarm systems. Detec- cleared if m1 out of m consecutive samples fall below the
tion delay is the difference of time between the actual instance of threshold.
fault occurrence and the time when an alarm is raised. If a process
variable moves from fault free to faulty state of operation at time tf A method called the tests for instability was proposed earlier by
and alarm is raised at ta , the detection delay is DD = ta − tf . Though the Western Electric Co. [25]. However, they had only two spe-
the detection delay is a measure of time, in this work it is considered cific combinations in contrary to any generalized combinations
as discrete samples. However, the calculations will still be valid as proposed here.
in practical cases the sampling time is constant and these values The Markov process for n1 = 3 and n = 4 is shown in Fig. 5 where
can be converted to actual time measurements. 3 out of 4 samples are required to raise an alarm and no off-delay
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 387

1-p1 1-p2
p1 p1
p2 NA A
1-p2
p2
p1 p1
NA NA1 A
p2

1-p1
A1
1-p1 p1
p2 1-p2
NA2 p1
A2

p2
1-p1
p1
1-p2
p2
1-p1
p1
1-p2
NAn-1 p2 Am-1

(a) (b)
Fig. 6. (a) 2 out of n generalized on-delay; (b) 2 out of m generalized off-delay.

Table 1 5.1. False alarm rate


Possible paths to reach from state NA to A. (∗) represents not applicable.

Sample1 Sample2 Sample3 Sample4 5.1.1. 2 out of n samples on-delay


p1 p1 p1 ∗ Consider a 2 out of n on-delay, where n ≥ 2. To calculate the prob-
p1 1 − p1 p1 p1 abilities of false alarms, Markov chain is used. The general Markov
p1 p1 1 − p1 p1 chain for n1 = 2 and any n is shown in Fig. 6(a). The corresponding
transitional probability matrix is
is configured. It can be seen that in the generalized on-delay, the
total number of intermediate no alarm states (NAi ) is 5. Compared ⎛1−p p1 0 ··· ··· 0 0

1
to the conventional 3 samples on-delay case (NA1 , NA2 in Fig. 4),
⎜ 1 − p1 ··· ··· p1 ⎟
this requires more intermediate states. The reason of this require- ⎜ 0 0 0 ⎟
ment can be explained by Table 1. Assume initially the alarm is ⎜ ⎟
⎜ .. .. .. ⎟
cleared and the Markov process is in the state NA. To raise an alarm ⎜ .. ⎟
Pn = ⎜ ⎟,
. . . .
⎜ (8)
.. ⎟
the Markov process has to reach state A with the condition that
⎜ .. .. .. ⎟
3 samples out of 4 have probabilities p1 . Unlike conventional on- ⎜ . . . . ⎟
delay where there is a single path to reach state A from state NA ⎜ ⎟
⎝ 0 0 0 ··· ··· 1 − p1 p1 ⎠
(Fig. 4), the Markov process here can achieve this in three different
paths. Each row of Table 1 represents a path. The first element of 1 − p1 0 0 ··· ··· 0 p1
each row is p1 , which is required to move to the first intermedi-
ate state NA1 . From NA1 depending on probability of next sample it
can either move to state NA2 (with probability p1 ) or to state NA3 Here Pn is of dimension (n + 1) × (n + 1). The corresponding
(with probability 1 − p1 ). This process will carry on until the Markov steady-state invariant vector is

  1  
 = P(NA) P(NA1 ) P(NA2 ) ··· P(NAn−1 ) P(A) = × p2 p1 p2 p1 p22 ··· p1 pn−1 p1 (1 − pn−1 ) (9)
p1 (1 − pn−1
2
) + p2 (2 − pn−1
2
) 2 2

chain reaches state A. Since the number of paths is increased in the It can be shown that the probability of false alarm in the steady
generalized on-delay compared to the conventional on-delay, the state is
number of intermediate states has also been increased.
It is easy to see that, for n1 out of n on-delay, there can be a
maximum of (2n−1 − 1) intermediate states. However some of these p1 (1 − pn−1
2
)
P(False Alarm) = P(A) = (10)
states are unreachable and not shown in Fig. 5. Due to the possi- p1 (1 − pn−1
2
) + p2 (2 − pn−1
2
)
bility of several paths, generalized delay-timer calculation is more
complex than the conventional case. Therefore with larger n, com-
putational complexity increases significantly and it is difficult to
provide a general representation of the Markov diagram for all com- 5.1.2. 2 out of m samples off-delay
binations. Below some special cases of generalized delay-timers (2 Similar to the generalized on-delay, generalized off-delay will
out of n samples on-delay and 2 out of m samples off-delay cases) require m1 out of m samples below the threshold to clear an alarm.
are discussed. In Section 6, a numerical algorithm to calculate the As a special case, the Markov process for 2 out of m samples is
FAR, MAR and EDD in the general case is given. considered (Fig. 6(b)). In the fault-free region of operation, any 2
388 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395

(a) 80 (b) 2
Monte Carlo Simulation
Monte Carlo Simulation
60 EDD equation
FAR equation
40 1

20

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5

Expected Detection Delay


(i) (i)
False alarm rate

80 2
60
40 1
20

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5
(ii) (ii)
80 2
60
40 1
20

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5
(iii) (iii)
Threshold
Threshold

Fig. 7. [a] Verification of FAR formula by Monte-Carlo simulations; [b] Verification of EDD formula by Monte-Carlo simulations.

samples out of m will clear an already raised alarm. The transitional from 0 to 2 with 0.1 increment. In Fig. 7(a), three different cases of
probability matrix is only on-delay, only off-delay and both on- and off-delays applied
⎛ ⎞ together are shown. For each cases, 2000 Monte Carlo simulations
1 − p1 p1 0 0 ··· ··· 0
are performed and averaged to get a single estimate. It can be seen
⎜ 0 ⎟
⎜ 1 − p2 p2 0 ··· ··· 0⎟ that the theoretical and simulated results are consistent.
⎜ ⎟
⎜ p2 1 − p2 ··· ··· 0 ⎟
⎜ 0 0 ⎟
⎜ ⎟
⎜ . .. .. .. .. ⎟ 5.2. Missed alarm rate
Pn = ⎜ . . ⎟, (11)
⎜ . . . . ⎟
⎜ . ⎟ Calculation of the missed alarm rate is similar to that of the false
⎜ . .. .. .. .. ⎟
⎜ . . . . . ⎟ alarm rate. But the probability density function of the abnormal
⎜ ⎟
⎝ p2 0 0 ··· · · · · · · 1 − p2 ⎠ data should be used. The transition probability matrix (Pn in Section
5.1) of the fault-free operation would be replaced by the matrix for
p2 1 − p2 0 ··· ··· ··· 0
the faulty operation (Pf ); where probability entries p1 , p2 would be
The steady-state invariant vector is replaced by q2 , q1 respectively. The probability of missed alarm for
2 out of n-sample on-delay and 2 out of m-sample off-delay can be
  expressed as
 = P(NA) P(A) P(A1 ) ··· P(Am−2 ) P(Am−1 )

1 q1 (2 − qn−1
1
)(1 − qm−1
2
)
= P(Missed Alarm) = (15)
p1 (2 − pm−1 ) + p2 (1 − pm−1 ) q2 (1 − qn−1
1
)(2 − qm−1
2
) + q1 (2 − qn−1
1
)(1 − qm−1
2
)
1 1
 
× p2 (1 − pm−1
1
) p1 p1 p2 p21 p2 ··· pm−1
1
p2 (12)
5.3. Detection delay

The probability of false alarm in steady state can be calculated


5.3.1. 2 out of n samples on-delay
as
The probability of z samples detection delay can be calculated
p1 (2 − p1m−1 ) using method described in (15). Here the transition matrix is
P(False Alarm) = P(A + A1 + · · · + Am−1 ) = (13)
p1 (2 − p1m−1 ) + p2 (1 − p1m−1 )
⎛1−q q2 0 ··· ··· 0 0

2
5.1.3. Combined on and off-delay
⎜ 1 − q2 ··· ··· q2 ⎟
Similarly, if both 2 out of n samples on-delay and 2 out of m sam- ⎜ 0 0 0 ⎟
⎜ ⎟
ples off-delay are applied together, the probability of false alarm ⎜ .. .. .. ⎟
is ⎜ .. ⎟
Pf = ⎜ ⎟,
. . . .
⎜ ⎟ (16)
p1 (2 − pm−1 )(1 − pn−1 ) ⎜ .. .. .. .. ⎟
⎜ ⎟
1 2
P(False Alarm) = (14) . . . .
p1 (2 − pm−1 )(1 − pn−1 ) + p2 (1 − pm−1 )(2 − pn−1 ) ⎜ ⎟
1 2 1 2
⎝ 0 0 0 ··· ··· 1 − q2 q2 ⎠
Example 1. This example is to validate the expression of FAR
derived in Eqs. (10)–(14). The process variable is Gaussian dis- q1 0 0 ··· ··· 0 1 − q1
tributed. Fault-free data has mean 0, variance 1 and faulty data has
mean 2, variance 1. The data length is 1000 for both fault-free and The Q matrix can be obtained from Pf replacing all the alarm
faulty data. 1 s uniform sampling is used. The threshold is varied state rows by zeros and the steady-state vector is given in Eq. (9).
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 389

Alarm !!

……………..
Sample number 1 2 3 4 5 6 7 8 9
Probability p1 1-p1 p1 p1 1-p1 p1 1-p1 p1 1-p1 …

Alarm raised Reset window


Alarm cleared

Fig. 8. Resetting of Markov process.

It can be shown that the expected detection delay for 2 out of n


samples on-delay is
n−2
p2 (1 − qn1 + q1 ) + p1 q1 (1 − pn−1
2
)(2 − qn−1
1
) + p1 p2 q1 pi (1 − qn−1
i=0 2 1
+ qn−2−i
1
)
EDD = (17)
{p1 (1 − pn−1
2
) + p2 (2 − pn−1
2
)}q2 (1 − qn−1
1
)

5.3.2. 2 out of m samples off-delay


Similar to the previous 2 out of n samples on-delay, the expected 6. Numerical calculation of FAR, MAR and EDD
detection delay for 2 out of m samples off-delay is
The calculations of FAR, MAR and EDD are based on the infor-
(p2 + p1 q1 − p2 q2 )(1 − pm−1
1
) mation of the transition matrix (P). Except for some special cases
EDD = (18)
q2 [p2 (1 − pm−1
1
) + p1 (2 − pm−1
1
)] discussed in earlier sections, there is no closed form solution so
far for the general n1 out of n samples on-delay and m1 out of m
5.3.3. Combined on and off-delay samples off-delay. In this section we propose an algorithm that can
For the special case of combined 2 out of n on-delay and 2 out numerically generate the transition matrix, which then can be used
of m off-delay, the expected detection delay is to calculate FAR, MAR and EDD. It can generate and reduce the tran-
sition matrix for an n1 out of n on-delay and m1 out of m off-delay
X1 + X2 + X3
EDD = timer whose one sample above threshold probability is p1 .
Y1 Y2
The algorithm comprises three phases: the generating phase, the
where discarding phase, and the lumping phase. In the generating phase, a

m−2 full size transition matrix is generated, in the discarding phase, the
j
X1 = p2 (1 − p1m−1 )(1 − qn1 + q1 ), X2 = p1 p2 q1 (1 − pn−1
2
)(2 − qn−1
1
) p1 transient states are discarded, while in the lumping phase, states
j=0
are lumped to the coarsest partition and thus we obtain the reduced

n−2
(19) size transition matrix.
X3 = p1 p2 q1 (1 − p1m−1 ) pi2 (1 − qn−1
1
+ qn−2−i
1
)
i=0 6.1. Generating phase
Y1 = p2 (1 − p1m−1 )(2 − p2n−1 ) + p1(1 − pn−1
2
)(2 − p1m−1 ), Y2 = q2 (1 − q1n−1 )
To generate the full size transition matrix, we firstly define the
Example 2. This example is to validate the expression of EDD description sequence of a state.
derived in Eqs. (17)–(19). The same process variable as in Example Definition: An l = max(n, m) bits binary sequence is called a
1 is used. The threshold is varied from 0 to 1.5 with 0.1 increment. description sequence for the n1 out of n on-delay and m1 out of
In Fig. 7(b), three different cases of only on-delay, only off-delay m off-delay timer. The first bit of the sequence shows whether
and both on- and off-delays applied together are shown. For 2000 currently it is an alarm or no-alarm situation; while the rest bits
Monte Carlo simulations the averaged EDD is estimated to com- provide the one sample above threshold information of the current
pare with theoretical values. It can be seen that the theoretical and sample and l − 2 past samples.
simulated results are consistent. Obviously, there are totally 2l states. Thus, we can obtain a
Markov chain with 2l states. For each state, it is easy to deter-
5.4. Resetting Markov processes mine the next state to move to in the cases that the next sample
is greater and smaller than the threshold, respectively. In other
For the generalized delay-timer, in certain cases it may appear words, each state has two successor states with probability p1 and
that there is a conflict of logic. Consider a 3 out of 4 samples on- 1 − p1 , respectively. After this procedure, an 2l × 2l full size transi-
delay timer for the series of samples shown in Fig. 8. The markov tion matrix is established.
process for this example is shown in Fig. 5. Assume initially there To make it clearer, we provide a simple example: A segment
is no alarm. Conditions to raise an alarm is satisfied at sample 4 of the values of a process variable is [1.3, 3.5, 5.7, 2.6, 10.2, 11.3].
and the Markov process moves to state A. Then the alarm is cleared The threshold is 8, and the delay timer rule is 2 out of 3 on-delay.
at sample 5 since it has probability 1 − p1 to fall back within the In this case, we need a total of 23 = 8 states. At instant 4, the cur-
threshold and the Markov process moves back to the state NA. Once rent and previous values, namely, 5.7 and 2.6, are both less than
it goes back to state NA, the next window to check for conditions of the threshold, and it is in the no-alarm situation. As a result, the
alarm raising will start from sample 6; we call this window shifting description sequence is ‘000’. At instant 5, although the new value
resetting of the Markov process. In this work resetting is important 10.2 exceeds the threshold, there is only 1 out of 3 values greater
as otherwise it may seem that conditions to raise and clear alarms than the threshold. Hence it is still in the no-alarm situation. So the
are conflicting. For example, if we consider the red rectangle in Fig. 8 description sequence is ‘001’. When the last value 11.3 comes, 2 out
it may appear that there should be an alarm at sample 6; however, of 3 values are greater than the threshold, which raises the alarm.
it is not true as resetting is done. Therefore, the description sequence changes to ‘111’.
390 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395

Fig. 9. Flow chart for generation of transition probability matrices.

One requirement we add on the alarm rule is that the Markov 6.3. Lumping phase
chain represented by the full size transition matrix should only have
one ergodic set which is regular, and all the transient sets include The goal of lumping is to group the states, i.e., make a partition,
only one state. This requirement is reasonable since the rule should so that all the states in the same group have the probability of p1 /
guarantee that the probability that the Markov chain moves to the (1 − p1 ) to move to states in a single group. The lumping phase can
ergodic set in finite steps is 1. further decrease the size of the Markov chain.
The Markov chain applied in our work is in a special case: each
6.2. Discarding phase
state has only two possible successor states, and there is a unique
probability p1 for all states. Because of this, the Markov chain is
In this phase transient states are discussed. Based on the require-
equivalent to a deterministic finite automaton (DFA) with 2l states
ment we mentioned above, the discarding procedure is straight
and 2 inputs that correspond to p1 and 1 − p1 .
forward:
The DFA minimization problem is a classical one in the field of
Step 1: Search for such columns ik in the transition matrix that all computer science that has been extensively studied. Since the goal
the elements in the column are zero. Assume that K = / 0 of lumping the states in our special Markov chain is exactly the
columns, {i1 , i2 , . . ., iK }, are found, initialize the discarding same as the purpose of DFA minimization, we can directly adopt
states list to {i1 , i2 , . . ., iK }. the algorithm for the lumping phase.
Step 2: For an index ik that appears in the discarding states list, The essence of the DFA algorithm is to first partition the states
set the two non-zero entries in the ik th row to zero, and according to their outputs (in our case alarm or no-alarm). Then the
check those two columns containing these two non-zero groups are repeatedly refined based on the successor states on a
entries. If any of them is all-zero column, add its index into given input for the states in the same group. If the successor states
the discarding states list. Repeat step 2 for the new added are in a single group, no further partition is required; otherwise
indices in the list until no more such state can be found. partition the group to subgroups whose successor states are in a
Step 3: Remove all the columns and rows in the discarding list from single group. When no further partition is needed, all states in the
the transition matrix. same group can be lumped to one state. By introducing the “process
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 391

the smaller half” strategy, reference [26] speeded up the algorithm rows 8, 9, 11, 13 to zero. Then one more zero column, column 10,
to an O(n log(n)) time bound, where n is the number of states of the is found. The discarded states list becomes {7, 8, 10, l2, 9,}. After
automaton. set row 10 to a zero row, no more zero column can be found. Rows
The reason why we need to reduce the size of the transition 8, 9, 10, 11, 13 and columns 8, 9, 10, 11, 13 are removed, and the
matrix is that as the length of the delay-timer increases the size transition matrix shrinks to the following:
of the transition matrix increases exponentially. Large transition ⎡ ⎤
1 − p1 p1 0 0 0 0 0 0 0 0 0
matrices cause great computational burden for invariant vectors,
⎢ ⎥
necessary for FAR, MAR and EDD calculation. We conjecture that ⎢ 0 0 1 − p1 p1 0 0 0 0 0 0 0 ⎥
the algorithm can reduce the size of the transition from 2l to ⎢ ⎥
⎢ 0 ⎥
Cnn −1 + Cmm ; where C represents combination. It is easy to find ⎢ 0 0 0 1 − p1 p1 0 0 0 0 0 ⎥
1 1 −1 ⎢ ⎥
out the upper bound of the ratio of reduced size to full size by using ⎢ 0 p1 ⎥
⎢ 0 0 0 0 0 1 − p1 0 0 0 ⎥
Sterling’s formula. The ratio converges to zero in the rate of O(l−0.5 ). ⎢ ⎥
⎢ 1 − p1 p1 0 0 0 0 0 0 0 0 0 ⎥
Although it still means a large computational burden, the total com- ⎢ ⎥
⎢ ⎥
putational complexity of the reduction algorithm and the eigenvec- ⎢ 0 0 1 − p1 0 0 0 0 0 0 0 p1 ⎥
tor computation of the reduced transition matrix is greatly reduced ⎢ ⎥
⎢ 0 ⎥
compared with directly computing the eigenvector of the full size ⎢ 0 0 0 1 − p1 0 0 0 0 0 p1 ⎥
⎢ ⎥
transition matrix. A flow chart of the algorithm is given in Fig. 9. ⎢ 0 p1 ⎥
⎢ 0 0 0 0 0 0 0 0 1 − p1 ⎥
Example: 3-out-of-4 on-delay timer, and 2-out-of-3 off-delay time ⎢ ⎥
Generating Phase: Since l = max(4, 3) = 4, the length of description ⎢ 1 − p1 0 0 0 0 0 0 p1 0 0 0 ⎥
⎢ ⎥
sequence should be 4. So there will be 24 = 16 states, which means ⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 p1 0 0 ⎥
that the size of the initial transition matrix should be 16 × 16. The ⎢ ⎥
⎢ 0 ⎥
non-zero entries in the matrix are designed as follows: The first ⎣ 0 0 0 0 0 0 0 0 1 − p1 p1 ⎦
state, state 0, has a description sequence ‘0000’. The first ‘0’ means
no-alarm situation, the following three ‘0’ show that all the latest
three samples are under the threshold. The probability that next Lumping phase: After the discarding phase, only 11 states are
sample exceeds the threshold is p1 . Then the latest four samples left. Among them the first 7 states, states {0, 1, 2, 3, 4, 5, 6}, are
become ‘0001’. Because of the 3-out-of-4 on-delay timer, the no- no-alarm states, and states {7, 8, 9, 10} are alarm states. For the no-
alarm situation can be kept. So there is a probability of p1 that state alarm states group, the 1 − p1 successor states set is {0, 2, 4, 6}, and
0 moves to state 1, namely, ‘0001’ (the first ‘0’ means on-alarm, the the p1 successor states set is {1, 3, 5, 10}. The 1 − p1 successor states
following sequence ‘001’ describe the latest three samples). Simi- set is a subset of the no-alarm group. However, one element in p1
larly, there is a probability of 1 − p1 that the state is kept. So in the successor states set belongs to the alarm group while others belong
first row, the two non-zero entries are the first and second entries. to the no-alarm group. So the no-alarm group should be partitioned
The values are of them are 1 − p1 and p1 , respectively. according to its elements’ p1 successor states. As a result, the states
Operate on the following 15 states in the same way, finally the are partitioned into three groups: states {3, 5, 6}; states {0, 1, 2, 4};
initial transition matrix is obtained as follows: states {7, 8, 9, 10}. Repeat this procedure until no more partition
⎡1−p p1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1
⎢ 0 0 1 − p1 p1 ⎥
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1 − p1 p1 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 1 − p1 0 0 0 0 0 0 0 0 p1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 1 − p1 p1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 1 − p1 0 p1 ⎥
⎢ 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1 − p1 0 0 0 0 0 0 0 0 0 0 p1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 1 − p1 0 0 0 0 0 0 0 0 p1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 0 p1 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢1−p 0 p1 0 ⎥
⎢ 1 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 0 0 0 0 0 p1 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 − p1 p1 ⎥
⎢ ⎥
⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 0 p1 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢1−p 0 p1 0 ⎥
⎢ 1 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 1 − p1 0 0 0 0 0 0 0 0 0 0 0 0 p1 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎣ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 − p1 p1 ⎦

Discarding phase: Columns 8, 9, 11, 13 are zero columns. So these


indices are recorded as the initial discarding states list: {7, 8, 10, 12}
(since states begin from state 0). Then set the non-zero entries in
392 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395

(a) 100
2 out of 2 (b) 20
2 out of 2
2 out of 3
2 out of 3
2 out of 4
2 out of 4
2 out of 5
2 out of 5
Probability of Missed Alarm (%)

Expected Detection Delay


15

50 10

0 0
0 50 100 0 1 2 3 3.5
Probability of False Alarm (%) Threshold

Fig. 10. (a) Changes in ROC curves for corresponding changes in the threshold for n = 2–5, n1 = 2; (b) corresponding changes in EDD curves.

(a) 100 4/4 (b) 20


3/5
4/5 4/5
3/5 5/5
3/3
3/3
5/5
4/4
15
MAR

EDD

50 10

0 0
0 50 100 −3 −2 −1 0 1 2 3

FAR Threshold

Fig. 11. (a) Changes in ROC curves for corresponding changes in the threshold; (b) Corresponding changes in EDD curves.

is needed. Finally, states {0, 4} are in one group; states {7, 10} are 7. Comparative analysis
in one group. All the other groups have single state. To lump states
{0, 4}, pick up a representative state for this group, say state 0;
For a discussion of advantages and disadvantages of the pro-
add column 5 to column 1, and then remove column 5 and row
posed generalized delay-timers and the conventional delay-timers,
5. To lump states {7, 10}, pick up state 10 as the representative
their performance is compared in this section in terms of three
state; add column 7 to column 10, and then remove column 7 and
criteria, namely, accuracy (ROC curve with false and missed alarm
row 7. Eventually, the reduced size transition matrix is obtained as
rates), latency (detection delay with respect to change in settings)
follows:
⎡ ⎤ and sensitivity (change in performance with any change in design
1 − p1 p1 0 0 0 0 0 0 0 settings).
⎢ 0 ⎥
⎢ 0 1 − p1 p1 0 0 0 0 0 ⎥
⎢ ⎥ 7.1. Accuracy and latency
⎢ 1 − p1 p1 0 ⎥
⎢ 0 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 1 − p1 0 0 p1 ⎥ The receiver operating characteristic (ROC) curve is the plot of
⎢ ⎥
⎢ ⎥ two alternatives when the decision variable is changed [27]. In our
⎢ 0 0 1 − p1 0 0 0 0 0 p1 ⎥ context, the ROC curve is the plot of FAR vs MAR when the threshold
⎢ ⎥
⎢1−p ⎥ changes. It is widely used in signal detection theory for graphical
⎢ 1 0 0 0 0 0 0 0 p1 ⎥
⎢ ⎥ sensitivity analysis. For simplicity of analysis only on-delay timers
⎢ 1 − p1 p1 ⎥
⎢ 0 0 0 0 0 0 0 ⎥ are considered in this section.
⎢ ⎥ Generally different costs are associated with false and missed
⎢ 1 − p1 0 0 0 0 0 p1 0 0 ⎥
⎢ ⎥ alarm rates. Selection of the appropriate point on the ROC curve
⎢ ⎥
⎣ 0 0 0 0 0 0 0 1 − p1 p1 ⎦ depends on these costs. Usually a weighted combination of false
and missed alarm rates are used to include these associated costs
in the design. For simplicity, if (FAR)2 + (MAR)2 is used as the
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 393

Process data Table 2


Fault occurrence
Comparison of FAR, MAR and EDD changing thresholds.

On-delay timer Thresholds FAR MAR EDD


38
3 out of 5 32.8 30 1.65 3
33.82 4.65 15.6 4
34.84 0.1 57.01 5
35.85 0 90.34 5
34
4 out of 5 32.8 23.42 2.13 1
33.82 2.52 19.13 5
34.84 0.04 64.60 6
35.85 0 92.71 6
30
5 out of 5 32.8 15.44 2.73 5
33.82 1.23 24.08 6
34.84 0.01 73.81 7
25−Mar−2009 06:00:00 PM 25−Mar−2009 08:47:36 PM 26−Mar−2009 12:00:00 AM
35.85 0 94.13 7

Fig. 12. Flow variable from oil-sand process facility.

performance index, then the optimum point on the ROC curve will A comparison of FAR, MAR and EDD can be seen from Table 2. Here
be the point closest to the origin; and the corresponding threshold the mean of fault-free data is 32.8 and variance is 1.04. The chart
of the optimum point in the ROC curve will be the optimum oper- is obtained following a 3-sigma limit of Shewhart chart and setting
ating threshold. In this case, if the threshold is set higher than the thresholds at one sigma distance.
optimum point, false alarms will decrease at the cost of increased From Fig. 13(a) and (b), it can be seen that conventional setup is
missed alarms. On the other hand, setting the threshold lower than more accurate if equal weight is given on false and missed alarms
the optimum point will result in smaller number of missed alarms as the curve is closest to the origin. However, the generalized con-
but more false alarms. figuration causes lower detection delay, as was also observed in
A process variable is considered with Gaussian distribution previous simulation examples. This kind of chart can be followed
where fault-free data has mean 0, variance 2 and faulty data has to determine operating settings of the delay-timer. Depending on
mean 2, variance 2. The data length is 1000 for both fault-free and design requirements of FAR, MAR and EDD, operating thresholds
faulty section; 1 s uniform sampling is used. The threshold is var- and on-delay configuration can be fixed. For example, if the require-
ied from minimum of the process data to the maximum with 0.05 ment is ≤5% MAR and ≤10 samples EDD irrespective of FAR, then
increment. In Fig. 10(a), the corresponding ROC curves are obtained either of generalized 3 or 4 out of 5 samples and conventional 5
for different delay-timers (n1 = 2, n = 2 −5). samples on-delay timers can be used in setting the threshold at
32.8. But if the requirement is changed to ≤5% FAR and <5 samples
Example 3. It is normally desirable to minimize both FAR and MAR EDD, then the conventional setup cannot satisfy the conditions. A 3
which account for accuracy of the alarm design; better accuracy out of 5 samples generalized setup is required in setting the thresh-
indicates lower false and missed alarm rates. It can be seen from old at 33.82. In practice, any change in alarm configuration setting
Fig. 10(a) that with use of the conventional 2 out of 2 samples delay- requires significant process knowledge, and detailed safety analysis
timer, the ROC curve is closer to the origin. But when generalized is a must before any change is made.
n1 out of n idea is used, the curves move away from the origin. This
indicates that increasing of n with fixed n1 decreases the accuracy
of alarm systems. On the other hand, in addition to better accuracy 7.2. Sensitivity
a lower detection delay (latency) is also a desired feature. Fig. 10(b)
indicates that increasing of n with fixed n1 results in lower detection During practical process operations, it is very likely to adjust
delays. Here conventional delay-timers cause lower FAR and MAR alarm design parameters for many reasons. Therefore, it is neces-
compared to the generalized cases, which is expected; however, sary to investigate the degree to which changes in alarm design
they also cause higher detection delays. parameters affect performance indices. Here, the parameter that
changes most frequently is the alarm limit or threshold for a fixed
Example 4. In the previous example, only n was varied while delay-timer. We define sensitivity as the ratio of the infinitesi-
keeping n1 fixed. Now a few cases of different n and n1 in close range mal change in the function (FAR, MAR or EDD) to the infinitesimal
would be considered for comparison. Here generalized 3 out of 5, 4 change in the alarm limit (t) for a fixed delay-timer:
out of 5 and conventional 5 samples on-delay cases are considered.
Then the corresponding ROC and EDD curves in these three cases Fractional change in the function, F F/F t ∂F
SF:t = lim = lim =
are compared with conventional 3 and 4 samples on-delay timers. t→0 Fractional change in the threshold, t t→0 t/t F ∂t

Like the previous example, from Fig. 11 it can also be seen that a Here F can be any of the FAR, MAR, and EDD quantities derived
conventional delay-timer has higher accuracy compared to a gen- in earlier sections. As a specific example, Eq. (10) for the false alarm
eralized one. A 5 out of 5 on-delay is more accurate than 4 out of 4 rate of 2 out of n on-delay is considered. Replacing p2 = 1 − p1 and
case. Even a 4 out 4 on-delay yields more accurate design than a 4 denoting p1 = p, it can be shown that the sensitivity is
out of 5 on-delay. The downside is the higher delay in detection. In
this case a conventional 5 samples on-delay has the highest and a (1 − p)n (np2 − p2 − np − 2p + 3) − (1 − p)2n − 2(1 − p)2
S=t f (t)
generalized 3 out of 5 on-delay has the lowest detection delay. p(1 − p)n − (1 − p)p2 − 3p − (1 − p)n + 2

7.1.1. Case study here f(t) represents the value of likelihood function in Fig. 2.
From an oil-sand processing facility, an actual flow measure- Consider the sensitivity curves in Fig. 14(a) which are obtained
ment is considered for case study (Fig. 12). Data is sampled at 1 s for the same set of data in Example 3 with the threshold changing
sampling period. Three different combinations of delay-timers are from 0 to 2. It can be seen that for the 2 out of n samples on-delay, a
applied on the measured variable to compare their results of FAR, higher value of n is more sensitive to the threshold change compar-
MAR and EDD. Performance of a conventional 5 samples on-delay is ative to a lower one. This indicates the conventional delay-timer is
compared with 3 out of 5 and 4 out of 5 samples generalized setup. more sensitive to changes in threshold than the generalized one.
394 N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395

(a) 100 (b) 3/5


4/5
5/5
40
Probability o missed alarm (%)

Detection Delay
30
4/5
50 5/5
3/5
20

10

0 0
0 50 100 34 35 36 37 38 39
Probability of false alarm (%) Threshold

Fig. 13. (a) ROC curves for case study; (b) EDD curves for case study changing thresholds.

6
(a) (b) 25
2/2
2/3
2/4 3/5
5 2/5 4/5
20 3/3
4/4
5/5
4
Sensitivity

15
Sensitivity

10
2

5
1

0 0
0 0.5 1 1.5 2 0 1 2 3 4
Threshold Threshold

Fig. 14. (a) Sensitivity curves for example 3; (b) Sensitivity curves for example 4.

The following observation also shows that a generalized n1 out Acknowledgements


of n on-delay timer gives a less sensitive design compared to the
conventional delay-timer. For Example 4, sensitivity curves are This work was supported by NSERC and Alberta Innovates –
shown in Fig. 14(b). Here the 3 out 5 case is the least sensitive and Technology Futures.
5 out 5 is most sensitive to any change in the threshold. 4 samples
and 4 out of 5 samples on-delays have very similar profiles; simi-
larly for 3 samples and 3 out of 5 samples on-delays. This indicates
References
that for sensitivity, n1 is more important than n in these cases.
Sensitivity analysis is very important as small changes in the [1] C. Perrow, Normal Accidents: Living with High-Risk Technologies, Princeton
distributions of normal and abnormal data, which may be regarded University Press, New Jersey, 1999.
equivalently as small changes in the threshold, can cause signifi- [2] J. Ahnlund, T. Bergquist, L. Spaanenburg, Rule-based reduction of alarm signals
in industrial control, Journal of Intelligent and Fuzzy Systems 14 (2) (2003)
cant deviations in FAR and MAR; such sensitivity is not desirable in 73–84.
practice. [3] I. Izadi, S.L. Shah, D.S. Shook, S.R. Kondaveeti, T. Chen, A framework for optimal
design of alarm systems, in: 7th IFAC Symposium on Fault Detection, Supervi-
sion and Safety of Technical Processes, June 30–July 3, 2009, Barcelona, Spain,
8. Conclusion and future work 2009, pp. 651–656.
[4] International Society of Automation, Management of Alarm Systems for the
Process Industries, ANSI/ISA 18.2 2009.
Generalized delay-timers provide more flexibility in alarm [5] Engineering Equipment and Materials Users Association (EEMUA), Alarm Sys-
design. In this work, three performance specifications of alarm tems – A Guide to Design, Management and Procurement, EEMUA Publication
delay-timers, namely, the FAR, MAR and EDD, are calculated for 191, 2007.
[6] F. Dahlstrand, Consequence analysis theory for alarm analysis, Knowledge-
generalized delay-timers. Some preliminary sensitivity analysis Based Systems 15 (1–2) (2002) 27–36.
and comparison of performance of the generalized and conven- [7] I. Izadi, S.L. Shah, T. Chen, Effective resource utilization for alarm management,
tional frameworks are also discussed. As a future extension, further in: 49th IEEE Conference on Decision and Control, December 15–17, Atlanta,
GA, 2010, pp. 6803–6808.
study on the sensitivity advantage of generalized delay-timers is
[8] R. Isermann, Fault-Diagnosis Systems: An Introduction from Fault Detection to
required by a systematic approach. Fault Tolerance, Springer-Verlag, Berlin, 2006.
N.A. Adnan et al. / Journal of Process Control 23 (2013) 382–395 395

[9] S.X. Ding, Model-Based Fault Diagnosis Techniques: Design Schemes, Algo- [18] J. Nishiguchi, T. Takai, IPL2 and 3 performance improvement method for process
rithms, and Tools, Springer-Verlag, Berlin, 2008. safety using event correlation analysis, Computers & Chemical Engineering 34
[10] I. Izadi, S.L. Shah, D.S. Shook, T. Chen, An introduction to alarm analysis and (12) (2010) 2007–2013.
design, in: 7th IFAC Symposium on Fault Detection, Supervision and Safety of [19] X. Liu, M. Noda, H. Nishitani, Evaluation of plant alarm systems by behavior
Technical Processes, June 30–July 3, 2009, Barcelona, Spain, 2009, pp. 645–650. simulation using virtual subject, Computers & Chemical Engineering 34 (3)
[11] F. Higuchi, I. Yamamoto, T. Takai, M. Noda, H. Nishitani, Use of event correlation (2010) 374–386.
analysis to reduce number of alarms, in: 10th International Symposium on [20] E. Naghoosi, I. Izadi, T. Chen, Estimation of alarm chattering, Journal of Process
Process Systems Engineering, August 16–20, 2009, pp. 1521–1526. Control 21 (9) (2011) 1243–1249.
[12] Y. Luo, X. Liu, M. Noda, H. Nishitani, Systematic design approach for plant alarm [21] J.A. Swets, Measuring the accuracy of diagnostic systems, Science 240 (4857)
systems, Journal of Chemical Engineering of Japan 40 (9) (2007) 765–772. (1988) 1285–1293.
[13] B.G. Liptak, Instrument Engineers’ Handbook: Process Control and Optimiza- [22] A. Papoulis, S.U. Pillai, Probability, Random Variables and Stochastic Processes,
tion, vol. 2, 4th ed., CRC Press, Florida, 2006, pp. 59–63. 4th ed., Tata Mcgraw-Hill Publishing Company Ltd., New Delhi, 2002.
[14] F. Yang, S.L. Shah, D. Xiao, Correlation analysis of alarm data and alarm limit [23] G.F. Lawler, Introduction to Stochastic Processes, 2nd ed., Chapman & Hall/CRC,
design for industrial processes, in: American Control Conference (ACC 2010), Boca Raton, 2006.
June 30 – July 2, 2010, Baltimore, USA, 2010, pp. 5850–5855. [24] M. Bauer, J.W. Cox, M.H. Caveness, J.J. Downs, N.F. Thornhill, Finding the
[15] N.A. Adnan, I. Izadi, T. Chen, On expected detection delays for alarm systems direction of disturbance propagation in a chemical process using transfer
with deadbands and delay-timers, Journal of Process Control 21 (9) (2011) entropy, IEEE Transactions on Control Systems Technology 15 (1) (2007)
1318–1331. 12–21.
[16] N.A. Adnan, I. Izadi, T. Chen, Computing detection delays in industrial alarm [25] Statistical Quality Control Handbook, 1st ed., Western Electric Company, Indi-
systems, in: American Control Conference (ACC 2011), June 29–July 1, 2011, anapolis, IN, 1956.
San Francisco, CA, USA, 2011, pp. 786–791. [26] J.E. Hopcroft, An n log n algorithm for minimizing states in a finite automaton,
[17] F. Higuchi, I. Yamamoto, T. Takai, M. Noda, H. Nishitani, Use of event correlation Theory of Machines and Computations (1971) 189–196.
analysis to rationalize plant alarm system, in: 5th International Symposium [27] B.C. Levy, Principles of Signal Detection and Parameter Estimation, Springer,
on Design, Operation and Control of Chemical Processes, July 25–28, 2010, New York, 2008.
Singapore, 2010, pp. 1–7.

You might also like