You are on page 1of 11

IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 42, NO.

U3/4,FEBRUARYMARCWAPRIL 1994 523

Alarm Correlation and Fault Identification


in Communication Networks
A. T. Bouloutas, S. Gyalo, and A. Finkel

Abstract- We present an approach for modeling and solv- 0 The third step is fault identification. The actual fault
ing the problem of fault identiflcation and alarm correlation is isolated given a number of possible hypotheses of
in large communication networks. A single fault in a large
network may result in a large number of alarms, and it is faults. Testing may be the most appropriate way of
often very difficult to isolate the true cause of the fault. isolating the fault at this stage.
This appears to be one of the most important difficulties
in managing faults in todays networks. The problem may This paper concentrates on the second step of the above
become worse in the case of multiple faults. In this pa- mentioned process. A framework is developed for analyging
per we present a general methodology for solving the alarm
correlation and fault identification problem. We propose a the information within the alarms emitted by communica-
new alarm structure, propose a general model for represent- tion network devices in order to propose possible hypothe-
ing the network, and give two algorithms which can solve ses of faults. The proposed framework can be also used
the alarm correlation and fault identification problem in the
presence of multiple faults. These algorithms differ in the to perform alarm correlation, Le., to reduce the number of
degree of accuracy achieved in identifying the fault, and in alarms and messages presented to the network operators.
the degree of complexity required for implementation. This appears to be an important problem in the opera-
Kegwords- network management, alarm correlation, fault tion of todays networks; the network operators are over-
identification, fault management, fault localization whelmed with messages and the fault localization task is
made very difficult. Too much information has the same
I. INTRODUCTION effect as too little information-fault identification is made
more complex.
Communication networks have increased dramatically in In section 2 we propose a representation for communica-
size and complexity in the last few years. A typical network
tion systems for the purpose of fault identification, in sec-
may consist of hundreds of nodes from various manufactur-
tion 3 we present the nature of the problem, in section 4 we
ers with different traffic and bandwidth requirements. The
study the information carried by the alarms, in section 5
increasing complexity poses serious questions of network we propose algorithms and methods for solving the alarm
management and control. One aspect of network manage-
correlation problem, in section 6 we give a general frame-
ment is fault management, and a central component of fault work for identifying multiple faults, and in section 7 we
management is fault identification. Even though failures in present two examples.
large communication networks are unavoidable, quick de-
tection and identification of the cause of failure can make 11. SYSTEMREPRESENTATION
communication systems more robust, and their operation
more reliable, ultimately increasing the level of confidence Since we are interested in identifying faults in commu-
in the services they provide. nication systems, we need to represent faults. Before rep-
The process of fault identification in large communica- resenting faults we need to define what we mean by the
tion systems can be divided into three steps: term fault. Here, fault identification means identifying the
device that is at fault and the nature of the fault.
0 The first step is fault detection. Fault detection can be If we assume that each device has a single fault mode
thought Of as an On-line indication that Some compo- (can only fail in one way), then fault identification meanS
nent Of the network is usually om- the identification of the device at fault. If a device can fail
munication network devices provide indications (in the in than one way, then fault identification the
form Of aarms) when they a identification of the device at fault and the nature of the
fault detection can be accomplished using the alarms fault,
provided by the network devices.
From this discussion it is clear that the representation
The second step is localizationl i.e.9 the of the faults of any system is closely related to the rep-
Of the from the devices Of the network in Order resentation of the structure of the system. In the rest of
to propose possible hypotheses Of This step is this section we focus on the representation of the structure
since, in most do not provide of communication systems, iSe., the representation of the
explicit or detailed identification of the malfunctioning faults for a single fault mode per device. This
device. tion is then extended to include the nature of the fault, Le.,
The authors are with the IBM Research Division, T.J. Watson Research
the same framework can be used to represent the faults for
Center, Yorktown Heights, N.Y. 10598 the case of multiple fault modes per device.
IEEE Log Number 9400935.
0090-6778/94$04.000 1994IEEE
524 IEEE TRANSACTIONS ON COMMUNICATIONS,VOL. 42, NO. 21314, FEBRUARYIMARCHIAPRIL 1994

Communication systems consist of processes, which con- LINK - CD


sist of subprocesses, down to the level of hardware or soft- Each link can be further expanded into the components
ware components that can be considered indivisible (the that it consists of:
appropriate level of division is system and user dependent). L I N K - A B + N O D E - A * C H A N N E L- A B .
A natural candidate (but not the only candidate) for rep- NODE - B
resenting communication systems is a Phrase Structured L I N K - BC + N O D E - B * C H A N N E L - BC .
Grammar. This is so because a phrase structured gram- NODE - C
mar allows sentences to be formed from expressions which LINK - CD t NODE - C . C H A N N E L - CD .
are formed from sub-expressions, etc., thus giving a natural NODE - D
way to represent hierarchically organized complex struc- If there exists a logical link utilizing the services of link-AB
tures. and link-BC we can write:
A Phrase Structured Grammar (PSG) can be defined EOGICAL - L I N K - AC + L I N K - A B * L I N K - BC
as the 4-tuple G = (V,T,P, S ) where V and T are finite This representation of the system is equivalent to a rep-
sets of variables and terminals respectively. It is assumed resentation of the possible faults, assuming a single fault
that V and T are disjoint. P is a finite set of productions mode for each network component. Multiple fault modes
and S a special variable which is designated as the root can be included by expanding any symbol representing a
symbol. Each of the productions in P are of the form network component. For example if Node - A has two fault
A -t a where A E V and a E (V U T ) * . The formula modes we can expand it as follows:
(V U 2')' denotes the Kleene closure [2] of the set V U T . N O D E - A -t N O D E - A - F l l N O D E - A - F 2
Thus, productions consist of a left hand side which is a The two fault modes of Node - A are considered terminal
variable symbol and a right hand side which consists of symbols, and thus cannot be expanded any further.
any number of variable or terminal symbols. For example In a similar fashion we can expand the channels. For ex-
- -
the production A + B C D means that A is formed ample Channel-AB may consist of a connector to Node-A,
by conjoining clauses of types B , C, and D in that order. a line, and a connector to Node-B. The line may consist
While A has to be a variable symbol, symbols B , C , and of an Encryptor, a T-1-Channel2, and another Encryptor.
D can be either variable or terminal. The representation could be as follows:
We introduce two new production forms that enable us to CHANNEL-AB +CONNECTOR-AeLINE-AB.
abbreviate the set of productions P. If the clauses A -+ B , CONNECTOR- B
A -+ C and A + D belong in the set of productions, then -
L I N E - A B -+ E N C R Y P T O R T 1 - C H A N N E L '
we can replace them with the Alternative Forms Clause ENCRY PTOR
A -, BICID. In this clause A has to be a variable symbol C O N N E C T O R 4 FAULTY - C O N N E C T O R
and B , C and D have to be either variable or terminal E N C R Y P T O R - FAULTY - E N C R Y P T O R
symbols, This indicates that A may consist of any of the T1- C H A N N E L -+ FAULTY - T 1 - C H A N N E L
clauses in the right hand side. Another abbreviation is the This technique provides a representation of any compo-
use of the superstar notation. The clause A -t B* means nent in the network which may fail. Furthermore, this ap-
that A can be substituted by any number of B (including proach may be used to represent both the structure and the
zero). Thus, this clause is equivalent to A + XIBIB BIB 6 e faults of a communication network. With this approach, a
B * BI ..., with X the empty string. communication network is represented as an acyclic graph
In representing the possible faults of a communication whose terminal symbols describe the devices which could
system, we can use the representation of its structure. The be at fault: the dependence graph. Each node of this graph
terminal symbols of this representation could be the faults (representing a device or a communication connection) is
we need to identify. Since communication networks are hi- expanded into the devices upon which the correct opersG
erarchically organized, we can use this organization to rep- tion of the node depends, or into the fault modes that may
resent our system. The basic logical unit in representing appear.
communication networks is the communication connection.
The network consists of communication connections. The R . PSGs, Dependence Graphs, and Independent Faults
communication connection provides service to two termi-
Our motivation for the introduction of PSGs for the rep-
nal points, and the combination of the terminal points and
resentation of the system has been the attempt to introduce
the communication connection may constitute the commu-
a general approach for representing alarms, faults and sys-
nication connection for two higher level terminal points.
tem structure. This approach can help the conceptual or-
Communication fails because either of the terminal points
ganization of the work and can guide the design of algo-
has failed, or the communication connection has failed. For
rithms. However, PSGs should not be considered as the
example, assume that a particular network consists of four
universal representation of all systems but only as a tool
nodes and three links, link-AB, link-BC, and link-CD. We
which can give guidance in the design of algorithms.
can represent the network as a collection of links:
An alternative representation of the system is the use of a
N E T W O R K + L I N K - AB * L I N K - B C .
dependence graph. The operation of the system depends on
'Phrase Structured Grammars are no different than Context Free
Grammars [2], but here we follow the terminology in [3]. 'Transmission line
-

BOULOUTAS et ul,:ALARM CORRELATION AND FAULT IDENTIFICATION IN COMMUNICATION NETWORKS 525

S R1 R2 D21 D1

-4

s r) R1 R1-R2 R2

S-R1

(b)

Fig. 1. (a) A representation of a three node network. S is connected


to R1 which is connected to R2. S has two broadcasting sessions to R1
and R2. (b) A representation of the dependence graph of the three node
broadcasting network. B represents the broadcasting mode and P 1 and
P2 the two broadcasting sessions. A4

Fig. 2. Faults in dependent devices projected to independent sources of


the operation of subsystems which depend on the operation fault
of basic devices and components.
We can represent systems and devices as nodes in a
graph. If the operation of a device depends on the opera-
tion of some other device then there is an arc connecting both would depend on. This would then be the terminal
the two devices. If a device fails then by traversing the object and the object we would be interested in identifying
graph one can find all the devices which could have possi- at fault.
bly caused the fault. All the independent devices which It is our intention in this paper to only consider inde-
could be at fault and could have caused the faults in the pendent sources of fault. If devices and systems which are
system are leaves of the graph. Note that generally the not independent are associated in any way with a possi-
dependence graph is a directed graph and if someone could ble fault then this fault is projected to the independent
traverse it starting at any node (the node represents a sys- devices by following down the dependence graph. An ex-
tem), then he would reach a set of nodes which represent ample of this is shown in fig. 2. A dependence graph and
the independent devices that could be at fault. It is ex- two alarms A1 and A2 affecting devices 04, 05, D6, 0 7
actly these devices we are interested in identifying at fault. are presented. Clearly devices 0 5 and 0 6 are not inde-
This model would easily address fault propagation because pendent since they have a common descendent. However if
once an independent device fails faults propagate to all the we project the alarms down the dependence graph we note
devices they depend on it. that alarm A1 is equivalent to alarm A5 affecting devices
Example 014, 015, 016, 017, 018. In a similar fashion we can
In Fig. 1 (a) we present an example of a simple three project alarm A2 to alarm A3. (Note that here we have not
node network. used any negative information meaning that we have not
Nodes S, R1, Rz are linked by links S - R1 and R1- R z . used the fact that since 021 has not indicated any problem
Suppose that S sets up a one-to-two broadcasting B with then devices 014 and 0 1 5 can not be at fault. This would
R1 and Rz by two paths Pl(S, R1) and Pa(S, R2). Then the limit alarm A5 to A4 -shown in dashed lines. This is gen-
representation of the dependence relations of the system erally examined in section 6.) This way the dependence
can be seen in fig. l(b). In terms of a PSG the system can graph can help us project the possible faults to the inde-
be described as follows: pendent devices. It is the dependence graph we propose to
B 4 P I .P2 with represent as phrase structure grammars, and clearly this
Pi --+ L(S - Rl), P2 + L ( S - R1) * L(R1- R2). can be easily done.
Obviously, the paths PI and P2 failure probabilities are It may seem that PSGs are more than we need. How-
not independent because Probability(P2 fail I Pi fail) = 1 ever, sometimes a dependence graph may not be enough.
and this is reflected in the dependence graph in fig. 1 (b). Consider the case where a channel consists of two subchan-
Both nodes have at least one commondescendent. However nels. The channel is operational if any of the subchannels is
the leaf nodes S, S - R1, R1, R1- R2, R2 are considered to operational. This is difficult to model using a dependence
be independent and this is where we would like to identify graph but it is easy to model using a PSG.That is why we
the fault. If nodes S and S - R1 were not independent chose PSGs as a general representation model which could
then this would mean that there is another object (node) represent both structure, faults, and alarms. In the rest of
the gaper we assume that structure is described either by
In the probabilistic sense: The fault of an independent device can not
in any way cause the fault of another independent device. a PSG or a dependence graph which in turn is described
This example is courtesy of an anonymous referee by a PSG.
526 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 42, NO. 21314, FEBRUARYIMARCHIAPRIL1994

111. NATURE O F THE PROBLEM correlate alarms, we could simply notice that alarms with
the same Why, Where, and similar When fields indicate
In this section we focus on the nature of the problem, and
the same problem, and thus should be correlated.
attempt t o explain why fault identification and alarm cor-
However, the information described above is usually not
relation in communication systems are difficult problems.
provided by the alarms of communication systems devices,
Communication networks consist of devices independently
because each device of a communication network has only
manufactured by various vendors. While the internal im- limited information about the rest of the system. Most
plementation of many devices vary from vendor to vendor,
alarms emitted by network devices report the device that
the interface of each device with the rest of the network has
is experiencing the fault, the symptoms of the fault, and the
been standardized to conform to widely accepted standards time of the detection of the fault, i.e., answer the questions,
(e.g., SNA, OSI, etc.). Who, What, and When. However, for the purpose of fault
Each of these devices has been independently designed.
identification and alarm correlation we need more. We need
The designer of a communication system device has to en-
answers to the questions Where and Why which are usually
sure that both his device and its perceived interface, i.e.,
not provided.
the rest of the network projected into the devices obser- The information added by the Where and Why fields
vation space, are working correctly. Naturally, the design
gives us a better ability to distinguish faults and to cor-
process includes designing alarms for the various fault con-
relate the alarms attributed to those faults. The trade-
ditions that the device may encounter when in operation.
off between the amount of the available information and
Thus, the designer needs to provide two kinds of alarms:
the fault distinguishing ability (alarm correlation ability)
0 Alarms for faults that exist within the device; and, is made clear in the following two cases:
e Alarms for faults that appear in the interface with
0 If we make the simplifying assumption that two faults
which the device has to conform.
can not happen at the same location at the same time,
A particular fault in a device may disrupt its operation then, for the purpose of fault identification, we only
as well as its behavior towards other devices. This fault need the answer to the question Where.
may cause many devices t o emit alarms indicating prob- e If we make the simplifying assumption that the whole
lems with their interfaces. In many cases, a network op- network (or the portion of the network we are moni-
erator may be overwhelmed with different alarms all due toring) experiences only a single fault at a time and
to the same network problem. Even though it may appear alarms appear immediately after the fault, then we
that more information helps to diagnose the problem, in could identify faults and correlate alarms using only
fact the opposite is too often true. Usually alarm messages timing information.
do not explicitly carry the information needed to diagnose
a fault. Alarms describe in detail the faulty condition, Le., V. PROPOSED
APPROACH
the symptom of the fault, but do not, in most cases, de-
scribe the cause of the fault. It is clear, from the discussion above, that alarms that
attribute the cause of the fault to the same network de-
CARRIEDBY THE ALARMS
IV. INFORMATION vice and have similar When fields should be correlated.
The problem is that most alarms do not give explicit fault-
In order to design a fault identification system we need localization information. Thus, we propose t o associate
to study the information carried by the alarms. The ideal ezplicit fault-localization information with each alarm.
alarm is presented in [5]. There, the SNA alert structure is Most alarms do not include fault-localization informa-
introduced. SNA alerts try to answer the following ques- tion because the location of the fault is not known precisely
tions about any fault: Who, What, Where, When, and
by the device emitting the alarm. However, almost every
Why. However, because of lack of knowledge sometimes alarm contains implicit fault-localization information. For
Where and Why are not provided. example, consider the alarm LOS-DTE emitted by a
Who: The name of the system reporting, or if different the Channel Service Unit (CSU). This alarm indicates a loss
name of the system experiencing the fault. of the signal from the data-terminal-equipment side of the
What: The condition of the fault, Le., the symptom of the CSU, and not from the side connected to the T-1 transmis-
fault. sion line. This alarm, even if it does not indicate a specific
Where: A description of the position in the network where location for the fault, restricts the possible locations of the
the problem occurred. fault to a smaller part of the network. In similar fashion,
When: A description of the time at which the problem was bipolar violation, or BPV (an alarm emitted by a CSU),
detected. is an alarm that indicates a problem with the transmission
Why: The cause of the problem, Le., the nature of the line and not with the CSU. This alarm restricts: the possible
fault.
The term fault-localiaation is used here to indicate the location and
If this information were available, then fault identifica- the nature of the fault. As was shown in section 2, the location and the
tion and alarm correlation would be very easy. In order to nature of the fault can be represented in the same framework, thus we do
not need to distinguish between them.
identify the fault that triggered an alarm, we would have A CSU is used on T-1 lines for signal regeneration and alarm
to check the Where and Why field of the alarm. In order to monitoring.
~

BOULOUTAS et 01.: ALARM CORRELATION AND FAULT IDENTIFICATION IN COMMUNICATION NETWORKS 527

locations of the fault to locations outside the CSU. A . Positive InfoTmation Algorithm (PIA, Part I)
Thus, each alarm can be associated with a set of possible Input: One new incoming alarm, a set of incidents and
locations representing the locations of the fault. Note that their associated alarms, and a description of the network
we propose to associate each alarm with all the possible topology (the dependence graph defined by a PSG).
locations of the fault and not only with the most proba-
Output: A new set of incidents indicating the probable
ble ones. In the case that alarms are reliable and there
fault locations.
is only a single fault in the network, then fault localira-
tion is straight forward: The fault lies in the intersection Method:
of the set of locations indicated by each alarm. Thus, intu- Step 1 Given a new alarm and a description of the network
itively, alarms that share a common intersection should be topology, identify the possible locations of the fault
cowelated. indicated by this alarm by localizing the alarm and
In order to formally define the algorithm for alarm cor- projecting the alarm down the dependence graph.
relation (and fault localization) we introduce the term in- Step 2 Construct the minimum numbeT of incidents such
cident. An incident is a set of correlated alarms. Ideally, that all the alarms associated with an incident share
an incident should contain all the alarms attributed to the a common intersection, and all alarms are assigned to
same fault.7 Any incoming new alarm can cause the cre- some incident.
ation of a new incident, the association of this alarm with
an existing incident, or a totally new reorganization of in- Step 2 of the algorithm is important since incidents in-
cidents. This is so because a new alarm may make a new dicate faults. Correct association of alarms with incidents
hypothesis of faults more probable. would mean correct fault localization. In step 2 we seek the
Incidents may be opened and closed. While the creation minimum number of incidents (thus faults) that can ex-
of an incident is triggered by the arrival of an alarm, the plain the appearance of a set of alarms. There is an im-
deletion of an incident is not straightforward. An incident plicit assumption here: It is more likely to have few faults
should be deleted (closed) when the faulty condition which than many. If there is a single fault to be identified, then
caused the generation of the alarms correlated by the in- the number of incidents is one. All alarms should have a
cident ceases to exist. This raises an interesting question: common intersection since they are attributed to the same
How can we verify that a fault that has triggered a set fault. Step 2 of the algorithm will localize the fault to the
of alarms is resolved? We believe that, in an integrated part of the network defined by the intersection of all the
fault management environment, testing is the most appro- alarms.
priate way to establish that a problem is resolved. Since A problem arises when there is more than one fault that
testing is beyond the scope of this paper, we can assume needs to be identified. In this case, step 2 of the algo-
that closing an incident is beyond the scope of this work. rithm may have more than one solution. Given a set of
For simplicity, we may assign the responsibility of closing alarms, there may be more than one way of constructing
an incident to the network operator. Sometimes a simple the incidents, i.e., more than one way of proposing pos-
timer may be used to close an incident. This is possible if sible hypotheses of faults. One example highlighting this
the delay of all the alarms generated due to a failure of a case is when, given three alarms, we get three intersections
component is much less than the time between failures of of their fault localization areas. Any combination of two
that component. intersections would be enough to define the incidents. The
We have to note that we are only considering indepen- choice of the two best is undetermined at this point.
dent faults. Given the dependence graph that represents One way of resolving this problem is to associate a-priori
the structure of the system -possibly described by a PSG- probabilities of failure with each network component. We
we assume that the only devices that may fail are the ones would then choose the incidents that contain components
that are terminal symbols in the PSG, or leaf nodes in that are more likely to fail.
the dependence graph. Alarms appear at any level in the Instead of associating a probability of failure with each
dependence graph but these alarms can be projected to component, we can associate an information cost, the
the independent devices by following down the dependence negative of the logarithm of the probability of failure. Even
graph. Note that usually we do not have to reach to the though information costs are equivalent to probabilities for
leaves of the graph in order to reach devices that are mu- independent faults , working with information costs in-
tually independent and this can speed up the algorithm stead of probabilities has certain advantages. One of these
proposed later in the paper. However, for simplicity and advantages is the intuitively justified additivity property:
consistency we assume that all alarms are projected to the the probability that two independent faults ( f ~fa) , appear
leaves of the graph as shown in figure 2.
in the network is p j = pfl - p f a ,while the corresponding
The algorithm for alarm correlation can be formally de- information cost is I ( p j ) = I ( p j l ) + I ( p j a ) .
fined as follows: Using the above method we can refine the second step of
the algorithm as follows:
Associate three pieces of information with each inci-
There is an implicit assumption here that the alarms are received *As described elsewhere in the paper we only consider independent
approximately the same time. sources of fault
528 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 42, NO. 21314, FEBRUARYMARCWAPRIL 1994

dent: widely differing fault localization information provided by


- The fault, Le.,a single component of the net- the texts of different alarms.
work that may have caused the alarms; 0 Some alarms are sophisticated enough to pinpoint the
- The information cost associated with the com- location of the fault, e.g., faults in the cards of an
ponent that may have caused the alarms (cor- IDNX can be detected by its peer unit.
responding to the a-priori knowledge about the Some alarms can distinguish the direction of the fault,
fault); and, e.g., the alarm LOSS DTE in a CSU refers to the
loss of the signal from the data terminal equipment
- All the alarms that contain this fault in their fault
side of the CSU and not from the T-1 side.
localization field.
0 Some alarms can only distinguish whether a fault that
e Find a set of incidents such that the sum of the infor- a device is experiencing is external or internal to that
mation costs of the incidents contained within this set device.
is minimized, and all alarms are associated with some
One approach is to introduce a set of primitives which
incident.
describe the possible locations of the fault (internal, exter-
This method produces the most probable faults, that is, nal, layer above, layer below, channel, peer unit, etc.), and
the faults that best explain the observed alarms. associate each alarm with one member of this set. In this
It can be proven (by transforming this problem to the case, we must construct a function that maps each alarm to
problem of Hitting Set [4])that the problem of finding a primitive in this set. Ideally, this function should be in-
the set of incidents that minimize the information cost is dependent of the network topology, so that changes to the
NP-Hard. Below, we formally restate the problem and give communications network do not require a change in the
a heuristic algorithm that approximates the optimal solu- function. An additional function may be constructed that
tion (and in many cases finds it). maps the alarm-primitive pair to specific network devices.
In this paper we assume that the set of possible fault
B. Positive Information Algorithm (PIA, Part 11)
locations indicated by each alarm is contained within the
Input: A finite set of possible fault locations F, an in- alarm. For clarity, we assume that each alarm has a fault
formation cost I(f)E Zt for each f E F , and a set A localization field that contains this information. We do not
of subsets of the set F . A contains elements Aj C F , j = go into implementation details on how this is done, but this
1,2,. . .n, where n is the number of observed alarms. Each is one of the implementation issues that must be considered
Aj is a set of possible fault locations as described by the in any real system.
fault localization field (see below) of the j t h alarm.
USINGALARMVOLATILITY
VI. GENERALFRAMEWORK
Output: A subset F c F such that I ( F ) = min I(g).
G E G aEG Even though PIA performs alarm correlation, it may oc-
G is in if and only if G c F and GnAj # 0 for aliA3 E A. casionally fail to identify the correct fault. The reason is
That is, F is the set G which minimizes the information that the algorithm does not take into account the absence
cost and explains all the alarms. of alarms, and does not provide for unreliable alarms. The
Method: absence of alarms may sometimes be a valuable piece of
information. An example may help in demonstrating this
Step 1 F +- 0. point.
Step 2 Assign in F the element f of F that is contained Assume that we have five pieces of equipment connected
in the most sets A j , where Aj E A. If there is more in series: A is connected to B , which is connected to a
than one such element f ? choose the one with the least T - 1 line, which is connected to C, which is connected
cost.
to D. Thus, the topology is as follows: A - B - (T -
Step 9 Delete all elements of A that contain f. A is thus 1) - C - D. Assume that we get an alarm from A stating
reduced. If A is empty, output F, otherwise go to that either B , or (T - l), or C , or D ,or the line between
step 2. A and B, or the line between C and D is at fault. We
As can be verified, the heuristic method described above then get an alarm from D stating that either A , or B ,
can be implemented in polynomial time. As long as the or (T - l),or C, or the line between A and B , or the line
information costs associated with the elements of the set between C and D is at fault. These alarms share a common
F do not have large variations, the method should closely intersection, thus should be correlated. Line T - 1 will be
approximate the desired minimum. pointed to by PIA as the most probable cause of these
Even though the approach proposed by PIA is natural, alarms. Communication lines are presumed more likely t o
simple, and intuitive, there are various issues that remain be at fault than any other device. While these alarms are
t o be resolved. One important issue concerns methodolo- correlated correctly, the identification of the fault may not
gies for associating fault localization information with each be correct. Absence of alarms from B and C indicate that
alarm. Such a methodology must take into account the di- the problem in not with the T - 1 line. It is, most probably,
verse topologies of communications networks, as well as the associated with a fault in the line that connects A to B ,
The t e r m i n a l symbols of t h e g r a m m a r representing t h e faults. A high p e r f o r m a n c e intelligent multiplexor
BOULOUTAS PI al.: ALARM CORRELATION AND FAULT IDENTIFICATION IN COMMUNICATION NETWORKS 529

or a fault in the line that connects C to D. This example A = ( A , - S(DELETE1Ae))U S(ADD) (5)
highlights the fact that the absence of alarms is important
This sentence is expanded to a sentence that represents
information that should be used in the fault identification
the a-priori knowledge of the alarms given the faults -the
process.
expected alarms Ae,- a sentence that corresponds to addi-
The other case ignored by PIA is the existence of un-
tion of alarms, and a sentence that corresponds to deletion
reliable alarms. Sometimes alarms are emitted because a
of alarms. This sentence essentially represents the correct
threshold was exceeded, or some other transient condition
alarms and the noise. The last condition (Equation ( 5 ) )
happened, and the alarm is not an indication of a perma,
needs to be added for correctness, It states that the ob-
nent fault. Thus, alarms can be unreliable from the fault
served alarms should be the ones expected minus the ones
identification point of view since an investigation of a non-
deleted plus the ones added.
existent fault may be triggered.
It may seem that the sentence S ( A , IF) carries a-priori
In this section, we introduce a general framework that
knowledge that is difficult to get in a real network. Indeed
accounts for both conditions ignored by the Positive In-
it may be difficult to find all the alarms a set of faults
formation Algorithm, i.e., the information carried by the
can produce. These alarms may depend on the network
absence of alarms, and the existence of unreliable alarms.
topology, the time the fault occurred, even the state of
It would be useful at this point to contrast our work with
the network at the time of the fault. In order to avoid
the approach presented in [6,7]. Even though the authors
this problem we can use information abstraction. We can
present there a probabilistic theory of diagnosis given a set
represent the alarms given a fault to any degree of detail
of malfunctions (manifestations) they do not address the
we want. We may only represent the source of the alarm
problem of unreliable manifestations (alarms in our case.)
(the device that emitted the alarm), or we may represent
One of their assumptions is the Mandatory Causation As-
both the source and the information contained within the
sumption which states that no effect can occur without
alarm, or we may even choose to expand S ( A , ) F ) to any
being caused by some disorder. Our approach is more gen-
set of alarms, indicating no a-priori knowledge about the
eral and more suited to fault identification in our particular
alarms given the faults.
area of application, namely communication networks.
Variable degrees of information abstraction can be in-
Since alarms are now considered unreliable, the prob-
cluded in the same framework. We do not have to use the
lem becomes a different one. We no longer want to find
same degree of detail of the alarms given the faults for all
only the faults, but we want to j o i n t l y estimate t h e faults
faults. For some faults it may be easy to find the alarms
and the actual alarms. Thus, we would like to estimate associated with them, and for others it may be very diffi-
the sentence S ( A , F ) ; here F represents the faults and A
cult. The framework proposed so far can include both of
the alarms. This is the sentence that describes the faults
these cases.
and the actual alarms. The sentence S ( A , F ) can be ex-
The sentence S(ADD), can be expanded to any number
panded to S ( F ) , a sentence that describes the faults, and
of alarms, among the alarms that can be generated, and the
to S(AIF), a sentence that describes the alarms given the
sentence S(DELETEIA,) can be expanded to any number
faults: of alarms among the alarms that are included in A,. Thus
we can write:
S ( A ,F) -+ S(A1F)* S ( F )
This rule actually represents a family of rules that dif- S(ADD) + A l u ~ m * (6)
fer only in F . This gives rise to an estimation problem
where the trade-off between the order of the model, or Alarm + AllA21.. .IA, (7)
structural complexity (number of faults in our case), is S(DELETEIAe) -+ Alarm* (8)
weighed against the fit to the data (the observed alarms
in our case). An excellent study on this topic can be found ..
Alarm .+ Ai I At I , I A? (9)
in [3]. We will modify the framework proposed in [3] to the Here, A t , A:, ...,A? are alarms that belong to set A,.
particular needs of our problem. One important characteristic of the class of problems we
The sentence that describes the faults can be expanded are considering is the multiplicity of solutions. Usually,
to any number of faults that are chosen from a finite set of alarms do not describe in detail the fault that caused their
faults: generation. A single alarm may be due to many faults; a
S ( F ) -+ fault* (2) set of alarms could be due to a number of possible faults.
fault -+ ...
f 1 I fa I 1A (3) The multiplicity of solutions also becomes apparent when
we examine the case of multiple faults. Usually, there are
Here fi, fi, ...,fn are the terminal symbols of the grammar
many combinations of faults that could have caused the
that represents the structure and the faults. A fault should
observed alarms; finding all of them could be very difficult
appear at most once in the sentence S ( F ) .
and non-informative. It is clear that we need a way to
Next we need to expand the sentence that represents the
choose the best solution among a set of possible ones.
alarms observed given the faults:
One way to do this is to use weights. Following the pro-
posed framework, solutions (or faults) should be weighted
S(A1F) + S(Ae IF) * S(DELETE1Ae)* S ( A D D ) (4) using two kinds of weights:
530 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL 42, NO. 21314, FEBRUARYMARCHIAPRIL 1994

e A weight based on the a-priori knowledge about the


probability that a specific set of faults appeared; and,
e A weight based on the knowledge about the probabil-
ity that the set of faults proposed has caused the set then the expansion rule leads to:
of observed symptoms (alarms).
Thus, in estimating the sentence that jointly describes the
alarms and the faults we need a way of weighting faults and
alarms. The minimum information framework introduced
by Hart 131 is applicable here. In [3], information costs are
associated with the sentences of phrased structured gram- Here IA, = - l o g p ~ , and P A , is the probability that
mars. The sentence which describes the observation and alarm A; was accidentally emitted. I ( A i ) is the information
the system and is associated with the minimum informa, associated with alarm Ai. Since alarm Ai is not further
tion, is proposed as the best estimate of the system. expanded this information cost is zero, [3]. The sentence
In [3], two expansion rules are presented for associating S(DELETEIA,), which describes the accidental deletion
information costs to phrase structured grammars: of alarms, can be expanded in similar fashion.
e If U is expanded as U --,UI Uz .. . * U n , then Once we have defined the information costs for the vari-
+ +- +
I ( U ) = I(U1) I(U2) - . I(Un). ous parts of equation (l),the problem of fault identification
e If U is expanded as: U 4 U1lU2l.. . Un,then can be defined as an optimization problem. The objective
I ( U ) = I ( U j ) + I; when U is rewritten as Uj. This is is to find a set of faults, that is a function of the observed
equal to the information cost associated with Vi, plus alarms, such that the information cost associated with the
Ij. Ij is the information associated with the choice of joint description of the alarms and the faults, is minimized,
Uj among the various possibilities. For our purposes i.e.,
Ii = - logp,, where pi is the probability of choos-
ing Ui among the various other possibilities. In case
we cannot associate probability measure to the vari- min I ( S ( A ,F ) ) = min(I(S(A1F))
FC3 FCF
+I ( S ( F ) ) ) (15)
ous choices, then I, = logn. This way we give equal
probability to any choice. A thorough investigation of
where 3 is the set of all potential faults in the network.
the various ways of assigning information measures to Equation (15) is a general description of the family of prob-
phrase structured grammars is given in [3]. lems presented so far. For example, appropriate cost as-
Now consider a specific sentence that describes a number signments could transform the equation into the problem
of faults where F is given by: addressed by the Positive Information Algorithm described
in section 5 provided:
(F) * fl f2 * .. * * fk. (10) Given a fault f , (&If) contains only alarms which
Where the symbol 3 is used to indicate an actual deriva- have f in their fault localization field. We may assume
tion using the production rules. The expansion rules de- I(S(A, If) is zero.
scribed above lead to the equation: The alarms are reliable. Therefore the probability of
alarms being accidentally deleted or emitted is zero.
The information associated with the term $ ( A D D )
and the term S(DELETEIA,) is infinite. Consequently,
the minimization problem in (15 is restricted to fault
Here I f , = -1ogpj. and p f , is the probability that the sets F where the expected set of alarms from the faults
fault fi appears. The information cost associated with f; in F is given by A , the observed set of alarms.
is I(fi)- Following 131, we will henceforth assume, without
loss of generality that I(f=z)is zero. Equation (15), in general, presents a hard discrete opti-
In order to associate information with S(AIF) the first mization problem. Only heuristic algorithms can be con-
expansion rule may be used to expand I(S(A1F))as fol- sidered since the problem can be proven to be NP-Hard.
lows: The general idea of any heuristic algorithm that is trying to
find a solution for equation (15) is that the algorithm has
to search the space of the candidate faults, and the space
I ( S ( A 1 F ) ) = I ( S ( A ,IF)) i-
I(S(ADD)) + (12) of the possible alarms in order to locate the ones that min-
+I(S(DELETEIAe )) imize equation (15). The best and the fastest heuristic
algorithms can be proposed only if we take into account
Given F , the information cost I ( S ( A eIF)) is constant, the particular cost function assigned to the faults and the
since we can uniquely describe the alarms given the faults alarms of the network we are managing.
(a-priori knowledge). The information associated with the We present here a general algorithm that should be con-
random addition of alarms is similar in structure to the sidered more as a guideline for designing an algorithm to
information associated with the faults. If: suit a particular network.
BOULOUTAS et al.: ALARM CORRELATION AND FAULT IDENTIFICATION IN COMMUNICATION NETWORKS 531

A . Positive and Negative Information Algorithm (PNIA, (Fp,A p ) is a node in the tree and f E F,, The condition
Part I) that f be contained in the fault localization field of some
Input: A communication network and a set of alarms. alarm in A, markedly reduces the search space. An alarm
The communication network has the following a-priori in- typically has only a few faults in its fault localization field.
formation associated with it: Furthermore, the depth of the tree is no greater than the
cardinality of A,. In practice it is often not necessary to
0 The set of possible faults in the network (a represen- expand the entire tree, as low cost paths often glaringly
tation of the network), suggest themselves. We formalize our approach of looking
0 The set of possible alarms, for minimal cost nodes with the:
0 Fault localization information associated with each of
the alarms, B. Positive and Negative Information Algorithm (PNIA,
0 Information about the possible alarms associated with Part 11,)
each fault, Method:
0 Information cost associated with each fault, and
0 Information cost associated with accidental addition Step 1; Given a communication network and a set of alarms,
and deletion of any alarm. apply PNIA Part I, and construct the tree represent-
ing the possible combinations of faults.
O u t p u t : The set of incidents indicating the most probable Step 2: Find the node (does not have to be a leaf) with
faults and the most probable alarms in the network. the minimum information cost.
Method: This method constructs a tree; each node of Step 9: Output the incidents that correspond to faults in
the tree represents a possible solution. Specifically, each the path from the root to the node with the minimum
node in the tree is represented by a pair, ( F , A ) . Here F information cost.
represents some set of faults and A c A , where A, is the
set of observed alarms. The space of all such possible points One of the steps of this algorithm requires the information
can be rather large. This method finds the best solution associated with the observed alarms given the faults. This
among the nodes examined. is not difficult to estimate since we know S ( A , IF) and the
Assume that any observable alarm can be explained by observed alarms. We have to find the additions and dele-
at least one fault. The root node of the tree is given by tions of alarms with the minimum information such that
the expected alarms (given by S(A,IF)) are transformed
( @ , A , ) . A leaf node in the tree is given by (Fl,0). Let
( F p ,A P )represent a node on the tree, where A, # 0. Child to observed alarms. In order to do that we need to make
the assumption that the sequence of the observed alarms
nodes of this node are constructed as follows:
is not important. This is a natural assumption since there
Step 1 Find the faults (i.e., terminal symbols of the gram- are random delays involved in observing the alarms. If the
mar representing the structure) that are contained in above mentioned assumption is true, then, in order to get
the fault localization field of the largest number of the set of observed alarms by transforming the set of ex-
alarms in the set A p . pected alarms, we need to add and delete alarms. The set
Step 2 For each fault f identified in step 1 create a child of alarms that should be added or deleted can be found by
node of (Fp,A p ) . The child node consists of the point taking the difference of the two sets: the set of observed
(FclA , ) where F, = Fp U {f ) and A , is obtained by minus the set of expected alarms is the set of alarms that
deleting from A, all the alarms that contain f in their should be added; and, the set of expected alarms, minus
fault localization fields. Naturally, A , C A,, and F, C the set of observed alarms is the set that should be deleted.
Fc *
VII. EXAMPLES
Each node in the tree represents a possible combination
of faults which could have produced the obseryed alarms A . Example 1
A,. A node ( F p , A p )in this tree has an associated infor- We now introduce a simple example in order to clarify
mation cost given by: the proposed algorithm. Assume that we have a commu-
nication system that consists of six components that may
..
fail: f 1 , . ,f6. We have assumed a single fault mode per
component. The information costs associated with these
faults are: I f i = 3, Ifa = 5, I f 3 = 7, If, = 3, If5 = 5,
I f b = 1. Assume that we observe three alarms: A I ,Az, As.
The fault localization fields of those alarms are as follows:
Here A , represents the set of expected alarms given the A1 = { f l , f Z i f3}r A2 = ( f 3 , f4,fS}, and A3 = ( f 5 , f 6 , f l } .
fault set Fp. The most probable set of faults corresponds For simplicity we will assume that the alarms of this system
to the node with the minimum information cost associated are reliable. Thus, the sentence that describes the alarms
with it. In practice this node is typically a leaf node. Note given the faults is a sentence that describes the observed
that at a leaf node S ( A D D ) = 8. alarms. We also assume that the information cost associ-
Although equation (15) is NP-Hard, the tree constructed ated with an alarm given a fault is infinite if the fault is not
in PNIA Part I is usually quite manageable. Suppose in the alarms fault localization field, and zero otherwise.
532 lEEE TRANSACTIONS ON COMKUNICATIONS, VOL 42, NO 21314, FEBRUARYMARCHIAPRIL 1994

That is, given a fault a particular alarm will either always alarm for fault f~ is a D , and IfD = 4. The expected alarms
appear or it will never appear. for fault fp-1) are a A , a g , a c and CZDand I f ( T - 1) = 1.
In applying PNIA Part I, in step 1, we find the faults The expected alarms for fi,, are aA and ag . The informa-
that are contained in most of the alarms. These are faults tion cost of this fault is 2. The expected alarms for fi,,
fi, f3, and f5. Each of these faults is contained in two are aA and a D . The information cost of this fault is 3.
alarms. We can expand each of the faults. Assume that the cost of addition or deletion of any alarm
If we choose f1 as the fault associated with the first po- is 3.
tential incident, the cost of f1 is 3 and alarms A1 and A 3 Finally, assume I ( S ( A , IF)) = 0 for any fault set E'. The
are associated with this incident. cost of any particular fault f is given by:
In expanding f 1 we delete A1 and A 3 and repeat the first
step. If we repeat the first step (now we are left only with
alarm A2) we have three potential faults to expand: f3, f4
and f5. Among them the one with the least cost (most
probable) that can explain the existence of alarm A2 is f4. Suppose that two alarms are observed: alarm aA whose
Thus this solution gives two possible faults, f1 and f4. fault localisation field indicates that either B , or (T - l)>
After expanding fault f1 we can expand faults f3 and or 6 ,or D,or IAB or ICD is at fault; alarm ag whose fault
f5. These give us two more hypotheses of possible faults. localization field indicates that either A , or B , or (2' - l),
If fault f3 happened, then, most probably, fault f~ also hap- or C, or IAB or 2cg is is at fault.
pened with total cost 8. If fault fs happened, then, most The T - 1 line has the lowest cost associated with it and
probably, fault fi happened, with total cost 8. Among the may at first glance appear to be responsible for the two
examined hypotheses of faults, the one with cost 6 (faults alarms. However, the total cost given by (5) associated
fi and f4) is the most likely one. with a fault in the T - 1 line is 7 since alarms U B and
From the algorithm described above we can see the way a c are expected but not observed. The cost of deletion of
that the tree of the possible solutions is generated. The each of these alarms is 3. The cost of a fault in the line IAB
root is connected to three nodes, fi, f3, and f5. Each of however is 2. It can easily be verified that this possibility is
these nodes can be expanded to a set of leaves. Node f1 is in fact the solution of minimum cost. The observed alarms
expanded into three leaves, f3, f4, and f5. Each of these are due to a fault in the connection between A and B.
leaves represents two faults: fault f1 and one of the faults Although the T - 1 device is the most likely device to
f3, f4,or f5. In a similar fashion node f3 can be expanded fail in the network, it is not always at fault. In this case
into leaves fi, fe, and f5. Finally, node f5 can be expanded the absence of alarms from B and D is valuable informa-
into leaves f3, f2 and f 1 . tion that points us towards a different fault. This example
Since we assume that the alarms received are reliable, points out the importance of comparing observed alarms
the information associated with the intermediate internal with expected alarms. As can be verified, the low cost
nodes f1, f3, and f5 is infinite. Take the case of f 1 ; the solution given by PIA Part I1 is T - 1.
rest of the cases are similar. Suppose f i is the only fault
in the network. Alarms A I and A2 contain fault f l in VIII. CONCLUSIONS
their fault localization fields. Alarm A2 is not explained The advent of large scale heterogeneous wide area net-
by fault f1, and must be a random event. Thus, alarm A2 works and the recent proliferation of local area networks
must be deleted. But, according to our assumptions, the make alarm correlation an important consideration for net-
information cost associated with the deletion of an alarm work management systems. We have presented a general
is infinite. Thus the information cost associated with the framework for modeling and solving the problem of alarm
internal node fi is infinite, and f1 alone cannot explain the correlation and fault identification in communication net-
observed alarms. works.
Instead of undertaking a rule based approach in which
B. Ezrample 2 patterns of alarms are indications of specific faults, we asso-
Consider a simple communications network consisting of ciate fault localization information with each alarm. Given
five pieces of equipment connected in series: A is connected that each alarm contains such information, general algo-
to B , which is connected to a T - 1 line, which is connected rithmic schemes have been developed in order to localize a
to C , which is connected to D. Symbolically the network fault and correlate alarms in a more efficient and extend-
topology is as follows: A - B - (2' - 1) - C - D. Assume able fashion.
the T - 1 line and each device in the network have only The framework developed for fault localization has been
a single fault mode and that each device can emit only a extended to cover multiple faults in a more realistic envi-
single alarm. Assume also that the lines JAB between A ronment where even the observed alarms can be unreliable.
and B ; and l C D between C and D also have a single fault The problem then becomes a discrete optimization prob-
mode. lem where the objective is to find the set of faults and the
The expected alarm for fault f~ is a A , and IfA = 4. The set of alarms that minimize a certain cost function. Since
expected alarm for fault f~ is a B , and Ifs = 4. The ex- this problem is NP-Hard, heuristic algorithms have been
pected alarm for fault fc is aC, and I f a = 4. The expected proposed for its solution.
BOULOUTAS et ul.: ALARM CORRELATION AND FAULT IDENTIFICATION IN COMMUNICATION NETWORKS 533

Although the proposed approach is a general framework Applications. This research group is involved in studies of archi-
for designing fault identification systems, more research tectural issues in the design of complex software systems, and the
application of advanced technologies to systems management prob-
needs to be continued in many areas. The algorithms pre- lems. As part of his interest in systems management research, Dr.
sented here rely heavily in a-priori information which is Calo was instrumental in establishing the IEEE International Work-
either guessed or can be experimentally gained. One key shop on Systems Management, and served as its first Chairman. Dr.
Calo holds two United States patents.
research direction could be the way to adaptively acquire
the complete information by starting with an empty, an
incomplete subset, or even an incorrect set of information.
Dr. Allan Finkel is a Research StdMember and project leader of
REFERENCES the Systems Management project of the IBM Research Division.
Dr. Finkel received a bache- lors degree from the State University of
A. Bouloutas, Modeling Fault Management in Com- New York at Binghamtom in 1977 and a Ph. D. in mathematics in
munication Networks, Ph.D. Dissertation, Columbia 1982 from New York University where he studied at the Courant Insti-
University, May 1990. tute of Mathematical Sciences. During 1982-1983he was a Member of
the Institute for Advanced Study at Princeton, New Jersey. He joined
J.E.Hopcroft, J.D. Ullman, Introduction to Automata the Mathematical Sciences Department of IBM Research in the fall
Theory, Languages and Computation, Addision- of 1983 as a postdoctoralfellow and moved to the Computer Sciences
Department in 1985. His research interests include expert systems,
Wesley, 1979. systems and network management and object-oriented databases.
George W. Hart, Minimum Information Estimation
of Structure, Ph.D. dissertation, MIT, Laboratory for
Information and Decision Systems, LIDS-TH-1664,
April 1987.
Michael R. Garey and David S. Johnson Comput-
ers and Intractability a Guide t o the Theory of NP-
Completeness, W.H. Freeman and Co., 1979.
Robert L. Nielsen, SNA/Management Services Alert
Implementation Guide, Document Number GC31-
6809-00, September, 1988, IBM Communications
Products Division, Research Triangle Park, NC.
Yun Peng and James A. Reggia, A Probabilistic
Causal Model for Diagnostic Problem Solving-Part I:
Integrating Symbolic Causal Inference with Numeric
Probabilistic Inference IEEE Trans. Systems, Man,
and Cybernetics, Vol SMC-17 No, 2, March/April
1987.
Yun Peng and James A. Reggia, A Probabilistic
Causal Model for Diagnostic Problem Solving Part 11:
Diagnostic Strategy, IEEE Trans. Systems, Man, and
Cybernetics, Vol SMC-17 No, 3, May/June 1987.

Anastasios Bouloutas was born in Lamia, Greece in 1959. He


received his Ph.D. in 1989 from the Department of Electrical Engi-
neering of Columbia University. For the year 1990 he worked as a
Postdoctoral fellow at IBM Research. From 1091 to Sept. of 1992
he worked as a research scientist at the National Technical Univer-
sity of Athens Greece and he participated in RACE and ESPRIT
projects. Since Sept. 1992 he has been a Research StafT Member at
IBM Research. His research interests include systems and network
management.

Dr. Seraphin B. Calo received his Ph. D. degree in Electrical


Engineering from Princeton University in 1976. Since 1977 he has
been a Research St& Member in Computer Sciences at the IBM T.
J. Watson Research Center, Yorktown Heights, New York, and has
worked and published in the areas of queueing theory, data commu-
nication networks, multi-access protocols, expert systems, and com-
plex systems management. He has managed research projects in the
communications and systems performance areas and has served on
the st& of the IBM Research VP, Systems. He joined the Systems
Analysis Department in 1987,and is currently Manager of Systems

You might also like