Professional Documents
Culture Documents
Abstract- We present an approach for modeling and solv- 0 The third step is fault identification. The actual fault
ing the problem of fault identiflcation and alarm correlation is isolated given a number of possible hypotheses of
in large communication networks. A single fault in a large
network may result in a large number of alarms, and it is faults. Testing may be the most appropriate way of
often very difficult to isolate the true cause of the fault. isolating the fault at this stage.
This appears to be one of the most important difficulties
in managing faults in todays networks. The problem may This paper concentrates on the second step of the above
become worse in the case of multiple faults. In this pa- mentioned process. A framework is developed for analyging
per we present a general methodology for solving the alarm
correlation and fault identification problem. We propose a the information within the alarms emitted by communica-
new alarm structure, propose a general model for represent- tion network devices in order to propose possible hypothe-
ing the network, and give two algorithms which can solve ses of faults. The proposed framework can be also used
the alarm correlation and fault identification problem in the
presence of multiple faults. These algorithms differ in the to perform alarm correlation, Le., to reduce the number of
degree of accuracy achieved in identifying the fault, and in alarms and messages presented to the network operators.
the degree of complexity required for implementation. This appears to be an important problem in the opera-
Kegwords- network management, alarm correlation, fault tion of todays networks; the network operators are over-
identification, fault management, fault localization whelmed with messages and the fault localization task is
made very difficult. Too much information has the same
I. INTRODUCTION effect as too little information-fault identification is made
more complex.
Communication networks have increased dramatically in In section 2 we propose a representation for communica-
size and complexity in the last few years. A typical network
tion systems for the purpose of fault identification, in sec-
may consist of hundreds of nodes from various manufactur-
tion 3 we present the nature of the problem, in section 4 we
ers with different traffic and bandwidth requirements. The
study the information carried by the alarms, in section 5
increasing complexity poses serious questions of network we propose algorithms and methods for solving the alarm
management and control. One aspect of network manage-
correlation problem, in section 6 we give a general frame-
ment is fault management, and a central component of fault work for identifying multiple faults, and in section 7 we
management is fault identification. Even though failures in present two examples.
large communication networks are unavoidable, quick de-
tection and identification of the cause of failure can make 11. SYSTEMREPRESENTATION
communication systems more robust, and their operation
more reliable, ultimately increasing the level of confidence Since we are interested in identifying faults in commu-
in the services they provide. nication systems, we need to represent faults. Before rep-
The process of fault identification in large communica- resenting faults we need to define what we mean by the
tion systems can be divided into three steps: term fault. Here, fault identification means identifying the
device that is at fault and the nature of the fault.
0 The first step is fault detection. Fault detection can be If we assume that each device has a single fault mode
thought Of as an On-line indication that Some compo- (can only fail in one way), then fault identification meanS
nent Of the network is usually om- the identification of the device at fault. If a device can fail
munication network devices provide indications (in the in than one way, then fault identification the
form Of aarms) when they a identification of the device at fault and the nature of the
fault detection can be accomplished using the alarms fault,
provided by the network devices.
From this discussion it is clear that the representation
The second step is localizationl i.e.9 the of the faults of any system is closely related to the rep-
Of the from the devices Of the network in Order resentation of the structure of the system. In the rest of
to propose possible hypotheses Of This step is this section we focus on the representation of the structure
since, in most do not provide of communication systems, iSe., the representation of the
explicit or detailed identification of the malfunctioning faults for a single fault mode per device. This
device. tion is then extended to include the nature of the fault, Le.,
The authors are with the IBM Research Division, T.J. Watson Research
the same framework can be used to represent the faults for
Center, Yorktown Heights, N.Y. 10598 the case of multiple fault modes per device.
IEEE Log Number 9400935.
0090-6778/94$04.000 1994IEEE
524 IEEE TRANSACTIONS ON COMMUNICATIONS,VOL. 42, NO. 21314, FEBRUARYIMARCHIAPRIL 1994
S R1 R2 D21 D1
-4
s r) R1 R1-R2 R2
S-R1
(b)
111. NATURE O F THE PROBLEM correlate alarms, we could simply notice that alarms with
the same Why, Where, and similar When fields indicate
In this section we focus on the nature of the problem, and
the same problem, and thus should be correlated.
attempt t o explain why fault identification and alarm cor-
However, the information described above is usually not
relation in communication systems are difficult problems.
provided by the alarms of communication systems devices,
Communication networks consist of devices independently
because each device of a communication network has only
manufactured by various vendors. While the internal im- limited information about the rest of the system. Most
plementation of many devices vary from vendor to vendor,
alarms emitted by network devices report the device that
the interface of each device with the rest of the network has
is experiencing the fault, the symptoms of the fault, and the
been standardized to conform to widely accepted standards time of the detection of the fault, i.e., answer the questions,
(e.g., SNA, OSI, etc.). Who, What, and When. However, for the purpose of fault
Each of these devices has been independently designed.
identification and alarm correlation we need more. We need
The designer of a communication system device has to en-
answers to the questions Where and Why which are usually
sure that both his device and its perceived interface, i.e.,
not provided.
the rest of the network projected into the devices obser- The information added by the Where and Why fields
vation space, are working correctly. Naturally, the design
gives us a better ability to distinguish faults and to cor-
process includes designing alarms for the various fault con-
relate the alarms attributed to those faults. The trade-
ditions that the device may encounter when in operation.
off between the amount of the available information and
Thus, the designer needs to provide two kinds of alarms:
the fault distinguishing ability (alarm correlation ability)
0 Alarms for faults that exist within the device; and, is made clear in the following two cases:
e Alarms for faults that appear in the interface with
0 If we make the simplifying assumption that two faults
which the device has to conform.
can not happen at the same location at the same time,
A particular fault in a device may disrupt its operation then, for the purpose of fault identification, we only
as well as its behavior towards other devices. This fault need the answer to the question Where.
may cause many devices t o emit alarms indicating prob- e If we make the simplifying assumption that the whole
lems with their interfaces. In many cases, a network op- network (or the portion of the network we are moni-
erator may be overwhelmed with different alarms all due toring) experiences only a single fault at a time and
to the same network problem. Even though it may appear alarms appear immediately after the fault, then we
that more information helps to diagnose the problem, in could identify faults and correlate alarms using only
fact the opposite is too often true. Usually alarm messages timing information.
do not explicitly carry the information needed to diagnose
a fault. Alarms describe in detail the faulty condition, Le., V. PROPOSED
APPROACH
the symptom of the fault, but do not, in most cases, de-
scribe the cause of the fault. It is clear, from the discussion above, that alarms that
attribute the cause of the fault to the same network de-
CARRIEDBY THE ALARMS
IV. INFORMATION vice and have similar When fields should be correlated.
The problem is that most alarms do not give explicit fault-
In order to design a fault identification system we need localization information. Thus, we propose t o associate
to study the information carried by the alarms. The ideal ezplicit fault-localization information with each alarm.
alarm is presented in [5]. There, the SNA alert structure is Most alarms do not include fault-localization informa-
introduced. SNA alerts try to answer the following ques- tion because the location of the fault is not known precisely
tions about any fault: Who, What, Where, When, and
by the device emitting the alarm. However, almost every
Why. However, because of lack of knowledge sometimes alarm contains implicit fault-localization information. For
Where and Why are not provided. example, consider the alarm LOS-DTE emitted by a
Who: The name of the system reporting, or if different the Channel Service Unit (CSU). This alarm indicates a loss
name of the system experiencing the fault. of the signal from the data-terminal-equipment side of the
What: The condition of the fault, Le., the symptom of the CSU, and not from the side connected to the T-1 transmis-
fault. sion line. This alarm, even if it does not indicate a specific
Where: A description of the position in the network where location for the fault, restricts the possible locations of the
the problem occurred. fault to a smaller part of the network. In similar fashion,
When: A description of the time at which the problem was bipolar violation, or BPV (an alarm emitted by a CSU),
detected. is an alarm that indicates a problem with the transmission
Why: The cause of the problem, Le., the nature of the line and not with the CSU. This alarm restricts: the possible
fault.
The term fault-localiaation is used here to indicate the location and
If this information were available, then fault identifica- the nature of the fault. As was shown in section 2, the location and the
tion and alarm correlation would be very easy. In order to nature of the fault can be represented in the same framework, thus we do
not need to distinguish between them.
identify the fault that triggered an alarm, we would have A CSU is used on T-1 lines for signal regeneration and alarm
to check the Where and Why field of the alarm. In order to monitoring.
~
BOULOUTAS et 01.: ALARM CORRELATION AND FAULT IDENTIFICATION IN COMMUNICATION NETWORKS 527
locations of the fault to locations outside the CSU. A . Positive InfoTmation Algorithm (PIA, Part I)
Thus, each alarm can be associated with a set of possible Input: One new incoming alarm, a set of incidents and
locations representing the locations of the fault. Note that their associated alarms, and a description of the network
we propose to associate each alarm with all the possible topology (the dependence graph defined by a PSG).
locations of the fault and not only with the most proba-
Output: A new set of incidents indicating the probable
ble ones. In the case that alarms are reliable and there
fault locations.
is only a single fault in the network, then fault localira-
tion is straight forward: The fault lies in the intersection Method:
of the set of locations indicated by each alarm. Thus, intu- Step 1 Given a new alarm and a description of the network
itively, alarms that share a common intersection should be topology, identify the possible locations of the fault
cowelated. indicated by this alarm by localizing the alarm and
In order to formally define the algorithm for alarm cor- projecting the alarm down the dependence graph.
relation (and fault localization) we introduce the term in- Step 2 Construct the minimum numbeT of incidents such
cident. An incident is a set of correlated alarms. Ideally, that all the alarms associated with an incident share
an incident should contain all the alarms attributed to the a common intersection, and all alarms are assigned to
same fault.7 Any incoming new alarm can cause the cre- some incident.
ation of a new incident, the association of this alarm with
an existing incident, or a totally new reorganization of in- Step 2 of the algorithm is important since incidents in-
cidents. This is so because a new alarm may make a new dicate faults. Correct association of alarms with incidents
hypothesis of faults more probable. would mean correct fault localization. In step 2 we seek the
Incidents may be opened and closed. While the creation minimum number of incidents (thus faults) that can ex-
of an incident is triggered by the arrival of an alarm, the plain the appearance of a set of alarms. There is an im-
deletion of an incident is not straightforward. An incident plicit assumption here: It is more likely to have few faults
should be deleted (closed) when the faulty condition which than many. If there is a single fault to be identified, then
caused the generation of the alarms correlated by the in- the number of incidents is one. All alarms should have a
cident ceases to exist. This raises an interesting question: common intersection since they are attributed to the same
How can we verify that a fault that has triggered a set fault. Step 2 of the algorithm will localize the fault to the
of alarms is resolved? We believe that, in an integrated part of the network defined by the intersection of all the
fault management environment, testing is the most appro- alarms.
priate way to establish that a problem is resolved. Since A problem arises when there is more than one fault that
testing is beyond the scope of this paper, we can assume needs to be identified. In this case, step 2 of the algo-
that closing an incident is beyond the scope of this work. rithm may have more than one solution. Given a set of
For simplicity, we may assign the responsibility of closing alarms, there may be more than one way of constructing
an incident to the network operator. Sometimes a simple the incidents, i.e., more than one way of proposing pos-
timer may be used to close an incident. This is possible if sible hypotheses of faults. One example highlighting this
the delay of all the alarms generated due to a failure of a case is when, given three alarms, we get three intersections
component is much less than the time between failures of of their fault localization areas. Any combination of two
that component. intersections would be enough to define the incidents. The
We have to note that we are only considering indepen- choice of the two best is undetermined at this point.
dent faults. Given the dependence graph that represents One way of resolving this problem is to associate a-priori
the structure of the system -possibly described by a PSG- probabilities of failure with each network component. We
we assume that the only devices that may fail are the ones would then choose the incidents that contain components
that are terminal symbols in the PSG, or leaf nodes in that are more likely to fail.
the dependence graph. Alarms appear at any level in the Instead of associating a probability of failure with each
dependence graph but these alarms can be projected to component, we can associate an information cost, the
the independent devices by following down the dependence negative of the logarithm of the probability of failure. Even
graph. Note that usually we do not have to reach to the though information costs are equivalent to probabilities for
leaves of the graph in order to reach devices that are mu- independent faults , working with information costs in-
tually independent and this can speed up the algorithm stead of probabilities has certain advantages. One of these
proposed later in the paper. However, for simplicity and advantages is the intuitively justified additivity property:
consistency we assume that all alarms are projected to the the probability that two independent faults ( f ~fa) , appear
leaves of the graph as shown in figure 2.
in the network is p j = pfl - p f a ,while the corresponding
The algorithm for alarm correlation can be formally de- information cost is I ( p j ) = I ( p j l ) + I ( p j a ) .
fined as follows: Using the above method we can refine the second step of
the algorithm as follows:
Associate three pieces of information with each inci-
There is an implicit assumption here that the alarms are received *As described elsewhere in the paper we only consider independent
approximately the same time. sources of fault
528 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 42, NO. 21314, FEBRUARYMARCWAPRIL 1994
or a fault in the line that connects C to D. This example A = ( A , - S(DELETE1Ae))U S(ADD) (5)
highlights the fact that the absence of alarms is important
This sentence is expanded to a sentence that represents
information that should be used in the fault identification
the a-priori knowledge of the alarms given the faults -the
process.
expected alarms Ae,- a sentence that corresponds to addi-
The other case ignored by PIA is the existence of un-
tion of alarms, and a sentence that corresponds to deletion
reliable alarms. Sometimes alarms are emitted because a
of alarms. This sentence essentially represents the correct
threshold was exceeded, or some other transient condition
alarms and the noise. The last condition (Equation ( 5 ) )
happened, and the alarm is not an indication of a perma,
needs to be added for correctness, It states that the ob-
nent fault. Thus, alarms can be unreliable from the fault
served alarms should be the ones expected minus the ones
identification point of view since an investigation of a non-
deleted plus the ones added.
existent fault may be triggered.
It may seem that the sentence S ( A , IF) carries a-priori
In this section, we introduce a general framework that
knowledge that is difficult to get in a real network. Indeed
accounts for both conditions ignored by the Positive In-
it may be difficult to find all the alarms a set of faults
formation Algorithm, i.e., the information carried by the
can produce. These alarms may depend on the network
absence of alarms, and the existence of unreliable alarms.
topology, the time the fault occurred, even the state of
It would be useful at this point to contrast our work with
the network at the time of the fault. In order to avoid
the approach presented in [6,7]. Even though the authors
this problem we can use information abstraction. We can
present there a probabilistic theory of diagnosis given a set
represent the alarms given a fault to any degree of detail
of malfunctions (manifestations) they do not address the
we want. We may only represent the source of the alarm
problem of unreliable manifestations (alarms in our case.)
(the device that emitted the alarm), or we may represent
One of their assumptions is the Mandatory Causation As-
both the source and the information contained within the
sumption which states that no effect can occur without
alarm, or we may even choose to expand S ( A , ) F ) to any
being caused by some disorder. Our approach is more gen-
set of alarms, indicating no a-priori knowledge about the
eral and more suited to fault identification in our particular
alarms given the faults.
area of application, namely communication networks.
Variable degrees of information abstraction can be in-
Since alarms are now considered unreliable, the prob-
cluded in the same framework. We do not have to use the
lem becomes a different one. We no longer want to find
same degree of detail of the alarms given the faults for all
only the faults, but we want to j o i n t l y estimate t h e faults
faults. For some faults it may be easy to find the alarms
and the actual alarms. Thus, we would like to estimate associated with them, and for others it may be very diffi-
the sentence S ( A , F ) ; here F represents the faults and A
cult. The framework proposed so far can include both of
the alarms. This is the sentence that describes the faults
these cases.
and the actual alarms. The sentence S ( A , F ) can be ex-
The sentence S(ADD), can be expanded to any number
panded to S ( F ) , a sentence that describes the faults, and
of alarms, among the alarms that can be generated, and the
to S(AIF), a sentence that describes the alarms given the
sentence S(DELETEIA,) can be expanded to any number
faults: of alarms among the alarms that are included in A,. Thus
we can write:
S ( A ,F) -+ S(A1F)* S ( F )
This rule actually represents a family of rules that dif- S(ADD) + A l u ~ m * (6)
fer only in F . This gives rise to an estimation problem
where the trade-off between the order of the model, or Alarm + AllA21.. .IA, (7)
structural complexity (number of faults in our case), is S(DELETEIAe) -+ Alarm* (8)
weighed against the fit to the data (the observed alarms
in our case). An excellent study on this topic can be found ..
Alarm .+ Ai I At I , I A? (9)
in [3]. We will modify the framework proposed in [3] to the Here, A t , A:, ...,A? are alarms that belong to set A,.
particular needs of our problem. One important characteristic of the class of problems we
The sentence that describes the faults can be expanded are considering is the multiplicity of solutions. Usually,
to any number of faults that are chosen from a finite set of alarms do not describe in detail the fault that caused their
faults: generation. A single alarm may be due to many faults; a
S ( F ) -+ fault* (2) set of alarms could be due to a number of possible faults.
fault -+ ...
f 1 I fa I 1A (3) The multiplicity of solutions also becomes apparent when
we examine the case of multiple faults. Usually, there are
Here fi, fi, ...,fn are the terminal symbols of the grammar
many combinations of faults that could have caused the
that represents the structure and the faults. A fault should
observed alarms; finding all of them could be very difficult
appear at most once in the sentence S ( F ) .
and non-informative. It is clear that we need a way to
Next we need to expand the sentence that represents the
choose the best solution among a set of possible ones.
alarms observed given the faults:
One way to do this is to use weights. Following the pro-
posed framework, solutions (or faults) should be weighted
S(A1F) + S(Ae IF) * S(DELETE1Ae)* S ( A D D ) (4) using two kinds of weights:
530 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL 42, NO. 21314, FEBRUARYMARCHIAPRIL 1994
A . Positive and Negative Information Algorithm (PNIA, (Fp,A p ) is a node in the tree and f E F,, The condition
Part I) that f be contained in the fault localization field of some
Input: A communication network and a set of alarms. alarm in A, markedly reduces the search space. An alarm
The communication network has the following a-priori in- typically has only a few faults in its fault localization field.
formation associated with it: Furthermore, the depth of the tree is no greater than the
cardinality of A,. In practice it is often not necessary to
0 The set of possible faults in the network (a represen- expand the entire tree, as low cost paths often glaringly
tation of the network), suggest themselves. We formalize our approach of looking
0 The set of possible alarms, for minimal cost nodes with the:
0 Fault localization information associated with each of
the alarms, B. Positive and Negative Information Algorithm (PNIA,
0 Information about the possible alarms associated with Part 11,)
each fault, Method:
0 Information cost associated with each fault, and
0 Information cost associated with accidental addition Step 1; Given a communication network and a set of alarms,
and deletion of any alarm. apply PNIA Part I, and construct the tree represent-
ing the possible combinations of faults.
O u t p u t : The set of incidents indicating the most probable Step 2: Find the node (does not have to be a leaf) with
faults and the most probable alarms in the network. the minimum information cost.
Method: This method constructs a tree; each node of Step 9: Output the incidents that correspond to faults in
the tree represents a possible solution. Specifically, each the path from the root to the node with the minimum
node in the tree is represented by a pair, ( F , A ) . Here F information cost.
represents some set of faults and A c A , where A, is the
set of observed alarms. The space of all such possible points One of the steps of this algorithm requires the information
can be rather large. This method finds the best solution associated with the observed alarms given the faults. This
among the nodes examined. is not difficult to estimate since we know S ( A , IF) and the
Assume that any observable alarm can be explained by observed alarms. We have to find the additions and dele-
at least one fault. The root node of the tree is given by tions of alarms with the minimum information such that
the expected alarms (given by S(A,IF)) are transformed
( @ , A , ) . A leaf node in the tree is given by (Fl,0). Let
( F p ,A P )represent a node on the tree, where A, # 0. Child to observed alarms. In order to do that we need to make
the assumption that the sequence of the observed alarms
nodes of this node are constructed as follows:
is not important. This is a natural assumption since there
Step 1 Find the faults (i.e., terminal symbols of the gram- are random delays involved in observing the alarms. If the
mar representing the structure) that are contained in above mentioned assumption is true, then, in order to get
the fault localization field of the largest number of the set of observed alarms by transforming the set of ex-
alarms in the set A p . pected alarms, we need to add and delete alarms. The set
Step 2 For each fault f identified in step 1 create a child of alarms that should be added or deleted can be found by
node of (Fp,A p ) . The child node consists of the point taking the difference of the two sets: the set of observed
(FclA , ) where F, = Fp U {f ) and A , is obtained by minus the set of expected alarms is the set of alarms that
deleting from A, all the alarms that contain f in their should be added; and, the set of expected alarms, minus
fault localization fields. Naturally, A , C A,, and F, C the set of observed alarms is the set that should be deleted.
Fc *
VII. EXAMPLES
Each node in the tree represents a possible combination
of faults which could have produced the obseryed alarms A . Example 1
A,. A node ( F p , A p )in this tree has an associated infor- We now introduce a simple example in order to clarify
mation cost given by: the proposed algorithm. Assume that we have a commu-
nication system that consists of six components that may
..
fail: f 1 , . ,f6. We have assumed a single fault mode per
component. The information costs associated with these
faults are: I f i = 3, Ifa = 5, I f 3 = 7, If, = 3, If5 = 5,
I f b = 1. Assume that we observe three alarms: A I ,Az, As.
The fault localization fields of those alarms are as follows:
Here A , represents the set of expected alarms given the A1 = { f l , f Z i f3}r A2 = ( f 3 , f4,fS}, and A3 = ( f 5 , f 6 , f l } .
fault set Fp. The most probable set of faults corresponds For simplicity we will assume that the alarms of this system
to the node with the minimum information cost associated are reliable. Thus, the sentence that describes the alarms
with it. In practice this node is typically a leaf node. Note given the faults is a sentence that describes the observed
that at a leaf node S ( A D D ) = 8. alarms. We also assume that the information cost associ-
Although equation (15) is NP-Hard, the tree constructed ated with an alarm given a fault is infinite if the fault is not
in PNIA Part I is usually quite manageable. Suppose in the alarms fault localization field, and zero otherwise.
532 lEEE TRANSACTIONS ON COMKUNICATIONS, VOL 42, NO 21314, FEBRUARYMARCHIAPRIL 1994
That is, given a fault a particular alarm will either always alarm for fault f~ is a D , and IfD = 4. The expected alarms
appear or it will never appear. for fault fp-1) are a A , a g , a c and CZDand I f ( T - 1) = 1.
In applying PNIA Part I, in step 1, we find the faults The expected alarms for fi,, are aA and ag . The informa-
that are contained in most of the alarms. These are faults tion cost of this fault is 2. The expected alarms for fi,,
fi, f3, and f5. Each of these faults is contained in two are aA and a D . The information cost of this fault is 3.
alarms. We can expand each of the faults. Assume that the cost of addition or deletion of any alarm
If we choose f1 as the fault associated with the first po- is 3.
tential incident, the cost of f1 is 3 and alarms A1 and A 3 Finally, assume I ( S ( A , IF)) = 0 for any fault set E'. The
are associated with this incident. cost of any particular fault f is given by:
In expanding f 1 we delete A1 and A 3 and repeat the first
step. If we repeat the first step (now we are left only with
alarm A2) we have three potential faults to expand: f3, f4
and f5. Among them the one with the least cost (most
probable) that can explain the existence of alarm A2 is f4. Suppose that two alarms are observed: alarm aA whose
Thus this solution gives two possible faults, f1 and f4. fault localisation field indicates that either B , or (T - l)>
After expanding fault f1 we can expand faults f3 and or 6 ,or D,or IAB or ICD is at fault; alarm ag whose fault
f5. These give us two more hypotheses of possible faults. localization field indicates that either A , or B , or (2' - l),
If fault f3 happened, then, most probably, fault f~ also hap- or C, or IAB or 2cg is is at fault.
pened with total cost 8. If fault fs happened, then, most The T - 1 line has the lowest cost associated with it and
probably, fault fi happened, with total cost 8. Among the may at first glance appear to be responsible for the two
examined hypotheses of faults, the one with cost 6 (faults alarms. However, the total cost given by (5) associated
fi and f4) is the most likely one. with a fault in the T - 1 line is 7 since alarms U B and
From the algorithm described above we can see the way a c are expected but not observed. The cost of deletion of
that the tree of the possible solutions is generated. The each of these alarms is 3. The cost of a fault in the line IAB
root is connected to three nodes, fi, f3, and f5. Each of however is 2. It can easily be verified that this possibility is
these nodes can be expanded to a set of leaves. Node f1 is in fact the solution of minimum cost. The observed alarms
expanded into three leaves, f3, f4, and f5. Each of these are due to a fault in the connection between A and B.
leaves represents two faults: fault f1 and one of the faults Although the T - 1 device is the most likely device to
f3, f4,or f5. In a similar fashion node f3 can be expanded fail in the network, it is not always at fault. In this case
into leaves fi, fe, and f5. Finally, node f5 can be expanded the absence of alarms from B and D is valuable informa-
into leaves f3, f2 and f 1 . tion that points us towards a different fault. This example
Since we assume that the alarms received are reliable, points out the importance of comparing observed alarms
the information associated with the intermediate internal with expected alarms. As can be verified, the low cost
nodes f1, f3, and f5 is infinite. Take the case of f 1 ; the solution given by PIA Part I1 is T - 1.
rest of the cases are similar. Suppose f i is the only fault
in the network. Alarms A I and A2 contain fault f l in VIII. CONCLUSIONS
their fault localization fields. Alarm A2 is not explained The advent of large scale heterogeneous wide area net-
by fault f1, and must be a random event. Thus, alarm A2 works and the recent proliferation of local area networks
must be deleted. But, according to our assumptions, the make alarm correlation an important consideration for net-
information cost associated with the deletion of an alarm work management systems. We have presented a general
is infinite. Thus the information cost associated with the framework for modeling and solving the problem of alarm
internal node fi is infinite, and f1 alone cannot explain the correlation and fault identification in communication net-
observed alarms. works.
Instead of undertaking a rule based approach in which
B. Ezrample 2 patterns of alarms are indications of specific faults, we asso-
Consider a simple communications network consisting of ciate fault localization information with each alarm. Given
five pieces of equipment connected in series: A is connected that each alarm contains such information, general algo-
to B , which is connected to a T - 1 line, which is connected rithmic schemes have been developed in order to localize a
to C , which is connected to D. Symbolically the network fault and correlate alarms in a more efficient and extend-
topology is as follows: A - B - (2' - 1) - C - D. Assume able fashion.
the T - 1 line and each device in the network have only The framework developed for fault localization has been
a single fault mode and that each device can emit only a extended to cover multiple faults in a more realistic envi-
single alarm. Assume also that the lines JAB between A ronment where even the observed alarms can be unreliable.
and B ; and l C D between C and D also have a single fault The problem then becomes a discrete optimization prob-
mode. lem where the objective is to find the set of faults and the
The expected alarm for fault f~ is a A , and IfA = 4. The set of alarms that minimize a certain cost function. Since
expected alarm for fault f~ is a B , and Ifs = 4. The ex- this problem is NP-Hard, heuristic algorithms have been
pected alarm for fault fc is aC, and I f a = 4. The expected proposed for its solution.
BOULOUTAS et ul.: ALARM CORRELATION AND FAULT IDENTIFICATION IN COMMUNICATION NETWORKS 533
Although the proposed approach is a general framework Applications. This research group is involved in studies of archi-
for designing fault identification systems, more research tectural issues in the design of complex software systems, and the
application of advanced technologies to systems management prob-
needs to be continued in many areas. The algorithms pre- lems. As part of his interest in systems management research, Dr.
sented here rely heavily in a-priori information which is Calo was instrumental in establishing the IEEE International Work-
either guessed or can be experimentally gained. One key shop on Systems Management, and served as its first Chairman. Dr.
Calo holds two United States patents.
research direction could be the way to adaptively acquire
the complete information by starting with an empty, an
incomplete subset, or even an incorrect set of information.
Dr. Allan Finkel is a Research StdMember and project leader of
REFERENCES the Systems Management project of the IBM Research Division.
Dr. Finkel received a bache- lors degree from the State University of
A. Bouloutas, Modeling Fault Management in Com- New York at Binghamtom in 1977 and a Ph. D. in mathematics in
munication Networks, Ph.D. Dissertation, Columbia 1982 from New York University where he studied at the Courant Insti-
University, May 1990. tute of Mathematical Sciences. During 1982-1983he was a Member of
the Institute for Advanced Study at Princeton, New Jersey. He joined
J.E.Hopcroft, J.D. Ullman, Introduction to Automata the Mathematical Sciences Department of IBM Research in the fall
Theory, Languages and Computation, Addision- of 1983 as a postdoctoralfellow and moved to the Computer Sciences
Department in 1985. His research interests include expert systems,
Wesley, 1979. systems and network management and object-oriented databases.
George W. Hart, Minimum Information Estimation
of Structure, Ph.D. dissertation, MIT, Laboratory for
Information and Decision Systems, LIDS-TH-1664,
April 1987.
Michael R. Garey and David S. Johnson Comput-
ers and Intractability a Guide t o the Theory of NP-
Completeness, W.H. Freeman and Co., 1979.
Robert L. Nielsen, SNA/Management Services Alert
Implementation Guide, Document Number GC31-
6809-00, September, 1988, IBM Communications
Products Division, Research Triangle Park, NC.
Yun Peng and James A. Reggia, A Probabilistic
Causal Model for Diagnostic Problem Solving-Part I:
Integrating Symbolic Causal Inference with Numeric
Probabilistic Inference IEEE Trans. Systems, Man,
and Cybernetics, Vol SMC-17 No, 2, March/April
1987.
Yun Peng and James A. Reggia, A Probabilistic
Causal Model for Diagnostic Problem Solving Part 11:
Diagnostic Strategy, IEEE Trans. Systems, Man, and
Cybernetics, Vol SMC-17 No, 3, May/June 1987.