Professional Documents
Culture Documents
odern telecommunication networks al works. The aspects of time and space correla-
may produce thousands of alarms tion of network events in t h e network t r o u -
perday. makingthe taskofreal-time bleshooting domain were discussed in [2], where
networksurvcillance and fault man- a knowledge-based approach was developed that
agement difficult. Due to the large dcscribed NEs and network events as knowledge-
volume ofalarms, network operators frequently over- base entities. The conceptual approach to alarm cor-
look or misinterpret them. To reduce the number relation was discussed in (31, A structural-phrase
of alarms displayed on operators‘ termin,‘I 1.s.current grammar-based approach to describe network
network management systems apply alarm filter- connectivity and alarm correlation conditions was
ing procedures or. in the case of bursts of alarms. introduced in [4].An alarm correlation model
send them directly to a printer or database. w a s proposed in [SI.where alarms caused by a
I n this article, we will consider a relatively new single common fault were considered. Interpreta-
process of real-time network management. alarm tion and correlation of events has been analyzed
correlation. Alarm correlation is aconceptu,‘1 I tnter- ’ i n other areas. such as electric power systems [6],
pretation of multiple alarms such that ;I new nuclear-power-plant alarm management [7], and
meaning is assigned to thesc alarms. I t is a gcner- patient-care monitoring.
ic process that underlies differcnt network man- In the network management area, several ven-
agement tasks such as context-dependent alarm dors have incorporated expert systems into theirplat-
fi 1t e ri ng, a I arm genera I iza t i on. n e t\v o r k fa u 1 t forms to support alarm correlation capabilities.
diagnosis, generation of corrective actions. proac- NMS/CoreT” from Teknekron Communications
tive maintenance, and network behavior trend anal- Systems [8] includes programs to perform alarm
ysis. filtering andcorrelation functions.The Sinergiasys-
T h e goal of this article is twofold: first. t o tem from CSELT. Italy [9]. first uses expert sys-
GABRIEL JAKOBSON b rr introduce an alarm correlation modcl and sec- tem rules t o recognize alarm correlation patterns
pnricrpul tnmrher. of rechnicirl ond, to describe the intelligent management plat- and instantiate network fault hypotheses, and
stuff (11 GTE Lahoruroi?es. form for alarm correlation tasks ( I M P A C T ) . then applies heuristic search to determine the
which implements the proposed model. Our approach best solution among the hypotheses. ALLINKTM
to alarm correlation is based on the principles of Operations Coordinator from NYNEX [ 101 uses
model-based reasoning ( M B R ) [ I ] . As in MBR. an expert system to filter network alarms.
we will define two basic components of the over- The rest of the article is organized as follows.
all alarm correlation model: the structural c o n - The following section describes the basic notions
ponent, which describes the network elements (NEs) associated with alarm correlation, and the section
NetAlertl:Mis U traricvnark and their connectivity and containment relations; after that discusses the conceptual framework of
of GTE TeleconimLinicrr- and the behavioral component, which descrihes alarm correlation. Next. we describe the struc-
tion Ser-cice.y. alarms and correlation. tural component of the alarm correlation model, and
T h e prototype of the I M P A C T system has then the behavioral component. An overview of
ALLINK’ is a trade- been developed at GTE Laboratories. It pro\ ides the IMPACT system is given, and conclusions
mark of NYNEX Corpo- an intelligent environment for developing alarm and future work are discussed.
ration. correlation applications, and for real-time alarm
monitoring. I M P A C T has been uscd at G T E
ARTIMTIIis a trademark business units to build two network alarm corre- Basic Notions of the Alarm
of Inference Corporatioil. lation applications: AMES, for a land-based tclecom- Correlation Domain
munication network: and CORAL. for a cellular
NMSlCoru7:b1is LI tratle-
mark of Teknekron Com-
munications Systems.
network.
Alarm correlation. a s a subject of research and
system development, has been discussed in scver-
I n this section, we will give a short informal
review of basic notions that we will use to explain
the alarm correlation domain and its applications.
Faults and Alarms
A fault is a disorder occurring in the hardware or
software of the managed network. Faults happen
within the managednetworkor itscomponents.while
alarms are external manifestations offaults. Alarms
defined byvendors and generated by network equip-
ment are observable by network operators. We
areconsidering only alarms mediated by alarm mes-
sages. Similar alarm messages with different time
stamps are separate alarms. Faults can be causal-
ly related, thus forming an acyclic fault propaga-
tion graph, or independent (causally unrelated). H Figure 1. Facilih dirconnect
Externalobservation of alarms may instill an impres-
sion that one alarm causes another. However. the
causality is not between alarms, but rather between
faults.
Alarm Correlation
Alarm correlation is a conceptual interpretation
of multiple alarms such that new meanings are c2
assigned to these alarms. It is a generic process
that underlies different network management tasks:
Compression: the reduction of multiple occur-
rences of an alarm into a single alarm.
Count: the substitution of a specified number
of occurrences of alarms with a new alarm.
Suppression: inhibitinga low-priority alarm in the
presence of a higher-priority alarm.
Boolean: substitution of a set of alarms satislly-
ing a Boolean pattern with a new alarm.
Generalization: reference toanalarm by itssuperclass.
Alarm correlation may be used for network
fault isolation and diagnosis, selecting corrective
actions, proactive maintenance, and trend analysis. H Figure 2. (a)Conrlrrtiori o f causally dtpetiderit alanns; (b)and (c) correlu-
To illustrate the use of alarm correlation. we tiori of cuitsally iti&ptvi&tit alarms.
will give anexample basedon actual events that hap-
pened on a private telecommunication network. until it expires o r is externally cleared. Corrcla-
Because of an administrative error at a primary tions may he subsumed by higher-level correlations.
network control center, a circuit disconnect order The alarm correlation model introduced in thisarti-
was incorrectly sent to a common carrier. hut cle distinguishes hetwcen corrclations and c o w -
soon after withdrawn. An additional error by the lation rules [ 1 I ] . A correlation is a statement about
common carrier led to the disconnect order being a e n t s happening on the network; for example. Bad-
carried out despite the cancellation.This meant that Card-Correlation states that some port contains a
alivecircuitwasdisconnected,causingacatastrophic faulty port card. A correlation rule defines thc
failure on a major DS3 link between city A and conditions under which correlations are asserted.
city B (Fig. 1). A normal facility disconnect. when Forexample, ifthcre isa redcarriergroupalarm (CGA)
performed by network operations personnel, invokes from one DCS. and a Yellow-CGAfrom another. and
automatic loopback conditions o n digital cross- these DCSs are connected. then Bad-Card-Corre-
connect systems (DCSs) at both ends of the cir- Iation will be asserted. The conditional part of
cuit. Since thisisanormal DCS behavior, the loopback the rule may contain a complex Boolean pattern rcc-
conditions a r e not reported. T h e packet and ognizing alarms. NEs. and correlations, as well a s
voice switches having logical trunks over the dis- structural. temporal and other relations.
connected circuit sent large volumes of call pro-
cessing failure messages to the primary network Fault Diagnosis
control center. The operators puzzled for an hour One of the major applications of alarm correla-
before they realized what had happened. T h e tion is network fault diagnosis. N o t all faults
task at hand was to correlate the call-processing exhibit alarms. These faults can be recognized
alarms from the switches with the absence of indirectly by correlating available alarms. Figure
alarms from the DCSs, and recognize that the 2a illustrates this, showing that correlation c 1 detects
trunk was actually disconnected. This was compli- the fault.fl. and correlation c? detects the fault
cated by the incorrect record in the database ,f2. Correlatingcl andc3into thecorrelationcOallows
showing that the circuit was live. diagnosis of the fault /U. Correlation between alarms
Subjectsforcorrelationcould be any events affcct- due to a common fault is a transitive. reflexive.
ing the network. These may be environmental- and symmetric relation (i.e.. an equivalence relation.
stat e p a r a m e t e r s, the ne two r k man age In c n t its noted in [5]). If a single alarm is a manifesta-
context, or events invoked by the user or external tion of multiple faults, this relation may not hold.
systems. Correlations are defined over a time For example. if alarm a (Fig. 7b) is caused by
interval o r window. When a situation is recog- fault fl orfaultp. but not both (anexclusive ORcon-
nized and a correlation asserted, it remains active dition). then correlations c.1 and e? arc formed
-. . ._. . .
SWITCH-CLASS
__I - --.
. . .. . . ..
The
behavior a1
component
contains
three major
components:
the message
Figure 5. Message class CARRIER-GROUP-ALARM and a sample message class hierarchy.
class
components: the message class hierarchy, the cor-
relation class hierarchy, and correlation rules.
Message Class refers to BASIC-DEXCS-MES-
SAGE, which is the root node of the associated mes-
hierarchy,
The message class hierarchy describes the messages sage class hierarchy. T h e Connected Filter
generated by NEs. The message class hierarchy is specifies that ROCKWELL-DEXCS may only be the correla-
used to control the alarm message-parsing pro- connected to a digital crossconnect or a switch.With-
cess. This process is described in more detail in in Filter is used to specify that ROCKWELL-DEXCS tion class
[ 121. The correlation classes and correlation rules can be placed within a building o r a network
will be described later.
The NE classes,message classes,correlation class-
operations center, while Contains Filter specifies
that only physical and logical ports may be contained
hierarchy
es, and correlation rules are organized into hier- within.
archies. T h e s e hierarchies a r e related by The NE class hierarchy is an abstraction of and corre a-
“producer/consumer” dependencies. NEs are physical NEs. The terminal nodes describe partic-
“producers” of alarm messages, messages “produce”
correlations, and rules are “consumers” of all the
ular NE types produced by manufacturers. Spe-
cific digital crossconnect products, such as AT&T’s
tion rules
above. The “producer/consumer” dependencies are DACS I1 or Rockwell’s RDX-370, are terminal nodes
used by IMPACT during the application develop- of the superclass digital-cross-connect-class.The NE
ment process. These dependencies, alongwith other class hierarchy is specific to an application. It may
domain-oriented constraints, are used to support be modified by adding, deleting, or editing exist-
correctness, completeness, and consistency of the ing classes.The upper levelsof the hierarchy are gen-
knowledge base, and to guide the user through eral and are therefore reusable across applications.
the application development process. The “pro-
ducer/consumer” dependency restricts the user from Network ConfigurationModel
deleting an N E class from the knowledge base The network configuration model is constructed
while message classes still refer to it. from the instances of individual NEs. NE instances
describe the actual physical o r logical compo-
nents of the managed network. The instances are
The Structural Component specified by instantiating terminal NE classes and
Network Element Class Hierarchy connecting them according to the network config-
N E classes describe network equipment types, uration. This process may be performed by the
such as switches, digital cross-connects and multi- network operating staff using the IMPACT Network
plexers. NE classes are organized into a hierarchy Element Editor. Constraints defined in the class
using class/subclass relations. T h e root of the specification will be enforced. The user cannot make
hierarchy is a GENERIC-NE-CLASS, which con- connections that violate the physical behavior of
tains the most general information common to all the connected elements, or leave required values
NEs. The next level of the hierarchy describes the unspecified. Network Element Editor in Fig. 4
basic NE classes, such as trunk-class, transmis- describes LOS-ANGELES-DEXCS, which is an
sion-interface-class, switch-class,building-class, and instance of ROCKWELL-DEXCS. It is installed
others. Each of these classes refers to its own sub- at a Los Angeles network operations center, con-
hierarchy; for example, the trunk-class refers to nected to a DCS in Sacramento, and contains
the logical-trunk-class and physical-trunk-class, and four physical ports.
the physical-trunk-class to the super-link-class,
T1-trunk-class, and T3-trunk-class. Each subclass
inherits parameters, values, attributes, and con-
The Behavioral Component
straints from its superclasses. IMPACT permits mul- Message Class Hierarchy
tiple inheritance; that is, a class might have more All alarm messages produced by a specific NE
than one superclass. a r e organized into a message class hierarchy
Network Class Editor, in Fig. 4, describes ROCK- using the class/subclass relation. Introduction of
WELL-DEXCS, which is a subclass of the gener- message classes simplifies the decision-making pro-
ic digital cross-connect class DEXCS-CLASS. cess of network management. Let us suppose
Figure 6. BAD-CARD-CORRELATION
and BAD-CARD-CORRELATION-RULE-I.