Professional Documents
Culture Documents
This document is intended for use by Nokia customers (“You”) only, and it may not be used except for the purposes d
agreement between You and Nokia (“Agreement”) under which this document is distributed. No part of this document m
copied, reproduced, modified or transmitted in any form
or means without the prior written permission of Nokia. If you have not entered into an Agreement applicable
to the Product, or if that Agreement has expired or has been terminated, You may not use this document in
any manner and You are obliged to return it to Nokia and destroy or delete any copies thereof.
The document has been prepared to be used by professional and properly trained personnel, and You
assume full responsibility when using it. Nokia welcome Your comments as part of the process of continuous develo
improvement of the documentation.
This document and its contents are provided as a convenience to You. Any information or statements
concerning the suitability, capacity, fitness for purpose or performance of the Product are given solely on
an “as is” and “as available” basis in this document, and Nokia reserves the right to change any such
information and statements without notice. Nokia has made all reasonable efforts to ensure that the content
of this document is adequate and free of material errors and omissions, and Nokia will correct errors that You identify
document. But, Nokia's total liability for any errors in the document is strictly limited to the
correction of such error(s). Nokia does not warrant that the use of the software in the Product will be uninterrupted or
NO WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
ANY WARRANTY OF AVAILABILITY, ACCURACY, RELIABILITY, TITLE, NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, IS MADE IN RELATION TO THE
CONTENT OF THIS DOCUMENT. IN NO EVENT WILL NOKIA BE LIABLE FOR ANY DAMAGES, INCLUDING BU
LIMITED TO SPECIAL, DIRECT, INDIRECT, INCIDENTAL OR CONSEQUENTIAL OR ANY LOSSES, SUCH AS BUT NOT
TO LOSS OF PROFIT, REVENUE, BUSINESS INTERRUPTION, BUSINESS OPPORTUNITY OR DATA THAT MAY AR
THE USE OF THIS DOCUMENT OR THE INFORMATION IN IT, EVEN IN THE CASE OF ERRORS IN OR OMISSION
THIS DOCUMENT OR ITS CONTENT.
This document is Nokia proprietary and confidential information, which may not be distributed or disclosed to any third
without the prior written consent of Nokia.
Nokia is a registered trademark of Nokia Corporation. Other product names mentioned in this document
may be trademarks of their respective owners, and they are mentioned for identification purposes only.
Only trained and qualified personnel may install, operate, maintain or otherwise handle this
product and only after having carefully read the safety information applicable to this product.
The safety information is provided in the Safety Information section in the “Legal, Safety and
Environmental Information” part of this document or documentation set.
Copyright © 2016 Nokia . All rights reserved.
Only trained and qualified personnel may install, operate, maintain or otherwise handle this
product and only after having carefully read the safety information applicable to this product.
The safety information is provided in the Safety Information section in the “Legal, Safety and
Environmental Information” part of this document or documentation set.
If you should have questions regarding our Environmental Policy or any of the environmental services we
offer, please contact us at Nokia for any additional information.
17A, Operating
Product”) specified
an Agreement applicable
not use this document in
ies thereof.
tion or statements
ct are given solely on
change any such
o ensure that the content
will correct errors that You identify in this
he
he Product will be uninterrupted or error-free.
UT NOT LIMITED TO
RINGEMENT,
RELATION TO THE
NY DAMAGES, INCLUDING BUT NOT
ANY LOSSES, SUCH AS BUT NOT LIMITED
RTUNITY OR DATA THAT MAY ARISE FROM
E OF ERRORS IN OR OMISSIONS FROM
sources of
e handle this
o this product.
e handle this
o this product.
s to join us in working
w the recommendations
ronmental services we
1. Introduction
The purpose of this document is to assist users in finding alarms, their meanings, effects, and instructions on how to avoid
them. The alarms are listed by numbers in ascending order.
1. Definition on whether the alarm is new, removed. That definition is provided in the following columns:
- Changes between issues ... which shows the changes since the previous issue of the document.
- Changes between releases ... which shows the changes since the latest issue of the document in the previous product
release.
If the cell is empty, the alarm is not changed, nor new, nor removed.
Note that removed alarms are listed in the very bottom and they are marked with red font.
2. Detailed change information showing the current and previous values of parameter attributes. That information is provide
in the following columns:
- <alarm attribute> in issue ... which shows the value in the previous issue of the document.
- <alarm attribute> in release ... which shows value in the latest issue of the document in the previous product release.
The columns are grouped into attribute-specific sections for structured and convenient view. Use unfold (+) and fold (-)
buttons in the top bar to browse the attribute-specific change details. Use unfold all (2) and fold all (1) buttons on the left ha
side of the top bar to respectively show and hide the change details for the whole report.
Note that for all alarms except new and removed ones, the field values for previous issue and previous release are always
provided to show a total history information. Additionally, the changed attributes of alarms are highlighted with grey color.
Highlights are enabled to specify whether there is a change between issues or releases.
d instructions on how to avoid
g columns:
ment.
ment in the previous product
Major
Major
Major
Minor
Minor
Warning
Minor
Major
Major
Processing error Major
Major
Minor
Warning
Major
Warning
Warning
Warning
Major
Warning
Critical
Minor
Major
Major
Warning
Major
Equipment Minor
Minor
Equipment Minor
Major
Minor
Major
Warning
Critical
Communications Warning
Major
Communications Minor
Critical
Warning
Minor
Processing error Minor
Warning
Minor
Equipment Critical
Communications Major
Major
Minor
Communications Minor
Communications Minor
communicationsAlarm Minor
Communications Major
Processing error Major
Major
Equipment Major
Equipment Major
Equipment Major
Equipment Major
Equipment Warning
Equipment Minor
Equipment Major
Equipment Major
Environmental Critical
processingErrorAlarm Major
Equipment Major
Equipment Critical
Equipment Critical
Equipment Critical
Equipment Critical
Equipment Critical
Equipment Critical
Environmental Major
Equipment Critical
Equipment Major
Environmental Critical
Equipment Critical
Equipment Critical
Equipment Critical
Equipment Critical
Communications Major
Communications Major
Communications Major
Communications Major
Communications Major
Communications Major
Equipment Critical
Equipment Major
Equipment Minor
Communications Minor
Equipment Major
Equipment Warning
Equipment Warning
qualityOfServiceAlarm Warning
Communications Warning
processingErrorAlarm Minor
Environmental Warning
Communications Minor
Communications Critical
Communications Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Environmental Major
Communications Critical
Communications Major
Equipment Critical
Equipment Major
Equipment Minor
Equipment Minor
Meaning Effect
The dynamic configuration activation has failed Failure of the dynamic configuration activation for
for at least one target node. at least one target node is the result of incorrect
setting of the relevant parameters in the affected
nodes.
This alarm indicates that initial coping of the The fault indicates that OS user/group account
Probable causes:
user/group account files (passwd, group, files (passwd, group, shadow, gshadow) cannot
1. A node joined the cluster but its node-type is
shadow, gshadow) from /etc directory to the be synchronized to all nodes and cannot be
unknown, and the dynamic configuration
shared storage (/mnt/mstate/_global/etc) has stored to the shared storage. The fault by itself is
activation for this node has failed.
failed, or that the synchronization of user/group not critical; it means that other nodes have not
If the logging node does not respond, it is The
2. A central logging
node joined thenode which
cluster, the receives
dynamic all logs
account files (passwd, group, shadow, gshadow) been updated with the last OS user
considered unavailable and the alarm is raised from all nodesactivation
configuration is not available.
started and, immediately
between the shared storage administration activity. Particularly password
since the logging procedure cannot continue. The
after,services
the node which use the logging
has disjoined. node the
As a result are
(/mnt/mstate/_global/etc) and /etc failed, on the change of OS user has not been synchronized
affected.
configuration is not fully activated.
specific node reported. with other nodes.
Probable cause: Underlying Resources
The configuration of the simple network The invalid part of the configuration is ignored.
Unavailable
management protocol (SNMP) mediator contains This causes partial loss of functionality. The
Event type: Processing error
values that are unacceptable. SNMP traps may be lost.
Default severity: Major
SNMP Mediator has sent an SNMP request to an SNMP Mediator is not able to handle the trap
SNMP agent but it has not received a response. correctly, because it is not able to query or
Example: modify the variables in the SNMP agent.
1. A filter condition has been added for the
authentication failure (.1.3.6.1.6.3.1.1.5.5) trap.
SNMP Mediator has received an SNMP trap that 1) Unknown traps may contain information that
Thus, the following entry can be viewed by using
it is not aware of. The trap is unknown to the could be useful.
the SCLI command:
SNMP Mediator, if - 2) Unnecessary traps waste network capacity.
> show config fsClusterId=ClusterRoot
1) the IP address of the SNMP agent that sends
fsFragmentId=SNMP fssnmpMediatorName=1
the trap is missing from the SNMP Mediator's
The alarm is triggered by the
fssnmpAttributeType=V2traps The alarm may indicate that an SNMP tool which
configuration, or
authenticationFailure SNMP trap. The SNMP
fssnmpV2TrapId=.1.3.6.1.6.3.1.1.5.5 is used for sending SNMP requests to the SNMP
2) the OID (object identifier) of the trap is
agents running in the Ethernet switches (switch
fssnmpAttributeType=V2traps, agent, is configured incorrectly. The alarm may
unknown to the SNMP Mediator.
blades or modules) generate this trap, if they
fssnmpMediatorName=1, also indicate malicious activity, which means that
receive SNMP requests that are not properly
fsFragmentId=SNMP, an unauthorized user is trying to obtain
The alarm is triggered by a coldStart Simple The alarm indicates that the switch has restarted
authenticated.
fsClusterId=ClusterRootSNMPv1 and SNMPv2c use information by sending SNMP requests.
Network Management Protocol (SNMP) trap sent and is now reinitializing itself. This is not
community
The filter names to
condition provide by
is defined security,
the attribute
by an Ethernet switch. The coldStart trap necessarily an error condition but might be
authentication,
fssnmpFilterCondition. and access control. Community
fssnmpFilterCondition
signifies that the sending protocol entity is caused by maintenance operations such as
names
may qualify
have, for the following
example, thethe criterion:
value
reinitializing itself and that management powering up a new switch or intentional restarting
A
1. linkDown simple network
Each community
(.1.3.6.1.2.1.1.1.0=*Linux*). namemanagement
has
See an RFC protocol
associated
2254 for Once a port (or link) is in down state, it cannot
agent's configuration or the protocol entity of a switch after a software or configuration
(SNMP)
access
more trap triggers
mode
information (either
about thisthealarm.
read-only filterorItsyntax.
is an indication
read-write). transport any traffic. This is not necessarily an
implementation may have been altered. The update. In these cases, this alarm can be safely
that
2. SNMP an
Each Ethernet
communityswitch port the
name changes from up
has an associated error condition, but this can follow from a
warmStartMediator receives
trap signifies that the sending protocol ignored.
state
IPv4 to down state. trap that does not contain
subnet/mask.
authenticationFailure maintenance operation such as replacing a cable
entity is reinitializing itself and that neither the
3.
the Each of
value community
variable name may beentities
.1.3.6.1.2.1.1.1.0. enabled or between two switches, or closing a switch port
The backup
management ofagent
any ofconfiguration
the following has
nor the protocol As an effect
However, of this alarm during
a spontaneous restart backup, a
should be
disabled.
3. SNMP Mediator queries the value via a management interface. A link may also go
failed because of a fatal
entity implementation error:
is altered. backup-iso
considered is asnot
an created,
indicationorofisarendered
serious problem
The
of SNMP agent rejects
.1.3.6.1.2.1.1.1.0 from and the SNMPdiscards any SNMP
agent, but to down state if, for example, the host computer
- Delivery unusable.
even though the system with its redundant
request
does not that does
receive anot contain
response. a community name or switch at the other end of the link is shut down,
- Configuration snapshot Ethernet infrastructure will tolerate it.
that matches one of the configured community restarted, or removed.
- State Volume
A broadcast
names, storm
or which control
does not condition
contain anhas started
IPv4 source The switch will drop broadcast frames for a fixed
- Database
within
address thethat
lastis250 milliseconds.
allowed by the IPv4 address and period
However,of 250 milliseconds.state change may
a spontaneous
- Plugin
IPv4 mask of the corresponding community. indicate a serious failure, although the system
- File System
More specifically, the following authentication will typically tolerate these failures to some extent
failures are detected for SNMP requests: because of redundant networking infrastructure.
1. Each getRequest, getNextRequest, or
getBulkRequest that does not include one of the Note that a linkDown SNMP trap is paired with a
enabled, read-only or read-write community linkUp SNMP trap that will trigger a cancelling of
names. the alarm. For example, when replacing a
2. Each setRequest that does not include one computer node, one would first see the raising of
of the enabled, read-write community names. this alarm when the replaced unit is shut down
3. Each SNMP message that is sent from an and the automatic cancelling of this alarm when
invalid IPv4 subnet. the new blade is taken into use.
The platform high availability services (HAS) The possible active/standby RGs, which have an
subsystem cannot reset a faulty active recovery unit (RU) instance running on the
node using HWM functionality with Intelligent failed node, cannot recover from the situation by
Platform Management Interface applying a switchover to another node. The
(IPMI). services provided by these RGs are currently
A physical disk partition of a Distributed The service that the application provides is not
The operational state of the node is not known asdown.
Replicated Block Device (DRBD) is broken or is impacted if the other node and the DRBD device
the node still holds and updates
reporting errors. are still functioning. In this case, the application is
the shared resources.
DRBD is used to replicate data of an application however no longer redundant, and recovery from
partition between two nodes. The nodes form an possible forthcoming failures may take longer or
A
This secondary
is a severe Distributed
platformReplicated Block Device
If the node running the primary DRBD (and the
error that may have the
active/standby redundancy pair where a standby may not be possible at all.
(DRBD)
followingdoes not synchronise or synchronises
results: application) is functioning, then there is no
node can take over in case the active node or The service provided by the application is down if
very
In ATCA slowly with the primary DRBD device.
Hardware: immediate impact to the service that the
application fails. also the other node or partition is not functioning.
DRBD
- a double is used to replicate
hardware fault data of an application
application provides. The identified DRBD
The identified DRBD partition or logical volume is
partition
-New an IPMI between two nodes.
configuration error is The nodes form an
partition or logical volume will, however, not be
configuration
currently unavailable on orSwitch not successfully
functioning Failure in applying new configuration on Switch.
poorly.
active/standby
-applied
a network redundancy
partitioning pair where a standby
problem currently available as a backup resource. Any
when initiated by SCLI Switch may not function properly or Switch may
node can take over of
-ormanual in acase the active node or
failure in the node that currently runs the
DHCP power-off
lease time expiry. complete chassis
be running with old configuration.
application fails. When the two disk images are
application causes a long or permanent service
not identical
In BCN (for example, following a node
Hardware: interruption. The service is down if the node with
The internal
reboot)
- LMP they
not temperature
are synchronised
responding of theby CPU
There has passed
copyingis a severe
the primary theDRBDtemperature-related
is not functioning. problem in
the programmed
changed data from threshold. the referred component, and the unit may behave
the primary DRBD to the
secondary
As the HASDRBD. unexpectedly.
This alarm is just for information if the severity is
is unable to determine the state of
Currently
the node, synchronisation
the active or standby to the identified
"Minor". However, actions are to be done if the
secondary
recovery DRBD partition
The central processing unitdelaying
groups (RG) are or(CPU)
logical volume is
severity
Unit mayisbecome
the raised tounstable.
utilization has "Major", as this indicates that
not proceeding
switchover until or
the proceeds
node
passed the programmed threshold. is extremely
operational slowly.
the situation has lasted longer than an hour.
again or the
The alarm is node
raisedisduemanually
to below Notice also that a "Major" alarm situation is
set reasons:
to isolation
state.
1. High amount of traffic expected if some maintenance operation (for
2. Possible looping in switch in multi-chassis example, disk replacement procedure) requires a
The image loaded via TFTP (Trivial File Transfer The switchorisaaccessible
complete only with default
large disk re-synchronization.
environment because of wrong cabling/wrong
Protocol) has not passed the CRC (cyclic configurations. The application-related
Depending on the disk size, a full disk re-
configuration.
redundancy check) check and has been configurations
synchronizationwill nottake
can be applied.
several hours to
discarded. complete.
The system memory utilization has passed the System is running out of memory which may
The alarm is raised beacuse of the following
programmed threshold. cause the system to behave erratically.
reasons:
The alarm is raised due to below reasons:
1. High amount of traffic
1. The loaded binary image is corrupted and can't
2. Possible looping in switch in multi-chassis
One of the physical ports in the switch has a
be used. It is an expected behavior of the application to
environment because of wrong cabling/wrong
problem, which may
2. The corruption may severely affect system
have happened during the raise this alarm when
configuration
performance.
transfer or the original image on the server was a switch blade or server is either plugged,
already corrupted. unplugged, restarted or invalid switch login.
The switch port error alarm is raised for the
The reported field replaceable unit (FRU) is not Refer to the scenarios under Meaning.
following reasons on a (physical)
present in the system or is in inactive state.
port of the switch:
Typical FRUs include cards, power supply units,
and chassis components.
portErrorsExceeded: the level of errors on the
A Bus
port has error has taken
passed place. The possible error
the programmed Unit may exhibit erratic behaviour or decreased
The alarm may be raised in one of the following
types are: Compared (as a percentage) to the
threshold. functionality.
scenarios:
total amount of packets over
Front
a periodPanel NMI (non-maskable interrupt
of time.
1. FRU missing from its expected position.
request) / Diagnostic Interrupt
A
2. problem in CPU (central processing unit)
FRU is deactivated. Unit may be out of service.
Bus Timeout
portsBroadcastExceeded: the level of broadcast-
functionality
3. Shelf Manager has been detected.
switchover Possible
or restart. error
I/O
limit(input/output)
has passed channel
the check NMI
programmed
types are: restart.
4. Cluster
Software
threshold.NMI
5. Dynamic configuration scenarios like adding or
PCI PERR (peripheral component interconnect
IERR (internal
replacing error)
the blades.
parity error)
portsCRCErrExceeded: the level of CRC (cyclic
Thermal
6. Embedded Trip software upgrade or downgrade.
PCI SERR (peripheral
redundancy check) component interconnect
errors
FRB1/BIST
7. Software (fault
upgrade resilient boot/built-in self-test)
or downgrade.
system
has error)the programmed threshold.
passed
failure
8. Management node restart or switchover.
EISA
Compared(Enhanced
(as Industry Standard
a percentage) to the Architecture
FRB2/Hang
9. FRU is in M7 in POST (power-on
state(Communication self-test) failure
Lost).
bus)
total Fail SafeofTimeout
amount packets over a period of time.
(believed to be due or related to a processor
Bus Correctable Error
failure)
Note: Alarm is raised with minor severity when
Bus Uncorrectable Error
portsRuntsExceeded: the level of runts (=broken
FRB3/Processor
the FRU is deactivated Startup/Initialization failure
and in other scenarios the
Fatal
(too NMI packets) has
short)
(CPU didn't
alarm start)
is raised with major severity.
passed the programmed
Configuration Error threshold. Compared
(as a percentage) to the
SM BIOS (system management basic total
The unit's current has exceeded the programmed The field-replaceable unit (FRU) may be
threshold. damaged or may stop working. The risk of the
FRU damage or misbehavior can be judged
The supported threshold levels are: based on the severity of the alarm. Severity level
- lnr (Lower Non Recoverable) warning signifies that the FRU is still working
This alarm indicates a booting failure. The Unit boot-up may be failing.
- lc (Lower Critical) within its specifications despite crossing the
possible causes or error types for the failure are:
- lnc (Lower Non Critical) detected threshold limit. The higher severity
0. No bootable media.
- unc (Upper Non Critical) signifies that the operation is out of
1. N/A - This specific sensor offset value is not
- uc (Upper Critical) specifications.
applicable for BS2AM-A unit.
A
- unr system(Upper firmware error (POST - power-on self- Unit boot-up may be failing or the unit may be
Non Recoverable)
2. Boot/configuration server not found.
test error) has been detected. The possible failing in some other respect.
3. Invalid boot sector.
causes
Based on are:
the threshold level crossed, the alarm
4. Timeout waiting for user action.
will be raised with a different severity as mapped
5. Primary bank boot failed.
1.
below: No system memory is physically installed in
The
6. power unitbank
Secondary readingboothas failed.exceeded the The system is running on backup power supply
the system.
programmed
7. Network boot/configuration threshold. failed. unit, which is highly risky. In case both the
2.
lnr,No unr: usable system
The alarm willmemory,
be raised because all
with severity
8. Boot retry limit exceeded. primary and secondary power supplies go down,
installed
"Critical". memory has experienced an
The supported threshold levels are: the system will come to a stop.
unrecoverable
lc, failure.
- lnruc: The alarm
(Lower will be raised with severity
Non Recoverable)
One
3.
"Major". of the following
Unrecoverable is taking place:(Advanced
hard-disk/ATAPI Platform security may have been compromised.
- lc (Lower Critical)
Technology
lnc, unc: The Attachment
alarm will bePacket
raised Interface)/IDE
with severity
- lnc (Lower Non Critical)
a.
(Integrated
"Warning". Pre-boot Drive Password Violation
Electronics) - userfailure
device password
- unc (Upper Non Critical)
b.
4. Pre-boot Unrecoverable Password Violation attempt
system-board failure - setup
- uc (Upper Critical)
password
5.
Note: Unrecoverable
For the below hard-disk
mentioned controller
FRUs failure
and their
The
- unrunit's (Upper temperature
Non has exceeded
Recoverable) the The chassis which has the faulty cooling field-
c.
6.
respective Pre-boot
Removable Password
IPMC boot Violation
media
firmware not - network
found
version onwards, bootthis replaceable unit (FRU) has the
programmed threshold.
password
7.
alarm Firmware applicable
(BIOS (basic input/output system))
Basedisonnot the thresholdand levelwill not be raised:
crossed, the alarm risk of getting over heated. The risk of the FRU
d.
ROM Out-of-band
(read-only Access
memory) Password
corruption Violation
detected
Thebe
will supported
raised with threshold
a different levels are: as mapped over heating or FRU malfunction can be judged
severity
e.
8. Other
CPU pre-boot
(central Passwordunit) Violation
AMPP2-A
- lnr
below: (Lower Nonprocessing
2.0.0 Recoverable) voltage by the severity of the alarm. Severity level
This
mismatch
AHUB4-A alarm(processors
reports an error
that
03.00.001-009-001-003 in the the
share system
same The FRU may have erratic behavior or decreased
- lc (Lower Critical) warning signifies that the FRU is still working
memory,
supply
ACPI5-A discovered
have different
2.1.31 through a sensor
voltage requirements) present in functionality.
- lncunr:
lnr, (Lower The Non
alarm Critical)
will be raised with severity within its specifications despite crossing the
the
9. system.
CPU speed The possiblefailure
matching causes are:
-"Critical". unc (Upper Non Critical) detected threshold limit. The higher severity
10.
- ucuc: System
(Upper Firmware Hang
Critical) signifies that the operation is out of
lc, The alarm will
a. Correctable ECC (error-correcting code) or be raised with severity
The
-"Major". unrbattery
(Upperreading has exceeded the
Non Recoverable) Unit may boot up with wrong configuration or
specifications.
other correctable memory error;
programmed
lnc, unc: The threshold.
alarm will be raised with severity date and time.
b. Uncorrectable ECC or other uncorrectable
Based
"Warning". on the threshold level crossed, the alarm
memory error;
The
will be supported
raised with threshold
a different levels are: as mapped
severity
c. Parity error;
-below:
The lnr (Lower
possible Non
causes Recoverable)
are:
The
d. Memory fan speed scrub has exceeded
failed (stuck the bit);programmed Fan speed out of limit may indicate a mechanical
- lc (Lower Critical)
threshold.
e. Memory device disabled; and or an electrical problem with the fan, which can
-lnr, lncunr:
(Lower The Nonalarm Critical)
will be raised with severity
f. power Correctableoff/power ECCdown or other correctable memory affect the cooling performance. The risk
-"Critical".
unc
power (Upper
cycleNon Critical)
The
error supported
logging threshold
limit reached. levels are: associated can be judged based on the severity
-lc,uc soft (Upper
uc: The
power Critical)
alarm
control will be raised
failure (the with severity
unit did not
- lnr (Lower Non Recoverable) of the alarm.
This unr alarm
-"Major".
respond (Upper indicates
a Non
to Critical)
request that
Recoverable)oneon)
to turn of the management This alarm is raised during the boot process, if
- lc (Lower
nodes
lnc, has detected thatbe the contentswith of the disk the system notices that the disks of management
lncunc:
-- power (Lower The
unit alarm
failure
Non will
detected
Critical) raised
(other) severity
are
Based outon
"Warning". of sync with the other
the threshold node of the
level crossed, the same
alarm nodes are not identical. The process
- unc (Upper Non Critical)
type.
will be raised with a different severity as mapped automatically puts the booting management node
- uc (Upper Critical)
below:
Inlet: This typically means thetheambient into inert mode and powers it off. The user has to
This
- unr alarm (Upper isNon
triggered when
Recoverable) system notices System functionality and performance may
temperature is out of the limit. take steps manually to get the disks in sync
that the shelf manager is unavailable, shelf degrade, or the system may not work at all.
lnr, unr: The alarm will be raised with severity again.
manager
Based onmay be missing,
the threshold level or is not running
crossed, in a
the alarm
"Critical".
Outlet:
healthy This means
state. withIt is alsothe chassis
possible cooling
that shelf field-
will be raised a different severity as mapped
lc,
manager uc: Theisalarm
replaceable unit was will be
experiencing notraised
able towith
a connectioncoolseverity
the chassis.
problem.
below:
The inserted FRU
"Major". reason can be(field-replaceable
caused by brokenunit) fans,does
or System functionality and performance may be
not
lnc,
some match
unc: withalarm
The the expected
field-replaceable will be
units unit based
raised
(FRUs) with on the
severity
consuming degraded, or the system may not be working at
If
lnr, theunr:system is not will
The alarm ablebetoraisedcontact the shelf
target
"Warning".
more hardware
power than configuration.
expected. In with
caseseverity
of Lynx all.
manager
"Critical". after several retries (currently
build, the alarm is also raised if a unit is inserted
programmed
lc, uc:a The for 35
alarm willretries,
be raised 1 retry inseverity
1 second), it
in
This to slot which
alarm is expected towith
be empty.
queries
"Major". themayexistenceindicate: of the shelf manager by
-pinging
batterythe lowmain IP address, and an alarm is
lnc,
FRU unc: The
is a missing
hardwarealarm component
will be raised thatwith
canseverity
be
-raised.
battery The system then tries to switchover from
"Warning".
removed and replaced on-site. Typical field-
-the
battery failed (other reason)
main IP address to one of the secondary IP
replaceable units include cards, power supply
addresses. If this also fails, the alarm is raised
units, and chassis components.
again.
This alarm indicates that an Internet Control The peer network element that is under ICMP
Message Protocol (ICMP) monitoring session monitoring is unreachable. If the alarm is not
was switched from its UP to DOWN state. cleared automatically, it might require operator
intervention to bring up the network element. The
This alarm is raised because two-way application(s) dependent on the ICMP monitored
The alarm indicates that in order to start the Some data may have been lost from the DRBD
connectivity between the local node and remote link might be affected.
cluster, one or more replicated distributed partitions. During normal cluster start up, both
node is not functional.
replicated block device(DRBD) disk devices has cluster manager nodes come up, and the DRBD
to be forced active. The alarm is raised when a drivers can reliably determine which cluster
cluster manager node which previously had a manager node has the most recent updates.
During upgrade, activated build rebooted the The newly activated delivery is not able to boot
working peer node starts up, and the peer cluster DRBD devices are then synchronized from the
system several times. And due to the autoreturn properly. Hence, autoreturn to the old working
manager node does not come up within the most recent version ensuring that e.g. no log
feature, the system is booted back to the old delivery has been automatically done.
defined maximum wait time. records are lost.
build.
This alarm indicates
The reason thatcluster
for the peer the manager node Exceeding the threshold
In the situation where thislimit indicates
alarm thatDRBD
is raised, the
ClusterTraceManager's
failure can, for example,CPUbe a usage
disk orhas crossed
some other tracing
device(s)is causing a significant
are forced loadthe
to start using oncurrent
the
the
HWconfigured threshold
failure, an explicit limit (WARNING
operator actions, oror
a system.
data copy. This does not present a real problem
MINOR)
software for a pre-configured period of time.
bug. if the failed cluster manager node went down
This alarm reports an error in the system about the same
The kernel timecapability
has the as the remaining
to repair the
memory, discovered through a software operational cluster manager
correctable errors in ECC-enablednode. memory, and
mechanism. There are two classes of errors it will seamlessly do so.
reported: If, however, the failed peer cluster manager node
had
If theearlier
rate ofbeen runningerrors
correctable alone,exceeds
it is possible
the that
This alarm indicates that a field-replaceable unit The
disk present
updates system
have configuration
been lost. For does not parts
example,
1. Correctable error rate over limit predefined threshold, then this alarm will alert the
(FRU), or hardware module associated with the correspond
of logs maytheto
bethe deployment
missing, the configuration.
known list ofrate. It
active
2. Uncorrectable error detected user about detected abnormal error
FRU is missing, or the hardware module may
alarms lack
maya critical resource,
be different, andfunctionality,
application storageor
associated with the FRU is not responding. redundancy,
may have and may operate in degraded mode.
Class 1 is an early warning, which usually In case of lost transactions.
an uncorrectable error, the kernel
Physical
indicates disk space
a more of thecondition
serious cluster isdeveloping.
getting full.It SW delivery
cannot repairinstallation
it and suchmight
errorsfail.
will affect the
There might
is raised withbe no space
severity available
"warning", to install
"minor", or new program running on the node. As a recovery
software (SW) deliveries.
"major" depending on the number of correctable attempt, the affected node will be restarted
errors occurred during the measurement period. automatically. If the memory error was due to a
Physical disk space of the cluster is almost full. transient
SW delivery cause, this maywill
installation solve
fail. the problem.
Class
There 2will
indicates that an
be no space error already
available caused
to install new
malfunction of the affected
software (SW) deliveries. node, and is raised
with severity "major".
This alarm indicates that the total available This alarm is an indication that the memory on
system memory on the Local Management the unit is running low. If the memory runs out
Processor (LMP) node is low. completely, the unit is restarted automatically as
a recovery action.
This alarm indicates that an SCCP instance's This is a warning alarm indicating that the
utilization of one of the signaling resources is signaling resource utilization by SCCP instance is
nearing the configured maximum capacity. nearing the maximum capacity. This is to prevent
a situation where the resource utilization exceeds
The situation where the signaling resources' maximum capacity, as this will lead to failures in
This alarm indicates that the utilization of one of The alarm indicates that there is an attempt to
utilization is nearing the configured maximum the establishment of signaling connections and
the signaling resources by the SCCP instance consume additional signaling resources than
capacity may arise due to: thus dropping the KPI.
has reached the configured maximum capacity what is allowed or dimensioned according to the
a. A software fault that does not erroneously
and there has been an attempt for reserving product deployment. This will typically cause
release the resources when not needed and thus
more resources than what is allowed by the failures in the establishment of signaling
leads to a leak of the resource.
system dimensions. connections and might result in KPI drop.
b. An overload situation.
c. An insufficient dimensioning of the system
The exhaustion of the signaling resources may
either at the deployment phase or object
arise due to:
administration phase.
a. A software fault that does not erroneously
The SCCP instance may utilize the signaling
release the resources when not needed and thus
resource "CONNECTION_CONTROL_BLOCK"
leads to a leak of the resource;
wherein the signaling connection control block is
needed for successfully establishing the signaling
b. An overload situation or due to an insufficient
This alarm indicates that there are disturbances The alarm indicates failure of the signaling
in the SCCP stack. The following reasons may protocol procedures and thus leads to releasing
be the cause of the disturbances: of abnormal signaling resources, for example,
signaling connections.
1. RLC_FAILURE:
This alarm indicates that the SCTP association is This alarm indicates congestion on the SCTP
The releasing of the signaling connection failed
congested. A congestion is determined when the association. Due to this, the outgoing signaling
as no acknowledgment was received. When the
outbound messages towards SCTP failed with message is dropped, which means that the
Released (RLSD) message was sent in the
error code EAGAIN. Due to this, the outgoing signaling traffic is disturbed.
connection section and if the Release Complete
messages on the specific SCTP association are
This
(RLC)alarm was indicates
not received thatbefore there is the a drop
expiryinofthe T The alarm indicates failure of some or all
dropped until the failure is recovered. However,
success
(int) timer, rate theofconnection
signaling connection would be released. signaling connection establishment attempts in
the SCTP association will not change the
establishment. The success rate of the signaling the network and thus leading to a drop in the KPI.
connection status.
connections
2. IAR_EXPIRED: within a signaling instance is
checked
The releasingwith aofsampling the signaling interval of 30 seconds.
connection due to
The
A signaling occurs
congestion configuration when SCTP has beenbuffers are This leads to unintended activation of the
The success
Inactivity timer rate includes
expiry. When thethe overall
Connection incoming
successfully
exhausted and validated
unable and to accept also activated,
further outbound or signaling object and/or signaling traffic as per the
and
Confirmoutgoing signaling
message connections
added,
messages to(CC)the
duesignaling
to an overload
was received
service. by the
After activating
situation. configuration database. This also misleads the
(preconfigured
network node,object, Tto(iar)
a minimum
timer of 100 connections)
is started and if there
the signaling the runtime configuration in inquiry reports.
in
are the
no network.
SCCP This alarmexchanged
messages will be reported in thisifdue the
the service has been modified dynamically to
This
success alarm
connection rate is raiseduntil
drops
section whenever
below the limit the power-
defined in the Throttling reduces the performance of the target
the mis-configurations atthe theTnetwork,
(iar) expires, for then
throttling
product
the mechanism
deployment
connection is
beactivated
would(preconfigured released. ontoa 90%) node.or depending on the level applied, and hence, may
example, the remote network element. This will
Power-throttling
the limit defined is in performed
the SCCP configuration on the node when during also reduce system performance depending on
result into inconsistency between the
the temperatureofofthe
commissioning
3. INVALID_IT_MSG: either system. the processing
The operator must the target role. Leaving the condition unattended
configuration database and the signaling
resources,
consider
The releasing that or thisthe
of the air
alarm inletmight
signaling of thealso component,
connection indicatedue has for a long period of time may also affect the long
SCTP
services. Multi-homed association is created to to The alarm indicates a path failure in the working
exceeded
paging
invalid the predefined
response
Inactivity failures
(IT) messages threshold
and IMSI levels.The
detach
received. term reliability of the system.
provide path resiliency for network failures. SCTP multi-homed association. The system
Power-throttling
scenarios
Invalid which
Inactivity reduces
maymessage
(IT) nottothe affect maximum
the
could KPIs,
be power
any unlessof
Generally, it is possible create two paths to takes into use an alternative path.
level
these
the which
failures
following: can are bepartdrawn of by monitored
the the affected KPIs.node.
each SCTP association. If either of the path fails,
The
a. method and thelocal amount of throttling depend
thisIncoming
alarm is Source set. If the primary reference path number fails, then the
All
on SCTP
the type
(SLRN) in associations
the of received
the component. inInactivity
the association
In (IT)
addition,message set areis
power- There is total failure in the reachability of the
traffic is automatically transferred to a secondary
unavailable.
throttling
not is There
performed are (ano connections
fallback action between
to bring remote network element resulting in signaling
pathmatching
so that the with the locally
traffic is not affected.stored DLRN in the
the local
processing
corresponding network elementtoand
resource a safe the blockremote
reduced mode) traffic downtime.
Furthermore, if connection the primarycontrol path recovers, (CCB) thenat
network
when
the SCCP theelement.
thermal
stack. Here,
controller network process element exits refers to
the traffic is automatically transferred back to the
"Application
possibly
b. Incoming due Server".
to manual
Destination There is somethingnumber
intervention.
local wrong
This
primary alarm path indicates
from thethat therereference
secondary has path.been If thean These are critical errors which will cause the
with
(DLRN) the in data the transmission
received connections
Inactivity (IT) of the is
message
internal
secondary failure pathinfails, the signaling
then the traffic components continues which signaling components to malfunction. They
associations
In
not addition
matching to of the
with this association
power-capping
the locally set,
for SLRNand/or
the objects
node, the
may
normallyaffect viathe the provisioning
primary path ofstored
signaling
and only the in thein indicate shutdown or restart of the entity in
associations
there are three
corresponding have levels been
connection oforblocked.
power control The
throttling
block network
(CCB)
the signaling
redundant services,
connection is lost.which This may SCTP affect the at question depending upon the failure described in
element
employed:
the SCCP automatically
stack. attempts to re-establish
signaling
association services' usedfunctionality.
is configuration for M3UA These or
protocol errorsIUA are the "Meaning of the alarm" section.
The
the
c. signaling
associations.
The received Inactivity (IT) activation
message ondata the is Failure during activation will result into
mostly
protocol, generated
as specified because in the offirst
abnormal
identifying
signaling
1.
not matching stack
Throttled-Light: withfailed.When
the Thethe
data configuration
temperature
stored locally input
of
in the inconsistency between the configuration
functioning
additional of the network
information field element
("Protocol"). due to If the failure type is "CONFIG_DB_FAILURE", the
was successfully
processing
connection resource
control validated
blockhas to beatactivated
exceeded theSCCP minor on the database and the signaling stack.
environmental defects. In(CCB)
case of the
such failures, signaling services will not startup successfully.
stack.
threshold,
stack, Failure
as the
the during
percentage
connection the activationof
control throttling
block is rare applied
(CCB) anddata it
the
This signaling
alarm can services will try beto onrecover And if the alarm exists when the signaling
is either
is mostly
depends due
on
corrupted the tobe set and
invalid
component
or configurations
inconsistent. type.
only iffor
An alarm
the whichwith
The alarm indicates
automatically
association is from
active that
failure
at the oneandSCTP ofif the services
failure
layer, that in
is, The effect
services havedepends
already onstarted,
the service
thenthat
the has failed.
validation
severity as has erroneously
"warning" is succeeded, or due to
d.
the The
NE
persists
status proto_class
has
then of failed
recovery
the toinstart
associationtheisraised
received
up
done and
is
in
only this
other not state.
isInactivity
after
than available (IT) configuration validation and activation will fail.
runtime
message status
is eitherof the system which made the
for use. The
restarting
"connection_down". the faultyclass-0
"Managed components.
This isorbecause
object" class-1
fieldIn in and
some
the thepathnotalarm
cases, 1. netconsole: kernel crash logs of the affected
activation
2. Throttled-Mid:
matching of the
the configuration
class-2. When the fail. This
temperature will
of result
the
specifies
restarting
status is the service
the failed signaling
supervised name
inside that
an has
services
active failed.
is needed
association node will nottype
If the failure be available
is in the system master
into inconsistency
processing resource between
has the configuration
exceeded the major
explicitly. and local syslogs.
by
The
databaseSCTPIn
SCTP.
threshold,
4. ERR_MSG: and
some
Ifassociation
the the
the restart
SCTP otherassociation
signaling
percentage hascases,
ofterminated
services.
thetakes
throttling
system themight
applied
"LM_CONNECTION_FAILURE",
If this association was the only available the activation of
automatically the faulty signaling services. The
the alarm forconfiguration
netconsole does notstack
exist in:
connection
abnormally,
depends
The releasing
status
on that the is,as
thenot
ofcomponent
"connection_down"
by normal
signaling type. abort
An
connection alarm ordue due
to
withto association in the association set, there which
signaling on the is total
either a graceful termination or an abnormal a. single
raised the node
alarm deployments;
will fail.
shutdown
severity
protocol as
data procedure
"minor"
unitalarm of
isdue
error raisedthe SCTP. in message
this The
state. SCTP has failure in the reachability of the remote network
This alarm
termination,
terminated is raised
this
the association is(ERR)
to one
cancelled.
due of the
to the The lack
received
following
alarm of
b. for the management node in a deployment that
element resulting in signaling traffic downtime.
from
failures:theASSOCIATION
peer network element. has
If only onetype management node.
"SCTP
SACK
3. chunks or HEARTBEAT
Throttled-Deep:
FAILURE"
When the customizable ACK
temperature
is chunks
raised only from
ofalarm, Onthethefailure
other hand, is if there are remaining
This
when isthean association
operator-specified is terminated abnormally. The significance of the external event that raised
"SCCP_ACTIVE_STANDBY_SYNCUP_FAILUR
the
either
5. peer the endpoint,
ROUTE_FAILURE: processing so the
resource reason or for
the the
air inlet of available associations in the association set, the
which reports
LM_CONNECTION_FAILURE: events externalbecomes to the host system. this
E", alarm
then theis created
described in a separate operator-
When
termination
the
The
the SCTP
component
releasing is either
of has
association
the in the underlying
exceeded
signaling the critical
connection IPactive
network
due signaling traffic to theSCCP connections
network element, are not
The
A
again
or
alarm
connection
in the later peer
is
on, raised
failure
this alarm
endpoint
when
has(for the
is setexternal
occurred
example, again between ifalarm
therethe
reset). istoa
input provided
getting instructions.between the active and
synchronized
reachable via the association set, is distributed to
threshold,
a remote
assumes
SGWNetMgr the
signaling
its percentage
assigned point alarm of throttling
state (defined applied by
path failure for(Signaling a path with services
the new central
association.part) and standby
the remainingSCCPassociations
compromising the association
in the high availability
set.
depends
code
input
the thaton
polarity
stack layer the
transitioned component
setting).
management The type.
to "INACCESSIBLE"
description
component.An alarm of the with
If this SCTP
severity
status. as association
"major" is raised is used in this M3UAPrior to
forstate. of
In this case, the signal transmissionSCCP
those SCCP connections. If the capacity may
external
raising theevents alarm, is described in a separate
protocol, then the the M3UA stack level layer manager will
connection switchover
be decreased. is triggered when this alarm is active,
operator-provided
exhaust the attempts instructions.
to reconnect to the all the SCCP connections managed by the
establishment
6. UNEXPECTED_ has not succeeded due
CONN_MSG: to either,
SGWNetMgr.
inconsistencies If the
of theoperator
ASP triggers
machine a change in
state recovery group which issued the alarm will be
The
the releasingconfiguration
signaling of the signaling when connection
the alarm dueis to dropped.
between
receipt itofwill the local andConfirm
Connection remote network
(CC) elements,
message,
active,
or there is incompatible create inconsistency
M3UAmessage, between
configurations the
Connection
configuration Refused
database (CREF) and the signaling or If the failure type is
between
Release the local
Complete and
(RLC) remote message network for elements.
a
services,
In override that mode, is, the theconfiguration
machine state willofbe the added "DISTRIBUTED_STACK_SYNCUP_FAILURE",
signaling
to the connection
configuration which
database is inaftertheaestablished
successful the recovery unit which loses the TCP connection
standby
state. This association
is applicable is moved forreflected
ANSI from only.
validation but
"ASSOC_STATE_INACTIVE" it will not be to in the will not participate in the distribution and
signaling services. redundancy. This will result in decreased
This is an operator-specified customizable alarm, The significance of the external event that raised
which reports events external to the host system. this alarm is described in a separate operator-
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
external events is described in a separate
operator-provided instructions.
This is an operator-specified customizable alarm, The significance of the external event that raised
which reports events external to the host system. this alarm is described in a separate operator-
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
external events is described in a separate
operator-provided instructions.
This is an operator-specified customizable alarm, The significance of the external event that raised
which reports events external to the host system. this alarm is described in a separate operator-
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This is an operator-specified customizable alarm, The significance of the external event that raised
external events is described in a separate
which reports events external to the host system. this alarm is described in a separate operator-
operator-provided instructions.
The alarm is raised when the external alarm input provided instructions.
assumes its assigned alarm state (defined by
input polarity setting). The description of the
This alarm indicates that the signaling object This indicates transmission break. Due to this,
external events is described in a separate
"SigObjectID" as specified in the "Identifying signaling traffic is disturbed and as a
operator-provided instructions.
application additional information" field has consequence there is a significant KPI drop or
changed status in a number of times based on even a possibility of network outage.
the "ObjectFluctuationThreshold" parameter, and
This alarm indicates that the D-channel has The signaling on the D-channel in question is
within the span based on the "CriticalAlarmTimer"
terminated abnormally, that is, not "disabled" broken.
parameter. The parameters
administratively. The failure may be due to a fault
"ObjectFluctuationThreshold" and
in the primary rate access terminal used by the
"CriticalAlarmTimer" are defined as part of
D-channel, or a fault in the D-channel
This alarm,
product when raised
deployment and with a "Critical"
appears in the severity, As long as the faulty hardware component has
connections, or transmission failure within the
indicates
"ApplicationthatAdditional
a flood ofInformation"
hardware events has
field of the not been isolated, the system will drop the
SGW (that is, between LAPD and IUA), or non-
been
alarm.detected, which could be caused by a faulty excess events, which may lead to inconsistent
operation of the remote end.
hardware component. The system takes hardware alarms state.
automatic
This action by limiting the ratenotifications
of hardware
This alarm
alarm suppresses other
is raised where alarm
either NE-wide eSW Either eSW installation or activation has failed
alarm
relatednotifications
to the tostatus
object a fixedchange
value, to
setavoid
in
installation, or NE-wide eSW activation has and may cause the system to malfunction, or
deployment,
flooding the and dropping
syslog with the notifications.
alarm rest.
failed. This failure is either full or partial. The may endanger the reliability and failover in the
failure can be one firmware component in a blade later phase if the issue is not solved and fixed.
or multiple firmware components in a blade, or The user is advised not to power off any blades
The alarm indicates that running configuration of Failure in applying correct configuration on the
failure would have occurred in multiple blades or nodes at this stage.
the switch is not in sync with the configuration switch may lead to improper functioning of the
with multiple firmware components. User
files present on the management node. switch.
attention is required to investigate the issue, fix
the problem, and retry the operation.
The alarm is raised only if the difference is found.
1. Identifies the managed object (MO) name of the new 1061 26600 glusterfs".
active RU.
The attribute ID for which the alarm is raised, is a Possible values are:
combination of the 1. Minimum threshold exceeded
omes ID and attribute number; for example, m2002c0001 2. Minimum threshold no longer exceeded
is 3. Monitoring stopped
the attribute
1. reason, possible values: 1 - file cannot be opened, 2 -
ID in which m2002 is the omes ID and c0001 is the
permanent file read error
attribute number.
2. additional information about the problem (for example,
text of the corresponding system exception)
Possible values:
1. Invalid attribute's value or empty string if the attribute or
its value is missing.
2. "LDAP server unavailable, using default configuration
parameters" if connection to LDAP has failed.
1. Invalid record (please note that the field can hold no 2. Error code, possible values:
more than ~390 symbols, so the original invalid record 1 - missing mandatory field;
can be cut). 2 - duplicated field;
3 - empty record;
4 - non-alarm data record.
Data from the original alarm: 4. Application additional info from the original
3. Field name (for missing or duplicated field).
1. MOId alarm.
2. Specific problem
3. Identifying application additional info
(The application ID is present in the MOId field of the
1. Heartbeat interval in seconds.
alarm)
1. VrfId 6.
4. Diagnostic code in
Limit configured onConfiguration
why the session was
Directory
previously DOWN. The possible
fsdbConnectionsAlarmLimit values
(default is nfor the
= 10)
2. Source address of the session code are:
5.0Limit
- No_Diagnostic
configured in Configuration Directory
3. Destination address of the session 1 - Control_Detection_Time_Expired
fsdbConnectionsCheckFreq (default is 10
2 - Echo_Function_Failed
seconds)
4. Hop type - The possible values are: 3 - Neighbor_Signaled_Session_Down
a. M - For multihop 4 - Forwarding_Plane_Reset
b. S - For singlehop 5 - Path_Down
1. Feature code 2.6License state (possible values are "OFF"
- Concatenated_Path_Down
5. Reference ID - This is an optional parameter used in and
7 - "ON")
Administratively_Down
multihop solution. 3.8Feature Admin State (possible values are
- Reverse_Concatenated_Path_Down
"OFF" and "ON")
1. Error Code. The possible values are: 7. The time when the session failure was
SCLI_DAEMON_NOT_AVAILABLE - The SCLI daemon observed.
was not available at the time of execution of the
user/subsystem configuration script.
FSCONFIGURE_SAVE_ERROR - The fsconfigure -- save
command did not execute successfully hence the applied
configuration was not saved.
USER_CONFIG_SCRIPT_ERROR - The script mentioned
in the second field failed to execute.
2. Script name (optional)
This is an optional parameter which indicates the name
of the script that failed.
Note that, this parameter is only applicable in case of
the USER_CONFIG_SCRIPT_ERROR code.
1. Error code
Example: CERT_NO_RW_PERMISSION
Possible values for error codes:
1. Error code:
- CERT_NO_RW_PERMISSION
This error code indicates that the directory is in a read-
Error code is the error code of LDAP server or
only mode or the parent directory resides on a read-only
RUIM providing detailed
file system.
information for the problem type. Possible error
codes and their explanation
- CERT_NO_RESOURCES
is given below. For each problem type,
This error code indicates that there is no free disk space
corresponding LDAP Server or RUIM error
on the device for creating a file, or the resources available
codes are given below in order to provide more
in the system are insufficient to perform a write operation.
detailed information.
- CERT_MAX_FILE_COUNT
Example: LDAP_CONNECT_ERROR
This error code indicates that the maximum allowed
number of files are currently open in the system.
1.WARNING_FORWARDING_TABLE_LIMIT_EXCEEDE The following
Route values are
limit exceeded: Thepossible forofthe
soft limit theerror
D codes:
number of supported routes is exceeded and
- CERT_CREATION_FAILED
2.ERROR_FORWARDING_TABLE_ADDITION_FAILED
This error code indicates that a component of the path the user should proceed cautiously when
LDAP_CONNECT_ERROR
adding more routes.
(prefix specified by the path) does not name an existing
WARNING_FORWARDING_TABLE_LIMIT_EXCEEDED: LDAP_PROTOCOL_ERROR
directory, the path is an empty string, or the path 1. General Information
This message indicates that the LDAP_TIMELIMIT_EXCEEDED
argument specifies the slave sidenumber of routes added
of a pseudo-terminalRoute4 addition failed: The hard limit of the
Possible values:
in the local LDAP_UNAVAILABLE_CRITICAL_EXTENSIO
device that forwarding
is locked. table of a node has caused an number of supported routes is reached and the
A) sysPeer not chosen
excess of the maximum number of routes supported in the N
addition of IPv4 route has failed.
NTP is unable to select a server.
LDAP_CONFIDENTIALITY_REQUIRED
node.
B) The sync time difference is beyond the
LDAP_UNWILLING_TO_PERFORM
1. Reachable
Route6 peers
addition failed: The hard limit of the
allowed offset
PMG_RESULTS_NOT_GET_YET
ERROR_FORWARDING_TABLE_ADDITION_FAILED: 2. Unreachable
number peersroutes is reached and the
of supported
The time difference between NTP server
This message indicates that one or more routes could not RUIM_TIMEOUT
addition of IPv6 route has failed.
and NTP client is greater than
be added in the local forwarding table of a node, because
the allowed offset.
the capacity of
Test-license theisforwarding
state enabled. table has been exceeded. C) ntpdc polling is not successful
Monitoring of NTP time sync has failed.
D) NTP is syncing
NTP is syncing to this server.
;SN:<serial-no> ISSID:<issuer-id> C:<EE> D:<domain> DTE:<days> CET:<end-
1. SN: Serial number of the certificate time>
2. ISSID: Issuer ID of the certificate C:<CA> D:<domain> DTE:<days> CET:<end-
Example: time> CT:<cert-type> CAID:<ca-id>
"SN: A08038 ISSID: /C=IN/O=nokia/CN=FlexiRootCA" C:<EE> D:<domain> Expired
1. Reason for failure. The possible reasons for
means the certificate has serial number as A08038 and C:<CA> D:<domain> Expired CT:<cert-type>
failure are:
the issuer of the CAID:<ca-id>
certificate is "/C=IN/O=nokia/CN=FlexiRootCA". 1. C: Certificate (possible values are "EE" and
a. UNABLE_TO_FETCH_ROOTCA: The root
"CA")
certification authority (CA) certificate or trust
1.
2. Name
D: Domainof the current start-up snapshot.
anchor was missing for the corresponding NE
2.
3. Name
DTE: Daysof theTo
previous
Expire start-up snapshot.
service domain. The root CA certificate or trust
3.
4. LDAP configuration
CET: Certificate Endchanges
Time apart from KUR
anchor is mandatory for KUR operation
changes
5. CT: CAsaved. The possible
Certificate values values
Type (possible are, are
initiation.
Yes/No.
"new-with-new", "old-with-old", "old-with-new"
1. Failed node name 2. Expected minimum RAM amount
andYes - Current start-up snapshot includes
"new-with-old")
3. CMP_REQUEST_FAILED:
b. Actual RAM amount The CMP
unsaved
6. CAID: CA configurations
identifier ofbefore
the CAKUR.
certificate
parameters of "default" domain were either
No - Current start-up snapshot includes only
Examples:
missing or improperly configured. The CMP
KUR changes
1. "C:EE no other
D:default unsaved
Expired" means configurations
the EE
parameters will always be fetched from the
before autoKUR.
certificate present under the default domain
"default" domain while performing KUR
4.
hasComma
alreadyseparated
expired. list of NE service
operation for any of the NE service domains.
domains
2. "C:EE for which automatic
D:default KUR update
DTE:5 CET:2013-03-15
succeeded.
09:19:24-10:00" means the EE certificate
c. UNABLE_TO_INSTALL_EE_CANDKEY:
present under the default domain is going to
The EE candidate private key required for
Example:
expire inconfig-
5 days, and the certificate will not
performing KUR operation is not auto-
R_FPT_170.3.WR.64.r.1602220650.395744-
be usable after 09H 19M 24S of 2013-03-15.
generated successfully and is not
INITIAL
3. "C:CAconfig-
D:default Expired CT:new-with-new
configured/installed for the corresponding NE
R_FPT_170.3.WR.64.r.1602220650.395867-
CAID:common" means the CA certificate of
service domain.
INITIAL Yes swmgmt,ruim
type new-with-new present under the default
1. Failed node name 2. Expected CPU count
3. CPUs found
1. Configuration entry
The name and value of the attribute that is out of order
under the fssnmpMediatorName=1, fsFragmentId=SNMP,
fsClusterId=ClusterRoot branch.
1. IP address of the SNMP agent that does not
respond.
1. IP address of the SNMP agent that had sent the trap 1. Version of the used SNMP, possible values
are:
SNMPv1
SNMPv2c
1. IP address
2. Object identifier of the received trap
The trap was generated because this IP address entity
had an incorrect community string.
Faulty
5. Hop delivery <delivery
type - The possiblename>
value is M - MultiHop. Autoreturn delivery <delivery_name>
Bootcount limit <bootcount_limit> Autoreturn
6. Reference ID reason <autoreturn_reason>
1. Class of the memory error occurred 1. Error rate for correctable type
(correctable/uncorrectable) 2. Affected memory location for uncorrectable
2. Type and ID of the affected memory module (DIMM- and correctable type
ID/cache)
Example1:
1. Name of the affected unit 4. Sensor Type
Example1: Error rate= 20/ 24h
2. Position of the affected unit
Class= correctable Affected location= DIMM ID=2 , Channel ID= 3
3. Sensor name
Affected memory= DIMM-2
Example2:
1. Volume group name
Example2: This fieldlocation=
Affected shows the threshold
core: value477e5
5 address: which has
Class= uncorrectable been surpassed,
syndrome0: and thus the
94 syndrome1: 0 alarm has been
Affected memory= cache (L2D) raised.
Sample Output:
ErrorCode:DCHANNEL_STARTUP_FAILED
1. Type of the affected unit. 5. Error description
2. Error type.
3. Position of the affected unit. For example: ToP Lock Fail
4. Sensor number.
1. Type of the affected unit. 5. Error Description
For example: Unit={BS2AM-A} ErrorType={01}
2. Error type.
Position=/chassis-1/AMC-2 Sensor={number=186}
3. Position of the affected unit. For example: OS initiated hard reset
4. Sensor number.
1. Type of the affected unit. 5. Error Description
For example: Unit={BS2AM-A} ErrorType={02}
2. Error type (= asserted sensor offset).
Position=/chassis-1/AMC-2 Sensor={number=174}
3. Position of the affected unit. For example: HWM process failure
4. Sensor number.
1. Type of the affected unit. 5. Error Description
For example: Unit={BS2AM-A} ErrorType={02}
2. Error type (= asserted sensor offset).
Position=/chassis-1/AMC-2 Sensor={number=186}
3. Position of the affected unit. For example: SYNC default config loading error
4. Sensor number.
1. Fibre channel switch module address 2. Switch Status
For example: Unit={BS2AM-A} ErrorType={06}
The table lists the values of the Switch Status.
Position=/chassis-1/AMC-2 Sensor={number=186}
For detailed information please see the Switch
user guide.
Value Meaning
1. Fibre channel switch module address 1. Port Status
1 Unknown
2. Fibre channel port ID The table lists the values of the Port Status.
2 Unused
Value Meaning
3 OK
1 Unknown
4 Warning
2 Unused
5 Failed
3 OK
4 Warning
3. Switch State
5 Failure
The table lists the values of the Switch State.
6 Notparticipating
For detailed information please see the Switch
7 Initializing
user guide.
8 Bypass
Value Meaning
1 Unknown
2. Port State
2 Online
The table lists the values of the Port State.
3 Offline
Value Meaning
1 Unknown
2 Online
3 Offline
4 Bypassed
Instructions
Fill in a problem report, and then send it to your local customer support.
Check the active alarms that are overflowing in the Network Element with
an alarm management application and correct
them according to their instructions. If this is not possible or does not clear
the alarm raised, fill in a Problem Report and
send it to your local customer support.
Perform the following steps to verify the state of the node:
1. Log into the controller node.
2.Check the state of the failing node VM from the compute node. For
example, for node IB7-0:
> nova list --name IB7-0
The alarm does not require any particular corrective actions if it is
3. The output in the previous step shows that IB7-0 node is available. If
preceded by either deliberate management action(s) or node
the node is powered off, then after 30 minutes the high
failures. In the latter case separate alarms indicating the node failures
availability services (HAS) of the system attempt to restart the failed node
would have been raised.
by issuing a restart. You can issue a restart manually
If, however, the reason is that the recovery units have failed and system is
Security
if time is log data must
an issue. Use be thechecked.
followingNotablySCLI commandsinvestigate forthe login on and
powering
not able to restart the software, the problem is in
attemptsthat
off. were made just before the alarm was raised.
the applications forming the load sharing group. Basically, there could be
For example: > nova reboot <nodeid> where <nodeid> is the id of IB7-0
several reasons why an application cannot be restarted
node
and therefore it is difficult to give exact or detailed instructions on how to
4.
ToEven
avoid if this operation fails(DB),
to bring the node up, contact your local
deal with filling up database
the situation. perform database-specific actions.
customer
Contact support.
The first yourthinglocal to becustomer
checked support is the immediately, and provide them with
the information
availability statusesyou received
of the failed fromrecovery
the alarm-notification fields. This status
units. If RU's availability
information
has the value depends
failed, systemon how has the already
DB application
attempted uses tothe file-system.
restart
itTowith
get detailed information on the measurement(s) that causedthe
no success. Note that it is also possible to try andrestart this alarm,
PostgreSQL
recovery server doesby notusing
storethe any other data or command:
files apart from DB
enter the unit manually,
following command: following SCLI
data files. If the application stores some other data or files, the application
show stats t-job all
documentation
set has restart informs which files can be removed to free the disk space.
show stats t-jobmanaged-object
id <id-value> where /AS-0/TestApplServer
id-value is the threshold job id of the
If the file-system is only used for the database, the actions must still be
measurement.
Check
defined
If manual ifto arestart
new the
free licensedisk is required,
space. and install the new license if needed.
Note that this is aalso userfails, the log threshold
configured writings should
rule. be detected: a typical
Fill in afor
Typical
cause problem
example report
is when
failing restart and sendbe ait missing
applications
might to your
insertlocal
new data, butfile,
configuration never delete it.
customer
In such case,
incorrect support.
disk-space
configuration filefull alarmor
content will be raised
even lack ofatsome a certain time,
critical and
resources,
1. Tothe check
applications
e.g. amount the
will status
define of
of availablethea disk
license, execute
clean-up
physical the following
procedure
memory. SCLI command:
(for example, which
tables
Clearing:
1. Check canfrom be deleted to free up
the Identifying table space).
application In customer
additional informationdocumentation,
field if this
show
there
Do not license
should
clear details
be
the a hint
alarm. unique-id
such
The as, <unique
systemwhich ID>the alarm
database
clears instances are
automatically usedwhen
from
alarm report is for overall CPU overload, or one core
where,
which
the number <unique
application. of ID> is the eight-character
operational RUs goes above unique
the ID of the
defined license to be
threshold.
overload.
checked.
2. Execute the "top" Linux command on the node that reported the alarm.
2. To delete the
Additionally, each expired license
application mustfrom the network elementon(NE),
provide execute
This command provides a repetitive update aofdescription
the processorwhat actions
Disclaimer:
one
need oftothe be The instructions
following
done if a SCLI
certain below occurs.
commands:
alarm use either The unsupported
application SCLI to state
needs
activity in real time. It gives a list of the most CPU-intensive tasks of the
commands,
delete
what thelicense or unique-id
database commands is used from
<unique
for and theID>unsupported
what Recoveryfull bash shell.
Groups (RGs)Pleaseare used.
system.
carefully
where,
An attribute read
<unique the
entry ID> disclaimer
in is the
the that is shown
eight-character
configuration whenforeither
unique
directory ID
the entering
ofDB-database
the expired thelicense.
should
For overall CPU load, follow the steps below
unsupported
describe the SCLI vendorofmode
instructions the or the full bash
application. This shell. Do notshould
information use the be
to check the CPU usage at the process level:
commands
delete
stored license
in afrom in any
text expired
field other
in thecontext. Please check from the product
configuration
a. Check the Application additional directory.
information field which processes
documentation
The execution of orthe from your command
above local customer deletessupport
all thefor more information.
expired licenses.
use more CPU load.
1.
3. Log
Please in
To install instructions:
verify a that
newno license
other into
data the
thanNE, the execute
directoriesthe following
db_data, SCLI
db_socket,
b. Press "P" to sort the processes as per CPU utilization.
a. Log inand
command:
.wdstat to the NE.
lost+found are created at the working directory (for
For one core overload, press "1" to show the separate state of all cores.
b.
add Switch
examplelicense to file
the <file>
root account (root privilege required). Other data will
3. Collect/mnt/db/<dbname>,
the syslog. depends on deployment).
USER@NODENAME
where,
reduce <file>
the is the license
available [NE]>file
space setname
for user
the username
along
database. with itsrootabsolute
Password path.
4. If the problem persists, contact your local customer support and provide
2. Check dynamically the file systems's fullness on the node. The alarm is
the information gathered in the previous steps.
raised when the
The following upper threshold
diagnosis commandvalue mustisbe reached
invokedorby the operator, in
exceeded.
order Use the instructions below to get the threshold value for the
node:
1. Log into the cluster, and check that the named managed object has
been successfully restarted.
2. Verify also that the MO did not raise any new alarms that would explain
the failure. You can check the status of a managed
object with the following SCLI command: "show has state". An operational
Disclaimer: The instructions below use either unsupported SCLI
MO has the value ENABLED in the operational state attribute,
commands, or commands from the unsupported full bash shell. Please
and has no value in the procedural status attribute.
carefully read the disclaimer that is shown when either entering the
For example, the state of the process NodeDNS in the recovery unit
unsupported SCLI vendor mode or the full bash shell. Do not use the
FSNodeDNSServer of the node AS-5 can be seen as follows:
commands in any other context. Please check from the product
Execute
show has state managed-object /AS-5/FSNodeDNSServer/NodeDNS
documentation or from your local customer support for more information.
Systemctl
OBJECT | grep OSMON ADMINISTRATIVE OPERATIONAL USAGE
1. Login instructions:
To
ROLE verifyPROCEDURAL
that OSMON is running. DYNAMIC
a. Log into the NE.
The output, in case OSMON is indeedUNLOCKED
/AS-5/FSNodeDNSServer/NodeDNS running should beENABLED of the form:
b. Switch to the root account (root privilege required).
# systemctl
ACTIVE | grep osmon
ACTIVE - -
1. If the alarm is raised[NE]
USER@NODENAME for an external
> set Ethernet interface,
user username root Password: check that the
osmonitor.service
If the MO loaded active
is not operational, perform running thepanel osmonitor
following service
steps:
cable
... is properly connected in the front of the
If
1. the
With alarm
a node is notMO, cleared
you can automatically,
wait for a node contact your local
to restart. The customer
system will
GW
2. Checknode. the current system memory usage on the node. The alarm is
support.
raise another alarm (70011 NODE NOT RESPONDING) if the node
2. Check
raised when the the status upper of the interface
threshold valuewithisthe following
reached SCLI
or exceeded.
does not
command:show come up within
networking a given
interface time. Verify that an <node> iface situation
alarm for the
The
This threshold
is an raised value
informative for the node
alarm and does can runtime
be read
notcommand:
require
node
as below:
any actions.
has been
<interface> with the following SCLI
a. Get the node name where alarm is raised using the following
show alarm active
For example: showfilter-by
networking specific-problem
interface runtime 70011node IB-0 iface
command:
2. Check
management the journal (journalctl command on the active management
# fsclish -c "show alarm active filter-by specific-problem 70160" | grep
node) for
3. If the above error(s) steps that have
cannot occurred, by searching for the yourMO'slocal
"Managed object" | awk -F"="resolve
'{print "nodethe situation,
name =contact " $2}' | cut -d "," -f1
name
This and/or
alarm by
is an"nodelooking
informative at events that
indicating that the whole alarm
occurred before this clusterwas
customer
Sample support.
output: name alarm = CLA-0" has
raised.
been
Note:
b. Log(re)started.
Ininto
a deployment
the node: As this operation
where not allisexternal
critical, check interfaces are used, a
3. You can
carefully
number ofthe also
alarm
alarms initiate
status
equal an
toinimmediate
thenumber
the clusterrestart
after
of unused attempt
the of the links
restart.
Ethernet failed MO
# ssh root@CLA-0
using
will the
be raised. following SCLI
To avoidvalue command:
getting those deceptive alarms
c. Read the threshold from the configuration file: (since they are
"set
actuallyhas the restartresult managed-object"
of unused links and not of failing links),
#
Thiscat /etc/opt/nokia/osmon-template.conf
is an informative alarm and does not require | grep -v ^#
any | actions.
grep "MEM LIMITS"
For example:
the operator isset has restart
advised to setmanaged-object
the admin /AS-5/FSNodeDNSServer
Sample output: MEM LIMITS 75.0 82.0 state If theof all unused
output of thelinks to DOWN
above
The
usingrestart operation is mostly
the following useful after a problem has been corrected.
command is empty,SCLI then command:
the default thresholds of 80% lower
Verify the resultinterface
set networking from the<node> journal and iface by checking theadmin down
<iface-name>
threshold (clearing) and 90% upper threshold (raising) are used to decide
status of the MO using the following SCLI command:
whether to raise or clear the alarm.
"show
This is has state".
an informative alarm which requires no user actions.
d. Exit from node:
4. Alarm for a recovery group implies a multiple error situation (for
# exit
example, multiple node failures) or a persistent configuration
e. Get system memory usage on the node:
or corruption problem.
# ssh root@CLA-0 "cat /proc/meminfo" > /mnt/export/mem.txt
#
ThisMemTotal=$(cat
is an informative /mnt/export/mem.txt
alarm and does not |grep require"MemTotal"
any actions. | awk -F ' ' '{print
$2}')
# MemFree=$(cat /mnt/export/mem.txt |grep "MemFree" | awk -F ' ' '{print
$2}')
# Cached=$(cat /mnt/export/mem.txt |grep "^Cached" | awk -F ' ' '{print
Verify
$2}') that the switchover operation is successful. The alarm is
automatically
# TotalFree=$(expr cleared if the switchover
$MemFree + $Cached) is successful. However,
depending
# PercentUsage=$(expr on the type of$TotalFree the application, \* 100the time for starting (or
/ $MemTotal)
activating)
# echo "Memory a standby usage RU= can vary from a few seconds to tens of
$PercentUsage%"
minutes.
# echo The state
"Total available of the new active
System memory: RU can be checked
$TotalFree" > using the result
The system clears the alarm automatically when the measurement
structured command
/mnt/export/output.txt line interface:
goes up and is continuously held at the minimum threshold
1.
# Log into
echo "Memory the cluster.usage = $PercentUsage%" >> /mnt/export/output.txt
clearing level or above.
2.
3. Use
Collect thethe "showtop 5 has state" SCLI
memory users commandat processtolevel see (asthe listed
state in of the
the alarm)
new
active
on the RU. NE using The MO thenamefollowingof the new active RU can be found in
commands:
Disclaimer:
the
# echo " TopThe
application instructions
5 additional
memory below make
information
consuming fielduse
process of either
1. (check
For example,
Appl.unsupported
execute
addl. SCLIfor
info the
field
commands
following or
command commands
details): " >> /mnt/export/output.txt to check from the the unsupported
state of the full bash shell.
Please
# fsclishcarefully
-c "showread
/AS-10/ApplServer-0 alarm theactive
disclaimer
recovery unit:
filter-by that is shown when 70160"
specific-problem either entering
>> the
SCLI
> show unsupported vendor mode or/AS-10/ApplServer-0
has state managed-object
/mnt/export/output.txt the full bash shell.
Do
An
4. Thenot memory
use the RU
operational commands
has an
is often in any other
UNLOCKED
consumed by filescontext.
administrative
located Please
in the check
state,
tmpfs from
ENABLED the
filesystem.
product
operational documentation
state, an or
empty from local
procedural customer
status,
This information is not part of the report collected above. The instructions support
and for
"ACTIVE" morerole.
information.
The
below procedural
can be used status of INITIALIZING
to collect such information: means that the RU is still starting
1.
a. Check
up. that
If the switchover
Check the name
the memory fails of(operational
used the
by alarm
the tmpfs logstatefile
fileof that theisnew
systems defined
on thebynode the found on
parameter
active
step (2.a). RU is fsLogFileName
DISABLED),
Check the tmpfs inmount
check the the alarmsyslog
points processor configuration
for a possible
(ignore entries explanation for
in
forConfiguration
the failure
the none and and if Directory,
required, is the same
contact
/dev filesystems): your as local /var/log/master-alarms
customer . Use the
following
support. SCLI command:
# echo "tmpfs mount points: " >> /mnt/export/output.txt
showssh config
#Note that
root@CLA-0 fsClusterId=ClusterRoot
if both Recovery
"mount -tUnits tmpfs" in the
>>fsFragmentId=AlarmMgmt
active standby RG fail repeatedly,
/mnt/export/output.txt
fsFragmentId=AlarmProcessors
this
b. Checkalarmthe may be raised
space used for by both fsAlarmProcessorId=AlarmProcessor1
the tmpfs Recovery Units.
file systems:
fsAlarmProcessorConfigurationId=Default
In that case
# echo "Space theused situation has to" be
by tmpfs: >> corrected immediately.
/mnt/export/output.txt
2. If the value in Configuration Directory is different, then modify the value
Disclaimer: The instructions below make use of either unsupported SCLI
commands or commands from the unsupported full bash shell.
Please carefully read the disclaimer that is shown when either entering the
SCLI unsupported vendor mode or the full bash shell.
Do not use the commands in any other context. Please check from the
1. Fill in a problem report with the alarm data and send it to your local
product documentation or from local customer support for more
customer support.
information.
1. Find the attribute with the invalid value. The name of the attribute can
be found in the "Managed Object Id" field of the
Fill
alarm. in a problem report with the alarm data, and then send it to your local
customer
fsParameterId=<name support. of attribute>,
fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=AlarmProcessors,
fsFragmentId=AlarmMgmt,
1. If automatic alert is not supported fsClusterId=ClusterRoot
in situations where the alarm system
The
heartbeating is not functioning, the
alarm can be shown using check following
occasionally SCLI command:
show alarm active filter-by specific-problem
that the heartbeating functions properly. The time of the alarm and the 70243
2.
value Modify of the theheartbeat
attribute'sinterval value in the Configuration
(specified in the Directory using the
following SCLI command:
'Application Additional Info' field) should be used in analyzing the situation.
Disclaimer:
set config attribute The instructions below make usefsFragmentId=AlarmMgmt
fsClusterId=ClusterRoot of either unsupported SCLI
commands
fsFragmentId=AlarmProcessorsor commands from the unsupported full bash shell.
fsAlarmProcessorId=AlarmProcessor1
2. Perform such checks also when the system does not generate any
Please carefully read the disclaimer that isattribute-list
fsAlarmProcessorConfigurationId=Default shown when<name either of entering the
attribute>
alarm events for a long time.
SCLI
<correct unsupported
value> vendor mode or the full bash shell.
3. If these occasional checks reveal that the heartbeat alarm events are
Do
Then notrestart
use the thecommands
alarm processorin any with otherthe context.
following PleaseSCLIcheck command: from the
To
not find out why
continuously the other
generated CLA node heartbeat
at each is unavailable, interval, perform the following
product
set has documentation
restart managed-object or from local customer support for more
/AlarmSystemLight
steps:
restart the alarm processor (this also forces the restart of the alarm
information.
3. The default
1.
system Log in to thevalues
database) controller
using of the the alarm
node.
following processor
SCLI command: attributes used when
This
correcting alarmthe is raised
situation when are the fsHeartbeatAlarm70246Enabled
listed below: is set to
2.
setCheck has restart the state of the VM using
managed-object the command:
/AlarmSystemLight
false
Attribute and the fsAlarm70247raise Default attribute
value is set to true.
>nova list
This
When problem
70247 alarm
fsAlarmProcessorConfigurationId:can indicateis raised one dooreither
both one of the of following:
the following (but not both)
3. If the VM is stopped then start it using the command:
-Default
A failure
Switch ON situation
Alarm
fsLogFileName: that
System is, for example,
heartbeating
/var/log/master-alarms caused
. by node reboots or node
>nova start <VMid >
failures.
If In these cases,
the heartbeating
fsLightParserThreadSleepTime: is required the alarm please severity
1set the is attribute
MINOR to true using the
and
following the problemSCLI commands
fsLightAlarmHistorySize: is likely toand disappear
the alarms
25000 quickly. will be Thecleared
severity of this alarm
will be raised to MAJOR if the node(s)
automatically:
fsLightAlarmListSize: 10000 do not restart
Find out why an FSDirectoryServer recovery unit has been locked. The
within
set a few
config minutes.
system canattribute
fsLightSnapShotTimeInterval:
be restored fsClusterId=ClusterRoot
to a safer state 60 by performing fsFragmentId=AlarmMgmt
the
-fsFragmentId=AlarmProcessors
An application
fsLightSnapShotMinNumRecords: problem caused,fsAlarmProcessorId=AlarmProcessor1
for example,1000 by a program error, a
following steps:
configuration error, or data corruption. In
fsAlarmProcessorConfigurationId=Default
fsLightManualAlarmClearingEnabled: this
false case, the
attribute-list
1. Log into the system.
alarm severity is MAJOR and manual
fsHeartbeatAlarm70246Enabled
fsTimersForRNWAlarmsEnabled: true intervention
false may be needed. If the
2. Start the structured command line interface:
1.
When Login
severity the toclearing
of the network
this alarm
fsRaise70280insteadOf70005forUnknownSP:is is element
MINOR,restart
complete, as
youroot may theuser
choose
Alarmtotrue
check
to wait
Light the asituation.
Processor by using
$ fsclish
2.
few
the Check
minutes
following the to
fsHeartbeatAlarm70246Enabled:state
SCLIsee ofif all
the
command: the
alarm recovery
is units
cancelled.
true within
In node the recovery
reboot and group (the
transient
3. Check the status of the FSDirectoryServer recovery units using the
name
failure
set has of the
situations,
restart recovery
fsMultiSpBlockingRuleEnabled: the group
system
managed-object is in
will the
cancel Application
/AlarmSystemLight
false the alarm as
following SCLI command:
Additional
soon as the Information
node
-fsHeartbeatAlarm70246TimeInterval:
Switch reboots field). have If the
beenrecovery 300 groupand
completed is providing
the service service,
show hasOFF state 70247 Raising
managed-object /*/FSDirectoryServer
each
When ofthe
instance(s) itsalarm
UNLOCKED
has
fsAutoAcknowledgeWhenCleared: been
system's recovery
successfully
heartbeating units that
reassignedis have
desired and the
to
follow-alarm-definitionthe
be recovery
switched unit
OFF,
1. To ensure the proper
OBJECT functionality of theOPERATIONAL
ADMINISTRATIVE system, switch off USAGEthe inert
ACTIVE
restarted.
along with role,
fsAlarm70247raise: If the
settinghas the ENABLED
severity of this operational
alarm
fsHeartbeatAlarm70246Enabled
true is MAJOR, state, and to
perform anthe empty
following
false
mode
ROLE after the problem analysisDYNAMIC
PROCEDURAL is done.
procedural
steps:
set FALSE status. For example, the12state of recovery units of in the
You can as
fsLightEventsProcessed:
2.
/CLA-0/FSDirectoryServerswitchthe value
off thefor inertthemode
fsAlarm70247raise
UNLOCKED from all nodes ENABLED attribute
of the cluster theby
ACTIVE
/Directory
1. Log
AlarmLight into recovery
the
fsLightProcessingInterval: active
Processor groups
CLA can
as
configuration be
root checked
user.
200 in the using
Configuration the "show has" SCLI
Directory.
issuing
ACTIVE the following
- SCLI
- command:
command:
2.
DoCheck bythe
this inert system
using
fsLightProcessorSleepInterval: the syslog
following (/var/log/master-syslog)
SCLI command: for possible failure
set has
/CLA-1/FSDirectoryServer off managed-object LOCKED / 800 ENABLED IDLE field from
Extract
> show
reasons
set config the
has
and error
regex
contact
attribute
fsLightClearWarningAlarmsEnabled: type
filter from
yourru statethe
local
fsClusterId=ClusterRootapplication
managed-object
customer false additional
support information
*Directory*
if you
fsFragmentId=AlarmMgmt
Note that this should
COLDSTANDBY be done by the supplier's
NOTINITIALIZED - field engineer that is
alarm
OBJECT
need output. Below is the list of fsAlarmProcessorId=AlarmProcessor1
assistance.
fsFragmentId=AlarmProcessors
fsLightClearAlarmsOnNEReset: actions none corresponding to each error type
currently
4. If one of analyzing
the recovery the system.
units is LOCKED, unlock it by using the SCLI
number. ADMINISTRATIVE
fsAlarmProcessorConfigurationId=Default OPERATIONAL USAGE ROLE
attribute-list
fsLightExcludeRangeFromNEResetRule:
When
command the inert "set has mode is switched
unlock". off, pending
For example: none recovery fsAlarm70247raise
actions take place.
PROCEDURAL
false
fsLightExternalFlowControlValid: DYNAMIC none
For
set example,
has unlock if managed-object
an important severity process in a
/CLA-1/FSDirectoryServer
USER_NAME_DUPLICATE_ERROR
fshaRecoveryUnitName=FSDirectoryServer,fsipHostName=CLA-
Once done, restart the Alarm Light - A username
Processor byausing cannot
the be the same
objectClass:
Depending
cold
Followingactive/standby on
message the problem
recovery
is displayed: FSAlarmProcessorConfiguration
type
group(see has thefailed
Application
in node thatfollowing
Additional was in the SCLI
as one of the reserved names
0,fsFragmentId=Nodes,fsFragmentId=HA,fsClusterId=ClusterRoot
command:
objectClass: from the list: root, wheel, daemon, adm,
extensibleObject
Information
inert mode, switching
/CLA-1/FSDirectoryServer field 2), the the cause
inert of
mode
unlocked the successfully.
problem
off for thecan node be:
sync,
UNLOCKED
set has shutdown, halt,
ENABLED lp, mail, uucp,
ACTIVE operator, ACTIVE games, -nobody, gopher, -
causes
5. Quit arestart
the switchover
fsclish managed-object
fsAlarmCompareAAIforWildcardIAAI: sessionof theusing /AlarmSystemLight
recoverythe group. false
following command:
nfs, nfsnobody, named, ntp,
fshaRecoveryUnitName=FSDirectoryServer,fsipHostName=CLA-
4. If Application- Additional ldap, mysql,
Information field postgres,
contains "LDAP serverrpm,
apache, sshd,
LDAP_DOWN
quit Network configuration problems.
dbus, vcsa, nscd.
1,fsFragmentId=Nodes,fsFragmentId=HA,fsClusterId=ClusterRoot
unavailable,
Check thatusing defaultand
the primary configuration
secondaryparameters" NetAct LDAP then server addresses
UNLOCKED
please
definedcontact in the network ENABLED
your local customer
element IDLE
(NE) support. COLDSTANDBY
internal LDAP server are reachable.
USER_NAME_RESERVED_ERROR
NOTINITIALIZED - - A username cannot start with one
of
In the prefixes reserved for network elements: "_nok", "_nsn".
INVALID_CREDENTIALS - NE accountthe
case above, the recovery unit of CLA-1
credentials node is actingtoas
to connect thea cold
standby backup, and the recovery
NetAct LDAP server are invalid (wrong account name, wrong password unit on CLA-0 is
USER_NAME_TOO_LONG_ERROR
running the service normally. Note that- A theusername
grep command cannotinexceed more
the example
and so on).
than
is Check
used 32 to characters.
filter out information regarding individual
that the NE account stored in the NE internal LDAP server to
processes
connect to in theeach NetAct recoveryLDAPunit. server exists in NetAct, has not expired, has
USER_NAME_CONTAINS_INVALID_CHARS_ERROR
Since this is a situation that - User name must
the correct password and somay on. be caused by various faults, contact your
start
local withcustomer a letter, a digit,
support toan underscore
analyze or a full stop. The last character
the root-cause.
must be a letter, a digit, an underscore, a hyphen, a dollar sign or a full
BAD_DATA - NetAct LDAP server is overloaded or shut down.
Disclaimer: The instructions below use either unsupported SCLI
commands, or commands from the unsupported full bash shell. Please,
read carefully the disclaimer that is shown when either entering the
unsupported SCLI vendor mode or the full bash shell. Do not use the
commands in any other context. For more information, see the product
1. Observe the application in question by checking the "Application ID"
documentation or your local technical support.
field of the alarm.
2. Lock the application in case of a CRITICAL severity alarm. To prevent
All currently active SSH sessions opened with the indicated username
its infinite restart, use the following SCLI command:
(see the 1st additional information field of the alarm) must be closed and
> set has lock managed-object <mo-name>
This
reopened alarmifimplies needed. anAftererrorreopening
situation which a session, prevents the process
the correct from
permissions
3. Observe the invalid or missing attribute by checking the "Identifying
starting.
are taken into use if the account is still in use for the NE.
Additional Info" field of the alarm.
Please contact your local customer support for assistance.
4. Observe the configuration location by checking the "Managed Object"
Initiate SSH sessions:
field. This contains, for example, a branch of the
1.
The Log in toannounces
alarm the active cluster theattempt management
to raise annode. alarm Forwithexample
an the
unknownMMN-0.
Configuration Directory-based configuration or a path for file-based
2. Use
specific the following
problem. command to check the open SSH sessions:
one.
1.
5. Check
Add or the correctIdentifying
the invalid Application
attributeAdditional
mentionedInformation in the "Identifyingfield of the
>
70280showalarm user-management
that is raised. login-history
Additional Info" field. FollowItthe contains
guidelines the specific
provided in
problem
customer which the alarm system
documentation foristhe was unable to
application recognize.
using the appropriate tool
1.
Note: Check Theifabove the targetcommand ID field shows displayed
all the as "<Empty>"
"still logged in"inusers the output
and of
show
(for alarm
example, active
a text filter-by
editor specific-problem
for the file-based 70280
configuration).
the
already SCLIlogged-out
commandusers. "show license target-id" to
2.
6. Check
Unlock whether
(if the
the alarm the specific
second problem
used)isorpresent restart in the alarm system
confirm that is step
reallywas valid. the application using the
repository
following usingcommands:
SCLI the following SCLI command:
2. Check
For example, if thethe licenseresultmanagement server
of invoking <unknown-specific-problem>
the above fetches
command the targetmightID lookof the NE
show
-successfully
unlock: alarm settype has specific-problem
unlock managed-object
as follows: upon its restart by executing<mo-name> the following
The
3.
- restart following
Check whether
set hasdiagnosis the specific
restart command
managed-objectproblem mustisbe invoked
present
<mo-name> by the
in the list operator,
of known in
SCLI command:
order
alarmstoingather the customersome diagnostics documentation. data for subsequent investigation on the
> set has restart managed-object /CLM.
Fillshow
reason in a of user-management
the alarm.
problem report andlogin-history send it to your local customer support.
3.Check that the target id (e.g. NE ID) has been properly configured via
CAM.
User If the license
name Loginmanagement time server
Logout continues
time to fail Host name
/opt/nokia/SS_RCPDBHAMgmt/tools/fsdbdiag.sh
The operatorthe
in fetching shouldtargetinform ID of the local NE, pleasecustomer
------------------------------------------------------------------------------------ contact support, yourand localreport
customer this as
a possible
support for problem
further caused
information.
extuser Mon Jan 16 08:22:09 2017 still logged in by maximnumn connection limitation of
10.157.3.230
database.
extaccountThe Fri configuration
Jan 13 12:49:06 may2017 have to FribeJan changed,
13 15:11:15 if the2017database has
been
10.157.3.230configured with too few connections. There are two possibilities to
avoid such situation. First, increase the maximum number of possible
Identify the application raising the alarm using the Application ID field in
connections to the DB and, second, reduce the number of applications
the alarm. From the SCLI shell, enter the bash shell
that
3. are simultaneously of accessing the database.
to Thegain preferred
access toway the system. closingexampleaccount@CLA-0
a session is a graceful exit. It is,
[test] > however,
shell
possible to close
[exampleaccount@CLA-0(test) it forcefully. The following example illustrates forceful
The
cleanup Operator
of a sessionshould for provide
user the information about which application uses
"extuser".
/home/exampleaccount] #
1.
which ResolveIP-address.the VRFEach name based on
database the VrfId reported
application has to describe in the alarm which
Now, if the subsystem (for example the PM9 server) is unable to write
additional
database
4. First, enterinformation
connections
the full by
bashare theused,
shellSCLI and
and command:
is
check responsible
the sshd for
process id of the child
result files into the result directory then the following
> showofnetworking
connecting/disconnecting
process "01256": vrf idto/from
<VrfId>the database. Additionally, each
shall be done at the management node where the Subsystem (here PM9
application must provide a description what actions are to be performed if
server) is active:
2.
this Check
alarm the network
occurs. Thisconnectivity
information to the peer
should network in element using ping
#
1.ps
Network -ef |reason
If the grep
leakage 1256
incould be caused
the identified by
applicationeitherbe anstored
incorrect
additional
a text field
infoconfiguration
in the
field of the alarm of
and tracerouteDirectory.
Configuration commands. Get the source node information from alarm
virtual
indicates switches,
issue "No or aspace virtualleft switch malfunction.
on device", then In case
Managed
root 1256 object 7701 field0fsipHostName=<node>
13:05 ?thebyvirtual
00:00:00 section
sshd: and execute
extuser [priv] the
ofcheck
an incorrect
whether configuration,
the disk is full executing switch the configuration
following should be
command at
following
The
10009 limit SCLI
can be commands:
increased by changing the value of the parameter
fixed shell1276
bash immediately. prompt: 1256 0 13:05 ? 00:00:00 sshd: extuser@pts/5
roota. Ping
"max_connections" in0the database
pts/4 configuration
00:00:00 grepfile
df -h if2504
Check <TARGET 17382DIRECTORY>
a new license
13:06
for the feature
1256
if BFD corresponding
/mnt/db/<dbname>/db_data/postgresql.conf address familyisisrequired,
IPv4: (Note:and install
actual DBitworking
if required.
If For feature
the example: itself dfis-h not /var/opt/nokia/SS_PM9/storage/
required, set the
start
directory
Terminate networking
depends
it:is full, get instance <VRF Name> diagnostics
on used deployment) of all the nodes. The database ping node <node>
If the disk
feature admin state to the
OFF. list of the files by executing the following
destination
has to be then<Destination
restarted IP>restarting
by source <Source the relevant IP> Recovery Group.
command
1. atthethestatusbash shell:
ifTo
# kill BFD check
-9 corresponding
1276 ofaddress
a feature, familyexecute is IPv6: the following SCLI command:
ls -lrt <TARGET DIRECTORY>
The start networking
following
"Disclaimer: severity
The instructions instance
level is<VRF Name>
supported: diagnostics
below make use of either unsupported ping6 node SCLI
For
>show example:
license ls -lrt /var/opt/nokia/SS_PM9/storage/results
feature allno equivalent
<node>
Note:
commands destination
There are
or space <Destination
currently
commands IP> sourceSCLI <Source
commands. IP>
Create
The free
feature required thefrom
onwould disk the unsupported
notby beremoving
displayed some full of
in the
bashthe shell.
output. old files.
b.
Warning:
Please TraceroutefsdbConnectionsAlarmLimit
carefully read the disclaimer (value
that from
is shown Configuration
when either Directory)
entering the
2.
2. If the reason in the identified application additional info field of the alarm
The
SCLIifTo
reached,BFD install
SSH a newfor
corresponding
minimum
session
unsupported
license,
number
vendoruser addressexecute family
of connections
"extuser"
mode or
the
is
the
following
is IPv4:
whichshell.
terminated.
full bash
SCLI
must command:
be free with check
indicates
>add issue
license "Permission
file <file> denied", then check
start
frequency
Dowhether
not use networking
therequired
commands instancein<VRF any otherName> diagnostics
context. traceroute node
where,destination the
<file> is the<Destinationpermissions
license file name arealong givenwith byPlease
executing check
its absolute
from
thepath. the
following
<node>
fsdbConnectionsCheckFreq
product documentation (value IP>from source <Source
Configuration
or from local technical support for more IP>
Directory)
command
3.ifTo at bash
setcorresponding
the feature admin shell: statefamily
to OFF, execute the following SCLI
BFD
information." address is IPv6:
ls -lrd <TARGET DIRECTORY>
command:
The
Followstart networking
following instance
diagnosisgiven
the instructions command <VRF
belowmust Name>
to clear diagnostics
be invoked
this alarm: by the traceroute6
operator, node in
For
> setto example:
license ls -lrd /var/opt/nokia/SS_PM9/storage/results
feature-mgmt codeIP> <code> feature-admin-state off on the
<node>
order
1)IfVerify destination
gather
if the some
SCLI <Destination
diagnostics
daemon is up. data
If source
you for can <Source
subsequent
access IP>
investigation
the fsclish shell, it
the output
where, does not indicate rwx permission for feature
user _nokfssyspm9,
reason
indicates of<code>
the alarm.
that
specifies
thecustomer
SCLI daemon
the feature
is up.
code whose admin state
contact
has the
to be set log local
to OFF. Support. A correct output will be
3. Check
Correctthe files (/var/log/master-syslog)
/opt/nokia/SS_RCPDBHAMgmt/tools/fsdbdiag.sh
2)displayed the for related faults.
aserrors,
shownifbelow: any, in the script provided by the user/subsystem.
The name of8the
drwxr-xr-x erroneous script
_nokfssyspm9 is indicated1024
_nokfssyspm9 in the Febalarm as the second
9 20:17
4. Check the
Identifying state of the
Application peer network
Additional Informationelement. field. This script can be
/var/opt/nokia/SS_PM9/storage
located at either one of the following locations:
/opt/nokia/configure/sh/
Try to create the subsystem certificate again and check why the certificate
cannot be created.
Disable the subsystem which raised the alarm, since it could be potentially
dangerous to run it in a non-secure mode.
Check why TLS connections cannot be made and potentially disable
Note: The procedure to properly disable a subsystem must be obtained
RUIM for the
from the Certificate Management Guide.
time of investigation.
Check the following options depending on the error code displayed (for
To disable RUIM, enter the following SCLI command:
If needed,
more save thesee
information, active configuration
the Identifying to clear the
Application alarm. Information
Additional
> set user-management ruim disable
field
1):
Depending on the Error code (see Application Additional Information field
- Check if the /etc/certs/<CertMan domain> directory is in a read-only
1)
mount.
Thiscause
is an informative alarm and
the for the problem can be:does not require any actions.
- Check if there is enough free disk space available for creating a
1. Certificates are not present in the "default" or "ruim" domain.
certificate file.
Verify whether Certificates are present in the "default" or "ruim" domain.
Use the following SCLI commands to check whether certificates are
Note: CertMan domain is the domain name where the Certificate is stored
installed:
1. In case of warning Management
alarm (with warning severity):
under the Certificate DN in LDAP.
> show security cert ruim ca-cert all
> showthe
a. Reduce security
number cert
of default ca-cert
routes in all
the forwarding table so that the route
count is below the supported limit.
2. Wrong server certificate is used.
Disclaimer: The instructions below use either unsupported SCLI
b. Ensure that the number of routes present in the node local forwarding
commands,
3. Theisexternal
table less thanLDAP server does
the maximum not support
number of routes TLS protocol. Execute the
supported.
or commands from the unsupported full bash shell. Please carefully read
following SCLI show commands, to view the routes in the forwarding table
the
Please disclaimer
contact that
yourislocal shown when either
customer support entering the unsupported
to resolve the issue. SCLI
of the node for which the alarm is raised:
vendor mode or the full bash shell. Do not use the commands in any other
Disclaimer: The instructions below make use of either unsupported SCLI
context. Please check from the product documentation or from your local
Commands
Ex: show networking or commands from the unsupported
forwarding-table runtime node full <node
bash shell.name> Please
technical support for more information.
carefully read the disclaimer
show networking forwarding-tablethat is shown runtime when
ipv6either
node entering
<node name> the SCLI
unsupported vendor mode or the full bash shell. Do not use the
1. Execute the following SCLI command to start an external bash shell
commands
2. In case inmajor
any other context. Please check raised
from the product
Check
session: the of environment alarmin (with
which major
the NE severity)
is installed after
(NOKIA warning alarm:or
laboratories
documentation or from local technical support for more information.
customer environment). If this alarm is seen in a customer
a. Reduce thecontact
environment, numberthe of local
routes in the node
customer localfor
support forwarding table so that
further information on
shell bash full
1.
the Execute
route the following
count is below SCLI
the commandlimit.
supported to start an external bash shell
how to change the NE's configuration to accept commercial
session:
licenses.
2. Execute If this alarm is seen
the action
following command in NOKIA to laboratories,
check whether then this can is
/ClusterNTP bein
The
b. corrective
Restart fornode
this alarm is to install a new valid bycertificate. If the
disabled
sync with if or
therereboot is no the need tofor which
allow the the alarm
installation is of
raised executing
certificate
shell
the bash
following is
full not
SCLI needed anymore, it can be
NOKIA
server internal
NTP: testcommand:
licenses into the NE. To disable this capability in
deleted
set as well.managed-object
The alarm will get automatically name> cleared when it is
NOKIA laboratories, execute the<node
has restart SCLI command
replaced
2. Check by thearoutes
new valid to thecertificate
NTP server or when the certificate
(depends on the configuration
"set license test-license state disabled"
/opt/nokia/SS_RCPNTP/bin/ntpdc -c peer from-n the unsupported vendor
is deleted, but this automatic clearance
made) could take up to 24 hours to
The
mode. corrective action(s) for this alarm depend on the reason for failure of
happen. So it is recommended to manually clear this alarm
automatic
In the output, KUR thereoperation,
should as
be described
an below:
after
3. Check the certificate
UDP port is 123 replaced
using the or asterisk
deleted.
steps below:
mark (*) on the NTP server IP,
which
Note: For more detailed description of the below mentioned SCLI
1. UNABLE_TO_FETCH_ROOTCA:
indicates that itwell
is synchronizing with the NTPused server andcommands,
the alarm has
commands
ps as as the parameters being in the
a. -ef
been Check | grep
cleared
-i ntp
whether the NE service domain contains the root CA certificate
automatically.
Check
please
root new
refer certificates
9760either 3638 to0the installed
15:12SCLI or
? command not:
00:00:00 online help or to the product
(trust anchor) by executing the SCLI command "show security cert
documentation.
/opt/nokia/SS_RCPNTP/bin/NTPMonitor
<domain>
If there is aca-cert
crosscheck trust-anchor
mark (x) on { [cert-type
the NTPpresentserver<cert-type>]?}".
IP, refer to step 7.
a.
ntpRun
As the
the first9765 below
step,9759 SCLI command
the
0 15:12 "C"
? value 00:00:00 in the "Appl. addl. info" of the
> show
alarm to find security
out if the
/opt/nokia/SS_RCPNTP/bin/ntpd certcertificate
<domain>is-u ee-cert
an EE -g -n -c /etc/ntp_master.conf
ntp:ntp
b.Use
3. If it isthe missing
"exit" then, install
command to the
exitcorresponding
the bash shell.root CA certificate by
certificate
Add more the or a
physicalCA certificate.
or virtualcert RAM to the node which raised the alarm.
executing "set security <domain> ca-cert" SCLI command.
For
Check Here
example, domain
that nameaddl.
theif source
"Appl. can forwarder
and be
info"fetched
of the from
alarmthe
addresses has"Application
"C:EE",
listed above Additional
then thebind to
are
4. Execute
Information
certificate isthean following
fields" EE field ofSCLI
certificate. the command
alarm.
And if "Appl. to find
addl. the NTP
info" of server
the IP: has
alarm
port 123 using the command below:
c. For more details, see the "Centralized certificate management" chapter
In thethen
"C:CA", output it isverify
a CA Not After field should contain new expiration date.
certificate.
in
showthe networking-service
platform administration guide.
ntpcertificate,
A. If the-alpn
netstat certificate
| grepis":123" an EE then follow these instructions:
A.1) To verify the validity of the certificate, execute the following SCLI
2.
5. CMP_REQUEST_FAILED:
If theany asterisk (*)raised
is the
not "Validity"
present the on the NTP server IP (in outputare of step
Check
command
4. Execute alarms
and
the check
following against
command to services
field:exit whose
the bash certificates
shell:
#a.2),Check
then whether
wait for the "default"
approximately domain
20 contains
minutes and the
check CMP configuration
whether alarm has
renewed
. showusing security the certfollowing
<domain> command: ee-cert
parameters
been by executing the "show security cert default cmp" SCLI
#>exit
show alarm active
For example, if "Appl. addl. info" of the alarm is "C:EE D:default DTE:2
command.
cleared using the 09:19:24-10:00",
following SCLI command:
CET:2013-03-15 execute command
Verify
5. "show
Check thatthe thestatus
security NE cert service
of has ee-cert".
default
/ClusterNTP beenor using the updated
/NodeNTP certificate(s)
SCLI and is
b. If they
show areactive
not configured, then configure themusing below
suitably by executing
A.2)
command Ifalarm
functioning onnormally
the certificate
the
filter-by
cluster.by specific-problem
hasreferring
expired
If the to
or the
is about
administrative
70377
corresponding
tostate
expire,is sections
obtain and
locked, of install
unlockproduct
them a
the "set security cert default cmp" SCLI command.
application
new one bydocumentation following the following guides. instructions.
6. IfNote
alarm thatis not
the cleared,
steps use thediffer,
to follow followingbased SCLI command
on whether thetocertificates
sync the
>c.show
For more has state
details, managed-object
see the "Centralized /ClusterNTP certificate management" chapter
system
are manually installed or automatically fetched via USAGE ROLE
OBJECT
in the platformADMINISTRATIVE administration guide. OPERATIONAL
Install more physical or virtual CPUs/cores
Check
Disclaimer: the status of the nodebelow
The instructions by using usethe following
either procedure:
unsupported SCLI
commands or commands from the unsupported full bash shell. Please
1. Log in to the cluster.
carefully
2.
read Execute
the disclaimerthe following that is command
shown when to verifyeither that the Administrative
entering the unsupported state is
UNLOCKED and Operational state
SCLI vendor mode or the full bash shell. Do not use the commands in any is ENABLED for the node:
1.
show Check hasthat statethe air flows freely through the cabinet and the chassis.
other context. Please check from the product documentation or from
technical
2.
For If example:
the alarm>isshow persistent,
has state replace the faulty plug-in
managed-object /TCU-0 unit:
support for more information.
- Refer to the hardware maintenance documentation for detailed replacing
instructions.
Expected output:
1. LoginExecute instructions
the following for switches:
SCLI command to check the availability of the
-OBJECT
The details of the faulty plug-in unit (cabinet, chassis and slot) are found
DHCP A. Login ADMINISTRATIVE
instruction for AHUB2 OPERATIONAL
& AHUB4-A: USAGE ROLE
in the ApplicationDYNAMIC
PROCEDURAL Additional Info field of the alarm.
server a. bash_prompt#
and the associated telnet <switch_name>
TFTP server for (You alarms maywithneed
IAAItoTrivial
change Filethe
switch
Transfer name, replace this with the switch name from the alarm)
3.
/TCU-0 If thereUNLOCKED are numerous alarms ENABLED of this kind ACTIVEfrom several
- -onplug-in - units,
Enter password
Protocol (TFTP) get or putthe failure or server. those falling back default fabric
1.
check Reload the the image
air conditioning from andTFTP temperature in the network element (NE)
>
configuration:
equipment
3. If room.
show b.theUse
has state
the
state of managed-object
the node
"enable" is displayed
command to turn ason mentioned
the privileged above within 10
mode.
2. If the
minutes, problem remains, reload/DHCPD the image from the original source to the
TFTP
4.
then Checktheserver,if there
following andisreload any high
instructions the power
same
to reset image
surge the from the
because TFTP
of which server.
thebealarm
2.B.If Login
the instruction
recovery group for isAHUB3-A:
not providing thenode
service manually can
i.e. if the status skipped.
is
could
Else,
1. Login have
the HAS been
instructions raised.
recovery operations
for output,
switches: will be pending and waiting to verify
not a.matching
bash_prompt# the below telnet <switch_name>
then execute (Youthe may
next need
step. to change the
3.
the Ifnode
the problemis isolated remains,is, compare the md5sum of the imageorfile on the
switch name, replace(that this with it cannot
the switch accessname anyfrom databases
the alarm) other
TFTP
5. If the server
problem to that remainsofthis
the case,
original source.
shared
OBJECT A.
Enter Loginresources).
instruction
password
ADMINISTRATIVE In for after
AHUB2 applying
the userthe
& AHUB4-A:
OPERATIONAL must instructions,
manually
USAGE verify
pleasethat
ROLE the
contact
your
node local
is down:customer support.
a. bash_prompt# telnet <switch_name> (You may need to change
AHUB3#
PROCEDURAL
4. If the md5sumsDYNAMIC are the same, the original image file is also corrupted,
theb.switch For BI, name,
start basereplace this with
ethernet CLIthe switch name from the alarm)
1.
in Login instructions.
which
Procedure case for contact
ATCA your local customer support in order to get the valid
Hardware:
/DHCPD EnterUNLOCKED
AHUB3# password
base-ethernet ENABLED ACTIVE - - -
image.
- PressA. Login thestart instruction
hot fabric
swap etherenetfor AHUB2
button of the &blade.
AHUB4-A:
For> FI,
a. bash_prompt# telnet CLI
<switch_name> (You may need to change
-3.Remove b. Use
AHUB3#
Execute the the blade
"enable"
fabric-ethernet
the following and wait
command for a while.
to turnto on the privileged mode.
-the Ifswitch
5.Re-insert
the md5sums name,
the blade. differ, SCLI
replace something commands
this with the switchunlock
is continually name the DHCPD
from
corruptingthe the
alarm)server:
image
set
during hasEnter unlock
the password
transfer managed-object
from the original /DHCPD source to the TFTP server. If possible,
1.
2. Check
B. Login
Use the scenariosfor
instruction listed under
AHUB3-A: thethatblocktheMeaning of the alarm. Ifisthe
replace
Procedure > the the
instructions
suspected
for BCN
below
component,
Hardware:
to check
for example,
CPU usage
cable, switch
threshold
unit etc.
not
alarm
set toa.
4. Execute isbash_prompt#
raised,
abnormally the it is
low.
following cleared
telnet
SCLI whencommandsthe scenario
<switch_name> to is over.
(You
restart may
the need to change
DHCPD server:
-the
Log b.
inUse to name,
thethecluster.
"enable" to turn on privileged mode.
set switch
has restart replace this with
managed-object the switch name from the alarm)
/DHCPD
6.
2. If the
- Restart
Execute problem
the the faulty remains
followingnode command
byafter
using applying
the the instructions,
following
to AHUB2:
check thatSCLI please
command:
the reported FRU contact
is
A. Enter
CPU password
usage threshold check for
your
> B.
set Login
local
hardware instruction
customer restart fornode
support. AHUB3-A:
<node-name>
plugged AHUB3#
device-name#
5. Check
Fora.alarms in and active:
show monitor cpu-usage
1. that with
bash_prompt# IAAItelnet
the affected Switch
unit Issue,
is healthy
<switch_name> checkand/var/log/master-syslog
plugged
(You mayinneed
properly.
to changefor
show
Switch b.tools
Forthe system-status
BI,unitstartif base brief list CLI:
ethernet
2.
the
4. Restart
switch
Once name,
theusage node needed.
replace
has on been this with
reset, the switch
logAHUB3-A:
in towait name
the forclusterfrom the alarm)
B. AHUB3#
related
3. CPU
Replace
Enter errors. thebase-ethernet
Based
password unitthreshold thischeck
if needed. for
information, next and
DHCP setlease
the node timeto
isolate
For example:
Forproblem
FI, start
AHUB3(blade_mgmt)#
expiry > show fabric tools system-status
etherenet
show CLI: these
snmp-traps brief list
4. If
state the
AHUB3#
by using theremains
following after following
procedure: instructions, please contact
alarm
your AHUB3#
to
local clear.
customer fabric-ethernet
support.
>
1. set
Checkb.
has For BI,
isolate start base
managed-object ethernet CLI:
<node-name>
Expected CPUthat
C. AHUB3# output:
usage the threshold
affected unit
base-ethernet check is healthy
for AHUB4-A:and plugged in properly.
2.
6. Restart
Please
Use the
wait...
the
device-name(enable)#
If the alarm unit
instructions
does if not
necessary.
belowshow
clear, to
the check
cpuload
execute thatthethefollowing
memory SCLI usagecommand
thresholdto is
5.
3. AfterForthe
Replace FI,the start
node unitfabric
is ifset etherenet
to isolationCLI:
necessary. state, HAS performs recovery actions
not
restartset to abnormally low.
that
4. AHUB3# fabric-ethernet
theIf the problem remains below after following
to checkthese
---------------------------------------------------------------------------------
3. Follow
Switch: the instructions CPU instructions,
usage: please contact
were
your
Node/
set pending.
local
A.hardware customer
Memory restart usage node support.
threshold <node> HW
check for AHUB2: HW Node RU Active
2. Use the instructions below for the switches to check that the port error
HW Unit
device-name#
A. For Location
AHUB2: show monitor ram-usage Type State State State Alarms
monitoring
Note: If threshold
thevalueisolation is setofcorrectly.
state a node is set without actually verifying the
---------------------------------------------------------------------------------
device-name#show
Note: The of the cpu
<node> utilization
can be taken from the fsipHostName="
node is down,
B. Memoryofusage
" parameter then serious
the Managed threshold data corruption
checkfield
Object may
for AHUB3-A: occur.
in the alarm information.
A. Foryour
Contact AHUB2: local customer support to find out the reason
ADPE2-A
Not supported.
B. Command
For /chassis-1/power-slot-1
AHUB3-A: ADPE2-A ON forN/A failure.N/A 0
not supported
AHUB3-<Base/Fabric>#
7. If the above steps cannotshow resolve process cpu
the situation, contact your local
The C.HW
customer Memory State
support. ON indicates
usage threshold thecheck FRU displayed
for AHUB4-A: under HW Unit section is
B. For AHUB3-A:
active.
Not
C.AHUB3
Forsupported
AHUB4-A:
(blade-mgmt)# show snmp-traps
The HW State OFF indicates
device-name(enable)# show thecpuload
FRU displayed under HW Unit section is
inactive.
3. Follow the instructions below to check memory usage:
C. For AHUB4:
If the HW Unit is not displayed in the output but it is configured (present in
1. Determine the sensor number from the "Sensor" field of Identifying
Application Additional Information section of the alarm.
2.
5. Determine
If the
show themanaged-object
problem
has state affected
is field-replaceable
permanent, check unit (FRU)
the status
/<value> of the from
power the
units"Position"
in
field of Identifying
accordance to the Application
hardware Additional Information
documentation and section
replace them of ifthe alarm.
1. Check that the affected FRU is healthy and plugged in properly.
necessary.
Provide
2. Restart anthe
appropriate value for the managed object.
FRU if necessary.
3. Determine the severity of the alarm from the "Severity" field of the
3. Replace the FRU if necessary.
alarm.
6.
4. If
5. If the
the problem
problem exists
remains
persistseven after
after
after following
following
following these
thesetheinstructions,
instructions,please
instructions, contact
please your
contact
contact
local
your customer
local support
customer with
support.
your local customer support. your observations on the sensor values.
1.
4. Replace the battery
Please follow the stepsof the affected
given in theunit.
hardware documentation to check
2.
theIf the
Acknowledge
The problem
following remains Once
automatic:
instructions after following
alarmthese
the applicable
are only instructions,
is cleared,
for will beplease
the itcause contact
automatically
"Local
your
boot local
sensor
acknowledged.
error customer
value using
while support.
the
executingsensor
fromnumber
flash": and FRU name.
5.
1. Find
1. another
Restart thethe
Determine corresponding
affected
sensorunit. Theorposition
number related
from alarm
theof in the
the affected
"Sensor" system
field unit intothe
resolve
of Identifying
the issue
example scope
below (For
is example: FRU, Shelf,
"Position=/chassis-1/slot-7":Cabinet,
Application Additional Information section of the alarm. System, Site).
6.
2. Correct
a) Check the
Determine thenode
temperature
the name of
affected issue by checking
the Replacement
Field affected if any
unit. For
Unit of thefrom
example:
(FRU) fan is broken,
the
air filter is dirty, air flow is blocked, or ambient temperature
"Position" field of Identifying Application Additional Information section is out of range.
of
THE
show RECOVERY
hardware state listPROCEDURE VARIES BETWEEN HARDWARE
the alarm.
ENVIRONMENTS,
7. Correct the temperature
<...> PAY ATTENTION IN BELOW
issue by replacing PROCEDURES!
the FRU as mentioned in
the
CSPU-1 hardware: node documentation.
available /cabinet-1/chassis-1/piu-1/addin-7/CPU-1/core-
3. Determine the severity of the alarm from the "Severity" field of the
0,1,2,3,4,5,6,7,8,9,10,11
alarm.
Disclaimer
8. If the problem: The instructions belowfollowing
make use theofinstructions,
either unsupported SCLI
<...>
1. Check that theexists even after
shelf managers are appropriately plugged contact your
in, and are
commands
local customer or commands
support with from
yourthe unsupportedonfull
observations thebash
sensorshell. Please
values.
running
4. Please in follow
a healthy state. given in the hardware documentation to check
the steps
carefully
b) read the disclaimerFor thatexample:
is shown when either entering the SCLI
theRestart
sensorthe valueaffected
using unit.
the sensor number and FRU name.
unsupported
Acknowledge vendor
automatic:mode or the
Once the full bashisshell.
alarm Do not
cleared, use
it will betheautomatically
2. Check the configuration of shelf managers (whether username exists,
commands
acknowledged.
set hardware in restart
any other node context.
CSPU-1 Please check from the product
network
5. Correct configuration
the sensor etc.). by replacing the FRU as mentioned in the
value
1. Check that allorthe
documentation fromFRUslocalare pluggedsupport
customer in into their correct
for more places based
information.
hardware
on the documentation.
intended hardware configuration.
Resetting
3. If the problemCSPU-1 [ok] even after following the instructions, please
persists
2. If the
Note: Theproblem remains
terms "CLA" and after applying
"CFPU" referthe
to instructions, please contact
a node that contains
contact
6. If the your
problem localexists
customer
evensupport.
after following these instructions, contact
your
2. If localdoes
centralized
this customer
O&M not and
fix support.
cluster
the issue,management
try to reflash functionalities.
the embedded The node
software
your local customer support with your observations on the sensor values.
names used in unit
of the affected the examples
by following maythenot be valid for
instructions oncertain
productproducts. Actual
documentation.
node names vary across different products. Exact node names for each
Acknowledge automatic: Once the alarm is cleared, it will be automatically
product
3. If the can problembe found
remainsin the product-specific
after following these documentation.
instructions, please contact
acknowledged.
your local customer support.
Instructions for ATCA HW:
The following instructions are only applicable for the cause "Network
To
boot recover
error": the CLA node from Disk Out Of Sync (DOOS) on the ATCA
HW, perform the following steps:
1.
ThisConnect
error may to the activeifCLA
happen node via
the cluster Securenode
manager Shellis(SSH).
not yet available
2.
to Enable
provide the PXEPreboot
service,Execution
for example, Environment (PXE)restart.
during system boot.
Use the following SCLI command to enable PXE boot:
1. Determine the sensor number from the "Sensor" field of Identifying
Application Additional Information section of the alarm.
2. Check if the blade self test status of the hardware unit has passed.
Execute the following instruction to check the connectivity:
3. If the blade self test status of the hardware unit has failed, or is
1. Check that the peer network element is connected by using traceroute,
pending, please contact your local customer support.
ping, or other similar utilities. The following commands could be executed:
a. Ping: ping <SESSION_DST_ADDRESS> -I
Fill in a problem report, and then send it to your local customer support.
<SESSION_SRC_ADDRESS>
b. Traceroute: traceroute <SESSION_DST_ADDRESS>
Use the following SCLI command to see the level at which the context
A single instance of the alarm may be due to some transient cause, and
(application) is writing into the buffer:
does not require specific actions. If the alarm from the same node
however
show tracing config src-context all-nodes
does not clear automatically , or keeps repeating, replacing the board or
the reported memory module (DIMM-X) should be considered.
1.
Use Determine
the following the sensor
SCLI command name from to the "Sensor"
change field of the Identifying
the level:
Application Additional Information fields section of the alarm.
1. Use the following command to get the alarm details:
set tracing src-level off/fine/finer/finest/all all-nodes process <process-
show alarm active filter-by specific-problem 70370
2. Determine
name> context the affected field-replaceable unit (FRU) from the "Position"
<context-name>
field of the Identifying Application Additional Information fields section of
1. If theIDapplication :specific
Alarm 417 SW delivery pre-check command does not
the
2. Ifalarm.
the automatic
problem still persists,
provide
Specific problem cleaning : 70370 upfill in a problemfree
-operations
MEMORYtoERRORS
reportthe anddiskcontact
space, your the local
customer
following the support.
manual steps can be tried.
Managed object : fsipHostName=CLA-
3. Figure out other related alarms in the system to resolve the issue.
0,fsFragmentId=Nodes,fsFragmentId=HA,fsClusterId=ClusterRoot
Check
Severity for any unused SW deliveries which could be removed to free the
: 5 (warning)
1.
4. If the application
Please follow thespecific steps given SW delivery pre-check
in the hardware command does
documentation not the
to find
disk
Clearedspace (in case:ano new SW delivery needs to be installed to the
provide
FRU, or automatic
other hardware cleaning moduleup operations
associated to with
free thethe FRU,
disk space,
which is themissing
system).
Clearing All deliveries : manualcan be listed with the following SCLI command:
following
with the help manual of the steps
sensor cannamebe tried. and the FRU name.
Acknowledged : no
> show
Ack. user sw-manage
ID list
: N/A
Check
5. Try to forcorrect
any unused the issue SWby deliveries
inserting whichFRU could be removed by to free the
Disclaimer:
Ack. time The instructions below usethe properly,
either unsupported orSCLI replacing
disk
the space
FRU or (in case aN/A
hardware
: new SWassociated
module delivery needs with to be
the FRU installed
as to the in
mentioned
commands,
A currently
Alarm time or commands
active delivery
: 2015-03-31frombe
can the unsupported
listed with the EEST
12:40:40:576 full bash SCLI
following shell. command
Please
system).
the hardware All deliveries
documentation. can be listed with the following SCLI command:
carefully
(this
Eventone type should not :be x5removed):
(equipment)
read the disclaimer that
Application :list is shown when either entering the unsupported
fsClusterId=ClusterRoot
> show
6. If the sw-manage
problem remains after following
SCLI
> show
Identif vendor
sw-manage
appl. mode
addl. infoor the
current
: fullallbash
Class: shell.these
correctable, DoAffected
notinstructions,
useMemory: please contact
the commands in any
Disclaimer:
your local The instructions
customer support. below use either unsupported SCLI
other
DIMM-2 context. Please check from the product documentation or from your
commands,
A currently active or commandsdelivery can frombe thelisted
unsupported full bash SCLI
with the following shell. command
Please
local
A delivery
Appl. addl. can
infothe be disclaimer
removed
: Error with
rate: the following
10/24h, AffectedSCLI command:
location:
carefully
(this
NOTE: one read
should
This alarm notwill be removed):
also be that
raisedis shown
in case when a either
hardware entering
entity the
is present
customer
DIMM ID=2, support
Channel for moreID=3mode information.
unsupported
but not responding. SCLI vendor To identify or the
this, full bash
examine shell.
all the Do not
alarms use the
raised for the
> delete sw-manage in unit.
any other delivery
context. <delivery
Pleaselabel>
--------------------------------------------------------------------------------
commands check from the product
> show
same sw-manage
plug-in In current
case all
this hardware entity is left out intentionally,
Disclaimer:
Note: The output The instructions
displayed by belowthe showuse either
hardware unsupported
inventorySCLI list brief this
documentation
alarm can be or from your local customer support for more information.
ignored.
commands,
command
2. Check
Figure out inor
for the
any commands
instructions
unused frombelow the refer
configuration unsupported
to the from
snapshots nodefull bash
andMO
created shell.
board Please
thenames
byfield user.
of theTheof
A delivery
carefully canthe
read benode
the removed
disclaimer
that has
with
that
memory
the
is the
shown
errors
following when SCLI the
command:
either entering the
a cluster.
snapshots
alarm. For Thecan
example,node
be and
listed
from board
with
step the names
# following
1, vary between
SCLI
affected command:
node different
is CLA-0. products.
1. This alarm indicates
Acknowledge automatic: that the alarm signalingcleared,
connection it willcontrol block
unsupported SCLI vendorOnce mode or theisfull bash shell. be
Doautomatically
not use the
resource
> delete
acknowledged. has reacheddelivery
sw-manage the signaling<delivery connection
label> congestion limit defined in
commands
Out
>
3. showof memory
Figure in any
snapshot
out otherclass
situations
listall context.
as detected Please bycheck
this
typealarm from the the Identifying
are product
rare. If this alarm
the system bythe theerror
parameterand memory from
"sccp-connections-congestion-threshold"
documentation
is
Application or from your local customer support for more information.
which
2. Check is configured
for any unused as part of product deployment
configuration snapshots created data. Inby suchtheauser. The
raised,
Notice
Additional then
that thethesnapshots
Information basic recovery
field created mechanism
ofsignaling
the automatically
alarm. For isexample,
to during
restartfrom the plug-in
delivery
stepby unit.
installation
# 1,
situation,
snapshots thecan number
be listed of idle
with the followingconnections
SCLI as specified
command: thethe
For
Before
are
error CONNECTION_CONTROL_BLOCK
automatically
class is removed
"correctable" when
and the thememory case:
delivery is is removed
"DIMM". (so no need to
"Idle" sub-field in the Application Additional Information field will be less
doing
remove so,them
in order for your local customer support to investigate the
manually).
than
> show or equal
snapshot to the difference between the parameters "max-sccp-
listall
1. This
problem
4. Use the alarm indicates
following SCLI thecommand
signaling connection
to clear the control alarm and block resource
toThe
verify if and
connections" and "sccp-connections-congestion-threshold".
has reached
further,
Snapshots
how soon please
thecan the be
alarm maximum
collectremovedthe data
reappears. limit
with asdefined theininstructions
perfollowing
the the SCLI
system by thebelow:
listed
command: parameter
parameters
Notice "max-sccp-connections"
that the snapshotsIncreated and "sccp-connections-congestion-
"max-sccp-connections". such aautomatically
situation, the during number delivery
of idle installation
signaling
threshold"
are are defined
automatically removed as part when of deployment
the delivery data.
is removed (so no need to
connections
1.
>
set Determine
delete
alarm snapshot
clear asthe specified
affected
config-name
alarm-id by the Replacement
Field
<alarm-id> "Idle" sub-field
<name of snapshot> in the
Unit (FRU)Application
from the
remove
Additional them manually).
"Position" Information field is 0. It means that all the signaling connection
2. The signaling connections that are triggered to be closed as specified in
control
field block
If theofalarm
the isresources
Identifying are utilized
Application
still not cleared, contact andcustomer
Additionalno more connections
Information
support fields will be of
section
services.
1. This alarm indicates that there are signaling connections dropped. Use
the product documentation to check the signaling connection drop.
This alarm is raised with the intent to avoid flooding of alarms in cases
where the signaling objects are changing status frequently. The user must
refer to the "Instructions" section of those specific alarms that were raised
due to object status change.
1. Use the following SCLI command to check whether other alarms (for
example: 70399, 70397) affecting the D-channel in question are active:
2. Use the following SCLI command to get the node name for the faulty
This alarm is an informative alarm indicating that a node has been
FRU:
(re)started. Check the restart caused by the reported sensor offset and
> show hardware inventory list brief
related detailed meaning. If the reported cause is internal to the node or
system and node restart occurs again and again, the node may be
The column "Node/Host" will give the node name for the corresponding
defective and needs to be replaced. For hints on what is causing restarts,
Following
entity as shown are theindifferent
the following recovery example:actions to be taken, based on the
check the alarm status in the cluster after the restart.
error type.
Actual Type Unused Node/Host Expected Type Admin-ignore Entity
Note: To check if the alarm is repeating follow the below steps:
ErrorType = 0, 1, 2:
1. Clear the alarm manually using below SCLI command and wait for few
The
HDSAM-APacket Timing N/A Unit internal monitoring process (monitd)
HDSAM-A will attempt
1. Determine the affected
minutes: Field Replacementno Unit (FRU) /chassis-1/AMC-1
from the
to recover the GNSS
BS2AM-A process[0],BS2AM-A
PTU-1-1-1 HWM process[1] no and SYNC process[2].
/chassis-1/AMC-
identifying
set alarmapplication
clear alarm-id additional
<alarm_id> information (IAAI) field in alarm record.
Therefore,
2 the recovery will be automatic. If the problem persists, consider
The "Position"
where field incan
<alarm_id> IAAI beindicates
found from thethelocation
activeofalarm
FRU.record.
restarting
BMFU-B the unit.N/A BMFU-B no /chassis-1/ft-1
2. If the alarm is raised again repeat the above step 1.
BMFU-B
2. Repeat
Use the the following N/A SCLI BMFU-B no /chassis-1/ft-2
3.
1. Ifrestart
the value same
of Switchfor at command
least is5 other
times.to get the node name for the faulty
To
BAFU-A the unit do theStatus
N/A following
BAFU-A steps: thanno OK, check that all fibre
/chassis-1/ft-3
FRU:
channel
BCNMB-B cables at the back of theBCNMB-B
LMP-1-1-1 chassis are properly no connected to their
>
If show
the hardware
problem inventory
persists after list brief the instructions,
following please contact your
corresponding
a. Determine the
/chassis-1/motherboard-1fibre channel
affected Fieldswitch modules. Unit (FRU)
Replacement from the
local
2. customer
If the problem support.
"Position"
BCNAP-B field ofstill
theN/Apersists,
Identifying replace
BCNAP-B the affected
Application no fibre/chassis-1/power-
Additional channel
Informationswitchfields
The column "Node/Host" will give the node name for the corresponding
module
section
supply-1 in the alarm.
of chassis.
1. If the
entity or value
FRU, of asPort
shown Status
in the is other
following thanexample:
OK, the reason for triggering this
3. If the previous steps
BCNAP-B N/A have not solved the situation,
BCNAP-B noof the alarmcontact your local
/chassis-1/power-
alarm has to be studied properly. If the cause appears to be
customer
b. To
supply-2 get support.
the node name for the faulty FRU,enter the following SCLI
in the application
Actual Type Unused software or configuration,
Node/Host Expected it has
Typeto Admin-ignore
be corrected. Entity
command
BMPP2-B : CFPU-0,SE-0 BMPP2-B
2. If the problem still persists, replace the affected fibre channel switch no /chassis-1/slot-
>
1 show hardware
module in the chassis.inventory list brief
HDSAM-A N/A HDSAM-A no /chassis-1/AMC-1
3. If the previous steps
BS2AM-A PTU-1-1-1have notBS2AM-A solved the situation, no contact your local
/chassis-1/AMC-
The
From column
customer "Node/Host"
the example,
support. displaysof
if the position thethenode name
faulty FRUfor is the corresponding
"/chassis-1/AMC-2",
2
entity
then as node
the shownname in thewould following
be example:
"PTU-1-1-1".
BMFU-B N/A BMFU-B no /chassis-1/ft-1
BMFU-B N/A BMFU-B no /chassis-1/ft-2
Actual
3. Check Type the Unused
status of Node/Host
the troubled Expected using Type the Admin-ignore Entity
BAFU-A N/A BAFU-Areference no following SCLI
/chassis-1/ft-3
command.
BCNMB-B Check also whether
LMP-1-1-1 BCNMB-B the reference hasno valid parameter values
HDSAM-A
(priority and SSM). N/A HDSAM-A no /chassis-1/AMC-1
/chassis-1/motherboard-1
BS2AM-A PTU-1-1-1 BS2AM-A no /chassis-1/AMC-
BCNAP-B N/A BCNAP-B no /chassis-1/power-
2
> show clock-sync server PTU-1-1-1 hwclock dpll-ref-status
supply-1
BMFU-B N/A BMFU-B no /chassis-1/ft-1
BCNAP-B N/A BCNAP-B no /chassis-1/power-
BMFU-B
HW Clock Zl30310 N/A
DPLL All BMFU-B
Reference no
Status: /chassis-1/ft-2
supply-2
BAFU-A
ref0 - SFP0 statusN/A = invalid BAFU-A no /chassis-1/ft-3
BMPP2-B CFPU-0,SE-0 BMPP2-B no /chassis-1/slot-
BCNMB-B
ref1 - SFP1 status LMP-1-1-1
= invalid BCNMB-B no
1
/chassis-1/motherboard-1
ref2 - GNSS status = invalid
BCNAP-B
ref3 N/A BCNAP-B no /chassis-1/power-
From- the NC example,
status = invalid
if the position of the faulty FRU is "/chassis-1/AMC-2",
supply-1
ref4 - EXT FREQ INPUT status = invalid
then the node name would be "PTU-1-1-1".
BCNAP-B
ref5 - NC status = invalid N/A BCNAP-B no /chassis-1/power-
supply-2
ref6 - TCLKA status = valid
3. Following are the different recovery actions to be taken, based on the
BMPP2-B
ref7 CFPU-0,SE-0 BMPP2-B no /chassis-1/slot-
error- type.
NC status = invalid
1
If
[6]allUse
reference
the defaultinputconfiguration
type status are file into the
loadinvalid
working state, that means
settings no
and recover
From
reference this example,
clock type ifisthe positionInofthis
enabled. thecase,
faultytheFRU is "/chassis-1/AMC-2",
operator should choose
the needed settings, so that a new startup configuration can be created.
then the node
appropriate name would
reference be "PTU-1-1-1".
Use the following SCLIclock commandinput type
to load usingthethe following
default SCLI command,
configuration
to see
settings: if the reference status is valid.
c. Enter the following SCLI command to restart the unit:
> set clock-sync server PTU-1-1-1 runtime default-config
> set hardware
clock-syncrestart servernode PTU-1-1-1
PTU-1-1-1 hwclock dpll-input ref-input
BACKPLANE activation-option immediate
[7] Use the startup configuration file to load working settings if the error 6
ErrorType = 3:
is not raised. If error 6 is raised, the startup configuration file needs to be
The
4. Packet
If the problemTiming Unit internal
persists monitoring
after following thetoprocess
instructions,(monitd) attempts
contact your to
local
manually edited and recovered according the content available in the
recover
customer the NTP process.
support. Operator Therefore,
can check thein recovery
advance is
the automatic.
reference Ifsource
the
documentation.
problem
and the are persists, consider
transmission path restarting the ntpd process or the unit.
If there no problems in of the reference
loading clock
default configurationfor failure.file, then restart
the sync-mgmt-app using the following SCLI command, so it reads the
Disclaimer: The instructions below use either unsupported SCLI
current start-up configuration file to load the working settings.
commands or commands from the unsupported full bash shell. Please
> set clock-sync server PTU-1-1-1 restart
carefully read the disclaimer that is shown when either entering the
unsupported SCLI vendor mode or the full bash shell. Do not use the
4. If the previous steps have not resolved the situation, contact your local
commands in any other context. Please check the product documentation
Clearing Time to Live
Do not clear
1. show alarmthe alarm.
active This alarm
filter-by is cleared automatically
specific-problem 70462 0
by the
This fault detector
command of the
displays theoperating
alarm-id system when the file
system problem
2. set alarm clearisforced
fixed. yes alarm-id <alarm id of the
alarm>
Do not cancel the alarm. The system clears the alarm 0
automatically when the fault has been corrected.
Do not clear the alarm. The system clears the alarm 0
automatically when the fault
is corrected.
Steps: 0
1. Rollback the in-service upgrade procedure according to
the customer documentation instructions.
2. Clear the alarm manually.
After correcting the fault, according to the Instructions
The system clears the alarm automatically when the fault 0
section, clear the alarm by using the following SCLI
has been corrected
command:
set alarm clear alarm-id <alarm id>
Example:
The system clears the alarm automatically when the fault 0
has been corrected.
Thealarm
set alarmclear-matching-alarms
is cleared automatically.filter-by specific-problem 0
70370 managed-object
<managed object of the alarm> application-id <application
id of
the
Thealarm>
alarm isidentifying-application-additional-info
cleared automatically. 0
<identifying application
additional info of the alarm>
7657
Equipment Critical
Equipment Major
Equipment Minor
Communications Critical
Communications Major
Communications Minor
Meaning Effect
A critical fault (or faults) has occurred in the base The effect of the fault on the functioning of the
station. network element depends on the fault
description. For more information, see base
Check the reason for the fault from the station fault descriptions in LTE System Libraries.
supplementary text field of the alarm.
A major fault (or faults) has occurred in the base The effect of the fault on the functioning of the
station. network element depends on the fault
description. For more information, see base
Check the reason for the fault from the station fault descriptions in LTE System Libraries.
supplementary text field of the alarm.
A minor fault (or faults) has occurred in the base The effect of the fault on the functioning of the
station. network element depends on the fault
description. For more information, see base
Check the reason for the fault from the station fault descriptions in LTE System Libraries.
supplementary text field of the alarm.
A critical fault (or faults) has occurred in a unit (or The effect of the fault on the functioning of the
units) that belong to the sector indicated in the network element depends on the fault
alarm. description. For more information, see base
station fault descriptions in LTE System Libraries.
Check the reason for the fault from the
A major fault (or faults) has occurred in a unit (or The effect of the fault on the functioning of the
supplementary text field of the alarm.
units) that belong to the sector indicated in the network element depends on the fault
alarm. description. For more information, see base
station fault descriptions in LTE System Libraries.
Check the reason for the fault from the
A minor fault (or faults) has occurred in a unit (or The effect of the fault on the functioning of the
supplementary text field of the alarm.
units) that belong to the sector indicated in the network element depends on the fault
alarm. description. For more information, see base
station fault descriptions in LTE System Libraries.
Check the reason for the fault from the
A critical fault (or faults) has occured in the base
supplementary text field of the alarm.
station interface.
Check the reason for the fault from the
supplementary text field of the
alarm.
A major fault (or faults) has occured in the base The effect of the fault on the functioning of the
station interface. Check the reason for the fault network element depends on the fault
from the supplementary text field of the alarm. description. For more information, see base
station fault descriptions in LTE System Libraries.
This is a spare alarm for handling late churn in The effect of the fault on the functioning of the
the release. Please see (1) additional information network element depends on
fields in the alarm to determine the BTS fault that the fault description.
raised this alarm and the (2) reported dynamic
severity in the alarm to determine the urgency of
A transmission fault (or faults) has occured in the The effect of the fault on the functioning of the
the issue.
BTS. This alarm is an encapsulation alarm that is network element depends on the fault
This alarm is applicable independent of the value
used to transfer the Flexi Transport Submodule description. For more information, see base
of actCategoryAlarms
(FTM) alarm data over the BTS O&M connection station fault descriptions in LTE System Libraries.
through iOMS to NetAct. In NetAct this alarm is
shown in opened format. This means that the
alarm number, alarm text, and supplementary
information are shown in the original FTM format.
Check the reason for the fault from the
supplementary information fields and
supplementary text field of the alarm.
Identifying Additional Information Fields Additional Information Fields
2. shelf number
3. slot
1. rack (cabinet) number
4. type of unit
2. shelf number
5. unit number
3. slot
1. rack (cabinet) number
6. subunit number
4. type of unit
2. shelf number
7. path (for alarms where field "type of unit" contains one
5. unit number
3.
of slot values: FSM, FT, FSP, FBB, FR, FAN,
1. the
rack (cabinet) number
AntennaLine, MHA, RET, FYG, SFP, ASIA, ABIA)
6. subunit number
4. type of unit
2. shelf number
7. path (for alarms where field "type of unit" contains one
5. unit number
3.
of slot values: FSM, FT, FSP, FBB, FR, FAN,
1. the
rack (cabinet) number
AntennaLine, MHA, RET, FYG,
6. subunit number
4. type of unit
SFP)
2. shelf number
7. path (for alarms where field "type of unit" contains one
5. unit number
3.
of slot values: FSM, FT, FSP, FBB, FR, FAN,
1. the
rack (cabinet) number
AntennaLine, MHA, RET, FYG,
6. subunit number
4. type
SFP, of unit
ASIA, ABIA)
2. shelf number
7. path (for alarms where field "type of unit" contains one
5. unit number
3.
of slot values: FSM, FT, FSP, FBB, FR, FAN,
1. the
rack (cabinet) number Destination IP address
AntennaLine,
2. subunit
shelf number MHA, RET, FYG,
6. number
4. type of unit
SFP,
3. slotASIA, ABIA)
4.
7. type
path of
(forunit
alarms where field "type of unit" contains one
5. unit number
5.
of unit
the number FSM, FT, FSP, FBB, FR, FAN,
values:
1. rack (cabinet) number Destination IP address
6. subunit number
AntennaLine, MHA, RET, FYG,
2. subunit
6. shelf number
number
SFP,
3. slotASIA, ABIA)
4.
7. type
path of
(forunit
alarms where field "type of unit" contains one
5.
of unit
the number FSM, FT, FSP, FBB, FR, FAN,
values:
1. rack (cabinet) number Serial Number (when applicable)
6. subunit number
AntennaLine, MHA, RET, FYG,
2. shelf number
SFP,
3. slotASIA, ABIA)
4. type of unit
5. unit number
1. probable cause reported by FTM
6. subunit number
7. path
2. the managed object reported by FTM
8. supplAlarmInfo (contains the real Alarming Object and
Alarm Name)
3. alarm number reported by FTM. The FTM has reserved
alarm numbers from space 61000-61999
Instructions