You are on page 1of 83

ElasticNet UME R18

Unified Management Expert System


Alarm Handling Reference

Version: V16.19.40

ZTE CORPORATION
No. 55, Hi-tech Road South, ShenZhen, P.R.China
Postcode: 518057
Tel: +86-755-26771900
URL: http://support.zte.com.cn
E-mail: support@zte.com.cn
LEGAL INFORMATION
Copyright 2020 ZTE CORPORATION.

The contents of this document are protected by copyright laws and international treaties. Any reproduction
or distribution of this document or any portion of this document, in any form by any means, without the

prior written consent of ZTE CORPORATION is prohibited. Additionally, the contents of this document

are protected by contractual confidentiality obligations.

All company, brand and product names are trade or service marks, or registered trade or service marks,

of ZTE CORPORATION or of their respective owners.

This document is provided as is, and all express, implied, or statutory warranties, representationsor

conditions are disclaimed, including without limitation any implied warranty of merchantability, fitness for

a particular purpose, title or non-infringement. ZTE CORPORATION and its licensors shall not be liable

for damages resulting from the use of or reliance on the information contained herein.

ZTE CORPORATION or its licensors may have current or pending intellectual property rights or

applications covering the subject matter of this document. Except as expressly provided in any written
license between ZTE CORPORATION and its licensee, the user of this document shall not acquire any

license to the subject matter herein.

ZTE CORPORATION reserves the right to upgrade or make technical change to this product without

further notice.

Users may visit the ZTE technical support website http://support.zte.com.cn to inquire for related

information.

The ultimate right to interpret this product resides in ZTE CORPORATION.

Revision History

Revision No. Revision Date Revision Reason


R1.0 2020-01-10 First edition.

Serial Number: SJ-20191220164305-036

Publishing Date: 2020-01-10 (R1.0)


Contents
1 Equipment Alarm.......................................................................................1-1
1.1 0001 Container Startup Failed..............................................................................1-1
1.2 0011 Node Heartbeat lost..................................................................................... 1-2
1.3 0051 Hardware performance threshold exceeded................................................1-3
1.4 0069 Hardware Alarm........................................................................................... 1-4
1.5 3003 Component Instance State Exception......................................................... 1-4
1.6 7004 NFS Shared Volume Multi-Mounted In The Cluster.....................................1-5
1.7 7005 Read-Only NFS Shared Volume..................................................................1-5
1.8 9031 One Port of the Bond Group Fault.............................................................. 1-5
1.9 9032 All Ports of the Bond Group Fault............................................................... 1-6
1.10 9033 OVS Service Fault..................................................................................... 1-6
1.11 9321 Platform pg database is unusable............................................................. 1-7
1.12 9322 Plateform pg node instance is abnormal................................................... 1-7
1.13 9323 pacemaker cluster heartbeat is abnormal..................................................1-8
2 QoS Alarm..................................................................................................2-1
2.1 0002 Node CPU usage too high.......................................................................... 2-2
2.2 0003 Too High Memory Usage of the Node......................................................... 2-3
2.3 0004 Too High CPU Usage of the Component.................................................... 2-4
2.4 0005 Too High Memory Usage of the Component............................................... 2-5
2.5 0009 Node disk usage too high............................................................................2-5
2.6 0015 Minion Cpu Allocation Ratio Too High......................................................... 2-6
2.7 0017 Minion Memory Allocation Ratio Too High...................................................2-7
2.8 0018 Minion Filesystem Usage Rate Too High.................................................... 2-7
2.9 0019 Node disk partition usage too high..............................................................2-8
2.10 0030 The time offset from the NTP server is too large.......................................2-9
2.11 0031 Too High CPU Iowait Usage of the Node................................................ 2-10
2.12 0032 Too High CPU Usage of Steal.................................................................2-11
2.13 0034 Too High PID Usage of the Node............................................................ 2-12
2.14 0035 Too High System Load per Minute.......................................................... 2-13
2.15 0036 Node can not synchronize time with NTP server.....................................2-14
2.16 0037 Node Network Rx Rate Too High............................................................ 2-15
2.17 0038 Node Network Tx Rate Too High.............................................................2-17
2.18 0039 NTP service exit...................................................................................... 2-18

I
2.19 0050 Business performance threshold exceeded.............................................2-18
2.20 0052 Middleware performance threshold exceeded......................................... 2-19
2.21 0053 Framework performance threshold exceeded......................................... 2-19
2.22 0054 self-maintain performance threshold exceeded....................................... 2-20
2.23 1513 Performance threshold alarm...................................................................2-20
2.24 3001 Abnormal Minion Status...........................................................................2-21
2.25 3002 Cluster status abnormal...........................................................................2-22
2.26 4002 Insufficient Tenant Quota......................................................................... 2-23
2.27 5001 Certificate Will Expire Soon..................................................................... 2-24
2.28 5002 Certificate Expired....................................................................................2-24
2.29 5101 Create Project Failed............................................................................... 2-24
2.30 5102 Delete Project Failed............................................................................... 2-25
2.31 7001 Storage cluster status abnormal.............................................................. 2-26
2.32 7002 Cluster Capacity Usage Exceeded the Threshold................................... 2-27
2.33 7003 Volume Capacity Usage Exceeded the Threshold...................................2-27
2.34 8501 NBM Initialization Failed.......................................................................... 2-28
2.35 9101 PostgreSQL database cluster unavailable...............................................2-28
2.36 9102 PostgreSQL database cluster contains unavailable nodes...................... 2-29
2.37 9103 PostgreSQL database cluster replication interrupts or produces brain-
split..................................................................................................................... 2-30
2.38 9104 PostgreSQL database master and standby cluster replication
interruption..........................................................................................................2-31
2.39 9105 PostgreSQL database failed to archive log file........................................2-32
2.40 9141 Index is damaged in common service Elasticsearch............................... 2-33
3 OMC Alarm.................................................................................................3-1
3.1 1000 User locked.................................................................................................. 3-2
3.2 1001 Hard disk usage of database server overload............................................. 3-2
3.3 1002 CPU usage of application server overload.................................................. 3-3
3.4 1003 RAM usage of application server overload..................................................3-3
3.5 1004 Application server disk-overload.................................................................. 3-3
3.6 1008 Database instance space usage too large.................................................. 3-4
3.7 1012 License is expired........................................................................................3-4
3.8 1013 License is about to expire........................................................................... 3-5
3.9 1015 The link between the server and the ME agent is broken........................... 3-5
3.10 1017 The time in which the designated alarm remains active has expired......... 3-6
3.11 1018 The time in which the designated alarm remains unacknowledged has
expired.................................................................................................................. 3-6

II
3.12 1022 Merge rule root alarm................................................................................ 3-7
3.13 1023 Suppress plan task.................................................................................... 3-7
3.14 1025 Automatic backup failure........................................................................... 3-8
3.15 1028 Alarm forwarding failure.............................................................................3-8
3.16 1034 License consumption exceeds the alarm threshold................................... 3-9
3.17 1035 License consumption exceeds the total authorization............................... 3-9
3.18 1050 Wrong login password............................................................................... 3-9
3.19 1060 The number of users assigned the specific type exceeds the limit.......... 3-10
3.20 1061 The number of users assigned the specific type is about to exceed the
limit..................................................................................................................... 3-10
3.21 1300 Password has expired............................................................................. 3-11
3.22 1301 Password will expire................................................................................ 3-11
3.23 1310 The number of login users exceeds the limit...........................................3-11
3.24 1311 SNMP authentication failure.....................................................................3-12
4 Communication Alarm.............................................................................. 4-1
4.1 1014 The link between the server and the ME is broken..................................... 4-1
4.2 1040 ME or agent backend start failure............................................................... 4-1
4.3 200204012 S1 link is broken................................................................................ 4-2
4.4 200204013 Power supply failure.......................................................................... 4-2
4.5 200204014 Transport failure................................................................................. 4-3
5 Processing Error Alarm............................................................................5-1
5.1 0502 K8s schedule failed..................................................................................... 5-1
5.2 0503 K8s create pod failed...................................................................................5-3
5.3 0504 Failed to Delete a Pod................................................................................ 5-3
5.4 1014 Abnormal Service Operational Status..........................................................5-4
5.5 1015 Abnormal Mircroservice Operational Status................................................ 5-5
5.6 2001 Add network for Pod error........................................................................... 5-6
5.7 2002 IaaS account authentication failed...............................................................5-7
5.8 8001 Commonservice deployed failed..................................................................5-7
5.9 9302 Failed to synchronize data to slave zone.................................................... 5-8
6 Environment Alarm................................................................................... 6-1
6.1 9121 FTP disk space is insufficient...................................................................... 6-1
6.2 9122 FTP disk read and write exception..............................................................6-2
6.3 9201 Common Service Kafka node is offline....................................................... 6-2
6.4 9301 The connection for geographical disaster recovery is broken......................6-3
7 Integrity Violation Alarm...........................................................................7-1
7.1 15010001 Alarm for Missing of PM Data............................................................. 7-1

III
7.2 15010002 Alarm for Missing of NAF PM Data..................................................... 7-1
Glossary............................................................................................................. I

IV
About This Manual
Purpose

The ElasticNet UME R18 (hereinafter referred to as the UME) is a RAN element
management system.
This manual provides a reference for alarms related to the UME system. For alarms
related to a specific NE, refer to the corresponding user manual of the NE.

Intended Audience

This manual is intended for:


 Maintenance engineers
 Software debugging engineers

What Is in This Manual

This manual contains the following chapters.

Chapter 1, Equipment Alarm Provides a reference for equipment alarms related to the UME
system.

Chapter 2, QoS Alarm Provides a reference for QoS alarms related to the UME system.

Chapter 3, OMC Alarm Provides a reference for network management alarms related to the
UME system.

Chapter 4, Communication Provides a reference for communication alarms related to the UME
Alarm system.

Chapter 5, Processing Error Provides a reference for processing error alarms related to the UME
Alarm system.

Chapter 6, Environment Provides a reference for processing environment alarms related to the
Alarm UME system.

Chapter 7, Integrity Violation Provides a reference for integrity violation alarms related to the UME
Alarm system.

Related Documentation

The following documentation is related to this manual:

V
ElasticNet UME R18 Unified Management Expert System Alarm Management Operation
Guide

Conventions

This manual uses the following conventions.

Note: provides additional information about a topic.

VI
Chapter 1
Equipment Alarm
Table of Contents
0001 Container Startup Failed.......................................................................................1-1
0011 Node Heartbeat lost..............................................................................................1-2
0051 Hardware performance threshold exceeded........................................................ 1-3
0069 Hardware Alarm....................................................................................................1-4
3003 Component Instance State Exception.................................................................. 1-4
7004 NFS Shared Volume Multi-Mounted In The Cluster............................................. 1-5
7005 Read-Only NFS Shared Volume.......................................................................... 1-5
9031 One Port of the Bond Group Fault....................................................................... 1-5
9032 All Ports of the Bond Group Fault........................................................................ 1-6
9033 OVS Service Fault................................................................................................ 1-6
9321 Platform pg database is unusable........................................................................1-7
9322 Plateform pg node instance is abnormal..............................................................1-7
9323 pacemaker cluster heartbeat is abnormal............................................................ 1-8

1.1 0001 Container Startup Failed


Alarm Information

 Alarm code: 0001


 Alarm description: Container Startup Failed
 Alarm level: Major
 Alarm type: Equipment Alarm

Alarm Cause

Blueprint container images error


Network plug-in failure
Container runtime failure
Application error

SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-1


ElasticNet UME R18 Alarm Handling Reference

Action

1. On the Application Manager page, click an application to enter the details page.
Click the Alarm tab. View the current alarms and check whether there is the “Pod
network configuration failure” alarm.
a. Yes -> Handle the fault based on related handling suggestions.
b. No -> Step 2.
2. If this application is not deployed through blueprint, go to Step 3.
3. If this application is deployed through blueprint, check whether the blueprint
container images is correct.
4. Select Software Repository > Blueprint , and click a blueprint used by the
application. Click the used blueprint version. In the Action column, select Edit to
enter the blueprint editing page. Check whether the Pod container image name and
version number exist in the image repository.
a. No -> Modify the Pod container image and save it. In the Action column, click
Deploy to re-deploy the application and delete the original application.
b. Yes -> Step 3.
5. Check whether the application itself is abnormal.
6. On the Application Manager page, click Application Name to enter the microservice
page. Click Microservice Name to enter the Pod page. Click Pod Name to enter the
container page. Click the Container tab. Click Container Name to enter the container
details page. Select the Log tab. Check whether the application is abnormal in
accordance with the container logs. If there is no log or you cannot determine,
contact ZTE technical support.

1.2 0011 Node Heartbeat lost


Alarm Information

 Alarm code: 0011


 Alarm description: Node Heartbeat lost
 Alarm level: Major
 Alarm type: Equipment Alarm

Alarm Cause

The nodes is faulty or the link is abnormal.

1-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)


1 Equipment Alarm

Action

1. Log in to the PaaS Controller node, and execute ssh ubuntu@IP address of the
control node.
a. Successful login -> Step 2.
b. Login failure -> Step 4.
2. Log in to the abnormal node. Assume that the node name is default-
np-5-192.173.0.57.
a. You can log in to the node by executing ssh ubuntu@192.173.0.57.
b. Successful login -> Step 3.
c. Login failure -> Step 4.
3. Restart the heartbeat handshaking component.
a. Switch the root permission: Sudo su.
b. Execute the service heartbeat restart command. Wait for 5 minutes, and check
whether the alarm is cleared.
c. Yes -> End.
d. No -> Step 4.
4. Restart the abnormal node.
a. If Non preset senario,Then Select Resources > Compute > Nodes . In the search
box, enter '192.173.0.57' in the alarm object name to find the abnormal node.
Click the restart button of the node.
b. If Preset senario,Then Siwtch to the root permission:sudo su, and execute reboot
to reboot the abnormal node.
5. Check whether the alarm is cleared.
a. Yes -> End.
b. No -> Contact ZTE technical support.

1.3 0051 Hardware performance threshold exceeded


Alarm Information

 Alarm code: 0051


 Alarm description: Hardware performance threshold exceeded
 Alarm level: Warn
 Alarm type: Equipment Alarm

SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-3


ElasticNet UME R18 Alarm Handling Reference

Alarm Cause

The hardware performance indicator exceeds the threshold within the specified
inspection period.

Action

1. Determine the corresponding hardware performance indicator item by alarm details.


2. Enter the the hardware operation and maintenance web interface related to this
hardware performance indicator and process it according to the prompt information.
3. Please contact the administrator if the problem persists, or contact the technical
support for other reasons.

1.4 0069 Hardware Alarm


Alarm Information

 Alarm code: 0069


 Alarm description: Hardware Alarm
 Alarm level: Undefined
 Alarm type: Equipment Alarm

Alarm Cause

Hardware Alarm,Pelease Check Detail Information Field.

Action

Check the running status of the hardware equipment according to the alarm location
information. If there is no log or cannot be judged, contact ZTE technical support
personnel for processing.

1.5 3003 Component Instance State Exception


Alarm Information

 Alarm code: 3003


 Alarm description: Component Instance State Exception
 Alarm level: Major
 Alarm type: Equipment Alarm

Alarm Cause

Component instance is running abnormally

1-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)


1 Equipment Alarm

Component instance is not running


Unable to get component instance running status

Action

Can not recover more than 10 minutes, please contact ZTE technical support staff.

1.6 7004 NFS Shared Volume Multi-Mounted In The Cluster


Alarm Information

 Alarm code: 7004


 Alarm description: NFS Shared Volume Multi-Mounted In The Cluster
 Alarm level: Major
 Alarm type: Equipment Alarm

Alarm Cause

The NFS shared volume is concurrently mounted in the Multiple computers.

Action

Please contact ZTE technical support staff

1.7 7005 Read-Only NFS Shared Volume


Alarm Information

 Alarm code: 7005


 Alarm description: Read-Only NFS Shared Volume
 Alarm level: Major
 Alarm type: Equipment Alarm

Alarm Cause

The NFS shared volume is mounted in the read-only status.

Action

Please contact ZTE technical support staff

1.8 9031 One Port of the Bond Group Fault


Alarm Information

 Alarm code: 9031

SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-5


ElasticNet UME R18 Alarm Handling Reference

 Alarm description: One Port of the Bond Group Fault


 Alarm level: Major
 Alarm type: Equipment Alarm

Alarm Cause

The Ethernet interface of the bond group is faulty.

Action

Check 9031 alarm specific reasons, then solve the Ethernet interface faulty by the
People Repair.

1.9 9032 All Ports of the Bond Group Fault


Alarm Information

 Alarm code: 9032


 Alarm description: All Ports of the Bond Group Fault
 Alarm level: Serious
 Alarm type: Equipment Alarm

Alarm Cause

All of the Ethernet interfaces of the bond group are faulty.

Action

Check 9032 alarm specific reasons, then solve the Ethernet interface faulty by the
People Repair.

1.10 9033 OVS Service Fault


Alarm Information

 Alarm code: 9033


 Alarm description: OVS Service Fault
 Alarm level: Major
 Alarm type: Equipment Alarm

Alarm Cause

The service of openvswitch is faulty, or the process of ovsdb-server and ovs-vswitchd


are fault.

1-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)


1 Equipment Alarm

Action

Contact ZTE technical support.

1.11 9321 Platform pg database is unusable


Alarm Information

 Alarm code: 9321


 Alarm description: Platform pg database is unusable
 Alarm level: Serious
 Alarm type: Equipment Alarm

Alarm Cause

Postgresql_ip resource is lost, probably caused by factors listed below:


 Postgresql_ip resource fail.
 Pg has no Master.
 Pg Master/Slave is changed.

Action

Can not recover more than 10 minutes, please contact ZTE technical support to repair
pg database failure.

1.12 9322 Plateform pg node instance is abnormal


Alarm Information

 Alarm code: 9322


 Alarm description: Plateform pg node instance is abnormal
 Alarm level: Minor
 Alarm type: Equipment Alarm

Alarm Cause

Plateform pg node instance is abnormal, probably caused by factors listed below:


 Pg instance has errors.
 Pg instance is stopped manually.

Action

Can not recover more than 20 minutes, please contact ZTE technical support to repair
pgl database failure.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-7


ElasticNet UME R18 Alarm Handling Reference

1.13 9323 pacemaker cluster heartbeat is abnormal


Alarm Information

 Alarm code: 9323


 Alarm description: pacemaker cluster heartbeat is abnormal
 Alarm level: Major
 Alarm type: Equipment Alarm

Alarm Cause

pacemaker cluster node is lost, probably caused by factors listed below:


 corosync process is killed by someone.
 corosync process listenning ip and corosync configured ip is nconsistency.
 pacemaker cluster node ip is unreachable.
 pacemaker cluster node interface MTU is less then 1500.
 pacemaker cluster nodes have network instability problems.

Action

Can not recover more than 20 minutes, please contact ZTE technical support to repair
pacemaker cluster node failure.

1-8 SJ-20191220164305-036 | 2020-01-10 (R1.0)


Chapter 2
QoS Alarm
Table of Contents
0002 Node CPU usage too high................................................................................... 2-2
0003 Too High Memory Usage of the Node..................................................................2-3
0004 Too High CPU Usage of the Component............................................................. 2-4
0005 Too High Memory Usage of the Component........................................................ 2-5
0009 Node disk usage too high.................................................................................... 2-5
0015 Minion Cpu Allocation Ratio Too High..................................................................2-6
0017 Minion Memory Allocation Ratio Too High........................................................... 2-7
0018 Minion Filesystem Usage Rate Too High............................................................. 2-7
0019 Node disk partition usage too high.......................................................................2-8
0030 The time offset from the NTP server is too large................................................. 2-9
0031 Too High CPU Iowait Usage of the Node...........................................................2-10
0032 Too High CPU Usage of Steal........................................................................... 2-11
0034 Too High PID Usage of the Node.......................................................................2-12
0035 Too High System Load per Minute.....................................................................2-13
0036 Node can not synchronize time with NTP server............................................... 2-14
0037 Node Network Rx Rate Too High....................................................................... 2-15
0038 Node Network Tx Rate Too High....................................................................... 2-17
0039 NTP service exit................................................................................................. 2-18
0050 Business performance threshold exceeded....................................................... 2-18
0052 Middleware performance threshold exceeded....................................................2-19
0053 Framework performance threshold exceeded.................................................... 2-19
0054 self-maintain performance threshold exceeded..................................................2-20
1513 Performance threshold alarm............................................................................. 2-20
3001 Abnormal Minion Status..................................................................................... 2-21
3002 Cluster status abnormal......................................................................................2-22
4002 Insufficient Tenant Quota.................................................................................... 2-23
5001 Certificate Will Expire Soon................................................................................2-24
5002 Certificate Expired.............................................................................................. 2-24
5101 Create Project Failed..........................................................................................2-24

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-1


ElasticNet UME R18 Alarm Handling Reference

5102 Delete Project Failed.......................................................................................... 2-25


7001 Storage cluster status abnormal.........................................................................2-26
7002 Cluster Capacity Usage Exceeded the Threshold..............................................2-27
7003 Volume Capacity Usage Exceeded the Threshold............................................. 2-27
8501 NBM Initialization Failed..................................................................................... 2-28
9101 PostgreSQL database cluster unavailable......................................................... 2-28
9102 PostgreSQL database cluster contains unavailable nodes................................ 2-29
9103 PostgreSQL database cluster replication interrupts or produces brain-split....... 2-30
9104 PostgreSQL database master and standby cluster replication interruption........ 2-31
9105 PostgreSQL database failed to archive log file.................................................. 2-32
9141 Index is damaged in common service Elasticsearch..........................................2-33

2.1 0002 Node CPU usage too high


Alarm Information

 Alarm code: 0002


 Alarm description:
This alarm is raised when the CPU of the node is overloaded.
As the CPU usage increases, the alarm level is dynamically adjusted:
→ When the usage reaches 75%, a warning alarm is raised.
→ When the usage reaches 80%, a minor alarm is raised.
→ When the usage reaches 85%, a major alarm is raised.
→ When the usage reaches 90%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

Cpu usage rate of node exceed qos threshold

Action

1. Check the CPU usage trend of the node.


a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance Data .
c. Click the History Performance tab. Select a time period and view the trend graph
of CPU Usage Rate .

2-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

2. Check whether the CPU usage trend of the node is consistent with the service status
based on the analysis of the on-site services.
 Yes -> Step 3.
 No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes -> Step 4.
 No -> Solve the problems with the service.
4. Determine whether to increase the CPU usage threshold in accordance with the
service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the CPU usage QoS
threshold.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the CPU Usage Rate line and then modify the
thresholds at all levels.

2.2 0003 Too High Memory Usage of the Node


Alarm Information

 Alarm code: 0003


 Alarm description: The memory usage of the node is too high, and alarms at different
levels are raised dynamically.
→ When the usage reaches 75%, a warning alarm is raised.
→ When the usage rate reaches 80%, a minor alarm is raised.
→ When the usage rate reaches 85%, a major alarm is raised.
→ When the usage rate is 90%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

The memory usage of the node exceeds the QoS threshold.

Action

1. View trends in CPU usage of nodes:

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-3


ElasticNet UME R18 Alarm Handling Reference

a. Login Portaladmin web -> "Monitor" -> "Alarm" -> "Current Alarm", click the alarm
name to enter the detail page.
b. Click "Go to check" to enter the node information page.
c. Click "History Performance", select a time tab to check the Memory usage.
2. Based on the analysis of the on-site business, confirm whether the Memory usage is
consistent with the business status.
Yes → Go to step 3.
No → Contact ZTE technical support.
3. Confirm that whether the service status is normal.
Yes → Go to step 4.
No → Solve the problems with the service.
4. Determine whether to increase the memory usage threshold in accordance with the
service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the memory usage
QoS threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the Memory Usage Rateline and then modify the
thresholds at all levels.

2.3 0004 Too High CPU Usage of the Component


Alarm Information

 Alarm code: 0004


 Alarm description:
The CPU usage of the component is too high, and alarms at different levels are
raised dynamically.
→ When the usage reaches 75%, a warning alarm is raised.
→ When the usage reaches 80%, a minor alarm is raised.
→ When the usage reaches 85%, a major alarm is raised.
→ When the usage reaches 90%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

The CPU usage of the component exceeds the QoS threshold.

2-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

Action

Contact ZTE technical support.

2.4 0005 Too High Memory Usage of the Component


Alarm Information

 Alarm code: 0005


 Alarm description:
The memory usage of the component is too high, and alarms at different levels are
raised dynamically.
→ When the usage reaches 75%, a warning alarm is raised.
→ When the usage reaches 80%, a minor alarm is raised.
→ When the usage reaches 85%, a major alarm is raised.
→ When the usage reaches 90%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

The memory usage of the component exceeds the QoS threshold.

Action

Contact ZTE technical support.

2.5 0009 Node disk usage too high


Alarm Information

 Alarm code: 0009


 Alarm description:
The disk usage of the node is too high, and alarms at different levels are raised
dynamically.
→ When the usage reaches 75%, a warning alarm is raised.
→ When the usage reaches 80%, a minor alarm is raised.
→ When the usage reaches 85%, a major alarm is raised.
→ When the usage reaches 90%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-5


ElasticNet UME R18 Alarm Handling Reference

Alarm Cause

The disk usage of the node exceeds the QoS threshold.

Action

1. Check the disk usage trend of the node.


a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance Data .
c. Click the History Performance tab. Select a time period and view the trend graph
of Disk Usage Rate .
2. Check whether the disk usage trend of the node is consistent with the service status
based on the analysis of the on-site services.
 Yes → Go to step 3.
 No → Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes → Go to step 4.
 No → Solve the problems with the service.
4. Determine whether to increase the disk usage threshold in accordance with the
service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the Disk Usage Rate line and then modify the
thresholds at all levels.

2.6 0015 Minion Cpu Allocation Ratio Too High


Alarm Information

 Alarm code: 0015


 Alarm description:
This alarm is raised when the CPU of the node is overloaded.
As the CPU usage increases, the alarm level is dynamically adjusted:
→ When the usage reaches 75%, a warning alarm is raised.

2-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

→ When the usage reaches 80%, a minor alarm is raised.


→ When the usage reaches 85%, a major alarm is raised.
→ When the usage reaches 90%, a critical alarm is raised.
 Alarm level: Warning
 Alarm type: QoS Alarm

Alarm Cause

 Downward adjust the CPU Limit value of the application.


 Number of minion nodes for lateral expansion Kubernetes.

Action

1. Downward adjust the CPU Limit value of the application.


2. Number of minion nodes for lateral expansion Kubernetes.

2.7 0017 Minion Memory Allocation Ratio Too High


Alarm Information

 Alarm code: 0017


 Alarm description:Minion Memory Allocation Ratio Too High
 Alarm level: Warning
 Alarm type: QoS Alarm

Alarm Cause

 The unreasonable configuration of Memory Limit value in application.


 There are too many deployed applications.

Action

1. Downward adjust the Memory Limit value of the application.


2. Number of minion nodes for lateral expansion Kubernetes.

2.8 0018 Minion Filesystem Usage Rate Too High


Alarm Information

 Alarm code: 0018


 Alarm description: Minion Filesystem Usage Rate Too High
 Alarm level: Warning
 Alarm type: QoS Alarm

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-7


ElasticNet UME R18 Alarm Handling Reference

Alarm Cause

File system usage rate of node exceed Qos threshold.

Action

Clean up disk space to less than 90%.

2.9 0019 Node disk partition usage too high


Alarm Information

 Alarm code: 0019


 Alarm description:
The disk partition usage of the node is too high, and alarms at different levels are
raised dynamically.
→ When the usage reaches 75%, a warning alarm is raised.
→ When the usage reaches 80%, a minor alarm is raised.
→ When the usage reaches 85%, a major alarm is raised.
→ When the usage reaches 90%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

The disk partition usage of the node exceeds the QoS threshold.

Action

1. Check the disk partition usage trend of the node.


a. Select "Monitor > Alarm > Current Alarm". Click the alarm name to enter the
alarm details page.
b. In the Detail Information box, click "Go to check" on the right of "Performance
Data".
2. Check whether the disk partition usage trend of the node is consistent with the
service status based on the analysis of the on-site services.
Yes → Go to step 3.
No → Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
Yes → Go to step 4.
No → Solve the problems with the service.

2-8 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

4. Determine whether to increase the disk partition usage threshold in accordance with
the service status.
If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
disk partition . Click the Modify button in the Disk Partition Usage Rate line and then
modify the thresholds at all levels.

2.10 0030 The time offset from the NTP server is too large
Alarm Information

 Alarm code: 0030


 Alarm description:The time offset between the node and the NTP server exceeds the
threshold (default 600 seconds, available).
 Alarm level: Warning
 Alarm type: QoS Alarm

Alarm Cause

 The NTP server is abnormal.


 The NTP service of the node is abnormal.

Action

1. Contact the administrator to confirm that the NTP server is normal.


a. View the NTP server address of the node
SSH login to the alarm node and use the following command to view the NTP
server of the node.
[root@10-62-49-161:/home/ssf/cloudframe-projects/monitor-master]$ cat /etc/
ntp.conf |grep "^server" |grep -v "127.127.1.0"
Server 10.30.1.105 minpoll 3 maxpoll 4
# In this example, 10.30.1.105 is the NTP server address.
b. Contact the administrator to confirm that the NTP server service is normal.
2. If the NTP server is abnormal
a. Ask the administrator or related technical support personnel to solve the problem
of the NTP server.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-9


ElasticNet UME R18 Alarm Handling Reference

b. Start the NTP service of the node with the command “systemctl start ntpd.
service”.
c. Use the command “systemctl status ntpd.service” to check whether the NTP
service of the node is successfully started (the status is active).
3. If the NTP server is normal, the NTP service of the node may be abnormal. Please
contact technical support.

2.11 0031 Too High CPU Iowait Usage of the Node


Alarm Information

 Alarm code: 0031


 Alarm description:
The CPU Iowait usage of the node is too high, and alarms at different levels are
raised dynamically.
→ When the usage reaches 30%, a warning alarm is raised.
→ When the usage reaches 40%, a minor alarm is raised.
→ When the usage reaches 50%, a major alarm is raised.
→ When the usage reaches 60%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

The CPU Iowait usage of the node exceeds the QoS threshold.

Action

1. Check the CPU Iowait usage trend.


a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance
Data .
c. Click the History Performance tab. Select a time period and view the trend graph
of Partition(Usage) .
2. Check whether the CPU Iowait Usage Rate of node is consistent with the service
status based on the analysis of the on-site services.
 Yes -> Step 3.
 No -> Contact ZTE technical support.

2-10 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes -> Step 4.
 No -> Solve the problems with the service.
4. Determine whether to increase the CPU Iowait Usage threshold in accordance with
the service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the CPU Iowait Usage Rate line and then modify the
thresholds at all levels.

2.12 0032 Too High CPU Usage of Steal


Alarm Information

 Alarm code: 0032


 Alarm description:
The CPU usage of the Steal process of the node is too high, and alarms at different
levels are raised dynamically.
→ When the usage reaches 10%, a warning alarm is raised.
→ When the usage reaches 20%, a minor alarm is raised.
→ When the usage reaches 30%, a major alarm is raised.
→ When the usage reaches 40%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

The CPU usage of the Steal process of the node exceeds the QoS threshold.

Action

1. Check the CPU usage of the Steal process.


a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance Data .

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-11


ElasticNet UME R18 Alarm Handling Reference

c. Click the History Performance tab. Select a time period and view the trend graph
of ** CPU Steal Usage Rate** .
2. Check whether the CPU usage trend of the Steal process is consistent with the
service status based on the analysis of the on-site services.
 Yes -> Step 3.
 No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes -> Step 4.
 No -> Solve the problems with the service.
4. Determine whether to increase the CPU Steal Usage threshold in accordance with
the service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the CPU Steal Usage Rate line and then modify the
thresholds at all levels.

2.13 0034 Too High PID Usage of the Node


Alarm Information

 Alarm code: 0034


 Alarm description:
The total number of PIDs used by the system currently is too large, and alarms at
different levels are raised dynamically.
→ When the usage reaches 75%, a warning alarm is raised.
→ When the usage reaches 80%, a minor alarm is raised.
→ When the usage reaches 85%, a major alarm is raised.
→ When the usage reaches 90%, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

The system PID usage of the node exceeds the QoS threshold.

2-12 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

Action

1. Check the PID usage trend of the Node.


a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance Data .
c. Click the History Performance tab. Select a time period and view the trend graph
of System Pid .
2. Check whether the PID usage of the nodes is consistent with the service status
based on the analysis of the on-site services.
 Yes -> Step 3.
 No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes -> Step 4.
 No -> Solve the problems with the service.
4. Determine whether to increase the system pid usage threshold in accordance with
the service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the system_pid_usage_rate line and then modify
the thresholds at all levels.

2.14 0035 Too High System Load per Minute


Alarm Information

 Alarm code: 0035


 Alarm description:
The value of counter system_load1m for node object is too high, and alarms at
different levels are raised dynamically.
→ When the load reaches 1, a warning alarm is raised.
→ When the load reaches 2, a minor alarm is raised.
→ When the load reaches 3, a major alarm is raised.
→ When the load reaches 4, a critical alarm is raised.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-13


ElasticNet UME R18 Alarm Handling Reference

 Alarm level: Undefined


 Alarm type: QoS Alarm

Alarm Cause

 Existing processes hang or execute slowly.


 Node running too many process or applications.
 Insufficient node resources make the system unable to schedule in time.

Action

1. Check the trend of load per minute of the node.


a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance Data .
c. Click the History Performance tab. Select a time period and view the trend graph
of System_Load 1m .
2. Check whether the trend of load per minute of the node is consistent with the service
status based on the analysis of the on-site services.
 Yes -> Step 3.
 No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes -> Step 4.
 No -> Solve the problems with the service.
4. Determine whether to increase the threshold of load per minute in accordance with
the service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the QoS threshold of
load per minute and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the system_load1m line and then modify the
thresholds at all levels.

2.15 0036 Node can not synchronize time with NTP server
Alarm Information

 Alarm code: 0036

2-14 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

 Alarm description:Node can not synchronize time with NTP server.


 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

 The local NTP service exits or is abnormal.


 Node and the NTP server network are unreachable.
 NTP server is abnormal.

Action

1. On the platform O&M portal, select Monitor > Alarm > Current Alarm . If it displays
“NTP daemon exit” or “NTP offset high”, refer to the alarm handling suggestions.
Otherwise, go to Step 2.
2. Check whether the NTP service of this node is normal.
a. SSH login to the alarm node and switch to the root user
b. Check if the NTP service is running normally
systemctl status ntpd.service
Check if the service is active. If it is not active, please contact the administrator.
3. Check whether the network between the node and the NTP server is connected.
a. SSH login to the alarm node and switch to the root user
b. Use the ping command to check whether the network between the node and the
NTP server is connected. If there are multiple servers, ping them one by one. If
the ping fails, solve the network problem first;
cat /etc/ntp.conf |grep "^server" |grep -v "127.127.1.0"
server 10.30.1.105 minpoll 3 maxpoll 4
#In the example, 10.30.1.105 is the NTP server.
ping 10.30.1.105
4. Contact the administrator to confirm that the NTP server is normal. If it is not normal,
first solve the problem of the NTP server, and then observe whether the alarm is
restored.
5. Contact ZTE technical support.

2.16 0037 Node Network Rx Rate Too High


Alarm Information

 Alarm code: 0037


 Alarm description:

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-15


ElasticNet UME R18 Alarm Handling Reference

The value of rx rate for node network is too high, and alarms at different levels are
raised dynamically:
→ When the rate reaches 300000000Bps, a warning alarm is raised.
→ When the rate reaches 500000000Bps, a minor alarm is raised.
→ When the rate reaches 750000000Bps, a major alarm is raised.
→ When the rate reaches 900000000Bps, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

Node network rx rate of node exceed qos threshold.

Action

1. Check the trend of network rx rate of the node.


a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance Data .
c. Click the History Performance tab. Select a time period and view the trend graph
of Network Rx Rate .
2. Check whether the trend of network rx rate of the node is consistent with the service
status based on the analysis of the on-site services.
 Yes -> Step 3.
 No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes -> Step 4.
 No -> Solve the problems with the service.
4. Determine whether to increase the threshold of network rx rate in accordance with
the service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the QoS threshold of
network rx rate and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the Network Rx Rate line and then modify the
thresholds at all levels.

2-16 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

2.17 0038 Node Network Tx Rate Too High


Alarm Information

 Alarm code: 0038


 Alarm description:
The value of tx rate for node network is too high, and alarms at different levels are
raised dynamically:
→ When the rate reaches 300000000Bps, a warning alarm is raised.
→ When the rate reaches 500000000Bps, a minor alarm is raised.
→ When the rate reaches 750000000Bps, a major alarm is raised.
→ When the rate reaches 900000000Bps, a critical alarm is raised.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

Node network tx rate of node exceed qos threshold.

Action

1. Check the trend of network tx rate of the node.


a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance Data .
c. Click the History Performance tab. Select a time period and view the trend graph
of Network Tx Rate .
2. Check whether the trend of network tx rate of the node is consistent with the service
status based on the analysis of the on-site services.
 Yes -> Step 3.
 No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes -> Step 4.
 No -> Solve the problems with the service.
4. Determine whether to increase the threshold of network tx rate in accordance with
the service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-17


ElasticNet UME R18 Alarm Handling Reference

 If the traffic increases in a long time, it is suggested to adjust the QoS threshold of
network tx rate and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the Network Tx Rate line and then modify the
thresholds at all levels.

2.18 0039 NTP service exit


Alarm Information

 Alarm code: 0039


 Alarm description:NTP service exit.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

 The time difference between the node and the NTP server is too large.
 The NTP service is abnormal.

Action

1. Check the time difference between the node and the clock source to determine
whether it is because the time difference is too large. For the check method and
processing method, refer to the handling suggestions of the alarm “The time offset
from the NTP server is too large”.
2. If the time difference is too large, contact the administrator. If it is for other reasons,
contact technical support.

2.19 0050 Business performance threshold exceeded


Alarm Information

 Alarm code: 0050


 Alarm description:Business performance threshold exceeded
 Alarm level: Warning
 Alarm type: QoS Alarm

Alarm Cause

The business performance indicator exceeds the threshold within the specified
inspection period.

2-18 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

Action

1. Determine the corresponding business performance indicator item by alarm details.


2. Analyze the trend of this business performance in OTCP Self Operation and
Maintenance System.
3. Process according to the prompt information in OTCP Self Operation and
Maintenance System.
4. Please contact the administrator if the problem persists, or contact the technical
support for other reasons.

2.20 0052 Middleware performance threshold exceeded


Alarm Information

 Alarm code: 0034


 Alarm description:Middleware performance threshold exceeded
 Alarm level: Warning
 Alarm type: QoS Alarm

Alarm Cause

The business performance indicator exceeds the threshold within the specified
inspection period.

Action

1. Determine the corresponding middleware performance indicator item by alarm


details.
2. Enter the the hardware operation and maintenance web interface related to
this middleware performance indicator and process it according to the prompt
information.
3. Please contact the administrator if the problem persists, or contact the technical
support for other reasons.

2.21 0053 Framework performance threshold exceeded


Alarm Information

 Alarm code: 0053


 Alarm description:Framework performance threshold exceeded
 Alarm level: Warning
 Alarm type: QoS Alarm

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-19


ElasticNet UME R18 Alarm Handling Reference

Alarm Cause

The framework performance indicator exceeds the threshold within the specified
inspection period.

Action

1. Determine the corresponding framework performance indicator item by alarm details.


2. Enter the the hardware operation and maintenance web interface related to
this middleware performance indicator and process it according to the prompt
information.
3. Please contact the administrator if the problem persists, or contact the technical
support for other reasons.

2.22 0054 self-maintain performance threshold exceeded


Alarm Information

 Alarm code: 0054


 Alarm description:self-maintain performance threshold exceeded
 Alarm level: Warnig
 Alarm type: QoS Alarm

Alarm Cause

The self-maintain performance indicator exceeds the threshold within the specified
inspection period.

Action

1. Determine the corresponding self-maintain performance indicator item by alarm


details.
2. Enter the the hardware operation and maintenance web interface related to this self-
maintain performance indicator and process it according to the prompt information.
3. Please contact the administrator if the problem persists, or contact the technical
support for other reasons.

2.23 1513 Performance threshold alarm


Alarm Information

 Alarm code: 1513


 Alarm description: Performance threshold alarm

2-20 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

 Alarm level: Undefined


 Alarm type: QoS Alarm

Alarm Cause

None.

Action

Please process according to the business meaning of monitor task.

2.24 3001 Abnormal Minion Status


Alarm Information

 Alarm code: .3001


 Alarm description:The “Minion is not ready!” alarm is raised when the Kubernetes
detects that the node status is not ready.The “Minion is absent!” alarm is raised when
internal data is inconsistent.
 Alarm level:Warning
 Alarm type: QoS Alarm

Alarm Cause

 The minion node is not ready.


→ The network is abnormal or the network between minion and master is abnormal.
→ The docker service is abnormal.
→ The application network is abnormal.
 The minion node is absent. An unknown reason causes inconsistence between
internal data.

Action

 If the alarm is “Minion is not ready”,


1. Check whether docker process is running properly.
a. Log in to the node in ssh mode and switch to the root user.
ssh ubuntu@IP address of the node
sudo su
b. Execute the systemctl status docker command and check whether the docker
status is active. No -> go to Step 3.
2. Check whether there is any alarm about the application network.
a. Check whether the alarm 2001 exists. If yes, solve the problem first,

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-21


ElasticNet UME R18 Alarm Handling Reference

b. After the alarm 2001 is cleared, check whether this alarm is cleared. Yes ->
End. No -> go to Step 3.
3. Contact the administrator.
 If the alarm is “Minion is absent”.
1. Delete the node from the cluster.
a. The value of the object ID in the alarm information is the node’s uuid.
b. View the uuid of the home cluster in the “Extra Params” in the “Detail
Information” box.
c. Delete the node from the cluster by using the command.
Log in to the control node, and swithc to the root user. Enter the command to
delete the node.
sudo su
cluster delete <cluster_uuid> node <node_uuid>
d. After the node is deleted, check whether the alarm is cleared. No -> go to
Step 2.
2. Contact the administrator.

2.25 3002 Cluster status abnormal


Alarm Information

 Alarm code: 3002


 Alarm description:When the cluster is not unavailable, this alarm is raised.
 Alarm level: Warning
 Alarm type: QoS Alarm

Alarm Cause

 More than half of the control nodes in the cluster cannot provide services.
 There is no available working node in the cluster.

Action

1. On the platform O&M portal, select Environment >Business Cluster . The Detail page
is displayed. Click the Node tab and view the nodes and roles under the cluster.
 If the minion node does not exist in the cluster, go to Step 2.
 If the minion node exists in the cluster, go to Step 3.
2. Expand the capacity of the cluster and add the minion node,
a. Click the Scale out button on the page described in Step 1, and add the minion
node.

2-22 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

b. Wait for the minion node to be deployed, and check whether this alarm is cleared.
Yes -> End.
3. On the platform O&M portal, select Monitor > Alarm > Current Alarm . Check whether
the “Abnormal cluster node status” alarm exists.
 Yes -> go to Step 4.
 No ->Contact ZTE technical support.
4. View the additional information of the “Abnormal cluster node status” alarm and
check whether the home cluster of the node is the cluster that raises the alarm.
 Yes -> Step 5.
 No -> Contact ZTE technical support.
5. For each node in the cluster that raises the “Abnormal cluster node status” alarm,
follow the alarm handling suggestions. Wait until the “Abnormal cluster node status”
alarm is cleared, and check whether this alarm is cleared.
 Yes -> End.
 No -> Contact ZTE technical support.

2.26 4002 Insufficient Tenant Quota


Alarm Information

 Alarm code: 4002


 Alarm description: When the tenant uses more than 90% of the space quota, this
alarm is raised.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

The remaining disk quota of the tenant is less than 10% of the total quota.

Action

1. Remove useless versions:


Select Software Repository > Image / Software Package / Component Package , and
delete useless versions in the version list.
2. If there is no useless version, ask the platform administrator to expand the disk quota
of the tenant.
On the platform O&M portal, select ** Project Management > Project** . Select **
Modify Quota** in the Action column of the corresponding project. The ** Modify
Quota** dialog box is displayed. Modify the Image storage space .

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-23


ElasticNet UME R18 Alarm Handling Reference

2.27 5001 Certificate Will Expire Soon


Alarm Information

 Alarm code:5001
 Alarm description:When the user’s certificate file will expire soon.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

certificate will expired soon.

Action

select Settings->Cert Manager, and check the certificate files according to the Alarm
Information,and then click Update button,Update the certificate acoording to page tips.

2.28 5002 Certificate Expired


Alarm Information

 Alarm code: 5002


 Alarm description:When the user’s certificate file already expired.
 Alarm level: Critical
 Alarm type: QoS Alarm

Alarm Cause

certificate expired.

Action

1. select Projects & Users -> Projects,check the failed cause of creating project,click
Retry to recreate the project.
2. If you determine that you cannot do nothing with the problem, contact ZTE technical
support.

2.29 5101 Create Project Failed


Alarm Information

 Alarm code:5101
 Alarm description:When create project failed.
 Alarm level: Critical

2-24 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

 Alarm type: QoS Alarm

Alarm Cause

Resources quota apply failed while creating project.

Action

1. select Projects & Users -> Projects,check the failed cause of creating project,click
Retry to recreate the project.
2. If you determine that you cannot do nothing with the problem, contact ZTE technical
support.

2.30 5102 Delete Project Failed


Alarm Information

 Alarm code: 5102


 Alarm description:When delete project failed.
 Alarm level: Critical
 Alarm type: QoS Alarm

Alarm Cause

 One or more nodes of the storage cluster are down.


 The storage volume is abnormal.
 The used amount of the cluster capacity is abnormal.
 The cluster network is abnormal.

Action

1. On the platform O&M portal, select Resources > Storage . Click Built-in storage
cluster , and check whether the Status column is healthy .
 Yes -> Contact ZTE technical support.
 No -> Step 2.
2. Wait five minutes and check whether the storage cluster heals itself. After five
minutes, check whether the Status column is healthy .
 Yes -> End.
 No -> Step 3.
3. If the Status column of a cluster is unhealthy , click the cluster name to enter the
storage cluster details page. Observe the Node information list.
 If the Status column of each node is normal, contact ZTE technical support.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-25


ElasticNet UME R18 Alarm Handling Reference

 If the Status column of a node is abnormal, go to Step 4.


4. If the Status column of a node is abnormal, click the Restart button in the Action
column of the node to manually complete the forced recovery of the storage cluster.
Wait five minutes and observe the Status column of the node. If it is normal and the
Status column of the corresponding storage cluster is healthy , the alarm handling is
completed. Otherwise, contact ZTE technical support.

2.31 7001 Storage cluster status abnormal


Alarm Information

 Alarm code: 7001


 Alarm description: When the system detects that the cluster status is not healthy, this
alarm is raised.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

 One or more Glusterfs nodes break down.


 Volume status is abnormal.
 Usage of volume is abnormal.
 Cluster network is abnormal.

Action

1. On the platform O&M portal, select Resources > Storage . Click Built-in storage
cluster , and check whether the Status column is healthy .
 Yes -> Contact ZTE technical support.
 No -> Step 2.
2. Wait five minutes and check whether the storage cluster heals itself. After five
minutes, check whether the Status column is healthy .
 Yes -> End.
 No -> Step 3.
3. If the Status column of a cluster is unhealthy , click the cluster name to enter the
storage cluster details page. Observe the Node information list.
 If the Status column of each node is normal, contact ZTE technical support.
 If the Status column of a node is abnormal, go to Step 4.
4. If the Status column of a node is abnormal, click the Restart button in the Action
column of the node to manually complete the forced recovery of the storage cluster.

2-26 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

Wait five minutes and observe the Status column of the node. If it is normal and the
Status column of the corresponding storage cluster is healthy , the alarm handling is
completed. Otherwise, contact ZTE technical support.

2.32 7002 Cluster Capacity Usage Exceeded the Threshold


Alarm Information

 Alarm code: 7002


 Alarm description: When the used capacity of the cluster exceeds 80% of the total
capacity, this alarm is raised.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

The used capacity of the cluster exceeds 80% of the total capacity

Action

Check whether there is useless data in the storage cluster.


 If yes, delete the useless data to save the cluster space.
 If no, ask the platform administrator to add a storage device to expand the storage
cluster capacity.

2.33 7003 Volume Capacity Usage Exceeded the Threshold


Alarm Information

 Alarm code: 7003


 Alarm description: When the used capacity of the volume exceeds 80% of the total
capacity, this alarm is raised.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

The used capacity of the volume exceeds 80% of the total capacity

Action

Check whether there is useless data in the storage volume.


 If yes, delete the useless data to save the volume space.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-27


ElasticNet UME R18 Alarm Handling Reference

 If no, ask the platform administrator to add a storage device to expand the storage
volume capacity.

2.34 8501 NBM Initialization Failed


Alarm Information

 Alarm code: 8501


 Alarm description:The rabbitmq service is abnormal.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

The NBM service fails.

Action

Contact ZTE technical support to check whether the rabbitmq service is normal.

2.35 9101 PostgreSQL database cluster unavailable


Alarm Information

 Alarm code: 9101


 Alarm description:
When the SLB pod is powered on successfully, this notification is reported.
The SLB provides the load balancing service for the services deployed within the
PaaS. This notification helps users know if the SLB is successfully powered on.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

Nodes with PG status of LATEST or SYNC in cluster fail to start for some reason, while
other PGs in other states can start normally, but without the right to be promoted, the
whole cluster can not choose the master node.

Action

First, check the start log of PG with status of LATEST or SYNC, or start PG manually
through PSQL client to find the cause of start failure.
If it can’t start at all, it can only start the PG with status of non-LATEST or non-SYNC.
But at this time, the PG with status

2-28 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

of non-LATEST or non-SYNC may have fewer data than that with status of LATEST or
SYNC, and force the start of non-LATEST or non-SYNCmay result in a small amount of
data loss.
For other information, please contact ZTE technical support staff.

2.36 9102 PostgreSQL database cluster contains unavailable


nodes
Alarm Information

 Alarm code: 9102


 Alarm description:This alert is generated when the PostgreSQL database cluster
detector detects DISCONNECT nodes in the cluster.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

Nodes in the database cluster can not be used normally.

Action

1. Use command crm status checks the status of the cluster, if there is master and
standby node also starts normally, but the stream replication is abnormal.
Through Self-Management Entry - > pg-mng: Click on Enter PG Manager page,
click the problem node with pull the data to pull data from master node manually.
2. Master exists and standby node failed to start.
a. First, check the log of the failed PG, or start the PG manually through the PSQL
client to find the cause of the failure.
b. If the standby node can’t start at all, select the problem node to pull the data
manually
c. through self-management entrance - > pg-mng: click on to enter PG Manager
page .
3. Without master, the PG with LATEST or SYNC status did not start successfully.
a. First, check the start-up log of PG with status of LATEST or SYNC, or start PG
manually with PSQL client to find the cause of start-up failure.
b. If it can’t start completely, restore database if the backup is available. If restore is
not the option, it can only start the PG with status of non-LATEST or non-SYNC.
c. But at this time, the PG with status of non-LATEST or non-SYNC may have fewer
data than that with status of LATEST or SYNC,

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-29


ElasticNet UME R18 Alarm Handling Reference

d. and force the start of non-LATEST or non-SYNC may result in a small amount of
data loss.
Please contact ZTE Communications Technical Support to check whether service is
normal.

2.37 9103 PostgreSQL database cluster replication interrupts


or produces brain-split
Alarm Information

 Alarm code: 9103


 Alarm description:This alarm is generated when PostgreSQL database cluster
detector detects cluster replication interruption or brain-split.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

The master node failed to be promoted, and there was no master node providing
external services.

Action

In PaaS Operations and Maintenance Interface Monitoring - > Alarm - > Details Page
Details Item, check 9103 alarm specific reasons.
 If the reason is shown as Need People Repair, Streaming is break, You can use
portal to pull full data from Master.
It indicates that the stream replication is broken and the data needs to be pulled
manually through the management interface.
Through Self-management Entry - > pg-mng : Click on to enter PG Manager page,
select the problem node to pull the data manually.
 If the reason is shown as it is possible to have a split brain, keep ban status.
indicates that the time line of the original
master node and the new master node is the same, it may be the case of brain-split,
which requires manual comparison and merging of data.
 If the reason is shown as It is possible to have a split brain, keep ban status. Marks
Maybe data loss or database abnormality,
need Repair, keep ban status. indicates that the difference between master and standby
node data is greater than the preset value, it needs manual comparison and merging
data.

2-30 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

For other information, please contact ZTE technical support staff.

2.38 9104 PostgreSQL database master and standby cluster


replication interruption
Alarm Information

 Alarm code: 9104


 Alarm description:In disaster recovery scenario, this alarm is generated when
the stream replication of master and standby nodes of PostgreSQL database is
interrupted.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

The stream replication of disaster recovery cluster and master cluster is interrupted and
disaster recovery fails.

Action

 Enter the master cluster to view any PostgreSQL server container


1. 1.Check whether the master cluster status is primary by:
a. crm_attribute-n cluster_mode-G-q .
b. If it is not primary, it means that there is no primary cluster, confirm whether
pronotion operation of master cluster is required
2. 2.Check whether the master by:
a. crm status .
b. If there is no master in the main cluster, see 9101 alert .
 Enter the slave cluster to view any PostgreSQL server container
1. 1.Check whether stream replication is paused by:
a. crm_attribute-n cluster_rep_status-G-q .
b. If pause, you need to turn on stream replication
2. 2.Whether the standby cluster has IP and PORT for the primary cluster:
a. Crm_attribute-n master_cluster_ip-G-q .
b. Crm_attribute-n master_cluster_port-G-q .
c. If not, additional configuration operations are needed to reset the IP and
PORT of the primary cluster.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-31


ElasticNet UME R18 Alarm Handling Reference

2.39 9105 PostgreSQL database failed to archive log file


Alarm Information

 Alarm code: 9105


 Alarm description:This alarm is generated when the number of unsuccessful archive
log in the PostgreSQL database log archive directory exceeds 5.
 Alarm level: Critical
 Alarm type: QoS Alarm

Alarm Cause

 FTP network interruption.


 Backup command configuration error.

Action

First, execute the statement


exec -it pod_name -n opcs bash
at the control node to enter the database docker.
Second, check the csv log file with the key word archive command failed to review the
cause of archive failure.
Thirdly, solute problems according to following description.

Table2-1 Table 1.1 Error list for log archive


Error Code Cause Solutions

20 ftp transform error Test network

40 ftp config error Check ftp configuration

111 FTPPORT is not found Config ftp port

112 FTPIP is not found Config ftp IP

113 FTPUSER is not found Config ftp USER

114 FTPPASSWORD is not found Config ftp password

115 FTP WALDATAPATH is not Config ftp backup path


found

116 WALDATAPATH is not found Config wal log name

117 full walname is not found Config wal full path and name

118 walname is not found Config wal log name

For other information, please contact ZTE technical support staff.

2-32 SJ-20191220164305-036 | 2020-01-10 (R1.0)


2 QoS Alarm

2.40 9141 Index is damaged in common service Elasticsearch


Alarm Information

 Alarm code:9141
 Alarm description:When index status is red in common service Elasticsearch, the
index is damaged and this alarm is generated.
 Alarm level: Undefined
 Alarm type: QoS Alarm

Alarm Cause

 The disk of the node where the Elasticsearch is deployed is abnormal.


 The network of the node where the Elasticsearch is deployed is abnormal.

Action

1. Check whether there are related alarms on the application management page.
a. In PaaS Operations and Maintenance Interface ** Monitoring - > Alarm - > Details
** Page ** Details ** Item, get Object ID.
b. In opcs project ** Application Manager ** page, find the application named “
commsrves-<Object ID>”. Enter the ** Alarm ** page of the application, and
check whether there is any current alarm.
 Yes -> Handle the fault based on related handling suggestions.
 No -> please contact ZTE technical support.
2. For other information, please contact ZTE technical support.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-33


ElasticNet UME R18 Alarm Handling Reference

2-34 SJ-20191220164305-036 | 2020-01-10 (R1.0)


Chapter 3
OMC Alarm
Table of Contents
1000 User locked...........................................................................................................3-2
1001 Hard disk usage of database server overload......................................................3-2
1002 CPU usage of application server overload...........................................................3-3
1003 RAM usage of application server overload.......................................................... 3-3
1004 Application server disk-overload...........................................................................3-3
1008 Database instance space usage too large........................................................... 3-4
1012 License is expired.................................................................................................3-4
1013 License is about to expire.................................................................................... 3-5
1015 The link between the server and the ME agent is broken....................................3-5
1017 The time in which the designated alarm remains active has expired................... 3-6
1018 The time in which the designated alarm remains unacknowledged has
expired............................................................................................................................3-6
1022 Merge rule root alarm...........................................................................................3-7
1023 Suppress plan task............................................................................................... 3-7
1025 Automatic backup failure...................................................................................... 3-8
1028 Alarm forwarding failure........................................................................................3-8
1034 License consumption exceeds the alarm threshold............................................. 3-9
1035 License consumption exceeds the total authorization.......................................... 3-9
1050 Wrong login password.......................................................................................... 3-9
1060 The number of users assigned the specific type exceeds the limit.................... 3-10
1061 The number of users assigned the specific type is about to exceed the limit..... 3-10
1300 Password has expired........................................................................................ 3-11
1301 Password will expire........................................................................................... 3-11
1310 The number of login users exceeds the limit..................................................... 3-11
1311 SNMP authentication failure............................................................................... 3-12

SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-1


ElasticNet UME R18 Alarm Handling Reference

3.1 1000 User locked


Alarm Information

 Alarm code: 1000


 Alarm description: User locked
 Alarm level: Warning
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Check and analyze the login log to find whether the problem is caused by a password
guessing attack. If no, contact the system administrator for unlocking the user account.

3.2 1001 Hard disk usage of database server overload


Alarm Information

 Alarm code: 1001


 Alarm description: Hard disk usage of database server overload
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

 The disk space occupied by audit logs exceeds the threshold "Lower Clean Percent".
 The disk space occupied by program logs exceeds the threshold "Lower Clean
Percent".

Action

Perform the following operations as required:


 Perform manual backup of the over-limit module big data in Backup and Recovery,
select Delete or Backup and delete.
 In Big Data Automatic Backup, modify Task Setting of the over-limit module, and
adjust the module's Clean by Time, Space Capacity, Lower Clean Percent or Upper
Clean Percent."

3-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)


3 OMC Alarm

3.3 1002 CPU usage of application server overload


Alarm Information

 Alarm code: 1002


 Alarm description:CPU usage of application server overload
 Alarm level:Undefined
 Alarm type:OMC Alarm

Alarm Cause

None.

Action

1. Check that the load of the UME is within the allowable range.
2. Check whether any unnecessary applications are running on the UME server. If yes,
exit those unnecessary applications.

3.4 1003 RAM usage of application server overload


Alarm Information

 Alarm code: 1003


 Alarm description: RAM usage of application server overload
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

1. Check that the load of the UME is within the allowed range.
2. Check whether any unnecessary applications are running on the UME server. If yes,
exit those applications to release some RAM.
3. Expand the RAM of the application server.

3.5 1004 Application server disk-overload


Alarm Information

 Alarm code: 1004


 Alarm description: Application server disk-overload

SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-3


ElasticNet UME R18 Alarm Handling Reference

 Alarm level: Undefined


 Alarm type: OMC Alarm

Alarm Cause

None.

Action

The system administrator is recommended to handle this alarm as follows:


1. Check that the space of the hard disk in the application server has been properly
allocated.
2. Expand the hard disk.

3.6 1008 Database instance space usage too large


Alarm Information

 Alarm code: 1008


 Alarm description: Database instance space usage too large
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Do the following to remove the probable problems causing this alarm:


1. Back up and delete historical data.
2. Clean the database periodically.
3. Allocate more space to the database instance.

3.7 1012 License is expired


Alarm Information

 Alarm code: 1012


 Alarm description: License is expired
 Alarm level: Major
 Alarm type: OMC Alarm

3-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)


3 OMC Alarm

Alarm Cause

None.

Action

Contact your local vendor office for a new license.

3.8 1013 License is about to expire


Alarm Information

 Alarm code: 1013


 Alarm description: License is about to expire
 Alarm level: Major
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Contact your local vendor office for a new license.

3.9 1015 The link between the server and the ME agent is
broken
Alarm Information

 Alarm code: 1015


 Alarm description: The link between the server and the ME agent is broken
 Alarm level: Critical
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Please check the link between the server and the agent. Check the connection as
follows:
1. On the Alarm Monitor interface, view the detail of the alarm to find the information
of the agent which the link is broken. Go to the proxy access UI, check the proxy
address information in the details of the corresponding proxy.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-5


ElasticNet UME R18 Alarm Handling Reference

2. Verify that the network is faulty or not. If the network is faulty, please configure the
firewall.
3. Otherwise, restart the agent.

3.10 1017 The time in which the designated alarm remains


active has expired
Alarm Information

 Alarm code: 1017


 Alarm description: The time in which the designated alarm remains active has
expired
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Do the following to handle this alarm:


1. On the Alarm Monitor view of the UME client GUI, view the details of the alarm to find
the original alarm that persists for a long time, which causes this alarm.
2. Find the handling suggestion of the original alarm by its alarm code, and then handle
the original alarm according to the suggestion.

3.11 1018 The time in which the designated alarm remains


unacknowledged has expired
Alarm Information

 Alarm code: 1018


 Alarm description: The time in which the designated alarm remains unacknowledged
has expired
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

3-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)


3 OMC Alarm

Action

Do the following to handle this alarm:


1. Open the detail of the alarm, and view the details of the alarm to find the original
alarm that causes this alarm.
2. Find the handling suggestion of the original alarm by its alarm code, and then handle
the original alarm according to the suggestion.

3.12 1022 Merge rule root alarm


Alarm Information

 Alarm code: 1022


 Alarm description: Merge rule root alarm
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

1. This alarm in the current alarm table to show all merged alarms.
2. Find the handling suggestion of each merged alarm by its alarm code, and then
handle the corresponding alarm according to the suggestion.

3.13 1023 Suppress plan task


Alarm Information

 Alarm code: 1023


 Alarm description: Suppress plan task
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

During engineering cutover and switchover,this alarm is used to suppress alarms


reported by the equipment, and users do not need to handle it. After engineering cutover

SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-7


ElasticNet UME R18 Alarm Handling Reference

and switchover, users should clear this alarm. If the equipment alarms suppressed by
this alarm are already cleared, they do not need to be handled again. If some equipment
alarms are not cleared yet, users need to check and handle these equipment alarms.

3.14 1025 Automatic backup failure


Alarm Information

 Alarm code: 1025


 Alarm description: Automatic backup failure
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Do the following to handle this alarm:


1. On the Alarm Monitor view of the UME client GUI, view the details of the alarm.
2. Handle backup failure based on information provided by additional alarm
parameters.
3. Remove the alarm manually after fault treatment.

3.15 1028 Alarm forwarding failure


Alarm Information

 Alarm code: 1028


 Alarm description: Alarm forwarding failure
 Alarm level: Major
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

1. If the SMS fails, please check the phone number.


2. If email fails, please check the email address or the configuration of email
configuration Center.

3-8 SJ-20191220164305-036 | 2020-01-10 (R1.0)


3 OMC Alarm

3.16 1034 License consumption exceeds the alarm threshold


Alarm Information

 Alarm code: 1034


 Alarm description: License consumption exceeds the alarm threshold
 Alarm level: Major
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Update the license file in a timely manner.

3.17 1035 License consumption exceeds the total


authorization
Alarm Information

 Alarm code: 1035


 Alarm description: License consumption exceeds the total authorization
 Alarm level: Major
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Update the license file in a timely manner.

3.18 1050 Wrong login password


Alarm Information

 Alarm code: 1050


 Alarm description: Wrong login password
 Alarm level: Warning
 Alarm type: OMC Alarm

SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-9


ElasticNet UME R18 Alarm Handling Reference

Alarm Cause

None.

Action

Please input the correct password.

3.19 1060 The number of users assigned the specific type


exceeds the limit
Alarm Information

 Alarm code: 1060


 Alarm description: The number of users assigned the specific type exceeds the limit
 Alarm level: Major
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Contact your local vendor office for a new license.

3.20 1061 The number of users assigned the specific type is


about to exceed the limit
Alarm Information

 Alarm code: 1061


 Alarm description: The number of users assigned the specific type is about to exceed
the limit
 Alarm level: Minor
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Contact your local vendor office for a new license.

3-10 SJ-20191220164305-036 | 2020-01-10 (R1.0)


3 OMC Alarm

3.21 1300 Password has expired


Alarm Information

 Alarm code: 1300


 Alarm description: Password has expired
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Please modify the password of the commonResouceservice instance.

3.22 1301 Password will expire


Alarm Information

 Alarm code: 1301


 Alarm description: Password will expire
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Please modify the password of the commonResouceservice instance.

3.23 1310 The number of login users exceeds the limit


Alarm Information

 Alarm code: 1310


 Alarm description: The number of login users exceeds the limit
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-11


ElasticNet UME R18 Alarm Handling Reference

Action

The number of login users exceeds the threshold. Please deal with the number of login
users and avoid abnormal system.

3.24 1311 SNMP authentication failure


Alarm Information

 Alarm code: 1311


 Alarm description: SNMP authentication failure
 Alarm level: Undefined
 Alarm type: OMC Alarm

Alarm Cause

None.

Action

Please modify the SNMP parameters.

3-12 SJ-20191220164305-036 | 2020-01-10 (R1.0)


Chapter 4
Communication Alarm
Table of Contents
1014 The link between the server and the ME is broken..............................................4-1
1040 ME or agent backend start failure........................................................................ 4-1
200204012 S1 link is broken.........................................................................................4-2
200204013 Power supply failure................................................................................... 4-2
200204014 Transport failure..........................................................................................4-3

4.1 1014 The link between the server and the ME is broken
Alarm Information

 Alarm code: 1014


 Alarm description: The link between the server and the ME is broken
 Alarm level: Critical
 Alarm type: Communication Alarm

Alarm Cause

None.

Action

Do the following to check whether the link between the server and the ME is normal:
1. Find the IP address of the ME on the Topo Management page.
2. Ping the IP address of the ME from the server.
3. If the second step fails, then you need to solve the network or the ME problems.

4.2 1040 ME or agent backend start failure


Alarm Information

 Alarm code: 1040


 Alarm description: ME or agent backend start failure

SJ-20191220164305-036 | 2020-01-10 (R1.0) 4-1


ElasticNet UME R18 Alarm Handling Reference

 Alarm level: Critical


 Alarm type: Communication Alarm

Alarm Cause

None.

Action

Please manually start ME or agent in the topology management page.

4.3 200204012 S1 link is broken


Alarm Information

 Alarm code: 200204012


 Alarm description: S1 link is broken
 Alarm level: Critical
 Alarm type: Communication Alarm

Alarm Cause

S1 link is broken

Action

Based on the network planning, check whether the settings of the IP address and the
route of each node (such as the BSC, RNC, and switching devices) over the transport
path are correct.

4.4 200204013 Power supply failure


Alarm Information

 Alarm code: 200204013


 Alarm description: Power supply failure
 Alarm level: Critical
 Alarm type: Communication Alarm

Alarm Cause

Power supply failure

Action

Check whether the power supply equipment in the equipment room is normal or not.

4-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)


4 Communication Alarm

4.5 200204014 Transport failure


Alarm Information

 Alarm code: 200204014


 Alarm description: Transport failure
 Alarm level: Critical
 Alarm type: Communication Alarm

Alarm Cause

Transport failure

Action

Check whether the gateway is normal or not.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 4-3


ElasticNet UME R18 Alarm Handling Reference

4-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)


Chapter 5
Processing Error Alarm
Table of Contents
0502 K8s schedule failed.............................................................................................. 5-1
0503 K8s create pod failed........................................................................................... 5-3
0504 Failed to Delete a Pod......................................................................................... 5-3
1014 Abnormal Service Operational Status.................................................................. 5-4
1015 Abnormal Mircroservice Operational Status......................................................... 5-5
2001 Add network for Pod error....................................................................................5-6
2002 IaaS account authentication failed....................................................................... 5-7
8001 Commonservice deployed failed.......................................................................... 5-7
9302 Failed to synchronize data to slave zone.............................................................5-8

5.1 0502 K8s schedule failed


Alarm Information

 Alarm code: 0502


 Alarm description: K8s schedule failed
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

 The node has insufficient CPU or disk space.


 The resources available to the node (CPU/memory) do not meet the requirement by
the application.
 The application has set the node affinity, which does not match the node label.
 The application has set the application affinity, which does not match the existing
application on the node.

Action

1. Login URL "http://[ip address]/portaladmin", open "Monitor" -> "Alarm" -> "Current
Alarm" tab.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-1


ElasticNet UME R18 Alarm Handling Reference

 If there has "K8s Report Node Has Insufficient Memory" or "K8s Report Node Has
Disk Pressure", to deal with the alarm according to suggestion.
 If none of above, turn to step 2.
2. Open "Resource" -> "Compute" -> "Nodes", click Kubernetes node name to enter
detail page, enter "Resources Monitor" page to check whether CPU/Memory is
satisfied by Pod.
 Yes → Go to step 6.
 No → Go to step 3.
3. Verify that the CPU/memory resource requested by the application is adjustable.
 Yes → Go to step 4.
 No → Go to step 5.
4. Adjust the number of CPU/memory resources requested by the application and
redeploy the application. Login portal page, enter "AppManager", select the App
name, click the "Delete".
Open "Software Repository" -> "Image" page, find the corresponding image, click "
Deploy" button to redeploy the CPU and Memory. Or open "Software Repository" ->
"Blueprint" page, click the corresponding blueprint name, click "Edit" -> "Container"
icon -> "Advanced setting" -> "Configure Resources", modify the CPU/Memory
parameters, and redeploy the blueprint.
5. Increase node resources for clusters. Login Portaladmin page -> "Environment" -
> "Business Cluster" page, click cluster name -> "Nodes" -> "Scale out", fill in all
necessary parameters, and click "Scale out" button.
6. Check whether Pod affinity matches with the Pod label. Login Portaladmin page -> "
Environment" -> "Business Cluster" -> cluster name -> "Node".
 Yes → Go to step 8.
 No → Go to step 7.
7. Modify the node affinity configuration of the application and redeploy the application.
Open "AppManager" page, click "Delete" button to delete this App. Open "Software
Repository" -> "Image" page, find the image which is used by the application,
click "deploy" -> "Show Advanced Settings" -> "affinity config", fill in the necessary
parameters, the deploy the Image. Or open "Software Repository" -> "Blueprint"
page, find the blueprint which is used by the application, click "deploy" -> "Show
Advanced Settings", fill in the necessary parameters, and click "deploy".
8. Open "AppManager" page, check that whether all applications that have affinity/anti-
affinity relationship with this application are correct.
 Yes → Please contact ZTE technical support.

5-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)


5 Processing Error Alarm

 No → Go to step 9.
9. Modify the node affinity configuration of the application and redeploy the application.
Open "AppManager" page, click "Delete" button to delete this App. Open "Software
Repository" -> "Image" page, find the image which is used by the application,
click "deploy" -> "Show Advanced Settings" -> "affinity config", fill in the necessary
parameters, the deploy the Image. Or open "Software Repository" -> "Blueprint"
page, find the blueprint which is used by the application, click "deploy" -> "Show
Advanced Settings", fill in the necessary parameters, and click "deploy".

5.2 0503 K8s create pod failed


Alarm Information

 Alarm code: 0503


 Alarm description: K8s create pod failed
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

K8s kube-apiserver process is abnormal. The etcd server is abnormal.

Action

Login Portaladmin page-> "Monitor" -> "Alarm" -> "Current Alarm", check whether there
is a "cluster status abnormal" alarm.
 Yes → Do the operation with suggestion of the "cluster status abnormal" alarm.
 No → Please contact ZTE technical support.

5.3 0504 Failed to Delete a Pod


Alarm Information

 Alarm code: 0504


 Alarm description: Failed to Delete a Pod
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

Kubernetes API access error.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-3


ElasticNet UME R18 Alarm Handling Reference

Action

On the platform O&M portal, select Monitor > Alarm > Current Alarm . Check whether
the “Abnormal cluster status” alarm exists.
 Yes -> Handle the fault based on related handling suggestions.
 No -> Contact ZTE technical support.

5.4 1014 Abnormal Service Operational Status


Alarm Information

 Alarm code: 1014


 Alarm description: Abnormal Service Operational Status
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

 The application fails to be deployed or upgraded.


 The working instance of the application does not operate or the operational status is
not healthy.

Action

1. On the “Event” tab of the application details page, use “AppRunAbnormally” to filter
the events that are searched.
 If it displays “could not run normally in appointed time” in the event description, go
to Step 2.
 If it displays “select cluster fail” in the event description, go to Step 3.
 If it displays “resource of tenant is not enough” in the event description, this
indicates the resource quota of the tenant is insufficient. Contact the platform
administrator.
 For other displayed information, go to Step 4.
2. On the “Current Alarm” tab of the application details page, view the alarms.
 If the “Kubernetes Failed to Dispatch Pod” alarm exists, handle the fault based on
related handling suggestions.
 If the “Pod Network Configuration Failure” alarm exists, handle the fault based on
related handling suggestions.
 If the “Failed to Create a Pod” alarm exists, handle the fault based on related
handling suggestions.

5-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)


5 Processing Error Alarm

 If the “Failed to Mount a Volume to the Pod” alarm exists, handle the fault based
on related handling suggestions.
 If the above alarms do not exist, go to Step 4.
3. On the platform O&M portal, select Environment >Business Cluster . View the
information in the “Available status” column.
 It displays “Yes” -> Go to Step 4.
 It displays “No” - > On the platform O&M portal, select Monitor > Alarm >
Current Alarm . If the “Abnormal cluster status” alarm exists, handle the fault
based on related handling suggestions.
4. Attempt to analyze the cause of the failure according to the details of the
AppRunAbnormally event.
 Clearly describe the cause of the failure and contact the platform administrator to
fix the failure.
 If the cause of the failure is unclear, contact ZTE technical support.

5.5 1015 Abnormal Mircroservice Operational Status


Alarm Information

 Alarm code: 1015


 Alarm description: Abnormal Mircroservice Operational Status
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

 The application fails to be deployed or upgraded.


 The working instance of the application does not operate or the operational status is
not healthy.

Action

1. On the “Event” tab of the application details page, use “AppRunAbnormally” to filter
the events that are searched.
 If it displays “could not run normally in appointed time” in the event description, go
to Step 2.
 If it displays “select cluster fail” in the event description, go to Step 3.
 If it displays “resource of tenant is not enough” in the event description, this
indicates the resource quota of the tenant is insufficient. Contact the platform
administrator.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-5


ElasticNet UME R18 Alarm Handling Reference

 For other displayed information, go to Step 4.


2. On the “Current Alarm” tab of the application details page, view the alarms.
 If the “Kubernetes Failed to Dispatch Pod” alarm exists, handle the fault based on
related handling suggestions.
 If the “Pod Network Configuration Failure” alarm exists, handle the fault based on
related handling suggestions.
 If the “Failed to Create a Pod” alarm exists, handle the fault based on related
handling suggestions.
 If the “Failed to Mount a Volume to the Pod” alarm exists, handle the fault based
on related handling suggestions.
 If the above alarms do not exist, go to Step 4.
3. On the platform O&M portal, select Environment >Business Cluster . View the
information in the “Available status” column.
 It displays “Yes” -> Go to Step 4.
 It displays “No” - > On the platform O&M portal, select Monitor > Alarm >
Current Alarm . If the “Abnormal cluster status” alarm exists, handle the fault
based on related handling suggestions.
4. Attempt to analyze the cause of the failure according to the details of the
AppRunAbnormally event.
 Clearly describe the cause of the failure and contact the platform administrator to
fix the failure.
 If the cause of the failure is unclear, contact ZTE technical support.

5.6 2001 Add network for Pod error


Alarm Information

 Alarm code: 2001


 Alarm description: Add network for Pod error
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

In underlay scenario, because of insufficient port resource quota of IaaS, the creation
of network ports by PaaS network components failed. The PaaS network misses the
network specified in the Pod blueprint.

5-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)


5 Processing Error Alarm

Action

1. According to the IaaS tenant used by PaaS, modify the resource tenant quota
in the IaaS environment, Contact IaaS administrator to modify the tenant quota
configuration.
2. Check whether the PaaS network has created the network planned for use in the
Pod blueprint. Open Portaladmin system -> "Resources" -> "Network" page, check
whether the network is created. If no, click "Create Network" to add a new one.

5.7 2002 IaaS account authentication failed


Alarm Information

 Alarm code: 2002


 Alarm description: IaaS account authentication failed
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

Incorrect IaaS user name, password or IaaS address.

Action

1. Obtain the correct IaaS's username, password, and IP address.


2. Enter "Resources"->"Compute", click "Add Region", fill in the correct IaaS's
username, password, and IP address.

5.8 8001 Commonservice deployed failed


Alarm Information

 Alarm code: 8001


 Alarm description: Commonservice deployed failed
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

Download blueprint failed, Create PVC failed, Create IPGroup failed, Deploy pdm/vnpm
server failed, Deploy broker failed.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-7


ElasticNet UME R18 Alarm Handling Reference

Action

1. If the detail information of the alarm is download BluePrint failed , check "Software
Repository"->"Blueprint" , according to the deployed commonservice name and
version number to check whether the corresponding commonservice blueprint exists.
No → Please contact the administrator to upload the blueprint version.
Yes → Please contact the administrator to confirm that whether the software
repository is normal.
2. If the detail information of the alarm is create PVC failed, check share storage node.
Please contact the administrator to confirm whether the environment has storage
clusters or volume capacity resources are out of limit.
3. If the detail information of the alarm is NW create ipgroup failed, check the network.
Please contact the administrator to confirm the network.
4. If the detail information of the alarm is vnpm deploy server failed, check events from
VNPM. Check "Monitor"->"Alarm"->"Current Alarm" page to see whether there is a "
cluster status abnormal" alarm.
Yes → Click "Alarm Name" to see the "detail information"->"Suggestion", to deal with
this alarm.
No → Go to step 6.
5. If the detail information of the alarm is vnpm deploy broker failed, check events from
VNPM. Same operation with step 4.
6. Please contact ZTE technical support.

5.9 9302 Failed to synchronize data to slave zone


Alarm Information

 Alarm code: 9302


 Alarm description: Failed to synchronize data to slave zone
 Alarm level: Major
 Alarm type: Processing Error Alarm

Alarm Cause

The master zone failed to send data to slave zone probably caused by factors listed
below.
 A network failure.
 A network congistionn.
 The peer entity in slave zone is offline.

5-8 SJ-20191220164305-036 | 2020-01-10 (R1.0)


5 Processing Error Alarm

Action

Contact ZTE technical support to check the component status and net configurations for
disaster recovery and repair.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-9


ElasticNet UME R18 Alarm Handling Reference

5-10 SJ-20191220164305-036 | 2020-01-10 (R1.0)


Chapter 6
Environment Alarm
Table of Contents
9121 FTP disk space is insufficient...............................................................................6-1
9122 FTP disk read and write exception.......................................................................6-2
9201 Common Service Kafka node is offline................................................................ 6-2
9301 The connection for geographical disaster recovery is broken.............................. 6-3

6.1 9121 FTP disk space is insufficient


Alarm Information

 Alarm code:9121
 Alarm description:
Excessive FTP disk space usage is causing this alarm.
As FTP disk space usage increases, this alarm level will be dynamically adjusted.
The default threshold is as follows:
→ Usage rate reaches 70%, report major alarm, at this time FTP function has no
real impact, designed to remind users to clean up the disk space in time. Clear
alarms when usage drops below 60%.
→ Usage rate reaches 90%, report critical alarm, at this time, the FTP server will
be set to read only, only to view, download and delete files, not to upload files.
Immediately clean up the FTP disk space. When the utilization rate drops below
80%, the alarm level becomes important. FTP resumes writable capacity.
The above threshold can be customized by users in the FTP administration page.
 Alarm level: critical
 Alarm type: QoS Alarm

Alarm Cause

FTP disk space is insufficient.

Action

Contact ZTE technical support to clean up unused data on the FTP server.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 6-1


ElasticNet UME R18 Alarm Handling Reference

6.2 9122 FTP disk read and write exception


Alarm Information

 Alarm code:5101
 Alarm description:Report this warning when a Shared volume or local disk attached
to the FTP service cannot be read or written properly..
 Alarm level: critical
 Alarm type: QoS Alarm

Alarm Cause

FTP disk is not available.

Action

Contact ZTE technical support to repair FTP disk failure.

6.3 9201 Common Service Kafka node is offline


Alarm Information

 Alarm code:9201
 Alarm description:When create project failed.
 Alarm level: Major
 Alarm type: QoS Alarm

Alarm Cause

 Common Service Kafka memory overflow.


 Zookeeper Session Timeout.

Action

1. Log into the node deployed in Kafka, go to the directory /paasdata/op-comsrv/log/


Apache-Kafka/entity_name/broker/logs.check the kafkaserver-gc.log.*.Current log
file,and observe whether gc errors occur, such as Full GC Failure.
 Yes→ In PortalAdmin ** CommonService -> Kafka -> entity_name ** page, Click
the ** upgrade ** button at the top of the page to select the Kafka version and
adjust the deployment value of Kafka memory size.
 No→ Step 2.
2. In PortalAdmin ** CommonService -> Kafka -> entity_name ** page
a. In the installation information section, check the value of zookeeper session
timeout ms.

6-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)


6 Environment Alarm

b. Click ** self-management entry ** below to enter Kafka Manager, and click **


Cluster ** option. In ** Zookeeper node list ** check whether the maximum delay
of Zookeeper is greater than the value of zookeeper session timeout.
 Yes→ Click the ** upgrade ** button at the top of the page to select a Kafka
version and adjust the deployment value of zookeeper session timeout ms,
Generally, 30s is recommended, no more than 60s.
 No→ Step 3.
3. In PortalAdmin ** CommonService -> Kafka -> entity_name ** page
a. click the ** Kafka Manager ** below to enter Kafka Manager.
b. In ** cluster ** page, click the health detection button to test whether the Kafka
cluster messaging function is abnormal.
 Yes→ Click the ** upgrade ** button at the top of the page, For the same
version of Kafka, the upgrade cannot take effect if the parameter values
remain unchanged. deployment parameters of Kafka and Zookeeper can be
adjusted slightly
 No→ Step 3.
4. For other information, please contact ZTE technical support staff.

6.4 9301 The connection for geographical disaster recovery


is broken
Alarm Information

 Alarm code:9301
 Alarm description:Report this warning when the connection for geographical disaster
recovery is broken down.
 Alarm level: critical
 Alarm type: QoS Alarm

Alarm Cause

The connection is broken down probably caused by factors listed below.


 A network failure.
 A network congistionn.
 A incorrect configuration to a IP address or port.

Action

Contact ZTE technical support to check net configurations for disaster recovery and
repair.

SJ-20191220164305-036 | 2020-01-10 (R1.0) 6-3


ElasticNet UME R18 Alarm Handling Reference

6-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)


Chapter 7
Integrity Violation
Alarm
Table of Contents
15010001 Alarm for Missing of PM Data...................................................................... 7-1
15010002 Alarm for Missing of NAF PM Data.............................................................. 7-1

7.1 15010001 Alarm for Missing of PM Data


Alarm Information

 Alarm code: 15010001


 Alarm description: Alarm for Missing of PM Data
 Alarm level: Major
 Alarm type: Intergrity Violation Alarm

Alarm Cause

See Details.

Action

1. Job Inconsistent: Automatic Recovery,No Next Operation.


2. Link Break: See Detail of Link Break.
3. FtpServer Error: Connect to FtpServer for further positioning.
4. No Data in ME: Please Connect to ME.
5. Lack of DB Space: Please Add Disk.

7.2 15010002 Alarm for Missing of NAF PM Data


Alarm Information

 Alarm code: 15010002


 Alarm description: Alarm for Missing of NAF PM Data

SJ-20191220164305-036 | 2020-01-10 (R1.0) 7-1


ElasticNet UME R18 Alarm Handling Reference

 Alarm level: Major


 Alarm type: Integrity Violation Alarm

Alarm Cause

See Details.

Action

1. Job Inconsistent: Automatic Recovery,No Next Operation.


2. Link Break: See Detail of Link Break.
3. FtpServer Error: Connect to FtpServer for further positioning.
4. No Data in ME: Please Connect to ME.
5. The FTPServer for NMS is error: Connect to FtpServer for further positioning.

7-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)


Glossary

BSC

- Base Station Controller

CPU

- Central Processing Unit

RNC

- Radio Network Controller

SNMP

- Simple Network Management Protocol

UME

- Unified Management Expert

You might also like