You are on page 1of 26

NCE V100R020C10

Troubleshooting

www.huawei.com

Copyright © Huawei Technologies Co., Ltd. All rights reserved.


Contents

1. NCE Maintenance

2. NCE Troubleshooting

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 2


1.1. Check the status of NCE
Access https://10.53.209.66: 31945) choose Maintenance/ Panoramic Monitoring. In System Monitoring, click tab Nodes
and check the node, service, database status

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 3


1.2. Switch server Primary/secondary
On the NCE management plane, choose HA > Remote High Availability System > Manage DR System from the main menu -
> choose Switch Over

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 4


1.3. Check DR status of NCE
On the management plane, choose HA >Manage DR System from the main menu. On the page that is displayed, click
Evaluate Heath.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 5


1.4. Health check NCE
On the management plane, choose Product > O&M Management > Health Check from the main menu.
Choose Daily and click Check.

Wait until the check progress is 100% and view the check results.

Click Check Result to view the check results online Click Download Report to download the check results in batches.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 6


1.5. Backup and restore NCE Product data
Access (https://10.53.209.66: 31945) và choose Backup and Restore > Backup Data > Backup Product Data , choose NCE and
click “Backup” to backup Product Data
Click task list to check task complete or not

Click “Task List” to check task progress

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 7


Contents

1. NCE Maintenance

2. NCE Troubleshooting

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 8


2.1. Browser Login Failures
Possible Cause Check Method Solution
The network is faulty Ping the client or northbound IP address of NCE from the current PC 1. Check whether the network cable of NCE is loose. If the network cable
is loose, fasten it.

2. Check whether the external network is normal. If the external network


is abnormal, contact the customer to check the external network and
perform troubleshooting.

3. Check whether IP addresses conflict. If IP addresses conflict, modify


them.

4. Import the network_NCE.json file if it was not imported during DR


system installation.

5. Check the process on the System Monitoring page of the management


plane. If it is not running, start it and try again.

The firewall between the On the current PC, telnet to the service server through port 31943 and OMP If the server port is not monitored, the ER service may be stopped. Start
current PC and server server through port 31945: telnet X.X.X.X portvalue Test ports 31943 and the ER service on the management plane.
shields the port number 31945.
If the port is monitored, the service is running properly. You are advised to
used for logging in to the
If an exception occurs on the port between the current PC and the server, run check the network between the browser and server for port masking rules.
server
the following command. The command output indicates that the connection
failed: telnet x.x.x.x 31943

In normal situations, no output is displayed.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 9


2.1 Browser Login Failures
Possible Cause Check Method Solution
The IP address of the current PC is 1. Open the Security Management app and 1. Contact the administrator to add the IP address of the current PC to the client IP address
beyond the client IP address policy, the choose Security Policies from the main policy.
password is incorrect, the user is menu. Choose Client IP Address Policies
2. Contact the administrator to unlock the current user.
locked, or the number of login from the navigation pane. Check whether
attempts exceeds the maximum limit. the IP address of the current PC is included 3. If possible, use another computer to log in to NCE.
in the policy.

2. Open the Security Management app and


choose User Management from the main
menu. On the Users page, click the current
user. Click the Access Policies tab and
check whether the current user is locked in
the Client IP Address Policies area.
The server is powered off Log in to the control port of the server and 1, check that the power supply in the equipment room is normal.
check whether the server is started. If the
2. Log in to the control port of the server and check that the server hardware is running
remote login fails, go to the equipment
properly. Otherwise, contact Huawei technical support.
room and check whether the server is
powered on. 3. Power on the server through the control port of the server.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 10


2.2 Troubleshooting Alarm Monitoring Failures in the Access Domain
Possible Cause Check Method Solution
The faulty process is not started or is started Check the status of the ifms_agent process on the Start the faulty process manually.
abnormally. As a result, all alarms cannot be management plane. If the status is not Running, the process
reported. is not started.

If the disk space is full, network-wide On the EulerOS, run the df –k command to check whether If the directory usage reaches 100%, contact Huawei technical support
alarms or alarms of some NEs cannot be the opt directory is full. for troubleshooting. In emergency situations, copy the .log files to
displayed. another PC and delete these files from the server.

If an alarm masking rule is configured, NCE Choose Monitor > Alarm > Alarm Settings from the main If the alarms that cannot be reported are displayed in the masked alarm
discards alarms although they are reported menu. Choose Masking Rules from the navigation pane to list, disable the alarm masking rule.
by NEs. check whether the corresponding alarm masking rule is
configured.

The trap parameter is not configured or MA5800T(config)#display snmp-agent trap enable Configure the trap parameter correctly.
incorrectly configured on the device
MA5800T(config)#display snmp-agent trap-source

NCE does not support the time format MA5800T(config)#display time date-format Configure the time format for the device correctly.
reported by the device
The current date display format is: YYYY-MM-DD
An alarm masking rule is configured on the MA5800T(config)#display trap filter Disable the alarm masking rule.
device.

The network is faulty. Check whether the fault such as network instability, packet Rectify the network fault.
loss or firewall screening occurs.
Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 11
2.3 Recovery from a Large Number of Stacked Tasks

Symptoms
1. A large number of task orders are in the pending list.
2. A large amount of the "too many tasks may cause process delay" message is displayed in logs.

General Fault Recovery


Step 1 Log in to the NCE management plane Step 2 Restart the BmsAccess_xxx and
and choose Product > System Monitoring from BmsCommon-x-x processes (x indicates the
the main menu. instance ID).

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 12


2.4 Performance Collection Tasks are Performed Too Frequent, Causing Service
Provisioning Requests are Not Responded.
Step 1 Log in to the NCE O&M plane and open the Network Management app. Choose Monitor >
Performance > Performance Instance from the main menu.

Step 2 View performance monitoring instances of the desired devices.

Step 3 If the task frequency is too high, suspend related monitoring tasks.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 13


2.5 The device is offline and the device icon on NCE is in gray.
Step 1Check whether the network between the devices and NCE is normal. Check the connection by running the
commands such as Ping and Trace. For example, ping the devices on the NCE server or ping the NCE server on the
devices.
Step 2 Check whether the SNMP protocol between NCE and the devices is normal. If the test is successful but the devices
are still offline, contact Huawei to locate the fault. If the test fails, the SNMP configuration between NCE and the devices
is incorrect. In this case, perform Step 3.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 14


Step 3 Check whether the SNMP configuration between NCE and the devices is consistent.
Configuration on NCE is shown as follows: Configuration on the devices is shown as follows:

If the configurations are inconsistent, modify them on NCE or the devices to keep consistent.
For example, double-click the record to modify the configuration on NCE.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 15


2.6 Most NEs are offline.
Step 1 Check whether the network between the devices and NCE is normal. Check the connection by running the commands such as Ping and Trace. For
example: Ping the devices on the NCE server or ping the NCE server on the devices. You can also use the shortcut menus on NCE to perform the ping
operation.

Step 2 Check whether the interface IP address, the mask, and configuration on the NCE server are normal. The following uses operations on the EulerOS
as an example.
[ossadm@NMS_Server ~]$ ifconfig -a
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet 10.185.215.156 netmask 255.255.254.0 broadcast 10.185.215.255
inet6 fe80::2e97:b1ff:fe5a:51e4 prefixlen 64 scopeid 0x20<link>
ether 2c:97:b1:5a:51:e4 txqueuelen 1000 (Ethernet)
RX packets 26095140 bytes 1814230816 (1.6 GiB)
RX errors 0 dropped 394013 overruns 0 frame 0
TX packets 10857723 bytes 3347765107 (3.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
If the IP address and the mask are incorrect, reconfigure them on the NCE server.
If the interface status is DOWN, check whether the switch directly connected to the NCE server is normal. If the switch is normal, run the following command to
change the interface status to UP: ifconfig bge0 up
If the interface status is normal but the ping operation fails, go to the next step.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 16


2.6 Most NEs are offline.
Step 3 Check the routing table of the NCE server.

Run the netstat –rn command to check whether the routing table information about the devices exists:
[ossadm@NMS_Server ~]$ netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.185.214.1 0.0.0.0 UG 00 0 bond0
10.185.214.0 0.0.0.0 255.255.254.0 U 00 0 bond0
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth2
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth3
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth6
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth7
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth8
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth9
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth10
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth11
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 bond0
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 bond1
192.168.5.0 0.0.0.0 255.255.255.0 U 00 0 bond1
If the route is lost, run the following command to add a static route:
# route add target network IP address -netmask target network subnet mask gateway IP address
Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 17
2.6 Most NEs are offline.
Step 4 Check the system logs and check whether logs about MAC address conflicts exist.
In the /var/log/ directory, view the log file that starts with messages and check whether the log about the following conflict
exists:
10:58:15 NCEPrimary ip: [ID 876157 kern.warning] WARNING: node 00:0d:60:8c:a4:0e is using our IP address
172.029.220.006 on e1000g0
If the preceding conflict exists, correct the IP address of the server where the configuration conflict occurs.
Step 5 Run the arp –a command to check the MAC address entries. Check whether the MAC addresses of the egress
gateways are those of the gateways on the live network. (The MAC address check is used to locate the IP address conflicts.)
[ossadm@NMS_Server ~]$ arp -a
_gateway (10.185.214.1) at 84:5b:12:76:ae:0f [ether] on bond0
? (10.185.215.114) at 18:35:d4:01:77:4f [ether] on bond0
? (10.185.214.154) at 78:1d:ba:db:96:99 [ether] on bond0
? (10.185.215.74) at 00:18:82:2f:fc:46 [ether] on bond0
? (10.185.215.221) at 00:18:82:a0:aa:a2 [ether] on bond0
? (10.185.215.213) at 70:c7:f2:78:66:b2 [ether] on bond0
? (10.185.215.45) at 00:50:56:8e:2d:18 [ether] on bond0
? (10.185.214.250) at e4:c2:d1:e7:21:41 [ether] on bond0
If the MAC addresses are not the same, run the following command to delete the MAC addresses, and then run the ping
command to trigger MAC address learning. Run the arp –a command to check whether the learned MAC addresses are
correct.
arp –d 10.112.166.139
ping 10.112.166.139

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 18


2.7 GaussDB T V3 Database Instance Is Abnormal Due to Incorrect File Permission
Symptoms
1. Start the browser and enter the following URL in the address box: https://IP address of the management plane:31945.
2. Enter the username admin and its password to log in to the management plane. Choose Product > System Monitoring from the
main menu. On the System Monitoring page, click the Nodes tab.
3. Check the DB node status. If a certain instance is faulty, the database status is Partially Running.
The following figure shows the abnormal state.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 19


2.7 GaussDB T V3 Database Instance Is Abnormal Due to Incorrect File Permission
On the Relational Databases tab page, view the instance with a replication exception (using cloudsopdbsvr-0-1000 as an example)

Confirmation:
Log in to the database node where the instance is abnormal as the sopuser user. Switch to the dbuser user and run the ps -ef|grep
zengine command to check whether the abnormal database instance process exists

The command output shows that the process for the instance cloudsopdbsvr-0-1000 does not exist, which indicates that the
instance is not running properly, and the process fails to be started. View the instance logs in the /opt/zenith/data/cloudsopdbsvr-
0-1000/log directory
Check the startup logs starting with zctl-xxx. If the following error information is displayed, the database fails to read the port
number and the database instance fails to be started.
Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 20
2.7 GaussDB T V3 Database Instance Is Abnormal Due to Incorrect File Permission

The port number is configured in the database configuration file. Go to the /opt/zenith/data/cloudsopdbsvr-0-1000/cfg directory,
and check the content and permission of the zengine.ini configuration file. The content of the zengine.ini file cannot be viewed
and the file permission is inconsistent.

Solution
Step 1 Log in to the node where the instance is abnormal as the sopuser user. Switch to the root user, and run the chown
dbuser:dbgroup zengine.ini command to modify the configuration file permission.

Step 2 Wait for three to five minutes, and check the database node status on the management plane. The fault is
rectified. Therefore, the fault is caused by incorrect file permission, as shown in the following figure.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 21


2.7 GaussDB T V3 Database Instance Is Abnormal Due to Incorrect File Permission

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 22


2.8 GaussDB T V3 Database Instance Is Abnormal Due to File Missing
Symptoms
1. Start the browser and enter the following URL in the address box: https://IP address of the management plane:31945.
2. Enter the username admin and its password to log in to the management plane. Choose Product > System Monitoring from
the main menu. On the System Monitoring page, click the Relational Databases tab.
3. Check the database node status. The abnormal status is as follows (using nmsdbsvr-2-50 as an example):

Confirmation
Log in to the database node where the instance is abnormal as the sopuser user. Switch to the dbuser user and run the ps -ef|
grep zenith command to check whether the process of the abnormal database instance exists

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 23


2.8 GaussDB T V3 Database Instance Is Abnormal Due to File Missing

The command output shows that the process of the instance nmsdbsvr-2-50 does not exist, indicating that the instance on the
node is abnormal. View the logs of the instance. The log directory is /opt/zenith/data/nmsdbsvr-2-50/log.
Check the startup logs starting with zctl-xxx. If the following error information is displayed, the configuration file cannot be
found, causing the instance startup failure.

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 24


2.8 GaussDB T V3 Database Instance Is Abnormal Due to File Missing
Check whether the zengine.ini file exists in the /opt/zenith/data/nmsdbsvr-2-50/cfg directory. The zengine.ini file does not exist.

Solution
Copy the zengine.ini file of the instance in the slave database to the corresponding instance in the master database
(/opt/zenith/data/instance name /cfg). Set LOCAL_HOST of LSNR_ADDR of LOG_ARCHIVE_DEST_n and
LSNR_ADDR to the IP address in use. Set SERVICE of LOCAL_HOST in LOG_ARCHIVE_DEST_n and
LOG_ARCHIVE_DEST_N to the IP address of the node where the slave database resides. Set the instance name in
CONTROL_FILES to the corresponding instance name

Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 25


Thank You!
www.huawei.com

You might also like