You are on page 1of 5

ECC storm troubleshooting and preventive

measures Highlighted
Created:  Jul 28, 2021 15:50:15Latest reply: Oct 17, 2021 05:43:35 3567 30 12 0 0
View the author 1#

Hello, everyone. 

This post will analyze the ECC storm phenomenon, its causes and its solutions. Then, based
on the characteristics of the ECC protocol, it will provide reasonable suggestions from the
perspective of network planning in order to reduce the probability of ECC storm problems
and improve the efficiency of solving such problems.

Problem Description
Symptom 1: A large number of NEs go offline.

After an ECC storm occurs, a large number of NEs go offline. In most cases, all NEs except
the gateway NE cannot be logged in. After querying the routing table, it was found that some
random routes existed and the distance of these routes was large.

Symptom 2: Network flapping.

The ECC network flaps and a large number of ECC route update packets are continuously
transmitted on the network. As a result, the ECC routes of some NEs change continuously and
the ECC communication is intermittent and unstable.

ECC storms occur on both SDH and WDM networks. This is because they all use the
HWECC protocol to transmit network management information.

Cause analysis
ECC storms are caused by the following reasons:

1. The ECC network scale is too large

Like IP RIP, HWECC is a distance-vector protocol and cannot prevent route loops. Bulk
routing information must be broadcast after a route change. Therefore, HWECC is designed
as an IGP routing protocol for a small network. In the case of RIP, an ECC spans up to 16
hops. In the case of HWECC, however, an ECC travels up to 64 hops by default. In the case
of a route change, the cyclic invalid data information must run through 64 hops before being
discarded. For a large network, this is no doubt a disastrous load to the network bandwidth.
Few bandwidth resources are available for transferring ordinary routing data. In addition, the
other NEs must use some bandwidth resources to set up new MAC connections and routes.
Thus, a vicious circle forms. For a long time, the routes cannot be converged. An ECC storm
finally occurs.  This frequently happens when a fiber cut occurs, or when the main control
board or line board is being replaced.

2. In the complex or ring network, ECC storms may occur when fibers are severely
degraded

When the bit errors of the optical line are large or the packet loss is severe, the ECC protocol
generates some disordered or outdated routes when broadcasting packets. If the network is a
ring network, these routes are cyclically sent to the other direction, which is difficult to
disappear within a period of time. Therefore, we see many routes with very large distances.
The more bit errors, the more disordered routes are generated. The ring network is the
amplifier of these disordered routes. As a result, routes of NEs are frequently switched. The
priority of the ECC route task is higher than that of the login task. When routes on an NE are
frequently switched, the CPU usage of the routing task is high, causing the login task with a
low priority not to run for a long time. As a result, the NE is unreachable to the NMS. 

3. NE ID conflict

ECC communication between NEs is implemented through ID address identification.


Therefore, each NE must have an independent identifier ID. If the IDs are duplicated, ECC
route calculation fails. The NMS cannot find the destination when delivering connection
requests. As a result, a large number of connection request packets are generated on the
network. Then an ECC storm occurs.

4. Communication with a large amount of data

When the configuration data of multiple NEs is uploaded or the version of multiple NEs is
upgraded at the same time, a large amount of data is transmitted on the network, causing
network congestion. In this case, ECC storms may occur.

In the preceding four cases, there is a high probability that ECC storms are caused in the first
and second cases. The probability of ECC storms caused by the third and fourth cases is
relatively low.

Solution
The solution consists of two steps:

Step 1: Restore the NMS to manage NEs.

Method 1: Set the maximum distance of ECC.

The maximum distance of ECC is 64 by default. In the actual network, such a large distance is
not required. In addition, this maximum distance affects the search scope of ECC route.

Setting the maximum distance of ECC can narrow the range of refreshing the ECC route on
the network, thus reducing the probability of an ECC storm. When an ECC storm occurs, set
the maximum distance of ECC to 5. Then, after the network becomes stable, increase the
maximum distance to stabilize the network gradually.

Run the commands to set the route distance:cm-set-maxdist 

For example:

:cm-set-maxdist:5      # The maximum route distance cannot be set to 0.

Method 2: Disable the ECC links around the backbone node.

When disabling the ECC links, you must be familiar with the fiber connections on the
network. First, disable the loops at the access layer and then isolate certain devices from the
current ECC network. After the ECC is stable, gradually release the devices.

When disabling the remote optical ports, do not disable the route to the NMS. Make sure that
the disabled ECC can be accessed on the NMS.

Step 2: Eliminate the root cause of the fault.

1. Troubleshooting bit errors on the line.

Locate the optical port with a large number of bit errors, disable the DCC channel, check the
fiber, or replace the optical board.

2. Check for NE ID conflicts. 

Change a duplicate NE ID to ensure that the IDs of all NEs managed by the same NMS are
unique. If the NMS manages NEs of the Transmission domain, access domain, and IP domain
at the same time, the NE IDs of these NEs also must be unique.

3. Suspend the operations that require large amounts of data to be transferred across the DCC
network.

Preventive measures
In actual application scenarios, there is no way to prevent bit errors from occurring on fibers
or ring networks. However, the following precautions can be taken to reduce the probability
of ECC storms and improve the efficiency of handling ECC storms:

1. Proper subnetting

Each GNE can connect to a maximum of 50 non-GNEs. If there are more than 50 non-GNEs
for one GNE, another GNE must be added. 

2. Do not use automatic extended ECC.

When three or more NEs are connected at a site, automatic ECC extension is not used. If
devices are connected only through network cables, manual ECC extension is recommended.
This is because the automatically extended ECC forms a very complex ring network.

3. Properly arrange the gateway location.

When an ECC storm occurs, only the gateway can be logged in. Therefore, the location of the
gateway on the network is very important for fault prevention and recovery. When an ECC
storm occurs, you can set the DCC on the gateway or the maximum route distance to cut the
loop. This way, disordered routes caused by bit errors will not flap on the loop. Therefore, on
a ring network, ensure that each ring has a GNE. For example, in the following network, NE
(9-2335) is the most suitable gateway while NE (9-1213) is the least suitable gateway.

4. Properly plan the network to avoid duplicate NE IDs.


Thanks!

You might also like