You are on page 1of 7

Use of Correlation and Automation Engines to

Improve Availability In Communications


Systems
Hank Rausch
Owner, Hank Rausch Designs
Harpers Ferry, West Virginia, USA
HankRausch@stanfordalumni.org

Abstract

An approach to achieving a high availability communications network by using a


commercially available network management system is shown. It is configured
to sense a failure and automatically bring spare equipment on line. This approach
shows promise for circuit switched networks where vendor-based redundancy
systems are not available or practical.
Keywords: high availability communications, redundancy, correlation,
automation.

1. Introduction
High availability (HA) is a primary design goal in any communications
infrastructure. Definitions of HA vary, but a typical contractual requirement is
“4 nines” or 99.99% , which correlates to 52 minutes of downtime per year. A
central feature of high availability designs is the the ability to maintain a signal
path in the event of component failure. Often this is done with patented or
proprietary applications [1,2]. This paper presents an extensible and easy to
implement approach to an HA design using commercially available software.

High availability is often achieved by employing redundant components that


automatically come on line when the primary component fails. Examples include
Cisco’s Hot Standby Routing Protocol (HSRP) and any number of vendor-
specific switching elements—Comtech, Newtec, and Avenue for example. A
limitation of these solutions is that they tend to achieve high availability only
within their own vendor’s “ecosystem”. One can buy a redundancy switch to
swap out a modem if it fails, but not if the signal train to the modem fails (which
is going through equipment provided by a different vendor). Management
software usually does a good job of alerting operators that a critical component
has failed, but often manual intervention is needed to put the standby component
online.
This paper documents an experiment with a management tool configured as an
abstraction layer that senses faults and re-routes a circuit switched signal through
terrestrial and satellite equipment. The abstraction layer operates autonomously,
without human interaction. It’s advantage over alternative approaches is that it
can be configured to communicate with any arbitrary assemblage of
communication equipment, using that equipment’s native command syntax.
Hence, the availability of a “legacy” communication system—which previously
had to be re-configured manually in the event of a component failure—can
theoretically be improved.

2. Discussed problem

This approach was originally investigated for a maritime customer operating


Single Channel Per Carrier (SCPC) satellite links. The signal is downloaded at a
teleport, and then transmitted to customer premises via a terrestrial network. Fig.
1 shows a simple diagram of the signal train. The terrestrial path has to be re-
routed if a modem fails, as the signal must land at a designated location at the
customer site. The SCPC modems are old and prone to failure, and when this
occurs, manual reconstitution of the terrestrial signal train to the replacement
modem is required. Operationally. modem failures occur frequently, resulting in
significant downtime. The failure has to be recognized, then manual
reconfiguration of the replacement modem and terrestrial circuit re-routing has to
occur.

Figure 1: Simple Signal Diagram Showing The Problem addressed by


Correlation and Automation Engine

SCPC satellite links and circuit switched networks are harder to protect with
redundant equipment than an IP based signal train. IP networks can have
redundant equipment positioned in line in the signal train because they allow for
multiple simultaneous connections. Switching protocols such as Spanning Tree
allow for loop-free multiple switch connections, and routing protocols redirect
traffic in the event of link failure. None of these mechanisms is available to this
maritime customer. The signal must route to a single defined port at the customer
premise. This means that if a component fails, the original circuit has to be torn
down and a new path created to the customer demarc, via manual, human
interaction. The goal is to reduce the time it takes to accomplish this.
3. Test Setup
A test network simulating this customer’s network was set up as shown in fig. 2.
The abstraction layer controls the entire terrestrial circuit-switched network and
the modem RF chain. A modem was put in failure mode, the correlation engine
sensed this and put a new modem online, configured it for operation, and re-
routed the signal through the terrestrial network by issuing commands to the
equipment. This approach shows great promise to reduce downtime and increase
availability of the communication network for similar communications networks.

A set of COMTECH CDM 570L modems was provisioned on the “teleport” side
of the test network. To simulate an RF connection, transmit RF was connected
to receive RF and the modem was configured so it would lock up on itself when
it transmitted.

Figure 2. Test Setup Used to Simulate the Maritime Customer Network

The terrestrial network was simulated by Juniper Circuit to Packet (CTP)


devices. These convert a synchronous serial signal from the modem baseband
port to IP packets. These packets transmit through an IP network and then are
reconstituted to a serial signal at the customer’s CTP device. Juniper CTPs are
used by the maritime customer which was the focus of this study. However, this
approach should work for any point to point communication infrastructure.

The customer end of the network was simulated with a Fireberd serial tester. A
test pattern from the Fireberd traverses the network to modem and back to the
tester via the modem RF loop. If the modem or terrestrial path fails, the path
must be torn down, because the signal must arrive a specified customer demarc
(the fireberd in this case).

In order to accomplish this, the abstraction layer must perform the following
actions:

1. sense the failure


2. issue a command to the failed modem to stop transmitting
3. issue a command to each CTP to disable the existing terrestrial path
4. issue a command to the CTPs to establish a new path to spare modem
5. configure the spare modem to begin transmitting

Skyline's DataMiner was chosen as the abstraction layer. This is a network


management system that. in addition to standard monitoring and alarm functions
provides a well developed control, automation, and correlation engine. The
control layer uses the native syntax of the controlled device—e.g. SNMP or
command line. It presents these controls in a GUI and makes them available for
both manual and automatic configuration. Command sets can be grouped and
called up by the correlation layer. The scripting of correlation and automation
sequences are straightforward—essentially, if one can diagram the sequence of
actions in a flowchart, one can create a script.

The following figures show how this is done.

Fig. 3 shows the beginning of the correlation script. The script is activated when
any critical alarm appears on the modem. The definition of what constitutes a
critical alarm can be specially tailored so the script is only activated when
desired.

Figure 3: The correlation script is initiated when a critical alarm occurs on


modem1
Figure 4: Restoral Actions Triggered by the Correlation script

The Correlation script executes pre-defined automation scripts which are created
in the Automation portion of the DataMiner. Fig. 4 shows these scripts. In
sequence, they (1) re-route the path, (2) turn the failed modem off, and (3)turn
the spare modem on.

Fig. 5 shows the first of these automation scripts, which deactivates the primary
terrestrial path and activates the backup path.
Figure 5: Automation script which deactivates the primary terrestrial path and
activates the secondary

The two other scripts which turn off the primary modem transmitter and turn on
the backup modem transmitter are not shown here in the interest of space.

4. Results

The test setup was configured as described above. A modem fault was simulated
on modem 1 by turning the transmit scrambler off, which generated a critical
alarm. The script worked as configured when the alarm generated. The
terrestrial path was re-routed, modem 1 stopped transmitting and modem 2
started transmitting.

5. Advantages and Disadvantages


This approach provides a simple and easy method to introduce automatic failover
for a communications path composed of disparate equipment. In addition to the
obvious one of quick re-constitution, it also presents a subtler advantage:
Occasionally, network operators fail to follow approved protocol in a failure
condition. For example, for the maritime customer presented here, standing
protocol was to first, restore operation via the quickest means possible, and only
then repair the condition that caused the outage. Often, “human nature” got in
the way and operators for this network would waste time by trying to
troubleshoot the outage, instead of immediately restoring the service using
alternative equipment. So, automating the recovery actions have a way of
enforcing discipline in operations.

The chief limitation of this approach is that it requires a spare terrestrial


path and spare modem for each active circuit. The reason for this is the
correlation and automation script, as written, is not “smart” enough to select a
single spare modem and configure it for any generic circuit. In this test, the spare
modem's demodulation and modulation settings were pre-configured so that all
that was required was to turn transmit to “on”. The general case of completely
configuring a modem from scratch would require more complex coding and was
not attempted for this test. However in this case the simple solution works well,
as the maritime customer has only 50 circuits active at one time and has
sufficient spare modems to configure a 1:1 redundancy.

6. Conclusions and Way Ahead

A simple and easy method of automatically restoring a communications path was


demonstrated. It is useful in cases that meet the following conditions:

1. The signal path must route to a specific customer interface, and multiple
simultaneous paths are not possible.
2. The signal train components cannot be configured to fail over to an
alternate path by themselves.
3. There are sufficient spares to support a 1:1 sparing ratio.

Two future investigations suggest themselves:

1. Extend the correlation and automation script so that 1:N sparing can be
accommodated—i.e., make the script complex enough to configure an
element from scratch and put it in the signal train.
2. Operationally test this approach in a live network, to determine if it has
a measurable effect on circuit availability.

References

[1] Espy, J. et al., United States Patent Number 6.128.750. FAIL-OVER


SWITCHING SYSTEM, October 3, 2000
[2] Chow, P. et al., United States Patent Number 6,687,217. METHOD OF AND
SYSTEM FOR ONE PLUS ONE PROTECTION FOR RADIO
EQUIPMENT, February 3, 2004.

You might also like