You are on page 1of 11
Availability and Reachability in eHealth A discussion of eHealth availability and reachability Concepts, definitions,

Availability and Reachability in eHealth

A discussion of eHealth availability and reachability Concepts, definitions, calculations and approximations

Prepared by:

Dan Seligman Concord Engineering

Copyright © 2004 Concord Communications, Inc. eHealth, the Concord Logo, Live Health, Live Status, SystemEDGE, AdvantEDGE and/or other Concord marks or products referenced herein are either registered trademarks or trademarks of Concord Communications, Inc. Other trademarks are the property of their respective owners.

Table of Contents

I. Introduction

2

II. Availability

2

III. Reachability

10

IV. References

11

I. Introduction

This paper addresses the related concepts of availability and reachability in eHealth. It describes the underlying ideas, relevant performance variables, and the details of associated calculations and approximations.

Conceptually, availability refers to the ability of a managed eHealth element to perform its

assigned functions. Availability is a binary property. At any point in time an element is either

available or unavailable: eHealth does not recognize partial availability.

element might be unable to perform all of its functions without being totally disabled, we adjust our models to resolve it into subcomponents, each of which can be assigned a binary availability state. Where this type of resolution is impossible, eHealth considers the partially disabled element unavailable. This latter case occurs infrequently. Formally, we define availability as the percentage of some time interval during which the element was available, i.e., capable of performing its assigned functions.

Reachability refers to the ability of eHealth to communicate with a polled element. Formally, reachability is the percentage of some time interval during which the eHealth system could communicate with the element.

Availability is an intrinsic property of the element. Reachability, on the other hand, reflects the ability of eHealth to poll the element. In the following sections we discuss, in detail, how availability, reachability and associated performance variables are calculated and used in eHealth.

In cases where an

II. Availability

As described above, the availability of an element is the percentage of time that the element was

available over some period and available time is the actual time, usually denominated in seconds,

that the element was available over the same period. time period referred to as total time. Mathematically,

Availability is usually calculated over a

Concord Communications – Availability and Reachability in eHealth

2

availability = 100 * (available time)/(total time)

where total time is he number of seconds since the last good poll. A good poll is an SNMP poll

of the associated element that was correctly formatted, finished successfully and detected no

counter wraps in any polled variable. Total time includes time gaps in polling due to missed polls, bad polls or events such as system reboots. A missed poll is one that did not result in a response from the element. A bad poll is one that resulted in an erroneous response or detected a counter wrap.

A related concept is delta time, the number of seconds between two successive good polls.

time does not embrace time gaps in polling due to events such as system reboots or bad polls, but

it does include gaps due to missed polls. Delta time is usually equal to total time; however, if a

bad poll or a system reboot occurs, delta time is less than total time.

A system reboot is defined as a reset of the (nearly universal) MIB variable sysUpTime.

Consistent with our definition of available time, unavailable time is the time that the element was unavailable over some period.

In principle an element can exist in one of only two availability states, available and unavailable.

However, when we take the perspective of the eHealth server and consider that the information it

obtains is, in general, imperfect, we wind up with three perceived availability states: available,

unavailable and unknown.

unknown as unavailable, avoiding the unknown state and causing our availability calculations to

err on the side of unavailability.

the time over which we have calculated availability, e.g., where eHealth discovered the device in

the middle of the time period represented by the report period, we consider the time over which availability was not calculated as unknown.

There are two additional qualifications to our definitions of availability. When a device reboots, we assume the element state was unavailable for the time between last good poll and the time the

device came back up.

was available for the time between the last good poll and the restart time, although in the case of

a server restart coincident with a device reboot this rule is overridden by the reboot rule.

There are four categories of elements for which we calculate availability:

Delta

For the most part, we consider time when the availability state was

However when the time associated with a report is a superset of

In the case of a hiccup in the eHealth server, we assume the element state

LAN/WAN interfaces

Routers, Remote Access Servers, and Applications,

Servers, and

Response Paths

We treat each case a little differently. In the following sections we address each in turn.

Concord Communications – Availability and Reachability in eHealth

3

LAN/WAN Interfaces

We define LAN/WAN interface availability as the percentage of time an interface has an operational status that renders it available, i.e., capable of sending and receiving network traffic. We base our definition of availability on two relevant MIB variables in ifTable (or their equivalents):

ifOperStatus, the operational status of the interface, and

ifLastChange, the value of sysUpTime at the time of the last status change.

We currently use the statuses described in RFC 1573. [3] According to that standard, the variable ifOperStatus can take on five possible values. A sixth value (not in the standard) indicates that no status was returned.

In the discussion below we distinguish between availability states, which we are trying to calculate, and operational statuses, which we obtain by polling and from which we are attempting to determine availability states. We identify those statuses which map to the available state, and those which map to the unavailable state.

We define as available the following statuses:

noSuchName (0):

no status returned

up(1):

ready to pass packets

dormant(5):

waiting for some external event in order to pass packets

and as unavailable the following:

down(2)

testing(3):

in some test mode

unknown(4)

status cannot be determined for some reason

We know there was a status change during a polling interval if ifLastChange is greater than the last poll time (and, trivially, less than the current poll time). This status change may or may not be "operationally significant," i.e., represent a change from the available state to the unavailable state or the reverse.

There are three cases:

1. The status at the time of the current poll is the same as the status at the time of the last poll

and ifLastchange is less than the time of the last poll. Here the status has not changed from one poll to the next and the value of the availability state for the current poll is constant (available or unavailable) for the entire poll period.

Concord Communications – Availability and Reachability in eHealth

4

ifLastChange

ifLastChange

Iast poll

Iast poll

current poll

current poll

 

time

time

 
time time  
 
 

status A

status A

status A

status A

2. The status at the time of the current poll is different from the status at the time of the last poll. In this case ifLastChange is greater than the time of the last poll.

Iast poll

Iast poll

ifLastChange

ifLastChange

current poll

current poll

 

time

time

 
time time  
 
 

status A

status A

There are two possibilities here:

status B

status B

(a) If the status at the last poll and the status at the current poll represent a change from

available to unavailable or the reverse, we assign the time between last poll and ifLastChange to available (unavailable) and the time between ifLastChange and current poll as unavailable (available), effectively assuming a single change in status, although in principle there could have been more than one in the time period between the last poll and ifLastChange.

(b) If, on the other hand, the statuses before and after ifLastChange are both among the

available statuses above or, alternatively, both statuses are unavailable, there might have been one, two or more status changes between last poll and ifLastChange. For this period we invoke the environment variable NH_UNKNOWN_AVAIL_PCT, which represents percentage of an unknown time interval that we assign to the “other” state, i.e., unavailable if the statuses both map to available or available if they both may to unavailable. The default value of this variable is zero. For the period between ifLastChange and the current poll, we identify the availability state as that indicated by the status of the current poll.

Concord Communications – Availability and Reachability in eHealth

5

3.

The status at the time of the current poll is the same as the status at the time of the last poll but

ifLastChange is greater than the time of the last poll. Here there were at least two status changes between the last poll time and ifLastChange (one to leave the current status and one to get back to it), and possibly more. Again, we have no way of knowing the availability of the interface between the time of the last poll and ifLastChange, so we invoke NH_UNKNOWN _AVAIL_PCT, in the same manner as described above. Once again, for the period between ifLastChange and the current poll, we identify the availability state as that indicated by the status of the current poll.

Iast poll

Iast poll

ifLastChange

ifLastChange

current poll

current poll

 

time

time

 
time time  
 
 

status A

status A

status A

status A

In some cases we have operational status values for each poll, but no measurement of ifLastChange. In this case, we assume the availability state for the entire poll period to be the same as the state at the end of the poll period, although we have no way of knowing that this was really the case.

One final consideration is how we assign status in the event of a device reboot during the poll period, where ifLastChange may not, in general, be coincident with the reboot time. In this case, we assume that the state of the interface was unavailable from the time of the last poll to the time of ifLastChange.

Routers, Remote Access Servers and Applications

Here we obtain a metric for available time directly from the element, divide by total time, and convert to a percent:

availability = 100 * (available time)/(total time)

This formula works fine when the counter available time survives a period of unavailability, but not as well when it resets. In the latter case, the available time counter is reset when the element comes back up so the resultant availability is potentially an underestimate. We have no way of knowing how many restarts there were, so we assume there was a single restart and use the same

Concord Communications – Availability and Reachability in eHealth

6

calculation. A special case is device element availabilities that use the MIB variable sysUpTime for availableTime.

In some cases available time is not provided directly but there is a status variable that identifies the status at any moment and which is obtained by eHealth at the end of each poll period. Each status value can be mapped to an availability state, and can be used in the same way ifOperStatus is used to obtain the availability of interfaces in the case where there is no value for ifLastChange. We make the same approximation here, assuming the availability state for the entire poll period is the same as the availability state at the end of the poll period.

For example, consider the following polled variable in Reference [1]:

ccmStatus OBJECT-TYPE

SYNTAX

INTEGER {

unknown(1),

up(2),

down(3)

}

MAX-ACCESS read-only

STATUS

DESCRIPTION

current

"The current status of the CallManager. A CallManager is up if the SNMP Agent received a system up event from the local CCM system

unknown:

Current status of the CallManager is

up:

Unknown CallManager is running & is able to communicate with other CallManagers

down:

CallManager is down or the Agent is unable to communicate with the local CallManager."

::= { ccmEntry 5 }

This status variable tells us the availability state of the Cisco CallManager at a point in time at the end of each poll period but says nothing about the time between polls. Here we assume that available time is equal to delta time if the Cisco Call Manager element is up (available) and zero otherwise. Thus for each poll period availability is either 100% or 0%.

Concord Communications – Availability and Reachability in eHealth

7

Servers

We use the same approach and the same formula for Servers as for Routers, Remote Access Servers and Applications. However, there is an additional wrinkle with servers concerning precisely which variable to use for available time.

Virtually all devices that support an SNMP agent, including servers, support sysUpTime: [4]

sysUpTime OBJECT-TYPE

SYNTAX TimeTicks

ACCESS read-only

STATUS mandatory

DESCRIPTION

"The time (in hundredths of a second) since the

network management portion of the system was last

re-initialized."

::= { system 3 }

For this reason, a reset of this variable effectively defines a device reboot and this variable is used as a default for available time. Since the variable is denominated in centiseconds, we need to divide by 100 to denominate available time in seconds.

A close reading of the DESCRIPTION reveals that this variable actually represents the time since

the network management portion of the system or server, i.e., the agent, was last initialized.

use of sysUpTime is, in fact, misleading, as the server or system might be hale and whole while only the agent is unavailable.

There is an alternative variable, present in any agent that supports the Host Resources MIB [2] that represents true server uptime:

hrSystemUptime OBJECT-TYPE SYNTAX TimeTicks ACCESS read-only STATUS mandatory DESCRIPTION "The amount of time since this host was last initialized. Note that this is different from sysUpTime in MIB-II [3] because sysUpTime is the uptime of the network management portion of the

So

Concord Communications – Availability and Reachability in eHealth

8

system." ::= { hrSystem 1 }

When this variable is available, we improve our availability calculations by equating the available time to hrSystemUptime. This will cause the system availability calculations to use the

It

should be pointed out that while this approach offers more accurate server availability figures,

it

suffers from an unfortunate side-effect. Interface availability is calculated from the values of

ifOperStatus and ifLastChange for each individual interface, as discussed above.

is likely to be coincident with the reboot time (reset of sysUpTime) and hence aligned with server

availability when the latter is calculated using sysUpTime. When we use hrSystemUptime, this is

no longer true, and eHealth might report interfaces going down owing merely to an agent reset while the server remains available.

ifLastChange

Response Paths

A Response Path element represents transactions between a response source and a response

destination and applies to client-server communication. There are two types of availability associated with a Response Path element, path availability and service availability.

Path availability or availability without a qualifier refers to the availability of the Response Path itself. We define path availability as the percentage of attempted transactions during a poll period that were successful. Since, in general, there will be a mix of successful and unsuccessful transactions during a poll period, we divide the poll period between available time and unavailable time in proportion to the number of transactions that succeeded and failed respectively.

If no transactions are attempted during a poll period, path availability is unknown for that poll

period and no value is reported.

Service availability attempts to measure – or approximate – not the availability of the Response

Path itself but rather the availability of the associated service to the client.

approximation that if a single transaction is completed successfully during the poll period, the service is assumed to be available for entire poll period. If, on the other hand, there are attempted transactions but no completed transactions during the poll period, the service is unavailable for that poll period. For service availability, we make no distinction between one or more transactions succeeding and none failing and a mix of transaction successes and failures. If there

are no attempted transactions during the poll period, eHealth does not report any value for service availability.

We make the

Concord Communications – Availability and Reachability in eHealth

9

Non-Polling Environments

In some cases eHealth is called upon to derive availability from non-polled sources, such as other management stations, where the data is imported into eHealth via protocols other than SNMP. Here the concept of polling, central to many eHealth definitions, does not exist. Under these circumstances, eHealth only reports availability if available time can be obtained directly or calculated from or parameters explicitly provided by the data source. If there is no way of calculating availability, eHealth offers a somewhat inconsistent default. In some reports it causes availability to display as 100%; in others as "No Data Available" or "Variable not Supported". It is important to note that even in reports that actually display availability it is not a bona fide measurement: its value will always be 100% and never change.

III. Reachability

eHealth preceeds each SNMP poll with a series of pings that determine whether the element is accessible. If the series is unsuccessful, the poll is flagged a missed poll and the element is considered to be unreachable for the duration of the poll period ending in that poll. If the ping series is successful, the poll is considered a good poll, and the element is considered reachable for the duration of the poll period. Reachable time is the time, generally denominated in seconds,

that eHealth system could communicate with, i.e., ping, the element.

reachable for the duration of any poll ending in a successful ping test and unreachable for the duration of any poll ending in an unsuccessful ping test. Thus for any poll period (delta time), reachability is either 100% or 0%. There are two exceptions to this rule

An element is considered

if a reboot (reset of sysUptime) occurred during the poll period, the element is assumed unreachable from the start of the poll period to the reboot time and reachable from the reboot time to the end of the poll period, assuming the reachability test was successful. This is another way of saying that reachability is set equal to availability in the case of a system reboot

if eHealth is not polling, reachability is also set equal to availability.

Unreachable time is the number of seconds that the eHealth system could not communicate with the element.

It is possible for the user to disable the ping measurements. If this is done, eHealth uses the success or failure of the SNMP request to determine reachability.

Reachability is undefined in non-polling environments.

Concord Communications – Availability and Reachability in eHealth

10

IV.

References

[1] Cisco Systems, Inc., CISCO-CCM-MIB.my LastUpdated 200012010000Z

[2] P. Grillo, S. Waldbusser, Host Resources MIB, RFC 1514, September 1993

[3] K. McCloghrie, F. Kastenholz, Evolution of the Interfaces Group of MIB-II, RFC 1573, January 1994

[4] K. McCloghrie, M. Rose, Management Information Base for Network Management of TCP/IP-based internets: MIB-II, RFC 1213, March 1991

Concord Communications – Availability and Reachability in eHealth

11