You are on page 1of 81

AVAILABILITY MEASUREMENT

SESSION NMS-2201

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Agenda
Introduction
Availability Measurement Methodologies
Trouble Ticketing
Device Reachability: ICMP (Ping), SA Agent, COOL
SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent
Application

Developing an Availability Culture

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Associated Sessions
NMS-1N01: Intro to Network Management
NMS-1N02: Intro to SNMP and MIBs
NMS-1N04: Intro to Service Assurance Agent
NMS-1N41: Introduction to Performance Management
NMS-2042: Performance Measurement with Cisco IOS
ACC-2010: Deploying Mobility in HA Wireless LANs
NMS-2202: How Cisco Achieved HA in Its LAN
RST-2514: HA in Campus Network Deployments
NMS-4043: Advanced Service Assurance Agent
RST-4312: High Availability in Routing
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

INTRODUCTION
WHY MEASURE AVAILABILITY?

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Why Measure Availability?


1. Baseline the network
2. Identify areas for network improvement
3. Measure the impact of improvement projects

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Why Should We Care About


Network Availability?
Where are we now? (baseline)
Where are we going? (business objectives)
How best do we get from where we are not to where
we are going? (improvements)
What if, we cant get there from here?

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Why Should We Care About


Network Availability?
Recent Studies by Sage Research Determined That
US-Based Service Providers Encountered:
Percent of downtime that is
unscheduled: 44%
18% of customers experience over 100
hours of unscheduled downtime or an
availability of 98.5%
Average cost of network downtime per
year: $21.6 million or $2,169 per minute!

DowntimeCosts too Much!!!


SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of Downtime
Causes, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Cause of Network Outages


Change
management

Technology Hardware
Links
20%

Design
Environmental
issues
Natural disasters

Process
consistency

User Error
and Process
40%

Source: Gartner Group


NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Software and
Application
40%

Software issues
Performance
and load
Scaling

Top Three Causes of Network Outages


Congestive degradation

Network design

Capacity
(unanticipated peaks)

WAN failure (e.g., major fiber


cut or carrier failure)

Solutions validation

Power

Software quality

Critical services failure


(e.g. DNS/DHCP)

Inadvertent configuration
change
Change management

Protocol implementations
and misbehavior
Hardware fault

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Method for Attaining a


Highly-Available Network
Or a Road to Five Nines
Establish a standard
measurement method
Define business goals as
related to metrics
Categorize failures, root
causes, and improvements
Take action for root cause
resolution and improvement
implementation

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

10

Where Are We Going?


Or What Are Your Business Goals?
Financial
ROI

Economic Value Added

Revenue/Employee

Productivity
Time to market
Organizational mission
Customer perspective
Satisfaction

Retention

Market Share

Define Your End-State?


What Is Your Goal?
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

11

Why Availability for Business


Requirements?
Availability as a basis for productivity data
Measurement of total-factor productivity
Benchmarking the organization
Overall organizational performance metric

Availability as a basis for organizational


competency
Availability as a core competency
Availability improvement as an innovation metric

Resource allocation information


Identify defects
Identify root cause
Measure MTTRtied to process
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

12

It Takes a Design Effort to Achieve HA


Hardware and Software Design

Process Design

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Network and
Physical Plant Design
13

INTRODUCTION
WHAT IS NETWORK
AVAILABILITY?

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

14

What Is High Availability?


High Availability Means an Average End
User Will Experience Less than Five
Minutes Downtime per Year
Availability

NMS-2201
9627_05_2004_c2

Downtime per Year (24x7x365)

99.000%

3 Days

15 Hours

36 Minutes

99.500%

1 Day

19 Hours

48 Minutes

99.900%

8 Hours

46 Minutes

99.950%

4 Hours

23 Minutes

99.990%

53 Minutes

99.999%

5 Minutes

99.9999%

30 Seconds

2004 Cisco Systems, Inc. All rights reserved.

15

Availability Definition
Availability definition is
based on business
objectives
Is it the user experience you are
interesting in measuring?
Are some users more important
than other?

Availability groups?
Definitions of different groups

Exceptions to the availability


definition
i.e. the CEO should never
experience a network problem

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

16

How You Define Availability


Define availability perspective (customer, business, etc.)
Define availability groups and levels of redundancy
Define an outage
Define impact to network
Ensure SLAs are compatible with outage definition
Understand how maintenance windows affect outage definition
Identify how to handle DNS and DHCP within definition of
Layer 3 outage
Examine component level sparing strategy

Define what to measure


Define measurement accuracy requirements
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

17

Network Design
What Is Reliability?
Reliability is often used as a general term that
refers to the quality of a product
Failure rate
MTBF (Mean Time Between Failures) or
MTTF (Mean Time To Failure)
Engineered availability

Reliability is defined as the probability of survival


(or no failure) for a stated length of time

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

18

MTBF Defined
MTBF stands for Mean Time Between Failure
MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF)
or, to a failure (MTTF)
More technically, it is the mean time to go from an
OPERATIONAL STATE to a NON-OPERATIONAL STATE
MTBF is usually used for repairable systems, and MTTF is
used for non-repairable systems

MTTR stands for Mean Time to Repair

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

19

One Method of Calculating Availability


Availability =

MTBF
(MTBF + MTTR)

What is the availability of a computer with MTBF =


10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%

Annual uptime
8,760 hrs/year X (0.9988)
= 8,749.5 hrs

Conversely, annual DOWN time is,


8,760 hrs/year X (1- 0.9988)
= 10.5 hrs
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

20

Networks Consist of Series-Parallel


Combinations of in-series and redundant
components

RBD

D1
A

B1
1/2

B2

NMS-2201
9627_05_2004_c2

D2

2/3

D3

2004 Cisco Systems, Inc. All rights reserved.

21

More Complex Redundancy


Pure active parallel
All components are on

Standby redundant
Backup components are not operating

Perfect switching
Switch-over is immediate and without fail

Switch-over reliability
The probability of switchover when it is not perfect

Load sharing
All units are on and workload is distributed
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

22

MEASURING THE
PRODUCTION NETWORK

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

23

Reliability or Engineered Availability vs.


Measured Availability
Calculations Are SimilarBoth Are
Based on MTBF and MTTR
1. Reliability is an engineered probability of the
network being available
2. Measured Availability is the actual outcome
produced by physically measuring over time the
engineered system

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

24

Availability Choice Based on


Business Goals
Passive availability measurement
(Without sending additional traffic on the production
network using data from problem management, fault
management, or another system)

Active availability measurement


(With traffic being sent specifically for availability
measurement using ICMP echo, SNMP, SA agent, etc.
to generate data)

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

25

Types of Availability
Device/interface
Path
Users
Application

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

26

Some Types of Availability Metrics


Mean Time to Repair (MTTR)
Impacted User Minutes (IUM)
Defects per Million (DPM)
MTBF (Mean Time Between Failure)
Performance (e.g. latency, drops)

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

27

Back to How Availability Is Calculated?


Availability (%) is calculated by tabulating end user outage
time, typically on a monthly basis
Some customers prefer to use DPM (Defects per Million) to
represent network availability
Availability (%) = (Total User Time Total User Outage Time) X 102
Total User Time
DPM = Total User Outage Time X 106
Total User Time
Total User Time = Total # of End Users X Time in Reporting Period
Total User Outage Time = (# of End Users X Outage Time in Reporting Period)
Is over All the Incidents in the Reporting Period
Ports or Connections May Be Substituted for End Users

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

28

Defects per Million


Started with mass produced items like toasters
For PVCs,
DPM = (#conns*outage minutes)
(#conns*total minutes)

For SVCs or phone calls,


DPM = (#existing calls lost + #new calls blocked)
total calls attempted

For connectionless traffic (application dependent),


DPM = (#end users*outage minutes)
(#end users*total minutes)
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

29

NETWORK AVAILABILITY
COLLECTION METHODS
TROUBLE TICKETING METHODS

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

30

Availability Improvement Process


Step I
Validate data collection/calculation methodology
Establish network availability baseline
Set high availability goals

Step II
Measure uptime ongoing
Track defects per million (DPM) or IUM or
availability (%)

Step III
Track customer impact for each ticket/MTTR
Categorize DPM by reason code and
begin trending
Identify initiatives/areas for a focus to
eliminate defects
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

31

Data Collection/Analysis Process


Understand current data collection methodology
Customer internal ticket database
Manual

Monthly collection of network performance data and export


the following fields to a spreadsheet or database system:
Outage start time (date/time)
Service restore time (date/time)
Problem description
Root cause
Resolution
Number of customers impacted
Equipment model
Component/part
Planned maintenance activity/unplanned activity
Total customers/ports on network
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

32

Network Availability Results


Methodology and assumptions must be
documented
Network availability should include:
Overall % network availability (baseline/trending)
Conversion of downtime to DPM by:
Planned and unplanned
Root cause
Resolution
Equipment type
Overall MTTR
MTTR by:
Root cause
Resolution
Equipment type

Results are not necessarily limited to the


above but should be customized based on
your network and requirements
NMS-2201
9627_05_2004_c2

33

2004 Cisco Systems, Inc. All rights reserved.

Availability Metrics: Reviewed


Network has 100 customers
Time in reporting period is one year or 24 hours x 365 days
8 customers have 24 hours down time per year
DPM =

8 x 24
x 106
100 x 24 x 365

Availability = 1 -

NMS-2201
9627_05_2004_c2

8 x 24
.
100 x 24 x 365

= 219.2 failures for every


1 million user hours
= 0.978082

MTBF =

24 x 365 .
8

= 1095 (hours)

MTTR =

1095 x (1-0.978082) .
0.978082

= 0.24 (hours)

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

34

TROUBLE TICKETING METHOD


SAMPLE OUTPUT

NMS-2201
9627_05_2004_c2

35

2004 Cisco Systems, Inc. All rights reserved.

Overall Network Availability


(Planned/Unplanned)
Network Availability
100.00
99.95
99.90
99.85

e
iv
t
tra
s
u
Ill

99.80
99.75
99.70
99.65
99.60
99.55
99.50

July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun

Key takeaways

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

36

Platform Related DPM Comparison


600

DPM

500

Platform related DPM contributed


13% of total DPM in September
Platform DPM includes events from:

400
300
200
100
0
June

July

Aug

e
iv
t
tra
s
u
Ill
Sept

Oct

Backbone
NAS
PG
POP
Radius Server
VPN Radius Server

Dec

June

July

Aug

Sept

Other

339.5

424.9

394.7

362.2

Platform Related

49.2

82.5

104

52.6

Total DPM

388.7

507.4

498.7

414.8

------99.99% Target

100

100

100

100

Oct

Nov

Dec

100

100

100

All other events are included in the


Other category

Breakdown of Platform Related DPM


June

July

Aug

Sept

Backbone

1.5

.8

15.7

2.3

NAS

21.7

19.4

27

26.1

PG

26

59.6

56.8

18.9

POP

3.9

.5

1.6

Radius Server

1.2

.3

VPN Radius

8.8

2.8

3.4

Total Platform Related

49.2

82.5

104

52.6

NMS-2201
9627_05_2004_c2

Network Access Server (NAS)


accounts for 50% of the total
Platform related DPM in September
Private Access Gateway (PG)
showing significant decrease over
the past 3 months
37

2004 Cisco Systems, Inc. All rights reserved.

DPM by Cause
2500

DPM

2000
1500
1000
500
0
Dec

Dec
Unknown
Human Error

18.2

Environmental

36.1

Power

566.1

Other
HW
Config/SW
TOTAL

NMS-2201
9627_05_2004_c2

e
v
i
t
ra
t
s
u
Ill
Jan

Jan

Feb

Mar

Feb

Apr

Mar

May

Apr

May

80

95.2
115.2

23.6

8.98

87.7

106

68.8

18.4

133.4

127

31.4

11.1

89.7

19

14.8

145.7

136.2

212.4

37

314.2

604.4

884.3

512.7

553.6

474.3

422.5

240

406

101.6

117.5

20.2

106.6

201

3789.3

1202.2

1226

1293.1

1641.9

1964.8

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

38

MTTR Analysis: Hardware Faults


Router HW
16

Produce for Each Fault Type

15.1

14
12.42

Hours

12
10

8.49

Number of faults increased slightly


in September however MTTR
decreased 49% of faults resolved in
< 1 Hour in September

e
v
i
t
a
r
t
s
lI lu

7.19

11% of faults resolved in > 24 hours


with an additional 3% >100 Hhours

4
2
0
Jun

Jul

Aug

Sep

Oct

Nov

Dec

100

140

90

120

80

# of Faults

100

>24 Hr

80

12-24 Hr

60

4-12 Hr

40

1-4 Hr

20

<1 Hr

Jun

Jul

NMS-2201
9627_05_2004_c2

Aug

Sep

Oct

Nov

Dec

# of Total

>100

70

>100

60

>24 Hr

50

12-24 Hr

40

4-12 Hr

30
20

1-4 Hr

10

<1 Hr

Jun

Jul

Aug

Sep

Oct

Nov

Dec
39

2004 Cisco Systems, Inc. All rights reserved.

Unplanned DPM
1000
900
800
700
600
500
400
300
200
100
0

Feb

Feb

Mar

Mar

Apr

Apr

Other

70

100

35

Process

90

80

55

HW

90

200

80

SW

60

140

50

TOTAL

310

520

220

e
v
i
t
a
r
t
s
lI lu
May

May

Jun

Jun

Jul

Jul

Aug

Aug

Sep

Sep

Oct

Oct

Nov

Nov

Dec

Dec

79

80

80

165

110

40

10

100

100

90

210

180

75

10

104

180

115

385

325

245

110

67

80

65

200

145

100

40

10

350

440

350

960

760

460

170

40

Key take-a-ways

Action plans
Identify areas of focus to enable
reduction of DPM to achieve network
availability goal

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

40

Trouble Ticketing Method


Pros
Easy to get started
No network overhead
Outages can be categorized based on event

Cons
Some internal subjective/consistency process issues
Outages may occur that are not included in the trouble
ticketing systems
Resources needed to scrub data and create reports
May not work with existing trouble ticketing
system/process
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

41

Network Availability Collection Methods

AUTOMATED FAULT
MANAGEMENT EVENTS METHOD

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

42

Availability Improvement Process


Step I
Determine availability goals
Validate fault management data collection
Determine a calculation methodology
Build software package to use customer event log

Step II
Establish network availability baseline
Measure uptime on an ongoing basis

Step III
Track root cause and customer impact
Begin trending of availability issues
Identify initiatives and areas of focus
to eliminate defects
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

43

Event Log
Analysis of events
received from the
network devices
Analysis of accuracy
of the data

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Event Log Example


Fri Jun 15 11:05:31 2001 Debug: Looking for message header ...
Fri Jun 15 11:05:33 2001 Debug: Message header is okay
Fri Jun 15 11:05:33 2001 Debug: $(LDT) ->
"06152001110532"
Fri Jun 15 11:05:33 2001 Debug: $(MesgID)
->
"100013"
Fri Jun 15 11:05:33 2001 Debug: $(NodeName) ->
"ixc00asm"
Fri Jun 15 11:05:33 2001 Debug: $(IPAddr)
->
"10.25.0.235"
Fri Jun 15 11:05:33 2001 Debug: $(ROCom)
->
"xlr8ed!"
Fri Jun 15 11:05:33 2001 Debug: $(RWCom)
->
"s39o!d%"
"CISCO-Large-special"
Fri Jun 15 11:05:33 2001 Debug: $(NPG) ->
Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN)
->
"aSnmpStatus"
Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) ->
"system"
Fri Jun 15 11:05:33 2001 Debug: $(OSN) ->
"Testing"
Fri Jun 15 11:05:33 2001 Debug: $(OSS) ->
"Normal"
Fri Jun 15 11:05:33 2001 Debug: $(DSN) ->
"SNMP_Down"
Fri Jun 15 11:05:33 2001 Debug: $(DSS) ->
"Agent_Down"
Fri Jun 15 11:05:33 2001 Debug: $(TrigName) ->
"NodeStateUp"
Fri Jun 15 11:05:33 2001 Debug: $(BON) ->
"nl-ping"
Fri Jun 15 11:05:33 2001 Debug: $(TrapGN)
->
"-2"
Fri Jun 15 11:05:33 2001 Debug: $(TrapSN)
->
"-2

44

Calculation Methodology: Example


Primary events are device down/up
Down time is calculated based on device-type
outage duration
Availability is calculated based on the total
number of device types, the total time, and the
total down time
MTTR numbers are calculated from average
duration of downtime
With MTTR the shortest and longest outage
provides a simplified curve
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

45

Automated Fault Management Methodology


Pros
Outage duration and scope can be fairly accurate
Can be implemented within a NMS fault management system
No additional network overhead

Cons
Requires an excellent change management/provisioning
process
Requires an efficient and effective fault management system
Requires a custom development
Does not account for routing problems
Not true end-to-end measure
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

46

NETWORK AVAILABILITY
DATA COLLECTION
SAMPLE OUTPUT

NMS-2201
9627_05_2004_c2

47

2004 Cisco Systems, Inc. All rights reserved.

Automated Fault Management:


Example Reports
Device
Type

# of
Count of
Devices Incidents

Total Down
Time
hhh:mm:ss

%
Down

%
Up

Shortest
Outage
Duration

Mean
Time to
Repair

Longest Events
Outage
per
Duration Device

Host
Totals

2389

801

202:27:27

.0673% 99.9327%

0:00:19

0:20:47

7:48:46

24.42

Network
Totals

4732

1673

430:02:03

.1309% 99.8691%

0:00:24

0:22:36

9:49:35

14.90

Other
Totals

897

173

212:29:46

.0509% 99.9491%

0:00:17

0:26:07

2:16:10

16.84

GRAND
TOTAL

8018

2647

844:59:16

.0830% 99.9170%

0:00:20

0:23:10

6:38:11

18.72

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

48

Automated Fault Management:


Example Reports (2)
Other Totals
11%

Host Totals
30%

Host Totals
Network Totals
Other Totals

Network
Totals
59%

Other Totals
7%

Host Totals
30%

Host Totals
24%

Network
Totals
51%
NMS-2201
9627_05_2004_c2

Count of Incidents
Host Totals
Network Totals
Other Totals

Network
Totals
63%

Other Totals
25%

Number of Managed Devices

Total Down Time


Host Totals
Network Totals
Other Totals

2004 Cisco Systems, Inc. All rights reserved.

49

Network Availability Collection Methods

ICMP ECHO (PING) AND SNMP AS


DATA GATHERING TECHNIQUES

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

50

Data Gathering Techniques


ICMP ping
Link and device polling (SNMP)
Embedded RMON
Embedded event management
Syslog messages
COOL

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

51

Data Gathering Techniques


ICMP Reachability
Method definition:
Central workstation or computer configured to send ping
packets to the network edges(device or ports) to determine
reachability

How:
Edge interfaces and/or devices are defined and pinged
on a determined interval

Unavailability:
Pre-defined, non-response from the interface

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

52

Availability Measurement Through ICMP


Periodic ICMP Test

Periodic Pings to Network Devices


NMS-2201
9627_05_2004_c2

Period Ping to Network Leaf Nodes

2004 Cisco Systems, Inc. All rights reserved.

53

Data Gathering Techniques


ICMP Reachability
Pros
Fairly accurate network availability
Accounts for routing problems
Can be implemented for fairly low network overhead

Cons
Point to multipoint implies not true end-to-end measure
Availability granularity limited by ping frequency
Maintenance of device databasemust have a solid
change management and provisioning process
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

54

Data Gathering Techniques


Link and Device Status
Method definition:
SNMP polling and trapping on links, edge ports,
or edge devices

How:
An agent is configured to SNMP poll and tabulate outage
times for defined devices or links; database maintains
outage times and total service time; sometimes trap
information is used to augment this method by providing
more accurate information on outages

Unavailability:
Pre-defined, non-redundant links, ports, or devices that
are down
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

55

Polling Interval vs. Sample Size


Polling interval is the rate at which data is collected
from the network
Polling interval =

1
Sampling Rate

The smaller the polling interval the more detailed


(granular) the data collected
Example polling data once every 15 minutes provides 4 times the
detail (granularity) of polling once an hour

A smaller polling interval does not necessarily provide a


better margin of error
Example polling once every 15 minutes for one hour, has the
same margin of error as polling once an hour for 4 hours

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

56

Link and Device Status Method


Method definition
SNMP polling and trapping on links, edge ports,
or edge devices

How:
Utilizing existing NMS systems that are currently SNMP
polling to tabulate outage times for defined devices or links
A database maintains outage times and total service time
SNMP Trap information is also used to augment this
method by providing more accurate information on
outages

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

57

Link and Device Status Method


Pros
Outage duration and scope can be fairly accurate
Utilize existing NMS systems
Low network overhead

Cons
No canned SW to do this; custom development
Maintaining element device database challenging
Requires an excellent change mgmt and provisioning
process
Does not account for routing problems
Not a true end-to-end measure
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

58

CISCO SERVICE ASSURANCE


AGENT (SA AGENT)

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

59

Service Assurance Agent


Method Definition:
SA Agent is an embedded feature of Cisco IOS software
and requires configuration of the feature on routers within
the customer network; use of the SA agent can provide for
a rapid, cost-effective deployment without additional
hardware probes

How:
A data collector creates SA Agents on the routers to
monitor certain network/service performances; the data
collector then collects this data from the routers,
aggregates it and makes it available

Unavailability:
Pre-defined paths with reporting on non-redundant links,
ports, or devices that are down within a path
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

60

Case Study:
Financial Institution (Collection)
Internet
Web Sites

DNS

SA Agent Collectors

Remote Sites
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

61

Availability Using Network-Based Probes


DPM equations used with network-based probes as input data
Probes can be
Simple ICMP Ping probe, modified Ping to test specific applications,
Cisco IOS SA Agent

DPM will be for connectivity between 2 points on the network,


the source and destination of probe
Source of probe is usually a management system and the destination are
the devices managed
Can calculate DPM for every device managed
DPM = Probes with No Response x 106
Total Probes Sent
Availability = 1 - Probes with No Response
Total Probes Sent

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

62

Availability Using Network-Based Probes:


Example
Network probe is a ping
10000 probes are sent between management
system and managed device
1 probe failed to respond
DPM =

1
x 106 = 100 probes out of 1 million will fail
10000

Availability = 1 -

NMS-2201
9627_05_2004_c2

1 .
= 0.9999
10000

63

2004 Cisco Systems, Inc. All rights reserved.

Sample Size
Sample size is the number of samples that have
been collected
The more samples collected the higher the confidence that
the data accurately represents the network
Confidence (margin of error) is defined by

m=

1
sample size

Example data is collected from the network every 1 hour


After One Month

After One Day

m=
NMS-2201
9627_05_2004_c2

1
24

= 0.2041

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

m=

1
24 x 31

= 0.0367
64

Service Assurance Agent


Pros
Accurate network availability for defined paths
Accounts for routing problems
Implementation with very low network overhead

Cons
Requires a system to collect the SAA data
Requires implementation in the router configurations
Availability granularity limited by polling frequency
Definition of the critical network paths to be measured

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

65

COMPONENT OUTAGE ONLINE


MEASUREMENT (COOL)

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

66

COOL Objectives
To automate the measurement to increase
operational efficiency and reduce operational cost
To measure the outage as close to the source of
outage events as possible to pin point the cause of
the outages
To cope with large number of network elements
without causing system and network performance
degradation
To maintain measurement data reliably in presents
of element failure or network partition
To support simplicity in deployment, configuration,
and data collection (autonomous measurement)
NMS-2201
9627_05_2004_c2

67

2004 Cisco Systems, Inc. All rights reserved.

COOL Features
NMS

rd Party Tools
3rd

NetTools
C-NOTE
PNL

Event Notification Filtering


Outage Monitor MIB

COOL
Access Router

Open access via Outage Monitor MIB


Embedded in Router
Automated Real-Time Measurement
Autonomous Measurement
Outage Data Stored in Router

Customer Equipment
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

68

Outage Correlation and


Calculation

COOL Features (Cont.)

NMS

NMS

NMS

Two-tier framework
Reduces performance impact on
the router
Provides scalability to the NMS
Makes easy to deploy

Outage Monitor MIB

Outage Monitor MIB

COOL

COOL
Access Router

Core Router

Access Routers
Customer Equipment

Outage Monitoring and


Measurement

Provides flexibility to availability


calculation

Support NMS or tools for


such applications as
Calculation of software or
hardware MTBF, MTTR,
availability per object, device,
or network
Verification of customers SLA
Trouble shooting in real-time

NMS-2201
9627_05_2004_c2

69

2004 Cisco Systems, Inc. All rights reserved.

Outage Model
C
Access Router
Network
Management
System

RP

D
D
A

Physical
D
Interface

B
Power Fan, A
Etc.
Logical
Interface

Type

Objects Monitored

Link

MUX/
Hub/
Switch

Customer
Equipment

Link

Peer
Router

Failure Modes

Physical Entity
Objects

Component Hardware or Software Failure Including the Failure


of Line Card, Power Supplies, Fan, Switch Fabric, and So on

Interface Objects

Interface Hardware or Software Failure, Loss of Signal

Remote Objects

Failure of Remote Device (Customer Equipment or Peer


Networking Device) or Link In-between

Software Objects

Failure of Software Processes Running on the RPs and Line


Cards

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

70

Outage Characterization
Data Definition
Defect threshold: a value across which the object is considered to be
defective (service degradation or complete outage)
Duration threshold: the minimum period beyond which an outage needs
to be reported (given SLA)
Start time: when the object outage starts
End time: when the outage ends

Duration
Threshold

Defect
Threshold

Up Event

Time

Down Event
Start Time

NMS-2201
9627_05_2004_c2

Outage Duration

End Time

71

2004 Cisco Systems, Inc. All rights reserved.

Architecture
Customer Interfaces
Outage Monitor MIB
SNMP Polling

Configuration

SNMP Notification

Data Table Structure

CLI

HA and Persistent Data Store

Outage Component Table


Event History Table

Measurement
Metrics
Event Map Table

Outage
Manager

NVRAM

Process Map Table


Remote Component Map Table

Fault Manager
CPU
Event
(IOS)
Usage Source
Detect Callbacks Syslog

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

ATA Flash

Time Stamp
Temp Event Data
Crash Reason
Outage Data

Remote Component
Outage Detector

Internal Component
Outage Detector

NMS-2201
9627_05_2004_c2

Customer
Authentication

Customer Equipment
Ping
Detection Function

Baseline

SAA
APIs

Optional

Measurement Methods

72

Outage Data: AOT and NAF


Requirements of measurement metrics:
Enable calculation of MTTR, MTBF, availability, and SLA assessment
Ensure measurement efficiency in terms of resource (CPU, memory, and
network bandwidth)

Measurement metrics per object:


AOT: Accumulated Outage Time since measurement started
NAF: Number of Accumulated Failures since measurement started

Router 1

System Crash

Up
Down

NMS-2201
9627_05_2004_c2

System Crash

10

10

Time

AOT = 20 and NAF = 2

73

2004 Cisco Systems, Inc. All rights reserved.

Outage Data: AOT and NAF


Object containment model
Router Device
Line Card

Logical Interface

Physical Interface

Containment independent property


Router 1

System Crash

Up

System Crash

10

10

Down

Time
20

Interface 1

20

Interface Failure

Up

NMS-2201
9627_05_2004_c2

Service Affecting
AOT = 27;
NAF = 3;

Router Device
AOT = 20;
NAF = 2;

10

10

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

7
Time

Router 1

Interface 1

Interface
AOT = 7;
NAF = 1;

74

Example: MTTR
Find MTTR for Object i
MTTRi = AOTi/NAFi
= 14/2
= 7 min

Object i

Measurement Interval (T2T1)

Up
Down

NMS-2201
9627_05_2004_c2

T1

TTR

TTR

10 min.

4 min.
Failure

Failure

T2

Time

75

2004 Cisco Systems, Inc. All rights reserved.

Example: MTBF and MTTF


Find MTBF and MTTF for Object i
MTBFi = (T2 T1)/NAFi
MTTFi = MTBFi MTTRi = (T2 T1 AOTi)/NAFi
MTBF = 700,000 = 1,400,000/2
MTTR = 699,993 = (700,000 7)
Object i

Measurement Interval (T2T1)


TBF
TTR

Up
Down

10 min.
T1

Failure

TTF
4 min.
Failure

T2

Time

(T2T1) = 1,400,000 min


NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

76

Example: Availability and DPM


Find availability and DPM for Object i
Availability (%) =

MTBF
MTBF + MTTR

* 100

Availability = 99.999% = (700,000/700,007) * 100


DPMi = [AOTi/(T2 T1)] x 106 = 10 DPM

Measurement Interval = 1,400,000 min.


Object i Up
Down

NMS-2201
9627_05_2004_c2

10 min.
T1

Failure

4 min.
Failure

T2

Time

77

2004 Cisco Systems, Inc. All rights reserved.

Planned Outage Measurement


To capture operation CLI commands both reload and
forced switchover
There is a simple rule to derive an upper bound of the
planned outage
If there is no NVRAM soft crash file, check the reboot reason or
switchover reason
If its reload or forced switchover, it can be considered as an upper
bound of the planned outage

Send Break
Operation
Caused
Outage

Reload
Planned Outage
Forced Switchover

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Upper Bound
of the Planned
Outage

78

Event Filtering
Flapping interface detection and filtering:
Some faulty interface state can be keep changing up and down
May cause virtual network disconnection
May occurs event storm when hundreds of messages for each
flapping event
May make the object MTBF unreasonably low due to frequent
short failures
This unstable condition needs to get operators attention
COOL detects the flapping status
Catching very short outage event (less than the duration threshold)
Increasing the event counter,
Flapping status, if it becomes over the flapping threshold (3 event
counter) for the short period (1 sec); sends a notification
Stable status, if it becomes less than the threshold; sends another
notification
NMS-2201
9627_05_2004_c2

79

2004 Cisco Systems, Inc. All rights reserved.

Data Persistency and Redundancy


Router
COOL
Event
Driven
Update

Periodic
Update

COOL
RAM

RAM

Outage Data

Outage Data

NVRAM

NVRAM

Persistent
Outage Data

Copy

FLASH

Copy

Persistent
Outage Data

Active RP

Persistent
Outage Data

FLASH

Persistent
Outage Data

Standby RP

Data persistency
To avoid data loss due to link outage or router itself crash

Data redundancy
To continue the outage measurement after the switchover
To retain the outage data even if the RP is physically replaced
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

80

Outage Monitor MIB


CISCO-OUTAGE-MONITOR-MIB
Iso.org.dod.internet.private.enterprise.cisco.ciscoMgmt.ciscoOutageMIB
1.3.6.1.4.1.9.9.280
cOutageHistoryTable
Object-Type;
Object-Index;

IF-MIB
Event Reason Map Table

Event-Reason-Index;
Event-Time;
Event-Interval;

(Event Description)

ifTable
(Interface Object Description)

ENTITY-MIB

cOutageObjectTable

entPhysicalTable

Object-Type;
Object-Index;

(Physical Entity Object Description)

Object-Status;
Object-AOT;
Object-NAF;

Process MIB Map

CISCO-PROCESS-MIB
cpmProcessTable
(Process Object Description)

Remote Object Map Table


(Remote Object Description)

NMS-2201
9627_05_2004_c2

81

2004 Cisco Systems, Inc. All rights reserved.

Configuration
MIB Display

Show CLI

Config CLI

Show event-table
Show object-table

Event Table
Object Table

COOL

run;
add;
removal
filtering-enable;

Cisco IOS
Configuration

Update
Customer Equipment
Detection Function

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Update

82

Enabling COOL
ari#dir
Directory of disk0:/
1

-rw-

19014056

Oct 29 2003 16:09:28 +00:00

gsr-k4p-mz.120-26.S.bin

Obtain
Authorization
File

128057344 bytes total (109051904 bytes free)


ari#copy tftp disk0:
Address or name of remote host []? 88.1.88.9
Source filename []? auth_file
Destination filename [auth_file]?
Accessing tftp://88.1.88.9/auth_file...
Loading auth_file from 88.1.88.9 (via FastEthernet1/2): !
[OK - 705 bytes]
705 bytes copied in 0.532 secs (1325 bytes/sec)
ari#clear cool per
ari#clear cool persist-files
ari#conf t
Enter configuration commands, one per line. End with CNTL/Z.
ari(config)#cool run

Enable COOL

ari(config)#^Z
ari#wr mem
Building configuration...
[OK][OK][OK]
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

83

COOL
Pros
Accurate network availability for devices, components,
and software
Accounts for routing problems
Implementation with low network overhead.
Enables correlation between active and passive availability
methodologies

Cons
Only a few system currently have the COOL feature
Requires implementation in the router configurations of
production devices
Availability granularity limited by polling frequency
New Cisco IOS Feature
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

84

Network Availability Collection Methods

APPLICATION LAYER
MEASUREMENT

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

85

Application Reachability

Similar to ICMP Reachability


Method definition:
Central workstation or computer configured to send packets that
mimic application packets

How:
Agents on client and server computers and collecting data
Fire Runner, Ganymede Chariot, Gyra Research, Response
Networks, Vital Signs Software, NetScout, Custom applications
queries on customer systems

Installing special probes located on user and server


subnets to send, receive and collect data; NikSun and
NetScout

Unavailability:
Pre-defined QoS definition
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

86

Application Reachability
Pros
Actual application availability can be understood
QoS, by application, can be factored into the availability
measurement

Cons
Depending on scale, potential high overhead and cost can
be expected

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

87

DATA COLLECTION FOR ROOT


CAUSE ANALYSIS (RCA) OF
NETWORK OR DEVICE
DOWNTIME

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

88

Data Gathering Techniques


Cisco IOS Embedded RMON
Alarm and event
History and statistics
Set thresholds in router configuration
Configure SNMP trap to be sent when MIB variable
rises above and/or falls below a given threshold
Alleviates need for frequent polling
Not an availability methodology by itself but can
add valuable information and customization to the
data collection method
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

89

Data Gathering Techniques


Syslog Messages
Provide information on what the router is doing
Categorized by feature and severity level
User can configure Syslog logging levels
User can configure Syslog messages to be sent as
SNMP traps
Not an availability methodology by itself but can
add valuable information and customization to the
data collection method

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

90

Expression and Event MIB


Expression MIB
Allows you to create new SNMP objects based upon formulas
MIB persistence is supported a MIBs SNMP data persists across
reloads
Delta and wildcard support allows you to:
Calculate utilization for all interfaces with one expression
Calculate errors as a percentage of traffic

Event MIB
Allows you to create custom notifications and log them and/or send
them as SNMP traps or informs
MIB persistence is supported a MIBs SNMP data persists across
reloads
Can be used to test objects on other devices
More flexible than RMON events/alarms
RMON is tailored for use with counter objects
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

91

Data Gathering Techniques


Embedded Event Manager
Underlying philosophy:
Embed intelligence in routers and switches to enable a
scalable and distributed solution, with OPEN interfaces for
NMS/EMS leverage of the features

Mission statement:
Provide robust, scalable, powerful, and easy-to-use
embedded managers to solve problems such as syslog and
event management within Cisco routers and switches

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

92

Embedded Event Manager (Cont.)


Development goal: predictable, consistent, scalable
management
Distributed
Independent of central management system

Control is in the customers hands


Customization

Local programmable actions:


Triggered by specific events

NMS-2201
9627_05_2004_c2

93

2004 Cisco Systems, Inc. All rights reserved.

Cisco IOS Embedded Event Manager:


Basic Architecture (v1)
Syslog Event

SNMP Data

Other Event

Event Detector Feeds EEM


Syslog
Event Detector

SNMP
Event Detector

Other
Event Detector

Embedded Event Manager

EEM
EEM
EEM
Policies
Policies
Policies

Network
Knowledge
Notify

Switchover

Reload

Actions
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

94

EEM Versions
EEM Version 1
Allows policies to be defined using the Cisco IOS CLI applet
The following policy actions can be established:
Generate prioritized syslog messages
Generate a CNS event for upstream processing by
Cisco CNS devices
Reload the Cisco IOS software
Switch to a secondary processor in a fully redundant hardware
configuration

EEM Version 2
EEM Version 2 adds programmable actions using the Tcl
subsystem within Cisco IOS
Includes more event detectors and capabilities
NMS-2201
9627_05_2004_c2

95

2004 Cisco Systems, Inc. All rights reserved.

EEM Version 2 Architecture


Event Publishers
Syslog
Daemon

System
Manager

System
Manager
Syslog

Watchdog
Sysmon

Timer
Services

Posix
Process
Manager

HA
Redundancy
Facility

Counters

IOS Process

Watchdog

Redundancy
Facility

Event Detectors
SNMP

Application Embedded Event


Specific
Event Detector Manager Server

IOS Subsystems
Event
Subscribers to
Subscriber

Receive Application
Events, Publishes
Application Events
Using Application
Specific Event
Detector

NMS-2201
9627_05_2004_c2

Tcl Shell
EEM Policy

Interface
Counters and
Stats

More event
detectors!
Define policies or
programmable
local actions
using Tcl
Register policy
with EEM Server
Events trigger
policy execution
Tcl extensions for
CLI control and
defined actions

Subscribers to
Receive Events,
Implements Policy
Actions

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Cisco Internal Use Only

96

What Does This Mean to the Business?


Better problem determination
Widely applicable scripts from Cisco engineering and TAC
Automated local action triggered by events
Automated data collection

Faster problem resolution


Reduces the next time it happensplease collect
Better diagnostic data to Cisco engineering
Faster identification and repair

Less downtime
Reduce susceptibility and Mean Time to Repair (MTTR)

Better service
Responsiveness
Prevent recurrence
Higher availability

Not an availability methodology by itself but can add valuable


information and customization to the data collection method
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

97

INSTILLING AN
AVAILABILITY CULTURE

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

98

Putting an Availability Program


into Practice
Track network availability
Identify defects
Identify root cause and
implement fix
Reduce operating expense
by eliminating non value
added work
How much does an outage
cost today?
How much can i save thru
process and product
enhancements?
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

99

How Do I Start?
1. What are you using now?
a. Add or modify trouble ticketing analysis
b. Add or improve active monitoring method

2. Processanalyze the data!


a. What caused an outage?
b. Can a root cause be identified and
addressed?

3. Implement improvements or fixes


4. Measure the results
5. Back to step 1are other metrics
needed?

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

100

If You Have a Network Availability Method


Use the current method and metric for improvement
Dont try to change completely
Use incremental improvements
Develop additional methods to gather data as identified

Concentrate on understanding unavailability


causesAll unavailability causes should be
classified at a minimum under:
Change, SW, HW, power/facility, or link

Identify the actions to correct unavailability causes


i.e., network design, customer process change, HW MTBF
improvement, etc.
NMS-2201
9627_05_2004_c2

101

2004 Cisco Systems, Inc. All rights reserved.

Multilayer Network Design

SA Agent
Between Access
and Distribution
Access

Distribution

Core/Backbone

Core

Building Block
Additions

Server Farm

NMS-2201
9627_05_2004_c2

WAN
2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Internet

PSTN
102

SA Agent
between
Servers and
WAN Users

Multilayer Network Design

Access

Distribution

Core/Backbone

Core

Building Block
Additions

Server Farm

NMS-2201
9627_05_2004_c2

WAN

Internet

PSTN
103

2004 Cisco Systems, Inc. All rights reserved.

COOL for HighEnd Core


Devices

Multilayer Network Design

Access

Distribution

Core/Backbone

Core

Building Block
Additions

Server Farm

NMS-2201
9627_05_2004_c2

WAN
2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Internet

PSTN
104

Trouble
Ticketing
Methodology

Multilayer Network Design

Access

Distribution

Core/Backbone

Core

Building Block
Additions

Server Farm

NMS-2201
9627_05_2004_c2

WAN

Internet

2004 Cisco Systems, Inc. All rights reserved.

PSTN
105

AVAILABILITY MEASUREMENT
SUMMARY

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

106

Summary
Availability metric is governed by your business
objectives
Availability measurements primary goal is:
To provide an availability baseline (maintain)
To help identify where to improve the network
To monitor and control improvement projects

Can you identify Where you are now? for your


network?
Do you know Where you are going? as network
oriented business objectives?
Do you have a plan to take you there?
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

107

Complete Your Online Session Evaluation!


WHAT:

Complete an online session evaluation


and your name will be entered into a
daily drawing

WHY:

Win fabulous prizes! Give us your feedback!

WHERE: Go to the Internet stations located


throughout the Convention Center
HOW:

NMS-2201
9627_05_2004_c2

Winners will be posted on the onsite


Networkers Website; four winners per day

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

108

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

109

Recommended Reading
Performance and Fault
Management
ISBN: 1-57870-180-5

High Availability Network


Fundamentals
ISBN: 1-58713-017-3

Network Performance
Baselining
ISBN: 1-57870-240-2

The Practical Performance


Analyst
ISBN: 0-07-912946-3

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

110

Recommended Reading (Cont.)


The Visual Display of Quantitative Information
by Edward Tufte (ISBN: 0-9613921-0)

Practical Planning for Network Growth


by John Blommers (ISBN: 0-13-206111-2)

The Art of Computer Systems Performance Analysis


by Raj Jain (ISBN: 0-421-50336-3)

Implementing Global Networked Systems Management: Strategies


and Solutions
by Raj Ananthanpillai (ISBN: 0-07-001601-1)

Information Systems in Organizations: Improving Business


Processes
by Richard Maddison and Geoffrey Darnton (ISBN: 0-412-62530-X)

Integrated Management of Networked SystemsConcepts,


Architectures, and Their Operational Application
by Hegering, Abeck, Neumair (ISBN: 1558605711)
NMS-2201
9627_05_2004_c2

111

2004 Cisco Systems, Inc. All rights reserved.

Appendix A: Acronyms

AVGAverage
ATMAsynchronous Transfer Mode
DPMDefects Per Million
FCAPSFault, Config, Acct, Perf,
Security
GEGigabit Ethernet
HAHigh Availability
HDLCHigh Level Data Link Control
HSRPHot Standby Routing
Protocol
IPMInternet Performance Monitor
IUMImpacted User Minutes
MIBManagement Information Base

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

MTBFMean Time Between Failure


MTTRMean Time to Repair
RMEResource Manager Essentials
RMONRemote Monitor
SA AgentService Assurance Agent
SNMPSimple Network Management
Protocol
SPFSingle Point of Failure; Shortest
Path First (routing protocol)
TCPTransmission Control Protocol

112

BACKUP SLIDES

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

113

ADDITIONAL
RELIABILITY SLIDES

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

114

Network Design
What Is Reliability?
Reliability is often used as a general term that
refers to the quality of a product
Failure Rate
MTBF (Mean Time Between Failures) or
MTTF (Mean Time to Failure)
Availability

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

115

Reliability Defined
Reliability:
1. The probability of survival (or no failure) for a
stated length of time
2. Or, the fraction of units that will not fail in the
stated length of time
A mission time must be stated
Annual reliability is the probability of
survival for one year

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

116

Availability Defined
Availability:
1. The probability that an item (or network, etc.) is
operational, and ready-to-go, at any point in time
2. Or, the expected fraction of time it is operational.
annual uptime is the amount (in days, hrs., min.,
etc.) the item is operational in a year
Example: For 98% availability, the annual availability is
0.98 * 365 days = 357.7 days

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

117

MTBF Defined
MTBF stands for Mean Time Between Failure
MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF)
or, to a failure (MTTF)
More technically, it is the mean time to go from an
operational state to a non-operational state
MTBF is usually used for repairable systems, and MTTF is
used for non-repairable systems

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

118

How Reliable Is It?


MTBF Reliability:
R = e-(MTBF/MTBF)
R = e-1 = 36.7%

MTBF reliability is only 37%; that is, 63% of your


HARDWARE fails before the MTBF!
But remember, failures are still random!

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

119

MTTR Defined
MTTR stands for Mean Time to Repair
or

MRT (Mean Restore Time)


This is the average length of time it takes to repair an item
More technically, it is the mean time to go from a nonoperational state to an operational state

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

120

One Method of Calculating Availability


Availability =

MTBF
(MTBF + MTTR)

What is the availability of a computer with


MTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

121

Uptime
Annual uptime
8,760 hrs/year X (0.9988)
= 8,749.5 hrs

Conversely, annual DOWNtime is,


8,760 hrs/year X (1- 0.9988)
= 10.5 hrs

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

122

Systems
Components In-Series

Component 1

Component 2

Components In-Parallel (Redundant)

RBD

Component 1

Component 2
NMS-2201
9627_05_2004_c2

123

2004 Cisco Systems, Inc. All rights reserved.

In-Series

Part 1
Part 2

In-Series

NMS-2201
9627_05_2004_c2

Up

Down Up

Up

Down Up

Up

Down Up Down Up

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Down

Up

Down Up

Down

Up

124

In-Parallel

Part 1
Part 2

Up

Down

Up

Up

Down

Down

Up

Up

Up

Down Up

Down

Up

In-Parallel

NMS-2201
9627_05_2004_c2

125

2004 Cisco Systems, Inc. All rights reserved.

In-Series MTBF
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.

COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.

Component Failure Rate


= 1/2500 = 0.0004
System Failure Rate
= 0.0004 + 0.0004 = 0.0008
System MTBF
= 1/(0.0008) = 1,250 hrs.
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

126

In-Series Reliability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.

COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.

Component ANNUAL Reliability:


R = e-(8760/2500) = 0.03
System ANNUAL Reliability:
R = 0.03 X 0.03 = 0.0009
NMS-2201
9627_05_2004_c2

127

2004 Cisco Systems, Inc. All rights reserved.

In-Series Availability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.

COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.

Component Availability:
A = 2500 (2500 + 10) = 0.996
System Availability:
A = 0.996 X 0.996 = 0.992
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

128

In-Parallel MTBF
COMPONENT 1
MTBF = 2,500 hrs.

System MTBF*:

COMPONENT 2

= 2500 + 2500/2=
3,750 hrs.

MTBF = 2,500 hrs.


In general*,

i =1

MTBF
i

*For 1-of-n Redundancy of n Identical Components


with NO Repair or Replacement of Failed Components
NMS-2201
9627_05_2004_c2

129

2004 Cisco Systems, Inc. All rights reserved.

1-of-4 Example

i =1

2500
i

2500
1

2500 + 2500
+ 2500
+
2
3
4
= 5,208 hrs.

In general*,

i =1

MTBF
i

*For 1-of-n Redundancy of n Identical Components


with NO Repair or Replacement of Failed Components
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

130

In-Parallel Reliability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.

Component ANNUAL Reliability:


R = e-(8760/2500) = 0.03

Un
re
li

System ANNUAL Reliability:

ab
ili t

R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06


NMS-2201
9627_05_2004_c2

131

2004 Cisco Systems, Inc. All rights reserved.

In-Parallel Availability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.

Component Availability:
A = 2500 (2500 + 10) = 0.996
System Availability:

Un
av
a

ila
b

ilit

A= 1- [(1-0.996) X (1-0.996)] = 1-0.000016 = 0.999984


NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

132

Complex Redundancy
Examples:

1-of-2

2-of-3

m-of-n

3
.
.
.
n

2-of-4
8-of-10

Pure Active Parallel


NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

133

More Complex Redundancy


Pure active parallel
All components are on

Standby redundant
Backup components are not operating

Perfect switching
Switch-over is immediate and without fail

Switchover reliability
The probability of switchover when it is not perfect

Load sharing
All units are on and workload is distributed
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

134

Networks Consist of Series-Parallel


Combinations of in-series and redundant
components

D1
A

B1
1/2

B2

NMS-2201
9627_05_2004_c2

D2

2/3

D3

2004 Cisco Systems, Inc. All rights reserved.

135

Failure Rate
The number of failures per time:
Failures/hour
Failures/day
Failures/week
Failures/106 hours
Failures/109 hours called FITs (Failures in Time)

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

136

Approximating MTBF
13 units are tested in a lab for 1,000 hours with 2
failures occurring
Another 4 units were tested for 6,000 hours with 1
failure occurring
The failed units are repaired (or replaced)
What is the approximate MTBF?

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

137

Approximating MTBF (Cont.)


MTBF

= 13*1000 + 4*6000
1+2
= 37,000
3
= 12,333 hours

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

138

Frequency

Modeling
MTBF

Normal
MTBF

Log-Normal
Weibull

Time-to-Failure
Frequency

Distributions

Exponential

MTBF

Time-to-Failure
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

139

Constant Failure Rate

The Exponential Distribution


The exponential function:
f(t) = e-t, t > 0
Failure rate, , IS CONSTANT
= 1/MTBF

If MTBF = 2,500 hrs., what is the failure rate?


= 1/2500 = 0.0004 failures/hr.

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

140

Failure Rate

The Bathtub Curve

DECREASING
Failure Rate

INCREASING
Failure Rate

CONSTANT Failure Rate


Time

Infant
Mortality

NMS-2201
9627_05_2004_c2

Useful Life Period

Wear-Out

2004 Cisco Systems, Inc. All rights reserved.

141

The Exponential Reliability Formula


Commonly used for electronic equipment
The exponential reliability formula:
R(t) = e-t or R(t) = e-t/MTBF

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

142

Calculating Reliability
A certain Cisco router has an MTBF of 100,000 hrs;
what is the annual reliability?
Annual reliability is the reliability for one year or 8,760 hrs

R =e-(8760/100000) = 91.6%
This says that the probability of no failure in one
year is 91.6%; or, 91.6% of all units will survive
one year

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

143

ADDITIONAL TROUBLE
TICKETING SLIDES

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

144

Essential Data Elements


Parameter

Format

Description

Date

dd/mmm/yy

Date Ticket Issued

Ticket

Alphanumeric

Trouble Ticket Number

Start Date

dd/mmm/yy

Date of Fault

Start Time

hh:mm

Time of Fault

Resolution Date

dd/mmm/yy

Date of Resolution

Resolution Time

hh:mm

Time of Resolution

Customers Impacted

Interger

Number of Customers that Lost Service; Number


Impacted or Names of Customers Impacted

Problem Description

String

Outline of the Problem

Root Cause

String

HW, SW, Process, Environmental, etc.

Component/Part/SW
Version

Alphanumeric

For HW Problems include Product ID; for SW


Include Release Version

Type

Planned/Unplanned

Identity if the Event Was Due to Planned


Maintenance Activity or Unplanned Outage

Resolution

String

Description of Action Taken to Fix the Problem

Note: Above Is the Minimum Data Set, However, if


Other Information Is Captured it Should Be Provided
NMS-2201
9627_05_2004_c2

145

2004 Cisco Systems, Inc. All rights reserved.

HA Metrics/NAIS Synergy
Referral for
Analysis

Data Analysis

Operational
Process and Procedures
Analysis

Baseline availability
Determine DPM
Network reliability
improvement analysis
(Defects Per Million)
Trouble Tickets
by:
Problem management
Definitions
Planned/Unplanned
Root Cause
Resolution
Equipment

Data accuracy
Collection
processes

MTTR

NMS-2201
9627_05_2004_c2

Fault management
Resiliency assessment
Change management
Performance
management
Availability
management

Analyzed Trouble Ticket Data


Referral for Process/Procedural Improvement

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

146

ADDITIONAL SA AGENT SLIDES

NMS-2201
9627_05_2004_c2

147

2004 Cisco Systems, Inc. All rights reserved.

SA Agent: How It Works


SNMP

Management Application
1. User configures Collectors
through Mgmt Application GUI
2. Mgmt Application provisions
Source routers with Collectors

SA Agent
3. Source router measures and
stores performance data,
e.g.:
Response time
Availability

6. Application retrieves data from


Source routers once an hour
7. Data is written to a database

4. Source router evaluates


SLAs, sends SNMP Traps
5. Source router stores latest
data point and 2 hours of
aggregated points

8. Reports are generated


NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

148

SAA Monitoring IP Core


R2
R1

P1
P3

IP Core

R3

P2

Management System
NMS-2201
9627_05_2004_c2

149

2004 Cisco Systems, Inc. All rights reserved.

Monitoring Customer IP Reachability

P1

Nw1

Nw3

TP1

P2

TPx
P3
Nw3

PN

NwN

P1-Pn Service Assurance Agent ICMP


Polls to a Test Point in the IP Core

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

150

Service Assurance Agent Features


Measures Service Level Agreement (SLA) metrics
Packet Loss
Response time

Throughput

Availability

Jitter

Evaluates SLAs
Proactively sends notification of SLA violations

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

151

SA Agent Impact on Devices


Low impact on CPU utilization
18k memory per SA agent
SAA rtr low-memory

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

152

Monitored Network Availability Calculation


Not calculated:
Already have availability baseline
Fault type, frequency and downtime may be more useful
Faults directly measured from management system(s)

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

153

Monitored Network Availability


Assumptions
All connections below IP are fixed
Management systems can be notified of all fixed
connection state changes
All (L2) events impact on IP (L3) service

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

154

ADDITIONAL COOL SLIDES

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

155

CLIs
Configuration CLI Commands
[no] cool run <cr>
[no] cool interface interface-name(idb) <cr>
[no] cool physical-FRU-entity entity-index (int) <cr>
[no] cool group-interface group-objectID(string) <cr>
[no] cool add-cpu objectID threshold duration <cr>
[no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]<cr>
[no] cool if-filter group-objectID (string)<cr>

Display CLI Commands


Router#show cool event-table [<number of entries>] displays all if not specified
Router#show cool object-table [<object-type(int)>] displays all object types if not specified
Router#show cool fru-entity

Exec CLI Commands


Router#clear cool event-table
Router#clear cool persistent-files

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

156

Measurement Example:
Router Device Outage

Reload (Operational) ,
Power Outage, or
Device H/W failure

Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4).


Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB.
Status: Up (1) Down (2).
Last-change: last object status change time.
AOT: Accumulated Outage Time (sec).
NAF: Number of Accumulated Failure.
NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

157

Measurement Example:
Cisco IOS S/W Outage
Standby RP in Slot 0 Crash Using Address Error (4) Test Crash;
AdEL Exception It Is Caused Purely by Cisco IOS S/W

Standby RP Crash Using Jump to Zero (5) Test Crash;


Bp Exception It Can Be Caused by S/W, H/W, or Operation

NMS-2201
9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

158

Measurement Example: Linecard Outage

Add a Linecard
Reset the Linecard

Down Event Captured


Up Event Captured

AOT and NAF Updated


NMS-2201
9627_05_2004_c2

159

2004 Cisco Systems, Inc. All rights reserved.

Measurement Example: Interface Outage


1

12406-R1202(config)#cool group-interface ATM2/0.


12406-R1202(config)#no cool group-interface ATM2/0.3

Object Table

sh cool object 1 | include ATM2/0.


33 1
1054859087 0 0
0
35 1
1054859088 0 0
0
39 1
1054859090 0 0
0
41 1
1054859090 0 0
0

ATM2/0.1
ATM2/0.2
ATM2/0.4
ATM2/0.5

12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#shut
Shut ATM2.0
show cool event-table
Interface Down
**** COOL Event Table ****
type index event time-stamp interval hist_id object-name
1 33 1 1054859105 18
1
ATM2/0.1
1 35 1 1054859106 18
2
ATM2/0.2 Down Event
1 39 1 1054859107 17
3
ATM2/0.4 Captured
1 41 1 1054859108 18
4
ATM2/0.5

4
12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#no shut
No Shut ATM2.0
show cool event-table
Interface
**** COOL Event Table ****
type index event time-stamp interval hist_id object-name
1 33 0 1054859146 41
1
ATM2/0.1
1 35 0 1054859147 41
2
ATM2/0.2
Up Event
1 39 0 1054859149 42
3
ATM2/0.4
Captured
1 41 0 1054859150 42
4
ATM2/0.5
NMS-2201
9627_05_2004_c2

Configure to Monitor All the Interfaces which


Includes ATM2/0; String, Except ATM2/0.3

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Object Table Shows AOT and NAF

sh cool object 1 | include ATM2/0.


33 1
1054859087 0 41
35 1
1054859088 0 41
39 1
1054859090 0 42
41 1
1054859090 0 42

1
1
1
1

ATM2/0.1
ATM2/0.2
ATM2/0.4
ATM2/0.5

160

Measurement Example:
Remote Device Outage
12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 1
12406-R1202(config)#cool remote-device 2 50.1.2.2 remobj.2 30 2 50.1.2.1 1
12406-R1202(config)#cool remote-device 3 50.1.3.2 remobj.3 30 2 50.1.3.1 1

3 Remote Devices Are


Added

sh cool object-table 4 | include remobj


1 1
1054867061 0
0 remobj.1
2 1
1054867063 0
0 remobj.2
3 1
1054867065 0
0 remobj.3

Object Table

12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#shut

Shut Down the Interface Link Between the Remote


Device and Router

4
4
4

2
1
3

5
5
5

1054867105
1054867108
1054867130

42
47
65

2
3
10

remobj.2
remobj.1
remobj.3

12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#no shut
4
4
4

1
3
2

4
4
4

1054867171
1054867193
1054867200

63
63
95

1
8
10

No Shut the Interface Link


remobj.1
remobj.3
remobj.2

sh cool object-table 4 | include remobj


1 1
1054867061 63
1 remobj.1
2 1
1054867063 63
1 remobj.2
3 1
1054867065 95
1 remobj.3
NMS-2201
9627_05_2004_c2

Down Event Captured

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.


Presentation_ID.scr

Up Event Captured

Object Table Shows AOT and NAF


161