You are on page 1of 91

Effective Datacenter

Troubleshooting
Methodology
Alejandra Sanchez Garcia - Systems Engineer
BRKDCT-2408
Agenda
• Data Center Solution Overview
• Troubleshooting Basics
• Case Studies
• The Dos and the Donts
Data Center Solution
Overview
Next Generation DC
NETWORK COMPUTE CLOUD

DC Switching / Private Cloud


UCS
MDS Stack

SDN / Network
Storage Hybrid Cloud
Virtualization

Integrated
ACI Intercloud
Stacks

Applications

Security & L4-7 Services

Management & Orchestration (M&O)

Software Development (DevOps / PaaS)


Ponemon Institute - September 2013
When a Data Center Goes Down…
• Loss of business continuance
• Loss of revenue
• Loss of business opportunities
• Loss of customer confidence
• Damaged corporate reputation
• Lower customer satisfaction
• Loss of employee productivity
• Loss of employee morale
Quickly identify the problem and get to a
solution within the minimum time!
Data Center Domains

Compute Networking Storage Programmability


Troubleshooting Basics -
Methodology
How We Troubleshoot
• Knowledge based

7 * 9 = 63

Understanding
the problem

• Strategy based
24 * 41 = 984

2 4 * 4 1= 8 _4 = 984
1 8
The TAC Secret Ingredients
Apply
knowledge

Understanding Confirm the


Break Identify
the problem down the Troubleshoot possible root cause
issue causes

Test the
Most
Probable
cause
Charles Kettering, inventor and head of research for GM
Step 1- Understanding the Problem
The 5 Ws
• Who is experiencing the problem
• Why is it important Situation assessment

• What are the effects


• When did the problem start
• Where does the problem occur
Problem definition
• What is NOT the problem
The H
• How did the problem start, what has changed
Problem 1 – Performance issue
Initial description: Network slowness and packet drops, P1

1st Pass
• What: • Who: Users experiencing slowness and latency
• Mission critical applications
• When: No known changes, working till this morning
• Timeouts are seen with some applications
• Multiple vlans • Where: N7Ks at the core and 3rd party at the access layer

2nd Pass
• How: Pings from a test laptop to server working • What’s not: No drops seen on the core switches (N7Ks).

• What: Latency 180-240ms

3rd Pass
• How: Test laptop on the same access switch as the application server
- Clear & specific problem to troubleshoot
- Core may not be the problem
Problem 2 – OTV config assistance
Initial description: Connectivity issue when SVIs are brought up at site B
1st Pass
• What: • When: After bringing up Site B
• Vlan224( vCenter ) and vlan130( hosts )
• What’s not: If SVIs are down it works fine.
• Site A was working fine
• Newly added to Site B and OTV • Why: Deadline is approaching
• Cannot add host to vCenter

2nd Pass
• How: Ping between the sites is okay, sometimes

3rd Pass
• What are the differences: Ping with packet size 1430 is going through but not with 1431B and above

Narrowed down to MTU issue


Problem 3 - MAC flapping and drops
Initial description: Network down due to connectivity issues with varies application servers
1st Pass
• What: • When:
• Connectivity issue • It’s been working fine for over a year
• Fabricpath • No changes
• MAC flapping
• Where:
• Connected to a pair of remote FP N5K switches
• Server connected to the remote N5Ks via eVPC

2nd Pass
• Where exactly: Flapping MAC address between eVPC

Narrowed down to port-channel


rather than Fabricpath
Step 2 - Troubleshoot
Break down the problem - simplify Identify the possible causes
• Network Topology • Changes (known vs. unknown)
• Technology • Rule out

Apply Knowledge & Experience Test the most probable cause


• How things should have worked • Explain the symptoms (Is and Is Not)
• What are the changes • Satisfy the conditions (What, When, When)
• Use the most approachable tests

What do we need: some logic, some tools


Scenario 1 – packet drop
Scenario 1 - topology
Filer 2

Both filer 1 and filer 2 are dual


linked to a pair of N5Ks in vPC

N5k01 N5k02

17 responses
21 requests
Filer 1
Scenario 1 – Narrowing down
Filer 2
4 different forwarding paths
- Narrow down on 1 path
- Narrow down on 1 direction
- Narrow down on 1 link or device
N5k01 N5k02

Tools: PACL, sniffer

Filer 1
Scenario 2 – Narrowing down
First break down

Simplify
the
topology
Step 3- Verifying the Root Cause
Test against the conditions:
• Does the probable cause match the problem description
• Does the probable cause satisfy all of the conditions

Test against the cause:


• Eliminate the probable cause: does the problem get eliminated?
• Reproduce the same condition: does the problem get reproduced?
The TAC Secret Ingredients • Forwarding path of the traffic
• L2 vs. L3
• Difference between sites
• Topology
Apply
• DC domain knowledge
• L2 vs. L3

Understanding Confirm the


the problem
Break Identify root cause
down the Troubleshoot Possible
issue causes

Multiple servers on varies


VLANs are experience slowness
during file transfers • Software processing
• Ping / Traceroute Test • L2 instability
probable
• Working vs. non-working cause • Unicast flooding
• Tools: Ethanalyzer, SPAN, etc. • Faulty hardware
Troubleshooting Basics –
Toolkit
Toolkits
• NX-OS Tools
• Sniffer Capture
• Tech-support
• Event Manager
• Programmability
Tools
• Granular show commands and CLI filtering
• Granular show tech-support Info Collection
• Logging Capabilities

• GOLD (General On-Line Diagnostics) Hardware


• OBFL (On-Board Failure Logging)

• Debugs (with filters & redirection) and Debug Plugins


• Ethanalyzer (built-in “CPU sniffer”)
• ELAM
Troubleshoot
• EEM (Embedded Event Manager)
• SPAN
• Programmability
Debugs
• When event-history is not sufficient
• Use debug logfile ‘debug logfile <file>’

N7K# debug logfile ospf1


N7K# dir log:
1261623 Jan 28 16:24:16 2015 messages
18628 Jan 24 19:12:22 2015 ospf1
N7K# show debug logfile ospf1
2015 Jan 24 19:10:12.801652 ospf: 1 [12275] (default) Nbr 10.10.10.4 FSM
start: old state INIT, event HELLORCVD
Debugs
• When event-history is not sufficient
• Use debug logfile ‘debug logfile <file>’

• Debug-filter
• More granular debugs
• Can apply multiple filters simultaneously

N7K# debug-filter pktmgr interface e4/1


N7K# debug-filter pktmgr dest-mac 0100.5e00.000d
N7K# debug pktmgr frame

N7K# show debug-filter all


debug-filter pktmgr dest-mac 0100.5E00.000D
debug-filter pktmgr interface Ethernet4/1

2013 Jan 27 20:12:19.447962 netstack: In 0x0800 72 7 001f.27c8.8c00 -> 0100.5e00.000d Eth4/1


2013 Jan 27 20:12:24.996482 netstack: Out 0x0800 64 6 64a0.e744.3942 -> 0100.5e00.000d Eth4/1
2013 Jan 27 20:12:48.607940 netstack: In 0x0800 72 7 001f.27c8.8c00 -> 0100.5e00.000d Eth4/1
SPAN (Switched Port ANalyzer)
• Tool to captures traffic from the source and directs to a destination interface
• Source: Ethernet port, port channel, inband interface to CPU, VLANs, Fabric port, HIF
• Destination: Ethernet port, port channel

monitor session 2 type local


source interface port-channel1995 both
source interface Ethernet 4/1 both
source vlan 500-510 both
destination interface Ethernet2/23

• Need-to-knows
• Requires external sniffer
• Understand the traffic flow(s) being captured
• Very useful for data plane issues, packet drops, intermittent problems

• Other Variation
• ERSPAN, encapsulated remote switched port analyzer
Ethanalyzer
• Built-in sniffer for CPU bound traffic
• ‘capture-filter’ vs. ‘display filter’
• ‘decode-internal’
• Other options

• Ethanalyzer does not


• Capture data plane traffic forwarded in hardware
• Support interface specific capture

N7K# ethanalyzer local interface inband capture-filter "tcp port 5000” 50 write bootflash:test.cap
N7K# dir test.cap
26224 Jan12 18:40:08 2015 test.cap
ELAM (Embedded Logic Analyzer Module)
• A tool to capture a single packet and determine its forwarding path within the switch
• Powerful and flexible triggering capability
• Module specific
• Available on Nexus 7000, Nexus 6000and Nexus 9000

• Need-to-knows
• L2-4 data plane forwarding issues
• Consistent problem
• No external sniffer required
• Not a replacement for capture utilities like Ethanalzyer or SPAN

Identify ingress ELAM triggers if


Configure trigger
Forwarding Engine Start Trigger packet received.
for specific pkt
(CLI available) View packet data
EEM (Embedded Event Manager)
• A subsystem to automate tasks and customize the device behavior
• Event  notification  action

event manager applet HIGH-CPU


event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.6.1 get-type exact entry-op ge entry-val 60 poll-interval 1
action 1.0 syslog msg High CPU hit $_event_pub_time
action 2.0 cli enable
action 3.0 cli show clock >> bootflash:high-cpu.txt
action 4.0 cli show proc cpu sort >> bootflash:high-cpu.txt

• Many built-in system policies: ‘show event manager system-policy’


• Helpful in data gathering when the occurrence of the issue is unpredictable
• Can be scheduled at a specific time or intervals
Programmability
Event Triggered DevOps
Scripts Workflows
Real Time

Custom
Automated
Integration
Provision Provisioning
Subset of Existing
Scripted Management Tools
Pre-Provisioning
Passive Automated Troubleshooting
Network Monitoring and Data Visibility

Automation Complexity
Programmability
• Scripting – python, TLC
• Scripting + EEM
• Scripting + scheduler
bootflash:Python.py

import re
import cisco
cisco.cli ("show interface eth 1/1-32 transceiver detail >> bootflash:link_flap.txt")

switch(config)# event manager applet PYTHON


switch(config-applet)# description "Triggers PYTHON Script"
switch(config-applet)# event cli match "shutdown"
switch(config-applet)# action 1.0 cli local python bootflash:Python.py
Programmability
• Scripting – python, TLC JSON-RPC/JSON/XML
• NX-API – REST API for NX-OS Request/response format

request/response

HTTP/S
HTTP/S

NXAPI web server


Switch# conf t
Switch(config)# feature nxapi
Switch(config)# exit Nexus
Programmability
• Scripting – python, TLC
• NX-API – REST API for NX-OS
• Configuration Management Software
Programmability
• Scripting – python, TLC
• NX-API – REST API for NX-OS
• Configuration Management Software
• Github
• Cisco Open Source Projects for the Data Center - https://github.com/datacenter
• MTU configuration
• Cable misconfiguration
• Other projects: https://github.com/jedelman8/nxos-ansible
ACI - Troubleshooting and Operation Tools
Case Studies
Case Study 1
Case Study 1

Case Study 1

“We recently migrated from Catalyst 6500 to a pair of Nexus switches in vPC.
The setup has all of our access switches connected to the pair. As soon as we cut over
to the Nexus, we see performance issue across the data center. Phones stop working.
Everything was working fine before the cut over”
Case Study 1

N7K-1 N7K-2 Cat4500

DHCP-1 DHCP-2 FTP-1 DHCP-3


Case Study 1

Multiple problems
1a) Poor performance across Data Center
1b) Phones are not working
Case Study 1a

Asking the right questions


• Poor performance across Data Center
Q) What application(s) specifically that users have problems accessing - FTP
Q) Is the user able to start the FTP process – YES
Q) Define poor performance- 30 minutes instead of 3-5
Q) Are all users affected? Not sure, all users are on VLAN 50.
Q) What do you observe when a user is moved to a different VLAN? – Works Fine
Q) Are all users in VLAN 50 connected to the same access layer switch – NO – does not
matter where the user is connected, if the machine is on VLAN 50, FTP is slow
Case Study 1a

What to look for


Look for common point in the network where the traffic goes
=> Nexus pair
We have narrowed down the issue to be with the Nexus pair
Let’s dig deeper.

• Interface Drops
• Spanning Tree
• Forwarding path
Case Study 1a

Ethanalyzer
• Ping between Client and FTP Server
• Ethanalyzer is a tool to monitor traffic to and from the CPU
N7k#ethanalyzer local interface inband capture-filter tcp
2014-01-28 17:19:55.730066 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#20] microsoft-ds > venus
[ACK] Seq=1 Ack=1 Win=17520 Len=0

2014-01-28 17:19:55.730193 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#21] microsoft-ds > venus
[ACK] Seq=1 Ack=1 Win=17520 Len=0

2014-01-28 17:19:55.730340 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#22] microsoft-ds > venus
[ACK] Seq=1 Ack=1 Win=17520 Len=0

Aha! Traffic is getting software switched


We have narrowed down the slowness to be a result of packets getting software
switched.
Case Study 1a

What can cause traffic to go to CPU


• Packets with IP options
• Fragmented packets
• IP redirects – When the next hop is reachable on same VLAN as the one the
packet comes in on – Check IP route for source and destination IP (‘no ip
redirects’ under interface vlan)
• Hardware misprogramming – Call TAC
Case Study 1b

Some phones not working


N7K-1 N7K-2 Cat4500

DHCP-1 DHCP-2 FTP-1 DHCP-3


Case Study 1b

Asking the right questions


Q) Are all phones down – No, only phones in VLAN 10
Q) Is it rebooting? What phase in bootup is failing? - Unable to get IP address
Q) Packet capture on DHCP server – No discovers from these specific phones
are seen
Q) Are all phones in VLAN 10 connected to one switch – No, it is across the
network

Start looking from the CORE – Nexus pair


Case Study 1b

Asking the right questions


Q) What is different about VLAN 10? Nothing different
Q) What VLAN is the DHCP server in? 3 servers, one in VLAN 10 and others in
VLAN 20 and 30 , but the one in VLAN 10 is the only active DHCP server at
this point.
DHCP server in same VLAN as the phone – Good data point
Q) No discovers from these specific phones are seen on the server, does it reach
the Nexus switches? Let’s use SPAN/Ethanalyzer to confirm
Case Study 1b

Ethanalyzer
sw1(config-if)# ethanalyzer local interface inband capture-filter "port 67" limit-
captured-frames 0
Capturing on inband
2014-05-01 06:20:41.793378 0.0.0.0 -> 255.255.255.255 DHCP DHCP Discover -
Transaction ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.5.1.220 DHCP DHCP Discover - Transaction
ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.6.1.220 DHCP DHCP Discover - Transaction
ID 0x3e96b16d
• Nexus switch is sending the relayed packets to servers in VLAN 20 and 30 that are not active.
• We are assuming the DISCOVER being a broadcast packet would make it to the server on the same
VLAN
NOT TRUE
Case Study 1b

Ethanalyzer
sw1(config-if)# ethanalyzer local interface inband capture-filter "port 67" limit-
captured-frames 0
Capturing on inband
2014-05-01 06:20:41.793378 0.0.0.0 -> 255.255.255.255 DHCP DHCP Discover -
Transaction ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.5.1.220 DHCP DHCP Discover - Transaction
ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.6.1.220 DHCP DHCP Discover - Transaction
ID 0x3e96b16d
• Nexus switch is sending the relayed packets to servers in VLAN 20 and 30 that are not active.
• We are assuming the DISCOVER being a broadcast packet would make it to the server on the same
VLAN
NOT TRUE
• On Nexus when relay is used, we need to specify the DHCP server even when it is on the same VLAN
Case Study 1

Conclusion
How did we solve these issues

1) Taking them one issue at a time


2) Asking the right questions to narrow down the scope of the problem
3) Using logical reasoning and right tools to root cause the problem
Case Study 2
Case Study 2

Case Study 2

“vMotion is failing”
Case Study 2

Netapp1 Netapp2

N5K-1 N5K-2

FI-A1 FI-B1 FI-A2 FI-B2

Compute

Storage

Networking
Case Study 2

Asking the right questions


Q) Was it working before? New deployment
Q) What type of vMotion are we doing? Host vMotion
Q) Is vMotion working for any of the hosts? Some of them work, some doesn’t
Q) Is there IP connectivity between failing hosts? Yes
Q) What is the error message? ESX hosts were not able to connect to vMotion
network
Case Study 2

What is next?

• Open support ticket?


• With VMware?
• With Cisco?

• Is there anything else we can do?


Apply
• What do we know about vMotion? knowledge

• VMKernel interface in each host with an IP address


Trouble Identify
• Same L2 domain
Break
possible
down the shoot causes
issue

• Tools? Test
• vmkping Probable
cause
Logical topology Case Study 2

N5K-1 N5K-2

FI-A1 FI-B1 FI-A2 FI-B2

vDS vDS vDS vDS vDS

V
K
V V
K
V
K
V
K
V
K
M M M M M
M

Host-1 Host-2 Host-3 Host-4 Host-5


Case Study 2

Troubleshooting

• Vmkernel ping -> works


• Narrow down the issue:
• vMotion between host-1 and host-2
• vMotion between host-1 and host-4

Apply
knowledge

Break Trouble Identify


possible
down the shoot causes
issue

Test
Probable
cause
Case Study 2

Troubleshooting

• Vmkernel ping -> works


• Narrow down the issue:
• vMotion between host-1 and host-2
• vMotion between host-1 and host-4

Apply
=> The most simple scenario that presents the problem knowledge

Break Trouble Identify


possible
down the shoot causes
issue

Test
Probable
cause
Case Study 2

Troubleshooting

• Possible causes:
• Interface or link related issues
• Shut down interfaces?

• MTU issues
• Ping with larger packets
• SVI on Nexus switches
Apply
knowledge

Break Trouble Identify


possible
down the shoot causes
issue

Test
Probable
cause
Case Study 2

Troubleshooting

• Possible causes:
• Interface or link related issues
• Shut down interfaces?

• MTU issues
• Ping with larger packets
• SVI on Nexus switches

=> MTU on N5K-2 was not jumbo frame


Case Study 2

Conclusion
• How did we solve these issues
• Consider the different DC domains
• Break down the issue with knowledge, logic and tools
• Find a simple scenario which presents the problem
Case Study – Issue 3
Case Study 3

Issue 3 ACI
• ACI Leaf switch is not discovered
Case Study 3
Case Study 3
Case Study 3

Troubleshooting
• APIC Interface up? - Yes
• Show LLDP Neighbour? – Ieaf102 is there
• DHCP Address assigned? - Yes
• Can we ping the leaf switch? - Yes
• Default firmware?

Wrong version!
Case Study 3

Troubleshooting
• APIC Interface up? - Yes
• Show LLDP Neighbour? – Ieaf102 is there
• DHCP Address assigned? - Yes
• Can we ping the leaf switch? - Yes
• Default firmware?

Wrong version!

=> Changing the default firmware resolved


the problem 
Case Study 3

Conclusion
• Traditional networking concepts are still present even in new SDN Technologies
• Same logic applies
– Understand problem
– Use your knowledge
– Narrow down the issue
– Test possible causes
Case Study – Issue 4
Case Study 4

Case Study 4
=== Customer Symptoms ===
ESX installation fails and when successful it fails upgrade (not able to transfer
patches to the storage) and often PSOD on hosts
Case Study 4

Compute

Storage

Networking
Case Study 4

Understanding the problem


• Local/remote storage? – Remote
• FCoE
• Different versions of ESX with same result
• Local install (USB) works
• Boot from SAN
Case Study 4

Initial checks
• Tried different ESX installation images.
• Checked Compatibility matrix for Vmware, UCS, drivers and NetApp.
• Configuration on Netapp, Nexus and UCS.
• Verified that the LUN is presented to the host.
• Netapp confirmed Clustered on-tap was configured properly and no errors
reported in the logs.
• Low number of CRC errors in some of the physical interfaces.
Case Study 4

Simplify the problem


Netapp1 Netapp2

N5K-1 N5K-2

FI-A FI-B
Case Study 4

Troubleshooting
• Problem isolated to N5K-2 showing CRC in several ports
• Possible cause for CRC:
• Faulty cable
• Faulty transceiver
• Faulty ASIC
• Problem was not related to cabling or transceivers, and multiple ASIC were affected but
intermittently
Potential HW failure??

• After rebooting N5K-2, problem appeared moved to N5K-1. Still very low number
of CRC (~100)
• HW failure unlikely, potential SW defect??
Case Study 4

Troubleshooting cont.
• Recreate in TAC lab unsuccessfully -> only difference were the cabling:
• Fiber instead of twinax

• It was a SW defect affecting twinax cables with a funny workaround


• shut/no shut the port fixed the issue temporarily.
Case Study 4

Conclusion
• DC environments can be complex
• To narrow down the issue is extremely important:
• Simplify the topology as much as possible
• Storage? Networking? Compute?
• Isolate the issue to a device
• Troubleshoot the device

• It’s rarely a bug 


The Dos and The Donts
The Dos – Troubleshooting
• Understand how things should work
• Identify the broken scenario
• Use solid troubleshooting techniques, start with basics
• Capture valuable information
• Bring all parties to the table
• Ask the right questions
The Dos – Operational
• Stay calm
• Know your network
• Backup
• Documentation
• Network Management
The DONTs -- Troubleshooting
• Jump to conclusion
• Take drastic measures
• 'let's bounce the datacenter'
• 'we are reloading the switches one at a time'
• Lump all issues together
The DONTs -- Operational
• Make multiple changes at once
• Status update and technical call in one
The TAC secret ingredients
Apply
Knowledge

Understanding Confirm the


the problem Break down
Identify root cause
the issue Troubleshoot possible
causes

Test the
Most
Probable
cause
Q&A
Complete Your Online Session Evaluation
• Give us your feedback and receive
a Cisco 2016 T-Shirt by
completing the Overall Event
Survey and 5 Session Evaluations.
– Directly from your mobile device on the Cisco
Live Mobile App
– By visiting the Cisco Live Mobile Site (URL)
– Visit any Cisco Live Internet Station located
throughout the venue
– T-Shirts can be collected in the
World of Solutions on Friday from Don’t forget: Cisco Live sessions will be available
12:00pm - 2:00pm for viewing on-demand after the event at
CiscoLiveAPAC.com
Continue Your Education
• Demos in the Cisco Campus
• Walk-in Self-Paced Labs
• Table Topics
• Meet the Engineer 1:1 meetings
Thank you

You might also like