Professional Documents
Culture Documents
Troubleshooting
Methodology
Alejandra Sanchez Garcia - Systems Engineer
BRKDCT-2408
Agenda
• Data Center Solution Overview
• Troubleshooting Basics
• Case Studies
• The Dos and the Donts
Data Center Solution
Overview
Next Generation DC
NETWORK COMPUTE CLOUD
SDN / Network
Storage Hybrid Cloud
Virtualization
Integrated
ACI Intercloud
Stacks
Applications
7 * 9 = 63
Understanding
the problem
• Strategy based
24 * 41 = 984
2 4 * 4 1= 8 _4 = 984
1 8
The TAC Secret Ingredients
Apply
knowledge
Test the
Most
Probable
cause
Charles Kettering, inventor and head of research for GM
Step 1- Understanding the Problem
The 5 Ws
• Who is experiencing the problem
• Why is it important Situation assessment
1st Pass
• What: • Who: Users experiencing slowness and latency
• Mission critical applications
• When: No known changes, working till this morning
• Timeouts are seen with some applications
• Multiple vlans • Where: N7Ks at the core and 3rd party at the access layer
2nd Pass
• How: Pings from a test laptop to server working • What’s not: No drops seen on the core switches (N7Ks).
3rd Pass
• How: Test laptop on the same access switch as the application server
- Clear & specific problem to troubleshoot
- Core may not be the problem
Problem 2 – OTV config assistance
Initial description: Connectivity issue when SVIs are brought up at site B
1st Pass
• What: • When: After bringing up Site B
• Vlan224( vCenter ) and vlan130( hosts )
• What’s not: If SVIs are down it works fine.
• Site A was working fine
• Newly added to Site B and OTV • Why: Deadline is approaching
• Cannot add host to vCenter
2nd Pass
• How: Ping between the sites is okay, sometimes
3rd Pass
• What are the differences: Ping with packet size 1430 is going through but not with 1431B and above
2nd Pass
• Where exactly: Flapping MAC address between eVPC
N5k01 N5k02
17 responses
21 requests
Filer 1
Scenario 1 – Narrowing down
Filer 2
4 different forwarding paths
- Narrow down on 1 path
- Narrow down on 1 direction
- Narrow down on 1 link or device
N5k01 N5k02
Filer 1
Scenario 2 – Narrowing down
First break down
Simplify
the
topology
Step 3- Verifying the Root Cause
Test against the conditions:
• Does the probable cause match the problem description
• Does the probable cause satisfy all of the conditions
• Debug-filter
• More granular debugs
• Can apply multiple filters simultaneously
• Need-to-knows
• Requires external sniffer
• Understand the traffic flow(s) being captured
• Very useful for data plane issues, packet drops, intermittent problems
• Other Variation
• ERSPAN, encapsulated remote switched port analyzer
Ethanalyzer
• Built-in sniffer for CPU bound traffic
• ‘capture-filter’ vs. ‘display filter’
• ‘decode-internal’
• Other options
N7K# ethanalyzer local interface inband capture-filter "tcp port 5000” 50 write bootflash:test.cap
N7K# dir test.cap
26224 Jan12 18:40:08 2015 test.cap
ELAM (Embedded Logic Analyzer Module)
• A tool to capture a single packet and determine its forwarding path within the switch
• Powerful and flexible triggering capability
• Module specific
• Available on Nexus 7000, Nexus 6000and Nexus 9000
• Need-to-knows
• L2-4 data plane forwarding issues
• Consistent problem
• No external sniffer required
• Not a replacement for capture utilities like Ethanalzyer or SPAN
Custom
Automated
Integration
Provision Provisioning
Subset of Existing
Scripted Management Tools
Pre-Provisioning
Passive Automated Troubleshooting
Network Monitoring and Data Visibility
Automation Complexity
Programmability
• Scripting – python, TLC
• Scripting + EEM
• Scripting + scheduler
bootflash:Python.py
import re
import cisco
cisco.cli ("show interface eth 1/1-32 transceiver detail >> bootflash:link_flap.txt")
request/response
HTTP/S
HTTP/S
Case Study 1
“We recently migrated from Catalyst 6500 to a pair of Nexus switches in vPC.
The setup has all of our access switches connected to the pair. As soon as we cut over
to the Nexus, we see performance issue across the data center. Phones stop working.
Everything was working fine before the cut over”
Case Study 1
Multiple problems
1a) Poor performance across Data Center
1b) Phones are not working
Case Study 1a
• Interface Drops
• Spanning Tree
• Forwarding path
Case Study 1a
Ethanalyzer
• Ping between Client and FTP Server
• Ethanalyzer is a tool to monitor traffic to and from the CPU
N7k#ethanalyzer local interface inband capture-filter tcp
2014-01-28 17:19:55.730066 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#20] microsoft-ds > venus
[ACK] Seq=1 Ack=1 Win=17520 Len=0
2014-01-28 17:19:55.730193 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#21] microsoft-ds > venus
[ACK] Seq=1 Ack=1 Win=17520 Len=0
2014-01-28 17:19:55.730340 10.127.76.137 -> 10.8.51.4 TCP [TCP Dup ACK 617#22] microsoft-ds > venus
[ACK] Seq=1 Ack=1 Win=17520 Len=0
Ethanalyzer
sw1(config-if)# ethanalyzer local interface inband capture-filter "port 67" limit-
captured-frames 0
Capturing on inband
2014-05-01 06:20:41.793378 0.0.0.0 -> 255.255.255.255 DHCP DHCP Discover -
Transaction ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.5.1.220 DHCP DHCP Discover - Transaction
ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.6.1.220 DHCP DHCP Discover - Transaction
ID 0x3e96b16d
• Nexus switch is sending the relayed packets to servers in VLAN 20 and 30 that are not active.
• We are assuming the DISCOVER being a broadcast packet would make it to the server on the same
VLAN
NOT TRUE
Case Study 1b
Ethanalyzer
sw1(config-if)# ethanalyzer local interface inband capture-filter "port 67" limit-
captured-frames 0
Capturing on inband
2014-05-01 06:20:41.793378 0.0.0.0 -> 255.255.255.255 DHCP DHCP Discover -
Transaction ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.5.1.220 DHCP DHCP Discover - Transaction
ID 0x3e96b16d
2014-05-01 06:20:41.793763 10.7.1.2 -> 10.6.1.220 DHCP DHCP Discover - Transaction
ID 0x3e96b16d
• Nexus switch is sending the relayed packets to servers in VLAN 20 and 30 that are not active.
• We are assuming the DISCOVER being a broadcast packet would make it to the server on the same
VLAN
NOT TRUE
• On Nexus when relay is used, we need to specify the DHCP server even when it is on the same VLAN
Case Study 1
Conclusion
How did we solve these issues
Case Study 2
“vMotion is failing”
Case Study 2
Netapp1 Netapp2
N5K-1 N5K-2
Compute
Storage
Networking
Case Study 2
What is next?
• Tools? Test
• vmkping Probable
cause
Logical topology Case Study 2
N5K-1 N5K-2
V
K
V V
K
V
K
V
K
V
K
M M M M M
M
Troubleshooting
Apply
knowledge
Test
Probable
cause
Case Study 2
Troubleshooting
Apply
=> The most simple scenario that presents the problem knowledge
Test
Probable
cause
Case Study 2
Troubleshooting
• Possible causes:
• Interface or link related issues
• Shut down interfaces?
• MTU issues
• Ping with larger packets
• SVI on Nexus switches
Apply
knowledge
Test
Probable
cause
Case Study 2
Troubleshooting
• Possible causes:
• Interface or link related issues
• Shut down interfaces?
• MTU issues
• Ping with larger packets
• SVI on Nexus switches
Conclusion
• How did we solve these issues
• Consider the different DC domains
• Break down the issue with knowledge, logic and tools
• Find a simple scenario which presents the problem
Case Study – Issue 3
Case Study 3
Issue 3 ACI
• ACI Leaf switch is not discovered
Case Study 3
Case Study 3
Case Study 3
Troubleshooting
• APIC Interface up? - Yes
• Show LLDP Neighbour? – Ieaf102 is there
• DHCP Address assigned? - Yes
• Can we ping the leaf switch? - Yes
• Default firmware?
Wrong version!
Case Study 3
Troubleshooting
• APIC Interface up? - Yes
• Show LLDP Neighbour? – Ieaf102 is there
• DHCP Address assigned? - Yes
• Can we ping the leaf switch? - Yes
• Default firmware?
Wrong version!
Conclusion
• Traditional networking concepts are still present even in new SDN Technologies
• Same logic applies
– Understand problem
– Use your knowledge
– Narrow down the issue
– Test possible causes
Case Study – Issue 4
Case Study 4
Case Study 4
=== Customer Symptoms ===
ESX installation fails and when successful it fails upgrade (not able to transfer
patches to the storage) and often PSOD on hosts
Case Study 4
Compute
Storage
Networking
Case Study 4
Initial checks
• Tried different ESX installation images.
• Checked Compatibility matrix for Vmware, UCS, drivers and NetApp.
• Configuration on Netapp, Nexus and UCS.
• Verified that the LUN is presented to the host.
• Netapp confirmed Clustered on-tap was configured properly and no errors
reported in the logs.
• Low number of CRC errors in some of the physical interfaces.
Case Study 4
N5K-1 N5K-2
FI-A FI-B
Case Study 4
Troubleshooting
• Problem isolated to N5K-2 showing CRC in several ports
• Possible cause for CRC:
• Faulty cable
• Faulty transceiver
• Faulty ASIC
• Problem was not related to cabling or transceivers, and multiple ASIC were affected but
intermittently
Potential HW failure??
• After rebooting N5K-2, problem appeared moved to N5K-1. Still very low number
of CRC (~100)
• HW failure unlikely, potential SW defect??
Case Study 4
Troubleshooting cont.
• Recreate in TAC lab unsuccessfully -> only difference were the cabling:
• Fiber instead of twinax
Conclusion
• DC environments can be complex
• To narrow down the issue is extremely important:
• Simplify the topology as much as possible
• Storage? Networking? Compute?
• Isolate the issue to a device
• Troubleshoot the device
Test the
Most
Probable
cause
Q&A
Complete Your Online Session Evaluation
• Give us your feedback and receive
a Cisco 2016 T-Shirt by
completing the Overall Event
Survey and 5 Session Evaluations.
– Directly from your mobile device on the Cisco
Live Mobile App
– By visiting the Cisco Live Mobile Site (URL)
– Visit any Cisco Live Internet Station located
throughout the venue
– T-Shirts can be collected in the
World of Solutions on Friday from Don’t forget: Cisco Live sessions will be available
12:00pm - 2:00pm for viewing on-demand after the event at
CiscoLiveAPAC.com
Continue Your Education
• Demos in the Cisco Campus
• Walk-in Self-Paced Labs
• Table Topics
• Meet the Engineer 1:1 meetings
Thank you