Strategies and Tactics:: Application Troubleshooting Simplified

Strategies and Tactics:
Application Troubleshooting Simplified

How to speed time-to-root cause with network traffic recording and application-centric analysis
Fluke Networks Is
Global
A $300+ million company Profitable since its inception as a separate operating entity Over 800 employees worldwide service customers in more than 120 countries Approximately 30% of revenue from outside the U.S. Worldwide Headquarters: Everett, WA
Major Facilities: Colorado Springs, CO; Duluth, GA; Bridgewater, NJ; Rockville, MD; Sales Offices & Associates: Worldwide Technical Assistance Centers: Everett, WA; Eindhoven, NL
Fluke Networks Is
Thriving
Backed by an $12B corporate parent, Danaher Corporation
Fluke Networks and Fluke are both part of Danaher (NYSE: DHR)
Trusted
Trusted by 98 of the Fortune 100 who use Fluke Networks solutions to deploy, solve, manage and optimize their networks.
Agenda
Todays Challenges Complexity Applications and the infrastructure that delivers them Change You think you know your network? Wanna bet? Triage Determining just who owns this problem Root cause analysis (RCA) What is the specific cause of latency? Best Practices Getting in the Path of the Packets Capturing all the Packets Discovering Problems before the Customer Discovers Them Resolving Problems in a Timely Manner
Challenges
Complexity? You got it!
and this is just the view of the app inside of a data center. What is happening in that users network?
Change You think you know your network?

The network is in a constant state of change Making assumptions about

the path packets take utilization levels on ports traffic distributions
Can lead to increased problem resolution time Without a clear view of the current state of the network, it is very difficult to quickly resolve network and application related problems
Triage Determining just who owns the problem

Its the network! Whether it is a network problem, server problem, or application problem, the network always gets blamed first The faster we can determine the fault domain of the problem, the faster we can get the right resources working on it
Root Cause Analysis (RCA)

Without a history of normal network operation, it is difficult to determine what is not normal Keeping a history of:
Utilization levels Roundtrip Latencies Protocol Distributions Packet captures of working applications
Allows us to get to the root of the problem, without chasing symptoms that are not really part of the problem
What best practices address the challenges?

Challenges

Best Practices

Complexity Change Triage Root Cause Analysis
Getting in the path of the packets Capturing all the Packets Discovering Problems before the customer does Resolving Problems in a Timely Manner
10
Best Practices
Getting in the Path of the Packets
No Network Documentation Understanding Application Dependencies Tapping Technologies Virtual Machines
Capturing all the Packets

High Bandwidth Utilization What Happened Yesterday at 3pm?
Discovering Problems before the Customer Does

Network is the backbone for everything Automatically picking out problems from Gigabytes worth of data
Resolving Problems in a Timely Manner

Need better understanding of how applications work Remote offices

Why is this important?
In order to analyze the application traffic and troubleshoot the problem, we must have the application packets in the capture buffer The only way to get these packets in the buffer is to get in the path of the packets been the client and server

Knowing the flow of the packets
Often times network administrators will think they know the path packets are taking through the network In many cases however, the packets are taking a different path, making the troubleshooting process much more difficult Without knowing the exact path, we cannot guarantee that we are in the path of the packets Not only must we know the Layer 3 flow of the packets, but we also need to know the Layer 2 flow of the packets
Demonstration of Layer 2 and Layer 3 Traceroute
Knowing Application Dependencies

Unless we know the devices on which a client, server, service or application is dependent, we do not know which paths to monitor Once the conversations are documented, we can isolate the path between the two endpoints and connect the monitoring equipment
Host Conversations
Data Center
DHCP Server Single Web Servers Sign-on Server
Application Servers
Database Servers
OptiView Host Conversations Demonstration
Getting in the path

Once we know the exact path of the packets, it is time to get into that path There are three common methods of getting in the path and capturing the packets
Hub Span Tap
Each of these methods has its own pros and cons
Hubs
Pros Cheap Available Easy to install Cons Reduce link to half duplex May not be a true hub Not practical on servers or switch uplinks If power drops, link drops 10/100 Mbps speeds
Span/Mirror Ports
Pros Free Available Does not require link to be dropped Great for one-time link monitoring
1 3 5 7 9 11 13 15 17 19 21 23 1 SYSTEM RPS STAT UTIL DUPLEX SPEED
Cons Requires switch access Configuration mistakes can result in network outages Can quickly become over provisioned Requires a free switch port
CATALYST 3550
2
10
12
14
16
18
20
22
24
Taps
Pros Truly monitors full-duplex traffic If power is lost link stays active Can monitor gigabit links without packet loss Once installed, can stay Cons Most expensive option Have to break the link to install Can over-provision the monitor port and drop packets
Tap Deployment
Analysis equipment can be quickly connected to the network, without the need for configuration changes Aggregators can be used to merge the traffic from multiple taps into a single stream This allows a single analyzer to monitor traffic at multiple locations as well as redundant paths
Tap Deployment
Having taps deployed at key locations provides easy access for the analysis equipment These points include:
In front of server farms At the Internet connection Switch Uplinks Demarcation Points between Responsibilities
Tap Deployment
Data Center
Application Servers
Database Servers
Capturing in a Virtual Machine Environment

The use of virtual servers create unique challenges when it comes to capturing packets There is no place to attach a physical tap The analyzer must be installed on the same virtual server as the virtual machines that are to be monitored Using the vSwitch within the virtual server, the traffic can be spanned from the virtual machines to the virtual analyzer machine
Capturing in a Virtual Machine Environment
NTM Connecting to the vSwitch
Capturing all the Packets

Without all of the packets, we are not able to analyze the application traffic. Some examples are:
VoIP Traffic If we do not capture all of the traffic, the analyzer will report a lower Mean Opinion Score (MOS) due to packet loss TCP Traffic The analyzer will report missing segments, which will give the appearance that packets are being lost on the network, when in fact they are not
If the capture buffer is not big enough, the packets will roll out of the buffer, before anyone knows the problem even occurred
High Performance Packet Capture

In order to capture all of the packets, we must
Have a hardware capture card that can keep up with the data rate of the network. This was easy in the 10/100 days, with the deployment of 10 Gigabit networks, this has become much more difficult Apply capture filters to the captured packets and discard those that do not match the filter Transfer the filtered packets to the storage system at a rate equal to the data rate of the network Index the captured traffic in such a way that it can be retrieved quickly by the protocol analyzer
High Performance Packet Capture

1 Ethernet traffic is captured from multiple ports at full line rates by FPGA-based capture card hardware filters supported Entire frames are sent to the Packet Store repository for storage and post analysis Entire frames are also sent to the various analytical and real-time monitoring engines that process, classify and index data this information is stored in the metadata database Atlas is the software interface that provides access to the rich network metadata information collected and created For troubleshooting and in-depth network analysis, a packet view engine facilitates fundamental protocol and multi-segment flow analysis
2
10 Gig Packet Capture

H/W filter & frame de-duplication Full Line Rate Capture with 2Gbps buffer Fast PCI-e bus
10Gbps Adapter Card (2*10G XFP)
1Gbps Adapter Card (4*1G SFP)
How Stream to Disk Works

All NTMs use RAID controller for high performance stream-to-disk All NTMs carry multiple disks to support multi-thread storage Large storage capacity models support RAID5 for redundancy All NTM are specified with true capacity:
True packet storage space Addition storage available for OS, and metadata
Deployment of Stream to Disk

Typically deployed in the data center, in front of application servers, data base servers, VoIP servers For troubleshooting purposes, can be deployed in a portable fashion to capture traffic over long periods of time
Going Back in Time

Often times a problem occurs, but no one reports the problem until several hours, or days later The ability to go back in time allows the network analyst to search through the captured packets quickly to extract those packets related to the problem The Network Time Machine provides the ability to select traffic by interface, time range, and device address
Going Back in Time Demonstration
Add Filter to narrow down scope
Deployment of Stream to Disk

Fixed Location Data Center
Server farms Database Servers Load Balancers
Portable Solution
Remote Offices
Capture to Disk Deployment
Data Center
Application Servers
Database Servers
Discovering Problems before your Customer Does

The network has become the backbone for most if not all of the communications within a organization These include
E-mail Phone Traffic (VoIP) Video (Video Conferencing and Surveillance) Business Critical Applications Non-Business Critical Applications, but still important to the end user (facebook)
When these services are not performing well, the customer wants them fixed and fixed now!

No so easy There are Terrabytes of information going across the network everyday Most of this traffic is working properly, the trick is pulling out the traffic that is not working properly

Monitoring!!!!!! The customer should not be used as a monitoring device It is important to be able to discover network related problems, before the customer discovers them To accomplish this, we need to be able to:
Perform Real Time Analysis of the network traffic Take advantage of information available to us through SNMP, RMON, NetFlow Set monitoring thresholds and alarms to notify us when things are not performing as they should
Real Time Analysis

It is important that the protocol analyzer be able to analyze packets not only after they have been captured, but as they are being captured This allows you to detect problems as they are occurring, instead of waiting until the customer reports a problem Detection can be combined with alerting, so that notifications are sent out when problems occur
Interface Utilization and Errors

Packet loss and link congestion contribute to slow applications Eliminating these problems from the network will positively impact all of the applications running across the network Monitoring routers and switches using SNMP will allow you to quickly isolate those links experiencing high utilization and interface errors The OptiView Portable Network Analyzer provides the capability to collect these values and graph them in a useful fashion

FCS/CRC errors are a common problem on many networks These errors result in packet loss, which in turn results in the retransmission of packets Retransmission delays cause application delays A typical cause of FCS/CRC errors are duplex mismatches The OptiView Portable Network Analyzer displays the number of errors seen on each port, thereby reducing the time it takes to isolate packet loss.
Utilization and Interface Monitoring
Data Center
Application Servers
Database Servers
Application Response Time

If the infrastructure supporting the application is running slowly, then the application will run slow By monitoring the time it takes to traverse the network and connect to the server, we are able to either implicate or eliminate the network as the cause of application slowdowns Monitoring these application ports over time will give us a baseline of the typical response time, which can then be compared with time periods when the application appears to be slow
Application Response Time Monitoring
Data Center
Application Servers
Database Servers
Utilizing SNMP Data

Virtually ever device on the network has an SNMP agent These agents can provide information about the performance, utilization, and faults on the device This information includes:
Host Resource Tables Route Tables ARP Caches
Host Resource Table

SNMP enabled servers can be accessed with the OptiView Portable Network Analyzer From these servers we can pull information about:
Memory and CPU utilization Running Processes Disk Utilization Number of Users
Host Resource Table Demonstration
Resolving Problems in a Timely Manner

To minimize the impact of application problems to the client, it is important to resolve the problems in a timely manner Factors that reduce the amount of time necessary to resolve problems are:
Understanding the Application as far as dependencies, data flows, response times Capturing in multiple locations and merging the packet captures to isolate packet loss and latency Play back multimedia traffic to view the end user experience
Understanding Applications
While the network analyst does not need to understand applications down to the code level, it is important to understand the network traffic related to applications This understanding will help reduce the amount of time it takes to troubleshoot the application A good practice is to capture the application traffic when the application is running well. This good capture can be compared with the problem trace to reduce the amount of time it takes to isolate the problem
Application Centric Analysis

Application Centric Analysis is the process of taking a top down approach to application troubleshooting as opposed to a bottom up approach If it can be shown that the network is transporting traffic as it should, we can begin troubleshooting application by looking at data flows, instead of packets This gives us a better picture of where the application may be failing, instead of digging through thousands of packets
What is a Transaction?
Business Transaction
User Action
Application Transaction
Packets
Packet #1
Go to Trade Page Look up Danaher Symbol Enter Symbol And Qty Submit Order
GET /tradepage.aspx GET /border.gif GET /dnarrow.gif GET /displayDHR.gif GET /stylesheet.css GET /javascript.js POST /submit_order.asp
Packet #2 Packet #3 Packet #4 Packet #5 Packet #6 Packet #7 Packet #8 Packet #9 Packet #10 Packet #11 Packet #12 Packet #13 Packet #14 Packet #15 Packet #16
Purchase 100 shares of Danaher stock
Demonstration of Application Centric Analysis
Multi-Segment Analysis
In order to get a complete picture of the problem, we may need to see both sides of the conversation at the same time By capturing on both sides and merging that traffic together, we are able to quickly identify the source of packet loss and delays To perform this multi-segment analysis, we must be able to synchronize the traces based on time stamp
ClearSight merges traces files from both analyzers
Client Network Web Server
Optiview
Firewall Latency Router Latency Core Latency
Multimedia Playback
In some cases it takes more than just looking at packets to resolve an application problem When troubleshooting VoIP and Video problems, it is helpful to be able to play the media stream back to view the quality Problems such as echo with VoIP cannot be determined by looking at the statistics or packets. The only way to detect echo is to listen to the audio stream
Keys to Troubleshooting Multimedia

Need to have the appropriate Codecs available on the analysis equipment to playback media Measurement of Metrics
MOS R-Factor V-Factor
Where to deploy the Equipment

The placement of the analysis equipment has a significant impact on the analysis An analyzer placed close to the source of the multimedia traffic may not see the same problems as one placed near the destination
Portable Solution
Having an portable analysis solution allows the analyst to move connect to various locations to isolate the problem In cases of remote offices, the analysis solution can be shipped to the office to capture the end user experience
Use of Taps
Having taps installed ahead of time provides a quick and easy way to connect the analyzer The use of taps insures that the timing of the multimedia packets is not changed, which could adversely impact the metrics
Use of Taps
VoIP Analyzer Deployment
Data Center
VoIP Server Single Web Servers Sign-on Server
Application Servers
Database Servers
Demonstration of Multimedia Playback
Summary of Best Practices and Challenges

Best Practice
Method
Flow of the packets Application Dependencies Span/Tap
Fluke Networks Tools

OptiView Traceswitch Route OptiView ICMP Traceroute OptiView Host Conversations Fluke Networks Taps Network Time Machine Network Time Machine Network Time Machine Atlas Metrics OptiView Interface Utilization and Errors OptiView Application Response Time OptiView Host Resource Table ClearSight Analyzer Application Centric Analysis ClearSight Analyzer Multi-segment Analysis Network Time Machine High Peformance Capture ClearSight Analyzer Multimedia Playback
Capturing All the Packets
High Performance Packet Capture Capture to Disk Back in Time
Discovering Problems before the Customer Does
Real Time Analysis
Using SNMP Data Diagnosing Problems in a Timely Manner Understanding Applications
Multimedia Analysis
Resources
90-Day ClearSight Trial requires unique Proof of Purchase (POP) Code found on the ClearSight Flyer handed out at the seminar 14-Day ClearSight Trial if you misplaced your POP Code you can download the 14day trial at www.flukenetworks.com/csatrial Application-Centric Resource Center: www.flukenetworks.com/app-centric Network Forensics Resource Center: www.flukenetworks.com/ntmresources Portable Network Analysis: www.flukenetworks.com/optiview Request OptiView 5Day Evaluation: www.flukenetworks.com/optivieweval
For additional information: Email: info@flukenetworks.com. Phone 800-283-5853 (US/Canada) or 425-446-4519 (other locations).

Strategies and Tactics:: Application Troubleshooting Simplified

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Strategies and Tactics:: Application Troubleshooting Simplified

Uploaded by

Copyright:

Available Formats

Strategies and Tactics:

Application Troubleshooting Simplified

Complexity? You got it!

Change You think you know your network?

The network is in a constant state of change Making assumptions about

Triage Determining just who owns the problem

Root Cause Analysis (RCA)

What best practices address the challenges?

Complexity Change Triage Root Cause Analysis

Capturing all the Packets

Discovering Problems before the Customer Does

Resolving Problems in a Timely Manner

Getting in the Path of the Packets

Getting in the Path of the Packets

Demonstration of Layer 2 and Layer 3 Traceroute

Knowing Application Dependencies

DHCP Server Single Web Servers Sign-on Server

OptiView Host Conversations Demonstration

Getting in the path

Each of these methods has its own pros and cons

DHCP Server Single Web Servers Sign-on Server

Capturing in a Virtual Machine Environment

Capturing in a Virtual Machine Environment

NTM Connecting to the vSwitch

Capturing all the Packets

High Performance Packet Capture

High Performance Packet Capture

10 Gig Packet Capture

10Gbps Adapter Card (2*10G XFP)

1Gbps Adapter Card (4*1G SFP)

How Stream to Disk Works

Deployment of Stream to Disk

Going Back in Time

Going Back in Time Demonstration

Add Filter to narrow down scope

Deployment of Stream to Disk

Capture to Disk Deployment

DHCP Server Single Web Servers Sign-on Server

Discovering Problems before your Customer Does

Discovering Problems before your Customer Does

Discovering Problems before your Customer Does

Real Time Analysis

Interface Utilization and Errors

Interface Utilization and Errors

Interface Utilization and Errors

Interface Utilization and Errors

Utilization and Interface Monitoring

DHCP Server Single Web Servers Sign-on Server

Application Response Time

Application Response Time Monitoring

DHCP Server Single Web Servers Sign-on Server

Utilizing SNMP Data

Host Resource Table

Host Resource Table Demonstration

Resolving Problems in a Timely Manner

Application Centric Analysis

Purchase 100 shares of Danaher stock

Demonstration of Application Centric Analysis

Keys to Troubleshooting Multimedia

Where to deploy the Equipment

VoIP Analyzer Deployment

VoIP Server Single Web Servers Sign-on Server