You are on page 1of 69

Strategies and Tactics:

Application Troubleshooting Simplified


How to speed time-to-root cause with network traffic recording and application-centric analysis

Fluke Networks Is
Global
A $300+ million company Profitable since its inception as a separate operating entity Over 800 employees worldwide service customers in more than 120 countries Approximately 30% of revenue from outside the U.S. Worldwide Headquarters: Everett, WA
Major Facilities: Colorado Springs, CO; Duluth, GA; Bridgewater, NJ; Rockville, MD; Sales Offices & Associates: Worldwide Technical Assistance Centers: Everett, WA; Eindhoven, NL

Fluke Networks Is
Thriving
Backed by an $12B corporate parent, Danaher Corporation
Fluke Networks and Fluke are both part of Danaher (NYSE: DHR)

Trusted
Trusted by 98 of the Fortune 100 who use Fluke Networks solutions to deploy, solve, manage and optimize their networks.

Agenda
Todays Challenges Complexity Applications and the infrastructure that delivers them Change You think you know your network? Wanna bet? Triage Determining just who owns this problem Root cause analysis (RCA) What is the specific cause of latency? Best Practices Getting in the Path of the Packets Capturing all the Packets Discovering Problems before the Customer Discovers Them Resolving Problems in a Timely Manner

Challenges

Complexity? You got it!

and this is just the view of the app inside of a data center. What is happening in that users network?

Change You think you know your network?


The network is in a constant state of change Making assumptions about


the path packets take utilization levels on ports traffic distributions

Can lead to increased problem resolution time Without a clear view of the current state of the network, it is very difficult to quickly resolve network and application related problems

Triage Determining just who owns the problem


Its the network! Whether it is a network problem, server problem, or application problem, the network always gets blamed first The faster we can determine the fault domain of the problem, the faster we can get the right resources working on it

Root Cause Analysis (RCA)


Without a history of normal network operation, it is difficult to determine what is not normal Keeping a history of:
Utilization levels Roundtrip Latencies Protocol Distributions Packet captures of working applications

Allows us to get to the root of the problem, without chasing symptoms that are not really part of the problem

What best practices address the challenges?


Challenges

Best Practices

Complexity Change Triage Root Cause Analysis

Getting in the path of the packets Capturing all the Packets Discovering Problems before the customer does Resolving Problems in a Timely Manner

10

Best Practices
Getting in the Path of the Packets
No Network Documentation Understanding Application Dependencies Tapping Technologies Virtual Machines

Capturing all the Packets


High Bandwidth Utilization What Happened Yesterday at 3pm?

Discovering Problems before the Customer Does


Network is the backbone for everything Automatically picking out problems from Gigabytes worth of data

Resolving Problems in a Timely Manner


Need better understanding of how applications work Remote offices

Getting in the Path of the Packets


Why is this important?
In order to analyze the application traffic and troubleshoot the problem, we must have the application packets in the capture buffer The only way to get these packets in the buffer is to get in the path of the packets been the client and server

Getting in the Path of the Packets


Knowing the flow of the packets
Often times network administrators will think they know the path packets are taking through the network In many cases however, the packets are taking a different path, making the troubleshooting process much more difficult Without knowing the exact path, we cannot guarantee that we are in the path of the packets Not only must we know the Layer 3 flow of the packets, but we also need to know the Layer 2 flow of the packets

Demonstration of Layer 2 and Layer 3 Traceroute

Knowing Application Dependencies


Why is this important?
Unless we know the devices on which a client, server, service or application is dependent, we do not know which paths to monitor Once the conversations are documented, we can isolate the path between the two endpoints and connect the monitoring equipment

Host Conversations

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

OptiView Host Conversations Demonstration

Getting in the path


Once we know the exact path of the packets, it is time to get into that path There are three common methods of getting in the path and capturing the packets
Hub Span Tap

Each of these methods has its own pros and cons

Hubs
Pros Cheap Available Easy to install Cons Reduce link to half duplex May not be a true hub Not practical on servers or switch uplinks If power drops, link drops 10/100 Mbps speeds

Span/Mirror Ports
Pros Free Available Does not require link to be dropped Great for one-time link monitoring
1 3 5 7 9 11 13 15 17 19 21 23 1 SYSTEM RPS STAT UTIL DUPLEX SPEED

Cons Requires switch access Configuration mistakes can result in network outages Can quickly become over provisioned Requires a free switch port
CATALYST 3550
2

10

12

14

16

18

20

22

24

Taps
Pros Truly monitors full-duplex traffic If power is lost link stays active Can monitor gigabit links without packet loss Once installed, can stay Cons Most expensive option Have to break the link to install Can over-provision the monitor port and drop packets

Tap Deployment
Analysis equipment can be quickly connected to the network, without the need for configuration changes Aggregators can be used to merge the traffic from multiple taps into a single stream This allows a single analyzer to monitor traffic at multiple locations as well as redundant paths

Tap Deployment
Having taps deployed at key locations provides easy access for the analysis equipment These points include:
In front of server farms At the Internet connection Switch Uplinks Demarcation Points between Responsibilities

Tap Deployment

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Capturing in a Virtual Machine Environment


The use of virtual servers create unique challenges when it comes to capturing packets There is no place to attach a physical tap The analyzer must be installed on the same virtual server as the virtual machines that are to be monitored Using the vSwitch within the virtual server, the traffic can be spanned from the virtual machines to the virtual analyzer machine

Capturing in a Virtual Machine Environment

NTM Connecting to the vSwitch

Capturing all the Packets


Why is this important?
Without all of the packets, we are not able to analyze the application traffic. Some examples are:
VoIP Traffic If we do not capture all of the traffic, the analyzer will report a lower Mean Opinion Score (MOS) due to packet loss TCP Traffic The analyzer will report missing segments, which will give the appearance that packets are being lost on the network, when in fact they are not

If the capture buffer is not big enough, the packets will roll out of the buffer, before anyone knows the problem even occurred

High Performance Packet Capture


In order to capture all of the packets, we must
Have a hardware capture card that can keep up with the data rate of the network. This was easy in the 10/100 days, with the deployment of 10 Gigabit networks, this has become much more difficult Apply capture filters to the captured packets and discard those that do not match the filter Transfer the filtered packets to the storage system at a rate equal to the data rate of the network Index the captured traffic in such a way that it can be retrieved quickly by the protocol analyzer

High Performance Packet Capture


1 Ethernet traffic is captured from multiple ports at full line rates by FPGA-based capture card hardware filters supported Entire frames are sent to the Packet Store repository for storage and post analysis Entire frames are also sent to the various analytical and real-time monitoring engines that process, classify and index data this information is stored in the metadata database Atlas is the software interface that provides access to the rich network metadata information collected and created For troubleshooting and in-depth network analysis, a packet view engine facilitates fundamental protocol and multi-segment flow analysis
2

10 Gig Packet Capture


H/W filter & frame de-duplication Full Line Rate Capture with 2Gbps buffer Fast PCI-e bus

10Gbps Adapter Card (2*10G XFP)

1Gbps Adapter Card (4*1G SFP)

How Stream to Disk Works


All NTMs use RAID controller for high performance stream-to-disk All NTMs carry multiple disks to support multi-thread storage Large storage capacity models support RAID5 for redundancy All NTM are specified with true capacity:
True packet storage space Addition storage available for OS, and metadata

Deployment of Stream to Disk


Typically deployed in the data center, in front of application servers, data base servers, VoIP servers For troubleshooting purposes, can be deployed in a portable fashion to capture traffic over long periods of time

Going Back in Time


Often times a problem occurs, but no one reports the problem until several hours, or days later The ability to go back in time allows the network analyst to search through the captured packets quickly to extract those packets related to the problem The Network Time Machine provides the ability to select traffic by interface, time range, and device address

Going Back in Time Demonstration

Add Filter to narrow down scope

Deployment of Stream to Disk


Fixed Location Data Center
Server farms Database Servers Load Balancers

Portable Solution
Remote Offices

Capture to Disk Deployment

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Discovering Problems before your Customer Does


The network has become the backbone for most if not all of the communications within a organization These include
E-mail Phone Traffic (VoIP) Video (Video Conferencing and Surveillance) Business Critical Applications Non-Business Critical Applications, but still important to the end user (facebook)

When these services are not performing well, the customer wants them fixed and fixed now!

Discovering Problems before your Customer Does


No so easy There are Terrabytes of information going across the network everyday Most of this traffic is working properly, the trick is pulling out the traffic that is not working properly

Discovering Problems before your Customer Does


Monitoring!!!!!! The customer should not be used as a monitoring device It is important to be able to discover network related problems, before the customer discovers them To accomplish this, we need to be able to:
Perform Real Time Analysis of the network traffic Take advantage of information available to us through SNMP, RMON, NetFlow Set monitoring thresholds and alarms to notify us when things are not performing as they should

Real Time Analysis


It is important that the protocol analyzer be able to analyze packets not only after they have been captured, but as they are being captured This allows you to detect problems as they are occurring, instead of waiting until the customer reports a problem Detection can be combined with alerting, so that notifications are sent out when problems occur

Interface Utilization and Errors


Packet loss and link congestion contribute to slow applications Eliminating these problems from the network will positively impact all of the applications running across the network Monitoring routers and switches using SNMP will allow you to quickly isolate those links experiencing high utilization and interface errors The OptiView Portable Network Analyzer provides the capability to collect these values and graph them in a useful fashion

Interface Utilization and Errors

Interface Utilization and Errors


FCS/CRC errors are a common problem on many networks These errors result in packet loss, which in turn results in the retransmission of packets Retransmission delays cause application delays A typical cause of FCS/CRC errors are duplex mismatches The OptiView Portable Network Analyzer displays the number of errors seen on each port, thereby reducing the time it takes to isolate packet loss.

Interface Utilization and Errors

Utilization and Interface Monitoring

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Application Response Time


If the infrastructure supporting the application is running slowly, then the application will run slow By monitoring the time it takes to traverse the network and connect to the server, we are able to either implicate or eliminate the network as the cause of application slowdowns Monitoring these application ports over time will give us a baseline of the typical response time, which can then be compared with time periods when the application appears to be slow

Application Response Time Monitoring

Data Center

DHCP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Utilizing SNMP Data


Virtually ever device on the network has an SNMP agent These agents can provide information about the performance, utilization, and faults on the device This information includes:
Host Resource Tables Route Tables ARP Caches

Host Resource Table


SNMP enabled servers can be accessed with the OptiView Portable Network Analyzer From these servers we can pull information about:
Memory and CPU utilization Running Processes Disk Utilization Number of Users

Host Resource Table Demonstration

Resolving Problems in a Timely Manner


To minimize the impact of application problems to the client, it is important to resolve the problems in a timely manner Factors that reduce the amount of time necessary to resolve problems are:
Understanding the Application as far as dependencies, data flows, response times Capturing in multiple locations and merging the packet captures to isolate packet loss and latency Play back multimedia traffic to view the end user experience

Understanding Applications
While the network analyst does not need to understand applications down to the code level, it is important to understand the network traffic related to applications This understanding will help reduce the amount of time it takes to troubleshoot the application A good practice is to capture the application traffic when the application is running well. This good capture can be compared with the problem trace to reduce the amount of time it takes to isolate the problem

Application Centric Analysis


Application Centric Analysis is the process of taking a top down approach to application troubleshooting as opposed to a bottom up approach If it can be shown that the network is transporting traffic as it should, we can begin troubleshooting application by looking at data flows, instead of packets This gives us a better picture of where the application may be failing, instead of digging through thousands of packets

What is a Transaction?
Business Transaction

User Action

Application Transaction

Packets

Packet #1

Go to Trade Page Look up Danaher Symbol Enter Symbol And Qty Submit Order

GET /tradepage.aspx GET /border.gif GET /dnarrow.gif GET /displayDHR.gif GET /stylesheet.css GET /javascript.js POST /submit_order.asp

Packet #2 Packet #3 Packet #4 Packet #5 Packet #6 Packet #7 Packet #8 Packet #9 Packet #10 Packet #11 Packet #12 Packet #13 Packet #14 Packet #15 Packet #16

Purchase 100 shares of Danaher stock

Demonstration of Application Centric Analysis

Multi-Segment Analysis
In order to get a complete picture of the problem, we may need to see both sides of the conversation at the same time By capturing on both sides and merging that traffic together, we are able to quickly identify the source of packet loss and delays To perform this multi-segment analysis, we must be able to synchronize the traces based on time stamp

Multi-Segment Analysis
ClearSight merges traces files from both analyzers
Client Network Web Server

Optiview

Multi-Segment Analysis
Firewall Latency Router Latency Core Latency

Multimedia Playback
In some cases it takes more than just looking at packets to resolve an application problem When troubleshooting VoIP and Video problems, it is helpful to be able to play the media stream back to view the quality Problems such as echo with VoIP cannot be determined by looking at the statistics or packets. The only way to detect echo is to listen to the audio stream

Keys to Troubleshooting Multimedia


Need to have the appropriate Codecs available on the analysis equipment to playback media Measurement of Metrics
MOS R-Factor V-Factor

Where to deploy the Equipment


The placement of the analysis equipment has a significant impact on the analysis An analyzer placed close to the source of the multimedia traffic may not see the same problems as one placed near the destination

Portable Solution
Having an portable analysis solution allows the analyst to move connect to various locations to isolate the problem In cases of remote offices, the analysis solution can be shipped to the office to capture the end user experience

Use of Taps
Having taps installed ahead of time provides a quick and easy way to connect the analyzer The use of taps insures that the timing of the multimedia packets is not changed, which could adversely impact the metrics

Use of Taps

VoIP Analyzer Deployment

Data Center

VoIP Server Single Web Servers Sign-on Server

Application Servers

Database Servers

Demonstration of Multimedia Playback

Summary of Best Practices and Challenges


Best Practice
Getting in the Path of the Packets

Method
Flow of the packets Application Dependencies Span/Tap

Fluke Networks Tools


OptiView Traceswitch Route OptiView ICMP Traceroute OptiView Host Conversations Fluke Networks Taps Network Time Machine Network Time Machine Network Time Machine Atlas Metrics OptiView Interface Utilization and Errors OptiView Application Response Time OptiView Host Resource Table ClearSight Analyzer Application Centric Analysis ClearSight Analyzer Multi-segment Analysis Network Time Machine High Peformance Capture ClearSight Analyzer Multimedia Playback

Capturing All the Packets

High Performance Packet Capture Capture to Disk Back in Time

Discovering Problems before the Customer Does

Real Time Analysis

Using SNMP Data Diagnosing Problems in a Timely Manner Understanding Applications

Multimedia Analysis

Resources
90-Day ClearSight Trial requires unique Proof of Purchase (POP) Code found on the ClearSight Flyer handed out at the seminar 14-Day ClearSight Trial if you misplaced your POP Code you can download the 14day trial at www.flukenetworks.com/csatrial Application-Centric Resource Center: www.flukenetworks.com/app-centric Network Forensics Resource Center: www.flukenetworks.com/ntmresources Portable Network Analysis: www.flukenetworks.com/optiview Request OptiView 5Day Evaluation: www.flukenetworks.com/optivieweval

For additional information: Email: info@flukenetworks.com. Phone 800-283-5853 (US/Canada) or 425-446-4519 (other locations).

You might also like