Professional Documents
Culture Documents
Page 1
SAN Optimization Workshop
This workshop provides a high-level walk through of the most commonly experienced detrimental device behaviors
and explains how to use Brocade® products and features to protect the data center.
Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard Fibre Channel (FC) media can
significantly affect the performance of FC fabrics and the applications they support. In most real-world scenarios,
these issues cannot be corrected or completely mitigated within the fabric itself; the behavior must be addressed
directly. However, with the proper knowledge and capabilities, the fabric can often identify and, in some cases,
mitigate or protect against the effects of these misbehaving components to provide better fabric resiliency.
Brocade offers many tools and features to assist with managing a robust Storage Area Network (SAN). Guidance on
best practices for using these capabilities is provided throughout this document. Many of the conditions and events
that adversely affect fabric operation are external to the switches themselves, but various mechanisms are available
in Brocade products to pinpoint the source of such problems,
Although there are certain aspects of today’s data centers that are common in most environments, no two data
centers are exactly alike, and no single set of configuration parameters apply universally to all environments. Brocade
works directly with customers in every type of SAN deployment to develop recommendations and guidelines that
apply to most environments. However, you should always validate all recommendations for your particular needs and
consult with your vendor or Brocade representative to ensure that you have the most robust design and configuration
available to fit your situation.
NOTE: Before proceeding it is recommended to already be familiar with and have on-hand these documents:
Page 2
SAN Optimization Workshop
2. Discover faulty media and FRU identification as shown in the example. (Using dashboard,
event log, and GUI icons):
a. Take note of the the highlighted “yellow” status (marginal) for switches SW6510_1
and SW6510_2 and information about the topology displayed on the CMCNE home
screen.
Page 3
SAN Optimization Workshop
3. In the dashboard view, double-click the Product Status and Traffic default dashboard.
4. In the 1-SAN Inventory pane, click the yellow portion of the top bar graph of the the SAN
Inventory display.
Page 4
SAN Optimization Workshop
5. A SAN Products Marginal box will open and display the switch(es) with faulty or marginal
hardware. Scroll horizontally to view all the information available.
6. To view status of the switch – right-click and select properties to view switch status.
Page 5
SAN Optimization Workshop
7. As shown in the switch properties sheet above, there is an issue with a power supply on
the switch. This needs to be investigated further.
(In addition, you can identify faulty hardware by viewing logs from the main screen.
Logs are located under the SAN and IP tabs. Use the Monitor Logs Product Status
dropdown menu).
Page 6
SAN Optimization Workshop
Note that Brocade Element Manager and Web Tools GUIs can be launched and used for
problem verification at any time.
The example used here shows only a faulty power supply. However, the same process can
be followed to help diagnose any other failing or already faulty hardware component.
Other indications of faulty media are CRC errors, invalid words and state changes
(“bouncing”). These should proactively be searched for in the Product Event Log.
8. To view event logs for errors in fabric, go to Monitor Logs Product Event Log
9. Important to note is that these events are based on the active MAPS default policy.
Optional Exercise: Please take the time now to review and familiarize yourself with the
Monitoring and Alerting Policy Suite Administrator’s Guide.
When using the Monitoring and Alerting Policy Suite Administrator’s Guide to assist during
an actual customer engagement, be certain that it is the guide for the FOS version you are
using for specific thresholds for faulty media, and other triggers.
Page 7
SAN Optimization Workshop
Misbehaving Devices
A common problem is abnormal behavior from high-latency end devices, hosts and storage
typically being common offenders. These devices can cause unacceptable application
performance issues. Always keep in mind that misbehaving devices and faulty media symptoms
are similar. Additional RASlog messages were added to identify buffer lost buffer credits.
Some examples of these messages in the RASlog are: CX-5679, 1012, 1014, 1015. Refer to
Brocades messaging refernce guide for additional information. These messages can be viewed in
the event log and in FOS errshow
Here are some examples of RASlog messages to look for. These can be viewed on the switch
using errshow or errdump commands:
Congestion
Link congestion in a Fibre Channel network is caused by either latency or insufficient bandwidth.
However, insufficient bandwidth is less common in today’s FC networks, due to the increases in
bandwidth, and proper sizing of ISLs. So when we see this, normally, it is due to poor design. This
leaves us with latency; and most latency is caused by slowly responding devices, which causes
the fabric to spend resources holding frames longer than usual. Most latencies are found in hosts
rather than storage arrays, as hosts may become resource constrained, and storage arrays are
purpose-built and are designed to handle I/O from many different hosts.
Latencies are categorized into “moderate” and “severe”. Severe latency is one that results in
frame loss, also known as C3 Discards, or timeouts. The severity of a device latency can get to
the point of backing up an ISL, especially if multiple devices are experiencing latency at the same
time. If severe enough, this causes healthy flows using the same ISL to be impacted by the
latency. This is known as a Bottleneck. Bottlenecks are detected in Brocade fabrics through
appropriate alert settings such as MAPS C3 Discard alerts and Bottleneck Detection alerts.
Page 8
SAN Optimization Workshop
In addition, Brocade also has a FOS Frame Viewer to determine which flows contained dropped
frames. However, we will utilize in this exercise, the MAPS features, along with Bottleneck
Detection and FPI to detect and mitigate congestion. The first thing to keep in mind is to always
follow Brocade/EMC Best Practices when establishing new fabrics or managing existing ones:
Note: The Brocade EHT recommended default setting is 220ms. This should be verified on
each switch in the fabric.
As shown in the figure below, when MAPS is properly configured, congestion due to both
bandwidth and device latency are detected. (This display is generated by clicking on a violation
count in the Out of Range Violations widget in the dashboard):
In this example, both high utilization and device latency are cited as possible sources of the
congestion. Also, a report can be generated from the dashboard by clicking on the Events widget
on the appropriate category. Clicking on the Critical category provides us with the following:
As you can see, the impact and exact frame delay in milliseconds is provided for each threshold-
exceeded condition or event. Whenever the excess latency approaches 100 milliseconds, there is
the potential for severe latency, and hence frame loss.
Page 9
SAN Optimization Workshop
2. See what out-of-range violations you have on your dashboard. Drill down to determine the
Recommended Actions for your situation.
2. The MAPS policy rules details will be loaded and the MAPS Configuration screen appears.
Select the desired switch.
Page 10
SAN Optimization Workshop
3. Select Options and verify FPI (Fabric Performance Impact) Monitoring is checked and
running on each switch in fabric.
Page 11
SAN Optimization Workshop
Look for and look for the SDDQ column and view the entries.
Page 12
SAN Optimization Workshop
Once you have the XML file, you are ready to configure MAPS for your customer.
Page 13
SAN Optimization Workshop
3. You can also compare policies with users if applicable by using the Compare Policies tool.
To use the tool, select two polices for comparison. In this example we’ve selected
‘SW6510-2’ switch and two of its policies. Click Compare and review the information.
Page 14
SAN Optimization Workshop
4. IMPORTANT: When you are done viewing the MAPS comparison tool you must configure
MAPS back to the original state of the lab. To do this, please activate the default
aggressive policy for each switch in the lab fabric and verify performance monitor is
enabled. Please follow these steps to accomplish the reset:
The screen shots below show the beginning state of the fabric. Please verify that the
actual MAPS Configuration display matches what is shown here after you have completed
the switch resets.
Page 15
SAN Optimization Workshop
Verify Performance Monitor is enabled by selecting the ‘Monitor’ icon on the toolbar:
Page 16
SAN Optimization Workshop
1. Faulty Media: Faulty media can cause frame loss due to excessive CRC errors, invalid
transmission words, and other conditions. This may result in I/O failure and application
performance degradation.
2. Misbehaving devices, links, and switches: Very occasionally a condition arises where a
device (server or storage array) or link (ISL) starts behaving erratically and causes
disruptions in the fabric. This may result in severe stress on the fabric.
3. Congestion: Congestion is caused by latencies or insufficient link bandwidth. End devices
that do not respond as quickly as expected cause the fabric to hold frames for excessive
periods of time. This can result in application performance degradation or, in extreme
cases, I/O failure.
4. Permanent or temporary loss of buffer credits: This is caused by the other end of a link
failing to acknowledge a request to transfer a frame because no buffers are available to
receive the frame.
The following are recommended features and capabilities to improve the overall resiliency of
Brocade FOS-based FC fabric environments:
Enable Brocade Fabric Vision to detect frame timeouts (MAPS “C3TX_TO” area).
Enable Port Fencing, Decommission, Toggle, and SDDQ for transmit timeouts on F_Ports.
Enable the Edge Hold Time feature.
Enable Brocade Fabric Watch to monitor (alert) for CRC errors, Invalid Words, and State
Changes, and to fence on extreme behavior.
Enable Bottleneck Detection for congestion conditions.
IMPORTANT: Please be sure to close all Brocade GUI windows and leave the Windows desktop
showing only the desktop icons before terminating your lab session.
Page 17