You are on page 1of 17

SAN Optimization Workshop

EMC SAN Optimization


Professional Service Workshop

Page 1
SAN Optimization Workshop

Fabric Resiliency Workshop Intro

This workshop provides a high-level walk through of the most commonly experienced detrimental device behaviors
and explains how to use Brocade® products and features to protect the data center.
Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard Fibre Channel (FC) media can
significantly affect the performance of FC fabrics and the applications they support. In most real-world scenarios,
these issues cannot be corrected or completely mitigated within the fabric itself; the behavior must be addressed
directly. However, with the proper knowledge and capabilities, the fabric can often identify and, in some cases,
mitigate or protect against the effects of these misbehaving components to provide better fabric resiliency.
Brocade offers many tools and features to assist with managing a robust Storage Area Network (SAN). Guidance on
best practices for using these capabilities is provided throughout this document. Many of the conditions and events
that adversely affect fabric operation are external to the switches themselves, but various mechanisms are available
in Brocade products to pinpoint the source of such problems,

Although there are certain aspects of today’s data centers that are common in most environments, no two data
centers are exactly alike, and no single set of configuration parameters apply universally to all environments. Brocade
works directly with customers in every type of SAN deployment to develop recommendations and guidelines that
apply to most environments. However, you should always validate all recommendations for your particular needs and
consult with your vendor or Brocade representative to ensure that you have the most robust design and configuration
available to fit your situation.

This document is divided into the following main sections:


o Factors affecting fabric resiliency
o Faulty media (description, detection, and mitigation)
o “Misbehaving” devices (description, detection, and mitigation)
o Congestion (description, detection, and mitigation)
o Loss of buffer credits (description, detection, and mitigation)
o Introduction of available tools to aid the fabric administrator (CMCNE/MAPS)

NOTE: Before proceeding it is recommended to already be familiar with and have on-hand these documents:

Fabric OS Administrator’s Guide Supporting Fabric OS v7.4.0 (53-1003509-04)


Monitoring and Alerting Policy Suite Supporting Fabric OS v7.4.0 (53-1003521-03)
SAN Fabric Resiliency Best Practices v2.0 (Download the PDF by clicking the document title)

Page 2
SAN Optimization Workshop

Fabric Resiliency Workshop

Faulty FRU and Media


1. Connect to host SANOPT1 and launch CMCNE 12.4.2 by double-clicking the icon on the
desktop. Log into CMCNE as ‘administrator’ (the account password is ‘password’)

2. Discover faulty media and FRU identification as shown in the example. (Using dashboard,
event log, and GUI icons):
a. Take note of the the highlighted “yellow” status (marginal) for switches SW6510_1
and SW6510_2 and information about the topology displayed on the CMCNE home
screen.

If necessary, scroll until this error and its description appears

Page 3
SAN Optimization Workshop

b. Select the Dashboard – Product Status and Traffic tab.

3. In the dashboard view, double-click the Product Status and Traffic default dashboard.

4. In the 1-SAN Inventory pane, click the yellow portion of the top bar graph of the the SAN
Inventory display.

Page 4
SAN Optimization Workshop

5. A SAN Products Marginal box will open and display the switch(es) with faulty or marginal
hardware. Scroll horizontally to view all the information available.

6. To view status of the switch – right-click and select properties to view switch status.

Page 5
SAN Optimization Workshop

7. As shown in the switch properties sheet above, there is an issue with a power supply on
the switch. This needs to be investigated further.
(In addition, you can identify faulty hardware by viewing logs from the main screen.
Logs are located under the SAN and IP tabs. Use the Monitor  Logs  Product Status
dropdown menu).

Page 6
SAN Optimization Workshop

Note that Brocade Element Manager and Web Tools GUIs can be launched and used for
problem verification at any time.

The example used here shows only a faulty power supply. However, the same process can
be followed to help diagnose any other failing or already faulty hardware component.

Other indications of faulty media are CRC errors, invalid words and state changes
(“bouncing”). These should proactively be searched for in the Product Event Log.

8. To view event logs for errors in fabric, go to Monitor  Logs  Product Event Log

9. Important to note is that these events are based on the active MAPS default policy.

Optional Exercise: Please take the time now to review and familiarize yourself with the
Monitoring and Alerting Policy Suite Administrator’s Guide.

When using the Monitoring and Alerting Policy Suite Administrator’s Guide to assist during
an actual customer engagement, be certain that it is the guide for the FOS version you are
using for specific thresholds for faulty media, and other triggers.

Page 7
SAN Optimization Workshop

Misbehaving Devices
A common problem is abnormal behavior from high-latency end devices, hosts and storage
typically being common offenders. These devices can cause unacceptable application
performance issues. Always keep in mind that misbehaving devices and faulty media symptoms
are similar. Additional RASlog messages were added to identify buffer lost buffer credits.

Some examples of these messages in the RASlog are: CX-5679, 1012, 1014, 1015. Refer to
Brocades messaging refernce guide for additional information. These messages can be viewed in
the event log and in FOS errshow

Here are some examples of RASlog messages to look for. These can be viewed on the switch
using errshow or errdump commands:

CDR-5021 or CX-5021 RASlog messages


CDR-1012 or CX-1012 RASlog messages
CDR-1015 or CX-1014 RASlog messages
CDR-1015 or CX-1015 RASlog messages
CDR-1015 or CX-1016 RASlog messages
CDR-1015 or CX-1017 RASlog messages
CDR-5079 or CX-5079 RASlog messages

Congestion
Link congestion in a Fibre Channel network is caused by either latency or insufficient bandwidth.
However, insufficient bandwidth is less common in today’s FC networks, due to the increases in
bandwidth, and proper sizing of ISLs. So when we see this, normally, it is due to poor design. This
leaves us with latency; and most latency is caused by slowly responding devices, which causes
the fabric to spend resources holding frames longer than usual. Most latencies are found in hosts
rather than storage arrays, as hosts may become resource constrained, and storage arrays are
purpose-built and are designed to handle I/O from many different hosts.

Latencies are categorized into “moderate” and “severe”. Severe latency is one that results in
frame loss, also known as C3 Discards, or timeouts. The severity of a device latency can get to
the point of backing up an ISL, especially if multiple devices are experiencing latency at the same
time. If severe enough, this causes healthy flows using the same ISL to be impacted by the
latency. This is known as a Bottleneck. Bottlenecks are detected in Brocade fabrics through
appropriate alert settings such as MAPS C3 Discard alerts and Bottleneck Detection alerts.

Page 8
SAN Optimization Workshop

In addition, Brocade also has a FOS Frame Viewer to determine which flows contained dropped
frames. However, we will utilize in this exercise, the MAPS features, along with Bottleneck
Detection and FPI to detect and mitigate congestion. The first thing to keep in mind is to always
follow Brocade/EMC Best Practices when establishing new fabrics or managing existing ones:

1. Always configure Edge Hold Time = 220ms (Brocade Best Practice)


2. Configure EHT < 400ms to obviate unnecessary host FLOGIs.
3. Enable Bottleneck Detection/FPI on all F_Ports
4. Enable C3 Discard/Timeout monitoring via MAPS
5. Configure EHT only on switches with hosts attached

 Note: The Brocade EHT recommended default setting is 220ms. This should be verified on
each switch in the fabric.

As shown in the figure below, when MAPS is properly configured, congestion due to both
bandwidth and device latency are detected. (This display is generated by clicking on a violation
count in the Out of Range Violations widget in the dashboard):

In this example, both high utilization and device latency are cited as possible sources of the
congestion. Also, a report can be generated from the dashboard by clicking on the Events widget
on the appropriate category. Clicking on the Critical category provides us with the following:

As you can see, the impact and exact frame delay in milliseconds is provided for each threshold-
exceeded condition or event. Whenever the excess latency approaches 100 milliseconds, there is
the potential for severe latency, and hence frame loss.

Page 9
SAN Optimization Workshop

1. See if you can generate your own Events Report now

2. See what out-of-range violations you have on your dashboard. Drill down to determine the
Recommended Actions for your situation.

Viewing FPI SDDQ triggers and actions

1. Back on the main CMCNE screen click the MAPS icon.

2. The MAPS policy rules details will be loaded and the MAPS Configuration screen appears.
Select the desired switch.

Page 10
SAN Optimization Workshop

3. Select Options and verify FPI (Fabric Performance Impact) Monitoring is checked and
running on each switch in fabric.

To view SDDQ (Quarantined) ports select switch, Select Violations.

Page 11
SAN Optimization Workshop

Look for and look for the SDDQ column and view the entries.

Page 12
SAN Optimization Workshop

Options for Configuring MAPS


The first step in configuring Monitoring and Policy Suite (MAPS) is to ensure that you have the
appropriate XML file which contains the appropriate customized Moderate MAPS policy to ensure
that the customer’s production SAN environment is properly protected and that no port fencing
will take place. This XML file is available on the EMC Tools area.

Once you have the XML file, you are ready to configure MAPS for your customer.

1. Import the xml file

2. Activate the new policy

Page 13
SAN Optimization Workshop

3. You can also compare policies with users if applicable by using the Compare Policies tool.
To use the tool, select two polices for comparison. In this example we’ve selected
‘SW6510-2’ switch and two of its policies. Click Compare and review the information.

Page 14
SAN Optimization Workshop

4. IMPORTANT: When you are done viewing the MAPS comparison tool you must configure
MAPS back to the original state of the lab. To do this, please activate the default
aggressive policy for each switch in the lab fabric and verify performance monitor is
enabled. Please follow these steps to accomplish the reset:

1. Select the first switch/policy and then click ‘Activate’.


2. Continue the same procedure until you have activated the default aggressive policy for
each switch in the Products column.
3. Click the ‘Close’ button at the bottom-right of the MAPS Configuration screen (not
shown).

The screen shots below show the beginning state of the fabric. Please verify that the
actual MAPS Configuration display matches what is shown here after you have completed
the switch resets.

Page 15
SAN Optimization Workshop

Verify Performance Monitor is enabled by selecting the ‘Monitor’ icon on the toolbar:

Page 16
SAN Optimization Workshop

Summary and Recommendation– Using Best Practices

1. Faulty Media: Faulty media can cause frame loss due to excessive CRC errors, invalid
transmission words, and other conditions. This may result in I/O failure and application
performance degradation.
2. Misbehaving devices, links, and switches: Very occasionally a condition arises where a
device (server or storage array) or link (ISL) starts behaving erratically and causes
disruptions in the fabric. This may result in severe stress on the fabric.
3. Congestion: Congestion is caused by latencies or insufficient link bandwidth. End devices
that do not respond as quickly as expected cause the fabric to hold frames for excessive
periods of time. This can result in application performance degradation or, in extreme
cases, I/O failure.
4. Permanent or temporary loss of buffer credits: This is caused by the other end of a link
failing to acknowledge a request to transfer a frame because no buffers are available to
receive the frame.

The following are recommended features and capabilities to improve the overall resiliency of
Brocade FOS-based FC fabric environments:
 Enable Brocade Fabric Vision to detect frame timeouts (MAPS “C3TX_TO” area).
 Enable Port Fencing, Decommission, Toggle, and SDDQ for transmit timeouts on F_Ports.
 Enable the Edge Hold Time feature.
 Enable Brocade Fabric Watch to monitor (alert) for CRC errors, Invalid Words, and State
Changes, and to fence on extreme behavior.
 Enable Bottleneck Detection for congestion conditions.

You have completed all of the workshop activities.

IMPORTANT: Please be sure to close all Brocade GUI windows and leave the Windows desktop
showing only the desktop icons before terminating your lab session.

Page 17

You might also like