You are on page 1of 35

DATA CENTER

SAN Fabric Resiliency Best Practices


v2.0
July 2013

This tech brief provides best practices for implementation of advanced Brocade Fabric OS features that identify, monitor, and protect Fibre Channel SANs from problematic device and media behavior.

DATA CENTER

TECHNICAL BRIEF

CONTENTS
Introduction................................................................................................................................................................................5 Feature Availability...................................................................................................................................................................6 Fabric Resiliency.......................................................................................................................................................................6 Factors Affecting Fabric Resiliency.........................................................................................................................................7 Faulty Media..............................................................................................................................................................................7 Description........................................................................................................................................... 7 Detection............................................................................................................................................. 8 Mitigation............................................................................................................................................. 8 Misbehaving Devices................................................................................................................................................................8 Description........................................................................................................................................... 8 Detection............................................................................................................................................. 9 CX-5021 Messages........................................................................................................................ 9 CX-1012 Messages...................................................................................................................... 10 CX-1014 Messages...................................................................................................................... 10 CX-1015 Messages...................................................................................................................... 10 CX-5679 Messages (CX-5678 in Brocade FOS v7.0.0x and v7.0.1x)................................................. 10 Mitigation........................................................................................................................................... 10 Congestion.............................................................................................................................................................................. 10 Description......................................................................................................................................... 10 Device-Based Latencies................................................................................................................ 10 Moderate Device Latencies.................................................................................................................. 11 Severe Device Latencies............................................................................................................... 12 Latencies on ISLs......................................................................................................................... 12 Detection........................................................................................................................................... 13 Mitigation........................................................................................................................................... 13 Initiators vs. Targets..................................................................................................................... 13 Loss of Buffer Credits............................................................................................................................................................ 14 Description......................................................................................................................................... 14 Condor 3 ASIC Enhancements....................................................................................................... 15 Detection........................................................................................................................................... 15 Mitigation........................................................................................................................................... 16 Credit Recovery on Back-End Ports................................................................................................. 16 Tools......................................................................................................................................................................................... 17 Bottleneck Detection........................................................................................................................... 17 Enhanced Bottleneck Detection..................................................................................................... 17 Brocade Fabric Watch.......................................................................................................................... 17 Port Fencing....................................................................................................................................... 17 Edge Hold Time.................................................................................................................................. 17 Frame Viewer...................................................................................................................................... 18 Designing Resiliency into the Fabric................................................................................................................................... 18 Factors Affecting Congestion................................................................................................................ 18

SAN Fabric Resiliency Best Practices v2.0

2 of 35

DATA CENTER

TECHNICAL BRIEF

Resiliency........................................................................................................................................... 19 Redundancy........................................................................................................................................ 19 Summary and Recommendations....................................................................................................................................... 20 Appendix A: Bottleneck Detection....................................................................................................................................... 21 Evolution of Bottleneck Detection Field Experience................................................................................ 21 Brocade FOS v6.3......................................................................................................................... 21 Brocade FOS v6.3 Bottleneckmon Parameters.......................................................................... 21 Specifying Ports and Port Ranges.................................................................................................. 21 CLI Examples......................................................................................................................... 22 Display Commands....................................................................................................................... 22 Brocade FOS v6.4......................................................................................................................... 22 Brocade FOS v6.4 Bottleneckmon Parameters.......................................................................... 23 CLI Examples......................................................................................................................... 23 Brocade FOS v7.0......................................................................................................................... 24 Brocade FOS v7.0.x Bottleneckmon Parameters....................................................................... 24 CLI Example........................................................................................................................... 25 Command Parameter Summary............................................................................................................ 26 Suggested Parameter Settings............................................................................................................. 27 Appendix B: Configuring Brocade Fabric Watch Port Fencing.......................................................................................... 28 Port Fencing Threshold Recommendations ..................................................................................... 28 Appendix C: Sample Frame Viewer session....................................................................................................................... 30 framelog --show -n 1200:............................................................................................................... 31 Appendix D: Edge Hold time (EHT)...................................................................................................................................... 32 Introduction to EHT............................................................................................................................. 32 Supported Releases and Licensing Requirements................................................................................. 32 Behavior............................................................................................................................................. 32 8G Platforms and the Brocade 48000............................................................................................ 32 Gen 5 Platforms .......................................................................................................................... 33 Default EHT Settings........................................................................................................................... 33 Recommended Settings...................................................................................................................... 35

SAN Fabric Resiliency Best Practices v2.0

3 of 35

DATA CENTER

TECHNICAL BRIEF

LIST OF FIGURES
Figure 1. Device latency example.................................................................................................................11 Figure 2. Latency on a switch can propagate through the fabric......................................................................11 Figure 3. Frame Viewer capture capability for a director or switch with multiple ASICs......................................30

LIST OF TABLES
Table 1. Brocade FOS Response to Front-End Link Credit Loss...................................................................... 14 Table 2. Brocade FOS Response to Back-End Link Credit Loss...................................................................... 15 Table 3. Command Parameter Summary...................................................................................................... 26 Table 4. Edge Hold Time Configuration Values.............................................................................................. 27 Table 5. Recommended Port Fencing Thresholds.......................................................................................... 28 Table 7. Factory Default EHT Settings.......................................................................................................... 33 Table 8. Suggested EHT Settings for Various FOS Releases.......................................................................... 34

SAN Fabric Resiliency Best Practices v2.0

4 of 35

DATA CENTER

TECHNICAL BRIEF

INTRODUCTION
This document provides a high-level description of the most commonly experienced detrimental device behaviors and explains how to use Brocade products and features to protect your data center. Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard Fibre Channel (FC) media can significantly affect the performance of FC fabrics and the applications they support. In most real-world scenarios, these issues cannot be corrected or completely mitigated within the fabric itself; the behavior must be addressed directly. However, with the proper knowledge and capabilities, the fabric can often identify and, in some cases, mitigate or protect against the effects of these misbehaving components to provide better fabric resiliency. Brocade offers many tools and features to assist with managing a robust Storage Area Network (SAN). Guidance on best practices for using these capabilities is provided throughout this document. Many of the conditions and events that adversely affect fabric operation are external to the switches themselves, but various mechanisms are available in Brocade products to pinpoint the source of such problems, Although there are certain aspects of todays data centers that are common in most environments, no two data centers are exactly alike, and no single set of configuration parameters apply universally to all environments. Brocade works directly with customers in every type of SAN deployment to develop recommendations and guidelines that apply to most environments. However, you should always validate all recommendations for your particular needs and consult with your vendor or Brocade representative to ensure that you have the most robust design and configuration available to fit your situation. Brocade also offers extensive Professional Services to assist with tuning and optimizing the features discussed in this document for customized deployment in your data center. For details, visit: www.brocade.com/services-support/professional-services/index.page This document is divided into the following main sections: Factors affecting fabric resiliency Faulty media (description, detection, and mitigation) Misbehaving devices (description, detection, and mitigation) Congestion (description, detection, and mitigation) Loss of buffer credits (description, detection, and mitigation) Introduction of available tools to aid the fabric administrator Designing resiliency into the fabric Summary and recommendations Appendices detailing selected topics The topics covered in this document are complex. Additional details on individual feature settings and configuration are included in the appendices to allow the main body of this document to flow more easily. You are encouraged to review the main body of the document and then refer to the appendices after the general concepts and information are understood. Every effort was taken to ensure the accuracy of the information in this document. However, it is essential to consult the appropriate version of the Fabric OS Command Reference Manual (covering the CLI) to confirm the correct syntax for all commands used.

SAN Fabric Resiliency Best Practices v2.0

5 of 35

DATA CENTER

TECHNICAL BRIEF

Some features discussed in this document are covered in more detail than other features. The Brocade Fabric Watch feature, for example, is too comprehensive to cover in detail here. Edge Hold Time and Bottleneck Detection functionality are covered in more detail due to their importance to fabric resiliency. Further documentation is available that provides additional detail on most of the features discussed in this document. The following is a partial list of material supplementing the information provided here (all documents are available to registered users at my.brocade.com): SAN Design and Best Practices SAN Fabric Administration Best Practices Guide: Support Perspective Fabric OS Administrators Guide (produced for each major Brocade Fabric OS [FOS] release) Fabric OS Command Reference Manual (produced for each major Brocade FOS release) This document covers Brocade FOS releases from version 6.3 to version 7.0. New revisions are produced covering subsequent Brocade FOS releases.

FEATURE AVAILABILITY
While there are many features available in Brocade FOS to assist with monitoring, protecting, and troubleshooting fabrics, Brocade has implemented several recent enhancements that focus exclusively on these areas. This document concentrates specifically on those newer features (and related capabilities) that help provide optimum fabric resiliency. These features are available and supported on the majority of 4 Gbps (gigabitper-second) and 8 Gbps platforms, in addition to the latest 16 Gbps Gen 5 FC platforms, when running the most recent Brocade FOS releases. (Visit my.brocade.com or consult with your vendor for the latest supported Brocade FOS releases.) Throughout this document, special requirements including required licenses, minimum release levels, and platform limitations are noted. Brocade strongly recommends that you review the additional documentation noted above to understand all of the tools that are available for maintaining an FC SAN environment. Also, be sure to read the Brocade FOS Release Notes for important information related to the specific version of Brocade FOS you are using.

FABRIC RESILIENCY
Fabric resiliency is the ability of an FC fabric to tolerate unusual conditions that might place the fabrics operation at risk. Examples of such risks include server or storage ports that behave abnormally or longdistance links that are at a greater risk of experiencing degradation in signal quality. Unless mitigated, these risks can lead to instability and unexpected behaviors in the fabric. In some cases, such behaviors can affect the Input/Output (I/O) for the devices connected to the fabric. However, implementing a resilient fabric requires more than just an awareness of the possible conditions that threaten fabric stability. A combination of capabilities is required that allows for problem detection, mitigation, andwhere necessaryisolation to enable the fabric administrator to efficiently correct the condition. In the past, the process of problem detection and mitigation relied on the diligence, skill, and experience of the fabric administrator. Usually, a manual task of searching and tracing error conditions was the regimen. Brocade has developed a suite of capabilities and tools that provide detailed monitoring and alerting, as well as response and mitigation, that vastly improve the fabric administrators insight and response time. This document discusses these capabilities and their application to the factors affecting fabric resiliency by detailing the processes of detection and mitigation for each factor. The fabric-specific tools and design recommendations that follow provide you with a robust approach to fabric resiliency.

SAN Fabric Resiliency Best Practices v2.0

6 of 35

DATA CENTER

TECHNICAL BRIEF

FACTORS AFFECTING FABRIC RESILIENCY


There are several common types of abnormal behavior originating from fabric components or attached devices: Faulty media (fiber-optic cables and Small Form-Factor Pluggables [SFPs]/optics): Faulty media can cause frame loss due to excessive Cyclic Redundancy Check (CRC) errors, invalid transmission words, and other conditions. This may result in I/O failure and application performance degradation. Misbehaving devices, links, or switches: Occasionally, a condition arises where a device (server or storage array) or link (Inter-Switch Link, or ISL) behaves erratically and causes disruptions in the fabric. If not immediately addressed, this may result in severe stress on the fabric. Congestion: This is caused by latencies or insufficient link bandwidth. End devices that do not respond as quickly as expected can cause the fabric to hold frames for excessive periods of time. This can result in application performance degradation or, in extreme cases, I/O failure. Permanent or temporary loss of buffer credits: This happens when the receiving end of a link fails to acknowledge a request to transfer a frame, because no buffers are available to receive the frame. This paper examines each of these types of behavior in depth.

FAULTY MEDIA
Description
Faulty media is one of the most common sources of fabric problems and eventual data center disruption. Faulty media can include damaged or substandard cables, SFPs, patch panels or receptacles, improper connections, malfunctioning extension equipment, and other types of external issues. Media can fault and fail on any port type, E_Port or F_Port, often intermittently and unpredictably, making it even harder to diagnose. Faulty media involving F_Ports results in an impact to the end device attached to the F_Port and to the devices communicating with that device. This can lead to broader issues, as the effects of the media fault are propagated through the fabric. Failures on E_Ports have the potential for even greater impact. Multiple flows (host/target pairs) simultaneously traverse a single E_Port, as well as inter-switch control traffic. In large fabrics, the number of flows passing through an E_Port can be very high. In the event of a media failure involving one of these links, it is possible to disrupt some or all of the flows using that link. Severe media fault or complete media failure can disrupt the port or even take the port offline. When an F_Port fails completely, the condition is usually detected by the connected device (storage or host). The device is usually configured to continue working through an alternative connection to the fabric. When this occurs on an F_Port, the impact is specific to flows involving that F_Port. An E_Port going offline causes the fabric to drop all routes using the failed E_Port. This is usually easy to detect and identify. E_Ports are typically redundant, such that a severe failure results in a minor drop in bandwidth as the fabric automatically utilizes available alternate paths. The error reporting built into Brocade FOS readily identifies the failed link and port, allowing for simple corrective action and repair. Moderate levels of media fault cause failures to occur intermittently. However, the port may remain online or, in some cases, may repeatedly transition between online and offline states. This can cause repeated errors which, if left unresolved, can recur indefinitely or until the media fails completely. Intermittent problems are difficult to assess. The fabric does its best to determine if a problem is critical enough to disable the port. Disabling a port causes the host failover software to stop using the port attached to the faulty media and to re-establish connectivity through its alternative path. However, problems can occur when the severity of the fault remains undetermined. This leaves the host and the applications running on it to cope with the results of the intermittently failing media.

SAN Fabric Resiliency Best Practices v2.0

7 of 35

DATA CENTER

TECHNICAL BRIEF

When this type of failure occurs on E_Ports, the result might be devastating. Repeated errors can affect many flows. This can result in a significant impact to applications, which can last for a prolonged period of time. Note that FC switches cannot correct for most problems caused by faulty media; the switches can attempt only to detect, alert, and compensate for the problems. Ultimately, the problems must be addressed in the host or target devices, cable plant, or media where the fault actually occurs.

Detection
Brocade FOS has evolved through multiple generations and has built a tool chest to detect and mitigate fabric resiliency issues, including faulty media conditions. You can use Brocade Fabric Watch to monitor and detect faulty media. The presence of faulty media can manifest with the following symptoms: CRC errors on frames Invalid words (includes encoder out errors) State changes (ports going offline or online repeatedly) Credit loss: Complete loss of credit on a Virtual Channel (VC) of an E_Port prevents traffic from flowing on that VC, which results in frame loss and I/O failures for devices utilizing the VC. Partial credit loss, while a concern, usually does not significantly affect traffic flow, due to the high link speeds of ISLs today. Trunked ISLs (best practice recommendation) (usually not affected) Switch-issued link (resets when a device or link fails to respond within two seconds) You can enable Brocade Fabric Watch monitors to automatically detect these types of faulty media conditions. Brocade Fabric Watch generates alerts based on user-defined thresholds.

Mitigation
You must identify and correct faulty media issues as soon as possible. Otherwise, they can lead to severe fabric problems, such as dropped frames, performance impact, and permanent credit loss. At the very least, you must isolate the ports with faulty media. The Brocade Fabric Watch feature provides a mechanism that quarantines the misbehaving component with the optional action of Port Fencing. Port Fencing is available for each of the previously noted conditions and is recommended to automatically protect the fabric from these error conditions. The recommended thresholds (specified in Appendix B: Configuring Brocade Fabric Watch Port Fencing) have been tested and tuned to quarantine misbehaving components that are likely to cause a fabric-wide impact. These thresholds are very unlikely to falsely trigger on normally behaving components. You can configure Brocade Fabric Watch thresholds to activate Port Fencing for various symptoms, to disable a port that exhibits signs of faulty media. The repair of media faults themselves is beyond the scope of this document.

MISBEHAVING DEVICES
Description
Another common class of abnormal behavior originates from high-latency end devices (host or storage). A highlatency end device is one that does not respond as quickly as expected and thus causes the fabric to hold frames for excessive periods of time. This can result in application performance degradation or, in extreme cases, I/O failure. Common examples of moderate device latency include disk arrays that are overloaded and hosts that cannot process data as fast as requested. Misbehaving hosts, for example, become more common as hardware ages. Bad host behavior is usually caused by defective Host Bus Adapter (HBA) hardware, bugs in the HBA firmware,

SAN Fabric Resiliency Best Practices v2.0

8 of 35

DATA CENTER

TECHNICAL BRIEF

and problems with HBA drivers. Storage ports can produce the same symptoms due to defective interface hardware or firmware issues. Some arrays deliberately reset their fabric ports, if they are not receiving host responses within their specified timeout periods. Severe latencies are caused by badly misbehaving devices that stop receiving, accepting, or acknowledging frames for excessive periods of time. However, with the proper knowledge and capabilities, the fabric can often identify and, in some cases, mitigate or protect against the effects of these misbehaving components to provide better fabric resiliency.

Detection
Prompt detection of misbehaving devices is critical to quick and effective mitigation of these disorders. The symptoms of misbehaving devices include: CRC errors on frames Invalid words (includes encoder out errors) State changes (ports going offline/online repeatedly or seemingly at random) Devices holding buffer credits for long periods, resulting in the switch issuing a Link Reset (LR) on the port connected to the device Loss of synchronization on device links Brocade products and Brocade FOS-integrated capabilities provide mechanisms for detection and recovery of lost credits, though this is not intended as a permanent method of resolving chronic issues with misbehaving devices. Once identified, the root cause of the problem should be addressed directly, to prevent the likelihood of further impact to the fabric. You can use Brocade Fabric Watch to monitor for all of the above conditions, except for non-return of buffer credits. The symptoms of misbehaving devices and faulty media are very similar. In addition to monitoring and isolation, Brocade FOS also provides the following RASlog messages for symptoms such as State Changes, devices not returning buffer credits, and loss of sync on device links. NOTE: The XX-5XXX messages referenced below are not documented in the Fabric OS Message Reference Guide. XX-1XXX messages are documented in the Fabric OS Message Reference Guide. The message code CX is displayed as either C2 or C3, depending on whether the port is on an 8 Gbps Condor 2 (C2) or Gen 5 16 Gbps Condor 3 (C3) Application-Specific Integrated Circuit (ASIC).

CX-5021 Messages
Brocade FOS issues an LR and logs a CX-5021 message when a port is detected that is not returning credit and has halted traffic for at least two seconds. No credit return in two seconds is considered a serious problem, and the action to take is dictated by Fibre Channel standards. The Brocade switch issues an LR to reset the link and recovers the lost credit. The CX-5021 message indicates that the switch has followed FC procedure and issued an LR in response to a loss of credit being detected. The CX-5021 message appears only for front-end (FE) ports in the event that no credits are returned over a period of 2 seconds. The CX-5021 message is not used for back-end (BE) or internal links; instead, a CX-1012 message is used. The CX-5021 message was replaced with a CX-1012 in later releases of Brocade FOS (v6.3.2a/v6.4.0 and later).

SAN Fabric Resiliency Best Practices v2.0

9 of 35

DATA CENTER

TECHNICAL BRIEF

CX-1012 Messages
These messages serve the same purpose for internal (BE) ports as the CX-5021 messages serve for external (FE) ports. This message indicates that an LR following no credit return for 2 seconds was performed.

CX-1014 Messages
Indicate that one or more credits were lost, and the link was reset.

CX-1015 Messages
Similar to the CX-1012 messages, these messages serve the same purpose for internal ports (BE ports) as the CX-5021 messages. This message indicates that the link is reinitialized.

CX-5679 Messages (CX-5678 in Brocade FOS v7.0.0x and v7.0.1x)


Brocade FOS reports a CX-5679 message whenever a loss of sync is detected. A loss of sync can occur when more than three corrupted words are received in a row. The receiving ASIC immediately attempts to regain sync. If sync can be reattained within 100 milliseconds (ms), as prescribed by Fibre Channel standards, then no further action takes place. However, if sync cannot be regained, then the blade is faulted. The CX-5679 message indicates a loss of sync condition on a BE port, during which frames may be lost and credit loss may occur, although frame or credit loss is not certain in every case. When this occurs, the port that is called out in the message likely has lost one or more credits on one or more VCs.

Mitigation
Misbehaving devices can have profound effects on a fabric. Bad device behavior can cause back pressure to build up in the fabric until no traffic passes. This is especially true of routed fabrics, where many flows may be traversing relatively few links. You can use Brocade Fabric Watch to remove the suspect device from the fabric by monitoring TX timeouts and blocking a port with Port Fencing if set thresholds are exceeded. Guidance is provided on implementing Port Fencing in Appendix B: Configuring Brocade Fabric Watch Port Fencing. You can use the optional configuration setting of Edge Hold TIme (EHT) to decrease the delay imposed before a Class 3 (C3) transmit timeout (C3 TX_TO: er_tx_c3_timeout) frame discard is produced, where frames are blocked for some reason. This allows traffic to resume for one buffer credit for each frame dropped. By allowing traffic to continue in this way, you avoid dropping frames from other flows on ISLs in the fabric due to back pressure from the high-latency device. If Brocade Fabric Watch is used to fence ports on TX_TO frame drops, then the port can also be disabled before it can cause widespread impact on the fabric. EHT is described in detail in the TOOLS and Appendices sections of this document.

CONGESTION
Description
Congestion occurs when the traffic being carried on a link exceeds its capacity. Sources of congestion could be links, hosts, or storage responding more slowly than expected. Congestion is typically due to either fabric latencies or insufficient link bandwidth capacity. As Fibre Channel link bandwidth has increased from one to 16 Gigabits/second, instances of insufficient link bandwidth capacities have radically decreased. Latencies, particularly device latencies, are the major source of congestion in todays fabrics, due to their inability to promptly return buffer credits to the switch.

Device-Based Latencies
A device experiencing latency responds more slowly than expected. The device does not return buffer credits (through R_RDY primitives) to the transmitting switch fast enough to support the offered load, even though the offered load is less than the maximum physical capacity of the link connected to the device.

SAN Fabric Resiliency Best Practices v2.0

10 of 35

DATA CENTER

TECHNICAL BRIEF

P2

Capacity: 8 Gbps Throughput: 1 Gbps Wants to do : 8Gbps

P1

Capacity: 8 Gbps Throughput: 1 Gbps Wants to do : 8Gbps

?
P6

P4

P3

Figure 1. Device latency example Figure 1 illustrates the condition where a buffer backup on ingress port 6 on B1 causes congestion upstream on S1, port 3. Once all available credits are exhausted, the switch port connected to the device needs to hold additional outbound frames until a buffer credit is returned by the device. When a device does not respond in a timely fashion, the transmitting switch is forced to hold frames for longer periods of time, resulting in high buffer occupancy. This in turn results in the switch lowering the rate at which it returns buffer credits to other transmitting switches. This effect propagates through switches (and potentially multiple switches, when devices attempt to send frames to devices that are attached to the switch with the highlatency device) and ultimately affects the fabric.
Hosts
4. All servers using ISL affected

Hosts

B
1. Buffer credits exhausted

2. Backflow of credit exhaustion depleting credits on ISL port on Switch A

3. Continuing backflow of credit exhaustion to ISL port on Switch B

B
fig02_SAN Fabric Res

Storage Arrays

Storage Arrays

5. Connection to second storage device now at risk

Figure 2. Latency on a switch can propagate through the fabric NOTE: The impact to the fabric (and other traffic flows) varies based on the severity of the latency exhibited by the device. The longer the delay caused by the device in returning credits to the switch, the more severe the problem.

Moderate Device Latencies


Moderate device latencies from the fabric perspective are defined as those not severe enough to cause frame loss. If the time between successive credit returns by the device is between a few hundred microseconds to tens of milliseconds, then the device exhibits mild to moderate latencies, since this delay is typically not enough to cause frame loss. This does cause a drop in application performance but typically does not cause frame drops or I/O failures. The effect of moderate device latencies on host applications may still be profound, based on the average disk service times expected by the application. Mission-critical applications that expect average disk service times

SAN Fabric Resiliency Best Practices v2.0

11 of 35

fig01_SAN Fabric Res

S2

S1

B1

DATA CENTER

TECHNICAL BRIEF

of, for instance, 10 ms, are severely affected by storage latencies in excess of the expected service times. Moderate device latencies have traditionally been very difficult to detect in the fabric. Advanced monitoring capabilities implemented in Brocade ASICs and Brocade FOS have made these moderate device latencies much easier to detect by providing the following information and alerts: Switches in the fabric generate Bottleneck Detection Alerts if Bottleneck Detection is activated on the affected ports Elevated tim_txcrd_z (see below) counts on the affected F_Port; that is, the F_Port where the affected device is connected Potentially elevated tim_txcrd_z counts on all E_Ports carrying the flows to and from the affected F_Port/ device NOTE: tim_txcrd_z is defined as the number of times that the port was unable to transmit frames because the transmit Buffer-to-Buffer Credit (BBC) was zero. The purpose of this statistic is to detect congestion or a device affected by latency. This parameter is sampled at intervals of 2.5 microseconds, and the counter is incremented if the condition is true. Each sample represents 2.5 microseconds of time with zero Tx BBC. tim_txcrd_z counts are not an absolute indication of significant congestion or latencies and are just one of the factors in determining if real latencies or fabric congestion are present. Some level of congestion is to be expected in a large production fabric and is reflected in tx_crd_z counts. The Brocade FOS Bottleneck Detection capability was introduced to remove the uncertainty around identifying congestion in a fabric.

Severe Device Latencies


Severe device latencies result in frame loss, which triggers the host Small Computer Systems Interface (SCSI) stack to detect failures and to retry I/Os. This process can take tens of seconds (possibly as long as 30 to 60 seconds), which can cause a very noticeable application delay and potentially results in application errors. If the time between successive credit returns by the device is in excess of 100 ms, then the device is exhibiting severe latency. When a device exhibits severe latency, the switch is forced to hold frames for excessively long periods of time (on the order of hundreds of milliseconds). When this time becomes greater than the established timeout threshold, the switch drops the frame (per Fibre Channel standards). Frame loss in switches is also known as C3 discards or timeouts. Since the effect of device latencies often spreads through the fabric, frames can be dropped due to timeouts not just on the F_Port to which the misbehaving device is connected, but also on E_Ports carrying traffic to the F_Port. Dropped frames typically cause I/O errors that result in a host retry, which can result in significant decreases in application performance. The implications of this behavior are compounded and exacerbated by the fact that frame drops on the affected F_Port (device) result not only in I/O failures to the misbehaving device (which are expected), but also on E_Ports, which may cause I/O failures for unrelated traffic flows involving other hosts (and typically are not expected).

Latencies on ISLs
Latencies on ISLs are usually the result of back pressure from latencies elsewhere in the fabric. The cumulative effect of many individual device latencies can result in slowing the link. The link itself might be producing latencies, if it is a long-distance link with distance delays or there are too many flows using the same ISL. Whereas each device may not appear to be a problem, the presence of too many flows with some level of latency across a single ISL or trunked ISL may become a problem. Latency on an ISL can ripple through other switches in the fabric and affect unrelated flows. Brocade FOS can provide alerts and information indicating possible ISL latencies in the fabric, through one or more of the following: Switches in the fabric generate Bottleneck Detection Alerts, if Bottleneck Detection is activated on the affected ports C3 transmit discards (er_tx_c3_timeout) on the device E_Port or EX_Port carrying the flows to and from the affected F_Port or device

SAN Fabric Resiliency Best Practices v2.0

12 of 35

DATA CENTER

TECHNICAL BRIEF

Brocade Fabric Watch alerts, if they are configured for C3 timeouts Elevated tim_txcrd_z counts on the affected E_Port, which also may indicate congestion C3 receive discards (er_rx_c3_timeout) on E_Ports in the fabric containing flows of a high-latency F_Port

Detection
Latencies on ISLs can affect all the flows and multiple switches in the fabric that is connected through the congested ISLs. Hence, it is critical to detect and mitigate the cause of the latency as quickly as possible. Often this requires investigating the places where the symptoms are detected, and then working backwards to identify the actual source of the problem. Some of the visible symptoms of a congested ISL may include: Bottleneck Detection Alerts (if Bottleneck Detection is activated on the affected ports) C3 transmit discards on the device F_Port Brocade Fabric Watch alerts, if they are configured for C3 timeouts Elevated tim_txcrd_z counts on the affected F_Port, which may indicate congestion C3 receive discards on E_Ports in the fabric containing flows to the affected F_Port Elevated tim_txcrd_z counts on all E_Ports containing flows to the affected F_Port Brocade FOS Frame Viewer can be used to detect severely congested flows from C3 discard data. The Source ID and Destination ID information about the flow can point in the right direction.

Mitigation
Once you detect the source of ISL congestion, you can identify mitigation steps. There is little that can be done in the fabric to mitigate the effects of device latencies or traffic congestion. The source of congestion could be: Slowly responding hosts Arrays with long latencies Long-distance ISLs (Increasing the BBCs on a long-distance ISL can help.) Dynamic Path Selection (default switch routing) tries to distribute frames across as many equal-cost paths as possible from the host to the storage. Assuming that all paths are not affected by latencies, then providing more paths through the fabric reduces the likelihood that a congested device or link will affect overall traffic flow.

Initiators vs. Targets


Field experience has shown that severe latencies are most likely to occur on hosts rather than storage ports. In architectures where hosts and targets are deployed on separate switches, Brocade recommends configuring Edge Hold Time to a value that is less than Hold Time. Some storage arrays, for example, reset a port if they have not received a credit in 400 ms. Every flow connected to the reset port has to log back into that port, causing an interruption on each attached host. Setting an EHT to less than 400 ms on the switches where the hosts are located avoids the resets. Best practice recommendations: Enable Bottleneck Detection on all F_Ports wherever possible, to detect device latencies. Bottleneck Detection can also help you determine inter-switch latencies. Enable C3 timeout monitoring via Brocade Fabric Watch. Configure EHT on switches with hosts only.

SAN Fabric Resiliency Best Practices v2.0

13 of 35

DATA CENTER

TECHNICAL BRIEF

LOSS OF BUFFER CREDITS


Description
BBCs form the basis of flow control in the Fibre Channel SAN and determine how many frames a port has buffers to store. Permanent credit loss is very infrequent and is caused when an R_RDY or VC_RDY primitive is not returned (or a malformed primitive is returned), after a frame is received and processed from one end of a link to the other. If loss of credit happens too often, then all the available buffer credits over a link may eventually become depleted, and traffic between the two end points ceases. This condition is cleared by issuing an LR on the port in question. Permanent credit loss is very infrequent and is usually caused by some external condition, such as electrical noise, faulty media, poorly seated blades that can corrupt the primitive, and misbehaving devices not returning R_RDYs. Corrupt primitives are dropped by the receiver as malformed frames. Permanent credit loss can occur on a port (such as an F_Port) or a VC on ISLs, BE ports, or other links where VCs are supported. Note that traffic still flows, as long as there are available credits on a VC or port. Temporary credit starvation caused by congestion is often mistaken for permanent credit loss, as the initial symptoms are exactly the same for both conditions. The difference is that congestion usually abates, and the buffer credits are eventually freed. All credit starvation results in effects that appear elsewhere in the fabric, due to back pressure from the source of the starvation. Some examples of the results of back pressure are as follows: I/O timeouts on servers SCSI transport errors in server logs that appear to be random and unrelated Transmit C3 timeouts on the affected port (and, possibly, receive C3 timeouts on ISLs in other switches) Brocade FOS has evolved through five generations of Fibre Channel speed transitions and provided progressive capabilities to monitor traffic flow through the switches. On FE device ports, link credit loss is monitored every time traffic stops flowing for more than two seconds. This is the means that you use to determine if the credit starvation is due to congestion or actual permanent credit loss. The number of available credits returns to the maximum assigned to the port or VC when credit starvation is due to congestion. Permanent loss of one or more credits shows a value lower than the assigned maximum. Monitoring and detection of buffer credit losses on BE links presents additional complications, although Brocade FOS has been enhanced over time to provide mechanisms that provide resiliency in these connections also. On these internal connections, credit loss occurrences are typically associated with mechanical or component issues, often simply such things as a poorly seated blade in a chassis. The following tables detail actions and alerts generated across Brocade FOS versions in response to detected buffer credit loss. Table 1. Brocade FOS Response to Front-End Link Credit Loss
Brocade FOS Version Prior to v6.3.1a Versions 6.3.1a, 6.3.2 and 6.4.0 (and later releases) Action Issue link reset after 2 seconds Port Fencing, Edge Hold Time, and Bottleneck Detection for setting credit recovery Alert CDR-5021 or C2-5021 message logged

SAN Fabric Resiliency Best Practices v2.0

14 of 35

DATA CENTER

TECHNICAL BRIEF

Table 2. Brocade FOS Response to Back-End Link Credit Loss


Brocade FOS Version Prior to v6.3.1a Versions 6.3.1a, 6.3.2 and 6.4.0 (and later releases) Versions 6.3.2b, 6.4.2 and 7.0.0 (and later releases) Action Issues LR after 2 seconds for full stoppage in traffic Improved detection for lost credits (multiple lost credits, single VC) Automatic recovery of lost credits on BE ports via an LR. Permanent blade errors result in blade being faulted. This feature is activated via the bottleneckmon command. Enhanced automatic recovery Manual check for congestion vs. permanently lost credit invoked automatically, if transmit timeouts are detected on BE. No need for 2 seconds without traffic. Any detected lost credits result in link reset on the BE port. Alert CDR-5021 or C2-5021 message logged CDR-5021 or C2-5021 messages enhanced CDR-1012 or C2-1012 messages logged

Versions 6.4.3a and 7.0.2 (and later releases)

C2-1027 for credit loss, C2-1014 if an LR is performed

Condor 3 ASIC Enhancements


In addition to Brocade FOS, the fifth-generation of Brocade FC switching ASIC (Condor 3) has the capability to automatically detect and recover a single permanently lost credit at the VC level, provided that both ends of the link are Condor 3-based ports. Multiple lost credits are recovered automatically via an LR with appropriate user notifications and alerts. Within the boundary of a switch (although it is an extremely low probability) there is a potential of credit loss between the BE links (the internal links between two ASICs on the same switch). This could be due to improperly seated blades or other conditions. Brocade FOS provides mechanisms to detect and mitigate the loss of credits on the BE links as well. The Condor 3 ASIC has built-in features that further minimize the credit loss on the BE ports. As soon as a port blade is plugged into a chassis and a BE link is brought online, Condor 3 ASIC ports automatically tune the links for optimal performance. This standards-based automatic tuning mitigates the effect of electrical noise due to incorrectly seated blades or other environmental factors. This capability is automatically enabled on the Condor 3-based BE links, and it recovers from any loss of buffer credits. In addition to BE link auto tuning, Condor 3 ASIC ports are capable of Forward Error Correction (FEC) to enhance transmission reliability and performance. This feature provides the ability to correct up to 11 bit-errors in every 2112-bit transmission in a 10 Gbps/16 Gbps data stream in both frames as well as primitives. FEC is enabled by default on the BE links of Condor 3 ASIC-based switches and blades and minimizes the loss of credits on BE links. FEC is also enabled by default on FE links when connected to another FEC-capable device. Best practice recommendation: Always have at least two member trunks (using Brocade Trunking) on FE links, where possible. This eliminates the potential for stopped traffic until all credits on all trunk members for the VC or port are lost (which is a very rare situation).

Detection
Lost credit situations display a variety of symptoms, depending on the Brocade FOS release version and FC generation of switch, and whether BE credit recovery is activated. The set of RASlog messages that may be generated in the event of lost credits is as follows:

SAN Fabric Resiliency Best Practices v2.0

15 of 35

DATA CENTER

TECHNICAL BRIEF

CDR-5021 or CX-5021 RASlog messages CDR-1012 or CX-1012 RASlog messages CDR-1015 or CX-1014 RASlog messages CDR-1015 or CX-1015 RASlog messages CDR-1015 or CX-1016 RASlog messages CDR-1015 or CX-1017 RASlog messages CDR-5079 or CX-5079 RASlog messages Total credit loss situations result in flows being blocked across the port or VC in question. Hosts and/or targets are affected, and C3 receive timeout frame drops are observed on other switches in the fabric.

Mitigation
Brocade FOS cannot control the conditions leading to permanent loss of credit on FE ports. Permanent loss of all credits on a port can be handled through either a manual or automatic LR on the port. Enable Bottleneck Detection for credit loss and recovery on all ports. See Appendix A: Bottleneck Detection for more detail on how Bottleneck Detection is used to enable Brocade FOS to detect lost credits.

Credit Recovery on Back-End Ports


Use the --cfgcredittools commands to enable or disable credit recovery of back-end ports, and use the --showcredittools parameter to display the configuration. When this feature is enabled, credit is recovered on back-end ports (ports connected to the core blade or core blade back-end ports) when credit loss is detected on these ports. If complete loss of credit on a Condor 2 back-end port causes frame timeouts, an LR is performed on that port regardless of the configured setting, even if that setting is -recover off. When used with the -recover onLrOnly option, the recovery mechanism takes the following escalating actions: When the mechanism detects credit loss, it performs an LR and logs a RASlog message (CX-1014). If the LR fails to recover the port, the port reinitializes. A RASlog message is generated (CX-1015). Note that the port reinitialization does not fault the blade. If the port fails to reinitialize, the port is faulted. A RASlog message (RAS CX-1016) is generated. If a port is faulted, and there are no more online back-end ports in the trunk, the port blade is faulted. A RASlog message (RAS CX-1017) is generated. When used with the -recover onLrThresh option, recovery is attempted through repeated LRs, and a count of the LRs is kept. If the threshold of more than two LRs per hour is reached, the blade is faulted (RAS CX-1018). Note that regardless of whether the LR occurs on the port blade or on the core blade, the port blade is always faulted. If complete credit loss on a particular VC on a particular back-end port is suspected, use the -check option to examine that particular back-end port and VC for credit loss. If the command detects complete credit loss, the information is reported. In addition, if the option to enable link resets on back-end ports is configured, this command performs an LR on the link in an attempt to recover from the problem. The user must explicitly initiate this check, and it is a one-time operation. In other words, this command does not continuously monitor for credit loss in the background. Detection of credit loss takes 2 to 7 seconds, after which the results of the operation are displayed. An LR also generates a RASlog message. Best practice recommendation: Enable Bottleneck Detection for lost credits.

SAN Fabric Resiliency Best Practices v2.0

16 of 35

DATA CENTER

TECHNICAL BRIEF

TOOLS
Bottleneck Detection
Bottleneck Detection was introduced in Brocade FOS v6.3.0 with monitoring for device latency conditions, and then enhanced in Brocade FOS v6.4.0 with added support for congestion detection on both E_Ports and F_Ports. Brocade FOS v6.4 also added improved reporting options and simplified configuration capabilities. The Brocade FOS v6.3.1b release introduced enhancements to improve the accuracy of detecting device latency. Bottleneck Detection does not require a license and is supported on 4 Gbps, 8 Gbps, and 16 Gbps platforms.

Enhanced Bottleneck Detection


In Brocade FOS v6.4.3 and v7.0.1b, additional enhancements were made to distinguish between congestion and latency conditions. Alerts for congestion and latency were de-coupled, to help reduce the potential of having alert storms, where a sudden spike in congestion conditions mask latency events. Refer to Appendix A: Bottleneck Detection for more details on the evolution of Bottleneck Detection capabilities as well as recommended best practice settings.

Brocade Fabric Watch


Brocade Fabric Watch is an optional (licensed) feature that monitors various Brocade FOS metrics, statistics, and switch component states. You can set thresholds on most counter values, rates of counter value change, and component states, and alerts can be generated when thresholds are exceeded. Brocade Fabric Watch usability was improved with Brocade FOS v6.4.0, through changes to the CLI. Brocade Fabric Watch documentation also received a major facelift as part of the Brocade FOS v6.4.0 release. It is now much easier to specify threshold values and activate special features such as Port Fencing. Brocade Fabric Watch has several notification mechanisms. Brocade Fabric Watch can notify the user through any of the following mechanisms: Send an SNMP trap Log a RASlog message Send an e-mail alert Log a syslog message Refer to the Fabric OS Brocade Fabric Watch Administrators Guide for complete details on the use of Brocade Fabric Watch.

Port Fencing
You can use Brocade Fabric Watch thresholds to protect a switch by automatically blocking a port when specified thresholds are reached. This feature is called Port Fencing, and it was a Brocade Fabric Watch enhancement in Brocade FOS v6.1.0. The recommended thresholds for Port Fencing that are specified in Appendix B: Configuring Brocade Fabric Watch Port Fencing have been tested and tuned to quarantine components that are misbehaving at the point at which they are likely to cause broader fabric-wide issues.

Edge Hold Time


EHT allows an overriding value for Hold Time (HT) that will be applied to individual F_Ports on Gen 5 Fibre Channel platforms or all ports on an individual ASIC for 8 Gbps platforms if any of the ports on that ASIC are operating as F_Ports. The default setting for HT is 500 ms. Field experience shows that most very high latencies are produced by initiators and not targets, as is frequently believed. When a host is not responding, back pressure can build up in the fabric and affect many flows between different hosts and their storage ports. Edge Hold Time can relieve this pressure by dropping frames sooner than they are dropped when using the default hold time setting, thus allowing frames to flow again. For complete details and recommendations on setting EHT, refer to Appendix D: Edge Hold Time (EHT).

SAN Fabric Resiliency Best Practices v2.0

17 of 35

DATA CENTER

TECHNICAL BRIEF

Frame Viewer
Frame Viewer was introduced in Brocade FOS v7.0 to allow the fabric administrator more visibility into C3 frames dropped due to timeouts. When frame drops are observed on a switch, the user can utilize this feature to find out which flows the dropped frames belong to and potentially determine affected applications by identifying the endpoints of the dropped frame. Frames discarded due to timeout are sent to the CPU for processing. Brocade FOS captures and logs information about the frame such as Source ID (SID), Destination ID (DID), and transmit port number. This information is maintained for a limited number of frames. The end user can use the CLI to retrieve and display this information. NOTE: Frame Viewer captures only FC frames that are dropped in a Receive (Rx) buffer due to a timeout received on an Edge ASIC (ASIC with FE ports). If the frame is dropped for any other reason, it is not captured by Frame Viewer. If the frame is dropped due to a timeout on an Rx buffer on a Core ASIC, the frame is not captured by Frame Viewer. Timeout is defined as a frame that lives in a Rx buffer for longer than the Hold Time default of 500 ms or the Edge Hold Time value custom setting. See Appendix C: Sample Frame Viewer session for an example of a Frame Viewer session.

DESIGNING RESILIENCY INTO THE FABRIC


This topic is handled in detail in a companion document entitled SAN Design and Best Practices. What follows here is a brief introduction into some of the issues affecting congestion in SAN fabrics. Modern fabrics impose very different requirements on fabrics than those existing when Fibre Channel first appeared. These differences include: Special node behaviors, such as storage and workload virtualization. These applications can blur the difference between initiator and target behavior and can produce large volumes of very short frames. Heartbeat and other control devices may use in-band Fibre Channel and potentially disrupt normal traffic flow. Fabrics are typically much larger, providing more potential for congestion, misbehaving devices, media faults, and lost buffer credits. Higher capacity storage arrays allow for much higher initiator fan-in to storage ports than previously seen. Many more fabrics are routed (using Layer 3 routing).

Factors Affecting Congestion


Some node behaviors merit particular attention when introduced into a fabric: Storage virtualization: Virtualized storage presents virtualized Logical Unit Numbers (LUNs) to initiators. Usually these devices are composed of some sort of FE controller and BE storage. Virtualized storage tends to increase traffic across ISLs, because frames must be first transferred to and from the controller and its BE storage, as well as to and from the initiator. The proper configuration of the controllers and placement of BE storage is critical in ensuring that fabrics are not congested. The controllers are usually in some form of cluster and may communicate via in-band Fibre Channel to maintain storage integrity. This can result in high volumes of short frames that can consume precious buffer credits. Workload virtualization: A virtualized workload can present its own challenges to fabric traffic throughput. Migrating virtual machines can increase fabric and storage array overheads by issuing high volumes of SCSI reserves, which can create latencies that are difficult to trace to root cause. Many of these products make extensive use of N_Port ID Virtualization (NPIV), which can obscure initiator flow and latency information.

SAN Fabric Resiliency Best Practices v2.0

18 of 35

DATA CENTER

TECHNICAL BRIEF

Clustering: Clustering solutions often impose higher I/O and synchronization requirements on a fabric beyond volumes typically seen in standalone platforms. This can result in more short frames traversing the fabric when storage status is being continuously queried. Congestion from applications can be best addressed at the source itself. The fabric cannot completely compensate for node behavior.

Resiliency
Fabric resiliency is usually threatened by misbehaving devices or other factors external to the fabric. Many features have been added to Brocade FOS to improve resiliency, but not all failing node conditions can be detected or handled transparently in the fabric. Redundancy is a very effective means of increasing resiliency in any SAN. The addition of fabric components must, of course, be weighed against the cost of doing so. Those costs should also be weighed against the opportunity cost of losing access to mission-critical applications.

Redundancy
Fabrics: Redundancy provides a complete failover to a redundant fabric. This requires that all multipathing software on all hosts is in perfect working order and detects the failure. Cores: Redundancy allows for higher resiliency, because only the hosts attached to failing edge switches are required to fail over to the redundant fabric. Edges: Redundancy can be important in routed environments. Backbones: Redundancy is particularly important when distance is involved. ISLs: More paths allow for more alternatives for frames to traverse the fabric. Best practice recommendations: Duplicate the core switches in core-edge and edge-core-edge designs. This vastly improves resilience in the fabric and avoids host failovers to alternate fabrics, in case a core platform ever fails. Experience has shown that host failover software sometimes does not function as expected, causing outages for applications that were expected to participate in a host failover to a second fabric. Duplicate the Fibre Channel Router (FCR) backbone switches to protect against host failover failures. Often the costs associated with a failover failure greatly exceed the cost of the second FCR platform. Provide as many different paths through the fabric as possible. This is particularly true for routed fabrics, as these are prime points of congestion.

SAN Fabric Resiliency Best Practices v2.0

19 of 35

DATA CENTER

TECHNICAL BRIEF

SUMMARY AND RECOMMENDATIONS


There are several common classes of abnormal behavior originating from fabric components or attached devices: Faulty media (fiber-optic cables and SFPs/optics): Faulty media can cause frame loss due to excessive CRC errors, invalid transmission words, and other conditions. This may result in I/O failure and application performance degradation. Misbehaving devices, links, or switches: Very occasionally a condition arises where a device (server or storage array) or link (ISL) starts behaving erratically and causes disruptions in the fabric. This may result in severe stress on the fabric. Congestion: Congestion is caused by latencies or insufficient link bandwidth. End devices that do not respond as quickly as expected cause the fabric to hold frames for excessive periods of time. This can result in application performance degradation or, in extreme cases, I/O failure. Permanent or temporary loss of buffer credits: This is caused by the other end of a link failing to acknowledge a request to transfer a frame because no buffers are available to receive the frame. The following are recommended features and capabilities to improve the overall resiliency of Brocade FOS-based FC fabric environments: Enable Brocade Fabric Watch to detect frame timeouts (the Brocade Fabric Watch C3TX_TO area). Enable Port Fencing for transmit timeouts on F_Ports. Enable the Edge Hold Time feature. Enable Brocade Fabric Watch to monitor (alert) for CRC errors, Invalid Words, and State Changes, and to fence on extreme behavior. Enable Edge Hold Time in core-edge configurations. Enable Bottleneck Detection for congestion conditions. Best practice recommendations: Enable Bottleneck Detection on all F_Ports wherever possible to detect device latencies and lost credits. Bottleneck Detection is also helpful in determining inter-switch latencies. Monitor C3 timeouts with Brocade Fabric Watch. Deploy EHT on switches with host only. Always have at least two member trunks on FE links where possible. This eliminates the potential for stopped traffic until all credits on all trunk members for the VC or port are lost (which is a very rare situation). Duplicate the core switches in core-edge and edge-core-edge designs. This vastly improves resilience in the fabric and avoids host failovers to alternate fabrics if a core platform ever fails. Experience has shown that host failover software sometimes does not function as expected, causing application outages that were expected to participate in a host failover to a second fabric. Duplicate the FCR backbone switches to protect against host failover failures. Often the costs associated with a failover failure greatly exceed the cost of the second FCR platform. Provide as many different paths through the fabric as possible. This is particularly true for routed fabrics, as these are prime points of congestion.

SAN Fabric Resiliency Best Practices v2.0

20 of 35

DATA CENTER

TECHNICAL BRIEF

APPENDIX A: BOTTLENECK DETECTION


Evolution of Bottleneck Detection Field Experience
What follows is a synopsis of the Bottleneck Detection CLI command parameters. See the Fabric OS Command Reference Manual for the appropriate release you are using, for definitive usage explanation.

Brocade FOS v6.3


Brocade FOS v6.3 marked the initial release of Bottleneck Detection. Starting with Brocade FOS v6.3.1b, Bottleneck Detection was made available for F_Port latencies. Alerting is supplied through RASlog only. Bottleneck Detection in Brocade FOS v6.3 produces a RASlog message (AN-1003) when a latency threshold is exceeded. The message has a severity level of WARNING. Brocade FOS v6.3 Bottleneckmon Parameters -time: A measurement interval (default 300 seconds). An integer value between 1 and 10800 inclusive, the time value is used to calculate a percentage of affected seconds that is compared to the threshold percentage, to determine if an alert can be generated. -qtime: A reporting interval (default 300 seconds). An integer value between 1 and 10800 inclusive, the quiet time value is the minimum amount of time in seconds between alerts for a port. Alerts are suppressed until the quiet time is satisfied. -thresh: A latency threshold (minimum % of -time when a latency detected) default (.1 or 10%). A decimal value with 3 digits of precision between 0 and 1. When the value is multiplied by 100, it gives a latency threshold percentage. When the percentage of affected seconds over the time value is greater than the latency threshold percentage, an alert can be produced, depending on the quiet time setting. -alert: Adding this parameter specifies that an alert is generated when a threshold is exceeded. --show: Displays a history of the bottleneck severity on the specified port. The output shows the percentage of one-second intervals affected by the bottleneck condition within the specified time interval. This command succeeds only on online F_Ports. -interval interval_size: Specifies the time window in seconds over which the bottlenecking percentage is displayed in each line of the output. The maximum interval is 10800 seconds. The default is 10 seconds. -span span_size: Specifies the total duration in seconds covered in the output. History data are maintained for a maximum of three hours per port, so the span can be 10800 seconds at most. --status: Lists the ports for which Bottleneck Detection is enabled in the current logical switch, along with alert configuration settings. The ports may be moved to a different logical switch, but they are still shown if their configuration is retained.

Specifying Ports and Port Ranges


Ports may be specified by port number, port index number, port ranges by slot 2/0-5, or wild card * to specify all ports. NOTE: There is a constraint on 48000 directors only that no more than 100 ports are monitored at a time. Port numbers and ranges may be supplied in a list as the last parameters on the command line. Parameter values are activated only when the monitors are enabled. You must disable monitors first before a parameter value may be changed. All parameter values that are different from the defaults must be specified when using the --config option. All unspecified parameter values revert to their defaults.

SAN Fabric Resiliency Best Practices v2.0

21 of 35

DATA CENTER

TECHNICAL BRIEF

CLI Examples bottleneckmon --enable -alert 2/1 2/5-15 2/18-21 Enable Bottleneck Detection using defaults on ports 1, 5 to 15, and 18 to 21 on blade 2. bottleneckmon --enable -alert -thresh 0.2 -time 30/0-31 Enable Bottleneck Detection on blade 1, ports 0 to 31 with a threshold of 20% and a time interval of 30 seconds. bottleneckmon --disable * Disable Bottleneck Detection on all ports. bottleneckmon --disable 2/1 2/12-15 Disable Bottleneck Detection on ports 1 and 12 to 15 on blade 2.

Display Commands
To display bottleneck statistics on a specified port: switch:admin> bottleneckmon --show -interval 5 -span 30 2/24 ============================================================= Mon Jun 15 18:54:35 UTC 2009 ============================================================= Percentage of From To affected secs ============================================================= Jun 15 18:54:30 Jun 15 18:54:35 80.00% Jun 15 18:54:25 Jun 15 18:54:30 40.00% Jun 15 18:54:20 Jun 15 18:54:25 0.00% Jun 15 18:54:15 Jun 15 18:54:20 0.00% Jun 15 18:54:10 Jun 15 18:54:15 20.00% Jun 15 18:54:05 Jun 15 18:54:10 80.00% To display the ports that are monitored for devices affected by latency bottlenecks: switch:admin> bottleneckmon --status Slot Port Alerts? Threshold Time (s) Quiet Time (s) ========================================================== 2 0 N -- -- -2 1 Y 0.200 250 500 2 24 N ----

Brocade FOS v6.4


Bottleneck Detection was enhanced significantly in Brocade FOS v6.4. Support was added for congestion and latencies on E_Ports. Congestion Bottleneck Detection was added for E_Ports, EX_Ports, and F_Ports. A new parameter, -cthresh, was added to monitor port bandwidth utilization. To avoid confusion, the latency threshold parameter, -thresh, was changed to -lthresh. An important change to note is that from Brocade FOS v6.4 onwards, when Bottleneck Detection is enabled, all online ports are monitored by default. The intent here is to simplify the enabling of the feature, on the assumption that most ports are monitored. This is the equivalent of bottleneckmon --enable *. To facilitate the port selection, two new operations were added: --exclude and --include. Note that --exclude and --include cannot be combined with other operations. They must be issued as separate commands on their own.

SAN Fabric Resiliency Best Practices v2.0

22 of 35

DATA CENTER

TECHNICAL BRIEF

Alerting was enhanced to include a special Bottleneck Detection SNMP MIB, called the BD MIB. Additional enhancements to Bottleneck Detection were added to several Brocade FOS v6.4.x maintenance releases by back-porting of new capability introduced in Brocade FOS v7.x. Changes include: v6.4.2: Added BE credit recovery. v6.4.3: Decoupled alerts for latency and congestion. In addition to changes in Brocade FOS, Bottleneck Detection support was added to Brocade Network Advisor. Refer to the Brocade Network Advisor documentation for more detail. NOTE: There is a constraint on 48000 directors only that no more than 100 ports are monitored at a time. Port numbers and ranges may be supplied in a list as the last parameters on the command line. All parameter values that are different from the defaults must be specified when using the -config option. All unspecified parameter values revert to their defaults. Brocade FOS v6.4 Bottleneckmon Parameters -time, -qtime, and -alert remain unchanged. -thresh was changed to -lthresh. -cthresh (% utilization, default is 80% .8) Congestion Threshold: A decimal value with 3 digits of precision between 0 and 1. When the value is multiplied by 100, it gives a congestion threshold percentage. When the percentage of affected seconds over the time value is greater than the congestion threshold percentage, an alert can be produced, depending on the quiet time setting. Note that this threshold actually refers to the percentage of time the time interval -time that exceeds 95% link utilization. --config: Change a parameter threshold value without disable. NOTE: You must explicitly provide values for parameters that you do not want to revert to their default values. --configclear: Clear the current values and revert to any switch-wide settings. --exclude: Specify a port range to be excluded from monitoring. --include: Specify a port range to be included for monitoring. -lthresh: Was -thresh in 6.3. -noalert: Disable alerts. --show: Was enhanced to refresh latency or congestion displays. --cfgcretdittools: Configure BE port credit recovery. --showcretdittools: Show BE port credit recovery values (added in v6.4.2). -alert=latency: --configs parameter to alert only on latency bottlenecks (added in v6.4.3). -alert=congestion: --configs parameter to alert only on congestion bottlenecks (added in v6.4.3). CLI Examples bottleneckmon --enable -alert -lthresh 0.2 -cthresh .7 -time 30 -qtime 30 1/0-31 Enable Bottleneck Detection on blade 1, ports 0 to 31 with a latency threshold of 20%, a congestion threshold of 70%, and a time interval of 30 seconds and quiet time of 30 seconds. bottleneckmon --config -cthresh .7 -lthresh .1 -time 60 -qtime 120 1/0-15 Change the congestion and latency thresholds on ports 0 to 15 on blade 1. Note that --config requires you to specify all the parameter values that you do not want to revert to the default values. bottleneckmon --configclear 2/0-7 Clear the configuration on ports 0 to 7 on blade 2 and revert to the switch-wide configuration.

SAN Fabric Resiliency Best Practices v2.0

23 of 35

DATA CENTER

TECHNICAL BRIEF

bottleneckmon --exclude 2/9-11 Exclude ports 9-11 on blade 2. bottleneckmon --cfgcredittools -intport -recover onLrOnly Activate the back-end credit recovery mechanism via the bottleneckmon CLI command. This instructs the firmware to issue an LR whenever a loss of credit condition is detected on a back-end link. The firmware continuously scans the links, and during any 2-second window of inactivity, credit levels are confirmed.

Brocade FOS v7.0


Increased Resolution Brocade FOS v7.x added the ability to use Bottleneck Detection to monitor for latency at a resolution below one second. Details on new commands enabling this are included below. Decoupled Alerting Brocade FOS v7.0.2 introduced the option to decouple latency and congestion alerts. In Brocade FOS releases prior to FOS v7.0.2, when users enabled bottleneck alerts, alerting for both congestion and latency bottleneck conditions was enabled. Starting with Brocade FOS v7.0.2, users can choose to enable alerts only for latency bottleneck while not enabling alerts for congestion bottleneck, or vice versa. Users still have the option to enable alerts for both congestion and latency bottleneck conditions. Brocade FOS v7.0.x Bottleneckmon Parameters --cfgcretdittools: Configure BE port credit recovery. --showcretdittools: Show E port credit recovery values. --lsubsectimethresh: Set the threshold for latency bottlenecks at the sub-second level. --lsubsecsevthresh: Set the severity (bandwidth effect) for latency bottlenecks at the sub-second level. The sub-second parameters allow much finer tuning of Bottleneck sampling. -lsubsectimethresh time_threshold Sets the threshold for latency bottlenecks at the sub-second level. The time_threshold specifies the minimum fraction of a second that must be affected by latency, in order for that second to be considered affected by a latency bottleneck. For example, a value of 0.75 means that at least 75% of a second must have had latency bottleneck conditions, in order for that second to be counted as an affected second. The time threshold value must be greater than 0 and no greater than 1. The default value is 0.8. Note that the application of the sub-second numerical limits is approximate. This command erases the statistics history and restarts alert calculations (if alerting is enabled) on the specified ports. When used with the config option, you must specify a port. -lsubsecsevthresh severity_threshold Specifies the threshold on the severity of latency in terms of the throughput loss on the port at the sub-second level. The severity threshold is a floating-point value in the range of no less than 1 and no greater than 1000. This value specifies the factor by which throughput must drop in a second, in order for that second to be considered affected by latency bottlenecking. For example, a value of 20 means that the observed throughput in a second must be no more than 1/20 the capacity of the port, in order for that second to be counted as an affected second. The default value is 50. This command erases the statistics history and restarts alert calculations (if alerting is enabled) on the specified ports. When used with the config option, you must specify a port.

SAN Fabric Resiliency Best Practices v2.0

24 of 35

DATA CENTER

TECHNICAL BRIEF

CLI Example switch:admin> bottleneckmon --status Bottleneck Detection - Enabled ============================== Switch-wide sub-second latency bottleneck criterion: ==================================================== Time threshold - 0.800 Severity threshold - 50.000 Switch-wide alerting parameters: ================================= Alerts - Yes Congestion threshold for alert - 0.800 Latency threshold for alert - 0.100 Averaging time for alert - 300 seconds Quiet time for alert - 300 seconds Per-port overrides for sub-second latency bottleneck criterion: =============================================================== Slot Port TimeThresh SevThresh ========================================= 0 3 0.500 100.000 0 4 0.600 50.000 0 5 0.700 20.000 Per-port overrides for alert parameters: ======================================== Slot Port Alerts? LatencyThresh CongestionThresh Time(s) QTime(s) ================================================================= 0 1 Y 0.990 0.900 3000 600 0 2 Y 0.990 0.900 4000 600 0 3 Y 0.990 0.900 4000 600 Excluded ports: =============== Slot Port ============ 0 2 0 3 0 4 Back-end port credit recovery examples To enable back-end port credit recovery with the link reset only option and to display the configuration: switch:admin> bottleneckmon --cfgcredittools \ -intport -recover onLrOnly switch:admin> bottleneckmon --showcredittools Internal port credit recovery is Enabled with LrOnly To enable back-end port credit recovery with the link reset threshold option and to display the configuration: switch:admin> bottleneckmon --cfgcredittools -intport \ -recover onLrThresh switch:admin> bottleneckmon --showcredittools Internal port credit recovery is Enabled with LrOnThresh To disable back-end port credit recovery and to display the configuration: switch:admin> bottleneckmon --cfgcredittools \ -intport -recover off

SAN Fabric Resiliency Best Practices v2.0

25 of 35

DATA CENTER

TECHNICAL BRIEF

switch:admin> bottleneckmon --showcredittools Internal port credit recovery is Disabled

Command Parameter Summary


Table 3. Command Parameter Summary
Parameters and Subparameters --enable --disable --config -alert -alert=latency -alert=congestion -noalert -thresh -lthresh -cthresh -time -qtime -lsubsectimethresh -lsubsecsevthresh --include --exclude --configclear --show -refresh -latency -congestion -interval -span --status --help --cfgcredittools -intport -recover Off onLRonly onLrThresh --showcredittools 300 sec 300 sec Y 10 sec 300 sec Y Y 6.3.1 Y Y Y Y Y 0.1 6.4.0 Y Y Y Y Y * 0.1 0.8 300 sec 300 sec Y Y Y Y Y Y Y 10 sec 300 sec Y Y 6.4.2 Y Y Y Y Y 0.1 0.8 300 sec 300 sec Y Y Y Y Y Y Y 10 sec 300 sec Y Y Y Y Y Y Y Y Y 6.4.3 Y Y Y Y Y Y Y 0.1 0.8 300 sec 300 sec Y Y Y Y Y Y Y 10 sec 300 sec Y Y Y Y Y Y Y Y Y 7.0.0 Y Y Y Y Y 0.1 0.8 300 sec 300 sec 0.8 sec 50 Y Y Y Y Y Y Y 10 sec 300 sec Y Y Y Y Y Y Y Y Y 7.0.1 Y Y Y Y Y Y Y 0.1 0.8 300 sec 300 sec 0.8 sec 50 Y Y Y Y Y Y Y 10 sec 300 sec Y Y Y Y Y Y Y Y Y 7.0.2 Y Y Y Y Y Y Y 0.1 0.8 300 sec 300 sec 0.8 sec 50 Y Y Y Y Y Y Y 10 sec 300 sec Y Y Y Y Y Y Y Y Y

SAN Fabric Resiliency Best Practices v2.0

26 of 35

DATA CENTER

TECHNICAL BRIEF

Notes on release descriptions: Not supported Y supported * - lthresh is backwards compatible with -thresh for this release anything else: default value

Suggested Parameter Settings


Field experience shows that the original strategy of enabling Bottleneck Detection with conservative values for latency thresholds almost always yields no results. There was a concern that aggressive values would result in Bottleneck Detection alert storms, but this has not been the case. Even the most aggressive values result in relatively few alerts being generated. As a result, it is now recommended that the most aggressive settings are tried first and then backed off gradually if too many alerts are seen. Table 4. Edge Hold Time Configuration Values
Parameter Brocade FOS v6.3 -time -qtime -thresh Brocade FOS v6.4 -time -qtime -lthresh -cthresh Brocade FOS v7.0.x -time -qtime -lthresh -cthresh -lsubsectimethresh -lsubsecsevthresh 300 300 0.3 0.8 0.8 75 60 60 0.1 0.5 0.5 50 5 1 0.2 0.1 0.5 (no less) 1 300 300 0.3 0.8 60 60 0.1 0.5 5 1 0.2 0.1 300 300 0.3 60 60 0.2 5 1 0.1 Conservative Setting Normal Setting Aggressive Setting

SAN Fabric Resiliency Best Practices v2.0

27 of 35

DATA CENTER

TECHNICAL BRIEF

APPENDIX B: CONFIGURING BROCADE FABRIC WATCH PORT FENCING


The portFencing CLI command is used to enable error reporting for the Brocade Fabric Watch Port Fencing feature. Once enabled, all ports of a specified type may be configured to report errors for one or more areas. Supported port types include E_Ports, F_Ports, and physical ports. Port Fencing monitors ports for erratic behavior and disables a port if specified error conditions are met. The portFencing CLI command enables or disables the Port Fencing feature for an area of a class. You can customize or tune the threshold of an area using the portthConfig CLI command. Use portFencing to configure Port Fencing for C3 transmit timeout events. For example: portfencing --enable fop-port -area C3TX_TO You can use the same command to configure Port Fencing on link reset. For example: portfencing --enable fop-port -area LR Use the portThconfig command to customize Port Fencing thresholds: switch:admin> portthconfig --set port -area crc -highthreshold -value 2 -trigger above -action email switch:admin> portthconfig --set port -area crc -highthreshold -trigger below -action email switch:admin> portthconfig --set port -ar crc -lowthreshold -value 1 -trigger above -action email switch:admin> portthconfig --set port -ar crc -lowthreshold -trigger below -action email To apply the new custom settings so they become effective: switch:admin> portthconfig --apply port -area crc -action cust -thresh_level custom To display the port threshold configuration for all port types and areas: switch:admin> portthconfig --show

Port Fencing Threshold Recommendations


Port Fencing Threshold recommendations for LRs, State Changes, and transmit timeouts are shown in Table 5. Table 5. Recommended Port Fencing Thresholds
Area Link Reset State Change TX_TO Recommended Threshold Value for Port Fencing 5 7 5

CRC errors and Invalid Words can occur on normal links. These errors have also been known to occur during certain transitions such as server reboots. When these errors occur more frequently, they can cause a severe impact. While most systems can tolerate infrequent CRC errors or Invalid Words, other environments may be sensitive to even infrequent instances. The overall quality of the fabric interconnects is also a factor. When establishing thresholds for CRC errors and Invalid Words, consider that, in general, cleaner interconnects can have lower thresholds, as they should be less likely to introduce errors on the links. Moderate (recommended), conservative, and aggressive threshold recommendations are provided in Table 6. After selecting the type of thresholds for an environment, set the low threshold with an action of Alert (RASlog, e-mail, SNMP trap). The alert is triggered whenever the low threshold is exceeded. Set the high threshold with an action of Fence. The port is fenced (disabled) whenever the high threshold is detected. Aggressive threshold suggestions do not include settings for low; instead, they have only the high values to trigger fencing action.

SAN Fabric Resiliency Best Practices v2.0

28 of 35

DATA CENTER

TECHNICAL BRIEF

Table 6. Aggressive Port Fencing Thresholds


Area Moderate/Recommended Threshold Low 5 High 20 Low 25 High 40

Aggressive Threshold High 20 High 25

Conservative Threshold Low 5 High 40 Low 25 High 80

CRC Invalid Word

SAN Fabric Resiliency Best Practices v2.0

29 of 35

DATA CENTER

TECHNICAL BRIEF

APPENDIX C: SAMPLE FRAME VIEWER SESSION


Frames discarded due to Hold-time timeout are sent to the CPU for processing. During subsequent CPU processing, information about the frame such as SID, DID, and transmit port number is retrieved and logged. This information is maintained for a certain fixed number of frames. Frame Viewer captures only FC frames that are dropped due to a timeout received on an Edge ASIC (ASIC with FE ports). If the frame is dropped due to any other reason, it is not captured by Frame Viewer. If the frame is dropped due to timeout on an Rx buffer on a Core ASIC, the frame is not captured by Frame Viewer. Timeout is defined as a frame that lives in an Rx buffer for longer than the Hold Time default of 500 ms or the Edge Hold Time value custom setting. The user is provided a CLI command (frame log) to retrieve and display this information. Figure 3 provides an overview of the collection process for a bladed director.
Core ASIC with Back-End Port only Rx Buffer

If a Frame is dropped due to timeout in this RX Buffer, the frame will NOT be captured by Frame Viewer (-1/-1)

Back-End Port

(-1/-1)

Back-End Port

(-1/-1) If a Frame is dropped due to timeout in this Rx RX Buffer, the frame Buffer will be captured by Frame Viewer

Back-End Port Edge ASIC with Front-End Port

(-1/-1) Rx Buffer

Back-End Port If a Frame is dropped due to timeout in this RX Buffer, the frame will be captured by Frame Viewer
fig03_SAN Fabric Res

Edge ASIC with Front-End Port

Front-End Port 1/23 Frame coming in on port 1/23 and destined out of port 10/29

Front-End Port 10/29

Figure 3. Frame Viewer capture capability for a director or switch with multiple ASICs NOTE: If the switch is a single ASIC switch, such as an embedded switch or a Brocade 300 Switch, Brocade 5100 Switch, Brocade 6505 Switch, Brocade 6510 Switch, and so on, there are no Core ASIC or back-end ports, and Frame Viewer captures dropped frames due to timeout. The number of frames captured depends on available switch resources. A Core ASIC has only backend ports and UltraScale Inter-Chassis Link (ICL) ports. If a frame is dropped and captured by Frame Viewer, it displays the frame (FC Header and Payload) with a timestamp of the time when the frame was dropped.

SAN Fabric Resiliency Best Practices v2.0

30 of 35

DATA CENTER

TECHNICAL BRIEF

framelog --show -n 1200:


============================================================================================= Wed Dec 28 08:51:02 EST 2012 ============================================================================================= Log TX RX timestamp port port SID DID SFID DFID Type Count ============================================================================================= Dec 19 11:37:00 -1/-1 10/29 0x01dd40 0x018758 129 129 timeout 8 Dec 19 11:37:00 -1/-1 1/29 0x018d40 0x01874b 129 129 timeout 8 Dec 19 11:37:00 -1/-1 12/5 0x017500 0x018758 129 129 timeout 8 Dec 19 11:37:00 -1/-1 10/5 0x015500 0x018758 129 129 timeout 8 Dec 19 11:37:00 -1/-1 3/5 0x012500 0x01874b 129 129 timeout 6 Dec 19 11:37:00 -1/-1 3/5 0x012500 0x018541 129 129 timeout 4 Dec 19 11:37:00 -1/-1 1/5 0x010500 0x01874b 129 129 timeout 12 Dec 19 11:37:00 1/23 -1/-1 0x01dd40 0x018758 128 128 timeout 4 Dec 19 11:37:00 1/23 -1/-1 0x015500 0x01874b 128 128 timeout 2 Dec 19 11:37:00 1/23 -1/-1 0x012500 0x018758 128 128 timeout 4 Dec 19 11:37:00 1/23 -1/-1 0x010500 0x018758 128 128 timeout 10 Dec 19 11:36:59 -1/-1 3/5 0x012500 0x01874b 129 129 timeout 8 Dec 19 11:30:51 -1/-1 10/29 0x01dd40 0x01874b 129 129 timeout 8 Dec 19 11:30:51 -1/-1 1/29 0x018d40 0x01874b 129 129 timeout 8 Dec 19 11:30:51 -1/-1 10/5 0x015500 0x018756 129 129 timeout 8 Dec 19 11:30:51 -1/-1 3/5 0x012500 0x01874b 129 129 timeout 8 Dec 19 11:30:51 -1/-1 1/5 0x010500 0x01874b 129 129 timeout 8 Dec 19 11:30:50 1/23 -1/-1 0x01dd40 0x018756 128 128 timeout 6 Dec 19 11:30:50 1/23 -1/-1 0x018d40 0x018756 128 128 timeout 8 Dec 19 11:30:50 1/23 -1/-1 0x012500 0x018756 128 128 timeout 6

Notes: 1. TX Port is the port that discarded the frame. 2. SID is the source Port ID (PID). 3. DID is the destination PID. 4. -1/-1 in the port column refers to a BE port.

SAN Fabric Resiliency Best Practices v2.0

31 of 35

DATA CENTER

TECHNICAL BRIEF

APPENDIX D: EDGE HOLD TIME (EHT)


Introduction to EHT
Edge Hold Time (EHT) is a FOS capability that allows an overriding value for Hold Time (HT). Hold Time is the amount of time a Class 3 frame may remain in a queue before being dropped while waiting for credit to be given for transmission. The default HT is calculated from the RA_TOV, ED_TOV and maximum hop count values configured on a switch. When using the standard 10 seconds for RA_TOV, 2 seconds for ED_TOV, and a maximum hop count of 7, a Hold Time value of 500ms is calculated. Extensive field experience has shown that when high latencies occur even on a single initiator or device in a fabric, that not only does the F-port attached to this device see Class 3 frame discards, but the resulting back pressure due to the lack of credit can build up in the fabric and cause other flows not directly related to the high latency device to have their frames discarded at ISLs. Edge Hold Time can be used to reduce the likelihood of this back pressure into the fabric by assigning a lower Hold Time value only for edge ports (initiators or devices). The lower EHT value will ensure that frames are dropped at the F-port where the credit is lacking, before the higher default Hold Time value used at the ISLs expires, allowing these frames to begin moving again. This localizes the impact of a high latency F-Port to just the single edge where the F-Port resides and prevents it from spreading into the fabric and impacting other unrelated flows. Like Hold Time, Edge Hold Time is configured for the entire switch, and is not configurable on individual ports or ASICs. Whether the EHT or HT values are used on a port depends on the particular platform and ASIC as well as the type of port and also other ports that reside on the same ASIC. This behavior is described in further detail in the following sections.

Supported Releases and Licensing Requirements


EHT was introduced in FOS v6.3.1b and is supported in FOS v6.3.2x, v6.4.0x, v6.4.1x, v6.4.2x, v6.4.3x and all v7.X releases. Some behaviors have changed in later releases and are noted in later sections. There is no license required to configure the Edge Hold Time setting. Edge Hold Time must be explicitly enabled in all supporting FOS v6.x releases. In FOS v7.0 and later, EHT is enabled by default.

Behavior
8G Platforms and the Brocade 48000
On the 48K and all 8G platforms including the DCX/DCX-4S, Hold Time is an ASIC level setting that is applied to all ports on the same ASIC chip. If any single port on the ASIC chip is an F-Port, then the alternate EHT value will be programmed into the ASIC, and all ports (E-Ports and F-Ports) will use this one value. If all ports on the single ASIC chip are E-Ports, then the entire ASIC will be programmed with the default Hold Time value (500ms). When Virtual Fabrics is enabled on an 8G switch, the programming of the ASIC remains at the ASIC level. If any single port on the ASIC is an F-Port, regardless of which Logical Switch it resides in, then the alternate EHT value will be programmed into the ASIC for all ports in all Logical Switches regardless of the port type.

SAN Fabric Resiliency Best Practices v2.0

32 of 35

DATA CENTER

TECHNICAL BRIEF

For example: If one ASIC has five ports assigned to Logical Switch 1 comprised of four F-Ports and one E-Port, and this same ASIC has five ports assigned to Logical Switch 2 comprised of all E-Ports, the EHT value will be programmed into all five ports in Logical Switch 1 and also all five ports in Logical Switch 2. The programming of EHT is at the ASIC level and is applied across Logical Switch boundaries. When using Virtual Fabrics, the EHT value configured into the Base Switch is the value that will be used for all Logical Switches.

Gen 5 Platforms
All Brocade Gen 5 platforms (16G) are capable of setting the Hold Time value on a port-by-port basis for ports resident on Gen 5 ASICs. All F-ports will be programmed with the alternate Edge Hold Time All E-Ports will be programmed with the default Hold Time value (500ms) The same EHT value set for the switch will be programmed into all F-Ports on that switch. Different EHT values cannot be programmed on an individual port basis. If 8G blades are installed into a Gen 5 platform (ie, an FC8-64 blade in a DCX 8510), then the behavior of EHT on the 8G blades will be the same as the description provided for 8G platforms above. The same EHT value will be programmed into all ports on the ASIC. If any single port on an ASIC is an F-Port, then the alternate EHT value will be programmed into the ASIC, and all ports (E-Ports and F-Ports) will use this one value. If all ports on an ASIC are E-Ports, then the entire ASIC will be programmed with the default Hold Time value (500ms). When deploying Virtual Fabrics with FOS versions 7.0.0x, 7.0.1x, or 7.0.2x, the EHT value configured into the Default Switch is the value that will be used for all Logical Switches. Starting with FOS v7.1.0, a unique EHT value can be independently configured for each Logical Switch for Gen 5 Platforms. 8G blades installed in a Gen 5 platform will continue to use the Default Logical Switch configured value for all ports on those blades regardless of which Logical Switches those ports are assigned to.

Default EHT Settings


The default setting used for Edge Hold Time (EHT) is pre-loaded into the switch at the factory based on the version of FOS installed: Table 7. Factory Default EHT Settings
Factory installed Version of FOS Any version of FOS 7.X FOS 6.4.3x FOS 6.4.2x FOS 6.4.1x FOS 6.4.0x Any version prior to FOS 6.4.0 Default EHT Value 220 ms 500 ms 500 ms 220 ms 500 ms 500 ms

The default setting can be changed using the configure command. The EHT can be changed without having to disable the switch and will take effect immediately after being set.

SAN Fabric Resiliency Best Practices v2.0

33 of 35

DATA CENTER

TECHNICAL BRIEF

When using the configure command to set EHT, a suggested EHT value will be provided. If the user accepts this suggested setting by pressing <enter>, then this suggested value will become the new value for EHT on the switch. The suggested value will be the value that was set during the previous time the configure command was run, even if the user just pressed the <enter> key when encountering this configuration parameter. If the configure command has never been run before, and thus the default value is what is currently set in the system, then the suggested value shown will be as follows: Table 8. Suggested EHT Settings for Various FOS Releases
FOS Version Currently on Switch Any version of FOS 7.X FOS 6.4.3x FOS 6.4.2x FOS 6.4.1x FOS 6.4.0x Any version prior to FOS 6.4.0 Suggested EHT Value When Configure Has Not Been Run Previously 220 ms 500 ms 500 ms 220 ms 500 ms 500 ms

Note that the suggested value shown when running the configure command may not be the same as the default value that is currently running in the system. This is because the default EHT value is set based on the FOS version that was installed at the factory, and the suggested EHT value is based on the FOS version currently running in the system and whether or not the configure command had ever been run in the past. Once set by the configure command, the EHT value will be maintained across firmware upgrades, power cycles and HA Fail-Over operations. This is true for all versions of FOS. The behavior of EHT has evolved over several FOS releases. The three different behaviors are shown in the three different examples below. Example (FOS 6.X): sw0:FID128:admin> configure Not all options will be available on an enabled switch. To disable the switch, use the switchDisable command. Configure... Fabric parameters (yes, y, no, n): [no] y Configure edge hold time (yes, y, no, n): [no] y Edge hold time: (100..500) [500] System services (yes, y, no, n): [no] Example (FOS 7.0.X) sw0:FID128:admin> configure Not all options will be available on an enabled switch. To disable the switch, use the switchDisable command. Configure... Fabric parameters (yes, y, no, n): [no] y Edge Hold Time (0 = Low(80ms),1 = Medium(220ms),2 = High(500ms): [220ms]: (0..2) [1] System services (yes, y, no, n): [no]

SAN Fabric Resiliency Best Practices v2.0

34 of 35

PRODUCT CATEGORY

TECHNICAL BRIEF

Example (FOS 7.0.2 and higher) sw0:FID128:admin> configure Not all options will be available on an enabled switch. To disable the switch, use the switchDisable command. Configure... Fabric parameters (yes, y, no, n): [no] y Edge Hold Time in ms (80(Low), 220(Medium), 500(High), 80-500(UserDefined)): (80..500) [220] System services (yes, y, no, n): [no]

Recommended Settings
Edge Hold Time does not need to be set on Core Switches that are comprised of only ISLs and will therefore only use the standard Hold Time setting of 500ms Recommended values for platforms containing initiators and targets are based on specific deployment strategies. End users typically either separate initiators and targets on separate switches or mix initiators and targets on the same switch. A frame drop has more significance for a target than an initiator because many initiators typically communicate with a single target port, whereas target ports typically communicate with multiple initiators. Frame drops on target ports usually result in SCSI Transport error messages being generated in server logs. Multiple frame drops from the same target port can affect multiple servers in what appears to be a random fabric or storage problem. Since the source of the error is not obvious this can result in time wasted determining the source of the problem. Extra care should be taken therefore when applying EHT to switches where targets are deployed. The most common recommended value for EHT is 220ms. The lowest EHT value of 80ms should only be configured on edge switches comprised entirely of initiators. This lowest value would be recommended for fabrics that are well maintained and when a more aggressive monitoring and protection strategy is being deployed.

2013 Brocade Communications Systems, Inc. All Rights Reserved. 07/13 GA-BP-477-00 ADX, AnyIO, Brocade, Brocade Assurance, the B-wing symbol, DCX, Fabric OS, ICX, MLX, MyBrocade, OpenScript, VCS, VDX, and Vyatta are registered trademarks, and HyperEdge, The Effortless Network, and The On-Demand Data Center are trademarks of Brocade Communications Systems, Inc., in the United States and/or in other countries. Other brands, products, or service names mentioned may be trademarks of their respective owners. Notice: This document is for informational purposes only and does not set forth any warranty, expressed or implied, concerning any equipment, equipment feature, or service offered or to be offered by Brocade. Brocade reserves the right to make changes to this document at any time, without notice, and assumes no responsibility for its use. This informational document describes features that may not be currently available. Contact a Brocade sales office for information on feature and product availability. Export of technical data contained in this document may require an export license from the United States government.

You might also like