Professional Documents
Culture Documents
TABLE OF CONTENTS
TABLE OF CONTENTS ...............................................................................................................................................................................................1
OVERVIEW .................................................................................................................................................................................................................1
THE ANATOMY OF A SCSI OPERATION....................................................................................................................................................................1
WALKING DOWN THE PROTOCOL STACK ...............................................................................................................................................................2
WHEN IOD MATTERS, AND WHEN IT DOESN’T ......................................................................................................................................................5
EXCHANGE-BASED LOAD BALANCING (DPS)..........................................................................................................................................................7
COMPARING SCSI RESULTS TO FC TRAFFIC GENERATORS .................................................................................................................................9
CONCLUSION: BUYER BEWARE............................................................................................................................................................................ 15
OVERVIEW
The Fibre Channel (FC) standards require that fabrics deliver frames in the same order that
they entered the fabric. That is, a network of FC switches should act like a FIFO buffer, from
the point of view of attached devices. This requirement is called “in-order delivery”, or IOD
for short. However, the language in the standards describing IOD is ambiguous, and so there
is confusion about what IOD means. There are situations in which out-of-order delivery (OOD)
is allowable, and even desirable. In other cases, OOD may have a strongly negative impact
on applications attached to an FC Storage Area Network (SAN).
To analyze fabric IOD requirements, it is necessary to consider which traffic patterns require
delivery order to be maintained, and why this is required in the first place. In most
environments, FC networks are used to carry SCSI traffic. 1 In this context, IOD means “all FC
frames within a particular SCSI operation must be delivered in order with respect to each
other.” FC frames in different, unrelated SCSI operations do not need to be delivered in
order relative to each other. In fact, it may be desirable to prioritize one SCSI operation over
another: that is the purpose of QoS and similar features. In that case, the entire point is to
make sure that higher priority frames are delivered ahead of lower priority traffic, regardless
of which entered the fabric first, which by its nature is OOD.
This paper will clarify what IOD means in an FC fabric, so readers can identify situations in
which it is meaningful and required vs. optional or even undesirable.
1
The examples in this paper do not apply directly to FC networks used for server-server connectivity, e.g. using IP
over FC (IPoFC), though analogous concepts apply. Also, this applies to open systems SANs running SCSI as the
upper layer protocol (FCP) rather than FICON environments.
A SCSI write “conversation” consists of an initiator (e.g. a server) “talking” to a target (e.g. a
storage array). Conceptually, the conversation goes something like this:
1. Initiator: “Can I send you x bytes of data for block y?” (Command.)
2. Target: “Yeah, OK.” (Command.)
3. Initiator: [Data Data Data Data…] (Payload.)
4. Target: “Got it! Thanks!” (Command.)
– repeat for next operation; continue as needed –
This is a simplification, as there can be additional elements to the conversation, but this
does reflect the general structure of a SCSI write. Step #1 is a write request. The initiator
asks permission of the target to send data. Next, the target gives permission. The third step
is the transfer of data. This step varies in size, depending on factors such as filesystem
block size and SCSI “logical block” size. Finally, once the target has received all the data and
verified its integrity, it will send an acknowledgement to the initiator that the operation
completed successfully. This is illustrated in Figure 1.
Page 2
Figure 2 - SCSI-to-FC Mapping
First, an application generates data, perhaps in the form of one or more files, and sends it
“down” to a filesystem. 2 The filesystem then maps that data (e.g. files) onto blocks 3 , e.g.
using an inode table, File Allocation Table (FAT), or some analogous mechanism. Next, the
filesystem sends data down to the SCSI layer. At this point, objects meaningful to the
application (such as one or more files) have been converted into objects meaningful to
storage subsystems (such as one or more SCSI blocks). The “SCSI write conversation”
described in the previous section now takes place, usually one time per SCSI block. The
SCSI layer handles the “commands” previously described.
The number of blocks per file is determined by file size and filesystem block size. That is,
“number of blocks” = “file size” / “filesystem block size”, rounded up to the nearest whole
number. In the case of unstructured data, a common block size is 4 kilobytes. In this case, 1
megabyte file would have 256 SCSI blocks. For tuned high performance large file
2
In the case of a database accessing a “raw” device, the filesystem functions are still there; they are just performed
by the database application rather than an OS driver. Similarly, analogous functions are performed by backup
software when writing to tape drives. One way or another, something at this layer has to keep track of where each
SCSI block is located, and what upper-level object (e.g. file or database record) is being stored there.
3
The term “block” is sometimes ambiguous. Block storage may refer to the blocks carried by the disk drive itself.
The “LBA” field in a SCSI command refers to this kind of block, which is variable length but most typically 512
bytes. Unfortunately, the same word is used also to define objects containing many SCSI blocks which are
transferred as a unit. In the context of networking, the latter is the relevant meaning. The SAN doesn’t really care
how the data is organized once it “lands” on a disk; the important concept is how the data is grouped for transfer
across the FC fabric. Thus, examples in this paper don’t deal with LBA-style “blocks”.
Page 3
applications, block sizes may be four times that or larger. In backup and asynchronous
replication applications, block sizes may exceed 100 megabytes.
If the application in this example were using directly-attached storage, the blocks would be
written onto disk or tape at this point. However, in a SAN, the blocks have to be mapped
onto FC for transportation to the storage subsystem. In this case, the SCSI layer hands off
each block, in order, to an FC Host Bus Adaptor (HBA). If the blocks are related to each other,
the SCSI layer will generally not hand off “block #2” until it gets an acknowledgement that
“block #1” is complete, so ordering from one block to the next is guaranteed. In addition, the
protocol headers contain enough information to uniquely identify the blocks in question, so
if one block did show up out of order, the SCSI layer would handle it. Once a block is handed
off to an HBA, it is the HBA’s job to convert the block into one or more FC frames. See Figure
3 and Figure 4. 4
4
Theoretically, the payload portion of a frame could be smaller. However, there isn’t any utility for this, and as a
practical matter the smallest frames are on the order of about 64 bytes total, with about half being payload and half
overhead. Similarly, FC standards allow frames up to 2112 bytes, but most fabrics use 2048 (2k).
Page 4
The number of FC frames per block in the data transfer phase follows a similar formula to
the blocks-per-file calculation. Generally, “number of FC frames” = “block size” / “frame
size”, rounded up. However, in the real world, this calculation can be simplified a bit. The FC
frame size for the payload portion of a SCSI write is pretty much always 2 kilobytes. Smaller
FC frame sizes are only used for commands, not for the payload. Therefore the number of FC
payload frames is actually “block size in kilobytes” / 2. For filesystems with unstructured
data, this usually means two FC frames per block, since 4k is the most common block size in
that arena. For larger block sizes, such as used in asynchronous replication, it could mean
thousands of payload frames per block. It’s difficult to know the industry-wide average, but it
would be somewhere between those two extremes.
5
This worst case is generally only a theoretical concern, since disk subsystems detect OOD and reject the entire
SCSI operation. In the real world, the application generally would get a SCSI retry, rather than corrupted data. That
said, the potential for corruption is there, and even a retry isn’t desirable.
Page 5
Figure 5 - IOD for Unrelated Initiator/Target Flows
The first, most intuitive way for the ISL to interleave the streams is to send an A frame, then
a B frame, then the second A frame, and so on. This is shown on the first line of the lower
portion of the figure. This way, all A frames are delivered through the fabric in the order as
they were sent, as are all B frames. In addition, the A-2 and B-2 frames are delivered after
both A-1 and B-2, which means that the streams maintain IOD relative to each other. They
don’t maintain simultaneity, as both B frames arrive a few milliseconds later than the
corresponding A frames. However, the difference is too small to worry about, and in any case
isn’t avoidable when sending parallel streams over a serial link.
Another option would be to transmit all A frames, then all B frames. This is shown on the
second line of the figure. In this case, all A frames are delivered in the order they were sent,
as are the B frames. But in this case, B-1 is delivered after A-2: the two streams do not
maintain their IOD relationship relative to each other.
Since FC fabrics must guarantee in-order delivery, this probably sounds like a problem. It
isn’t, because the out of order condition exists between unrelated streams. In fact, this is
exactly what QoS is supposed to accomplish. If one application (say, Host A) is more
important than another (Host B) then the SAN administrator might put them into different
QoS priority groups. In that case, fabric links are supposed to send A frames before sending
B frames, regardless of the order in which they entered the fabric. OOD is not only allowable;
it is desirable. FC switches can tell that OOD is allowable because the flows have different
endpoints. That is, the SID and DID fields in the frame headers are different.
The last line in the figure illustrates an unallowable OOD scenario. In this case, frame A-2 is
delivered ahead of A-1, and B-2 is delivered ahead of B-1. Remember, in this example A-1
and A-2 are part of the same SCSI operation. This means that the various “parts” of the SCSI
block are arriving out of order. At minimum, this can result in a SCSI retry.
Page 6
What if A-1 and A-2 had not been part of the same SCSI operation? If frames come into the
fabric from the same host and are destined for the same storage port, but have a different
exchange ID (OXID) in the frame header, switches will “know” that they are parts of different
SCSI operations and thus IOD is not applicable. If ordering had mattered to the application,
it wouldn’t have even generated the second SCSI operation until after the first had
completed. In that case, all three examples are allowable.
The bottom line is that, in a typical-case open-systems SAN, FC in-order delivery is only
relevant between frames which have identical values in the SID, DID, and OXID fields.
6
“Equal cost paths” generally means the same number of hops as defined by the Fabric Shortest Path First (FSPF)
protocol, and/or by manual settings such as Traffic Isolation Zones.
Page 7
Figure 7 - Path for All Frames of the First Block
Page 8
difference to a node if OOD comes about from an ISL balancing algorithm, or a backplane
trace balancing method. From an application perspective, there is no meaningful distinction
between switch-level behaviors, vs. fabric link behaviors: IOD vs. OOD is a fabric-level
concept that does not vary based on the type of link. Therefore, if the OXID balancing
method is “correct” at a network level, then it is also correct – or at least allowable –
internally at a switch or director level.
Since Brocade link-balancing methods such as DPS and frame-level trunking are allowable
and work in networks, they are allowable and work inside switches and directors, but does
that mean that they are desirable? Is a network-oriented algorithm such as DPS the “right”
way to balance the interior of a switch? That comes down to implementation quality.
Not all vendors are adept at building high performance, highly reliable, vastly scalable
networks. Brocade, however, does have a long history of doing this, so the methods Brocade
developed for balancing IO in large fabrics had to be robust. As a result, the network link-
balancing implementation in enterprise-class Brocade platforms is capable of balancing a
half-terabit per equal-cost path. (One terabit full-duplex.) In the context of director
architecture, the slot connectivity to the backplane is a “path” which needs to be balanced.
Currently, the highest performing director in the industry – the Brocade DCX– has a quarter
terabit of backplane connectivity per slot. Thus, the current implementation of Brocade
network link balancing is capable of running a balanced backplane at 2x the most stringent
requirement in the world for backplane balancing.
Brocade competitors, on the other hand, have drastically lower capabilities for network link
balancing. The nearest competitor can demonstrate at most 25% of Brocade’s performance.
So for Brocade, it is allowable and desirable to use DPS as part of the backplane balancing
method, because Brocade’s implementation of DPS is up to the task. For vendors who aren’t
good at balancing networks, it would still be allowable but not desirable, because it would
result in “hot spots” (unbalanced load) on their backplane.
Using network-oriented methods inside a network element is a good idea if those methods
are robust. If the network-oriented methods are weak, then one would need to create
proprietary methods for backplane balancing. When a vendor claims that DPS is correct for
balancing a network, but not correct for balancing a backplane, what they are really saying is
that their implementation is not good enough to balance a backplane.
Page 9
analyzer generates a frame destined for another port on the same analyzer. Because of this,
they are able to collect frames at the other end of the fabric, and analyze the results in
terms of high-precision delivery time (latency), pipeline capacity (throughput), error and loss
rates, and more. In the right hands, they are optimal for conducting stress testing on fabrics,
and graphing the results.
However, frame shooters also have limits. They do not run “real” operating systems,
applications, or driver stacks. They are powerful tools when used correctly, but because they
lack upper-level behaviors, frame shooter results can also be misleading when used
incorrectly. Incorrect configurations can happen if the engineer conducting the test did not
know how to operate the analyzer, or constructed the test plan to intentionally mislead an
audience, or did not understand the disconnect between the tester settings vs. the intended
production load. The next sections will clarify such cases.
GENERAL NETWORK TEST METHODOLOGY
Nobody deploys frame shooters in a production SAN. Frame shooter test results are
therefore not meaningful in and of themselves. The goal of using frame shooters is to
predict something about how a network will likely behave in a production context. The goal
of setting up a frame shooter test plan is therefore to make it close to the intended
production deployment; to make a strong connection between the lab test and the real
application. If there is no connection, the results will be meaningless at best, and may cause
SAN designers to make bad decisions about which network elements to use and how to
interconnect them.
To ensure that results are on point, when Brocade conducts network testing, a team of
personnel is involved in creating the plan. Engineers familiar with the equipment in question
make sure it is operated properly. Customer-facing personnel are involved to ensure that the
results are portrayed honestly, since disingenuous representations harm customer
relationships. Finally, senior architects review the plan to verify that the traffic pattern and
other settings reflect anticipated real-world scenarios.
Unfortunately, most companies are not as rigorous as Brocade. It is common for testing to
be conducted by marketing personnel and low-level engineers, without supervision of senior
architects or customer-facing ombudsman. Teams lacking the latter two components often
make errors, through misunderstanding or intent to deceive. As a result, outright falsified or
at least materially misleading test data can often be found quoted in marketing literature,
internet web sites, or even magazine articles. SAN designers are therefore required to have
a detailed understanding of test methods and results before they can predict anything about
their production environments from reading such literature.
One common mistake is that engineers may use a fundamentally unrealistic “flow pattern”
in the test plan. Basically, the flow pattern defines which ports “talk” to which other ports. It
also encompasses variables such as frame size and data rate. Using an incorrectly-
conceived flow pattern will guarantee that test results fail to predict real-world behaviors.
Say that an engineer had access to a ten-port FC frame shooter, and wanted to simulate the
load that was going to occur in a small SAN. Figure 9 illustrates three different ways the
engineer could configure the flow pattern, with alternate visualizations for two methods. On
the left, the “multipoint” pattern causes eight analyzer ports to send data to two other ports.
In the middle, the “paired port” pattern causes pairs of tester ports to talk to each other, but
Page 10
not to any ports outside of the paired relationship. (E.g. port 1 will talk to port 2, but not to
ports 3 through 10. In the same way, port 3 will talk to 4, 5 to 6, and so on.) On the far right,
the “full mesh” pattern causes every tester port to talk to every other port.
Page 11
The most typical manipulation as related to IOD is to configure a frame shooter for a paired
port flow pattern, and either disable DPS, or misconfigure the frame shooters’ OXID rotation
and sequence mapping behaviors. If DPS is off, Brocade still balances IO based on SID and
DID. In multi-point and full mesh tests, this is still up to 99% efficient. If a vendor wants to
concoct a deliberately misleading test result, the paired port test will be used in combination
with disabling DPS and/or misconfiguring the frame shooter because doing both at the
same time can result in a reduction in efficiency.
The fallacious reasoning in such manipulated testing goes like this: the misconfigured tester
incorrectly shows out of order deliver with DPS enabled. (Discussed in the next section.)
Therefore DPS must be turned off. Performance in paired port tests with DPS disabled is
inferior. Therefore Brocade performance is inferior. The flaws in this line of reasoning should
be obvious, but only to those who understand the flow pattern manipulation as well as the
manipulated IOD testing.
TESTING FOR IN-ORDER DELIVERY
Frame shooters do not have the upper layers of the stack, so the engineer setting up a test
must configure the analyzer to simulate these behaviors. Engineers unfamiliar with SANs, or
who intend to manipulate the results to deceive an audience, may configure the analyzer
incorrectly in this respect. Figure 10 shows stack elements found in a frame shooter, and
illustrates the parts which are different or missing vs. real SAN devices.
Page 12
Compare this to Figure 2 (page 3), and notice how the SCSI-block-to-FC-frame-to-OXID
relationship differs. (That’s the middle portion of the two figures.) Figure 2 shows how the
mapping is actually performed by real-world devices in a SAN. Figure 10 shows one possible
analyzer mapping, but analyzers can be configured for a variety of different methods of
generating OXID / sequence / frame relationships. Figure 10 shows an incorrect mapping,
which is why it differs from Figure 2.
In Figure 10, every frame gets a unique exchange ID. In the real world, even in the rare case
when the block size is set to 2k, there are at least four frames in the exchange: three control
frames and one payload frame. Figure 1 page 2 can be used to visualize this. Thus, in and of
itself, changing the OXID every frame causes the analyzer to disconnect from real-world
SANs, which limits the validity of the results.
When the analyzer is misconfigured in this way and DPS is enabled, each frame can go
down a different equal-cost FSPF path. This is shown in Figure 11. However, this
misconfiguration does not necessarily, by itself, cause the analyzer to report misleading
results. To accomplish that, the engineer setting up the test has to go one step further.
Page 13
in the context of FC standards (as in “frames within a sequence must be delivered in order,
and failure to do so is an error”) then the sequence referred to is required to be contained
within an exchange. To get an analyzer to report “out of sequence” errors in a DPS balanced
system, the word “sequence” as to be redefined as something that contains multiple
exchanges: exactly the opposite of the FC standards’ meaning of the term. Figure 12 and
Figure 13 illustrate this another way.
Page 14
debugging an FPGA chipset prior to “burning” it into an ASIC, or check to see how the fabric
might react to hypothetical non-SCSI upper-layer workloads, or to test security features
against hypothetical hacker attack strategies. If the test equipment is being used for such a
purpose, any arbitrary settings are “allowable”. But when the equipment is used to make
claims about how various SAN devices will perform in production environments, configuring
it in this way is either a mistake or a deception.
It is also worth noting that all major vendors’ FC solutions will exhibit OOD in this case. That
is, any heavily-loaded FC network, designed as in Figure 11, will report “sequence errors” if
the analyzer is misconfigured. This is a function of how exchange-based load balancing
works; not of any particular vendors’ implementation of that method. As a consequence,
vendors trying to use this technique to deceive an audience will also limit the test
configuration to a single director or switch, rather than a network. That is, they draw a
distinction between IOD requirements for a fabric as a whole, vs. requirements for an
element within that fabric. As shown earlier, this distinction is illusory: the IOD requirement
is a fabric-wide concept, and applies identically to inter-switch links and backplane traces.
Page 15
The analysis logic is configured to define a sequence as spanning multiple OXIDs,
rather than being contained within an OXID. (I.e. the definition used for the word
“sequence” does not follow standards or reflect real-world node behaviors.)
The test equipment is configured with a paired port flow pattern, instead of a multi-
point pattern typical of real SANs.
Vendors desiring to demonstrate superior performance vs. Brocade will select all of these
unrealistic variables. The first set of misconfigured parameters allows them to justify
disabling the DPS balancing feature, and prevent the audience from noticing that their own
equipment will exhibit the same “sequence errors” when used in a network. The final
parameter allows them to create an unrealistic traffic pattern which congests when DPS is
turned off, because otherwise Brocade will still outperform their equipment even with OXID
backplane balancing disabled.
In fact, changing any of these points will result in two things: (1) the test will more accurately
reflect real-world SANs, and (2) it will show a vastly superior performance result for Brocade
switches. It is possible that vendors disseminating such test results simply don’t understand
the mapping of SCSI to FC, or that they intend to deceive their customers. Either way, SAN
engineers reading such results must now take a “buyer beware” approach to interpreting
them, before making decisions on that basis. Caveat emptor indeed.
Page 16