You are on page 1of 16

FIBRE CHANNEL IN-ORDER DELIVERY:

WHAT IOD REALLY MEANS IN THE CONTEXT OF FC STORAGE AREA NETWORKS

TABLE OF CONTENTS
TABLE OF CONTENTS ...............................................................................................................................................................................................1
OVERVIEW .................................................................................................................................................................................................................1
THE ANATOMY OF A SCSI OPERATION....................................................................................................................................................................1
WALKING DOWN THE PROTOCOL STACK ...............................................................................................................................................................2
WHEN IOD MATTERS, AND WHEN IT DOESN’T ......................................................................................................................................................5
EXCHANGE-BASED LOAD BALANCING (DPS)..........................................................................................................................................................7
COMPARING SCSI RESULTS TO FC TRAFFIC GENERATORS .................................................................................................................................9
CONCLUSION: BUYER BEWARE............................................................................................................................................................................ 15

OVERVIEW
The Fibre Channel (FC) standards require that fabrics deliver frames in the same order that
they entered the fabric. That is, a network of FC switches should act like a FIFO buffer, from
the point of view of attached devices. This requirement is called “in-order delivery”, or IOD
for short. However, the language in the standards describing IOD is ambiguous, and so there
is confusion about what IOD means. There are situations in which out-of-order delivery (OOD)
is allowable, and even desirable. In other cases, OOD may have a strongly negative impact
on applications attached to an FC Storage Area Network (SAN).
To analyze fabric IOD requirements, it is necessary to consider which traffic patterns require
delivery order to be maintained, and why this is required in the first place. In most
environments, FC networks are used to carry SCSI traffic. 1 In this context, IOD means “all FC
frames within a particular SCSI operation must be delivered in order with respect to each
other.” FC frames in different, unrelated SCSI operations do not need to be delivered in
order relative to each other. In fact, it may be desirable to prioritize one SCSI operation over
another: that is the purpose of QoS and similar features. In that case, the entire point is to
make sure that higher priority frames are delivered ahead of lower priority traffic, regardless
of which entered the fabric first, which by its nature is OOD.
This paper will clarify what IOD means in an FC fabric, so readers can identify situations in
which it is meaningful and required vs. optional or even undesirable.

THE ANATOMY OF A SCSI OPERATION


When an FC fabric is acting as a SAN transport, the upper-layer protocol being carried across
the network is SCSI. Applications in this context don’t interact with FC directly. To the
application, what happens below the SCSI layer is irrelevant unless it impacts the
performance or reliability of the SCSI interface itself. To understand how a SAN-attached
application reacts to out of order delivery, it is necessary to understand how SCSI works.

1
The examples in this paper do not apply directly to FC networks used for server-server connectivity, e.g. using IP
over FC (IPoFC), though analogous concepts apply. Also, this applies to open systems SANs running SCSI as the
upper layer protocol (FCP) rather than FICON environments.
A SCSI write “conversation” consists of an initiator (e.g. a server) “talking” to a target (e.g. a
storage array). Conceptually, the conversation goes something like this:
1. Initiator: “Can I send you x bytes of data for block y?” (Command.)
2. Target: “Yeah, OK.” (Command.)
3. Initiator: [Data Data Data Data…] (Payload.)
4. Target: “Got it! Thanks!” (Command.)
– repeat for next operation; continue as needed –
This is a simplification, as there can be additional elements to the conversation, but this
does reflect the general structure of a SCSI write. Step #1 is a write request. The initiator
asks permission of the target to send data. Next, the target gives permission. The third step
is the transfer of data. This step varies in size, depending on factors such as filesystem
block size and SCSI “logical block” size. Finally, once the target has received all the data and
verified its integrity, it will send an acknowledgement to the initiator that the operation
completed successfully. This is illustrated in Figure 1.

Figure 1 - Diagram of SCSI Write Operation


During the time between first and last commands, the application will wait for SCSI to
provide the acknowledgement. Applications which “care” about the ordering of sequential
SCSI operations will not send a subsequent write until they get the acknowledgement that
the first completed. In the interim, they might send other unrelated writes, or different
applications on the same host might send unrelated writes, but additional writes in the
same application context will not be started until the previous write completes. SCSI will
repeat the above steps as many times as needed to transfer the data.

WALKING DOWN THE PROTOCOL STACK


Figure 2 illustrates the relationships between each layer, starting from the application, going
all the way down to the Logical Unit Number (LUN) on the storage device upon which the
data is ultimately to be written. To understand the impact of delivery order on applications, it
may help to walk through this diagram. Here is how a SCSI write takes place in the context of
the protocol layering.

Page 2
Figure 2 - SCSI-to-FC Mapping
First, an application generates data, perhaps in the form of one or more files, and sends it
“down” to a filesystem. 2 The filesystem then maps that data (e.g. files) onto blocks 3 , e.g.
using an inode table, File Allocation Table (FAT), or some analogous mechanism. Next, the
filesystem sends data down to the SCSI layer. At this point, objects meaningful to the
application (such as one or more files) have been converted into objects meaningful to
storage subsystems (such as one or more SCSI blocks). The “SCSI write conversation”
described in the previous section now takes place, usually one time per SCSI block. The
SCSI layer handles the “commands” previously described.
The number of blocks per file is determined by file size and filesystem block size. That is,
“number of blocks” = “file size” / “filesystem block size”, rounded up to the nearest whole
number. In the case of unstructured data, a common block size is 4 kilobytes. In this case, 1
megabyte file would have 256 SCSI blocks. For tuned high performance large file

2
In the case of a database accessing a “raw” device, the filesystem functions are still there; they are just performed
by the database application rather than an OS driver. Similarly, analogous functions are performed by backup
software when writing to tape drives. One way or another, something at this layer has to keep track of where each
SCSI block is located, and what upper-level object (e.g. file or database record) is being stored there.
3
The term “block” is sometimes ambiguous. Block storage may refer to the blocks carried by the disk drive itself.
The “LBA” field in a SCSI command refers to this kind of block, which is variable length but most typically 512
bytes. Unfortunately, the same word is used also to define objects containing many SCSI blocks which are
transferred as a unit. In the context of networking, the latter is the relevant meaning. The SAN doesn’t really care
how the data is organized once it “lands” on a disk; the important concept is how the data is grouped for transfer
across the FC fabric. Thus, examples in this paper don’t deal with LBA-style “blocks”.

Page 3
applications, block sizes may be four times that or larger. In backup and asynchronous
replication applications, block sizes may exceed 100 megabytes.
If the application in this example were using directly-attached storage, the blocks would be
written onto disk or tape at this point. However, in a SAN, the blocks have to be mapped
onto FC for transportation to the storage subsystem. In this case, the SCSI layer hands off
each block, in order, to an FC Host Bus Adaptor (HBA). If the blocks are related to each other,
the SCSI layer will generally not hand off “block #2” until it gets an acknowledgement that
“block #1” is complete, so ordering from one block to the next is guaranteed. In addition, the
protocol headers contain enough information to uniquely identify the blocks in question, so
if one block did show up out of order, the SCSI layer would handle it. Once a block is handed
off to an HBA, it is the HBA’s job to convert the block into one or more FC frames. See Figure
3 and Figure 4. 4

Figure 3 - Variable Length of Fibre Channel Frames

Figure 4 - Fibre Channel Frame Header

4
Theoretically, the payload portion of a frame could be smaller. However, there isn’t any utility for this, and as a
practical matter the smallest frames are on the order of about 64 bytes total, with about half being payload and half
overhead. Similarly, FC standards allow frames up to 2112 bytes, but most fabrics use 2048 (2k).

Page 4
The number of FC frames per block in the data transfer phase follows a similar formula to
the blocks-per-file calculation. Generally, “number of FC frames” = “block size” / “frame
size”, rounded up. However, in the real world, this calculation can be simplified a bit. The FC
frame size for the payload portion of a SCSI write is pretty much always 2 kilobytes. Smaller
FC frame sizes are only used for commands, not for the payload. Therefore the number of FC
payload frames is actually “block size in kilobytes” / 2. For filesystems with unstructured
data, this usually means two FC frames per block, since 4k is the most common block size in
that arena. For larger block sizes, such as used in asynchronous replication, it could mean
thousands of payload frames per block. It’s difficult to know the industry-wide average, but it
would be somewhere between those two extremes.

WHEN IOD MATTERS, AND WHEN IT DOESN’T


This is where FC in-order delivery comes into play. In the context of SAN applications, the
reason FC needs to deliver frames in order is to make sure that each piece of any given
block “lands” in the correct position within that block. Ordering from one block to the next is
either irrelevant or handled by SCSI, but ordering of frames within a block is the
responsibility of the fabric. If the HBA divides a block into two frames, and the fabric delivers
them in reverse order, then the two halves of the block could be written to the disk in
reverse order. 5
Look back at Figure 2 (page 3). Notice the vertical relationship between “block1” at the SCSI
layer, the associated two FC payload frames, and “block1” at the LUN layer. The goal of FC
in-order delivery is to ensure that this relationship is maintained, so that the blocks
“landing” on the LUN have the same structure and data as they had inside the SCSI layer of
the initiator. To accomplish this, FC devices map each block to a unique exchange identifier.
This is the “OXID” noted in Figure 2 and Figure 4.
Figure 4 also shows other relevant FC header fields. When an FC fabric switch examines a
frame header, it can tell who sent the frame (SID), where it’s supposed to go (DID), as well
as the exchange to which it belongs (OXID). If any one of these fields differs between two
frames, then they are “unrelated” to each other from an IOD requirement viewpoint.
If the frames came from different hosts and/or are bound for different storage devices, IOD
is obviously not required. In that case, the devices aren’t coordinating their activities at the
FC layer, or – much more often – aren’t coordinating storage-related activities at all.
Figure 5 illustrates how IOD applies to flows between unrelated initiators and targets. In this
example, Host A is “talking” to Storage A, and Host B is talking to Storage B. Both of these
conversations use the same ISL. Because the ISL is a serialized link, and the hosts are
transmitting frames into the fabric in parallel, there will need to be a mechanism of
interleaving their IO streams. During any given time slice, both hosts can send a frame
simultaneously, but the ISL can only send one of those frames, so it is logically impossible
for the ISL to transmit frames in the exact same order in which they entered the fabric.
Let’s say that Host A sends two frames, both of which are part of the same SCSI operation:
A-1 and then A-2. At the same time, Host B sends B-1 and then B2.

5
This worst case is generally only a theoretical concern, since disk subsystems detect OOD and reject the entire
SCSI operation. In the real world, the application generally would get a SCSI retry, rather than corrupted data. That
said, the potential for corruption is there, and even a retry isn’t desirable.

Page 5
Figure 5 - IOD for Unrelated Initiator/Target Flows
The first, most intuitive way for the ISL to interleave the streams is to send an A frame, then
a B frame, then the second A frame, and so on. This is shown on the first line of the lower
portion of the figure. This way, all A frames are delivered through the fabric in the order as
they were sent, as are all B frames. In addition, the A-2 and B-2 frames are delivered after
both A-1 and B-2, which means that the streams maintain IOD relative to each other. They
don’t maintain simultaneity, as both B frames arrive a few milliseconds later than the
corresponding A frames. However, the difference is too small to worry about, and in any case
isn’t avoidable when sending parallel streams over a serial link.
Another option would be to transmit all A frames, then all B frames. This is shown on the
second line of the figure. In this case, all A frames are delivered in the order they were sent,
as are the B frames. But in this case, B-1 is delivered after A-2: the two streams do not
maintain their IOD relationship relative to each other.
Since FC fabrics must guarantee in-order delivery, this probably sounds like a problem. It
isn’t, because the out of order condition exists between unrelated streams. In fact, this is
exactly what QoS is supposed to accomplish. If one application (say, Host A) is more
important than another (Host B) then the SAN administrator might put them into different
QoS priority groups. In that case, fabric links are supposed to send A frames before sending
B frames, regardless of the order in which they entered the fabric. OOD is not only allowable;
it is desirable. FC switches can tell that OOD is allowable because the flows have different
endpoints. That is, the SID and DID fields in the frame headers are different.
The last line in the figure illustrates an unallowable OOD scenario. In this case, frame A-2 is
delivered ahead of A-1, and B-2 is delivered ahead of B-1. Remember, in this example A-1
and A-2 are part of the same SCSI operation. This means that the various “parts” of the SCSI
block are arriving out of order. At minimum, this can result in a SCSI retry.

Page 6
What if A-1 and A-2 had not been part of the same SCSI operation? If frames come into the
fabric from the same host and are destined for the same storage port, but have a different
exchange ID (OXID) in the frame header, switches will “know” that they are parts of different
SCSI operations and thus IOD is not applicable. If ordering had mattered to the application,
it wouldn’t have even generated the second SCSI operation until after the first had
completed. In that case, all three examples are allowable.
The bottom line is that, in a typical-case open-systems SAN, FC in-order delivery is only
relevant between frames which have identical values in the SID, DID, and OXID fields.

EXCHANGE-BASED LOAD BALANCING (DPS)


One reason why SAN designers care about in-order delivery in such detail relates to fabric
path balancing. It is common for designers to create fabrics with multiple equal-cost paths6
between a given host and storage device. Figure 6 is an example.

Figure 6 - FC Network with Multiple Equal-Cost Paths


Dynamic Path Selection (DPS) is a link-balancing method available on all Brocade B-Series
4Gbit and later platforms. It works by striping FC exchanges across all equal-cost paths.
When a DPS-enabled switch receives a frame, it takes all equal-cost routes and calculates
the egress port from that set based on a formula using the sender’s PID (SID), target’s PID
(DID), and the exchange ID (OXID). The hashing formula will always select the same path for
a given [ SID, DID, OXID ] set.
For a given “conversation” between a host and storage port, one SCSI operation would go
down the first path, and the next would go down a different path. Effectively, DPS stripes IO
at the SCSI level. All frames within a given exchange would be delivered in order by virtue of
going down the same serialized network path. This guarantees IOD within the SCSI block,
due to the mapping of FC exchanges to SCSI operations. IOD between different blocks is
either irrelevant or handled by SCSI level behaviors.
Figure 7 and Figure 8 illustrate how all frames for “OXID 1” (the first block) will follow one
path, and all frames for “OXID 2” (the next block) will follow the other path. This maintains
IOD while balancing IO between paths. The potential exists for OOD between unrelated SCSI
operations, but all SCSI-compliant devices tested are capable of handling this.

6
“Equal cost paths” generally means the same number of hops as defined by the Fabric Shortest Path First (FSPF)
protocol, and/or by manual settings such as Traffic Isolation Zones.

Page 7
Figure 7 - Path for All Frames of the First Block

Figure 8 - Path for All Frames of the Second Block


The bottom line is that DPS works, and does not violate IOD. This isn’t just a theory. A
majority of fabrics in production world wide use this method. This includes Brocade
customer environments, and competitors’ as well since every major fabric switch vendor
supports a version of this method. DPS has passed OEM qualification testing for every SCSI-
compliant, open-systems application, and would not have passed if it corrupted data.7
In fact, many Brocade B-Series switches, routers, and directors can use the same algorithm
to balance their backplanes. 8 Remember, the requirement for IOD comes from the fact that
hosts and storage attached to a fabric need each part of each block to end up in the correct
position. To ensure this, any frame inserted into the fabric must come out the other end of
the entire fabric in the same relative position to all other related frames. There is no
7
There are systems which use FC fabrics for something other than SCSI – e.g. FICON and some proprietary
distance extension solutions. For those environments, Brocade has an optional fabric setting called “APTpolicy” to
tell the fabric not to consider OXID in the context of frame ordering. Other switch vendors have analogous methods.
8
This does not mean that the inside of a director actually is a fabric in the sense of using ISLs to interconnect blades.
It simply means that that the device can use the same balancing algorithms as are used on network links. There are
similarities between designing a SAN and designing the interior of a director, as one may expect. After all, a director
and a network are both trying to “solve” the same “problem”, vis a vis host to storage connectivity. Therefore it is
understandable that some algorithms are used both outside and inside a switch. But it is important not to take the
analogy too far, because there are also substantial differences. E.g. switch backplanes do not use FSPF, negotiate
speeds on traces, have failure-prone SFPs, or “reconfigure” when blades fail. If the backplane traces were actually
ISLs, all of these fabric-oriented behaviors would occur.

Page 8
difference to a node if OOD comes about from an ISL balancing algorithm, or a backplane
trace balancing method. From an application perspective, there is no meaningful distinction
between switch-level behaviors, vs. fabric link behaviors: IOD vs. OOD is a fabric-level
concept that does not vary based on the type of link. Therefore, if the OXID balancing
method is “correct” at a network level, then it is also correct – or at least allowable –
internally at a switch or director level.
Since Brocade link-balancing methods such as DPS and frame-level trunking are allowable
and work in networks, they are allowable and work inside switches and directors, but does
that mean that they are desirable? Is a network-oriented algorithm such as DPS the “right”
way to balance the interior of a switch? That comes down to implementation quality.
Not all vendors are adept at building high performance, highly reliable, vastly scalable
networks. Brocade, however, does have a long history of doing this, so the methods Brocade
developed for balancing IO in large fabrics had to be robust. As a result, the network link-
balancing implementation in enterprise-class Brocade platforms is capable of balancing a
half-terabit per equal-cost path. (One terabit full-duplex.) In the context of director
architecture, the slot connectivity to the backplane is a “path” which needs to be balanced.
Currently, the highest performing director in the industry – the Brocade DCX– has a quarter
terabit of backplane connectivity per slot. Thus, the current implementation of Brocade
network link balancing is capable of running a balanced backplane at 2x the most stringent
requirement in the world for backplane balancing.
Brocade competitors, on the other hand, have drastically lower capabilities for network link
balancing. The nearest competitor can demonstrate at most 25% of Brocade’s performance.
So for Brocade, it is allowable and desirable to use DPS as part of the backplane balancing
method, because Brocade’s implementation of DPS is up to the task. For vendors who aren’t
good at balancing networks, it would still be allowable but not desirable, because it would
result in “hot spots” (unbalanced load) on their backplane.
Using network-oriented methods inside a network element is a good idea if those methods
are robust. If the network-oriented methods are weak, then one would need to create
proprietary methods for backplane balancing. When a vendor claims that DPS is correct for
balancing a network, but not correct for balancing a backplane, what they are really saying is
that their implementation is not good enough to balance a backplane.

COMPARING SCSI RESULTS TO FC TRAFFIC GENERATORS


DEFINITION AND LIMITATIONS OF AN FC TRAFFIC GENERATOR
In addition to understanding how the FC IOD requirement applies to real-world environments,
it is necessary to understand how it applies to testing. These are very different things,
because lab testing for IOD is conducted with specialized testing equipment instead of real
hosts and storage. Testing devices are usually called “traffic generators”, “load generators”,
or “frame shooters”. Many frame shooters contain protocol analysis logic, and may be
referred to as “analyzers”.
FC frame shooters are able to create frames in bulk and simulate different IO patterns. That
is, an engineer creates an algorithm for defining how the analyzer will generate frame
headers and payloads, and rules for which ports will send/receive data from each other.
Analyzer ports talk to themselves rather than to hosts or storage devices, so one port on an

Page 9
analyzer generates a frame destined for another port on the same analyzer. Because of this,
they are able to collect frames at the other end of the fabric, and analyze the results in
terms of high-precision delivery time (latency), pipeline capacity (throughput), error and loss
rates, and more. In the right hands, they are optimal for conducting stress testing on fabrics,
and graphing the results.
However, frame shooters also have limits. They do not run “real” operating systems,
applications, or driver stacks. They are powerful tools when used correctly, but because they
lack upper-level behaviors, frame shooter results can also be misleading when used
incorrectly. Incorrect configurations can happen if the engineer conducting the test did not
know how to operate the analyzer, or constructed the test plan to intentionally mislead an
audience, or did not understand the disconnect between the tester settings vs. the intended
production load. The next sections will clarify such cases.
GENERAL NETWORK TEST METHODOLOGY
Nobody deploys frame shooters in a production SAN. Frame shooter test results are
therefore not meaningful in and of themselves. The goal of using frame shooters is to
predict something about how a network will likely behave in a production context. The goal
of setting up a frame shooter test plan is therefore to make it close to the intended
production deployment; to make a strong connection between the lab test and the real
application. If there is no connection, the results will be meaningless at best, and may cause
SAN designers to make bad decisions about which network elements to use and how to
interconnect them.
To ensure that results are on point, when Brocade conducts network testing, a team of
personnel is involved in creating the plan. Engineers familiar with the equipment in question
make sure it is operated properly. Customer-facing personnel are involved to ensure that the
results are portrayed honestly, since disingenuous representations harm customer
relationships. Finally, senior architects review the plan to verify that the traffic pattern and
other settings reflect anticipated real-world scenarios.
Unfortunately, most companies are not as rigorous as Brocade. It is common for testing to
be conducted by marketing personnel and low-level engineers, without supervision of senior
architects or customer-facing ombudsman. Teams lacking the latter two components often
make errors, through misunderstanding or intent to deceive. As a result, outright falsified or
at least materially misleading test data can often be found quoted in marketing literature,
internet web sites, or even magazine articles. SAN designers are therefore required to have
a detailed understanding of test methods and results before they can predict anything about
their production environments from reading such literature.
One common mistake is that engineers may use a fundamentally unrealistic “flow pattern”
in the test plan. Basically, the flow pattern defines which ports “talk” to which other ports. It
also encompasses variables such as frame size and data rate. Using an incorrectly-
conceived flow pattern will guarantee that test results fail to predict real-world behaviors.
Say that an engineer had access to a ten-port FC frame shooter, and wanted to simulate the
load that was going to occur in a small SAN. Figure 9 illustrates three different ways the
engineer could configure the flow pattern, with alternate visualizations for two methods. On
the left, the “multipoint” pattern causes eight analyzer ports to send data to two other ports.
In the middle, the “paired port” pattern causes pairs of tester ports to talk to each other, but

Page 10
not to any ports outside of the paired relationship. (E.g. port 1 will talk to port 2, but not to
ports 3 through 10. In the same way, port 3 will talk to 4, 5 to 6, and so on.) On the far right,
the “full mesh” pattern causes every tester port to talk to every other port.

Figure 9 - Network Analyzer Flow Patterns


These are all useful patterns, when used and interpreted correctly: it is not fundamentally
“wrong” to use an analyzer in each of these ways. For example, the full mesh test is useful
to check for bottlenecks in a frame forwarding engine, or to saturate the switch backplane
with evenly distributed load to see how much electromagnetic interference escapes from the
platform. If the goal of the test is to debug switch hardware as part of an internal
engineering effort, the full mesh test is a “must”.
However, if the goal of the test is to predict likely behavior in a SAN, the multipoint test is the
most useful. SANs are almost always used for host-to-storage connectivity, with server-to-
server IO (e.g. IPoFC) being a small fraction of overall traffic. In addition, storage ports are
almost always over-subscribed, i.e. there is more than one host accessing each storage
controller. Full mesh patterns are only accurate if every host talks to every other host in
addition to talking to every storage device, and every storage device talks to every other
storage device. Since true any-to-any traffic patterns simply do not come up in a SAN context,
full mesh testing is not an indicator of real world performance. Similarly, paired port tests
are only accurate when there is a 1:1 ratio of hosts to storage controllers, and each
controller is only used by one host. These scenarios are just not typical.
The multi-point pattern is therefore a more accurate representation of the loads that occur
in real SANs. When evaluating vendors’ testing, look at the flow pattern. If the pattern was a
paired port or full mesh, the result is not reflective of real world SAN environments.
This might seem off point in a discussion of IOD. The point is that, when an engineer uses
test results to make decisions about SAN design, it is necessary to evaluate the entire test
plan as well as the test result. If vendors desire to create misleading results from a frame
shooter test, they will typically manipulate the flow pattern as well as other test parameters.
Because the flow pattern is the most overt variable, it is the easiest manipulation for an
engineer detect. Any test plan which provides results for only paired port or full mesh
patterns but does not provide and break out multi-point results is not intended to honestly
reflect real world SAN environments.

Page 11
The most typical manipulation as related to IOD is to configure a frame shooter for a paired
port flow pattern, and either disable DPS, or misconfigure the frame shooters’ OXID rotation
and sequence mapping behaviors. If DPS is off, Brocade still balances IO based on SID and
DID. In multi-point and full mesh tests, this is still up to 99% efficient. If a vendor wants to
concoct a deliberately misleading test result, the paired port test will be used in combination
with disabling DPS and/or misconfiguring the frame shooter because doing both at the
same time can result in a reduction in efficiency.
The fallacious reasoning in such manipulated testing goes like this: the misconfigured tester
incorrectly shows out of order deliver with DPS enabled. (Discussed in the next section.)
Therefore DPS must be turned off. Performance in paired port tests with DPS disabled is
inferior. Therefore Brocade performance is inferior. The flaws in this line of reasoning should
be obvious, but only to those who understand the flow pattern manipulation as well as the
manipulated IOD testing.
TESTING FOR IN-ORDER DELIVERY
Frame shooters do not have the upper layers of the stack, so the engineer setting up a test
must configure the analyzer to simulate these behaviors. Engineers unfamiliar with SANs, or
who intend to manipulate the results to deceive an audience, may configure the analyzer
incorrectly in this respect. Figure 10 shows stack elements found in a frame shooter, and
illustrates the parts which are different or missing vs. real SAN devices.

Figure 10 - Protocol mapping with Incorrectly Configured Frame Shooter

Page 12
Compare this to Figure 2 (page 3), and notice how the SCSI-block-to-FC-frame-to-OXID
relationship differs. (That’s the middle portion of the two figures.) Figure 2 shows how the
mapping is actually performed by real-world devices in a SAN. Figure 10 shows one possible
analyzer mapping, but analyzers can be configured for a variety of different methods of
generating OXID / sequence / frame relationships. Figure 10 shows an incorrect mapping,
which is why it differs from Figure 2.
In Figure 10, every frame gets a unique exchange ID. In the real world, even in the rare case
when the block size is set to 2k, there are at least four frames in the exchange: three control
frames and one payload frame. Figure 1 page 2 can be used to visualize this. Thus, in and of
itself, changing the OXID every frame causes the analyzer to disconnect from real-world
SANs, which limits the validity of the results.
When the analyzer is misconfigured in this way and DPS is enabled, each frame can go
down a different equal-cost FSPF path. This is shown in Figure 11. However, this
misconfiguration does not necessarily, by itself, cause the analyzer to report misleading
results. To accomplish that, the engineer setting up the test has to go one step further.

Figure 11 - Traffic Flow in a Fabric with Incorrectly Configured Frame Shooter


Most of the time, the frames will still arrive in the same order they were transmitted, but
sometimes, if traffic load is high enough, they will not. Because the frames have different
OXIDs, this is not actually an error, as demonstrated earlier in this paper. In a real open-
systems SAN fabric, frames with different OXIDs are guaranteed to be unrelated to each
other, and thus IOD does not apply.
Some frame shooters can be set up to generate the “sequence error” statistic under a
variety of different rules. Properly configured, an analyzer will not report the above scenario
as an error, but when incorrectly configured, it will do so. When the word “sequence” is used

Page 13
in the context of FC standards (as in “frames within a sequence must be delivered in order,
and failure to do so is an error”) then the sequence referred to is required to be contained
within an exchange. To get an analyzer to report “out of sequence” errors in a DPS balanced
system, the word “sequence” as to be redefined as something that contains multiple
exchanges: exactly the opposite of the FC standards’ meaning of the term. Figure 12 and
Figure 13 illustrate this another way.

Figure 12 - Real SCSI Operation Maps FC Sequences Inside FC Exchanges

Figure 13 - Incorrectly Configured Frame Shooter Maps FC Exchanges Inside “Sequences”


Figure 12 shows a normal, real world SCSI write. In this example, the block size is set to 16k,
so there are eight payload frames in the transfer of each block. The figure shows two such
transfers, which might happen when writing a 32k file. In addition to the eight payload
frames, each operation has three command frames, and some latency-related pauses.
(Figure 1 and the surrounding text explain why this happens.) The frames in the second
operation are not transmitted until the final command frame from the first operation is
received by the initiator, guaranteeing SCSI-level IOD.
The important points in Figure 12 relate to the relationship between SCSI operations, OXIDs,
sequences, and frames. Each SCSI operation has a 1:1 relationship with a unique OXID.
That is, all frames in that operation have the same OXID, and all frames in different
operations have a different OXID. Depending on the size of the exchange and the
architecture of the drivers, there could be more than one FC sequence within each exchange.
The key take away is that the exchange ID should – and in the real world, always would –
“contain” all related sequences.
Figure 13 is the same style of diagram, but reflects a misconfigured frame shooter. In this
case, there is no SCSI operation concept, there are no SCSI control frames, and no IO wait
states: there is just a continual series of 2k frames. While this is a bit unrealistic, it isn’t a
problem so far. But the frame shooter is also configured to change OXIDs every single frame,
and to define a “sequence” as containing multiple exchanges rather than the reverse. This
series of configuration problems will cause an analyzer to report OOD between frames with
different OXIDs as being a “sequence error”.
Now, this is not a “bug” in the frame shooter itself. Frame shooters are extremely powerful
and flexible because they are primarily intended to be used by engineers developing
products, rather than by SAN designers evaluating network performance problems. An ASIC
engineer might want to use a frame shooter to test unrealistic workloads as part of

Page 14
debugging an FPGA chipset prior to “burning” it into an ASIC, or check to see how the fabric
might react to hypothetical non-SCSI upper-layer workloads, or to test security features
against hypothetical hacker attack strategies. If the test equipment is being used for such a
purpose, any arbitrary settings are “allowable”. But when the equipment is used to make
claims about how various SAN devices will perform in production environments, configuring
it in this way is either a mistake or a deception.
It is also worth noting that all major vendors’ FC solutions will exhibit OOD in this case. That
is, any heavily-loaded FC network, designed as in Figure 11, will report “sequence errors” if
the analyzer is misconfigured. This is a function of how exchange-based load balancing
works; not of any particular vendors’ implementation of that method. As a consequence,
vendors trying to use this technique to deceive an audience will also limit the test
configuration to a single director or switch, rather than a network. That is, they draw a
distinction between IOD requirements for a fabric as a whole, vs. requirements for an
element within that fabric. As shown earlier, this distinction is illusory: the IOD requirement
is a fabric-wide concept, and applies identically to inter-switch links and backplane traces.

CONCLUSION: BUYER BEWARE


Much of the content in this paper is highly technical. This level of detail is needed to fully
understand what IOD means in the context of a real-world FC SAN. Historically, the only
people who needed this level of technical depth were engineers tasked with designing FC
fabric switches, HBAs, or storage array controllers. Recently, it became relevant to a broader
audience because it is also needed to interpret the results of testing performed with frame
shooters, and some vendors have started putting out misleading test results.
Truly understanding this requires considerable study, but it can be summarized this way:
 For open-systems SANs, IOD means “every frame within a given SCSI operation must
arrive in the same order in which it was transmitted”.
 FC HBAs map SCSI operations to exchange IDs (OXID), and fabrics distinguish
endpoints using port IDs (DID and SID).
 It is therefore possible to tell whether or not FC frames are part of the same
operation by looking at the SID, DID, and OXID.
 If any of them differ, then the SCSI operation is different and IOD does not apply.
Thus, OOD is allowed between those unrelated frames.
 FC switches from all major vendors support load balancing based on this fact, and
this method has passed qualification from every major storage OEM.
This really should be the end of the discussion, and until recently, it was. Now, some
misleading test results have been circulated, which is confusing the issue. Typically, the
testing in question is conducted in the following manner:
 FC frame shooters without SCSI, operating system, filesystem, or application layers
are used instead of real hosts or storage devices.
 The test is limited to a single network element instead of a multi-switch fabric.
 The frame shooter is configured to change OXIDs every frame, instead of sending a
series of related frames with the same OXID.

Page 15
 The analysis logic is configured to define a sequence as spanning multiple OXIDs,
rather than being contained within an OXID. (I.e. the definition used for the word
“sequence” does not follow standards or reflect real-world node behaviors.)
 The test equipment is configured with a paired port flow pattern, instead of a multi-
point pattern typical of real SANs.
Vendors desiring to demonstrate superior performance vs. Brocade will select all of these
unrealistic variables. The first set of misconfigured parameters allows them to justify
disabling the DPS balancing feature, and prevent the audience from noticing that their own
equipment will exhibit the same “sequence errors” when used in a network. The final
parameter allows them to create an unrealistic traffic pattern which congests when DPS is
turned off, because otherwise Brocade will still outperform their equipment even with OXID
backplane balancing disabled.
In fact, changing any of these points will result in two things: (1) the test will more accurately
reflect real-world SANs, and (2) it will show a vastly superior performance result for Brocade
switches. It is possible that vendors disseminating such test results simply don’t understand
the mapping of SCSI to FC, or that they intend to deceive their customers. Either way, SAN
engineers reading such results must now take a “buyer beware” approach to interpreting
them, before making decisions on that basis. Caveat emptor indeed.

Copyright © 2009 Brocade Communications Systems, Inc. All Rights Reserved.


All or some of the products detailed in this document may still be under development and certain
specifications, including but not limited to, release dates, prices, and product features, may change.
The products may not function as intended and a production version of the products may never be
released. Even if a production version is released, it may be materially different from the pre-release
version discussed in this document.
NOTHING IN THIS DOCUMENT SHALL BE DEEMED TO CREATE A WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, STATUTORY OR OTHERWISE, INCLUDING BUT NOT LIMITED TO, ANY IMPLIED
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR
NONINFRINGEMENT OF THIRD PARTY RIGHTS WITH RESPECT TO ANY PRODUCTS AND SERVICES
REFERENCED HEREIN.
Brocade, the Brocade B-weave logo, Fabric OS, File Lifecycle Manager, MyView, SilkWorm, and
StorageX are registered trademarks and the Brocade B-wing symbol, SAN Health, and Tapestry are
trademarks of Brocade Communications Systems, Inc., in the United States and/or in other
countries. FICON is a registered trademark of IBM Corporation in the U.S. and other countries. All
other brands, products, or service names are or may be trademarks or service marks of, and are used
to identify, products or services of their respective owners.

Page 16

You might also like