You are on page 1of 19

Link Aggregation and Failover

Thom Bean 9/21/2011

Contents
Introduction Terminology References Link Aggregation Types Topologies Direct Connect Private Network Local Network Remote Network Data Domain Link Aggregation and Failover Bond Functions Available in Linux Distribution Hash Methods Used Link Failures Other Link Aggregation Cisco Sun Windows AIX HPUX Data Domain Link Aggregation and Failover in the Customers Environment Normal Link Aggregation Failover of NICs Failover Associated with Link Aggregation Recommended Link Aggregation Switch Information

Introduction
This document describes the use of link aggregation and failover techniques to maximize throughput and keep an interface up on networks with Data Domain systems installed. The basic topologies are described with notes on the usefulness of different aggregation methods, so the right method can be chosen for a specific site. The goal of Link Aggregation has two purposes: 1. Evenly split the network traffic across all the links or ports that are in the aggregation group 2. Continue to transfer data over connections even though a link fails With Link aggregation, failover of a link is provided with degradation. For example, if two 1 Gb links are aggregated to obtain a throughput of 1.8 Gb/second and one link goes down then the data transfer will continue but only up to 0.9 Gb/s over the single link that is still available. With failover only one link, referred to as the active link, is used at a time. The other links in the group are idle. The maximum throughput is the speed of a single link. If the active link fails then the traffic will continue on another link without losing the connection. In

both cases, link aggregation and failover, the failure would be due to loss of carrier, but when using Link aggregation with LACP the failure can also be the loss of a heartbeat of communication. Normally the aggregation is between the local system (i.e. DDR) and a network device (e.g. switch or router) or another system (e.g. media server) that it is directly connected. The goal of purpose number 1 is to achieve the maximum throughput across all the links that are aggregated. There are a few things that can impact how well the aggregation actually performs. 1. 2. 3. 4. 5. 6. 7. Speed of the switch How much the DDR can process Network overhead Acknowledging and coalescing to recover out of order packets Aggregation method may not effectively distribute the data evenly across all the links Number of clients Number of streams (connections) per client

For impact 1, normally the switch can handle the speed of each link that is connected to it, but it may lose some packets coming from several connected ports when sending to one uplink depending on the uplink speed. Note: this implies that only one switch can be used for port aggregation coming out of a system. For most of the implementations this is true, but there are some network topologies that allow for link aggregation across multiple switches, especially in conjunction with virtual switches or virtual COM ports. Impact 2 addresses the DDR systems. DDR systems and programs processing rate is limited. As the hardware gets faster and the use of parallel processing improves, DDR systems will support a higher network throughput, but as the processing speed increases the network link speed will also increase. For example, with the current systems it makes sense to aggregate 1 GbE links but not 10 GbE links because one 10 GbE can provide enough data to saturate the processing power of some of the current and older DDR systems. As the system speed improves it will make sense to aggregate 10 GbE links, but the link speeds will also increase. Impact 3 addresses the inherent overhead of the network programs. This overhead will guarantee that the transfer speed will never reach 100% of the line speed. The throughput will always be reduced by the overhead it takes to queue and send a packet of data through the system until it is put onto the wire. There is an inherent delay separating the sending of packets on Ethernet. The overhead is expected to cause 5 - 10% reduction in line speed per link. Impact 4 deals with the case that the packets may get out of order. The network program will need to coalesce out of order packets into the original order. If the link aggregation mode allows the packets to be sent or received out of order and the protocol requires that they be put back into the original order. It may also look like a lost packet cauing recovery techniques. This added overhead will impact the throughput speed to the point where the specific mode of link aggregation that causes out of order packets should not be used. Impact 5 has the biggest impact on how effective the data is split across the network links and therefore able to get the full throughput of the multiple links. Impact 6: Can a single client drive data fast enough to fully utilize multiple aggregated links? In some older systems, either the physical or OS resources cannot drive data at multiple Gbp/s. Also, due to hashing limitations, multiple clients may be required to push data at those speeds. For example if the mac address is used for the hash then more then one client will be needed to distribute the network traffic across the aggregated interfaces. Impact 7: The number of streams, which translates to separate connections, can play a significant role in link utilization depending on the hashing that is used. For example, if the hashing is done on TCP port numbers more connections will allow better distribution of the packets. A final impact deals with the effectiveness of the aggregation method used. If two systems are connected together by direct connect cables, the use of Layer 2 (MAC) hashing would not provide any aggregation at all. All the packets would go over the same link. If Layer 3 (IP) hashing is used the same thing would result unless IP aliasing or VLAN tagging is used to have multiple IPs for one interface. In that case Layer3 hashing could be used and in fact the IP addresses could be selected to guarantee the distribution of packets across the

interface. In general the number of systems that will be communicating with the Data Domain system will be small. So the aggregation method used will need to work for a limited number of client systems. The number of links that are aggregated will depend on the switch performance, the DDR system and application performance and the link aggregation mode used. There is no absolute limit on the DDR software, except for the actual number of physical links, as long as the switch or whatever it is connected to can handle the number of ports trunked together.

Terminology
The following are terms that number of links that are aggregated will depend on the switch performance, the DDR system and application performance and the link aggregation mode used.

DDR
Data Domain appliance, a Linux system used to perform only Data Domain operations.

Bond, Bonding
This is a term used by Linux community to describe the grouping of interface together to act as one interface to the outside world. The DDR uses a subset of the bonding available from Linux.

EtherChannel
This is a term used by Cisco to define the bundling of network links as described under Ethernet Channel. With Cisco there are three ways to form an EtherChannel: manually, automatically using PAgP, and automatically using LACP. If it is done manually both sides have to be setup by the administrator. If one of the protocols is used, the specific packets with the specific protocol are sent to the other side to where the EtherChannel is setup based on the information in the packets.

Ethernet Channel
This is multiple of individual Ethernet links that is bundled into a single logical link between systems. This provides a higher throughput than a single link does. The term used by Cisco to identify this is EtherChannel. The actual throughput is dependent on the number of links bundled together, the individual link speed of the individual links and the switch or router that is actually being used. If a link within the Ethernet Channel fails the normal traffic over the failed link is sent over the remaining links within the bundle.

LACP
Link Aggregation Control Protocol (LACP) provides a dynamic network aggregation as defined in IEEE 802.3ad standard (IEEE 802.3 clause 43). This is not available in DDOS 4.9 and before. In DDOS 5.0 LACP is only available for the 1 Gb interfaces. It is not available for 10 Gb until 5.1. This is not available for Chelsio NICs.

Link Aggregation
Using multiple Ethernet network cables or ports in parallel, Link Aggregation increases the link speed beyond the limits of any one single cable or port. Link aggregation is usually limited to being connected to the same switch. Other terms used are EtherChannel (from Cisco), Trunking, Port Trunking, Port aggregation, NIC bonding, and Load balancing. There are proprietary methods that are used, but the main standard method is IEEE 802.3ad. Link aggregation can be used for a type of failover too.

Load Balancing
Aggregation methods used to try to distribute loads across all available links or ports. This term is applied to switches and routers and with Cisco it is generally a global parameter applied across all ehterchannels.

Port Aggregation Protocol (PAgP)


This is Ciscos Proprietary networking protocol providing logical aggregation of Ethernet ports. This is used in Ciscos EtherChannel. This is the older method used by Cisco. Later releases of their software use the standard LACP to provide the same type of functions. Note PAgP EtherChannels do not interoperate with LACP EtherChannels. This is not supported by DDRs.

Round Robin
In Linux round robin sends packets in sequence to each available slave that is available. This provides the best distribution across the bonded interfaces. Normally this would be the best aggregation to use, but the throughput can suffer because of packet ordering.

RSTP
Rapid Spanning Tree Protocol, IEEE 802.1W, allows a network topology with bridges to provide redundant paths. This allows for failover of network traffic among systems. This is an extension to the spanning tree protocol (STP). The two names are used inter-changeably.

TOE
TCP Offload Engine Network cards (NIC) that have the full TCP/IP stack on the card.

Trunking
Trunking is the use of multiple communication links to provide an aggregated data transfer among systems. For computers this may be referred to as port trunking to distinguish from other types of trunking such as frequencies sharing. Note: Cisco uses the term trunking to refer to VLAN tagging not link aggregation, whereas other vendors use this term in reference to link aggregation.

References:
Catalyst 4500 Series Switch Cisco IOS Software Configuration Guide (also used for the 4900 Series Switch too) Release 12.2(44)SG, available from the Cisco Documentation site. Cisco Documentation, http://www.cisco.com/univercd/home/home.htm IEEE 802.3 Standard http://standards.ieee.org/getieee802/802.3.html Also available under: http://iweb.datadomain.com/eweb/technical_library/Vendor/Cisco/ IEEE 802.3ad Standard is Clause 43 under IEEE802_3-sec3.pdf of the standards documents listed. Linux distribution documentation, http://www.kernel.org/ Linux Ethernet Bonding Driver HOWTO, http://www.cyberciti.biz/howto/question/static/linux-ethernet-bonding-driverhowto.php, http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces-nic-into-single-interface.html Linux Ethernet Bonding Driver HOWTO: http://www.cyberciti.biz/howto/question/static/linux-ethernet-bonding-driverhowto.php, http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces-nic-into-single-interface.html Wikipedia, http://en.wikipedia.org/wiki/Main_Page Various links on the web as noted within the document by hotlinks

Link Aggregation Types


Link aggregation needs to balance the number of packet across all the links within the aggregation group with minimum impact on the splitting, assembling, and reordering of packets. Currently IEEE 802.3ad is the accepted standard. This can be used by most systems that can support Link aggregation, but there is no one size fits all. There are other aggregation types that may work better in some situations such as round robin which is not part of the IEEE 802.3ad standard. The IEEE 802.3ad standard is contained in clause 43 of the IEEE 802.3 standard that is freely available on the web. In the IEEE standards the term clause 43 can be thought of as chapter 43. Clause 43 is part of the IEEE 802.3-2005 Section Three pdf file on the IEEE web site. A large part of the IEEE 802.3ad standard is the LACP. This is a protocol that is used to coordinate the aggregation between the two systems that are directly connected. Note: This standard does not identify how the actual link is selected to send a packet, but it does emphatically mention two things: packets within a conversation should always be kept in order and packets should not be duplicated. For the purposes of this document, conversation is the same as data traffic sent over a single connection. The aggregation process is defined internally on the sending system, but the LACP operation coordinates with the connected system for the available ports to be used in the aggregation. The LACP provides the "heartbeat" over each interface to be used and therefore can tell when an interface can no longer be used when the control packets can no longer be sent or received. It also uses the carrier to help determined when an interface can no longer be used, but the heartbeat give it a slight advantage over the normal failover. Note: The LACP protocol is not supported in releases 4.9 and before. Also LACP is not supported on 10 Gb until 5.1 release. If the IEEE standard is not used there are two other link aggregation options on the DDR. One of the two other options is round robin and the other option is the Linux bonding modules balance XOR type to provide link aggregation. As implied by the name the aggregation is done by doing a XOR function on one or more of the addresses and/or port numbers within the packet headers. This aggregation has to be setup on both sides. Even though it my be desired the aggregation used on both sides does not need to match. For example if Layer 3+4 is used on the DDR the system connected to the DDR could use Layer 2 hashing. In the case of the DDR being connected to a Cisco switch only one hash can exactly match the transmit hash on the DDR and that is Layer 2. The hashing only impacts the transmitted packets and does not impact the received packets. An important consideration is the network topology. Important things to consider with the network topology are: The equipment directly connected to the DDR o It may be the media server or another DDR o If it is a switch or a router, the make and model number should also be noted. Whether the target system is local or remote, there may be a gateway involved The DDR may be on a private network or shared with the rest of the customers network The number of target computer systems that will be connected needs to be taken into account Single DDR or multiple DDRs. Each part of this information will have an impact on the type of Link aggregation, the transmit hash and the load balancing that is used. Consider what systems will be doing the link aggregation. Normally link aggregation configuration requires coordination from both the DDR system and the switch. There is at least one network topology where a switch may not be part of the configuration, i.e. direct connect. This will need the link aggregation to be configured between the DDR and the Media Servers. If the DDR is on the local network and is communicating with many systems then using Layer 2 (MAC address) could be acceptable. If connection path goes through a router/gateway then layer 3 (IP address only) or Layer 3+4 (IP address and the port number) may be desired. The DDR link aggregation only supports links with the same speed on a single bonded interface. The MTU will be set the same on on the interfaces in one bonding. With 1 Gb the media type can be either fiber or copper, but on the 10 Gb interfaces the media types must match. Link aggregation are not supported on the interfaces that have TOE enabled. This would be the cards with CX4 interfaces and the single port optical card. The dual port 10 GbE TOE cards can have failover on the cards but does not support failover off the card.

There is also the question of when to use failover since aggregated interface handle failures. The link aggregation modes include an failover component by allowing data transfer to continue in a degraded state on less of the bonded interfaces. For example, one of the links goes down the link aggregation can recognize this and drop that link from the aggregation list and continue with one less link. The customer may feel full failover is more important than link aggregation and would rather have no degradation. Instead of aggregating over multiple links, these links can be configured in full failover mode where idle spares that carries no data would be setup until the active link fails. This way there would be no degradation of throughput if the one link fails and data is sent over the other. One or more would be kept in a standby mode until it is needed. A strong reason to use failover instead of link aggregation is the setup of parallel network paths. In this configuration there can be two switches and if a component in one path goes down the data traffic is moved to the other port and switch. Failover can fail across switches while except for special cases with virtual switches Link Aggregation must be setup with interfaces connected to the same switch. A caveat for this is the if a port fails and not the swicth then the new switch will need to be able to route the data to the target destination and the destinaion will need to send the packets to the new switch. Administration network interface is also needed with DDRs. For direct connections and one to one server connections there is a separate Ethernet interface for this, but this could also be part of the link aggregation unless there is a physical separation needed between the links.

Topologies
The basic types of network topologies are described below, along with their differing suitability for various types of aggregation methods.

Direct Connect
The Data Domain system is directly connected to one or more backup servers. To be able to provide link aggregation within this topology will require multiple links between each backup server and the Data Domain system. Usually link aggregation is not done with this topology, especially with multiple backup servers, because of the limited number of links available on the Data Domain system and one system does not drive data faster than one link especially if the link is 10 Gb. In the picture shown, there are two interfaces being aggregated (the blue lines). For backups the important hashing is done on the Media server beause the direction of the majority of data traffic is flowing from the Media Server to the DDR. In this case it may be useful to check if round robin can be used and evaluate if the performance is adequate. Otherwise consider multiple connections using IP alias.

Data Domain Network switch Backup/media server

Business Servers

Tape Library

Private Network
This topology is the same as the direct connect except the connections are through a switch rather than being directly connected. This would normally be used to connect multiple media server to multiple DDRs. The link aggregation would be between a DDR and the switch or between a media server and the switch. The aggregation would be to get the data to and from the switch. In this case the aggregation between the DDR and the switch would be independent of the aggregation used between the media server and the switch. Note: there is a possible special case where the switch would be only a pass through and would be transparent to the aggregation. That would not be the norm and is discussed in further detail later. In this case, if there is one media server, the use of multiple TCP connections and a load balanceing of src-dst-port on the switch may be the best choice.

Data Domain

Network switch

Private network switch

Backup/media server

Business Servers Tape Library

Local Network
The Data Domain system is connected to the backup server through a common switch/router which is shared by many other systems. In the previous network topologies shown the Data Domain system may have a connection through the common switch to handle administration and maintenance tasks which need not be part of the aggregation. In this example the data is also being sent through the shared network. The setup of the aggregation is the same as with the private network, but it opens the possibility of a lot more media servers being available to backup to the DDR. This may make allow load balancing of layer 2 (mac address) more feasible.

Data Domain System

Network switch

Backup/media servers Tape Library

Business Servers

Remote Network
This is similar to the local network except that connection is through a router before it gets to the media server or other DDRs in the case of replication. There will normally be switch in between the DDR and the router unless the router also provides switch functionality. What is important to note in this diagram is that there is a gateway function that is involved in the network data flow. It is important to maximize the data throughput between the DDR and the media servers, but if there is a WAN involved one 1 Gb interface is normally enough because of banwidth limitation. Normally for performance reasons the DDR will be located on the same LAN and use the same switch as the media server. There may be cases where some of the media servers may be on separate LANs. The DDR would need to go through at least one gateway to get to them. It is not expected that the media servers will go across a WAN to get to the DDR, although there are some WANs that are high speed. In which case the backup could go over the WAN. The WAN topology is likely to be the case for DDR with replication. Normally the data flow in replication is low enough where it does not need aggregation, but MANs and some WANs are getting faster to where this configuration may be needed. Otherwise the WAN would tend to make aggregation ineffective. Yet there are customers that have asked about it. One reason is that it provides redundancy (failover).

Data Domain

Network router

Network router

Tape Library Business Servers Backup/media servers

Complex bonding
Starting with the DDOS 5.1 release there is the ability of creating failover on an aggregation group. In this case the failover slaves are virtual interfaces containing aggregations. The primary use of this is to allow the use aggregated interfaces, but to also have a full switch failover. In this diagram there are two parallel paths. If the path being used is the switch on the left the failover on the data domain system till switch to using the switch on the right. In this case the carrier controls the failure and the carrier to both the interfaces to the switch on the left have to go down. Otherwise it will run over one of the interfaces on the left in a degraded mode.

Data Domain System

Aggregated interfaces Network switch Aggregated interfaces Network switch

Aggregated interfaces Business Servers

Aggregated interfaces

Tape Library Backup/media servers

It is important to realistically set the expectation for this setup. Does the Media Server support this? if not then it would only have failover across single links and without multiple Media Servers it does no make sense to have aggregated links on the DDR. Another point, if one of the aggregated links goes down, the failover will not happen rather theother link(s) will handle all the traffic in a degraded mode. If both links go down then the failover will kick in but since the switch itself did not fail the Media Server side will still use the first switch. So the packets will need to be routed between the switches. Otherwise the connections will fail. Also if LACP is being used and the carrier stays up, but the heartbeat fails on both the interfaces the failover (which is based on carrier) will not change the the second switch and communication will stop. It turns out that this is only really effective if the whole switch goes down instead of part of it.

Data Domain Link Aggregation and Failover


There are three link aggregation methods supported by Data Domain: Round Robin Balanced (setup manually on both sides) LACP (starting in 5.0 for 1 Gb and 5.1 for 10 Gb) The balanced aggregation and the lacp will need to also provide a specific transmit hash that is supported: Layer 2 Layer 3+4 or Layer 2+3 (starting in 5.0). Virtual interfaces can be created to define the aggregation or as failover: The virtual interface names start with a v as in veth0 and veth34. The number of virtual interfaces can no more than the number of physical interfaces. This is normally not a problem except in testing because each virtual interface will usually have two or more vitual interfaces associated with it. Any of the physical links that are available on the system can be included: eth0, eth1, eth2, eth3, etc. using the legacy port naming or eth0a, eth0b, etc. using the slot based port naming. To specify aggregation of eth2 and eth3 (or eth4a eth4b for later releases with slot based naming) in the virtual interface veth0 one of the following commands would be used: net aggregate add veth0 mode roundrobin interfaces eth2 eth3 The first network transmit packet given to veth0 will be forwarded to one of the interfaces and the next packet would be forwarded to the other interface. Sending of packets will continue to alternate between the interfaces until there are no more packets or a link fails. If eth3 loses physical connection all packets are sent through eth2 until the eth3 link is brought back up. Round robin should be considered for the remote side too. For direct connect (the only topology that is recommended for round robin) the media server will have to be able to setup and support round robin. Note that even though round robin provides the best distribution of packets it does not necessarily provide the best utilization of the lines. For example suppoe the every other transmitted packet is an ACK packet and there are two interfaces. Than the small ACK packets will go over one interface and all the large data packets will go over the other interface. This normally will not happen, but it illustrates the potential problem. The commands used for adding balanced mode aggregation it changes at release 5.0 For DDOS version before 5.0 with slot based port naming: net aggregate add veth0 mode xor-L2 interfaces eth4a eth4b For DDOS version 5.0 and later with slot based port naming: net aggregate add veth0 mode balanced hash xor-L2 interfaces eth4a eth4b For this command the aggregation used is balanced-xor. The send packets are distributed across eth2 and eth3 (or eth4a eth4b for later releases with slot based naming) based on the XOR of the source and destination MAC addresses. Because there are only 2 links to be aggregated the lowest bit is used to determine the interface to use for the packet. If the result is 0 one interface will be chosen. If the result is 1 the other interface will be used. To get the packets to be spread across the two links requires that data is sent to more than one system and the MAC addresses of the destination and/or source needs to be different in such a way that XOR results provide a different number. This means that one address needs to be odd and the other needs to be even. If there are three links that are aggregated, the XOR result is split 3 ways. Starting with 2 bonded interfaces there has to be at least two media servers and there the mac addresses must be different enough to cause the packets to spread across the interfaces. In general, this aggregation should not be used with less than 4 media servers. For DDOS version before 5.0 with lagecy port naming: net aggregate add veth0 mode xor-L3L4 interfaces eth2 eth3

For DDOS version 5.0 and later with slot based port naming: net aggregate add veth0 mode balanced hash xor-L3L4 interfaces eth4a eth4b The aggregation used with this command will also be balanced-xor. The packets are distributed across eth2 and eth3 (or eth4a eth4b for later releases with slot based naming) based on the XOR of the source IP address, destination IP address, source port number, and the destination port number. The result gives a number in which the lowest bits are used to determine which link to use to send the packet. For this example an even result will go over one and an odd result will go over the other. With three links the result is divided by 3 with the remainder determining which interface to use. This aggregation would be used when there are a lot of connections (there is one connection per stream) or a lot of media servers or both. This is the mode of choice for Data Domain, but some switches do not support this type of hashing. For DDOS version 5.0 and later with slot based port naming: net aggregate add veth0 mode lacp hash xor-L3L4 interfaces eth4a eth4b The aggregation used with this command will also be lacp-xor. The packets are distributed across eth2 and eth3 based (or eth4a eth4b for later releases with slot based naming) on the XOR of the source IP address, destination IP address, source port number, and the destination port number. The data flow control follows the same mechanism used by balanced mode except it adds a control protocol to monitor the interfaces with a minimal amount automated administration of the interfaces including better sensing when a interface failure. The sensing goes beyond the sensing of carrier loss to the sensing of the ability to send and receive data. The heartbeat can be sent out every second or every 30 seconds. The default is every 30 seconds. The interval determines how fast the bonding will sense that the link is no longer communicating and will stop using the interface. Once every 30 seconds is less invasive, but it will take longer to make the link as down and there may be connection timeouts in while it is waiting. Net failover add veth0 interfaces eth2 eth3 This is not aggregation but the command will group together interfaces eth2 and eth3 (or eth4a eth4b for later releases with slot based naming) for failover. There is only one failover type supported. If the active physical link goes away (determined by loss of carrier) the data is sent to the second physical link. The active interface is determined by which link comes up first when it is setup. This is nondeterministic. It is dependent on several factors such as switch activity, network activity, and which interface is brought up first when they are enabled. The active one can be determined by specifying one of the links as primary. The primary interface will always be set as active if it can be UP and RUNNING. A down time and up time can also be specified. The time is set to the multiple of 0.9 seconds that is nearest but not greater than the value specified. For example, if 1.5 second is specified the actual value used is 0.9 seconds. For the command the actual value used is given in milliseconds so 1.5 seconds would be 1500 and the value used for this is 900. The up and down timer values are also used in the net aggregate commands too. The new hash of L2L3 which was released in DDOS 5.0 is the use of a combination of the source and destination mac and IP addresses to determine the interface to send the packets. .This gives more flexibility to in getting the data to be better aggregated.

Functions available in Linux distribution


The following is a summary of the aggregation and failover modes and hashing used in Linux. If the client/Media Server is a Linux system then these are what you will encounter when setting up the clietn. A more complete description can be found in Documentation/networking/bonding.txt in the Linux distribution:

Mode Options
1. balance-rr or BOND_MODE_ROUNDROBIN (0) - known as the roundrobin mode on the DDR Aggregation using Round Robin Failover with degradation Normally a good type to use with direct connect or something equivalent To get full matching aggregation both ends of the link needs to be set up to use round robin 2. active-backup or BOND_MODE_ACTIVEBACKUP (1) - known as failover on the DDR

Failover method used by Data Domain Works only when one or more standby links are in the group There is one active and all others in the group are stanby The active link is non-deterministic unless a primary is specified 3. balance-xor or BOND_MODE_XOR (2) - known as the balanced mode on the DDR Send transmit to a specific NIC based on specified hash method being used Default (Source MAC address XOR Destination MAC address) modulo size of aggregation group Note: this only aggregates transmissions. The receive needs to be aggregated on the other end This mode is referred to as static because of the manual setup that is needed. 4. 802.3ad or BOND_MODE_LACP (4) - know as the lacp mode on the DDR Send transmit to a specific NIC based on specified hash method being used Default (Source MAC address XOR Destination MAC address) modulo size of aggregation group Note: this aggregates transmissions and actively checks the aggregated interfaces. The receive needs to be aggregated on the other end This configuration js initially set manually, but is maintained automatically.

Hash method used:


1. Layer 2 Uses (source mac XOR destination mac) modulo count of links in aggregation group This works best when there are many hosts and they are connected to the same switch All packets to a specific MAC address goes through the same link 2. Layer 2+3 Uses ((source mac XOR dest mac) AND 0xffff) XOR ((source IP XOR dest IP) AND 0xffff) modulo count of links in aggregation group This works best with many IP addresses and/or many media servers This can work with as little as one media server if multiple addresses are used This hash works best with multiple connections with different IP addresses. If the number of clients are limited then IP aliases or VLANs can be used to allow multiple addresses to generate multiple connections over the same two interfaces. 3. Layer 3+4 Uses ((source port XOR dest port) XOR ((source IP XOR dest IP) AND 0xffff) modulo count of links in aggregation group This works best with many connections and/or many media servers This can work with as little as one media server For packets that do not include the port number such as IP fragmentation packets and non-TCP and non-UDP packets this method will use the IP address only. For non-IP packets the Layer 2 mode is used. It is because of these exceptions that this is not IEEE compliant. Note that the Data Domain network configuration is set up so that packets are not fragmented and almost never uses UDP. The aggregation method used is very important to getting the desired performance. In general the aggregation of choice is mode lacp hash xor-L3L4 (src-dst-port for cisco load balance), if the DDOS version supports it or mode xor-L3L4 otherwise, along with many streams. The desire to use lacp is enhanced by the improved failover ability. If the DDR and media servers are directly connected and there are enough links to do aggregation then mode roundrobin may work best. There are some switches that do not support port number hashing. In this case src-dst-port on the switch will not work. Consider also that the best aggregation may be to have each media server use a different link instead of grouping them together. Consider the following example: four media servers each media server is sending data at the same time there are 4 links available on the DDR,

Assign a different IP address to each link and setup up each media server to send data to one unique IP address on the media server. That way the throughput will approach 4 times a single link speed verses around 2.5 times if aggregation is used. This is very dependent on the expected traffic pattern from the media servers.

Link failures
A link can fail at several places. It can occur in the driver, the wire, the switch, the router, or the remote system. For failover to work the program (this is the bonding module in the Data Domain case) must be able to determine that a link to the other side is down. This information is normally provided by the hardware driver. For a simple case consider a direct connect were the wire is disconnected. The driver can sense that the carrier is down and will report this back to the bonding module. The bonding module will mark it as down after the "down" time has expired and switch to a different link. The bonding module will continue to monitor the link and when it comes back up for the "up" time it will mark it as up. If the restored link is marked as the primary the data will be switched back to using that link again. Otherwise the data flow will stay on the current link. Note: the failover method that is currently supported is for directly attached hardware. The driver can sense when the directly attached link is no longer functioning, but beyond that it gets a little harder. Consider the case that there is a switch and the failover is between two different switches with no routing. Can the driver determine that the connection to the remote system has failed and therefore it needs to switch to the backup link going through the other switch? This is possible if the switch provides a link fault signaling similar to what is defined in IEEE 802.3ae. This is supported by the Fujitsu 10GbE switch and a similar thing is supported by Cisco. This is rather limited network topology where the systems are directly connected via switches and there are no other routes available. This would be an extension of the direct connect to the media server. Currently the driver and the bonding module does not support the link fault signaling because it is not widely available too limited of a network topology There are two types of failover. One is failover to a standby interface. The standby interface is not being used until a failure happens and the traffic is redirected to the standby link. This is a waste of resources if there is never a failover. This is the method used by Data Domain when the bonding method failover is specified: net add failover veth1 interfaces eth3 eth5 Another type of failover is failover with degradation. In this method there is no standby. All the links in the group are being used. If there is a failure the failed link is removed and the rest of the network traffic from that link is redirected to the other links in the group. This is the failover associated with link aggregation, but it can become complex if the bonding driver has to determine if a path to the target system no longer exists and it will to not send data to that link. There is also the question as to what bad is the failure. Maybe the carrier stays up but the data still fails to get transferred or the data has too many CRC or checksum errors to be effective. The lacp mode (available in DDOS5.0 for 1 Gb and DDOS 5.1 for 10 Gb) is the only mode that can determine this and mark the interface as down. Note that it is not up to the DDR to detect and adjust the network flow for a failure beyond the switch or router. That is a network/switch/router problem to come up with an alternative path. The DDR only senses the network between its local interface and the switch or system it is connected to.

Other Link Aggregation


The link aggregation used is dependent on what network equipment the DDR is connected and the network topology. The equipment connected to the DDR could be a switch or router, and the target system. So it is important to understand what aggregation is provided by other systems. Most switches and routers support LACP link aggregation (IEEE 802.3ad standard). Some offer proprietary aggregation types. If they offer aggregation they support the XOR of Layer 2 to define which packet goes to which port.

Cisco
Some of the older Cisco switches and routers only support the older proprietary protocol, PAgP. The Data Domain system will not support this type of aggregation. Fortunately, the newer switches and routers support

the IEEE 802.3ad standard. When using Cisco switches and routers the IEEE 802.3ad should be used with Layer 3 and 4 hashing. It may be possible in some cases to set the aggregation with PAgP to round robin, but that is not currently supported for the DDR when connected to a switch or a router because of through put delays from potential packet ordering issues. At high speeds with fast retransmissions out of order packets can generate many more packets which would decrease the overall performance.

Nortel
Nortel supports an aggregation called Split Multi-Link Trunking which uses LACP_AUTO mode link aggregation

Sun
The initial version 10 Solaris and earlier models supported Sun Trunking. Later releases of Solaris 10 and beyond support IEEE 802.3ad standard in communicating with switches. Back-to-back link aggregation is supported in which two systems are directly connected over multiple ports. The balancing of the load can be done with L2 (MAC address), L3 (IP address), L4 (TCP port number), or any combination of these. Note the DDR currently only supports L2 or L3+L4. Link aggregation can run in either passive mode or active mode. At least one side must be in active mode. The DDR always uses active mode. Sun trunking supports round robin type of aggregation. This type of aggregation could be used if the DDR is connected directly to a Sun system. For more information on Sun Aggregation refer to the following: http://docs.sun.com/app/docs/doc/816-4554/fpjvl?l=en&q=%22link+aggregation%22&a=view For more information on Sun Trunking refer to the following: http://docs.sun.com/source/817-3374-11/preface.html

Windows
Microsofts view of Link aggregation is that it is a switch problem or a hardware problem. So Microsoft feels that it should be handled by the switch/router and the NIC card. There is nothing in the OS that directly supports it. Rather if the customer wants it they should get NIC cards that support it and either have a special driver to initiate it or use the switch to drive it. In the current documentation for their server 2008 they refer to the support of PAgP an old proprietary Cisco aggregation protocol: http://blogs.technet.com/winserverperformance/ They also refer to Receive-Side Scaling (RSS): http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx This refers to a way to allocate a program to handle packets across NIC cards which are normally tied to specific CPUs. There are drivers from outside of Microsoft that at least provide passive IEEE 802.3ad support if not active. Passive support means that the Windows system will respond to the to the IEEE 802.3ad protocol packets, but it will not generate them. For direct connect this may be the only way to have a directly connected aggregated link. The following link provides Microsofts view of servers for 2008: http://technet2.microsoft.com/windowsserver2008/en/library/59e1e955-3159-41a1-b8fd-047defcbd3f41033.mspx?mfr=true If the Windows serv er is not directly connected then it is not important to the DDR system if or how Link aggregation is provided by Windows. That would be between the windows server and the switch/router. It is still TBD for more specific information on which NIC cards support Link aggregation.

AIX
According to an AIX and Linux administration guide AIX supports EtherChannel and IEEE 802.3ad types of link aggregation as mentioned in the RSCT administration guide: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.rsct.doc/rsct_aix5l53/bl5adm05/bl 5adm0559.html

When using DDR, the round robin available through the EtherChannel can be used when directly connected. IEEE 803.3ad can be used if Layer 4 hashing is included. If it is not directly connected then it is dependent on the switch or router being used. AIX uses a variant of EtherChannel for backup, referred to as EtherChannel backup. This is similar to the active backup supported by the Linux bonding driver and does not need any handshake from the equipment connected to the links except to have multiple links available.

HPUX
The link aggregation product is referred to as HP Auto Port Aggregation (APA). As with the Link bonding this product also provides either a full standby failover or a degradation failover by overloading other links with in an aggregation group. The aggregation can use Layer 2, Layer 3, and/or Layer 4 hashing for aggregating across the links. It also supports the IEEE 802.3ad standard. A summary of the product is given here: http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=J4240AA The administration guide can be found here: http://docs.hp.com/en/J4240-90039/index.html According to the administration guide, direct connect server to server is supported, but round robin type of aggregation does not seem to be. This is further brought out in figure 3-4 in the document where for direct connect it is recommended to have many connections for load balancing to be effective. With round robin multiple connections are not required for effective aggregation. With this understanding the HPUX systems would not support round robin with a directly connected system

Data Domain Link Aggregation and Failover in the Customers Environment


The history of the use of failover and link aggregation in the Data Domain products is as follows: 1. Failover was added in release 4.3 2. Link aggregation was added in release 4.4 3. Source routing was set as the default in release 4.5 to allow separate NICs to reside on the same network and the response packets would get sent to the correct route. This means the following settings are the default: net option set net.ipv4.route.src_ip_routing 1 net option set net.ipv4.route.flush 2 4. Since July 2008 a primary can be specified for failover. If the primary is up it will always be the active link. 5. Since July 2008 release for 4.4.4, 4.5.2, and beyond the on-board NICs can be included in link aggregation. 6. Since release 5.0 the 1 Gb interfaces can be added to lacp bonding mode along with rate setting 7. Since release 5.0 there is a new command formats for net failover and net aggregate 8. Since release 5.0 the hash value of xor-L2L3 can be used 7. Since release 5.0 the user can specify the down and up time for sensing failover and the recovery of a failed interface 8. Since release 5.1 the 10 Gb interfaces can be added to lacp bonding mode 9.Since release 5.1 aggregated virtual interfaces can be added as slave to failover

Special Link Aggregation


Normally when the DDR has aggregated links they are connected to one switch. There is a case when the aggregated links are go to different switches. This is done using Cisco's virtual switch technology. Normally the

aggregation used is lacp and the switches are interconnected to share information. Cisco uses this to provide good data flow along with a failover ability. The packet routing is dependent on the best path, the least used and the concern with packet ordering. If one of the switches goes down or becomes unavailable the oackets are directed to the other. Another case is where the switches operate at a physical level where data is passed directly through to the remote system. It acts like a direct connect even though the systems can go over a WAN. The following diagram illustrates this. eth2 en5

en4 Data Domain Appliance eth3 Network switch B Backup/media servers

In this case the aggregation is passed through ot the remote system and it prevides an increase throughput but also failover. Normally lacp is used in this case to provide a heartbeat falover. The challange with this is which transmit hash to is used. Multiple connections are used along with multiple IP addresses either from VLAN or IP alias. If one of the WAN links becomes unavailable both sides becomes aware of this and sends everything over the other link. The main concern here is handling the latency of the WAN using the heartbeat.

Failover of NICs
With the special aggregation cases the dependency on failover is reduced. Failover is still simpler to setup and easier to maintain, but as networks get more comples the use of lacp offers better automation of handling the conditions.

Recommended Link Aggregation


The following is what should be considered when trying to decide on aggregation. If no aggregation is to be done then failover should be considered. Therefore the last choice given is failover as an alternative to aggregation. When considering aggregation some important things to consider: The direction of data flow that is being aggregated. If it is backup the flow is from the Media Server to the DDR. So the important hash is on the switch or if it is a pass-thrugh it would be on the Media Server. On restore the main data direction is from the DDR to the Media Server. So the important hash is on the DDR. How many simultaneous Media Server are actively doing backups? If the number is less then 4 xor-L2 will not be very effective. As the number of aggregated links increase the number of active clients will also need to increase if xor-L2 is to be used. With three aggregated links the number of active clients should be above 5. What is the network topology? If it route goes through gateways using xor-L2 wont work because it the destination mac will be the gateway router. Direct Connect 1. mode roundrobin (if it is supported by the media servers or the directly connected system) 2. separate NIC per media server (if there are enough NICs) 3. mode lacp if supported (balanced if not), using hash xor-L3L4 4. failover (if aggregation cannot be used) Private Network (less than 4 active client or route has gateways in the path) 1. separate NIC per media server (if there are enough NICs) 2. mode lacp if supported (balanced if not), using hash xor-L3L4 3. mode lacp if supported (balanced if not), using hash xor-L2 (if there are a suitable number of clients)

4. failover (if aggregation cannot be used) Private Network (more than 4 active client and the network path as no gateways) 1. mode lacp if supported (balanced if not), using hash xor-L2 2. mode lacp if supported (balanced if not), using hash xor-L3L4 3. separate NIC per media server (if there are enough NICs) 4. failover (if aggregation cannot be used) Local Network (less than 4 active client or route has gateways in the path) 1. separate NIC per media server (if there are enough NICs) 2. mode lacp if supported (balanced if not), using hash xor-L3L4 3. mode lacp if supported (balanced if not), using hash xor-L2 (if there are a suitable number of active clients) 4. failover (if aggregation can not be used) Local Network (more than 4 active client and the network path as no gateways) 1. mode lacp if supported (balanced if not), using hash xor-L2 2. mode lacp if supported (balanced if not), using hash xor-L3L4 3. separate NIC per media server (if there are enough NICs) 4. failover (if aggregation can not be used) Remote Network (normally through gateway and routers) 1. separate NIC per media server (if there are enough NICs) 2. mode lacp if supported (balanced if not), using hash xor-L3L4 3. failover (if aggregation can not be used) Note, hash xor-L2L3 can be substituted for xor-L2.

Switch information
Link aggregation is setup on both sides of a link. The link aggregation does not necessarily have to match on both sides of the link. For example, the DDR may be set to xor-L3L4 but the switch may be set to src-ip. A good rule of thumb to follow is to keep the aggregations close, such as xor-L3L4 on the DDR and src-dst-port on the switch. The reason for this is that if an aggregation is good enough for one direction it is good eough for the other direction. Aggregation on the switch is used to distribute traffic being received by the DDR. If the main set of operations being done is backup the switch aggregation is very important. Backup network traffic is mostly data being received by the DDR. Because of the limited number of clients communicating with the DDR the recommended aggregation method is balance-xor with Layer 3+4 hashing. To support this, the device directly connected to the DDR, e.g. switch or router (see the Normal Link Aggregation ), needs to support src-dst-port or at least src-port load balancing. This section uses the vendors documentation to provide potential switches that may work with the Layer 3+4 hashing and also some that may not. There are no plans to validate or certify these. The final authority whether a switch supports the desired aggregation is to physically try it. For example, there is at least one case where round robin was desired and tried and it worked satisfactory even though it is listed that it is not supported. Note again, even though round robin may be supported by a switch the aggregation performance is poor or even worst then not having it. This is mostly due to the out of order packets. Note: There are few switches that supports layer 3 + 4 aggregation. The supported aggregation may be for layer 3 only or layer 4 only. Matching layer 4 (port aggregation) with layer 3 + 4 (IP address and port aggregation) is not a problem, but be aware that it may cause data to be sent on one link and received on a different link, but the concern of out of order packets shold not occur. Which link the data is sent on is not important as long as all the data associated with a connection is sent on the same link. Definitions: Dest := Destination

IP := IP address L4 := Layer 4 of the network stack, i.e. TCP MAC := mac or hardware address Port := TCP port number Src := Source SW := software Switch brand & model Cisco Catalyst 6500 CatOS Cisco Catalyst 6500 IOS Cisco Catalyst 3560 Cisco Catalyst 2960 Cisco Catalyst 3750 Cisco Catalyst 4500/4948/4924
Switch Vendor SW Release Src MAC Dest MAC SrcDest MAC Src IP Dest IP SrcDest IP Src L4 Port Dest L4 Port SrcDest L4 Port Round Robin

8.6 12.2SXF 12.2(44)SE 12.2(44)SE 12.2(44)SE 12.2(37)SG

Yes Yes Yes Yes Yes Yes

Yes Yes Yes Yes Yes Yes

Yes Yes Yes Yes Yes Yes

Yes Yes Yes Yes Yes Yes

Yes Yes Yes Yes Yes Yes

Yes Yes Yes Yes Yes Yes

Yes Yes No No No Yes

Yes Yes No No No Yes

Yes Yes No No No Yes

No No No No No No

For directly connected systems the support for round robin is as follows: Sun - yes AIX - yes, it can HPUX - no Windows maybe, it depends on the NIC software, but dont count on it.

Cisco Configuration
Set the etherchannel mode to on: Manually set the ports to participate in the channel group DDR Configuration xor-l3l4 xor-l2 Cisco Load Balance Configuration src-dst-port src-dst-mac