You are on page 1of 13

DARD: Distributed Adaptive Routing for Datacenter Networks

Xin Wu Xiaowei Yang
Dept. of Computer Science, Duke University Duke-CS-TR-2011-01

ABSTRACT
Datacenter networks typically have many paths connecting each host pair to achieve high bisection bandwidth for arbitrary communication patterns. Fully utilizing the bisection bandwidth may require flows between the same source and destination pair to take different paths to avoid hot spots. However, the existing routing protocols have little support for load-sensitive adaptive routing. We propose DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to move traffic from overloaded paths to underloaded paths without central coordination. We use an openflow implementation and simulations to show that DARD can effectively use a datacenter network’s bisection bandwidth under both static and dynamic traffic patterns. It outperforms previous solutions based on random path selection by 10%, and performs similarly to previous work that assigns flows to paths using a centralized controller. We use competitive game theory to show that DARD’s path selection algorithm makes progress in every step and converges to a Nash equilibrium in finite steps. Our evaluation results suggest that DARD can achieve a closeto-optimal solution in practice.

1.

INTRODUCTION

Datacenter network applications, e.g., MapReduce and network storage, often demand high intra-cluster bandwidth [11] to transfer data among distributed components. This is because the components of an application cannot always be placed on machines close to each other (e.g., within a rack) for two main reasons. First, applications may share common services provided by the datacenter network, e.g., DNS, search, and storage. These services are not necessarily placed in nearby machines. Second, the auto-scaling feature offered by a datacenter network [1, 5] allows an application to create dynamic instances when its workload increases. Where those instances will be placed depends on machine availability, and is not guaranteed to be close to the application’s other instances. Therefore, it is important for a datacenter network to have high bisection bandwidth to avoid hot spots between any pair of hosts. To achieve this goal, today’s datacenter networks often use commodity Ethernet switches to form multi-rooted tree topologies [21] (e.g., fat-tree [10] or Clos topology [16]) that have multiple equal-cost paths connecting any host pair. A flow (a flow refers to a TCP connection in this paper) can

use an alternative path if one path is overloaded. However, legacy transport protocols such as TCP lack the ability to dynamically select paths based on traffic load. To overcome this limitation, researchers have advocated a variety of dynamic path selection mechanisms to take advantage of the multiple paths connecting any host pair. At a high-level view, these mechanisms fall into two categories: centralized dynamic path selection, and distributed trafficoblivious load balancing. A representative example of centralized path selection is Hedera [11], which uses a central controller to compute an optimal flow-to-path assignment based on dynamic traffic load. Equal-Cost-Multi-Path forwarding (ECMP) [19] and VL2 [16] are examples of trafficoblivious load balancing. With ECMP, routers hash flows based on flow identifiers to multiple equal-cost next hops. VL2 [16] uses edge switches to forward a flow to a randomly selected core switch to achieve valiant load balancing [23]. Each of these two design paradigms has merit and improves the available bisection bandwidth between a host pair in a datacenter network. Yet each has its limitations. A centralized path selection approach introduces a potential scaling bottleneck and a centralized point of failure. When a datacenter scales to a large size, the control traffic sent to and from the controller may congest the link that connects the controller to the rest of the datacenter network, Distibuted traffic-oblivious load balancing scales well to large datacenter networks, but may create hot spots, as their flow assignment algorithms do not consider dynamic traffic load on each path. In this paper, we aim to explore the design space that uses end-to-end distributed load-sensitive path selection to fully use a datacenter’s bisection bandwidth. This design paradigm has a number of advantages. First, placing the path selection logic at an end system rather than inside a switch facilitates deployment, as it does not require special hardware or replacing commodity switches. One can also upgrade or extend the path selection logic later by applying software patches rather than upgrading switching hardware. Second, a distributed design can be more robust and scale better than a centralized approach. This paper presents DARD, a lightweight, distributed, end system based path selection system for datacenter networks. DARD’s design goal is to fully utilize bisection bandwidth and dynamically balance the traffic among the multipath paths between any host pair. A key design challenge DARD faces is how to achieve dynamic distributed load balancing. Un-

Figure 2 shows a fat-tree topology example.like in a centralized approach. The controller calculates a path arrangement and periodically updates switches’ routing tables. To address this challenge. We evaluate DARD in Section 5. 16. as in [27]. 24] suggest to use multi-rooted tree topologies to build datacenter networks. we mostly use the In this paper. Therefore. An important design parameter of a datacenter network is the oversubscription ratio at each layer of the hierarchy. About 90% of the flows change their paths less than 4 times in their life cycles. as shown in Figure 1. we briefly describe what a fat-tree topology is. Section 6 concludes our work. DARD maintains stable link utilization. unless otherwise noted. but a recent data center traffic measurement suggests that this centralized approach needs parallelism and fast route computation . 19] can be further divided into centralized and distributed approaches. thereby making achieving close-to-optimal load balancing a challenging problem. 2. we use the term elephant flow to refer to a continuous TCP connection longer than a threshold defined in the number of transferred bytes.2 Related work Related work falls into three broad categories: adaptive path selection mechanisms. Section 3 describes DARD’s design goals and system components. Figure 1: A multi-rooted tree topology for a datacenter network. with DARD. Adaptive path selection. Section 2 introduces background knowledge and discusses related work. which is computed as a layer’s downstream bandwidth devided by its upstream bandwidth. In this section. To facilitate path selection. The topology has three vertical layers: Top-of-Rack (ToR). Under dynamic traffic pattern. BACKGROUND AND RELATED WORK Figure 2: A 4-pod fat-tree topology. 2. and core. We use static traffic pattern to show that DARD converges to a stable state in two to three control intervals. aggregation. Hedera can almost fully utilize a network’s bisection bandwidth. Figure 1 shows a 3-stage multi-rooted tree topology. In Section 4. VL2 and TeXCP and its performance gap to the centralized scheduling is small. A pair of end hosts in different pods have p2 /4 equal-cost paths connecting them. we first briefly introduce what a datacenter network looks like and then discuss related work. In Hedera. But for ease of exposition. Once the two end hosts choose a core switch as the intermediate node. p = 4)has p pods in the horizontal direction. assuming that not all downstream devices will be active concurrently. the path between them is uniquely determined. The oversubscription ratio is usually designed larger than one. Thus. DARD uses hierarchical addressing to represent an end-to-end path with a pair of source and destination addresses. Hedera [11] adopts a centralized approach in the granularity of a flow.1 Datacenter Topologies Recent proposals [10. Each end system can only select a path based on its local knowledge. Evaluation result also shows that the bandwidth taken by DARD’s control traffic is bounded by the size of the topology. 2. It represents a replicable building block consisting of a number of servers and switches sharing the same power and management infrastructure. end host based multipath transmission. We discuss how to choose this threshold in § 3. no end system or router has a global view of the network. Adaptive path selection mechanisms [11. and traffic engineering protocols. 16. We use dynamic traffic pattern to show that DARD outperforms ECMP. A p-pod fattree topology (in Figure 2. we introduce the system implementation details. DARD uses a selfish path selection algorithm that provably converges to a Nash equilibrium in finite steps (Appedix B). an end system can switch paths by switching addresses. It uses 5p2 /4 p-port switches and supports nonblocking communication among p3 /4 end hosts. DARD is a scalable and stable end host based approach to load-balance datacenter traffic. The rest of this paper is organized as follows. We design DARD to work for arbitrary multi-rooted tree topologies. We have implemented a DARD prototype on DeterLab [6] and a ns-2 simulator. edge switches detect and report elephant flows to a centralized controller. We make every effort to leverage existing infrastructures and to make DARD practically deployable. BWup fat-tree topology to illustrate DARD’s design. Our experimental evaluation shows that the equilibrium’s gap to the optimal solution is small. A pod is a management unit. The aggregation layer’s oversubscription ratio is defined as BWdown .

its path selection algorithm is load sensitive. 3. DARD is transparent to applications as it is implemented as a path selection module under the transport layer (§ 3). it places the path selection logic at edge switches. We aim to make DARD compatible with existing datacenter infrastructures so that it can be deployed without significant modifications or upgrade of existing infrastructures. Second. they also place the path selection logic at switches and therefore require upgrading switches. elephant flows occupy a significant fraction of the total bandwidth (more than 90% of bytes are in the 1% of the largest flows [16]). A path selection module running at an end system monitors path state and switches path according to path load (§ 3.3). It can split traffic to each destination across multiple paths. Since packets of the same flow share the same hash value. Lightweight and scalable. We present more design details in the following sub-sections. an edge switch first forwards a flow to a randomly selected core switch. Traffic Engineering Protocols. it places the path selection logic at end systems rather than at switches to facilitate deployment. DARD uses three key mechanisms to meet the above system design goals. Path State Monitor and 3. Second. Practically deployable. it requires applications to use MPTCP rather than legacy TCP to take advantage of underutilized paths. It differs from ECMP and VL2 in two key aspects. Given the large bisection bandwidth in datacenter networks. because these protocols are not end-toend solutions. First.5). An ECMP-enabled switch is configured with multiple next hops for a given destination and forwards a packet according to a hash of selected fields of the packet header. which then forwards the flow to the destination. legacy applications need not upgrade to take advantage of multiple paths in a datacenter network. the algorithm will shift flows from the collided path to more lightly loaded paths. We elaborate the design goal in more detail. DARD’s essential goal is to effectively utilize a datacenter’s bisection bandwidth with practically deployable mechanisms and limited control overhead. as a datacenter network can upgrade its end systems by applying software patches. Third.2 Overview In this section. defined in § 3. 3. If multiple elephant flows collide on the same path. and vary paths by varying addresses. a switch in DARD has only two functions: (1) it forwards packets to the next hop according to a pre-configured routing table. We desire to avoid a centralized scaling bottleneck and minimize the amount of control traffic and computation needed to fully utilize bisection bandwidth. they forward traffic along different paths in the granularity of a packet rather than a TCP flow. it uses hierarchical addressing to facilitate efficient path selection (§ 3. (2) it keeps track of the Switch State (SS. In VL2. We aim to provide fairness among elephant flows so that concurrent elephant flows can evenly share the available bisection bandwidth. Traffic engineering protocols such as TeXCP [20] are originally designed to balance traffic in an ISP network.1 Design Goals . Therefore. In addition. A datacenter network can deploy DARD by upgrading its end system’s software stack rather than updating switches. An end system has three DARD components as shown in Figure 3: Elephant Flow Detector.5). However. However. VL2 [16] is another distributed path selection mechanism. DARD places the path selection logic at an end system to facilitate practical deployment. Multi-path Transport Protocol. We aim to design a lightweight and scalable system. We focus our work on elephant flows for two reasons. It only requires that a switch support the openflow protocol and such switches are commercially available [?]. harming a TCP flow’s performance. Different from ECMP. they take the same path from the source to the destination and maintain the packet order.4) and replies to end systems’ Switch State Request (SSR). enables an end system to simultaneously use multiple paths to improve TCP throughput. Second. Figure 3 shows DARD’s system components and how it works. First. Our design implements this function using the openflow protocol. 3. Meanwhile we desire to prevent any systematic design risk that may cause packet reordering and decrease the system goodput. DARD belongs to the distributed adaptive path selection family. we present an overview of the DARD design. it uses a lightweight distributed end-system-based path selection algorithm to move flows from overloaded paths to underloaded paths to improve efficiency and prevent hot spots (§ 3. 2. DARD DESIGN In this section. Since we choose to place the path selection logic at an end system. Each end system can use a pair of source and destination addresses to represent an end-to-end path. different from DARD. As a result. However. it can cause TCP packet reordering. First. multiple large flows can collide on their hash values and congest an output port [11]. multipath TCP (MPTCP) [26]. Then we present an overview of the system. A different design paradigm. In contrast. we describe DARD’s design. Equal-Cost-Multi-Path forwarding (ECMP) [19] is a distributed flow-level path selection approach. 1. existing work shows ECMP and VL2 already perform well on scheduling a large number of short flows [16]. 4. Fairness among elephant flows . we aim to take advantage of the multiple paths connecting each host pair and fully utilize the available bandwidth. multiple elephant flows can still get collided on the same output port as ECMP. Efficiently utilizing the bisection bandwidth .to support dynamic traffic patterns [12]. but can be adopted by datacenter networks. Therefore. We first highlight the system design goals.

the datacenter operators can generate a similar address assignment schema and allocate the prefixes along the topology hierarchies. each network device receives multiple IP addresses. all the switches and end hosts highlighted by the solid circles form a tree with its root core1 .P E11 uniquely encodes the address allocation sequence core1 → aggr11 → T oR11 . aggrij to refer to the jth aggregation switch in the ith pod. at the same time. in Figure 4. As shown in Figure 4. including how to use hierarchical addressing to select paths at an end system (§ 3. The Path Selector moves flows from overloaded paths to underloaded paths.g. We follow the same rule to interpret T oRij for the top of rack switches and Eij for the end hosts.6 sets the limit to be 256K IP alias per interface [3]. Windows NT 4. The Path State Monitor sends SSR to the switches on all the paths and assembles the SS replies in Path State (PS. we can use the source and destination pair highlighted by dotted circles to uniquely encode the dotted path from E11 to E21 through core1 . In DARD.Figure 3: DARD’s system overview..4). A datacenter network is usually constructed as a multirooted tree. the Path Selector periodically assigns flows from overloaded paths to underloaded paths.g. For a general multi-rooted tree topology..3 Addressing and Routing To fully utilize the bisection bandwidth and. We borrow the idea from NIRA [27] to split an end-to-end path into uphill and downhill segments and encode a path in the source and destination addresses. The first core is allocated with prefix P core1 . The Path State Monitor monitors traffic load on each path by periodically querying the switches. each of which represents the device’s position in one of the trees. There are multiple paths con- necting each source and destination pair. Three other similar trees exist in the same topology. we can simply use different source and destination address combinations without dynamically reconfiguring the routing tables. A source and destination address pair can further uniquely identify a path. Path Selector.4). 3. To forward a packet. in Figure 4. e. Linux kernel 2. We use 100KB as the threshold in our implementation. The rest of this section presents more design details of DARD. The path state indicates the load on each path. DARD is a distributed system running on every end host. The uphill table keeps the entries for the prefixes allocated from upstream switches and the downhill table keeps the entries for the pre- . We call the partial path encoded by source address the uphill path and the partial path encoded by destination address the downhill path. However. more than 85% of flows in a datacenter are less than 100 KB [16]. This strictly hierarchical structure facilitates adaptive routing through some customized addressing rules [10].P aggr21 to two of its sub-trees. e. It has three components. and how to assign flows from overloaded paths to underloaded paths to improve efficiency and prevent hot spots (§ 3. we use corei to refer to the ith core. Based on both the path state and the detected elephant flows. The latest operating systems support a large number of IP alias to associate with one network interface. each switch stores a downhill table and an uphill table as described in [27]. as defined in § 3. we propose to use IP alias to configure multiple IP addresses to one network interface.g. Take Figure 4 as an example. The sub-trees will recursively allocate nonoverlapping subdivisions of their prefixes to lower hierarchies. To move a flow to a different path. we allow a flow to take different paths in its life cycle to reach the destination. each of the core switches obtains a unique prefix and then allocates nonoverlapping subdivisions of the prefix to each of its sub-trees. we decided to leverage the datacenter’s hierarchical structure to enable an end host to actively select paths for a flow. In case more IP addresses than network cards are assigned to each end host. e. The Elephant Flow Detector detects elephant flows. to prevent retransmissions caused by packet reordering (Goal 1).P T oR11 .3). E11 ’s address P core1 . By this hierarchical prefix allocation.P aggr11 . one flow can use only one path at any given time.0 has no limitation on the number of IP addresses that can be bound to a network interface [4]. We use the device names prefixed with letter P and delimited by colons to illustrate how prefixes are allocated along the hierarchies.5).P aggr11 and P core1 . The Elephant Flow Detector monitors all the output flows and treats one flow as an elephant once its size grows beyond a threshold. how to actively monitor all paths’ state in a scalable fashion (§ 3. Since we are exploring the design space of putting as much control logic as possible to the end hosts.. It then allocates nonoverlapping prefixes P core1 . The sub-tree rooted from aggr11 will further allocate four prefixes to lower hierarchies. This is because according to a recent study. One nice property of this hierarchical addressing is that one host address uniquely encodes the sequence of upperlevel switches that allocate that address.

in which each end host actively probes the network and collects replies.P T oR12 P core2 . These configurations are static unless the topology changes.P aggr11 . a switch first looks up the destination address in the downhill table using the longest prefix matching algorithm. The mapping from IDs to underlying IP addresses is maintained by a DNS-like system and cached locally. an end host in DARD also uses the active probe method to monitor the traffic load in the network. Then we improve the straw man design by decreasing the control traffic. The port indexes are marked in Figure 4.P aggr11 . Output link l’s Link State (LSl ) is defined as a triple [Cl . if Nl = 0). A core switch only has the downhill table. When the packet arrives at the destination. the downhill-uphill-looking-up is not necessary for a fat-tree topology. . there are two options to inform an end host with the network traffic load. a mechanism similar to the Netflow [15]. However.P aggr21 . In fact. downhill table Prefixes P core1 . Second.P E11 encodes the uphill path T oR11 -aggr11 -core1 .P T oR21 . DARD informs every end host with the traffic load in the network. the source encapsulates the packet with a proper source and destination address pair.Figure 4: DARD’s addressing and routing. A switch r’s Switch State (SSr ) is defined as {LSl | l is r’s output link}. However. When a packet arrives. A path p refers to a set of links that connect a source and destination ToR switch pair. an increasing number of switches support highly customized forwarding policies. since a core switch in a fat-tree uniquely determines the entire path.P aggr11 . To deliver a packet from a source to a destination. Otherwise. TeXCP [20] chooses the second option. not all the multi-rooted trees share the same property. an active measurement mechanism. E11 ’s address P core1 . We define link l’s fair share Sl = Cl /Nl for the bandwidth each elephant flow will get if they fairly share that link (Sl = 0.P T oR12 uphill table Prefixes P core1 P core2 Port 1 2 1 2 Port 3 4 Switches in the middle will forward the packet according to the encapsulated packet header. Nl denotes the number of elephant flows on the output link l.P T oR11 . Each network component is also assigned a location independent IP address.g. Table 1 shows the switch aggr11 ’s downhill and uphill table. This section first describes a straw man design of the Path State Monitor which enables an end host to monitor the traffic load in the network. the packet will be forwarded to the corresponding downstream switch. Nl . If link l has the smallest Sl among all the links on path p. Sl ]. First. We use Cl to note the output link l’s capacity. a Clos network. In its straw man design. Since we desire to make DARD not to rely on any conceptually centralized component to prevent any potential scalability issue. we use LSl to represent p’s Path State (P Sp ).P aggr11 . In a high level perspective. The path state Table 1: aggr11 ’s downhill and uphill routing tables. ID. All switches’ uphill and downhill tables are automatically configured during their initialization. This function is done by the Path State Monitor shown in Figure 3. e. 3..4 Path Monitoring To achieve load-sensitive path selection at an end host. it will be decapsulated and passed to upper layer protocols. which uniquely identifies the component and is used for making TCP connections.P T oR11 P core2 .P aggr11 . In our implementation we use OpenFlow enabled switch to support this forwarding algorithm. Each end host will accordingly select paths for its outbound elephant flows. fixes allocated to the downstream switches.P E21 encodes the downhill path core1 -aggr21 -T oR21 . If a match is found. in which traffic is logged at the switches and stored at a centralized server. E21 ’s address P core1 . each switch keeps track of its switch state (SS) locally. We first define some terms before describing the straw man design.P T oR11 P core1 . We can then either pull or push the log to the end hosts. The Path State Monitor is a DARD component running on an end host and is responsible to monitor the states of all the paths to other hosts. the switch will look up the source address in the uphill table to forward the packet to the corresponding upstream switch. The downhill-uphill-looking-up modifies current switch’s forwarding algorithm.

P . 3: min index = 0. In DARD.P V [i]. since DARD is designed to select paths for only elephant flows. min S = ∞.S then 9: min S = M. do 2: max index = 0. The rest switches are not on any path from the source to the destination.Figure 5: The set of switches a source sends switch state requests to. Based on this observation.length] do 5: if P. thus DARD cannot move flows away from that link anyway. This straw man design requires every switch to have a customized flow counter and the capability of replying to SSR. The Algorithm selfish flow scheduling illustrates one round of the path selection process. E2 ). In § 3. 11: end if 12: end for 13: end for 14: if max index = min index then 15: estimation = P.S. 18: end if 19: end if width divided by the number of elephant flows. where pkt size refers to the sum of request and response packet sizes. line 16 is to make sure this algorithm will not decrease the global minimum fair share. and the flow vector (F V ).P V.P V [i]. The switches highlighted by the dotted circles are the ones E21 needs to send SSR to. Second. if an end host is not sending any elephant flow. (3) all the core switches and (4) the aggregation switches directly connected to the destination ToR switch.P V [i]. Line 15 estimates the fair share of the max indexth path if another elephant flow is moved to it. moving one flow from a path with small fair share to a path with large fair share will push both the small and large fair shares toward the middle and thus improve fairness to some extent. In the straw man design. Based on these two observations. 3. As a result. This section first introduces the intuition of DARD’s path selection and then describes the algorithm in detail. If we set δ to 0. it is not necessary for the end host to monitor the traffic load. a small δ will evenly split elephant flows among different paths and a large δ will accelerate the algorithm’s convergence.S then 6: max S = M.P V [max index]. (E3 . Figure 6 shows an example of how an oscillation might happen using the selfish path selection algorithm. every source and destination pair maintains two vectors. num of servers × num of switchs × pkt size (1) Algorithm selfish path selection 1: for each src-dst ToR switch pair. (E1 . we can still improve the straw man design by decreasing the control traffic. whose ith item is the state of the ith path.0. P. (2) the aggregation switches directly connected to the source ToR switch.P V [i]. These two functions are already supported by OpenFlow enabled switches [17]. First. we limit the number of switches each source sends SSR to. Existing work shows that load-sensitive adaptive routing protocols can lead to oscillations and instability [22]. The δ in line 16 is a positive threshold to decide whether to move a flow. this set of switches includes (1) the source ToR switch. 7: max index = i. the algorithm will converge as soon as the estimation being close enough to the current minimum fair share. Given an elephant flow’s elastic traffic demand and small delays in datacenters. We do not highlight the destination ToR switch (T oR31 ). E4 ) and (E5 . as shown in Figure 5.f low numbers+1 16: if estimation − min S > δ then 17: move one elephant flow from path min index to path max index. because its output link (the one connected to E31 ) is shared by all the four paths from the source to the destination. The switches will then reply to the requests. For any multi-rooted tree topology with three hierarchies. There are three source and destination pairs. E21 is sending an elephant flow to E31 . 4: for each i ∈ [1. whose high level idea is to enable every end host to selfishly increase the minimum fair share they can observe.4 we define a link’s fair share to be the link’s band- . 8: else if min S > M.bandwidth V [max index].S. In general. the path state vector (P V ). There are two intuitions for the optimizations. elephant flows tend to fully and fairly utilize their bottlenecks. If we set δ to be larger than 0.5 Path Selection As shown in Figure 3. every end host periodically sends SSR to all the switches in the topology.F V [i] > 0 and max S < P. We desire to design a stable and distributed algorithm to improve the system’s efficiency.PP. Even though the above control traffic is bounded by the size of the topology. 10: min index = i. a Path Selector running on an end host takes the detected elephant flows and the path state as the input and periodically moves flows from overloaded paths to underloaded paths. These path states indicate the traffic load on each path. we propose DARD’s path selection algorithm. The control traffic in every control interval (we will discuss this control interval in § 5) can be estimated using formula (1). A path’s fair share is defined as the smallest link fair share along the path. max S = 0. monitor of an end host periodically sends the Switch State Requests (SSR) to every switch and assembles the switch state replies in the path states (PS). whose ith item is the number of elephant flows the source is sending along the ith path.

DARD does not rely on it. e. the shared path. A Path Selector moves elephant flows from overloaded paths to underloaded paths according to the selfish path selection algorithm. All PCs run the Ubuntu 10.3. It also maintains per flow and per port statistics. We leave exploring the impact of varying this query interval to our future work. It has the three components shown in Figure 3. This will further cause the three sources to move flows away from the shared paths. It queries switches for their states using the aggregate flow statistics interfaces provided by OpenFlow infrastructure [8]. We implement our prototype based on existing OpenFlow platform to show DARD is practical and readily deployable. All switches run OpenFlow 1. This interval causes acceptable amount of control traffic as shown in § 5. Clos network [16] and the 3-tier topologies whose oversubscription is larger than 1 [2].4. Figure 6: Path oscillation example. The three sources repeat this process and cause permanent oscillation and bandwidth underutilization.g. As a result. Each of the pairs has two paths and two elephant flows. The source in each pair will run the path state monitor and the path selector independently without knowing the other two’s behaviors. This component allocates the downhill table to OpenFlow’s flow table 0 and the uphill table to OpenFlow’s flow table 1 to enforce a higher priority for the downhill table. According to the selfish path selection algorithm. An OpenFlow enabled switch allows us to access and customize the forwarding table. All entries are set to be permanent. Different vendors.1 Test Bed We set up a fat-tree topology using 4-port PCs acting as the switches and configure IP addresses according to the hierarchical addressing scheme described in § 3. the three sources will move flows to it and increase the number of elephant flows on the shared path to three. According to the evaluation in § 5. The Elephant Flow Detector leverages the TCPTrack [9] at each end host to monitor TCP connections and detects an elephant flow if a TCP connection grows beyond 100KB [16].3. . we build a DARD simulator on ns-2. A link’s bandwidth is 1Gbps and its delay is 0. We focus this evaluation on four aspects.04 LTS standard image. (1) Can DARD fully utilize the bisection bandwidth and prevent elephant flows from colliding at some hot spots? (2) How fast can DARD converge to a stable state given different static traffic patterns? (3) Will DARD’s distributed algorithm cause any path oscillation? (4) How much control overhead does DARD introduce to a datacenter? 5. for OpenFlow enabled networks. We use the same settings as the test bed for the rest of the parameters. 5s]. (link switch1 -switch2 ). The reason for path oscillation is that different sources move flows to under-utilized paths in a synchronized manner. This number is a tradeoff between maximizing the minimum flow rate and fast convergence. which captures the system’s packet level behavior. the interval between two adjacent flow movements of the same end host consists of a fixed span of time and a random span of time. where we set the δ to be 10M bps. in DARD. this interval setting prevents an elephant flow from being delivered even without the chance to be moved to a less congested path. We use NOX only once to initialize the static flow tables. larger than one.1 Traffic Patterns Due to the absence of commercial datacenter network traces. and at the same time.0. However. Because a significant amount of elephant flows last for more than 10s [21].E6 ). We use the Linux IP-in-IP tunneling as the encapsulation/decapsulation module. has not elephant flows on it. The simulator supports fat-tree. EVALUATION This section describes the evaluation of DARD using DeterLab test bed and ns-2 simulation. simply adding this randomness in the control interval can prevent path oscillation. In the beginning. we implemented a prototype and deployed it in a 4-pod fattree topology in DeterLab [6]. this conservative interval setting limits the frequency of flow movement. IMPLEMENTATION To test DARD’s performance in real datecenter networks. We implement a NOX [18] component to configure all switches’ flow tables during their initialization. The flow movement interval is 5 seconds plus a random time from [0s.3.4. The query interval is set to 1 second. 4. The Path State Monitor tracks the fair share of all the equal-cost paths connecting the source and destination ToR switches as described in § 3.2 Simulator To evaluate DARD’s performance in larger topologies. Each link’s bandwidth is configured as 100M bps.01ms. 4. A daemon program is running on every end host. TCP New Reno is used as the transport protocol. The queue size is set to be the delay-bandwidth product. All the mappings from IDs to underlying IP addresses are kept at every end host. have implemented OpenFlow in their products.. 4. The topology and traffic patterns are passed in as tcl configuration files. We also implemented a simulator on ns-2 to evaluate DARD’s performance on different types of topologies. Cisco and HP.3. NOX is often used as a centralized controller 5.

We constantly measure the incoming bandwidth at every end host. where an end host sends elephant flows to any other end host in the topology with a uniform probability. The experiment lasts for five minutes. where avg TECMP is the average file transfer time using ECMP. This traffic pattern emulates an average case where applications are randomly placed in datacenters. the flow generating rate under different traffic patterns. 21]).we use the three traffic patterns introduced in [10] for both our test bed and simulation evaluations. As a result. Given 20% of the flows are elephants [21]. ports as ECMP. the distribution of inter-arrival times between two flows at an end host has periodic modes spaced by 15ms. This traffic pattern emulates the worst case where a source and a destination are in different pods. Because a flow-level VLB randomly chooses a core switch to forward a flow. We leave the short flow’s impact on DARD’s performance to our future work. The experiment lasts for one minute. This 10s control interval is set roughly the same as DARD’s control interval. DARD outperforms both ECMP and pVLB. DARD’s strategic path selection can converge to a better flow allocation than simply relying on randomness. where an end host with index Eij sends elephant flows to the end host with index Ekj . We do not implement other approaches. We run the same experiment on ECMP and calculate the improvement of DARD over ECMP using formula (2). For the stride traffic pattern. Two key parameters for the dynamic traffic are the flow inter-arrival time and the file size. to any other end host in the same pod with probability P odP and to the end hosts in different pods with probability 1 − T oRP − P odP . Hedera. TeXCP and MPTCP.5 and P odP is 0.. According to [21]. Measured on 4-pod fat-tree test bed. We compare DARD and these approaches in the simulation. we use the static traffic patterns and for each source and destination pair. e. Figure 7 shows the comparison of DARD. and avg TDARD is the average file transfer time using DARD. The elephant flows transfer large files of different sizes. We vary each source-destination pair’s flow generating rate from 1 to 10 per second.3. Compared with ECMP and pVLB. a TCP connection transfers an infinite file. where an end host sends elephant flows to another end host connecting to the same ToR switch with probability T oRP . ECMP and pVLB’s bisection bandwidths under different static traffic patterns.PodP). In our evaluation T oRP is 0. In the hope of smooth out the collisions by randomness. in the test bed. We do not include any short term TCP flows in the evaluation because elephant flows occupy a significant fraction of the total bandwidth (more than 90% of bytes are in the 1% of flows [16. the traffic stresses out the links between the core and the aggregation layers. ECMP and pVLB’s bisection bandwidths under different static traffic patterns. We use the results from the middle 40 seconds to calculate the average bisection bandwidth. This is because flows through the core have more path diversities than the flows inside a pod. we set the inter-arrival time between two elephant flows to 75ms. Figure 7: DARD. (1) Stride. 5. The dynamic traffic means the elephant flows between a source and a destination start at different times.g. a flow is forwarded according to a hash of the source and destination’s IP addresses and TCP ports. DARD outperforms ECMP because DARD moves flows from overloaded paths to underloaded ones and in- . We note this implementation as periodical VLB (pVLB). improvement = avg TECMP − avg TDARD avg TECMP (2) Figure 8 shows the improvement vs. It is because we need to differentiate whether finishing a flow earlier is because of a better path selection or a smaller file. We use a fixed file length for all the flows instead of lengths uniformly distributed between 100MB and 1GB. We track the start and the end time of every elephant flow and calculate the average file transfer time. (3) Random. The static traffic refers to a number of permanent elephant flows from the source to the destination. we set an elephant flow’s size to be uniformly distributed between 100MB and 1GB. where k = ((i + 1)mod(num pods)) + 1. our flow-level VLB implementation randomly picks a core every 10 seconds for an elephant flow. In the ECMP implementation. The above three traffic patterns can be either static or dynamic. random and stride. Each elephant flow is a TCP connection transferring a 128MB file. We do not explore other choices. (2) Staggered(ToRP. One observation is that the bisection bandwidth gap between DARD and the other two approaches increases in the order of staggered. it can also introduce collisions at output We also measure DARD’s large file transfer times under dynamic traffic patterns.2 Test Bed Results To evaluate whether DARD can fully utilize the bisection bandwidth. This traffic pattern emulates the case where an application’s instances are close to each other and the most intra-cluster traffic is in the same pod or even in the same ToR switch. Because 99% of the flows are smaller than 100MB and 90% of the bytes are in flows between 100MB and 1GB [21]. We also implement a static hash-based ECMP and a modified version of flow-level VLB in the test bed.

In the TeXCP implementation each ToR switch pair maintains the utilizations for all the available paths connecting the two of them by periodical probing (The default probe interval is 200ms. Figure 9 shows the result. As a centralized method. The improvementECMP i is 1 for all the flows.creases the minimum flow throughput in every step. . in which case DARD reallocates the flows sharing the same bottleneck and improves the average file transfer time. and compare its transmission times under different traffic scheduling approaches. cross-pod flows congest the switch-to-switch links. each ToR switch schedules traffic at packet level. As the flow generating rate increases. TeXCP and DARD achieve similar bisection bandwidth. Each experiment lasts for 120s in ns-2.3. since the RTT in datacenter is in granularity of 1ms or even smaller. Even though DARD and TeXCP achieve the same bisection bandwidth under dynamic random traffic pattern (Figure 9). Figure 8: File transfer improvement. we decrease this probe interval to 10ms). As a result. However. We use MPTCP’s ns-2 implementation [7] to compare with DARD. Figure 11 shows TeXCP has a higher retransmission rate than DARD. The control interval is five times of the probe interval [20].1 Performance Improvement We use static traffic pattern on a fat-tree with 1024 hosts (p = 16) to evaluate whether DARD can fully utilize the bisection bandwidth in a larger topology. ECMP and pVLB essentially have the same performance. given staggered traffic pattern. However. which is defined as the number of retransmitted packets over the number of unique packets. When the flow generating rate is low. Hedera degrades to ECMP. in which case DARD helps little. besides comparing DARD with ECMP and pVLB. ECMP and DARD have almost the same performance because the bandwidth is over-provided. Each MPTCP connection uses all the simple paths connecting the source and the destination. When intra-pod traffic is dominant. Hedera outperforms MPTCP because Hedera achieves higher bisection bandwidth. Hedera achieves less bisection bandwidth than DARD. We assign each elephant flow from a dynamic random traffic pattern a unique index. We leave a further comparison between DARD and MPTCP as our future work. 5. We use every TiECMP as the reference and calculate the improvement of file transfer time for every traffic scheduling approach according to formula (3). This is because DARD monitors all the paths connecting a source and destination pair and moves flows to underloaded paths to smooth out the collisions caused by ECMP. TeXCP and MPTCP in our simulation. DARD or ECMP. e. In other words. it achieves less bisection bandwidth than Hedera. since ECMP transfers half of the flows faster than pVLB and transfers the rest half slower than pVLB. We believe this issue can be addressed by introducing new neighbor generating functions in Hedera. This is because current Hedera only schedules the flows going through the core. we also compare DARD with Hedera. Simulated on p = 16 fat-tree. Thus. even though TeXCP can achieve a high bisection bandwidth.3 Simulation Results To fully understand DARD’s advantages and disadvantages. We will compare these two approaches in detail later. We also measure the large file transfer time to compare DARD with other approaches on a fat-tree topology with 1024 hosts (p = 16). DARD still outperforms TeXCP. Hedera outperforms DARD under both stride Figure 10 shows the CDF of the above improvement. Improvement of average file transfer time 20% 15% 10% 5% 0% 2 3 4 5 6 7 8 9 10 Flow generating rate for each src−dst pair (number_of_flows / s) 1 staggered random stride and random traffic patterns. Measured on testbed. We do not implement the flowlet [25] mechanisms in the simulator.g. DARD achieves higher bisection bandwidth than both ECMP and pVLB under all the three traffic patterns. We implement both the demand-estimation and the simulated annealing algorithm described in Hedera and set its scheduling interval to 5 seconds [11]. We find both random traffic and staggered traffic share an interesting pattern. the host-switch links are occupied by flows within the same pod and thus become the bottlenecks. MPTCP outperforms DARD by completely exploring the path diversities. We further measure every elephant flow’s retransmission rate. When flow generating rate becomes even higher. We define Tim to be flow j’s transmission time when the underlying traffic scheduling approach is m. However. Figure 9: Bisection bandwidth under different static traffic patterns. We suspect this is because the current MPTCP’s ns-2 implementation does not support MPTCP level retransmission.. improvementm = i TiECMP − Tim TiECMP (3) 5. lost packets are always retransmitted to the same path regardless how congested the path is.

Simulated on p = 16 fat-tree. Figure 13 shows the link utilizations on the 8 output ports at the first core switch. As a result. in which case few path diversities exist. the bottleneck is most likely located at the host-switch links. for a inter-pod elephant flow [11]. in which every flows stops changing paths. This indicates when most of the flows are within the same pod or even the same ToR switch. we trace the control messages for both DARD and Hedera on a fat- . 5. To prevent this oscillation. Certain flows are constantly moved between two paths. Figure 10: Improvement of large file transfer time under dynamic random traffic pattern. We start these elephant flows simultaneously and track the time when all the flows stop changing paths. the network will be underutilized during the process.2 Convergence Speed Being a distributed path selection algorithm. Simulated on p = 16 fat-tree. Another 50% change their paths for less than 4 times. for the stride traffic. we measure how fast can DARD converge to a stable state. This indicates path oscillation does exist in load-sensitive adaptive routing. However. As a result. Figure 13: The first core switch’s output port utilizations under dy- namic random traffic pattern. around 90% of the flows stick to their original path assignment.g. e. This small number of path changing times indicates that DARD is stable and no flow changes its paths back and forth. DARD adds a random span of time to each end host’s control interval. since a core’s output link is usually the bottleneck However we cannot simply conclude that DARD does not cause path oscillations.3.2% 1% 2% 3% Retransmission rate DARD TeXCP 4% 50 100 Time (s) 150 200 Figure 11: DARD and TeXCP’s TCP retransmission rate. we vary the number of elephant flows from 1 to 64.some of the packets are retransmitted because of reordering and thus its goodput is not as high as DARD. we find that even though the link utilizations are stable. around 50% of the flows do not change their paths. DARD is provable to converge to a Nash equilibrium (Appendix B).3.. because the link utilization is an aggregated metric and misses every single flow’s behavior. We use the static traffic patterns on a fat-tree with 1024 hosts (p = 16). 5. Given DARD’s control interval at each end host is roughly 10s. we choose to add a random span of time to the control interval to address above problem. On the other hand. the entire system converges in less than three control intervals. 5. if the convergence takes a significant amount of time. Simulated on p = 8 fat-tree. We use dynamic random traffic pattern on a fat-tree with 128 end hosts (p = 8) and track the output link utilizations at the core. This section evaluates the effects of this simple mechanism. We can see that after the initial oscillation the link utilizations stabilize afterward. We first disable the random time span added to the control interval and log every single flow’s path selection history. which is the number of core switches. The main reason is because different sources move flows to underloaded paths in a highly synchronized manner. Figure 14 shows the CDF of how many times flows change their paths in their life cycles. one 512MB elephant flow are moved between two paths 23 times in its life cycle.4 Control Overhead To evaluate DARD’s communication overhead. After many attempts. For each source and destination pair. Figure 12 show the CDF of DARD’s convergence time. Simulated on p = 16 fat-tree.3 Stability Load-sensitive adaptive routing can lead to oscillation. For the staggered traffic. 100% 80% CDF 60% 40% pVLB TeXCP DARD 20% MPTCP Hedera 0 −5% 0 5% 10% 15% 20% 25% 30% Improvement of file transfer time CDF 100% 75% 50% 25% 0 10 staggered random stride 15 20 Convergence time (s) 25 Figure 12: DARD converges to a stable state in 2 or 3 control intervals given static traffic patterns.3. where all flows are inter-pod. from which we can see DARD converges in less than 25s for more than 80% of the cases. 100% Link utilization 90% 80% 70% 0 100% 75% CDF 50% 25% 0 0.

http://www. Sengupta. NY. Commun.6. Nox: towards an operating system for networks.. Akella. A. 2000. In the first stage (between 0K and 1. ToR switches report elephant flows to the centralized controller and the controller further updates some switches’ flow tables. In Proceedings of the 7th ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI). Hedera: Dynamic flow scheduling for data center networks. Hopps. The reason is mainly because Hedera’s control message size is larger than DARD’s (In Hedera. implementation and evaluation of congestion control for multipath tcp. [6] Microsoft Windows Azure. N. Lakshman. [20] S. http://lxr. Miri.nishida. [21] S. S. November 2004. In the third stage (more than 3K in this example). A. S. Simulated on p = 8 fat-tree. and D. [23] M. [9] TCPTrack. even the centralized scheduling cannot easily find out an improved flow allocation and thus few messages are sent from the controller to the switches. In In Proceedings of Third Workshop on Hot Topics in Networks (HotNets-III). commodity data center network architecture. P. That is mainly because when the traffic pattern is dense enough. Rev.amazon. and A. . As a result. [24] R. This communication overhead is bounded by the size of the topology. [25] S. USA. T. Control overhead (MB/s) 300 200 100 0 0 DARD Simulated Annealing 1K 2K 3K 4K Peak number of elephant flows 5K 7. and S. R. Magdon-Ismail. 1040043. tree with 128 hosts (p = 8) under static random traffic pattern. 2000. [4] IP alias limitation in windows nt 4. version 1.0. there are three stages. USA. A. Benson. Apr.5K in this example). 2005.jp. P. http://www. C. Greene. 2008. 38(3):105–110.networkworld. [26] D.com/kb/149426. Claise. REFERENCES [1] Amazon elastic compute cloud. Commun. N. [16] A. Radhakrishnan. Sinha. Greenhalgh.microsoft. New York. Al-Fares. USA. CA. Lakshman. Figure 15 shows how much of the bandwidth is taken by control messages given different number of elephant flows. [19] C. http: //www. [2] Cisco Data Center Infrastructure 2. B. http://www. pages 267–280. Simu- lated on p = 8 fat-tree.0.com/source/net/core/dev. Rev. [5] Microsoft Windows Azure. Patel.com/windowsazure. 2004. SIGCOMM Comput. S. and D.. Katabi. Kandula. Harnessing TCPs Burstiness using Flowlet Switching. P. In In Proc. SIGCOMM Comput. ACM. and M. Shenker. Pfaff. Al-Fares.c#L935. Test bed emulation and ns-2 simulation show that DARD outperforms random flow-level scheduling when the bottlenecks are not at the edge.0. including both queries from hosts and replies from switches. the system needs to process all pair probes. However. New York.com/ec2. DARD’s communication overhead is mainly introduced by the periodical probes. the communication overhead is bounded by the number of flows. DARD’s probe traffic is bounded by the topology size. Comput. Maltz.com/news/2008/102908-openflow.cx/~steve/devel/tcptrack/.deterlab. The revised arpanet routing metric. Our analysis shows that this algorithm converges to a Nash equilibrium in finite steps. In 3rd ACM SIGCOMM Workshop on Hot Topics in Networks (HotNets). [8] Openflow switch specification. [22] A. In the second stage (between 1. D. and A. N.html. [18] N. V.rhythm. Rev. Vahdat. Raiciu. Busch and M. http://www.net/. Patel. Pettit. DARD allows each end host to selfishly move elephant flows from overloaded paths to underloaded paths. 410(36):3337–3347.isi. [13] J. Farrington. Wischik. http://support. Commun. [15] B. J. 2010. Vl2: a scalable and flexible data center network. CA. Subramanya. and S. Greenberg. http://www. 2009. N. Rate adaptation.. In IMC ’09: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference. Sengupta. Rev. Atomic routing games on maximum congestion. and A. for Hedera’s centralized scheduling. Greenberg. This paper proposes DARD. McKeown. Zinky.. M.com/en/US/docs/solutions/Enterprise/Data_ Center/DC_Infra2_5/DCI_SRND_2_5_book. 8. [17] T. NY. the payload of a message from a ToR switch to the controller is 80 bytes and the payload of a message from the controller to a switch is 72 Bytes. With the increase of the number of elephant flows. S.org/mptcp/.5 Design Guide. Hedera’s communication overhead does not increase proportionally to the number of elephant flows. Raghavan. ACM. 2004. Gude. http://www. SIGCOMM Comput. B.microsoft. Lahiri. A scalable. In Symposium proceedings on Communications architectures & protocols. http://aws. V.0. and A. because in the worst case. pages 202–208.6. D. [7] ns-2 implementation of mptcp. 39(4):39–50. Huang. and R. In Proceedings of the 8th USENIX conference on Networked systems design and Figure 15: DARD and Hedera’s communication overhead. Jain. SIGCOMM Comput.openflowswitch.org/documents/openflow-spec-v1. 1989. Theor. [11] M. 2008. Sengupta. and S. IMC ’10. [12] T. Kim. 2010. ACM SIGCOMM. However. Efficient and robust routing of highly variable traffic. C.html. A. B. DARD’s control messages take more bandwidth. Hamilton. Vahdat. Cisco Systems Netflow Services Export Version 9. Walking the tightrope: Responsive yet stable traffic engineering. Vahdat. Kandula. Design. and outperforms centralized scheduling when intra-pod traffic is dominant. That is because the sources are probing for the states of all the paths to their destinations. New York. Commun. Loukissas. A. [10] M. SIGCOMM ’89. J. S. 39(4):51–62. congestion control and fairness: A tutorial. RFC 3954. Kandula. Huang. Network traffic characteristics of data centers in the wild. Portland: a scalable fault-tolerant layer 2 data center network fabric. RFC 2992. On the other hand.pdf. Boudec. these two numbers are 48 bytes and 32 bytes for DARD). Niranjan Mysore. 2009. The nature of data center traffic: measurements & analysis.cisco.. a readily deployable. lightweight distributed adaptive routing system for datacenter networks. Khanna and J. DARD’s control messages take less bandwidth than Hedera’s. 2009. San Diego. P.0. Katabi. K. Kandula.free-electrons. pages 45–56. Radhakrishnan. Chaiken. Analysis of an Equal-Cost Multi-Path Algorithm. Pamboris. Maltz. ACM. Sci. [3] IP alias limitation in linux kernel 2. CONCLUSION 90% CDF 70% 50% 0 staggered random stride 2 4 6 Path switch times 8 Figure 14: CDF of the times that flows change their paths under dynamic traffic pattern. 38(4):63–74. A. Casado. San Jose. NY. A.5K and 3K in this example). In Proceedings of the 10th annual conference on Internet measurement. ACKNOWLEDGEMENT This material is based upon work supported by the National Science Foundation under Grant No. 2009. Charny.-Y. Davie. [14] C. Researchers show off advanced network control technology. Koponen. Handley. N. T.

NY. pages 8–8. USA. In FDNA ’03: Proceedings of the ACM SIGCOMM workshop on Future directions in network architecture. CA. New York. 2011. [27] X. Yang. NSDI’11. USA. Nira: a new internet routing architecture. Berkeley. 2003. USENIX Association. . ACM.implementation. pages 301–312.

Flow fk is i i locally optimal in strategy s if f f f f f f k F Sfk (s). Suppose min rate < S0 . For a strategy s and a link j. min rate. . {pf }f ∈F ). In other words. If only one flow f selfishly changes its route to improve its fair share. . G. Since the number of state vectors is limited. B. G. we can find out at least one strategy s that is the smallest.e. pfk′ ) refers to the i i |F k−1 k+1 1 k strategy [pf1 . G = (V. According to the definition of ” = ” and ” < ”. link li has the fair share Si = Ci /Ni . . pi|F || ] i i k is a collection of paths. Theorem 1 claims min rate ≥ S0 . . {pf }f ∈F ) and δ. . We define s = s′ when vk (s) = vk (s′ ) for all k ≥ 0. . . s′ < s.] and SV (s′ ) = [v0 (s′ ). k for all pfk′ ∈ pfk . the state vector SV (s) = [v0 (s)..S. . k Notation s−k refers to the strategy s without pfk . . The proof is a special case of a congestion game [14]. . As a result k vk (s) = |E|. s′ and s′′ . s ≤ s. Flow f has the minimum flow rate. i |F k−1 k+1 1 k [pf1 . Theorem 1.S ≥ F Sfk (s−k . the path state P Sp (s) is the link state with the smallest fair share over all links in p.g. which is defined as (F. . First we define a bottleneck link according to [13]. pik−1 . CONVERGENCE PROOF We now formalize DARD’s flow scheduling algorithm and prove that this algorithm can converge to a Nash equilibrium in finite steps. the global minimum fair share is the lower bound of global minimum flow rate. Given max-min fair bandwidth allocation for any network topology and any traffic pattern. Each end host moves flows from overloaded paths to underloaded ones to increase its observed minimum fair share (link l’s fair share is defined as the link capacity. . divided by the number of elephant flows. (k + 1)δ). For a strategy s. Suppose link l0 has the minimum fair share S0 . i.. Cl . A small δ will group the links in a fine granularity and increase the minimum fair share. Suppose s and s′ are two strategies. pi|F || ]. This indicates that asynchronous and selfish flow movements actually increase global minimum fair share round by round until all flows reach their locally optimal state. What is more. F is the set of all the flows. which tries to achieve max-min fairness if combined with fair queuing. . the minimum fair share is the lower bound of the global minimum flow rate. making the strategy change from s to s′ . Nj . Given a congestion game (F. and (b) flow f has the maximum rate among all the flows using link l. It is easy to show that given three strategies s. i. . Theorem 2. for any strategy s. Given max-min fair bandwidth allocation. EXPLANATION OF THE OBJECTIVE We assume TCP is the dominant transport protocol in datacenter. SysS(s∗ ). the link state LSj (s) is a triple (Cj . Sj ). We prove this theorem using contradiction. This section explains given max-min fair bandwidth allocation. It is easy to see that this s has the largest minimum fair share or has the least number of links that have the minimum fair share and thus is the global optimal. . i. pf2 . flow f is using path p. A strategy s = [pf1 . . . The global optimal strategy is also a Nash equilibrium strategy. If there is no synchronized flow scheduling. no flow can have a further movement to decrease s. thus the global minimum fair share increases.e.]. where vk (s) stands for the number of links whose fair share is located at [kδ.4. . A large δ will improve the convergence speed. .. in which the ik th path in pfk . pf is a set of paths f|F 1 2 that can be used by flow f . then s ≤ s′′ . v1 (s). 10M bps. pik−1 . because s is the smallest strategy. . to cluster links into groups. . vk (s) ≤ vk (s′ ). .e. this action decreases the number of links with small fair shares and increases the number of links with larger fair shares. Algorithm selfish flow scheduling will increase the minimum fair share round by round and converge to a Nash equilibrium in finite steps.. A f low state F Sf (s) is the corresponding path state. the global minimum fair share is the lower bound of the global minimum flow rate. we get Cf /Nf < S0 (A1) F Sf (s) = P Sp (s). pi|F || ]. the steps converge to a Nash equilibrium is finite. s < s′ when there exists some K such that vK (s) < vK (s′ ) and ∀k < K. For a path p. pik+1 . A strategy s∗ is global optimal if for any strategy s.Appendix A. N ash equilibrium is a state where all i flows are locally optimal. v1 (s′ ).]. and thus Cf /Nf ≤ min rate. every end host tries to increase its observed minimum fair share in each round. According to the bottleneck definition.e every flow is in its locally optimal state. thus increasing the minimum fair share actually increases the global minimum flow rate. pfk . v2 (s′ ). v2 (s).. as defined in Section 3. Nl ).S ≥ SysS(s). (s−k . min rate is the maximum flow rate on link lf . In DARD. Hence this global optimal strategy s is also a Nash equilibrium strategy. i. As a result. . A link l is a bottleneck for a flow f if and only if (a) link l is fully utilized. pik+1 . . e. The system state SysS(s) is the link state with the smallest fair share over all links in E. . pfk′ ).S i (B1) (A1) is conflict with S0 being the minimum fair share. . if s ≤ s′ and s′ ≤ s′′ . so does the global minimum flow rate. is i used by flow fk . v1 (s). pfk′ . where δ is a positive parameter. . Link lf is flow f ’s bottleneck. . there are only finite number of state vectors. v2 (s). E) is an directed graph. SV (s) = [v0 (s).