You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

Dragon: Scalable, Flexible and Efficient Traffic


Engineering in Software Defined ISP Networks
Mehrdad Moradi∗ , Ying Zhang† , Z. Morley Mao∗ , Ravi Manghirmalani§
∗ University of Michigan, † Facebook, ‡ Ericsson Research

Abstract—To optimize network cost, performance, and reliabil- are limited to applying the same set of objective functions
ity, SDN advocates for centralized traffic engineering (TE) that to the entire network. Third, existing systems lack interactions
enables more efficient path and egress point selection as well as between intradomain resource allocation and interdomain
bandwidth allocation. In this work, we argue SDN-based TE for
ISP networks can be very challenging. First, ISP networks often routing, causing sub-optimal end-to-end performance [5], [6].
are very large in size, imposing significant scalability challenges Dragon TE application framework. Rather than central-
to the centralized TE. Second, ISP networks usually have diverse izing the TE application in a single process or thread on the
types of links, switches, and cost models, leading to a complex SDN controller, we argue a scalable, flexible, and efficient TE
combination of optimizations. Third, ISP networks not only have application for ISP networks should consist of a distributed
many choices of internal paths but also include rich selections of
egress points and interdomain routes, unlike cloud/enterprise net- logic. To overcome the scalability challenge, it must break
works. To overcome these challenges, we present a novel TE appli- down large-scale TE problems into smaller pieces, solve the
cation framework, called Dragon, for existing SDN control planes. subproblems in parallel as much as possible, and combine
To address the scalability challenge, Dragon consists of hierarchi-partial solutions to generate a TE result sufficiently close to
cal and recursive TE algorithms and mechanisms that divide flow that of global one-shot TE approach. To support fine-grained
optimization problems into subtasks and execute them in parallel.
Further, Dragon allows ISPs to express diverse objectives for dif- customization, it must enable ISPs to express their desired
ferent parts of their network. Finally, we extend Dragon to jointlyoptimization goal for each subproblem. To ensure efficiency
optimize the selection of intradomain and interdomain paths. of TE decisions, it must jointly optimize inter-domain paths
Using extensive evaluation on real topologies and prototyping and intradomain routes within each subproblem. To realize
with SDN controller and switches, we demonstrate that Dragon this vision, we present a novel TE application framework,
outperforms existing TE methods both in speed and in optimality.
called Dragon, having these properties. Dragon builds up a
Index Terms—Traffic Engineering, Software Defined Network- hierarchical structure of TE workers inside the TE application
ing, Hierarchical, Recursive, ISP, SDN, TE, Framework on the SDN controller (Fig. 1). Our framework partitions
the TE problem by recursively building an abstract network
I. I NTRODUCTION region for each TE worker. Each TE worker is responsible for
To optimize network performance and throughput, operators solving a small flow optimization to handle its regional traffic
are always looking for new ways to engineer their traffic. while minimally coordinating with its parent and children.
Software-defined networking (SDN) provides more flexibility Further, Dragon enables operators to mix and match diverse
and control over networks by separating their control and optimization functions as each of its TE workers can enforce
data planes. SDN also enhances traffic engineering (TE) by a different optimization goal on its traffic. Moreover, Dragon
having a centralized TE application with global view constantly simultaneously computes both intradomain and interdomain TE
reconfigure the network in response to new traffic demand. decisions by properly disseminating the BGP routing informa-
Recently, cloud providers (e.g., Microsoft) leveraged SDN tion throughout the hierarchy. Our Dragon framework is capable
to demonstrate the potential of centralized intradomain TE of running on both hierarchical and flat SDN controllers [7], [8].
in improving link utilization and fault-tolerance [1], [2]. In Contributions. In more detail, we make four contributions
addition, there is an increasing trend of centralized TE platforms in this paper. First, our framework targets TE in large ISP
for interdomain TE [3], [4]. networks. It creates unified network topologies and formulates
At first sight, it seems that existing SDN-based TE systems the interdomain and intradomain TE as a joint optimization
and platforms can be fully adapted to manage traffic in problem. Also, it is extensible to incorporate diverse optimiza-
ISP networks. However, a close inspection reveals that they tion objectives, e.g., service chaining, maximizing throughput,
fall short in completely satisfying key requirements of ISP and guaranteeing fault tolerance. Second, Dragon offers a novel
networks. First, most of the existing proposals poorly scale hierarchical TE algorithm to address the scalability challenge,
due to large size and rapid growth of ISPs as well as their especially for handling a large number of IP prefixes in large
formulation of TE as a linear flow optimization that can ISP networks. While solving each sub-optimization problem
consist of millions of constraints and variables for an ISP locally can reduce the computation overhead, it certainly will
network. Second, ISPs have different types of links, switches, affect the optimality of the global solution. Thus, our main
traffic and failure patterns in different parts of their network. challenge is to design the right boundary between hierarchical
Thus, they can have various optimization goals for each part, regions in terms of the information handled locally versus
e.g., maximizing throughput on peering links and balancing that exposed to upper levels. Third, Dragon maps local TE
the load on internal links. However, most existing TE solutions results from the hierarchical regions to the physical data plane

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

with a minimal number of data plane states. It minimizes size as most of them originally are designed for optimizing
the packet header size and the number of flow entries in aggregate inter-DC traffic in cloud networks with at most tens of
switches. Forth, Dragon enables ISPs to combine different TE nodes or DCs. A large ISP network can consist of tens of thou-
objective functions in a hierarchical manner for their different sands of forwarding devices, leading to millions of optimization
regions, e.g., maximize the throughput on the peering links constraints and variables in the TE application. Such flow
while minimizing the max. link utilization (MLU) internally optimization problems are intractable since modern algorithms
to avoid congestion. solving even simple linear programs (LPs) have exponential
Result summary. We have implemented Dragon as an worst-case complexities [13]. As a matter of fact, prior studies
application on top of an RYU controller while the data plane (e.g., DEFO [14] in § VII) report TE for a 100-node ISP
comprises of OVS switches. Through experimentation with network can take one night on high-end SDN controllers.
prototypes on eleven real-world topologies, we show that Requirement 2: TE flexibility. Second, existing SDN-based
Dragon is computationally more efficient than the global TE solutions lack flexibility in optimization. ISPs have different
optimal approach by reducing 98% of the TE overhead, while types of links, switches, traffic and failure patterns in different
still achieving 88-96% of the solution optimality.1 Our joint parts of their large network. Therefore, they are interested
optimization of inter and intradomain TE reduces the end-to- in having different optimization goals for each part or region
end latency by 52%. In the experiment of maximizing the (e.g., maximizing throughput on peering links and balancing
inter-region throughput and minimizing intra-region MLU, the load on internal links). Most recent solutions all have the
Dragon achieves 15.5% lower MLU while only sacrificing 3% same objective functions for the entire network and extending
throughput compared to best multi-objective TE approaches [9]. them to multi-objective optimizations with best-practice
Finally, Dragon has a more scalable data plane and requires techniques [15], [9] still can lead to very suboptimal results,
70% fewer flow entries than advanced SDN solutions [7]. specifically when the objectives conflict with each other.
Requirement 3. TE performance. Third, today’s SDN
II. D RAGON F RAMEWORK OVERVIEW TE solutions are suboptimal due to optimizing intradomain
and interdomain TE decisions separately while the ultimate
This section provides a background on SDN-based TE and
ISP network efficiency is affected by both of them [16]. They
presents an overview of the Dragon TE framework.
select interdomain routes in the BGP route selection process
by comparing all routes according to the local policies and AS
A. Traditional SDN-based Traffic Engineering path length. In this process, the intradomain TE decisions are
In SDN, data plane consists of programmable switches that considered in the form of IGP cost at a much later stage (step
are managed and configured by a logically centralized SDN 8 in [17]). After routes are fixed, they optimize bandwidth and
controller (Fig. 1a). TE is realized as an application on the path allocation within the network. The limited interactions
controller that discovers the topology changes and collects new between interdomain and intradomain decisions can lead to
traffic demand from agents running on access/edge switches path inflation and poor network utilization [5], [18], [19].
(e.g., [10]). Then, it feeds them to the TE application that
often performs three operations: (i) computes paths for flows,
C. Dragon TE Framework: Software Abstractions
(ii) allocates bandwidth to flows, and (iii) generates rules for
switches to enforce the computed paths and bandwidths. The To address these challenges, our key idea is a hierarchical
difference between TE solutions comes from their approach to elastic TE framework that breaks large-scale ISP TE into
realizing these operations. We focus on a popular multipath smaller pieces, solves them in parallel, and incrementally
TE model that is facilitated by the SDN paradigm. This model aggregates partial solutions (Fig. 1). Dragon creates a
separates the optimization of path computation from bandwidth hierarchical group of TE workers that each of which solves
allocation and runs iteratively based on how fast the traffic a small part of the TE problem. The root TE worker (e.g.,
matrix changes in the network. It computes multiple paths (e.g., TW7) is at the top and leaf TE workers (e.g., TW1-4) are
k shortest paths) for each flow demand, solves a linear flow at the bottom. Dragon is equipped with a manager node that
optimization program to allocate bandwidth to each flow on runs on existing SDN control planes and is responsible for
its paths and generates a set of switch rules (e.g., OpenFlow bootstrapping and managing TE workers. In single-controller
flow, group, and meter rules) to enforce the TE results. networks, the manager node runs each TE worker in separate
threads. In distributed SDN control planes [7], [20], TE workers
are placed on different controllers to improve robustness.
B. Requirements for Software-defined TE in ISP networks
Each TE worker handles the TE task for a logical region,
Before delving into our solution overview, we discuss three which is formed in the following way. The manager node first
requirements that existing SDN-based TE designs [11], [1], partitions the physical data plane (level 1) into a set of logical
[12], [2] fail to simultaneously satisfy for ISP networks: TE regions (e.g., regions 1-4) based on an algorithm defined by
Requirement 1: TE scalability. First, today’s SDN-based the ISP. Dragon does not impose any limitation on the number
TE methods do not scale well with respect to ISP network of worker levels and the number of children per worker. Then,
1 Today’s ISP networks are highly over-provisioned so sacrificing a negligible each leaf region is assigned to a leaf TE worker that immedi-
amount of solution optimality for scaling TE is practical, while also enabling ately constructs an abstract view of its region, in the form of
applications such as real-time TE-based security in response to DDoS attacks. logical switches (explained below), and presents this abstraction

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

Topology Dragon TE parallel, followed by TE optimization at child TE workers


App Manager App in the level below. Due to our recursive abstractions and
Logical SDN Controller parallelism, Dragon potentially reduces the TE computation
time by multiple orders of magnitude.
OpenFlow/P4
Property 2. Dragon enables ISPs to compose region-
Customer Peer specific optimizations. ISPs can configure Dragon to sup-
S1 S2 S3 S4 S5 S6 ISP
ISP port hierarchical combinations of optimizations. Recursively,
(a) Dragon can group links and (Dragon and physical) switches
with the same objective function at any level into a logical
(b) TW7 (Root)
region and then offloads its control onto a TE worker, e.g., in

Level 3
Fig. 1, maximize the network throughput for traffic traversing
DS5 DS6

1: Flows & logical fabrics


Region 7 the expensive link in TW7’s region for cost concerns, while then
having TW5 and TW6 to load balance traffic within their region.
TW5 TW6
Property 3. Dragon jointly optimizes interdomain and

2: TE results
Level 2
F
DS1 Region 5 DS2 DS3
intradomain traffic. It distributes interdomain information in
Region 6 DS4
hierarchical regions by having each TE worker establish a
TW1 TW2 TW3 TW4
BGP session with its parent and giving the illusion of physical
peering points to non-leaf regions. Dragon simultaneously and
Level 1

S1 S2 S3 S4 hierarchically takes into account the status of internal resources


Region 1 Region 2 Region 3 Region 4 and interdomain routes in its logical regions to run end-to-end
TE Worker Abstraction Dragon/physical switch TE on egress traffic from ISP networks at different levels.
Fig. 1: Dragon TE framework accelerates TE computation through
parallelism, enables region-specific flow optimization, and jointly
executes optimization of intradomain and interdomain TE deci- E. Overview of Design and Optimization Challenges
sions. (a) SDN-driven ISP network. (b) Dragon’s recursive and The above design seemingly meets the TE requirements.
hierarchical TE worker and region abstractions.
However, several design and algorithm challenges with different
to its parent. Then each level-2 TE worker (e.g., TW5-6) learns complexities reveal after a closer inspection as follows, not
its region topology by fetching the abstracted topology built by addressed by legacy hierarchical routing systems (e.g., [21],
its children and then creates its logical abstraction for its parent [22], [23]) and modern hierarchical SDN control planes [7], [8].
at the level 3. Recursively, the process creates increasingly more This is because Dragon is a centralized, multipath TE scheme
simplified topologies until the root TE worker (e.g., TW7) gath- while they are distributed, single-path routing protocols, and
ers its very abstract region. In the recursions, each TE worker thus are inefficient from TE aspects. Poor network utilization,
hides the detailed topology of its region and abstracts it as a tunnel reservation failures, and congestion are inherent to them.
single switch for its parent, called Dragon Switch (D-switch). Challenge 1. Estimating flow demand without visibility.
A D-switch is a new abstraction over a physical/logical switch One key input to the Dragon framework is traffic demand.
topology. It is more than a big switch abstraction (e.g., [7], [8]), In traditional SDN-based TE (§ II-A), the TE application
because it encodes accurate and effective details of the con- performs TE by periodically collecting flow requests from
stituent topology so that the high layer TE worker has sufficient agents on access switches [24] or broker nodes in access
information to compute near-optimal TE results (§ III-B). networks [2]. In Dragon, no TE worker has a global view
and control over the network due to our recursive D-switch
D. Dragon TE Framework: Properties & Functionalities abstraction. Thus, it is challenging for TE workers to know
Leveraging linear programming (LP), Dragon iteratively the traffic demand in their region.
learns and engineers all the flows in the network similar to Solution: Recursive flow assignment. Within the limits of
conventional SDN-based TE solutions (discussed in § II-A). the D-switch abstraction, Dragon distributes traffic demands
In each TE epoch, it takes flow requests (i.e., set of <source, to hierarchical regions through a recursive procedure, making
sink, volume>s), constraints, and objective functions as inputs. each TE worker responsible for a subset of flows. In particular,
Then, it determines multiple efficient paths for each flow and before each TE epoch, starting from the leaf level, each
globally optimized bandwidth allocations to flow paths followed TE worker indirectly communicates with its parent through
by generating rules for configuring the SDN data plane. We D-switch and exports some of its local flow requests collected
discuss new design opportunities that the Dragon framework from its region to the parent worker (§ III).
provides that are not available in existing SDN-TE systems. Challenge 2. Generating efficient and feasible TE de-
Property 1. Dragon accelerates TE computation through cisions. The Dragon framework must be able to generate
parallelism. To scale the global linear optimization, Dragon efficient TE results for any operator-selected optimization
recursively distributes flows among regions from the leaf objective. In the traditional SDN-based TE, achieving this goal
regions to the root region and then runs the TE computation is straightforward since the TE application runs TE with global
recursively from the root region to the leaf regions (Fig. 1). network state (e.g., full topology, traffic demands) for the entire
In the top-down processing, at a given level, TE workers, network in one shot. In contrast, Dragon offloads the TE logic
each on a very small search space, can perform their TE in from the controller onto a set of workers. It is possible non-leaf

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
workers’ TE decisions become infeasible or inefficient after they to efficiently aggregate its local rules with that of its parent
get aggregated level by level from the root to the physical data using a tag-based tunneling protocol (e.g., MPLS) to greedily
plane. While optimizing Dragon for basic single-path routing and locally minimize the number of switch rules that it
is easy [22], [21], [23], [7] but for our desired multipath TE is transfers to the lower-level and the number of states that it
challenging. Thus, our main focus is to improve the quality of needs to encode in packet headers. Our optimization makes
TE solutions without sacrificing Dragon’s other benefits (e.g., sure that Dragon does not create more data plane state than
low TE computation time, support for diverse objectives). a traditional SDN-based TE solution.
Solution: Recursive flow optimization with greedy
refinements. In Dragon, we introduce the Dragon Fabric III. E FFICIENT R ECURSIVE I NTRADOMAIN TE
(D-fabric) concept, associating each logical port pair of
This section describes how the Dragon framework efficiently
an abstract D-switch with performance and bandwidth
handles intradomain traffic between fixed source and sink
constraints. Using these constraints, any non-leaf TE worker
points (access switches) in ISP networks. In § IV, we extend
can participate in the global flow optimization initiated by
it to handle interdomain (outbound traffic) traffic. We propose
the root without full TE states. Computing accurate D-fabrics
recursive linear programs and greedy algorithms that allow
is challenging, particularly when the network is saturated,
TE workers to efficiently participate in network-wide flow
due to hidden shared bottleneck physical links and circular
optimizations such that Dragon sacrifices a very small amount
dependencies between TE decisions made by a TE worker and
of the network throughput compared to the global, optimal
its children. Our recursive flow optimization programs and
TE while it accelerates each TE iteration by several orders of
greedy algorithms at each TE worker address this issue (§ III).
magnitude. Fig. 2 shows the high-level steps that we take in
Challenge 3. Considering intra and interdomain TE
this section, assuming TE workers can have diverse objective
together. In Dragon, jointly and hierarchically optimizing
functions for their region.
intra and interdomain TE decisions can be challenging in
terms of scalability and efficiency, particularly because a Step 1: Each TE worker exposes Step 1.1: Each worker collects internal
typical ISP network can have millions of egress flows destined external traffic and D-switch’s D- demand and D-fabric in its local region.
fabric to its parent.
to hundreds of thousands of external IP prefixes. On the one
Traffic matrix:
hand, it will not be scalable if we offload the entire joint M=<flow f: in_port, out_port> Step 1.2: Estimates parent’s demand
TE load onto the root TE worker for global efficiency. On Dragon fabric (D-fabric): per port pair on the exposed D-switch
the other hand, it can lead to low network utilization and G= <port a, port b, metrics>
performance if we completely perform it at non-root TE Step 2: Root worker computes Final
TE and sends output to its children.
workers. Finally, Dragon must handle unplanned BGP routing Step 1.3: Runs Shadow TE on estimated
Bandwidth allocation:
messages changing flow egress points between two epochs, inter-region & exact intra-region flows
<flow f: (path1: bw1); (path 2: bw2)>
to compute D-fabric for the parent.
without inefficiently recomputing the entire TE from scratch. Step 3: After receiving outputs
Solution: QoS-based joint TE equipped with IP prefix from the parent, TE worker at each
clustering and incremental computation. To reduce the joint level computes Final TE in parallel.
TE problem size, Dragon clusters destination IP prefixes with Step 1.4: Exposes the updated D-fabric
and external demand to the parent.
similar QoS metrics from all egress points. To balance its Step 4: Each TE worker programs
scalability and efficiency, Dragon dynamically decides which tunnels on its D-switches to
enforce the Final TE results.
ancestor TE worker should perform the joint optimization on an
egress flow. To handle the interdomain routes changes, Dragon Fig. 2: Efficient and scalable recursive TE in Dragon. Dragon
sacrifices a very small amount of the network throughput
selects affected regions and computes a local TE in them, while compared to the global, optimal TE while it accelerates each TE
keeping the old TE results in their ancestor regions unchanged iteration by several orders of magnitude.
to achieve BGP routing stability (§ IV).
Challenge 4. Minimizing data plane rules and overheads.
Physical SDN switches often have a very limited capacity for A. Distributing Network Flows on TE workers
forwarding rules. In the traditional SDN-TE, the controller One key input to Dragon is the set of flows (the traffic
after running TE can globally compress and aggregate the between two access switches) to be scheduled. Our manager
number of rules that must be installed in switches. The Dragon node collects flows and their bandwidth demand in the next
framework consists of a set of workers in the hierarchy, each time interval from the controller services [1]. Clearly, since
having a partial visibility of the network and generating a global flow optimization by the root worker is not scalable, our
set of switch rules in its region that are fetched by its child goal is to distribute them among hierarchical regions. Dragon
workers. When each worker makes more fine-grained flow greedily assigns more flows to the lower level regions where
decision level by level from the root level, the number of rules the hierarchy has more branches. This allows Dragon to achieve
in physical switches and headers in packets quickly increase, better parallelism in the TE computation. We instruct workers
causing significant pressure on the data plane scalability. to recursively collaborate with each other to place flows in
Solution: scalable recursive tunnels. We design workers the hierarchical regions. Dragon’s recursive traffic demand
in the hierarchy to efficiently interact with each other to distribution is simple and runs in each TE epoch. Starting from
generate minimal TE state for the physical data plane by the leaf level, each leaf worker first collects flows demands
performing a recursive optimization; we instruct each worker from the manager node running on the SDN controller. Then,
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

each worker looks up the destination access switch in its region. apriori for each port pair on the exposed D-switch, it can do
If the switch or the flow egress point is not its region, the TE to compute the D-fabric for the parent.
worker offloads the flow to its parent through the abstract D- Dragon’s efficient and scalable two-step recursive TE
switches. In particular, the worker gives the parent the illusion computation. Based on the above intuition, we propose a
of flow requests in the upper-level parent region. If the sink recursive two-step TE computation for Dragon: Step 1. recur-
location falls inside its region, the leaf worker greedily hides sive bottom-up TE for D-fabric computation (Shadow TE). In
the flow from its the parent and takes the responsibility for the a TE epoch, we compute effective D-fabrics from the leaf to
flow. Recursively upward, at each level, other workers perform the root level. We have each TE worker learn its exact intra-
exactly the same operation on their flows. Intuitively, this light- region flows using the flow assignment procedure (§ III-A) and
weight procedure assigns each flow request to one region in estimate the bandwidth request of each inter-region flow (using
the hierarchy while also putting more flows in the lower levels. the algorithm that will be discussed shortly). As shown in Fig. 2,
Fig. 3 shows this in a two-level Dragon, where a leaf TE worker on these two flow types as inputs, it runs a flow optimization
exported flow F2 to the root region and kept flow F1 locally. based on its region objective and constraints, called shadow TE.
The Shadow TE results in approximate bandwidth allocation
to estimated inter-region flows in the coexistence of exact
B. Succinct Dragon Fabric (D-fabric) Abstraction and Recur- intra-region flows. Using the approximate allocation, the TE
sive Schemes for Maximizing TE Results Efficiency worker easily computes the D-fabric for the parent. After the
After the flow assignment process, the root TE worker can above step, the TE worker goes into a Sleep mode. Recursively,
begin the recursive TE from its region (§ II). However, the this freezes the control tree until the root fetches its local
recursive construction of D-switches can cause a TE worker to D-fabrics. Step 2. recursive top-down TE for network-wide
overestimate the resources and performance of its child regions. flow scheduling (Final TE). When the root TE worker learns
Given existing compact topology representation techniques [25], D-fabrics and flows in its region, it starts the top-down TE
[26] are not effective for centralized TE (§ VII), to avoid this computation in the current TE epoch. First, the root runs its
challenge, we propose to extend the D-switch abstraction with TE by computes multiple paths for each of its flow requests,
the Dragon Fabric (D-fabric) concept that associates each running a flow optimization using LP to allocate bandwidth to
pair of D-switch ports with effective hop count, latency and flows along their paths, and programming its region. The root
bandwidth information. Each worker constructs a new D-fabric worker takes into account the D-fabric constraints in its path
for its immediate parent in each TE epoch.2 To better expose computation and bandwidth allocation so it can make efficient
the true capacity of a region, rather than relying on single path and feasible TE decisions. When a child TE worker receives
algorithms, the TE worker computes multiple paths between the parent forwarding rules on the D-switch, it views them as
border ports in its region. For each port pair, it approximates exact inter-region flow requests in its region. Recursively, each
the D-fabric constraints to be the maximum of hop counts, child TE worker runs its TE on its exact intra-region and inter-
maximum of latencies, and the sum of available bandwidths region flows until reaching the leaf level. A TE worker can
over the paths. While the D-fabric idea is simple, computing flexibly distribute the parent traffic in their region as long as
accurate bandwidth constraints is challenging because of shared it meets the bandwidth and performance constraints promised
bottleneck links and dependency between a TE worker and its to its parent through the D-switch’s D-fabric. We refer to the
child workers. To pinpoint the problem, we first characterize TE that TE workers run in this phase as Final TE.
traffic interaction between a parent and its child regions. Root
Challenge of computing the bandwidth in D-fabrics. In Root’s Flows
Level 2

R1 B C F2: (R1->R2: 10)


DS2 R2
each TE epoch, there can be two types of traffic in a region:
(1) intra-region flow requests that are assigned to each TE
TW1 B D Local end points
worker in our recursive flow assignment process (see § III-A) TW1’s Flows
A Dragon/physical switch
Level 1

and (2) inter-region flows that are assigned to a TE worker by F1: (A->D: 15) D
FP: (B->C: ?) TE Worker
its parent in the top-down TE computation. The problem is that Abstraction
a TE worker cannot easily report the remaining bandwidth to C
be used by the parent’s inter-region flows because it does not Fig. 3: Flow distribution procedure; Bandwidth computation chal-
know how its intra-region and inter-region flows, which can lenge in D-fabrics: TW1 does not how its intra-region flow (F1
with a known bandwidth request) and parent’s inter-region flow
go through shared bottleneck links, will be distributed in its (FP with an unknown bandwidth request) will be distributed in its
region after its local TE. Moreover, unlike intra-region flows region, and thus cannot easily compute D-fabric for its parent.
whose near exact bandwidth demands are known as they are Dragon’s greedy estimation of parent flows in Shadow
fetched directly from the controller, the bandwidth request of TE. In our bottom-up recursive process, each TE worker
each inter-region flow is unknown. This challenge is shown must estimate the bandwidth demand of each inter-region
in Fig. 3. Our key idea to resolve this problem is that if a TE flow from the parent in the next TE epoch. While there are
worker knows the inter-region flow demand from the parent many techniques for predicting traffic matrices in general, most
of them (e.g., [24], [27]) compute a convex combination of
2 If border ports of a TE worker’s region map to s physical switches, the
previous matrices and report it as the next matrix. Although
number of constraints of the abstract D-switch can be reduced to 3s(s − 1)/2,
which is reasonably small and decreases as we move up in the hierarchy. they are effective for predicting local/transit user traffic between

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

ingress/egress points in an ISP network, they cannot be Param Description


borrowed in Dragon. First, the next parent traffic matrix on Ps set of port pairs on D-switch s in the region
p
gs available bandwidth of port pair p in the D-fabric of D-switch s
the D-switch is confined by the current D-fabric exposed by
ce bandwidth capacity of link e in E
the child and may not fall in the convex hull of the previous F set of intra-region and inter-region flows
matrices. Second, the next traffic matrix will be the product W set of region-specific variables in the objective function
of Final TE by multiple ancestor workers in their regions and df exact bandwidth demand of flow f
may not follow predictable temporal patterns. As these two bf allocated bandwidth to flow f
make designing accurate predictors hard, we have found out df
D estimated bandwidth demand of inter-region flow f
the greedy heuristic that follows is more efficient in practice. Dbf exposed bandwidth for inter-region flow f
In the bottom-up process at TE epoch k, consider a TE Tf tunnel set for flow f
worker that wants to expose the D-switch s with the port x tf allocated bandwidth to tunnel t of flow f
pairs Ps to its parent. For each port pair p ∈ Ps , the worker ψ(t, e) 1 if tunnel t contains link e otherwise 0
estimates the bandwidth demand of the corresponding inter- ξ (t, s, p) 1 if tunnel t uses port pair p on D-switch s otherwise 0
region flow f to be dDkf , runs its Shadow TE, builds the latest λ( f ) 1 if flow f is inter-region (i.e., pushed by the parent)
D-fabric, and exposes it to the parent. In the top-down process, (a) Other parameters
it receives the exact bandwidth demand of inter-region flows
from the parent, finishes its Final TE, and then greedily refines max R(X, W ) (Region-specific goal)
the bandwidth estimate of the inter-region flow f from dDkf to s.t. region-specific constraints (C0)
dDk+1
X
f . If the fraction of used bandwidth by the parent on port ψ(t, e)x tf ≤ ce ;∀e ∈ E (C1)
pair p is more/less than c1 /c2 , the worker increases/reduces the f ∈F, t ∈T f
estimated demand on p by γ1 /γ2 . If the fraction falls in (c2, c1 ),
X
p
ξ (t, s, p)x tf ≤ gs ;∀s ∈ S, p ∈ Ps (C2)
the estimate does not change for the next epoch. In epoch k + 1, t ∈T f
the worker uses the updated estimates in its Shadow TE to X
x tf = b f ;∀ f ∈ F (C3)
compute a new D-fabric. The greediness parameters are chosen t ∈T f
experimentally but can be tuned over time.
x tf ≥ 0, ∀ f ∈ F;t ∈ T f (C4)
λ( f ) min(D
b f , d f ) ≤ b f ≤ d f ;∀ f ∈ F (C5-Final TE)
C. Recursive Linear Programs For Intradomain TE
0 ≤ b f ≤ (1 − λ( f ))d f + λ( f ) d f ;∀ f ∈ F (C5-Shadow TE)
D
Based on the above design, we present our recursive linear
programs that are used by TE workers in their Shadow and
(b) Shadow and Final TE programs.
Final TE. These programs are designed to be general, allowing Fig. 4: Dragon TE’s recursive flow optimization programs.
each TE worker to be flexible in having its own objective
bandwidth allocated to the parent flows on the D-switch is less
function (e.g., fault tolerance, throughput maximization).
than what our greedy heuristic estimates and to intra-region
Optimization programs. Assume a TE worker stores its
flows is less than their exact demand. On the other hand, Final
region as graph G = (S,E) where S and E are the sets of
TE must ensure the bandwidth granted to each flow (regardless
D-switches and links. Each aggregate flow f ∈ F has format
of its type) is less than its demand and to each inter-region
(src f , dst f , d f ) including the source and sink D-switch ports
flow is larger than what has been promised by the worker to
and its exact bandwidth demand. For each flow f , the worker
the parent through the D-fabric.
computes multiple paths (tunnels), T f , and allocates x tf of
Objective functions. Although our programs are general
bandwidth to each t ∈ T f . As shown in Fig. 4b, in both Shadow
to accommodate various objectives, we focus on combining
and Final programs, the worker maximizes its region-specific
two popular functions to demonstrate the multi-objective
objective function R(X, W ) by fixing two sets of variables:
efficiency of Dragon TE (see § VI): 1) Maximizing the network
(1) X that is common among all TE workers and consists of P
throughput, max( f ∈F b f ); 2) Load balancing (minimizing
bandwidth allocated to each flow f on each of its tunnels, i.e.,
the max link utilization), min(uma x −  f ∈F b f ) subject to
P
x tf , totaling to b f and (2) W that consists of variables that are
constraints uma x ≥ ue ;∀e ∈ E; uma x is the max link utilization
specific to the worker’s objective function.
and ue is the utilization of the link e.
Constraints. The worker’s objective is subject to five
constraint sets (C0-4) common between the Shadow and Final
TE (parameters in Fig. 4a): (C0) constraints that are specific to IV. J OINT R ECURSIVE I NTER -I NTRADOMAIN TE
the worker’s region and its objective; (C1) the total bandwidth The previous section described our scalable and efficient
allocated to tunnels traversing a link must be less than the link TE for intradomain traffic (flow requests with fixed source
bandwidth capacity; (C2) the total bandwidth allocation on and sink points inside the network). However, most of the
each D-switch port pair must be less than its corresponding flows in an ISP network are destined to IP prefixes outside the
bandwidth constraint in the D-fabric; (C3) the bandwidth network on the Internet. Usually, there exists multiple options
allocated to a flow equals to the sum of bandwidths allocated of egress points and interdomain routes to be chosen for these
to its tunnels, (C4) the allocated bandwidth to each tunnel is flows. Recent centralized TE methods (e.g., [3], [1]), make
non-negative, and (C5) constraints that are different for the interdomain and intradomain routing decisions in a two-step
Shadow and Final TE programs. Shadow TE must ensure the and suboptimal process (§ II). Meanwhile, traditional non-SDN

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

systems that formulate a joint optimization have serious route selection. If the selected route satisfies the performance
performance and scalability limitations (§ VII). In this section, requirement of the flow, it decides to exit and engineer the flow
we describe how Dragon hierarchically and jointly optimizes from its region. Otherwise, it exposes the flow to its parent,
these decisions based on QoS metrics (§ IV-A). To guarantee performing the same operation. We explain our design in Fig. 5
both the performance and scalability, we distribute the joint TE with two aggregate flows destined to external prefixes P1 and
load among hierarchical regions and workers (§ IV-B). Finally, P2 entering region 2 from neighboring ISP N1. In this case,
we explain how we adapt our recursive liner programs in § III TE worker TW2 needs to select one best BGP route for each.
to the joint TE (§ IV-C) and handle routing changes (§ IV-D). For prefix P1, since a local end-to-end path via neighboring
ISP N2 satisfies the latency constraint, TW2 locally selects the
A. Joint Selection of Intradomain and Interdomain Paths constituent interdomain route, announces it to N1 and N2, and
also exposes it to the root from (DS2, A). For prefix P2, it
To handle egress traffic, in Dragon, each TE worker runs
cannot meet the requirement and thus exports the flow request
a BGP process establishing BGP sessions with local BGP
from (DS2, B) to the root region. The root worker is aware of
speakers in neighbor ISPs and its parent through the abstract
an efficient end-to-end path to prefix P2 through N3, select the
D-switch. Dragon uses these recursive sessions to disseminate
constituent BGP route and advertises it to N1 through D-switch
interdomain routes without global visibility to proper regions,
DS2.
giving them the illusion of real egress/peering points. In Dragon,
when a leaf TE worker sends a BGP message to its parent, it N1
Root B

Level 2
extends the message by associating a few performance metrics
of the route (e.g., latency, bandwidth to the destination) from N3
C DS1 DS2
A N2
Neighbor ISP

periodic measurements. Dragon uses this approach to realize a Root Region Dragon or
Physical Switch
new end-to-end routing logic jointly optimizing intradomain TW1 TW2
TE Worker

Level 1
and interdomain paths. We first explain how Dragon as a whole C N2
N3 A
(without the TE worker abstraction) selects a single best route Abstraction
B
for each aggregate flow destined to an IP prefix on the Internet. Region 1 Region 2
N1
We propose to first compare candidate interdomain routes
Fig. 5: Dragon’s scalable QoS-based end-to-end (joint interdomain
by operator’s policy for routing stability [28] and then and intradomian) routing before bandwidth allocation.
break the tie by ranking them based on end-to-end (intrado-
main+interdomain) performance experienced by the flow. Our
end-to-end QoS-driven routing does not disturb the global C. Put together: Unified Topology Abstraction for Joint TE
BGP convergence although it might choose and announce new To run our scalable end-to-end routing and then allocate
routes for better performance faster in each epoch. Moreover, bandwidth to both the ISP egress and local flows, we need
we have Dragon collect sufficient measurement samples for to create unified topologies containing both interdomain and
each BGP route to form stable metrics. Finally, we extend intradomain nodes in each region. Basically, we extend the
Dragon to select multiple end-to-end paths to efficiently utilize topology of each region with IP prefixes as sink point of egress
the network resources–used the BGP add-path to select more flows. Since the number of prefixes is huge, we cluster them
than one interdomain (BGP) route for a flow [29]. into External Virtual (EV) nodes and then connect EV nodes
to egress switches. Clustering can be based on various factors
B. Scaling Joint Routing to a Large Number of IP Prefixes (e.g., reachability: each set contains prefixes with the same
BGP path from all egress points). We propose to group prefixes
Dragon’s routing logic improves the TE end-to-end perfor-
based on QoS metrics due to compatibility with our end-to-end
mance (§ VI). But to achieve scalability, the key question is
routing logic. Therefore, we have each EV node contains
where this route selection should take place in the hierarchy.
prefixes with similar QoS metrics from all egress points. For
One naive way is to send all the routes and flows to the root
example, EV1 in Fig. 6 contains prefixes P1 and P2 since both
TE worker (e.g., RCP [3]), which then selects the best route
have similar external latencies (i.e., 100 and 150 ms) from
based on the end-to-end metric for all flows. The selected
peer switches A and B in the region. After forming unified
routes are then recursively pushed down through the BGP
topologies in all regions, Dragon first runs the scalable end-
sessions to be advertised to neighbor ISPs. We call this
to-end routing in each epoch, followed by the two-step linear
approach global route selection, which clearly is not scalable
programs to allocate bandwidth to both egress and local flows.
given the hundreds of thousands of address prefixes and
millions of aggregate flows in an ISP network. To improve D. Handling Routing Changes Between Epochs
both the scalability and performance, we leverage Dragon’s Dragon engineers interdomain and intradomain flows
hierarchical structure to intelligently distribute our end-to- iteratively. In the joint TE context, unplanned BGP route
end route selection and LP loads to different levels, while updates might arrive between two epochs, forcing Dragon to
satisfying performance requirements. In our approach, the change the flow egress points. In this case, recomputing the
ISP defines a decision threshold in terms of the performance entire TE from is inefficient so Dragon provides a hierarchical
requirement (e.g., maximum latency) for each aggregate egress greedy packing approach to find temporary solutions between
flow. Recursively upward from the leaf level, for each flow in two epochs. In this method, from the leaf level upward, when
a leaf region, each TE worker first locally runs our end-to-end a TE worker learns an interdomain route change for an egress

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

EV1 = {P1,P2} TW3


EV2 = {P3, P4}
[Local label: R

Level 3
100 200 Global label: F]
150 300

ISP 2 A
B ISP 1
TW2
D-Switch
Peer Switch R P P R

Level 2
EV Node
A TE worker’s region Dragon or
Physical Switch
Fig. 6: Constructing a unified topology in a logical region for
TW1
compressing address prefixes to systematically adapt our recursive TE Worker
intra-region TE techniques to joint TE. P L L R Abstraction

Level 1
flow in its region, it checks if it has an alternative end-to-end R Push Label R

(interdomain+intradomain) path locally. If so, the TE worker S L Pop Label L


selects it and greedily establishes paths till it can recover S
the previously granted bandwidth to the flow. Otherwise, it Fig. 7: Minimizing the packet header size and the number of rules
exposes the affected egress flow to its parent. Recursively, in the physical SDN data plane.
this procedure occurs until a new TE worker with a satisfying Minimizing rules with global labels. In the above basic
egress route in its region is discovered. The new TE worker design, each local tunnel is associated with a unique local label
performs the same greedy approach to recover the bandwidth. in any of our hierarchical region s. Thus, the number of rules to
This method is effective due to minimizing changes to paths pop the local tunnel label at an egress switch (physical/Dragon)
and bandwidth allocation. in a region is the same as the number of local tunnels, which
multiplicatively increases as moving down from the root to
V. TE E NFORCEMENT: R ECURSIVE SDN T UNNELING the leaf level. To reduce the rules needed at an egress switch,
we introduce global labels, each uniquely identifies a flow
Given the TE program output, each TE worker configures between its ingress and egress points in all hierarchical regions
a set of label-switched paths within each region and carries it traverses. Using the global label, an egress switch performs
traffic through them. Each local ingress D-switch maps the action of “match on the global label, pops any local labels,
incoming traffic to a set of tunnels according to the TE output and pushes the parent label". In this case, we only need to have
and pushes right labels onto the partial traffic going through one rule instead of n rules for each flow in any region, where
each partial tunnel. In this section, we optimize the number n is the total number of local tunnels. Because the traffic is
of states generated by TE workers for the physical SDN data bidirectional, the ingress switch also benefits from the same
plane. We first minimize the packet header size and then the optimization. We require to match at an inner label and then
number of forwarding rules in SDN switches (simultaneous pop the outer label, which can be implemented in P4-enabled
optimization is not possible without global visibility). SDN switches supporting user-defined switch actions.
VI. E VALUATION
A. Optimized Hierarchical Label Swapping
With an extensive evaluation on real-world topologies and
Minimizing packet overhead. Following the hierarchical
traffic demands, we demonstrate Dragon’s benefits in improving
structure at the control plane, in the data plane, tunnel labels are
scalability, efficiency, and flexibility of SDN-based TE in ISPs.
also designed recursively. Instead of stacking the labels from all
the levels [30], each into a different packet header, which will
significantly increase the packet size, we use a label swapping A. Dragon Prototype Implementation, Datasets, and Baseline
method. Each TE worker determines the set of labels to use for Prototype and data. We have prototyped a complete
its tunnels independently. When a flow enters a non-leaf region, two-level Dragon on top of the RYU controller and OVS
the ingress D-switch at that local region pops the parent’s label switches. Each TE worker is a separate thread. The leaf TE
and pushes a local label for each of its internal tunnels to reach workers use OpenFlow to communicate with OVS switches.
the egress. The intermediate D-switches forward traffic based The root worker interacts with the D-switch exposed by the
on the local label. The egress D-switch pops the local label, leaf workers through an API similar to OpenFlow. Dragon’s
pushes back the parent’s label, and forwards to the next region. manager application on the controller heuristically divides ISP
This action applies to regions, at different levels where the topologies into approximately equal-sized regions such that
traffic (a flow) traverses through. We need to push the parent’s the density of intra-region links is maximized and then assign
label back because a label specified by an ancestor TE worker them to the leaf TE workers. We have ported CVXOPT that
needs to be visible across inter D-switch links in its region. Fig. is a fast LP solver into TE workers and then implemented
7 shows this process, where the root worker TW3 instructs its the throughput maximization). Using the prototype results, we
intermediate D-switch to forward traffic having label R and also carefully estimate the behavior of three and four-level
this tunnel is recursively translated in the lower level regions Dragon. While we have conducted our experiments with
using our label swapping approach. Meanwhile, each packet different topologies, due to lack of space, we we first present
always has one label on each link, determined by the ancestor the results for: the Ebone ISP’s topology (300 switches) from
TE worker having visibility of the link. RocketFuel [31] dataset and the Internet-2’s topology (12

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

switches). At the end, we present a summary of results in our is highly scalable since it reduces the total time by 98% (from
experimentation with other ISP topologies in the RocketFuel 30 min to 30 sec.) for the Ebone and by 68% (from 889 ms
dataset. We perform TE every 5 minutes for 80 hours. Unlike to 283 ms) for Internet 2 with the optimal k.
Internet-2, RocketFuel does not come with real traffic matrices TABLE I: Linear program size (O(mn)) on Ebone w.r.t the number
and link capacities. Therefore, we generate traffic using the of constraints (m) and variables (n). Dragon efficiently distributes
widely-used cyclostationary model [32]. To estimate link the TE load among its TE workers.
capacities, we leverage a popular centrality model [33]. To Dragon
SD-TE
Root L1 L2 L3 L4 L5 L6 L7
model egress points and interaction with other ISPs based on
m 4538 441 2708 370 1830 1535 688 1440 22556
BGP, we extract the interdomain routes from Planetlab nodes to
n 1848 338 2274 294 1541 1281 557 1218 19384
all destination IP prefix based on iPlane [34]. Each PlanetLab O(mn) 8E6 1E5 6E6 1E5 3E6 2E6 4E5 2E6 4E8
node serves as a neighbor ISP of a Dragon ISP. Similar to
prior work [7], we obtain the latency of BGP paths from each TE complexity. To understand Dragon’s scalability nature,
neighbor to all destinations by aggregating iPlane snapshots. we show the average size of LPs solved over all epochs by
Metrics and baseline. We have four metrics: (1) TE Dragon and SD-TE in Ebone (Internet-2 is similar). Given LPs
scalability in terms of computation time and the state size in have exponential worst-case complexities, from Table I, we
SDN switches, (2) TE overhead in terms of solution optimality, observe that Dragon is more agile and scalable. This because
(3) performance in terms of RTT delay in inter-intra domain TE, it distributes small LP problems among TE workers, which are
and (4) solution efficiency in multi-objective region-specific TE. solved mostly in parallel. In particular, each TE worker has
Our baseline is today’s best software-defined TE system (re- been assigned problems that are 50-1000× smaller than those
ferred to as SD-TE) that (1) performs optimal interdomain TE solved by SD-TE.
based on RCP [3], (2) executes optimal multi-path intradomain TABLE II: Data plane scalability. Dragon minimizes the number of
TE for a desired objective function by leveraging the core LP re- forwarding states in physical switches. N/A in SoftMoW is because
it does not enforce bandwidth allocations.
peatedly used in previous DC works [2], [1], (3) enforces TE re- Dragon SD-TE
sults leveraging SoftMoW’s [7] advanced SDN-based tunneling Internet-2 Ebone Internet-2 Ebone
protocol, and (4) supports a state-of-the-art hierarchical multi- Avg. Max. Avg. Max. Avg. Max. Avg. Max.
objective TE [9] by iteratively scheduling inter-region flows #Flow Rules 21.82 38 467.04 3048 63.27 124 1577.85 9144
with a desired function, computing the remaining link capacities, #Group Rules 7.82 14 4.72 34.0 N/A N/A N/A N/A
and then optimizing intra-region flows with the next objective #Meter Rules 9.64 16 68.35 510.0 N/A N/A N/A N/A
function. SD-TE is implemented on top of the RYU controller. Data plane scalability. To show Dragon’s data plane
scalability, we measure the average number of flow, group and
meter rules installed in the physical (level-1) SDN switches.
B. Scalability of the Dragon Framework As shown in Table II, we observe the maximum numbers of
We quantify Dragon’s scalability benefits by focusing on flow entries used by Dragon are 69.3% and 66.6% smaller than
the intradomain TE where flow requests are between fixed SD-TE on Internet-2 and Ebone respectively. This is because
source and sink points, and choosing the network throughput Dragon uses global labels for minimizing switch rules and local
maximization objective for all TE workers. labels to implement partial tunnels with small packet header
overhead (§ V). Unlike SD-TE, Dragon with more levels needs
Flat
SD-TE Dragon
Max. Number of Tunnels (k)

SD-TE Dragon
Dragon Dragon
the same number of rules due to its scalable rule management.
29.5 283
1822 4 889 C. Enhanced Optimality via D-fabric Reconfiguration
26.1
918.5 Given Dragon’s scalability, we evaluate the amount of
24.7 optimality (throughput) loss caused by our recursive abstraction
370.1 277
21.8 2 and greedy D-fabric reconfiguration under different loads.
492
90.3
w.r.t. Two-level

1  1"
0.0 0.2 400
0.4 600
0.6 800
0.8 1000
1.0
Network Throughput (%)

0 200 Dragon
Network Throughput (%)

Dragon (Greedy)
Dragon
Avg. TE Avg.
TE Computation Time(s)
w.r.t.

Avg. (ms) SD-TE (Optimal) 0.8"


Dragon

0.8  
+
 
Network Throughput (%)

Normalized+

Dragon
Normalized  

(a) Ebone (b) Internet 2 (ms)


Dragon (Basic)
0.6"
0.6  
Dragon
Normalized
Network Throughput (%)

Dragon
Fig. 8: TE Computation time. Dragon reduces the TE time by (Greedy)
Dragon
(ms)
Net"Throughput"Loss"
Net   Throughput  
throughputLloss
oss  
Network Throughput (%)

Two-level

several orders of magnitude. SD-TE Dragon 0.4"


0.4   Net.
Computa9on"Time"
Computa9on   Time  
Normaized

TE computation time
0.2"
0.2  
(%)

TE computation time. Fig. 8 shows the computation timeDragon Dragon (Greedy) 0  0"
Throughput(%)

in second for the SD-TE and Dragon TE. At most k tunnel are
established between access switch pairs [35]. The optimal k is 4
Dragon 2  2
2" 3  3
Number  
Numberoof
3"
f  levels  
Number+of+levels+ Levels
4  44"

Number
Number of of Levels
Fig. 10: Scalability-optimality levels
TE Epoch
and 20 for Internet 2 and Ebone respectively indicating a larger Fig. 9: Our greedy and Shadow trade-off. Dragon can scale to
krkThroughput

k does not increase the network throughput noticeably. We Epoch TE approaches enable Dragon to large ISP networks by incurring
make the following observations. First, SD-TE has several compute near-optimal results. a small amount of overhead.
orders of magnitude higher computation time for Ebone DragonEpoch
fabric reconfiguration. To iteratively compute
compared to Internet-2. Second, increasing k substantially D-fabrics for the root, each leaf TE worker estimates the
intensifies the computation time of SD-TE. These two trends bandwidth demands of the root on the abstract D-switch
Epoch
confirm the cannot scale to very large networks. Third, Dragon port pairs based on our greedy method and feed them into
Epoch
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

in its Shadow TE. We choose the greediness parameters optimization. Using the same setup, we generate flows destined
experimentally: If the bandwidth of a port pair is utilized more to external IP prefixes (outside the ISP) on the Internet and
than 95%, each leaf TE worker increases the estimate by 2Gbps; feed the interdomain routing data into our framework.
if it is below 85%, it decreases at a lower speed, 500kbps. 1.0 1.0
Otherwise, the estimate does not change. We also implement 0.8 0.8
a basic Dragon in which each leaf TE worker estimates an
0.6 0.6

CDF

CDF
equal bandwidth demand for all their border port pairs. Fig. 9 RCP-8P
SD-TE (8P) RCP-4P
SD-TE (4P)
depicts Internet-2’s network throughput for greedy Dragon, 0.4 Dragon (8P) 0.4 Dragon (4P)

0.2 RCP-16P 0.2 RCP-8P


SD-TE (8P)
Throughput(%)

SD-TE (16P)
Throughput(%)
Network Throughput(%)

Network Throughput(%)
NetworkThroughput(%)

Throughput(%)

Throughput(%)
Throughput(%)
Throughput(%)

Throughput(%)
Throughput(%)

basic Dragon, and SD-TE after the first 40 epochs where


Throughput(%)

Whale-16P
Throughput(%)
Throughput(%)

Dragon (16P) Dragon (8P)


SD-TE SD-TE SD-TE
greedy Dragon
SD-TE
SD-TE the optimal
approachesSD-TE SD-TE
Dragon
SD-TE
Dragon Dragon
Dragon
solutions.
SD-TE However, the
Dragon 0.00 200 400 600 0.00 200 400 600
SD-TE
Dragon
Throughput(%)

Flow RTT(ms) Flow RTT(ms)


Network Throughput (%)

Dragon
basic Dragon does not Dragon Baseline
considerBaseline
Baseline root Dragon
worker’s traffic patterns
Baseline
Throughput(%)

Baseline
Throughput(%)

Baseline
Throughput(%)

Dragon
Throughput(%)

Dragon
Dragon (Greedy)
Throughput(%)
Throughput(%)

(a) Ebone (b) Internet-2


(%)
Throughput(%)

Baseline
Network Throughput(%)
and incursBaseline
up to 4% throughput loss.
Throughput(%)

Throughput(%)

Baseline
Throughput(%)

Dragon Baseline
Throughput(%)

Dragon (Basic)
Throughput(%)
Throughput (%)

Throughput(%)

SD-TE SD-TE SD-TESD-TE


(%)

SD-TE SD-TE SD-TESD-TE SD-TE Fig. 12: End-to-end flow-level delay. Dragon is more efficient than
Network(%)

Baseline SD-TE
Dragon Dragon DragonDragon Dragon
(%)

Dragon
Dragon Dragon
DragonDragon
Dragon SDN-based TE solutions due to jointly optimizing its interdomain
Throughput(%)

Throughput

Dragon (Greedy)
Throughput (%)

Throughput (%)

Baseline
Baseline Baseline Baseline
Throughput

Dragon (Basic)Baseline
Baseline
Baseline Baseline BaselineBaseline and intradomain TE decisions in the network.
Network

Baseline
SD-TE
Avg.Network
Network
Network

Network
Network
Network
Network

Network
Network
Network
Network

Network
Network

Throughput

SD-TE (Optimal)
Dragon
Avg. Throughput

Benefits of QoS-driven end-to-end routing.Unlike SD-TE


Network
Avg.Network

Network
Network
Network

that follows the standard BGP process (described in § IV),


Network

Network
Network
Network

Avg.Network
NetworkNetwork
Network

Network
Avg.Network
Network

Avg.Network
Network

Network

Network
Network

Avg.Network
Avg. Network

Network
Network

Network
Network
Network

Avg.

Dragon
Avg.

Avg.
Avg.

Avg.
Avg.

Avg.

Dragon simultaneously selects interdomain routes and


Network

ScaleFactor
FactorScale Factor
Network

ScaleFactor
Factor
Network

Scale
Network

Scale
Network

Network

Scale Factor
Network

Network
Network
Network

Network
Avg.Network
Network
Network

Network
Network

Dragon
Avg. Network
Network

Dragon (Greedy)
Network

Dragon intradomain paths for each external ingress flow based on


Network

Dragon
Dragon
Network
Network
Network

SD-TE (Optimal)
Avg.

Avg.
Network
Avg.

Network
Avg.

Scale Factor Scale


Scale Factor
Factor Scale Factor Scale
Dragon Factor end-to-end latency (data collection explained in § VI-A). To
Network

Scale Factor Scale Factor Dragon


Dragon Dragon (Basic)
Avg.
Network
Avg.

Network

Scale Factor
Avg.

have a fair comparison with the SD-TE, we configure Dragon


Avg.

Scale Factor Scale Factor


Avg.

Scale Factor
Avg.
Avg.
Avg.
Avg.

Avg.

Avg.

Scale Factor
Scale Factor Scale
Scale Scale
Factor Factor
Scale
Factor Factor to conduct the route selection at the root TE worker to leverage
Avg.

(a)Scale Factor Scale Factor


Avg.

Ebone Scale Factor


Avg.

Scale (b) Internet-2


Scale
FactorFactor
Avg.

Scale Factor
Avg.

Scale Factor
Fig. 11: Throughput in saturated
Scale ScaleFactor Scale
networks.
Factor
Scale
Dragon
Scale Factor
generates
Factor near- network-wide view similar to SD-TE. Fig. 12 shows the
highly Factor
Avg.

optimal results even when the Scale


networkFactor
is saturated distribution of flow RTT delays destined to 100K destination
ScaleFactor
Scale Factor
Scalesaturate
Factorthe data prefixes on the Internet during 200 TE epochs for a different
Saturated network. Next, we artificially
plane to show how Dragon incurs throughput Scale loss Factor
in such sce- number of peering points. Compared to SD-TE, Dragon lowers
narios. We linearly scale traffic matrices such that the demand the 80% and maximum delays by 33%-43% and 42%-52%
hits the maximum achievable throughput. Fig. 11 illustrates the by jointly optimizing interdomain and intradomain paths,
average of network throughput of Internet-2 and Ebone over 83 minimizing the path inflation that today’s two-step TE incurs.
hours for various scale factors. It shows that Dragon (greedy)
550 2.5 1e4 Tasks) #Flows)
can achieve a high throughput even when the network is com- Mean Maximum
Flow RTT (ms)

2.0
(# TELoad
# Flows)

pletely saturated with at most 1-2% and 3.5-9.8% throughput 400 Max
1.5 Median
Load(Avg.
TE Worker

loss for Internet-2 and Ebone respectively. Given the substantial 250 1.0
(Avg.

scalability of Dragon and the current over-provisioning ratios 0.5


Load TE Worker

1000 200 400 600 0.00 200 400


200 400 600

Net
in ISP networks, we believe these loss values are tolerable. 600
Threshold (ms) Threshold (ms)
(a) (b) (c)
D. Dragon’s Scalability-Optimality Trade-off Fig. 13: Trade-off among (a) end-to-end latency, (b) throughput,
and (c) load distribution on workers in Dragon’s joint-TE.
To further characterize Dragon, we show the trade-off
between the TE time and the throughput loss for varying Characteristics of the hierarchical joint TE. Next, we
number of levels. Since our prototype is limited to two levels, evaluate Dragon’s hierarchical joint TE described in § IV and
for the Ebone topology, we compute the expected value of these characterize its trade-offs between end-to-end latency, network
two metrics in the two-level Dragon with a different number throughput, and TE workers’ load. In our design, Dragon’s
of abstract D-switches at the root region. Then, we carefully leaf TE workers perform the joint TE for a flow locally unless
and recursively propagate the expected values to estimate the the ISP egress points to the destination are not in their region
throughput loss and TE time for three and four-level ones. or not satisfying its latency requirement. In Fig 13, when the
In our estimation, we take into account actual flow demands entire load is handled by the root TE worker (threshold is 0),
in logical regions of three and four-level Dragon. Fig. 10 the end-to-end latency in the joint TE is optimal due to the
presents the normalized TE time and throughput loss for a global view. As we offload less (larger threshold), the TE load
various number of levels relative to the 2-level Dragon. We gets more balanced among all TE workers as more TE tasks are
observe that Dragon can provide large ISPs with tremendous done locally and in parallel. In this case, the maximum latency
TE agility at a small over-provisioning cost. Particularly, three- becomes larger because a better egress point (end-to-end path)
level Dragon has 30% smaller TE time with only 8% higher might exist in a neighboring leaf region but a leaf TE worker
throughput loss compared to the two-level one. selects its local routes instead. The last metric is in Fig. 13c
showing a large threshold increases the throughput as more
E. Joint Inter-Intra Domain TE traffic is sent through local egress points, but the global routing
results in more flows competing for those egress points with
Next, we demonstrate performance and scalability benefits of smaller end-to-end latencies. This disappears if we select more
Dragon’s joint inter-intradomain TE compared to SD-TE that than one inter-domain route for each egress flow.
first makes interdomain TE decisions followed by intradomain

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

First, Dragon uses the hierarchy technique for scaling the

Max. Link Utilization (%)


Throughput(%)
Dragon
SD-TE
TE function, does not need to change the SDN controller
design, and can deliver the same benefits with any controller
49.53 46.1 architecture in place. Second, the RSDN and SoftMoW works
40 do not study the TE problem (except for a basic design on how
Network
Network

20 the RSDN routing scheme can support load balancing [8]).


Centralized TE. DEFO [14] addresses the TE scalability
Avg.

SD-TE Dragon
Dragon
Approach issue by departing from LP and using constraint programming
(a) Inter-region throughput (b) Intra-region MLU (CP). While there are some similarities between Dragon
Fig. 14: Root TE worker maximizes network throughput & leaf
workers balance traffic load on links. Dragon efficiently supports
and DEFO, the differences are significant. First, DEFO is
region-specific objective functions. complementary to our work. Although our implementation
is based on multipath LP-based TE, our framework is generic
F. Region-Specific Objective Composition
and thus can have its hierarchical workers to execute DEFO
Finally, we compare the efficiency of Dragon and SD-TE (or CP) programs for further TE acceleration. Second, DEFO
in supporting (hierarchical) region-specific multi-objective TE. does not jointly optimize interdomain and intradomain traffic
In our two-level setting, we configure the root TE worker to and support hierarchical multi-objective optimization.
maximize the network throughput on expensive inter-region Hierarchical routing architectures. In the past, notable
links and the leaf TE workers to balance the traffic load on efforts have leveraged similar hierarchy and recursive topology
their intra-region links by minimizing the max. link utilization aggregation techniques to scale legacy distributed routing
(MLU). SD-TE performs the same multi-objective TE but protocols (e.g., PNNI [21], Nimrod [22]). Due to their
using a different implementation (§ VI-A). For 30 epochs on distributed nature, these systems are inefficient from TE
Ebone, Fig. 14 shows that Dragon on average has 3.43% lower aspects, and poor network utilization, reservation failure, and
inter-region network throughput and 15.46% better intra-region congestion are inherent to them. Some lossy representations
load balancing. Overall, Dragon is more efficient than SD- of abstracted topologies have been proposed for these legacy
TE because it provides a feedback from the leaf level to the systems (e.g., [25], [36], [26]). Although Dragon can use
root through D-fabrics associated with D-switches. However, these techniques to further reduce the size of D-fabrics, they
SD-TE achieves the throughput maximization without taking create substantial throughput loss in our centralized TE.
into account the load balancing objective as it performs each Joint inter-intradomain TE. RCP [3] is a flat centralized
of these optimizations one at a time. The former can exploit routing design that selects the best BGP route on behalf of all
scarce resources within each within each leaf region and thus routers in an ISP. Similarly, SDX [4] manages routes of multiple
limits the choices of the latter. ISPs at exchange points. Unlike them, Dragon jointly optimizes
both intra and interdomain decisions motivated by multiple
G. Summary of Experiments with Other Larger Topologies studies [19], [18] on hot-potato routing. Our approach is
different from prior optimizations solving the joint TE problem
We repeat all the above experiments for 9 other ISP with bi-criteria techniques [16], [37], which are intractable for
topologies in the RocketFuel dataset consisting of around large networks, limited to single path TE, and lack an efficient
600-800 switches and 1000-5000 links and summarize the support for handling interdomain routing updates. Using the
results for three of them (due to lack of space) in Table III. SDN programmability, Dragon offers an end-to-end approach
Our evaluation shows the significant benefits and negligible based QoS metrics, which is both scalable and efficient.
overheads of Dragon stays valid for different ISP topologies
and traffic patterns. VIII. C ONCLUSION
TABLE III: Summary of Dragon’s benefits and overhead compared
to SD-TE for other larger ISP topologies. This paper proposes a new method for SDN-based TE
TE time SDN rules Throughput Flow RTT Multi-objective in large ISP networks. Because of its hierarchical design, it
ISP
↓ (×) ↓(%) ↓ (%) ↓ (%) efficiency ↑ (%)
Verio 722-821 40-62 7-12 55 10-14 can support TE for large-sized networks with thousands of
Level3 542-600 72-80 5-10 67 20-32 constraints. Because of its scalability advantage, we can solve
AT&T 384-450 75-79 2-11 48 12-18 the intradomain and interdomain resource allocation jointly.
VII. R ELATED WORK Because of the hierarchical and region-based optimization
structure, it allows the operator to flexibly define the
We discuss related work that has inspired Dragon, especially optimization demand for different parts of the network. As
works not previously considered. a part of our future work, we will investigate how other SDN
Hierarchical SDN architectures. A body of recent works features, such as on-demand measurement, can be used to
has already advocated for a hierarchical SDN controller improve the TE effectiveness in practice.
architecture. SoftMoW [7] is the most comprehensive solution
for cellular wide-area networks that particularly improves the
mobility management. RSDN [8] is entirely is focused on ACKNOWLEDGMENTS
routing and connectivity problems in networks managed by a This work is supported by Ericsson Research and NSF
hierarchical SDN control plane. Similar to these works, Dragon grants CNS-1629894 and CNS-1345226 and conducted while
leverages the hierarchy technique but with some key differences. Mehrdad, Ying, and Ravi were at Ericsson Research .

0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications

R EFERENCES Mehrdad Moradi holds a PhD degree in computer


science from the EECS department at the University
of Michigan, Ann Arbor, USA. He received his B.S.
[1] C.-Y. Hong et al., “Achieving high utilization with software-driven wan,” in Software Engineering from Sharif University of
in Proc. ACM SIGCOMM, 2013. Technology, Iran in 2012. His recent research focus
[2] H. H. Liu et al., “Traffic engineering with forward fault correction,” in is on 5G, SDN and NFV.
Proc. ACM SIGCOMM, 2014.
[3] M. Caesar et al., “Design and implementation of a routing control
platform,” in Proc. USENIX NSDI, 2005.
[4] A. Gupta et al., “SDX: A software defined internet exchange,” in Proc.
ACM SIGCOMM, 2014.
[5] N. Spring et al., “The causes of path inflation,” in Proc. ACM SIGCOMM, Ying Zhang is a Software Engineer at Facebook.
2003. Previously, she was a senior researcher at HP Labs
[6] R. Teixeira et al., “Dynamics of hot-potato routing in ip networks,” ACM and in the Ericsson Research-Silicon Valley Lab.
SIGMETRICS, 2004. She earned my Ph.D. degree in 2009 in the EECS
[7] M. Moradi et al., “SoftMoW: Recursive and reconfigurable cellular wan department from the University of Michigan. She
architecture,” in Proc. ACM CoNEXT, 2014. has been granted 30+ patents and was reported by
Swedish media as a Mobile Network 10 Brightest
[8] J. McCauley et al., “Recursive sdn for carrier networks,” arXiv preprint
Researcher.
arXiv:1605.07734, 2016.
[9] A. S. Manne, “Linear programming and sequential decisions,” Manage-
ment Science, 1960.
[10] M. Moradi et al., “SoftBox: A customizable, low-latency, and scalable Z. Morley Mao is a professor in the Department of
5g core network architecture,” 2018. Electrical Engineering and Computer Science (EECS)
[11] S. Jain et al., “B4: Experience with a Globally-Deployed Software defined at the University of Michigan, Ann Arbor, USA.
WAN,” in Proc. ACM SIGCOMM, 2013. Her research interests include networking, distributed
[12] A. Kumar et al., “Bwe: Flexible, hierarchical bandwidth allocation for systems, and security. Morley has a Ph.D. in computer
wan distributed computing,” in Proc. ACM SIGCOMM, 2015. science from the University of California at Berkeley,
[13] N. Megiddo, On the complexity of linear programming. IBM Thomas CA, USA and has published 150+ peer-reviewed
J. Watson Research Division, 1986. research articles, and holds 8 awarded patents.
[14] R. Hartert et al., “A declarative and expressive approach to control paths
in carrier-grade networks,” in Proc. ACM SIGCOMM, 2015.
[15] R. T. Marler et al., “Survey of multi-objective optimization methods for
engineering,” Structural and multidisciplinary optimization, 2004. Ravi Manghirmalani was a former senior
[16] K.-H. Ho et al., “Joint optimization of intra-and inter-autonomous system researcher at Ericsson Research Silicon Valley. Ravi
traffic engineering,” IEEE Trans. Network and Service Management, 2009. received his M.Sc. degree in computer science and
engineering from Case Western Reserve University,
[17] “BGP best path,” http://goo.gl/pdzJsr, 2014.
OH, USA. His main areas of research are Software
[18] S. Agarwal et al., “The impact of bgp dynamics on intra-domain traffic,” Defined Networks and Cloud Computing.
in Proc. ACM SIGMETRICS, 2004.
[19] S. Agarwal, A. Nucci, and S. Bhattacharyya, “Measuring the shared fate
of igp engineering and interdomain traffic,” in Proc. IEEE ICNP, 2005.
[20] P. Berde et al., “ONOS: towards an open, distributed SDN os,” in Proc.
ACM SIGCOMM Workshop on HotSDN, 2014.
[21] T. A. Forum et al., “Private network-network interface specification
version 1.0 (pnni 1.0),” 1996.
[22] I. Castineyra et al., “The Nimrod routing architecture,” in IETF, RFC,
1992.
[23] A. Farrel, J.-P. Vasseur, and J. Ash, “A path computation element (pce)-
based architecture,” RFC 4655, August, Tech. Rep., 2006.
[24] H. Wang et al., “Cope: traffic engineering in dynamic networks,” in
ACM CCR, 2006.
[25] B. Awerbuch et al., “Routing through networks with hierarchical topology
aggregation,” in Proc. IEEE ISCC, 1998.
[26] W. C. Lee, “Topology aggregation for hierarchical routing in atm
networks,” ACM CCR, 1995.
[27] M. Herbster and M. K. Warmuth, “Tracking the best expert,” Machine
Learning, 1998.
[28] L. Gao and J. Rexford, “Stable internet routing without global coordina-
tion,” in Proc. ACM SIGMETRICS, 2000.
[29] V. den Schrieck et al., “Bgp add-paths: the scaling/performance tradeoffs,”
IEEE JSAC, 2010.
[30] D. O. Awduche and J. Agogbua, “Requirements for traffic engineering
over mpls,” 1999.
[31] N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP topologies
with Rocketfuel,” ACM CCR, 2002.
[32] A. Nucci et al., “The problem of synthetically generating ip traffic
matrices: initial recommendations,” ACM CCR, 2005.
[33] M. E. Newman, “A measure of betweenness centrality based on random
walks,” Social networks, 2005.
[34] “iPlane Dataset,” http://goo.gl/GCCg2O.
[35] J. Y. Yen, “Finding the k shortest loopless paths in a network,”
management Science, 1971.
[36] B. Awerbuch et al., “The effect of network hierarchy structure on
performance of atm pnni hierarchical routing,” IEEE CC Jorunal, 2000.
[37] Y. Zhang and M. Moradi, “Sdn based interdomain and intradomain traffic
engineering,” July 4 2017, uS Patent 9,699,116.
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like