Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
Abstract—To optimize network cost, performance, and reliabil- are limited to applying the same set of objective functions
ity, SDN advocates for centralized traffic engineering (TE) that to the entire network. Third, existing systems lack interactions
enables more efficient path and egress point selection as well as between intradomain resource allocation and interdomain
bandwidth allocation. In this work, we argue SDN-based TE for
ISP networks can be very challenging. First, ISP networks often routing, causing sub-optimal end-to-end performance [5], [6].
are very large in size, imposing significant scalability challenges Dragon TE application framework. Rather than central-
to the centralized TE. Second, ISP networks usually have diverse izing the TE application in a single process or thread on the
types of links, switches, and cost models, leading to a complex SDN controller, we argue a scalable, flexible, and efficient TE
combination of optimizations. Third, ISP networks not only have application for ISP networks should consist of a distributed
many choices of internal paths but also include rich selections of
egress points and interdomain routes, unlike cloud/enterprise net- logic. To overcome the scalability challenge, it must break
works. To overcome these challenges, we present a novel TE appli- down large-scale TE problems into smaller pieces, solve the
cation framework, called Dragon, for existing SDN control planes. subproblems in parallel as much as possible, and combine
To address the scalability challenge, Dragon consists of hierarchi-partial solutions to generate a TE result sufficiently close to
cal and recursive TE algorithms and mechanisms that divide flow that of global one-shot TE approach. To support fine-grained
optimization problems into subtasks and execute them in parallel.
Further, Dragon allows ISPs to express diverse objectives for dif- customization, it must enable ISPs to express their desired
ferent parts of their network. Finally, we extend Dragon to jointlyoptimization goal for each subproblem. To ensure efficiency
optimize the selection of intradomain and interdomain paths. of TE decisions, it must jointly optimize inter-domain paths
Using extensive evaluation on real topologies and prototyping and intradomain routes within each subproblem. To realize
with SDN controller and switches, we demonstrate that Dragon this vision, we present a novel TE application framework,
outperforms existing TE methods both in speed and in optimality.
called Dragon, having these properties. Dragon builds up a
Index Terms—Traffic Engineering, Software Defined Network- hierarchical structure of TE workers inside the TE application
ing, Hierarchical, Recursive, ISP, SDN, TE, Framework on the SDN controller (Fig. 1). Our framework partitions
the TE problem by recursively building an abstract network
I. I NTRODUCTION region for each TE worker. Each TE worker is responsible for
To optimize network performance and throughput, operators solving a small flow optimization to handle its regional traffic
are always looking for new ways to engineer their traffic. while minimally coordinating with its parent and children.
Software-defined networking (SDN) provides more flexibility Further, Dragon enables operators to mix and match diverse
and control over networks by separating their control and optimization functions as each of its TE workers can enforce
data planes. SDN also enhances traffic engineering (TE) by a different optimization goal on its traffic. Moreover, Dragon
having a centralized TE application with global view constantly simultaneously computes both intradomain and interdomain TE
reconfigure the network in response to new traffic demand. decisions by properly disseminating the BGP routing informa-
Recently, cloud providers (e.g., Microsoft) leveraged SDN tion throughout the hierarchy. Our Dragon framework is capable
to demonstrate the potential of centralized intradomain TE of running on both hierarchical and flat SDN controllers [7], [8].
in improving link utilization and fault-tolerance [1], [2]. In Contributions. In more detail, we make four contributions
addition, there is an increasing trend of centralized TE platforms in this paper. First, our framework targets TE in large ISP
for interdomain TE [3], [4]. networks. It creates unified network topologies and formulates
At first sight, it seems that existing SDN-based TE systems the interdomain and intradomain TE as a joint optimization
and platforms can be fully adapted to manage traffic in problem. Also, it is extensible to incorporate diverse optimiza-
ISP networks. However, a close inspection reveals that they tion objectives, e.g., service chaining, maximizing throughput,
fall short in completely satisfying key requirements of ISP and guaranteeing fault tolerance. Second, Dragon offers a novel
networks. First, most of the existing proposals poorly scale hierarchical TE algorithm to address the scalability challenge,
due to large size and rapid growth of ISPs as well as their especially for handling a large number of IP prefixes in large
formulation of TE as a linear flow optimization that can ISP networks. While solving each sub-optimization problem
consist of millions of constraints and variables for an ISP locally can reduce the computation overhead, it certainly will
network. Second, ISPs have different types of links, switches, affect the optimality of the global solution. Thus, our main
traffic and failure patterns in different parts of their network. challenge is to design the right boundary between hierarchical
Thus, they can have various optimization goals for each part, regions in terms of the information handled locally versus
e.g., maximizing throughput on peering links and balancing that exposed to upper levels. Third, Dragon maps local TE
the load on internal links. However, most existing TE solutions results from the hierarchical regions to the physical data plane
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
with a minimal number of data plane states. It minimizes size as most of them originally are designed for optimizing
the packet header size and the number of flow entries in aggregate inter-DC traffic in cloud networks with at most tens of
switches. Forth, Dragon enables ISPs to combine different TE nodes or DCs. A large ISP network can consist of tens of thou-
objective functions in a hierarchical manner for their different sands of forwarding devices, leading to millions of optimization
regions, e.g., maximize the throughput on the peering links constraints and variables in the TE application. Such flow
while minimizing the max. link utilization (MLU) internally optimization problems are intractable since modern algorithms
to avoid congestion. solving even simple linear programs (LPs) have exponential
Result summary. We have implemented Dragon as an worst-case complexities [13]. As a matter of fact, prior studies
application on top of an RYU controller while the data plane (e.g., DEFO [14] in § VII) report TE for a 100-node ISP
comprises of OVS switches. Through experimentation with network can take one night on high-end SDN controllers.
prototypes on eleven real-world topologies, we show that Requirement 2: TE flexibility. Second, existing SDN-based
Dragon is computationally more efficient than the global TE solutions lack flexibility in optimization. ISPs have different
optimal approach by reducing 98% of the TE overhead, while types of links, switches, traffic and failure patterns in different
still achieving 88-96% of the solution optimality.1 Our joint parts of their large network. Therefore, they are interested
optimization of inter and intradomain TE reduces the end-to- in having different optimization goals for each part or region
end latency by 52%. In the experiment of maximizing the (e.g., maximizing throughput on peering links and balancing
inter-region throughput and minimizing intra-region MLU, the load on internal links). Most recent solutions all have the
Dragon achieves 15.5% lower MLU while only sacrificing 3% same objective functions for the entire network and extending
throughput compared to best multi-objective TE approaches [9]. them to multi-objective optimizations with best-practice
Finally, Dragon has a more scalable data plane and requires techniques [15], [9] still can lead to very suboptimal results,
70% fewer flow entries than advanced SDN solutions [7]. specifically when the objectives conflict with each other.
Requirement 3. TE performance. Third, today’s SDN
II. D RAGON F RAMEWORK OVERVIEW TE solutions are suboptimal due to optimizing intradomain
and interdomain TE decisions separately while the ultimate
This section provides a background on SDN-based TE and
ISP network efficiency is affected by both of them [16]. They
presents an overview of the Dragon TE framework.
select interdomain routes in the BGP route selection process
by comparing all routes according to the local policies and AS
A. Traditional SDN-based Traffic Engineering path length. In this process, the intradomain TE decisions are
In SDN, data plane consists of programmable switches that considered in the form of IGP cost at a much later stage (step
are managed and configured by a logically centralized SDN 8 in [17]). After routes are fixed, they optimize bandwidth and
controller (Fig. 1a). TE is realized as an application on the path allocation within the network. The limited interactions
controller that discovers the topology changes and collects new between interdomain and intradomain decisions can lead to
traffic demand from agents running on access/edge switches path inflation and poor network utilization [5], [18], [19].
(e.g., [10]). Then, it feeds them to the TE application that
often performs three operations: (i) computes paths for flows,
C. Dragon TE Framework: Software Abstractions
(ii) allocates bandwidth to flows, and (iii) generates rules for
switches to enforce the computed paths and bandwidths. The To address these challenges, our key idea is a hierarchical
difference between TE solutions comes from their approach to elastic TE framework that breaks large-scale ISP TE into
realizing these operations. We focus on a popular multipath smaller pieces, solves them in parallel, and incrementally
TE model that is facilitated by the SDN paradigm. This model aggregates partial solutions (Fig. 1). Dragon creates a
separates the optimization of path computation from bandwidth hierarchical group of TE workers that each of which solves
allocation and runs iteratively based on how fast the traffic a small part of the TE problem. The root TE worker (e.g.,
matrix changes in the network. It computes multiple paths (e.g., TW7) is at the top and leaf TE workers (e.g., TW1-4) are
k shortest paths) for each flow demand, solves a linear flow at the bottom. Dragon is equipped with a manager node that
optimization program to allocate bandwidth to each flow on runs on existing SDN control planes and is responsible for
its paths and generates a set of switch rules (e.g., OpenFlow bootstrapping and managing TE workers. In single-controller
flow, group, and meter rules) to enforce the TE results. networks, the manager node runs each TE worker in separate
threads. In distributed SDN control planes [7], [20], TE workers
are placed on different controllers to improve robustness.
B. Requirements for Software-defined TE in ISP networks
Each TE worker handles the TE task for a logical region,
Before delving into our solution overview, we discuss three which is formed in the following way. The manager node first
requirements that existing SDN-based TE designs [11], [1], partitions the physical data plane (level 1) into a set of logical
[12], [2] fail to simultaneously satisfy for ISP networks: TE regions (e.g., regions 1-4) based on an algorithm defined by
Requirement 1: TE scalability. First, today’s SDN-based the ISP. Dragon does not impose any limitation on the number
TE methods do not scale well with respect to ISP network of worker levels and the number of children per worker. Then,
1 Today’s ISP networks are highly over-provisioned so sacrificing a negligible each leaf region is assigned to a leaf TE worker that immedi-
amount of solution optimality for scaling TE is practical, while also enabling ately constructs an abstract view of its region, in the form of
applications such as real-time TE-based security in response to DDoS attacks. logical switches (explained below), and presents this abstraction
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
Level 3
Fig. 1, maximize the network throughput for traffic traversing
DS5 DS6
2: TE results
Level 2
F
DS1 Region 5 DS2 DS3
intradomain traffic. It distributes interdomain information in
Region 6 DS4
hierarchical regions by having each TE worker establish a
TW1 TW2 TW3 TW4
BGP session with its parent and giving the illusion of physical
peering points to non-leaf regions. Dragon simultaneously and
Level 1
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
workers’ TE decisions become infeasible or inefficient after they to efficiently aggregate its local rules with that of its parent
get aggregated level by level from the root to the physical data using a tag-based tunneling protocol (e.g., MPLS) to greedily
plane. While optimizing Dragon for basic single-path routing and locally minimize the number of switch rules that it
is easy [22], [21], [23], [7] but for our desired multipath TE is transfers to the lower-level and the number of states that it
challenging. Thus, our main focus is to improve the quality of needs to encode in packet headers. Our optimization makes
TE solutions without sacrificing Dragon’s other benefits (e.g., sure that Dragon does not create more data plane state than
low TE computation time, support for diverse objectives). a traditional SDN-based TE solution.
Solution: Recursive flow optimization with greedy
refinements. In Dragon, we introduce the Dragon Fabric III. E FFICIENT R ECURSIVE I NTRADOMAIN TE
(D-fabric) concept, associating each logical port pair of
This section describes how the Dragon framework efficiently
an abstract D-switch with performance and bandwidth
handles intradomain traffic between fixed source and sink
constraints. Using these constraints, any non-leaf TE worker
points (access switches) in ISP networks. In § IV, we extend
can participate in the global flow optimization initiated by
it to handle interdomain (outbound traffic) traffic. We propose
the root without full TE states. Computing accurate D-fabrics
recursive linear programs and greedy algorithms that allow
is challenging, particularly when the network is saturated,
TE workers to efficiently participate in network-wide flow
due to hidden shared bottleneck physical links and circular
optimizations such that Dragon sacrifices a very small amount
dependencies between TE decisions made by a TE worker and
of the network throughput compared to the global, optimal
its children. Our recursive flow optimization programs and
TE while it accelerates each TE iteration by several orders of
greedy algorithms at each TE worker address this issue (§ III).
magnitude. Fig. 2 shows the high-level steps that we take in
Challenge 3. Considering intra and interdomain TE
this section, assuming TE workers can have diverse objective
together. In Dragon, jointly and hierarchically optimizing
functions for their region.
intra and interdomain TE decisions can be challenging in
terms of scalability and efficiency, particularly because a Step 1: Each TE worker exposes Step 1.1: Each worker collects internal
typical ISP network can have millions of egress flows destined external traffic and D-switch’s D- demand and D-fabric in its local region.
fabric to its parent.
to hundreds of thousands of external IP prefixes. On the one
Traffic matrix:
hand, it will not be scalable if we offload the entire joint M=<flow f: in_port, out_port> Step 1.2: Estimates parent’s demand
TE load onto the root TE worker for global efficiency. On Dragon fabric (D-fabric): per port pair on the exposed D-switch
the other hand, it can lead to low network utilization and G= <port a, port b, metrics>
performance if we completely perform it at non-root TE Step 2: Root worker computes Final
TE and sends output to its children.
workers. Finally, Dragon must handle unplanned BGP routing Step 1.3: Runs Shadow TE on estimated
Bandwidth allocation:
messages changing flow egress points between two epochs, inter-region & exact intra-region flows
<flow f: (path1: bw1); (path 2: bw2)>
to compute D-fabric for the parent.
without inefficiently recomputing the entire TE from scratch. Step 3: After receiving outputs
Solution: QoS-based joint TE equipped with IP prefix from the parent, TE worker at each
clustering and incremental computation. To reduce the joint level computes Final TE in parallel.
TE problem size, Dragon clusters destination IP prefixes with Step 1.4: Exposes the updated D-fabric
and external demand to the parent.
similar QoS metrics from all egress points. To balance its Step 4: Each TE worker programs
scalability and efficiency, Dragon dynamically decides which tunnels on its D-switches to
enforce the Final TE results.
ancestor TE worker should perform the joint optimization on an
egress flow. To handle the interdomain routes changes, Dragon Fig. 2: Efficient and scalable recursive TE in Dragon. Dragon
sacrifices a very small amount of the network throughput
selects affected regions and computes a local TE in them, while compared to the global, optimal TE while it accelerates each TE
keeping the old TE results in their ancestor regions unchanged iteration by several orders of magnitude.
to achieve BGP routing stability (§ IV).
Challenge 4. Minimizing data plane rules and overheads.
Physical SDN switches often have a very limited capacity for A. Distributing Network Flows on TE workers
forwarding rules. In the traditional SDN-TE, the controller One key input to Dragon is the set of flows (the traffic
after running TE can globally compress and aggregate the between two access switches) to be scheduled. Our manager
number of rules that must be installed in switches. The Dragon node collects flows and their bandwidth demand in the next
framework consists of a set of workers in the hierarchy, each time interval from the controller services [1]. Clearly, since
having a partial visibility of the network and generating a global flow optimization by the root worker is not scalable, our
set of switch rules in its region that are fetched by its child goal is to distribute them among hierarchical regions. Dragon
workers. When each worker makes more fine-grained flow greedily assigns more flows to the lower level regions where
decision level by level from the root level, the number of rules the hierarchy has more branches. This allows Dragon to achieve
in physical switches and headers in packets quickly increase, better parallelism in the TE computation. We instruct workers
causing significant pressure on the data plane scalability. to recursively collaborate with each other to place flows in
Solution: scalable recursive tunnels. We design workers the hierarchical regions. Dragon’s recursive traffic demand
in the hierarchy to efficiently interact with each other to distribution is simple and runs in each TE epoch. Starting from
generate minimal TE state for the physical data plane by the leaf level, each leaf worker first collects flows demands
performing a recursive optimization; we instruct each worker from the manager node running on the SDN controller. Then,
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
each worker looks up the destination access switch in its region. apriori for each port pair on the exposed D-switch, it can do
If the switch or the flow egress point is not its region, the TE to compute the D-fabric for the parent.
worker offloads the flow to its parent through the abstract D- Dragon’s efficient and scalable two-step recursive TE
switches. In particular, the worker gives the parent the illusion computation. Based on the above intuition, we propose a
of flow requests in the upper-level parent region. If the sink recursive two-step TE computation for Dragon: Step 1. recur-
location falls inside its region, the leaf worker greedily hides sive bottom-up TE for D-fabric computation (Shadow TE). In
the flow from its the parent and takes the responsibility for the a TE epoch, we compute effective D-fabrics from the leaf to
flow. Recursively upward, at each level, other workers perform the root level. We have each TE worker learn its exact intra-
exactly the same operation on their flows. Intuitively, this light- region flows using the flow assignment procedure (§ III-A) and
weight procedure assigns each flow request to one region in estimate the bandwidth request of each inter-region flow (using
the hierarchy while also putting more flows in the lower levels. the algorithm that will be discussed shortly). As shown in Fig. 2,
Fig. 3 shows this in a two-level Dragon, where a leaf TE worker on these two flow types as inputs, it runs a flow optimization
exported flow F2 to the root region and kept flow F1 locally. based on its region objective and constraints, called shadow TE.
The Shadow TE results in approximate bandwidth allocation
to estimated inter-region flows in the coexistence of exact
B. Succinct Dragon Fabric (D-fabric) Abstraction and Recur- intra-region flows. Using the approximate allocation, the TE
sive Schemes for Maximizing TE Results Efficiency worker easily computes the D-fabric for the parent. After the
After the flow assignment process, the root TE worker can above step, the TE worker goes into a Sleep mode. Recursively,
begin the recursive TE from its region (§ II). However, the this freezes the control tree until the root fetches its local
recursive construction of D-switches can cause a TE worker to D-fabrics. Step 2. recursive top-down TE for network-wide
overestimate the resources and performance of its child regions. flow scheduling (Final TE). When the root TE worker learns
Given existing compact topology representation techniques [25], D-fabrics and flows in its region, it starts the top-down TE
[26] are not effective for centralized TE (§ VII), to avoid this computation in the current TE epoch. First, the root runs its
challenge, we propose to extend the D-switch abstraction with TE by computes multiple paths for each of its flow requests,
the Dragon Fabric (D-fabric) concept that associates each running a flow optimization using LP to allocate bandwidth to
pair of D-switch ports with effective hop count, latency and flows along their paths, and programming its region. The root
bandwidth information. Each worker constructs a new D-fabric worker takes into account the D-fabric constraints in its path
for its immediate parent in each TE epoch.2 To better expose computation and bandwidth allocation so it can make efficient
the true capacity of a region, rather than relying on single path and feasible TE decisions. When a child TE worker receives
algorithms, the TE worker computes multiple paths between the parent forwarding rules on the D-switch, it views them as
border ports in its region. For each port pair, it approximates exact inter-region flow requests in its region. Recursively, each
the D-fabric constraints to be the maximum of hop counts, child TE worker runs its TE on its exact intra-region and inter-
maximum of latencies, and the sum of available bandwidths region flows until reaching the leaf level. A TE worker can
over the paths. While the D-fabric idea is simple, computing flexibly distribute the parent traffic in their region as long as
accurate bandwidth constraints is challenging because of shared it meets the bandwidth and performance constraints promised
bottleneck links and dependency between a TE worker and its to its parent through the D-switch’s D-fabric. We refer to the
child workers. To pinpoint the problem, we first characterize TE that TE workers run in this phase as Final TE.
traffic interaction between a parent and its child regions. Root
Challenge of computing the bandwidth in D-fabrics. In Root’s Flows
Level 2
and (2) inter-region flows that are assigned to a TE worker by F1: (A->D: 15) D
FP: (B->C: ?) TE Worker
its parent in the top-down TE computation. The problem is that Abstraction
a TE worker cannot easily report the remaining bandwidth to C
be used by the parent’s inter-region flows because it does not Fig. 3: Flow distribution procedure; Bandwidth computation chal-
know how its intra-region and inter-region flows, which can lenge in D-fabrics: TW1 does not how its intra-region flow (F1
with a known bandwidth request) and parent’s inter-region flow
go through shared bottleneck links, will be distributed in its (FP with an unknown bandwidth request) will be distributed in its
region after its local TE. Moreover, unlike intra-region flows region, and thus cannot easily compute D-fabric for its parent.
whose near exact bandwidth demands are known as they are Dragon’s greedy estimation of parent flows in Shadow
fetched directly from the controller, the bandwidth request of TE. In our bottom-up recursive process, each TE worker
each inter-region flow is unknown. This challenge is shown must estimate the bandwidth demand of each inter-region
in Fig. 3. Our key idea to resolve this problem is that if a TE flow from the parent in the next TE epoch. While there are
worker knows the inter-region flow demand from the parent many techniques for predicting traffic matrices in general, most
of them (e.g., [24], [27]) compute a convex combination of
2 If border ports of a TE worker’s region map to s physical switches, the
previous matrices and report it as the next matrix. Although
number of constraints of the abstract D-switch can be reduced to 3s(s − 1)/2,
which is reasonably small and decreases as we move up in the hierarchy. they are effective for predicting local/transit user traffic between
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
systems that formulate a joint optimization have serious route selection. If the selected route satisfies the performance
performance and scalability limitations (§ VII). In this section, requirement of the flow, it decides to exit and engineer the flow
we describe how Dragon hierarchically and jointly optimizes from its region. Otherwise, it exposes the flow to its parent,
these decisions based on QoS metrics (§ IV-A). To guarantee performing the same operation. We explain our design in Fig. 5
both the performance and scalability, we distribute the joint TE with two aggregate flows destined to external prefixes P1 and
load among hierarchical regions and workers (§ IV-B). Finally, P2 entering region 2 from neighboring ISP N1. In this case,
we explain how we adapt our recursive liner programs in § III TE worker TW2 needs to select one best BGP route for each.
to the joint TE (§ IV-C) and handle routing changes (§ IV-D). For prefix P1, since a local end-to-end path via neighboring
ISP N2 satisfies the latency constraint, TW2 locally selects the
A. Joint Selection of Intradomain and Interdomain Paths constituent interdomain route, announces it to N1 and N2, and
also exposes it to the root from (DS2, A). For prefix P2, it
To handle egress traffic, in Dragon, each TE worker runs
cannot meet the requirement and thus exports the flow request
a BGP process establishing BGP sessions with local BGP
from (DS2, B) to the root region. The root worker is aware of
speakers in neighbor ISPs and its parent through the abstract
an efficient end-to-end path to prefix P2 through N3, select the
D-switch. Dragon uses these recursive sessions to disseminate
constituent BGP route and advertises it to N1 through D-switch
interdomain routes without global visibility to proper regions,
DS2.
giving them the illusion of real egress/peering points. In Dragon,
when a leaf TE worker sends a BGP message to its parent, it N1
Root B
Level 2
extends the message by associating a few performance metrics
of the route (e.g., latency, bandwidth to the destination) from N3
C DS1 DS2
A N2
Neighbor ISP
periodic measurements. Dragon uses this approach to realize a Root Region Dragon or
Physical Switch
new end-to-end routing logic jointly optimizing intradomain TW1 TW2
TE Worker
Level 1
and interdomain paths. We first explain how Dragon as a whole C N2
N3 A
(without the TE worker abstraction) selects a single best route Abstraction
B
for each aggregate flow destined to an IP prefix on the Internet. Region 1 Region 2
N1
We propose to first compare candidate interdomain routes
Fig. 5: Dragon’s scalable QoS-based end-to-end (joint interdomain
by operator’s policy for routing stability [28] and then and intradomian) routing before bandwidth allocation.
break the tie by ranking them based on end-to-end (intrado-
main+interdomain) performance experienced by the flow. Our
end-to-end QoS-driven routing does not disturb the global C. Put together: Unified Topology Abstraction for Joint TE
BGP convergence although it might choose and announce new To run our scalable end-to-end routing and then allocate
routes for better performance faster in each epoch. Moreover, bandwidth to both the ISP egress and local flows, we need
we have Dragon collect sufficient measurement samples for to create unified topologies containing both interdomain and
each BGP route to form stable metrics. Finally, we extend intradomain nodes in each region. Basically, we extend the
Dragon to select multiple end-to-end paths to efficiently utilize topology of each region with IP prefixes as sink point of egress
the network resources–used the BGP add-path to select more flows. Since the number of prefixes is huge, we cluster them
than one interdomain (BGP) route for a flow [29]. into External Virtual (EV) nodes and then connect EV nodes
to egress switches. Clustering can be based on various factors
B. Scaling Joint Routing to a Large Number of IP Prefixes (e.g., reachability: each set contains prefixes with the same
BGP path from all egress points). We propose to group prefixes
Dragon’s routing logic improves the TE end-to-end perfor-
based on QoS metrics due to compatibility with our end-to-end
mance (§ VI). But to achieve scalability, the key question is
routing logic. Therefore, we have each EV node contains
where this route selection should take place in the hierarchy.
prefixes with similar QoS metrics from all egress points. For
One naive way is to send all the routes and flows to the root
example, EV1 in Fig. 6 contains prefixes P1 and P2 since both
TE worker (e.g., RCP [3]), which then selects the best route
have similar external latencies (i.e., 100 and 150 ms) from
based on the end-to-end metric for all flows. The selected
peer switches A and B in the region. After forming unified
routes are then recursively pushed down through the BGP
topologies in all regions, Dragon first runs the scalable end-
sessions to be advertised to neighbor ISPs. We call this
to-end routing in each epoch, followed by the two-step linear
approach global route selection, which clearly is not scalable
programs to allocate bandwidth to both egress and local flows.
given the hundreds of thousands of address prefixes and
millions of aggregate flows in an ISP network. To improve D. Handling Routing Changes Between Epochs
both the scalability and performance, we leverage Dragon’s Dragon engineers interdomain and intradomain flows
hierarchical structure to intelligently distribute our end-to- iteratively. In the joint TE context, unplanned BGP route
end route selection and LP loads to different levels, while updates might arrive between two epochs, forcing Dragon to
satisfying performance requirements. In our approach, the change the flow egress points. In this case, recomputing the
ISP defines a decision threshold in terms of the performance entire TE from is inefficient so Dragon provides a hierarchical
requirement (e.g., maximum latency) for each aggregate egress greedy packing approach to find temporary solutions between
flow. Recursively upward from the leaf level, for each flow in two epochs. In this method, from the leaf level upward, when
a leaf region, each TE worker first locally runs our end-to-end a TE worker learns an interdomain route change for an egress
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
Level 3
100 200 Global label: F]
150 300
ISP 2 A
B ISP 1
TW2
D-Switch
Peer Switch R P P R
Level 2
EV Node
A TE worker’s region Dragon or
Physical Switch
Fig. 6: Constructing a unified topology in a logical region for
TW1
compressing address prefixes to systematically adapt our recursive TE Worker
intra-region TE techniques to joint TE. P L L R Abstraction
Level 1
flow in its region, it checks if it has an alternative end-to-end R Push Label R
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
switches). At the end, we present a summary of results in our is highly scalable since it reduces the total time by 98% (from
experimentation with other ISP topologies in the RocketFuel 30 min to 30 sec.) for the Ebone and by 68% (from 889 ms
dataset. We perform TE every 5 minutes for 80 hours. Unlike to 283 ms) for Internet 2 with the optimal k.
Internet-2, RocketFuel does not come with real traffic matrices TABLE I: Linear program size (O(mn)) on Ebone w.r.t the number
and link capacities. Therefore, we generate traffic using the of constraints (m) and variables (n). Dragon efficiently distributes
widely-used cyclostationary model [32]. To estimate link the TE load among its TE workers.
capacities, we leverage a popular centrality model [33]. To Dragon
SD-TE
Root L1 L2 L3 L4 L5 L6 L7
model egress points and interaction with other ISPs based on
m 4538 441 2708 370 1830 1535 688 1440 22556
BGP, we extract the interdomain routes from Planetlab nodes to
n 1848 338 2274 294 1541 1281 557 1218 19384
all destination IP prefix based on iPlane [34]. Each PlanetLab O(mn) 8E6 1E5 6E6 1E5 3E6 2E6 4E5 2E6 4E8
node serves as a neighbor ISP of a Dragon ISP. Similar to
prior work [7], we obtain the latency of BGP paths from each TE complexity. To understand Dragon’s scalability nature,
neighbor to all destinations by aggregating iPlane snapshots. we show the average size of LPs solved over all epochs by
Metrics and baseline. We have four metrics: (1) TE Dragon and SD-TE in Ebone (Internet-2 is similar). Given LPs
scalability in terms of computation time and the state size in have exponential worst-case complexities, from Table I, we
SDN switches, (2) TE overhead in terms of solution optimality, observe that Dragon is more agile and scalable. This because
(3) performance in terms of RTT delay in inter-intra domain TE, it distributes small LP problems among TE workers, which are
and (4) solution efficiency in multi-objective region-specific TE. solved mostly in parallel. In particular, each TE worker has
Our baseline is today’s best software-defined TE system (re- been assigned problems that are 50-1000× smaller than those
ferred to as SD-TE) that (1) performs optimal interdomain TE solved by SD-TE.
based on RCP [3], (2) executes optimal multi-path intradomain TABLE II: Data plane scalability. Dragon minimizes the number of
TE for a desired objective function by leveraging the core LP re- forwarding states in physical switches. N/A in SoftMoW is because
it does not enforce bandwidth allocations.
peatedly used in previous DC works [2], [1], (3) enforces TE re- Dragon SD-TE
sults leveraging SoftMoW’s [7] advanced SDN-based tunneling Internet-2 Ebone Internet-2 Ebone
protocol, and (4) supports a state-of-the-art hierarchical multi- Avg. Max. Avg. Max. Avg. Max. Avg. Max.
objective TE [9] by iteratively scheduling inter-region flows #Flow Rules 21.82 38 467.04 3048 63.27 124 1577.85 9144
with a desired function, computing the remaining link capacities, #Group Rules 7.82 14 4.72 34.0 N/A N/A N/A N/A
and then optimizing intra-region flows with the next objective #Meter Rules 9.64 16 68.35 510.0 N/A N/A N/A N/A
function. SD-TE is implemented on top of the RYU controller. Data plane scalability. To show Dragon’s data plane
scalability, we measure the average number of flow, group and
meter rules installed in the physical (level-1) SDN switches.
B. Scalability of the Dragon Framework As shown in Table II, we observe the maximum numbers of
We quantify Dragon’s scalability benefits by focusing on flow entries used by Dragon are 69.3% and 66.6% smaller than
the intradomain TE where flow requests are between fixed SD-TE on Internet-2 and Ebone respectively. This is because
source and sink points, and choosing the network throughput Dragon uses global labels for minimizing switch rules and local
maximization objective for all TE workers. labels to implement partial tunnels with small packet header
overhead (§ V). Unlike SD-TE, Dragon with more levels needs
Flat
SD-TE Dragon
Max. Number of Tunnels (k)
SD-TE Dragon
Dragon Dragon
the same number of rules due to its scalable rule management.
29.5 283
1822 4 889 C. Enhanced Optimality via D-fabric Reconfiguration
26.1
918.5 Given Dragon’s scalability, we evaluate the amount of
24.7 optimality (throughput) loss caused by our recursive abstraction
370.1 277
21.8 2 and greedy D-fabric reconfiguration under different loads.
492
90.3
w.r.t. Two-level
1
1"
0.0 0.2 400
0.4 600
0.6 800
0.8 1000
1.0
Network Throughput (%)
0 200 Dragon
Network Throughput (%)
Dragon (Greedy)
Dragon
Avg. TE Avg.
TE Computation Time(s)
w.r.t.
0.8
+
Network Throughput (%)
Normalized+
Dragon
Normalized
Dragon
Fig. 8: TE Computation time. Dragon reduces the TE time by (Greedy)
Dragon
(ms)
Net"Throughput"Loss"
Net
Throughput
throughputLloss
oss
Network Throughput (%)
Two-level
TE computation time
0.2"
0.2
(%)
TE computation time. Fig. 8 shows the computation timeDragon Dragon (Greedy) 0
0"
Throughput(%)
in second for the SD-TE and Dragon TE. At most k tunnel are
established between access switch pairs [35]. The optimal k is 4
Dragon 2
2
2" 3
3
Number
Numberoof
3"
f
levels
Number+of+levels+ Levels
4
44"
Number
Number of of Levels
Fig. 10: Scalability-optimality levels
TE Epoch
and 20 for Internet 2 and Ebone respectively indicating a larger Fig. 9: Our greedy and Shadow trade-off. Dragon can scale to
krkThroughput
k does not increase the network throughput noticeably. We Epoch TE approaches enable Dragon to large ISP networks by incurring
make the following observations. First, SD-TE has several compute near-optimal results. a small amount of overhead.
orders of magnitude higher computation time for Ebone DragonEpoch
fabric reconfiguration. To iteratively compute
compared to Internet-2. Second, increasing k substantially D-fabrics for the root, each leaf TE worker estimates the
intensifies the computation time of SD-TE. These two trends bandwidth demands of the root on the abstract D-switch
Epoch
confirm the cannot scale to very large networks. Third, Dragon port pairs based on our greedy method and feed them into
Epoch
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
in its Shadow TE. We choose the greediness parameters optimization. Using the same setup, we generate flows destined
experimentally: If the bandwidth of a port pair is utilized more to external IP prefixes (outside the ISP) on the Internet and
than 95%, each leaf TE worker increases the estimate by 2Gbps; feed the interdomain routing data into our framework.
if it is below 85%, it decreases at a lower speed, 500kbps. 1.0 1.0
Otherwise, the estimate does not change. We also implement 0.8 0.8
a basic Dragon in which each leaf TE worker estimates an
0.6 0.6
CDF
CDF
equal bandwidth demand for all their border port pairs. Fig. 9 RCP-8P
SD-TE (8P) RCP-4P
SD-TE (4P)
depicts Internet-2’s network throughput for greedy Dragon, 0.4 Dragon (8P) 0.4 Dragon (4P)
SD-TE (16P)
Throughput(%)
Network Throughput(%)
Network Throughput(%)
NetworkThroughput(%)
Throughput(%)
Throughput(%)
Throughput(%)
Throughput(%)
Throughput(%)
Throughput(%)
Whale-16P
Throughput(%)
Throughput(%)
Dragon
basic Dragon does not Dragon Baseline
considerBaseline
Baseline root Dragon
worker’s traffic patterns
Baseline
Throughput(%)
Baseline
Throughput(%)
Baseline
Throughput(%)
Dragon
Throughput(%)
Dragon
Dragon (Greedy)
Throughput(%)
Throughput(%)
Baseline
Network Throughput(%)
and incursBaseline
up to 4% throughput loss.
Throughput(%)
Throughput(%)
Baseline
Throughput(%)
Dragon Baseline
Throughput(%)
Dragon (Basic)
Throughput(%)
Throughput (%)
Throughput(%)
SD-TE SD-TE SD-TESD-TE SD-TE Fig. 12: End-to-end flow-level delay. Dragon is more efficient than
Network(%)
Baseline SD-TE
Dragon Dragon DragonDragon Dragon
(%)
Dragon
Dragon Dragon
DragonDragon
Dragon SDN-based TE solutions due to jointly optimizing its interdomain
Throughput(%)
Throughput
Dragon (Greedy)
Throughput (%)
Throughput (%)
Baseline
Baseline Baseline Baseline
Throughput
Dragon (Basic)Baseline
Baseline
Baseline Baseline BaselineBaseline and intradomain TE decisions in the network.
Network
Baseline
SD-TE
Avg.Network
Network
Network
Network
Network
Network
Network
Network
Network
Network
Network
Network
Network
Throughput
SD-TE (Optimal)
Dragon
Avg. Throughput
Network
Network
Network
Network
Network
Network
Avg.Network
NetworkNetwork
Network
Network
Avg.Network
Network
Avg.Network
Network
Network
Network
Network
Avg.Network
Avg. Network
Network
Network
Network
Network
Network
Avg.
Dragon
Avg.
Avg.
Avg.
Avg.
Avg.
Avg.
ScaleFactor
FactorScale Factor
Network
ScaleFactor
Factor
Network
Scale
Network
Scale
Network
Network
Scale Factor
Network
Network
Network
Network
Network
Avg.Network
Network
Network
Network
Network
Dragon
Avg. Network
Network
Dragon (Greedy)
Network
Dragon
Dragon
Network
Network
Network
SD-TE (Optimal)
Avg.
Avg.
Network
Avg.
Network
Avg.
Network
Scale Factor
Avg.
Scale Factor
Avg.
Avg.
Avg.
Avg.
Avg.
Avg.
Scale Factor
Scale Factor Scale
Scale Scale
Factor Factor
Scale
Factor Factor to conduct the route selection at the root TE worker to leverage
Avg.
Scale Factor
Avg.
Scale Factor
Fig. 11: Throughput in saturated
Scale ScaleFactor Scale
networks.
Factor
Scale
Dragon
Scale Factor
generates
Factor near- network-wide view similar to SD-TE. Fig. 12 shows the
highly Factor
Avg.
2.0
(# TELoad
# Flows)
pletely saturated with at most 1-2% and 3.5-9.8% throughput 400 Max
1.5 Median
Load(Avg.
TE Worker
loss for Internet-2 and Ebone respectively. Given the substantial 250 1.0
(Avg.
Net
in ISP networks, we believe these loss values are tolerable. 600
Threshold (ms) Threshold (ms)
(a) (b) (c)
D. Dragon’s Scalability-Optimality Trade-off Fig. 13: Trade-off among (a) end-to-end latency, (b) throughput,
and (c) load distribution on workers in Dragon’s joint-TE.
To further characterize Dragon, we show the trade-off
between the TE time and the throughput loss for varying Characteristics of the hierarchical joint TE. Next, we
number of levels. Since our prototype is limited to two levels, evaluate Dragon’s hierarchical joint TE described in § IV and
for the Ebone topology, we compute the expected value of these characterize its trade-offs between end-to-end latency, network
two metrics in the two-level Dragon with a different number throughput, and TE workers’ load. In our design, Dragon’s
of abstract D-switches at the root region. Then, we carefully leaf TE workers perform the joint TE for a flow locally unless
and recursively propagate the expected values to estimate the the ISP egress points to the destination are not in their region
throughput loss and TE time for three and four-level ones. or not satisfying its latency requirement. In Fig 13, when the
In our estimation, we take into account actual flow demands entire load is handled by the root TE worker (threshold is 0),
in logical regions of three and four-level Dragon. Fig. 10 the end-to-end latency in the joint TE is optimal due to the
presents the normalized TE time and throughput loss for a global view. As we offload less (larger threshold), the TE load
various number of levels relative to the 2-level Dragon. We gets more balanced among all TE workers as more TE tasks are
observe that Dragon can provide large ISPs with tremendous done locally and in parallel. In this case, the maximum latency
TE agility at a small over-provisioning cost. Particularly, three- becomes larger because a better egress point (end-to-end path)
level Dragon has 30% smaller TE time with only 8% higher might exist in a neighboring leaf region but a leaf TE worker
throughput loss compared to the two-level one. selects its local routes instead. The last metric is in Fig. 13c
showing a large threshold increases the throughput as more
E. Joint Inter-Intra Domain TE traffic is sent through local egress points, but the global routing
results in more flows competing for those egress points with
Next, we demonstrate performance and scalability benefits of smaller end-to-end latencies. This disappears if we select more
Dragon’s joint inter-intradomain TE compared to SD-TE that than one inter-domain route for each egress flow.
first makes interdomain TE decisions followed by intradomain
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications
SD-TE Dragon
Dragon
Approach issue by departing from LP and using constraint programming
(a) Inter-region throughput (b) Intra-region MLU (CP). While there are some similarities between Dragon
Fig. 14: Root TE worker maximizes network throughput & leaf
workers balance traffic load on links. Dragon efficiently supports
and DEFO, the differences are significant. First, DEFO is
region-specific objective functions. complementary to our work. Although our implementation
is based on multipath LP-based TE, our framework is generic
F. Region-Specific Objective Composition
and thus can have its hierarchical workers to execute DEFO
Finally, we compare the efficiency of Dragon and SD-TE (or CP) programs for further TE acceleration. Second, DEFO
in supporting (hierarchical) region-specific multi-objective TE. does not jointly optimize interdomain and intradomain traffic
In our two-level setting, we configure the root TE worker to and support hierarchical multi-objective optimization.
maximize the network throughput on expensive inter-region Hierarchical routing architectures. In the past, notable
links and the leaf TE workers to balance the traffic load on efforts have leveraged similar hierarchy and recursive topology
their intra-region links by minimizing the max. link utilization aggregation techniques to scale legacy distributed routing
(MLU). SD-TE performs the same multi-objective TE but protocols (e.g., PNNI [21], Nimrod [22]). Due to their
using a different implementation (§ VI-A). For 30 epochs on distributed nature, these systems are inefficient from TE
Ebone, Fig. 14 shows that Dragon on average has 3.43% lower aspects, and poor network utilization, reservation failure, and
inter-region network throughput and 15.46% better intra-region congestion are inherent to them. Some lossy representations
load balancing. Overall, Dragon is more efficient than SD- of abstracted topologies have been proposed for these legacy
TE because it provides a feedback from the leaf level to the systems (e.g., [25], [36], [26]). Although Dragon can use
root through D-fabrics associated with D-switches. However, these techniques to further reduce the size of D-fabrics, they
SD-TE achieves the throughput maximization without taking create substantial throughput loss in our centralized TE.
into account the load balancing objective as it performs each Joint inter-intradomain TE. RCP [3] is a flat centralized
of these optimizations one at a time. The former can exploit routing design that selects the best BGP route on behalf of all
scarce resources within each within each leaf region and thus routers in an ISP. Similarly, SDX [4] manages routes of multiple
limits the choices of the latter. ISPs at exchange points. Unlike them, Dragon jointly optimizes
both intra and interdomain decisions motivated by multiple
G. Summary of Experiments with Other Larger Topologies studies [19], [18] on hot-potato routing. Our approach is
different from prior optimizations solving the joint TE problem
We repeat all the above experiments for 9 other ISP with bi-criteria techniques [16], [37], which are intractable for
topologies in the RocketFuel dataset consisting of around large networks, limited to single path TE, and lack an efficient
600-800 switches and 1000-5000 links and summarize the support for handling interdomain routing updates. Using the
results for three of them (due to lack of space) in Table III. SDN programmability, Dragon offers an end-to-end approach
Our evaluation shows the significant benefits and negligible based QoS metrics, which is both scalable and efficient.
overheads of Dragon stays valid for different ISP topologies
and traffic patterns. VIII. C ONCLUSION
TABLE III: Summary of Dragon’s benefits and overhead compared
to SD-TE for other larger ISP topologies. This paper proposes a new method for SDN-based TE
TE time SDN rules Throughput Flow RTT Multi-objective in large ISP networks. Because of its hierarchical design, it
ISP
↓ (×) ↓(%) ↓ (%) ↓ (%) efficiency ↑ (%)
Verio 722-821 40-62 7-12 55 10-14 can support TE for large-sized networks with thousands of
Level3 542-600 72-80 5-10 67 20-32 constraints. Because of its scalability advantage, we can solve
AT&T 384-450 75-79 2-11 48 12-18 the intradomain and interdomain resource allocation jointly.
VII. R ELATED WORK Because of the hierarchical and region-based optimization
structure, it allows the operator to flexibly define the
We discuss related work that has inspired Dragon, especially optimization demand for different parts of the network. As
works not previously considered. a part of our future work, we will investigate how other SDN
Hierarchical SDN architectures. A body of recent works features, such as on-demand measurement, can be used to
has already advocated for a hierarchical SDN controller improve the TE effectiveness in practice.
architecture. SoftMoW [7] is the most comprehensive solution
for cellular wide-area networks that particularly improves the
mobility management. RSDN [8] is entirely is focused on ACKNOWLEDGMENTS
routing and connectivity problems in networks managed by a This work is supported by Ericsson Research and NSF
hierarchical SDN control plane. Similar to these works, Dragon grants CNS-1629894 and CNS-1345226 and conducted while
leverages the hierarchy technique but with some key differences. Mehrdad, Ying, and Ravi were at Ericsson Research .
0733-8716 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2018.2871312, IEEE Journal
on Selected Areas in Communications