HUG: Multi-Resource Fairness For Correlated and Elastic Demands

L1 L2
2/3 1/9
0% L1 L2
L1 L4 L2 L5 L3 L6 L1 L
100%
4 1/2
L2 5/6
L5 L3 L6 100%
L1 L4 L2 L5 L3 L6 100% 1/2 5/6
50% 50%
A1 B1 A2 B2 A3 B3 A1 B1 A2 B2 A3 B
50%
HUG: Multi-Resource Fairness for Correlated
A B
and
A
Elastic
B1/2
Demands
1/6 A B
3
M M 3 0% M1 1 M 2 2 M 3 3 M3 0%
1
Mosharaf Chowdhury21 , Zhenhua Liu23, MAli 1
Ghodsi , Ion L1 L322
M2 Stoica
1/2 1/6
1 M3 0% L1 L2
1 2 3
University of Michigan Stony Brook University UC Berkeley, Databricks Inc.
Abstract A1
L1
A3 100% L1A1
L1
A 100% 100%
1/3 A1 2/3 A3 100% 3 1/2
100% 1/21/3
1/3 2/3
100%
2/3 1/2 1/2
In this paper, we study how to optimally provide isola-
B1 B3 B1 B3
tion guarantees in multi-resource environments, such as B1 B3
50% 50% 50% 50% 50% 50%
public clouds, whereAa tenant’s
L2 demands on different re- L2A L2
A4 A2 2 A4 A4
2
sources (links) are correlated. Unlike prior work such as 2/3 1/9 1/2 1/22/3 1/9 2/3 1/9 1/2 1/2
0%
L0%
0% B2 0% B4 0% L1 L2 0%
Dominant Resource BFairness
2 (DRF) B4 that assumes static L1 B2 L2 B4 L1 LL
2
1 L2 L1 2
and fixed demands, we consider elastic demands. Our (a) (b) Max-min fair. (c) DRF [33]
approach generalizes canonical max-min fairness to the
multi-resource setting with correlated demands, and ex- Figure 1: Bandwidth allocations in two independent links (a)
−
→
tends DRF to elastic demands. We consider two natu- for tenant-A (orange) with correlation vector dA = h 12 , 1i and
−
→
ral optimization objectives: isolation guarantee from a tenant-B (dark blue) with dB = h1, 16 i. Shaded portions are
tenant’s viewpoint and system utilization (work conser- not allocated to any tenant.
vation) from an operator’s perspective. We prove that
because, given enough demand, it allocates the entire
in non-cooperative environments like public cloud net-
bandwidth of the link. Finally, it is strategyproof, because
works, there is a strong tradeoff between optimal iso-
tenants cannot get more bandwidth by lying about their
lation guarantee and work conservation when demands
demands (e.g., by sending more traffic).
are elastic. Even worse, work conservation can even de-
However, a datacenter network involves many links,
crease network utilization instead of improving it when
and tenants’ demands on different links are often corre-
demands are inelastic. We identify the root cause be-
lated. Informally, we say that the demands of a tenant on
hind the tradeoff and present a provably optimal alloca-
two links i and j are correlated, if for every bit the ten-
tion algorithm, High Utilization with Guarantees (HUG),
ant sends on link-i, she sends at least α bits on link-j.
to achieve maximum attainable network utilization with-
More formally, with every tenant-k, we associate a cor-
out sacrificing the optimal isolation guarantee, strategy- −
→
proofness, and other useful properties of DRF. In co- relation vector dk = hd1k , d2k , . . . , dnk i, where dik ≤ 1,
operative environments like private datacenter networks, which captures the fact that for every dik bits tenant-k
HUG achieves both the optimal isolation guarantee and sends on link-i, it should send at least djk bits on link-j.
work conservation. Analyses, simulations, and experi- Examples of applications with correlated demands in-
ments show that HUG provides better isolation guar- clude optimal shuffle schedules [22, 23], long-running
antees, higher system utilization, and better tenant-level services [19, 52], multi-tiered enterprise applications
performance than its counterparts. [39], and realtime streaming applications [10, 69]. Con-
sider the example in Figure 1a with two independent
−
→
1 Introduction links and two tenants. The correlation vector dA = h 21 , 1i
In shared, multi-tenant environments such as public means that (i) link-2 is tenant-A’s bottleneck, (ii) for ev-
clouds [2, 5, 6, 8, 40], the need for predictability and the ery MA rate tenant-A is allocated on the bottleneck link,
means to achieve it remain a constant source of discus- she requires at least MA /2 rate on link-1, resulting in a
sion [15, 44, 45, 50, 51, 58, 59, 61, 62]. The general con- progress of MA , and (iii) except for the bottleneck link,
sensus – recently summarized by Mogul and Popa [53] tenants’ demands are elastic, meaning tenant-A can use
– is that tenants expect guaranteed minimum bandwidth more than MA /2 rate on link-1.1 Similarly, tenant-B re-
(i.e., isolation guarantee) for performance predictability, quires at least MB /6 on link-2 for MB on link-1. If we
while network operators strive for work conservation to denote the ratenallocated to tenant-k on link-i by aik , then
ai
o
achieve high utilization and strategy-proofness to ensure Mk = mini dik , the minimum demand-normalized
k
isolation. rate allocation over all links, captures her progress.
Max-min fairness [43] – a widely-used [16, 25, 34, 35, In this paper, we want to generalize max-min fairness
55, 63, 64] allocation policy – achieves all three in the to tenants with correlated and elastic demands while
context of a single link. It provides the optimal isolation maintaining its desirable properties: optimal isolation
guarantee by maximizing the minimum amount of band- guarantee, high utilization, and strategy-proofness.
width allocated to each flow. The bandwidth allocation
of a user (tenant) determines her progress – i.e., how fast 1 While it does not improve the instantaneous progress of tenant-A,
she can complete her data transfer. It is work-conserving, it increases network utilization, which is desired by the operators.
1
Intuitively, we want to maximize the minimum Cloud Network Sharing
progress over all tenants, i.e., maximize mink Mk ,
where mink Mk corresponds to the isolation guaran- Reservation Dynamic Sharing
tee of an allocation algorithm. We make three observa- (SecondNet, Oktopus, Pulsar, Silo)
Uses admission control
tions. First, when there is a single link in the system,
this model trivially reduces to max-min fairness. Sec- Flow-Level VM-Level Tenant-/Network-Level
ond, getting more aggregate bandwidth is not always bet- (Per-Flow Fairness) (Seawall, GateKeeper)
No isolation guarantee No isolation guarantee
ter. For tenant-A in the example, h50Mbps, 100Mbpsi is
better than h90Mbps, 90Mbpsi or h25Mbps, 200Mbpsi, Non-Cooperative Cooperative
even though the latter ones have more bandwidth in to- Environments Environments
Require Do not require
tal. Third, simply applying max-min fairness to individ- strategy-proofness strategy-proofness
ual links is not enough. In our example, max-min fairness
Low Suboptimal
allocates equal resources to both tenants on both links,
Utilization Isolation Guarantee
resulting in allocations h 12 , 12 i on both links (Figure 1b). (DRF) (PS-P, EyeQ, NetShare)
Corresponding progress (MA = MB = 12 ) result in a Optimal isolation guarantee Work-conserving
suboptimal isolation guarantee (min{MA , MB } = 12 ).
Highest Utilization for Work-Conserving
Dominant Resource Fairness (DRF) [33] extends max- Optimal Isolation Guarantee Optimal Isolation Guarantee
min fairness to multiple resources and prevents such sub- (HUG) (HUG)
optimality. It equalizes the shares of dominant resources
– link-2 (link-1) for tenant-A (tenant-B) – across all ten- Figure 2: Design space for cloud network sharing.
ants with correlated demands and maximizes the iso- guarantee and work conservation in non-cooperative en-
lation guarantee in a strategyproof manner. As shown vironments, HUG ensures the highest utilization possi-
in Figure 1c, using DRF, both tenants have the same ble while maintaining the optimal isolation guarantee.
progress – MA = MB = 32 , 50% higher than using It incentivizes tenants to expose their true demands, en-
max-min fairness on individual links. Moreover, DRF’s suring that they actually consume their allocations in-
isolation guarantee (min{MA , MB } = 23 ) is optimal stead of causing collateral damages. In cooperative en-
across all possible allocations and is strategyproof. vironments, where strategy-proofness might be a non-
However, DRF assumes inelastic demands [40], and it requirement, HUG simultaneously ensures both work
is not work-conserving. For example, bandwidth on link- conservation and the optimal isolation guarantee. In
2 in shades is not allocated to either tenant. In fact, we contrast, existing solutions [33, 45, 51, 58, 59] are sub-
show that DRF can result in arbitrarily low utilization optimal in both environments. Overall, HUG generalizes
(Lemma 6). This is wasteful, because unused bandwidth single- [25, 43, 55] and multi-resource max-min fairness
cannot be recovered. [27, 33, 38, 56] and multi-tenant network sharing solu-
We start by showing that strategy-proofness is a neces- tions [45, 51, 58, 59, 61, 62] under a unifying framework.
sary condition for providing the optimal isolation guar- HUG is easy to implement and scales well. Even with
antee – i.e., to maximize mink Mk – in non-cooperative 100, 000 machines, new allocations can be centrally cal-
environments (§2). Next, we prove that work conserva- culated and distributed throughout the network in less
tion – i.e., when tenants are allowed to use unallocated than a second – faster than that suggested in the litera-
resources, such as the shaded area in Figure 1c, without ture [13]. Moreover, each machine can locally enforce
constraints – spurs a race to the bottom. It incentivizes HUG-calculated allocations using existing traffic control
each tenant to continuously lie about her demand cor- tools without any changes to the network (§4).
relations, and in the process, it decreases the amount of We demonstrate the effectiveness of our proposal us-
useful work done by all tenants! Meaning, simply making EC2 experiments and trace-driven simulations (§5).
ing DRF work-conserving can do more harm than good. In non-cooperative environments, HUG provides the op-
We propose a two-stage algorithm, High Utilization timal isolation guarantee, which is 7.4× higher than ex-
with Guarantees (HUG), to achieve our goals (§3). Fig- isting network sharing solutions like PS-P [45, 58, 59]
ure 2 surveys the design space for cloud network shar- and 7000× higher than traditional per-flow fairness, and
ing and places HUG in context by following the thick 1.4× better utilization than DRF for production traces. In
lines. At the highest level, unlike many alternatives cooperative environments, HUG outperforms PS-P and
[13, 14, 37, 44], HUG is a dynamic allocation algo- per-flow fairness by 1.48× and 17.35× in terms of the
rithm. Next, HUG enforces its allocations at the tenant- 95th percentile slowdown of job communication stages,
/network-level, because flow- or (virtual) machine-level and 70% jobs experience lower slowdown w.r.t. DRF.
allocations [61, 62] do not provide isolation guarantee. We discuss current limitations and future research in
Due to the hard tradeoff between optimal isolation Section 6 and compare HUG to related work in Section 7.
2
−
→
dk Correlation vector of tenant-k’s demand
−
→ 100% 1/3 2/3
ak Guaranteed allocation to tenant-k
i
ak
Mk Progress of tenant-k; Mk := min ,
50% 1≤i≤2P dik
where subscript i stands for link-i
−
→
c
2/3 1/9
Actual resource consumption of tenant-k
k 0% L1 L2
Isolation Guarantee mink Mk
L1 L4 L2 L5 L3 L6 Optimal
100%Isolation
1/2 Guarantee
5/6 max {mink Mk }
P P i
(Network) Utilization i k ck
50%
Table 1: Important notations and definitions.
A1 B1 A2 B2 A3 B3 PP PP
(C ) and i=11/2dik = 1/6 i=1 dkP +i .
i
M1 M2 M3 0% L1 inLFigure
For the example 2 3, consider tenant correla-
Figure 3: Two VMs from tenant-A (orange) and three from tion vectors:
tenant-B (dark blue) and their communication patterns over a −
→ 1 1
3 × 3 datacenter
L1 fabric. The network fabric has three uplinks dA = h , 1, 0, 1, , 0i
A A3downlinks (L4 –L
100%
2 2
(L1 1 –L3 ) and three 6 ) corresponding
1/3 2/3 to the 100%
→ 1/2
1/2 − 1 1
three physical machines. dB = h1, , 0, 0, 1, i
B1 B3 6 6
2 Motivation
L
50% 50%
where dik = 0 indicates the absence of a VM and dik = 1
2
A2 A4
In this section, we elaborate on the assumptions
2/3
and no-
1/9 indicates the bottleneck link(s) of a tenant.
1/2 1/2
tations
B2 used in this
B4 paper and 0%
summarize L1the three
L2 de- 0%
Correlation vectors depend on tenant applications, that
L1 L 2
sirable requirements – optimal isolation guarantee, high can range from elastic-demand batch jobs [3, 24, 42, 68]
utilization, proportionality – for bandwidth allocation to long-running services [19, 52], multi-tiered enterprise
across multiple tenants. Later, we show the tradeoff be- applications [39], and realtime streaming applications [9,
tween optimal isolation guarantee and high utilization, 69] with inelastic demands. We focus on scenarios where
identifying work conservation as the root cause. a tenant’s demand changes at the timescale of seconds or
longer [13, 18, 58], and she can use provider-allocated
2.1 Background
resources in any way for her own workloads.
We consider Infrastructure-as-a-Service (e.g., EC2 [2],
Azure [8], and Google Compute [5]) and Container-as- 2.2 Inter-Tenant Network Sharing
a-Service (e.g., Mesos [40] and Kubernetes [6]) models Requirements
where tenants pay per-hour flat rates for virtual machines Given correlation vectors of M tenants, a cloud provider
(VMs) and containers.2 must use an allocation algorithm A to determine the al-
We abstract out the datacenter network as a non- locations of each tenant:
blocking switch (i.e., the fabric/hose model [11, 12, 14,
→
− → − −→
23, 28, 45, 49]) with P physical machines connected to A({d1 , d2 , . . . , dM }) = {→
−
a1 , →
−
a2 , . . . , −
a→
M}
it. Each machine has full-duplex links (i.e., 2P indepen-
dent links) and can host one or more VMs from different where − → = ha1 , a2 , . . . a2P i and ai is the fraction of
ak k k k k
tenants. Figure 3 shows an example. We assume that VM link-i guaranteed to the k-th tenant.
placement and routing are implemented independently. As identified in previous work [15, 58], any allocation
Not only does this model provide analytical simplicity, policy A must meet three requirements – (optimal) isola-
it is mostly a reality today: recent EC2 and Google data- tion guarantee, high utilization, and proportionality – to
centers have full bisection bandwidth networks [4, 7]. fairly share the cloud network:
We denote the correlation vector of the k-th tenant 1. Isolation Guarantee: VMs should receive minimum
−
→
(k ∈ {1, ..., M }) as dk = hd1k , d2k , . . . d2P i
k i, where dk bandwidth guarantees proportional to their correla-
P +i
and dk (1 ≤ i ≤ P ) respectively denote the uplink tion vectors so that tenants can estimate worst-case
and downlink demands normalized3 by link capacities performance. Formally, progress of tenant-k (Mk )
2 We use the terms VM and container interchangeably in this paper is defined as her minimum demand satisfaction ratio
because they are similar from the network’s perspective. across the entire fabric:
3 Normalization helps us consider heterogeneous capacities. By de- i
fault we normalize the correlation vector such that the largest compo- ak
Mk = min
nent equals to 1 unless otherwise specified. 1≤i≤2P dik
3
100% 1/2 1/2
50% 100% 1/4 1/2

100% 1/3 2/3 100% 1/2 1/2 100% 1/4 1/2 100% 1/4 1/2
100% 1/3 2/3 100% 1/2 11/12 100% 11/24
1/2 1/2
0% L1 L2 50%
50% 50% 50% 50%
50% 50% 50%
1/2 1/12
2/3 1/9 1/2 1/2 0% L1 L2
0% 1/2 1/12 1/2 1/12
link1 link2 0% L1 L2 0% L1 L2 2/3 0% 1/9 1/2 1/12 1/2
0% L1 L2
0% 0%
L1 L2 L1 L2 L1
Figure 4: Bandwidth consumptions of tenant-A (orange) and (a) Optimal isolation guarantee (b) Tenant-A lies
−→
tenant-B (dark blue) with correlation vectors dA = h 12 , 1i and
−→ 100% 100% 100%
dB = h1, 16 i using
100% 1/2PS-P11/12 100%
[45, 58, 59]. Both tenants
11/24 run 11/12
elastic- 100% 1/41/3 1/22/3 1/2 1/2 1/4
demand applications.
50%
For
50% example, progress of tenants 50%A and B in Figure −
→
4 50% 100%50% 11/24 11/12
50%
1 4 1 100%
100% are
1/2 M A
11/12 = M B =
100% 2 . Note
11/24that M
11/12
k = M if d k = 1/4 1/2 100% 1/4 1/2
h1, 1, . . . , 1i1/2for all1/12
tenants (generalizing 1/2PS-P 1/12
[58]), 0% 3/42/3 1/21/9 1/2 1/2 1/2
and0%Mk = LM 1
for flows
L on a 0% link L
single L2
(generalizing
1
0% Llink
1 1 Llink
2 2 50% 0% L1 L2 0% L1
1 2
50% per-flow max-min 50%fairness [43]). 50% 50%
(c) Tenant-B lies (d) Both lie
Isolation guarantee is defined as the lowest progress 1/2 1/12
all tenants, i.e., min 1/2
across 1/12
1/2 Mk . 1/12 Figure 1/2
3/4 3/4 0% of
5: Bandwidth consumptions L1
1/8tenant-A L2
(orange) and
0% 0% k L1 L2 0% Ltenant-B 0%
L2(dark blue) L1 Lvectors −→
L1 L2 1 with correlation 2 dA = h 12 , 1i
2. High Utilization: Spare network capacities should −
→
and dB = h1, 16 i, when both run elastic-demand applications.
be utilized by tenants with elastic demands to ensure
(a) Optimal isolation guarantee in the absence of work con-
high utilization as long as it does not decrease any- servation. With work conservation, (b) tenant-A increases her
one’s progress. progress100%
at the expense
1/2 of tenant-B,
11/12 100%
and 11/24 can11/12
(c) tenant-B do 100%
A related concept is work conservation, which en- the same, which results in (d) a prisoner’s dilemma.
sures that either a link is fully utilized or demands
from all flows traversing the link have been satisfied tion guarantee 50%optimize the iso-
50% with best effort; or (2) 50%
[43, 58]. Although existing research conflates the two lation guarantee, then maximize utilization with best ef-
[14, 15, 45, 51, 58, 59, 61, 62, 67], we show in the fort.6 Please refer 1/2
to Appendix
1/12 C for more details.
1/2 1/12
0% 0% L1 L2 0%
next section why that is not the case. 1 L L
2.3.1 Full Utilization but2Suboptimal Isolation
3. Proportionality: A tenant’s bandwidth allocation Guarantee
should be proportional to its payment similar to re-
As shown in prior work [58, Section 2.5], flow-level
sources like CPU and memory. We discuss this re-
and VM-level mechanisms – e.g., per-flow, per source-
quirement in more details in Section 3.3.1.
destination pair [58], and per-endpoint fairness [61, 62]
2.3 Challenges and Inefficiencies of – can easily be manipulated by creating more flows or
by using denser communication patterns. To avoid such
Existing Allocation Policies manipulations, many allocation mechanisms [45, 58, 59]
Prior work also identified two tradeoffs: isolation guar- equally divide link capacities at the tenant level and allow
antee vs. proportionality and high utilization vs. propor- work conservation for tenants with unmet demands. Fig-
tionality. However, it has been implicitly assumed that ure 4 shows an allocation using PS-P [58] with isolation
tenant-level optimal isolation guarantee5 and network- guarantee 21 . If both tenants have elastic-demand appli-
level work conservation can coexist. Although optimal cations, they will consume entire allocations; i.e., − c→
A =
isolation guarantee and network-level work conservation −
→ −
→ −
→ 1 1 →
−
cB = aA = aB = h 2 , 2 i, where ck = hck , ck , . . . c2P
1 2
k i
can coexist for a single link – max-min fairness is an ex- and cik is the fraction of link-i consumed by tenant-k.
ample – optimal isolation guarantee and work conser- Recall that aik is the guaranteed allocation of link-i to
vation can be at odds when we consider the network as tenant-k.
a whole. This has several implications on both isolation However, PS-P and similar mechanisms are also sub-
guarantee and network utilization. In particular, we can optimal. For the ongoing example, Figure 5a shows the
(1) either optimize utilization, then maximize the isola- optimal isolation guarantee of 32 , which is higher than
that provided by PS-P. In short, full utilization does not
4 We are continuing the example in Figure 3 but omitted the rest of
−
necessarily imply optimal isolation guarantee!
a→
k because there is either no contention or they are symmetric.
,
5 Optimality means that the allocation maximizes the isolation guar-
6 Maximizing a combination of these two is also an interesting fu-
antee across all tenants, i.e., maximize min Mk . ture direction.
k
4
0% L1 L2
100% 1/4 1/2
% 1/3 2/3
50% 100% 1/4 1/2
Tenant-A!
% 100% 100%
Doesn’t Lie! 1/2 Lies!1/12 1/3 2/3 11/24 11/12
0% L11! 50%
1!L2
2!
,!
2! 1
,!
Tenant-B!
2/3 1/9 Doesn’t Lie! 3! 3! 12! 2!
% L1 L2 50% 50%
1/2 1/12
0%
Lies!
1! ,! 3! 1! ,! 1!
2/3 1/9
L1 L2
2! 4! 2! 2! 1/2 1/12
0% L1 L2 0% L1 L2
Figure 6: Payoff matrix for the example in Section 2.3. Each (a) Optimal isolation guarantee (b) Tenant-A lies
cell shows progress of tenant-A and tenant-B.
2.3.2100%
Optimal Isolation
11/24 11/12 Guarantee but Low 100% 1/4 1/2 100% 1/4 1/2
Utilization 100% 1/3 2/3
In contrast,
50% optimal isolation guarantee does not neces- 50% 100%50% 11/24 11/12
sarily mean full utilization. In general, optimal isolation
guarantees can be 50%
1/2calculated
1/12 using DRF [33], which gen- 3/4 1/8 1/2 1/12
0%
eralizes max-minL1 fairness
L2 to multiple resources. In the 0% L1 L2 50% 0% L1 L2
example of Figure 5a, each uplink and downlink
0% of2/3the 1/9
(c) Tenant-B lies (d) Both lie
fabric is an independent resource – 2P in total. L1 L2
1/2 1/12
Given this premise, it seems promising and straightfor- Figure 7: Bandwidth consumptions 0% of tenant-A
L1 (orange)
L and
→ 2
−
ward to keep the DRF-component for optimal isolation tenant-B (dark blue) with correlation vectors dA = h 21 , 1i
−
→
guarantee and strategy-proofness and try to ensure full and dB = h1, 61 i, when neither runs elastic-demand applica-
utilization by allocating all remaining resources.In the tions. (a) Optimal isolation guarantee allocation is not work-
following two subsections, we show that work conserva- conserving. With work conservation, (b) utilization can in-
tion may render isolation guarantee no longer optimal, 100%
crease or (c) decrease, based11/12
11/24 on which tenant lies. (d) However, 100%
and even worse, may reduce useful network utilization. ultimately it lowers utilization. Shaded means unallocated.
→ −
− →
2.3.3 Naive Work Conservation Reduces Optimal 50% lie: d0A = d0B = h1, 1i, resulting in a lower
multaneously 50%
1
Isolation Guarantee isolation guarantee 2 (Figure 5d). Both are worse off!
We first illustrate that even the optimal isolation guar- In this example,1/2the inefficiency
1/12 arises due to allocat-
0% resources
ing all spare L1 to the L2 tenant who demands more. 0%
antee allocation degenerates into the classic prisoner’s
dilemma problem [30] in the presence of work conser- We show in Appendix B that intuitive allocation poli-
vation. In particular, we show that reporting a false cor- cies of all spare resources – e.g., allocating all to who
relation vector h1, 1i is the dominant strategy for each demands the least, allocating equally to all tenants with
tenant, i.e., her best option, no matter whether the other non-zero demands, and allocating proportionally to ten-
tenants tell the truth or not. As a consequence, optimal ants’ demands – do not work as well.
isolation guarantees decrease (Figure 6). 2.3.4 Naive Work Conservation can Even Decrease
If tenant-A can use the spare bandwidth in link-2, she Utilization
can increase her progress at the expense of tenant-B by
−→ Now consider that neither tenant has elastic-demand ap-
changing her correlation vector to d0A = h1, 1i. With an plications; i.e., they can only consume bandwidth pro-
−
→
unmodified dB = h1, 16 i, the new allocation would be portional to their correlation vectors. A similar prisoner’s
−→
aA = h 2 , 2 i and −
1 1
a→ 1 1
B = h 2 , 12 i. However, work con- dilemma unfolds (Figure 6), but this time, network uti-
servation would increase it to − a→ 1 11
A = h 2 , 12 i (Figure 5b). lization decreases as well.
Overall, progress of tenant-A would increase to 11 12 , while Given the optimal isolation guarantee allocation, − a→A =
decreasing it to 12 for tenant-B. As a result, the isolation −
→ 1 2 −
→ −
→ 2 1
cA = h 3 , 3 i and aB = cB = h 3 , 9 i, both tenants have
guarantee decreases from 23 to 12 . the same optimal isolation guarantee: 23 , and 29 -th of link-
The same is true for tenant-B as well. Consider again 2 remain unused (Figure 7a). One would expect work
that only tenant-B reports a falsified correlation vector conservation to utilize this spare capacity.
−→
d0B = h1, 1i to receive a favorable allocation: − a→A = Same as before, if tenant-A changes her correlation
1 1 −
→ 1 1
h 4 , 2 i and aB = h 2 , 2 i. Work conservation would in- vector to d0A = h1, 1i, she can receive an allocation
crease it to −a→ 3 1
B = h 4 , 2 i (Figure 5c). Overall, progress
−
a→ 1 11 −
→ 11 11
A = h 2 , 12 i and consume cA = h 24 , 12 i. This in-
of tenant-B would increase to 34 , while decreasing it to 12 11
creases her isolation guarantee to 12 and total network
for tenant-A, resulting in the same suboptimal isolation utilization increases (Figure 7b).
guarantee 12 . Similarly, tenant-B can receive an allocation − a→B =
Since both tenants gain by lying, they would both si- 3 1 −
→ 3 1
h 4 , 2 i and consume cB = h 4 , 8 i to increase her isola-
5
1/2 1/2
100% 1/3 2/3 Algorithm 1 High Utilization with Guarantees (HUG)

1/2 1/2
−
→
L1 L2 Input: {dk }: reported correlation vector of tenant-k, ∀k
Output: {−
→}: guaranteed resource allocated to tenant-k, ∀k
a k
50%
Stage 1: Calculate the optimal isolation guarantee (M∗ )
2/3 1/3 and minimum allocations − → = M∗ −
a
→
dk , ∀k
0% L1 L2
k
Stage 2: Restrict maximum utilization for each of

Figure 8: Optimal allocations with maximum achievable uti-
the 2P links, such that cik ≤ M∗ , ∀i, ∀k
lizations and maximum isolation guarantees for tenant-A (or-
% 11/24 100%
11/12ange) and 1/2
1/4 blue).
tenant-B (dark Consequently, we must eliminate a tenant’s incentive
tion guarantee to 43 . Utilization decreases (Figure 7c). to gain too much spare resources by lying; i.e., a ten-
% 50%
Consequently, both tenants lie and consume − c→
A =
ant should never be able to manipulate and increase her
1 1 −
→ 1 1
h 4 , 2 i and cB = h 2 , 12 i (Figure 7d). Instead of increas- progress due to work conservation.
1/2 1/12 ing, work conservation
3/4 decreases
1/2 network utilization! Lemma 1 Any allocation policy with the following two
% L1 L2 0% L1 L2 characteristics is not strategyproof:
2.4 Summary
1. it first uses DRF to ensure the optimal isolation guar-
The primary takeaways of this section are the following: antee and then assigns the spare, DRF-unallocated
• Existing mechanisms provide either suboptimal iso- resources for work conservation;
lation guarantees or low network utilization. 2. there exists at least one tenant whose allocation (in-
• There exists a strong tradeoff between optimal isola- cluding spare) on some link is more than her progress
tion guarantee and high utilization in a multi-tenant under DRF based on her reported correlation vector.
network. The key lies in strategy-proofness: optimal
Corollary 2 (of Lemma 1) Optimal isolation guaran-
isolation guarantee requires it, while work conserva-
tee allocations cannot always be work-conserving even
tion nullifies it. We provide a formal result about this
in the presence of elastic-demand applications.
(Corollary 2) in the next section.
• Unlike single links, work conservation can decrease 3.2 The Optimal Algorithm: HUG
network utilization instead of increasing it. Given the tradeoff, our goal is to design an allocation
algorithm that can achieve the highest utilization while
3 HUG: Analytical Results keeping the optimal isolation guarantee and strategy-
In this section, we show that despite the tradeoff between proofness. Formally, we want to design an algorithm to
optimal isolation guarantee and work conservation, it is X X
possible to increase utilization to some extent. Moreover, Maximize cik
i∈[1,2P ] k∈[1,M ] (1)
we present HUG, the optimal algorithm to ensure max-
imum achievable utilization without sacrificing optimal subject to min Mk = M∗ ,
k∈[1,M ]
isolation guarantees and strategy-proofness of DRF.
We defer the proofs from this section to Appendix A. where cik is the actual consumption7 of tenant-k on link-i
for allocation aik , and M∗ is the optimal isolation guar-
3.1 Root Cause Behind the Tradeoff: antee.
Unrestricted Sharing of Spare We observe that an optimal algorithm would have re-
Resources stricted tenant-A’s progress in Figure 5b and tenant-B’s
Going back to Figure 5, both tenants were incentivized to progress in Figure 5c to 23 . Consequently, they would not
lie because they were receiving spare resources without have been incentivized to lie and the prisoner’s dilemma
any restriction due to the pursuit of work conservation. could have been avoided. Algorithm 1 – referred to as
After tenant-A lied in Figure 5b, both MA and MB High Utilization with Guarantees (HUG) – is such a
decreased to 12 . However, by cheating, tenant-A managed two-stage allocation mechanism that guarantees maxi-
to increase her allocation in link-1 to 21 from 13 . Next, mum utilization while maximizing the isolation guaran-
indiscriminate work conservation increased her alloca- tees across tenants and is strategyproof.
tion in link-2 to 11 1 In the first stage, HUG allocates resources to maxi-
12 from the initial 2 , effectively increas-
11
ing MA to 12 . Similarly in Figure 5c, tenant-B first in- mize isolation guarantees across tenants. To achieve this,
creased her allocation in link-2 to 21 from 19 and then we pose our problem as a 2P -resource fair sharing prob-
work conservation increased her allocation in link-1 to lem and use DRF [33, 56] to calculate M∗ . By reserving
3 1 7 Consumptions
4 from the initial 2 . can differ from allocations when tenants are lying.
6
these allocations, HUG ensures isolation. Moreover, be-
cause DRF is strategy-proof, tenants are guaranteed to
use these allocations (i.e., cik ≥ aik ).
While DRF maximizes the isolation guarantees (a.k.a.
dominant shares), it results in low network utilization.
…
…
…
…
…
…
…
In some cases, DRF may even have utilization arbitrar-
ily close to zero, and HUG can increase that to 100%
(Lemma 6). (a) Tenant-X (N-to-N) (b) Tenant-Y (N-to-1)
To achieve this, the second stage of HUG maxi-
mizes utilization while still keeping the allocation strat- Figure 9: Communication patterns of tenant-X and tenant-Y
egyproof. In this stage, we calculate upper bounds to re- with (a) two minimum cuts of size P , where P is the number
strict how much of the spare capacity a tenant can use in of fabric ports, and (b) one minimum cut of size 1. The size
each link, with the constraint that the largest share across of the minimum cut of a communication pattern determines its
all links cannot increase (Lemma 1). As a result, Algo- effective bandwidth even if it were placed alone.
rithm 1 remains strategy-proofness across both stages. shows an example. Clearly, tenant-Y will be bottle-
Because spare usage restrictions can be applied locally, necked at her only receiver; trying to equalize them will
HUG can be enforced in individual machines. only result in low utilization. As expected, FairCloud
Illustrated in Figure 8, the bound is set at 23 for both proved that such proportionality is not achievable as it
tenants, and tenant-B can use its elastic demand on link- decreases both isolation guarantee and utilization [58].
2’s spare resource, while tenant-A cannot as she has None of the existing algorithms provide proportionality.
reached its bound on link-2. Instead, we consider a relaxed notion of proportional-
ity, called min-cut proportionality, that depends on com-
3.3 HUG Properties
munication patterns and ties proportionality with a ten-
We list the main properties of HUG in the following. ant’s progress. Specifically, each tenant receives mini-
1. In non-cooperative cloud environments, HUG is mum bandwidth proportional to the size of the minimum
strategyproof (Theorem 3), maximizes isolation cut [31] of their communication patterns. Meaning, in the
guarantees (Corollary 4), and ensures the highest earlier example, tenant-X would receive P times more
utilization possible for an optimal isolation guaran- total bandwidth than tenant-Y, but they would have the
tee allocation (Theorem 5). In particular, Lemma 6 optimal isolation guarantee (MX = MY = 21 ).
shows that under some cases, DRF may have utiliza- Min-cut proportionality and optimal isolation guaran-
tion arbitrarily close to 0, and HUG improves it to tee can coexist, but they both have tradeoffs with work
100%. We defer the proofs of properties in the Sec- conservation.
tion to Appendix A.
2. In cooperative environments like private datacenters, 4 Design Details
HUG maximizes isolation guarantees and is work- This section discusses how a cloud operator can imple-
conserving. Work conservation is achievable because ment, enforce, and expose HUG to the tenants (§4.1),
strategy-proofness is a non-requirement in this case. how to exploit placement to further improve HUG’s per-
3. Because HUG provides optimal isolation guarantee, formance (§4.2), and how HUG can handle weighted,
it provides min-cut proportionality (§ 3.3.1) in both heterogeneous scenarios (§4.3).
non-cooperative and cooperative environments.
4.1 Architectural Overview
Regardless of resource types, the identified tradeoff
exists in general multi-resource allocation problems and HUG can easily be implemented atop existing moni-
HUG can directly be applied. toring infrastructure of cloud operators (e.g., Amazon
CloudWatch [1]). Tenants would periodically update
3.3.1 Min-Cut Proportionality their correlation vectors through a public API, and the
Prior work promoted the notion of proportionality [58], operator would compute new allocations and update en-
where tenants would expect to receive total allocations forcing agents within milliseconds.
proportional to their number of VMs regardless of com- HUG API The tenant-facing API simply transfers a
munication patterns. Meaning, two tenants, each with −
→ −
→
tenant’s correlation vector (dk ) to the operator. dk =
N VMs, should receive equal bandwidth even if tenant- h1, 1, . . . , 1i is used as the default correlation vector. By
−→
X has an all-to-all communication pattern (i.e., dX = design, HUG incentivizes tenants to report and maintain
h1, 1, . . . , 1i) and tenant-Y has an N -to-1 pattern (i.e., accurate correlation vectors. This is because the more ac-
−→ −
→
exactly one 1 in dY and the rest are zeros). Figure 9 curate it is – instead of the default dk = h1, 1, . . . , 1i –
7
the higher are her progress and performance. constraints like fault-tolerance and collocation [19], and
Many applications already know their long-term pro- we consider its detailed study an important future work.
files (e.g., multi-tier online services [19, 52]) and others It is worth noting that with any VM placement, HUG pro-
can calculate on the fly (e.g., bulk communication data- vides the highest attainable utilization without sacrificing
intensive applications [22, 23]). Moreover, existing tech- optimal isolation guarantee and strategy-proofness.
niques in traffic engineering can provide good accuracy
in estimating and predicting demand matrices for coarse
4.3 Additional Constraints
time granularities [17, 18, 20, 47, 48]. Weighted Tenants Giving preferential treatment to
−
→ −
→
Centralized Computation For any update, the oper- tenants is simple. Just using wk dk instead of dk in Equa-
ator must run Algorithm 1. Although Stage-1 requires tion (2) would account for tenant weights (wk for tenant-
solving a linear program to determine the optimal isola- k) in calculating M∗ .
tion guarantee (i.e., the DRF allocation) [33], it can also Heterogeneous Capacities Because allocations are
be rewritten as a closed-form equation [56] when tenants calculated independently in each machine based on M∗
can scale up and down following their normalized corre- and local capacities (Ci ), HUG supports heterogeneous
lation vectors. The progress of all tenants after Stage-1 link capacities.
of Algorithm 1 – the optimal isolation guarantee – is:
Bounded Demands So far we have considered only
1 elastic tenants. If tenant-k has bounded demands, i.e.,
M∗ = M (2) dik < 1 for all i ∈ [1, 2P ], calculating a common M∗
di and corresponding − → in one round using Equation (2)
P
max a
1≤i≤2P k=1 k k
will be inefficient. This is because tenant-k might require
Equation (2) is computationally inexpensive. For our less than the calculated allocation, and being bounded,
100-machine cluster, calculating M∗ takes about 5 mi- she cannot elastically scale up to use it. Instead, we must
croseconds. Communicating the decision to all 100 ma- use the multi-round DRF algorithm [56, Algorithm 1] in
chines takes just 8 milliseconds and to 100, 000 (emu- Stage-1 of HUG; Stage-2 will remain the same. Note that
lated) machines takes less than 1 second (§5.1.2). this is similar to max-min fairness in a single link when
The guaranteed minimum allocations of tenant-k can a flow has a smaller demand than its n1 -th share.
then be calculated as aik = M∗ dik for all 1 ≤ i ≤ 2P .
Local Enforcement Enforcement in Stage-2 of Algo-
5 Evaluation
rithm 1 is simple as well. After reserving the minimum We evaluated HUG using trace-driven simulations and
uplink and downlink allocations for each tenant, each EC2 deployments. Our results show the following:
machine needs to ensure that no tenant can consume • HUG isolates multiple tenants across the entire net-
more than M∗ fraction of the machine’s up or down link work, and it can scale up to 100, 000 machines with
capacities (Ci ) to the network; i.e., aik ≤ cik ≤ M∗ . less than one second overhead (§5.1).
The spare is allocated among tenants using local max-
• HUG ensures the optimal isolation guarantee – al-
min fairness subject to tenant-specific upper-bounds.
most 7000× more than per-flow fairness and about
Because we only care about inter-tenant behavior – not
7.4× more than PS-P in production traces – while
how a tenant performs internal sharing – stock Linux
providing 1.4× higher utilization than DRF (§5.2).
tc is sufficient (§5). A tenant has the flexibility to
choose from traditional per-flow fairness, shortest-first • HUG outperforms per-flow fairness (PS-P) by
flow scheduling [12, 41], or explicit rate-based flow con- 17.35× (1.48×) in terms of the 95th percentile slow-
trol [29]. down and by 1.49× (1.14×) in minimizing the aver-
age shuffle completion time (§5.3).
4.2 VM Placement and • HUG outperforms Varys [23] in terms of the maxi-
Re-Placement/Migration mum shuffle completion time by 1.77×, even though
While M∗ is optimal for a given placement, it can be im- Varys is 1.45× better in minimizing the average shuf-
proved by changing the placement of tenant VMs based fle completion time and 1.33× better in terms of the
on their correlation vectors. One must perform load bal- 95th percentile slowdown (§5.3).
ancing across all machines to minimize the denominator We present our results in three parts. First, we mi-
of Equation (2). Cloud operators can employ optimiza- crobenchmark HUG on 100-machine EC2 clusters to
tion frameworks like [19] to perform initial VM place- evaluate HUG’s guarantees and overheads (§5.1). Sec-
ment and periodic migrations with an additional load ond, we leverage traces collected from a 3200-machine
balancing constraint. However, VM placement is a noto- Facebook cluster by Popa et al. [58] to compare HUG’s
riously difficult problem because of often-incompatible instantaneous allocation characteristics with that of per-
8
100! 100! Tenant A!
Total Alloc (Gbps)!
Total Alloc (Gbps)!

Tenant B!
Tenant A! Tenant C!
50! Tenant B! 50!
Tenant C!
0! 0!
0! 60! 120! 180! 240! 300! 360! 420! 480! 540! 0! 60! 120! 180! 240! 300! 360! 420! 480! 540!
Time (Seconds)! Time (Seconds)!
(a) Per-flow Fairness (TCP) (b) HUG
Figure 10: [EC2] Bandwidth consumptions of three tenants arriving over time in a 100-machine EC2 cluster. Each tenant has 100
VMs, but each uses a different communication pattern (§5.1.1). We observe that (a) using TCP, tenant-B dominates the network by
creating more flows; (b) HUG isolates tenants A and C from tenant B.
flow fairness, PS-P [58], and DRF [33] (§5.2). Finally, lated, propagated, and enforced in each machine of the
we evaluate HUG’s long-term impact on application per- cluster. As before, tenants A and C use marginally less
formance using a 3000-machine Facebook cluster trace than their allocations because of creating only one flow
used by Chowdhury et al. [23] and compare against per- between each VM-VM pair.
flow fairness, PS-P, DRF, as well as Varys, which focuses
5.1.2 Scalability
only on improving performance (§5.3).
The key challenge in scaling HUG is its centralized re-
5.1 Testbed Experiments source allocator, which must recalculate tenant shares
Methodology We performed our experiments on 100 and redistribute them across the entire cluster whenever
m2.4xlarge Amazon EC2 [2] instances running on any tenant changes her correlation vector.
Linux kernel 3.4.37 and used the default htb and tc im- We found that the time to calculate new allocations us-
plementations. While there exist proposals for more ac- ing HUG is less than 5 microseconds in our 100 machine
curate qdisc implementations [45, 57], the default htb cluster. Furthermore, a recomputation due to a tenant’s
worked sufficiently well for our purposes. Each of the arrival, departure, or change of correlation vector would
machines had 1 Gbps NICs, and we could use close to take about 8.6 milliseconds on average for a 100, 000-
full 100 Gbps bandwidth simultaneously. machine datacenter.
Communicating a new allocation takes less than 10
5.1.1 Network-Wide Isolation milliseconds to 100 machines and around 1 second for
100, 000 emulated machines (i.e., sending the same mes-
We consider a cluster with 100 EC2 machines, divided
sage 1000 times to each of the 100 machines).
between three tenants A, B, and C that arrive over time.
Each tenant has 100 VMs; i.e., VMs Ai , Bi , and Ci 5.2 Instantaneous Fairness
are collocated on the i-th physical machine. However,
While Section 5.1 evaluated HUG in controlled, syn-
they have different communication patterns: tenants A
thetic scenarios, this section focuses on HUG’s instanta-
and C have pairwise one-to-one communication patterns
neous allocation characteristics in the context of a large-
(100 VM-VM flows each), whereas tenant-B follows
scale cluster.
an all-to-all pattern using 10, 000 flows. Specifically, Ai
communicates with A(i+50)%100 , Cj communicates with Methodology We use a one-hour snapshot with 100
C(j+25)%100 , and any Bk communicates with all Bl , concurrent jobs from a production MapReduce trace,
where i, j, k, l ∈ {1, ..., 100}. Each tenant demands the which was extracted from a 3200-machine Facebook
entire capacity at each machine; hence, the entire capac- cluster by Popa et al. [58, Section 5.3]. Machines are con-
ity of the cluster should be equally divided among the nected to the network using 1 Gbps NICs. In the trace, a
active tenants to maximize isolation guarantees. job with M mappers and R reducers – hence, the corre-
Figure 10a shows that as soon as tenant-B arrives, she sponding M × R shuffle – is described as a matrix with
takes up the entire capacity in the absence of isolation the amount of data to transfer between each M -R pair.
guarantee. Tenant-C receives only marginal share as she We calculated the correlation vectors of individual shuf-
arrives after tenant-B and leaves before her. Note that fles from their communication matrices ourselves using
tenant-A (when alone) uses only about 80% of the avail- the optimal rate allocation algorithm for a single shuffle
able capacity; this is simply because just one TCP flow [22, 23], ensuring all the flows of each shuffle to finish
per VM-VM pair often cannot saturate the link. simultaneously.
Figure 10b presents the allocation using HUG. As ten- Given the workload, we calculate progress of each
ants arrive and depart, allocations are dynamically calcu- job/shuffle using different allocation mechanisms and
9
2.5
Total Allocation (Tbps)

1 25 1
Avg. Bandwidth (Gbps)
Fraction of Shuffles
Fraction of Shuffles
Per-Flow Fairness 2
0.8 20 Per-Flow Fairness
PS-P
1.5 PS-P
0.6 HUG 15
HUG 0.5
DRF 1
0.4 10 DRF
0.5
0.2 5
0
0 0 0
DRF
PS-P
HUG
Per-Flow
0 0.5 1 0.0001 0.001 0.01 0.1 1 10 100 1000
Progress (Gbps) Aggregate Bandwidth (Gbps)
(a) Distribution of progress (b) Total allocation (c) Distribution of allocations
Figure 11: [Simulation] HUG ensures higher isolation guarantee than high-utilization schemes like per-flow fairness and PS-P, and
provides higher utilization than multi-resource fairness schemes like DRF.
cross-examine characteristics like isolation guarantee, Bin 1 (SN) 2 (LN) 3 (SW) 4 (LW)
utilization, and proportionality. % of Shuffles 52% 16% 15% 17%
5.2.1 Impact on Progress
Table 2: Shuffles binned by their lengths (Short and Long) and
Figure 11a presents the distribution of progress of each widths (Narrow and Wide).
shuffle. Recall that the progress of a shuffle – we con-
sider each shuffle an individual tenant in this section – is width than they do under HUG, while the other 90%
the amount of bandwidth it is receiving in its bottleneck receive less than they do using HUG. PS-P crosses
up or downlink (i.e., progress can be at most 1 Gbps). over at the 76-th percentile.
Both HUG and DRF (overlapping vertical lines in Fig- 5.2.3 Impact on Proportionality
ure 11a) ensure the same progress (0.74 Gbps) for all
shuffles. Note that despite same progress, shuffles will A collateral benefit of HUG is that tenants receive allo-
finish at different times based on how much data each cations proportional to their bottleneck demands. Con-
one has to send (§5.3). Per-flow fairness and PS-P pro- sequently, despite the same progress across all shuffles
vide very wide ranges: 112 Kbps to 1 Gbps for the former (Figure 11a), their total allocations vary (Figure 11c)
and 0.1 Gbps to 1 Gbps for the latter. Shuffles with many based on the size of minimum cuts in their communi-
flows crowd out the ones with fewer flows under per-flow cation patterns.
fairness, and PS-P suffers by ignoring correlation vectors
and through indiscriminate work conservation.
5.3 Long-Term Characteristics
We have shown in Section 5.2 that HUG provides opti-
5.2.2 Impact on Utilization mal isolation guarantee in the instantaneous case. How-
By favoring heavy tenants, per-flow fairness and PS-P do ever, similar to all instantaneous solutions [33, 43, 58],
succeed in their goals of increasing network utilization HUG does not provide any long-term isolation or fair-
(Figure 11b). Given the communication patterns of the ness guarantees. Consequently, in this section, we eval-
workload, the former utilizes 69% of 3.2 Tbps total ca- uate HUG’s long-term impact on performance using a
pacity across all machines and the latter utilizes 68.6%. production trace through simulations.
In contrast, DRF utilizes only 45%. HUG provides a
common ground by extending utilization to 62.4% with- Methodology For these simulations, we use a MapRe-
out breaking strategy-proofness and providing optimal duce/Hive trace from a 3000-machine production Face-
isolation guarantee. book cluster. The trace includes the arrival times, com-
Figure 11c breaks down total allocations of each shuf- munication matrices, and placements of tasks of over
fle and demonstrates two high-level points: 10, 000 shuffle during one day. Shuffles in this trace have
diverse length (i.e., size of the longest flow) and width
1. HUG ensures overall higher utilization (1.4× on (i.e., the number of flows) characteristics and roughly
average) than DRF by ensuring equal progress for follow the same distribution of the original trace (Ta-
smaller shuffles and by using up additional band- ble 2). We consider a shuffle to be short if its longest
width for larger shuffles. It does so while ensuring flow is less than 5 MB and narrow if it has at most 50
the same optimal isolation guarantee as DRF. flows; we use the same categorization. We calculated the
2. Per-flow fairness crosses HUG at the 90-th per- correlation vector of each shuffle as we did before (§ 5.2).
centile; i.e., the top 10% shuffles receive more band-
10
1 5.3.1 Improvements Over Per-Flow Fairness
Fraction of Coflows 0.8
HUG improves over per-flow fairness both in terms of
0.6 Per-Flow Fairness slowdown and performance. The 95th percentile slow-
PS-P
0.4 down using HUG is 17.35× better than that of per-flow
HUG
DRF fairness (Table 12b). Overall, HUG provides better slow-
0.2
Varys down across the board (Figure 12a) – 61% shuffles are
0 better off using HUG and the rest remain the same.
1 10 100 1000
Slowdown w.r.t. Minimum Required Completion Time
HUG improves the average completion time of shuf-
fles by 1.49× (Figure 13). The biggest wins comes from
(a) CDFs
bin-1 (4.8×) and bin-2 (6.15×) that include the so-called
narrow shuffles with less than 50 flows. This reinforces
Min 95th AVG STDEV
the fact that HUG isolates tenants with fewer flows from
Per-Flow Fairness 1 69.4 15.52 65.54 those with many flows. Overall, HUG performs well
PS-P 1 5.9 2.22 2.97 across all bins.
HUG 1 4.0 1.86 2.25 5.3.2 Improvements Over PS-P
DRF 1 4.4 2.11 2.66
HUG improves over PS-P in terms of the 95th percentile
Varys 1 3.0 1.43 0.99
slowdown by 1.48×, and 45% shuffles are better off
(b) Summaries using HUG. HUG also providers better average shuf-
fle completion times than PS-P for an overall improve-
Figure 12: [Simulation] Slowdowns of shuffles using different
ment of 1.14×. Large improvements again come in bin-
mechanisms w.r.t. the minimum completion times of each shuf-
fle. The X-axis of (a) is in log scale. 1 (1.19×) and bin-2 (1.27×) because PS-P also favors
tenants with more flows.
8 Per-Flow Fairness
6.15
Normalized Comp. Time
7 Note that instantaneous high utilization of per-flow

PS-P
4.80
6 DRF
fairness and PS-P (§5.2) does not help in the long run
5 due to lower isolation guarantee.
Varys
4
5.3.3 Improvements Over DRF
1.49
3
1.33
1.31
1.27
1.19
1.15
1.14
1.14
1.14
1.13
1.13
1.11
1.11
1.00
0.76
0.74
0.69
0.68
2
1 While HUG and DRF has the same worst-case slow-
0 down, 70% shuffles are better off using HUG. HUG also
Bin 1 Bin 2 Bin 3 Bin 4 ALL provides better average shuffle completion times than
Shuffle Type DRF for an overall improvement of 1.14×.
Figure 13: [Simulation] Average shuffle completion times nor- 5.3.4 Comparison to Varys
malized by HUG’s average completion time. 95-th percentile
Varys outperforms HUG by 1.33× in terms of the 95th
plots are similar.
percentile slowdown and by 1.45× in terms of aver-
Metrics We consider two metrics: 95th percentile age shuffle completion time. However, because Varys at-
slowdown and average shuffle completion time to respec- tempts to improve the average completion time by pri-
tively measure long-term progress and performance char- oritization, it risks in terms of the maximum completion
acteristics. time. More precisely, HUG outperforms Varys by 1.77×
We define the slowdown of a shuffle as its completion in terms of the maximum shuffle completion time (not
time due to a scheme normalized by its minimum com- shown).
pletion time if it were running alone; i.e.,
6 Discussion
Compared Duration
Slowdown = Payment Model Similar to many existing proposals
Minimum Duration
[32, 33, 45, 46, 58, 59, 61, 62], we assume that tenants
The minimum value of slowdown is one. pay per-hour flat rates for individual VMs, but there is no
We measure performance as the shuffle completion pricing model associated with their network usage. This
time of a scheme normalized by that using HUG; i.e., is also the prevalent model of resource pricing in cloud
computing [2, 5, 8]. Exploring whether and how a net-
Compared Duration work pricing model would change our solution and what
Normalized Comp. Time =
HUG’s Duration that model would look like requires further attention.
If the normalized completion time of a scheme is greater Determining Correlation Vectors Unlike long-term
(smaller) than one, HUG is faster (slower). correlation vectors, e.g., over the course of an hour or for
11
an entire shuffle, accurately capturing short-term changes demands, i.e., all correlation vectors have all elements as
can be difficult. How fast tenants should update their vec- 1, we give the same allocation; for all other cases, we
tors and whether that is faster than centralized HUG can provide higher isolation guarantee and utilization.
react to requires additional analysis.
Efficient Schedulers Researchers have also focused
Decentralized HUG HUG’s centralized design makes on efficient scheduling and/or packing of datacenter re-
it easier to analyze its properties and simplifies its im- sources to minimize job and communication completion
plementation. We believe that designing a decentralized times [12, 21–23, 26, 36, 41]. Our work is orthogonal and
version of HUG is an important future work, which will complementary to these work focusing on application-
be especially relevant for sharing wide-area networks in level efficiency within each tenant. We guarantee iso-
the context of geo-distributed analytics [60, 66]. lation across tenants, so that each tenant can internally
perform whatever efficiency or fairness optimizations
7 Related Work among her own applications.
Single-Resource Fairness Max-min fairness was first
proposed by Jaffe [43] to ensure at least n1 -th of a link’s 8 Conclusion
capacity to each flow. Thereafter, many mechanisms In this paper, we have proved that there is a strong trade-
have been proposed to achieve it, including weighted fair off between optimal isolation guarantees and high uti-
queueing (WFQ) [25, 55] and those similar to or extend- lization in non-cooperative public clouds. We have also
ing WFQ [16, 34, 35, 63, 64]. We generalize max-min proved that work conservation can decrease utilization
fairness to parallel communication observed in scale-out instead of improving it, because no network sharing al-
applications, showing that unlike in the single-link sce- gorithm remains strategyproof in its presence.
nario, optimal isolation guarantee, strategy-proofness, To this end, we have proposed HUG to restrict band-
and work conservation cannot coexist. width utilization of each tenant to ensure highest uti-
Multi-Resource Fairness Dominant Resource Fair- lization with optimal isolation guarantee across multi-
ness (DRF) [33] maximizes the dominant share of each ple tenants in non-cooperative environments. In cooper-
user in a strategyproof manner. Solutions that have at- ative environments, where strategy-proofness might be
tempted to improve the system-level efficiency of multi- a non-requirement, HUG simultaneously ensures both
resource allocation – both before [54, 65] and after work conservation and the optimal isolation guarantee.
[27, 38, 56] DRF – sacrifice strategy-proofness. We have HUG generalizes single-resource max-min fairness to
proven that work-conserving allocation without strategy- multi-resource environments where a tenant’s demand on
proofness can hurt utilization instead of improving it. different resources are correlated and elastic. In particu-
Dominant Resource Fair Queueing (DRFQ) [32] ap- lar, it provides optimal isolation guarantee, which is sig-
proximates DRF over time in individual middleboxes. nificantly higher than that provided by existing multi-
In contrast, HUG generalizes DRF to environments with tenant network sharing algorithms. HUG also comple-
elastic demands to increase utilization across the entire ments DRF with provably highest utilization without sac-
network and focuses only on instantaneous fairness. rificing other useful properties of DRF. Regardless of
Joe-Wong et al. [46] have presented a unifying frame- resource types, the identified tradeoff exists in general
work to capture fairness-efficiency tradeoffs in multi- multi-resource allocation problems, and all those scenar-
resource environments. They assume a cooperative envi- ios can take advantage of HUG.
ronment, where tenants never lie. HUG falls under their
FDS family of mechanisms. In non-cooperative environ- Acknowledgments
ments, however, we have shown that the interplay be- We thank our shepherd Mohammad Alizadeh and the
tween work conservation and strategy-proofness is criti- anonymous reviewers of SIGCOMM’15 and NSDI’16
cal, and our work complements the framework of [46]. for useful feedback. This research is supported in
Network-Wide / Tenant-Level Fairness Proposals for part by NSF CISE Expeditions Award CCF-1139158,
sharing cloud networks range from static allocation [13, NSF Award CNS-1464388, DOE Award SN10040 DE-
14, 44] and VM-level guarantees [61, 62] to variations of SC0012463, and DARPA XData Award FA8750-12-2-
network-wide sharing mechanisms [45, 51, 58, 59, 67]. 0331, and gifts from Amazon Web Services, Google,
We refer the reader to the survey by Mogul and Popa [53] IBM, SAP, The Thomas and Stacey Siebel Foundation,
for an overview. FairCloud [58] stands out by systemati- Adatao, Adobe, Apple, Inc., Blue Goji, Bosch, Cisco,
cally discussing the tradeoffs and addresses several lim- Cray, Cloudera, EMC2, Ericsson, Facebook, Fujitsu,
itations of other approaches. Our work generalizes Fair- Guavus, HP, Huawei, Informatica, Intel, Microsoft, Ne-
Cloud [58] and many proposals similar to FairCloud’s tApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata,
PS-P policy [45, 59, 61]. When all tenants have elastic and VMware.
12
References [18] T. Benson, A. Anand, A. Akella, and M. Zhang.
MicroTE: Fine grained traffic engineering for data
[1] Amazon CloudWatch. centers. In CoNEXT, 2011.
http://aws.amazon.com/cloudwatch.
[19] P. Bodik, I. Menache, M. Chowdhury, P. Mani,
[2] Amazon EC2. http://aws.amazon.com/ec2. D. Maltz, and I. Stoica. Surviving failures in
[3] Apache Hadoop. http://hadoop.apache.org. bandwidth-constrained datacenters. In
SIGCOMM, 2012.
[4] AWS Innovation at Scale.
http://goo.gl/Py2Ueo. [20] M. Chowdhury, S. Kandula, and I. Stoica.
Leveraging endpoint flexibility in data-intensive
[5] Google Compute Engine. clusters. In SIGCOMM, 2013.
https://cloud.google.com/compute.
[21] M. Chowdhury and I. Stoica. Efficient coflow
[6] Google Container Engine. scheduling without prior knowledge. In
http://kubernetes.io. SIGCOMM, 2015.
[7] A look inside Google’s data center networks. [22] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan,
http://goo.gl/u0vZCY. and I. Stoica. Managing data transfers in computer
clusters with Orchestra. In SIGCOMM, 2011.
[8] Microsoft Azure.
http://azure.microsoft.com. [23] M. Chowdhury, Y. Zhong, and I. Stoica. Efficient
coflow scheduling with Varys. In SIGCOMM,
[9] Storm: Distributed and fault-tolerant realtime
2014.
computation. http://storm-project.net.
[24] J. Dean and S. Ghemawat. MapReduce: Simplified
[10] Trident: Stateful Stream Processing on Storm.
data processing on large clusters. In OSDI, 2004.
http://goo.gl/cKsvbj.
[25] A. Demers, S. Keshav, and S. Shenker. Analysis
[11] M. Alizadeh, T. Edsall, S. Dharmapurikar,
and simulation of a fair queueing algorithm. In
R. Vaidyanathan, K. Chu, A. Fingerhut, F. Matus,
SIGCOMM, 1989.
R. Pan, N. Yadav, and G. Varghese. CONGA:
Distributed congestion-aware load balancing for [26] F. Dogar, T. Karagiannis, H. Ballani, and
datacenters. In SIGCOMM, 2014. A. Rowstron. Decentralized task-aware scheduling
for data center networks. In SIGCOMM, 2014.
[12] M. Alizadeh, S. Yang, M. Sharif, S. Katti,
N. Mckeown, B. Prabhakar, and S. Shenker. [27] D. Dolev, D. G. Feitelson, J. Y. Halpern,
pFabric: Minimal near-optimal datacenter R. Kupferman, and N. Linial. No justified
transport. In SIGCOMM, 2013. complaints: On fair sharing of multiple resources.
In ITCS, 2012.
[13] S. Angel, H. Ballani, T. Karagiannis, G. OShea,
and E. Thereska. End-to-end performance [28] N. G. Duffield, P. Goyal, A. Greenberg, P. Mishra,
isolation through virtual datacenters. In OSDI, K. K. Ramakrishnan, and J. E. van der Merive. A
2014. flexible model for resource management in virtual
private networks. In SIGCOMM, 1999.
[14] H. Ballani, P. Costa, T. Karagiannis, and
A. Rowstron. Towards predictable datacenter [29] N. Dukkipati. Rate Control Protocol (RCP):
networks. In SIGCOMM, 2011. Congestion control to make flows complete
quickly. PhD thesis, Stanford University, 2007.
[15] H. Ballani, K. Jang, T. Karagiannis, C. Kim,
D. Gunawardena, and G. OShea. Chatty tenants [30] M. M. Flood. Some experimental games.
and the cloud network sharing problem. In NSDI, Management Science, 5(1):5–26, 1958.
2013.
[31] L. R. Ford and D. R. Fulkerson. Maximal flow
[16] J. Bennett and H. Zhang. WF2 Q: Worst-case fair through a network. Canadian Journal of
weighted fair queueing. In INFOCOM, 1996. Mathematics, 8(3):399–404, 1956.
[17] T. Benson, A. Akella, and D. A. Maltz. Network [32] A. Ghodsi, V. Sekar, M. Zaharia, and I. Stoica.
traffic characteristics of data centers in the wild. In Multi-resource fair queueing for packet
IMC, 2010. processing. SIGCOMM, 2012.
13
[33] A. Ghodsi, M. Zaharia, B. Hindman, [46] C. Joe-Wong, S. Sen, T. Lan, and M. Chiang.
A. Konwinski, S. Shenker, and I. Stoica. Dominant Multiresource allocation: Fairness-efficiency
resource fairness: Fair allocation of multiple tradeoffs in a unifying framework. In INFOCOM,
resource types. In NSDI, 2011. 2012.
[34] S. J. Golestani. Network delay analysis of a class [47] S. Kandula, D. Katabi, B. Davie, and A. Charny.
of fair queueing algorithms. IEEE JSAC, Walking the tightrope: Responsive yet stable traffic
13(6):1057–1070, 1995. engineering. In SIGCOMM, 2005.
[35] P. Goyal, H. M. Vin, and H. Chen. Start-time fair [48] S. Kandula, S. Sengupta, A. Greenberg, P. Patel,
queueing: A scheduling algorithm for integrated and R. Chaiken. The nature of datacenter traffic:
services packet switching networks. In Measurements and analysis. In IMC, 2009.
SIGCOMM, 1996.
[49] N. Kang, Z. Liu, J. Rexford, and D. Walker.
[36] R. Grandl, G. Ananthanarayanan, S. Kandula,
Optimizing the “One Big Switch” abstraction in
S. Rao, and A. Akella. Multi-resource packing for
Software-Defined Networks. In CoNEXT, 2013.
cluster schedulers. In SIGCOMM, 2014.
[37] C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, [50] K. LaCurts, J. Mogul, H. Balakrishnan, and
P. Sun, W. Wu, and Y. Zhang. SecondNet: A data Y. Turner. Cicada: Introducing predictive
center network virtualization architecture with guarantees for cloud networks. In HotCloud, 2014.
bandwidth guarantees. In CoNEXT, 2010.
[51] V. T. Lam, S. Radhakrishnan, A. Vahdat,
[38] A. Gutman and N. Nisan. Fair allocation without G. Varghese, and R. Pan. NetShare and stochastic
trade. In AAMAS, 2012. NetShare: Predictable bandwidth allocation for
data centers. SIGCOMM CCR, 42(3):5–11, 2012.
[39] M. Hajjat, X. Sun, Y.-W. E. Sung, D. Maltz,
S. Rao, K. Sripanidkulchai, and M. Tawarmalani. [52] J. Lee, Y. Turner, M. Lee, L. Popa, S. Banerjee,
Cloudward bound: planning for beneficial J.-M. Kang, and P. Sharma. Application-driven
migration of enterprise applications to the cloud. bandwidth guarantees in datacenters. In
In SIGCOMM, 2010. SIGCOMM, 2014.
[40] B. Hindman, A. Konwinski, M. Zaharia, [53] J. C. Mogul and L. Popa. What we talk about
A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and when we talk about cloud network performance.
I. Stoica. Mesos: A Platform for Fine-Grained SIGCOMM CCR, 42(5):44–48, 2012.
Resource Sharing in the Data Center. In NSDI,
2011. [54] J. F. Nash Jr. The bargaining problem.
Econometrica: Journal of the Econometric
[41] C.-Y. Hong, M. Caesar, and P. B. Godfrey. Society, pages 155–162, 1950.
Finishing flows quickly with preemptive
scheduling. In SIGCOMM, 2012. [55] A. K. Parekh and R. G. Gallager. A generalized
processor sharing approach to flow control in
[42] M. Isard, M. Budiu, Y. Yu, A. Birrell, and
integrated services networks: The single-node
D. Fetterly. Dryad: Distributed data-parallel
case. IEEE/ACM ToN, 1(3):344–357, 1993.
programs from sequential building blocks. In
EuroSys, 2007. [56] D. C. Parkes, A. D. Procaccia, and N. Shah.
[43] J. M. Jaffe. Bottleneck flow control. IEEE Beyond dominant resource fairness: extensions,
Transactions on Communications, 29(7):954–962, limitations, and indivisibilities. In EC, 2012.
1981.
[57] J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah,
[44] K. Jang, J. Sherry, H. Ballani, and T. Moncaster. and H. Fugal. Fastpass: A centralized
Silo: Predictable message completion time in the “zero-queue” datacenter network. In SIGCOMM,
cloud. In SIGCOMM, 2015. 2014.
[45] V. Jeyakumar, M. Alizadeh, D. Mazieres, [58] L. Popa, G. Kumar, M. Chowdhury,

B. Prabhakar, C. Kim, and A. Greenberg. EyeQ: A. Krishnamurthy, S. Ratnasamy, and I. Stoica.
Practical network performance isolation at the FairCloud: Sharing the network in cloud
edge. In NSDI, 2013. computing. In SIGCOMM, 2012.
14

in link-1 1/2+ is already larger than before 13 . If

[59] L. Popa, P. Yalagandula, S. Banerjee, J. C. Mogul, 3/2+
Y. Turner, and J. R. Santos. ElasticSwitch: the work conservation policy allocates the spare resource
Practical work-conserving bandwidth guarantees in link-2 by δ (δ may be small but a positive
for cloud computing. In SIGCOMM, 2013. 0 1 0value),
a2

0 a
her progress will change to MA = min d1A , d2A =
A A
[60] Q. Pu, G. Ananthanarayanan, P. Bodik,

1+2 1 3/2δ
min 3/2+ , 3/2+ +δ . As long as < 2/3−δ (if δ ≥ 23 ,
S. Kandula, A. Akella, V. Bahl, and I. Stoica. Low
latency geo-distributed data analytics. In we have no constraint on ), her progress will be better
SIGCOMM, 2015. than when she was telling the truth, which makes the pol-
icy not strategyproof. The operator cannot prevent this
[61] H. Rodrigues, J. R. Santos, Y. Turner, P. Soares, because she knows neither a tenant’s true correlation vec-
and D. Guedes. Gatekeeper: Supporting tor nor , the extent of the tenant’s lie.
bandwidth guarantees for multi-tenant datacenter
Theorem 3 Algorithm 1 is strategyproof.
networks. In USENIX WIOV, 2011.
Proof Sketch (of Theorem 3) Because DRF is strate-
[62] A. Shieh, S. Kandula, A. Greenberg, and C. Kim. gyproof, the first stage of Algorithm 1 is strategyproof
Sharing the data center network. In NSDI, 2011. as well. We show that adding the second stage does not
[63] M. Shreedhar and G. Varghese. Efficient fair violate strategy-proofness of the combination.
queueing using deficit round robin. In SIGCOMM, Assume that link-b is a system bottleneck – the link
1995. DRF saturated to maximize isolation guarantee in the
M
X
[64] I. Stoica, H. Zhang, and T. Ng. A hierarchical fair first stage. Meaning, b = arg max dik . We use Db =
i
k=1
service curve algorithm for link-sharing, real-time, M
and priority services. In SIGCOMM, 1997.
X
dbk to denote the total demand in link-b (Db ≥ 1), and
k=1
[65] H. R. Varian. Equity, envy, and efficiency. Journal
Mbk = 1/Db for corresponding progress for all tenant-k
of economic theory, 9(1):63–91, 1974.
(k ∈ {1, ..., M }) when link-b is the system bottleneck.
[66] A. Vulimiri, C. Curino, B. Godfrey, J. Padhye, and In Figure 5, b = 1. The following arguments hold even
G. Varghese. Global analytics in the face of for multiple bottlenecks.
bandwidth and regulatory constraints. In NSDI, Any tenant-k can attempt to increase her progress
−
→
2015. (Mk ) only by lying about her correlation vector (dk ).
Formally, her action space consists of all possible cor-
[67] D. Xie, N. Ding, Y. C. Hu, and R. Kompella. The relation vectors. It includes increasing and/or decreasing
only constant is change: Incorporating demands of individual resources to report a different vec-
time-varying network reservations in data centers. −
→
tor, d0k and obtain a new progress, M0k (> Mk ). Tenant-
In SIGCOMM, 2012. k can attempt one of the two alternatives when report-
−
→
[68] M. Zaharia, M. Chowdhury, T. Das, A. Dave, ing d0k : either keep link-b still the system bottleneck or
J. Ma, M. McCauley, M. Franklin, S. Shenker, and change it. We show that Algorithm 1 is strategyproof in
I. Stoica. Resilient distributed datasets: A both cases; i.e., M0k ≤ Mk .
fault-tolerant abstraction for in-memory cluster Case 1: link-b is still the system bottleneck.
computing. In NSDI, 2012. Her progress cannot improve because
• if d0b b
k ≤ dk , her share on the system bottleneck will
[69] M. Zaharia, T. Das, H. Li, S. Shenker, and decrease in the first stage; so will her progress. There is
I. Stoica. Discretized streams: Fault-tolerant no spare resource to allocate in link-b.
stream computation at scale. In SOSP, 2013. For example, if tenant-A changes d01 1
A = 4 instead of
dA = 2 in Figure 5, her allocation will decrease to 15 -
1 1
th of link-1; hence, M0A = 52 instead of MA = 23 .

A Proofs from Section 3
• if d0b b
k > dk , her share on the system bottleneck
Proof Sketch (of Lemma 1) Consider tenant-A from will increase. However, because D0b > Db as d0b b
k > dk ,
the example in Figure 5. Assume that instead of report- everyone’s progress including her own will decrease in
−
→
ing her true correlation vector dA = h 21 , 1i, she reports the first stage (M0b b
−→ k ≤ Mk ). The second stage will en-
0i
d0A = h 21 + , 1i, where > 0. As a result, her alloca- sure that
n her o maximum consumption in any link-i ck ≤
−→
tion will change to a0A = h 1/2+ 1
3/2+ , 3/2+ i. Her allocation
maxj a0j
k . Therefore her progress will be smaller than
15
DRF$/$MinAll$ OPT$
that when she tells the truth (M0b k < Mk ).

b
100%changes d 01 100% 1/5 2/5 100% 1/5 2/5

For example, if tenant-A A = 1
1/5 2/5instead of 100% 1/5 2/5 100% 1/3 6/7 100% 1/3
1 1 1
dA = 2 in Figure 5, her allocation will increase to 2 of
link-1. However, progress of both tenants will 1/15 decrease: 1/15
50% 2/5 2/5 50% 2/5
MA = MB = 12 . The second 50% stage2/5 will restrict her us-50% 50% 1/3 50% 1/3
2/5
age in link-2 to 12 as well; hence, M0A = 12 instead of 2/5
2/5 2/5 2/5
MA = 23 . 0% 0% 2/5 0% 1/3 1/7 1/3
L L 0% L L 0% L1 L2 0%
Case 2: link-b is no longer a system bottleneck; in-
1 2 L1 1 L2 2 L 1 L 2 L1
0 (a) DRF (b) HUG
stead, link-b (6= b) is now one of the system bottlenecks. OPT$
We need to consider the following DRF$/$MinAll$
two sub-cases. DRF$/$MinAll$
OPT$ Prop.$ Ma
Figure 14: Hard tradeoff between work conservation and
0b0 b
• If D ≤ D , the progress in the first stage will strategy-proofness. Adding one more tenant (tenant-C in black)
0
increase; i.e., M0b b
k ≥ Mk . However, tenant-k’s alloca- to Figure 5 with correlation vector h1, 0i makes simultaneously
tion in link-b will be no larger than if she had told the achieving work conservation and optimal isolation guarantee
truth, making her progress no better. To see this, con- impossible, even when all three have elastic demands. 100% 1/3 2/3
100% 1/5 2/5 100%
sider the allocations of all other tenants in link-b before 100% 1/3 6/7 1/3 1 14/15
HUG will allocate to each tenant N on every link and
and after she lies. Denote by cb−k and c0b −k the resource
consumption of all other tenants
1/15 achieve 100% utilization. 50% 1/3
50% in link-b
2/5 when tenant-
50%
50% 1/3 1/3
k was telling the truth and lying, respectively. We also
b b b b
have c−k = a−k and a−k + ak =2/51 because link-b 1/3 1/3
0%was no spare resource to
B Tradeoff
1/3 1/7 Between Work
1/3 1/15
Conser-
0%
L1 L2
was the bottleneck, and there L1 L2 0% 0%
allocate for this link. When tenant-k lies, a0b ≥ a b Lvation
1 L2 and Strategy-proofness
L1 L2
−k −k
because M0b
0
b
have c0b 0b Equal$
k ≥ Mk . We alsoDRF$/$MinAll$ −k ≥ a−k and Prop.$
We demonstrate the tradeoff between work conservation
MaxAll$
c0b 0b 0b 0b
−k + ck ≤ 1. This implies ck ≤ 1 − c−k ≤ 1 − a−k ≤
0b
and strategy-proofness (thus isolation guarantee) by ex-
b b b tending our running example from Section 2.
1 − a−k = ak = ck . Meaning, tenant-k’s progress is no
larger than that when she was telling the truth. Consider another tenant (tenant-C) with correlation
−→
0b0 b
• If D > D , everyone’s progress including her vector dC = h1, 0i in addition to the two tenants present
0
own decreases in the first stage (M0b b earlier. The key distinction between tenant-C and either
k < Mk ). Similar
to the second scenario in Case 1, the second stage will of the earlier two is that she does not demand any band-
restrict tenant-k to the lowered progress. width on link-2. Given the three correlation vectors, we
can use DRF to calculate the optimal isolation guaran-
Regardless of tenant-k’s approaches – keeping the
tee (Figure 14a), where tenant-k has Mk = 25 , link-1 is
same system bottleneck or not – her progress using Al- 7
completely utilized, and 15 -th of link-2 is proportionally
gorithm 1 will not increase.
divided between tenant-A and tenant-B.
Corollary 4 (of Theorem 3) Algorithm 1 maximizes This leaves us with two questions:
8
isolation guarantee, i.e., the minimum progress across 1. How do we completely allocate the remaining 15 -th
tenants. bandwidth of link-2?
2. Is it even possible without sacrificing optimal isola-
Theorem 5 Algorithm 1 achieves the highest resource tion guarantee and strategy-proofness?
utilization among all strategyproof algorithms that pro-
We show in the following that it is indeed not possible
vide optimal isolation guarantee among tenants.
to allocate more than 54 -th of link-2 (Figure 14b) without
Proof Sketch (of Theorem 5) Follows from Lemma 1 sacrificing the optimal isolation guarantee.
and Theorem 3. Let us consider three primary categories of work-
conserving spare allocation policies: demand-agnostic,
Lemma 6 Under some cases, DRF may have utilization unfair, and locally fair. All three will result in lower iso-
arbitrarily close to 0, and HUG helps improve the uti- lation guarantee, lower utilization, or both.
lization to 1.
B.1 Demand-Agnostic Policies
Proof Sketch (of Lemma 6) Construct the cases with Demand-agnostic policies equally divide the resource
K links and N tenants, and each tenant has demand 1 between the number of tenants independently in each
on link-1 and on other links. link, irrespective of tenant demands, and provide isola-
DRF will allocate to each tenant N1 on link-1 and tion. Although strategyproof, this allocation (Figure 15a)

N on all other links, resulting in a total utilization of has lower isolation guarantee (MA = 12 and MB =
1+(K−1)
K → 0 when K → ∞, → 0 for any N . MC = 13 , therefore isolation guarantee is 13 ) than the op-
16
OPT$ OPT$
100% 1/3 4/5 100% 1/5 2/5
100% 1/5 2/5
50% 50% 2/5

1/3 50% 2/5
100%sources1/3
to the tenant
2/3 with 2/5
the least or the most demand.
1/5 2/5 100% 2/5
100% 100%
1/3 1/31/31/5 1/26/7 1/3 14/15 2/5
0% 2/5 Lemma 90%Allocating spare resource to the tenant with
L1 L2 0% L1 L2
2/5
1/15 L1 L2 50%the least
1/3demand can result in zero spare allocation.
50%
50% 1/31/3 50% 1/3
MaxAll& OPT$
Proof Sketch Although OPT$this strategy provides the opti-
1/3
mal allocation 1/3
for Figure 5, when at least one tenant in
2/5 0%
1/31/3 1/21/7 1/3 1/15 L1 zero L
a link has demand, it can trivially result in no addi-
L1 L2 0%0% 0% 2
L1 L1 L2 L2 L1 L2 tional utilization; e.g., tenant-C in Figure 14.
(a) PS-P (b) Most-demanding gets all Equal$
DRF$/$MinAll$ Prop.$ MaxAll$ Lemma 10 Allocating spare resource to the tenant with
100% PS#P$ 2/5
1/5
100% 1/3
the least demand is not strategyproof. 2/3
100% 1/3 2/3 100% 1/3 6/7 100% 1/3 14/15
Proof Sketch Consider tenant-A lied and changed her
1/15 −→ 50% 1
50% 2/5 correlation vector to d0A = h1, 10 i. 1/3
The new optimal iso-
50% 1/3 50% 1/3 50% 1/3
lation guarantee allocation for unchanged tenant-B and
2/5 tenant-C correlation vectors would1/3be: − a→1/3 1 1
A = h 3 , 30 i,
1/3 1/3 1/3 1/7 −→ 1 1/3
1 1/15−
→ 0%1
0%
0% a
0%B = h 3 , 18 i, and aC = h 3 , 0i. Now L1 the spare
L2 resource
0%
L L1
1 L L2
2 L1 L2 L1 be allocated
in link-2 will L2 to tenant-A because she asked
(c) Equal spare division (d) Prop. spare division Equal$ would be
for the least amount, and her final allocation
DRF$/$MinAll$
Prop&/&CEEI& Prop.$ −→0 1 17MaxAll$
aA = h 3 , 18 i. As a result, her progress improved from
Figure 15: Allocations after applying different work-
MA = 25 to M0A = 32 , while the others’ decreased to
conserving policies to divide spare capacities in link-2 for the
example in Figure 14.
MB = MC = 13 .
timal isolation guarantee allocation shown in Figure 14a Corollary 11 (of Lemma 10) In presence of work con-
(MA = MB = MC = 25 , therefore isolation guarantee servation, tenants can lie both by increasing and de-
is 25 ). PS-P [45, 58, 59] fall in this category. creasing their demands, or a combination of both.
Worse, when tenants do not have elastic-demand ap- Lemma 12 Allocating spare resource to the tenant with
plications, demand-agnostic policies are not even work- the highest demand is not strategyproof.
conserving (similar to the example in §2.3.4).
Proof Sketch If tenant-A changes her correlation vector
Lemma 7 When tenants do not have elastic demands, −
→
to d0A = h1, 1i, the eventual allocation (Figure 15b) will
per-resource equal sharing is not work-conserving. again result in lower progress (MB = MC = 13 ). Be-
Proof Sketch Only 11 5 cause tenant-B is still receiving more than 61 -th of her
12 -th of link-1 and 9 -th of link-2
will be consumed; i.e., none of the links will be satu- allocation in link-1 in link-2, she does not need to lie.
rated! Corollary 13 (of Lemmas 10, 12) Allocating spare re-
source randomly to tenants is not strategyproof.
To make it work-conserving, PS-P suggests dividing
spare resources based on whoever wants it. B.3 Locally Fair Policies
Lemma 8 When tenants do not have elastic demands, Finally, one can also consider equally or proportionally
PS-P is not work-conserving. dividing the spare resource on link-2 between tenant-
Proof Sketch If tenant-B gives up her spare allocation A and tenant-B. Unfortunately, these strategies are not
in link-2, tenant-A can increase her progress to MA = 32 strategyproof either.
and saturate link-1; however, tenant-B and tenant-C will Lemma 14 Allocating spare resource equally to tenants
remain at MB = MC = 13 . If tenant-A gives up her is not strategyproof.
spare allocation in link-1, tenant-B and tenant-C can in- Proof Sketch If the remaining 15 8
-th of link-2 is equally
crease their progress to MB = MC = 38 and saturate divided, the share of tenant-A will increase to 23 -rd and
link-1, but tenant-A will remain at MA = 21 . Because incentivize her to lie. Again, the isolation guarantee will
both tenant-A and tenant-B have chances of increas- be smaller (Figure 15c).
ing their progress, both will hold off to their allocations
even with useless traffic – another instance of Prisoner’s Lemma 15 Allocating spare resource proportionally to
dilemma. tenants’ demands is not strategyproof.
Proof Sketch If one divides the spare in proportion to
B.2 Unfair Policies tenant demands, the allocation is different (Figure 15d)
Instead of demand-agnostic policies, one can also con- than equal division. However, tenant-A can again in-
sider simpler, unfair policies; e.g., allocating all the re- crease her progress at the expense of others.
17
C Dual Objectives of Network Here, we maximize resource consumption while keep-
ing the optimal isolation guarantee across all tenants, de-
Sharing noted by M∗k . Meanwhile, the constraint on consump-
The two conflicting requirements of the network sharing tion being at least guaranteed minimum allocation en-
problem can be defined as follows. sures strategy-proofness; thus, guaranteeing that guaran-
1. Utilization: i∈[1,2P ] k∈[1,M ] cik
P P teed allocated resources will be utilized.
Because cik values have no upper bounds except for
2. Isolation guarantee: mink∈[1,M ] Mk physical capacity constraints, optimization O2 may re-
Given the tradeoff between the two, one can consider sult in suboptimal isolation guarantee in non-cooperative
one of the two possible optimizations:8 environments (§2.3.3). HUG introduces the following
O1 Ensure highest utilization, then maximize the isola- additional constraint to avoid this issue only in non-
tion guarantee with best effort; cooperative environments:
O2 Ensure optimal isolation guarantee, then maximize
utilization with best effort. cik ≤ M∗ , i ∈ [1, 2P ], k ∈ [1, M ]
O1: Utilization-First In this case, the optimization at- This constraint is not necessary when strategy-proofness
tempts to maximize the isolation guarantee across all ten- is a non-requirement – e.g., in private datacenters.
ants while keeping the highest network utilization.
Maximize min Mk
k∈[1,M ]
X X (3)
s.t. cik = U ∗ ,
i∈[1,2P ] k∈[1,M ]
X X
where U ∗ = max cik is the highest uti-
i∈[1,2P ] k∈[1,M ]
lization possible. Although this ensures maximum net-
work utilization, isolation guarantee to individual tenants
can be arbitrarily low. This formulation can still be useful
in private datacenters [36].
To ensure some isolation guarantee, existing cloud
network sharing approaches [14, 45, 51, 58, 59, 61, 62,
67] use a similar formulation:
X X
Maximize cik
1≤i≤2P k∈[1,M ]
(4)
1
subject to Mk ≥ , k ∈ [1, M ]
M
The objective here is to maximize utilization while en-
1
suring at least M -th of each link to tenant-k. However,
this approach has two primary drawbacks (§ 2.3):
1. suboptimal isolation guarantee, and
2. lower utilization.
O2: Isolation-Guarantee-First Instead, in this paper,
we have formulated the network sharing problem as fol-
lows:
X X
Maximize cik
i∈[1,2P ] k∈[1,M ]
subject to min Mk = M∗k (5)

k∈[1,M ]
cik ≥ aik , i ∈ [1, 2P ], k ∈ [1, M ]

8 Maximizing a combination of these two is also an interesting fu-
ture direction.
18

HUG: Multi-Resource Fairness For Correlated and Elastic Demands

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HUG: Multi-Resource Fairness For Correlated and Elastic Demands

Uploaded by

Copyright:

Available Formats

L1 L2

50% 100% 1/4 1/2

100% 1/3 2/3 Algorithm 1 High Utilization with Guarantees (HUG)

Stage 2: Restrict maximum utilization for each of

Total Alloc (Gbps)!

Total Allocation (Tbps)

Avg. Bandwidth (Gbps)

7 Note that instantaneous high utilization of per-flow

[45] V. Jeyakumar, M. Alizadeh, D. Mazieres, [58] L. Popa, G. Kumar, M. Chowdhury,

th of link-1; hence, M0A = 52 instead of MA = 23 .

that when she tells the truth (M0b k < Mk ).

100%changes d 01 100% 1/5 2/5 100% 1/5 2/5

50% 50% 2/5

subject to min Mk = M∗k (5)

cik ≥ aik , i ∈ [1, 2P ], k ∈ [1, M ]

You might also like