Professional Documents
Culture Documents
2/3 1/9
0% L1 L2
L1 L4 L2 L5 L3 L6 L1 L
100%
4 1/2
L2 5/6
L5 L3 L6 100%
L1 L4 L2 L5 L3 L6 100% 1/2 5/6
50% 50%
A1 B1 A2 B2 A3 B3 A1 B1 A2 B2 A3 B
50%
HUG: Multi-Resource Fairness for Correlated
A B
and
A
Elastic
B1/2
Demands
1/6 A B
3
M M 3 0% M1 1 M 2 2 M 3 3 M3 0%
1
Mosharaf Chowdhury21 , Zhenhua Liu23, MAli 1
Ghodsi , Ion L1 L322
M2 Stoica
1/2 1/6
1 M3 0% L1 L2
1 2 3
University of Michigan Stony Brook University UC Berkeley, Databricks Inc.
Abstract A1
L1
A3 100% L1A1
L1
A 100% 100%
1/3 A1 2/3 A3 100% 3 1/2
100% 1/21/3
1/3 2/3
100%
2/3 1/2 1/2
In this paper, we study how to optimally provide isola-
B1 B3 B1 B3
tion guarantees in multi-resource environments, such as B1 B3
50% 50% 50% 50% 50% 50%
public clouds, whereAa tenant’s
L2 demands on different re- L2A L2
A4 A2 2 A4 A4
2
sources (links) are correlated. Unlike prior work such as 2/3 1/9 1/2 1/22/3 1/9 2/3 1/9 1/2 1/2
0%
L0%
0% B2 0% B4 0% L1 L2 0%
Dominant Resource BFairness
2 (DRF) B4 that assumes static L1 B2 L2 B4 L1 LL
2
1 L2 L1 2
and fixed demands, we consider elastic demands. Our (a) (b) Max-min fair. (c) DRF [33]
approach generalizes canonical max-min fairness to the
multi-resource setting with correlated demands, and ex- Figure 1: Bandwidth allocations in two independent links (a)
−
→
tends DRF to elastic demands. We consider two natu- for tenant-A (orange) with correlation vector dA = h 12 , 1i and
−
→
ral optimization objectives: isolation guarantee from a tenant-B (dark blue) with dB = h1, 16 i. Shaded portions are
tenant’s viewpoint and system utilization (work conser- not allocated to any tenant.
vation) from an operator’s perspective. We prove that
because, given enough demand, it allocates the entire
in non-cooperative environments like public cloud net-
bandwidth of the link. Finally, it is strategyproof, because
works, there is a strong tradeoff between optimal iso-
tenants cannot get more bandwidth by lying about their
lation guarantee and work conservation when demands
demands (e.g., by sending more traffic).
are elastic. Even worse, work conservation can even de-
However, a datacenter network involves many links,
crease network utilization instead of improving it when
and tenants’ demands on different links are often corre-
demands are inelastic. We identify the root cause be-
lated. Informally, we say that the demands of a tenant on
hind the tradeoff and present a provably optimal alloca-
two links i and j are correlated, if for every bit the ten-
tion algorithm, High Utilization with Guarantees (HUG),
ant sends on link-i, she sends at least α bits on link-j.
to achieve maximum attainable network utilization with-
More formally, with every tenant-k, we associate a cor-
out sacrificing the optimal isolation guarantee, strategy- −
→
proofness, and other useful properties of DRF. In co- relation vector dk = hd1k , d2k , . . . , dnk i, where dik ≤ 1,
operative environments like private datacenter networks, which captures the fact that for every dik bits tenant-k
HUG achieves both the optimal isolation guarantee and sends on link-i, it should send at least djk bits on link-j.
work conservation. Analyses, simulations, and experi- Examples of applications with correlated demands in-
ments show that HUG provides better isolation guar- clude optimal shuffle schedules [22, 23], long-running
antees, higher system utilization, and better tenant-level services [19, 52], multi-tiered enterprise applications
performance than its counterparts. [39], and realtime streaming applications [10, 69]. Con-
sider the example in Figure 1a with two independent
−
→
1 Introduction links and two tenants. The correlation vector dA = h 21 , 1i
In shared, multi-tenant environments such as public means that (i) link-2 is tenant-A’s bottleneck, (ii) for ev-
clouds [2, 5, 6, 8, 40], the need for predictability and the ery MA rate tenant-A is allocated on the bottleneck link,
means to achieve it remain a constant source of discus- she requires at least MA /2 rate on link-1, resulting in a
sion [15, 44, 45, 50, 51, 58, 59, 61, 62]. The general con- progress of MA , and (iii) except for the bottleneck link,
sensus – recently summarized by Mogul and Popa [53] tenants’ demands are elastic, meaning tenant-A can use
– is that tenants expect guaranteed minimum bandwidth more than MA /2 rate on link-1.1 Similarly, tenant-B re-
(i.e., isolation guarantee) for performance predictability, quires at least MB /6 on link-2 for MB on link-1. If we
while network operators strive for work conservation to denote the ratenallocated to tenant-k on link-i by aik , then
ai
o
achieve high utilization and strategy-proofness to ensure Mk = mini dik , the minimum demand-normalized
k
isolation. rate allocation over all links, captures her progress.
Max-min fairness [43] – a widely-used [16, 25, 34, 35, In this paper, we want to generalize max-min fairness
55, 63, 64] allocation policy – achieves all three in the to tenants with correlated and elastic demands while
context of a single link. It provides the optimal isolation maintaining its desirable properties: optimal isolation
guarantee by maximizing the minimum amount of band- guarantee, high utilization, and strategy-proofness.
width allocated to each flow. The bandwidth allocation
of a user (tenant) determines her progress – i.e., how fast 1 While it does not improve the instantaneous progress of tenant-A,
she can complete her data transfer. It is work-conserving, it increases network utilization, which is desired by the operators.
1
Intuitively, we want to maximize the minimum Cloud Network Sharing
progress over all tenants, i.e., maximize mink Mk ,
where mink Mk corresponds to the isolation guaran- Reservation Dynamic Sharing
tee of an allocation algorithm. We make three observa- (SecondNet, Oktopus, Pulsar, Silo)
Uses admission control
tions. First, when there is a single link in the system,
this model trivially reduces to max-min fairness. Sec- Flow-Level VM-Level Tenant-/Network-Level
ond, getting more aggregate bandwidth is not always bet- (Per-Flow Fairness) (Seawall, GateKeeper)
No isolation guarantee No isolation guarantee
ter. For tenant-A in the example, h50Mbps, 100Mbpsi is
better than h90Mbps, 90Mbpsi or h25Mbps, 200Mbpsi, Non-Cooperative Cooperative
even though the latter ones have more bandwidth in to- Environments Environments
Require Do not require
tal. Third, simply applying max-min fairness to individ- strategy-proofness strategy-proofness
ual links is not enough. In our example, max-min fairness
Low Suboptimal
allocates equal resources to both tenants on both links,
Utilization Isolation Guarantee
resulting in allocations h 12 , 12 i on both links (Figure 1b). (DRF) (PS-P, EyeQ, NetShare)
Corresponding progress (MA = MB = 12 ) result in a Optimal isolation guarantee Work-conserving
suboptimal isolation guarantee (min{MA , MB } = 12 ).
Highest Utilization for Work-Conserving
Dominant Resource Fairness (DRF) [33] extends max- Optimal Isolation Guarantee Optimal Isolation Guarantee
min fairness to multiple resources and prevents such sub- (HUG) (HUG)
optimality. It equalizes the shares of dominant resources
– link-2 (link-1) for tenant-A (tenant-B) – across all ten- Figure 2: Design space for cloud network sharing.
ants with correlated demands and maximizes the iso- guarantee and work conservation in non-cooperative en-
lation guarantee in a strategyproof manner. As shown vironments, HUG ensures the highest utilization possi-
in Figure 1c, using DRF, both tenants have the same ble while maintaining the optimal isolation guarantee.
progress – MA = MB = 32 , 50% higher than using It incentivizes tenants to expose their true demands, en-
max-min fairness on individual links. Moreover, DRF’s suring that they actually consume their allocations in-
isolation guarantee (min{MA , MB } = 23 ) is optimal stead of causing collateral damages. In cooperative en-
across all possible allocations and is strategyproof. vironments, where strategy-proofness might be a non-
However, DRF assumes inelastic demands [40], and it requirement, HUG simultaneously ensures both work
is not work-conserving. For example, bandwidth on link- conservation and the optimal isolation guarantee. In
2 in shades is not allocated to either tenant. In fact, we contrast, existing solutions [33, 45, 51, 58, 59] are sub-
show that DRF can result in arbitrarily low utilization optimal in both environments. Overall, HUG generalizes
(Lemma 6). This is wasteful, because unused bandwidth single- [25, 43, 55] and multi-resource max-min fairness
cannot be recovered. [27, 33, 38, 56] and multi-tenant network sharing solu-
We start by showing that strategy-proofness is a neces- tions [45, 51, 58, 59, 61, 62] under a unifying framework.
sary condition for providing the optimal isolation guar- HUG is easy to implement and scales well. Even with
antee – i.e., to maximize mink Mk – in non-cooperative 100, 000 machines, new allocations can be centrally cal-
environments (§2). Next, we prove that work conserva- culated and distributed throughout the network in less
tion – i.e., when tenants are allowed to use unallocated than a second – faster than that suggested in the litera-
resources, such as the shaded area in Figure 1c, without ture [13]. Moreover, each machine can locally enforce
constraints – spurs a race to the bottom. It incentivizes HUG-calculated allocations using existing traffic control
each tenant to continuously lie about her demand cor- tools without any changes to the network (§4).
relations, and in the process, it decreases the amount of We demonstrate the effectiveness of our proposal us-
useful work done by all tenants! Meaning, simply mak- ing EC2 experiments and trace-driven simulations (§5).
ing DRF work-conserving can do more harm than good. In non-cooperative environments, HUG provides the op-
We propose a two-stage algorithm, High Utilization timal isolation guarantee, which is 7.4× higher than ex-
with Guarantees (HUG), to achieve our goals (§3). Fig- isting network sharing solutions like PS-P [45, 58, 59]
ure 2 surveys the design space for cloud network shar- and 7000× higher than traditional per-flow fairness, and
ing and places HUG in context by following the thick 1.4× better utilization than DRF for production traces. In
lines. At the highest level, unlike many alternatives cooperative environments, HUG outperforms PS-P and
[13, 14, 37, 44], HUG is a dynamic allocation algo- per-flow fairness by 1.48× and 17.35× in terms of the
rithm. Next, HUG enforces its allocations at the tenant- 95th percentile slowdown of job communication stages,
/network-level, because flow- or (virtual) machine-level and 70% jobs experience lower slowdown w.r.t. DRF.
allocations [61, 62] do not provide isolation guarantee. We discuss current limitations and future research in
Due to the hard tradeoff between optimal isolation Section 6 and compare HUG to related work in Section 7.
2
−
→
dk Correlation vector of tenant-k’s demand
−
→ 100% 1/3 2/3
ak Guaranteed allocation to tenant-k
i
ak
Mk Progress of tenant-k; Mk := min ,
50% 1≤i≤2P dik
where subscript i stands for link-i
−
→
c
2/3 1/9
Actual resource consumption of tenant-k
k 0% L1 L2
Isolation Guarantee mink Mk
L1 L4 L2 L5 L3 L6 Optimal
100%Isolation
1/2 Guarantee
5/6 max {mink Mk }
P P i
(Network) Utilization i k ck
50%
Table 1: Important notations and definitions.
A1 B1 A2 B2 A3 B3 PP PP
(C ) and i=11/2dik = 1/6 i=1 dkP +i .
i
M1 M2 M3 0% L1 inLFigure
For the example 2 3, consider tenant correla-
Figure 3: Two VMs from tenant-A (orange) and three from tion vectors:
tenant-B (dark blue) and their communication patterns over a −
→ 1 1
3 × 3 datacenter
L1 fabric. The network fabric has three uplinks dA = h , 1, 0, 1, , 0i
A A3downlinks (L4 –L
100%
2 2
(L1 1 –L3 ) and three 6 ) corresponding
1/3 2/3 to the 100%
→ 1/2
1/2 − 1 1
three physical machines. dB = h1, , 0, 0, 1, i
B1 B3 6 6
2 Motivation
L
50% 50%
where dik = 0 indicates the absence of a VM and dik = 1
2
A2 A4
In this section, we elaborate on the assumptions
2/3
and no-
1/9 indicates the bottleneck link(s) of a tenant.
1/2 1/2
tations
B2 used in this
B4 paper and 0%
summarize L1the three
L2 de- 0%
Correlation vectors depend on tenant applications, that
L1 L 2
sirable requirements – optimal isolation guarantee, high can range from elastic-demand batch jobs [3, 24, 42, 68]
utilization, proportionality – for bandwidth allocation to long-running services [19, 52], multi-tiered enterprise
across multiple tenants. Later, we show the tradeoff be- applications [39], and realtime streaming applications [9,
tween optimal isolation guarantee and high utilization, 69] with inelastic demands. We focus on scenarios where
identifying work conservation as the root cause. a tenant’s demand changes at the timescale of seconds or
longer [13, 18, 58], and she can use provider-allocated
2.1 Background
resources in any way for her own workloads.
We consider Infrastructure-as-a-Service (e.g., EC2 [2],
Azure [8], and Google Compute [5]) and Container-as- 2.2 Inter-Tenant Network Sharing
a-Service (e.g., Mesos [40] and Kubernetes [6]) models Requirements
where tenants pay per-hour flat rates for virtual machines Given correlation vectors of M tenants, a cloud provider
(VMs) and containers.2 must use an allocation algorithm A to determine the al-
We abstract out the datacenter network as a non- locations of each tenant:
blocking switch (i.e., the fabric/hose model [11, 12, 14,
→
− → − −→
23, 28, 45, 49]) with P physical machines connected to A({d1 , d2 , . . . , dM }) = {→
−
a1 , →
−
a2 , . . . , −
a→
M}
it. Each machine has full-duplex links (i.e., 2P indepen-
dent links) and can host one or more VMs from different where − → = ha1 , a2 , . . . a2P i and ai is the fraction of
ak k k k k
tenants. Figure 3 shows an example. We assume that VM link-i guaranteed to the k-th tenant.
placement and routing are implemented independently. As identified in previous work [15, 58], any allocation
Not only does this model provide analytical simplicity, policy A must meet three requirements – (optimal) isola-
it is mostly a reality today: recent EC2 and Google data- tion guarantee, high utilization, and proportionality – to
centers have full bisection bandwidth networks [4, 7]. fairly share the cloud network:
We denote the correlation vector of the k-th tenant 1. Isolation Guarantee: VMs should receive minimum
−
→
(k ∈ {1, ..., M }) as dk = hd1k , d2k , . . . d2P i
k i, where dk bandwidth guarantees proportional to their correla-
P +i
and dk (1 ≤ i ≤ P ) respectively denote the uplink tion vectors so that tenants can estimate worst-case
and downlink demands normalized3 by link capacities performance. Formally, progress of tenant-k (Mk )
2 We use the terms VM and container interchangeably in this paper is defined as her minimum demand satisfaction ratio
because they are similar from the network’s perspective. across the entire fabric:
3 Normalization helps us consider heterogeneous capacities. By de- i
fault we normalize the correlation vector such that the largest compo- ak
Mk = min
nent equals to 1 unless otherwise specified. 1≤i≤2P dik
3
100% 1/2 1/2
4
0% L1 L2
100% 1/4 1/2
% 1/3 2/3
50% 100% 1/4 1/2
Tenant-A!
% 100% 100%
Doesn’t Lie! 1/2 Lies!1/12 1/3 2/3 11/24 11/12
0% L11! 50%
1!L2
2!
,!
2! 1
,!
Tenant-B!
2/3 1/9 Doesn’t Lie! 3! 3! 12! 2!
% L1 L2 50% 50%
1/2 1/12
0%
Lies!
1! ,! 3! 1! ,! 1!
2/3 1/9
L1 L2
2! 4! 2! 2! 1/2 1/12
0% L1 L2 0% L1 L2
Figure 6: Payoff matrix for the example in Section 2.3. Each (a) Optimal isolation guarantee (b) Tenant-A lies
cell shows progress of tenant-A and tenant-B.
2.3.2100%
Optimal Isolation
11/24 11/12 Guarantee but Low 100% 1/4 1/2 100% 1/4 1/2
Utilization 100% 1/3 2/3
In contrast,
50% optimal isolation guarantee does not neces- 50% 100%50% 11/24 11/12
sarily mean full utilization. In general, optimal isolation
guarantees can be 50%
1/2calculated
1/12 using DRF [33], which gen- 3/4 1/8 1/2 1/12
0%
eralizes max-minL1 fairness
L2 to multiple resources. In the 0% L1 L2 50% 0% L1 L2
example of Figure 5a, each uplink and downlink
0% of2/3the 1/9
(c) Tenant-B lies (d) Both lie
fabric is an independent resource – 2P in total. L1 L2
1/2 1/12
Given this premise, it seems promising and straightfor- Figure 7: Bandwidth consumptions 0% of tenant-A
L1 (orange)
L and
→ 2
−
ward to keep the DRF-component for optimal isolation tenant-B (dark blue) with correlation vectors dA = h 21 , 1i
−
→
guarantee and strategy-proofness and try to ensure full and dB = h1, 61 i, when neither runs elastic-demand applica-
utilization by allocating all remaining resources.In the tions. (a) Optimal isolation guarantee allocation is not work-
following two subsections, we show that work conserva- conserving. With work conservation, (b) utilization can in-
tion may render isolation guarantee no longer optimal, 100%
crease or (c) decrease, based11/12
11/24 on which tenant lies. (d) However, 100%
and even worse, may reduce useful network utilization. ultimately it lowers utilization. Shaded means unallocated.
→ −
− →
2.3.3 Naive Work Conservation Reduces Optimal 50% lie: d0A = d0B = h1, 1i, resulting in a lower
multaneously 50%
1
Isolation Guarantee isolation guarantee 2 (Figure 5d). Both are worse off!
We first illustrate that even the optimal isolation guar- In this example,1/2the inefficiency
1/12 arises due to allocat-
0% resources
ing all spare L1 to the L2 tenant who demands more. 0%
antee allocation degenerates into the classic prisoner’s
dilemma problem [30] in the presence of work conser- We show in Appendix B that intuitive allocation poli-
vation. In particular, we show that reporting a false cor- cies of all spare resources – e.g., allocating all to who
relation vector h1, 1i is the dominant strategy for each demands the least, allocating equally to all tenants with
tenant, i.e., her best option, no matter whether the other non-zero demands, and allocating proportionally to ten-
tenants tell the truth or not. As a consequence, optimal ants’ demands – do not work as well.
isolation guarantees decrease (Figure 6). 2.3.4 Naive Work Conservation can Even Decrease
If tenant-A can use the spare bandwidth in link-2, she Utilization
can increase her progress at the expense of tenant-B by
−→ Now consider that neither tenant has elastic-demand ap-
changing her correlation vector to d0A = h1, 1i. With an plications; i.e., they can only consume bandwidth pro-
−
→
unmodified dB = h1, 16 i, the new allocation would be portional to their correlation vectors. A similar prisoner’s
−→
aA = h 2 , 2 i and −
1 1
a→ 1 1
B = h 2 , 12 i. However, work con- dilemma unfolds (Figure 6), but this time, network uti-
servation would increase it to − a→ 1 11
A = h 2 , 12 i (Figure 5b). lization decreases as well.
Overall, progress of tenant-A would increase to 11 12 , while Given the optimal isolation guarantee allocation, − a→A =
decreasing it to 12 for tenant-B. As a result, the isolation −
→ 1 2 −
→ −
→ 2 1
cA = h 3 , 3 i and aB = cB = h 3 , 9 i, both tenants have
guarantee decreases from 23 to 12 . the same optimal isolation guarantee: 23 , and 29 -th of link-
The same is true for tenant-B as well. Consider again 2 remain unused (Figure 7a). One would expect work
that only tenant-B reports a falsified correlation vector conservation to utilize this spare capacity.
−→
d0B = h1, 1i to receive a favorable allocation: − a→A = Same as before, if tenant-A changes her correlation
1 1 −
→ 1 1
h 4 , 2 i and aB = h 2 , 2 i. Work conservation would in- vector to d0A = h1, 1i, she can receive an allocation
crease it to −a→ 3 1
B = h 4 , 2 i (Figure 5c). Overall, progress
−
a→ 1 11 −
→ 11 11
A = h 2 , 12 i and consume cA = h 24 , 12 i. This in-
of tenant-B would increase to 34 , while decreasing it to 12 11
creases her isolation guarantee to 12 and total network
for tenant-A, resulting in the same suboptimal isolation utilization increases (Figure 7b).
guarantee 12 . Similarly, tenant-B can receive an allocation − a→B =
Since both tenants gain by lying, they would both si- 3 1 −
→ 3 1
h 4 , 2 i and consume cB = h 4 , 8 i to increase her isola-
5
1/2 1/2
6
these allocations, HUG ensures isolation. Moreover, be-
cause DRF is strategy-proof, tenants are guaranteed to
use these allocations (i.e., cik ≥ aik ).
While DRF maximizes the isolation guarantees (a.k.a.
dominant shares), it results in low network utilization.
…
…
…
…
…
…
…
In some cases, DRF may even have utilization arbitrar-
ily close to zero, and HUG can increase that to 100%
(Lemma 6). (a) Tenant-X (N-to-N) (b) Tenant-Y (N-to-1)
To achieve this, the second stage of HUG maxi-
mizes utilization while still keeping the allocation strat- Figure 9: Communication patterns of tenant-X and tenant-Y
egyproof. In this stage, we calculate upper bounds to re- with (a) two minimum cuts of size P , where P is the number
strict how much of the spare capacity a tenant can use in of fabric ports, and (b) one minimum cut of size 1. The size
each link, with the constraint that the largest share across of the minimum cut of a communication pattern determines its
all links cannot increase (Lemma 1). As a result, Algo- effective bandwidth even if it were placed alone.
rithm 1 remains strategy-proofness across both stages. shows an example. Clearly, tenant-Y will be bottle-
Because spare usage restrictions can be applied locally, necked at her only receiver; trying to equalize them will
HUG can be enforced in individual machines. only result in low utilization. As expected, FairCloud
Illustrated in Figure 8, the bound is set at 23 for both proved that such proportionality is not achievable as it
tenants, and tenant-B can use its elastic demand on link- decreases both isolation guarantee and utilization [58].
2’s spare resource, while tenant-A cannot as she has None of the existing algorithms provide proportionality.
reached its bound on link-2. Instead, we consider a relaxed notion of proportional-
ity, called min-cut proportionality, that depends on com-
3.3 HUG Properties
munication patterns and ties proportionality with a ten-
We list the main properties of HUG in the following. ant’s progress. Specifically, each tenant receives mini-
1. In non-cooperative cloud environments, HUG is mum bandwidth proportional to the size of the minimum
strategyproof (Theorem 3), maximizes isolation cut [31] of their communication patterns. Meaning, in the
guarantees (Corollary 4), and ensures the highest earlier example, tenant-X would receive P times more
utilization possible for an optimal isolation guaran- total bandwidth than tenant-Y, but they would have the
tee allocation (Theorem 5). In particular, Lemma 6 optimal isolation guarantee (MX = MY = 21 ).
shows that under some cases, DRF may have utiliza- Min-cut proportionality and optimal isolation guaran-
tion arbitrarily close to 0, and HUG improves it to tee can coexist, but they both have tradeoffs with work
100%. We defer the proofs of properties in the Sec- conservation.
tion to Appendix A.
2. In cooperative environments like private datacenters, 4 Design Details
HUG maximizes isolation guarantees and is work- This section discusses how a cloud operator can imple-
conserving. Work conservation is achievable because ment, enforce, and expose HUG to the tenants (§4.1),
strategy-proofness is a non-requirement in this case. how to exploit placement to further improve HUG’s per-
3. Because HUG provides optimal isolation guarantee, formance (§4.2), and how HUG can handle weighted,
it provides min-cut proportionality (§ 3.3.1) in both heterogeneous scenarios (§4.3).
non-cooperative and cooperative environments.
4.1 Architectural Overview
Regardless of resource types, the identified tradeoff
exists in general multi-resource allocation problems and HUG can easily be implemented atop existing moni-
HUG can directly be applied. toring infrastructure of cloud operators (e.g., Amazon
CloudWatch [1]). Tenants would periodically update
3.3.1 Min-Cut Proportionality their correlation vectors through a public API, and the
Prior work promoted the notion of proportionality [58], operator would compute new allocations and update en-
where tenants would expect to receive total allocations forcing agents within milliseconds.
proportional to their number of VMs regardless of com- HUG API The tenant-facing API simply transfers a
munication patterns. Meaning, two tenants, each with −
→ −
→
tenant’s correlation vector (dk ) to the operator. dk =
N VMs, should receive equal bandwidth even if tenant- h1, 1, . . . , 1i is used as the default correlation vector. By
−→
X has an all-to-all communication pattern (i.e., dX = design, HUG incentivizes tenants to report and maintain
h1, 1, . . . , 1i) and tenant-Y has an N -to-1 pattern (i.e., accurate correlation vectors. This is because the more ac-
−→ −
→
exactly one 1 in dY and the rest are zeros). Figure 9 curate it is – instead of the default dk = h1, 1, . . . , 1i –
7
the higher are her progress and performance. constraints like fault-tolerance and collocation [19], and
Many applications already know their long-term pro- we consider its detailed study an important future work.
files (e.g., multi-tier online services [19, 52]) and others It is worth noting that with any VM placement, HUG pro-
can calculate on the fly (e.g., bulk communication data- vides the highest attainable utilization without sacrificing
intensive applications [22, 23]). Moreover, existing tech- optimal isolation guarantee and strategy-proofness.
niques in traffic engineering can provide good accuracy
in estimating and predicting demand matrices for coarse
4.3 Additional Constraints
time granularities [17, 18, 20, 47, 48]. Weighted Tenants Giving preferential treatment to
−
→ −
→
Centralized Computation For any update, the oper- tenants is simple. Just using wk dk instead of dk in Equa-
ator must run Algorithm 1. Although Stage-1 requires tion (2) would account for tenant weights (wk for tenant-
solving a linear program to determine the optimal isola- k) in calculating M∗ .
tion guarantee (i.e., the DRF allocation) [33], it can also Heterogeneous Capacities Because allocations are
be rewritten as a closed-form equation [56] when tenants calculated independently in each machine based on M∗
can scale up and down following their normalized corre- and local capacities (Ci ), HUG supports heterogeneous
lation vectors. The progress of all tenants after Stage-1 link capacities.
of Algorithm 1 – the optimal isolation guarantee – is:
Bounded Demands So far we have considered only
1 elastic tenants. If tenant-k has bounded demands, i.e.,
M∗ = M (2) dik < 1 for all i ∈ [1, 2P ], calculating a common M∗
di and corresponding − → in one round using Equation (2)
P
max a
1≤i≤2P k=1 k k
will be inefficient. This is because tenant-k might require
Equation (2) is computationally inexpensive. For our less than the calculated allocation, and being bounded,
100-machine cluster, calculating M∗ takes about 5 mi- she cannot elastically scale up to use it. Instead, we must
croseconds. Communicating the decision to all 100 ma- use the multi-round DRF algorithm [56, Algorithm 1] in
chines takes just 8 milliseconds and to 100, 000 (emu- Stage-1 of HUG; Stage-2 will remain the same. Note that
lated) machines takes less than 1 second (§5.1.2). this is similar to max-min fairness in a single link when
The guaranteed minimum allocations of tenant-k can a flow has a smaller demand than its n1 -th share.
then be calculated as aik = M∗ dik for all 1 ≤ i ≤ 2P .
Local Enforcement Enforcement in Stage-2 of Algo-
5 Evaluation
rithm 1 is simple as well. After reserving the minimum We evaluated HUG using trace-driven simulations and
uplink and downlink allocations for each tenant, each EC2 deployments. Our results show the following:
machine needs to ensure that no tenant can consume • HUG isolates multiple tenants across the entire net-
more than M∗ fraction of the machine’s up or down link work, and it can scale up to 100, 000 machines with
capacities (Ci ) to the network; i.e., aik ≤ cik ≤ M∗ . less than one second overhead (§5.1).
The spare is allocated among tenants using local max-
• HUG ensures the optimal isolation guarantee – al-
min fairness subject to tenant-specific upper-bounds.
most 7000× more than per-flow fairness and about
Because we only care about inter-tenant behavior – not
7.4× more than PS-P in production traces – while
how a tenant performs internal sharing – stock Linux
providing 1.4× higher utilization than DRF (§5.2).
tc is sufficient (§5). A tenant has the flexibility to
choose from traditional per-flow fairness, shortest-first • HUG outperforms per-flow fairness (PS-P) by
flow scheduling [12, 41], or explicit rate-based flow con- 17.35× (1.48×) in terms of the 95th percentile slow-
trol [29]. down and by 1.49× (1.14×) in minimizing the aver-
age shuffle completion time (§5.3).
4.2 VM Placement and • HUG outperforms Varys [23] in terms of the maxi-
Re-Placement/Migration mum shuffle completion time by 1.77×, even though
While M∗ is optimal for a given placement, it can be im- Varys is 1.45× better in minimizing the average shuf-
proved by changing the placement of tenant VMs based fle completion time and 1.33× better in terms of the
on their correlation vectors. One must perform load bal- 95th percentile slowdown (§5.3).
ancing across all machines to minimize the denominator We present our results in three parts. First, we mi-
of Equation (2). Cloud operators can employ optimiza- crobenchmark HUG on 100-machine EC2 clusters to
tion frameworks like [19] to perform initial VM place- evaluate HUG’s guarantees and overheads (§5.1). Sec-
ment and periodic migrations with an additional load ond, we leverage traces collected from a 3200-machine
balancing constraint. However, VM placement is a noto- Facebook cluster by Popa et al. [58] to compare HUG’s
riously difficult problem because of often-incompatible instantaneous allocation characteristics with that of per-
8
100! 100! Tenant A!
Total Alloc (Gbps)!
Figure 10: [EC2] Bandwidth consumptions of three tenants arriving over time in a 100-machine EC2 cluster. Each tenant has 100
VMs, but each uses a different communication pattern (§5.1.1). We observe that (a) using TCP, tenant-B dominates the network by
creating more flows; (b) HUG isolates tenants A and C from tenant B.
flow fairness, PS-P [58], and DRF [33] (§5.2). Finally, lated, propagated, and enforced in each machine of the
we evaluate HUG’s long-term impact on application per- cluster. As before, tenants A and C use marginally less
formance using a 3000-machine Facebook cluster trace than their allocations because of creating only one flow
used by Chowdhury et al. [23] and compare against per- between each VM-VM pair.
flow fairness, PS-P, DRF, as well as Varys, which focuses
5.1.2 Scalability
only on improving performance (§5.3).
The key challenge in scaling HUG is its centralized re-
5.1 Testbed Experiments source allocator, which must recalculate tenant shares
Methodology We performed our experiments on 100 and redistribute them across the entire cluster whenever
m2.4xlarge Amazon EC2 [2] instances running on any tenant changes her correlation vector.
Linux kernel 3.4.37 and used the default htb and tc im- We found that the time to calculate new allocations us-
plementations. While there exist proposals for more ac- ing HUG is less than 5 microseconds in our 100 machine
curate qdisc implementations [45, 57], the default htb cluster. Furthermore, a recomputation due to a tenant’s
worked sufficiently well for our purposes. Each of the arrival, departure, or change of correlation vector would
machines had 1 Gbps NICs, and we could use close to take about 8.6 milliseconds on average for a 100, 000-
full 100 Gbps bandwidth simultaneously. machine datacenter.
Communicating a new allocation takes less than 10
5.1.1 Network-Wide Isolation milliseconds to 100 machines and around 1 second for
100, 000 emulated machines (i.e., sending the same mes-
We consider a cluster with 100 EC2 machines, divided
sage 1000 times to each of the 100 machines).
between three tenants A, B, and C that arrive over time.
Each tenant has 100 VMs; i.e., VMs Ai , Bi , and Ci 5.2 Instantaneous Fairness
are collocated on the i-th physical machine. However,
While Section 5.1 evaluated HUG in controlled, syn-
they have different communication patterns: tenants A
thetic scenarios, this section focuses on HUG’s instanta-
and C have pairwise one-to-one communication patterns
neous allocation characteristics in the context of a large-
(100 VM-VM flows each), whereas tenant-B follows
scale cluster.
an all-to-all pattern using 10, 000 flows. Specifically, Ai
communicates with A(i+50)%100 , Cj communicates with Methodology We use a one-hour snapshot with 100
C(j+25)%100 , and any Bk communicates with all Bl , concurrent jobs from a production MapReduce trace,
where i, j, k, l ∈ {1, ..., 100}. Each tenant demands the which was extracted from a 3200-machine Facebook
entire capacity at each machine; hence, the entire capac- cluster by Popa et al. [58, Section 5.3]. Machines are con-
ity of the cluster should be equally divided among the nected to the network using 1 Gbps NICs. In the trace, a
active tenants to maximize isolation guarantees. job with M mappers and R reducers – hence, the corre-
Figure 10a shows that as soon as tenant-B arrives, she sponding M × R shuffle – is described as a matrix with
takes up the entire capacity in the absence of isolation the amount of data to transfer between each M -R pair.
guarantee. Tenant-C receives only marginal share as she We calculated the correlation vectors of individual shuf-
arrives after tenant-B and leaves before her. Note that fles from their communication matrices ourselves using
tenant-A (when alone) uses only about 80% of the avail- the optimal rate allocation algorithm for a single shuffle
able capacity; this is simply because just one TCP flow [22, 23], ensuring all the flows of each shuffle to finish
per VM-VM pair often cannot saturate the link. simultaneously.
Figure 10b presents the allocation using HUG. As ten- Given the workload, we calculate progress of each
ants arrive and depart, allocations are dynamically calcu- job/shuffle using different allocation mechanisms and
9
2.5
Fraction of Shuffles
Fraction of Shuffles
Per-Flow Fairness 2
0.8 20 Per-Flow Fairness
PS-P
1.5 PS-P
0.6 HUG 15
HUG 0.5
DRF 1
0.4 10 DRF
0.5
0.2 5
0
0 0 0
DRF
PS-P
HUG
Per-Flow
0 0.5 1 0.0001 0.001 0.01 0.1 1 10 100 1000
Progress (Gbps) Aggregate Bandwidth (Gbps)
(a) Distribution of progress (b) Total allocation (c) Distribution of allocations
Figure 11: [Simulation] HUG ensures higher isolation guarantee than high-utilization schemes like per-flow fairness and PS-P, and
provides higher utilization than multi-resource fairness schemes like DRF.
cross-examine characteristics like isolation guarantee, Bin 1 (SN) 2 (LN) 3 (SW) 4 (LW)
utilization, and proportionality. % of Shuffles 52% 16% 15% 17%
5.2.1 Impact on Progress
Table 2: Shuffles binned by their lengths (Short and Long) and
Figure 11a presents the distribution of progress of each widths (Narrow and Wide).
shuffle. Recall that the progress of a shuffle – we con-
sider each shuffle an individual tenant in this section – is width than they do under HUG, while the other 90%
the amount of bandwidth it is receiving in its bottleneck receive less than they do using HUG. PS-P crosses
up or downlink (i.e., progress can be at most 1 Gbps). over at the 76-th percentile.
Both HUG and DRF (overlapping vertical lines in Fig- 5.2.3 Impact on Proportionality
ure 11a) ensure the same progress (0.74 Gbps) for all
shuffles. Note that despite same progress, shuffles will A collateral benefit of HUG is that tenants receive allo-
finish at different times based on how much data each cations proportional to their bottleneck demands. Con-
one has to send (§5.3). Per-flow fairness and PS-P pro- sequently, despite the same progress across all shuffles
vide very wide ranges: 112 Kbps to 1 Gbps for the former (Figure 11a), their total allocations vary (Figure 11c)
and 0.1 Gbps to 1 Gbps for the latter. Shuffles with many based on the size of minimum cuts in their communi-
flows crowd out the ones with fewer flows under per-flow cation patterns.
fairness, and PS-P suffers by ignoring correlation vectors
and through indiscriminate work conservation.
5.3 Long-Term Characteristics
We have shown in Section 5.2 that HUG provides opti-
5.2.2 Impact on Utilization mal isolation guarantee in the instantaneous case. How-
By favoring heavy tenants, per-flow fairness and PS-P do ever, similar to all instantaneous solutions [33, 43, 58],
succeed in their goals of increasing network utilization HUG does not provide any long-term isolation or fair-
(Figure 11b). Given the communication patterns of the ness guarantees. Consequently, in this section, we eval-
workload, the former utilizes 69% of 3.2 Tbps total ca- uate HUG’s long-term impact on performance using a
pacity across all machines and the latter utilizes 68.6%. production trace through simulations.
In contrast, DRF utilizes only 45%. HUG provides a
common ground by extending utilization to 62.4% with- Methodology For these simulations, we use a MapRe-
out breaking strategy-proofness and providing optimal duce/Hive trace from a 3000-machine production Face-
isolation guarantee. book cluster. The trace includes the arrival times, com-
Figure 11c breaks down total allocations of each shuf- munication matrices, and placements of tasks of over
fle and demonstrates two high-level points: 10, 000 shuffle during one day. Shuffles in this trace have
diverse length (i.e., size of the longest flow) and width
1. HUG ensures overall higher utilization (1.4× on (i.e., the number of flows) characteristics and roughly
average) than DRF by ensuring equal progress for follow the same distribution of the original trace (Ta-
smaller shuffles and by using up additional band- ble 2). We consider a shuffle to be short if its longest
width for larger shuffles. It does so while ensuring flow is less than 5 MB and narrow if it has at most 50
the same optimal isolation guarantee as DRF. flows; we use the same categorization. We calculated the
2. Per-flow fairness crosses HUG at the 90-th per- correlation vector of each shuffle as we did before (§ 5.2).
centile; i.e., the top 10% shuffles receive more band-
10
1 5.3.1 Improvements Over Per-Flow Fairness
Fraction of Coflows 0.8
HUG improves over per-flow fairness both in terms of
0.6 Per-Flow Fairness slowdown and performance. The 95th percentile slow-
PS-P
0.4 down using HUG is 17.35× better than that of per-flow
HUG
DRF fairness (Table 12b). Overall, HUG provides better slow-
0.2
Varys down across the board (Figure 12a) – 61% shuffles are
0 better off using HUG and the rest remain the same.
1 10 100 1000
Slowdown w.r.t. Minimum Required Completion Time
HUG improves the average completion time of shuf-
fles by 1.49× (Figure 13). The biggest wins comes from
(a) CDFs
bin-1 (4.8×) and bin-2 (6.15×) that include the so-called
narrow shuffles with less than 50 flows. This reinforces
Min 95th AVG STDEV
the fact that HUG isolates tenants with fewer flows from
Per-Flow Fairness 1 69.4 15.52 65.54 those with many flows. Overall, HUG performs well
PS-P 1 5.9 2.22 2.97 across all bins.
HUG 1 4.0 1.86 2.25 5.3.2 Improvements Over PS-P
DRF 1 4.4 2.11 2.66
HUG improves over PS-P in terms of the 95th percentile
Varys 1 3.0 1.43 0.99
slowdown by 1.48×, and 45% shuffles are better off
(b) Summaries using HUG. HUG also providers better average shuf-
fle completion times than PS-P for an overall improve-
Figure 12: [Simulation] Slowdowns of shuffles using different
ment of 1.14×. Large improvements again come in bin-
mechanisms w.r.t. the minimum completion times of each shuf-
fle. The X-axis of (a) is in log scale. 1 (1.19×) and bin-2 (1.27×) because PS-P also favors
tenants with more flows.
8 Per-Flow Fairness
6.15
Normalized Comp. Time
6 DRF
fairness and PS-P (§5.2) does not help in the long run
5 due to lower isolation guarantee.
Varys
4
5.3.3 Improvements Over DRF
1.49
3
1.33
1.31
1.27
1.19
1.15
1.14
1.14
1.14
1.13
1.13
1.11
1.11
1.00
0.76
0.74
0.69
0.68
2
1 While HUG and DRF has the same worst-case slow-
0 down, 70% shuffles are better off using HUG. HUG also
Bin 1 Bin 2 Bin 3 Bin 4 ALL provides better average shuffle completion times than
Shuffle Type DRF for an overall improvement of 1.14×.
Figure 13: [Simulation] Average shuffle completion times nor- 5.3.4 Comparison to Varys
malized by HUG’s average completion time. 95-th percentile
Varys outperforms HUG by 1.33× in terms of the 95th
plots are similar.
percentile slowdown and by 1.45× in terms of aver-
Metrics We consider two metrics: 95th percentile age shuffle completion time. However, because Varys at-
slowdown and average shuffle completion time to respec- tempts to improve the average completion time by pri-
tively measure long-term progress and performance char- oritization, it risks in terms of the maximum completion
acteristics. time. More precisely, HUG outperforms Varys by 1.77×
We define the slowdown of a shuffle as its completion in terms of the maximum shuffle completion time (not
time due to a scheme normalized by its minimum com- shown).
pletion time if it were running alone; i.e.,
6 Discussion
Compared Duration
Slowdown = Payment Model Similar to many existing proposals
Minimum Duration
[32, 33, 45, 46, 58, 59, 61, 62], we assume that tenants
The minimum value of slowdown is one. pay per-hour flat rates for individual VMs, but there is no
We measure performance as the shuffle completion pricing model associated with their network usage. This
time of a scheme normalized by that using HUG; i.e., is also the prevalent model of resource pricing in cloud
computing [2, 5, 8]. Exploring whether and how a net-
Compared Duration work pricing model would change our solution and what
Normalized Comp. Time =
HUG’s Duration that model would look like requires further attention.
If the normalized completion time of a scheme is greater Determining Correlation Vectors Unlike long-term
(smaller) than one, HUG is faster (slower). correlation vectors, e.g., over the course of an hour or for
11
an entire shuffle, accurately capturing short-term changes demands, i.e., all correlation vectors have all elements as
can be difficult. How fast tenants should update their vec- 1, we give the same allocation; for all other cases, we
tors and whether that is faster than centralized HUG can provide higher isolation guarantee and utilization.
react to requires additional analysis.
Efficient Schedulers Researchers have also focused
Decentralized HUG HUG’s centralized design makes on efficient scheduling and/or packing of datacenter re-
it easier to analyze its properties and simplifies its im- sources to minimize job and communication completion
plementation. We believe that designing a decentralized times [12, 21–23, 26, 36, 41]. Our work is orthogonal and
version of HUG is an important future work, which will complementary to these work focusing on application-
be especially relevant for sharing wide-area networks in level efficiency within each tenant. We guarantee iso-
the context of geo-distributed analytics [60, 66]. lation across tenants, so that each tenant can internally
perform whatever efficiency or fairness optimizations
7 Related Work among her own applications.
Single-Resource Fairness Max-min fairness was first
proposed by Jaffe [43] to ensure at least n1 -th of a link’s 8 Conclusion
capacity to each flow. Thereafter, many mechanisms In this paper, we have proved that there is a strong trade-
have been proposed to achieve it, including weighted fair off between optimal isolation guarantees and high uti-
queueing (WFQ) [25, 55] and those similar to or extend- lization in non-cooperative public clouds. We have also
ing WFQ [16, 34, 35, 63, 64]. We generalize max-min proved that work conservation can decrease utilization
fairness to parallel communication observed in scale-out instead of improving it, because no network sharing al-
applications, showing that unlike in the single-link sce- gorithm remains strategyproof in its presence.
nario, optimal isolation guarantee, strategy-proofness, To this end, we have proposed HUG to restrict band-
and work conservation cannot coexist. width utilization of each tenant to ensure highest uti-
Multi-Resource Fairness Dominant Resource Fair- lization with optimal isolation guarantee across multi-
ness (DRF) [33] maximizes the dominant share of each ple tenants in non-cooperative environments. In cooper-
user in a strategyproof manner. Solutions that have at- ative environments, where strategy-proofness might be
tempted to improve the system-level efficiency of multi- a non-requirement, HUG simultaneously ensures both
resource allocation – both before [54, 65] and after work conservation and the optimal isolation guarantee.
[27, 38, 56] DRF – sacrifice strategy-proofness. We have HUG generalizes single-resource max-min fairness to
proven that work-conserving allocation without strategy- multi-resource environments where a tenant’s demand on
proofness can hurt utilization instead of improving it. different resources are correlated and elastic. In particu-
Dominant Resource Fair Queueing (DRFQ) [32] ap- lar, it provides optimal isolation guarantee, which is sig-
proximates DRF over time in individual middleboxes. nificantly higher than that provided by existing multi-
In contrast, HUG generalizes DRF to environments with tenant network sharing algorithms. HUG also comple-
elastic demands to increase utilization across the entire ments DRF with provably highest utilization without sac-
network and focuses only on instantaneous fairness. rificing other useful properties of DRF. Regardless of
Joe-Wong et al. [46] have presented a unifying frame- resource types, the identified tradeoff exists in general
work to capture fairness-efficiency tradeoffs in multi- multi-resource allocation problems, and all those scenar-
resource environments. They assume a cooperative envi- ios can take advantage of HUG.
ronment, where tenants never lie. HUG falls under their
FDS family of mechanisms. In non-cooperative environ- Acknowledgments
ments, however, we have shown that the interplay be- We thank our shepherd Mohammad Alizadeh and the
tween work conservation and strategy-proofness is criti- anonymous reviewers of SIGCOMM’15 and NSDI’16
cal, and our work complements the framework of [46]. for useful feedback. This research is supported in
Network-Wide / Tenant-Level Fairness Proposals for part by NSF CISE Expeditions Award CCF-1139158,
sharing cloud networks range from static allocation [13, NSF Award CNS-1464388, DOE Award SN10040 DE-
14, 44] and VM-level guarantees [61, 62] to variations of SC0012463, and DARPA XData Award FA8750-12-2-
network-wide sharing mechanisms [45, 51, 58, 59, 67]. 0331, and gifts from Amazon Web Services, Google,
We refer the reader to the survey by Mogul and Popa [53] IBM, SAP, The Thomas and Stacey Siebel Foundation,
for an overview. FairCloud [58] stands out by systemati- Adatao, Adobe, Apple, Inc., Blue Goji, Bosch, Cisco,
cally discussing the tradeoffs and addresses several lim- Cray, Cloudera, EMC2, Ericsson, Facebook, Fujitsu,
itations of other approaches. Our work generalizes Fair- Guavus, HP, Huawei, Informatica, Intel, Microsoft, Ne-
Cloud [58] and many proposals similar to FairCloud’s tApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata,
PS-P policy [45, 59, 61]. When all tenants have elastic and VMware.
12
References [18] T. Benson, A. Anand, A. Akella, and M. Zhang.
MicroTE: Fine grained traffic engineering for data
[1] Amazon CloudWatch. centers. In CoNEXT, 2011.
http://aws.amazon.com/cloudwatch.
[19] P. Bodik, I. Menache, M. Chowdhury, P. Mani,
[2] Amazon EC2. http://aws.amazon.com/ec2. D. Maltz, and I. Stoica. Surviving failures in
[3] Apache Hadoop. http://hadoop.apache.org. bandwidth-constrained datacenters. In
SIGCOMM, 2012.
[4] AWS Innovation at Scale.
http://goo.gl/Py2Ueo. [20] M. Chowdhury, S. Kandula, and I. Stoica.
Leveraging endpoint flexibility in data-intensive
[5] Google Compute Engine. clusters. In SIGCOMM, 2013.
https://cloud.google.com/compute.
[21] M. Chowdhury and I. Stoica. Efficient coflow
[6] Google Container Engine. scheduling without prior knowledge. In
http://kubernetes.io. SIGCOMM, 2015.
[7] A look inside Google’s data center networks. [22] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan,
http://goo.gl/u0vZCY. and I. Stoica. Managing data transfers in computer
clusters with Orchestra. In SIGCOMM, 2011.
[8] Microsoft Azure.
http://azure.microsoft.com. [23] M. Chowdhury, Y. Zhong, and I. Stoica. Efficient
coflow scheduling with Varys. In SIGCOMM,
[9] Storm: Distributed and fault-tolerant realtime
2014.
computation. http://storm-project.net.
[24] J. Dean and S. Ghemawat. MapReduce: Simplified
[10] Trident: Stateful Stream Processing on Storm.
data processing on large clusters. In OSDI, 2004.
http://goo.gl/cKsvbj.
[25] A. Demers, S. Keshav, and S. Shenker. Analysis
[11] M. Alizadeh, T. Edsall, S. Dharmapurikar,
and simulation of a fair queueing algorithm. In
R. Vaidyanathan, K. Chu, A. Fingerhut, F. Matus,
SIGCOMM, 1989.
R. Pan, N. Yadav, and G. Varghese. CONGA:
Distributed congestion-aware load balancing for [26] F. Dogar, T. Karagiannis, H. Ballani, and
datacenters. In SIGCOMM, 2014. A. Rowstron. Decentralized task-aware scheduling
for data center networks. In SIGCOMM, 2014.
[12] M. Alizadeh, S. Yang, M. Sharif, S. Katti,
N. Mckeown, B. Prabhakar, and S. Shenker. [27] D. Dolev, D. G. Feitelson, J. Y. Halpern,
pFabric: Minimal near-optimal datacenter R. Kupferman, and N. Linial. No justified
transport. In SIGCOMM, 2013. complaints: On fair sharing of multiple resources.
In ITCS, 2012.
[13] S. Angel, H. Ballani, T. Karagiannis, G. OShea,
and E. Thereska. End-to-end performance [28] N. G. Duffield, P. Goyal, A. Greenberg, P. Mishra,
isolation through virtual datacenters. In OSDI, K. K. Ramakrishnan, and J. E. van der Merive. A
2014. flexible model for resource management in virtual
private networks. In SIGCOMM, 1999.
[14] H. Ballani, P. Costa, T. Karagiannis, and
A. Rowstron. Towards predictable datacenter [29] N. Dukkipati. Rate Control Protocol (RCP):
networks. In SIGCOMM, 2011. Congestion control to make flows complete
quickly. PhD thesis, Stanford University, 2007.
[15] H. Ballani, K. Jang, T. Karagiannis, C. Kim,
D. Gunawardena, and G. OShea. Chatty tenants [30] M. M. Flood. Some experimental games.
and the cloud network sharing problem. In NSDI, Management Science, 5(1):5–26, 1958.
2013.
[31] L. R. Ford and D. R. Fulkerson. Maximal flow
[16] J. Bennett and H. Zhang. WF2 Q: Worst-case fair through a network. Canadian Journal of
weighted fair queueing. In INFOCOM, 1996. Mathematics, 8(3):399–404, 1956.
[17] T. Benson, A. Akella, and D. A. Maltz. Network [32] A. Ghodsi, V. Sekar, M. Zaharia, and I. Stoica.
traffic characteristics of data centers in the wild. In Multi-resource fair queueing for packet
IMC, 2010. processing. SIGCOMM, 2012.
13
[33] A. Ghodsi, M. Zaharia, B. Hindman, [46] C. Joe-Wong, S. Sen, T. Lan, and M. Chiang.
A. Konwinski, S. Shenker, and I. Stoica. Dominant Multiresource allocation: Fairness-efficiency
resource fairness: Fair allocation of multiple tradeoffs in a unifying framework. In INFOCOM,
resource types. In NSDI, 2011. 2012.
[34] S. J. Golestani. Network delay analysis of a class [47] S. Kandula, D. Katabi, B. Davie, and A. Charny.
of fair queueing algorithms. IEEE JSAC, Walking the tightrope: Responsive yet stable traffic
13(6):1057–1070, 1995. engineering. In SIGCOMM, 2005.
[35] P. Goyal, H. M. Vin, and H. Chen. Start-time fair [48] S. Kandula, S. Sengupta, A. Greenberg, P. Patel,
queueing: A scheduling algorithm for integrated and R. Chaiken. The nature of datacenter traffic:
services packet switching networks. In Measurements and analysis. In IMC, 2009.
SIGCOMM, 1996.
[49] N. Kang, Z. Liu, J. Rexford, and D. Walker.
[36] R. Grandl, G. Ananthanarayanan, S. Kandula,
Optimizing the “One Big Switch” abstraction in
S. Rao, and A. Akella. Multi-resource packing for
Software-Defined Networks. In CoNEXT, 2013.
cluster schedulers. In SIGCOMM, 2014.
[37] C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, [50] K. LaCurts, J. Mogul, H. Balakrishnan, and
P. Sun, W. Wu, and Y. Zhang. SecondNet: A data Y. Turner. Cicada: Introducing predictive
center network virtualization architecture with guarantees for cloud networks. In HotCloud, 2014.
bandwidth guarantees. In CoNEXT, 2010.
[51] V. T. Lam, S. Radhakrishnan, A. Vahdat,
[38] A. Gutman and N. Nisan. Fair allocation without G. Varghese, and R. Pan. NetShare and stochastic
trade. In AAMAS, 2012. NetShare: Predictable bandwidth allocation for
data centers. SIGCOMM CCR, 42(3):5–11, 2012.
[39] M. Hajjat, X. Sun, Y.-W. E. Sung, D. Maltz,
S. Rao, K. Sripanidkulchai, and M. Tawarmalani. [52] J. Lee, Y. Turner, M. Lee, L. Popa, S. Banerjee,
Cloudward bound: planning for beneficial J.-M. Kang, and P. Sharma. Application-driven
migration of enterprise applications to the cloud. bandwidth guarantees in datacenters. In
In SIGCOMM, 2010. SIGCOMM, 2014.
[40] B. Hindman, A. Konwinski, M. Zaharia, [53] J. C. Mogul and L. Popa. What we talk about
A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and when we talk about cloud network performance.
I. Stoica. Mesos: A Platform for Fine-Grained SIGCOMM CCR, 42(5):44–48, 2012.
Resource Sharing in the Data Center. In NSDI,
2011. [54] J. F. Nash Jr. The bargaining problem.
Econometrica: Journal of the Econometric
[41] C.-Y. Hong, M. Caesar, and P. B. Godfrey. Society, pages 155–162, 1950.
Finishing flows quickly with preemptive
scheduling. In SIGCOMM, 2012. [55] A. K. Parekh and R. G. Gallager. A generalized
processor sharing approach to flow control in
[42] M. Isard, M. Budiu, Y. Yu, A. Birrell, and
integrated services networks: The single-node
D. Fetterly. Dryad: Distributed data-parallel
case. IEEE/ACM ToN, 1(3):344–357, 1993.
programs from sequential building blocks. In
EuroSys, 2007. [56] D. C. Parkes, A. D. Procaccia, and N. Shah.
[43] J. M. Jaffe. Bottleneck flow control. IEEE Beyond dominant resource fairness: extensions,
Transactions on Communications, 29(7):954–962, limitations, and indivisibilities. In EC, 2012.
1981.
[57] J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah,
[44] K. Jang, J. Sherry, H. Ballani, and T. Moncaster. and H. Fugal. Fastpass: A centralized
Silo: Predictable message completion time in the “zero-queue” datacenter network. In SIGCOMM,
cloud. In SIGCOMM, 2015. 2014.
14
in link-1 1/2+ is already larger than before 13 . If
[59] L. Popa, P. Yalagandula, S. Banerjee, J. C. Mogul, 3/2+
Y. Turner, and J. R. Santos. ElasticSwitch: the work conservation policy allocates the spare resource
Practical work-conserving bandwidth guarantees in link-2 by δ (δ may be small but a positive
for cloud computing. In SIGCOMM, 2013. 0 1 0value),
a2
0 a
her progress will change to MA = min d1A , d2A =
A A
[60] Q. Pu, G. Ananthanarayanan, P. Bodik,
1+2 1 3/2δ
min 3/2+ , 3/2+ +δ . As long as < 2/3−δ (if δ ≥ 23 ,
S. Kandula, A. Akella, V. Bahl, and I. Stoica. Low
latency geo-distributed data analytics. In we have no constraint on ), her progress will be better
SIGCOMM, 2015. than when she was telling the truth, which makes the pol-
icy not strategyproof. The operator cannot prevent this
[61] H. Rodrigues, J. R. Santos, Y. Turner, P. Soares, because she knows neither a tenant’s true correlation vec-
and D. Guedes. Gatekeeper: Supporting tor nor , the extent of the tenant’s lie.
bandwidth guarantees for multi-tenant datacenter
Theorem 3 Algorithm 1 is strategyproof.
networks. In USENIX WIOV, 2011.
Proof Sketch (of Theorem 3) Because DRF is strate-
[62] A. Shieh, S. Kandula, A. Greenberg, and C. Kim. gyproof, the first stage of Algorithm 1 is strategyproof
Sharing the data center network. In NSDI, 2011. as well. We show that adding the second stage does not
[63] M. Shreedhar and G. Varghese. Efficient fair violate strategy-proofness of the combination.
queueing using deficit round robin. In SIGCOMM, Assume that link-b is a system bottleneck – the link
1995. DRF saturated to maximize isolation guarantee in the
M
X
[64] I. Stoica, H. Zhang, and T. Ng. A hierarchical fair first stage. Meaning, b = arg max dik . We use Db =
i
k=1
service curve algorithm for link-sharing, real-time, M
and priority services. In SIGCOMM, 1997.
X
dbk to denote the total demand in link-b (Db ≥ 1), and
k=1
[65] H. R. Varian. Equity, envy, and efficiency. Journal
Mbk = 1/Db for corresponding progress for all tenant-k
of economic theory, 9(1):63–91, 1974.
(k ∈ {1, ..., M }) when link-b is the system bottleneck.
[66] A. Vulimiri, C. Curino, B. Godfrey, J. Padhye, and In Figure 5, b = 1. The following arguments hold even
G. Varghese. Global analytics in the face of for multiple bottlenecks.
bandwidth and regulatory constraints. In NSDI, Any tenant-k can attempt to increase her progress
−
→
2015. (Mk ) only by lying about her correlation vector (dk ).
Formally, her action space consists of all possible cor-
[67] D. Xie, N. Ding, Y. C. Hu, and R. Kompella. The relation vectors. It includes increasing and/or decreasing
only constant is change: Incorporating demands of individual resources to report a different vec-
time-varying network reservations in data centers. −
→
tor, d0k and obtain a new progress, M0k (> Mk ). Tenant-
In SIGCOMM, 2012. k can attempt one of the two alternatives when report-
−
→
[68] M. Zaharia, M. Chowdhury, T. Das, A. Dave, ing d0k : either keep link-b still the system bottleneck or
J. Ma, M. McCauley, M. Franklin, S. Shenker, and change it. We show that Algorithm 1 is strategyproof in
I. Stoica. Resilient distributed datasets: A both cases; i.e., M0k ≤ Mk .
fault-tolerant abstraction for in-memory cluster Case 1: link-b is still the system bottleneck.
computing. In NSDI, 2012. Her progress cannot improve because
• if d0b b
k ≤ dk , her share on the system bottleneck will
[69] M. Zaharia, T. Das, H. Li, S. Shenker, and decrease in the first stage; so will her progress. There is
I. Stoica. Discretized streams: Fault-tolerant no spare resource to allocate in link-b.
stream computation at scale. In SOSP, 2013. For example, if tenant-A changes d01 1
A = 4 instead of
dA = 2 in Figure 5, her allocation will decrease to 15 -
1 1
15
DRF$/$MinAll$ OPT$
16
OPT$ OPT$
100% 1/3 4/5 100% 1/5 2/5
100% 1/5 2/5
timal isolation guarantee allocation shown in Figure 14a Corollary 11 (of Lemma 10) In presence of work con-
(MA = MB = MC = 25 , therefore isolation guarantee servation, tenants can lie both by increasing and de-
is 25 ). PS-P [45, 58, 59] fall in this category. creasing their demands, or a combination of both.
Worse, when tenants do not have elastic-demand ap- Lemma 12 Allocating spare resource to the tenant with
plications, demand-agnostic policies are not even work- the highest demand is not strategyproof.
conserving (similar to the example in §2.3.4).
Proof Sketch If tenant-A changes her correlation vector
Lemma 7 When tenants do not have elastic demands, −
→
to d0A = h1, 1i, the eventual allocation (Figure 15b) will
per-resource equal sharing is not work-conserving. again result in lower progress (MB = MC = 13 ). Be-
Proof Sketch Only 11 5 cause tenant-B is still receiving more than 61 -th of her
12 -th of link-1 and 9 -th of link-2
will be consumed; i.e., none of the links will be satu- allocation in link-1 in link-2, she does not need to lie.
rated! Corollary 13 (of Lemmas 10, 12) Allocating spare re-
source randomly to tenants is not strategyproof.
To make it work-conserving, PS-P suggests dividing
spare resources based on whoever wants it. B.3 Locally Fair Policies
Lemma 8 When tenants do not have elastic demands, Finally, one can also consider equally or proportionally
PS-P is not work-conserving. dividing the spare resource on link-2 between tenant-
Proof Sketch If tenant-B gives up her spare allocation A and tenant-B. Unfortunately, these strategies are not
in link-2, tenant-A can increase her progress to MA = 32 strategyproof either.
and saturate link-1; however, tenant-B and tenant-C will Lemma 14 Allocating spare resource equally to tenants
remain at MB = MC = 13 . If tenant-A gives up her is not strategyproof.
spare allocation in link-1, tenant-B and tenant-C can in- Proof Sketch If the remaining 15 8
-th of link-2 is equally
crease their progress to MB = MC = 38 and saturate divided, the share of tenant-A will increase to 23 -rd and
link-1, but tenant-A will remain at MA = 21 . Because incentivize her to lie. Again, the isolation guarantee will
both tenant-A and tenant-B have chances of increas- be smaller (Figure 15c).
ing their progress, both will hold off to their allocations
even with useless traffic – another instance of Prisoner’s Lemma 15 Allocating spare resource proportionally to
dilemma. tenants’ demands is not strategyproof.
Proof Sketch If one divides the spare in proportion to
B.2 Unfair Policies tenant demands, the allocation is different (Figure 15d)
Instead of demand-agnostic policies, one can also con- than equal division. However, tenant-A can again in-
sider simpler, unfair policies; e.g., allocating all the re- crease her progress at the expense of others.
17
C Dual Objectives of Network Here, we maximize resource consumption while keep-
ing the optimal isolation guarantee across all tenants, de-
Sharing noted by M∗k . Meanwhile, the constraint on consump-
The two conflicting requirements of the network sharing tion being at least guaranteed minimum allocation en-
problem can be defined as follows. sures strategy-proofness; thus, guaranteeing that guaran-
1. Utilization: i∈[1,2P ] k∈[1,M ] cik
P P teed allocated resources will be utilized.
Because cik values have no upper bounds except for
2. Isolation guarantee: mink∈[1,M ] Mk physical capacity constraints, optimization O2 may re-
Given the tradeoff between the two, one can consider sult in suboptimal isolation guarantee in non-cooperative
one of the two possible optimizations:8 environments (§2.3.3). HUG introduces the following
O1 Ensure highest utilization, then maximize the isola- additional constraint to avoid this issue only in non-
tion guarantee with best effort; cooperative environments:
O2 Ensure optimal isolation guarantee, then maximize
utilization with best effort. cik ≤ M∗ , i ∈ [1, 2P ], k ∈ [1, M ]
O1: Utilization-First In this case, the optimization at- This constraint is not necessary when strategy-proofness
tempts to maximize the isolation guarantee across all ten- is a non-requirement – e.g., in private datacenters.
ants while keeping the highest network utilization.
Maximize min Mk
k∈[1,M ]
X X (3)
s.t. cik = U ∗ ,
i∈[1,2P ] k∈[1,M ]
X X
where U ∗ = max cik is the highest uti-
i∈[1,2P ] k∈[1,M ]
lization possible. Although this ensures maximum net-
work utilization, isolation guarantee to individual tenants
can be arbitrarily low. This formulation can still be useful
in private datacenters [36].
To ensure some isolation guarantee, existing cloud
network sharing approaches [14, 45, 51, 58, 59, 61, 62,
67] use a similar formulation:
X X
Maximize cik
1≤i≤2P k∈[1,M ]
(4)
1
subject to Mk ≥ , k ∈ [1, M ]
M
The objective here is to maximize utilization while en-
1
suring at least M -th of each link to tenant-k. However,
this approach has two primary drawbacks (§ 2.3):
1. suboptimal isolation guarantee, and
2. lower utilization.
O2: Isolation-Guarantee-First Instead, in this paper,
we have formulated the network sharing problem as fol-
lows:
X X
Maximize cik
i∈[1,2P ] k∈[1,M ]
ture direction.
18