You are on page 1of 44

High Speed Networks Need

Proactive Congestion Control


The Congestion Control Problem

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A Flow C

Flow B Flow D
Ask an oracle.
Link 0 Link
Link 1 Capacity
Link 2 LinkFlow
3 Link 4Rate
Flow A √ 0√ 100
Flow A 35
Flow B 1√ √60 Flow B 25
Flow C 2 √30 √ Flow C 5
Flow D 3 10 √ Flow D
√ 5
4 100

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G

Flow B = 25G Flow D = 5G


Traditional Congestion Control
• No explicit information about traffic matrix
• Measure congestion signals, then react by
adjusting rate after measurement delay
• Gradual, can’t jump to right rates, know direction
• “Reactive Algorithms”
60 50
Transmission Rate (Gbps)
30 40
1

10 20 0

01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G
Flow B = 25G Flow D = 5G
60 50
Transmission Rate (Gbps)

Ideal (dotted)
30 40
1

10 20 0

01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G
Flow B = 25G Flow D = 5G
60 50
RCP (dashed)
Transmission Rate (Gbps)

Ideal (dotted)
30 40
1

10 20 0

01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G
Flow B = 25G Flow D = 5G
60 50
RCP (dashed)
Transmission Rate (Gbps)

Ideal (dotted)
30 40
1

10 20 0

01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G
Flow B = 25G Flow D = 5G
60 50
RCP (dashed)
Transmission Rate (Gbps)

Ideal (dotted)
30 40
1

10 20 0

01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G
Flow B = 25G Flow D = 5G
60 50
RCP (dashed)
Transmission Rate (Gbps)

Ideal (dotted)
30 40
1

10 20 0

01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G
Flow B = 25G Flow D = 5G
60
30 RTTs to
50
Converge RCP (dashed)
Transmission Rate (Gbps)

Ideal (dotted)
30 40
1

10 20 0

01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G
Flow B = 25G Flow D = 5G
Convergence Times Are Long
• If flows only last a few RTTs, then we can’t
wait 30 RTTs to converge.
• At 100G, a typical flow in a search workload is
< 7 RTTs long.

Fraction of Total Flows in Bing Workload


14%
Small (1-10KB)
30%
Medium (10KB-1MB)
56%
1MB / 100 Gb/s = 80 µs
Large (1MB-100MB)
Why “Reactive” Schemes Take Long
1. No explicit information
2. Therefore measure congestion signals, react
3. Can’t leap to correct values but know direction
4. Reaction is fed back into network
5. Take cautious steps

Adjust
Flow Rate

Measure
Congestion
Transmissi
10 20
0
01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)

Reactive algorithms trade off


explicit flow information for
long convergence times
Can we use explicit flow information
and get shorter convergence times?
Back to the oracle, how did she use
traffic matrix to compute rates?

Link 0 Link 1 Link 2 Link 3 Link 4


100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G

Flow B = 25G Flow D = 5G


Waterfilling Algorithm

Link 0 (0/ 100 G)


Link 4 (0/ 100 G)
Link 1 (0/ 60 G)
Link 2 (0/ 30 G) Link 3 (0/ 10 G)
Flow A (0 G) Flow C (0 G)
Flow B (0 G) Flow D (0 G)
Waterfilling- 10 G link is fully used

Link 0 (5/ 100 G)


Link 4 (5/ 100 G)
Link 1 (10/ 60 G)
Link 2 (10/ 30 G) Link 3 (10/ 10 G)
Flow A (5 G) Flow C (5 G)
Flow B (5 G) Flow D (5 G)
Waterfilling- 30 G link is fully used

Link 0 (25/ 100 G)


Link 4 (5/ 100 G)
Link 1 (50/ 60 G)
Link 2 (30/ 30 G) Link 3 (10/ 10 G)
Flow A (25 G) Flow C (5 G)
Flow B (25 G) Flow D (5 G)
Waterfilling- 60 G link is fully used

Link 0 (35/ 100 G)


Link 4 (5/ 100 G)
Link 1 (60/ 60 G)
Link 2 (30/ 30 G) Link 3 (10/ 10 G)
Flow A (35 G) Flow C (5 G)
Flow B (25 G) Flow D (5 G)
Fair Share of Bottlenecked Links

Link 0 (35/ 100 G) Fair Share: 35 G Link 4 (5/ 100 G)


Link 1 (60 G) Fair Share: 25 G
Link 2 (30 G) Fair Share: 5 G
Flow A (35 G) Link 3 (10 G) Flow C (5 G)
Flow B (25 G) Flow D (5 G)
A centralized water-filling scheme
may not scale.
Can we let the network figure out
rates in a distributed fashion?
Fair Share for a Single Link
flow demand Capacity at Link 1: 30G
A ∞ So Fair Share Rate: 30G/2 = 15G
B ∞
15 G
Link 1
30 G
Flow A ∞
Flow B ∞
A second link introduces a dependency
flow demand
A ∞
B 10
∞G

Link 1 Link 2
30 G 10 G
Flow A
Flow B
Dependency Graph

Flow A Link 1
30 G
Flow B
Link 2
10 G
Dependency Graph

Flow A Link 1
30 G
Flow B 10
10 Link 2
10 G
Proactive Explicit Rate Control (PERC)
Overview
Round 1 (Flows  Links)
• Flows and links alternately
exchange messages.
• A flow sends a “demand” Flow A 1 Link 1
∞ 1
– ∞ when no other fair share 5 30 G
– min. fair share of other links 5
Flow B ∞
• A link sends a “fair share” ∞
1 Link 2
– C/N when demands are ∞ 0 10 G
– otherwise use water-filling
Proactive Explicit Rate Control (PERC)
Overview
Round 2 (Flows  Links)
• Flows and links alternately
exchange messages.
• A flow sends a “demand”
– ∞ when no other fair share Flow A Link 1

– min. fair share of other links 30 G
• A link sends a “fair share” 1
Flow B 1
– C/N when demands are ∞ 0
5 Link 2
– otherwise use water-filling
10 G
• Messages are approximate,
jump to right values quickly
with more rounds
Proactive Explicit Rate Control (PERC)
Overview
Round 2 (Links  Flows)
• Flows and links alternately
exchange messages.
• A flow sends a “demand”
– ∞ when no other fair share Flow A 2 Link 1
2
– min. fair share of other links 0 30 G
0
• A link sends a “fair share” Flow B
– C/N when demands are ∞ 1 Link 2
– otherwise use water-filling 0 10 G
• Messages are approximate,
jump to right values quickly
with more rounds
Message Passing Algorithms
Decoding error correcting codes Flow counts using shared counters
(LDPC- Gallager, 1963) (Counter Braids- Lu et al, 2008)

Parity Check 1 Counter 1


Flow A
x1 0 x1 + x3 = 0 36

x2 1 Parity Check 2 Flow B


x2+ x3 = 0
Counter 2
x3 1
32
... ...
Making PERC concrete
PERC Implementation
Control Packet For Flow B

d| ∞ | ∞
f| ? | ?

Link 1 Link 2
30 G 10 G
Flow A
Flow B
PERC Implementation
Control Packet For Flow B

d| ∞ | ∞
f|15
f| ? | ?

Link 1 Link 2
30 G 10 G
Flow A
Flow B
PERC Implementation
Control Packet For Flow B

d| ∞ | ∞
f|15 |10
| ?

Link 1 Link 2
30 G 10 G
Flow A
Flow B
PERC Implementation
Control Packet For Flow B

d| ∞ |
d|10 ∞
|15
f|15
f|? |10
| ?
Link 1 Link 2
30 G 10 G
Flow A
send at 15G!
Flow B
PERC converges fast
60 <5 RCP took PERC (solid)
RTTs
50

30 RTTs to
Transmission Rate (Gbps)

Ideal (dotted)
Converge
30 40
1

10 20 0

01 3 5 10 20 30 40
Time (# of RTTs, 1 RTT=24us)
Link 0 Link 1 Link 2 Link 3 Link 4
100 G 60 G 30 G 10 G 100 G
Flow A = 35G Flow C = 5G
Flow B = 25G Flow D = 5G
PERC Converges Fast
0.99
0.80

4 vs 14 at Median
10 vs 71 at Tail (99th)
0.60
CDF

1%TOR%
0.40

Edge%Links
PERC
RCP
0.20

4%Hosts%
0.00

0 10 20 40 60 80 100
RTTs to converge
Some unanswered questions
• How to calculate fair shares in PERC switches?
• How to bound convergences times in theory?
• What about other policies?
Takeways
• Reactive schemes are slow for short flows
(majority) at 100G
• Proactive schemes like PERC are fundamentally
different and can converge quickly because they
calculate explicit rates based on out of band
information about set of active flows.
• Message passing promising proactive approach-
could be practical, need further analysis to
understand good convergence times in practice.
Thanks!
Shorter FCTs For Flows
That Last A Few RTTs (“Medium”)
100G, 12us
RCP
9

DCTCP
PERC
(norm. by IDEAL)

(norm. by IDEAL)
7

Mean FCT
Tail FCT

4.5
2

2
0

0
Small Flows Medium Flows Large Flows
XCP
RCP
ATM/ Charny etc.
0.99
0.80
0.60
CDF
0.40

PERC
CHARNY
0.20

RCP
0.00

0 10 20 40 60 80 100
RTTs to converge
Discussion
• Fundamentally any limit on how fast we can
get max-min rates? Explicit or implicit
whatever

You might also like