Professional Documents
Culture Documents
Chiesa Purr Shortened More
Chiesa Purr Shortened More
Marco Chiesa
KTH Royal Institute of Technology
Code: bitbucket.org/marchiesa/purr
2024-03-02 2
Network resilience is a key yet challenging property
High resilience
2024-03-02 Image credits: route by Philipp Petzka from the Noun Project 3
Routing reconvergence takes time!
routing
restoration failure event
control-plane
time
detection & route
recomputation
data-plane update
failure occurs
control-plane
time
detection & route
recomputation
2024-03-02 Image credits: route by Philipp Petzka from the Noun Project 7
Hard to recover within 50ms!
failure occurs
control-plane
time
detection & route
recomputation
[1] On low-latency-capable topologies, and their impact on the design of intra-domain routing. In SIGCOMM 2018
2024-03-02 Image credits: route by Philipp Petzka from the Noun Project 8
Fast Reroute (FRR):
pre-computing failover paths
match/actions (e.g., at
X) C
1 dst = A >> FRR1
2nd
backup
FRR actions (e.g., at X B
X)FRR >> fwd A B C primary backup
control plane 1
10
Fast Reroute (FRR):
pre-computing failover paths
match/actions (e.g., at
X) C
1 dst = A >> FRR1
dst = B >> FRR2 backup
deparser
pre-computing failover paths
parser
match/actions (e.g., at
X) C
1 dst = A >> FRR1
dst = B >> FRR2
2 packet pipeline
backup
deparser
that minimizes pipeline resource consumption
parser
dst = A >> FRR1 FRR1 >> fwd A B C
dst = B >> FRR2 FRR2 >> fwd C B A
Input
FRR actions (e.g., at Output
X)FRR >> fwd A B C
1
FRR2 >> fwd C B A
Packet recirculation
simplified pipeline
Packets in
Packets out
Parser
stage …
stage N
stage 1
ingress
buffer
Advanced SDN Input
routing High availability
headers & High flexibility
(e.g., intent-based (e.g., within(e.g.,
50 reconfigurable
metadata
logically centralized milliseconds”[1])P4 dataplanes)
Resources: SDN)
- SRAM (exact matches), e.g., 2M entries Runtime P4
- TCAM (wildcard matches), e.g., 100K entries (Control plane)
- ALU
[1] On low-latency-capable topologies, and their impact on the design of intra-domain routing. In SIGCOMM 2018
2024-03-02
Image credits: route by Philipp Petzka from the Noun Project 14
PURR applies to P4 programmable switches
Packet recirculation
simplified pipeline
Packets in
Packets out
Parser
stage …
stage N
stage 1
ingress
buffer
Advanced SDN Input
routing High availability
headers & High flexibility
(e.g., intent-based (e.g., within(e.g.,
50 reconfigurable
metadata
logically centralized milliseconds”[1])P4 dataplanes)
SDN)
Resources: How do we realize FRR in P4?
- SRAM (exact matches), e.g., 2M entries
No P4 built-in FRR primitive
- TCAM (wildcard matches), e.g., 100K entries but…
Runtime P4
(Control plane)
a programmable pipeline: many approaches are
- ALU possible!
[1] On low-latency-capable topologies, and their impact on the design of intra-domain routing. In SIGCOMM 2018
2024-03-02
Image credits: route by Philipp Petzka from the Noun Project 15
First approach: packet recirculation
action
match
write &
tag fwd recirculate
Input 1 1 tag := 2 port 1 fails
FRR1 = 1 2 3 4
2 2 tag := 3 port 2 fails
3 3 tag := 4
4 4 -
throughput reduction
latency increase
2024-03-02 16
FRR recirculation has high memory occupancy
action
match
write &
tag fwd recirculate
Input 1 1 tag := 2
FRR1 = 1 2 3 4
2 2 tag := 3
FRR2 = 4 3 2 1
3 3 tag := 4
4 4 -
FRR = 1 port
port1 2isis port status fwd
active
active 1*** 1
*1** 2
port status
**1* 3
1111
***1 4
P4 register
TCAM wildcard
match memory
2024-03-02 20
PURR: Encoding FRR in the packet metadata
Input
FRR1 = 1 2 3 4
TCAM wildcard
match memory
2024-03-02 21
PURR: Encoding FRR in the packet metadata
Input
FRR1 = 1 2 3 4
TCAM wildcard
match memory
2024-03-02 22
PURR: Encoding FRR in the packet metadata
Input
FRR1 = 1 2 3 4
TCAM wildcard
match memory
2024-03-02 23
PURR: One tempting option: “Duplication” TCAM
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1
FRR = 4 0001111
2024-03-02 25
PURR: Encoding FRR in the packet metadata
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1
bit-to-port mapping
12341
2024-03-02 26
PURR: Encoding FRR in the packet metadata
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1
bit-to-port mapping
12341
2024-03-02 27
PURR: Encoding FRR in the packet metadata
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1
bit-to-port mapping
12341
2024-03-02 28
PURR: re-cycling TCAM entries
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1
bit-to-port mapping
12341
2024-03-02 29
The key problem: How to compute the bit-to-port
mapping that minimizes memory occupancy?
FRR1 = 2 3 1 0 FRR2 = 0 2 1 3 FRR3 = 3 0 2 1 FRR4 = 1 0 2 3
2024-03-02 [2 ]T. Jiang, M. Li. On the approximation of shortest common supersequences and longest common subsequences. In Journal on Computing. 30
Fast-Greedy: a heuristic for solving this specific SCS
idea: remove the most F1=2 3 1 0 F1=3 1 0 F1=1 0 F1=1 0
frequent left-most F2=2 0 1 3 F2=0 1 3 F2=0 1 3 F2=1 3
element among the F3=2 3 0 1 F3=3 0 1 F3=0 1 F3=1
longest sequences F4=3 1 2 0 F4=3 1 2 0 F4=1 2 0 F4=1 2 0
remove 2 remove 3 remove 0 remove 1
See the paper for:
• multi-table
optimization F1=0 F1=0 F1= ---
F2=3 F2=3 F2=3
F3= --- F3= --- F3= ---
F4=2 0 F4=0 F4= ---
remove 2 remove 0 remove 3
SCS = 2 3 0 1 2 0 3
2024-03-02 31
Implementation feasibility
P4-based implementations:
• implemented different FRR mechanisms in P4 using
PURR (e.g., F10 [1], arborescences [2], BFS, DFS,
rotor router [3])
• compiled on Tofino
FPGA-based implementation:
• implemented PURR on the NetFPGA-SUME platform
30 106
2024-03-02 36
How much memory does 32 PURR save?
factorial possible FRR sequences
Fast-greedy scales to large numberTCAM
The “duplication” of sequences
or recirculation FRR
approaches would not scale
107
#TCAMentries
[bit]
entries
bits
106
sequence size=16
cost
105 sequence size=16
TCAM
2
10 sequence size=8
avg. #TCAM
104
Memory
sequence size=8
103
avg.
1
10 102
101 102 103 104 105 101 102 103 104 105
number of sequences number of sequences
2024-03-02 37
How does FCT vary depending on the FRR primitive?
PURR improves both FCT and throughput
2.0
Small flows FCT [ms]
Throughput [Gbps]
1.5 101
FRR recirculation immediat
e reconve
1.0 rgence
2.4x 2.7x
0.5 FRR recirc
immediate reconvergence 100 ulation
0.0
10 20 30 40 50 60 70 10 20 30 40 50 60 70
Load [%] Load [%]
NS-3 simulations
Topology: 32-server Clos network, 10Gbps links
Workload: data-mining Transport: DCTCP One link failure at 0.5s
2024-03-02 38
How does FCT vary depending on the FRR primitive?
PURR improves both FCT and throughput
2.0
Small flows FCT [ms]
Throughput [Gbps]
1.5 101
FRR recirculation immediat
e reconve
1.0 purr
rgence
2x
1.4x purr
0.5 FRR recirc
immediate reconvergence 100 ulation
0.0
10 20 30 40 50 60 70 10 20 30 40 50 60 70
Load [%] Load [%]
NS-3 simulations
Topology: 32-server Clos network, 10Gbps links
Workload: data-mining Transport: DCTCP One link failure at 0.5s
2024-03-02 39
Conclusions: Keep calm and enjoy programmability
Fast Reroute is a critical functionality in today’s network
• requires high throughput, low latency, fast reactiveness, small forwarding tables
P4 does not define an FRR built-in primitive
• pipeline compilers and control-plane must program the P4 pipeline
PURR: We propose a lightweight TCAM-based FRR primitive
• an intriguing connection to algorithmic string theory
• no FRR-tailored hardware support
• improve performance by a factor of ~2x w.r.t. FRR recirculation
Marco Chiesa
Thank you! KTH Royal Institute of Technology
Code: bitbucket.org/marchiesa/purr
reusable
2024-03-02 Cat by dDara from the Noun Project 40
Backup slides
Smaller sequences on switches with
high-density ports
Input: FRR recirculation:
• a 64-port programmable switch • #TCAM entries = 64*63*62 = 250K
• all possible three ports FRR sequences • TCAM memory = 250K * (17+3) = 5Mb
2024-03-02 42
Three naïve approaches to implementing FRR
FRR FRR FRR parallel
recirculation sequential duplication
throughput low high high
2024-03-02 43
Second approach: sequential search
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1 FRR3 = 3 4 1 2 FRR4 = 4 1 2 3
2024-03-02 44
Second approach: sequential search
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1 FRR3 = 3 4 1 2 FRR4 = 4 1 2 3
2024-03-02 45
Second approach: sequential search
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1 FRR3 = 3 4 1 2 FRR4 = 4 1 2 3
2024-03-02 46
Second approach: sequential search
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1 FRR3 = 3 4 1 2 FRR4 = 4 1 2 3
2024-03-02 47
A sequential search wastes hardware resources
Input
FRR1 = 1 2 3 4 FRR2 = 2 3 4 1 FRR3 = 3 4 1 2 FRR4 = 4 1 2 3
Throughput [Gbps]
recirculation purr recirculation purr
1.5 imm. reconvergence 10 1
imm. reconvergence
1.0
1.7x
0.5 1.8x
100
0.0
10 20 30 40 50 60 70 10 20 30 40 50 60 70
Load [%] Load [%]
NS-3 simulations
Topology: 32-server Clos network, 10Gbps links, 10𝜇𝑠 link latency
Workload: data-mining Transport: DCTCP
2024-03-02 49
Recirculating packets degrades Flow Completion
Time (FCT) under two link failures
2.0
Small flows FCT [ms]
Throughput [Gbps]
1.5 101
FRR recirculation immedia
te reconv
1.0 ergence
3.7x
3.3x
0.5 FRR re
immediate reconvergence 100 circula
tion
0.0
10 20 30 40 50 60 70 10 20 30 40 50 60 70
Load [%] Load [%]
NS-3 simulations
Topology: 32-server Clos network, 10Gbps links
Workload: data-mining Transport: DCTCP
2024-03-02 50
NS-3 datacenter topology
k
spine ai led lin
switches S1 S2 S3 2 f
nd
1 S4
2 3 4
1st failed link
1 2 3 4x10Gbps
leaf
switches L1 L2 4 L3 L4
8x10Gbps
… … … …
2024-03-02 51
Evaluation: Random vs tree-based sequences
40000
30000 random
tree
20000 2
10 103 104 105
number of sequences
2024-03-02 52
The web search workload
1.6 1.6
Norm. Throughput
3.0 1.4
Norm. Throughput
recirc imm recirc imm
1.2 1.2
2.5 1.0 2.5 1.0
2.0 0.8 2.0 0.8
0.6 0.6
1.5 0.4 1.5 0.4
0.2 0.2
1.0 0.0 1.0 0.0
10 20 30 40 50 60 70 10 20 30 40 50 60 70
Load [%] Load [%]
2024-03-02 53
2024-03-02 54