High Performance Switches and Routers: Theory and Practice: Sigcomm 99 August 30, 1999 Harvard University

High Performance Switches and Routers:
TheorySigcomm
and99Practice
August 30, 1999
Harvard University
H ig h Perfo rmance
Swi tc hi ng and Rout in g
Te le com Ce nte rWor ksho p: Se pt 4,1 9 7.
Nick McKeown Balaji Prabhakar
Departments of Electrical Engineering and Computer Science
nickm@stanford.edu balaji@isl.stanford.edu
Tutorial Outline
• Introduction:
What is a Packet Switch?
• Packet Lookup and Classification:
Where does a packet go next?
• Switching Fabrics:
How does the packet get there?
• Output Scheduling:
When should the packet leave?
Copyright 1999. All Rights Reserved 2

Introduction
• Basic Architectural Components

• Some Example Packet Switches
• The Evolution of IP Routers

Basic Architectural Components
Congestion
Admission Control Control
Reservation
Control Routing
Output Datapath:
Policing Switching Scheduling per-packet
processing

Datapath: per-packet processing 3.
1.
Output
Forwarding 2. Scheduling
Table
Interconnect
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision

Where high performance packet
switches are used
- Carrier Class Core Router
- ATM Switch
- Frame Relay Switch
The Internet Core
Edge Router Enterprise WAN access

& Enterprise Campus Switch

Introduction


ATM Switch
• Lookup cell VCI/VPI in VC table.
• Replace old VCI/VPI with new.
• Forward cell to outgoing interface.
• Transmit cell onto link.

Ethernet Switch
• Lookup frame DA in forwarding table.
– If known, forward to correct port.
– If unknown, broadcast to all ports.
• Learn SA of incoming frame.
• Forward frame to outgoing interface.
• Transmit frame onto link.

IP Router
• Lookup packet DA in forwarding table.
– If known, forward to correct port.
– If unknown, drop packet.
• Decrement TTL, update header Cksum.
• Forward packet to outgoing interface.
• Transmit packet onto link.

Introduction


First-Generation IP Routers
Shared Backplane Buffer

CPU
Memory
CP Li
U In ne
t er
M fa DMA DMA DMA
em ce
or Line Line Line
y Interface Interface Interface
MAC MAC MAC

Second-Generation IP Routers
CPU Buffer
Memory
DMA DMA DMA

Line Line Line
Card Card Card
Local Local Local
Buffer Buffer Buffer
Memory Memory Memory
MAC MAC MAC

Third-Generation Switches/Routers
Switched Backplane
L Li Line CPU Line

L LiIn nene
i
I
LiIninnetneeterf Card Card Card
L I
LiIninnetneeterf rfa ace
L I
CPI Initnnetneerterf rfac acece Local Local
nUt er fa ac e
er fa ce e Buffer Buffer
M fa c e
em ce Memory Memory
or
y
MAC MAC

Fourth-Generation Switches/Routers
Clustering and Multistage
1 2 3 4 5 6 13 14 15 16 17 18 25 26 27 28 29 30
1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16
17 1819 20 21 22 23 2425 26 27 28 29 30 31 32
7 8 9 10 11 12 19 20 21 22 23 24 31 32 21

Packet Switches
References
• J. Giacopelli, M. Littlewood, W.D. Sincoskie “Sunshine: A
high performance self-routing broadband packet switch
architecture”, ISS ‘90.
• J. S. Turner “Design of a Broadcast packet switching
network”, IEEE Trans Comm, June 1988, pp. 734-743.
• C. Partridge et al. “A Fifty Gigabit per second IP Router”,
IEEE Trans Networking, 1998.
• N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M.
Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE
Micro Magazine, Jan-Feb 1997.

Tutorial Outline
• Introduction:

1.
Output
Table
Interconnect
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision

Forwarding Decisions
• ATM and MPLS switches
– Direct Lookup
• Bridges and Ethernet switches
– Associative Lookup
– Hashing
– Trees and tries
• IP Routers
– Caching
– CIDR
– Patricia trees/tries
– Other methods
• Packet Classification
ATM and MPLS Switches
Direct Lookup
(Port, VCI)
Address
VCI
Data
Memory

– Direct Lookup
– Hashing
– Trees and tries
• IP Routers
– Caching
– CIDR
– Other methods
Bridges and Ethernet Switches
Associative Lookups
Associative Advantages:
Memory or CAM
• Simple
Associated
Search
Network Associated Data Disadvantages
{
Address Data
Data Hit? • Slow
48
Address • High Power
log2N
• Small
• Expensive

Bridges and Ethernet Switches
Hashing
Associated
Search Data
Data
{
Address
Data
Hashing 16 Hit?
Memory
48 Function Address
log2N

Lookups Using Hashing
An example
Memory
#1 #2 #3 #4
Associated
Search Data
Data
48
Hashing Function
CRC-16
16
#1 #2 { Hit?
Address
log2N
#1 #2 #3
Linked lists

Performance of simple example
 
1   
E R = --- 1 + -------------------------------
-
2  1 – --- 1 M 
 1 – -
 N 
Where:
ER = Expected number of memory references
M = Number of memory addresses in table
N = Number of linked lists
 = M N

Advantages:
• Simple
• Expected lookup time can be small
Disadvantages
• Non-deterministic lookup time
• Inefficient use of memory

Trees and Tries
Binary Search Tree Binary Search Trie
< > 0 1
< > < > log2N 0 1 0 1
010 111
N entries

Trees and Tries
Multiway tries
16-ary Search Trie

0000, ptr 1111, ptr
0000, 0 1111, ptr 0000, 0 1111, ptr
000011110000 111111111111

Trees and Tries
Multiway tries
Where:
L –1
N D
E w = D L – 11 –  1 – -------  +
  DL 
 D i   1 – D i – 1 N –  1 – D 1 – i  N D = Degree of tree
L = Number of layers/references
i=1
L–1 N = Number of entries in table
N D
E n = 1 + D L 1 – ------- +  Di – D i – 11 – D i – 1 
N
E n = Expected number of nodes
 DL 
i =1 E w = Expected amount of wasted memory
Degree of # Mem # Nodes Total Memory Fraction

Tree References (x106) (Mbytes) Wasted (%)
2 48 1.09 4.3 49
4 24 0.53 4.3 73
8 16 0.35 5.6 86
16 12 0.25 8.3 93
64 8 0.17 21 98
256 6 0.12 64 99.5
Table produced from 215 randomly generated 48-bit addresses
– Direct Lookup
– Hashing
– Trees and tries
• IP Routers
– Caching
– CIDR
– Other methods
Caching Addresses
Slow Path
CPU Buffer
Memory
Fast Path
DMA DMA DMA

Line Line Line
Card Card Card
Local Local Local
Buffer Buffer Buffer
Memory Memory Memory
MAC MAC MAC

Caching Addresses
LAN:
Average flow < 40 packets
WAN: Huge Number of flows

100%
90%
80%
Cache 70%
60%
Hit 50%
Rate 40%
30%
20%
10%
0%
Cache = 10% of Full Table
IP Routers
Class-based addresses
IP Address Space
Class A Class B Class C D
Class A
Routing Table:
212.17.9.4 Class B Exact
Class C Match
212.17.9.0 Port 4

IP Routers
CIDR
Class-based:
A B C D
0 232-1
Classless: 128.9.0.0
142.12/19
65/8
128.9/16
0 232-1
216
128.9.16.14
IP Routers
CIDR
128.9.19/24
128.9.25/24
128.9.16/20 128.9.176/20
128.9/16
0 232-1
128.9.16.14
Most specific route = “longest matching prefix”

IP Routers
Metrics for Lookups
Prefix Port
• Lookup time
65/8 3
128.9.16.14 128.9/16 5 • Storage space
128.9.16/20 2 • Update time
128.9.19/24 7
128.9.25/24 10 • Preprocessing time
128.9.176/20 1
142.12/19 3

IP Router
Lookup
H
E Dstn Forwarding Engine
Addr Next Hop
A
Next Hop Computation
D
E
R
Forwarding Table
Destination Next Hop
---- ----
---- ----
Incoming
Packet ---- ----
IPv4 unicast destination address based lookup

Need more than IPv4 unicast
lookups
• Multicast
• PIMSM
– Longest Prefix Matching on the source and group address
– Try (S,G) followed by (*,G) followed by (*,*,RP)
– Check Incoming Interface
• DVMRP:
– Incoming Interface Check followed by (S,G) lookup
• IPv6
• 128bit destination address field
• Exact address architecture not yet known

Lookup Performance Required
Line Line Rate Pktsize=40B Pktsize=240B
T1 1.5Mbps 4.68Kpps 0.78Kpps
OC3 155Mbps 480Kpps 80 Kpps
OC12 622Mbps 1.94Mpps 323Kpps
OC48 2.5Gbps 7.81Mpps 1.3Mpps
OC192 10 Gbps 31.25Mpps 5.21Mpps
Gigabit Ethernet (84B packets): 1.49 Mpps

Size of the Routing Table
Source: http://www.telstra.net/ops/bgptable.html
Ternary CAMs
Associative Memory
Value Mask
10.0.0.0 255.0.0.0 R1
10.1.0.0 255.255.0.0 R2 Next Hop
10.1.1.0 255.255.255.0 R3
10.1.3.0 255.255.255.0 R4
10.1.3.1 255.255.255.255 R4
Priority Encoder

Binary Tries
0 1 Example Prefixes
a) 00001
b) 00010
c) 00011
d) 001
d g e) 0101
f
f) 011
g) 100
h i h) 1010
e
i) 1100
a b c
j) 11110000
Copyright 1999. All Rights Reserved

j 42
Patricia Tree
Example Prefixes
0 1 a) 00001
b) 00010
c) 00011
d) 001
Skip=5 e) 0101
f g j f) 011
d
g) 100
h) 1010
e h i i) 1100
a b c j) 11110000

Patricia Tree
Disadvantages Advantages
• Many memory accesses • General Solution
• May need backtracking • Extensible to wider
• Pointers take up a lot of fields
space
Avoid backtracking by storing the intermediate-best matched prefix.

(Dynamic Prefix Tries)
40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)]

Binary search on trie levels
Level 0
Level 8
Level 29
Copyright 1999. All Rights Reserved P 45

Example Prefixes
Store a hash table for each prefix length
to aid search at a particular trie level. 10.0.0.0/8
10.1.0.0/16
Length Hash 10.1.1.0/24
8 10.1.2.0/24
10 10.2.3.0/24
12
16 Example Addrs
10.1, 10.2
24 10.1.1.4
10.4.4.3
10.2.3.9
10.2.4.8
10.1.1, 10.1.2, 10.2.3

• Multiple hashed memory • Scaleable to IPv6.
accesses.
• Updates are complex.
33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)]

Compacting Forwarding Tables
1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1

Codeword array
10001010 11100010 10000010 10110100 11000000
R1, 0 R2, 3 R3, 7 R4, 9 R5, 0

0 1 2 3 4
Base index array
0 13
0 1
• Scalability to larger • Extremely small data
tables? structure - can fit in
• Updates are complex. cache.
33K entries: 160KB data structure with average 2Mpps [O(W/k)]

Multi-bit Tries
16-ary Search Trie

0000, ptr 1111, ptr
0000, 0 1111, ptr 0000, 0 1111, ptr
000011110000 111111111111

Compressed Tries
Only 3 memory accesses
L8
L16
L24

Routing Lookups in Hardware
Number
Prefix length
Most prefixes are 24-bits or shorter
Prefixes up to 24-bits
224 = 16M entries
142.19.6
Next Hop
1 Next Hop
142.19.6
24
142.19.6.14
14

Prefixes up to 24-bits
128.3.72
1 Next Hop
128.3.72
24 Next Hop
0 Pointer
128.3.72.44
Prefixes above
base 24-bits
Next
Next Hop
Hop
offset
44

Prefixes up to n-bits
2n entries:
i 
m
2  entries
i j Prefixes
0 longer than
N+M bits
N
Next Hop
N+M

• Large memory required • 20Mpps with 50ns
(9-33MB) DRAM
• Depends on prefix-length • Easy to implement in
distribution. hardware
Various compression schemes can be employed to decrease the

storage requirements: e.g. employ carefully chosen variable length
strides, bitmap compression etc.

IP Router Lookups
References
• A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables
for Fast Routing Lookups”, Sigcomm 1997, pp 3-14.
• B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and
multicolumn search”, Infocom 1998, pp 1248-56, vol. 3.
• M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP
routing lookups”, Sigcomm 1997, pp 25-36.
• P. Gupta, S. Lin, N.McKeown. “Routing lookups in hardware at memory
access speeds”, Infocom 1998, pp 1241-1248, vol. 3.
• S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl
Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998.
• V. Srinivasan, G.Varghese. “Fast IP lookups using controlled prefix
expansion”, Sigmetrics, June 1998.

– Direct Lookup
– Hashing
– Trees and tries
• IP Routers
– Caching
– CIDR
– Other methods
Providing ValueAdded Services
Some examples
• Differentiated services
– Regard traffic from Autonomous System #33 as `platinumgrade’
• Access Control Lists
– Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp
• Committed Access Rate
– Rate limit WWW traffic from subinterface#739 to 10Mbps
• Policybased Routing
– Route all voice traffic through the ATM network

Packet Classification
H
E
Forwarding Engine
A Action
D
E
R
Classifier (Policy Database)
Predicate Action
---- ----
---- ----
Incoming
Packet ---- ----

Multi-field Packet Classification
Field 1 Field 2 … Field k Action
Rule 1 152.163.190.69/21 152.163.80.11/32 … UDP A1
Rule 2 152.168.3.0/24 152.163.0.0/16 … TCP A2
… … … … … …
Rule N 152.168.0.0/16 152.0.0.0/8 … ANY An
Given a classifier with N rules, find the action associated

with the highest priority rule matching an incoming
packet.
Geometric Interpretation in 2D
Field #1 Field #2 Data
R7 R6
P1
P2
Field #2
R3
e.g. (144.24/16, 64/24)
e.g. (128.16.46.23, *) R1
R5 R4
R2
Copyright 1999. All Rights Reserved

Field #1 63
Proposed Schemes
Pros Cons
Sequential Small storage, scales well with Slow classification rates
Evaluation number of fields
Ternary CAMs Single cycle classification Cost, density, power
consumption
Grid of Tries Small storage requirements and Not easily extendible to
(Srinivasan et fast lookup rates for two fields. more than two fields.
al[Sigcomm Suitable for big classifiers
98])

Proposed Schemes (Contd.)
Pros Cons
Crossproducting Fast accesses. Large memory
(Srinivasan et Suitable for requirements. Suitable
al[Sigcomm 98]) multiple fields. without caching for
classifiers with fewer than
50 rules.
Bil-level Parallelism Suitable for Large memory bandwidth
(Lakshman and multiple fields. required. Comparatively
Stiliadis[Sigcomm 98]) slow lookup rate.
Hardware only.

Proposed Schemes (Contd.)
Pros Cons
Hierarchical Suitable for multiple Large preprocessing
Intelligent Cuttings fields. Small memory time.
(Gupta and requirements. Good
McKeown[HotI 99]) update time.
Tuple Space Search Suitable for multiple Classification rate can be
(Srinivasan et fields. The basic scheme low. Requires perfect
al[Sigcomm 99]) has good update times hashing for determinism.
and memory
requirements.
Recursive Flow Fast accesses. Suitable for Large preprocessing time
Classification (Gupta multiple fields. and memory
and Reasonable memory requirements for large
McKeown[Sigcomm requirements for real-life classifiers.
99]) classifiers.
Grid of Tries
 Dimension 1



  R4
 
   Dimension 2
 
 
R1 R7
 R3 R5 R6
R2
Grid of Tries
• Static solution • Good solution for two
• Not easy to extend to dimensions
higher dimensions
20K entries: 2MB data structure with 9 memory accesses [at most 2W]

Classification using Bit Parallelism
0
1
1
1 1
1 R1
R4 R3
R2
0
0

Classification using Bit Parallelism
• Large memory • Good solution for
bandwidth multiple dimensions
• Hardware optimized for small classifiers
512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips.

Classification Using Multiple Fields
Recursive Flow Classification
Packet Header
2 =2
S 128
2 =2
T 12
Memory
Memory
F1 Memory
F2 Action
F3
2S = 2128 264 224 2 =2

T 12
F4
Fn

References
• T.V. Lakshman. D. Stiliadis. “High speed policy based packet
forwarding using efficient multi-dimensional range matching”,
Sigcomm 1998, pp 191-202.
• V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and
scalable layer 4 switching”, Sigcomm 1998, pp 203-214.
• V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using
tuple space search”, to be presented at Sigcomm 1999.
• P. Gupta, N. McKeown, “Packet classification using hierarchical
intelligent cuttings”, Hot Interconnects VII, 1999.
• P. Gupta, N. McKeown, “Packet classification on multiple fields”,
Sigcomm 1999.

Tutorial Outline
• Introduction:

Switching Fabrics
• Output and Input Queueing
• Output Queueing
• Input Queueing
– Scheduling algorithms
– Combining input and output queues
– Other non-blocking fabrics
– Multicast traffic

1.
Output
Table
Interconnect
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision

Interconnects
Two basic techniques
Input Queueing Output Queueing
Usually a non-blocking Usually a fast bus

switch fabric (e.g. crossbar)
Interconnects
Output Queueing
Individual Output Queues Centralized Shared Memory
Memory b/w = 2N.R
1
N 1
Memory b/w = (N+1).R N

Output Queueing
The “ideal”
1
2 2 1
1
2 1
2 2
1
1
2
1

Output Queueing
How fast can we make centralized shared memory?
5ns SRAM
Shared
Memory
1 • 5ns per memory operation

• Two memory operations per packet
2 • Therefore, up to 160Gb/s
• In practice, closer to 80Gb/s
N
200 byte bus

Switching Fabrics
• Output Queueing
• Input Queueing

Interconnects
Input Queueing with Crossbar
Memory b/w = 2R
Scheduler
Data In
configuration Data Out

Input Queueing
Head of Line Blocking
Delay
Load
58.6% 100%

Head of Line Blocking

Input Queueing
Virtual output queues

Input Queues
Virtual Output Queues
Delay
Load
100%

Input Queueing
Memory b/w = 2R
Scheduler
Can be quite
complex!

Input Queueing
Scheduling
Input 1
Q(1,1)
A1,1(t) Matching, M Output 1
A1 (t) D1 (t)
Q(1,n)
?
Input m
Q(m,1)
Output n
Am (t) Dn(t)
Q(m,n)

Input Queueing
Scheduling
1 2
7
1 1 1
2 4 2 2 2
2
3 5
3 3 3
4 2 4 4 4
Request Bipartite
Graph Matching
(Weight = 18)
Question: Maximum weight or maximum size?

Input Queueing
Scheduling
• Maximum Size
– Maximizes instantaneous throughput
– Does it maximize long-term throughput?
• Maximum Weight
– Can clear most backlogged queues
– But does it sacrifice long-term throughput?

Input Queueing
Scheduling
1 1
2 2
1 1
2 2
Input Queueing
Longest Queue First or
Oldest Cell First
={ }
Queue Length
Weight 100%
Waiting Time
1 1 1 1 1
10
2 1 2 Maximum weight 2 2
3 1 3 3 3
10
4 1 4 4 4
Input Queueing
Why is serving long/old queues better than
serving maximum number of queues?
• When traffic is uniformly distributed, servicing the
maximum number of queues leads to 100% throughput.
• When traffic is non-uniform, some queues become
longer than others.
• A good algorithm keeps the queue lengths matched, and
services a large number of queues.
Uniform traffic Non-uniform traffic
Avg Occupancy
Avg Occupancy
VOQ #
Copyright 1999. All Rights Reserved VOQ # 94
Input Queueing
Practical Algorithms
• Maximal Size Algorithms
– Wave Front Arbiter (WFA)
– Parallel Iterative Matching (PIM)
– iSLIP
• Maximal Weight Algorithms
– Fair Access Round Robin (FARR)
– Longest Port First (LPF)

Wave Front Arbiter
Requests Match
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Wave Front Arbiter
Requests Match

Wave Front Arbiter
Implementation
Combinational
1,1 1,2 1,3 1,4 Logic Blocks
2,1 2,2 2,3 2,4
3,1 3,2 3,3 3,4
4,1 4,2 4,3 4,4

Wave Front Arbiter
Wrapped WFA (WWFA)
N steps instead of
2N-1
Requests Match

Input Queueing
– iSLIP

Parallel Random
Iterative Matching
Selection Random Selection
1 1 1 1 1 1
#1 2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Requests Grant Accept/Match
1 1 1 1 1 1
2 2 2 2 2 2
#2
3 3 3 3 3 3
4 4 4 4 4 4
Parallel Iterative Matching
Maximal is not Maximum
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Requests Accept/Match
1 1
2 2
3 3
4 4
Analytical Results
Number of iterations to converge:

2 C = # of iterations required to resolve connections
E U i   N
-------
4i N = # of ports
E  C   log N U i = # of unresolved connections after iteration i




Input Queueing
– iSLIP

iSLIP Round-Robin Selection
Round-Robin Selection
1 1 1 1 1 1
#1 2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Requests Grant Accept/Match
1 1 1 1 1 1
2 2 2 2 2 2
#2
3 3 3 3 3 3
4 4 4 4 4 4
iSLIP
Properties
• Random under low load
• TDM under high load
• Lowest priority to MRU
• 1 iteration: fair to outputs
• Converges in at most N iterations. On average <=
log2N
• Implementation: N priority encoders
• Up to 100% throughput for uniform traffic
iSLIP

iSLIP

Programmable
iSLIP
Priority Encoder Implementation
N 1 1 log2N
Grant Accept
N 2 2 log2N
Grant Accept
State Decision
N N N log2N
Grant Accept

Input Queueing References
References
• M. Karol et al. “Input vs Output Queueing on a Space-Division Packet
Switch”, IEEE Trans Comm., Dec 1987, pp. 1347-1356.
• Y. Tamir, “Symmetric Crossbar arbiters for VLSI communication
switches”, IEEE Trans Parallel and Dist Sys., Jan 1993, pp.13-27.
• T. Anderson et al. “High-Speed Switch Scheduling for Local Area
Networks”, ACM Trans Comp Sys., Nov 1993, pp. 319-352.
• N. McKeown, “The iSLIP scheduling algorithm for Input-Queued
Switches”, IEEE Trans Networking, April 1999, pp. 188-201.
• C. Lund et al. “Fair prioritized scheduling in an input-buffered
switch”, Proc. of IFIP-IEEE Conf., April 1996, pp. 358-69.
• A. Mekkitikul et al. “A Practical Scheduling Algorithm to Achieve
100% Throughput in Input-Queued Switches”, IEEE Infocom 98,
April 1998.
Switching Fabrics
• Output Queueing
• Input Queueing

Other Non-Blocking Fabrics
Clos Network

Clos Network
Expansion factor required = 2-1/N (but still blocking for multicast)

Self-Routing Networks
000 000
001 001
010 010
011 011
100 100
101 101
110 110
111 111

Self-Routing Networks
The Non-blocking Batcher Banyan Network
Batcher Sorter Self-Routing Network
3 7 7 7 7 7 7 000
7 2 5 0 4 6 6 001
5 3 2 5 5 4 5
010
2 5 3 1 6 5 4 011
6 6 1 3 0 3 3
100
0 1 0 4 3 2 2
101
1 0 6 2 1 0 1
110
4 4 4 6 2 2 0
111
• Fabric can be used as scheduler.

•Batcher-Banyan network is blocking for multicast.
Switching Fabrics
• Output Queueing
• Input Queueing

Speedup
• Context
– input-queued switches
– output-queued switches
– the speedup problem
• Early approaches
• Algorithms
• Implementation considerations

Speedup: Context
M M
e e
m m
o o
r r
y y
A generic switch
The placement of memory gives

- Output-queued switches
- Input-queued switches
- Combined input- and output-queued switches

Output-queued switches
Best delay and throughput performance

- Possible to erect “bandwidth firewalls” between sessions
Main problem
- Requires high fabric speedup (S = N)
Unsuitable for high-speed switching

Input-queued switches
Big advantage
- Speedup of one is sufficient
Main problem
- Can’t guarantee delay due to input contention
Overcoming input contention: use higher speedup

A Comparison
Memory speeds for 32x32 switch
Output-queued Input-queued
Line Rate Memory Access Time Memory Access Time
BW Per cell BW
100 Mb/s 3.3 Gb/s 128 ns 200 Mb/s 2.12 s
1 Gb/s 33 Gb/s 12.8 ns 2 Gb/s 212 ns
2.5 Gb/s 82.5 Gb/s 5.12 ns 5 Gb/s 84.8 ns
10 Gb/s 330 Gb/s 1.28ns 20 Gb/s 21.2 ns

The Speedup Problem
Find a compromise: 1 < Speedup << N
- to get the performance of an OQ switch
- close to the cost of an IQ switch
Essential for high speed QoS switching

Some Early Approaches
Probabilistic Analyses
- assume traffic models (Bernoulli, Markov-modulated,
non-uniform loading, “friendly correlated”)
- obtain mean throughput and delays, bounds on tails
- analyze different fabrics (crossbar, multistage, etc)
Numerical Methods
- use actual and simulated traffic traces
- run different algorithms
- set the “speedup dial” at various values

The findings
Very tantalizing ...

- under different settings (traffic, loading, algorithm, etc)
- and even for varying switch sizes
A speedup of between 2 and 5 was sufficient!

Using Speedup
1
2
1
2

Intuition
Bernoulli IID inputs
Speedup = 1
Fabric throughput = .58
Speedup = 2 Fabric throughput = 1.16

I/p efficiency,  = 1/1.16
Ave I/p queue = 6.25

Intuition (continued)
Fabric throughput = 1.74

Speedup = 3
Input efficiency = 1/1.74
Speedup = 4 Fabric throughput = 2.32

Input efficiency = 1/2.32

Issues
Need hard guarantees

- exact, not average
Robustness
- realistic, even adversarial, traffic
not friendly Bernoulli IID

The Ideal Solution
Inputs Outputs
Speedup = N
?
Speedup << N
Question: Can we find

- a simple and good algorithms
- that exactly mimics output-queueing
- regardless of switch sizes and traffic patterns?

What is exact mimicking?
Apply same inputs to an OQ and a CIOQ switch

- packet by packet
Obtain same outputs

- packet by packet

Algorithm - MUCF
Key concept: urgency value

- urgency = departure time - present time

MUCF
The algorithm
- Outputs try to get their most urgent packets

- Inputs grant to output whose packet is most
urgent, ties broken by port number
- Loser outputs for next most urgent packet
- Algorithm terminates when no more matchings
are possible

Stable Marriage Problem
Men = Outputs
Bill John Pedro
Women = Inputs
Hillary Monica Maria

An example
Observation: Only two reasons a packet doesn’t get to its output

- Input contention, Output contention
- This is why speedup of 2 works!!
What does this get us?
Speedup of 4 is sufficient for exact emulation of FIFO
OQ switches, with MUCF
What about non-FIFO OQ switches?
E.g. WFQ, Strict priority

Other results
To exactly emulate an NxN OQ switch
- Speedup of 2 - 1/N is necessary and sufficient

(Hence a speedup of 2 is sufficient for all N)
- Input traffic patterns can be absolutely arbitrary
- Emulated OQ switch may use a “monotone”

scheduling policies
- E.g.: FIFO, LIFO, strict priority, WFQ, etc

What gives?
Complexity of the algorithms
- Extra hardware for processing
- Extra run time (time complexity)
What is the benefit?

- Reduced memory bandwidth requirements
Tradeoff: Memory for processing

- Moore’s Law supports this tradeoff

Implementation - a closer look
Main sources of difficulty
- Estimating urgency, etc - info is distributed
(and communicating this info among I/ps and O/ps)
- Matching process - too many iterations?
Estimating urgency depends on what is being emulated

- Like taking a ticket to hold a place in a queue
- FIFO, Strict priorities - no problem
- WFQ, etc - problems

Implementation (contd)
Matching process
- A variant of the stable marriage problem
- Worst-case number of iterations for SMP = N2
- Worst-case number of iterations in switching =
N
- High probability and average approxly log(N)

Other Work
Relax stringent requirement of exact emulation
- Least Occupied O/p First Algorithm (LOOFA)
Keeps outputs always busy if there are packets
By time-stamping packets, it also exactly mimics
- Disallow arbitrary inputs

E.g. leaky bucket constrained
Obtain worst-case delay bounds

References for speedup
- Y. Oie et al, “Effect of speedup in nonblocking packet switch’’, ICC 89.
- A.L Gupta, N.D. Georgana, “Analysis of a packet switch with input and
and output buffers and speed constraints”, Infocom 91.
- S-T. Chuang et al, “Matching output queueing with a combined input and
and output queued switch”, IEEE JSAC, vol 17, no 6, 1999.
- B. Prabhakar, N. McKeown, “On the speedup required for combined input
and output queued switching”, Automatica, vol 35, 1999.
- P. Krishna et al, “On the speedup required for work-conserving crossbar
switches”, IEEE JSAC, vol 17, no 6, 1999.
- A. Charny, “Providing QoS guarantees in input buffered crossbar switches
with speedup”, PhD Thesis, MIT, 1998.

Switching Fabrics
• Output Queueing
• Input Queueing

Multicast Switching
• The problem
• Switching with crossbar fabrics
• Switching with other fabrics

Multicasting
1 3 5
4 6

Crossbar fabrics: Method 1
Copy network + unicast switching
Copy networks
Increased hardware, increased input contention

Method 2
Use copying properties of crossbar fabric
No fanout-splitting: Easy, but low

throughput
Fanout-splitting: higher
throughput, but not as simple.
Leaves “residue”.

The effect of fanout-splitting
Performance of an 8x8 switch with and without fanout-splitting

under uniform IID traffic
Placement of residue
Key question: How should outputs grant requests?
(and hence decide placement of residue)

Residue and throughput
Result: Concentrating residue brings more new work
forward. Hence leads to higher throughput.
But, there are fairness problems to deal with.
This and other problems can be looked at in a unified

way by mapping the multicasting problem onto a
variation of Tetris.

Multicasting and Tetris
Input ports
1 2 3 4 5
Residue
1 2 3 4 5
Output ports

Multicasting and Tetris
Input ports
1 2 3 4 5
Residue
Concentrated
1 2 3 4 5
Output ports

Replication by recycling
Main idea: Make two copies at a time using a binary tree
with input at root and all possible destination outputs at
the leaves.
b
b c x x
a y c
a
x d
y y
e
e
d

Replication by recycling (cont’d)
Receive
Reseq Transmit
Output
Table
Network
Recycle
Scaleable to large fanouts. Needs resequencing at outputs and

introduces variable delays.

References for Multicasting
• J. Hayes et al. “Performance analysis of a multicast
switch”, IEEE/ACM Trans. on Networking, vol 39, April
1991.
• B. Prabhakar et al. “Tetris models for multicast switches”,
Proc. of the 30th Annual Conference on Information
Sciences and Systems, 1996
• B. Prabhakar et al. “Multicast scheduling for input-queued
switches”, IEEE JSAC, 1997
• J. Turner, “An optimal nonblocking multicast virtual
circuit switch”, INFOCOM, 1994

Tutorial Outline
• Introduction:

Output Scheduling
• What is output scheduling?

• How is it done?
• Practical Considerations

Output Scheduling
Allocating output bandwidth

Controlling packet delay
scheduler

Output Scheduling
FIFO
Fair Queueing

Motivation
• FIFO is natural but gives poor QoS
– bursty flows increase delays for others
– hence cannot guarantee delays
Need round robin scheduling of packets

– Fair Queueing
– Weighted Fair Queueing, Generalized Processor Sharing

Fair queueing: Main issues
• Level of granularity
– packet-by-packet? (favors long packets)
– bit-by-bit? (ideal, but very complicated)
• Packet Generalized Processor Sharing (PGPS)

– serves packet-by-packet
– and imitates bit-by-bit schedule within a tolerance

How does WFQ work?
WR = 1
WG = 5
WP = 2

Delay guarantees
• Theorem
If flows are leaky bucket constrained and all nodes

employ GPS (WFQ), then the network can
guarantee worst-case delay bounds to sessions.

Practical considerations
• For every packet, the scheduler needs to
– classify it into the right flow queue and maintain a linked-list
for each flow
– schedule it for departure
• Complexities of both are o(log [# of flows])

– first is hard to overcome
– second can be overcome by DRR

Deficit Round Robin
50 700 250 500

250
750
400 600 500

1000
200 600 100 400

500
Good approximation of FQ
Much simpler to implement 500 Quantum size

But...
• WFQ is still very hard to implement

– classification is a problem
– needs to maintain too much state information
– doesn’t scale well

Strict Priorities and Diff Serv
• Classify flows into priority classes
– maintain only per-class queues
– perform FIFO within each class
– avoid “curse of dimensionality”

Diff Serv
• A framework for providing differentiated QoS
– set Type of Service (ToS) bits in packet headers
– this classifies packets into classes
– routers maintain per-class queues
– condition traffic at network edges to conform to
class requirements
May still need queue management inside the network

References for O/p Scheduling
- A. Demers et al, “Analysis and simulation of a fair queueing algorithm”,
ACM SIGCOMM 1989.
- A. Parekh, R. Gallager, “A generalized processor sharing approach to
flow control in integrated services networks: the single node
case”, IEEE Trans. on Networking, June 1993.
- A. Parekh, R. Gallager, “A generalized processor sharing approach to
flow control in integrated services networks: the multiple node
case”, IEEE Trans. on Networking, August 1993.
- M. Shreedhar, G. Varghese, “Efficient Fair Queueing using Deficit Round
Robin”, ACM SIGCOMM, 1995.
- K. Nichols, S. Blake (eds), “Differentiated Services: Operational Model
and Definitions”, Internet Draft, 1998.

Active Queue Management
• Problems with traditional queue management
– tail drop
• Active Queue Management
– goals
– an example
– effectiveness

Tail Drop Queue Management
Lock-Out
Max Queue Length

Tail Drop Queue Management
• Drop packets only when queue is full

– long steady-state delay
– global synchronization
– bias against bursty traffic

Global Synchronization
Max Queue Length

Bias Against Bursty Traffic
Max Queue Length

Alternative Queue Management
Schemes
• Drop from front on full queue
• Drop at random on full queue
 both solve the lock-out problem

 both have the full-queues problem

Goals
• Solve lock-out and full-queue problems
– no lock-out behavior
– no global synchronization
– no bias against bursty flow
• Provide better QoS at a router
– low steady-state delay
– lower packet dropping

• Problems with traditional queue management
– tail drop
• Active Queue Management
– goals
 an example
– effectiveness

Random Early Detection (RED)
Pk P2 P1
maxth qavg minth


if qavg < minth: admit every packet

else if qavg <= maxth: drop an incoming packet
with p = (qavg - minth)/(maxth - minth)

else if qavg > maxth: drop every incoming packet

Effectiveness of RED: Lock-Out
• Packets are randomly dropped

• Each flow has the same probability of being discarded

Effectiveness of RED: Full-Queue
• Drop packets probabilistically in anticipation of congestion (not when queue is full)
• Use qavg to decide packet dropping probability: allow instantaneous bursts
• Randomness avoids global synchronization

What QoS does RED Provide?
• Lower buffer delay: good interactive service
– qavg is controlled to be small
• Given responsive flows: packet dropping is reduced
– early congestion indication allows traffic to throttle back before congestion
• Given responsive flows: fair bandwidth allocation

Unresponsive or aggressive flows
• Don’t properly back off during congestion

• Take away bandwidth from TCP
compatible flows
• Monopolize buffer space

Control Unresponsive Flows
• Some active queue management schemes
– RED with penalty box

– Flow RED (FRED)
– Stabilized RED (SRED)
identify and penalize unresponsive flows with a bit of extra work

References
• B. Braden et al. “Recommendations on queue management
and congestion avoidance in the internet”, RFC2309, 1998.
• S. Floyd, V. Jacobson, “Random early detection gateways
for congestion avoidance”, IEEE/ACM Trans. on
Networking, 1(4), Aug. 1993.
• D. Lin, R. Morris, “Dynamics on random early detection”,
ACM SIGCOMM, 1997
• T. Ott et al. “SRED: Stabilized RED”, INFOCOM 1999
• S. Floyd, K. Fall, “Router mechanisms to support end-to-
end congestion control”, LBL technical report, 1997

Tutorial Outline
• Introduction:

Congestion
Admission Control Control
Reservation
Control Routing
Output Datapath:
Policing Switching Scheduling per-packet
processing

1.
Output
Table
Interconnect
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision

High Performance Switches and Routers: Theory and Practice: Sigcomm 99 August 30, 1999 Harvard University

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High Performance Switches and Routers: Theory and Practice: Sigcomm 99 August 30, 1999 Harvard University

Uploaded by

Copyright:

Available Formats

High Performance Switches and Routers:

Nick McKeown Balaji Prabhakar

Departments of Electrical Engineering and Computer Science

Copyright 1999. All Rights Reserved 2

• Basic Architectural Components

Copyright 1999. All Rights Reserved 3

Copyright 1999. All Rights Reserved 4

Copyright 1999. All Rights Reserved 5

The Internet Core

Edge Router Enterprise WAN access

Copyright 1999. All Rights Reserved 6

• Basic Architectural Components

Copyright 1999. All Rights Reserved 7

Copyright 1999. All Rights Reserved 8

Copyright 1999. All Rights Reserved 9

Copyright 1999. All Rights Reserved 10

• Basic Architectural Components

Copyright 1999. All Rights Reserved 11

Shared Backplane Buffer

Copyright 1999. All Rights Reserved 12

DMA DMA DMA

MAC MAC MAC

Copyright 1999. All Rights Reserved 13

L Li Line CPU Line

Copyright 1999. All Rights Reserved 14

Copyright 1999. All Rights Reserved 15

Copyright 1999. All Rights Reserved 16

Copyright 1999. All Rights Reserved 17

Copyright 1999. All Rights Reserved 18

Copyright 1999. All Rights Reserved 20

Copyright 1999. All Rights Reserved 22

Copyright 1999. All Rights Reserved 23

Copyright 1999. All Rights Reserved 24

Copyright 1999. All Rights Reserved 25

Copyright 1999. All Rights Reserved 26

< > < > log2N 0 1 0 1

Copyright 1999. All Rights Reserved 27

16-ary Search Trie

0000, 0 1111, ptr 0000, 0 1111, ptr

Copyright 1999. All Rights Reserved 28

Degree of # Mem # Nodes Total Memory Fraction

DMA DMA DMA

MAC MAC MAC

Copyright 1999. All Rights Reserved 31

WAN: Huge Number of flows

Copyright 1999. All Rights Reserved 33

Most specific route = “longest matching prefix”

Copyright 1999. All Rights Reserved 35

Copyright 1999. All Rights Reserved 36

IPv4 unicast destination address based lookup

Copyright 1999. All Rights Reserved 37

Copyright 1999. All Rights Reserved 38

Gigabit Ethernet (84B packets): 1.49 Mpps

Copyright 1999. All Rights Reserved 41

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved 43

Avoid backtracking by storing the intermediate-best matched prefix.

40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)]

Copyright 1999. All Rights Reserved 44

Copyright 1999. All Rights Reserved P 45

Copyright 1999. All Rights Reserved 46

Copyright 1999. All Rights Reserved 47

Copyright 1999. All Rights Reserved 48