04 Router

Per-Hop Packet Processing
High Speed Router Design
Routers with CQS Architecture
1
High-Speed Router Design
Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design
Buffer Placement
2
Where IP routers sit in the network
Core
router
The Internet Core
Edge
Router
3
Basic Architectural Components
Two key router functions:

run routing algorithms/protocol (RIP, OSPF, BGP)
switching datagrams from incoming to outgoing link
Routing
Protocols Control Plane
Routing
Table
Forwarding
Switching
Datapath
Table - per-packet
processing
4
Per-packet processing in an IP Router
1. Accept packet arriving on an incoming link.
2. Lookup packet destination address in the forwarding table => to
identify outgoing port(s).
3. Manipulate packet header: e.g., decrement TTL, update header
checksum.
4. Send packet to the outgoing port(s).
5. Buffer packet in the queue.
6. Transmit packet onto outgoing link.
5
Input Port Functions
Physical layer:
bit-level reception
Data link layer: Decentralized switching:
e.g., Ethernet given datagram dest., lookup output port using
routing table in input port memory
goal: complete input port processing at line
speed
queueing: if datagrams arrive faster than
forwarding rate into switch fabric
6
Output Ports
Buffering required when datagrams arrive from fabric

faster than the transmission rate
Scheduling discipline chooses among queued datagrams for
transmission (to be discussed later)
7
Outline:
Introduction
Router Generations
Table Lookup
Buffer Placement
8
First Generation Routers
Shared Backplane Route Buffer

CPU Table Memory
Line Line Line

Interface Interface Interface
MAC MAC MAC
Typically <0.5Gb/s aggregate capacity

9
Switching Via Memory
Input Output
Port Memory/CPU Port
System Bus
Follow conventional computer architecture

A shared central bus, with a central CPU, memory and peripheral
Line Cards.
Each Line Card connecting the system to each of the external links.
Packets arriving from a link are transferred across the shared bus
to the CPU, where a forwarding decision is made.
The packet is then transferred across the bus again to its outgoing
Line Card, and onto the external link.
speed limited by memory bandwidth (2 bus crossings per packet)
10
Queueing Structure: Shared Memory
Input 1 Output
Numerous1work has proven and
made possible:
Fairness

Delay Guarantees
Delay Variation Control
Loss Guarantees
Input 2 Output 2
Statistical Guarantees
Large, single dynamically

Input N allocated memory buffer:
Output N per pkt time
N writes
N reads per pkt time.
Limited by memory
bandwidth.
11
How fast can we make centralized shared memory?
5ns SRAM
Shared
Memory
5ns per memory operation

1
Two memory operations per packet
Therefore, up to 160Gb/s
2 In practice, closer to 80Gb/s
200 byte bus
12
Second Generation Routers
Route Buffer
CPU Table Memory
Slow Path
Line Line Line

Card Card Card
Drop Policy Or
Drop Policy Buffer Buffer Buffer Backpressure
Memory Memory Memory
Fwding Fwding Fwding
Cache Cache Cache Output
MAC MAC MAC Link
Scheduling
Typically <5Gb/s aggregate capacity

13
Operations
Placing a separate CPU at each interface.

A local forwarding decision is made in a dedicated CPU, based
on its local forwarding table cache.
The packet is immediately forwarded to its outgoing interface.
(i.e. port mapping intelligence in Line Cards)
This has the additional benefit that each packet need only
traverse the bus once, thus increasing the system throughput.
The central CPU is needed to maintain the forwarding tables in
each of the other CPUs, and for centralized system
management functions.
14
Queueing Structure: Combined Input and Output Queueing
1 write per packet time Rate of writes/reads 1 read per packet time
determined by bus speed
Bus
15
E.g. Cisco 7507 router
Front view
Rear view
Backplane
16
Third Generation Routers
Switched Backplane
Line CPU Line

Card Card Card
Local Local
Buffer Routing Buffer
Memory Table Memory
Fwding Fwding
Table Table
MAC MAC
Typically <50Gb/s aggregate capacity

17
BUT forwarding decisions are made in software, and so are

limited by the speed of a general purpose CPU.
Carefully designed, special purpose ASICs can readily
outperform a CPU when making forwarding decisions, managing
queues, and arbitrating access to the bus.
Shared bus allows only one packet traverses at a time between
two Line Cards.
Replacing the shared bus by a crossbar switch
=> multiple Line Cards can communicate with each other
simultaneously greatly increasing the system throughput.
18
Queueing Structure
1 write per pkt time Rate of writes/reads 1 read per pkt time
determined by switch
fabric speedup
Switch
19
E.g. Cisco 12000 series routers
http://www.cisco.com/warp/public/cc/pd/rt/12000/12416/
The Cisco 12000 series offers industry leading scalability, high
performance, and guaranteed priority packet delivery through an
innovative distributed architecture design that enables service providers
to accelerate the evolution of the Internet through delivery of profitable,
next generation services.
The Cisco 12416 Internet router is a 10 Gigabit, 16-slot chassis member
of the Cisco 12000 series that provides a total switching capacity of 320
Gigabits per second (Gbps), with 20 Gbps (10 Gbps full duplex) capacity
per slot.
20
Routers vs. Gateways
Researchers who invented TCP/IP defined the term IP

Gateway to refer to the systems that interconnected
networks and forwarded IP datagrams among them.
By the early 1990s, vendors had hired marketing people to help
them sell products. One vendor thought IP router sounded
better than IP gateway, and others quickly followed the lead.
When Microsoft incorporated TCP/IP software into their
Windows system, they chose to make the configuration screen
ask the user to enter a Gateway address.
So, the terms IP Gateway and IP router are synonymous.
21
Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design Routing
Protocols
Buffer Placement Routing
Table
Forwarding
Switching
Table
22
Forwarding Engine
Packet
payload header
Router
Destination Routing Lookup Outgoing

Address Data Structure Port
Forwarding Table
Dest-network Port
65.0.0.0/8 3
128.9.0.0/16 1
149.12.0.0/19 7
23
What makes table lookup difficult?
The Search Operation is not a Direct Lookup

32 bits address => 4 G entries
The Search Operation is also not an Exact Match

Search
Exact match search: search for a key in a collection of keys
of the same length.
Metrics for lookup:

Lookup time
Storage space
Update time
Preprocessing time
24
An example
Destination IP Prefix Outgoing Port

65.0.0.0/8 3
128.9.0.0/16
Prefix length 1
142.12.0.0/19 7
IP prefix: 0-32 bits

142.12.0.0/19
128.9.0.0/16
65.0.0.0/8
0 128.9.16.14 232-1
224
65.0.0.0 65.255.255.255
25
Dest. IP Outgoing
Prefixes can overlap! Prefix Port
65.0.0.0/8 3
128.9.0.0/16 1
32
142.12.0.0/19 7
Prefix Length
128.9.16.0/21 2
Longest
24 matching prefix 128.9.176.0/24 128.9.172.0/21 4
128.9.16.0/21 128.9.172.0/21 128.9.176.0/24 5
65.0.0.0/8 142.12.0.0/19
128.9.0.0/16
8 Impossible!
0 128.9.16.14 232-1
Routing lookup: Find the longest matching prefix (i.e.

the most specific route) among all prefixes that match
the destination address.
26
Lookup rate required
Year Line Line-rate 40B

(Gbps) packets
(Mpps)
1998-99 OC12c 0.622 1.94
1999-00 OC48c 2.5 7.81
2000-01 OC192c 10.0 31.25
2002-03 OC768c 40.0 125
31.25 Mpps 32 ns
DRAM: 50-80 ns, SRAM: 5-10 ns

27
SONET/SDH http://www.sonet.com/edu/edu.htm
Optical Electrical Line Rate Payload Overhead SDH

Level Level (Mbps) Rate rate Equivalent
(Mbps) (Mbps)
OC-1 STS-1 51.840 50.112 1.728 -
OC-3 STS-3 155.520 150.336 5.184 STM-1
OC-9 STS-9 466.560 451.008 15.552 STM-3
OC-12 STS-12 622.080 601.344 20.736 STM-4
OC-18 STS-18 933.120 902.016 31.104 STM-6
OC-24 STS-24 1244.160 1202.688 41.472 STM-8
OC-36 STS-36 1866.240 1804.032 62.208 STM-12
OC-48 STS-48 2488.320 2405.376 82.944 STM-16
OC-96 STS-96 4976.640 4810.752 165.888 STM-32
OC-192 STS-192 9953.280 9621.504 331.776 STM-64
28
Table growth of a typical backbone router
29
Prefix length distribution
Multicast
address
30
A standard solution: trie
e.g. packet with address 128.32.1.20

entries in the routing table:
...
(128.32.*, 3),
(128.32.1.*, 4),
...
31
Another example
Root
Dest. IP Outgoing
65 128 142
Prefix Port (65.*)
65.0.0.0/8 3
128.9.0.0/16 1
9 12
142.12.0.0/19 7 (142.12.*)
128.9.16.0/21 2
128.9.172.0/21 4 16 172 176

(128.9.16.*) (128.9.172.*) (128.9.176.*)
128.9.176.0/24 5
32
Need more than IPv4 unicast lookups
Multicast
Longest Prefix Matching on the source and group address
IPv6
128bit destination address field
Exact address architecture not yet known
Packet classification
33
Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design Routing
Protocols
Buffer Placement Routing
Table
Forwarding
Switching
Table
34
Basic Architectural Components
Datapath: per-packet processing
35
Input Queueing vs Output Queueing
Input Queueing Output Queueing
Usually a non-blocking Usually a fast bus

switch fabric (e.g. crossbar) (or, multiple path switch)
(or, single path switch)
36
Switch fabrics
Transfer data from input to Switch Fabrics
output, ignoring scheduling and Time domain
buffering Shared media
There are many types of fabric

Shared memory
architectures Space domain
Single path
Choosing one usually depends on
Crossbar
where the switch will exist in the Broadcast
network and the amount of traffic Banyan
it will have to carry Batcher-banyan
Multiple path
Replicated Banyan
Dilated Banyan
Tandem Banyan
37
Three types of switching fabrics
Interconnection
Networks, e.g.
crossbar
38
Switching Via Memory
packet copied by systems (single) CPU
speed limited by memory bandwidth
39
Switching Via Bus
datagram from input port memory

to output port memory via a shared bus
bus contention: switching speed limited by
bus bandwidth
1 Gbps bus, Cisco 1900: sufficient speed
for access and enterprise routers (not
regional or backbone)
40
Switching Via Interconnection Networks
An interconnection network is usually constructed using 2 x 2
switching elements, e.g. crossbar switch
Cross state Bar state
The way those switching elements interconnected determines

the resulting switch architecture.
Overcome bus bandwidth limitations
Major types of Interconnection Networks:

Crossbar
Banyan based
Note: For an interconnection network, the number of 2x2
switching elements required is considered as a good measure
for complexity
41
Fundamental Properties of Interconnection
Networks
Inputs Outputs
1 1 1 1
2 2 2 2
2x2
N N N N

switching elements. The way those switching elements
interconnected determines the resulting switch architecture.
We can find out that there should be NN input-output mappings
for a nonblocking packet switch, since there can be up to N
packets destine for a same output port simultaneously
Output blocking/contention occurs at an output port if it cannot

accept the amount of packets destined to it in a packet transmission
time, or a time slot.
42
How to deal with output blocking?
1 1 Group size, the
2 2 maximum # of packets
can be received by an
output per time slot,
N N increases in both cases
Speed-up Multiple paths

mainly adopted by time- mainly adopted by space
domain switch fabrics domain switch fabrics
If in a packet transmission Different packets can
time, m packets can be follow different paths to
forwarded to a same output arrive at a same output
port, the switch fabric is port simultaneously.
said to have a speed-up => no speed-up is required
factor of m. but the switch fabric is
more complicated in order
to provide multiple paths
43
How to deal with excess packets?
The switch can either drop excess packets that cannot be

switched (loss systems), or it can buffer them for output
access in the next time slot (waiting systems).
In a loss system, one can increase the group size to reduce the
packet loss probability.
But the switch complexity increases.
Besides, packets can be dropped as a result of internal blocking in
the switch, as that for banyan networks
In a waiting system, excess packets can be buffered at inputs,
internally in the switch, or at outputs. A large group size =>
higher throughput.
Note: if group size > 1, output buffer is always required.
44
Crossbar Switch
Data In
2N buses in parallel
1
configuration Data Out
2
Inputs Complexity = N2
4 Cross state
1 2 3 4 Bar state
Outputs
45
Switching Via Interconnection Networks
switching elements, e.g. crossbar switch
Cross state Bar state
The way those switching elements interconnected determines

the resulting switch architecture.
Overcome bus bandwidth limitations
Major types of Interconnection Networks:

Crossbar
Banyan based
Note: For an interconnection network, the number of 2x2
switching elements required is considered as a good measure
for complexity
46
Banyan Network
4 x 4 banyan
8 x 8 banyan 16 x 16 banyan
An N x N banyan network has two properties (where N = 2n):

There is a unique path from any input to any output.
There are logN columns/stages, each with N/2 switching
elements.
Banyan is self-routed
=> distributed control
47
Interconnecting pattern
Label switching stages from 0 to m-1 where N=2^m
Divide switching elements in stage k into 2^k groups
Connecting pattern between stage k and stage k+1:
Outputs from group i of stage k will be connected to inputs
of group 2i and group 2i+1 of stage k+1
Divide the outputs into upper half and lower half
Upper half outputs connected to upper inputs; lower half
outputs connected to lower inputs
Upper outputs of switching elements connected to group 2i
inputs; lower outputs of switching elements connected to
group 2i+1 inputs
Connecting to network inputs: upper inputs first
Connecting network outputs: parallel
48
Routing in banyan network
Each switching element in the i-th stage examines the i-th bit
of the destination address (most-significant-bit first) to make
the decision
if the bit = 1, route to the lower output
if the bit = 0, route to the upper output
011
000
001
010
011
011 100
101
101 110
101 111
Banyan Network
49
Blocking in banyan network
Internal blocking: if two packets, with different outputs,

contend for the same outgoing link at an intermediate
switching element
Internal blocking!
011
101 000
001
010
010
011
011 100
101
Output blocking!
110
101 111
Banyan Network
50
Other banyan-typed networks
Shuffle-exchange (Omega) Network Reverse Shuffle-exchange network
Banyan Network
51
Simple switches based on banyan network
Pm Pm+1
2x2
Pm Pm+1
Consider a banyan network when it is operated as a loss

system. Assume a uniform traffic situation and let
Pm = Pr [there is a packet at an input link at stage m+1].
We can express
Pm1 1 (1 Pm / 2) 2 Pm Pm2 / 4
Therefore, Ploss = (P0 - Pn) / P0.
Using Taylor series expansion, we can have
Ploss = n P0 / (n P0 + 4).
When P0 = 1, Ploss = 0.25, 0.39, 0.48, 0.5, 0.56, 0.6 for switches
with size 2, 4, 8, 16, 32, 64
=> blocking increases rapidly with switch size n.
52
A simple internally nonblocking switch
1 1
2 Loss 2
P0 Pn
system
N N
Performance of an internally nonblocking switch functioning as a

loss system
Consider a particular output i. In any time slot the prob. that none
of the N inputs has a packet destined for it is (1-P0/N)N e-P0 as N
increases
Thus Pn = Pr [there is a packet at output i] = 1 - e-P0 for large N.
Since Pn increases with P0, the maximum throughput is obtained
when P0=1, i.e. * = 1 - e-1 = 0.632; and the corresponding packet loss
prob. Ploss = (P0 - Pn) / P0 = (1-0.632)/1 = 0.368.
Similarly, when P0 = 1, Ploss = 0.25, 0.32, 0.34, 0.356, 0.362, 0.365
for switches with size 2, 4, 8, 16, 32, 64
Ploss = 0.25, 0.39, 0.48, 0.5, 0.56, 0.6 53
Combinatoric properties of banyan
networks
For a nonblocking (both internally and at output) N x
N packet switch, there are NN possible input-output
mappings.
The total number of possible states (input-output
mapping) can be realized by a banyan network is NN/2
(=2N/2logN).
The fraction of realizable input-output mapping is
1/NN/2.
=> appraches 0 as N increases
1-1
2-2
3-3
4-4
54
Analytical model for input-buffered Banyan
inputs outputs
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Stage 1 Stage 2 Stage 3 55
State transition diagram
1-qek(t) A1,n
A 1,n A2,n AB-1,n
A 1,n AB,n
E1,n E2,n EB,n
2,n
0, e 1, n 2, n B-1,n B, n
qek(t) A2,n AB-1,n
A1,n 1-rnk(t)
C2,b CB,b
(1-q1,bk(t))rbk(t) D1,n DB-1,n
B1,b B2,b BB-1,b BB,b
1, b 2, b B-1,b B, b
F1,b FB-1,b
B1,b B1,b BB-1,b 1-rbk(t)
Ai ,n qik,n (t )rnk (t ) Ci ,b (1 q ik,b (t )) rbk ( t )

A i ,n (1 qik,n (t ))(1 rnk (t )) Di ,n q ik,n ( t )(1 rnk ( t ))
Bi ,b (1 qik,b (t ))(1 rbk (t )) E i ,n (1 q ik,n ( t )) rnk (t )
B i ,b qik,b (t )rbk (t ) Fi ,b q ik,b ( t )(1 rbk (t ))
56
Non-blocking Conditions for Banyan Networks
Theorem. The banyan network is nonblocking if the active inputs x1, ...., xm
( xj > xi if j > i) and their corresponding output destinations y1, ...., ym
satisfy the following:
1. (Distinct & monotonic outputs): y1 y2 ym or y1 y2 ym
2. (Concentrated inputs): Any input between two active inputs is also
active. That is, xi w x j implies input w is active.
x1 = 0000
x2 = 0001
x3 = 0010 y1 = 0010
y2 = 1011
y3 = 1100
57
Labeling switching elements in a banyan
Each node in stage k can be uniquely represented by two
binary numbers (an-k a1,b1 bk-1).
0001 000, 00,0 0,00 ,000
001, 01,0 1,00 ,001
010, 10,0 0,01 ,010
011, 11,0 1,01 ,011
100, 00,1 0,10 ,100 1001

101, 01,1 1,10 ,101
110, 10,1 0,11 ,110
111, 11,1 1,11 ,111
1 0 0 1
0001 (001, ) (01,1) (1,10) ( ,100) 1001
b1 bn1 bn
an a1 (an1 a1 , ) (an2 a1 , b1 ) ( , b1 bn1 ) b1 bn
58
Proof:
Suppose two packets, one from x = an a1 to output y = b1 bn, the
other from x = an a1 to output y = b1 bn , collide in stage k
That is, two paths
b1 bn1 bn
an a1 (an1 a1 , ) (an2 a1 , b1 ) ( , b1 bn1 ) b1 bn
b'1 b' 1 b'n
a'n a'1 (a'n1 a'1 , ) (a'n2 a'1 , b'1 ) n ( , b'1 b'n1 )
b'1 b'n
merge at the same node and share the same outgoing link
Stage k
(an-k a1,b1 bk-1) =
bk = bk
(an-k a1,b1 bk-1)
Thus, we have
an k a1 a'n k a'1
(A)
b1 bk b'1 b'k
59
Since input packets are concentrated, the total # of packets
between x and x, inclusively, is |x- x| + 1
Since all packets are destined for different outputs, thus there
must be |x-x|+1 distinct output addresses
Since outputs are monotonic, the largest and the smallest output
addresses must be y and y, or y and y. Hence we must have
x' x 1 y ' y 1
(B)
x' x y ' y
From (A), we have
x' x a'n a'1 an a1

y ' y b'1 b'n b1 bn
a'n a'n k 1 0 0 an an k 1 0 0
b'k 1 b'n bk 1 bn
2 n k a'n a'n k 1 an an k 1 2nk 1
2nk
This contradicts with (B). Thus the theorem is proved.

Proof End
60
E.g.
Violating the concentrated input condition but still

non-blocking
000
001
010
011
011 100
101
110
101 111
Banyan Network
61
Theorem 2. Let the input-output pair of packet i be denoted by (xi,
yi). If the packets can be routed through the banyan network without
conflicts, so can the set of packets ((xi + z ) mod N, yi).
Proof: (try it yourself)
e.g. z = 5
x1 = 0000
x2 = 0001
x3 = 0010 y1 = 0010
x1 = 0101
x2 = 0110
x3 = 0111
y2 = 1011
y3 = 1100
62
Exercise:
000
001
010
011
100
101
110
111
Consider an 8 x 8 banyan network. Suppose with probability

0.75 a packet is destined for outputs 000, 001, 010, or 011,
and with probability 0.25 it is destined for the other four
outputs. Within each group of outputs, the packet is equally
likely to be destined for any of the four outputs. Is the loss
probability higher in this case than when a packet is equally
likely to be destined for any of the eight outputs?
63
P1U P2U P3U
Solution:
P0 000
P1L 001
010
011
P2L P3L 100
101
110
111
Let PiL and PiU be defined as shown in The overall loss prob. is
the figure. P0 is the input load and let P P3 0.3516 0.5890
Ploss 0 1 0.5297
P0 = 1. P0 2
At stage 1,
When a packet is equally likely to
P1U 1 (1 0.75P0 ) 0.9375
2
any of the 8 outputs,
P1L 1 (1 0.25P0 ) 2 0.4375 P1 P0 0.25 P02 0.75
At stage 2, P2 P1 0.25 P12 0.6094
P2U 1 (1 0.5P1U ) 2 0.7178 P3 P2 0.25 P22 0.5166
P2 L 1 (1 0.5P1L ) 2 0.3896 P 'loss 0.4834
At stage 3, Therefore, the loss probability in
P3U 1 (1 0.5P2U ) 2 0.5890 this case is higher than the case
P3 L 1 (1 0.5P2 L ) 2 0.3516
that a packet is equally likely to
be destined for any outputs.
64
Sorting Networks
01 00 01 00
11 01 11 01
10 10 11
00 11 00
min{a1,a2} min{a1,a2,a3,a4}
a1 b1
max{min{a1,a2},min{a3,a4}}
a2 b2
a3 b3
min{max{a1,a2},max{a3,a4}}
a4 max{a3,a4}
b4
max{a1,a2,a3,a4}
Stage: 1 2 3
Comparator: it takes two input numbers and places the
larger number on the output pointed by the arrow
and the smaller number on the other output
65
Order-preserving Property: Suppose a sorting network sorts the
input sequence a = a1,a2,...,aN into the output sequence b=
b1,b2,...,bN , then for any monotonically increasing function f, the
network sorts the input sequence f(a) = f(a1),f(a2),...,f(aN) into the
output sequence b = f(b1),f(b2),...,f(bN).

5 Sorting 1
4 Network 4
7 5
1 7

E.g. f(x) = x+2
2 7= f(5) Sorting f(1) =3
6= f(4) Network f(4) =6
9= f(7) f(5) =7
3= f(1) f(7) =9

What if f(x) = x+2, if x < 2;
f(x)= x -3, if x 2 2= f(5) Sorting f(1) =3
1= f(4) f(4) =1
2 4= f(7)
Network
f(5) =2 X
3= f(1) f(7) =4
2
66
Theorem 3 (Zero-One Principle) If a sorting network with N inputs
sorts all the 2N possible sequences of 0s and 1s correctly, then it sorts all
sequences of arbitrary input numbers correctly.
Proof:
Consider a sorting network can sort all sequences of 0s and 1s correctly. By
contradiction, suppose it does not sort input sequences of arbitrary numbers correctly.
That is, there is an input sequence a1,a2,...,aN containing two elements ai and aj such
that ai < aj , but the network places aj before/above ai.
Define a monotonically increasing function f(x) such that f(x) = 0, if x ai ; f(x) = 1,
if x > ai.
According to the order-preserving property, since the network places aj before/above
ai when the input sequence is a1,a2,...,aN , it places f(aj)=1 before/above f(ai)=0
when the input sequence is f(a1),f(a2),...,f(aN). But this input sequence consists of
only 0s and 1s, and yet the network does not sort it correctly, leading to a
contradiction.

Sorting Sorting
a1 0= f(a1)
Network aj Network f(aj) =1
a2 1= f(a2)

...
...
ai : : f(ai) =0
aN 0= f(aN)
67

Sorting networks based on bitonic sort
Merging is a divide-and-conquer technique for sorting.
A k-merger takes two sorted input sequences and merge them
into one sorted sequence of k elements.
Intuitively merging is simpler than sorting in general.
Suppose we have mergers of different sizes, they can be
interconnected (as shown below) to sort an arbitrary input
sequence.
One way to construct the mergers is to use bitonic sorting
algorithm invented by Batcher.
2-merger
4-merger ... N/2-merger
...
2-merger
...
...
N-merger
2-merger
4-merger ... N/2-merger
...
2-merger
68
Some properties of bitonic sequence
A bitonic sequence is a sequence that either increases monotonically
and then decreases monotonically, or decreases monotonically and
then increases monotonically.
E.g. 1,3,5,7,6,4,2,0, 7,5,3,1,0,2,4,6, 1,2,3,3,2,1
A bitonic sorter is a merger that takes a bitonic sequence and sort it
into a monotonic sequence.
k-bitonic Descending k-bitonic

Ascending
...
sorter
...
sequence sorter sequence Ascending

Ascending
sequence sequence
Descending Ascending
...
...
sequence sequence
69
We focus on bitonic sequences with only 0s and 1s (Why?)
Two general forms: 1i0j1k or 0i1j0k
A bitonic sequence a is said to be no less than another bitonic
sequence if none of the element in a is less than any of the element in
b.
e.g. 00000 01110, 11111 01110,
Two sequences do not necessarily have an ordering relationship.
e.g. 00010 and 01110
Using the following theorem, a bitonic sequence can be

decomposed into two bitonic subsequences a and a using only
one stage of comparators.
70
Theorem 4. If a zero-one sequence of 2n elements a = a1,a2,...,a2n is
bitonic then the two n-element sequences
a = min(a1,an+1),min(a2,an+2), ..., min(an,a2n) and
a= max(a1,an+1),max(a2,an+2) ,..., max(an,a2n)
have two properties:
1. They are both bitonic.
2. a < a. a1 min(a1,an+1)
Proof: a2 min(a2,an+2)
a
...
a ...
an
min(an,a2n)
an+1 max(a1,an+1)
an+2 max(a2,an+2) a
...
...
a2n max(an,a2n)
71
Recursive construction of a k-bitonic sorter
k-half cleaner k/2-half cleaner 2-half cleaner
a1 min(a1,ak/2+1)
Ascending Sequence
a2 min(a2, ak/2+2)
...
...
...
...
min(ak/2,ak)
ak/2
max(a1,ak/2+1)
...
ak/2+1
max(a2,ak/2+2) ...
...
...
...
ak max(ak/2,ak)
k-bitonic k/2-bitonic
sorter
k-half
...
sorter
Cleaner
k/2-bitonic
...
sorter
72
An 8 x 8 Sorting Network using bitonic sorters
2-merger
4-merger
2-merger
8-merger
2-merger
4-merger
2-merger
Note: a merger is implemented using a bitonic sorter
3 2 2 2 2 2 2 2 2 2 2 1
2 3 8 8 3 3 6 6 4 4 1 2
8 8 3 3 8 7 3 3 3 1 4 3
Total # of stages
7 7 7 7 7 8 5 5 1 3 3 4
= 1 + 2 + 3 + + log N
1 1 1 5 5 6 7 4 6 6 6 5
= (1 + log N) log N / 2
6 6 5 1 6 5 4 7 7 7 5 6
5 5 6 6 1 4 8 1 5 5 7 7
4 4 4 4 4 1 1 8 8 8 8 8
N log N (log N 1)
The total # of comparators in an N x N Batcher Network is:
4
73
Batcher-Banyan network
Batcher Banyan
Inputs Outputs
Q1: how packets are switched in Batcher network?

Q2: how contentions are resolved such that banyan
is nonblocking?
74
Switching in Batcher-banyan network
Only headers are compared
Order of arrival from right to left
If both header bits of the
two packets are 0s or 1s, the ... 0100 Assume comparator is
comparator remains in its in bar state
original state (in this case,
... 1000
Payload Header
bar state) and the bits are
forwarded to the outputs. ... 010 0 Remains in bar
For the first pair of bits that ... 100 0
after 1st bit
differ, set the comparator
state accordingly and remains
unchanged for the rest of the ... 01 00 Remains in bar
packet ... 10 00 after 2nd bit
If the two packets have the Set to cross after

same output address, the ... 0 000 3rd bit because
comparator remains in its ... upper input is larger;
original state for the whole 1 100
remains in cross for
packet duration. the whole pkt duration
75
Contention resolution in Batcher-banyan network
Batcher Banyan
0001
001
0100
100
1xxx
i
0100
100
0000
000
1xxx
i
1xxx
i
0101
101
What if some inputs are idle?

Adding an extra bit in front of of the MSB (most significant bit) of
the output port address of each packet; called activity bit.
If the input is active, set this bit to 0
Otherwise, construct a dummy packet and set this bit to 1
All dummy packets will be pushed to the lower end of the outputs
=> all active packets are concentrated at the inputs to the banyan switch
76
Three-phase algorithm
How to solve the output contention problem?
Three phase algorithm for resolving output contention:
Probe phase: only the header of packets enter the sorting
network. Packets with the same output address will be
adjacent to each other at the outputs. Output j+1 checks with
output j to see if their addresses are the same. If yes, let the
packet at output j+1 be the loser and the packet at output j be
the winner.
Acknowledgment phase: acknowledgements are back-
propagated along the same path as the forward path in the
probe phase.
Send phase: send the winning packets; inputs that have lost
contention can buffer their packets for later attempt (=>
waiting system approach)
The first two phases are overheads.
77
An example
0001
001 0000
000 000
X 0100
100 0001
001 001
1xxx
i 010
0100 010
0100
100 0100
011 011
0000
000 0101
100 100
1xxx
i 101
1xxx 101
1xxx
i 1xxx
110 110
0101
101 111
1xxx 111
0001
001 000
000
0000 000
1xxx
100 001
0001
001 001
1xxx
i 010
100
0100 010
0100
100 101
0101
011 011
0000
000 100
xxx
1xxx 100
1xxx
i 101
xxx
1xxx 101
1xxx
i xxx
1xxx
110 110
0101
101 111
xxx
1xxx 111
78
Multiple-Path Banyan Switch Designs
Complexity of Batcher-banyan
N log N (log N 1) N log N

(comparators) ( switching _ elements )
4 2
Can get rid of the complexity of Batcher network

while keeping the performance of Batcher-banyan
and the advantages of banyan?
Yes, using multiple-path banyan networks:
Dilated banyan
Replicated banyan
Tandem banyan
79
Multiple-Path Banyan Switch Designs
Dilated Banyan d
The internal link bandwidth is expanded
to reduce the likelihood of a packet
being dropped
For a banyan with dilation degree of d,
the switch elements are of size 2d x 2d.
Each outgoing address has d associated
outgoing links
Replicated Banyan
Suppose each packet is randomly routed Random router
to one of the banyan for switching. The or broadcaster
load to each banyan is reduced by a
factor of K, thus 1 1st banyan 1
Ploss = n P0 / (n P0 + 4K).
Instead of random routing, we can 2 2
broadcast a packet to all K banyan
planes.
Since a packet is lost only if all copies
are lost, so N K-th banyan N
Ploss = (n P0 / (n P0 + 4))K.
80
Tandem Banyan Packet filter for marked packets
1st banyan 2nd banyan K-th banyan
1
N
Delay
elements
Concentrator
Deflection routing: Output 1 Output N
Whenever there is a conflict at a 2x2 switch element, one packet would be

routed correctly while the other would be marked and routed in the wrong
direction.
To optimize the number of correctly routed packets, the marked packet would
have a lower priority than an unmarked one for the rest of the journey within
the banyan network (why?)
If a packet remains unmarked when it reaches the output of the banyan, it has
reached the correct destination. It is removed and forwarded to the
concentrator associated with the output destination
On the other hand, a marked packet will be unmarked and forwarded to the next
banyan (by packet filter), and a new attempt to route the packet to its desired
output is initiated
A packet is considered lost if it still fails to reach the desired output after
passing through all the K banyan networks. 81
Let D be the delay suffered by a packet as it travels through a single
banyan network.
A packet that reaches its correct destination at a later banyan
experiences a larger delay
To compensate for the delay differences (why?), one can insert delay
elements with varying delays at different places:
for the links that connect the N outputs of banyan i to the N
concentrators, one can introduce a delay of (K-i)D.
Packet filter for marked packets

1st banyan 2nd banyan K-th banyan
1
N
Delay
elements
Concentrator
Output 1 Output N
82
A practical issue
Today, it is the number

of chip I/Os, not the
number of crosspoints,
that limits the size of
a switch fabric.
The problem is how to

interconnect chips
with limited I/Os to
form a larger switch?
83
Three-stage switching network
Clos Network
Switch modules are
arranged in three stages and Switch module
any module in the first r1 x r3
(second) stage is n1 x r2 r2 x n3
1
interconnected with any
1 1
module in the second (third)
stage via a unique link
...
Each switch module is a
...
...
nonblocking switch
For an N x N 3-stage
switch: r1 r3
n1 x r1 = n3 x r3 = N
r2
84
An example
n1 = r2 = r1 = r3 = n3 = 3
1 1 1
2 2 2
3 3 3
3-stage switching network is not necessarily

nonblocking.
85
Theorem A three-stage switching network is strictly
nonblocking if and only if
r2 n1 + n3 -1
Proof:
n1 -1
r1 x r3
n1 x r2 1 r2 x n3
1 1
n1-1 inputs are busy
i.e. only one is idle
...
...
...
n3-1 outputs are busy

i.e. only one is idle
r1 r3
r2 n3-1
86
Summary: Switch Fabric Design using
Interconnection networks
Banyan networks
nice but with serious internal blocking problem
Non-blocking conditions for banyan networks

how to implement?
Batcher-banyan networks
Design of a sorting network (Batcher) to satisfy the non-
blocking conditions
Multipath banyan networks
Keeping the performance of batcher-banyan and
advantages of banyan
Tackling the issue on interconnecting chips with
limited I/Os to form a larger switch fabric
87
Outline:
Introduction
Router Generations
Table Lookup
Input Port Queueing
Buffer Placement Output Port Queueing
Combined Input Output Queueing
88
Input Port Queuing
Note: When we call a switch is input-queued, we imply that in each time slot,
the switch fabric can allow at most one packet to be sent by an input port, and
at most one packet can be received by an output port.
Fabric slower than input ports -> queueing may occur at input
queues
Head-of-the-Line (HOL) blocking: queued datagram at front of
queue prevents others in queue from moving forward
89
An Analogy
90
Input Port Queuing
Performance
Throughput of an input-queueing switch

When N is large, the maximum throughput * = 0.586 can
be found under uniformly distributed traffic condition
all inputs have the same loading
a packet destined for any output port with the same probability
in case of contention, the winner is chosen randomly
Delay
Load
58.6% 100% 91
92
Input Queueing -- Virtual output queues
93
Input Queues
Virtual Output Queues
Memory b/w = 2R
Scheduler
Delay
Load
58.6% 100%
94
Input Queueing 1 7 1
Scheduling 2 4
2
2
2
3 3
5
4 2 4
?
1 1
2 2
3 3
4 4
Bipartite
Matching
Question: Maximum weight or maximum size? (Weight = 18)
95
Input Queueing
Scheduling
Maximum Size
Maximizes instantaneous throughput
Does it maximize long-term throughput? Not necessarily
Maximum Weight
Can clear most backlogged queues
But does it sacrifice long-term throughput? No.
Maximum Weight is better!
96
Input Queueing
Why is serving long/old queues better than serving maximum
number of queues?
When traffic is uniformly distributed, servicing the

maximum number of queues leads to 100% throughput.
When traffic is non-uniform, some queues become
longer than others.
A good algorithm keeps the queue lengths matched, and
services a large number of queues.
Non-uniform traffic
Uniform traffic
Avg Occupancy
Avg Occupancy
VOQ #
VOQ #
97
Points to Ponder:
Can we design an input queued switch with k queues

per input port, where 1 < k < N, such that it is
scalable and has the high throughput of a VOQ
switch?
k queues
? N outputs
98
Odd-even switch
INPUTS OUTPUTS
Queue 1 (Odd) 5 3
Queue 2 (Even) 8 6 2 4
1 7
8x8
Queue 1 (Odd)
Queue 2 (Even) 6 4 2 Non-blocking
Switch
Queue 1 (Odd) 3 3 1
Queue 2 (Even) 4 2
In each time slot, scheduling is divided into 2 contention rounds, one

for each output address group.
The 1-st contention round considers the HOL packets at all even
queue. The 2-nd round considers the HOL packets at all odd queues.
After the 2 contention rounds, the winning packets in both rounds are
then switched together.
Note: Each input/output can only send/receive at most one packet
each time slot.
99
Performance Results
# of Max.Throu. Max. Throu. Percentage

queues (ana) (sim.) Error
1 0.586 0.586 insignificant
2 0.705 0.713 1.1
5 0.827 0.849 2.6
10 0.890 0.917 3.0
20 0.933 0.956 2.4
50 0.966 0.980 1.4
100 0.980 0.993 1.3
Table 1. Maximum throughput
For details, please refer to the following paper:

Kwan L. Yeung and H. Shi, Throughput Analysis for Input-buffered ATM
Switches with Multiple FIFO Queues per Input Port, Electronics Letters,
Vol. 33, No. 19, pp. 1604-1606, Sep. 1997.
100
Outline:
Introduction
Router Generations
Table Lookup
Input Port Queueing
Buffer Placement Output Port Queueing
Combined Input Output Queueing
101
Output port queueing
Note: When we call a switch is output-queued, we imply that in each time

slot, the switch fabric can allow up to N packets to be arrived at an output
port.
Output buffering when arrival rate via switch exceeds output
line speed
queueing (delay) and loss due to output port buffer overflow
102
Output Queueing
Individual Output Queues Centralized Shared Memory

Memory b/w = 2N.R
1
N 1
Memory b/w = (N+1).R N
103
Comments:
Both an output-buffered switch and an input-buffered VOQ switch do not
suffer from HOL blocking.
Does it mean they should have the same performance?
No.
Which one will be better?
Output-buffered switch.
Why?
Intuitively let us consider the case that an output line is idle. For an
output-buffered switch, a packet (if any) in the output buffer can be sent
immediately; for an input-buffered VOQ switch, even there are packets
waiting at the input buffers for this output port, they can not be
immediately sent to the output port/line because of the constraints
imposed by the VOQ scheduling algorithms.
e.g. using longest queue first, the queue destines for this output port may not
be the longest
104
Combined Input-output Queueing
Can we design an input-queued switch that functions
exactly as an output-queued switch?
Yes. It has been proved that a combined input-output Queueing

(CIOQ) switch with (at least) 2 times switch fabric speedup,
together with a suitable scheduling algorithm (e.g. stable marriage
matching algorithm), can function exactly like an output-buffered
switch.
Why we bother to do so?
To get the performance of an output-queued switch, while at the cost
of an input-queued switch.
Note: To provide Quality of Service (QoS) guarantee on, e.g. packet
delay, scheduling algorithms can be easily designed and efficiently
implemented for output-buffered switches (more to be covered later
on in Scheduling).
105
Using Speedup
1
2
1
2
106
The Ideal Solution
Output Queued Switch
1
N N
=?
N
Combined Input-Output Queued Switch

1
107
The findings
For a switch with combined input and output queueing

to exactly mimic an output queued switch, for all
types of traffic, a speedup of 2-1/N is necessary
and sufficient.
But
How to make such an algorithm fast and efficient for
real implementation?
108
Summary:
Introduction
Router Generations
Table Lookup
Buffer Placement
109

04 Router

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 Router

Uploaded by

Copyright:

Available Formats

Per-Hop Packet Processing

High Speed Router Design

Routers with CQS Architecture

The Internet Core

Two key router functions:

Buffering required when datagrams arrive from fabric

Shared Backplane Route Buffer

Line Line Line

Typically <0.5Gb/s aggregate capacity

Follow conventional computer architecture

Large, single dynamically

5ns per memory operation

200 byte bus

Line Line Line

Typically <5Gb/s aggregate capacity

Placing a separate CPU at each interface.

Line CPU Line

Typically <50Gb/s aggregate capacity

BUT forwarding decisions are made in software, and so are

Researchers who invented TCP/IP defined the term IP

Destination Routing Lookup Outgoing

The Search Operation is not a Direct Lookup

The Search Operation is also not an Exact Match

Metrics for lookup:

Destination IP Prefix Outgoing Port

IP prefix: 0-32 bits

128.9.16.0/21 128.9.172.0/21 128.9.176.0/24 5

Routing lookup: Find the longest matching prefix (i.e.

Year Line Line-rate 40B

DRAM: 50-80 ns, SRAM: 5-10 ns

Optical Electrical Line Rate Payload Overhead SDH

e.g. packet with address 128.32.1.20

128.9.172.0/21 4 16 172 176

Input Queueing Output Queueing

Usually a non-blocking Usually a fast bus

Transfer data from input to Switch Fabrics

output, ignoring scheduling and Time domain

buffering Shared media

There are many types of fabric

architectures Space domain

datagram from input port memory

Cross state Bar state

The way those switching elements interconnected determines

Major types of Interconnection Networks:

An interconnection network is usually constructed using 2 x 2

Output blocking/contention occurs at an output port if it cannot

Speed-up Multiple paths

The switch can either drop excess packets that cannot be

Note: if group size > 1, output buffer is always required.

Cross state Bar state

The way those switching elements interconnected determines

Major types of Interconnection Networks:

An N x N banyan network has two properties (where N = 2n):

Internal blocking: if two packets, with different outputs,

Shuffle-exchange (Omega) Network Reverse Shuffle-exchange network

Consider a banyan network when it is operated as a loss

Performance of an internally nonblocking switch functioning as a

B1,b B1,b BB-1,b 1-rbk(t)

Ai ,n qik,n (t )rnk (t ) Ci ,b (1 q ik,b (t )) rbk ( t )

0001 000, 00,0 0,00 ,000

001, 01,0 1,00 ,001

010, 10,0 0,01 ,010

011, 11,0 1,01 ,011