You are on page 1of 109

Per-Hop Packet Processing

High Speed Router Design

Routers with CQS Architecture

1
High-Speed Router Design

Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design
Buffer Placement

2
Where IP routers sit in the network

Core
router

The Internet Core

Edge
Router

3
Basic Architectural Components

Two key router functions:


run routing algorithms/protocol (RIP, OSPF, BGP)
switching datagrams from incoming to outgoing link

Routing
Protocols Control Plane
Routing
Table

Forwarding
Switching
Datapath
Table - per-packet
processing

4
Per-packet processing in an IP Router
1. Accept packet arriving on an incoming link.
2. Lookup packet destination address in the forwarding table => to
identify outgoing port(s).
3. Manipulate packet header: e.g., decrement TTL, update header
checksum.
4. Send packet to the outgoing port(s).
5. Buffer packet in the queue.
6. Transmit packet onto outgoing link.

5
Input Port Functions

Physical layer:
bit-level reception
Data link layer: Decentralized switching:
e.g., Ethernet given datagram dest., lookup output port using
routing table in input port memory
goal: complete input port processing at line
speed
queueing: if datagrams arrive faster than
forwarding rate into switch fabric

6
Output Ports

Buffering required when datagrams arrive from fabric


faster than the transmission rate
Scheduling discipline chooses among queued datagrams for
transmission (to be discussed later)

7
High-Speed Router Design

Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design
Buffer Placement

8
First Generation Routers

Shared Backplane Route Buffer


CPU Table Memory

Line Line Line


Interface Interface Interface
MAC MAC MAC

Typically <0.5Gb/s aggregate capacity


9
First Generation Routers
Switching Via Memory
Input Output
Port Memory/CPU Port

System Bus

Follow conventional computer architecture


A shared central bus, with a central CPU, memory and peripheral
Line Cards.
Each Line Card connecting the system to each of the external links.
Packets arriving from a link are transferred across the shared bus
to the CPU, where a forwarding decision is made.
The packet is then transferred across the bus again to its outgoing
Line Card, and onto the external link.
speed limited by memory bandwidth (2 bus crossings per packet)
10
First Generation Routers
Queueing Structure: Shared Memory

Input 1 Output
Numerous1work has proven and
made possible:
Fairness

Delay Guarantees
Delay Variation Control
Loss Guarantees
Input 2 Output 2
Statistical Guarantees

Large, single dynamically


Input N allocated memory buffer:
Output N per pkt time
N writes
N reads per pkt time.
Limited by memory
bandwidth.
11
First Generation Routers
How fast can we make centralized shared memory?

5ns SRAM
Shared
Memory

5ns per memory operation


1
Two memory operations per packet
Therefore, up to 160Gb/s
2 In practice, closer to 80Gb/s

200 byte bus

12
Second Generation Routers

Route Buffer
CPU Table Memory

Slow Path

Line Line Line


Card Card Card
Drop Policy Or
Drop Policy Buffer Buffer Buffer Backpressure
Memory Memory Memory
Fwding Fwding Fwding
Cache Cache Cache Output
MAC MAC MAC Link
Scheduling

Typically <5Gb/s aggregate capacity


13
Second Generation Routers
Operations

Placing a separate CPU at each interface.


A local forwarding decision is made in a dedicated CPU, based
on its local forwarding table cache.
The packet is immediately forwarded to its outgoing interface.
(i.e. port mapping intelligence in Line Cards)
This has the additional benefit that each packet need only
traverse the bus once, thus increasing the system throughput.
The central CPU is needed to maintain the forwarding tables in
each of the other CPUs, and for centralized system
management functions.

14
Second Generation Routers
Queueing Structure: Combined Input and Output Queueing

1 write per packet time Rate of writes/reads 1 read per packet time
determined by bus speed

Bus

15
E.g. Cisco 7507 router

Front view
Rear view

Backplane
16
Third Generation Routers

Switched Backplane

Line CPU Line


Card Card Card
Local Local
Buffer Routing Buffer
Memory Table Memory
Fwding Fwding
Table Table

MAC MAC

Typically <50Gb/s aggregate capacity


17
Third Generation Routers

BUT forwarding decisions are made in software, and so are


limited by the speed of a general purpose CPU.
Carefully designed, special purpose ASICs can readily
outperform a CPU when making forwarding decisions, managing
queues, and arbitrating access to the bus.
Shared bus allows only one packet traverses at a time between
two Line Cards.
Replacing the shared bus by a crossbar switch
=> multiple Line Cards can communicate with each other
simultaneously greatly increasing the system throughput.

18
Third Generation Routers
Queueing Structure

1 write per pkt time Rate of writes/reads 1 read per pkt time
determined by switch
fabric speedup

Switch

19
E.g. Cisco 12000 series routers

http://www.cisco.com/warp/public/cc/pd/rt/12000/12416/
The Cisco 12000 series offers industry leading scalability, high
performance, and guaranteed priority packet delivery through an
innovative distributed architecture design that enables service providers
to accelerate the evolution of the Internet through delivery of profitable,
next generation services.
The Cisco 12416 Internet router is a 10 Gigabit, 16-slot chassis member
of the Cisco 12000 series that provides a total switching capacity of 320
Gigabits per second (Gbps), with 20 Gbps (10 Gbps full duplex) capacity
per slot.

20
Routers vs. Gateways

Researchers who invented TCP/IP defined the term IP


Gateway to refer to the systems that interconnected
networks and forwarded IP datagrams among them.
By the early 1990s, vendors had hired marketing people to help
them sell products. One vendor thought IP router sounded
better than IP gateway, and others quickly followed the lead.
When Microsoft incorporated TCP/IP software into their
Windows system, they chose to make the configuration screen
ask the user to enter a Gateway address.
So, the terms IP Gateway and IP router are synonymous.

21
High-Speed Router Design

Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design Routing
Protocols
Buffer Placement Routing
Table

Forwarding
Switching
Table

22
Forwarding Engine

Packet
payload header

Router

Destination Routing Lookup Outgoing


Address Data Structure Port

Forwarding Table
Dest-network Port
65.0.0.0/8 3
128.9.0.0/16 1

149.12.0.0/19 7
23
What makes table lookup difficult?

The Search Operation is not a Direct Lookup


32 bits address => 4 G entries

The Search Operation is also not an Exact Match


Search
Exact match search: search for a key in a collection of keys
of the same length.

Metrics for lookup:


Lookup time
Storage space
Update time
Preprocessing time

24
An example

Destination IP Prefix Outgoing Port


65.0.0.0/8 3
128.9.0.0/16
Prefix length 1
142.12.0.0/19 7

IP prefix: 0-32 bits


142.12.0.0/19
128.9.0.0/16
65.0.0.0/8

0 128.9.16.14 232-1
224
65.0.0.0 65.255.255.255

25
Dest. IP Outgoing
Prefixes can overlap! Prefix Port
65.0.0.0/8 3
128.9.0.0/16 1
32
142.12.0.0/19 7
Prefix Length

128.9.16.0/21 2
Longest
24 matching prefix 128.9.176.0/24 128.9.172.0/21 4

128.9.16.0/21 128.9.172.0/21 128.9.176.0/24 5

65.0.0.0/8 142.12.0.0/19
128.9.0.0/16
8 Impossible!

0 128.9.16.14 232-1

Routing lookup: Find the longest matching prefix (i.e.


the most specific route) among all prefixes that match
the destination address.
26
Lookup rate required

Year Line Line-rate 40B


(Gbps) packets
(Mpps)
1998-99 OC12c 0.622 1.94
1999-00 OC48c 2.5 7.81
2000-01 OC192c 10.0 31.25
2002-03 OC768c 40.0 125

31.25 Mpps 32 ns

DRAM: 50-80 ns, SRAM: 5-10 ns


27
SONET/SDH http://www.sonet.com/edu/edu.htm

Optical Electrical Line Rate Payload Overhead SDH


Level Level (Mbps) Rate rate Equivalent
(Mbps) (Mbps)
OC-1 STS-1 51.840 50.112 1.728 -
OC-3 STS-3 155.520 150.336 5.184 STM-1
OC-9 STS-9 466.560 451.008 15.552 STM-3
OC-12 STS-12 622.080 601.344 20.736 STM-4
OC-18 STS-18 933.120 902.016 31.104 STM-6
OC-24 STS-24 1244.160 1202.688 41.472 STM-8
OC-36 STS-36 1866.240 1804.032 62.208 STM-12
OC-48 STS-48 2488.320 2405.376 82.944 STM-16
OC-96 STS-96 4976.640 4810.752 165.888 STM-32
OC-192 STS-192 9953.280 9621.504 331.776 STM-64

28
Table growth of a typical backbone router

29
Prefix length distribution

Multicast
address

30
A standard solution: trie

e.g. packet with address 128.32.1.20


entries in the routing table:
...
(128.32.*, 3),
(128.32.1.*, 4),
...

31
Another example

Root

Dest. IP Outgoing
65 128 142
Prefix Port (65.*)
65.0.0.0/8 3
128.9.0.0/16 1
9 12
142.12.0.0/19 7 (142.12.*)
128.9.16.0/21 2

128.9.172.0/21 4 16 172 176


(128.9.16.*) (128.9.172.*) (128.9.176.*)
128.9.176.0/24 5

32
Need more than IPv4 unicast lookups

Multicast
Longest Prefix Matching on the source and group address

IPv6
128bit destination address field
Exact address architecture not yet known

Packet classification

33
High-Speed Router Design

Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design Routing
Protocols
Buffer Placement Routing
Table

Forwarding
Switching
Table

34
Basic Architectural Components
Datapath: per-packet processing

35
Input Queueing vs Output Queueing

Input Queueing Output Queueing

Usually a non-blocking Usually a fast bus


switch fabric (e.g. crossbar) (or, multiple path switch)
(or, single path switch)
36
Switch fabrics

Transfer data from input to Switch Fabrics

output, ignoring scheduling and Time domain

buffering Shared media

There are many types of fabric


Shared memory

architectures Space domain

Single path
Choosing one usually depends on
Crossbar
where the switch will exist in the Broadcast
network and the amount of traffic Banyan
it will have to carry Batcher-banyan

Multiple path

Replicated Banyan

Dilated Banyan

Tandem Banyan

37
Three types of switching fabrics

Interconnection
Networks, e.g.
crossbar

38
Switching Via Memory
packet copied by systems (single) CPU
speed limited by memory bandwidth

39
Switching Via Bus

datagram from input port memory


to output port memory via a shared bus
bus contention: switching speed limited by
bus bandwidth
1 Gbps bus, Cisco 1900: sufficient speed
for access and enterprise routers (not
regional or backbone)

40
Switching Via Interconnection Networks
An interconnection network is usually constructed using 2 x 2
switching elements, e.g. crossbar switch

Cross state Bar state

The way those switching elements interconnected determines


the resulting switch architecture.
Overcome bus bandwidth limitations

Major types of Interconnection Networks:


Crossbar
Banyan based
Note: For an interconnection network, the number of 2x2
switching elements required is considered as a good measure
for complexity
41
Fundamental Properties of Interconnection
Networks
Inputs Outputs
1 1 1 1
2 2 2 2
2x2
N N N N

An interconnection network is usually constructed using 2 x 2


switching elements. The way those switching elements
interconnected determines the resulting switch architecture.
We can find out that there should be NN input-output mappings
for a nonblocking packet switch, since there can be up to N
packets destine for a same output port simultaneously

Output blocking/contention occurs at an output port if it cannot


accept the amount of packets destined to it in a packet transmission
time, or a time slot.
42
How to deal with output blocking?
1 1 Group size, the
2 2 maximum # of packets
can be received by an
output per time slot,
N N increases in both cases

Speed-up Multiple paths


mainly adopted by time- mainly adopted by space
domain switch fabrics domain switch fabrics
If in a packet transmission Different packets can
time, m packets can be follow different paths to
forwarded to a same output arrive at a same output
port, the switch fabric is port simultaneously.
said to have a speed-up => no speed-up is required
factor of m. but the switch fabric is
more complicated in order
to provide multiple paths
43
How to deal with excess packets?

The switch can either drop excess packets that cannot be


switched (loss systems), or it can buffer them for output
access in the next time slot (waiting systems).
In a loss system, one can increase the group size to reduce the
packet loss probability.
But the switch complexity increases.
Besides, packets can be dropped as a result of internal blocking in
the switch, as that for banyan networks
In a waiting system, excess packets can be buffered at inputs,
internally in the switch, or at outputs. A large group size =>
higher throughput.

Note: if group size > 1, output buffer is always required.

44
Crossbar Switch

Data In
2N buses in parallel

1
configuration Data Out
2
Inputs Complexity = N2

4 Cross state

1 2 3 4 Bar state
Outputs
45
Switching Via Interconnection Networks
An interconnection network is usually constructed using 2 x 2
switching elements, e.g. crossbar switch

Cross state Bar state

The way those switching elements interconnected determines


the resulting switch architecture.
Overcome bus bandwidth limitations

Major types of Interconnection Networks:


Crossbar
Banyan based
Note: For an interconnection network, the number of 2x2
switching elements required is considered as a good measure
for complexity
46
Banyan Network

4 x 4 banyan
8 x 8 banyan 16 x 16 banyan

An N x N banyan network has two properties (where N = 2n):


There is a unique path from any input to any output.
There are logN columns/stages, each with N/2 switching
elements.
Banyan is self-routed
=> distributed control

47
Interconnecting pattern
Label switching stages from 0 to m-1 where N=2^m
Divide switching elements in stage k into 2^k groups
Connecting pattern between stage k and stage k+1:
Outputs from group i of stage k will be connected to inputs
of group 2i and group 2i+1 of stage k+1
Divide the outputs into upper half and lower half
Upper half outputs connected to upper inputs; lower half
outputs connected to lower inputs
Upper outputs of switching elements connected to group 2i
inputs; lower outputs of switching elements connected to
group 2i+1 inputs
Connecting to network inputs: upper inputs first
Connecting network outputs: parallel

48
Routing in banyan network

Each switching element in the i-th stage examines the i-th bit
of the destination address (most-significant-bit first) to make
the decision
if the bit = 1, route to the lower output
if the bit = 0, route to the upper output

011
000
001

010
011

011 100
101

101 110
101 111
Banyan Network

49
Blocking in banyan network

Internal blocking: if two packets, with different outputs,


contend for the same outgoing link at an intermediate
switching element

Internal blocking!
011
101 000
001
010
010
011

011 100
101
Output blocking!
110
101 111
Banyan Network

50
Other banyan-typed networks

Shuffle-exchange (Omega) Network Reverse Shuffle-exchange network

Banyan Network
51
Simple switches based on banyan network
Pm Pm+1
2x2
Pm Pm+1

Consider a banyan network when it is operated as a loss


system. Assume a uniform traffic situation and let
Pm = Pr [there is a packet at an input link at stage m+1].
We can express
Pm1 1 (1 Pm / 2) 2 Pm Pm2 / 4
Therefore, Ploss = (P0 - Pn) / P0.
Using Taylor series expansion, we can have
Ploss = n P0 / (n P0 + 4).
When P0 = 1, Ploss = 0.25, 0.39, 0.48, 0.5, 0.56, 0.6 for switches
with size 2, 4, 8, 16, 32, 64
=> blocking increases rapidly with switch size n.
52
A simple internally nonblocking switch
1 1
2 Loss 2
P0 Pn
system
N N

Performance of an internally nonblocking switch functioning as a


loss system
Consider a particular output i. In any time slot the prob. that none
of the N inputs has a packet destined for it is (1-P0/N)N e-P0 as N
increases
Thus Pn = Pr [there is a packet at output i] = 1 - e-P0 for large N.
Since Pn increases with P0, the maximum throughput is obtained
when P0=1, i.e. * = 1 - e-1 = 0.632; and the corresponding packet loss
prob. Ploss = (P0 - Pn) / P0 = (1-0.632)/1 = 0.368.
Similarly, when P0 = 1, Ploss = 0.25, 0.32, 0.34, 0.356, 0.362, 0.365
for switches with size 2, 4, 8, 16, 32, 64
Ploss = 0.25, 0.39, 0.48, 0.5, 0.56, 0.6 53
Combinatoric properties of banyan
networks
For a nonblocking (both internally and at output) N x
N packet switch, there are NN possible input-output
mappings.
The total number of possible states (input-output
mapping) can be realized by a banyan network is NN/2
(=2N/2logN).
The fraction of realizable input-output mapping is
1/NN/2.
=> appraches 0 as N increases
1-1
2-2
3-3
4-4
54
Analytical model for input-buffered Banyan
inputs outputs
1 1
2 2

3 3
4 4

5 5
6 6

7 7
8 8
Stage 1 Stage 2 Stage 3 55
State transition diagram

1-qek(t) A1,n
A 1,n A2,n AB-1,n
A 1,n AB,n
E1,n E2,n EB,n
2,n
0, e 1, n 2, n B-1,n B, n
qek(t) A2,n AB-1,n
A1,n 1-rnk(t)
C2,b CB,b
(1-q1,bk(t))rbk(t) D1,n DB-1,n
B1,b B2,b BB-1,b BB,b
1, b 2, b B-1,b B, b
F1,b FB-1,b

B1,b B1,b BB-1,b 1-rbk(t)

Ai ,n qik,n (t )rnk (t ) Ci ,b (1 q ik,b (t )) rbk ( t )


A i ,n (1 qik,n (t ))(1 rnk (t )) Di ,n q ik,n ( t )(1 rnk ( t ))
Bi ,b (1 qik,b (t ))(1 rbk (t )) E i ,n (1 q ik,n ( t )) rnk (t )
B i ,b qik,b (t )rbk (t ) Fi ,b q ik,b ( t )(1 rbk (t ))

56
Non-blocking Conditions for Banyan Networks
Theorem. The banyan network is nonblocking if the active inputs x1, ...., xm
( xj > xi if j > i) and their corresponding output destinations y1, ...., ym
satisfy the following:
1. (Distinct & monotonic outputs): y1 y2 ym or y1 y2 ym
2. (Concentrated inputs): Any input between two active inputs is also
active. That is, xi w x j implies input w is active.

x1 = 0000
x2 = 0001
x3 = 0010 y1 = 0010

y2 = 1011
y3 = 1100

57
Labeling switching elements in a banyan
Each node in stage k can be uniquely represented by two
binary numbers (an-k a1,b1 bk-1).

0001 000, 00,0 0,00 ,000

001, 01,0 1,00 ,001

010, 10,0 0,01 ,010

011, 11,0 1,01 ,011

100, 00,1 0,10 ,100 1001


101, 01,1 1,10 ,101

110, 10,1 0,11 ,110

111, 11,1 1,11 ,111

1 0 0 1
0001 (001, ) (01,1) (1,10) ( ,100) 1001
b1 bn1 bn
an a1 (an1 a1 , ) (an2 a1 , b1 ) ( , b1 bn1 ) b1 bn
58
Proof:
Suppose two packets, one from x = an a1 to output y = b1 bn, the
other from x = an a1 to output y = b1 bn , collide in stage k
That is, two paths
b1 bn1 bn
an a1 (an1 a1 , ) (an2 a1 , b1 ) ( , b1 bn1 ) b1 bn
b'1 b' 1 b'n
a'n a'1 (a'n1 a'1 , ) (a'n2 a'1 , b'1 ) n ( , b'1 b'n1 )
b'1 b'n

merge at the same node and share the same outgoing link
Stage k
(an-k a1,b1 bk-1) =
bk = bk
(an-k a1,b1 bk-1)

Thus, we have
an k a1 a'n k a'1
(A)
b1 bk b'1 b'k

59
Since input packets are concentrated, the total # of packets
between x and x, inclusively, is |x- x| + 1
Since all packets are destined for different outputs, thus there
must be |x-x|+1 distinct output addresses
Since outputs are monotonic, the largest and the smallest output
addresses must be y and y, or y and y. Hence we must have
x' x 1 y ' y 1
(B)
x' x y ' y
From (A), we have

x' x a'n a'1 an a1


y ' y b'1 b'n b1 bn
a'n a'n k 1 0 0 an an k 1 0 0
b'k 1 b'n bk 1 bn
2 n k a'n a'n k 1 an an k 1 2nk 1
2nk

This contradicts with (B). Thus the theorem is proved.


Proof End
60
E.g.

Violating the concentrated input condition but still


non-blocking

000
001

010
011

011 100
101

110
101 111
Banyan Network

61
Theorem 2. Let the input-output pair of packet i be denoted by (xi,
yi). If the packets can be routed through the banyan network without
conflicts, so can the set of packets ((xi + z ) mod N, yi).
Proof: (try it yourself)

e.g. z = 5
x1 = 0000
x2 = 0001
x3 = 0010 y1 = 0010

x1 = 0101
x2 = 0110
x3 = 0111

y2 = 1011
y3 = 1100

62
Exercise:
000
001
010
011
100
101
110
111

Consider an 8 x 8 banyan network. Suppose with probability


0.75 a packet is destined for outputs 000, 001, 010, or 011,
and with probability 0.25 it is destined for the other four
outputs. Within each group of outputs, the packet is equally
likely to be destined for any of the four outputs. Is the loss
probability higher in this case than when a packet is equally
likely to be destined for any of the eight outputs?

63
P1U P2U P3U
Solution:
P0 000
P1L 001
010
011
P2L P3L 100
101
110
111

Let PiL and PiU be defined as shown in The overall loss prob. is
the figure. P0 is the input load and let P P3 0.3516 0.5890
Ploss 0 1 0.5297
P0 = 1. P0 2
At stage 1,
When a packet is equally likely to
P1U 1 (1 0.75P0 ) 0.9375
2
any of the 8 outputs,
P1L 1 (1 0.25P0 ) 2 0.4375 P1 P0 0.25 P02 0.75
At stage 2, P2 P1 0.25 P12 0.6094
P2U 1 (1 0.5P1U ) 2 0.7178 P3 P2 0.25 P22 0.5166
P2 L 1 (1 0.5P1L ) 2 0.3896 P 'loss 0.4834
At stage 3, Therefore, the loss probability in
P3U 1 (1 0.5P2U ) 2 0.5890 this case is higher than the case
P3 L 1 (1 0.5P2 L ) 2 0.3516
that a packet is equally likely to
be destined for any outputs.
64
Sorting Networks
01 00 01 00
11 01 11 01
10 10 11
00 11 00

min{a1,a2} min{a1,a2,a3,a4}
a1 b1
max{min{a1,a2},min{a3,a4}}
a2 b2
a3 b3
min{max{a1,a2},max{a3,a4}}
a4 max{a3,a4}
b4
max{a1,a2,a3,a4}
Stage: 1 2 3
Comparator: it takes two input numbers and places the
larger number on the output pointed by the arrow
and the smaller number on the other output

65
Order-preserving Property: Suppose a sorting network sorts the
input sequence a = a1,a2,...,aN into the output sequence b=
b1,b2,...,bN , then for any monotonically increasing function f, the
network sorts the input sequence f(a) = f(a1),f(a2),...,f(aN) into the
output sequence b = f(b1),f(b2),...,f(bN).

5 Sorting 1
4 Network 4
7 5
1 7

E.g. f(x) = x+2
2 7= f(5) Sorting f(1) =3
6= f(4) Network f(4) =6
9= f(7) f(5) =7
3= f(1) f(7) =9

What if f(x) = x+2, if x < 2;
f(x)= x -3, if x 2 2= f(5) Sorting f(1) =3
1= f(4) f(4) =1
2 4= f(7)
Network
f(5) =2 X
3= f(1) f(7) =4
2
66
Theorem 3 (Zero-One Principle) If a sorting network with N inputs
sorts all the 2N possible sequences of 0s and 1s correctly, then it sorts all
sequences of arbitrary input numbers correctly.
Proof:
Consider a sorting network can sort all sequences of 0s and 1s correctly. By
contradiction, suppose it does not sort input sequences of arbitrary numbers correctly.
That is, there is an input sequence a1,a2,...,aN containing two elements ai and aj such
that ai < aj , but the network places aj before/above ai.
Define a monotonically increasing function f(x) such that f(x) = 0, if x ai ; f(x) = 1,
if x > ai.
According to the order-preserving property, since the network places aj before/above
ai when the input sequence is a1,a2,...,aN , it places f(aj)=1 before/above f(ai)=0
when the input sequence is f(a1),f(a2),...,f(aN). But this input sequence consists of
only 0s and 1s, and yet the network does not sort it correctly, leading to a
contradiction.

Sorting Sorting
a1 0= f(a1)
Network aj Network f(aj) =1
a2 1= f(a2)

...

...
ai : : f(ai) =0
aN 0= f(aN)
67

Sorting networks based on bitonic sort
Merging is a divide-and-conquer technique for sorting.
A k-merger takes two sorted input sequences and merge them
into one sorted sequence of k elements.
Intuitively merging is simpler than sorting in general.
Suppose we have mergers of different sizes, they can be
interconnected (as shown below) to sort an arbitrary input
sequence.
One way to construct the mergers is to use bitonic sorting
algorithm invented by Batcher.

2-merger
4-merger ... N/2-merger

...
2-merger

...
...

N-merger

2-merger
4-merger ... N/2-merger
...
2-merger
68
Some properties of bitonic sequence
A bitonic sequence is a sequence that either increases monotonically
and then decreases monotonically, or decreases monotonically and
then increases monotonically.
E.g. 1,3,5,7,6,4,2,0, 7,5,3,1,0,2,4,6, 1,2,3,3,2,1
A bitonic sorter is a merger that takes a bitonic sequence and sort it
into a monotonic sequence.

k-bitonic Descending k-bitonic


Ascending

...
sorter
...

sequence sorter sequence Ascending


Ascending
sequence sequence
Descending Ascending

...
...

sequence sequence

69
Some properties of bitonic sequence
We focus on bitonic sequences with only 0s and 1s (Why?)
Two general forms: 1i0j1k or 0i1j0k
A bitonic sequence a is said to be no less than another bitonic
sequence if none of the element in a is less than any of the element in
b.
e.g. 00000 01110, 11111 01110,
Two sequences do not necessarily have an ordering relationship.
e.g. 00010 and 01110

Using the following theorem, a bitonic sequence can be


decomposed into two bitonic subsequences a and a using only
one stage of comparators.

70
Some properties of bitonic sequence
Theorem 4. If a zero-one sequence of 2n elements a = a1,a2,...,a2n is
bitonic then the two n-element sequences
a = min(a1,an+1),min(a2,an+2), ..., min(an,a2n) and
a= max(a1,an+1),max(a2,an+2) ,..., max(an,a2n)
have two properties:
1. They are both bitonic.
2. a < a. a1 min(a1,an+1)
Proof: a2 min(a2,an+2)
a

...
a ...
an
min(an,a2n)
an+1 max(a1,an+1)
an+2 max(a2,an+2) a

...
...

a2n max(an,a2n)

71
Recursive construction of a k-bitonic sorter
k-half cleaner k/2-half cleaner 2-half cleaner
a1 min(a1,ak/2+1)

Ascending Sequence
a2 min(a2, ak/2+2)
...

...
...
...

min(ak/2,ak)
ak/2
max(a1,ak/2+1)

...
ak/2+1
max(a2,ak/2+2) ...
...

...
...

ak max(ak/2,ak)

k-bitonic k/2-bitonic
sorter
k-half
...

sorter
Cleaner
k/2-bitonic
...

sorter
72
An 8 x 8 Sorting Network using bitonic sorters

2-merger
4-merger
2-merger
8-merger
2-merger
4-merger
2-merger
Note: a merger is implemented using a bitonic sorter
3 2 2 2 2 2 2 2 2 2 2 1
2 3 8 8 3 3 6 6 4 4 1 2
8 8 3 3 8 7 3 3 3 1 4 3
Total # of stages
7 7 7 7 7 8 5 5 1 3 3 4
= 1 + 2 + 3 + + log N
1 1 1 5 5 6 7 4 6 6 6 5
= (1 + log N) log N / 2
6 6 5 1 6 5 4 7 7 7 5 6
5 5 6 6 1 4 8 1 5 5 7 7
4 4 4 4 4 1 1 8 8 8 8 8
N log N (log N 1)
The total # of comparators in an N x N Batcher Network is:
4
73
Batcher-Banyan network

Batcher Banyan

Inputs Outputs

Q1: how packets are switched in Batcher network?


Q2: how contentions are resolved such that banyan
is nonblocking?

74
Switching in Batcher-banyan network
Only headers are compared
Order of arrival from right to left
If both header bits of the
two packets are 0s or 1s, the ... 0100 Assume comparator is
comparator remains in its in bar state
original state (in this case,
... 1000
Payload Header
bar state) and the bits are
forwarded to the outputs. ... 010 0 Remains in bar
For the first pair of bits that ... 100 0
after 1st bit
differ, set the comparator
state accordingly and remains
unchanged for the rest of the ... 01 00 Remains in bar
packet ... 10 00 after 2nd bit

If the two packets have the Set to cross after


same output address, the ... 0 000 3rd bit because
comparator remains in its ... upper input is larger;
original state for the whole 1 100
remains in cross for
packet duration. the whole pkt duration

75
Contention resolution in Batcher-banyan network
Batcher Banyan
0001
001
0100
100
1xxx
i
0100
100
0000
000
1xxx
i
1xxx
i
0101
101

What if some inputs are idle?


Adding an extra bit in front of of the MSB (most significant bit) of
the output port address of each packet; called activity bit.
If the input is active, set this bit to 0
Otherwise, construct a dummy packet and set this bit to 1
All dummy packets will be pushed to the lower end of the outputs
=> all active packets are concentrated at the inputs to the banyan switch

76
Three-phase algorithm
How to solve the output contention problem?
Three phase algorithm for resolving output contention:
Probe phase: only the header of packets enter the sorting
network. Packets with the same output address will be
adjacent to each other at the outputs. Output j+1 checks with
output j to see if their addresses are the same. If yes, let the
packet at output j+1 be the loser and the packet at output j be
the winner.
Acknowledgment phase: acknowledgements are back-
propagated along the same path as the forward path in the
probe phase.
Send phase: send the winning packets; inputs that have lost
contention can buffer their packets for later attempt (=>
waiting system approach)
The first two phases are overheads.

77
An example
0001
001 0000
000 000
X 0100
100 0001
001 001
1xxx
i 010
0100 010
0100
100 0100
011 011
0000
000 0101
100 100
1xxx
i 101
1xxx 101
1xxx
i 1xxx
110 110
0101
101 111
1xxx 111

0001
001 000
000
0000 000
1xxx
100 001
0001
001 001
1xxx
i 010
100
0100 010
0100
100 101
0101
011 011
0000
000 100
xxx
1xxx 100
1xxx
i 101
xxx
1xxx 101
1xxx
i xxx
1xxx
110 110
0101
101 111
xxx
1xxx 111

78
Multiple-Path Banyan Switch Designs

Complexity of Batcher-banyan

N log N (log N 1) N log N


(comparators) ( switching _ elements )
4 2

Can get rid of the complexity of Batcher network


while keeping the performance of Batcher-banyan
and the advantages of banyan?
Yes, using multiple-path banyan networks:
Dilated banyan
Replicated banyan
Tandem banyan

79
Multiple-Path Banyan Switch Designs
Dilated Banyan d
The internal link bandwidth is expanded
to reduce the likelihood of a packet
being dropped
For a banyan with dilation degree of d,
the switch elements are of size 2d x 2d.
Each outgoing address has d associated
outgoing links

Replicated Banyan
Suppose each packet is randomly routed Random router
to one of the banyan for switching. The or broadcaster
load to each banyan is reduced by a
factor of K, thus 1 1st banyan 1
Ploss = n P0 / (n P0 + 4K).
Instead of random routing, we can 2 2
broadcast a packet to all K banyan
planes.
Since a packet is lost only if all copies
are lost, so N K-th banyan N
Ploss = (n P0 / (n P0 + 4))K.
80
Tandem Banyan Packet filter for marked packets
1st banyan 2nd banyan K-th banyan
1

N
Delay
elements
Concentrator

Deflection routing: Output 1 Output N

Whenever there is a conflict at a 2x2 switch element, one packet would be


routed correctly while the other would be marked and routed in the wrong
direction.
To optimize the number of correctly routed packets, the marked packet would
have a lower priority than an unmarked one for the rest of the journey within
the banyan network (why?)
If a packet remains unmarked when it reaches the output of the banyan, it has
reached the correct destination. It is removed and forwarded to the
concentrator associated with the output destination
On the other hand, a marked packet will be unmarked and forwarded to the next
banyan (by packet filter), and a new attempt to route the packet to its desired
output is initiated
A packet is considered lost if it still fails to reach the desired output after
passing through all the K banyan networks. 81
Let D be the delay suffered by a packet as it travels through a single
banyan network.
A packet that reaches its correct destination at a later banyan
experiences a larger delay
To compensate for the delay differences (why?), one can insert delay
elements with varying delays at different places:
for the links that connect the N outputs of banyan i to the N
concentrators, one can introduce a delay of (K-i)D.

Packet filter for marked packets


1st banyan 2nd banyan K-th banyan
1

N
Delay
elements
Concentrator
Output 1 Output N
82
A practical issue

Today, it is the number


of chip I/Os, not the
number of crosspoints,
that limits the size of
a switch fabric.

The problem is how to


interconnect chips
with limited I/Os to
form a larger switch?

83
Three-stage switching network
Clos Network
Switch modules are
arranged in three stages and Switch module
any module in the first r1 x r3
(second) stage is n1 x r2 r2 x n3
1
interconnected with any
1 1
module in the second (third)
stage via a unique link

...
Each switch module is a

...
...
nonblocking switch
For an N x N 3-stage
switch: r1 r3
n1 x r1 = n3 x r3 = N
r2

84
An example

n1 = r2 = r1 = r3 = n3 = 3

1 1 1

2 2 2

3 3 3

3-stage switching network is not necessarily


nonblocking.

85
Theorem A three-stage switching network is strictly
nonblocking if and only if
r2 n1 + n3 -1
Proof:

n1 -1

r1 x r3
n1 x r2 1 r2 x n3
1 1
n1-1 inputs are busy
i.e. only one is idle
...

...
...

n3-1 outputs are busy


i.e. only one is idle
r1 r3
r2 n3-1
86
Summary: Switch Fabric Design using
Interconnection networks
Banyan networks
nice but with serious internal blocking problem

Non-blocking conditions for banyan networks


how to implement?

Batcher-banyan networks
Design of a sorting network (Batcher) to satisfy the non-
blocking conditions
Multipath banyan networks
Keeping the performance of batcher-banyan and
advantages of banyan
Tackling the issue on interconnecting chips with
limited I/Os to form a larger switch fabric
87
High-Speed Router Design

Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design
Input Port Queueing
Buffer Placement Output Port Queueing
Combined Input Output Queueing

88
Input Port Queuing
Note: When we call a switch is input-queued, we imply that in each time slot,
the switch fabric can allow at most one packet to be sent by an input port, and
at most one packet can be received by an output port.
Fabric slower than input ports -> queueing may occur at input
queues
Head-of-the-Line (HOL) blocking: queued datagram at front of
queue prevents others in queue from moving forward

89
An Analogy

90
Input Port Queuing
Performance

Throughput of an input-queueing switch


When N is large, the maximum throughput * = 0.586 can
be found under uniformly distributed traffic condition
all inputs have the same loading
a packet destined for any output port with the same probability
in case of contention, the winner is chosen randomly
Delay

Load
58.6% 100% 91
92
Input Queueing -- Virtual output queues

93
Input Queues
Virtual Output Queues

Memory b/w = 2R

Scheduler

Delay
Load
58.6% 100%

94
Input Queueing 1 7 1
Scheduling 2 4
2
2
2
3 3
5
4 2 4

?
1 1
2 2
3 3
4 4

Bipartite
Matching
Question: Maximum weight or maximum size? (Weight = 18)
95
Input Queueing
Scheduling

Maximum Size
Maximizes instantaneous throughput
Does it maximize long-term throughput? Not necessarily

Maximum Weight
Can clear most backlogged queues
But does it sacrifice long-term throughput? No.

Maximum Weight is better!

96
Input Queueing
Why is serving long/old queues better than serving maximum
number of queues?

When traffic is uniformly distributed, servicing the


maximum number of queues leads to 100% throughput.
When traffic is non-uniform, some queues become
longer than others.
A good algorithm keeps the queue lengths matched, and
services a large number of queues.

Non-uniform traffic
Uniform traffic
Avg Occupancy

Avg Occupancy

VOQ #
VOQ #
97
Points to Ponder:

Can we design an input queued switch with k queues


per input port, where 1 < k < N, such that it is
scalable and has the high throughput of a VOQ
switch?

k queues

? N outputs

98
Odd-even switch
INPUTS OUTPUTS

Queue 1 (Odd) 5 3
Queue 2 (Even) 8 6 2 4
1 7
8x8
Queue 1 (Odd)
Queue 2 (Even) 6 4 2 Non-blocking
Switch

Queue 1 (Odd) 3 3 1
Queue 2 (Even) 4 2

In each time slot, scheduling is divided into 2 contention rounds, one


for each output address group.
The 1-st contention round considers the HOL packets at all even
queue. The 2-nd round considers the HOL packets at all odd queues.
After the 2 contention rounds, the winning packets in both rounds are
then switched together.
Note: Each input/output can only send/receive at most one packet
each time slot.
99
Performance Results

# of Max.Throu. Max. Throu. Percentage


queues (ana) (sim.) Error
1 0.586 0.586 insignificant
2 0.705 0.713 1.1
5 0.827 0.849 2.6
10 0.890 0.917 3.0
20 0.933 0.956 2.4
50 0.966 0.980 1.4
100 0.980 0.993 1.3
Table 1. Maximum throughput

For details, please refer to the following paper:


Kwan L. Yeung and H. Shi, Throughput Analysis for Input-buffered ATM
Switches with Multiple FIFO Queues per Input Port, Electronics Letters,
Vol. 33, No. 19, pp. 1604-1606, Sep. 1997.

100
High-Speed Router Design

Outline:
Introduction
Router Generations
Table Lookup
Switch Fabric Design
Input Port Queueing
Buffer Placement Output Port Queueing
Combined Input Output Queueing

101
Output port queueing

Note: When we call a switch is output-queued, we imply that in each time


slot, the switch fabric can allow up to N packets to be arrived at an output
port.
Output buffering when arrival rate via switch exceeds output
line speed
queueing (delay) and loss due to output port buffer overflow
102
Output Queueing

Individual Output Queues Centralized Shared Memory


Memory b/w = 2N.R
1

N 1

Memory b/w = (N+1).R N

103
Comments:
Both an output-buffered switch and an input-buffered VOQ switch do not
suffer from HOL blocking.
Does it mean they should have the same performance?
No.
Which one will be better?
Output-buffered switch.
Why?
Intuitively let us consider the case that an output line is idle. For an
output-buffered switch, a packet (if any) in the output buffer can be sent
immediately; for an input-buffered VOQ switch, even there are packets
waiting at the input buffers for this output port, they can not be
immediately sent to the output port/line because of the constraints
imposed by the VOQ scheduling algorithms.
e.g. using longest queue first, the queue destines for this output port may not
be the longest

104
Combined Input-output Queueing
Can we design an input-queued switch that functions
exactly as an output-queued switch?

Yes. It has been proved that a combined input-output Queueing


(CIOQ) switch with (at least) 2 times switch fabric speedup,
together with a suitable scheduling algorithm (e.g. stable marriage
matching algorithm), can function exactly like an output-buffered
switch.
Why we bother to do so?
To get the performance of an output-queued switch, while at the cost
of an input-queued switch.
Note: To provide Quality of Service (QoS) guarantee on, e.g. packet
delay, scheduling algorithms can be easily designed and efficiently
implemented for output-buffered switches (more to be covered later
on in Scheduling).

105
Using Speedup

1
2

1
2

106
The Ideal Solution
Output Queued Switch
1

N N
=?
N

Combined Input-Output Queued Switch


1

107
The findings

For a switch with combined input and output queueing


to exactly mimic an output queued switch, for all
types of traffic, a speedup of 2-1/N is necessary
and sufficient.

But
How to make such an algorithm fast and efficient for
real implementation?

108
High-Speed Router Design

Summary:
Introduction
Router Generations
Table Lookup
Switch Fabric Design
Buffer Placement

109

You might also like