You are on page 1of 85

David Tsiang, Cedrik Begin, Guglielmo Morandin

4/22/13

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

Goals and requirements of switch fabrics Buffering strategies (input, output, CIOQ) Transport (packet vs cell) Topologies (single stage, multi-stage) Congestion management (proactive, reactive) Multicast Service provider examples Enterprise and Datacenter examples

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

Scale
Bandwidth per fabric interface Number of fabric interfaces

Fairness
Usually want non-blocking and fair (sometimes weighted fairness)
Non-blocking no cross-flow interference between src-dest flows
e.g. a congested flow doesnt unduly interfere with a non-congested flow

Latency
Service provider 100 us (WAN distances dominate, jitter) Enterprise 10s of us (Campus distances) Datacenter 1 us (Datacenter distances compute perf. latency sensitive)

Cost
SP vs Datacenter vs Enterprise

Redundancy
1:1, 1+1, 1:N
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

Central
Shared memory

Input
Deep buffers only on input

Output
Deep buffers only on output

Combined input/output
Deep buffers on input and output

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

Usually associated with central memory switch fabric designs Bandwidth scale limited by memory bandwidth (can be distributed over

several parallel memory slices to improve)


Limited queue scale (not practical for multi-chassis) Similar performance characteristics to output buffered switch without

need for a large speedup


Examples: Early cisco routers (AGS+, 7000, 7500), smaller routers (ISR,

ASR1K, Procket, early Juniper routers M40, M160)


FIA Write Central memory Read

FIA: Fabric Interface Adaptor


2010 Cisco and/or its affiliates. All rights reserved.

FIA

Cisco Confidential

Buffers on input Requires Virtual Output Queues to be non-blocking Most common type of buffering (GSR, N7K, ASR9K, Panini)
Mem Mem Mem

VOQ VOQ

FIA

VOQ

Send Switch Fabric Receive

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

Only works if there is no congestion within the switch fabric! Can be achieved if speedup is high (path from SendRCV is >>

FIA input BW).

Pure output buffered switch is not practical for large systems

(need speed of N)

FIA
OQ OQ OQ

Send Switch Fabric Receive

Mem Mem Mem

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

High speed up (e.g. 2-3X FIA BW) enough most of the time Input Queues for cases where speedup is insufficient - blocking CRS uses this (VOQ scale impractical for input only approach)
Mem Mem Mem

IQ IQ IQ

FIA
OQ OQ OQ

Send Switch Fabric Receive

Mem Mem Mem

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

10

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

11

Two main methods of transporting data across a switch fabric Packet send whole packets (or even multiple pkts as a frame)
Advantages:
Simpler no reassembly of cells (but may have to reorder packets) Higher Efficiency per packet overhead vs per cell overhead Lower average latency (can do cut through on egress)

Disadvantages
Slightly higher complexity for buffered switch chips Higher WC latency (small packets must wait behind larger packets) Not as scalable (large scale switches require distribution which requires cells to be efficient packet requires bundling of links to achieve low latency for large packets which does not allow for large scale distribution)

Cell segment packets into smaller sized cells


Advantages
Lower WC latency (important for TDM types of traffic) Scalable (easy to evenly distribute cells)
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12

(Cell continued): Disadvantages


Higher complexity requires segmentation and reassembly of cells Higher overhead per cell overhead, packet packing efficiency Worse average latency reassembly and reordering cell buffer adds latency, cant do egress cut-through

Generally:
Packet transport pure packet fabrics, single chassis scale fabrics Cell transport hybrid packet/TDM switches, most large scale multi-chassis fabrics.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

13

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

14

Mesh (pt-pt, bus) Scale limited to bus bandwidth or FIA bandwidth Used in smaller and/or older systems

FIA Bus

FIA

FIA

FIA

FIA

FIA

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

15

Single stage (crossbar, central memory) Scale limited to number of serdes I/O on a single chip
e.g.224x224 (SM15) of serdes on a chip limits a system to 224 FIAs Panini has 768 FIAs

Use parallel crossbars for bandwidth scale and redundancy

Crossbar Crossbar Crossbar

FIA

FIA

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

16

3-stage symmetric CLOS


#S1 = #S2 = #S3, Traffic takes two hops always (S1->S2, S2->S3)

For a NxN xbar chip can connect up to N^2 FIAs


Scale is usually less due to common use of combining S1 and S3 (folded: N^2/2) and number of S1s and FIAs achievable in a LC chassis.

Provably non-blocking via rearrangement (connection oriented) or

load balancing (requires some speedup to overcome imperfect randomized load balancing)
FIA FIA FIA FIA S1 S2 S2 S3 FIA FIA FIA

S1

S3

FIA

Cisco Confidential 17

2010 Cisco and/or its affiliates. All rights reserved.

n-stage topologies (Hyper-cube, Torus, Hyper-torus) Flows can take a variable number of hops from src-destination Lower cost
Typically FIAs interconnect directly less components, no fabric chassis needed Less interconnection cost but interconnection can be complex for: less than a fully populated system system with varying speed nodes

Requires complex scheduling to be non-blocking


Typically flow based path selection Must be able to dynamically change path if flow bandwidths change (reordering?) Slow to recover from failures (massive path recomputations)

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

18

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

19

Centralized timeslot scheduling


Allows bufferless crossbars with no data loss (no collisions) Difficult to achieve maximum match (O(n^2.5), O(n^3) complexity)
Approximate maximum match instead (e.g. ISLIP, PIM, WFA) needs speedup to overcome imperfect match.
Not that scalable (O(n) complexity, but n can be large. Sched. Speed can be an issue as well).

Distributed timeslot scheduling


Scheduling done by each destination independently Imprecise sources may receive multiple grants but can only act on one
Results in loss of bandwidth can be overcome with speedup

Scalable since its distributed but somewhat inefficient

Distributed bandwidth scheduling


Distribute bandwidth (credits or MTU pkts) on request
src sends when ready

Collisions can occur need buffering in the xbar and some speedup Scalable since its distributed and efficient
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20

i.e. no scheduling just send Switch fabric either:


buffers and asserts flow control if buffers get too full Or just drops if buffers get too full (may require ack + retransmission)

Requires a large speedup to get good performance

oversubscribed scenarios

Is blocking for congestion > speedup because flow control within

the switch fabric is usually coarse (not to VOQ level)


Can reduce blocking by adding secondary flow control from destination back to source can be at a VOQ level

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

21

Hybrid schemes
Can combine proactive and reactive schemes
e.g. send speculatively and if congested (reactive) request to re-send (proactive) Better latency if non-congested

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

22

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

23

Typically very challenging Scheduling per MC group not practical


Combinatoric number of groups 2^n-1 where n is the number of FIAs

Usually drop on congestion or reactive flow control Alternative turn multicast into unicast can now isolate

congestion
But can be blocking of unicast (ingress replication) or expensive (server replication).

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

24

LC

1 1 2 3 4 Switch Fabric

Fabric multicast No impact to linecards, scalable to 100% multicast Drop on congestion

LC

1 2 3 4 1 2 3 4

Switch Fabric

Ingress Replication Can block ingress if not enough speedup to overcome replication dilation Staggered delivery

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

25

LC LC

1 2 1 2 3 4 3 4 1 1 Switch Fabric

Binary Tree Replication Can block ingress if not enough speedup to overcome replication dilation (but less chance of this due to distribution of replication). Very staggered delivery

LC
MC server

1 2 3 4 1 2 3 4

Switch Fabric

MC Server Replication No ingress blocking Additional expense of MC server cards Staggered delivery

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

26

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

27

Cell Based (fixed at 64B) Centrally scheduled Non buffered crossbar Input buffered Single chassis topology, up to 16 slots. 3 generations: 2.5G -> 10G -> 20G per slot.
Generations support previous generations LCs.

Multicast replication in fabric. 2 Priorities per cast in fabric (not in all generations).

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

28

Fabric works in cell

periods of 128ns. Cell clock is distributed to LCs and switch cards.


Packets are segmented

into cells in the ingress path


For each cell a request is

sent to the SCA (Scheduler Control ASIC)


SCA determines which

input -> output connections to make. Sends grants to the IFIA and controls the XBARs
Cells are sent across the

serial links and XBARs to the EFIA


Packets are reassembled

in the egress path


2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29

Ingress ToFab Queues: Per destination slot HPQ + LPQs, Multicast HPQ + LPQs. (MDRR) FIA (toFab): H/L Unicast queues per destination LC + H/L Multicast Q SCA algorithm used to insure fairness + maximize throughput over the fabric
Schedules between UC/MC requests (alternates priority between UC/MC). Within a priority, Input LCs get their fair share of traffic towards output linecards

FIA (FrFab) has: Per source UC/MC reassembly queues that can flow control the SCA if nearing full.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

30

Multicast to different LCs is performed by the crossbar.


A given multicast cell is transmitted to N destinations across the crossbar.

Partial grants supported.


If a cell wants to go to destinations 1,2 and 3. The fabric may first grant 1,3 and then grant 2 in a subsequent cell time.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

31

Switch cards:
Redundant switch card which allows to correct errors in a single serial link stream. Redundant stream carries XOR of 4 other streams Provides 4+1 redundancy.

CSC (Clock and Scheduler) cards:


2 of these in the system, one is operational and the other standby.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

32

Generation 622M 2.5G 10G 20G

Fabric Connection 5x1.25G

FIA FIA FIA-48 Fusilli

SCH SCA

XBAR XBAR

20x1.25G 20x2.5G

TFIA,FFIA Superfish EROS

SCA192 HAD Hecate (priority)

XBAR192 IRIS

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

33

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

34

Cell Based (Fixed 136B cells w. cell packing) Unscheduled Single stage / 3 stage fabric Single chassis / Multi-chassis capable Architecture scales up to 1536 EFIAs.
2/4 EFIAs per LC. VOQ not feasible (system has 1M+ Output queues) solved with fabric speedup with flow control

3 generations: 40G -> 120G -> 400G per slot.


New generations required to support previous generations

Input Buffered. (fabric congestion) Output Buffered. (fabric speedup) Multicast replication in fabric. 2 Priorities per UC/MC in fabric.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

35

136 Bytes cells 40/120/400 Gbps 8 2 Line Card 1 1 of 8 S1 S2 2 of 8 8 of 8 S1

Fabric Chassis S2 S3

2.5X Speedup 16

2 S3 1 Line Card

2 LEVELS OF PRIORITY
S1 S2 S3

MULTICAST SUPPORT
1M multicast groups

Buffered Non-blocking Switch Multi-stage Interconnect3 Stage Clos Topology

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

36

IFIA:
Segments packets into fixed size cells Distributes cells evenly to planes.

S1 Stage: Distributes all cells evenly to all S2s. S2 Stage:


UC: Directs cell to S3 stage based on Destination address. MC: Replicates cell to S3 stages based on FGID

S3 Stage:
UC: Directs cell to EFIA based on destination address. MC: Replicates cell to EFIAs based on FGID

EFIA:
Receives cells from all planes and reassembles packet per source/cast.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

37

Single chassis can have:


3-stage topology if switching element does not have enough links to do single stage. Single stage topology:
Full mesh between IFIAs and EFIAs and Fabric chips. Fabric chips in S123 mode whereby incoming cells are routed directly to the EFIAs.
Cisco Confidential 38

2010 Cisco and/or its affiliates. All rights reserved.

Linecard Chassis:
Fabric Cards contain S1 and S3 stages (these maybe combined into ASICs doing both stages).

Fabric Chassis:
Fabric cards contain S2 Stages. 1 or more fabric cards may implement S2 stage of plane.
Cisco Confidential 39

2010 Cisco and/or its affiliates. All rights reserved.

Full mesh between IFIAs,EFIAs and Fabric chips. Fabric chips in S13 mode:
Traffic local to a chassis does not go over optical links. Traffic destined to other chassis goes over optical links.
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40

3072 High priority fabric Destination (EFIA) queues

S2 Queues per priority per S3 group

S3 Queues per priority per fabric destination (EFIA)

4K Raw queues in EFIA

IFIA
Fabric Destination BP Discard Filter

S1

S2

S3

EFIA
Reseq & Reassembly

Packets from NPU

Input buffering:
Per destination (EFIA) H/L queue. System scale (~1M OQ) deemed too large for VOQ.

Output buffering:
Per faceplate port configurable number of queues.
Cisco Confidential 41

2010 Cisco and/or its affiliates. All rights reserved.

..

8k shaped Queues

3072 Low priority fabric destination (EFIA) queues

S1 has a single data queue

Packets to NPU

1 High priority Multicast queue

S2 Queues per S3 group per priority and cast. (i.e. separate queues for MC)

S3 Queues per destination per priority and cast. (i.e. separate queues for MC)

Some number of MC Raw queues in EFIA

IFIA
Fabric Destination BP Discard Filter

S1

S2

S3

EFIA
Reseq & Reassembly

Packets from NPU

2010 Cisco and/or its affiliates. All rights reserved.

..

8k shaped Queues

1 Low priority Multicast queue

S1 has a single data queue for both UC and MC data cells

Packets to NPU

Cisco Confidential

42

IFIAs send traffic into the fabric. 2 Main flow controls to regulate.
Handles cases where fabric speedup is insufficient

Destination Backpressure:
Used to minimize buffer occupancy in the fabric for short term congestion. Operates at per destination EFIA granularity. S3 Queue congestion + S2 Feed forward counts contribute. Ingress FIAs implement a slow start algorithm to minimize overshoot.

Discard:
Operates at a per faceplate port granularity. Used to alleviate potential fabric congestion by reducing the amount of congested traffic from entering the fabric.
That is we do not want to send packets across the fabric which are going to be discarded at EFIA anyway.

To minimize the amount of queuing delay at IFIA due to congestion at the destination.
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43

IFIA
Fabric Destination BP

S1

S2

S3

EFIA
Reseq & Reassembly

8k shaped Queues

Discard Filter

Other flow controls in the fabric as well:


Stages may backpressure upstream stages if they are out of resources. Not expected.

2010 Cisco and/or its affiliates. All rights reserved.

..

Dest BP Discard

Cisco Confidential

44

Multicast is performed in the fabric at the S2 and S3 stages. IFIAs have 2 queues (H/L) Cell Header contains FGID field which is used by S2 and S3

stages as an index to replication table.


1M fabric groups.

S2 and S3 replicate cells No flow control mechanisms.


Separate queues for H/L multicast. If there is congestion multicast cells are dropped.

Scheduling between unicast and multicast cells is WRR at S2 and

S3 stages.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

45

4/8 planes in the system.


FQ: 8 planes: 1 plane per fabric card HQ: 8 planes: 2 planes per fabric card QQ: 4 planes: 1 plane per fabric card

Enough speedup in fabric to handle one plane down and not

adversely affect performance.


Further planes down result in reduced fabric performance Fabric can operate with a minimum of 2 planes.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

46

Generation 40G 120G

Fabric links 2.5G (8b10b) 5G (scrambler + 8b10b) 8.625G (scrambler)

FIA Sprayer Sponge Seal Crab Inbar

XBAR SEA (36x72) Superstar (128x144) Sapir (128x128)

400G

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

47

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

48

Cell Based (Variable Cell size) Distributed scheduling Single stage / 3 stage fabric topologies Single chassis / Multi-chassis capable Architecture scales up to 768 FIAs
Up to 6 FIAs per LC.

Input Buffered: 64K VOQs (4 COS per 10GE) Multicast replication in fabric. (512K groups) 2 independent pipes in fabric.
OTN, Data UC, MC

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

49

Panini Multi-chassis Architecture


- Multi-Chassis Fabric Architecture
nx200G 64~256B Cells CXP

Fabric Chassis

CXP

6 of 6
240Gbps

nx200G (1x Speedup)

2 of 6 1 of 6
240Gbps

S1 S2 S1 S2 S1 S2 S1 S2 S3 S2 S2 S2 S2 S3 S2 S3 S3 S3 S3 S3 S3

S3 S3 S3 S3

240Gbps

S1 S1 S1 S1

S1 S1

240Gbps

Single Priority
2 Paths in fabric S1 S1

Multicast Support
512K Multicast Groups Replication at S2 and S3

3 Stage CLOS Topology

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

50

IFIA:
Segments packets into variable sized cells Distributes cells evenly to planes.

S1 Stage: Distributes all cells evenly to all S2s. S2 Stage:


UC: Directs cell to S3 stage based on Destination address. MC: Replicates cell to S3 stages based on FGID

S3 Stage:
UC: Directs cell to EFIA based on destination address. MC: Replicates cell to EFIAs based on FGID

EFIA:
Receives cells from all planes and reassembles packet per source/cast.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

51

Single stage topology Full mesh between FIAs and Fabric chips. FIAs spray cells to Fabric chips which will send them to correct

FIA.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

52

3 stage topology. In each Linecard Chassis of the FIA are connected to each S1S3. Full mesh between S1S3s and S2s on a plane. Fabric Chassis:
Fabric cards contain S2 Stages. 1 or multiple Fabric cards may implement S2 stage of plane.
Cisco Confidential 53

2010 Cisco and/or its affiliates. All rights reserved.

3 stage topology In each Linecard Chassis of the FIAs are connected to each S1S3. Full mesh between S1S3 and S2 stages S2 stage shared between 2 chassis fabric cards.
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54

Unicast: Distributed scheduling


VOQs in IFIA indicate occupancy state to EFIAs they are associated with. EFIAs will issue credits fairly (WFQ) to IFIAs based on:
VOQ state from IFIAs Number of links up between S3 and itself Congestion indication from fabric

IFIAs will send cells of packets into fabric for VOQs with credit.

Multicast: Unscheduled
Packets sent into the fabric Congestion may result in drops or global flow control depending on priority.

Congestion in fabric between UC/MC:


When only UC traffic, should be no sustained congestion as the EFIAs control the traffic towards it. When MC is introduced:
congestion may occur and fabric will indicate to EFIA EFIA will adapt UC credits down to a configured value to make room for MC traffic
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55

Ingress Unicast: 64K VOQs:


Enough for 4 COS Queues per 10GE

Ingress Multicast:
4 class queues towards the fabric

Fabric: Queues per pipe per destination.


S2: queue per S3 destination. S3: queue per EFIA.

Egress: Per cast/class queues towards the egress NPU

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

56

Multicast is performed in the fabric at the S2 and S3 stages. Cell Header contains FGID field which is used by S2 and S3

stages as an index to replication table.


512K Fabric groups.

S2 and S3 replicate cells to wanted destinations

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

57

6 planes in the system. Enough speedup in fabric to handle one plane down and not

adversely affect performance.

Further planes down result in reduced fabric performance Fabric can operate with a minimum of 1 plane.

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

58

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

59

Packets are segmented

into 64B cells


48B of payload 8B of cell header 8B of CRC

No cell packing: a given

cell may only have data for one packet.


Cell is split (in 16-bit

chunks) over 4 serial links. One to each XBAR


A fifth redundant serial link

contains information for error correction


Links are 8b10b encoded
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 60

Packets are segmented into

136B cells:

4B of RS code (for error correction)

Cell Payload (120B)


Header (12 bytes)

RS
Packet 2
30 b ytes

12B of header 120B of payload

(4 bytes)

Two Packet Payloads


Packet 1 Packet 1 Packet 1

Unicast cells can be packed

such that a cell can contain data from 2 packets


Multicast cells are not packed Control Cells
Idle,Discard,SRCC

Packet 2 Packet 2

30 bytes

30 b ytes

30 bytes

Single Packet Payload


Packet 1 (120 bytes)

Cells are:
8b10b encoded for 2.5G links Scrambled + 8b10b encoded for 5G links Scrambled for 8.625G links
2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

61

Packet
Fabric hdr 4 bytes

Pkt hdr 14 bytes

Pkt payload up to 9.6k bytes

Cell
Fabric hdr 13 bytes Data payload 64-255 bytes crc8

13 byte header | 64-256 bytes payload | 1 byte CRC Idle cells sent if no data to send 11.5G Serdes 64/66 encoding FEC covers a group of cells (for optical links) Retransmit used for electrical link error correction CRC-32 covers the packet

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

crc32
62

3072 High priority fabric destination queues

S2 Queues per priority per fabric group

S3 Queues per priority per fabric destination

4K raw queues in EFIA

IFIA
Fabric Destination BP

S1

S2

S3

EFIA
Reseq & Reassembly

8k shaped Queues

Packets from NPU EFIA raw queue state controls the discard filter

Discard Filter

2010 Cisco and/or its affiliates. All rights reserved.

..

3072 Low priority fabric destination queues

S2 OOR state used to control scheduling in S1

S3 queue state generates destination BP S2 queue state and feed forward incorporated into destination BP

Packets to NPU S3 OOR state used to control scheduling in S2

S1 Hiccups control per plane scheduling at Sprayer

Cisco Confidential

63

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

64

First high performance enterprise switch


FCS in 1998

First implementation was shared bus, evolved to single stage

fabric
Large set of features, supported also by special service cards Wire rate at minimum packet size

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

65

PORT ASIC

PORT ASIC

..

PORT ASIC

DBUS

RBUS

ARB DECISION ENGINE

Port asics on linecards, decision engine and Dbus arbiter on supervisor card 16 Gb/s total system bandwidth Input and output buffers No backpressure from output
Bus arbiter prevented multiple port asics to write on the bus simultaneously Two bus priorities to support VoIP More queuing classes on port asics

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

66

. . Ioslice Ioslice

. .

. .

Fabric interface

Crossbar

Fabric interface

. .

. .

Decision Engine

. .

Input Queue (per priority)


2010 Cisco and/or its affiliates. All rights reserved.

High speed serial links

Output Queues
Cisco Confidential 67

. .

. .

Decision Engine

Fabric interface

Crossbar

Fabric interface

. .

. .

. .

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

68

. .

. .

Decision Engine

Fabric interface

Crossbar

Fabric interface

. .

. .

. .

Input queues
No VOQ, only per-priority input queues No congestion feedback from egress ports to ingress Blocking possible

Crossbar Can drop packets when congested Initially centralized decision engine, later distributed on each linecard To support line rate at min packet size as newer generations of linecards got faster

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

69

Packet based
Arbiters (conceptually one per output buffer) independently decide which input is writing to which output
. .

Each crossbar link is a bundle of

8 serdes
Lower port count required

. .

. . Prio2

..

Input and output queues


Input queues cause blocking
3x internal overspeed to compensate Requires store and forward
Prio1 ..

Input queues can drop

Egress wrr
..

Two priorities supported as two

separate datapaths and queues


Cisco Confidential 70

2010 Cisco and/or its affiliates. All rights reserved.

Crossbar does replication according to Fabric

Port Of Exit mask


FPOE set by ingress port asic Done by writing to multiple output queues simultaneously Multiple retries possible to satisfy replication mask 3x internal overspeed helps to maintain rate

. .

..

Egress fabric interface and egress port asic use

Destination Index in packet header to access lookup table and perform further replications

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

71

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

72

Support for no-drop protocols (Fibrechannel over Ethernet) Bandwidth optimized More than 16 slots per chassis High density, including oversubscribed linecards

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

73

Centrally scheduled Packet based 3 stage fabric Buffered Crossbar Input Buffered Single chassis topology, up to 16 linecards slots + 2 supervisor

slots
Multicast replication in fabric One priority per cast in fabric 8 priorities per cast in Ioslices

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

74

request grant scheduler

credit return

. . . . . .

..

VOQ

crossbar with input and output buffer (one of three stages shown)

per port,class output queues


Cisco Confidential 75

2010 Cisco and/or its affiliates. All rights reserved.

Packets are queued into VOQs


VOQ system shared among ports on an ingress ioslice

Multiple small packets are accumulated into a superframe


Up to a max size of about 3000 bytes

Requests are made to central scheduler


No size information, MTU assumed Destination port and priority

Grants are generated according to egress buffer availability


Central scheduler keeps track of buffer availability on every egress queue in the system

Superframes are sent to fabric upon grant reception


Smaller than max size if grants arrives quickly, when little or no congestion present Split into fragments if packet bigger than max sf size

No drops in crossbars and outputs


Optionally no drops in VOQs by issuing pause frames

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

76

Up to 8 VOQs per destination port at ingress ioslice


Shared across ingress ports Ingress drops according to tail drop or some form of AQM (WRED, AFD)

Ingress Ioslice determines load balancing over fabric planes


Round robin or pseudo-random

One egress queue for each egress port, priority


No drops at egress One credit loop per egress port, priority Buffer hard-partitioned Credits are returned to central scheduler when packet leaves system, creating available buffer Egress scheduler controls egress queue drain rate, so it controls credit return rate Central scheduler distributes aggregate grant rate across requesting voqs
voq42,1

central scheduler

eg q 42,1 voq42,1 voq42,1 eg q 42,2 voq42,2 voq42,2 eg scheduler

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

77

Three stage folded CLOS

central scheduler
Sup Sup

central scheduler

network

Links are bundles of 8 serdes Aggregator asic in linecard to


. . FIA

Spine card

credit aggregator

. .

xbar

reduce number of arbitration links to central schedulers (active/standby) bandwidth, multiple links between ioslices and crossbar(s) on linecard

Stage 2 . .

xbar
Stage 1 and 3

. .

Depending on ioslice

. . FIA

Spine card . .

xbar
Linecard

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

78

Not scheduled
No superframing possible no voqs, only input queues, independent of group membership

Replication performed at S2 and S3 stages


Separate internal datapath in xbar for multicast Packet header contains index into replication table Multiple retries to satisfy all required destination crossbar ports Limited by timer. On timeout, drop

Egress ioslices do further replication to individual ports Load balanced to fabric planes using flow hash
No reordering required Max rate of single flow limited by link bundle capacity

Fabric can also be programmed to flow control multicast


More blocking, but preferable for financial applications where MC is low average bandwidth but very bursty

Scheduling between unicast and multicast is DWRR


2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 79

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

80

Low latency
Sub microsecond Cut through operation

Single chassis, multiple chips Support for no-drop classes (FCoE)

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

81

Packet-based, with full cut-through Speculative transmission


Similar to shared Ethernet: collision detection and retransmission

Single crossbar stage with large overspeed Unbuffered Crossbar Input and output Buffered

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

82

Nexus 6k Design
ack/nack
12 x 10G or 3 x 40G

Xon/Xoff
12 x 10G or 3 x 40G

576 x 10G or 144 x 40G

48 4

48

Single stageoptimized crossbars ! Fabric to 10 Gbps or 40 Gbps

! Packet-based unbuffered crossbar center-stage

IFIA

XBAR

! Shared memory third-stage with per port per class unicast and multicast queues ! Unicast per-class backpressure broadcast from each egress to all ingresses ! Egress shared memory with per-port pruning for multicast

EFIA

Fabric latency-optimized for 10 Gbps or 40 Gbps ! Packet cut-through at fabric operating rate Single links orswitching bundles of 4 line cards ! No local on Fabric speedup 3.6 across fabric ! No scheduling

Random path selection with out-of-band ack or ! Fabric speedup 3.6 Out!of band Ack/nack from crossbar for each transmission nack from crossbar for each transmission ! 576 x 8 all unicast VoQs per ingress port Out!of band Xon/Xoff broadcast from each unicast output queue to ingress Superframing if congested ! 8k multicast queues per ingress port ! 8 classes of service across entire switch optionally shared 83 Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. ! No-drop classes supported by Xon/Xoff feedback ! 15 MB ingress buffer, 9 MB egress buffer

Packets are queued into VOQs


Separate VOQ system for each input port on an ingress ioslice to support ingress cut-through 8 classes of service across entire switch Superframing when congested

If destination not congested, unicast packet is sent immediately


Random path selection Crossbar Nacks packet if no downlink available, ingress stops and retries on different path Only one packet in flight per voq. No need to reorder packets at egress

Large speedup in egress buffer and egress downlinks to reduce collision

probability
Egress queue per (priority,port)
Broadcasts Xoff before it gets too full, to avoid egress drops After getting Xon ingress waits random time before attempting send

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

84

Sent immediately by ingress


subject only to uplink availability, no egress xon/xoff

Replication performed in crossbar


One copy to each ioslice participating in the mc group Crossbar nacks with success mask If some destinations did not get a copy, ingress retries
No memory in crossbar

Egress chip has shared memory


One memory write, multiple reads according to group membership Some destinations may be pruned based on queue lengths

2010 Cisco and/or its affiliates. All rights reserved.

Cisco Confidential

85

You might also like