A - 130423 35702 24

David Tsiang, Cedrik Begin, Guglielmo Morandin
4/22/13
2010 Cisco and/or its affiliates. All rights reserved.
Cisco Confidential
Goals and requirements of switch fabrics Buffering strategies (input, output, CIOQ) Transport (packet vs cell) Topologies (single stage, multi-stage) Congestion management (proactive, reactive) Multicast Service provider examples Enterprise and Datacenter examples
Cisco Confidential
Cisco Confidential
Scale
Bandwidth per fabric interface Number of fabric interfaces
Fairness
Usually want non-blocking and fair (sometimes weighted fairness)
Non-blocking no cross-flow interference between src-dest flows
e.g. a congested flow doesnt unduly interfere with a non-congested flow
Latency
Service provider 100 us (WAN distances dominate, jitter) Enterprise 10s of us (Campus distances) Datacenter 1 us (Datacenter distances compute perf. latency sensitive)
Cost
SP vs Datacenter vs Enterprise
Redundancy
1:1, 1+1, 1:N
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
Cisco Confidential
Central
Shared memory
Input
Deep buffers only on input
Output
Deep buffers only on output
Combined input/output
Deep buffers on input and output
Cisco Confidential
Usually associated with central memory switch fabric designs Bandwidth scale limited by memory bandwidth (can be distributed over
several parallel memory slices to improve)

Limited queue scale (not practical for multi-chassis) Similar performance characteristics to output buffered switch without
need for a large speedup

Examples: Early cisco routers (AGS+, 7000, 7500), smaller routers (ISR,
ASR1K, Procket, early Juniper routers M40, M160)

FIA Write Central memory Read
FIA: Fabric Interface Adaptor

FIA
Cisco Confidential
Buffers on input Requires Virtual Output Queues to be non-blocking Most common type of buffering (GSR, N7K, ASR9K, Panini)
Mem Mem Mem
VOQ VOQ
FIA
VOQ
Send Switch Fabric Receive
Cisco Confidential
Only works if there is no congestion within the switch fabric! Can be achieved if speedup is high (path from SendRCV is >>
FIA input BW).
Pure output buffered switch is not practical for large systems
(need speed of N)
FIA
OQ OQ OQ
Mem Mem Mem
Cisco Confidential
High speed up (e.g. 2-3X FIA BW) enough most of the time Input Queues for cases where speedup is insufficient - blocking CRS uses this (VOQ scale impractical for input only approach)
Mem Mem Mem
IQ IQ IQ
FIA
OQ OQ OQ
Mem Mem Mem
Cisco Confidential
10
Cisco Confidential
11
Two main methods of transporting data across a switch fabric Packet send whole packets (or even multiple pkts as a frame)
Advantages:
Simpler no reassembly of cells (but may have to reorder packets) Higher Efficiency per packet overhead vs per cell overhead Lower average latency (can do cut through on egress)
Disadvantages
Slightly higher complexity for buffered switch chips Higher WC latency (small packets must wait behind larger packets) Not as scalable (large scale switches require distribution which requires cells to be efficient packet requires bundling of links to achieve low latency for large packets which does not allow for large scale distribution)
Cell segment packets into smaller sized cells

Advantages
Lower WC latency (important for TDM types of traffic) Scalable (easy to evenly distribute cells)
(Cell continued): Disadvantages

Higher complexity requires segmentation and reassembly of cells Higher overhead per cell overhead, packet packing efficiency Worse average latency reassembly and reordering cell buffer adds latency, cant do egress cut-through
Generally:
Packet transport pure packet fabrics, single chassis scale fabrics Cell transport hybrid packet/TDM switches, most large scale multi-chassis fabrics.
Cisco Confidential
13
Cisco Confidential
14
Mesh (pt-pt, bus) Scale limited to bus bandwidth or FIA bandwidth Used in smaller and/or older systems
FIA Bus
FIA
FIA
FIA
FIA
FIA
Cisco Confidential
15
Single stage (crossbar, central memory) Scale limited to number of serdes I/O on a single chip
e.g.224x224 (SM15) of serdes on a chip limits a system to 224 FIAs Panini has 768 FIAs
Use parallel crossbars for bandwidth scale and redundancy
Crossbar Crossbar Crossbar
FIA
FIA
Cisco Confidential
16
3-stage symmetric CLOS

#S1 = #S2 = #S3, Traffic takes two hops always (S1->S2, S2->S3)
For a NxN xbar chip can connect up to N^2 FIAs

Scale is usually less due to common use of combining S1 and S3 (folded: N^2/2) and number of S1s and FIAs achievable in a LC chassis.
Provably non-blocking via rearrangement (connection oriented) or
load balancing (requires some speedup to overcome imperfect randomized load balancing)
FIA FIA FIA FIA S1 S2 S2 S3 FIA FIA FIA
S1
S3
FIA
Cisco Confidential 17
n-stage topologies (Hyper-cube, Torus, Hyper-torus) Flows can take a variable number of hops from src-destination Lower cost
Typically FIAs interconnect directly less components, no fabric chassis needed Less interconnection cost but interconnection can be complex for: less than a fully populated system system with varying speed nodes
Requires complex scheduling to be non-blocking

Typically flow based path selection Must be able to dynamically change path if flow bandwidths change (reordering?) Slow to recover from failures (massive path recomputations)
Cisco Confidential
18
Cisco Confidential
19
Centralized timeslot scheduling

Allows bufferless crossbars with no data loss (no collisions) Difficult to achieve maximum match (O(n^2.5), O(n^3) complexity)
Approximate maximum match instead (e.g. ISLIP, PIM, WFA) needs speedup to overcome imperfect match.
Not that scalable (O(n) complexity, but n can be large. Sched. Speed can be an issue as well).
Distributed timeslot scheduling

Scheduling done by each destination independently Imprecise sources may receive multiple grants but can only act on one
Results in loss of bandwidth can be overcome with speedup
Scalable since its distributed but somewhat inefficient
Distributed bandwidth scheduling

Distribute bandwidth (credits or MTU pkts) on request
src sends when ready
Collisions can occur need buffering in the xbar and some speedup Scalable since its distributed and efficient
i.e. no scheduling just send Switch fabric either:

buffers and asserts flow control if buffers get too full Or just drops if buffers get too full (may require ack + retransmission)
Requires a large speedup to get good performance
oversubscribed scenarios
Is blocking for congestion > speedup because flow control within
the switch fabric is usually coarse (not to VOQ level)

Can reduce blocking by adding secondary flow control from destination back to source can be at a VOQ level
Cisco Confidential
21
Hybrid schemes
Can combine proactive and reactive schemes
e.g. send speculatively and if congested (reactive) request to re-send (proactive) Better latency if non-congested
Cisco Confidential
22
Cisco Confidential
23
Typically very challenging Scheduling per MC group not practical

Combinatoric number of groups 2^n-1 where n is the number of FIAs
Usually drop on congestion or reactive flow control Alternative turn multicast into unicast can now isolate
congestion
But can be blocking of unicast (ingress replication) or expensive (server replication).
Cisco Confidential
24
LC
1 1 2 3 4 Switch Fabric
Fabric multicast No impact to linecards, scalable to 100% multicast Drop on congestion
LC
1 2 3 4 1 2 3 4
Switch Fabric
Ingress Replication Can block ingress if not enough speedup to overcome replication dilation Staggered delivery
Cisco Confidential
25
LC LC
1 2 1 2 3 4 3 4 1 1 Switch Fabric
Binary Tree Replication Can block ingress if not enough speedup to overcome replication dilation (but less chance of this due to distribution of replication). Very staggered delivery
LC
MC server
1 2 3 4 1 2 3 4
Switch Fabric
MC Server Replication No ingress blocking Additional expense of MC server cards Staggered delivery
Cisco Confidential
26
Cisco Confidential
27
Cell Based (fixed at 64B) Centrally scheduled Non buffered crossbar Input buffered Single chassis topology, up to 16 slots. 3 generations: 2.5G -> 10G -> 20G per slot.
Generations support previous generations LCs.
Multicast replication in fabric. 2 Priorities per cast in fabric (not in all generations).
Cisco Confidential
28
Fabric works in cell
periods of 128ns. Cell clock is distributed to LCs and switch cards.

Packets are segmented
into cells in the ingress path

For each cell a request is
sent to the SCA (Scheduler Control ASIC)

SCA determines which
input -> output connections to make. Sends grants to the IFIA and controls the XBARs
Cells are sent across the
serial links and XBARs to the EFIA

Packets are reassembled
in the egress path

Ingress ToFab Queues: Per destination slot HPQ + LPQs, Multicast HPQ + LPQs. (MDRR) FIA (toFab): H/L Unicast queues per destination LC + H/L Multicast Q SCA algorithm used to insure fairness + maximize throughput over the fabric
Schedules between UC/MC requests (alternates priority between UC/MC). Within a priority, Input LCs get their fair share of traffic towards output linecards
FIA (FrFab) has: Per source UC/MC reassembly queues that can flow control the SCA if nearing full.
Cisco Confidential
30
Multicast to different LCs is performed by the crossbar.

A given multicast cell is transmitted to N destinations across the crossbar.
Partial grants supported.

If a cell wants to go to destinations 1,2 and 3. The fabric may first grant 1,3 and then grant 2 in a subsequent cell time.
Cisco Confidential
31
Switch cards:
Redundant switch card which allows to correct errors in a single serial link stream. Redundant stream carries XOR of 4 other streams Provides 4+1 redundancy.
CSC (Clock and Scheduler) cards:

2 of these in the system, one is operational and the other standby.
Cisco Confidential
32
Generation 622M 2.5G 10G 20G
Fabric Connection 5x1.25G
FIA FIA FIA-48 Fusilli
SCH SCA
XBAR XBAR
20x1.25G 20x2.5G
TFIA,FFIA Superfish EROS
SCA192 HAD Hecate (priority)
XBAR192 IRIS
Cisco Confidential
33
Cisco Confidential
34
Cell Based (Fixed 136B cells w. cell packing) Unscheduled Single stage / 3 stage fabric Single chassis / Multi-chassis capable Architecture scales up to 1536 EFIAs.
2/4 EFIAs per LC. VOQ not feasible (system has 1M+ Output queues) solved with fabric speedup with flow control
3 generations: 40G -> 120G -> 400G per slot.

New generations required to support previous generations
Input Buffered. (fabric congestion) Output Buffered. (fabric speedup) Multicast replication in fabric. 2 Priorities per UC/MC in fabric.
Cisco Confidential
35
136 Bytes cells 40/120/400 Gbps 8 2 Line Card 1 1 of 8 S1 S2 2 of 8 8 of 8 S1
Fabric Chassis S2 S3
2.5X Speedup 16
2 S3 1 Line Card
2 LEVELS OF PRIORITY
S1 S2 S3
MULTICAST SUPPORT
1M multicast groups
Buffered Non-blocking Switch Multi-stage Interconnect3 Stage Clos Topology
Cisco Confidential
36
IFIA:
Segments packets into fixed size cells Distributes cells evenly to planes.
S1 Stage: Distributes all cells evenly to all S2s. S2 Stage:

UC: Directs cell to S3 stage based on Destination address. MC: Replicates cell to S3 stages based on FGID
S3 Stage:
UC: Directs cell to EFIA based on destination address. MC: Replicates cell to EFIAs based on FGID
EFIA:
Receives cells from all planes and reassembles packet per source/cast.
Cisco Confidential
37
Single chassis can have:

3-stage topology if switching element does not have enough links to do single stage. Single stage topology:
Full mesh between IFIAs and EFIAs and Fabric chips. Fabric chips in S123 mode whereby incoming cells are routed directly to the EFIAs.
Linecard Chassis:
Fabric Cards contain S1 and S3 stages (these maybe combined into ASICs doing both stages).
Fabric Chassis:
Fabric cards contain S2 Stages. 1 or more fabric cards may implement S2 stage of plane.
Full mesh between IFIAs,EFIAs and Fabric chips. Fabric chips in S13 mode:
Traffic local to a chassis does not go over optical links. Traffic destined to other chassis goes over optical links.
3072 High priority fabric Destination (EFIA) queues
S2 Queues per priority per S3 group
S3 Queues per priority per fabric destination (EFIA)
4K Raw queues in EFIA
IFIA
Fabric Destination BP Discard Filter
S1
S2
S3
EFIA
Reseq & Reassembly
Packets from NPU
Input buffering:
Per destination (EFIA) H/L queue. System scale (~1M OQ) deemed too large for VOQ.
Output buffering:
Per faceplate port configurable number of queues.
..
8k shaped Queues
3072 Low priority fabric destination (EFIA) queues
S1 has a single data queue
Packets to NPU
1 High priority Multicast queue
S2 Queues per S3 group per priority and cast. (i.e. separate queues for MC)
S3 Queues per destination per priority and cast. (i.e. separate queues for MC)
Some number of MC Raw queues in EFIA
IFIA
Fabric Destination BP Discard Filter
S1
S2
S3
EFIA
Reseq & Reassembly
Packets from NPU
..
8k shaped Queues
1 Low priority Multicast queue
S1 has a single data queue for both UC and MC data cells
Packets to NPU
Cisco Confidential
42
IFIAs send traffic into the fabric. 2 Main flow controls to regulate.
Handles cases where fabric speedup is insufficient
Destination Backpressure:
Used to minimize buffer occupancy in the fabric for short term congestion. Operates at per destination EFIA granularity. S3 Queue congestion + S2 Feed forward counts contribute. Ingress FIAs implement a slow start algorithm to minimize overshoot.
Discard:
Operates at a per faceplate port granularity. Used to alleviate potential fabric congestion by reducing the amount of congested traffic from entering the fabric.
That is we do not want to send packets across the fabric which are going to be discarded at EFIA anyway.
To minimize the amount of queuing delay at IFIA due to congestion at the destination.
IFIA
Fabric Destination BP
S1
S2
S3
EFIA
Reseq & Reassembly
8k shaped Queues
Discard Filter
Other flow controls in the fabric as well:

Stages may backpressure upstream stages if they are out of resources. Not expected.
..
Dest BP Discard
Cisco Confidential
44
Multicast is performed in the fabric at the S2 and S3 stages. IFIAs have 2 queues (H/L) Cell Header contains FGID field which is used by S2 and S3
stages as an index to replication table.

1M fabric groups.
S2 and S3 replicate cells No flow control mechanisms.

Separate queues for H/L multicast. If there is congestion multicast cells are dropped.
Scheduling between unicast and multicast cells is WRR at S2 and
S3 stages.
Cisco Confidential
45
4/8 planes in the system.

FQ: 8 planes: 1 plane per fabric card HQ: 8 planes: 2 planes per fabric card QQ: 4 planes: 1 plane per fabric card
Enough speedup in fabric to handle one plane down and not
adversely affect performance.

Further planes down result in reduced fabric performance Fabric can operate with a minimum of 2 planes.
Cisco Confidential
46
Generation 40G 120G
Fabric links 2.5G (8b10b) 5G (scrambler + 8b10b) 8.625G (scrambler)
FIA Sprayer Sponge Seal Crab Inbar
XBAR SEA (36x72) Superstar (128x144) Sapir (128x128)
400G
Cisco Confidential
47
Cisco Confidential
48
Cell Based (Variable Cell size) Distributed scheduling Single stage / 3 stage fabric topologies Single chassis / Multi-chassis capable Architecture scales up to 768 FIAs
Up to 6 FIAs per LC.
Input Buffered: 64K VOQs (4 COS per 10GE) Multicast replication in fabric. (512K groups) 2 independent pipes in fabric.
OTN, Data UC, MC
Cisco Confidential
49
Panini Multi-chassis Architecture

- Multi-Chassis Fabric Architecture
nx200G 64~256B Cells CXP
Fabric Chassis
CXP
6 of 6
240Gbps
nx200G (1x Speedup)
2 of 6 1 of 6
240Gbps
S1 S2 S1 S2 S1 S2 S1 S2 S3 S2 S2 S2 S2 S3 S2 S3 S3 S3 S3 S3 S3
S3 S3 S3 S3
240Gbps
S1 S1 S1 S1
S1 S1
240Gbps
Single Priority
2 Paths in fabric S1 S1
Multicast Support
512K Multicast Groups Replication at S2 and S3
3 Stage CLOS Topology
Cisco Confidential
50
IFIA:
Segments packets into variable sized cells Distributes cells evenly to planes.
S1 Stage: Distributes all cells evenly to all S2s. S2 Stage:

UC: Directs cell to S3 stage based on Destination address. MC: Replicates cell to S3 stages based on FGID
S3 Stage:
UC: Directs cell to EFIA based on destination address. MC: Replicates cell to EFIAs based on FGID
EFIA:
Receives cells from all planes and reassembles packet per source/cast.
Cisco Confidential
51
Single stage topology Full mesh between FIAs and Fabric chips. FIAs spray cells to Fabric chips which will send them to correct
FIA.
Cisco Confidential
52
3 stage topology. In each Linecard Chassis of the FIA are connected to each S1S3. Full mesh between S1S3s and S2s on a plane. Fabric Chassis:
Fabric cards contain S2 Stages. 1 or multiple Fabric cards may implement S2 stage of plane.
3 stage topology In each Linecard Chassis of the FIAs are connected to each S1S3. Full mesh between S1S3 and S2 stages S2 stage shared between 2 chassis fabric cards.
Unicast: Distributed scheduling

VOQs in IFIA indicate occupancy state to EFIAs they are associated with. EFIAs will issue credits fairly (WFQ) to IFIAs based on:
VOQ state from IFIAs Number of links up between S3 and itself Congestion indication from fabric
IFIAs will send cells of packets into fabric for VOQs with credit.
Multicast: Unscheduled
Packets sent into the fabric Congestion may result in drops or global flow control depending on priority.
Congestion in fabric between UC/MC:

When only UC traffic, should be no sustained congestion as the EFIAs control the traffic towards it. When MC is introduced:
congestion may occur and fabric will indicate to EFIA EFIA will adapt UC credits down to a configured value to make room for MC traffic
Ingress Unicast: 64K VOQs:

Enough for 4 COS Queues per 10GE
Ingress Multicast:
4 class queues towards the fabric
Fabric: Queues per pipe per destination.

S2: queue per S3 destination. S3: queue per EFIA.
Egress: Per cast/class queues towards the egress NPU
Cisco Confidential
56
Multicast is performed in the fabric at the S2 and S3 stages. Cell Header contains FGID field which is used by S2 and S3
stages as an index to replication table.

512K Fabric groups.
S2 and S3 replicate cells to wanted destinations
Cisco Confidential
57
6 planes in the system. Enough speedup in fabric to handle one plane down and not
adversely affect performance.
Further planes down result in reduced fabric performance Fabric can operate with a minimum of 1 plane.
Cisco Confidential
58
Cisco Confidential
59
Packets are segmented
into 64B cells

48B of payload 8B of cell header 8B of CRC
No cell packing: a given
cell may only have data for one packet.

Cell is split (in 16-bit
chunks) over 4 serial links. One to each XBAR

A fifth redundant serial link
contains information for error correction

Links are 8b10b encoded
Packets are segmented into
136B cells:
4B of RS code (for error correction)
Cell Payload (120B)

Header (12 bytes)
RS
Packet 2
30 b ytes
12B of header 120B of payload
(4 bytes)
Two Packet Payloads

Packet 1 Packet 1 Packet 1
Unicast cells can be packed
such that a cell can contain data from 2 packets

Multicast cells are not packed Control Cells
Idle,Discard,SRCC
Packet 2 Packet 2
30 bytes
30 b ytes
30 bytes
Single Packet Payload

Packet 1 (120 bytes)
Cells are:
8b10b encoded for 2.5G links Scrambled + 8b10b encoded for 5G links Scrambled for 8.625G links
Cisco Confidential
61
Packet
Fabric hdr 4 bytes
Pkt hdr 14 bytes
Pkt payload up to 9.6k bytes
Cell
Fabric hdr 13 bytes Data payload 64-255 bytes crc8
13 byte header | 64-256 bytes payload | 1 byte CRC Idle cells sent if no data to send 11.5G Serdes 64/66 encoding FEC covers a group of cells (for optical links) Retransmit used for electrical link error correction CRC-32 covers the packet
Cisco Confidential
crc32
62
3072 High priority fabric destination queues
S2 Queues per priority per fabric group
S3 Queues per priority per fabric destination
4K raw queues in EFIA
IFIA
Fabric Destination BP
S1
S2
S3
EFIA
Reseq & Reassembly
8k shaped Queues
Packets from NPU EFIA raw queue state controls the discard filter
Discard Filter
..
3072 Low priority fabric destination queues
S2 OOR state used to control scheduling in S1
S3 queue state generates destination BP S2 queue state and feed forward incorporated into destination BP
Packets to NPU S3 OOR state used to control scheduling in S2
S1 Hiccups control per plane scheduling at Sprayer
Cisco Confidential
63
Cisco Confidential
64
First high performance enterprise switch

FCS in 1998
First implementation was shared bus, evolved to single stage
fabric
Large set of features, supported also by special service cards Wire rate at minimum packet size
Cisco Confidential
65
PORT ASIC
PORT ASIC
..
PORT ASIC
DBUS
RBUS
ARB DECISION ENGINE
Port asics on linecards, decision engine and Dbus arbiter on supervisor card 16 Gb/s total system bandwidth Input and output buffers No backpressure from output
Bus arbiter prevented multiple port asics to write on the bus simultaneously Two bus priorities to support VoIP More queuing classes on port asics
Cisco Confidential
66
. . Ioslice Ioslice
. .
. .
Fabric interface
Crossbar
Fabric interface
. .
. .
Decision Engine
. .
Input Queue (per priority)

High speed serial links
Output Queues
. .
. .
Decision Engine
Fabric interface
Crossbar
Fabric interface
. .
. .
. .
Cisco Confidential
68
. .
. .
Decision Engine
Fabric interface
Crossbar
Fabric interface
. .
. .
. .
Input queues
No VOQ, only per-priority input queues No congestion feedback from egress ports to ingress Blocking possible
Crossbar Can drop packets when congested Initially centralized decision engine, later distributed on each linecard To support line rate at min packet size as newer generations of linecards got faster
Cisco Confidential
69
Packet based
Arbiters (conceptually one per output buffer) independently decide which input is writing to which output
. .
Each crossbar link is a bundle of
8 serdes
Lower port count required
. .
. . Prio2
..
Input and output queues

Input queues cause blocking
3x internal overspeed to compensate Requires store and forward
Prio1 ..
Input queues can drop
Egress wrr
..
Two priorities supported as two
separate datapaths and queues

Crossbar does replication according to Fabric
Port Of Exit mask

FPOE set by ingress port asic Done by writing to multiple output queues simultaneously Multiple retries possible to satisfy replication mask 3x internal overspeed helps to maintain rate
. .
..
Egress fabric interface and egress port asic use
Destination Index in packet header to access lookup table and perform further replications
Cisco Confidential
71
Cisco Confidential
72
Support for no-drop protocols (Fibrechannel over Ethernet) Bandwidth optimized More than 16 slots per chassis High density, including oversubscribed linecards
Cisco Confidential
73
Centrally scheduled Packet based 3 stage fabric Buffered Crossbar Input Buffered Single chassis topology, up to 16 linecards slots + 2 supervisor
slots
Multicast replication in fabric One priority per cast in fabric 8 priorities per cast in Ioslices
Cisco Confidential
74
request grant scheduler
credit return
. . . . . .
..
VOQ
crossbar with input and output buffer (one of three stages shown)
per port,class output queues

Packets are queued into VOQs

VOQ system shared among ports on an ingress ioslice
Multiple small packets are accumulated into a superframe

Up to a max size of about 3000 bytes
Requests are made to central scheduler

No size information, MTU assumed Destination port and priority
Grants are generated according to egress buffer availability

Central scheduler keeps track of buffer availability on every egress queue in the system
Superframes are sent to fabric upon grant reception

Smaller than max size if grants arrives quickly, when little or no congestion present Split into fragments if packet bigger than max sf size
No drops in crossbars and outputs

Optionally no drops in VOQs by issuing pause frames
Cisco Confidential
76
Up to 8 VOQs per destination port at ingress ioslice

Shared across ingress ports Ingress drops according to tail drop or some form of AQM (WRED, AFD)
Ingress Ioslice determines load balancing over fabric planes

Round robin or pseudo-random
One egress queue for each egress port, priority

No drops at egress One credit loop per egress port, priority Buffer hard-partitioned Credits are returned to central scheduler when packet leaves system, creating available buffer Egress scheduler controls egress queue drain rate, so it controls credit return rate Central scheduler distributes aggregate grant rate across requesting voqs
voq42,1
central scheduler
eg q 42,1 voq42,1 voq42,1 eg q 42,2 voq42,2 voq42,2 eg scheduler
Cisco Confidential
77
Three stage folded CLOS
central scheduler
Sup Sup
central scheduler
network
Links are bundles of 8 serdes Aggregator asic in linecard to

. . FIA
Spine card
credit aggregator
. .
xbar
reduce number of arbitration links to central schedulers (active/standby) bandwidth, multiple links between ioslices and crossbar(s) on linecard
Stage 2 . .
xbar
Stage 1 and 3
. .
Depending on ioslice
. . FIA
Spine card . .
xbar
Linecard
Cisco Confidential
78
Not scheduled
No superframing possible no voqs, only input queues, independent of group membership
Replication performed at S2 and S3 stages

Separate internal datapath in xbar for multicast Packet header contains index into replication table Multiple retries to satisfy all required destination crossbar ports Limited by timer. On timeout, drop
Egress ioslices do further replication to individual ports Load balanced to fabric planes using flow hash
No reordering required Max rate of single flow limited by link bundle capacity
Fabric can also be programmed to flow control multicast

More blocking, but preferable for financial applications where MC is low average bandwidth but very bursty
Scheduling between unicast and multicast is DWRR

Cisco Confidential
80
Low latency
Sub microsecond Cut through operation
Single chassis, multiple chips Support for no-drop classes (FCoE)
Cisco Confidential
81
Packet-based, with full cut-through Speculative transmission

Similar to shared Ethernet: collision detection and retransmission
Single crossbar stage with large overspeed Unbuffered Crossbar Input and output Buffered
Cisco Confidential
82
Nexus 6k Design
ack/nack
12 x 10G or 3 x 40G
Xon/Xoff
12 x 10G or 3 x 40G
576 x 10G or 144 x 40G
48 4
48
Single stageoptimized crossbars ! Fabric to 10 Gbps or 40 Gbps
! Packet-based unbuffered crossbar center-stage
IFIA
XBAR
! Shared memory third-stage with per port per class unicast and multicast queues ! Unicast per-class backpressure broadcast from each egress to all ingresses ! Egress shared memory with per-port pruning for multicast
EFIA
Fabric latency-optimized for 10 Gbps or 40 Gbps ! Packet cut-through at fabric operating rate Single links orswitching bundles of 4 line cards ! No local on Fabric speedup 3.6 across fabric ! No scheduling
Random path selection with out-of-band ack or ! Fabric speedup 3.6 Out!of band Ack/nack from crossbar for each transmission nack from crossbar for each transmission ! 576 x 8 all unicast VoQs per ingress port Out!of band Xon/Xoff broadcast from each unicast output queue to ingress Superframing if congested ! 8k multicast queues per ingress port ! 8 classes of service across entire switch optionally shared 83 Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. ! No-drop classes supported by Xon/Xoff feedback ! 15 MB ingress buffer, 9 MB egress buffer
Packets are queued into VOQs

Separate VOQ system for each input port on an ingress ioslice to support ingress cut-through 8 classes of service across entire switch Superframing when congested
If destination not congested, unicast packet is sent immediately

Random path selection Crossbar Nacks packet if no downlink available, ingress stops and retries on different path Only one packet in flight per voq. No need to reorder packets at egress
Large speedup in egress buffer and egress downlinks to reduce collision
probability
Egress queue per (priority,port)
Broadcasts Xoff before it gets too full, to avoid egress drops After getting Xon ingress waits random time before attempting send
Cisco Confidential
84
Sent immediately by ingress

subject only to uplink availability, no egress xon/xoff
Replication performed in crossbar

One copy to each ioslice participating in the mc group Crossbar nacks with success mask If some destinations did not get a copy, ingress retries
No memory in crossbar
Egress chip has shared memory

One memory write, multiple reads according to group membership Some destinations may be pruned based on queue lengths
Cisco Confidential
85

A - 130423 35702 24

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A - 130423 35702 24

Uploaded by

Copyright:

Available Formats

David Tsiang, Cedrik Begin, Guglielmo Morandin

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

several parallel memory slices to improve)

need for a large speedup

ASR1K, Procket, early Juniper routers M40, M160)

FIA: Fabric Interface Adaptor

Send Switch Fabric Receive

2010 Cisco and/or its affiliates. All rights reserved.

FIA input BW).

Pure output buffered switch is not practical for large systems

Send Switch Fabric Receive

Mem Mem Mem

2010 Cisco and/or its affiliates. All rights reserved.

Send Switch Fabric Receive

Mem Mem Mem

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

Cell segment packets into smaller sized cells

(Cell continued): Disadvantages

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

Use parallel crossbars for bandwidth scale and redundancy

Crossbar Crossbar Crossbar

2010 Cisco and/or its affiliates. All rights reserved.

3-stage symmetric CLOS

For a NxN xbar chip can connect up to N^2 FIAs

Provably non-blocking via rearrangement (connection oriented) or

2010 Cisco and/or its affiliates. All rights reserved.

Requires complex scheduling to be non-blocking

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

Centralized timeslot scheduling

Distributed timeslot scheduling

Scalable since its distributed but somewhat inefficient

Distributed bandwidth scheduling

i.e. no scheduling just send Switch fabric either:

Requires a large speedup to get good performance

Is blocking for congestion > speedup because flow control within

the switch fabric is usually coarse (not to VOQ level)

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

Typically very challenging Scheduling per MC group not practical

2010 Cisco and/or its affiliates. All rights reserved.

Fabric multicast No impact to linecards, scalable to 100% multicast Drop on congestion

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

2010 Cisco and/or its affiliates. All rights reserved.

Fabric works in cell

periods of 128ns. Cell clock is distributed to LCs and switch cards.

into cells in the ingress path

sent to the SCA (Scheduler Control ASIC)

serial links and XBARs to the EFIA

in the egress path

2010 Cisco and/or its affiliates. All rights reserved.

Multicast to different LCs is performed by the crossbar.

Partial grants supported.