Professional Documents
Culture Documents
4/22/13
Cisco Confidential
Goals and requirements of switch fabrics Buffering strategies (input, output, CIOQ) Transport (packet vs cell) Topologies (single stage, multi-stage) Congestion management (proactive, reactive) Multicast Service provider examples Enterprise and Datacenter examples
Cisco Confidential
Cisco Confidential
Scale
Bandwidth per fabric interface Number of fabric interfaces
Fairness
Usually want non-blocking and fair (sometimes weighted fairness)
Non-blocking no cross-flow interference between src-dest flows
e.g. a congested flow doesnt unduly interfere with a non-congested flow
Latency
Service provider 100 us (WAN distances dominate, jitter) Enterprise 10s of us (Campus distances) Datacenter 1 us (Datacenter distances compute perf. latency sensitive)
Cost
SP vs Datacenter vs Enterprise
Redundancy
1:1, 1+1, 1:N
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
Cisco Confidential
Central
Shared memory
Input
Deep buffers only on input
Output
Deep buffers only on output
Combined input/output
Deep buffers on input and output
Cisco Confidential
Usually associated with central memory switch fabric designs Bandwidth scale limited by memory bandwidth (can be distributed over
FIA
Cisco Confidential
Buffers on input Requires Virtual Output Queues to be non-blocking Most common type of buffering (GSR, N7K, ASR9K, Panini)
Mem Mem Mem
VOQ VOQ
FIA
VOQ
Cisco Confidential
Only works if there is no congestion within the switch fabric! Can be achieved if speedup is high (path from SendRCV is >>
(need speed of N)
FIA
OQ OQ OQ
Cisco Confidential
High speed up (e.g. 2-3X FIA BW) enough most of the time Input Queues for cases where speedup is insufficient - blocking CRS uses this (VOQ scale impractical for input only approach)
Mem Mem Mem
IQ IQ IQ
FIA
OQ OQ OQ
Cisco Confidential
10
Cisco Confidential
11
Two main methods of transporting data across a switch fabric Packet send whole packets (or even multiple pkts as a frame)
Advantages:
Simpler no reassembly of cells (but may have to reorder packets) Higher Efficiency per packet overhead vs per cell overhead Lower average latency (can do cut through on egress)
Disadvantages
Slightly higher complexity for buffered switch chips Higher WC latency (small packets must wait behind larger packets) Not as scalable (large scale switches require distribution which requires cells to be efficient packet requires bundling of links to achieve low latency for large packets which does not allow for large scale distribution)
Generally:
Packet transport pure packet fabrics, single chassis scale fabrics Cell transport hybrid packet/TDM switches, most large scale multi-chassis fabrics.
Cisco Confidential
13
Cisco Confidential
14
Mesh (pt-pt, bus) Scale limited to bus bandwidth or FIA bandwidth Used in smaller and/or older systems
FIA Bus
FIA
FIA
FIA
FIA
FIA
Cisco Confidential
15
Single stage (crossbar, central memory) Scale limited to number of serdes I/O on a single chip
e.g.224x224 (SM15) of serdes on a chip limits a system to 224 FIAs Panini has 768 FIAs
FIA
FIA
Cisco Confidential
16
load balancing (requires some speedup to overcome imperfect randomized load balancing)
FIA FIA FIA FIA S1 S2 S2 S3 FIA FIA FIA
S1
S3
FIA
Cisco Confidential 17
n-stage topologies (Hyper-cube, Torus, Hyper-torus) Flows can take a variable number of hops from src-destination Lower cost
Typically FIAs interconnect directly less components, no fabric chassis needed Less interconnection cost but interconnection can be complex for: less than a fully populated system system with varying speed nodes
Cisco Confidential
18
Cisco Confidential
19
Collisions can occur need buffering in the xbar and some speedup Scalable since its distributed and efficient
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20
oversubscribed scenarios
Cisco Confidential
21
Hybrid schemes
Can combine proactive and reactive schemes
e.g. send speculatively and if congested (reactive) request to re-send (proactive) Better latency if non-congested
Cisco Confidential
22
Cisco Confidential
23
Usually drop on congestion or reactive flow control Alternative turn multicast into unicast can now isolate
congestion
But can be blocking of unicast (ingress replication) or expensive (server replication).
Cisco Confidential
24
LC
1 1 2 3 4 Switch Fabric
LC
1 2 3 4 1 2 3 4
Switch Fabric
Ingress Replication Can block ingress if not enough speedup to overcome replication dilation Staggered delivery
Cisco Confidential
25
LC LC
1 2 1 2 3 4 3 4 1 1 Switch Fabric
Binary Tree Replication Can block ingress if not enough speedup to overcome replication dilation (but less chance of this due to distribution of replication). Very staggered delivery
LC
MC server
1 2 3 4 1 2 3 4
Switch Fabric
MC Server Replication No ingress blocking Additional expense of MC server cards Staggered delivery
Cisco Confidential
26
Cisco Confidential
27
Cell Based (fixed at 64B) Centrally scheduled Non buffered crossbar Input buffered Single chassis topology, up to 16 slots. 3 generations: 2.5G -> 10G -> 20G per slot.
Generations support previous generations LCs.
Multicast replication in fabric. 2 Priorities per cast in fabric (not in all generations).
Cisco Confidential
28
input -> output connections to make. Sends grants to the IFIA and controls the XBARs
Cells are sent across the
Ingress ToFab Queues: Per destination slot HPQ + LPQs, Multicast HPQ + LPQs. (MDRR) FIA (toFab): H/L Unicast queues per destination LC + H/L Multicast Q SCA algorithm used to insure fairness + maximize throughput over the fabric
Schedules between UC/MC requests (alternates priority between UC/MC). Within a priority, Input LCs get their fair share of traffic towards output linecards
FIA (FrFab) has: Per source UC/MC reassembly queues that can flow control the SCA if nearing full.
Cisco Confidential
30
Cisco Confidential
31
Switch cards:
Redundant switch card which allows to correct errors in a single serial link stream. Redundant stream carries XOR of 4 other streams Provides 4+1 redundancy.
Cisco Confidential
32
SCH SCA
XBAR XBAR
20x1.25G 20x2.5G
XBAR192 IRIS
Cisco Confidential
33
Cisco Confidential
34
Cell Based (Fixed 136B cells w. cell packing) Unscheduled Single stage / 3 stage fabric Single chassis / Multi-chassis capable Architecture scales up to 1536 EFIAs.
2/4 EFIAs per LC. VOQ not feasible (system has 1M+ Output queues) solved with fabric speedup with flow control
Input Buffered. (fabric congestion) Output Buffered. (fabric speedup) Multicast replication in fabric. 2 Priorities per UC/MC in fabric.
Cisco Confidential
35
Fabric Chassis S2 S3
2.5X Speedup 16
2 S3 1 Line Card
2 LEVELS OF PRIORITY
S1 S2 S3
MULTICAST SUPPORT
1M multicast groups
Cisco Confidential
36
IFIA:
Segments packets into fixed size cells Distributes cells evenly to planes.
S3 Stage:
UC: Directs cell to EFIA based on destination address. MC: Replicates cell to EFIAs based on FGID
EFIA:
Receives cells from all planes and reassembles packet per source/cast.
Cisco Confidential
37
Linecard Chassis:
Fabric Cards contain S1 and S3 stages (these maybe combined into ASICs doing both stages).
Fabric Chassis:
Fabric cards contain S2 Stages. 1 or more fabric cards may implement S2 stage of plane.
Cisco Confidential 39
Full mesh between IFIAs,EFIAs and Fabric chips. Fabric chips in S13 mode:
Traffic local to a chassis does not go over optical links. Traffic destined to other chassis goes over optical links.
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40
IFIA
Fabric Destination BP Discard Filter
S1
S2
S3
EFIA
Reseq & Reassembly
Input buffering:
Per destination (EFIA) H/L queue. System scale (~1M OQ) deemed too large for VOQ.
Output buffering:
Per faceplate port configurable number of queues.
Cisco Confidential 41
..
8k shaped Queues
Packets to NPU
S2 Queues per S3 group per priority and cast. (i.e. separate queues for MC)
S3 Queues per destination per priority and cast. (i.e. separate queues for MC)
IFIA
Fabric Destination BP Discard Filter
S1
S2
S3
EFIA
Reseq & Reassembly
..
8k shaped Queues
Packets to NPU
Cisco Confidential
42
IFIAs send traffic into the fabric. 2 Main flow controls to regulate.
Handles cases where fabric speedup is insufficient
Destination Backpressure:
Used to minimize buffer occupancy in the fabric for short term congestion. Operates at per destination EFIA granularity. S3 Queue congestion + S2 Feed forward counts contribute. Ingress FIAs implement a slow start algorithm to minimize overshoot.
Discard:
Operates at a per faceplate port granularity. Used to alleviate potential fabric congestion by reducing the amount of congested traffic from entering the fabric.
That is we do not want to send packets across the fabric which are going to be discarded at EFIA anyway.
To minimize the amount of queuing delay at IFIA due to congestion at the destination.
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43
IFIA
Fabric Destination BP
S1
S2
S3
EFIA
Reseq & Reassembly
8k shaped Queues
Discard Filter
..
Dest BP Discard
Cisco Confidential
44
Multicast is performed in the fabric at the S2 and S3 stages. IFIAs have 2 queues (H/L) Cell Header contains FGID field which is used by S2 and S3
S3 stages.
Cisco Confidential
45
Cisco Confidential
46
400G
Cisco Confidential
47
Cisco Confidential
48
Cell Based (Variable Cell size) Distributed scheduling Single stage / 3 stage fabric topologies Single chassis / Multi-chassis capable Architecture scales up to 768 FIAs
Up to 6 FIAs per LC.
Input Buffered: 64K VOQs (4 COS per 10GE) Multicast replication in fabric. (512K groups) 2 independent pipes in fabric.
OTN, Data UC, MC
Cisco Confidential
49
Fabric Chassis
CXP
6 of 6
240Gbps
2 of 6 1 of 6
240Gbps
S1 S2 S1 S2 S1 S2 S1 S2 S3 S2 S2 S2 S2 S3 S2 S3 S3 S3 S3 S3 S3
S3 S3 S3 S3
240Gbps
S1 S1 S1 S1
S1 S1
240Gbps
Single Priority
2 Paths in fabric S1 S1
Multicast Support
512K Multicast Groups Replication at S2 and S3
Cisco Confidential
50
IFIA:
Segments packets into variable sized cells Distributes cells evenly to planes.
S3 Stage:
UC: Directs cell to EFIA based on destination address. MC: Replicates cell to EFIAs based on FGID
EFIA:
Receives cells from all planes and reassembles packet per source/cast.
Cisco Confidential
51
Single stage topology Full mesh between FIAs and Fabric chips. FIAs spray cells to Fabric chips which will send them to correct
FIA.
Cisco Confidential
52
3 stage topology. In each Linecard Chassis of the FIA are connected to each S1S3. Full mesh between S1S3s and S2s on a plane. Fabric Chassis:
Fabric cards contain S2 Stages. 1 or multiple Fabric cards may implement S2 stage of plane.
Cisco Confidential 53
3 stage topology In each Linecard Chassis of the FIAs are connected to each S1S3. Full mesh between S1S3 and S2 stages S2 stage shared between 2 chassis fabric cards.
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54
IFIAs will send cells of packets into fabric for VOQs with credit.
Multicast: Unscheduled
Packets sent into the fabric Congestion may result in drops or global flow control depending on priority.
Ingress Multicast:
4 class queues towards the fabric
Cisco Confidential
56
Multicast is performed in the fabric at the S2 and S3 stages. Cell Header contains FGID field which is used by S2 and S3
Cisco Confidential
57
6 planes in the system. Enough speedup in fabric to handle one plane down and not
Further planes down result in reduced fabric performance Fabric can operate with a minimum of 1 plane.
Cisco Confidential
58
Cisco Confidential
59
136B cells:
RS
Packet 2
30 b ytes
(4 bytes)
Packet 2 Packet 2
30 bytes
30 b ytes
30 bytes
Cells are:
8b10b encoded for 2.5G links Scrambled + 8b10b encoded for 5G links Scrambled for 8.625G links
2010 Cisco and/or its affiliates. All rights reserved.
Cisco Confidential
61
Packet
Fabric hdr 4 bytes
Cell
Fabric hdr 13 bytes Data payload 64-255 bytes crc8
13 byte header | 64-256 bytes payload | 1 byte CRC Idle cells sent if no data to send 11.5G Serdes 64/66 encoding FEC covers a group of cells (for optical links) Retransmit used for electrical link error correction CRC-32 covers the packet
Cisco Confidential
crc32
62
IFIA
Fabric Destination BP
S1
S2
S3
EFIA
Reseq & Reassembly
8k shaped Queues
Packets from NPU EFIA raw queue state controls the discard filter
Discard Filter
..
S3 queue state generates destination BP S2 queue state and feed forward incorporated into destination BP
Cisco Confidential
63
Cisco Confidential
64
fabric
Large set of features, supported also by special service cards Wire rate at minimum packet size
Cisco Confidential
65
PORT ASIC
PORT ASIC
..
PORT ASIC
DBUS
RBUS
Port asics on linecards, decision engine and Dbus arbiter on supervisor card 16 Gb/s total system bandwidth Input and output buffers No backpressure from output
Bus arbiter prevented multiple port asics to write on the bus simultaneously Two bus priorities to support VoIP More queuing classes on port asics
Cisco Confidential
66
. . Ioslice Ioslice
. .
. .
Fabric interface
Crossbar
Fabric interface
. .
. .
Decision Engine
. .
Output Queues
Cisco Confidential 67
. .
. .
Decision Engine
Fabric interface
Crossbar
Fabric interface
. .
. .
. .
Cisco Confidential
68
. .
. .
Decision Engine
Fabric interface
Crossbar
Fabric interface
. .
. .
. .
Input queues
No VOQ, only per-priority input queues No congestion feedback from egress ports to ingress Blocking possible
Crossbar Can drop packets when congested Initially centralized decision engine, later distributed on each linecard To support line rate at min packet size as newer generations of linecards got faster
Cisco Confidential
69
Packet based
Arbiters (conceptually one per output buffer) independently decide which input is writing to which output
. .
8 serdes
Lower port count required
. .
. . Prio2
..
Egress wrr
..
. .
..
Destination Index in packet header to access lookup table and perform further replications
Cisco Confidential
71
Cisco Confidential
72
Support for no-drop protocols (Fibrechannel over Ethernet) Bandwidth optimized More than 16 slots per chassis High density, including oversubscribed linecards
Cisco Confidential
73
Centrally scheduled Packet based 3 stage fabric Buffered Crossbar Input Buffered Single chassis topology, up to 16 linecards slots + 2 supervisor
slots
Multicast replication in fabric One priority per cast in fabric 8 priorities per cast in Ioslices
Cisco Confidential
74
credit return
. . . . . .
..
VOQ
crossbar with input and output buffer (one of three stages shown)
Cisco Confidential
76
central scheduler
Cisco Confidential
77
central scheduler
Sup Sup
central scheduler
network
Spine card
credit aggregator
. .
xbar
reduce number of arbitration links to central schedulers (active/standby) bandwidth, multiple links between ioslices and crossbar(s) on linecard
Stage 2 . .
xbar
Stage 1 and 3
. .
Depending on ioslice
. . FIA
Spine card . .
xbar
Linecard
Cisco Confidential
78
Not scheduled
No superframing possible no voqs, only input queues, independent of group membership
Egress ioslices do further replication to individual ports Load balanced to fabric planes using flow hash
No reordering required Max rate of single flow limited by link bundle capacity
Cisco Confidential
80
Low latency
Sub microsecond Cut through operation
Cisco Confidential
81
Single crossbar stage with large overspeed Unbuffered Crossbar Input and output Buffered
Cisco Confidential
82
Nexus 6k Design
ack/nack
12 x 10G or 3 x 40G
Xon/Xoff
12 x 10G or 3 x 40G
48 4
48
IFIA
XBAR
! Shared memory third-stage with per port per class unicast and multicast queues ! Unicast per-class backpressure broadcast from each egress to all ingresses ! Egress shared memory with per-port pruning for multicast
EFIA
Fabric latency-optimized for 10 Gbps or 40 Gbps ! Packet cut-through at fabric operating rate Single links orswitching bundles of 4 line cards ! No local on Fabric speedup 3.6 across fabric ! No scheduling
Random path selection with out-of-band ack or ! Fabric speedup 3.6 Out!of band Ack/nack from crossbar for each transmission nack from crossbar for each transmission ! 576 x 8 all unicast VoQs per ingress port Out!of band Xon/Xoff broadcast from each unicast output queue to ingress Superframing if congested ! 8k multicast queues per ingress port ! 8 classes of service across entire switch optionally shared 83 Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. ! No-drop classes supported by Xon/Xoff feedback ! 15 MB ingress buffer, 9 MB egress buffer
probability
Egress queue per (priority,port)
Broadcasts Xoff before it gets too full, to avoid egress drops After getting Xon ingress waits random time before attempting send
Cisco Confidential
84
Cisco Confidential
85