Professional Documents
Culture Documents
Chapter 3: IP Fabric
Advanced Data Center Switching
We Will Discuss:
• Routing in an IP Fabric;
• Scaling of an IP Fabric; and
• Configuring an IP Fabric.
IP Fabric Overview
The slide lists the topics we will discuss. We discuss the highlighted topic first.
IP Fabric
An IP Fabric is one of the most flexible and scalable data center solutions available. Because an IP Fabric operates strictly
using Layer 3, there are no proprietary features or protocols being used so this solution works very well with data centers
that must accommodate multiple vendors. One of the most complicated tasks in building an IP Fabric is assigning all of the
details like IP addresses, BGP AS numbers, routing policy, loopback address assignments, and many other implementation
details.
IP Fabric Routing
The slide highlights the topic we discuss next.
Layer 3 Connectivity
Remember that your IP Fabric will be forwarding IP data only. Each node will be an IP router. In order to forward IP packets
between routers, they need to exchange IP routes. So, you have to make a choice between routing protocols. You want to
ensure that your choice of routing protocol is scalable and future proof. As you can see by the chart, BGP is the natural
choice for a routing protocol.
IBGP: Part 1
IBGP is a valid choice as the routing protocol for your fabric. IBGP peers almost always peer to loopback addresses as
opposed to physical interface addresses. In order to establish a BGP session (over a TCP session), a router must have a route
to the loopback address of its neighbor. To learn the route to a neighbor an Interior Gateway Protocol (IGP) like OSPF must be
enabled in the network. One purpose of enabling an IGP is simply to ensure every router knows how to get to the loopback
address of all other routers. Another problem that OSPF will solve is determining all of the equal cost paths to remote
destinations. For example, router A will determine from OSPF that there are 2 equal cost paths to reach router B. Now
router A can load balance traffic destined for router B’s loopback address (IBGP learned routes, see next few slides) across
the two links towards router B.
IBGP: Part 2
There is a requirement in an IBGP network that if one IBGP router needs to advertise an IBGP route, then every other IBGP
router must receive a copy of that route (to prevent black holes). One way to ensure this happens is to have every IBGP router
peer with every other IBGP router (a full mesh). This works fine but it does not scale (i.e., add a new router to your IP fabric
and you will have to configure every router in your IP fabric with a new peer). There are two ways to help scale the full mesh
issue; route reflection or confederations. Most often, it is route reflection that is chosen (it is easy to implement). It is
possible to have redundant route reflectors as well (shown on the slide). It is best practice to configure one or more of the
Spine nodes as route reflectors.
IBGP: Part 3
Note: The next few slides will highlight the problem faced by a Spine node (router D) that is NOT a route reflector.
You must build your IP Fabric such that all routers load balance traffic over equal cost paths (when they exist) towards
remote networks. Each router should be configured for BGP multipath so that they will load balance when multiple BGP
routes exist. The slide shows that router A and B advertise the 10.1/16 network to RR-A. RR-A will use both routes for
forwarding (multipath) but will chose only one of those routes (the one from router B because it B has the lowest router ID) to
send to router C (a Leaf node) and router D (a Spine node). Router C and router D will receive the route for 10.1/16. Both
copies will have a BGP next hop of router B’s loopback address. This is the default behavior of route advertisement and
selection in the IBGP with route reflection scenario.
Did you notice the load balancing problem (Hint: the problem is not on router C)? Since router C has two equal cost path to
get to router B (learned from OSPF), router C will load balance traffic to 10.1/16 over the two uplinks towards the Spine
routers. The load balancing problem lies on router D. Since router D received a single route that has a BGP next hop of router
B’s loopback, it forwards all traffic destined to 10.1/16 towards router B. The path to router A (which is an equal cost path to
10.1/16) will never be used in this case. The next slide discusses the solution to this problem.
IBGP: Part 4
The problem on RR-A is that it sees the routes received from routers A and B, 10.1/16, as a single route that has been
received twice. If an IBGP router receives different versions of the same route it is supposed to make a choice between them
and then advertise the one, chosen route to its appropriate neighbors. One solution to this problem is to make every Spine
node a route reflector. This would be fine in a small fabric but probably would not make sense when there are 10s of Spine
nodes. Another option would be to make each of the advertisements from router A and B look like unique routes. How can we
make the multiple advertisements of 10.1/16 from router A and B appear to be unique routes? There is a draft RFC
(draft-ietf-idr-add-paths) that defines the ADD-PATH capability which does just that; makes the advertisements look unique.
All Spine routers in the IP Fabric should support this capability for it to work. Once enabled, routers advertise and evaluate
routes based on a tuple of the network and its path ID. In the example, router A and B advertise the 10.1/16 route. However,
this time, RR-A and router D support the ADD-PATH capability, RR-A attaches a unique path ID to each route and is able to
advertise both routes to router D. When the routes arrive on the router D, router D installs both routes in its routing table
(allowing it to load balance towards routers A and B.)
EBGP: Part 1
EBGP is also a valid design to use in your IP Fabric. You will notice that the load balancing problem is much easier to fix in the
EBGP scenario. For example, there will be no need for the routers to support any draft RFCs! Generally, each router in an IP
Fabric should be in its own unique AS. You can use AS numbers from the private or public range or, if you will need thousands
of AS numbers, you can use 32-bit AS numbers.
EBGP: Part 2
In an EBGP-based fabric, there is no need for route reflectors or an IGP. The BGP peering sessions parallel the physical
wiring. For example, every Leaf node has a BGP peering session with every Spine node. There is no leaf-to-leaf or
spine-to-spine BGP sessions just like there is no leaf-to-leaf or spine-to-spine physical connectivity. EBGP peering is done
using the physical interface IP addresses (not loopback interfaces). To enable proper load balancing, all routers need to be
configured for multipath multiple-as as well as a load balancing policy. Both of these configurations will be covered
later in this chapter.
EBGP: Part 3
The slide shows that the router in AS64516 and AS64517 are advertising 10.1/16 to their 2 EBGP peers. Because
multipath multiple-as is configured on all routers, the receiving routers in AS64512 and AS64513 will install both
routes in their routing table and load balance traffic destined to 10.1/16.
EBGP: Part 4
The slide shows that the routers in AS64512 and AS64513 are advertising 10.1/16 to all of their EBGP peers (all Leaf
nodes). Since multipath multiple-as is configured on all routers, the receiving router in the slide, the router in
AS64514, will install both routes in its routing table and load balance traffic destined to 10.1/16.
Best Practices
When enabling an IP fabric you should follow some best practices. Remember, two of the main goals of an IP fabric design
(or a Clos design) is to provide a non-blocking architecture that also provides predictable load-balancing behavior.
Some of the best practices that should be followed include...
• All Spine nodes should be the exact same type of router. They should be the same model and they should also
have the same line cards installed. This helps the fabric to have a predictable load balancing behavior.
• All Leaf nodes should be the exact same type of router. Leaf nodes do not have to be the same router as the
Spine nodes. Each Leaf node should be the same model and they should also have the same line cards
installed. This helps the fabric to have a predictable load balancing behavior.
• Every Leaf node should have an uplink to every Spine node. This helps the fabric to have a predictable load
balancing behavior.
• All uplinks from Leaf node to Spine node should be the exact same speed. This helps the fabric to have
predictable load balancing behavior and also helps with the non-blocking nature of the fabric. For example, let
us assume that a Leaf has one 40GbE uplink and one 10GbE uplink to the Spine. When using the combination
of OSPF (for loopback interface advertisement and BGP next hop resolution) and IBGP, when calculating the
shortest path to the BGP next hop, the bandwidth of the links will be taken into consideration. OSPF will most
likely always chose the 40GbE interface for forwarding towards remote BGP next hops. This essentially blocks
the 10GbE interface from ever being used. In the EBGP scenario, the bandwidth will not be taken into
consideration, so traffic will be equally load balanced over the two different speed interfaces. Imagine trying to
equally load balance 60 Gbps of data over the two links, how will the 10GbE interface handle 30 Gbps of
traffic? The answer is...it won’t.
IP Fabric Scaling
The slide highlights the topic we discuss next.
Scaling
To increase the overall throughput of an IP Fabric, you simply need to increase the number of Spine devices (and the
appropriate uplinks from the Leaf nodes to those Spine nodes). If you add one more Spine node to the fabric, you will also
have to add one more uplink to each Leaf node. Assuming that each uplink is 40GbE, each Leaf node can now forward an
extra 40Gbps over the fabric.
Adding and removing both server-facing ports (downlinks from the Leaf nodes) and Spine nodes will affect the
oversubscription (OS) ratio of a fabric. When designing the IP fabric, you must understand OS requirements of your data
center. For example, does your data center need line rate forwarding over the fabric? Line rate forwarding would equate to
1-to-1 (1:1) OS. That means the aggregate server-facing bandwidth is equal to the aggregate uplink bandwidth. Or, maybe
your data center would work perfectly fine with a 3:1 OS of the fabric. That is, the aggregate server-facing bandwidth is 3
times that of the aggregate uplink bandwidth. Most data centers will probably not require to design around a 1:1 OS. Instead,
you should make a decision on an OS ratio that makes the most sense based on the data center’s normal bandwidth usage.
The next few slides discuss how to calculate OS ratios of various IP fabric designs.
3:1 Topology
The slide shows a basic 3:1 OS IP Fabric. All Spine nodes, four in total, are qfx5100-24q routers that each have (32) 40GbE
interfaces. All leaf nodes, 32 in total, are qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE
server-facing interfaces. Each of the (48) 10GbE ports for all 32 Spine nodes will be fully utilized (i.e., attached to
downstream servers). That means that the total server-facing bandwidth is 48 x 32 x 10Gbps which equals 15360 Gbps.
Each of the 32 Leaf nodes has (4) 40GbE Spine-facing interfaces. That means, that the total uplink bandwidth is 4 x 32 x
40Gbps which equals 5120 Gbps. The OS ratio for this fabric is 15360:5120 or 3:1.
An interesting thing to note is that if you remove any number of Leaf nodes, the OS ratio does not change. For example, what
would happen to the OS ratio if their were only 31 nodes. The server facing bandwidth would be 48 x 31 x 10Gbps which
equals 14880 Gbps. The total uplink bandwidth is 4 x 31 x 40Gbps which equals 4960 Gbps. The OS ratio for this fabric is
14880:4960 or 3:1. This fact actually makes your design calculations very simple. Once you decide on an OS ratio and
determine the number of Spine nodes that will allow that ratio, you can simply add and remove Leaf nodes from the topology
without effecting the original OS ratio of the fabric.
2:1 Topology
The slide shows a basic 2:1 OS IP Fabric in which two Spine nodes were added to the topology from the last slide. All Spine
nodes, six in total, are qfx5100-24q routers that each have (32) 40GbE interfaces. All leaf nodes, 32 in total, are
qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE server-facing interfaces. Each of the (48) 10GbE
ports for all 32 Spine nodes will be fully utilized (i.e., attached to downstream servers). That means that the total
server-facing bandwidth is still 48 x 32 x 10Gbps which equals 15360 Gbps. Each of the 32 Leaf nodes has (6) 40GbE
Spine-facing interfaces. That means, that the total uplink bandwidth is 6 x 32 x 40Gbps which equals 7680 Gbps. The OS
ratio for this fabric is 15360:7680 or 2:1.
1:1 Topology
The slide shows a basic 1:1 OS IP Fabric. All Spine nodes, six in total, are qfx5100-24q routers that each have (32) 40GbE
interfaces. All leaf nodes, 32 in total, are qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE
server-facing interfaces. There are many ways that an 1:1 OS ratio can be attained. In this case, although the Leaf nodes
each have (48) 10GbE server-facing interfaces, we are only going to allow 24 servers to be attached at any given moment.
That means that the total server-facing bandwidth is still 24 x 32 x 10Gbps which equals 7680 Gbps. Each of the 32 Leaf
nodes has (6) 40GbE Spine-facing interfaces. That means, that the total uplink bandwidth is 6 x 32 x 40Gbps which equals
7680 Gbps. The OS ratio for this fabric is 7680:7680 or 1:1.
Configure an IP Fabric
The slide highlights the topic we discuss next.
Example Topology
The slide shows the example topology that will be used in the subsequent slides. Notice that each router is the single
member of a unique autonomous system. Each router will peer using EBGP with it directly attached neighbors using the
physical interface addresses. Host A is singly to the router in AS 64514. Host B is multihomed to the routers in AS 64515 and
AS 64516.
Verifying Neighbors
Once you configure BGP neighbors, you can check the status of the relationships using either the show bgp summary or
show bgp neighbor command.
Routing Policy
Once BGP neighbors are established in the IP Fabric, each router must be configured to advertise routes to its neighbors and
into the fabric. For example, as you attach a server to a top-of-rack (TOR) switch/router (which is usually a Leaf node of the
fabric) you must configure the TOR to advertise the server’s IP subnet to the rest of the network. The first step in advertising
route is to write a policy that will match on a route and then accept that route. The slide shows the policy that must be
configured on the routers in AS 64514 and AS 64515.
Applying Policy
After configuring a policy, the policy must be applied to the router EBGP peers. The slide shows the direct policy being
applied as an export policy as64515’s EBGP neighbors.
Default Behavior
Assuming the routers in AS 64515 and AS 64516 are advertising Host B’s subnet, the slide shows the default routing
behavior on a Spine node. Notice that the Spine node has received two advertisements for the same subnet. However,
because of the default behavior of BGP, the Spine node choses a single route to select as the active route in the routing table
(you can tell which is the active route because of the asterisk). Based on what is shown in the slide, the Spine node will send
all traffic destined for 10.1.2/24 over the ge-0/0/2 link. The Spine node will not load balance over the two possible nexthops
by default.
Verify Multipath
View the routing table to see the results of the multipath statement. As you can see the active BGP route now has two
nexthops that can be use for forwarding. Do you think the router is using both nexthops for forwarding?
Results
The output shows that after applying the load balancing policy to the forwarding table, all nexthops associated with active
routes in the routing table have been copied into the forwarding table.
AS 64514
The slide shows the BGP and policy configuration for the router in AS 64514.
AS 64515
The slide shows the BGP and policy configuration for the router in AS 64515.
AS 64512
The slide shows the BGP and policy configuration for the router in AS 64512.
We Discussed:
• Routing in an IP Fabric;
• Scaling of an IP Fabric; and
• Configuring an IP Fabric.
Review Questions
1.
2.
3.
Lab: IP Fabric
The slide provides the objective for this lab.