You are on page 1of 50

Advanced Data Center Switching

Chapter 3: IP Fabric
Advanced Data Center Switching

We Will Discuss:
• Routing in an IP Fabric;
• Scaling of an IP Fabric; and
• Configuring an IP Fabric.

Chapter 3–2 • IP Fabric www.juniper.net


Advanced Data Center Switching

IP Fabric Overview
The slide lists the topics we will discuss. We discuss the highlighted topic first.

www.juniper.net IP Fabric • Chapter 3–3


Advanced Data Center Switching

IP Fabric
An IP Fabric is one of the most flexible and scalable data center solutions available. Because an IP Fabric operates strictly
using Layer 3, there are no proprietary features or protocols being used so this solution works very well with data centers
that must accommodate multiple vendors. One of the most complicated tasks in building an IP Fabric is assigning all of the
details like IP addresses, BGP AS numbers, routing policy, loopback address assignments, and many other implementation
details.

Chapter 3–4 • IP Fabric www.juniper.net


Advanced Data Center Switching

A Three Stage Clos Network


In the 1950s, Charles Clos first wrote about his idea of a non-blocking, multistage, telephone switching architecture that
would allow calls to be completed. The switches in his topology are called crossbar switches. A Clos network is based on a
three-stage architecture, an ingress stage, a middle stage, and an egress stage. The theory is that there are multiple paths
for a call to be switched through the network such that calls will always be connected and not "blocked" by another call. The
term Clos “fabric” came about later as people began to notice that the pattern of links looked like threads in a woven piece
of cloth.
You should notice that the goal of the design is to provide connectivity from one ingress crossbar switch to an egress
crossbar switch. Notice that there is no need for connectivity between crossbar switches that belong to the same stage.

www.juniper.net IP Fabric • Chapter 3–5


Advanced Data Center Switching

An IP Fabric Is Based on a Clos Fabric


The diagram shows an IP Clos Fabric using Juniper Networks switches. In an IP Fabric the Ingress and Egress stage crossbar
switches are called Leaf nodes. The middle stage crossbar switches are called Spine nodes. Most diagrams of an IP Fabric
do not present the topology with 3 distinct stages as shown on this slide. Most diagrams show an IP Fabric with the Ingress
and Egress stage combined as a single stage. It would be like taking the top of the diagram and folding it over onto itself with
all Spines nodes on top and all Leaf nodes on the bottom of the diagram (see the next slide).

Chapter 3–6 • IP Fabric www.juniper.net


Advanced Data Center Switching

Spine and Leaf Architecture: Part 1


To maximize the throughput of the fabric, each Leaf node should have a connection to each Spine node. This ensures each
server-facing interface is always two hops away from any other server-facing interfaces. This creates a highly resilient fabric
with multiple paths to all other devices. An important fact to keep in mind is that a member switch has no idea of its location
(Spine or Leaf) in an IP Fabric. The Spine or Leaf function is simply a matter of a device’s physical location in the fabric. In
general, the choice of router to be used as a Spine nodes should be partially based on the interface speeds and number of
ports that it supports. The example on the slide shows an example where every Spine node is a QFX5100-24q. The
QFX5100-24q supports (32) 40GbE interfaces and was literally designed by Juniper to be a Spine node.

www.juniper.net IP Fabric • Chapter 3–7


Advanced Data Center Switching

Spine and Leaf Architecture: Part 2


The slide shows that there are four distinct paths (1 path per Spine node) between Host A and Host B across the fabric. In an
IP Fabric, the main goal of your design should be that traffic is automatically load balanced over those equal cost paths
using a hash algorithm (keeping frames from same flow on same path).

Chapter 3–8 • IP Fabric www.juniper.net


Advanced Data Center Switching

IP Fabric Design Options


IP Fabrics are generally structured in either a 3-stage topology or a 5-stage topology. A 3-stage topology is used in small to
medium deployments. We cover the configuration of a 3-stage fabric in the upcoming slides. A 5-stage topology is used in a
medium to large deployment. Although we do not cover the configuration of a 5-stage fabric, you should know that the
configuration of a 5-stage fabric is quite complicated.

www.juniper.net IP Fabric • Chapter 3–9


Advanced Data Center Switching

Recommended Spine Nodes


The slide shows some of the recommended Juniper Networks products that can act as Spine nodes. As stated earlier, you
should consider port density and scaling limitations when choosing the product to place in the Spine location. Some of the
pertinent features for a Spine node include overlay networking support, Layer 2 and Layer 3 VXLAN Gateway support, and
number of VLANs supported.

Chapter 3–10 • IP Fabric www.juniper.net


Advanced Data Center Switching

Recommended Leaf Nodes


The slide shows some of the recommended Juniper Networks products that can act as Leaf nodes.

www.juniper.net IP Fabric • Chapter 3–11


Advanced Data Center Switching

IP Fabric Routing
The slide highlights the topic we discuss next.

Chapter 3–12 • IP Fabric www.juniper.net


Advanced Data Center Switching

Routing Strategy: Part 1


The slide highlights the desired routing behavior of a Spine node. Ideally, each Spine node should have multiple next-hops to
use to load balance traffic over the IP fabric. Notice the router C can use two different paths to forward traffic to any remote
destination.

www.juniper.net IP Fabric • Chapter 3–13


Advanced Data Center Switching

Routing Strategy: Part 2


The slide highlights the desired routing behavior of a Spine node. Ideally, each Spine node should have multiple next-hops to
use to load balance traffic to remote destinations attached to the IP fabric. Notice that routers D and E have one path for
singly homed hosts and two path available for multihomed hosts. It just so happen that getting these routes and associated
nexthops into the forwarding table of a Spine node can be tricky. The rest of the chapter discusses the challenges as well as
the solutions to the problem.

Chapter 3–14 • IP Fabric www.juniper.net


Advanced Data Center Switching

Layer 3 Connectivity
Remember that your IP Fabric will be forwarding IP data only. Each node will be an IP router. In order to forward IP packets
between routers, they need to exchange IP routes. So, you have to make a choice between routing protocols. You want to
ensure that your choice of routing protocol is scalable and future proof. As you can see by the chart, BGP is the natural
choice for a routing protocol.

www.juniper.net IP Fabric • Chapter 3–15


Advanced Data Center Switching

IBGP: Part 1
IBGP is a valid choice as the routing protocol for your fabric. IBGP peers almost always peer to loopback addresses as
opposed to physical interface addresses. In order to establish a BGP session (over a TCP session), a router must have a route
to the loopback address of its neighbor. To learn the route to a neighbor an Interior Gateway Protocol (IGP) like OSPF must be
enabled in the network. One purpose of enabling an IGP is simply to ensure every router knows how to get to the loopback
address of all other routers. Another problem that OSPF will solve is determining all of the equal cost paths to remote
destinations. For example, router A will determine from OSPF that there are 2 equal cost paths to reach router B. Now
router A can load balance traffic destined for router B’s loopback address (IBGP learned routes, see next few slides) across
the two links towards router B.

Chapter 3–16 • IP Fabric www.juniper.net


Advanced Data Center Switching

IBGP: Part 2
There is a requirement in an IBGP network that if one IBGP router needs to advertise an IBGP route, then every other IBGP
router must receive a copy of that route (to prevent black holes). One way to ensure this happens is to have every IBGP router
peer with every other IBGP router (a full mesh). This works fine but it does not scale (i.e., add a new router to your IP fabric
and you will have to configure every router in your IP fabric with a new peer). There are two ways to help scale the full mesh
issue; route reflection or confederations. Most often, it is route reflection that is chosen (it is easy to implement). It is
possible to have redundant route reflectors as well (shown on the slide). It is best practice to configure one or more of the
Spine nodes as route reflectors.

www.juniper.net IP Fabric • Chapter 3–17


Advanced Data Center Switching

IBGP: Part 3
Note: The next few slides will highlight the problem faced by a Spine node (router D) that is NOT a route reflector.
You must build your IP Fabric such that all routers load balance traffic over equal cost paths (when they exist) towards
remote networks. Each router should be configured for BGP multipath so that they will load balance when multiple BGP
routes exist. The slide shows that router A and B advertise the 10.1/16 network to RR-A. RR-A will use both routes for
forwarding (multipath) but will chose only one of those routes (the one from router B because it B has the lowest router ID) to
send to router C (a Leaf node) and router D (a Spine node). Router C and router D will receive the route for 10.1/16. Both
copies will have a BGP next hop of router B’s loopback address. This is the default behavior of route advertisement and
selection in the IBGP with route reflection scenario.
Did you notice the load balancing problem (Hint: the problem is not on router C)? Since router C has two equal cost path to
get to router B (learned from OSPF), router C will load balance traffic to 10.1/16 over the two uplinks towards the Spine
routers. The load balancing problem lies on router D. Since router D received a single route that has a BGP next hop of router
B’s loopback, it forwards all traffic destined to 10.1/16 towards router B. The path to router A (which is an equal cost path to
10.1/16) will never be used in this case. The next slide discusses the solution to this problem.

Chapter 3–18 • IP Fabric www.juniper.net


Advanced Data Center Switching

IBGP: Part 4
The problem on RR-A is that it sees the routes received from routers A and B, 10.1/16, as a single route that has been
received twice. If an IBGP router receives different versions of the same route it is supposed to make a choice between them
and then advertise the one, chosen route to its appropriate neighbors. One solution to this problem is to make every Spine
node a route reflector. This would be fine in a small fabric but probably would not make sense when there are 10s of Spine
nodes. Another option would be to make each of the advertisements from router A and B look like unique routes. How can we
make the multiple advertisements of 10.1/16 from router A and B appear to be unique routes? There is a draft RFC
(draft-ietf-idr-add-paths) that defines the ADD-PATH capability which does just that; makes the advertisements look unique.
All Spine routers in the IP Fabric should support this capability for it to work. Once enabled, routers advertise and evaluate
routes based on a tuple of the network and its path ID. In the example, router A and B advertise the 10.1/16 route. However,
this time, RR-A and router D support the ADD-PATH capability, RR-A attaches a unique path ID to each route and is able to
advertise both routes to router D. When the routes arrive on the router D, router D installs both routes in its routing table
(allowing it to load balance towards routers A and B.)

www.juniper.net IP Fabric • Chapter 3–19


Advanced Data Center Switching

EBGP: Part 1
EBGP is also a valid design to use in your IP Fabric. You will notice that the load balancing problem is much easier to fix in the
EBGP scenario. For example, there will be no need for the routers to support any draft RFCs! Generally, each router in an IP
Fabric should be in its own unique AS. You can use AS numbers from the private or public range or, if you will need thousands
of AS numbers, you can use 32-bit AS numbers.

Chapter 3–20 • IP Fabric www.juniper.net


Advanced Data Center Switching

EBGP: Part 2
In an EBGP-based fabric, there is no need for route reflectors or an IGP. The BGP peering sessions parallel the physical
wiring. For example, every Leaf node has a BGP peering session with every Spine node. There is no leaf-to-leaf or
spine-to-spine BGP sessions just like there is no leaf-to-leaf or spine-to-spine physical connectivity. EBGP peering is done
using the physical interface IP addresses (not loopback interfaces). To enable proper load balancing, all routers need to be
configured for multipath multiple-as as well as a load balancing policy. Both of these configurations will be covered
later in this chapter.

www.juniper.net IP Fabric • Chapter 3–21


Advanced Data Center Switching

EBGP: Part 3
The slide shows that the router in AS64516 and AS64517 are advertising 10.1/16 to their 2 EBGP peers. Because
multipath multiple-as is configured on all routers, the receiving routers in AS64512 and AS64513 will install both
routes in their routing table and load balance traffic destined to 10.1/16.

Chapter 3–22 • IP Fabric www.juniper.net


Advanced Data Center Switching

EBGP: Part 4
The slide shows that the routers in AS64512 and AS64513 are advertising 10.1/16 to all of their EBGP peers (all Leaf
nodes). Since multipath multiple-as is configured on all routers, the receiving router in the slide, the router in
AS64514, will install both routes in its routing table and load balance traffic destined to 10.1/16.

www.juniper.net IP Fabric • Chapter 3–23


Advanced Data Center Switching

Best Practices
When enabling an IP fabric you should follow some best practices. Remember, two of the main goals of an IP fabric design
(or a Clos design) is to provide a non-blocking architecture that also provides predictable load-balancing behavior.
Some of the best practices that should be followed include...
• All Spine nodes should be the exact same type of router. They should be the same model and they should also
have the same line cards installed. This helps the fabric to have a predictable load balancing behavior.
• All Leaf nodes should be the exact same type of router. Leaf nodes do not have to be the same router as the
Spine nodes. Each Leaf node should be the same model and they should also have the same line cards
installed. This helps the fabric to have a predictable load balancing behavior.
• Every Leaf node should have an uplink to every Spine node. This helps the fabric to have a predictable load
balancing behavior.
• All uplinks from Leaf node to Spine node should be the exact same speed. This helps the fabric to have
predictable load balancing behavior and also helps with the non-blocking nature of the fabric. For example, let
us assume that a Leaf has one 40GbE uplink and one 10GbE uplink to the Spine. When using the combination
of OSPF (for loopback interface advertisement and BGP next hop resolution) and IBGP, when calculating the
shortest path to the BGP next hop, the bandwidth of the links will be taken into consideration. OSPF will most
likely always chose the 40GbE interface for forwarding towards remote BGP next hops. This essentially blocks
the 10GbE interface from ever being used. In the EBGP scenario, the bandwidth will not be taken into
consideration, so traffic will be equally load balanced over the two different speed interfaces. Imagine trying to
equally load balance 60 Gbps of data over the two links, how will the 10GbE interface handle 30 Gbps of
traffic? The answer is...it won’t.

Chapter 3–24 • IP Fabric www.juniper.net


Advanced Data Center Switching

IP Fabric Scaling
The slide highlights the topic we discuss next.

www.juniper.net IP Fabric • Chapter 3–25


Advanced Data Center Switching

Scaling
To increase the overall throughput of an IP Fabric, you simply need to increase the number of Spine devices (and the
appropriate uplinks from the Leaf nodes to those Spine nodes). If you add one more Spine node to the fabric, you will also
have to add one more uplink to each Leaf node. Assuming that each uplink is 40GbE, each Leaf node can now forward an
extra 40Gbps over the fabric.
Adding and removing both server-facing ports (downlinks from the Leaf nodes) and Spine nodes will affect the
oversubscription (OS) ratio of a fabric. When designing the IP fabric, you must understand OS requirements of your data
center. For example, does your data center need line rate forwarding over the fabric? Line rate forwarding would equate to
1-to-1 (1:1) OS. That means the aggregate server-facing bandwidth is equal to the aggregate uplink bandwidth. Or, maybe
your data center would work perfectly fine with a 3:1 OS of the fabric. That is, the aggregate server-facing bandwidth is 3
times that of the aggregate uplink bandwidth. Most data centers will probably not require to design around a 1:1 OS. Instead,
you should make a decision on an OS ratio that makes the most sense based on the data center’s normal bandwidth usage.
The next few slides discuss how to calculate OS ratios of various IP fabric designs.

Chapter 3–26 • IP Fabric www.juniper.net


Advanced Data Center Switching

3:1 Topology
The slide shows a basic 3:1 OS IP Fabric. All Spine nodes, four in total, are qfx5100-24q routers that each have (32) 40GbE
interfaces. All leaf nodes, 32 in total, are qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE
server-facing interfaces. Each of the (48) 10GbE ports for all 32 Spine nodes will be fully utilized (i.e., attached to
downstream servers). That means that the total server-facing bandwidth is 48 x 32 x 10Gbps which equals 15360 Gbps.
Each of the 32 Leaf nodes has (4) 40GbE Spine-facing interfaces. That means, that the total uplink bandwidth is 4 x 32 x
40Gbps which equals 5120 Gbps. The OS ratio for this fabric is 15360:5120 or 3:1.
An interesting thing to note is that if you remove any number of Leaf nodes, the OS ratio does not change. For example, what
would happen to the OS ratio if their were only 31 nodes. The server facing bandwidth would be 48 x 31 x 10Gbps which
equals 14880 Gbps. The total uplink bandwidth is 4 x 31 x 40Gbps which equals 4960 Gbps. The OS ratio for this fabric is
14880:4960 or 3:1. This fact actually makes your design calculations very simple. Once you decide on an OS ratio and
determine the number of Spine nodes that will allow that ratio, you can simply add and remove Leaf nodes from the topology
without effecting the original OS ratio of the fabric.

www.juniper.net IP Fabric • Chapter 3–27


Advanced Data Center Switching

2:1 Topology
The slide shows a basic 2:1 OS IP Fabric in which two Spine nodes were added to the topology from the last slide. All Spine
nodes, six in total, are qfx5100-24q routers that each have (32) 40GbE interfaces. All leaf nodes, 32 in total, are
qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE server-facing interfaces. Each of the (48) 10GbE
ports for all 32 Spine nodes will be fully utilized (i.e., attached to downstream servers). That means that the total
server-facing bandwidth is still 48 x 32 x 10Gbps which equals 15360 Gbps. Each of the 32 Leaf nodes has (6) 40GbE
Spine-facing interfaces. That means, that the total uplink bandwidth is 6 x 32 x 40Gbps which equals 7680 Gbps. The OS
ratio for this fabric is 15360:7680 or 2:1.

Chapter 3–28 • IP Fabric www.juniper.net


Advanced Data Center Switching

1:1 Topology
The slide shows a basic 1:1 OS IP Fabric. All Spine nodes, six in total, are qfx5100-24q routers that each have (32) 40GbE
interfaces. All leaf nodes, 32 in total, are qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE
server-facing interfaces. There are many ways that an 1:1 OS ratio can be attained. In this case, although the Leaf nodes
each have (48) 10GbE server-facing interfaces, we are only going to allow 24 servers to be attached at any given moment.
That means that the total server-facing bandwidth is still 24 x 32 x 10Gbps which equals 7680 Gbps. Each of the 32 Leaf
nodes has (6) 40GbE Spine-facing interfaces. That means, that the total uplink bandwidth is 6 x 32 x 40Gbps which equals
7680 Gbps. The OS ratio for this fabric is 7680:7680 or 1:1.

www.juniper.net IP Fabric • Chapter 3–29


Advanced Data Center Switching

Configure an IP Fabric
The slide highlights the topic we discuss next.

Chapter 3–30 • IP Fabric www.juniper.net


Advanced Data Center Switching

Example Topology
The slide shows the example topology that will be used in the subsequent slides. Notice that each router is the single
member of a unique autonomous system. Each router will peer using EBGP with it directly attached neighbors using the
physical interface addresses. Host A is singly to the router in AS 64514. Host B is multihomed to the routers in AS 64515 and
AS 64516.

www.juniper.net IP Fabric • Chapter 3–31


Advanced Data Center Switching

BGP Configuration—Spine Node


The slide shows the configuration of the Spine node in AS 64512. It is configured to peer with each of the Leaf nodes using
EBGP.

Chapter 3–32 • IP Fabric www.juniper.net


Advanced Data Center Switching

BGP Configuration—Leaf Node


The slide shows the configuration of the Leaf node in AS 64515. It is configured to peer with each of the Spine nodes using
EBGP.

www.juniper.net IP Fabric • Chapter 3–33


Advanced Data Center Switching

Verifying Neighbors
Once you configure BGP neighbors, you can check the status of the relationships using either the show bgp summary or
show bgp neighbor command.

Chapter 3–34 • IP Fabric www.juniper.net


Advanced Data Center Switching

Routing Policy
Once BGP neighbors are established in the IP Fabric, each router must be configured to advertise routes to its neighbors and
into the fabric. For example, as you attach a server to a top-of-rack (TOR) switch/router (which is usually a Leaf node of the
fabric) you must configure the TOR to advertise the server’s IP subnet to the rest of the network. The first step in advertising
route is to write a policy that will match on a route and then accept that route. The slide shows the policy that must be
configured on the routers in AS 64514 and AS 64515.

www.juniper.net IP Fabric • Chapter 3–35


Advanced Data Center Switching

Applying Policy
After configuring a policy, the policy must be applied to the router EBGP peers. The slide shows the direct policy being
applied as an export policy as64515’s EBGP neighbors.

Chapter 3–36 • IP Fabric www.juniper.net


Advanced Data Center Switching

Verifying Advertised Routes


After applying the policy, the router should begin advertise any routes that were accepted by the policy. Use the show
route advertising-protocol bgp command to see which routes are being advertised to a routers BGP neighbors.

www.juniper.net IP Fabric • Chapter 3–37


Advanced Data Center Switching

Default Behavior
Assuming the routers in AS 64515 and AS 64516 are advertising Host B’s subnet, the slide shows the default routing
behavior on a Spine node. Notice that the Spine node has received two advertisements for the same subnet. However,
because of the default behavior of BGP, the Spine node choses a single route to select as the active route in the routing table
(you can tell which is the active route because of the asterisk). Based on what is shown in the slide, the Spine node will send
all traffic destined for 10.1.2/24 over the ge-0/0/2 link. The Spine node will not load balance over the two possible nexthops
by default.

Chapter 3–38 • IP Fabric www.juniper.net


Advanced Data Center Switching

Override Default BGP Behavior


The multipath statement overrides the default BGP routing behavior and allows two or more next-hops to be used for
routing. The statement by itself requires that the multiple routes must be received from the same autonomous system. Use
the multiple-as modifier to override that matching AS requirement.

www.juniper.net IP Fabric • Chapter 3–39


Advanced Data Center Switching

Verify Multipath
View the routing table to see the results of the multipath statement. As you can see the active BGP route now has two
nexthops that can be use for forwarding. Do you think the router is using both nexthops for forwarding?

Chapter 3–40 • IP Fabric www.juniper.net


Advanced Data Center Switching

Default Forwarding Table Behavior


The slide shows that since multipath was configured in the previous slides, two nexthops are associated with the 10.1.2/
24 route in the routing table. However, only one next-hop is pushed down to the forwarding table, by default. So, at this point,
the Spine node is continuing to only forward traffic destined to 10.1.2/24 over a single link.

www.juniper.net IP Fabric • Chapter 3–41


Advanced Data Center Switching

Load Balancing Policy


The final step in getting a router to load balance, is to write and apply a policy that will cause the multiple nexthops in the
routing table to be exported from the routing table into the forwarding table. The slide shows the details of that process.

Chapter 3–42 • IP Fabric www.juniper.net


Advanced Data Center Switching

Results
The output shows that after applying the load balancing policy to the forwarding table, all nexthops associated with active
routes in the routing table have been copied into the forwarding table.

www.juniper.net IP Fabric • Chapter 3–43


Advanced Data Center Switching

AS 64514
The slide shows the BGP and policy configuration for the router in AS 64514.

Chapter 3–44 • IP Fabric www.juniper.net


Advanced Data Center Switching

AS 64515
The slide shows the BGP and policy configuration for the router in AS 64515.

www.juniper.net IP Fabric • Chapter 3–45


Advanced Data Center Switching

AS 64512
The slide shows the BGP and policy configuration for the router in AS 64512.

Chapter 3–46 • IP Fabric www.juniper.net


Advanced Data Center Switching

We Discussed:
• Routing in an IP Fabric;
• Scaling of an IP Fabric; and
• Configuring an IP Fabric.

www.juniper.net IP Fabric • Chapter 3–47


Advanced Data Center Switching

Review Questions
1.

2.

3.

Chapter 3–48 • IP Fabric www.juniper.net


Advanced Data Center Switching

Lab: IP Fabric
The slide provides the objective for this lab.

www.juniper.net IP Fabric • Chapter 3–49


Advanced Data Center Switching
Answers to Review Questions
1.
Some of the Juniper Networks products that can be used in the Spine position of an IP Fabric are MX, QFX10k, and QFX5100 Series
routers.
2.
Routing should be implemented in such a way that when multiple, equal physical paths exist between two points data traffic should be
load-balanced over those paths to reach those two points.
3.
To allow a BGP speaker to install more than one next hop in the routing table when the same route is received from two or more
neighbors, multipath must be enabled.

Chapter 3–50 • IP Fabric www.juniper.net

You might also like