You are on page 1of 13

Fail Open:

A Way to Handle CPN Failures


Leon Poutievski

10/16/2014
Introduction
With SDN, controller runs remotely from the
switches. Many benefits:
● Complex routing moves to the powerful
server
● Simpler switch
● Simpler upgrade
● Easier to introduce new features

Out-of-band CPN A separate out-of-band control plane


network (CPN) makes bootstrapping and
troubleshooting simpler.
Problem: Partial CPN Failure
Controller loses connectivity of a switch.

If controller treats CPN disconnect as


switch down:
● Controller reroute traffic from the node
● Massive churn on CPN disconnect
● Massive churn on CPN re-connect
resulting in congestion and traffic loss.
CPN
Controller Signals
How can the controller distinguish
disconnected node from a down node?
● Peer switches know about adjacent links
○ If neighboring switch reports ports to
the disconnected node are down,
then the node is likely down
● Controller can send inject probes to
determine if paths through the
CPN disconnected node are still up
● End-to-end data might be available, e.g.
host may see drops
Assumptions
Frequent switch updates are not required
● E.g. proactive approach: switch is
updated based on topology changes,
routes advertisement, traffic changes
● Example: WAN

Switches can automatically react on local


failures. E.g. prune down ports.
CPN
Fail-Open

Goal: Minimize the network disruption during and after CPN failures

Fail-Open - Optimistic Policy

Assume the data plane is not affected by control plane failures.

Controller knows last programmed state, assumes that node is


“frozen” at that state.
Fail-Open Reaction: At Transit

Encap
Keep using the confirmed flows on
failed-open nodes.

New routing solutions can use the


confirmed flows on failed-opened nodes.
Encap Transit Decap

CPN
Fail-Open Reaction: Destination

Encap Transit
Keep using the confirmed flows on
failed-open nodes.

New routing solutions can use the


confirmed flows on failed-opened nodes.
Encap Transit Decap

CPN
Fail-Open Reaction: At Source

Encap
Keep using the confirmed flows on
failed-open nodes.

Encap Transit Decap

CPN
Fail-Open Reaction: At Source

Encap
Keep using the confirmed flows on
failed-open nodes.

Since the source cannot be updated,


controller needs to maintain the existing
Encap Transit Decap
tunnels.

CPN
Detection and Recovery
Up → Fail-Open
● No control plane connection:
○ OpenFlow connection lost
○ No response to commands
Fail
● Peers report that data plane links to the
Up
Open node are still up
Fail-Open → Up
● Control connectivity has been restored
Fail-Open → Down
Down
● Fail-Open for a long period of time
● Negative data plane signals from peers
Massive CPN Failures
● Quick state transitions can be harmful
● Example:
○ 25% nodes considered down, then
○ 50%, then
○ 75%, then
○ 100%
○ The network will be left at 25% capacity

● Coalescing helps
Thank You!

You might also like