You are on page 1of 6

Fast Failure Recovery for In-Band Controlled Multi-

Controller OpenFlow Networks


Kwan-Yee Chan, Chen-Hua Chen, Yi-Huei Chen, Yun-Ju Tsai, Steven S. W. Lee, and Cheng-Shong Wu
Department of Communications Engineering and Advanced Institute of Manufacturing with High-tech Innovations
National Chung Cheng University
Chiayi, Taiwan
{kwanyee86, k951874632, yihui0413, yunjulutsai, steven.sswlee}@gmail.com

Abstract— Benefit from central control paradigm, SDN owns OpenFlow [1] is one of the promising approaches to realize
flexible and agile controllability to the networks. However, the SDN. To enhance scalability and reliability, OpenFlow
central controller becomes the most vulnerable device. The specification 1.3 and its descendant versions allow more than
failure of the controller results in the malfunction of the whole one controller to control a network. OpenFlow controller
network. OpenFlow specification allows a network to equip platforms including ONOS [2] and OpenDaylight [3] have
multiple controllers to enhance the reliability on the control plane. already supported multi-controller architecture. The controllers
How to perform fast failure recovery that takes advantage of a automatically synchronize network states among them. In a
network with multiple controllers especially for networks single controller network, the controller failure is fatal. It is
operating with in-band control has not been well investigated. In
clear that multiple controllers can improve network reliability.
this paper, we design a fast failure recovery scheme that takes the
When a controller fails in a network, other controller can help
multiple controller architecture into consideration. In our system,
there is a main controller that is responsible for controlling the to control the network [4] and [5]. Multiple controllers are also
network in the normal state. The other controllers are standby useful for load sharing. The design of load balancing on
controllers used to take over the network control in failure state. multiple controller architecture could be found in [6], [7] and
The failure recovery in the proposed scheme includes three [8].
phases- fast failure detection, fast failure location identification, For a switch, a controller could be one of the three types -
and traffic reroute. The fast failure detection uses monitoring Master, Slave, and Equal. A switch can have only one master
cycles to examine the status of the network. As a failure is
controller but could have many Equal and Slave controllers. A
detected, the main controller and the standby controllers work
together to perform failure location identification to pin-point the
controller with role to be master or equal owns full control to
locations of the failure devices. In the traffic reroute phase, the their switches. However, a slave controller can only listen to
main controller and the standby controllers collaborate to the responses from switches. More specifically, it is not
complete the fast failover. A control path provisioning algorithm allowed for a slave controller to issue a command to a switch.
is proposed to resolve the disjoint path planning problem for the For a network with multiple controllers, one typical
routing of the in-band control channels from the controllers to configuration is to assign one to be the master controller and let
the switches. As a result, a failure can be fast recovered even if it the others to be slave controllers. Another feasible
damages both working control paths and data paths. We have configuration is to assign multiple equal controllers. For a
implemented our design in an experimental network. The switch, the role of its controllers can be dynamically changed
experimental results show that the proposed scheme can protect through role change command. Therefore, a slave controller
any single link or single node failure within 50 msec. can become a master (or equal) controller if needed.
Keywords—Fast Failure Recovery; Survivable Openflow Survivable SDN network design is an important research
Network; Multi-controller; In-band controlled OpenFlow Networks topic. There are many studies addressing fast failure detection
and fast failure recovery problems in OpenFlow networks [9],
I. INTRODUCTION [10], [11], [12] and [13]. In [9], Link Layer Discovery Protocol
By separating the control plane and the data plane, Software (LLDP) is used to examine the link status. The controller
Defined Network (SDN) improves the controllability and setups a path to reroute flows going through the damaged link.
manageability on a network. Through applying the central In [10], a path restoration scheme is investigated. OpenFlow
control paradigm, the application programs running in controller reconfigures recovery paths after a failure is detected.
OpenFlow controller provide functions including admission The failover time ranges from 260 msec to 310 msec. In [11],
control, routing, path provisioning, statistical data collection, high density Bidirectional Forwarding Detection messages are
flow analysis, and so on. Although SDN has high flexibility on used for failure detection. As a failure detected, the switches at
network control and management, the central controller the two ends of path performs automatically protection
becomes the most vulnerable device. The failure of the switching. In [12], both path restoration and protection schemes
controller results in the malfunction of the whole network. are considered. The failover time is about 240 msec for path
restoration and 42-48 msec for path protection in a network
with 16 nodes. In [13], segment protection in a network with

978-1-5386-5041-7/18/$31.00 ©2018 IEEE 396 ICTC 2018

Authorized licensed use limited to: Univ de Alcala. Downloaded on January 18,2023 at 11:16:22 UTC from IEEE Xplore. Restrictions apply.
pre-configured backup paths is used. The switch that detects a When a failure occurs, all switches should be controlled by at
link failure performs automatic protection switching without least one controller except the failure switch itself. In order to
the involvement of the controller. In [14], an effective control maintain maximum reachability against network failures, the
channel resilience protocol is able to recover multiple failures collection of control paths from the K controllers to each switch
simultaneous. The recovery does not require intervention by should be as diverse as possible. The problem of finding K as
the controller. diverse as possible lowest cost paths is called the K-best path
(KBP) problem. In KBP, if the desired number of paths, K, is
Although many research works have been published to larger than the maximum number of mutually disjointed paths
address fast failure recovery in SDN networks, none of them that the network can support, the KBP algorithm outputs the K
are developed for the architecture with multiple controllers. In paths with lowest total cost and those paths are with minimum
this work, we present a scheme that fully takes advantage of number of common nodes sharing among them. However,
multi-controller for the design of survivable OpenFlow conventional KBP algorithm deals with this problem starting
networks. Because in-band control can significantly reduce the from a single node. Instead, in this multi-controller problem,
consumption of network resources usage, our design focuses on the control paths should start from the K distinct controllers. A
in-band controlled networks. However, in-band control graph transformation technique is used for solving the control
increases the difficulty in fast failure recovery. Since a failure path routing problem. As shown in Figure 1, we first generate
in an in-band controlled network could disrupt control channels, an artificial virtual node (AVN) that connects to all of the
it results in loss of control on multiple switches. The main idea controllers. The link cost between the AVN to each controller
of this work is to provision disjoint paths from multiple is assigned to be zero. By applying the KBP algorithm [16], we
controllers to each switch. As the number of controller could obtain the routing for the control paths.
increases the number of simultaneous failures that the network
can protect also increases.
The failure recovery in proposed scheme includes three
phases- fast failure detection, fast failure location identification,
and traffic reroute. In Section II, we present how to detect the
occurrences of a failure and how to identify the failure location
in short time. In Section III, we present our design that
performs fast failover through the collaborating of the main
controller and the standby controllers. In Section IV, we
present the experimental results. Finally, conclusion remarks
are included in Section V.
II. FAILURE DETECTION AND FAILURE LOCATION
IDENTIFICATION Fig. 1. Graph transformation for applying K-best path algorithm on control
paths planning problem
In this section, we present a software base fast failure
detection and fast failure location identification scheme for a
B. Failure Detection and Monitoring Cycle Planning
multi-controller OpenFlow network. The proposed scheme can
be applied in a network using either out-of-band or in-band The time to detect a link failure depends on the design of
control. We provision multiple monitoring cycles that cover all switches. In [17], several switches are examined. It indicates
links in the network. By periodically probing the status of the that a switch might take up to 1.5 sec to detect a failure. The
monitoring cycles, the controller knows the status of the long failure detection time prevents developing a fast failure
network. As a failure is detected in a monitoring cycle, the recovery based on the status reports from switches themselves.
controller initiates failure location identification processes to For an in-band controlled network, the situation becomes even
identify the failure type and the failure location. worse. A fault on a control channel disables a switch to report
the event to the controller. As a result, it is necessary for a
A. Control Paths Planning controller to autonomously monitor the status of the network.
In an in-band controlled network, there might be multiple
In this paper, we apply monitoring cycle based approach for
control paths and data paths broken at the same time when a
network status detection. This approach is proposed in our
failure occurs. Since the damaged control paths prevent the
previous work [15]. The monitoring path is a cycle that starts
controller to reconfigure flow entries in OpenFlow switches. In
and ends at the MC. By traversing the monitoring cycle, the
a single controller network, we must perform the control path
controller can detect if the nodes and links are working
recovery first before the data path recovery. In other words,
properly. Figure 2 shows an example for a monitoring cycle. In
failure recovery for data paths can only be performed after the
our system, there is a main controller (MC). The other
controller regains the control to the switches [15]. Not able to
controllers are standby controllers (SC). A Failure Monitoring
perform the control path recovery and data path recovery at the
(FM) packet leaving from the MC traverses the whole
same time make a delay for failure recovery. In this work, we
monitoring cycle S1 →S2 →S3 →S4 →S1 before it comes back
take advantage of multiple controllers to shorten the time for
to the MC. If there is a failure on any devices of the monitoring
failure recovery. Our design can protect any single node or link
cycle, the FM cannot return to the MC. To cover all of the links,
failure even if the failure interrupts both control and data paths.
a network might need multiple cycles. The monitoring cycle
In our system, there are K controllers. One of them is a planning problem can be solved based on the solution of min-
main controller (MC) that is used to control the whole network max K-Chinese postman problem. Please refer to [15] and [18]
in the normal state. The others are standby controllers which for the details of the problem.
will take over the control of the network in the failure state.

397

Authorized licensed use limited to: Univ de Alcala. Downloaded on January 18,2023 at 11:16:22 UTC from IEEE Xplore. Restrictions apply.
The detection processes include one Failure Monitoring
(FM) round and one Failure Location Identify (FLI) round. In
FM round, each monitoring cycle is traversed by a monitoring
packet. The FM packet for each monitoring cycle should come
back to the controller within time Δ. The system parameter Δ is
the maximum propagation delay and queuing delay for a FM
packet transmission along its monitoring path. To guarantee the
timing of cycle monitoring, FM packets are assigned to have
the highest priority. If the MC does not receive the
corresponding FM packet within Δ, the system enters FLI
round to pinpoint the failure location.
Fig. 2. FM round: network in normal state
C. Failure Location Identification
As shown in Figure 3, any link or node failure could
prevent the FM packet from coming back to the controller
within time duration Δ. We initiate a FLI (Failure Location
Identification) round to detect the failure type (node or link
failure) and failure location.
1) Failure Location Detection
To detect failure location, the MC sends a FLD (Failure
Location Detection) packet along the monitoring path. Every
switch makes two copies of the incoming FLD packet. One of
them is forwarded to the next hop along the monitoring path.
The other copy is looped back on the monitoring cycle toward
the MC .
(a) Link failure
By counting the number of returning FLD packets by the
MC, the controller can identify the failure location. Receiving n
FLD packets indicates the error must happen at the link or node
at the (n+1)-th hop on the monitor cycle. For example, in
Figure 4(a), receiving 3 FLD packets by the MC means that
there is no response from S4. Thus, the fault should be caused
by link (S3, S4) and/or node S4.
2) Failure Type Detection
MC and SC collaborate to identify the failure type (node
failure or link failure) through applying barrier requests to
probe the switches. Barrier requests are sent from both MC and
SC to all of the switches belonging to the monitoring cycle that
is failed to pass the FM round. As a switch receives a barrier
request, it responses the query controller a barrier reply. Figure (b) Node failure
4(b) shows the failure type detection in a FLI round. If the Fig. 3. FM round: network in failure state
failure is a link failure, the union of barrier reply messages to
MC and SC should cover all of the switches. Otherwise, if
there is no without responding any barrier reply to the
controllers, the failure must come from the fault of a node.
Please note that we applied KBP to provisioning disjoint
control paths for each switch. Thus, any single device failure
can only prevent one controller from connecting to a switch.
Although we only depict one MC and one SC in this example,
as the number of SC increases, the proposed scheme can be
extended to protect multiple simultaneously failures.
Please note that the failure location detection and the failure
type detection do not need to be performed in sequential. They
can be performed at the same time to reduce the failure (a) Failure location detection
recovery time.

398

Authorized licensed use limited to: Univ de Alcala. Downloaded on January 18,2023 at 11:16:22 UTC from IEEE Xplore. Restrictions apply.
enough, it is possible that both control paths are not fully
disjointed. In this case, it is possible that both controllers lose
control to a switch. In this case, we could apply the scheme
proposed in our previous work [15] to recover the control
channel first followed by the recovery of the data channel. In
this case, additional recovery time is needed.

(b)Failure type detection


Fig. 4. The FLI round

III. FAILURE RECOVERY PROCESSES AND SYSTEM TIMING


DIAGRAMS
In our design, the MC is responsible to control the whole Fig. 5. Processing diagrams of failure recovery between two controllers
network in the normal state. The remaining controllers are
standby controllers which will take over the control on the set
of switches that no longer connecting to the MC after a failure
happens. In the OpenFlow protocol, a switch can be connected
to a number of controllers. The role of a controller for a switch
is one of the three types: Master, Slave and Equal. A master
and an equal controller have full controllability to their
switches, i.e., the controllers can send any OpenFlow
commands to their switches. However, a slave controller can
only listen to the responses from their switches. In multi- (a) Normal case
controller networks, a network can have at most one master
controller and the other controllers must be slave or equal
controllers. More specifically, the roles of switches in a multi-
controller architecture can be divided into Master-Slave,
Master-Slave-Equal or Equal-Equal.
In our network, all of the three of architectures are
applicable. Since a slave controller can only listen to but is not
allowed to send a command to a switch, if the standby
controller is a slave, the controller should issue a role request
message to the switch to change the its role from slave to
master (or equal) in order to gain the full controllability. If each (b) Control channels are not damaged by a failure
controller is an equal controller, the standby controller can take
over the control to a switch at any time. Therefore, for a
network operating in Equal-Equal architecture, the recovery
time is slightly shorter than the one with Master-Slave
architecture. In the sequel, we assume all of the controllers are
operating in Equal-Equal mode so each controller has full
control on each switch. To simplify the introduction of our
design, only one MC and one SC are considered. Both of them
are equal to each switch. The idea can be extended to a network
with more than two controllers.
The failure location is successfully identified after the FLI
round mentioned in the previous section. New flow entries are (c) Control channels are damaged by a failure
configured on corresponding switches to reroute flows that are Fig. 6. Timing diagram for the network operation
affected by a failure. Since the KBP algorithm provides as
diverse as possible for in-band control path routing, if the
degree of the network is big enough, all of the control paths
from the controllers to a switch are mutually disjointed. This Figure 6 shows the timing diagram for the network
feature makes the failure recovery quite simple because at least operation in normal and failure states. In the normal case
one controller can access the switches under the scenario of any shown in Figure 6(a), MC periodically sends the FM packet in
single device failure. Figure 5 shows the diagrams for failure every FM round. The cycle time of a FM round is 20 msec.
recovery between two controllers with mutually disjointed Figure 6(b) depicts the operation when the failure interrupts
control paths to a switch. If the network degree is not big only the data channel. In other words, the MC can still access

399

Authorized licensed use limited to: Univ de Alcala. Downloaded on January 18,2023 at 11:16:22 UTC from IEEE Xplore. Restrictions apply.
the switch in a failure state. As the FM packet does not return B. Link Failure Causing Both Data Path and Control Path
back to the MC after 10 msec, it enforces the system to enter Disconnections
the FLI round by triggering the MC and the SC to detect the We further examine a case shown in Figure 10 that the
failure type and failure location. To reduce the recovery time, damaged link breaks both data and control paths. In this
both detections are performed at the same time. As a barrier experiment, the original data paths are the same as the previous
reply from a particular switch is received by a SC, it notifies experiment but the failed link is changed to (S1, S3) instead.
the MC to indicate the response of the switch. Combining the The MC is not able to setup the recovery path by itself because
received FLD packets and barrier reply messages, MC can it loses the control to S3 and S4. SC has to participate in the
determine the location of the failure so as to reconfigure the recovery process. In the proposed scheme, the control path to
flow entries in the corresponding switches for failure recovery. switches from the MC and the SC are node disjointed so the SC
Figure 6(c) is the case that the failure broken the control path remains contact with S3 and S4 after the link failure. As the
for connecting the main controller. The only difference failure is detected, the MC ask the SC to modify the flow
between Figure 6 (b) and Figure 6 (c) is that in the latter case, entries of S3 and S4 to reroute the traffic. By performing 30
MC is not able to access the corresponding switches. Thus, the independent experiments, the average recovery time, the
MC asks the SC to modify the flow entries for rerouting the minimum recovery time, and the maximum recovery time are
affected traffic. 33 msec, 23.7 msec, and 45.5 msec, respectively.
IV. EXPERIMENTAL RESULTS
We have performed experiments to evaluate the
performance of the proposed fast failure recovery scheme. The
experimental network includes five HP5900 switches and two
host PCs. The controllers are two PCs running ONOS. The
CPU used in MC is i5-4460 running at clock rate 3.20 GHz and
the CPU used in SC is i3-3240 running at clock rate 3.40 GHz.
Figure 7 depicts the in-band control paths used in the network.
For each switch, there are two disjoint paths connecting to the
MC and the SC. To connect the MC, the control paths for
switches S1, S2, S3, S4, and S5 are S1 →MC, S2 →S1 →MC, S3
→S1 →MC, S4 →S3 →S1 →MC, and S5 →S1 →MC,
Fig. 7. Control path provisioning for each switch
respectively. On the other hand, the control paths for switches
to SC are S1 →S2 →SC, S2 →SC, S3 →S2 →SC, S4 →S5 →S2
→SC and S5 →S2 →SC. All of the monitoring cycles: S1 →S3
→S2 →S1, S1 →S5 →S2 →S1, S1 →S3 →S5 →S2 →S1, and S1
→S3 →S4 →S5 →S2 →S1 used in the network are shown in
Figure 8.
A. Link Failure Causing Only Data Path Disconnection
We first performed experiments to examine the case when
the failure damages only data paths. Since the MC is still able
to control all of the switches, it can reconfigure the flow entries
for rerouting the affected traffic. The frequency for sending out
a FM packet for each cycle is 20 msec. The waiting time
window Δ is assigned to be 10 msec. Since FM packets own
Fig. 8. Monitoring cycle setup
highest priority, this value is long enough to guarantee a FM
packet to travel through the whole cycle. If the FM does not
return back to the MC after 10 msec, the system enters FLI
round by sending the barrier requests and the FLD packets to
find the failure location.
Figure 9 demonstrates a link failure that breaks only the
data path. The failure of link (S2, S5) does not cause the loss of
control for MC to connect each switch. It only interrupts the
data path. Since all of the switches are under the control of MC,
the MC performs failure recovery by changing the data path
from H1 →S3 →S1 →S2 →S5 →H2 to H1 →S3 →S4 →S5 →H2.
This set of experiments were performed 30 times. The average
recovery time is defined to be the duration from the time that
the failure occurs to the last flow entry modification commands Fig. 9. Failure on a link carrying only data channel
sent to the switches from the controller. The minimum
recovery time and the maximum recovery time are 34.2 msec,
26.7 msec and 45.6 msec, respectively.

400

Authorized licensed use limited to: Univ de Alcala. Downloaded on January 18,2023 at 11:16:22 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[1] OpenFlow Switch Specification 1.3.2:
https://www.opennetworking.org/images/stories/downloads/s
dn-resources/onf-specifications/openflow/openflow-spec-
v1.3.2.pdf
[2] ONOS https://onosproject.org/
[3] OpenDaylight https://www.opendaylight.org/
[4] Guang-Hong Yang and Si-Ying Zhang, “Design of reliable
control systems by using multiple similar controllers,” in
Proc. IEEE Conference on Decision and Control, pp. 969 -
972 vol.1, 1995.
[5] Ko-Chih Fang, Kuochen Wang, and Jian-Hong Wang “A fast
and load-aware controller failover mechanism for software-
Fig. 10. Failure on a link carrying data and channel defined networks,” in Proc. International Symposium on
Communication Systems, Networks and Digital Signal
C. Node Protection Processing (CSNDSP), 2016.
Figure 11 depicts a case for node failure. The working path [6] Yaning Zhou, Ying Wang, Jinke Yu, Junhua Ba and Shilei
Zhang, “Load balancing for multiple controllers in SDN
is H1 →S3 →S4 →S5 →H2 in this experiment. The failure of S4 based on switches group,” in Proc. Asia-Pacific Network
results in the MC receives only 2 FLD packets. It also results in Operations and Management Symposium (APNOMS), 2017.
receiving no barrier reply received by the MC and the SC. [7] Dharmendra Chourishi, Ali Miri, Mihailo Milić and Salam
Combining both syndromes the MC knows the failed device is Ismaeel, “Role-based multiple controllers for load balancing
S4. Consequently, the MC sets up the new recovery path H1 and security in SDN,” in Proc. IEEE Canada International
→S3 →S1 →S2 →S5 →H2 for traffic rerouting. Since the Humanitarian Technology Conference (IHTC), 2015.
operations for the node failure case are the same as the previous [8] Mitsunobu Homma and Norihiko Shinomiya, “Cycle-based
traffic load balancing method for clustered network with
link failure case shown in Figure 9, they have similar failure multiple SDN controllers,” in Proc. TENCON, 2017.
recovery time. [9] Sachin Sharma, Dimitri Staessens, Didier Colle, Mario
Pickavet and Piet Demeester, “Enabling fast failure recovery
in OpenFlow networks,” in Proc. International Workshop on
the Design of Reliable Communication Networks (DRCN),
2011.
[10] Dimitri Staessens, Sachin Sharma, Didier Colle, Mario
Pickavet and Piet Demeester, “Software defined networking:
Meeting carrier grade requirements,” in Proc. 18th IEEE
Workshop on Local & Metropolitan Area Networks
(LANMAN), 2011.
[11] James Kempf, Elisa Bellagamba, András Kern, Dávid Jocha,
Attila Takacs and Pontus Sköldström, “Scalable fault
management for OpenFlow,” in Proc. IEEE International
Conference on Communications (ICC), 2012.
[12] Sachin Sharma, Dimitri Staessens, Didier Colle, Mario
Fig. 11. Failure on a node Pickavet and Piet Demeester, “OpenFlow: Meeting carrier-
grade recovery requirements,” Computer Communications,
V. CONCLUSIONS vol. 36, no. 6, pp. 656-665, 2013.
[13] Andrea Sgambelluri, Alessio Giorgetti, Filippo Cugini,
In this paper, we propose a fast failure recovery scheme for Francesco Paolucci and Piero Castoldi, “OpenFlow-based
in-band controlled multi-controller OpenFlow networks. The segment protection in Ethernet networks,” IEEE/OSA
proposed scheme fully takes advantage of networks with Journal of Optical Communications and Networking, vol. 5,
multiple controllers to achieve fast failover against any single no. 9, pp. 1066 – 1075, 2013.
device failure. We propose to use K-best path algorithm on a [14] A. S. M. Asadujjaman, Elisa Rojas, Mohammad Shah Alam,
transformed graph for planning the control paths. Through the Suryadipta Majumdar, “Fast Control Channel Recovery for
Resilient In-band OpenFlow Networks,” 4th IEEE
collaboration of multiple controllers, the failure type and failure International Conference on Network Softwarization
location can be identified within short time. The experimental (NetSoft 2018)
results show that the time average to recover from any single [15] Steven S. W. Lee, Kuang-Yi Li, Kwan-Yee Chan, Guan-Hao
device failure is less than 50 msec even if the failure Lai, and Yao Chuan Chung, “Software-based Fast Failure
disconnects both working control channels and the data Recovery for Resilient Openflow Networks,” IEEE RNDM
channels at the same time. 2015.
[16] S. W. Lee and C. S. Wu, “A K-best Paths Algorithm for
ACKNOWLEDGEMENT Highly Reliable Communication Networks," IEICE
Transactions on Communications, vol. E-82B, no. 4, pp.
This work was partially supported by the Ministry of 586-590, 1999. [SCI]
Science and Technology, Taiwan under Grants 106-2221-E- [17] Steven S. W. Lee, Kuang-Yi Li, Kwan-Yee Chan, Guan-Hao
194 -019. It was also supported in part by the Advanced Lai and Yao-Chuan Chung, “Path layout planning and
Institute of Manufacturing with High-tech Innovations (AIM- software based fast failure detection in survivable OpenFlow
networks,” in Proc. IEEE DRCN, 2014.
HI) from The Featured Areas Research Center Program within
[18] Dino Ahr and Gerhard Reinelt, “A tabu search algorithm for
the framework of the Higher Education Sprout Project by the the min–max k-Chinese postman problem,” Computers &
Ministry of Education (MOE) in Taiwan. Operations Research, vol. 33, no. 12, pp. 3403-3422, 2006.

401

Authorized licensed use limited to: Univ de Alcala. Downloaded on January 18,2023 at 11:16:22 UTC from IEEE Xplore. Restrictions apply.

You might also like