(2020 - IEEE-Trsnsactions Omn NSAE) RL-Routing - An SDN Routing Alogorirjm Based On Deep Reinforcement Learning

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. 7, NO.
4, OCTOBER-DECEMBER 2020 3185
RL-Routing: An SDN Routing Algorithm Based

on Deep Reinforcement Learning
Yi-Ren Chen, Amir Rezapour , Wen-Guey Tzeng, and Shi-Chun Tsai , Senior Member, IEEE
Abstract—Communication networks are difficult to model and solve a traffic engineering (TE) problem of SDN in terms of
predict because they have become very sophisticated and dynamic. throughput and delay. This is because, with the rapid develop-
We develop a reinforcement learning routing algorithm (RL-
Routing) to solve a traffic engineering (TE) problem of SDN in ment of devices, such as smartphones, IoT, etc., and network
terms of throughput and delay. RL-Routing solves the TE problem technologies, such as could computing, the data traffic grows
via experience, instead of building an accurate mathematical exponentially. As a result, managing the traffic of a large num-
model. We consider comprehensive network information for state ber of devices and network resources becomes more challeng-
representation and use one-to-many network configuration for ing than ever before. Traditional approaches [3], [4], which are
routing choices. Our reward function, which uses network
throughput and delay, is adjustable for optimizing either upward mostly model-based, assume that network traffic and user
or downward network throughput. After appropriate training, the demands can be well modeled. However, communication net-
agent learns a policy that predicts future behavior of the underlying works are difficult to model and predict because they have
network and suggests better routing paths between switches. The become very sophisticated and dynamic. Hence, deploying
simulation results show that RL-Routing obtains higher rewards more intelligent agents in networks is necessary for optimizing
and enables a host to transfer a large file faster than Open Shortest
Path First (OSPF) and Least Loaded (LL) routing algorithms on network resources.
various network topologies. For example, on the NSFNet topology, A traffic engineering problem is to find paths that efficiently
the sum of rewards obtained by RL-Routing is 119.30, whereas forward data traffic from a source switch to all reachable destina-
those of OSPF and LL are 106.59 and 74.76, respectively. The tion switches. The goal is to maximize the source switch’s
average transmission time for a 40GB file using RL-Routing is
throughput and minimize communication delay.1 Simple and
25.2 s. Those of OSPF and LL are 63 s and 53.4 s, respectively.
widely used methods are Open Shortest Path First (OSPF) [5]
Index Terms—Cognitive sdn, deep reinforcement learning,
routing algorithm, software defined networks. and Least Loaded (LL) [6] routing algorithms. However, they are
both greedy and make commitments based on the current state of
the network. Such a greedy approach fails to foresee network
I. INTRODUCTION
change in the near future. Hence, they are unable to find optimal
S OFTWARE Defined Network (SDN) [1] is an emerging

network technology that separates the control plane of a net-
working device (e.g., a switch / router) from its data plane. The
routing paths in networks with dynamic traffic distributions.
In this paper, we develop a reinforcement learning routing
algorithm (RL-Routing) to solve a TE problem of SDN in terms
SDN controller has centralized control of the entire network and of throughput and delay. RL-Routing solves the TE problem
acts as a network operating system. The controller enables intel- via experience, instead of building an accurate mathematical
ligent deployment for the following reasons. First, since the con- model for the underlying network. We consider comprehensive
troller has a global network view, it can collect comprehensive network information for state representation and use one-to-
network information for routing algorithms. Second, the control- many network configuration for routing choices. Our reward
ler can quickly configure and manage network resources by function, which uses the network throughput and delay, is
installing forwarding rules in the switches under its control [2]. adjustable for optimizing either upward or downward network
Although SDN brings new opportunities for data collection throughput. After appropriate training, the agent learns a policy
and resource management, it is still challenging to efficiently that predicts the future behavior of the underlying network and
suggests better routing paths between source and destination
Manuscript received May 18, 2020; revised July 8, 2020 and August 4,
switches. The policy suggests better routing paths between
2020; accepted August 15, 2020. Date of publication August 19, 2020; date of source and destination switches in the near future.
current version December 30, 2020. This work was supported in part by the Our solution also addresses the scalability problem in terms of
Ministry of Science and Technology (MOST) under Grants 108-2221-E-009-
051-MY3 and 107-2221-E-009-024-MY2, and the Ministry of Economic
the number of agents needed for data forwarding. We only need
Affairs (MOEA) 106-EC-17-A-24-0619, and the Ministry of Education through to deploy one agent per switch. Previous solutions need to deploy
the SPROUT Project Center for Open Intelligent Connectivity of National a large number of agents. For example, in [7], they need to run a
Chiao Tung University and the Ministry of Education, Taiwan, R.O.C. Recom-
mended for acceptance by Dr. Huimin Lu (Corresponding author: Amir
dedicated agent for each source-destination switch pair. In [8],
Rezapour.)
The authors are with the Department of computer science, National Chiao
1
Tung University, 30010, Taiwan (e-mail: yiren@cs.nctu.edu.tw; rezapour@cs. We will focus on designing a solution for forwarding data traffic of a
nctu.edu.tw; wgtzeng@cs.nctu.edu.tw; sctsai@cs.nctu.edu.tw). source switch (a.k.a. designated switch) with the knowledge that the other
Digital Object Identifier 10.1109/TNSE.2020.3017751 switches will follow a similar approach.
2327-4697 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 16,2021 at 01:19:37 UTC from IEEE Xplore. Restrictions apply.
3186 IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. 7, NO. 4, OCTOBER-DECEMBER 2020
[9], the number of agents is even proportional to the number of and machine learning [12], [13]. There have been several rout-
flows in the network. Notice that there are other scalability issues, ing optimization methods for SDN [14]–[19]. In this section,
such as controller-switch communication delay, the explosion of we review the relevant works on RL-based routing algorithms
flow entries in scenarios with a large number of traffic flows, in the context of SDN. Unlike traditional routing algorithms
etc [10]. However, we do not consider them in this paper. that are model-based [3], [4], [11], we focus on RL-based meth-
The main contributions of the paper are summarized in the ods because they are model-free. In model-based methods, they
following: assume that traffic flows and user demands follow some distri-
1) We introduce new features for state representation, such butions. However, in RL-based methods, they make no prior
as link trust level, switch throughput rate, and link-to- assumptions on the dynamics of the underlying network.
switch rate. These features capture more comprehensive Besides, unlike machine learning based approaches [12], [13]
information on the underlying network and make future that require a labeled dataset for training, RL-based methods
prediction possible. For example, the link trust level helps learn a policy via direct experience with the underlying
the agent find reliable paths between a source-destination network.
switch pair. We say that a link is not reliable, if packets in SDN allows flow-level and packet-level forwarding [20].
the link are often lost due to transmission errors, colli- Furthermore, the traffic forwarding strategy can be classified
sions, etc. The other two features guide the agent to avoid into traffic splitting and destination-based forwarding. In the
heavily loaded switches and links. The results show that traffic splitting strategy, the traffic from a source to a destina-
our state features lead the agent to avoid congested links. tion switch is split via multiple paths by applying a hash func-
2) We address one of the scalability problems of the con- tion over a set of packets’ header fields. Given the network
troller in terms of the number of agents needed for data state, the goal is to find the best splitting ratios, which are the
forwarding. We only need to deploy one agent per portions of the traffic forwarded on each path. In the destina-
switch. We achieve this by introducing a one-to-many tion-based forwarding strategy, the traffic is forwarded on a
network configuration for routing choices. That is, single path. The goal here is to change paths between different
instead of connecting a single source switch to a single network states to reach better network performance.
destination switch, we connect the source switch to all
of destination switches, and vice-versa.
A. Flow-level Forwarding
3) Our reward function is adjustable for optimizing either
upward or downward network throughput. Hence, Valadarsky et al. [21] used the reinforcement learning tech-
depending on the duty of the hosts connected to the source nique to solve a TE problem by the traffic splitting strategy.
switch, we can adjust upward and downward traffic effi- They represented the state as a demand matrix, where an entry
ciencies via related parameters accordingly. is the traffic demand between a source-destination switch pair.
4) We test RL-Routing on Fat-tree, NSFNet, and ARPANet The action space is a set of splitting ratios. The reward is the
network topologies. The simulation results show that link utilization rate (see Section III-A for the formal definition
RL-Routing obtains higher rewards and enables a host to of state, action, and reward). Stampa et al. [20] proposed a
transfer a large file faster than OSPF and LL on various deep RL algorithm using the same state and action spaces.
network topologies. For example, on the NSFNet topol- However, the reward is based on the mean of end-to-end
ogy, the sum of rewards obtained by RL-Routing is delays. Xu et al. [8] proposed a similar approach. They defined
119.30, whereas those of OSPF and LL are 106.59 and a TE problem in a network with K flows. The state is defined
74.76, respectively. The average transmission time for a as throughput and delay of K flows. The action space is a set
40 GB file using RL-Routing is 25.2 s. Those of OSPF of splitting ratios for each flow. The reward is a function of
and LL are 63 s and 53.4 s, respectively. Overall, RL- the flow throughput and delay. Since the input size of the neu-
Routing provides 2:5 and 2:11 speedups over OSPF ral network is a function of K, their construction cannot scale
and LL on average, receptively. up in practice. Nevertheless, it is not clear how [21] and [8]
The rest of the paper is organized as follows. Section II split a flow through multiple paths in SDN. This is because all
presents some related work. Section III presents reinforcement packets in a flow share the same header fields.
learning background and problem definition. In Section IV, we For the destination-based forwarding strategy, Francois
describe the components of the SDN architecture and the et al. [9] proposed a cognitive routing engine for finding effi-
details of RL-Routing. We also discuss how to represent states, cient paths for the current state of the network. For a given
actions, and reward function. Section V presents performance flow, every switch creates a dedicated recurrent neural net-
evaluation. Section VI presents discussions and challenges work (RNN) to select the next hop for forwarding the flow’s
associated with RL-Routing. Finally, Section VII discusses the traffic. After constructing a path, they use throughput, delay,
conclusions. and quality (based on packet loss and frame errors) of the links
along the chosen path for the reward. Yet, the scalability issue
remains unsolved. As the number of flows increases, the over-
II. RELATED WORK
head of the RNNs increases as well.
Routing optimization is a well-studied topic. There exist a Lin et al. [22] applied the SARSA algorithm to maximize
wide range of solutions based on analytical optimization [11] QoS in a hierarchical SDN. For each incoming flow, the switch
CHEN et al.: RL-ROUTING: AN SDN ROUTING ALGORITHM BASED ON DEEP REINFORCEMENT LEARNING 3187
contacts the controller. The controller implicitly recognizes QoS TABLE I

NOTATION DEFINITION
requirements of the flow and computes the optimum path based
on the QoS requirements. The next action is chosen on a next-
hop basis, starting from the source to the destination switch.
QoS requirements consist of some metrics, such as delay, loss,
throughput, etc. The reward function is a weighted sum of the
metrics. In [22], the network administrator needs to define vari-
ous factors for each traffic type. For example, a real-time multi-
media flow requires a routing algorithm that can provide a lower
delay, higher throughput, and lower jittering likelihood.
Naeem et al. [23] used reinforcement learning to optimize
TCP flow’s throughput. The state space is a set of N TCP
flows. Each TCP flow is represented by some metrics, such as
congestion window size, the mean deviation of Round Trip
Time (RTT), etc. An action is a set of congestion window
sizes for N TCP flows. A congestion window size limits the
amount of data that the TCP flow can send into the network
before receiving an ACK. The reward is defined as the average
of the flows’ throughput. for actions, finding the best policy is equivalent to the find
most suitable paths in different network states.
B. Packet-level Forwarding The primary step in a reinforcement learning problem is to
Boyan et al. [24] proposed a Q-learning algorithm with the define the states, actions, and a scalar reward for the final goal.
traffic splitting strategy. In their construction, each switch We will describe how to define such requirements for RL-Rout-
chooses the next-hop for forwarding packets. Then, the switch ing in Section IV-C1.
updates its policy using its Q-function and local information.
However, packet-level traffic splitting is undesirable because B. Problem Definition
it results in excessive packet reordering in the destination host We consider a network as a directed graph GðV; EÞ. V ¼
and higher jittering likelihood [25]. fsw1 ; sw2 ; . . . ; swn g is set of switches and E V V is the
Yao et al. [7] proposed an intelligent network architecture set of links in the network, where jEj ¼ m. We assume that net-
using deep reinforcement learning with the destination-based work links are bidirectional (full duplex), i.e., ei;j and ej;i are
forwarding strategy. In their construction, the state is repre- upward and downward links connected to the swi , respectively.
sented by link delay and switch processing delay. The action The neighborhood of the switch swi is Nðswi Þ ¼ fswj 2
space is a set of paths that connect a source switch to a destina- Vjei;j 2 Eg. Eðswi Þ ¼ fei;k jei;k 2 Eg is the set of edges adja-
tion switch. The reward is the delay from the source to the descent to swi . Eðsw b i Þ ¼ fer;i jer;i 2 Eg is the set of edges that
tination switch. Nevertheless, their solution has a scalability connect the switches in Nðswi Þ to swi . swsrc is the designated
problem. Due to their action description, they need to run a source switch. Let Dsrc V fswsrc g be the set of all destina-
dedicated agent for each source-destination switch pair. tion switches reachable from swsrc . A path psrc;des is a walk in
We present RL-Routing that aims to solve a TE problem at the the graph GðV; EÞ that connects swsrc to swdes via a sequence
packet-level with the destination-based forwarding strategy. of switches hswsrc ! swi ! swj ! swk !! swdes i, where
two consecutive switches in the sequence form an edge in E
III. BACKGROUND KNOWLEDGE AND PROBLEM DEFINITION and each switch is visited at most once.
bwt ðei;j Þ is the bandwidth of the link ei;j that connects swi
A. Reinforcement Learning Background
to swj at time interval Dt . delayt ðei;j Þ and errort ðei;j Þ denote
We consider a standard reinforcement learning setting in link delay and the indicator for errors occurred in ei;j over
which an agent interacts with an environment E over discrete time interval Dt , respectively. Path bandwidth bwt ðpsrc;des Þ ¼
time intervals Dt , t 0. At each time interval Dt , the agent minei;j 2psrc;des bwt ðei;j Þ is the minimum bandwidth of its
observes current state, St 2 S, and chooses an action, At 2 links
P at time interval Dt . Path delay delayt ðpsrc;des Þ ¼
AðSt Þ, where AðSt Þ is the set of all actions available in state ei;j 2psrc;des delayt ðei;j Þ is the sum of the link delays in the
St . In return, the agent receives a reward, Rtþ1 RðSt ; At Þ, path at time interval Dt . Table I summarizes the important
and enters in the next state Stþ1 . Let an interaction tuple notations used in the paper.
hSt ; At ; Stþ1 ; Rtþ1 i denote that the agent was in state St , per- Formally, the TE problem is defined as follows. Given
formed action At , and ended up in state Stþ1 with reward GðV; EÞ, swsrc , Dsrc , find a set of paths for forwarding swsrc ’s
Rtþ1 . The process continues until the agent encounters a ter- data to the switches in Dsrc . The goal is to maximize swsrc ’s
minal state. The agent’s goal is to learn a policy Pp : St ! A throughput and minimize communication delay.
that maximizes the expected future reward R ¼ 1 t¼0 g Rtþ1 , The TE problem shares some similarities with the vehicle
where g 2 ½0; 1 is the discounting factor. Since we use paths routing problem. Consider a driver who commutes to work
every day. The goal is to find the best route to minimize com-
muting time. During his driving, he observes some traffic con-
ditions, such as route condition (highway or narrow street),
traffic load (high, medium, low), etc., for each potential route.
The driver tries many possible routes for a period of time and
obtains the traveling time. Then, he can determine a fast route
based on the encountered traffic conditions during driving. In
this scenario, the driver’s experience is the key to finding such
a route practically.
IV. SYSTEM DESIGN

A. Overall Architecture Fig. 1. An SDN architecture with its components. Solid and dashed lines
denote data plane and control plane links, respectively. In this example, RL-
Figure 1 shows where RL-Routing is located in a typical Routing finds a solution for forwarding swsrc ’s data traffic to the destination
switches Dsrc ¼ fsw2 ; sw4 g.
SDN architecture and how it interacts with other components.
The controller connects to the switches through a secured
OpenFlow [26] (OF) channel for message exchange. Applica-
1) The controller sends a command to every switch under
tions are pieces of software running on the top of the controller
its control. The command installs a trap flow entry on
for providing specific network functions. In our architecture,
every switch to redirect LLDP packets to the controller.
there are two applications:
In addition, the controller installs a dummy flow entry
OpenFlow Network Discovery application: it discovers
on every switch. The expiration time of the dummy
the data plane topology of the network using the Link
flow entry is set to be hard time ¼ jDt j.
Layer Discovery Protocol (LLDP) [27].
2) When the dummy flow entry in swi expires, the control-
RL-Routing application: it has two main modules:
ler receives an OpenFlow Flow-Removed message from
Network Monitoring Module (NMM): it uses pas-
swi . It initializes a timer denoted by starti;j
sive and active network measurements to get the
time:nowðÞ. Then, for every port of swi , the controller
necessary network information. Network informa-
packs a LLDP packet and sends it to swi using Open-
tion refers to the status of the network devices,
Flow Packet-Out message. Meanwhile, in order to peri-
including link delay, link throughput, port speed,
odically measure link delays, the controller re-installs
etc. We use the information for representing states
the dummy flow entry on the switch swi .
and computing rewards.
3) For every OpenFlow Packet-Out message that swi
Most of the network information is retrieved pas-
receives from the controller, swi unpacks the LLDP
sively by the controller. However, OF protocol [26]
packet and forwards it to the port specified in the Open-
does not specify how a switch measures and stores
Flow Packet-Out message.
link delay by itself. Our method of computing link
4) Once swj receives the LLDP packet from its adjacent
delays is in Section IV-B.
switch swi , it packs it into an OpenFlow Packet-In mes-
Action Translator Module (ATM): it translates the
sage and sends it back to the controller as specified in
action chosen by the agent into an appropriate set of
the trap flow entry.
OpenFlow messages for updating flow tables of the
5) Upon receiving the OpenFlow Packet-In message, the
switches.
controller unpacks the LLDP packet from the message
Upon a request to configure a new path, it sends the
and computes the link delay as follows.
OF messages starting with the last switch of the path
to the first switch. This ensures that the switches in time:nowðÞstart
i;j
the new path do not send Packet-In messages to the delayt ðei;j Þ ¼ 3 if LLDP arrives
controller. Finally, the old rules in the switches of 100 otherwise:
the old path are deleted. (1)
In order to maintain the chosen paths in the network If the LLDP packet does not arrive at the controller, we
until the agent decides to change them, we set hard_- assume that the link is congested and set its delay to be a pre-
timeout=0 (inactive) in every rule. This ensures that defined high value. Notice that, we assume equivalent trans-
swsrc is always connected to the destination switches mission time in steps (2) and (4). Hence, we divide the
in Dsrc , and vice-versa. measured time by 3 to roughly estimate the link delay.
NMM periodically estimates link delays as they may
change during the network operation. The estimated link delay
B. Measure Link Delay
also includes a negligible delay in the controller and switches
NMM uses the following steps to measure link delay during the executions. We ignore it because such overhead
delayt ðei;j Þ between switches swi and swj at time interval Dt . occurs for every link’s delay.
C. Proposed RL-Routing probability of individual packet loss. The goal here

In this section, we first provide the description of RL-Rout- is to build a trust table and update it after every inter-
ing. Then, we present a Q learning algorithm for solving the action. An interaction with a link ei;j 2 E occurs
TE problem. whenever data are transmitted via ei;j during Dt .
1) Description of RL-Routing: In the following, we describe After each interaction, we obtain an interaction out-
the details of RL-Routing. come IOti;j , indicating the reliability of the link at
Definition 1: The model for RL-Routing is represented by Dt . We say that a link is not reliable, if packets in the
M ¼ hS; A; R; T ; gi, where: link are often lost due to transmission errors, colli-
S Rz is the state space. A state is a feature vector of sions, or drops (miss-configuration, small buffer
size z. size, etc.). Upon receiving IOti;j , we update T i;j
A is the action space. incorporating this new interaction outcome. If the
R : S A ! R is a reward function. interaction outcome is 1, we increase the trust level
T is the transition probability map. of the link; otherwise, we reduce it.We use a beta
g 2 ½0; 1 is a discount rate. trust management system [28] to update the trust lev-
The details are in the following. els. Initially, the trust level of each link is neutral.
1) State With every interaction at time interval Dt , we obtain
The state space is an interaction outcome IOti;j ¼ 1 errort ðei;j Þ. In
this event, the beta parameters of the link ei;j are
S R6mþ2nþ11 : (2) updated as follow.
A state s at time interval Dt is represented by i;j ¼ h ai;j þ IOi;j

atþ1 t t
s ¼ ½f1 ; f2 ; . . . ; f10 : (3) btþ1

i;j ¼ z bi;j þ ð1 IOi;j Þ
t t
(7)
It summarizes the network information at time interval

Dt . The features are computed as follow. h 2 ½0; 1 and z 2 ½0; 1 control the rate at which
f1 ¼ fci;j : 8ei;j 2 Eg is the link capacity rate set good and bad interaction outcomes are discounted,
(tuple) at time interval Dt , where the capacity rate respectively. After each interaction, the trust level
for a link ei;j is of ei;j is updated as follows.
bwt ðei;j Þ atþ1

i;j þ 1
ci;j ¼ : (4) T i;j ¼ (8)
max--bwt atþ1 þ btþ1
i;j i;j þ 2
max--bwt ¼ maxei;j 2E bwt ðei;j Þ is the maximum

f6 and f7 are the set (tuple) of upward and down-
bandwidth of the links at time interval Dt .
ward switch throughput rates (similar to switch uti-
f2 ¼ fxi;j : 8ei;j 2 Eg is the link throughput rate set
lization) at time interval Dt , respectively. They are:
(tuple) (similar to standard link utilization, see
Section V-B) at time interval Dt , where the through-
f6 ¼ fxuswi AVGðfxi;k : 8ei;k 2 Eðswi ÞgÞ :
put rate for a link ei;j is
8swi 2 Vg (9)
txt ðei;j Þ
xi;j ¼ : (5)
bwt ðei;j Þ jDt j b
f7 ¼ fxdswi AVGðfxr;i : 8er;i 2 Eðsw i ÞgÞ :
txt ðei;j Þ is the amount of data that swi transmitted 8swi 2 Vg (10)
to swj via ei;j at time interval Dt . jDt j is the dura-
tion of the time interval Dt . where xuswi and xdswi are the averages of upward and
f3 ¼ fdelayt ðei;j Þ; 8ei;j 2 Eg is the link delay set downward throughput rates of the switch swi ,
(tuple) at time interval Dt . respectively. AVG(.) computes the average value of
f4 ¼ fstatusi;j : 8ei;j 2 Eg is the link status set all entries within the set.
(tuple) at time interval Dt , where the status for a f8 ¼ fcxi;j : 8ei;j 2 Eg is the link-to-switch rate set
link ei;j is (tuple) at time interval Dt , where the contribution
rate of a link ei;j to the switch swi is
1 if ei;j is up during Dt
statusi;j ¼ (6) xi;j
0 otherwise: cxi;j ¼ P : (11)
ei;k 2Eðswi Þ xi;k
f5 ¼ fT i;j : 8ei;j 2 Eg is the link trust level set
(tuple) at time interval Dt , where T i;j is the trust level It rates the percentage of a link to the switch load.
of edge ei;j . It is similar to the packet loss probabil- This feature helps the agent avoid those links that
ity. However, it evaluates a link’s reliability, not the potentially overload the switch.
f9 is a 7-dimensional indicator vector for the day of Note that when h is large enough, the action space
a week. For each day, the corresponding index is set consists of all paths. However, computation and
to 1 and the rest of the entries are set to zero. training costs increase as h increases.
f10 is a 4-dimensional indicator vector for the part Searching solutions only within the action space
of a day. We partition a day into four non-overlap- also avoid very long paths. In practice, network
ping time intervals as ½6am; 12pmÞ, ½12pm; 6pmÞ, administrators might have their own concerns. For
½6pm; 12amÞ, and ½12am; 6amÞ. Based on the time, example, the management unit’s network traffic
the corresponding entry is set to 1 and the rest of should not pass through a switch in the engineering
the entries are set to zero. unit. Hence, they can revise the action space
Similar to the vehicle driving problem, these fea- accordingly.
tures enable the agent to sense the capacity, speed, 3) Reward
status, traffic load, and reliability of each link. For The reward function takes as input state s and an
example, features f6 and f7 show loads of intersec- action a and outputs the corresponding reward, indi-
tions. Feature f8 rates the traffic over the roads of cating the quality of the chosen action a. We define
an intersection. Features f9 and f10 show that traffic the reward function as follows.
conditions usually vary from day to day and from
time to time. All of these features can be computed r ¼ r1 þ r2 2 ½0; 2; (15)
in an on-line fashion. Therefore, RL-Routing does
not have to maintain the history of network where r1 and r2 are the throughput rate and the
information. delay of the chosen action, respectively. r1 and r2
Notice that some features, such as f6 , f7 , and f8 , are are defined as follow.
correlated to link throughput rate. However, the Throughput Rate: It is defined as follows.
extra features provide more direct information to the
agent and accelerate the training phase. r1 ¼ f ru1 þ ð1 fÞ rd1 2 ½0; 1 (16)
2) Action
The action space is
where ru1 and rd1 are the upward and downward
A ¼ fa1 ; a2 ; . . . ; ah g; (12) throughput rates for the chosen action a at the cur-
rent time interval Dt , respectively. f 2 ½0; 1 con-
where an action, 1
i
h, trols the first objective of the TE problem. When
f ¼ 1, the agent’s goal is to maximize upward
ai ¼ fpsrc;d jswd 2 Dsrc g [ fpd;src jswd 2 Dsrc g throughput from swsrc to its destination switches
(13) swd 2 Dsrc .
is a set of paths that connect swsrc to its destina- ru1 is defined as
tion switches in Dsrc and the destination switches 0 1
to swsrc . X txt ðpsrc;d Þ A
Notice that our action definition addresses the scal- ru1 ¼ AVGIF@ (17)
2a
bw t ðpsrc;d Þ jDt j
ability problem of the controller in terms of the num- psrc;d
ber of agents needed for data forwarding. It enables
the agent to perform a one-to-many network configu- where psrc;d is the path in a that connects swsrc to
ration at once. Hence, we only need to deploy one swd 2 Dsrc . txt ðpsrc;d Þ is the amount of data that
agent per switch. swsrc transmitted to swd via psrc;d in Dt . Since it is
Path Discovery Algorithm (PDA): for input GðV; EÞ, possible that in some time intervals no data are
swsrc , Dsrc , and h, it outputs the action space A. We transmitted to a destination switch, we compute
use Yen’s algorithm [29] to find k shortest loopless the average with AVGIF(.), which excludes zero
paths, for some k, between each source-destination values. Similarly, we define rd1 as
switch pair. We sort the paths between two switches
swsrc and swd 2 Dsrc in the ascending order by their 0 1
X rxt ðpd;src Þ A
lengths as p1src;d ; p2src;d ; . . . ; pksrc;d . We let, 1
i
h, rd1 ¼ AVGIF@ ; (18)
pd;src 2a
bwt ðpd;src Þ jDt j
ai ¼ fpisrc;d jswd 2 Dsrc g [ fpid;src jswd 2 Dsrc g: (14)
where pd;src is the path in a that connects swd 2
where pid;src is the sequence of pisrc;d , but with Dsrc to swsrc . rxt ðpd;src Þ is the amount data that
inverse direction. swsrc received from swd via pd;src in Dt . The more
This approach reduces the search space and compu- the received data rxt ðpd;src Þ are, the higher the rd1 is.
tation time by pre-computing the action space and r1 guides the agent to choose a path that maximizes
only searching for solutions within the action space. upward or downward throughput rates.
Algorithm 1: RL-Routing Algorithm with Dueling DDQN

with Prioritized Experience Replay.
1: GðV; EÞ OpenFlow Network Discovery
2: A ¼ PDAðGðV; EÞ; swsrc ; Dsrc ; hÞ
3: Initialize replay memory M to capacity N
4: Initialize current and target action-value functions with random
weights w and w ¼ w, receptively.
5: Initialize the link trust level of each link neutrally, i.e,
Ti;j 0:5 8ei;j 2 E
6: repeat (for each episode):
7: Initialize S1 ¼ ½f1 ; f2 ; . . . ; f10 by invoking NMM
8: for t ¼ 1 to 100: do
9: With probability , choose a random action; otherwise select
At ¼ argmaxa2A qp ðSt ; a; wÞ
10: Execute At by invoking ATM and let the network run for a
duration of jDt j
11: Invoke NMM to obtain new network information for
computing Stþ1 and Rtþ1
12: Store interaction tuple hSt ; At ; Stþ1 ; Rtþ1 i in M
13: Sample random mini-batch of interaction tuples Fig. 2. The components of RL-Routing.
hsj ; aj ; sjþ1 ; rjþ1 i from M with prioritization
14: Set target yj ¼ rjþ1 þ gqp ðsjþ1 ; ðargmaxa0 2A qp ðsjþ1 ; a0 ; Notice that, it is possible to define Eq. 15 as r ¼
wÞÞ; w Þ " All interaction tuples are non-terminal v r1 þ ð1 vÞ r2 , where v 2 ½0; 1 adjusts the
15: Perform a gradient descent step on ðyj qp ðsj ; aj ; wÞÞ2 importance of r1 and r2 .
with respect to w 2) Learning Algorithm: We use the dueling double deep Q
16: Every C steps, w ¼ w learning (Dueling DDQN) architecture [30] with prioritized
17: end for memory experience [31] and the -greedy policy to solve the
reinforcement learning problem. This architecture handles the
problem of overestimated Q-values and increases learning
Delay: It is defined as follows. stability.
As illustrated in Algorithm 1, the agent in step 1 receives
u d
r2 ¼ er r2 ð1rÞ r2 2 ½0; 1 (19) the network topology GðV; EÞ, discovered by the OpenFlow
Network Discovery application. In step 2, it invokes the PDA
where ru2 and rd2 are upward and downward delays, to construct the action space A for the topology GðV; EÞ (for
respectively. We define the delays as: more details, see PDA under the definition of actions in
Section IV-C1). In steps 3 to 4, the agent initializes memory
X
ru2 ¼ delayt ðpsrc;d Þ M. Then, it initializes current and target action-value func-
psrc;d 2a tions as qp ðSt ; At ; wÞ and qp ðSt ; At ; w Þ, respectively. They
X are approximated by deep neural networks with parameters w
rd2 ¼ delayt ðpd;src Þ: (20) and w . In step 5, it sets the trust level of each link in E
pd;src 2a
as neutral.
At the beginning of every episode (step 7), the agent
where ru2 and rd2 are the sum of all upward and down- invokes NMM for computing the initial state using Eq. 3.
ward path delays for the chosen action a. r 2 ½0; 1 After each time interval Dt (a step of an episode), the agent
controls the second objective of the TE problem. repeats the following operations (steps 9 to 16), as shown in
When r ¼ 1, the agent’s goal is to minimize the Figure 2.
communication delay from swsrc to its destination It chooses an action At 2 A that maximizes the action-
switches swd 2 Dsrc . value function qp ðSt ; At ; wÞ, but with -exploration.
Notice that, we inverse the communication delay The agent executes At by invoking ATM to set up rout-
objective function. Hence, minimizing the communi- ing tables. It lets the network run for a duration of jDt j.
cation delay is equivalent to maximizing r2 . Then, it invokes NMM to obtain new network informa-
r2 guides the agent to choose a path that minimizes tion. It uses the information for computing the reward
upward and downward delays. Rtþ1 for the chosen action At using Eq. 15. It also com-
Overall, there are two goals for the agent. These putes the new state Stþ1 using Eq. 3.
goals are embedded in the reward function Eq. 15. In steps 12 to 15, the agent stores the current interaction
A higher reward means that more packets with less tuple into the memory and samples a mini-batch from
delay are transferred from swsrc to its destination M with prioritization. It then updates the action-value
switches in Dsrc , and vice-versa. function qp ð:; :; wÞ accordingly.
Fig. 3. Test network topologies. (a) 10-switches and 40-links Fat-tree network. (b) 14-switches and 70-links NSFNet network. (c) 21-switches and 92-links
ARPANet network.
V. EVALUATION topology. For NSFNet and ARPANet topologies, we observe

that the paths in a3 to ah are very long. We revise their action
In this part, we focus on evaluating our approach’s perfor-
spaces by letting ai , 3
i
h, be a combination of the shorter
mance and comparing it with some well-known solutions. We
paths in a1 and a2 . In our simulation, we set the 11 indicator fea-
first describe our simulation environment and evaluation met-
tures to f9 ¼ ð1; 0; . . . ; 0Þ and f10 ¼ ð1; 0; 0; 0Þ for Monday
rics. Then, we train the agent and evaluate its performance
morning. That is, RL-Routing is trained over traffic patterns
using a file transfer test.
observed on Monday morning. For the duration of a time inter-
val jDt j, we empirically try different values. We observe that
A. Simulation Environment
when jDt j ¼ 1 s, RL-Routing provides the best performance in
We use the Mininet platform [32] to set up our environment terms of all metrics. We train the agent in an episodic fashion.
containing a set of virtual hosts, switches, and links. We use An episode has 100 steps, where the duration of each step is jDt j
Ryu [33] as an OpenFlow controller for managing the network seconds. We set f ¼ 0:5 and r ¼ 0:5 in Eq. 16 and Eq. 19,
and use Iperf to generate flow traffic. respectively. That is, the agent equally optimizes upward and
We evaluate the performance of RL-Routing on three well- downward network throughputs.
known network topologies, Fat-tree, NSF Network (NSFNet), For training purposes, we try different traffic compositions,
and Advanced Research Projects Agency Network (ARPA- including randomly generated traffic. We found that a more
Net) as shown in Figure 3. The shaded nodes are the desig- effective way for training the agent is to orchestrate hosts to
nated source switches (swsrc ), and the rest of the nodes generate the same traffic in every episode. Such a strategy
Dsrc ¼ V fswsrc g are the set of destination switches. Two makes the agent revisit each state an enough number of times
hosts are connected to each switch for generating network traf- for trying different actions. We schedule each host pair to gen-
fic. The default speed for each switch port is 10Gb. erate periodic traffic in some predefined steps. For each host
pair, we let them generate random traffic with a duration
B. Evaluation Metrics obtained from a normal distribution with m ¼ 5 seconds and
s ¼ 1. We let the generator determine the patterns that data
We evaluate RL-Routing in terms of reward, file transmis-
are transferred between each host pair. We do not specify the
sion time, and utilization rate metrics.
packet interval time and burst length.
Reward is a score computed using Eq. 15.
Figure 4 shows the training process of the agent within RL-
File transmission time is the total time of a complete file
Routing on various network topologies. We observe that 25 k
transfer.
different states are generated on average, some are very similar,
Utilization rate is calculated in the destination switch as
and some are entirely different. In Figures 4 a, 4 b, and 4 c,
follows. Suppose, host hsrc;1 in swsrc transfers a file to
y-axis shows the sum of the rewards during an episode, while
host hdes;2 in swdes . The destination switch swdes com-
x-axis shows the number of training episodes. In the beginning,
putes the utilization rate as the ratio of its link band-
the agent does not have enough knowledge about the underlying
width that is used for the data traffic between hsrc;1 and
network. Therefore, it mostly explores the environment and
hdes;2 to the link’s maximum bandwidth. The higher the
obtains lower rewards. After a few episodes, the rewards
utilization rate is, the lower the file transmission time is.
increase and finally converge to the maximum values.
Notice that the utilization rate differs from standard link
In Figures 5 a, 5 b, and 5 c, the numbers on the x-axis are
utilization, where the amount of traffic traversing is
the number of episode steps, while the values on the y-axis are
divided by the link capacity.
the obtained rewards. They contain five training episodes,
each of which comprises 100 steps. The reward curves show
C. Training RL-Routing
that, compared with the baselines, RL-Routing obtains higher
We train an agent for solving the TE problem for each net- rewards on all three network topologies. For example, on the
work topology. We invoke PDA by setting h ¼ 8 to construct NSFNet topology, the sum of rewards obtained by RL-Rout-
the corresponding action space with eight paths for each network ing is 193.01, whereas those of OSPF and LL are 160.32 and
Fig. 4. Performance of the learning agent on various network topologies in terms of total reward. They show the convergence of RL-Routing on various net-
work topologies. The solid lines represent smoothed total rewards with a window size of 40 episodes. (a) Total reward trend on Fat-tree. (b) Total reward trend
on NSFNet. (c) Total reward trend on ARPANet.
144.86, respectively. This means that the agent has gained Routing’s effectiveness in satisfying its objective function as
routing knowledge. After the training process, RL-Routing is compared with the baseline solutions. In addition, they show
ready to be used for routing in the field. that the agent obtains higher rewards even over unseen traffic
compositions.
D. Evaluation Results We investigate the reason behind higher rewards for RL-
Routing. LL is a memoryless algorithm. It does not record its
For performance comparison, we compare RL-Routing with
experience. Suppose that in a state s 2 S, after choosing an
two widely used baseline solutions ,2:
action a, the background traffic changes. Then LL needs to
Open Shortest Path First (OSPF): It finds a suitable path
wait until the next time interval to choose another action a0 . In
with the smallest number of hops.
this situation, the throughput of the network remains low until
Least Loaded routing algorithm (LL): It uses the link
the next time interval arrives. Even though LL revisits the
throughput rate Eq. 5 as the cost of each link. It finds a
same state s in the near future, it will make the same mistake
suitable path with the smallest cost using the Dijkstra
because it does not record its experience.
algorithm, where the path cost is the sum of its links’
On the other hand, RL-Routing records its experience in the
costs.
action-value function. Hence, in a state s 2 S, it remembers
Notice that both algorithms find their suitable paths for
that action a0 contributes to a higher reward than a. Therefore,
connecting a pair of switches with different goals.
as mentioned in the problem definition in Section III-B, the
Another similar algorithm is the delay-constrained low-
experience is important in finding an efficient solution when
est-cost (DCLC) [34]. It finds a path that has the minimal
mathematical modeling can not be optimal.
cost subject to a delay constraint (Ddelay ). It is similar to
Next, we design an experiment to see whether maximizing
LL, but it chooses a path psrc;des with delayt ðpsrc;des Þ
rewards can actually improve user experience on the network.

Ddelay . Since LL and DCLC are both memoryless, DCLC
We design a file transfer test as follows. In every episode,
is expected to have a similar result as LL.
hsrc;1 transfers a 40GB file to h3;2 in switch sw3 . Other hosts
Figure 6 shows RL-Routing, OSPF, and LL’s performance
are also instructed to generate some background traffic, which
on various network topologies in terms of rewards. We sum-
changes the network states from time to time. We repeat this
marize the results in the third column of Table II. The numbers
experiment for ten times.
on the x-axis are the number of episode steps, while the values
Notice that on the Fat-tree topology (Figure 3 a), all paths
on the y-axis are the obtained rewards. It contains five testing
from the source switch to the destination switches have the
episodes, each of which comprises 100 steps. In this test, the
same length. In order to fairly demonstrate the performance of
hosts are instructed to generate heavy traffic. They produce
OSPF, we consider two assignments (i.e., efficient and non-
traffic compositions that have never been seen by the agent in
efficient) and average the results. In the efficient assignment,
the training phase.3 The reward curves show that, compared
OSPF connects the switches via dedicated paths that do not
with the baselines, RL-Routing outperforms the baseline solu-
share any link. For example, switches swsrc and sw1 are con-
tions and obtains significantly higher rewards in every step.
nected to sw2 and sw3 via sw4 ; sw8 ; sw6 and sw5 ; sw9 ; sw7 ,
For example, on the ARPANet topology, the sum of rewards
respectively. On the other hand, in the non-efficient assign-
obtained by RL-Routing is 89.30, whereas those of OSPF and
ment, switches swsrc and sw1 are connected to sw2 and sw3
LL are 50.02 and 35, respectively. These results confirm RL-
via sw4 ; sw8 ; sw6 and sw4 ; sw8 ; sw7 , respectively.
2 Figure 7 shows the results of the file transfer test. Figures 7
For a related paper [7] they used switch processing delays feature in state
representation. However, neither their paper nor OpenFlow specification a, 7 d, and 7 g show ten file transmission times using different
specifies how to measure such a delay. methods on Fat-tree, NSFNet, and ARPANet topologies,
3
We use host pairs to generate heavy traffic in some steps other than those respectively. Each bar shows the file transmission time in sec-
in the training phase. For example, h1;2 generates heavy traffic towards h5;1
with a duration obtained from a normal distribution with m ¼ 20 seconds and onds. In Figures 7 b, 7 e, and 7 h, each bar shows the average
s ¼ 2. utilization rate during a file transfer test. Figures 7 c, 7 f, and 7
Fig. 5. Performance of the learning agent on various network topologies in Fig. 6. (a) Fat-tree network. (b) NSFNet network. (c) ARPANet network.
terms of rewards. They show the reward obtained by RL-Routing, OSPF, and Comparison of the performance of RL-Routing, OSPF, and LL in terms of the
LL in each step of an episode. The reward curves of all algorithms fluctuate, reward on various network topologies. We apply a Gaussian smoothing to
and they all drop towards the end of an episode. It is because, in the final steps curves for better presentation and comparison.
of an episode, hosts are scheduled not to generate any traffic. (a) Step reward
on Fat-tree. (b) Step reward on NSFNet. (c) Step reward in ARPANet.
1) From Figures 7 a, 7 d, and 7 g, we can see that RL-Rout-
i show the distribution of the chosen actions by RL-Routing ing significantly reduces the file transmission time on all
and LL. We summarize the details of the file transfer test, topologies as compared with baseline solutions. For
such as average file transmission times, average utilization example, on the NSFNet topology, the average file
rates, etc., in Table II. We can make the following observa- transmission time of RL-Routing is 25.2 s. Those of
tions from these results. OSPF and LL are 63 s and 53.4 s, respectively. Overall,
TABLE II
COMPARING PERFORMANCE OF DIFFERENT METHODS FOR TESTING EPISODES AND FILE TRANSFER TESTS
Fig. 7. Performance of all methods for file transfer test on various network topologies. The utilization rates are computed in sw3 for the traffic from hsrc;1 to
h3;2 . Action 1 denotes the set of shortest paths in every topology. Notice that OSPF’s strategy is to stay with the action 1. Therefore, we omit its action distribu-
tion. (a) Fat-tree file transmission time. (b) Fat-tree utilization rate. (c) Fat-tree action distribution. (d) NSFNet file transmission time. (e) NSFNet utilization
rate. (f) NSFNet action distribution. (g) ARPANet file transmission time. (h) ARPANet utilization rate. (i) ARPANet action distribution.
RL-Routing provides 2:5 and 2:11 speedups over consistently higher utilization rates in all tests and on
OSPF and LL on average, respectively. We can see that all network topologies. For example, on the NSFNet
OSPF and LL perform inadequately. They are both the average utilization rate of RL-Routing is 0.49,
greedy approaches and did not respond to the network whereas those of OSPF and LL are 0.26 and 0.30,
traffic changes accordingly. respectively. Notice that utilization rates are different
2) Figures 7 b, 7 e, and 7 h, unveil one of the reasons that on all network topologies. We observe that the topo-
RL-Routing delivers satisfactory performance. Com- logical structure and background traffic affect the utili-
pared with all baseline solutions, RL-Routing obtains zation rates.
3) We investigate the reason behind such significant VI. DISCUSSION

improvements in terms of file transmission time and
RL-Routing significantly improves network throughput.
utilization rate. Figures 7 c, 7 f, and 7 i show the dis-
However, it also meets some challenges that need to be
tribution of the chosen actions during ten file transfer
addressed further.
tests. We can see that the distributions of the actions
chosen by RL-Routing and LL are quite different as
A. Overhead
they have different objective functions. We observe
that on the Fat-tree topology, RL-Routing changes its The computational complexity of RL-Routing is mainly due
actions 7.9 times during a file transfer test on aver- to the neural network computation. In the testing time, the
age, whereas, LL changes 35.7 times. This is because, action calculation time is determined by the structure of the
unlike LL, RL-Routing’s strategy is to maximize its neural network. It mostly consists of a series of matrix multipli-
expected reward in the future, not just for the next cation, which is Oðn1 n2 þ n2 n3 þ þ nd1 nd Þ,
step. In comparison with LL, RL-Routing tends to where ni is the number of neurons in each layer of the neural
find an efficient solution. Once it finds such a solu- network, and d is the number of layers. In our construction, d ¼
tion, it does not tend to change it. On the other hand, 5, n1 ¼ jSt j, and nd ¼ jAj. The number of neurons in the
LL with its greedy approach changes its paths quite remaining layers (i.e., n2 ; . . . ; nd1 ) is 64.
frequently. In the following, we compare CPU and memory overheads
We observe that action distributions are different on of RL-Routing, OSPF, and LL on the Fat-tree topology. We
NSFNet and ARPANet topologies. This indicates that run the Ryu controller on a virtual machine (VM) with ten cores
the agent understands the topological structures of the of CPU and 8 GB of memory. Figure 8 shows the CPU usage
networks and adjusts its strategy accordingly. For for different routing algorithms over the file transfer test in
example, RL-Routing changes its action on average 2.8 Section V. The numbers on the x-axis are episode steps, while
and 5.4 times during the file transfer test on the NSFNet the values on the y-axis are the CPU usage percentage. We
and ARPANet topologies, respectively. Therefore, RL- observe that the VM occupies on average 12:12% of CPU
Routing learns that there is no single efficient solution time when running RL-Routing, whereas those of OSPF and
for these topologies. Hence, it changes its actions in LL are 7:43% and 9:54%, respectively. Comparing with LL,
some network states. the extra overhead is mostly due to the neural network compu-
(4) We investigate the reason that OSPF and LL deliver tation, which is acceptable in this context.
unsatisfactory results. We record the amount of traffic We further measure the memory overhead of RL-Routing,
forwarded from hsrc;1 to h3;2 for transferring a 40GB OSPF, and LL on the Fat-tree topology. We observe that RL-
file. We observe that on the NSFNet, swsrc forwards Routing occupies 22.2MB. It is mostly due to the neural net-
78.98GB and 77.39GB using OSPF and LL on aver- work for storing input data, weight parameters, and activations
age, respectively. However, when using RL-Routing, as an input propagates through the neural network. The mem-
swsrc forwards 59.43GB on average. On the ARPANet ory overhead of LL is roughly 1MB. It is mostly due to run-
topology, swsrc forwards on average 55.27GB, ning the Dijkstra algorithm and computing the cost of each
54.97GB, and 50.51GB using OSPF, LL, and RL- path. The memory overhead of OSPF is roughly 0.2MB. It is
Routing, respectively. This shows that OSPF and LL, due to running the Dijkstra algorithm for finding the shortest
with their greedy strategies, congest the links often path between each source-destination switch pair.
and cause delayed or lost packets. Consequently, hsrc;1 The communication overhead for retrieving network infor-
needs to re-transmit more data, thereby adding more mation is another issue in the SDN architecture. The controller
traffic and further increasing the file transmission time periodically invokes NMM and ATM to collect network infor-
and congestion likelihood. mation and configure switches. Among all operations in NMM
We observed that RL-Routing performs consistently and ATM, computing the link delay is the most expensive oper-
better than other baseline solutions. That is, when using ation. That is, in every time interval Dt , 2n þ 3 m packets are
RL-Routing, hsrc;1 transfers fewer gigabytes of data transferred for computing link delays. For the 2n packets, the
than all the other methods across all network topologies. controller sends a packet for installing a dummy flow entry on
This is due to switch throughput rates in Eq. 9 and each switch. Upon the expiration of dummy flow entries, each
Eq. 10, link-to-switch rate in Eq. 11, and the link switch sends a notification packet to the controller. For the
throughput rate in Eq. 5. These features provide more other 3 m packets, the controller packs m LLDP packets into
direct information for the agent to choose a path that OpenFlow messages and sends them to the corresponding
does not include the highly loaded switches and links. switches. Each switch unpacks its OpenFlow messages and for-
Overall, these features help RL-Routing overcome the wards the LLDP packets into its links (m packets). The
following issues. switches on the other end of the links forward the LLDP pack-
RL-Routing prevents congestion collapse, which ets to the controller via OpenFlow messages (m packets).
means that network throughput drops to a low level. We observe that communication overhead is acceptable in
RL-Routing reduces delay or packet loss, which our experiments. For example, on the Fat-tree topology, the
decreases the likelihood of data re-transmission. extra overhead for computing link delays is only 48.06 MB in
generality, every routing algorithm makes its routing decisions

using a policy. A policy p defines how a routing algorithm
behaves. It is a distribution over actions on given states, i.e.,
pðajsÞ ¼ P ½ajs: (21)
For example, LL’s policy pLL chooses the least loaded path
using links capacity rates. OSPF’s policy pOSPF chooses the
shortest path with the smallest number of hops.
RL-Routing starts with a policy pRL and improves using
policy iteration. In every step, the agent updates qpRL by com-
puting states using Eq. 3, taking actions using its policy pRL ,
and computing rewards using Eq. 15. By repeating these steps
over a sufficiently large number of times, the agent finds an
optimal qp s.t. for all p0RL
RL
Fig. 8. Comparison of the CPU usage of RL-Routing, OSPF, and LL on Fat-
tree topology. The curves represent smoothed CPU usage with a window size
of 150 steps. RL-Routing(LDM) computes link delays using the LDM method pRL p0RL if qpRL ðs; aÞ qp0 ðs; aÞ 8s 2 S; a 2 AðsÞ:
RL
suggested by Francois et al. [9].
(22)
24 hours. These values for the NSFNet and ARPANet network
topologies are 81.71 MB and 109.17 MB, respectively. Hence, This is because by the theorem of reinforcement learning, qpRL
the communication overhead slightly increases as the number converges to an optimal action-value function [35].
of switches increases on the network. We say a solution with policy p is optimal for the TE prob-
Francois et al. [9] use a Link Delay Monitoring mechanism lem if it is better than or equal to all policies [35], i.e., among
(LDM) for computing link delays in a slightly different way. all p0
We implement LDM and compare it with our method (see
Section IV-B for the description). We observe that the com- p p0 if qp ðs; aÞ qp0 ðs; aÞ 8s 2 S; a 2 AðsÞ: (23)
puted link delays by both methods are very close with Normal-
ized Cross Correlation (NCC) of 0.45. However, they have Upon achieving an optimal policy p , the agent takes the
different computation and communication complexities. In best possible action in every state s 2 S by
Figure 8, the related curves show the CPU usage of RL-Rout-
ing when using our method and LDM for computing link A ¼ arg max qp ðs; aÞ (24)
delays, respectively. We observe that LDM causes RL-Rout- a2AðsÞ
ing to use on average 8:6% more CPU time as compared with
our method. Meanwhile, in every time interval Dt , LDM trans- Therefore, finding an optimal policy p is equivalent to find-
fers 4n þ 3 m packets for computing link delays, whereas our ing an optimal solution for the TE problem. However, we do
method transfers 2n þ 3 m packets. The extra 2n packets are not know the best action-value function qp .
the OpenFlow Barrier-Request messages that the controller We say that the policy pRL obtained by RL-Routing is effi-
sends to every switch under its control. Therefore, our method cient, because there may be some states s, where qpRL ðs; aÞ <
for computing link delays is more satisfactory in this context. qp ðs; aÞ and the agent chooses a non-optimal action a0 6¼ a.
Nevertheless, as shown in Figure 5 and Figure 6, RL-Routing
B. Traffic Load Level obtains relatively higher rewards as compared with OSPF and
LL. Therefore, pRL is better than pLL and pOSPF over the
In our experiment, we observe that when the traffic load is
observed states. This indicates that OSPF and LL with their
low, the results obtained from RL-Routing, OSPF, and LL are
greedy policies cannot achieve an efficient policy on the eval-
close. However, when the traffic load is high, RL-Routing sig-
uated network topologies.
nificantly outperforms the baseline solutions. When the traffic
RL-Routing has the potential to find a better policy. It is con-
load increases, the likelihood of congestion increases as well,
ditioned on traffic compositions in the training phase and the
especially on the network’s shortest paths. As discussed in
adopted exploration mechanism (e.g., -greedy) to control the
Section V-D, RL-Routing performs much better than baseline
exploration-exploitation trade-off. As shown in Figure 4, at
solutions because the agent learned to choose non-congested
the beginning of the training phase, the agent mostly explores
paths for future traffic forwarding.
to gather information. After a while, it starts to exploit by mak-
ing better decisions using the knowledge it obtained on the
C. Optimality
underlying network.
The goal here is to know what is an efficient solution for the In networks with divergent traffic distributions, it is impor-
TE problem with the given state representation, action descrip- tant to prolong the training phase so that the agent observes
tion, and reward function and how to find it. Without loss of various traffic compositions and improves its policy.
D. Performance Factors [2] S. Sezer et al., “Are we ready for SDN? Implementation challenges for
software-defined networks,” Commun. Mag., vol. 51, no. 7, pp. 36–43, 2013.
The performance of RL-Routing depends on the topological [3] S. H. Low and D. E. Lapsley, “Optimization flow controli: Basic algorithm
structure of the network and the length of the training phase. and convergence,” IEEE/ACM Trans. Netw., vol. 7, no. 6, p. 861874,
Dec. 1999. [Online]. Available: https://doi.org/10.1109/90.811451
For example, RL-Routing obtains the highest speedup in Fat- [4] D. P. Palomar and M. Chiang, “A tutorial on decomposition methods for
tree topology. Nevertheless, on NSFNet and ARPANet topolo- network utility maximization,” IEEE J. Sel. A. Commun., vol. 24, no. 8,
gies, it still obtains higher speedups comparing with OSPF and pp. 1439–1451, Aug. 2006. [Online]. Available: https://doi.org/10.1109/
JSAC.2006.879350
LL. Therefore, RL-Routing does not obtain a constant improve- [5] J. Moy, “Rfc2328: Ospf version 2,” 1998.
ment for all network topologies. [6] L. Li and A. K. Somani, “Dynamic wavelength routing using congestion
In real platform deployment, we suggest prolonging the and neighborhood information,” IEEE/ACM Trans. Netw., vol. 7, no. 5,
pp. 779–786, Oct. 1999. [Online]. Available: http://dx.doi.org/10.1109/
training phase for networks with highly dynamic traffic distri- 90.803390
butions. This helps the agent observe various traffic patterns [7] H. Yao, T. Mai, X. Xu, P. Zhang, M. Li, and Y. Liu, “Networkai: An intel-
on different days and parts of the day and improve its policy. ligent network architecture for self-learning control strategies in software
defined networks,” IEEE Int. Things. J., vol. 5, no. 6, pp. 4319–4327,
The day of a week and part of the day features help RL-Rout- 2018.
ing to distinguish traffic patterns over different times. For [8] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. H. Liu, and D. Yang,
example, the network traffic might have different patterns on “Experience-driven networking: A deep reinforcement learning based
approach,” in Proc. 37th Conf. Inform. Commun., ser. INFOCOM ’18.,
Friday nights and Monday mornings. 2018, pp. 1871–1879.
[9] F. Francois and E. Gelenbe, “Towards a cognitive routing engine for
software defined networks,” in Proc. IEEE Int. Conf. Commun., ser.
E. Deployment of RL-Routing ICC’16., 2016, pp. 1–6.
Our RL-Routing can be adopted in both virtual and real [10] M. Karakus and A. Durresi, “A survey: Control plane scalability issues
and approaches in software-defined networking (sdn),” Comput. Netw.,
platforms immediately. This is because the network informa- vol. 112, pp. 279–293, 2017.
tion used in NMM are all defined by the OpenFlow specifica- [11] N. Wang, K. Ho, G. Pavlou, and M. Howarth, “An overview of routing
tion [26]. In addition, RL-Routing can be deployed on another optimization for internet traffic engineering,” Commun. Surv. Tuts., vol.
10, no. 1, p. 3656, Jan. 2008. [Online]. Available: https://doi.org/10.1109/
network with similar network topology and traffic patterns COMST.2008.4483669
after training. This is because the agent can transfer its routing [12] S. T. V. Pasca, S. S. P. Kodali, and K. Kataoka, “Amps: Application aware
knowledge through action-value functions. multipath flow routing using machine learning in sdn,” in Proc. 23 rd Nat.
Conf. Commun., ser. NCC’17. Chennai, India: IEEE, 2017, pp. 1–6.
[13] A. Mendiola, J. Astorga, E. Jacob, and M. Higuero, “A survey on the
VII. CONCLUSIONS contributions of software-defined networking to traffic engineering,”
Commun. Surveys Tuts., vol. 19, no. 2, pp. 918–953, 2016.
In this paper, we develop a reinforcement learning routing [14] S. Jain et al., “B4: Experience with a globally-deployed software defined
wan,” ACM SIGCOMM Comput. Commun. Rev., vol. 43, no. 4, pp. 3–14,
algorithm to solve a TE problem of SDN in terms of through- 2013.
put and delay. RL-Routing solves the TE problem via experi- [15] M. Caria, A. Jukan, and M. Hoffmann, “Sdn partitioning: A centralized
ence, instead of building an accurate mathematical model. We control plane for distributed routing protocols,” IEEE Trans. Netw. Serv.
Manag., vol. 13, no. 3, p. 381393, Sep. 2016. [Online]. Available:
consider comprehensive network information for state repre- https://doi.org/10.1109/TNSM.2016.2585759
sentation and use one-to-many network configuration for rout- [16] H. Ghafoor and I. Koo, “Cr-sdvn: A cognitive routing protocol for
ing choices. Our reward function, which uses the network software-defined vehicular networks,” IEEE Sen. J., vol. 18, no. 4,
pp. 1761–1772, 2017.
throughput and delay, is adjustable for optimizing either [17] P. Amaral, L. Bernardo, and P. Pinto, “Achieving correct hop-by-hop
upward or downward network throughput. forwarding on multiple policy-based routing paths,” IEEE Trans. Netw.
We implement RL-Routing and conduct some comprehen- Sci. Eng., 2019.
[18] B. Wu, H. Shen, and K. Chen, “Spread: Exploiting fractal social com-
sive experiments on well-known network topologies, i.e., Fat- munity for efficient multi-copy routing in taxi vdtns,” IIEEE Trans.
tree, NSFNet, and ARPANet. The experimental results show Netw. Sci. Eng., vol. 6, no. 4, pp. 871–884, 2018.
the advantage of experience-driven artificial intelligence for [19] R. Touihri, S. Alwan, A. Dandoush, N. Aitsaadi, and C. Veillon, “Crp:
Optimized sdn routing protocol in server-only camcube data-center
the TE problem over traditional algorithms. Our results show networks,” in Proc. IEEE Int. Conf. Commun., ser. ICC’19., 2019, pp. 1–6.
the following. Firstly, compared with the baseline solutions, [20] G. Stampa, M. Arias, D. Sanchez-Charles, V. Muntes-Mulero, and
RL-Routing obtains higher rewards on all three network topol- A. Cabellos, “A deep-reinforcement learning approach for software-
defined networking routing optimization,” 2017, arXiv:1709.07080.
ogies. Secondly, RL-Routing significantly improves user [21] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “Learning to route,”
experience on the network as it minimizes the file transmission in Proc. 16th ACM Workshop Hot Topics Netw., ser. HotNets-XVI.
time on all three network topologies. Thirdly, RL-Routing New York, NY, USA: Association for Computing Machinery, 2017,
p. 185191. [Online]. Available: https://doi.org/10.1145/3152434.3152441
avoids congested paths. Therefore, hosts re-transfer fewer [22] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “Qos-aware adaptive
packets as compared with the baseline solutions. routing in multi-layer hierarchical software defined networks: A rein-
As a part of future work, we aim to deploy RL-Routing in a forcement learning approach,” in Proc. IEEE Int. Conf. Serv. Comput.,
ser. SCC 16. IEEE Computer Society, 2016, pp. 25–33.
real network environment. Moreover, we will evaluate RL- [23] F. Naeem, G. Srivastava, and M. Tariq, “A software defined network
Routing on other operational network topologies. based fuzzy normalized neural adaptive multipath congestion control
for internet of things,” IEEE Trans. Netw. Sci. Eng., 2020.
[24] J. A. Boyan and M. L. Littman, “Packet routing in dynamically changing
REFERENCES networks: A reinforcement learning approach,” in Proc. 6th Int. Conf. Neu-
[1] N. McKeown, et al., “Openflow: Enabling innovation in campus networks,” ral Inform. Process. Syst., ser. NIPS’93. San Francisco, CA, USA: Morgan
SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, pp. 69–74, Mar. 2008. Kaufmann Publishers Inc., 1993, pp. 671–678. [Online]. Available: http://
[Online]. Available: http://doi.acm.org/10.1145/1355734.1355746 dl.acm.org/citation.cfm?id=2987189.2987274
[25] K.-C. Leung, V. O. K. Li, and D. Yang, “An overview of packet reordering Amir Rezapour received the M.S. degree in computer
in transmission control protocol (tcp): Problems, solutions, and challenges,” science from National Tsing Hua University, Hsinchu,
IEEE Trans. Parallel Distrib. Syst., vol. 18, no. 4, pp. 522–535, Apr. 2007. Taiwan, in 2013 and the Ph.D. degree in computer
[Online]. Available: https://doi.org/10.1109/TPDS.2007.1011 science from National Chiao Tung University,
[26] “Openflow switch specification version 1.3.5. open networking Hsinchu, Taiwan, in 2018. He is currently a Postdoc-
foundation.” [Online]. Available: https://www.opennetworking.org/wp- toral Research Fellow with National Chiao Tung Uni-
content/uploads/2014/10/openflow-switch-v1.3.5.pdf versity. His research interests are in the area of
[27] L. S. Committee et al., “Ieee standard for local and metropolitan area cryptography and network security.
networks–station and media access control connectivity discovery,”
2009.
[28] A. Rezapour and W.-G. Tzeng, “A robust intrusion detection network using
thresholdless trust management system with incentive design,” in Proc.
14th Int. Conf. Security Privacy Commun. Netw., ser. SecureComm’18.
Cham: Springer, 2018, pp. 139–154. Wen-Guey Tzeng received the B.S. degree in
[29] J. Y. Yen, “Finding the k shortest loopless paths in a network,” Manag. computer science and information engineering from
Sci., vol. 17, no. 11, pp. 712–716, 1971. National Taiwan University, Taipei, Taiwan, in 1985;
[30] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and and the M.S. and Ph.D. degrees in computer science
N. De Freitas, “Dueling network architectures for deep reinforcement from the State University of New York at Stony Brook,
learning,” in Proc. 33 rd Int. Conf. Int. Conf. Mach. Learn., ser.
Stony Brook, NY, USA, in 1987 and 1991, respec-
ICML’16. JMLR.org, 2016, pp. 1995–2003. [Online]. Available: http://
tively. His current research interests include security
dl.acm.org/citation.cfm?id=3045390.3045601 data analytics, cryptology, information security, and
[31] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience network security.
replay,” 2015, arXiv:1511.05952.
[32] B. Lantz, B. Heller, and N. McKeown, “A network in a laptop: Rapid
prototyping for software-defined networks,” in Proc. 9th ACM SIG-
COMM Workshop Hot Topics Netw., ser. Hotnets-IX. New York, NY,
USA: ACM, 2010, pp. 19:1–19:6. [Online]. Available: http://doi.acm.
org/10.1145/1868447.1868466 Shi-Chun Tsai (Senior Member, IEEE) received the
[33] “Ryu controller.” [Online]. Available: http://osrg.github.com/ryu/ B.S. and M.S. degrees from National Taiwan Univer-
[34] R. Hassin, “Approximation schemes for the restricted shortest path prob- sity, Taipei, Taiwan, in 1984 and 1988, respectively,
lem,” Math. Oper. Res., vol. 17, no. 1, p. 3642, Feb. 1992. and the Ph.D. degree from The University of Chicago,
[35] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” Chicago, IL, USA, in 1996, all in computer science.
2011. He is currently a Professor with the Department of
Computer Science, National Chiao Tung University
Yi-Ren Chen received the B.S. and M.S. degrees in (NCTU), Hsinchu, Taiwan. His research interests
2006 and 2012, respectively, from National Chiao include computational complexity, algorithms, cryp-
Tung University, Hsinchu, Taiwan, where she is tography, software defined networking, and applica-
currently working toward the Ph.D. degree with the tions. He is a member of ACM and SIAM
Department of Computer Science. Her research
interests are software defined networking and cloud
infrastructure.

(2020 - IEEE-Trsnsactions Omn NSAE) RL-Routing - An SDN Routing Alogorirjm Based On Deep Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(2020 - IEEE-Trsnsactions Omn NSAE) RL-Routing - An SDN Routing Alogorirjm Based On Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. 7, NO.

4, OCTOBER-DECEMBER 2020 3185

RL-Routing: An SDN Routing Algorithm Based

S OFTWARE Defined Network (SDN) [1] is an emerging

contacts the controller. The controller implicitly recognizes QoS TABLE I

IV. SYSTEM DESIGN

C. Proposed RL-Routing probability of individual packet loss. The goal here

A state s at time interval Dt is represented by i;j ¼ h ai;j þ IOi;j

s ¼ ½f1 ; f2 ; . . . ; f10 : (3) btþ1

It summarizes the network information at time interval

bwt ðei;j Þ atþ1

max--bwt ¼ maxei;j 2E bwt ðei;j Þ is the maximum

Algorithm 1: RL-Routing Algorithm with Dueling DDQN

V. EVALUATION topology. For NSFNet and ARPANet topologies, we observe

rewards can actually improve user experience on the network.

3) We investigate the reason behind such significant VI. DISCUSSION

generality, every routing algorithm makes its routing decisions

pðajsÞ ¼ P ½ajs: (21)

You might also like

(2020 - IEEE-Trsnsactions Omn NSAE) RL-Routing - An SDN Routing Alogorirjm Based On Deep Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(2020 - IEEE-Trsnsactions Omn NSAE) RL-Routing - An SDN Routing Alogorirjm Based On Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. 7, NO.

4, OCTOBER-DECEMBER 2020 3185

RL-Routing: An SDN Routing Algorithm Based

S OFTWARE Defined Network (SDN) [1] is an emerging

contacts the controller. The controller implicitly recognizes QoS TABLE I

IV. SYSTEM DESIGN

C. Proposed RL-Routing probability of individual packet loss. The goal here

 A state s at time interval Dt is represented by i;j ¼ h ai;j þ IOi;j

s ¼ ½f1 ; f2 ; . . . ; f10 : (3) btþ1

It summarizes the network information at time interval

bwt ðei;j Þ atþ1

max--bwt ¼ maxei;j 2E bwt ðei;j Þ is the maximum

Algorithm 1: RL-Routing Algorithm with Dueling DDQN

V. EVALUATION topology. For NSFNet and ARPANet topologies, we observe

rewards can actually improve user experience on the network.

3) We investigate the reason behind such significant VI. DISCUSSION

generality, every routing algorithm makes its routing decisions

pðajsÞ ¼ P ½ajs: (21)

You might also like

A state s at time interval Dt is represented by i;j ¼ h ai;j þ IOi;j