Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
Abstract—With the development of communication technolo- Although some traditional methods enable to reach con-
gies and smart manufacturing, industrial Internet of Things sensus among multiple controller instances, numerous non-
(IIoT) has emerged. Software-defined networking (SDN), a trivial issues in the current consensus methods prevent
promising paradigm shift, has provided a viable way to manage
IIoT dynamically, called software-defined industrial Internet of SDIIoT from being used as a generic platform for different
Things (SDIIoT). In SDIIoT, lots of data and flows are gener- services and applications, including 1) the extra overheads,
ated by industrial devices, where a physically distributed but 2) the poor safety and liveness properties, 3) the limited
logically centralized control plane is necessary. However, one of scope of available network size. These challenges need to
the most intractable problems is how to reach consensus among be broadly tackled by comprehensive research efforts.
multiple controllers under complex industrial environments. In
this paper, we propose a blockchain-based consensus protocol Recently, Blockchain (BC) [8] has been emerged as a
in SDIIoT, along with detailed consensus steps and theoretical novel technique, which can be used to address the above
analysis, where blockchain works as a trusted third party to
challenges. BC is a distributed ledger to record transactions,
collect and synchronize network-wide views between different
SDN controllers. Specially, it is a permissioned blockchain. and provide trustworthy services to a group of nodes without
In order to improve the throughput of this blockchain-based central authority [9]. For the distributed SDIIoT, BC can
SDIIoT, we jointly consider the trust features of blockchain act as a trusted third and ‘out-of-band’ party to collect
nodes and controllers, as well as the computational capability of and synchronize network-wide views (e.g., network events,
the blockchain system. Accordingly, we formulate view change,
network topology, and OpenFlow commands, etc) between
access selection, and computational resources allocation as a
joint optimization problem. We describe this problem as a different SDN controllers safely, dependably, and traceably.
Markov decision process by defining state space, action space, In general, there are two kinds of BC [10], including
and reward function. Due to the fact that it is difficult to permissionless BC and permissioned BC. In permissionless
solve this joint problem by traditional methods, we propose a BC, enrollment is open to anyone, and nodes enable to
novel dueling deep Q-learning approach. Simulation results are
join and leave dynamically and frequently, using Nakamoto
presented to show the effectiveness of our proposed scheme.
consensus protocol and coupled to cryptocurrency, such as
Index Terms—Industrial Internet of Things (IIoT), software- proof of work (PoW) in Bitcoin [8] and proof of stake (PoS)
defined networking (SDN), multiple controllers, blockchain,
dueling deep Q-learning.
in Ethereum [11]. All BC participants (miners) contribute
their CPU power to work on an extra hard task, and only
the winner of them enables to propose a block and syn-
I. I NTRODUCTION chronize it with others. Thus, lots of resources and time are
Recently, there are a growing number of applications that imposed in the permissionless BC. Moreover, another BC,
use Internet of Things (IoT) technologies in several indus- the permissioned BC, uses Byzantine fault tolerance (BFT)
tries. Industrial Internet of Things (IIoT) has emerged and consensus protocol. It employs state machine replication
attracted lots of attentions from industry and academia [1]. mechanism to deal with Byzantine nodes that are subverted
In order to meet the demands of high bandwidth, ubiquitous by adversaries and against the common goal of reaching
accessibility, and dynamic management, software-defined consensus maliciously [12], such as practical Byzantine fault
networking (SDN) [2] has been used in IIoT, called SDIIoT tolerance (PBFT) [13] and Paxos [14]. They always operate
[3]. In addition, software-defined routing management, edge in a partially trusted environment, such as Hyperledger
computing, flow scheduling, and energy harvesting have Fabric [15]. Thus, the advantages of permissioned BC are
been researched in excellent literature [4]–[7]. With the low costs, low latency, and low band-intensive. Considering
increasing number of industrial devices, more than one partial trust, limited communication time, high loads and
controller is employed in SDIIoT, known as distributed narrow bandwidth in SDIIoT, as well the advantages of
SDIIoT. How to reach consensus among multiple controllers permissioned BC, we consider to use permissioned BC in
is challenging in distributed SDIIoT. this paper.
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
In this paper, we propose a BC-based consensus proto- A. Software-Defined Industrial Internet of Things
col in distributed SDIIoT, where BC works as a trusted There are a number of advances in IIoT, such as industrial
third party to collect and synchronize network-wide views wireless sensor networks, industrial big data, cloud comput-
between different SDN controllers. Specifically, it is a ing, and fog computing. These emerging techniques bring
permissioned BC. Due to the fact that the throughput of the explosion of intelligent devices, industrial robots that are
permissioned BC is constrained with respect to the per- supported by wired or wireless networks. In order to meet
formance of BFT consensus protocol, we jointly consider the demands of high bandwidth, ubiquitous accessibility,
the trust features of BC nodes and controllers, as well as and dynamic management, lots of works have employed
the computational capability of BC system, so as to further SDN in IIoT. For example, Wan et al. in [20] proposed a
improve the throughput of BC. Accordingly, we formulate software-defined industrial network to manage manufactur-
view change, access selection, and computational resources ing resources dynamically. Moness et al. in [21] employed
allocation as a joint optimization problem. We describe this a hybrid software-defined approach in IIoT. Silva et al. in
problem as a Markov decision process by defining state [22] expanded OpenFlow protocol in industrial scenarios.
space, action space, and reward function. Inspired by the Duan et al. in [23] described a software-defined and virtual
works to achieve the optimal decisions by improved learning multiple networks control framework in IIoT.
methods [16]–[18], and the fact that it is difficult to solve As we can see that there is an emerging trend to employ
this joint problem by traditional methods, we use a novel SDN in IIoT. Due to the fact that there are lots of data
dueling deep Q-learning approach to solve this problem. and flows in SDIIoT, more than one controller is necessary,
A preliminary version of our work appeared in [19]. The called distributed SDIIoT. One of the most intractable prob-
distinct contributions of this paper are as follows. lems in distributed SDIIoT is how to reach consensus among
• We propose a BC-based consensus protocol to simplify multiple controllers. Whether or not traditional distributed
and secure the collection and synchronization of net- SDN architectures are suitable in IIoT, and how to reach
work views between different SDN controllers. And we consensus among controllers need to be further researched.
give consensus steps in detail, along with theoretical Therefore, we will present some traditional distributed SDN
analysis. architectures, along with the corresponding consensus meth-
• Due to the fact that lots of network views need to be ods and their challenges.
collected and synchronized by BC system, we jointly
consider the trust features of BC nodes and controllers,
as well as computational capability of the BC system, B. Distributed SDN Architectures
so as to improve the throughput. Accordingly, we Traditionally, there are lots of works to design distributed
formulate view change, access selection, and computa- SDN architectures [24]. Based on the configuration meth-
tional resources allocation as a joint optimization prob- ods between controllers and switches, we classify them as
lem. We describe it as a Markov decision process by statically configured control architecture and dynamically
defining state space, action space, and reward function. configured control architecture [25].
• In order to address this joint problem, we propose a 1) Statically Configured Control Architecture: The au-
novel dueling deep Q-learning approach to learn the thors in [26] designed the first distributed SDN control
optimal strategy. Simulation results with different sys- plane for OpenFlow, called HyperFlow. It synchronized
tem parameters are presented to show the effectiveness controllers’ network-wide views (i.e., local controller in-
of the proposed scheme. stance’s events) by a publish/subscribe messaging paradigm.
The rest of this paper is organized as follows. In Section II, Koponen et al. in [27] considered a Onix architecture. They
we present some related works about SDIIoT and distributed used network information base (NIB) to aggregate and share
SDN. Section IV describes the BC-based consensus proto- network-wide views.
col, followed by the system model in Section III. In Section 2) Dynamically Configured Control Architecture: Stati-
V, we describe view change, access selection, and computa- cally configured control architectures lead to the uneven load
tional resources allocation as a Markov decision process. We distribution among controllers. Therefore, Berde et al. in
then use a novel dueling deep Q-learning approach to solve [28] designed an elastic and experimental distributed SDN
this issue in Section VI. Simulation results are presented control plane, called ONOS. The work in [29], [30] pre-
and discussed in Section VII. Finally, conclusions and future sented a hierarchical SDN control architecture with the joint
works are given in Section VIII. consideration of load balancing and energy consumption of
multiple controllers.
II. R ELATED W ORK As we can see from the above researches, one of the most
In this section, we briefly present the recent advances in important problems in distributed SDN architectures is how
SDIIoT and distributed SDN. Some challenges are discussed to reach consensus among multiple controller instances. No
as well. matter which consensus protocols (e.g., publish/subscribe,
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
slicing, NIB, and controllers pool, etc.), lots of challenges to compromise the replicated service. However, they can’t
are remained to be solved when they are employed in break cryptographic technologies, i.e., signatures, message
SDIIoT. authentication code (MAC), and collision-resistant hashing.
• The traditional consensus protocols are implemented We denote the messages with cryptographic technologies as
‘in-band’, which leads to the extra overheads in each follows [34].
controller to weaken the inherent functions of con- • ⟨m⟩σ means that message m is signed with a pubilc
i
trollers (e.g. routing decisions, and network manage- key from node i.
ment, etc.). Especially, due to the limited capacities • ⟨m⟩σ means that message m is authenticated by node
i,j
of industrial devices, this problem is more severe. i with a MAC for node j.
Therefore, a third party and ‘out-of-band’ consensus • ⟨m⟩σ means that message m is authenticated by an
⃗
i
protocol is necessary in SDIIoT. array of MACs with node i for every replicas.
• Safety and liveness properties are ignored in traditional Fig. 1 shows the different network structures between
works. As the brain of SDN, the failures of control traditional scheme and BC-based scheme. It is worth men-
plane have incalculable consequences. In open indus- tioning that we use edge computing servers to do some
trial networks, a number of adversaries and attacks are computations related to the above cryptography so as to
easier to affect safe communications. Thus, trustworthy improve the throughput of BC system. There are E edge
and dependable consensus protocols are stringently computing servers, and the set of computing servers is
required in SDIIoT. represented by E = {1, 2, ..., E}.
• With the widespread deployments of industrial net-
works, the limited scopes of these consensus proto- B. Trust Feature Model
cols are challenging. A large-scale consensus protocol We consider trust features of nodes and controllers in the
should be employed in SDIIoT. system. Due to the lack of centralized security services and
These challenges have fueled the needs to explore new prior security association, all nodes and controllers have
consensus protocols in SDIIoT. Inspire by the successful diverse trust features, such as safe or compromised. It is
integration BC with secure key synchronization [31], and barely possible to exactly know what the trust feature is for
data sharing [32] in intelligent transportation systems, we one node or one controller at the next time instant. Thus, the
consider that BC could be a potential approach to address trust features of a node n ∈ {1, 2, ..., N } and a controller
the challenges in distributed SDIIoT. BC is a distributed c ∈ {1, 2, ..., C} can be modelled as random variables δ n
system to provide dependable services to a group of nodes and η c . δ n and η c can be divided into discrete levels, denoted
that don’t fully trust each other. It can be seen as a third by ξ = {ξ0 , ξ1 , ..., ξL−1 }, and D = {D0 , D1 , ..., DH−1 },
party system to achieve an agreement among nodes without respectively, where L and H are the number of available
a central trust broker, with the features of dependability, and trust features for nodes and controllers. We assume the trust
in largely available system scale. In addition, considering features realization of δ n and η c to be δ n (t) and η c (t) at
the fact that distributed SDIIoT operates in partially trusted time slot t, respectively. There are T time slots during the
environment, and the advantages of permissioned BC, we period of time, which starts from when the controller issues
utilize the permissioned BC in distributed SDIIoT. an un-validated block, and terminates when the controller is
replied with a validated block. Let t ∈ {0, 1, 2, ..., T − 1}
III. S YSTEM M ODEL denote the time instant.
Considering the time correlation of real trust features in
In this section, we introduce the system model that we BC nodes and controllers, we use Markov chain to model
use. We first present the network model, followed by the the transition of trust features in BC nodes and controllers,
trust feature model and the computation model. as follows:
n
• For node n, let the transition probability of δ (t) from
A. Network Model one state Xs to another state Ys be κXs Ys (t). The L×L
We assume that there are C controllers in distributed SDI- transition probability matrix Kn (t) of the trust feature
IoT, which are represented by C = {1, ..., C}. Each of them in node n is denoted as
can communicate with the third-party BC system. This BC Kn (t) = [κXs Ys (t)]L×L , (1)
system consists of N nodes, i.e., physical machines, denoted
by N = {1, ..., N }. Like other robust BFT protocols [33], where κXs Ys (t) = P r(δ n (t + 1) = Ys |δ n (t) = Xs ),
these N nodes are under Byzantine failure model to make and Xs , Ys ∈ ξ.
consensus, where at most f nodes are faulty [13], where • For controller c, the transition probability of η c (t) is
f = N 3−1 . And any finite number of controllers can behave γθs ϕs (t). The H ×H transition probability matrix Υc (t)
arbitrarily to issue correct or incorrect transactions to the BC of the trust feature in controller c is denoted as
system. Some strong adversaries can collude with each other Υc (t) = [γθs ϕs (t)]H×H , (2)
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
Fig. 1: The different network structures between traditional scheme and BC-based scheme.
where γθs ϕs (t) = P r(η c (t + 1) = ϕs |η c (t) = θs ), and The execution time of computation task Tm can be
θs , ϕs ∈ D. denoted as
qm
tm = e . (4)
C. Computation Model ζ (t)
There are a number of computation tasks in the BC Thus, the computation rate is
nodes, such as verifying signatures, generating MACs, and
sm ζ e (t)sm
verifying MACs. Let Tm = {sm , qm } denote a computation CompRe (t) = ae (t) = ae (t) , (5)
task related to message m, where sm means the size of tm qm
message m, and qm is the required number of CPU cycles where ae (t) means whether or not edge computing server
to complete this task. e is allocated to the BC system at time slot t. ae (t) = 1
In order to improve the throughput of BC, we use virtual denotes edge computing server e is allocated to the BC
computing resources from edge computing servers to do system; otherwise ae (t) = 0. At one time slot, there is only
computation tasks in the BC system. There are lots of one
∑E edgee computing server allocated to the BC system, thus
edge computing servers and some other computation tasks
e=1 a (t) = 1.
that also use the computation resources in edge computing
servers, thus we don’t exactly know the computational re-
IV. B LOCKCHAIN -BASED C ONSENSUS P ROTOCOL
sources for the BC system at the next time instant. Therefore,
we model the computation resources of edge computing We have presented that the existing consensus protocols
server e for the BC system as a random variable ζ e . To dis- are challenging in SDIIoT, and BC could be a potential ap-
cretize the values of computation capabilities, ζ e can be par- proach to address these issues in the previous section. In this
titioned into Y discrete intervals as Y = {Y0 , Y1 , ..., YY −1 }. section, we propose a novel BC-based consensus protocol in
The computation resources of edge computing server e at distributed SDIIoT. We begin with an overview of BC-based
time slot t can be denoted as ζ e (t), t ∈ {0, 1, 2, ..., T − 1}. consensus protocol. Then, we present the detailed steps of
Considering the time correlation of computation state in the consensus protocol, along with theoretical analysis.
edge computing server, we use a Markov chain to model
the transition of computation state in edge computing server.
A. Overview of BC-Based Consensus Protocol
Based on a certain transition probability, ζ e (t) changes from
one state to another. Let ϑas bs (t) denote the transition prob- Each controller collects its local events and OpenFlow
ability. The Y × Y computation state transition probability commands as Transaction #1, Transaction #2, ..., Transaction
matrix is represented as: #n, which is called the collection period. The format of a
transaction is shown in Table I. The number of this transac-
Πn (t) = [ϑas bs (t)]Y ×Y , (3)
tion denotes the position of this transaction. The signature
where ϑas bs (t) = P r(ζ (t + 1) = bs |ζ (t) = as ), and
e e
and MAC make sure the integrity and authentication of this
as , bs ∈ Y. transaction. The payloads include local events and OpenFlow
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
and there is no cost at non-primary nodes. Theoretical analysis. In this phase, the primary and
2. The primary node multicasts PRE-PREPARE mes- the replicas need to generate gb MACs for one controllers.
sage to other replica nodes. Finishing the verification, pri- Therefore, the total costs at the primary and the replica are
mary p sends a PRE-PREPARE message to all other replica both
nodes, as ⟨P RE − P REP ARE, p, c, H(m)⟩σp⃗ , which is b
Cα. (12)
authenticated with MACs for each replica node. Here, p and g
H(m) mean the primary node ID and the hashed result of
the issued block, respectively. The replica node verifies the Thus, for one transaction, the cost at the primary is
MAC of primary p, as well as the signature and the MAC
1 1 1 C +1 2N + 4f − 2
of each transaction. Then it enters the following steps. ( + )θ + ( + )α + α (13)
Theoretical analysis. In this phase, primary p generates g b b g b
(N − 1) MACs for all replicas. Each replica verifies one For one transaction, the cost at the replica is
MAC from primary p, as well as gb signatures and MACs
from transactions. Thus, the cost at the primary is 1 2+C 2N + 4f − 2
θ+ α+ α (14)
g g b
(N − 1)α, (7)
and the cost at each replica node is We assume that multi-core computation modules run in
parallel on distinct cores. Each core has the computation
b speed of φ Hz. In addition, we consider an uncivil execution
α + (θ + α). (8)
g where the primary is not fully trusted. The primary has k
3. The replicas send PREPARE message to oth- trust to affect the system performance, and k ∈ [0, 1]. The
ers. After verifying the validity of MACs and sig- more trusted primary node has the bigger k. Therefore, the
natures, each replica replies the PRE-PREPARE mes- throughput of the consensus protocol is at most
sage with sending a PREPARE message to all nodes,
kφ
as ⟨P REP ARE, p, c, H(m), n⟩σn , where n denotes the min[ 2N +4f −2
,
replica node ID. When each replica node collects 2f ( g1 + 1
b )θ + ( 1b + C+1
g )α + b α
(15)
matching PREPARE messages with its local PRE-PREPARE kφ
2N +4f −2
]trx/s.
message, it will enter the following steps. 1 2+C
gθ + g α + b α
Theoretical analysis. In this phase, primary p needs to
verify 2f MACs. Each replica node generates (N −1) MACs
and verifies 2f MACs. Therefore, the cost at the primary is V. P ROBLEM F ORMULATION
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
B. Action Space
The agent mainly needs to decide view changes (i.e.,
which node is the primary node), access selection (i.e., which
controller can access to the BC system), and computational
resources allocation (i.e., which edge computing server
should be allocated to the BC system). Thus, the action space
is denoted by
A(t) = {AN (t), AC (t), AE (t)}, (17)
where AN (t), AC (t), and AE (t) represent:
N
• A (t) = [a1 (t), a2 (t), ..., an (t), ..., aN (t)], which
means whether or not node n is the primary. And
an (t) ∈ {0, 1}, where an (t) = 1 means node n is
the primary node, otherwise it is the replica node. Note Fig. 4: The interaction of the learning agent and the envi-
that
∑N the nBC system only has one primary node, thus ronment.
n=1 a (t) = 1.
C
• A (t) = [a1 (t), a2 (t), ..., ac (t), ..., aC (t)] decides
which controller can access to the BC system. And another new policy, and so on so forth. The final goal is
ac (t) ∈ {0, 1}, where ac (t) = 1 represents controller c to achieve the maximum long-term reward. Summarily, the
can access, otherwise ac (t) = 0. Note that at one time interaction of the learning agent and the environment is
slot, only one∑controller enables to access to the BC shown in Fig. 4.
C
system, thus c=1 ac (t) = 1. There are some challenges to solve the above problem
E 1 2 e E
• A (t) = [a (t), a (t), ..., a (t), ..., a (t)] determines formulation as follows.
which edge computing server is allocated to the BC 1) The target is to maximize the long-term reward by step-
system. And ac (t) ∈ {0, 1}, where ae (t) = 1 denotes and-step control. However, the learning agent only senses the
edge computing∑Eserver e is allocated, otherwise ae (t) = state at time slot t, and the action taken at time slot t will
e
0. Similarly, e=1 a (t) = 1. affect the environment at time slot t + 1. The state cannot
be obtained ahead of time. Thus, the traditional optimization
C. Reward Function method, only considering current state, is not feasible.
2) Considering the trust features of nodes and controllers,
In order to improve the throughput, we model the through-
as well as the computational capabilities of edge computing
put of the BC system as reward function. According to (15),
servers, the system is high-dimensional and high-dynamical.
state space, and action space, we define the reward function
It is hard to make the joint and optimal decisions by
as:
traditional methods.
k ′ φ′ 3) In the BC system, taking which action has no relation-
r(t) = min[ 2N +4f −2
,
( g1′ + 1b )θ + ( 1b + C+1 g ′ )α + b α ship with what happens in the next time slot. For example,
(18) when the learning agent selects one controller to access to
kφ
1 2+C 2N +4f −2
]trx/s, the BC system, its trust feature in the next time slot still
g′ θ + g′ α + b α
changes according to its transition probability matrix, not
∑N
where k′ = n=1 a∑
n
(t)δ n (t), φ′ = the action. Therefore, the traditional optimization method,
∑E e e ′ C c c learning the relationship between states and actions, is not
e=1 a (t)CompR (t), and g = c=1 a (t)η (t).
Based on the above problem formulation, the learning suitable.
agent senses state s(t) at time slot t, and outputs a policy To address the above challenges, we will propose a
π that determines which action a(t) should be taken. Then dueling deep Q-learning approach in the next Section to
this action will be executed, i.e., one controller will enable achieve the maximum long-term reward.
to access to the BC system, one node will be the primary
node, and one edge computing server will be allocated to the VI. D UELING D EEP Q-L EARNING
BC system. In order to let the learning agent remember the
experience and act better next time, the immediate reward After the problem formulation, we consider a dueling deep
r(t) will be fed back to the learning agent. The trust features Q-learning approach to address this problem. In this section,
of nodes and controllers, as well as the computational we begin with the introduction of Q-learning and deep Q-
capabilities of edge computing servers change to next state learning. Then we present dueling deep Q-learning that is
s(t + 1). Then the learning agent senses them and outputs used in this paper.
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
A. Q-Learning
In the Q-learning model, the agent interacts with the envi-
ronment by perceptions and actions. In one interaction step,
the agent receives current state s(t) from the environment,
then selects an action a(t) as the output, and the value
of this action is measured by a scalar reward r(t). This
action generates next state s(t+1). The agent selects actions
to obtain the maximum long-term rewards. It learns to do
this over several interaction steps by systematic trials and
errors, guided by Q-learning [36]. Q-learning is a model-
free algorithm using delay rewards. It aims to find a policy
π, mapping states and actions, to maximize the long-term Fig. 5: The workflows of DQL.
rewards.
There are two popular approaches to denote the feedback
from each step in terms of long-term rewards, namely {state, action, reward, statenext } in a finite-sized cyclic
state-value function V π (s) and action-state value function buffer, and the agent randomly samples batches of them to
Qπ (s, a). V π (s) means the expected total reward in state s: train deep networks, instead of only the current ones. By
∞
∑ this way, the temporal correlations that can adversely affect
π
V (s) = E [ π
γ k rt+k+1 |st = s], (19) DQL are broken. 2) Fixed target networks have the same
k=1 architecture as the evaluated ones, but are kept frozen for a
period of time. The evaluated networks are trained in each
where E π [∗] means mathematical expectation, rt+k+1
step to minimize loss function L(ω) to evaluate real Q(s, a),
means the immediate reward at time slot t + k + 1, and
and L(ω) is represented as
γ ∈ (0, 1) is discount factor to balance immediate reward
and future reward. Moreover, Qπ (s, a) denotes the expected L(ω) = E[(r + γmaxa′ Q(s′ , a′ , ω − ) − Q(s, a, ω))2 ], (22)
total rewards in state s and action a:
where ω − is the weights and biases set in target networks,
∞
∑ and ω is the weights and biases set in evaluated networks.
Qπ (s, a) = E π [ γ k rt+k+1 |st = s, at = a]. (20)
During training, the weights and biases in target networks
k=1
are updated with evaluated networks periodically. For a
Q-learning evaluates Q(s, a) by a temporal difference comprehensive perspective, we present the workflows of
method as follows. DQL in Fig. 5.
Q(s, a) ← Q(s, a)+α(r+γ max
′
Q(s′ , a′ )−Q(s, a)), (21)
a C. Dueling Deep Q-Learning Approach
where α is learning rate, and α ∈ (0, 1]. The action with However, for the majority of states in our system, the
maximum Q(s, a) may be chosen by the agent at each step. choice of actions in the agent has no repercussion with
In the traditional Q-learning, each Q(s, a) is put into Q- what happens, i.e., actions have no relationship with states.
table. However, with the rapid increase of data dimension, According to the work in [38], dueling DQL is more
it is challenging to put all Q(s, a) into Q-table. efficient than natural DQL. Based on this approach, some
training processes have been carried out in real scenario,
B. Deep Q-Learning such as drive games [38], which are simulated with better
performance.
The rise of deep learning has provided a new tool to
In the dueling DQL, there is another value function,
overcome the challenges. The most important property of
A(s, a), which represents the relative advantage of a action,
deep learning is that deep networks enable to find the low-
called state-action value function. Learning A(s, a) is easier
dimensional features of high-dimensional data by crafting
to know which action has better consequences. Instead
weights and biases in deep networks. Therefore, many
of one single stream following the output layer of deep
researches have advocated to use deep networks to approx-
networks, there are two separate streams in dueling DQL,
imate Q(s, a) instead of Q-table, i.e., Q(s, a, ω) ≈ Q(s, a),
where one computes state-value function V (s), and another
where ω is the set of weights and biases in deep networks
computes state-action value function A(s, a), called dueling
[37]. This is the core idea of deep Q-learning (DQL).
architecture as shown in Fig. 6. Finally, these two streams
In order to address the fundamental instability problem
are aggregated as a output Q(s, a). This combination module
of approximating Q(s, a), there are two improvements in
is denoted as follows:
DQL, including experience replay and fixed target networks:
1) Experience replay stores the transitions as a set of Q(s, a; ω, ϱ, ζ) = V (s; ω, ϱ) + A(s, a; ω, ζ), (24)
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
10
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
11
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
12
10 4
4
Throughput (trx/s)
2.5
1.5
0.5
0
10 12 14 16 18 20 22 24 26 28 30
The number of controllers
10 4
4.5
Fig. 11 shows the learning loss of natural DQL and
dueling DQL. As we can see, the learning loss in dueling 4
DQL decreases more quickly than natural DQL, which
3.5 DuelingDQL-based scheme
indicates dueling DQL has better learning effectiveness. The DuelingDQL-based scheme without controller choice
reason is that in our BC system, the choices of which 3
DuelingDQL-based scheme without node choice
DuelingDQL-based scheme without computation offloading
node is the primary, which controller can access to the BC
Throughput (trx/s)
Existing scheme
system, and which edge computing server should execute the 2.5
natural DQL.
0
6 8 10 12 14 16 18 20
After the effective training of deep networks, we use them The number of consensus nodes
in the following simulations. Fig. 12 shows the relationship
between the number of controllers and the system throughput Fig. 13: The throughput versus the number of BC nodes
under different schemes. This figure also can be used to show under different schemes.
the performance of these schemes in large real SDIIoT en-
vironment, where up to 30 controllers are considered in this
simulation. As the increase of the number of controllers, the
throughput of the BC system decreases. The reason is that put. The reason is that with the increase of the number of
more controllers need more computational operations about nodes, more signatures and MACs need to be verified and
verifying signatures, MACs. But with the joint consideration generated, which need more CPU cycles so as to decrease
of trust features of controllers and nodes, as well as using the system throughput. But, the performance of our proposed
edge computing servers, our proposed scheme, as shown in scheme is still the best.
the blue curve, has better performance. Thus, we can see Fig. 14 shows the relationship between the batch size of a
our proposed scheme still has better performance in large block and the system throughput. The bigger block enables
SDIIoT environment. to contain more transactions so as to synchronize more
Fig. 13 shows the relationship between the number of BC local network events among controllers, which increases the
nodes and the system throughput under different schemes. system throughput. As we can see, our scheme also has the
As we can see, more nodes lead to the less system through- better performance.
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
13
[2] L. Cui, F. R. Yu, and Q. Yan, “When big data meets software-defined
10 4
5 networking: SDN for big data and big data for SDN,” IEEE Net.,
vol. 30, no. 1, pp. 58–65, 2016.
4.5 [3] J. Wan, S. Tang, Z. Shu, D. Li, S. Wang, M. Imran, and A. V. Vasi-
lakos, “Software-defined industrial internet of things in the context of
4
DuelingDQL-based scheme industry 4.0,” IEEE Sensors Journal, vol. 16, no. 20, pp. 7373–7380,
DuelingDQL-based scheme without controller choice 2016.
3.5 DuelingDQL-based scheme without node choice [4] H. R. Faragardi, H. Fotohi, T. Nolte, and R. Rahmani, “A cost efficient
DuelingDQL-based scheme without computation offloading
Throughput (trx/s)
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2018.2871394, IEEE Internet of
Things Journal
14
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.