Adaptive Resource Allocation in Future Wireless Networks With Blockchain and Mobile Edge Computing

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO.
3, MARCH 2020 1689
Adaptive Resource Allocation in Future Wireless

Networks With Blockchain and Mobile
Edge Computing
Fengxian Guo , F. Richard Yu , Fellow, IEEE, Heli Zhang , Hong Ji , Senior Member, IEEE,
Mengting Liu , and Victor C. M. Leung , Fellow, IEEE
Abstract— In this paper, we present a blockchain-based mobile Meanwhile, new applications are developing in the directions
edge computing (B-MEC) framework for adaptive resource allo- of Internet of things (IoT), Internet of vehicles (IoV), e-
cation and computation offloading in future wireless networks, healthcare, tactile Internet and so on. However, the deployment
where the blockchain works as an overlaid system to provide
management and control functions. In this framework, how to of these applications is restricted by the energy, memory size,
reach a consensus between the nodes while simultaneously guar- computation resources of mobile devices [2]. These emerging
anteeing the performance of both MEC and blockchain systems applications with requirements in terms of intensive compu-
is a major challenge. Meanwhile, resource allocation, block size, tational capacity and sensitive latency can rely on advanced
and the number of consecutive blocks produced by each producer improved wireless technologies and computation offloading.
are critical to the performance of B-MEC. Therefore, an adaptive
resource allocation and block generation scheme is proposed. Future wireless networks are required to not only support
To improve the throughput of the overlaid blockchain system massive wireless access but also offer the provisioning of
and the quality of services (QoS) of the users in the underlaid computation offloading for mobile users.
MEC system, spectrum allocation, size of the blocks, and number To meet the demands of mobile users, future wireless
of producing blocks for each producer are formulated as a joint networks will become more heterogeneous and dense. In
optimization problem, where the time-varying wireless links and
computation capacity of the MEC servers are considered. Since the growth of more capable wireless networks, the scarcity
this problem is intractable using traditional methods, we resort of spectrum is always an impediment along the evolution
to the deep reinforcement learning approach. Simulation results of cellular networks from the first generation (1G) to the
show the effectiveness of the proposed approach by comparing upcoming fifth generation (5G) [3]. One reason is the binary
with other baseline methods. quality of the current spectrum access approach, i.e., licensed
Index Terms— Mobile edge computing, computation offloading, and un-licensed, which is an intentional set of policy choices.
blockchain, deep reinforcement learning.
To improve the spectrum efficiency, dynamic spectrum access
I. I NTRODUCTION becomes the norm. However, with an unprecedented level of
network densification in the future, the spectrum management
T HE progressive miniaturization of hardware is enabling
the massive deployment of smart mobile devices [1]. is of high complexity. Thus, smarter and more decentralized
dynamic spectrum access techniques are preferred.
Manuscript received January 6, 2019; revised May 27, 2019, September 4, In future communication networks, the edge clouds will be
2019, and November 11, 2019; accepted November 12, 2019. Date of
publication December 9, 2019; date of current version March 10, 2020. This deployed in the heterogeneous network and able to provide
work was supported in part by the National Natural Science Foundation of computation offloading services to users [4]. One of the
China under Grant 61671088 and Grant 61771070, in part by the Beijing promising paradigms is mobile edge computing (MEC) [5].
University of Posts and Telecommunications (BUPT) Excellent Ph.D. Students
Foundation under Grant CX2018201, and in part by the Canadian Natural Many outstanding works have been done on computation
Sciences and Engineering Research Council under Grant RGPIN-2019-06348. offloading [6]–[10], in which resource allocation, collabora-
The associate editor coordinating the review of this article and approving it tion, offloading strategy and pricing algorithm are investigated.
for publication was L. Le. (Corresponding author: Hong Ji.)
F. Guo, H. Zhang, and H. Ji are with the Key Laboratory of Universal However, some problems in this distributed and distrusted
Wireless Communications, Ministry of Education, Beijing University of environment are failed to be considered. First, it is impractical
Posts and Telecommunications, Beijing 100876, China (e-mail: fengxianguo@ to deploy or collaborate all the system resources, e.g., caching,
bupt.edu.cn; zhangheli@bupt.edu.cn; jihong@bupt.edu.cn).
F. R. Yu is with the Department of Systems and Computer computing, networking, due to the self-deployment nature and
Engineering, Carleton University, Ottawa, ON K1S 5B6, Canada (e-mail: coexistence of multiple radio access service providers (SPs)
richard.yu@carleton.ca). and edge cloud vendors, which is precisely required by the
M. Liu is with the Beijing Key Laboratory of Space-ground Interconnec-
tion and Convergence, Beijing University of Posts and Telecommunications, logical system of traditional MEC. Second, there is no trusted
Beijing 100876, China (e-mail: liumengting@bupt.edu.cn). entity in the system to audit the computation offloading
V. C. M. Leung is with the College of Computer Science and Software process or ensure the proper and surefire payments to the
Engineering, Shenzhen University, Shenzhen 518060, China, and also with the
Department of Electrical and Computer Engineering, The University of British SPs and edge cloud vendors. Third, privacy is often cited
Columbia, Vancouver, BC V6T 1Z4, Canada (e-mail: vleung@ieee.org). as one of the key concerns in cloud adoption, especially
Color versions of one or more of the figures in this article are available when sensitive or personal information is outsourced to the
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TWC.2019.2956519 edge cloud vendors. Few cloud SPs can be fully trusted
1536-1276 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
1690 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 3, MARCH 2020
by end-users [11]. Hence, new expectations are set for a

decentralized, self-organized, trusted computation offloading
system.
As the core technology behind Bitcoin and Ethereum,
blockchain has ben gained popularity in academia and indus-
try [12], which is a secured, shared and distributed ledger
in essence. It allows two parties in a peer-to-peer network
to communicate and exchange resources where the decisions
are made distributedly by the majority rather than a single
centralized one [13]. Thanks to the decentralization and other
inspiring features of blockchain, it is considered as a candidate
technology to establish a secure and self-organized MEC
ecosystem for future wireless networks.
To address these challenges, in this paper, we propose
a novel blockchain-based MEC framework (B-MEC) for
resource allocation in future wireless networks with MEC.
A blockchain is a kind of distributed ledger without any
centralized trusted auditors [14], which is public immutable
and append-only [15]. Attracted by the fantastic characteristics
of blockchains in terms of decentralization, anonymity and
trust, researchers have developed significant research interests Fig. 1. Blockchain-enabled MEC in the heterogeneous wireless networks.
in blockchains, e.g., resource allocation for [16], [17], secure
data storage and sharing in vehicular edge networks [18],
security in IoT [19], [20], and electric vehicle networks [21]. blockchains and MEC. Section III introduces the system
In this paper, we focus on the adaptive resource allocation model, followed by the consensus protocol in Section IV. The
issue in future wireless networks with blockchains and MEC. performance of MEC and blockchain system is presented in
Our contributions are summarized as follows: Section V. In Section VI, the main problem is formulated.
• We develop a B-MEC framework for adaptive resource
It is solved with a novel DRL approach in Section VII.
allocation and computation offloading in future wire- Section VIII presents and discusses the simulation results.
less networks, considering the issues arose from the Finally, conclusion and future works are given in Section IX.
property of heterogeneous wireless networks and MEC.
This framework deploys a consensus protocol based on II. S YSTEM D ESCRIPTION
practical Byzantine fault tolerance (PBFT) and delegated In this section, we first describe the framework of future
proof of stake (DPoS). wireless networks with blockchains and MEC, in which some
• The details and theoretical analysis of the consensus pro- concepts are included. Then some challenges in this frame-
tocol are presented, where the computation task execution work are discussed.
is self-organized by smart contracts. The performance of
the blockchain system is given, i.e., throughput, time to
A. Future Wireless Networks With Blockchain and MEC
finality, decentralization, and security.
• By jointly considering the computation task execution in As shown in Fig. 1, there are three layers in this frame-
smart contracts and blockchain consensus maintenance, work, i.e., users, edge networks (i.e., heterogeneous wireless
we formulate the spectrum allocation, block size, number networks with MEC), and blockchain. Before describing this
of consecutive blocks produced by the block producers as framework, some concepts are illustrated as follows.
an optimization problem, which is described as a markov • Clients: mobile users in this system, which submit
decision process (MDP) by defining state space, action offloading requests to the B-MEC system.
space, and reward function. The goal of the main problem • Replicas or block producers: BSs with MEC servers,
is to optimize the performance of the joint MEC and which are selected from the BSs and provide offloading
blockchain system. services by using blockchain technology.
• To handle the high dynamics of this system, we propose • Primary node: the one selected from all replicas, which
to solve this problem with a novel deep reinforcement is authorized to produce blocks at a certain time period.
learning (DRL) approach, with double-dueling deep Q • Backup nodes or validators: the block producers except
network (DQN). Google TenserFlow is used to implement the primary node, which play the role of validators.
the double-dueling DQN. • Transactions: the offloading requests generated by the
• Simulation results show the effectiveness of the proposed users, which are handled by the block producers.
approach with various parameters by comparing with In this paper, the blockchain serves as an overlaid sys-
other baseline algorithms. tem to provide management and control functions to the
The rest of this paper is organized as follows. Section II underlaid MEC system. With the blockchain, data delivery
describes the framework of future wireless networks with and computation task execution are self-organized by smart
GUO et al.: ADAPTIVE RESOURCE ALLOCATION IN FUTURE WIRELESS NETWORKS WITH BLOCKCHAIN AND MEC 1691
contracts [22], which can provide an incentive to ensure the users located around the BSs, each of which has a number of
interests of different parties in a trust-less computing market. computational tasks (e.g., online games, navigation, VR, health
The transaction records the computation offloading requests monitor and so on) to be completed. The users are denoted by
from the mobile users. These transaction records are jointly U = {U1 , . . . , Um , . . . , UM }. To complete the tasks, the users
approved by the consensus nodes selected from the blockchain are supposed to choose to offload the tasks to the BSs. In this
nodes, and then digitally stored in all nodes’ local blockchain paper, we don’t consider the local execution on the mobile
replicas. Since the mobile devices usually have limited storage, devices.
the full ledger is stored only in the blockchain nodes, i.e., the In the blockchain system, the consensus protocol adopts
edge servers in this paper. Fortunately, the blockchain provides the idea of both DPoS and PBFT. As noted, the selection of
publicly accessible records to the users about the transactions block producers is important in DPoS, which has been well
in this network, which are inevitably encrypted to provide studied in existing works [23]. Thus, we didn’t consider it
privacy guarantee. in this paper. PBFT provides safety and liveness while there
In this network, the offloading requests are listened by are less than (N − 1)/3 faulty nodes. In the B-MEC system,
all the BSs. When it is its own turn to produce blocks, assume that the N block producers take turns to produce K
the primary node validates and processes the transactions with blocks within interval Ṫ (in seconds), the size of which is SB
smart contracts. Then the computing results are sent back (in bits). K varies across different time periods to account for
to the users and the transactions are packaged into a new the time-varying characteristics of wireless networks.
block. After that, it comes to a consensus procedure, which As noted, computation offloading in this system includes
will be described in details in section IV. After the consensus four phases, 1) submitting the offloading requests to the
is reached among the nodes, the block is appended to the blockchain system, 2) executing smart contracts by the block
blockchain, which means the block reaches finality. producers, 3) sending back the results to the users, 4) reaching
consensus among the block producers. As noted, it involves a
communication model and a computation model, which will
B. Key Challenges of Computation Offloading in This
Framework be presented next.
Considering the time-varying of the wireless channels and

spectrum efficiency, an adaptive spectrum allocation scheme B. Communication Model
is needed. Although the blockchain maintains non-repudiation In the B-MEC system, it involves two types of wireless
and non-tampering properties, it is difficult to use it in wireless links: 1) data transmission from the users to the BSs, 2) mes-
networks directly, since the currently employed blockchain sage delivery among the BSs.
technology doesn’t take into the limitation of resources and To capture the characteristics of the time-varying wireless
the time-varying features of wireless networks into consid- channels in this system, we resort to finite state Markov
eration. Hence, an adaptive consensus protocol is required channel (FSMC) model. In this paper, we define the channel
for a wireless blockchain network. Furthermore, since the state according to the received signal-to-noise ratio (SNR). The
blockchain provides management and control functions to amplitude of the SNR of the received signals are partitioned
the MEC system, the performance of the blockchain is vital into L non-overlapping levels, and there are L + 1 correspond-
to ensure the quality of experience (QoE) of the mobile ing thresholds {hl , l = 0, 1, 2, . . . , L}. Generally, h0 and hL
users. Hence, we need to guarantee the performance of the are known, which denote the minimum and maximum mea-
blockchain, i.e., throughput, decentralization, time to finality, sured values of the SNR. The available state set of the finite
and security, while concurrently ensuring the performance of markov chain is given by H = {H1 , H2 , . . . , Hl , . . . , HL }.
MEC, e.g., quality of services (QoS) of the users. Let γ denote the channel state, the realization of which at time
period t is denoted by Γ(t). In details, we have Γ(t) = Hl ,
III. S YSTEM M ODEL when there is γ ∈ {hl−1 , hl ).
In this section, we introduce the system model in this work. We assume the channel is block-fading, where the received
We first present the network model, then the communication SNR is constant during one time period, but evolves between
model and computation model are introduced. In this paper, different time periods according to a set of Markov transition
a discrete-time slotted system is applied, where time is parti- probabilities. Let pl,g = P r{Γt+1 = Hl |Γt = Hg } be
tioned into discrete time periods T = {1, . . . , t, . . . , T }, and the transition probability from state Hg to Hl , where l, g ∈
each time period t has a constant duration Ṫ . {1, . . . , L}. Then the L × L transition probability matrix can
be defined as P = [pl,g ]L×L .
Based on the presented FSMC model, let ΓUm ,Bn be the
A. Network Model SNR from user Um to BS Bn , and ΓBn ,Bn be the SNR from
As seen in Fig. 1, we assume that there are I BSs and BS Bn to BS Bn .
N block producers in the heterogeneous wireless networks. Multicast OFDMA [24] is considered in this paper, where
Especially, the block producers are selected from the BSs, a sub-channel is used by one transmitter to multicast the
which are indexed by B = {B1 , . . . , Bn , . . . , BN }. Each of same message to several receivers. Assume that there exists
them is equipped with a MEC server, the computation capacity E sub-channels with the same bandwidth W0 over the whole
of which is denoted by FBn (in Hz). Also, there are M spectrum bandwidth W , where there is E ≥ M +N . As noted,
each user and BS have the chance to be a transmitter of one where we have qx,y = P r(Ψs (t + 1) = y|Ψs (t) = x) and x,
multicast group. Assume that the user Um is allocated with y ∈ C.
WUm sub-channels, and BS Bn is assigned with WBn sub- Based on the above model, define the computation resources
channels. Due to the limited wireless spectrum, the following assigned from BS Bn to message s is ΨBn ,s (t). The execution
constraint should be met time for completing task Is at BS Bn can be calculated by
fs
WUm W0 + WWBn W0 ≤ W, (1) TBn ,s = . (5)
Um ∈U Bn ∈B
ΨBn ,s (t)
which means the spectrum allocated to all users and BSs

IV. A DAPTIVE C ONSENSUS P ROTOCOL
should not exceed the total bandwidth in this system.
In this section, we first present the adaptive consensus
Hence, the received data rate of user Um at BS Bn is
protocol proposed in this paper, which is based on PBFT.
RUm ,Bn = WUm W0 log(1 + ΓUm ,Bn ). (2)
A. Overview of the Consensus Protocol
The received data rate of BS Bn to Bn is expressed by
After the new block is generated, it is then broadcasted
RBn ,Bn = WBn W0 log(1 + ΓBn ,Bn ). (3) to the validators. When the validators verify the new block,
the signatures of them are added to the block. After reaching
Since each BS receives the broadcasting messages from all consensus, the new block will be appended to the blockchain.
the users and the other BSs, the backhaul data rate should not A signed block contains the block number, block size,
exceed the received data rate. Thus, we have block header, signed block summary and transactions. In the
block header, it contains the version of the block, the hash
RUm ,Bn + RBn ,Bn ≥ RBn ,Bn . (4)
of the previous block, a timestamp (i.e., the creation time
Um ∈U n ∈B/{n}
of the new block), Markle root of all the transactions, and
IDs of the validators. Signed block summary contains the
C. Computation Model summary of the transactions, e.g., the number of the included
In this system, the blocks are generated continuously with- transactions, the structure of the transactions. Worth noting,
out waiting for confirmation. This results in the fact that there it only contains the summary of the transactions, while the
may be several un-confirmed blocks existing in this system integral transactions are attached at the end of the block. The
at the same time, on which it needs to be reached consensus. transactions mainly contain the transaction number, ID, scope,
Hence, each BS needs to process different computation tasks smart contract, signature and MAC. Here the smart contract is
from different blocks, i.e., executing smart contracts, gener- used to complete the computation tasks offloaded by the users,
ating and verifying the signatures, generating and verifying and the required transaction data represents what is required
MACs. After one BS receives a message, it contains certain to complete the computation tasks, e.g., associated program
computation tasks. Assume that the computation task of mes- codes or captured images.
sage s is denoted by a 2-tuple Is = {ds , fs }, where ds denotes
data size of the message s, while fs is the total CPU cycles B. Theoretical Analysis
to complete this task. Particularly, define s ∈ {p, v}, where p In the following, we give detailed steps and theoretical
denotes producing while v denotes verifying. analysis of the consensus protocol. Compared with traditional
Denote the computation capacity assigned to one message systems without blockchain, this blockchain system also needs
by Fs (in Hz or CPU cycles per second). In this system, to validate the signatures and MACs except for executing
one BS may process different messages at the same time. For the offloaded tasks, which can be treated as a computation
example, the primary node needs to produce blocks while ver- overhead to execute the transactions. Since smart contracts
ifying the other messages from the other BSs. It results in the are in charge of the execution of offloaded tasks, we assume
fact that we don’t exactly know the computational resources that the computation of the offloaded tasks is included in the
for each message at the next time instant. Hence, we model execution of smart contracts. It is assumed that executing the
the evolution of Fs as a finite state Markov process. The smart contract for one transaction, generating or verifying one
computation capacity is partitioned into Y non-overlapping signature, generating or verifying one MAC require α, β, and
levels, which is expressed as C = {C1 , . . . , CY }. The real- θ CPU cycles, respectively. Based on PBFT, the consensus
ization of Fs at time period t is Ψs (t) = Cy , when there is protocol consists of five steps as the following.
Fs ∈ [cy−1 , cy ]. There are Cy ∈ C and c0 = 0 < c1 < . . . < 1) Request: In one time period, the users submit offloading
cY = Fmax , where Fmax denotes the maximum computation requests to the replicas, and the requests are as transactions
capacity of the processing BS. stored in the pending pool. The primary node, which is one
For simplicity, we assume that the computation capacity of the replicas, selects its’ preferred transactions. For one
allocated to message s is constant in one specific time period, transaction, the primary node will first verify its signature.
but it evolves to the next state according to the transition If valid, it then verifies the transaction’s MAC. If still valid,
probability. Let qx,y denote the probability that Ψs (t) moves the primary node will execute smart contracts for this transac-
from state x to state y at time period t. The Y × Y state tion. After verifying the preferred transactions, all transactions,
transition probability matrix is defined as Q = [qx,y ]Y ×Y , computing results and other important information will be
packaged into the new block. This procedure occurs within matching prepare messages from the other replicas, it enters
the block interval Ṫ /K. the next step.
Assume that the average size of one transaction is denoted In this phase, the transmission cost is caused by sending the
by . In this phase, the transmission latency in this phase prepare message to all other replicas, which can be calculated
tr
Treq can be expressed by by
tr SB
Treq = max { }. (6) tr
Tpre = max { }. (10)
Um ∈U RUm ,Bp
Bn ,Bn ∈B/{Bp },Bn =Bn RBn ,Bn
where RUm ,Bp denotes the transmission rate from user Um to For the computation cost, the primary node needs to verify
the primary node Bp . 2f signatures and MACs from the other replicas, which can
Considering the size of one block, the maximum number be expressed by Δpre,Bp = 2f (β + θ). For the other backup
of transactions that can be included in a block is SB /. nodes, each needs to generate a signature and N − 1 MACs
Uncivil execution is assumed that a g fraction of transactions for the prepare message. Then 2f signatures and MACs are
submitted by the clients are correct [25]. In this phase, required to be validated. Hence, the computation cost at the
the primary node needs to verify the signatures and MACs backup nodes Bn (= Bp ) can be given by Δpre,Bn = β +
for SB /g transactions and it also needs to execute smart (N − 1)θ + 2f (β + θ). Hence, the computation latency in this
contracts for SB / transactions. Hence, the computation cost phase is
at the primary node is Δreq,Bp = SB g (β+θ)
+ S Bα
. Thus, Δpre,Bn
c
the computation delay is Tpre = max { }. (11)
Bn ∈B FBn ,v
c Δreq,Bp
Treq = . (7) 4) Commit: Following receipt 2f matching prepare mes-
FBp ,p
sages from the other replicas that are consistent with the
As noted, there is no computation cost at the backup nodes. pre-prepare message, each replica sends a commit message
2) Pre-Prepare: After producing the new block, the primary to all the others, which includes the ID and signature of the
node multicasts the signed block along with a pre-prepare replica. Once upon receipt 2f matching commit messages,
message to all the backup nodes for validation, where the it enters the next step.
pre-prepare message contains the ID, signature of the primary To deliver the commit messages, the transmission latency
node and hashed result of the new block. Since the smart con- can be expressed by
tract is in charge of the execution of the offloaded computation SB
tasks, the backup nodes need to make sure that the offloading Tctr = max { }. (12)
Bn ,Bn ∈B,Bn =Bn RBn ,Bn
tasks are actually executed by the primary node except for
validating the identities and the economic parts. In this case, In this phase, each replica needs to generate 1 signature and
the intuitive method is to implement the smart contracts and N −1 MACs to form the commit messages. After receiving the
compare the computation results. commit messages, each replica needs to verify 2f signatures
Hence, after receiving the pre-prepare message and the new and MACs. Hence, the computation cost at each replica is
block, the backup nodes first verify the signature and MAC of Δc,Bn = β + (N − 1)θ + 2f (β + θ). Hence, the computation
the block, then the signatures and MACs of the transactions. latency in this phase is
Different from the work in [20], then smart contracts are Δc,Bn
executed by the backup nodes to validate the transactions. Tcc = max { }. (13)
Bn ∈B FBn ,v
If the pre-prepare message is accepted by some backup node,
5) Reply: After collecting 2f matching commit messages,
it enters the next step.
tr the new block becomes a valid one and it will be appended to
In this phase, the transmission latency Tprep can be calcu-
the blockchain. A reply message will be delivered, in which
lated by
the signature, ID, computation result for the offloading task
tr SB
Tprep = max { }. (8) are included. Different from the original BFT protocol [25],
Bn ∈B/{Bp } RBp ,Bn the reply message is delivered to the primary node, instead of
As noted, the primary node needs to generate one signature the clients, due to the mobile devices’ limited memory size.
and N − 1 MACs in this phase, which is given by Δprep,Bp = In this phase, the transmission cost is
β + (N − 1)θ. The computation cost at the backup nodes is SB
Δprep,Bn = β + θ + SB (α + β + θ), where there is Bn = Bp . Trtr = max { }. (14)
Bn ∈B/{Bp } RBn ,Bp
Hence, the computation latency in this phase is
For the computation cost, each backup node needs to
c Δprep,Bn
Tprep = max { }. (9) generate SB / signatures, and SB / MACs for the primary
Bn ∈B FBn ,v node, which can be given by Δr,Bn = SB (β + θ). For the
3) Prepare: After verifying the new block, each backup primary node, it needs to verify 2f signatures and MACs,
node sends a prepare message to all the other replicas, in which the computation cost of which is given by Δr,Bp = 2f (β +θ).
the replica ID and the signature are contained. Each replica Hence, the computation latency in this phase is
will check the prepare message to make sure that it is Δr,Bn
consistent with the pre-prepare message. Once upon receipt 2f Trc = max { }. (15)
Bn ∈B FBn ,v
V. P ERFORMANCE A NALYSIS the left of which denotes the processing time to produce a
In this section, we give details of the performance of the block.
MEC system and blockchain system. For the MEC system, Since the primary node produces K blocks continuously,
the QoS of the users in terms of delay is given, which the last several blocks may be ignored due to the propagation
is the time from submitting the requests to receiving the delay to the next primary node. We assume the transmission
results. In the blockchain system, the most important criterions data rate from the current primary node to the next one is
to measure the system performance are throughput, time to Rp,p+1 . Hence, the number of ignored blocks can be calculated
finality, decentralization, and security. To address the four-way by
trade-off issue, the four properties will be presented in this SB /Rp,p+1
section. IB(SB , K) = − 1. (21)
Ṫ /K
As noted, there is IB ≤ K, which is apparent that the system
A. Performance of MEC won’t missing more blocks than produced ones.
To measure the QoS of the users, the delay experienced The throughput of the consensus protocol can be expressed
by the users is introduced, which consists of three parts, by
submitting the requests to the BS, executing the offloading
(K − IB) SB SB K
tasks (executing smart contracts), sending back the results Υ(SB , K, W) = Ξ = (K − +1), (22)
to the users. As analyzed in Section IV.B, the transmission K Ṫ Rp,p Ṫ
tr
latency Treq to submitting the requests is as expression (6). where W = {WUm , WB,n , Um ∈ U, Bn ∈ B} denotes the
As analyzed in Section IV.B, the primary node first verifies spectrum allocation profile. Υ denotes the number of transac-
the signature and MAC of the requests, then executes the tions that can be included into the blocks and transmitted to
offloading requests. The computation cost for one offloading the next primary node successfully.
request is α + β + θ. The processing latency consists of two 2) Time to Finality/Confirmation Latency: To guarantee the
parts, queuing delay and executing delay. For each transaction, security of the transactions, it is essential to prevent the
the executing delay is transactions to be arbitrarily changed or reversed. Time to
α+β+θ finality is the time that the transactions can’t be revoked
Te = . (16) once committed to the blockchain, which is important to
FBp ,p
some real-time applications. Longer delay frustrates users and
Hence, the average queuing delay can be expressed by makes applications built on a blockchain less competitive with
1 SB existing non-blockchain alternatives.
Tq = ( − 1)T e . (17)
2 Time to finality T f includes two parts, time for propagation
In this work, we don’t consider the sending back procedure T and time for computation T c .
p
as in [9], since the size of the output may be much smaller than T f = T p + T c. (23)
the input data, which corresponds to many practical scenarios,
such as virus detection, face recognition, and video analysis. Assume that each transmission procedure should be done
Hence, the average delay experienced by the users are within a timeout τtr . As discussed in section IV.B, the prop-
tr
agation time can be calculated by
TU = Treq + T e + T q. (18)
T p = ttr tr tr tr tr
req + tprep + tpre + tc + tr
tr tr tr
B. Performance of the Blockchain System = min{Treq , τtr } + min{Tprep , τtr } + min{Tpre , τtr }
1) Throughput: The throughput of the blockchain system + min{Tctr , τtr } + min{Trtr , τtr }. (24)
can be measured by the number of transactions that can be In this consensus protocol, it involves five procedures. The
processed successfully in unit time, which is related to two computation cost and computation latency for each procedure
procedures, i.e., block generation, and consensus reaching. are shown in section IV.B. We assume that each message
When producing a block, it is limited by the block size and should be processed within a timeout τc . Thus, for computation
the processing capacity of the primary node. Considering the latency T c , we have
block size, the transactions that can be included into the block
per time is T c = tcreq + tcprep + tcpre + tcc + tcr
c c c
SB K = min{Treq , τc } + min{Tprep , τc } + min{Tpre , τc }
Ξ(SB , K) = . (19) c c
Ṫ + min{Tc , τc } + min{Tr , τc }. (25)
The computation cost of producing one block is shown in 3) Decentralization: To characterize the decentralization of
Section IV.B. We assume that the computation resources of the the blockchain systems, we resort to Gini coefficient, which
primary node that are assigned to produce blocks is Fp,p Hz. is often used as a gauge of economic inequality, measuring
Considering the limited computation resources, the following income distribution or wealth distribution among a popula-
constraint should be met tion [26]. The definition to measure inequality is based on
SB (β + θ)/g + α Ṫ Lorenz curve [27]. Focusing on the decentralization of the
≤ , (20)
Fp,p K block producers, we consider the number of blocks that each
replica produces over time, the set of which is denoted by assigned by the BSs to different messages, primary node ID.
K = {K(1), K(2), . . . , K(T )}. Hence, the Gini coefficient of Hence, the network state s(t) at time period t is expressed by
the distribution among K is expressed by
s(t) = {ΓU1 (t), . . . , ΓUm (t), . . . , ΓUM (t);
t∈T |K(t) − K(t )| ΓB1 (t), . . . , ΓBn (t), . . . , ΓBN (t);
G(K) = t ∈T
2 t∈T t ∈B K(t)

ΨB1 (t), . . . , ΨBn (t), . . . , ΨBN (t); Bp (t)}, (29)
t∈T t ∈T |K(t) − K(t )|
= . (26) where there are ΓUm (t) = {ΓUm ,Bn (t), Bn ∈ B},
2N t∈T K(t)
ΓBn (t) = {ΓBn ,Bn (t), Bn ∈ B, Bn = Bn }, and ΨBn (t) =
Note that there is G(K) ∈ [0, 1]. The smaller the value of {ΨBn ,s (t)}. In this paper, the primary node in each time
the Gini coefficient is, the more decentralized the blockchain period is known as a priori.
system is. A Gini coefficient of zero expresses perfect equality,
where all values in K are the same. It means every replica B. Action
produces the same number of blocks in a round. A Gini
coefficient of 1 expresses maximal inequality among values. In this paper, we focus on spectrum allocation, block size,
For example, only one replica produces several blocks, but the number of successive blocks produced by one block producer.
other replicas don’t have a chance or can’t produce one block. Let A = {A(t), t ∈ T } be the system action space. Here A(t)
In this case, the blockchain system becomes totally centralized, denotes the action at time period t, which can be expressed
which violates the idea of the blockchain, a distributed ledger. by
To ensure the decentralization of the blockchain system, A(t) = {WU1 (t), . . . , WUm (t), . . . , WUM (t);
we have the following constraint
WB1 (t), . . . , WBn (t), . . . , WBN (t);
G(K) ≤ η, (27) SB (t); K(t)}, (30)
where η ∈ [0, 1] denotes the thresholds of decentralization in where the first two rows denote the spectrum allocation indica-
terms of K. tors for the users and BSs, respectively. Particularly, we have
4) Security: To guarantee the security of the transactions, WUm ∈ {1, . . . , E} and WBn ∈ {1, . . . , E}. Considering the
it is essential to prevent the transactions to be arbitrarily limited wireless resources, the capacity constraint should be
changed or reversed. As such, finality is vital when designing a met, which is shown in expression (1). Here, SB (t) denotes
blockchain consensus protocol. In PBFT-based consensus pro- the block size at time period t, and K(t) denotes the number
tocol, absolute finality can be provided when a 2/3 fraction of of successive blocks at time period t. Especially, the replicas
nodes are honest. So the number of loyal nodes is essential to take turns to produce blocks, there is only one primary node
guarantee the security of the consensus protocol. To guarantee in a certain time period. For the primary node Bp (t) at time
the security of the system, the following constraint should be period t, it produces K(t) blocks. To simplify this problem,
met we discretize the action space, where SB and K are selected
N −1 in the set SB and K.
f≤ . (28)
3
In another word, to prevent from revoking or modifying a C. Reward Function
transaction, the number of malicious nodes should not exceed In this paper, we aim to maximize the performance of the
(N − 1)/3. In this paper, we don’t consider the security joint MEC and blockchain system by making decisions on the
problem of the consensus protocol. In another word, the above action space. The reward function is designed to be
condition is assumed to be satisfied already.
max R(SB , K, W)
SB ,K,W
VI. P ROBLEM F ORMULATION s.t. C1 : Tkf (t) ≤ Tmax , ∀k ∈ K(t)
In order to improve the throughput of this system, we need C2 : G(K) ≤ η
to jointly optimize spectrum allocation, block size, number of
blocks produced by each replica. Since it is intractable to solve C3 : WUm + WWBn ≤ E
Um ∈U Bn ∈B
this problem with the traditional methods, we resort to DRL,
which will be introduced in the next section. To implement C4 : RUm ,Bn + RBn ,Bn ≥ RBn ,Bn
the approach, we formulate the joint optimization problem as Um ∈U n ∈B/{n}
a MDP, where the state, action, and reward function are defined (31)
as follows. T
where R(SB , K, W) = t =t t −t r(t) denotes the long term
reward over the time periods T . Here ∈ [0, 1) is the discount
A. State rate, which indicates the weight of the future reward. For
Let S = {s(t), t ∈ T } be the system state space, where s(t) fixed t, the bigger is, the more influence the future reward

denotes the state at time period t. Here st evolves across T . r(t) has. As noted, with fixed , t −t approaches zero when

The network state consists of the SNR between the users and t − t is large enough, which means that the future reward has
the BSs, SNR between different BSs, computing resources less impact on the long term reward with the time going on.
In the proposed problem, Tkf (t) in constraint C1 denotes the where y(t) is the target Q value, which can be estimated by
time to finality of the k-th block produced in time period t.
y(t) = r(t)+ max Q(s(t+1), a(t+1); θ−(t+1)). (35)
Constraints C1 and C2 represents the limitations on time to a(t+1)
finality, and decentralization, respectively. C3 denotes the allo-
Here, the target DQN is updated every G steps,
cated sub-channels should not exceed the total system wireless
i.e., θ− (t) = θt−G .
resources. C4 denotes the backhaul capacity constraint. As
3) Beyond DRL: To improve the performance of DRL, two
noted, constraints C1 ∼ C4 may not be met, which means the
important techniques, double DQN, and dueling DQN, are
whole system may have a low system performance. Adopting
applied in this work, which will be described next.
the idea of penalty function, we define the immediate reward
a) Double DQN: To handle the problem of overestima-
r(t) as
⎧ tions of Q values, double DQN is proposed by Hado van
⎨ϑ Υ(t) + ϑ 1 , when C1 ∼ C4 is satisfied, Hasselt [30], the idea of which is to decompose the selection
B M
r(t) = TU from the estimation for the actions. In mathematics, it can be
⎩0, otherwise. expressed by
(32)
y DoubleDQN = r + Q(s , argmaxQ(s , a; θ); θ− ), (36)
where ϑB and ϑM ∈ [0, 1] are the weights corresponding to
which selects the actions according to online weights θ, while
the blockchain system and MEC system. And there is ϑB +
the estimation is based on the current values. This simple trick
ϑM = 1. Note that the weights can be dynamic, which indicate
can help yield more accurate estimations, thus improving the
the dynamic preference on these two systems. For ease of
performance of DRL.
modeling, assume that the weights remain stationary within
b) Dueling DQN: Motivated by the fact that not every
one time period, while can be changed over different periods.
action affects the state when in some state, dueling DQN is
VII. P ROPOSED L EARNING A PPROACH proposed [31], the idea of which is to decompose the Q value
into two parts, the value of being in that state V (s) and the
In this section, we first introduce necessary background
advantage of taking that action at that state A(s, a). The idea
related to DRL, then present the approach to solve the con-
can be expressed by
sidered problem.
Q(s, a) = A(s, a) + V (s). (37)
A. DRL Background
In this case, dueling DQN can intuitively learn which state
1) RL: RL is a branch of machine learning, in which the is more valuable without learning the effect of each action at
agent learns the optimal policy by interacting with an unknown that state. By doing so, it can help find more reliable Q values
environment to maximize the expected long term reward [28]. for each action and accelerate the training process.
A RL agent can be modeled as a MDP. The way that an agent
acts in a MDP framework is as follows. Given the state s(t) ∈
B. Proposed Algorithm
S in environment X, the agent takes an action a(t) from the
legal set A at each time step. To solve the proposed problem, an offline DRL-based
After taking the action from the given state, it enters the approach is proposed. In this approach, double-dueling DQN
next state s(t + 1) according to the state transition probability model is first trained to learn the optimal policies in an
P (s(t + 1)|s(t), a(t)). At the same time, it receives an instant offline way. After the model is trained, it can be used by the
reward r(t). In RL, the objective is defined as the expected B-MEC system to jointly allocate the wireless resources and
long term reward, which is decide the size of the blocks and the number of consecutive
T
produced blocks for each replica in an online way. In this way,
it avoids long training time compared to the online learning
R(t) = t −t r(t), (33)
t =t
approach.
In each training step, the state information is sent to the
where ∈ [0, 1] is the discount factor on the future rewards.
Q network, and the Q network sends back the optimal action
2) DRL: Recently, many researches have shown that deep
a∗ (t) at each time step. Action selection follows the ε-greedy
learning can be combined with RL to solve problems with
policy. The transitions, i.e., experience, from all the trains
high dimensional raw data input, which is referred to DRL
are accumulated in the experience replay buffer in parallel.
[29]. In the training process of DRL, it utilizes a deep neural
A mini-batch of samples are selected from the experience
network (DNN) called DQN to derive the relationship between
replay buffer to train the Q network parameter θ. The target Q
the action-state pair and the Q function Q(s, a; θ), in which θ
network parameter θ− is updated every G steps, i.e., copying
represents the weights of the neural networks. DQN is trained
from the main Q network. The training process is shown in
by updating θ in each iteration to approximate the real Q
Algorithm 1, in which there are two points to be specified.
values. This is achieved until two improved technique are
First, the Q values are first divided into two parts, the value
applied in DQN, experience replay and the target network.
being in the state V (s) and the advantage of taking that action
Furthermore, the main DQN is trained towards the target
in that state A(s, a). In the last layer of DQN, combine these
DQN by minimizing the loss function, which is defined as
two parts into one Q value. Second, when updating the target
Loss(θ(t)) = E[(y(t) − Q(s(t), a(t); θ(t)))], (34) Q network, a learning rate α is introduced, where α ∈ [0, 1] is
Algorithm 1 Offline DRL-Based Performance Optimization Theorems 1: In practical scenarios, the computational
for B-MEC complexity of the proposed training DRL algorithm is
1: Input: Maximum training episode Emax , maximum steps O(E (N +M) ) or O(N + M )(E−(N +M)) .
Hmax in each episode, mini-batch size U , initial learning Proof of Theorem 1: To prove the above theorem and thus
rate α, exploration probability ε, discount rate . analyze the complexity of the proposed algorithm, one must
2: Initialization consider the size of the state function of the system as
3: Initialize the state of the B-MEC system s1 , set ε = 1 well as the action space at each state vector [32]. As such,
4: Initialize the experience replay buffer based on the action space definition, the system needs to
5: Initialize the main DQN with random weights with θ. update each user and BS’s spectrum allocation indicator,
6: Initialize the target DQN with weights θ − = θ. the block size and number of blocks produced by each
7: for episode = 1, . . . , Emax do block producer, and, thus, its actions is also a function of
8: for t = 1, . . . , Hmax do channel association vector, block size level, and block number
9: Choose a random probability p, level.
10: if p > ε then For each state, the action of the system is a function of
11: a(t) = a∗ (t) = arg maxa Q(x, a; θ), channel association vector, block size level, and block number
12: else level. Nevertheless, the number of possible channel association
13: randomly select an action at = a∗ (t). of the users and BSs in the system is much more than
14: end if the number of possible block size level and block number
15: Decrease exploration probability ε level. Therefore, one can focus on the number of possible
16: Execute action a(t) in the system, and observe the channel association of the users and BSs only for analyzing the
reward r(t) and the next state s(t + 1). convergence complexity of the proposed training algorithm,
17: Store the experience (s(t), a(t), r(t), s(t + 1)) into by the law of large numbers. Consequently, the computational
the experience replay buffer. complexity of the proposed algorithm is O(E (N +M) ) when
18: Sample a mini-batch of size U the system update the channel allocation indicator of the N
19: Calculate the target Q-value through expression (36) block producers and M users with E sub-channels. In this
20: Update the main DQN by minimizing the loss L(θ) paper, we assume that the number of sub-channels is more than
in expression (34), and perform a gradient descent step on the total number of users and BSs, i.e., E > N + M . Thus,
L(θ) with respect to θ. from another perspective, the computation complexity can be
21: Every G steps, update the target DQN parameters also expressed by O(N + M )(E−(N +M)) . This completes the
with learning rate α, θ− = αθ + (1 − α)θ− every G steps proof.
22: Update the learning rate according to the optimizer From Theorem 1, we can conclude that the conver-
(e.g., Adam, Adagrad) gence speed of the proposed training algorithm is strongly
23: end for related to the state space dimension. It is of signifi-
24: end for cant importance to note here that there exists a tradeoff
between the computational complexity of the proposed DRL
training algorithm and the resulting network performance
the weight to adjust the preference on the current and previous [32]. Worth noted, the complexity of the proposed algo-
learning values. rithm can be ignored in this paper, due to the training
In each training step, the system state transits into a new process, which is the most costly, is conducted in an offline
state according to the system transition probability after an way.
action is performed. And the reward can be observed based Theorems 2: The space complexity of the proposed
on the reward function. After the states, actions, the reward DRL-based algorithm is O(SAHmax ), where S is the number
function, transition probabilities, and constraints of B-MEC of states, A is the number of actions, and Hmax is the number
system are identified, the optimal policy can be learned off- of steps in one episode.
line. In order to obtain the optimal solution, the states, actions, Proof of Theorem 2: According to [33], space complexity
reward function, and constraints are identified in Section VI. is measured by the amount of memory required to implement
As noted, the transition probabilities and reward need to be the algorithm. Inferred from the work in [34], the space
identified when conducting the simulation, while both of them complexity is related to the number of states, the number
are not needed when carrying out the Q networks in a real of actions, and the number of steps per episode. In this
B-MEC system. paper, the number of states can be expressed by S = (N +
M )(E−(N +M)) × Y N × N . The number of actions can be
calculated by A = (N + M ) × |SB | × |K|. In this paper,
the number of maximum steps in each episode is Hmax ,
C. Complexity Analysis
as defined in Algorithm 1.
Next, we analyze the computational complexity and space In this case, the space complexity to implement the proposed
complexity of the proposed DRL-based algorithm for practical algorithm can be expressed by O(SAHmax ) = O((N +
scenarios where there are tens and even hundreds of users and M )(E−(N +M)+1) × Y N × N × |SB | × |K| × Hmax ). This
BSs in a small area. completes the proof.
TABLE I
T HE S IMULATION PARAMETERS
Fig. 2. Reward under different learning rates.
For double-dueling DQN, we use four full-connected layers

in the main network and target network. The first 3 layers have
VIII. S IMULATION R ESULTS AND D ISCUSSION 256, 256 and 128 neurons, respectively. The forth neural layer
is split into Advantage (action advantage) and Value (state
In this section, we use computer simulation to demonstrate
value) functions, which comes from the idea of dueling DQN.
the effectiveness of the proposed scheme. First, the simulation
In the last layer, the Advantage and Value functions are merged
settings are presented. Then it follows the simulation results
as the Q value. The other parameters of the proposed approach
with various parameters.
are presented in Table I.
To evaluate the effectiveness of the proposed approach,
A. Simulation Setting we first compare the performance achieved by DQN, double
We conduct our simulation on a GPU based server, which DQN, dueling DQN, and the utilized double-dueling DQN.
has four GTX 1080 TI NVIDIA GPUs, a 128G RAM, and Then we choose four comparison algorithms: 1) the proposed
an Intel Xeon CPU. The software environment is TensorFlow scheme with fixed spectrum allocation, in which the learning
1.8.0 with Python 3.6 on Ubuntu 18.04 LTS. agent only needs to determine the optimal block size and the
For the wireless network, there are 4 BSs, surrounded by number of consecutively produced blocks for each replica;
3 mobile users. Each BS is equipped with a MEC server. Since 2) the proposed scheme with fixed block size, in which
the block producer selection is not considered in this paper, the block size is set to be 1 MB; 3) the proposed scheme
all the BSs are assumed to be selected as block producers. with fixed block number, in which each block producer can
The computation resources assigned to each message are from produce a fixed number of blocks at one time period, i.e., 4;
the set {200, 500, 1000} GHz, the transition probability of 4) the existing static scheme, in which the decisions are made
which is Q = [0.7, 0.2, 0.1; 0.2, 0.1, 0.7; 0.1, 0.7, 0.2]. The through maximizing the immediate reward. Last, we choose
channel state also follows MDP. We assume three scenarios two existing algorithms in other works, the proportional fair-
with different SNR state settings, labeled as SN R1 , SN R2 , ness utility based algorithm (PFA) widely used for resource
and SN R3 . There are three states in the first scenario with allocation [36] and the random selection algorithm (RSA)
SN R1 , low (Hl = 1), medium (Hl = 7), and high (Hl = 15), in [37], where an action is randomly selected to execute in
the transition probability of which is each step.
⎡ ⎤
0.6 0.3 0.1
P1 = ⎣0.3 0.1 0.6⎦ . (38) B. Simulation Results and Discussion
0.1 0.6 0.3
The performance of double-dueling DQN based scheme
The second scenario with SN R2 has the following setting, with different learning rates is shown in Fig. 2. From this
five states {1, 3, 7, 15, 31} and transition probabilities figure, we can observe that the convergence is faster with
⎡ ⎤ the learning rate equal to 0.001 than that with learning rate
0.6 0.4 0.2 0.1 0.05
⎢0.05 0.6 0.4 0.2 0.1 ⎥ 0.0001. The learning rate adjusts the weights of the current and
⎢ ⎥
⎢
P2 = ⎢ 0.1 0.05 0.6 0.4 0.2 ⎥ previous learning value. A larger learning rate means that the
⎥. (39)
⎣ 0.2 0.1 0.05 0.6 0.4 ⎦ learning agent will place more emphasis on the current learn-
0.4 0.2 0.1 0.05 0.6 ing value, and vice versa. In another word, a bigger learning
rate denotes a longer learning step. However, a big learning
In the third scenario with SN R3 , there are ten states rate can result in a local optimum point and miss the global
{1, 2, 3, 5, 7, 11, 15, 23, 31, 47}, the transition probability of optimum point due to the big learning step. Hence, the learning
which is as the settings in [35]. The other related settings rate should be carefully chosen, which is set to be 0.001 in
are illustrated in Table I. this paper.
Fig. 3. Reward under different discount rates. Fig. 5. Comparison of DQN, double DQN, dueling DQN, double-dueling
DQN.
Fig. 4. Performance of the proposed approach with different mini-batch

sizes. Fig. 6. Convergence of difference comparison algorithms.
The effects of different discount rates on the performance Different DRL methods are compared in Fig. 5. For fairness,
of the proposed approach are shown in Fig. 3. In DRL, all methods adopt the same simulation parameters. The Y-axis
the actions are chosen by optimizing the long term reward, represents the value of loss function, which represents the gap
where the future rewards are discounted by multiplying the to approximate the Q function, while the x-axis represents
discount rate, as defined in equation (33). The learning agent the training steps. First, we can see that double-dueling DQN
will choose the action maximizing the current reward with a converges first. Second, double-dueling DQN performs a more
small discount rate and vice versa. Since the current action accurate approximation of the Q function than the other
would influence future rewards in this paper, the long term three DRL methods. That is because dueling DQN divides
reward increases with the discount rate growing. However, the Q function into the state value function and the action
it is meaningless to put too many weights on the future in advantage function, which allows a better approximation of the
an unstable system. Also, it would incur high computational Q values and enables faster convergence. Furthermore, double
complexity. To explore a tradeoff between the performance DQN selects the actions according to the online weights,
and the computational complexity, an appropriate discount which mitigates the overestimation compared with traditional
rate should be chosen, which is set as 0.9 in the rest of the DQN. It also results in more accurate approximation.
simulations. Fig. 6 shows the convergence of different schemes, where
Fig. 4 shows the effects of the mini-batch size of the pro- the y-axis denotes the long term reward. First, the existing
posed approach on the convergence performance. The x-axis static scheme converges firstly, but obtains the worst perfor-
denotes the training steps and the y-axis represents the value mance in terms of reward. That is because that the decisions
of loss function. The mini-batch size indicates how many are made according to the current reward, which needs less
experience cases are used to train the Q network in each training steps. As a result, it doesn’t consider the effects of the
training step. We can observe from Fig. 4 that the convergence current action on the future rewards, which obviously obtains
is faster with the mini-batch size growing, which is because the lowest reward. Second, the proposed scheme maintains
that more experiences are used to train the Q network with higher long term reward than the other three schemes. With
a bigger mini-batch size. Similar to the other parameters, the adaptive spectrum allocation policy, the latency can be
an appropriate mini-batch size should be chosen, which is set reduced. With an adaptive block size and a properly chosen
to be 64 in the rest of the simulations. number of producing blocks, the throughput of blockchain
Fig. 7. Long term reward vs. average transaction size.

Fig. 9. Rewards with different numbers of BSs and users.
enough, it has little influence on the reward. Another observa-

tion is that the scheme with fixed block size acts better than
that with fixed block number. The reason is that it can adjust
the block number, which in turn adjusts the block interval
to deal with the strict time to finality threshold and avoids
missing blocks. Furthermore, we can find that the existing
static scheme has the poorest performance, which reveals
the superiority of double-dueling DQN based schemes. The
proposed scheme maintains the best performance of all as well.
Fig. 9 plots the reward to show the scalability and the the
ability of the framework in dealing with dynamic changes
Fig. 8. Long term reward vs. time to finality threshold. of users in the system, where the number of BSs is fixed,
i.e., 5, and the number of users varies from 2 to 20. In the
simulation, we first train the algorithm offline in the scenario
system can be improved. All these aspects contribute to better with 20 users and 5 BSs. When the offline-trained algorithm
system performance. Hence, the proposed scheme, which is applied online with changing number of users, the extra
jointly optimizes these aspects, obtains the best performance parameters in the states and actions are set to be null to cope
of all. with the smaller number of users. In another word, the results
Fig. 7 depicts the relationship between the system perfor- in Fig. 9 are obtained online using the algorithm that is
mance and the average transaction size. This figure can be trained offline. From the simulations, we can first observe
used to show the performance of the proposed approach with that the long-term reward decreases with increasing number
different types of transactions, which are corresponding to of users. The reasons are: 1) the average resources decrease,
different offloading tasks. With the average transaction size resulting in higher latency, and 2) more blocks may be ignored,
increasing, the system performance decreases. The reasons inducing smaller throughput. Second, we can see that the
are 1) one block can contain fewer transactions when the proposed algorithm outperforms the other existing algorithms,
transaction size rises; 2) the transmission latency from the which shows the superiority of the proposed scheme, even
users to the BSs increases, which lowers the performance with a large number of users. Third, the results show that
of MEC system. Focusing on the comparison of different the offline-trained algorithm can work effectively online under
schemes, the proposed scheme obtains the highest long term changing conditions.
reward with the variation of average transaction size, then In Fig. 10, the performance of the proposed approach is ver-
follows the proposed scheme with fixed spectrum allocation, ified by comparing with the other two baseline algorithms in
fixed block size and fixed block number, and the existing static three scenarios, where the SNR state settings vary in different
scheme has the worst performance. The reasons are described scenarios. In the simulation, we train the proposed framework
as before. again when the dynamics of the SNR have changed. First,
Fig. 8 shows the relationship between the system per- the scenario with SN R3 achieves the best reward. That is
formance and time to finality threshold. One observation is because the system can obtain a better SNR state in this
that the long term reward increases with the time to finality scenario, which introduce larger throughput and lower laten-
threshold raising and intends to be steady finally. With a cies, resulting in larger rewards. Second, it can be seen that
flexible latency constraint, less punishment may be added to the proposed scheme becomes less stable when the number
the reward, which naturally leads to a higher performance. of SNR states increase. The reason is due to the mismatch
However, when the time to finality threshold becomes big between the states and actions. Third, there is little difference
given, i.e., the latency of the users, throughput, time to finality,

decentralization, security. To improve the performance of the
joint MEC and blockchain system, the main problem was for-
mulated, where the spectrum allocation, block size, and block
number were optimized. Since this problem is intractable using
traditional methods, we resorted to a novel double-dueling
deep Q learning approach to solve this problem. Simulation
results demonstrated the effectiveness of the proposed scheme
by comparing with the other baseline schemes. Future work
is in progress to consider caching in the proposed framework.
R EFERENCES
[1] V. Sharma, I. You, F. Palmieri, D. N. K. Jayakody, and J. Li, “Secure
Fig. 10. Rewards under different SNR state settings. and energy-efficient handover in fog networks using blockchain-based
DMM,” IEEE Commun. Mag., vol. 56, no. 5, pp. 22–31, May 2018.
[2] X. Tao, K. Ota, M. Dong, H. Qi, and K. Li, “Performance guaranteed
computation offloading for mobile-edge cloud computing,” IEEE Wire-
less Commun. Lett., vol. 6, no. 6, pp. 774–777, Dec. 2017.
[3] D. M. Kalathil and R. Jain, “Spectrum sharing through contracts
for cognitive radios,” IEEE Trans. Mobile Comput., vol. 12, no. 10,
pp. 1999–2011, Oct. 2013.
[4] J. Zheng, Y. Cai, Y. Wu, and X. Shen, “Dynamic computation offloading
for mobile cloud computing: A stochastic game-theoretic approach,”
IEEE Trans. Mobile Comput., vol. 18, no. 4, pp. 771–786, Apr. 2019.
[5] Y. C. Hu, M. Patel, D. Sabella, N. Sprecher, and V. Young, “Mobile
edge computing—A key technology towards 5G,” ETSI White Paper,
vol. 11, no. 11, pp. 1–16, 2015.
[6] J. Feng, Q. Pei, F. R. Yu, X. Chu, and B. Shang, “Computation offloading
and resource allocation for wireless powered mobile edge computing
with latency constraint,” IEEE Wireless Commun. Lett., vol. 8, no. 5,
pp. 1320–1323, Oct. 2019.
[7] Y. Liu, F. R. Yu, X. Li, H. Ji, and V. C. M. Leung, “Distributed resource
Fig. 11. Throughput and latency (seconds) versus the number of users. allocation and computation offloading in fog and cloud networks with
non-orthogonal multiple access,” IEEE Trans. Vehi. Technol., vol. 67,
no. 12, pp. 12137–12151, Dec. 2018.
[8] H. Guo, J. Liu, and H. Qin, “Collaborative mobile edge computation
on the convergence speeds of these algorithms. That is because offloading for IoT over fiber-wireless networks,” IEEE Netw., vol. 32,
the number of states, actions, and steps in one episode is rather no. 1, pp. 66–71, Jan. 2018.
[9] C. Wang, F. R. Yu, C. Liang, Q. Chen, and L. Tang, “Joint computation
larger than the state set’s dimension, which can be ignored by offloading and interference management in wireless cellular networks
the law of large numbers. Fourth, the proposed scheme obtains with mobile edge computing,” IEEE Trans. Veh. Technol., vol. 66, no. 8,
the best performance, due to the superiority of DRL, which pp. 7432–7445, Aug. 2017.
[10] M. Liu and Y. Liu, “Price-based distributed offloading for mobile-
also shows the ability of the proposed scheme to adapt to edge computing with computation capacity constraints,” IEEE Wireless
different network dynamics. Commun. Lett., vol. 7, no. 3, pp. 420–423, Jun. 2018.
At last, we analyze the throughput and latency of the [11] K. Yang, X. Jia, and K. Ren, “Secure and verifiable policy update
outsourcing for big data access control in the cloud,” IEEE Trans.
proposed algorithm, as shown in Fig. 11, where the number Parallel Distrib. Syst., vol. 26, no. 12, pp. 3461–3470, Dec. 2015.
of BSs is fixed to 10 and the number of users ranges from [12] F. Tschorsch and B. Scheuermann, “Bitcoin and beyond: A technical
5 to 25. Similar to Fig. 9, the results in Fig. 11 are also survey on decentralized digital currencies,” IEEE Commun. Surveys
Tuts., vol. 18, no. 3, pp. 2084–2123, 3rd Quart., 2016.
obtained online using the offline-trained algorithm, and it [13] T. Salman, M. Zolanvari, A. Erbad, R. Jain, and M. Samaka, “Security
also shows that the proposed approach of training offline services using blockchains: A state of the art survey,” IEEE Commun.
and operating online can work well in practice. Obviously, Surveys Tuts., vol. 21, no. 1, pp. 858–880, 1st Quart., 2018.
[14] F. R. Yu, J. Liu, Y. He, P. Si, and Y. Zhang, “Virtualization for distributed
the throughput decreases and the latency increases with the ledger technology (vDLT),” IEEE Access, vol. 6, pp. 25019–25028,
number of users increasing. First, the resources are fixed. 2018.
So the average amount of resources decreases with the number [15] T. N. Dinh and M. T. Thai, “AI and blockchain: A disruptive integration,”
Computer, vol. 51, no. 9, pp. 48–53, Sep. 2018.
of users going up. Second, the queuing delay grows when the [16] Y. Liu, F. R. Yu, X. Li, H. Ji, and V. C. M. Leung, “Decentralized
number of users raises, which induces larger latency. resource allocation for video transcoding and delivery in blockchain-
based system with mobile edge computing,” IEEE Trans. Vehi. Technol.,
IX. C ONCLUSION AND F UTURE W ORK vol. 68, no. 11, pp. 11169–11185, Nov. 2019.
[17] M. Liu, F. R. Yu, Y. Teng, V. C. M. Leung, and M. Song, “Distributed
In this paper, we developed a novel blockchain-based frame- resource allocation in blockchain-based video streaming systems with
work for resource allocation in future wireless networks with mobile edge computing,” IEEE Trans. Wireless Commun., vol. 18, no. 1,
MEC. With blockchain, the data delivery and computation pp. 695–708, Jan. 2019.
[18] J. Kang et al., “Blockchain for secure and efficient data sharing in
execution on the edge servers are self-organized by smart vehicular edge computing and networks,” IEEE Internet Things J., vol. 6,
contracts. A consensus protocol in this distributed wireless no. 3, pp. 4660–4670, Jun. 2019.
network was proposed, along with the details and theoretical [19] K. Fan, S. Wang, Y. Ren, K. Yang, Z. Yan, H. Li, and Y. Yang,
“Blockchain-based secure time protection scheme in IoT,” IEEE Internet
analysis. The performance of MEC and blockchain system was Things J., vol. 6, no. 3, pp. 4671–4679, Jun. 2019.
[20] C. Qiu, F. R. Yu, H. Yao, C. Jiang, F. Xu, and C. Zhao, “Blockchain- F. Richard Yu (S’00–M’04–SM’08–F’18) received
based software-defined industrial Internet of Things: A dueling deep the Ph.D. degree in electrical engineering from The
Q-learning approach,” IEEE Internet Things J., vol. 6, no. 3, University of British Columbia (UBC) in 2003.
pp. 4627–4639, Jun. 2019. From 2002 to 2006, he was with Ericsson, Lund,
[21] H. Liu et al., “Blockchain-enabled security in electric vehicles cloud and Sweden, and a start-up in California, USA. He joined
edge computing,” IEEE Netw., vol. 32, no. 3, pp. 78–83, May 2018. Carleton University in 2007, where he is currently
[22] G. Wood et al., “Ethereum: A secure decentralised generalised transac- a Professor. His research interests include con-
tion ledger (eip-150 revision),” Ethereum Project Yellow Paper, vol. 151, nected/autonomous vehicles, security, artificial intel-
no. 2017, pp. 1–32, 2017. ligence, distributed ledger technology, and wireless
[23] M. Liu, F. R. Yu, Y. Teng, V. C. M. Leung, and M. Song, “Performance cyber-physical systems.
optimization for blockchain-enabled industrial Internet of Things (IIoT) Dr. Yu is a registered Professional Engineer in the
systems: A deep reinforcement learning approach,” IEEE Trans. Ind. province of Ontario, Canada, and a fellow of the Institution of Engineering and
Informat., vol. 15, no. 6, pp. 3559–3570, Jun. 2019. Technology (IET). He is an elected member of the Board of Governors of the
[24] V. D. Papoutsis and S. A. Kotsopoulos, “Chunk-based resource alloca- IEEE VTS. He received the IEEE TCGCC Best Journal Paper Award in 2019,
tion in multicast OFDMA systems with average BER constraint,” IEEE the Distinguished Service Awards in 2019 and 2016, the Outstanding Lead-
Commun. Lett., vol. 15, no. 5, pp. 551–553, May 2011. ership Award in 2013, the Carleton Research Achievement Award in 2012,
[25] A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti, “Making the Ontario Early Researcher Award (formerly Premiers Research Excellence
byzantine fault tolerant systems tolerate byzantine faults,” in Proc. 6th Award) in 2011, the Excellent Contribution Award at IEEE/IFIP TrustCom
NSDI, 2009, pp. 153–168. 2010, the Leadership Opportunity Fund Award from Canada Foundation of
[26] F. Cowell, Measuring Inequality. Oxford, U.K.: Oxford Univ. Press, Innovation in 2009, and the Best Paper Awards at IEEE ICNC 2018, VTC
2011. 2017 Spring, ICC 2014, Globecom 2012, IEEE/IFIP TrustCom 2009, and
[27] F. Wenli, H. Ping, and L. Zhigang, “Multi-attribute node importance International Conference on Networking 2005. He has served as the Technical
evaluation method based on Gini-coefficient in complex power grids,” Program Committee (TPC) Co-Chair of numerous conferences. He serves on
IET Gener., Transmiss. Distrib., vol. 10, no. 9, pp. 2027–2034, Jun. 2016. the editorial boards of several journals, including as a Co-Editor-in-Chief
[28] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement for Ad Hoc and Sensor Wireless Networks, an Area Editor for the IEEE
learning: A survey,” J. Artif. Intell. Res., vol. 4, no. 1, pp. 237–285, C OMMUNICATIONS S URVEYS AND T UTORIALS , a Lead Series Editor for
Jan. 1996. the IEEE T RANSACTIONS ON V EHICULAR T ECHNOLOGY, and the IEEE
[29] V. Mnih et al., “Human-level control through deep reinforcement learn- T RANSACTIONS ON G REEN C OMMUNICATIONS AND N ETWORKING. He is
ing,” Nature, vol. 518, no. 7540, p. 529, 2015. an IEEE Distinguished Lecturer of Vehicular Technology Society (VTS) and
[30] H. V. Hasselt, “Double Q-learning,” in Proc. Adv. Neural Inf. Process. Communications Society.
Syst. 23, 2010, pp. 2613–2621. [Online]. Available: http://papers.nips.cc/
paper/3964-double-q-learning.pdf
[31] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and
Heli Zhang received the B.S. degree in commu-
N. de Freitas, “Dueling network architectures for deep reinforcement
nication engineering from Central South University
learning,” 2015, arXiv:1511.06581. [Online]. Available: https://arxiv.
in 2009 and the Ph.D. degree in communication
org/abs/1511.06581
and information system from the Beijing University
[32] U. Challita, W. Saad, and C. Bettstetter, “Interference management
of Posts and Telecommunications (BUPT) in 2014.
for cellular-connected UAVs: A deep reinforcement learning approach,”
From 2014 to 2018, she was a Lecturer with the
IEEE Trans. Wireless Commun., vol. 18, no. 4, pp. 2125–2140,
School of Information and Communication Engi-
Apr. 2019.
neering, BUPT, where she has been an Associate
[33] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman,
Professor since 2018. Her research interests include
“PAC model-free reinforcement learning,” in Proc. 23rd Int. Conf. Mach.
heterogeneous networks, long-term evolution/fifth
Learn., 2006, pp. 881–888.
generation, and the Internet of Things.
[34] C. Jin, Z. A. Zhu, S. Bubeck, and M. I. Jordan, “Is Q-learning provably
Dr. Zhang participated in many National projects funded by National
efficient?” in Proc. NIPS, 2018, pp. 4863–4873.
Science and Technology Major Project, National 863 High-tech, and National
[35] Y. He et al., “Deep-reinforcement-learning-based optimization for cache-
Natural Science Foundation of China, and cooperated with many corporations
enabled opportunistic interference alignment wireless networks,” IEEE
in research. She has been a reviewer for Journals of IEEE Wireless Commu-
Trans. Veh. Technol., vol. 66, no. 11, pp. 10433–10445, Nov. 2017.
nications, IEEE Communication Magazine, the IEEE T RANSACTIONS ON
[36] U. Challita, L. Dong, and W. Saad, “Proactive resource management for
V EHICULAR T ECHNOLOGY, the IEEE C OMMUNICATION L ETTERS , and the
LTE in unlicensed spectrum: A deep learning perspective,” IEEE Trans.
IEEE T RANSACTIONS ON N ETWORKING.
Wireless Commun., vol. 17, no. 7, pp. 4674–4689, Jul. 2018.
[37] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-Q-learning-based
transmission scheduling mechanism for the cognitive Internet of Things,”
IEEE Internet Things J., vol. 5, no. 4, pp. 2375–2385, Aug. 2018. Hong Ji (SM’09) received the B.S. degree in
communications engineering and the M.S. and
Ph.D. degrees in information and communications
engineering from the Beijing University of Posts
and Telecommunications (BUPT), Beijing, China,
in 1989, 1992, and 2002, respectively. In 2006, she
was a Visiting Scholar with The University of British
Columbia, Vancouver, BC, Canada. She is currently
a Professor with BUPT. She has authored more than
300 journals/conference papers. Several of her arti-
cles had been selected for Best paper. Her research
interests include wireless networks and mobile systems, including cloud
Fengxian Guo received the B.E. degree in commu- computing, machine learning, intelligent networks, green communications,
nications from Zhengzhou University (ZZU), China, radio access, ICT applications, system architectures, management algorithms,
in 2015. She is currently pursuing the Ph.D. degree and performance evaluations.
with the School of Information and Communica- Dr. Ji is serving on the Editorial Boards of the IEEE T RANSAC -
tion Engineering, Beijing University of Posts and TIONS ON G REEN C OMMUNICATIONS AND N ETWORKING and Interna-
Telecommunications (BUPT), Beijing, China. She tional Journal of Communication Systems (Wiley). She has served as the
was with The University of British Columbia, Van- Co-Chair for Chinacom’11 and a member of the Technical Program Com-
couver, Canada, and Carleton University, Ottawa, mittee of WCNC’19/15/14/12, Globecom’17/16/15/14/13/12/11/10, ISCIT’17,
Canada, as a Visiting Ph.D. Student from Septem- CITS’16/15/12, WCSP’15, ICC’20/13/12/11, ICCC’13/12, PIMRC’12/11,
ber 2018 to September 2019. Her current research IEEE VTC’12S, and Mobi-World’11. She was a Guest Editor of International
interests include future wireless networks, mobile Journal of Communication Systems, (Wiley) Special Issue on Mobile Internet:
edge computing, blockchain, and machine learning. Content, Security and Terminal.
Mengting Liu received the Ph.D. degree from Victor C. M. Leung (S’75–M’89–SM’97–F’03)
the Beijing University of Posts and Telecommu- is currently a Distinguished Professor of computer
nications (BUPT), Beijing, China, in 2019. From science and software engineering with Shenzhen
2017 to 2018, she was a Visiting Ph.D. Student University, Shenzhen, China, and a Professor Emeri-
with The University of British Columbia, Vancouver, tus with The University of British Columbia (UBC),
BC, Canada. Her current research interests include Vancouver, BC, Canada. Before he retired from
blockchain technology, deep reinforcement learning, UBC in 2018, he was a Professor of electrical
resource allocation, mobile edge computing systems, and computer engineering and the holder of the
and stochastic geometry theory. TELUS Mobility Research Chair there. He has coau-
thored more than 1300 journals/conference papers
and book chapters. His research is in the broad areas
of wireless networks and mobile systems. He is serving on the Editorial
Boards of the IEEE T RANSACTIONS ON G REEN C OMMUNICATIONS AND
N ETWORKING, the IEEE T RANSACTIONS ON C LOUD C OMPUTING, IEEE
A CCESS , the IEEE N ETWORK, and several other journals. He is a fellow of the
Royal Society of Canada, Canadian Academy of Engineering, and Engineering
Institute of Canada. He received the IEEE Vancouver Section Centennial
Award, the 2011 UBC Killam Research Prize, the 2017 Canadian Award
for Telecommunications Research, and the 2018 IEEE TCGCC Distinguished
Technical Achievement Recognition Award. He has coauthored articles that
received the 2017 IEEE ComSoc Fred W. Ellersick Prize, the 2017 IEEE
Systems Journal Best Paper Award, the 2018 IEEE CSIM Best Journal Paper
Award, and the 2019 IEEE TCGCC Best Journal Paper Award. He is named
in the current Clarivate Analytics list of Highly Cited Researchers.

Adaptive Resource Allocation in Future Wireless Networks With Blockchain and Mobile Edge Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adaptive Resource Allocation in Future Wireless Networks With Blockchain and Mobile Edge Computing

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO.

3, MARCH 2020 1689

Adaptive Resource Allocation in Future Wireless

by end-users [11]. Hence, new expectations are set for a

Considering the time-varying of the wireless channels and

which means the spectrum allocated to all users and BSs

Fig. 2. Reward under different learning rates.

For double-dueling DQN, we use four full-connected layers

Fig. 4. Performance of the proposed approach with different mini-batch

Fig. 7. Long term reward vs. average transaction size.

enough, it has little influence on the reward. Another observa-

given, i.e., the latency of the users, throughput, time to finality,

You might also like