Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 1
Abstract—Vehicular edge computing (VEC) is a new comput- it is difficult for onboard software and hardware systems to
ing paradigm that has great potential to enhance the capability meet their demanding resource requirements. Vehicular edge
of vehicle terminals (VT) to support resource-hungry in-vehicle computing (VEC) [1]–[3], which is the application of mobile
applications with low latency and high energy efficiency. In
this paper, we investigate an important computation offloading edge computing (MEC) [4], [5] in vehicular scenarios, has re-
scheduling problem in a typical VEC scenario, where a VT cently received significant attention as a promising solution to
traveling along an expressway intends to schedule its tasks improve the situation. By offloading these resource-demanding
waiting in the queue to minimize the long-term cost in terms of tasks to the MEC servers attached in the roadside units (RSU),
a trade-off between task latency and energy consumption. Due the execution latency and energy consumption of in-vehicle
to diverse task characteristics, dynamic wireless environment,
and frequent handover events caused by vehicle movements, an applications can be significantly reduced. At the same time,
optimal solution should take into account both where to schedule the bandwidth to the core network can be saved, diminishing
(i.e., local computation or offloading) and when to schedule (i.e., the risk of network congestion.
the order and time for execution) each task. To solve such a Task offloading is a key feature of MEC/VEC technologies.
complicated stochastic optimization problem, we model it by a Due to the extra energy and time consumption induced by
carefully designed Markov decision process (MDP) and resort to
deep reinforcement learning (DRL) to deal with the enormous data transmission and remote task execution, offloading com-
state space. Our DRL implementation is designed based on the putation tasks to edge servers may not always bring benefits.
state-of-the-art proximal policy optimization (PPO) algorithm. A key technical challenge is to balance the overall costs
A parameter-shared network architecture combined with a of computation and communication when making offloading
convolutional neural network (CNN) is utilized to approximate decisions. A number of efficient offloading algorithms have
both policy and value function, which can effectively extract
representative features. A series of adjustments to the state and been reported in the recent literature (e.g., [3], [6], [7]). To
reward representations are taken to further improve the training simplify the decision-making problem, most of the existing
efficiency. Extensive simulation experiments and comprehensive solutions are based on the assumption of a static or quasi-
comparisons with six known baseline algorithms and their static environment, e.g., the state of the wireless channel is
heuristic combinations clearly demonstrate the advantages of the fixed during the complete offloading procedure. When user
proposed DRL-based offloading scheduling method.
mobility and complex wireless signal propagation are taken
Index Terms—Computation offloading, deep reinforcement into consideration, the assumption of a static environment no
learning, mobile edge computing, task scheduling, vehicular edge
longer holds. Markov decision process (MDP) is an effective
computing.
mathematical tool to model the impact of user actions in a dy-
namic environment and allows seeking the optimal offloading
I. I NTRODUCTION decision for achieving a particular long-term goal [8]–[10]. To
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 2
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 3
II. S YSTEM M ODEL A task that is assigned for local execution is processed on the
In this section, we first introduce the system architec- LPU, according to the scheduled execution time. For remote
ture. Then, the detailed mathematical models of task queue, execution, a task is first transmitted to the serving RSU via
communication and computation operations, and the overall the DTU. Afterward, the MEC server at the RSU executes the
designing objective are elaborated. task using the reserved computation resource and then sends
Notation: Throughout the paper, we use N+ to denote the the computation result back to the VT.
set of all positive integers. CN (0, σ 2 ) denotes a complex The whole system is considered to operate in a slotted
Gaussian distribution with zero mean and variance σ 2 . ⌈·⌉ fashion: the smallest time interval for clock counting, soft-
denotes the ceiling operator. 1{Φ} is the indicator function ware/hardware operation, and message coding/transmission, is
that equals 1 if the condition Φ is satisfied and otherwise 0. deemed as a unit time slot. At the beginning of each time
slot, the scheduler monitors the system state and, if necessary,
performs the offloading scheduling. Then, computation and/or
A. System Architecture communication operate during the entire time slot. The time
In this paper, we consider a typical application scenario instant that the considered VT starts conducting the scheduling
of VEC [2], as shown in Fig. 1. VTs are driven along an process of the tasks in its queue is taken as the initial time
expressway, served by RSUs deployed along the roadside. The slot 0. Let the speed of the VT be V (in meter/slot). At any
distance between adjacent RSUs is L meters, and the coverage time slot t (t ∈ N+ ), its location (coordinate) can be found
regions of different RSUs do not overlap. Hence, the road is by x[t] = V t + x[0], where x[0] is the starting location.
divided into segments based on the coverage range of RSU.
A VT can only be served by one RSU through vehicle-to- B. Task Queue Model
infrastructure (V2I) communication. When it is driven across
the boundary of two road segments, a handover occurs. Each We model the system’s workload as a Poisson process with
RSU is equipped with an MEC server, which acts as a proxim- rate λ, indicating the expected number of computation tasks
ity cloud for VTs, i.e., a certain amount of computing resource arriving in the VT’s task queue in each time slot. The ith
can be reserved for each VT to conduct task computation. A (i ∈ N+ ) arriving task Ji is described as a 3-tuple:
backup server located at the network backend is connected Ji , (tgi , di , ki ). (1)
to all the MEC servers through wired connections, and can
enhance the capability of each MEC server when its computing In (1), tgi is the time that Ji is generated, di (in bit) is the size
resource is not adequate. MEC servers can communicate with of the task input data, and ki (in CPU cycle/bit) is its CVR,
each other through the core network. Nevertheless, to avoid which can be obtained by applying program profilers [19].
degrading the condition of the core network, raw in-vehicle All tasks waiting in the queue are considered to be generated
application data (e.g., input data for task offloading), which by computation-intensive in-vehicle applications (e.g., object
usually have large size, are not allowed to be transmitted recognition or virus scanning). In general, the computation
between RSUs [2]. results of such tasks (e.g., object tags or virus report) have
The focus of this work is to, from the perspective of VTs, sufficiently smaller data sizes compared with those of the input
identify an offloading scheduling strategy that can efficiently (e.g., pictures or files) [3]. Therefore, the size and transmission
complete the execution of in-vehicle application computation time of the output data of all tasks are considered to be
tasks with a minimum cost. Our analysis hereinafter con- negligible throughout the paper.
centrates on one single representative VT.1 The interaction We consider tasks without stringent latency demands or
between the VT and its serving MEC server regarding com- execution priority. They are sorted in the queue by their
putation offloading is shown at the bottom of Fig. 1. We generated time. In other words, when a new task arrives, it
consider an abstract computing architecture in the VT, which is appended to the rear end of the task queue. If any task in
consists of a task queue, a task scheduler, a local processing the queue is sent to the LPU or DTU by the scheduler, tasks
unit (LPU), and a data transmission unit (DTU).2 Independent after it are shifted forward to fill the empty position. We use Q
computation tasks, possibly generated by multiple types of to denote the maximum number of tasks the queue can hold,
applications, randomly arrive in the task queue (sorted by their and use q[t] (q[t] ≤ Q) to denote the actual number of tasks
generation time). The task scheduler synthesizes all available in the queue at time slot t. If q[t] = Q, new incoming tasks
system information (including queue state, local execution have to be discarded and an overflow event occurs. Now, the
state, transmission state, and remote execution state) and state of the task queue at any time slot t can be represented
schedules the tasks based on an offloading scheduling policy. by a Q × 3 matrix Q[t], in which the jth (j ∈ {1, 2, · · · , q[t]})
row is formed by the three defining elements of the jth waiting
1 In this paper, a fixed amount of the MEC computation resource and V2I
task (denoted by Q[t]hji), i.e., task generation time, input data
transmission bandwidth is assumed to be reserved for each VT. Dynamically size, and CVR, respectively. Each of the remaining Q − q[t]
optimizing the allocation of limited computing and communication resources
among multiple VTs is a more challenging issue, which is one of our future rows in Q[t] is a 1 × 3 all-zero vector, indicating the absence
research topics. of a waiting task at that queue position.
2 The focus of our work is to identify a proper offloading scheduling
Note that we can use two notations to refer to a task. The
strategy to balance local and remote execution in a complicated vehicular
communication and computation environment. A computing architecture with first reflects the natural task generation process. For example,
a single central unit is adopted. Ji in (1) represents the ith task generated by an in-vehicle
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 4
application, and the index i can be any positive integer. The known when the data transmission actually completes. As a
second type of notation reflects the real-time status of the result, a good offloading scheduling strategy should make the
task queue. For instance, Q[t]hji is the jth (sorted) task offloading decisions sequentially, instead of conducting “one-
waiting in the queue at a particular time slot t, and the integer shot” solution.
index j is upper-bounded by Q. The former notation uniquely Another issue caused by the movement of the VT and
distinguishes different tasks. But the latter can better help the dynamic wireless channel environment is the transmission
describe the dynamic nature of the considered system. To failure due to handover. As we mentioned earlier, to avoid
facilitate presentation, with a little abuse of notation, we follow degrading the condition of the core network, raw in-vehicle
the similar way as (1) to describe task Q[t]hji: when time t application data are not allowed to be transmitted between
is known, we have RSUs. When the VT leaves the coverage area of an RSU,
if the input data transmission of a task has not yet finished,
Q[t]hji , (tgj , dj , kj ). (2)
the previously transmitted data cannot be passed to the new
serving RSU and have to be discarded. The task has to be re-
C. Communication Model transmitted by the VT with the cost of wasted energy and large
The wireless V2I communication (for task offloading) is delay. In contrast, since the size of computation output data is
conducted in a block-fading environment. The fading co- negligible, the computation result can be delivered from the
efficient remains unchanged within each channel coherence previous serving RSU to the new one if the task input data
time interval, which can be one or multiple time slots, but transmission has been completed before handover.
varies randomly (not necessarily independently) afterward. We
assume that the channel fading coefficient between the VT D. Computation Model
and its serving RSU is statistically determined by the signal The tasks in the queue are expected to be executed timely
propagation environment between them. For instance, under and efficiently, i.e., with small latency and energy usage. Large
Rayleigh fading, the complex fading coefficient h can be delay reduces the usefulness of the in-vehicle application and
modeled by h = h̃ĥ, where ĥ ∼ CN (0, 1) represents small- user experience. High energy consumption quickly drains the
scale fading and the large-scale fading coefficient h̃ reflects VT’s battery. They can both be considered as the cost of task
the impact of both path loss and shadowing phenomenon. In execution. The demanded time and energy for conducting the
general, h̃ is a function of E, a set of environment factors such calculation of each task, either locally on the LPU or remotely
as the position of the VT x, the distance to the serving RSU on the MEC server, are presented as follows.
d, and possibly other factors. Since E is relatively simple to 1) Local Execution Model: Assume that the scheduler
estimate, knowing E and thus h̃ statistically describes h as a determines to schedule task Ji to local execution at time slot
random variable generated from CN (0, |h̃|2 ). tai . Then starting from tai , all of Ji ’s CPU cycles, i.e., di ki , will
Taking the capability of the employed channel code into be executed on the LPU. The required number of time slots
consideration, the reliable transmission data rate r from the to complete Ji on the LPU is
VT to its serving RSU is a determined function of the channel
fading coefficient. This means, the transmission rate is also di ki
tli = , (4)
statistically determined by the environment factors, i.e., it fl
obeys a conditional probability density function (PDF) f (r|E). where f l (in cycle/slot) is the CPU frequency of the LPU.
For a demonstration purpose, in this paper we limit the set Following [20] and [21], the average power consumption,
E = {x, d}: the environment is completely described by the pl (in joule/slot), can be modeled by a super-linear function
location of the VT and the distance between the VT and the of CPU frequency as pl = ξ(f l )ν , where ξ and ν are
serving RSU. For fixed x and d, the PDF f (r|E) is a fixed both constants. Hence, the energy consumed for executing Ji
function but is unknown to either the VT or RSUs. locally can be obtained by
At any time slot t, the VT can obtain a certain level of
knowledge regarding the instantaneous fading coefficient h[t], eli = pl tli . (5)
based on which it can infer the data rate r[t] (in bits/slot). 2) Remote Execution Model: If task Ji is scheduled for
If the VT intends to transmit data to the RSU at time slot t, offloading at time slot tai , the DTU starts to deliver the task to
then r[t] bits data can be successfully delivered. Therefore, we the serving RSU at time slot tai . Upon successfully receiving all
denote ttx (v, t) as the time consumed for transmitting data of the input data, the MEC server executes the task remotely for
size v (in bit) to the serving RSU, starting from time slot t. It the VT. The total time consumption consists of two parts: time
satisfies for wireless data transmission and time for task computation
t+ttx (v,t)−1 t+ttx (v,t)
on the MEC server. The former is related to the scheduled
X X
r[s] < v ≤ r[s]. (3)
s=t s=t
starting time tai since the transmission data rate at each time
slot depends on the location of the VT, as presented in Section
In practice, channel knowledge can only be causally available.
II-C. The number of time slots for completing the delivery of
The movement of the VT also contributes to a dynamic
the di bits data through V2I communication can be derived
channel environment where channel gains change rapidly. It
according to (3) as
is difficult to predict the exact achievable transmission rate
at future time slots. Hence the value of ttx (v, t) can only be ttx a tx a
i (ti ) = t (di , ti ). (6)
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 5
In addition, the number of time slots required for executing average cost of making scheduling decisions for all the tasks
Ji on the MEC server can be calculated by generated by the VT:
n
exe di ki 1X
ti = , (7) minimize lim ci (ai , tai ), (14)
fs a
a ,tt n→∞ n
i=1
where f s is the CPU frequency reserved for the VT by the wherePa = [a1 , a2 , · · · , an ] and t a = [ta1 , ta2 , · · · , tan ]. The
service provider. Hence, the total time slots consumed for term ni=1 ci (ai , tai ) is the weighted sum of time and energy
offloading Ji can be obtained by consumption of all the tasks.
toi (tai ) = ttx a exe Solving the problem (14) is very challenging due to the
i (ti ) + ti . (8)
impacts of the random task generation process and dynamic
From VT’s perspective, remote execution of tasks does not wireless transmission environment. Conventional optimization
consume its energy. The energy usage for offloading Ji is only techniques are hence hard to apply. In this paper, we propose
for data transmission and is given by to apply DRL to identify the optimal offloading scheduling
policy. We will show that our method can efficiently and
eoi (tai ) = ptx ttx a
i (ti ), (9) effectively estimate the unknown system dynamics, and then
properly make the scheduling decisions to achieve the ex-
where ptx (in joule/slot) is the transmission power of the VT. pected long-term objective.
where weighting coefficients α and β can be chosen to reflect DRL uses DNNs to approximate the policy and/or value
designing preference towards smaller time or energy usage. function. With the powerful representation capability of DNN,
The overall objective of our system design is to find the large state space can be supported. Current DRL methods can
optimal computation offloading scheduling strategy for the be classified into two categories: value-based methods and
scheduler, in order to minimize the long-term cost, i.e., the policy-based methods.
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 6
Value-based DRL methods adopt DNNs to approximate the IV. MDP F ORMULATION
value function (called value network), e.g., deep Q-network We propose to apply DRL to solve the considered offloading
(DQN) [23] and double DQN [24]. Generally, the core idea scheduling problem. To this end, we first formulate an MDP
of value-based DRL methods is to minimize the difference that can adequately describe the offloading scheduling process
between the value network and the real value function. A and then use DRL to find the optimal policy for the MDP. In
natural objective function can be written as what follows, each element of the MDP is presented.
LV (θ) = En [vπ∗ (sn ) − v(sn ; θ)]2 , (17)
A. State Space
where v(·; θ) is the value network, and θ is the set of its
parameters. vπ∗ (·) represents the real value function, which At the beginning of each time slot, the scheduler monitors
is unknown but is estimated by different value-based RL the system state, based on which the offloading scheduling
methods. The expectation En [·] indicates the empirical average decision is made. We defined the state space of our MDP as
over a finite batch of samples in an algorithm that alternates S , {s|s = (Q, slpu , sdtu , smec , x, d)}, (23)
between sampling and optimization.
Policy-based DRL methods use DNNs to approximate the where each state s uses 3Q + 5 parameters (dimensions) to
parameterized policy (called policy network), e.g., REIN- represent the status of VT’s task queue (Q), the LPU (slpu ),
FORCE [25] and Actor-Critic [26]. Comparing with value- the DTU (sdtu ), the MEC server (smec ), and finally the wireless
based DRL methods, these methods have better convergence environment ({x, d}). As we stated in Section II, at each time
properties and can learn stochastic policies. Policy-based DRL slot t, Q is a Q × 3 matrix, the jth row of which clarifies
methods work by computing an estimator of the policy gradi- the task generation time, input data size, and CVR of the task
ent, and the most commonly used gradient estimator has form: Q[t]hji. x and d are the current location of the VT and its
distance to the serving RSU, which determines the statistics
∇LPG (θ) = En [∇θ log π(an |sn ; θ)Ân ], (18) of the current V2I transmission data rate. The remaining state
where π is a stochastic policy and Ân is an estimator function parameters slpu , sdtu , and smec are elaborated as follows.
at time-step n. 1) LPU state slpu : We use slpu to describe the number
Traditional policy-based DRL methods, such as REIN- of remaining CPU cycles that the LPU requires to complete
FORCE, have two main defects: 1) The Monte Carlo sampling the task currently running on it. Specifically, assume that the
(to obtain Gn ) brings high variance, leading to slow learning; scheduler decides to start operating the task Ji on the LPU
2) The on-policy (training and sampling using the same policy) from a certain time slot t1 . Then at the beginning of t1 th time
update can easily converge to a local optimum. slot, slpu [t1 ] is initialized to di ki , which is the total number
Recently, the generalized advantage estimation (GAE) is of CPU cycles demanded by Ji . Afterward, during each time
proposed to make a compromise between variance and bias slot t ≥ t1 , the LPU provides f l CPU cycles (and consumes
[27]. The GAE estimator is written as pl Joules of energy) to execute the task. The value of slpu is
∞
thus reduced by f l , until the computation is finished and slpu
ÂGAE(γ,φ) =
X
(γφ)l ηn+l
v
, (19) is set to 0. Such an LPU state updating process (for a single
n
l=0
task) can be described by
where φ is used to adjust the bias-variance trade-off, and slpu [t] = max{slpu [t − 1] − f l , 0}, t > t1 . (24)
ηnv = rn + γv(sn+1 ; ω) − v(sn ; ω). (20) When slpu reaches 0, the LPU is said to be idle and is able to
host a new task.
To alleviate the local optimum problem, off-policy learning 2) DTU state sdtu : The parameter sdtu describes the amount
is introduced to increase the exploration ability of policy of remaining data volume of the task that needs to be trans-
gradient methods. The PPO algorithm proposed by OpenAI mitted to the MEC server by the DTU. Now, suppose that
is recently the state-of-the-art [18]. Its objective function is the scheduler decides to offload task Ji , starting from time
LCLIP (θ) = En [min(rn (θ)Ân , clip(rn (θ), 1 − ǫ, 1 + ǫ)Ân )], slot t2 . This leads to sdtu [t2 ] = di . Then, in the tth time slot
(21) (t ≥ t2 ), up to r[t] bits data can be delivered from the VT to
where rn (θ) is the policy probability ratio defined as the RSU (consuming ptx Joules of energy), so that the value
sdtu is decreased by r[t]. Consequently, the DTU state updating
π(an |sn ; θ)
rn (θ) = . (22) process (for a single task) is given by
π(an |sn ; θold )
sdtu [t] = max{sdtu [t − 1] − r[t − 1], 0}, t > t2 , (25)
The clip function clip(rn (θ), 1 − ǫ, 1 + ǫ) constrains the value
of rn , which removes the incentive for moving rn outside the until sdtu reaches 0.
interval [1−ǫ, 1+ǫ], with ǫ being a hyper-parameter to control It is worth noting that the above process can be interrupted
the clip range. By taking the minimum of the clipped and by a handover event. For instance, at a certain time slot t3
unclipped objective, the final objective is restricted as a lower (t3 ≥ t2 ), the VT enters the coverage region of a new RSU,
bound to the unclipped objective. Because of these advantages, but the transmission of Ji has not yet finished, i.e., sdtu [t3 ] 6= 0.
in this paper, we design our DRL-based offloading scheduling As we mentioned earlier, in this case, the re-transmission of
method based on PPO. the whole task will be automatically conducted. Therefore,
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 7
sdtu [t3 ] is re-initialized to di immediately, and afterward the 2) RE Action Type: When the DTU is idle and the queue is
data transmission continues until sdtu = 0. Since the re- not empty, an RE action can be taken to offload the specified
transmission causes waste of both time and energy, a good task for remote execution. We define all the actions of this
offloading scheduling policy should try to avoid it in its type as a set, i.e.,
decision-making process.
RE , {RE1 , RE2 , · · · REQ }. (28)
3) MEC server state smec : Similar to the LPU state, the
value of smec indicates the number of remaining CPU cycles By taking REj at time slot t2 , task Q[t2 ]hji is sent to the
that the MEC server requires to execute for the current DTU for uploading. The DTU state is set as sdtu [t2 ] = dj , and
offloaded task. Assume that at a specific time slot t4 , the trans- the task queue state Q[t2 ] is updated similar to the LE action
mission of Ji is found completed (i.e., sdtu [t4 ] reaches zero). type as introduced before. Then, the states of the DTU and
Then the MEC server state is initialized to smec [t4 ] = di ki im- MEC server changes with time according to Section IV-A2
mediately. In each time slot, the MEC server provides f s CPU and IV-A3, respectively, without taking any action.
cycles (no energy consumption from the VT’s perspective) for 3) HO Action Type: The HO actions are responsible for
task computation and thus reduces the value of smec by f s . postponing task scheduling. The set of actions that perform
This leads to the MEC server state satisfying such a function is defined as
At any time slot t, only when both sdtu [t] = 0 and smec [t] = where HOw (w ∈ N+ ) means that the scheduler determines
0, we say that the DTU is idle so that a V2I transmission to keep all waiting tasks to stay in the queue for w time slots,
process can be activated.3 Otherwise, if either sdtu [t] 6= 0 or even though the LPU and/or DTU are capable of accepting
smec [t] 6= 0, tasks cannot be scheduled for offloading. the computation or transmission of jobs. For example, at time
slot t5 , an action HOw is taken. Then in the next w time slots
The (3Q + 5)-dimensional state space S defined in (23)
(from time slot t5 to time slot t5 + w − 1), the scheduler does
completely describes the characteristics of the environment
not carry out any offloading decisions (the tasks scheduled for
that our agent (i.e., the scheduler) interacts with. Due to the
local or remote operations before time slot t5 are not affected).
fact that each dimension parameter has a large value range
Certainly conducting an HO action increases the latency of all
(i.e., x ≥ 0, 0 ≤ d ≤ L/2, slpu ≥ 0, sdtu ≥ 0, and smec ≥ 0),
tasks waiting in the queue. However, if the wireless condition
our MDP actually has an enormous state space.
is poor and offloading causes unnecessarily large delay and/or
energy consumption, properly postponing the task scheduling
B. Action Space process to wait for better transmission opportunities would be
worthwhile.
The action space of our MDP includes three types of 4) Action Space and Action Legitimacy: With the three
scheduling actions: local execution (LE), remote execution types of actions defined above, the complete action space of
(RE), and holding-on (HO). LE and RE are used to conduct our MDP, A, is defined as the union of the three sets LE, RE,
the “where to schedule” operation, and HO is specially de- and HO, i.e.,
signed to carry out the “when to schedule” operation. Each A , LE ∪ RE ∪ HO. (30)
type of actions includes multiple individual actions. They are
elaborated as follows. However, for each system state s ∈ S, the set of actions that
1) LE Action Type: This type of actions schedule tasks the scheduler can take, A(s), may be only a subset of S.
waiting in the VT’s queue to the LPU. The complete set of Throughout our paper, we term the actions included in A(s)
such actions is defined as as legal actions for state s, and the remaining in A (i.e., those
included in the complementary set of A(s)) as illegal actions.
LE , {LE1 , LE2 , · · · LEQ }, (27) Specifically, at the beginning of each time slot t, the
scheduler monitors the system state s and determines whether
where LEi (i ∈ {1, 2, · · · , Q}) is dedicated to the ith task in a scheduling action should be taken. Consider the case that a
the queue. For instance, at the beginning of a certain time slot total of q[t] tasks are waiting in the queue to be scheduled. As
t1 , the scheduler decides to take action LEi (this is possible mentioned earlier, if the current LPU is idle (slpu [t] = 0), LE
only when the LPU is currently idle and the ith task in the actions in the set {LE1 , LE2 , · · · , LEq[t] } and all HO actions
queue exists, i.e., slpu [t1 ] = 0 and i ≤ q[t1 ]). By taking LEi , are legal actions. But the LPU being busy (slpu [t] > 0) causes
task Q[t1 ]hii is sent to the LPU, which changes the state of all actions in LE to be illegal. Similarly, if the current DTU
the LPU (slpu [t1 ]) and the task queue (Q[t1 ]) immediately. is idle (sdtu [t] = 0 and smec [t] = 0), RE actions in the set
Specifically, slpu [t1 ] is set to di ki , and in VT’s queue, the tasks {RE1 , RE2 , · · · , REq[t] } and all HO actions are legal. When
after Q[t1 ]hii are shifted forward to fill the empty position. the DTU becomes busy (sdtu [t] > 0 or smec [t] > 0), all actions
Afterward, the LPU state updates slot by slot as mentioned in in RE would be illegal. Finally, when there is no task in the
Section IV-A1. queue (i.e., Q[t] is an all-zero matrix) or no executing resource
3 In this paper, for simplicity, the situation that an MEC server can still
(i.e., both the LPU and DTU are busy), no action can be taken
receive data while executing another offloaded task is not considered. Hence, (i.e., A(s) = ∅). Table I provides a summary of the legitimacy
only when both sdtu [t] = 0 and smec [t] = 0, offloading can be performed. regarding each type of actions.
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 8
C. Rewards
3< =< =< >? =< 3< >? >?
Suppose that at time slot tan , the scheduler takes the nth
9+5 67/8 9+5 67/8 9+5 67/8 :;,2 :;,2 action an ∈ A on state sn (sn ∈ S). Then, after tbn time slots,
it comes to a new state sn+1 (sn+1 ∈ S), based on which a
! " # $ % & ' ( ) ! ! ! ! ! ! ! ! ! ! " " " " " " " " " " # # # # # # # # # # $ $ $ $
* ! " # $ % & ' ( ) * ! " # $ % & ' ( ) * ! " # $ % & ' ( ) * ! " # new action needs to be taken (at time slot tan+1 ). We define
345 67/8 :;,2 345 67/8
the reward function R(sn , an , sn+1 ) based on the time and
+,-. /012
energy consumption of all the tasks in the VT within the time
+,-. ,@2.AB;0/ 6.2C..@ 2C1 D1@/.D72,B. ;D2,1@/ interval of this state transition. Note that the time interval of
state transition, i.e., tbn , varies based on different states and
Fig. 2. An example task scheduling time line. actions, but is easy to obtain (tbn = tan+1 − tan ).
At each time slot t, if an arriving task is not finished, its
latency would increase by one time slot. Hence the total time
It is worth noting that, due to the dynamic wireless envi- delay (in slot) of all the tasks in the VT at time slot t is
ronment, the diversity of offloading tasks, and the potential of denoted as
task parallel executing (both locally and remotely), the time
interval between two consecutive actions in fact varies. This ∆sl (t) = q[t] + 1{slpu [t]>0} + 1{sdtu [t]>0} + 1{smec [t]>0} . (31)
is different from most conventional MDP modeling for MEC, In (31), if a task is running on the LPU, then 1{slpu [t]>0} = 1,
which assumes a fixed inter-decision time duration [15], [16]. and if a task is in the process of offloading, either 1{sdtu [t]>0}
Fig. 2 gives an example of a task scheduling time line in our or 1{smec [t]>0} is 1. Therefore, the total time delay of all the
MDP. Consider the case that the VT’s task queue is not empty, tasks in the VT from state sn to sn+1 after taking action an
and the LPU, DTU, and MEC states are all initialized to be can be calculated by
zero. At time slot 1, both the LPU and the DTU are idle, which
tan+1 −1
is the situation (1) in Table I. The scheduler chooses an LE X
action from the legal action space A(s), and the associated ∆l (sn , an , sn+1 ) = ∆sl (t). (32)
t=tan
task is sent to the LPU. At time slot 2, since the LPU is
busy, actions in LE are excluded from the legal action space The energy consumption is only associated with the task
(situation (3) in Table I). The scheduler takes an RE action local computation and input data transmission. Hence, the total
to offload a task for remote execution, and the DTU becomes energy consumption of all the tasks in the VT at time slot t
busy, too. The time slots 3-6 represent the situation (4) in can be calculated by
Table I. Due to the lack of computation and communication
∆se (t) = pl · 1{slpu [t]>0} + ptx · 1{sdtu [t]>0} . (33)
resources, no action can be taken. At time slot 7, the scheduler
observes that the DTU is idle and immediately takes another Therefore, the total energy consumption of all the tasks in the
RE action. At time slot 13, the LPU is found to be idle VT from state sn to sn+1 after taking action an is given by
(situation (2) in Table I), but the scheduler considers it to tan+1 −1
be improper to locally compute any of the tasks in the queue
X
∆e (sn , an , sn+1 ) = ∆se (t). (34)
(e.g., because all tasks require a large number of CPU cycles). t=tan
An HO action for postponing the task scheduling for 11 time
Finally, considering the fact that our system has a dynamic
slots is adopted. As a result, before the 25th time slot, no
workload, if the task arriving rate λ is relatively large com-
action can be taken even when the DTU becomes idle from
pared with scheduling speed, overflow events may possibly
time slot 19. After the holding-on operation completes, the
occur and new incoming tasks have to be discarded. This
decision-making process continues in the similar fashion.
issue brings bad user experience on vehicular applications.
Although the system has a different state in each time slot,
Hence, we consider a cost induced by task queue overflow
the scheduler’s actions would actively affect the system state
in our MDP, ∆o (sn , an , sn+1 ), which denotes the number of
only on those time slots that the action are taken (e.g., the time
overflowed tasks from state sn to sn+1 after taking action an .
slots 1, 2, 7 and 14 in Fig. 2). The change of system states
Therefore, the overall cost brought by action an can
on other time slots (termed intermediate states) are caused by
be considered as a weighted sum of ∆l (sn , an , sn+1 ),
environmental factors such as new arriving tasks (changing
∆e (sn , an , sn+1 ), and ∆o (sn , an , sn+1 ):
queue state Q), computation with local CPU (changing LPU
state slpu ), successful or failing (due to handover) V2I trans- cost(sn ,an , sn+1 ) , α∆l (sn , an , sn+1 )
(35)
mission (changing DTU state sdtu ), computation with remote + β∆e (sn , an , sn+1 ) + ζ∆o (sn , an , sn+1 ).
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 9
We can find that Gm is the weighted sum of time and energy TABLE II
consumption of all the tasks if the discount factor γ is set 1 T HE A RCHITECTURE OF THE S HARED DNN a
and no task overflow happens. Therefore, the goal of finding
Layer Kernels Stride Output Remark
the optimal offloading scheduling policy π ∗ is consistent with Input All states
the original objective introduced in Section II-E. Split 20 × 2 × 1, other states Cache other states
The enormous state and action space coupled with the Conv2D 2 × 2 × 8 1 19 × 1 × 8 Resize, No padding
Conv1D 4 × 16 1 16 × 16 No padding
highly dynamic environment makes it difficult to obtain the Conv1D 4 × 16 2 7 × 16 No padding
environment dynamics, i.e., P (sn+1 |sn , an ). In the next sec- Conv1D 4 × 20 1 4 × 20 No padding
tion, we resort to the model-free DRL to search for the optimal Concat 4 × 20 ⊕ other states Concat other states
offloading scheduling policy. FC layer 512 Layer Norm. [28]
FC layer 512 Layer Norm.
a Leaky-ReLU is adopted as the activation function for all neurons.
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 10
B. Training Algorithm described in Section IV-B, can both be selected when sampling
Since the parameter-shared DNN architecture is adopted, (line 6). We deal with this problem as follows. First, if an
the overall objective is a combination of the error terms of illegal action type is chosen (e.g., selecting LE action type
both the policy network and value network. We utilize GAE when the LPU is busy), the system will neglect the illegal
as the estimator function. The objective function of the policy action, leaving the system state unchanged. Since an action is
network can be derived by substituting (19) and (22) into needed at this time, this will prompt the offloading scheduling
(21), and that of the value network can be obtained by (17). policy to reschedule according to the same state. Thanks to
Therefore, the overall maximization objective function can be the stochastic policy supported by the policy-based DRL, there
written as are always opportunities for other action types to be selected.
Then, the offloading process continues. To force our learning
LPPO (θ) = En [LCLIP
n (θ) − cLV
n (θ)], (38) algorithm to effectively reduce the attempts to explore illegal
where c is the loss coefficient. The subscript n of and LVn LCLIP
n
actions, we add a penalty term, ki , to the reward when an
means that we obtain their value without taking the expectation illegal action type is selected during the training process.
in (21) and (17) (we take the empirical average from a finite Second, when the task specified by an LEj or REj action does
batch of samples after their combination). not exist, i.e., j > q[t], for simplicity, we let the scheduler
select the first task in the queue automatically.
Algorithm 1 Training algorithm based on PPO
1: Initialize the DNN parameter θ randomly to obtain πθ
2: Initialize the sampling policy πθold with θold ← θ C. Training Efficiency
3: for iteration = 1, 2, · · · do So far, we have carefully designed our DNN architecture to
4: /* Sampling (exploring) with πθold */ improve the feature extraction process and applied the state-
5: for i = 1, 2, · · · , N do of-the-art PPO algorithm to enhance the training performance.
6: Sample a whole episode in the environment with However, the state space and action space of this MDP are both
πθold , and store the trajectory in Ti very large, leading to a huge exploration space. As a result, the
GAE(γ,φ)
7: Compute advantage estimates Âni according training process is difficult to converge. To handle this issue,
to (19) for each time step n in the Ti we adopt a series of methods to improve training efficiency.
8: end for First, we restrict the selection of HO actions. Two constant
9: Cache all sampled data in two sets: T and A parameters are defined, i.e., phmax and pg . phmax limits the
10: /* Optimizing πθ (exploiting) */ maximum waiting time of an HO action, and pg controls the
11: for epoch = 1, 2, · · · , K do granularity of HO actions. This changes the set HO defined
12: Update the policy based on the objective function in (29) as
(38), i.e., θ ← arg maxθ LPPO (θ), by Adam using the
sampled data from T and A for one epoch HO , {HOpg , HO2pg , · · · , HOnpg }, npg ≤ phmax . (39)
13: end for
14: Synchronise the sampling policy with θold ← θ Each time the scheduler takes an HO action, maximally the
15: Drop T and A scheduling can be delayed by npg time slots. If necessary,
16: end for more HO actions can be taken to further postpone the decision-
making procedure. To limit the maximum waiting time of an
The details of the proposed training algorithm based on PPO HO action does not change the original MDP, but significantly
is presented in Algorithm 1. Two DNNs are initialized with the reduce the action space (the unlimited and discrete action
same parameter (θold ← θ), one for sampling (πθold ), and the space raises great challenges in implementing the DRL algo-
other for optimizing (πθ ). The algorithm alternates between rithm [29]). Further, the length of each time slot can be very
sampling (line 4-8) and optimization (line 10-13). In the sam- tiny, which makes nearly no difference between consecutive
pling stage, N trajectories (e.g., (s0 , a0 , r0 , s1 , · · · , ster )) are HO actions. Larger granularity can help the algorithm learn
sampled following the old policy πθold . For training efficiency, the difference between HO actions. By setting the granularity
the generalized advantage estimations for each time step n in pg = 1, we derive the original MDP for modeling the desired
GAE(γ,φ)
each trajectory Ti , i.e., Âni , are computed in advance in offloading scheduling process described in Section II. Now,
this stage. The sampled data are cached for optimization (line the total number of HO actions in HO becomes ⌈phmax /pg ⌉.
9). In the optimization stage, the parameter θ of the policy πθ To avoid the training process to explore the HO actions that
is updated for K epochs. In each epoch, we improve the policy hold the tasks for too long, which is clearly of little use, the
πθ by conducting stochastic gradient ascend on the cached continuous waiting time of the system is recorded. A penalty
sampled data based on the objective function, i.e., (38). After term, kh , is applied when the waiting time exceeds a certain
the optimization stage, we update the sampling policy πθold threshold value. In our experiments, this threshold is simply
with the current πθ and drop the cached data (line 14-15). set to L/V , meaning that we do not encourage the VT to pass
Then the next iteration begins. the complete coverage area of an RSU without executing any
It is worth noting that the random policy in the exploring task. In practice, such a value can certainly be much smaller.
stage cannot ensure the legitimacy of the chosen action ac- Second, we can further reduce the size of the action space by
cording to the current state. Two types of illegal actions, as setting a constant parameter psmax ≤ Q and limit the number of
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 11
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 12
TABLE V
−10 PARAMETERS OF THE GA BASELINE
−20
Parameter Setting
Average Accumulated Reward
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 13
DRLOSM AL TG GA DRLOSM AL TG GA
3.5 RD AO EG 30 RD AO EG
25
2.5 20
2.0 15
1.5 10
1.0 5
0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
User Preference (α, β = 1) User Preference ( , β = 1)
Fig. 5. Comparison of average cost in SQS. Fig. 7. Comparison of average task latency in SQS.
2.5 DRLOSM AL TG GA
RD AO EG
Average Number of Re-transmitted Tasks
DRLOSM AL TG GA
2.25 RD AO EG
2.0
0.5 1.00
0.75
0.0 0.50
0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
User Prefere ce (α, ( = 1)
0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
User Preference (α, ( = 1)
Fig. 6. Comparison of average number of re-transmitted tasks in SQS.
Fig. 8. Comparison of average task energy consumption in SQS.
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 14
90
DRLOSM AL TG 6 DRLOSM AL TG
80 RD AO EG RD AO EG
70 5
Average Task Latency (sec)
60 4
Average Cost
50
3
40
30 2
20
1
10
0 0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
User Workload λ (no./sec) User Workload λ (no./sec)
Fig. 9. Average task latency versus workload in DQS. Fig. 11. Average cost versus workload in DQS.
DRLOSM AL TG 25 DRLOSM AL TG
2.25 RD AO EG RD AO EG
2.00 20
1.75
15
1.50
1.25 10
1.00
0.75 5
0.50
0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
User Workload λ (no./sec) User Workload λ (no./sec)
Fig. 10. Average task energy consumption versus workload in DQS. Fig. 12. Average number of re-transmitted tasks versus workload in DQS.
C. Performance in Dynamic Queue Scenario lowest average task latency, and EG has the highest. DRLOSM
In this section, we consider the more general situation in always achieves a relatively small latency. The performance
which the system workload λ > 0. In such a dynamic queue curve is smooth, which implies that it can adjust its execution
scenario (DQS), new tasks keep arriving in the VT’s task strategy according to the workload.
queue and a good offloading scheduling solution should also The average task energy consumption comparison is shown
take them into consideration. As mentioned in the previous in Fig. 10. The performance curves of EG, AL, AO, and
subsection, applying the “one-shot” GA would require ex- RD are almost irrelevant to λ, which reflects the fact they
tremely high computation complexity. Hence we compare our do not adjust their behaviors to handle different workloads.
DRLOSM with only the remaining five baseline algorithms. The energy consumption of TG decreases as λ raises. This
We fix α = 0.06 and β = 1. For each algorithm, the is because the proportion of re-transmitted tasks decreases
simulations are conducted for 200 rounds, each of which lasts when more tasks are executed. The algorithm is efficient
150 seconds. Within each simulation round, the initial state for very high-workload regimes. DRLOSM has more energy
of different algorithms are set to be identical. We adjust the consumption as the workload is larger, because it is able to
value λ to change from 0.1 (nearly no workload) to 1.0 (huge change its offloading scheduling policy, by scheduling more
workload under which overflow occurs in all six algorithms), tasks with higher energy usage for avoiding a rapidly increased
and investigate the impact of workload on system performance. queue, to maintain a relatively small overall cost. This is the
Fig. 9 shows the average task latency comparison in DQS. reason behind its good task latency performance in Fig. 9.
As λ increases, the average task latency of all the algorithms Fig. 11 illustrates the comparison of overall costs. When
grows. For EG, AL, AO, and RD, there is a sudden rise of the workload is very light, EG achieves good performance.
the curve slope at certain values of λ. These methods do not On the other hand, for very heavy workload, TG outperforms
adapt their scheduling behaviors according to the workload. other baseline algorithms. Our DRLOSM is strictly better than
When λ is sufficiently large, the speed of task execution both EG and TG, for all possible values of λ. Finally, Fig.
starts to lag behind the task arrival rate, which causes tasks 12 displays the average number of re-transmitted tasks due
to stack in the queue. When λ further increases, the task to communication failures. We can observe that DRLOSM
latency curves become flat because the task queues become properly handle the handover issue, even it does not have
fully occupied, and new tasks have to be discarded due to knowledge regarding the dynamics of the wireless environ-
overflow. As expected, among the algorithms, TG has the ment. Only under high workload, the chances of DRLOSM’s
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 15
TABLE VI 2.4
O FFLOADING S TRATEGY OF CDS IN D IFFERENT S ITUATIONS
2.2
Queue Length Distance to RSU Description Strategy
q[t] > LQ too many tasks in queue TG 2.0
q[t] < SQ system workload is low EG
Average Cost
SQ < q[t] < LQ d > LD offloading is not economical AL 1.8
SQ < q[t] < LQ d < SD offloading is economical AO
1.6
SQ < q[t] < LQ LD < d < SD no dominant strategy RD
1.4 DRLOSM
TABLE VII
1.2
24 CDS A LGORITHMS WITH D IFFERENT T HRESHOLDS
1.0
CDS LQ SQ LD SD CDS LQ SQ LD SD 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
User Preference (α, β = 1)
CDS 1 10 4 60 40 CDS 2 8 4 60 40
CDS 3 6 4 60 40 CDS 4 10 4 60 20
CDS 5 8 4 60 20 CDS 6 6 4 60 20 Fig. 13. Comparison of average cost in SQS.
CDS 7 10 2 60 40 CDS 8 8 2 60 40
CDS 9 6 2 60 40 CDS 10 4 2 60 40
2.4
CDS 11 10 2 60 20 CDS 12 8 2 60 20
CDS 13 6 2 60 20 CDS 14 4 2 60 20 2.2
CDS 15 10 0 60 40 CDS 16 8 0 60 40
2.0
CDS 17 6 0 60 40 CDS 18 4 0 60 40
CDS 19 2 0 60 40 CDS 20 10 0 60 20 1.8
Average Cost
CDS 21 8 0 60 20 CDS 22 6 0 60 20
1.6
CDS 23 4 0 60 20 CDS 24 2 0 60 20
1.4
DRLOSM
1.2
transmission failure rise. This is because it tries to offload more
1.0
tasks to adapt to the system workload even in bad wireless
0.8
conditions. The advantages of the proposed solution are thus
0.2 0.4 0.6 0.8 1.0
clearly exhibited. User Workload λ (no./sec)
D. Comparison with Combinative Strategy Fig. 14. Comparison of average cost in DQS.
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 16
environment and the enormous state space. A novel DRL- [12] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
based method has been proposed in this paper to address Cambridge, MA, USA: MIT Press, 2011.
[13] J. Wang, J. Hu, G. Min, W. Zhan, Q. Ni, and N. Georgalas, “Computation
these issues. It is designed based on the state-of-the-art PPO offloading in multi-access edge computing using a deep sequential model
algorithm. A parameter-shared DNN architecture, which is based on reinforcement learning,” IEEE Commun. Mag., vol. 57, no. 5,
enhanced by a CNN, is utilized to approximate both the pp. 64–69, 2019.
[14] J. Li, H. Gao, T. Lv, and Y. Lu, “Deep reinforcement learning based
policy and value function. A series of methods have been computation offloading and resource allocation for MEC,” in Proc. IEEE
considered to deal with the large state and action spaces and Wireless Commun. Netw. Conf. (WCNC), 2018, pp. 1–6.
improve training efficiency. Extensive simulation experiments [15] M. Min, L. Xiao, Y. Chen, P. Cheng, D. Wu, and W. Zhuang, “Learning-
based computation offloading for IoT devices with energy harvesting,”
have been conducted to demonstrate that our proposed method IEEE Trans. Veh. Technol., vol. 68, no. 2, pp. 1930–1941, 2019.
can efficiently learn the optimal offloading scheduling policy [16] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Optimized
without requiring any prior knowledge of the environment computation offloading performance in virtual edge computing systems
via deep reinforcement learning,” IEEE Internet Things J., vol. 6, no. 3,
dynamics, and it significantly outperforms a number of known pp. 4005–4018, 2019.
baseline algorithms in terms of the long-term cost. [17] W. Zhan, C. Luo, J. Wang, G. Min, and H. Duan, “Deep reinforcement
In our paper, a fixed amount of the MEC computation learning-based computation offloading in vehicular edge computing,” in
Proc. IEEE Global Commun. Conf. (GLOBECOM), 2019, pp. 1–6.
resource and V2I transmission bandwidth is assumed to be [18] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
reserved for each VT. In more general conditions, the resource “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347,
limitation on each RSU should be considered. VTs would 2017.
[19] A. R. Khan, M. Othman, S. A. Madani, and S. U. Khan, “A survey of
compete or cooperate with each other to share the limited mobile cloud computing application models,” IEEE Commun. Surveys
resources. The offloading scheduling problem, in this case, Tuts., vol. 16, no. 1, pp. 393–413, 2013.
may be formulated using a multi-agent partially observable [20] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. S. Quek, “Offloading in mobile
edge computing: Task allocation and computational frequency scaling,”
MDP (multi-agent POMDP) [11]. In addition, one may also IEEE Trans. Commun., vol. 65, no. 8, pp. 3571–3584, 2017.
consider the scenarios in which each task has an individ- [21] X. Chen, “Decentralized computation offloading game for mobile cloud
ual priority demand (regarding, e.g., latency) and the VT’s computing,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 4, pp. 974–
983, 2015.
computing architecture is formed by multiple heterogeneous [22] X. Lyu, H. Tian, C. Sengul, and P. Zhang, “Multiuser joint task
sub-systems with diverse characteristics. Applying DRL to offloading and resource optimization in proximate clouds,” IEEE Trans.
solve the offloading scheduling problem in these systems Veh. Technol., vol. 66, no. 4, pp. 3435–3447, 2017.
[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
may demand new approaches to formulate the optimization Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
problem, define the equivalent MDP, and design the DNN et al., “Human-level control through deep reinforcement learning,”
architecture and RL training method. They are treated as Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[24] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
meaningful future research directions. with double Q-learning,” CoRR, vol. abs/1509.06461, 2015.
[25] R. J. Williams, “Simple statistical gradient-following algorithms for
R EFERENCES connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4,
pp. 229–256, 1992.
[1] J. Feng, Z. Liu, C. Wu, and Y. Ji, “Mobile edge computing for the [26] T. Degris, M. White, and R. S. Sutton, “Off-policy actor-critic,” CoRR,
internet of vehicles: Offloading framework and job scheduling,” IEEE vol. abs/1205.4839, 2012.
Veh. Technol. Mag., vol. 14, no. 1, pp. 28–36, 2019. [27] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
[2] K. Zhang, Y. Mao, S. Leng, Y. He, and Y. Zhang, “Mobile-edge dimensional continuous control using generalized advantage estimation,”
computing for vehicular networks: A promising network paradigm with CoRR, vol. abs/1506.02438, 2015.
predictive off-loading,” IEEE Veh. Technol. Mag., vol. 12, no. 2, pp. [28] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR,
36–44, 2017. vol. abs/1607.06450, 2016.
[3] Y. Dai, D. Xu, S. Maharjan, and Y. Zhang, “Joint load balancing and [29] G. Dulac-Arnold, R. Evans, H. V. Hasselt, P. Sunehag, T. Lillicrap,
offloading in vehicular edge computing and networks,” IEEE Internet J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Reinforcement
Things J., vol. 6, no. 3, pp. 4377–4387, 2019. learning in large discrete action spaces,” CoRR, vol. abs/1512.07679,
[4] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edge 2015.
computing: A survey,” IEEE Internet Things J., vol. 5, no. 1, pp. 450– [30] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
465, 2018. S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for
[5] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision large-scale machine learning,” in Proc. USENIX 12th Symp. Operating
and challenges,” IEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Syst. Des. Implementation (OSDI), 2016, pp. 265–283.
2016. [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[6] X. Lin, Y. Wang, Q. Xie, and M. Pedram, “Task scheduling with dynamic in Proc. Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15.
voltage and frequency scaling for energy minimization in the mobile [32] F.-A. Fortin, F.-M. D. Rainville, M.-A. Gardner, M. Parizeau, and
cloud computing environment,” IEEE Trans. Services Comput., vol. 8, C. Gagné, “Deap: Evolutionary algorithms made easy,” Journal of
no. 2, pp. 175–186, 2015. Machine Learning Research, vol. 13, no. Jul, pp. 2171–2175, 2012.
[7] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge com-
puting: Partial computation offloading using dynamic voltage scaling,”
IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, 2016.
[8] Y. Zhang, D. Niyato, and P. Wang, “Offloading in mobile cloudlet
systems with intermittent connectivity,” IEEE Trans. Mobile Comput.,
vol. 14, no. 12, pp. 2516–2529, 2015.
[9] J. Liu, Y. Mao, J. Zhang, and K. B. Letaief, “Delay-optimal computation
task scheduling for mobile-edge computing systems,” in Proc. IEEE Int.
Symp. Inf. Theory (ISIT), 2016, pp. 1451–1455.
[10] H. Ko, J. Lee, and S. Pack, “Spatial and temporal computation offloading
decision algorithm in edge cloud-enabled heterogeneous networks,”
IEEE Access, vol. 6, pp. 18 920–18 932, 2018.
[11] Y. Li, “Deep reinforcement learning: An overview,” CoRR, vol.
abs/1701.07274, 2017.
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.2978830, IEEE Internet of
Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. X, NO. X, MARCH 2020 17
Wenhan Zhan received his B.E. and M.Sc. degrees Geyong Min is a Professor of High-Performance
from the University of Electronic Science and Tech- Computing and Networking in the Department of
nology of China (UESTC), Chengdu, China, in 2010 Computer Science within the College of Engi-
and 2013, respectively. Since 2013, he has been an neering, Mathematics and Physical Sciences at the
Experimentalist in UESTC. From 2018 to 2019, he University of Exeter, UK. He received the Ph.D.
worked as a Visiting Scholar in the Department of degree in Computing Science from the University
Computer Science at the University of Exeter, UK. of Glasgow, UK, in 2003, and the B.Sc. degree
He is currently also a Ph.D. student in UESTC. His in Computer Science from Huazhong University
research interests mainly lie in Distributed System, of Science and Technology, China, in 1995. His
Cloud Computing, Edge Computing, and Artificial research interests include Future Internet, Computer
Intelligence. Networks, Wireless Communications, Multimedia
Systems, Information Security, High-Performance Computing, Ubiquitous
Computing, Modelling, and Performance Engineering.
2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 27,2020 at 21:27:14 UTC from IEEE Xplore. Restrictions apply.