You are on page 1of 6

1

Distributed Multi-Objective Dynamic Offloading


Scheduling for Air-Ground Cooperative MEC
Yang Huang, Miaomiao Dong, Yijie Mao, Wenqiang Liu, and Zhen Gao

Abstract—Utilizing unmanned aerial vehicles (UAVs) with


edge server to assist terrestrial mobile edge computing (MEC)
arXiv:2403.10927v1 [cs.IT] 16 Mar 2024

has attracted tremendous attention. Nevertheless, state-of-the-


art schemes based on deterministic optimizations or single-
objective reinforcement learning (RL) cannot reduce the backlog
of task bits and simultaneously improve energy efficiency in
highly dynamic network environments, where the design problem
amounts to a sequential decision-making problem. In order to
address the aforementioned problems, as well as the curses of
dimensionality introduced by the growing number of terrestrial
terrestrial users, this paper proposes a distributed multi-objective Fig. 1. Air-ground collaborative MEC with a UAV and a BS.
(MO) dynamic trajectory planning and offloading scheduling
scheme, integrated with MORL and the kernel method. The
design of n-step return is also applied to average fluctuations in
the backlog. Numerical results reveal that the n-step return can of computation-intensive and latency-critical services. As a
benefit the proposed kernel-based approach, achieving significant promising solution, mobile edge computing (MEC) enables
improvement in the long-term average backlog performance,
IoT user equipment (UE) to offload computation tasks to edge
compared to the conventional 1-step return design. Due to
such design and the kernel-based neural network, to which servers. However, it is still difficult for MEC servers fixed at
decision-making features can be continuously added, the kernel- terrestrial base stations (BSs) to handle the scenario where
based approach can outperform the approach based on fully- existing infrastructures cannot satisfy unexpected increases in
connected deep neural network, yielding improvement in energy the demands of computation task processing.
consumption and the backlog performance, as well as a significant
reduction in decision-making and online learning time. Given the ability of high maneuverability, flexibility and
employing line-of-sight (LoS) channels, MEC assisted with
Index Terms—Unmanned aerial vehicle, mobile edge comput-
unmanned aerial vehicle (UAV) can be a potential coun-
ing, trajectory planning, offloading scheduling, multi-objective
reinforcement learning. termeasure. State of the art mainly focuses on scenarios
where UEs can choose to perform computational tasks locally
or offload to UAVs [3]–[5]. Unfortunately, these offloading
I. I NTRODUCTION
scheduling approaches are inapplicable to scenarios where
Along with the widespread application of 5G technolo- UEs can offload task bits to the BS. Besides, the approaches
gies, various intelligent applications have emerged and gained derived from deterministic optimizations and assumptions are
widespread use in Internet of Things (IoT) devices [1], [2]. inapplicable to the scenario where the channel gains and
However, stringent constraints on computing capability and the statistical characteristics of producing the computational
power supply make IoT devices unable to cater to scenarios tasks are unknown to the network [4]. In practice, due to
Copyright ©2015 IEEE. Personal use of this material is permitted. However,
the continuous task production in a whole timeslot and the
permission to use this material for any other purposes must be obtained from non-negligible time of signaling and data preparation for
the IEEE by sending a request to pubs-permissions@ieee.org. transmission, decision-making on offloading scheduling in a
This work was partially supported by the National Natural Science Founda-
tion of China under Grant U2001210, 62211540396, 61901216, and the Key
timeslot has no knowledge of the number of task bits produced
R&D Plan of Jiangsu Province under Grant BE2021013-4. (Corresponding in the timeslot, and the decisions for processing/offloading
author: Yijie Mao) these task bits can only be executed in the next timeslot [6],
Y. Huang, M. Dong and W. Liu are with College of Electronic and
Information Engineering, Nanjing University of Aeronautics and Astro-
[7]. This boils down to a sequential decision-making problem.
nautics, Nanjing, 210016, China (e-mail:{yang.huang.ceie, dongmiaomiao, In order to handle the aforementioned issues, we focus on
sx2304085}@nuaa.edu.cn). the scenario of air-ground collaborative MEC [7] (as shown
Y. Mao is with School of Information Science and Tech-
nology, ShanghaiTech University, Shanghai, 201210, China (e- in Fig. 1), where a BS can be assisted by an edge server
mail:maoyj@shanghaitech.edu.cn). deployed at a UAV and reinforcement learning (RL) can be
Z. Gao is with State Key Laboratory of CNS/ATM, Beijing Institute of exploited to solve the sequential decision-making problem.
Technology, Beijing 100081, China; MIIT Key Laboratory of Complex-field
Intelligent Sensing, Beijing Institute of Technology, Beijing 100081, China; In contrast to the conventional single-objective optimization
Yangtze Delta Region Academy of Beijing Institute of Technology (Jiaxing), [7], in order to balance energy consumption and task backlog
Jiaxing 314019, China; Advanced Technology Research Institute of Beijing minimization, a novel multi-objective RL (MORL) approach
Institute of Technology (Jinan), Jinan 250307, China; and Advanced Research
Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing is proposed to jointly optimize the trajectory planning and
100081, China (e-mail: gaozhen16@bit.edu.cn). offloading scheduling policies. In order to address the curses
2

of dimensionality caused by the growing number of UEs, a as |hUAV,m,t |2 = |hUAV,0 |2 d−2


UAV,m,t with probability PLoS ;
distributed structure, where the overall network and UAV state otherwise, |hUAV,m,t |2 = |ΓUAV,m |2 |hUAV,m,t,0 |2 d−β
UAV,m,t .
is shared by all the agents while decision-making is performed In each timeslot t, for each UE, the offloading scheduling
at each agent, is integrated with the kernel method. Numerical options, including computing locally at a UE and offloading
results demonstrate that the kernel-based approach with n- tasks to the UAV or the BS, are mutually exclusive. The
step return, which averages fluctuations in backlog of task duration of offloading or/and executing tasks can be designated
bits, can achieve lower long-term average backlog than that as τ . Assuming adequate number of frequency-domain chan-
with 1-step return. Moreover, due to the n-step return and nels, the UEs’ offloading transmissions do not interfere with
the kernel-based neural networks, where new features can be each other, and the computation results can be returned to the
continuously learned and added, the kernel-based approach can UEs via dedicated frequency-domain channels. Therefore, the
significantly outperform the approach based on the deep neural achievable rate at BS or the UAV in timeslot t can be obtained
network (DNN), in terms of the backlog performance and 2
as RX,m,t = B log2 (1+|hX,m,t | Pm /σn2 ), where B, Pm and
the average time consumption of decision-making and online 2
σn respectively represent the bandwidth, transmit power at UE
learning. It is also shown that the air-ground MEC can benefit m and the average noise power. The value of the subscript X
from the online trajectory planning, fully utilizing air-ground depends on the edge server being deployed at the UAV or BS.
channels and the UAV-mounted edge server. In the former case, X = UAV; otherwise, X = BS. Assume
Organization: Section II discusses the system model and that a packet consists of a fraction of δb bits, the number
formulate the problem. Section III proposes the distributed Lm,X,t of task bits that can be delivered during the duration
multi-objective dynamic trajectory planning and offloading of τ can be obtained as Lm,X,t = δb ⌊RX,m,t · τ /δb ⌋.
scheduling scheme. Numerical results are analyzed in Section
IV. Conclusions are drawn in Section V.
Notations: Matrices and vectors are respectively in bold B. Computational Task Production and Processing
capital and bold lower cases; (·)T , || · || and | · | represent Each UE continuously produces computational tasks over
the transpose, l2 -norm, absolute value, respectively; cat(a, b) timeslots, and the statistical characteristics of task production
concatenates b vertically to the end of a. is unknown [4]. Due to the overhead of signaling and data
preparation [10], the task bits produced at each UE in an
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION arbitrary timeslot t can only be processed locally at the UE or
offloaded in future timeslots.
A. Network & Communications The CPU cycle frequencies of a certain UE, the edge
Without loss of generality, the studied system, as shown server deployed at the BS and the UAV-mounted server
in Fig. 1, consists of a UAV flying at a constant altitude H, are designated as fUE , fBS and fUAV , respectively. The ef-
a BS and M fixed terrestrial UEs, where a UE can offload fective switched capacitance [5] w.r.t. the UE, the BS and
computational task bits to the edge server deployed at the BS the UAV are denoted by κUE , κBS and κUAV , respectively.
or the UAV-mounted edge server. The processing density [11] is designated as c. Task bits to
In terms of the terrestrial channel, due to the non-line-of- be processed are buffered in queues, and the bits are pro-
sight (NLoS) small-scale fading, we assume a block-fading cessed following the first-in-first-out (FIFO) rule. A variable
channel model [8], where channel gains remain constant within αm,t,P = {0, 1} for P ∈ {UE,UAV,BS} is utilized to indicate
a timeslot but vary across timeslots. Therefore, in timeslot t, the offloading action, including local processing, offloading
the channel power gain with respect to (w.r.t.) the channel task bits to the UAV, and offloading task bits to the BS, e.g.
from UE m to the BS can be designated as |hBS,m,t |2 = if UE m performs local processing, αm,t,UE = 1; otherwise,
|ΓBS,m |2 |hBS,0 |2 d−β
BS,m . The variables ΓBS,m , hBS,0 , dBS,m and αm,t,UE = 0.
β represent the corresponding small-scale fading, the large- In the presence of UE m performing local processing in
scale fading w.r.t. the terrestrial channel at a reference distance timeslot t, the exact time tcp,m,t (which may not be equal
of 1 m, the distance between the BS and UE m, and the to τ ) consumed for processing buffered task bits can be
pathloss exponent, respectively. obtained, by following the modeling in [7]. Specifically, prior
The ground-to-air channel can be characterized by the to formulating tcp,m,t (for each timeslot), we can designate
probabilistic LoS channel model [9]. Therefore, given the the number of task bits produced by UE m during timeslot
position qUAV,t = [xt , yt , H] of the UAV in a certain times- t − 1 as Lm,t−1 . Besides, the number of unprocessed task bits,
lot t, the LoS probability can be obtained as PLoS = which are observed at the end of timeslot t − 1 but produced
(1 + a exp (−b(arctan (H/rm,UAV,t ) − a)))−1 , where a and b before timeslot t − 1, can be designated as Dm,t−1 . When
represent constant modeling parameters; rm,UAV,t represents UE m processing the buffered task bits in timeslot t, these
the horizontal distance between UE m ∈ {1, . . . , M } and the Dm,t−1 bits take priority and be processed first, following
UAV in timeslot t. Let hUAV,0 , dUAV,m,t and ΓUAV,m represent the FIFO rule. Meanwhile, the Lm,t−1 task bits generated
the large-scale fading w.r.t. the ground-to-air channel at a in timeslot t − 1 are accumulated at the end of the queue.
reference distance of 1 m, the distance between UE m and Then, we first evaluate the potential time t̂cp,m,t (which may
the UAV in timeslot t, and the small-scale fading w.r.t. the be higher than τ ) for processing all the buffered bits. That
channel between UE m and the UAV, respectively. The channel is, t̂cp,m,t = c · (Lm,t−1 + Dm,t−1 )/fUE . If t̂cp,m,t ≤ τ , all
power gain between UE m and the UAV can be obtained the Lm,t−1 + Dm,t−1 bits can be processed within τ , such
3

that tcp,m,t = c · (Lm,t−1 + Dm,t−1 )/fUE and Dm,t = 0; as at,m = [αm,t,UAV , αm,t,BS , αm,t,UE ]T ∈ Am , where Am
otherwise, UE m cannot complete processing all the task represents the offloading
SM scheduling action space w.r.t. UE m.
bits buffered in the previous timeslots, such that tcp,m,t = τ By defining A = m=0 Am , the joint trajectory planning and
and Dm,t = (Lm,t−1 + Dm,t−1 ) − fUE · τ /c. Then, the offloading scheduling policy can be expressed as π : S → A.
energy consumed by local processing can be evaluated by In order to improve the expected long-term average energy
cp 3
EUE,m,t = κUE fUE tcp,m,t . efficiency and backlog of task bits, the averageP rewards can
T −1
In the presence of UE m perform offloading in timeslot t, be respectively obtained as Ē = limT →∞ sup E{ t=0 et }/T
PT −1
Lm,X,t task bits (encapsulated in packets) can be offloaded to and D̄ = limT →∞ sup E{ t=0 dt }/T . By collecting the
the edge server deployed at the UAV or BS during the duration rewards within r , [Ē, D̄]T and defining wr , [we , wd ]T ,
of τ . The energy and the time consumed for transmission can the optimization of π can be formulated as
trans
be respectively achieved by EX,m,t = Pm Lm,X,t /RX,m,t and
tX,m,trans = Lm,X,t /RX,m,t . Besides receiving the offloaded π = arg max{wrT r̄}. (1)
π
bits, the edge server can simultaneously process tasks which
are buffered in its task queue during previous timeslots. By
III. D ISTRIBUTED M ULTI -O BJECTIVE DYNAMIC
designating the number of the unprocessed task bits (which
T RAJECTORY P LANNING & O FFLOADING S CHEDULING
are observed at the end of timeslot t − 1) as DX,t−1 , the
potential time for processing the previously buffered task bits The unknown statistics of computational task production
can be obtained as tX,pre = DX,t−1 · c/fX . If tX,pre ≤ τ , all and terrestrial channels make dynamics of P (st+1 = s′ |st =
the DX,t−1 bits can be processed before the end of offloading s, at,m = am ) ∀s, s′ ∈ S, ∀am , am ∈ Am unknown for
transmission. It means that the residual time tm,res = τ −tX,pre solving problem (1). Intuitively, such a problem with multiple
can be exploited by the edge sever to process the received task objectives can be addressed by MORL [12], [13].
bits (which are offloaded by the UE). If the potential total pro-
cessing time Lm,X,t · c/fX > tm,res , the exact time consumed A. Kernel-Based Approach with n-step Return
for processing is equal to tcp,X,m,t = τ , and the number of
unprocessed bits becomes DX,t = DX,t−1 +Lm,X,t −fX ·τ /c; A centralized decision-making for all the UEs and the UAV
otherwise, tcp,X,m,t = tX,pre + Lm,X,t · c/fX , and DX,t = 0. (i.e. the agents) can suffer from the curses of dimensionality
On the other hand, in the case of tX,pre ≥ τ , the edge server [14]. As a countermeasure, we propose a distributed MORL,
cannot complete the process of the task bits buffered in the where each agent can select the offloading action or flight
previous timeslots. Therefore, the exact time consumed for direction within its own action space Am .
processing task bits can be obtained as tcp,X,m,t = τ , and The agents share the observation of s̃t , which is a quantized
the number of unprocessed bits reserved in the queue can be version of st , i.e. s̃t = [q̃UAV,t , d˜′t−1 ]T ∈ S̃. Note that elements
updated as DX,t = DX,t−1 + Lm,X,t − fX · τ /c. Meanwhile, in the set of S̃ are not predefined but added into the set during
the energy consumed for processing task bits can be obtained online learning or offline training over timeslots. Specifically,
cp
as EX,m,t = κX f X 3
tcp,X,m,t . given thresholds µq and µd , if a newly observed s̃t in timeslot
Given the above, the total energy consumed t satisfies kq̃UAV,t − q̃UAV k > µq or |d˜′t−1 − d| ˜ > µd for
Pfor
M
transmission
trans ˜ T
all s̃ = [q̃UAV , d] ∈ S̃, s̃t is recognized as a new state
and processing can be achieved as Et = m=1 (EUAV,m,t ·
trans
αm,t,UAV + EBS,m,t cp
· αm,t,BS + EUE,m,t · αm,t,UE + EX,m,tcp
· and S̃ = S̃ ∪ s̃t . Then, in order to reduce the number of
(1 − αm,t,UE )). The total backlog of task bits can be obtained optimization variables w.r.t. policies πm : S → Am ∀m
P (and the elapsed time for training/inference), the action-values
as Dt = m=M m=1 (Dm,t ) + DUAV,t + DBS,t .
are approximated with linearly combined Gaussian kernels.
Therefore, given s̃ ∈ S̃, the action-values w.r.t. maximizing
C. Problem Formulation Ē and D̄ (i.e. minimizing the long-term average energy
Aiming at minimizing the expected long-term average en- consumption and backlog of task bits) can be respectively
T
ergy consumption and backlog of task bits, this study addresses expressed as Qe,m (s̃, am ; we,m ) = we,m fe,m,t (s̃, am ) and
T
the problem of optimizing policies for joint offloading schedul- Qd,m (s̃, am ; wd,m ) = wd,m fd,m,t (s̃, am ), where we,m and
ing and UAV trajectory planning. Intuitively, the decision- wd,m represent weight vectors to be optimized over iterations,
making on trajectory planning and offloading scheduling turn i.e. the learning process; fe,m,t and fd,m,t are kernel vectors
out to be a Markov decision process (MDP) with two distinct with Ne,m,t and Nd,m,t entries, respectively. Moreover, each
objectives [7], [12], where the immediate reward can be there- entry of fe,m,t can be written as [fe,m,t ]n = f (xe,m , x be,m,n ) =
fore formulated as a vector rt = [et , dt ]T , where et = −Et and φ(xe,m )T φ(b xe,m,n ) = exp(−kq̃UAV − q bn k2 /2σs21 ) exp(−kd̃′ −
dt = −Dt . For such an MDP, the UAV/backlog state in times- dbn k2 /2σs22 ) exp(−kam − b am,n k2 /2σa2 ) for n = 1, . . . , Ne,m,t ,
lot t can be formulated as st = [qUAV,t , d′t−1 ]T ∈ S, where where xe,m , [s̃ , am ] , xe,m,n , [b
T T b
sTn , b
am,n ]T and φ(·)
d′t−1 = − log(Dt−1 ) and S represents the state space. The respectively represent a certain sample, the n th feature and
trajectory planning action w.r.t. the UAV, i.e. the direction the the feature space mapping w.r.t. decision-making; σs1 , σs2 and
UAV decides to fly towards, can be designated as at,0 ∈ A0 , σa respectively denote the characteristic length scales w.r.t. the
where A0 represents the UAV’s action space that collects at,0 . feature vectors of q bn , dbn and b am,n . Each entry of fd,m,t is
The offloading scheduling action that could be selected by a formulated in similar manner. All features are collected in the
Ne,m,t Nd,m,t
certain UE m ∈ {1, 2, . . . , M } in timeslot t is formulated sets of De,m,t , {b xe,m,n }n=1 and Dd,m,t , {b xd,m,n }n=1 .
4

In order to select a trajectory planning/offloading schedul- Algorithm 1 Kernel-Based Approach with n-step Return
ing action maximizing the weighted objective as shown in 1: Initialize: For m = 0, · · · , M , initialize s0 , s̃0 , S; set
(1), a synthetic vector-valued function can be defined as S̃ = s̃0 Tm = 01×|Am | ∀m; set t = 1; initialize wm,e
T and wm,d .
qm (s̃t , at,m ) = [Qe,m (s̃t , at,m ), Qd,m (s̃t , at,m )] [12]. There-
fore, given s̃t , an optimized action can be obtained as 2: repeat
3: Each agent m observes st and quantizes st , yielding

 T s̃ t ;
am = arg max wr qm (s̃t , at,m ) . (2)
am ∈Am 4: if s̃t ∈/ S then ⊲ Check if s̃t is a new state
5: S = S ∪ s̃t and Tm = cat(Tm , 01×|Am | );
A conventional ǫ-greedy strategy for exploration and exploita- 6: end if
tion can yield a strictly suboptimal action at,m with a certain 7: for m = 0, . . . , M do ⊲ Action configuration for
probability even if πm converges. Thus, an improved ǫ-greedy agent
strategy is developed, such that only actions that have not 8: Each agent m finds the row index
been visited before can be explored. To this end, we first j w.r.t. s̃t in Tm and obtain Am,j =
define Tm for each agent m to indicate the visit state of state- {am |all am ∈ Am for which [Tm ]j,k = 0};
action pair. That is, if a pair of state j and action k has been 9: Generate a random number ǫx ∼ U(0, 1);
visited, [Tm ]j,k = 1; otherwise, [Tm ]j,k = 0. Note that in 10: if ǫx < ǫ then
the presence of a certain s̃t being recognized as a new state, 11: Given s̃t , randomly select an action at,m ∈
Tm = cat(Tm , 01×|Am | ). Given Tm and a certain s̃t = s̃j , Am,j ;
with probability ǫ, at,m is randomly selected from the set 12: else
of actions w.r.t. [Tm ]j,k = 0; otherwise, (2) is performed to 13: Given s̃t , compute at,m by solving (2);
obtain a∗t,m . 14: end if
In order to address such RL with an average reward RL as 15: Set [Tm ]j,k corresponding to (s̃t , am,t ) as
shown in (1) and optimizing we,m and wd,m over iterations, [T m ] j,k = 1;
we integrate R-learning [15] with the semi-gradient method. 16: end for
Moreover, as the immediate reward dt may periodically fluc- 17: In timeslot t, for m = 0, · · · , M , each agent m
tuate even in the presence of a fixed πm [7], the n-step return executes at,m and obtains reward rm,t+1 ;
method [14] is exploited. To this end, we define rm,t:t+n = 18: for m = 0, . . . , M do
[et:t+n , dt:t+n ]T and rm,t:t+n ← rm,t+1 +γr rm,t+2:t+n , where 19: ρ ← t−n+1
rm,t+1 and γr respectively represent the reward in timeslot 20: if ρ ≥ 0 then P ⊲ Obtain reward for N steps
min{ρ+n,T } i−ρ−1
t+1 and the discount rate (also known as a discount factor). 21: rm,t:t+n ← i=ρ+1 γr rm,i ;
Therefore, let α and kr denote the learning rates, we,m , wd,m 22: end if
and the estimated average reward r̄m,t = [Ēt , D̄t ]T can be 23: Update we,m and wd,m by performing (3) and (4);
updated by performing 24: if ǫx ≥ ǫ then
25: Update r̄m,t+1 by performing (5);
T 26: end if
we,m ←we,m +α em,t:t+n +γr max{we,m fe,m,t (s̃t+1 , am )}
am
 27: Perform ALD test to update Dd,m,t and De,m,t ;
T end for
−Ēt − we,m fe,m,t (s̃t , at,m ) fe,m,t (s̃t , at,m ) , (3) 28:
wd,m ←wd,m +α(dm,t:t+n +γr max{wd,m T
fd,m,t (s̃t+1, am )} 29: t ← t + 1;
am 30: until Stopping criteria
T
−D̄t − wd,m fd,m,t (s̃t , at,m ))fd,m,t (s̃t , at,m ) , (4)
r̄m,t+1 = r̄m,t (1−kr )+kr (rm,t:t+n +qm (s̃t+1 , a⋆m )
− qm (s̃t , at,m )) , (5) B. DNN-Based Approach
In order to get an insight into the benefits of the
where a⋆m can be obtained by performing (2). Moreover, above kernel-based approach, this subsection elaborates on
(5) is performed, only in the presence of at,m not being a baseline, where state-of-the-art fully-connected DNNs
generated by exploration [15]. New decision-making features [17] are employed to approximate the action-values of
can be added into De,m,t and Dd,m,t so as to improve Qe,m (st , at,m ; we,m ) and Qd,m (st , at,m ; wd,m ). Adam op-
the approximation of Qe,m (s̃t , at,m ) and Qd,m (s̃t , at,m ), timizer [18] and experience replay are exploited to op-
by performing the approximate linear dependence (ALD) timize we,m and wd,m . Thus, in each iteration of the
[16] test. In terms of the update of De,m,t , given a proposed algorithm, a minibatch of N transition samples
PNe,m,t
threshold µ0 , if δ0,t = minλn ∀n k n=1 xe,m,n ) − (st , at,m , rm,t , st+1 ) is randomly taken from Be,m as well as
λn φn (b
φ(xe,m,t )k ≤ µ0 , it means that φ(xe,m,t ) can be approxi- Bd,m for optimizing we,m and wd,m . The optimizations can
2
Ne,m,t
mated by {φ(b xe,m,n )}n=1 ; otherwise, De,m,t+1 = Dm,e,t ∪ be respectively formulated as
xe,m,t . The update of Dd,m,t is similar to that of De,m,t . The N
above proposed algorithm is then referred to as the kernel- 1X
we,m ← arg min |ye,m,k −Qe,m(s̃k , am,k ; we,m )|2 (6)
based approach and is summarized in Algorithm 1. we,m N
k=1
5

6
10
production (bits) 3 25
UE1,UE2

consumption (J)
Average energy
20
Task bits

2 15 22
N-step return with we =1 and w d =1
19 N-step return
UE3,UE4 10 1-step return with 1-step
we =1 and w d =1
return
UE5 16
1 300 400 N-step return withdata1
we =3 and w d =1
0 100 200 300 400 500 600 700 800 900 1000 5
Indices of timeslots
0
0 50 100 150 200 250 300 350 400
Indices of timeslots
Fig. 2. Number of task bits produced in each UE over timeslots.
(a)

of task bits (Mbits)


40 N-step return with we =1 and w d =1

Average backlog
1-step return with we =1 and w d =1
30
and 20
N-step return with we =3 and w d =1

N
1X 2
10

wd,m ← arg min |yd,m,k −Qd,m(s̃k , am,k ; wd,m)| ,(7) 0


wd,m N 0 50 100 150 200 250 300 350 400
k=1 Indices of timeslots
(b)

where ye,m,k = ek+1 + γr maxam Qe,m (s̃k+1 , am ; we,m ) and

yd,m,k = dk+1 +γr maxam Qd,m (s̃k+1 , am ; wd,m ) represent the
temporal difference targets, and the target network weights Fig. 3. Performance achieved by kernel-based approaches with n-step return
− − (for n = 30) at different weights and 1-step return. (a) Average energy
we,m and wd,m can be iteratively updated with the optimized
we,m and wd,m [14]. consumption as a function of timeslots. (b) Average backlog of task bits as a
function of timeslots.

IV. N UMERICAL R ESULTS TABLE I


In the simulations, A0 contains eight cardinal directions AVERAGE D ECISION -M AKING & O NLINE L EARNING T IME .
and H = 100 m. Regarding the channels, the small-scale Algorithms DNN-based Kernel-based
fading is characterized by Rayleigh fading and the pathloss Elapsed time (s) 1.511 0.0045
at a reference distance of 1m is set as 39 dB. The LoS
probability PLoS is characterized by the constant modeling
parameters a = 9.61 and b = 0.16 [19]. Additionally, the especially dt , the average backlog of task bits achieved by the
pathloss exponent is set as β = 2.6. For the UEs, M = 5, approach with n-step return can be significantly lower than that
Pm = 30 dBm, σn2 = −90dBm and B = 6 MHz [7]. For achieved by the approach with 1-step return. Moreover, Fig. 3
task processing, τ = 2 s; c = 1e3 cycles/bit [5]; fUAV = also indicates that in the presence of we = 3 and wd = 1, the
1.6e9 cycles/s, fBS = 1.8e9 cycles/s and fUE = 8e8 cycles/s; average energy consumption achieved by the kernel-based ap-
κUE = 1e-28, κBS = 1e-28 and κUAV = 1e-27. The thresholds proach with n-step return can yield lower energy consumption
for recognizing a new state are set as µq = 2 and µd = 0.3. but higher average backlog of task bits than that with we = 1
For both kernel-based and DNN-based approaches, we = 1, and wd = 1. The reason lies in that a higher we can induce
wd = 1 and γr = 0.3, unless otherwise stated. In the kernel- the dominance of maximizing Qe,m (s̃t , at,m ) in the objective
based approach, σs1 = 200, σs2 = 1 and σa = 1; µ0 = 0.82 function of problem (2), leading to an decrease in the average
and n = 5 (for the n-step return), unless otherwise stated. In energy consumption. The formulation of Et indicates that the
the DNN-based approach, each action-value is approximated actions at,m = [αm,t,UAV , αm,t,BS , αm,t,UE ]T are related to
by a fully-connected DNN with 3 hidden layers, where each not only energy consumption but also offloading/computing.
layer consists of 64 neurons and tan-sigmoid is selected as the Therefore, an decrease in energy consumption can cause an
activation function; the size of the minibatch is set as N = 64. increase in the average backlog of task bits.
The simulations are conducted by MATLAB R2020a on a Table I depicts the average elapsed time of decision making
single computer, with an Intel Core i7 processor at 3.6GHz, and online learning in each timeslot achieved by the kernel-
a RAM of 16GB and the Windows 10 operating system. In based and the DNN-based approach. It is shown that the
order to gain insights into the proposed approach, we consider kernel-based approach consumes significantly less time than
that the number of task bits produced at each UE is periodic, the DNN-based approach. Meanwhile, Fig. 4 indicates the
as shown in Fig. 2. performance achieved by the kernel-based approach and the
Fig. 3 investigates the long-term average energy consump- DNN-based approach during timeslot 7001 and timeslot 9000.
tion and the long-term average backlog performance achieved By performing online learning for a duration of 7000 timeslots,
by kernel-based approaches with n-step return at different it can be inferred from the numerical results as shown in Fig. 3
weights and 1-step return. It can be seen that both the that the long-term average rewards achieved by the algorithms
average energy consumption and the average backlog achieved converge. It can be observed from Fig. 4 that although the net-
by the kernel-based approaches finally converges. Since the work performing the kernel-based approach achieves slightly
algorithms perform online learning, the average backlog of lower energy consumption than that performing the DNN-
task bits first increases with the increasing time. Then, as the based approach, the latter suffers from dramatic fluctuations in
proposed algorithms continuously optimize the trajectory plan- the instantaneous backlog of task bits. This is due to that the
ning and offloading scheduling policies, the average backlog kernel-based approach can benefit from the design of n-step
of task bits decreases. Fig. 3 also illustrates that since the n- return and the neural networks consisting of kernel functions.
step return can average the fluctuates of immediate rewards The sizes of such neural networks can be adaptive to the
6

60
from the design of n-step return, the proposed approach can
Instantaneous energy
consumption (J) Kernel-based algorithm
DNN-based algorithm
40 outperform the design with 1-step return. Moreover, due to
the n-step return and the kernel-based neural networks, the
20
proposed kernel-based approach can significantly outperform
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
the DNN-based approach in terms of the backlog of task bits
Indices of timeslots and the average decision-making and online learning time.
(a)
Instantaneous backlog
of task bits (Mbits)

40 Kernel-based algorithm

30
DNN-based algorithm
R EFERENCES
20 [1] M. Liu, J. Yang, and G. Gui, “DSF-NOMA: UAV-assisted emergency
10 communication technology in a heterogeneous Internet of Things,” IEEE
Internet Things J., vol. 6, no. 3, pp. 5508–5519, 2019.
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 [2] M. Ke, Z. Gao, Y. Wu, X. Gao, and R. Schober, “Compressive sensing-
Indices of timeslots based adaptive active user detection and channel estimation: Massive
(b)
access meets massive MIMO,” IEEE Trans. Signal Process., vol. 68, pp.
764–779, 2020.
[3] Z. Yu, Y. Gong, S. Gong, and Y. Guo, “Joint task offloading and
Fig. 4. Performance comparison of the kernel-based approach and the DNN- resource allocation in UAV-enabled mobile edge computing,” IEEE
based approach after running for 7000 timeslots. (a) Instantaneous energy Internet Things J., vol. 7, no. 4, pp. 3147–3159, 2020.
consumption as a function of timeslots. (b) Instantaneous backlog of task bits [4] T. Zhang, Y. Xu, J. Loo, D. Yang, and L. Xiao, “Joint computation and
as a function of timeslots. communication design for UAV-assisted mobile edge computing in IoT,”
IEEE Trans. Industr. Inform., vol. 16, no. 8, pp. 5505–5516, 2020.
[5] X. Hu, K.-K. Wong, K. Yang, and Z. Zheng, “UAV-assisted relaying and
edge computing: Scheduling and trajectory optimization,” IEEE Trans.
Wirel. Commun, vol. 18, no. 10, pp. 4738–4752, 2019.
[6] E. Dahlman, S. Parkvall, and J. Skold, 4G, LTE-Advanced Pro and The
Road to 5G, Third Edition, 3rd ed. USA: Academic Press, Inc., 2016.
[7] S. Wang, Y. Huang, and B. Clerckx, “Dynamic air-ground collaboration
for multi-access edge computing,” ICC 2022-IEEE Inter. Conf. on
Commun., pp. 5365–5371, 2022.
[8] S. Lyu, A. Campello, and C. Ling, “Ring compute-and-forward over
block-fading channels,” IEEE Trans. on Inf. Theory, vol. 65, no. 11, pp.
6931–6949, 2019.
[9] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal lap altitude for
maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp.
569–572, 2014.
Fig. 5. The UAV’s trajectory, where the notations (x, y) and t represent the [10] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and L. Hanzo, “Multi-
horizontal position of the UAV and the index of a timeslot, respectively. agent deep reinforcement learning-based trajectory planning for multi-
UAV assisted mobile edge computing,” IEEE Trans. Cogn. Commun.
and Netw., vol. 7, no. 1, pp. 73–84, 2021.
[11] J. Zhang, L. Zhou, Q. Tang, E. C.-H. Ngai, X. Hu, H. Zhao, and J. Wei,
environments by adding more appropriate decision-making “Stochastic computation offloading and trajectory scheduling for UAV-
features, yielding more accurate approximation of actor-values. assisted mobile edge computing,” IEEE Internet Things J, vol. 6, no. 2,
pp. 3688–3699, 2018.
Furthermore, since the number of task bits produced by the [12] Y. Huang, C. Hao, Y. Mao, and F. Zhou, “Dynamic resource config-
cluster of UEs 3, 4 and 5 reaches peaks in timeslots 400, 800, uration for low-power IoT networks: A multi-objective reinforcement
1200, etc. (as depicted in Fig. 2), the instantaneous backlog learning method,” IEEE Commun. Lett., vol. 25, no. 7, pp. 2285–2289,
2021.
of task bits achieved by the kernel-based approach around [13] C. Liu, X. Xu, and D. Hu, “Multiobjective reinforcement learning: A
timeslots 400, 800, 1200, 1600 and 2000 is sightly higher comprehensive overview,” IEEE Trans. Syst., vol. 45, no. 3, pp. 385–398,
than that in other timeslots. 2015.
[14] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
Fig. 5 illustrates the UAV’s trajectory achieved by the kernel- MIT press, 2018.
based approach during the duration of the 2000 timeslots as [15] S. Mahadevan, “Average reward reinforcement learning: Foundations,
shown in Fig. 4. As the average number of task bits per algorithms, and empirical results,” Machine Learning, vol. 22, no. 1-3,
pp. 159–196, 1996.
timeslot produced by the cluster of UEs 3, 4 and 5 is greater [16] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares
than that produced by the cluster of UEs 1 and 2 (as shown in algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285,
Fig. 2), the UAV always hovers at the right hand side of the 2004.
[17] Z. Gao, S. Liu, Y. Su, Z. Li, and D. Zheng, “Hybrid knowledge-
BS, such that the overall network can benefit from the stronger data driven channel semantic acquisition and beamforming for cell-free
air-ground channel and the edge server at the UAV. massive MIMO,” IEEE J. Sel. Areas Commun., vol. 17, no. 5, pp. 964–
979, 2023.
[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, and M. Belle-
V. C ONCLUSIONS mare et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
We have proposed a novel multi-objective trajectory plan- [19] X. Shi and N. Deng, “Modeling and analysis of mmwave UAV swarm
ning and offloading scheduling scheme based on RL for networks: A stochastic geometry approach,” IEEE Trans. on Wireless
Commun., vol. 21, no. 11, pp. 9447–9459, 2022.
dynamic air-ground collaborative MEC. In order to address
the issues of multi-objective MDP and the curses of dimen-
sionality caused by multiple UEs, the scheme is developed
based on a distributed structure, where MORL and the kernel
method are integrated. Numerical results reveal that benefiting

You might also like