Professional Documents
Culture Documents
that tcp,m,t = c · (Lm,t−1 + Dm,t−1 )/fUE and Dm,t = 0; as at,m = [αm,t,UAV , αm,t,BS , αm,t,UE ]T ∈ Am , where Am
otherwise, UE m cannot complete processing all the task represents the offloading
SM scheduling action space w.r.t. UE m.
bits buffered in the previous timeslots, such that tcp,m,t = τ By defining A = m=0 Am , the joint trajectory planning and
and Dm,t = (Lm,t−1 + Dm,t−1 ) − fUE · τ /c. Then, the offloading scheduling policy can be expressed as π : S → A.
energy consumed by local processing can be evaluated by In order to improve the expected long-term average energy
cp 3
EUE,m,t = κUE fUE tcp,m,t . efficiency and backlog of task bits, the averageP rewards can
T −1
In the presence of UE m perform offloading in timeslot t, be respectively obtained as Ē = limT →∞ sup E{ t=0 et }/T
PT −1
Lm,X,t task bits (encapsulated in packets) can be offloaded to and D̄ = limT →∞ sup E{ t=0 dt }/T . By collecting the
the edge server deployed at the UAV or BS during the duration rewards within r , [Ē, D̄]T and defining wr , [we , wd ]T ,
of τ . The energy and the time consumed for transmission can the optimization of π can be formulated as
trans
be respectively achieved by EX,m,t = Pm Lm,X,t /RX,m,t and
tX,m,trans = Lm,X,t /RX,m,t . Besides receiving the offloaded π = arg max{wrT r̄}. (1)
π
bits, the edge server can simultaneously process tasks which
are buffered in its task queue during previous timeslots. By
III. D ISTRIBUTED M ULTI -O BJECTIVE DYNAMIC
designating the number of the unprocessed task bits (which
T RAJECTORY P LANNING & O FFLOADING S CHEDULING
are observed at the end of timeslot t − 1) as DX,t−1 , the
potential time for processing the previously buffered task bits The unknown statistics of computational task production
can be obtained as tX,pre = DX,t−1 · c/fX . If tX,pre ≤ τ , all and terrestrial channels make dynamics of P (st+1 = s′ |st =
the DX,t−1 bits can be processed before the end of offloading s, at,m = am ) ∀s, s′ ∈ S, ∀am , am ∈ Am unknown for
transmission. It means that the residual time tm,res = τ −tX,pre solving problem (1). Intuitively, such a problem with multiple
can be exploited by the edge sever to process the received task objectives can be addressed by MORL [12], [13].
bits (which are offloaded by the UE). If the potential total pro-
cessing time Lm,X,t · c/fX > tm,res , the exact time consumed A. Kernel-Based Approach with n-step Return
for processing is equal to tcp,X,m,t = τ , and the number of
unprocessed bits becomes DX,t = DX,t−1 +Lm,X,t −fX ·τ /c; A centralized decision-making for all the UEs and the UAV
otherwise, tcp,X,m,t = tX,pre + Lm,X,t · c/fX , and DX,t = 0. (i.e. the agents) can suffer from the curses of dimensionality
On the other hand, in the case of tX,pre ≥ τ , the edge server [14]. As a countermeasure, we propose a distributed MORL,
cannot complete the process of the task bits buffered in the where each agent can select the offloading action or flight
previous timeslots. Therefore, the exact time consumed for direction within its own action space Am .
processing task bits can be obtained as tcp,X,m,t = τ , and The agents share the observation of s̃t , which is a quantized
the number of unprocessed bits reserved in the queue can be version of st , i.e. s̃t = [q̃UAV,t , d˜′t−1 ]T ∈ S̃. Note that elements
updated as DX,t = DX,t−1 + Lm,X,t − fX · τ /c. Meanwhile, in the set of S̃ are not predefined but added into the set during
the energy consumed for processing task bits can be obtained online learning or offline training over timeslots. Specifically,
cp
as EX,m,t = κX f X 3
tcp,X,m,t . given thresholds µq and µd , if a newly observed s̃t in timeslot
Given the above, the total energy consumed t satisfies kq̃UAV,t − q̃UAV k > µq or |d˜′t−1 − d| ˜ > µd for
Pfor
M
transmission
trans ˜ T
all s̃ = [q̃UAV , d] ∈ S̃, s̃t is recognized as a new state
and processing can be achieved as Et = m=1 (EUAV,m,t ·
trans
αm,t,UAV + EBS,m,t cp
· αm,t,BS + EUE,m,t · αm,t,UE + EX,m,tcp
· and S̃ = S̃ ∪ s̃t . Then, in order to reduce the number of
(1 − αm,t,UE )). The total backlog of task bits can be obtained optimization variables w.r.t. policies πm : S → Am ∀m
P (and the elapsed time for training/inference), the action-values
as Dt = m=M m=1 (Dm,t ) + DUAV,t + DBS,t .
are approximated with linearly combined Gaussian kernels.
Therefore, given s̃ ∈ S̃, the action-values w.r.t. maximizing
C. Problem Formulation Ē and D̄ (i.e. minimizing the long-term average energy
Aiming at minimizing the expected long-term average en- consumption and backlog of task bits) can be respectively
T
ergy consumption and backlog of task bits, this study addresses expressed as Qe,m (s̃, am ; we,m ) = we,m fe,m,t (s̃, am ) and
T
the problem of optimizing policies for joint offloading schedul- Qd,m (s̃, am ; wd,m ) = wd,m fd,m,t (s̃, am ), where we,m and
ing and UAV trajectory planning. Intuitively, the decision- wd,m represent weight vectors to be optimized over iterations,
making on trajectory planning and offloading scheduling turn i.e. the learning process; fe,m,t and fd,m,t are kernel vectors
out to be a Markov decision process (MDP) with two distinct with Ne,m,t and Nd,m,t entries, respectively. Moreover, each
objectives [7], [12], where the immediate reward can be there- entry of fe,m,t can be written as [fe,m,t ]n = f (xe,m , x be,m,n ) =
fore formulated as a vector rt = [et , dt ]T , where et = −Et and φ(xe,m )T φ(b xe,m,n ) = exp(−kq̃UAV − q bn k2 /2σs21 ) exp(−kd̃′ −
dt = −Dt . For such an MDP, the UAV/backlog state in times- dbn k2 /2σs22 ) exp(−kam − b am,n k2 /2σa2 ) for n = 1, . . . , Ne,m,t ,
lot t can be formulated as st = [qUAV,t , d′t−1 ]T ∈ S, where where xe,m , [s̃ , am ] , xe,m,n , [b
T T b
sTn , b
am,n ]T and φ(·)
d′t−1 = − log(Dt−1 ) and S represents the state space. The respectively represent a certain sample, the n th feature and
trajectory planning action w.r.t. the UAV, i.e. the direction the the feature space mapping w.r.t. decision-making; σs1 , σs2 and
UAV decides to fly towards, can be designated as at,0 ∈ A0 , σa respectively denote the characteristic length scales w.r.t. the
where A0 represents the UAV’s action space that collects at,0 . feature vectors of q bn , dbn and b am,n . Each entry of fd,m,t is
The offloading scheduling action that could be selected by a formulated in similar manner. All features are collected in the
Ne,m,t Nd,m,t
certain UE m ∈ {1, 2, . . . , M } in timeslot t is formulated sets of De,m,t , {b xe,m,n }n=1 and Dd,m,t , {b xd,m,n }n=1 .
4
In order to select a trajectory planning/offloading schedul- Algorithm 1 Kernel-Based Approach with n-step Return
ing action maximizing the weighted objective as shown in 1: Initialize: For m = 0, · · · , M , initialize s0 , s̃0 , S; set
(1), a synthetic vector-valued function can be defined as S̃ = s̃0 Tm = 01×|Am | ∀m; set t = 1; initialize wm,e
T and wm,d .
qm (s̃t , at,m ) = [Qe,m (s̃t , at,m ), Qd,m (s̃t , at,m )] [12]. There-
fore, given s̃t , an optimized action can be obtained as 2: repeat
3: Each agent m observes st and quantizes st , yielding
⋆
T s̃ t ;
am = arg max wr qm (s̃t , at,m ) . (2)
am ∈Am 4: if s̃t ∈/ S then ⊲ Check if s̃t is a new state
5: S = S ∪ s̃t and Tm = cat(Tm , 01×|Am | );
A conventional ǫ-greedy strategy for exploration and exploita- 6: end if
tion can yield a strictly suboptimal action at,m with a certain 7: for m = 0, . . . , M do ⊲ Action configuration for
probability even if πm converges. Thus, an improved ǫ-greedy agent
strategy is developed, such that only actions that have not 8: Each agent m finds the row index
been visited before can be explored. To this end, we first j w.r.t. s̃t in Tm and obtain Am,j =
define Tm for each agent m to indicate the visit state of state- {am |all am ∈ Am for which [Tm ]j,k = 0};
action pair. That is, if a pair of state j and action k has been 9: Generate a random number ǫx ∼ U(0, 1);
visited, [Tm ]j,k = 1; otherwise, [Tm ]j,k = 0. Note that in 10: if ǫx < ǫ then
the presence of a certain s̃t being recognized as a new state, 11: Given s̃t , randomly select an action at,m ∈
Tm = cat(Tm , 01×|Am | ). Given Tm and a certain s̃t = s̃j , Am,j ;
with probability ǫ, at,m is randomly selected from the set 12: else
of actions w.r.t. [Tm ]j,k = 0; otherwise, (2) is performed to 13: Given s̃t , compute at,m by solving (2);
obtain a∗t,m . 14: end if
In order to address such RL with an average reward RL as 15: Set [Tm ]j,k corresponding to (s̃t , am,t ) as
shown in (1) and optimizing we,m and wd,m over iterations, [T m ] j,k = 1;
we integrate R-learning [15] with the semi-gradient method. 16: end for
Moreover, as the immediate reward dt may periodically fluc- 17: In timeslot t, for m = 0, · · · , M , each agent m
tuate even in the presence of a fixed πm [7], the n-step return executes at,m and obtains reward rm,t+1 ;
method [14] is exploited. To this end, we define rm,t:t+n = 18: for m = 0, . . . , M do
[et:t+n , dt:t+n ]T and rm,t:t+n ← rm,t+1 +γr rm,t+2:t+n , where 19: ρ ← t−n+1
rm,t+1 and γr respectively represent the reward in timeslot 20: if ρ ≥ 0 then P ⊲ Obtain reward for N steps
min{ρ+n,T } i−ρ−1
t+1 and the discount rate (also known as a discount factor). 21: rm,t:t+n ← i=ρ+1 γr rm,i ;
Therefore, let α and kr denote the learning rates, we,m , wd,m 22: end if
and the estimated average reward r̄m,t = [Ēt , D̄t ]T can be 23: Update we,m and wd,m by performing (3) and (4);
updated by performing 24: if ǫx ≥ ǫ then
25: Update r̄m,t+1 by performing (5);
T 26: end if
we,m ←we,m +α em,t:t+n +γr max{we,m fe,m,t (s̃t+1 , am )}
am
27: Perform ALD test to update Dd,m,t and De,m,t ;
T end for
−Ēt − we,m fe,m,t (s̃t , at,m ) fe,m,t (s̃t , at,m ) , (3) 28:
wd,m ←wd,m +α(dm,t:t+n +γr max{wd,m T
fd,m,t (s̃t+1, am )} 29: t ← t + 1;
am 30: until Stopping criteria
T
−D̄t − wd,m fd,m,t (s̃t , at,m ))fd,m,t (s̃t , at,m ) , (4)
r̄m,t+1 = r̄m,t (1−kr )+kr (rm,t:t+n +qm (s̃t+1 , a⋆m )
− qm (s̃t , at,m )) , (5) B. DNN-Based Approach
In order to get an insight into the benefits of the
where a⋆m can be obtained by performing (2). Moreover, above kernel-based approach, this subsection elaborates on
(5) is performed, only in the presence of at,m not being a baseline, where state-of-the-art fully-connected DNNs
generated by exploration [15]. New decision-making features [17] are employed to approximate the action-values of
can be added into De,m,t and Dd,m,t so as to improve Qe,m (st , at,m ; we,m ) and Qd,m (st , at,m ; wd,m ). Adam op-
the approximation of Qe,m (s̃t , at,m ) and Qd,m (s̃t , at,m ), timizer [18] and experience replay are exploited to op-
by performing the approximate linear dependence (ALD) timize we,m and wd,m . Thus, in each iteration of the
[16] test. In terms of the update of De,m,t , given a proposed algorithm, a minibatch of N transition samples
PNe,m,t
threshold µ0 , if δ0,t = minλn ∀n k n=1 xe,m,n ) − (st , at,m , rm,t , st+1 ) is randomly taken from Be,m as well as
λn φn (b
φ(xe,m,t )k ≤ µ0 , it means that φ(xe,m,t ) can be approxi- Bd,m for optimizing we,m and wd,m . The optimizations can
2
Ne,m,t
mated by {φ(b xe,m,n )}n=1 ; otherwise, De,m,t+1 = Dm,e,t ∪ be respectively formulated as
xe,m,t . The update of Dd,m,t is similar to that of De,m,t . The N
above proposed algorithm is then referred to as the kernel- 1X
we,m ← arg min |ye,m,k −Qe,m(s̃k , am,k ; we,m )|2 (6)
based approach and is summarized in Algorithm 1. we,m N
k=1
5
6
10
production (bits) 3 25
UE1,UE2
consumption (J)
Average energy
20
Task bits
2 15 22
N-step return with we =1 and w d =1
19 N-step return
UE3,UE4 10 1-step return with 1-step
we =1 and w d =1
return
UE5 16
1 300 400 N-step return withdata1
we =3 and w d =1
0 100 200 300 400 500 600 700 800 900 1000 5
Indices of timeslots
0
0 50 100 150 200 250 300 350 400
Indices of timeslots
Fig. 2. Number of task bits produced in each UE over timeslots.
(a)
Average backlog
1-step return with we =1 and w d =1
30
and 20
N-step return with we =3 and w d =1
N
1X 2
10
60
from the design of n-step return, the proposed approach can
Instantaneous energy
consumption (J) Kernel-based algorithm
DNN-based algorithm
40 outperform the design with 1-step return. Moreover, due to
the n-step return and the kernel-based neural networks, the
20
proposed kernel-based approach can significantly outperform
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
the DNN-based approach in terms of the backlog of task bits
Indices of timeslots and the average decision-making and online learning time.
(a)
Instantaneous backlog
of task bits (Mbits)
40 Kernel-based algorithm
30
DNN-based algorithm
R EFERENCES
20 [1] M. Liu, J. Yang, and G. Gui, “DSF-NOMA: UAV-assisted emergency
10 communication technology in a heterogeneous Internet of Things,” IEEE
Internet Things J., vol. 6, no. 3, pp. 5508–5519, 2019.
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 [2] M. Ke, Z. Gao, Y. Wu, X. Gao, and R. Schober, “Compressive sensing-
Indices of timeslots based adaptive active user detection and channel estimation: Massive
(b)
access meets massive MIMO,” IEEE Trans. Signal Process., vol. 68, pp.
764–779, 2020.
[3] Z. Yu, Y. Gong, S. Gong, and Y. Guo, “Joint task offloading and
Fig. 4. Performance comparison of the kernel-based approach and the DNN- resource allocation in UAV-enabled mobile edge computing,” IEEE
based approach after running for 7000 timeslots. (a) Instantaneous energy Internet Things J., vol. 7, no. 4, pp. 3147–3159, 2020.
consumption as a function of timeslots. (b) Instantaneous backlog of task bits [4] T. Zhang, Y. Xu, J. Loo, D. Yang, and L. Xiao, “Joint computation and
as a function of timeslots. communication design for UAV-assisted mobile edge computing in IoT,”
IEEE Trans. Industr. Inform., vol. 16, no. 8, pp. 5505–5516, 2020.
[5] X. Hu, K.-K. Wong, K. Yang, and Z. Zheng, “UAV-assisted relaying and
edge computing: Scheduling and trajectory optimization,” IEEE Trans.
Wirel. Commun, vol. 18, no. 10, pp. 4738–4752, 2019.
[6] E. Dahlman, S. Parkvall, and J. Skold, 4G, LTE-Advanced Pro and The
Road to 5G, Third Edition, 3rd ed. USA: Academic Press, Inc., 2016.
[7] S. Wang, Y. Huang, and B. Clerckx, “Dynamic air-ground collaboration
for multi-access edge computing,” ICC 2022-IEEE Inter. Conf. on
Commun., pp. 5365–5371, 2022.
[8] S. Lyu, A. Campello, and C. Ling, “Ring compute-and-forward over
block-fading channels,” IEEE Trans. on Inf. Theory, vol. 65, no. 11, pp.
6931–6949, 2019.
[9] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal lap altitude for
maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp.
569–572, 2014.
Fig. 5. The UAV’s trajectory, where the notations (x, y) and t represent the [10] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and L. Hanzo, “Multi-
horizontal position of the UAV and the index of a timeslot, respectively. agent deep reinforcement learning-based trajectory planning for multi-
UAV assisted mobile edge computing,” IEEE Trans. Cogn. Commun.
and Netw., vol. 7, no. 1, pp. 73–84, 2021.
[11] J. Zhang, L. Zhou, Q. Tang, E. C.-H. Ngai, X. Hu, H. Zhao, and J. Wei,
environments by adding more appropriate decision-making “Stochastic computation offloading and trajectory scheduling for UAV-
features, yielding more accurate approximation of actor-values. assisted mobile edge computing,” IEEE Internet Things J, vol. 6, no. 2,
pp. 3688–3699, 2018.
Furthermore, since the number of task bits produced by the [12] Y. Huang, C. Hao, Y. Mao, and F. Zhou, “Dynamic resource config-
cluster of UEs 3, 4 and 5 reaches peaks in timeslots 400, 800, uration for low-power IoT networks: A multi-objective reinforcement
1200, etc. (as depicted in Fig. 2), the instantaneous backlog learning method,” IEEE Commun. Lett., vol. 25, no. 7, pp. 2285–2289,
2021.
of task bits achieved by the kernel-based approach around [13] C. Liu, X. Xu, and D. Hu, “Multiobjective reinforcement learning: A
timeslots 400, 800, 1200, 1600 and 2000 is sightly higher comprehensive overview,” IEEE Trans. Syst., vol. 45, no. 3, pp. 385–398,
than that in other timeslots. 2015.
[14] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
Fig. 5 illustrates the UAV’s trajectory achieved by the kernel- MIT press, 2018.
based approach during the duration of the 2000 timeslots as [15] S. Mahadevan, “Average reward reinforcement learning: Foundations,
shown in Fig. 4. As the average number of task bits per algorithms, and empirical results,” Machine Learning, vol. 22, no. 1-3,
pp. 159–196, 1996.
timeslot produced by the cluster of UEs 3, 4 and 5 is greater [16] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares
than that produced by the cluster of UEs 1 and 2 (as shown in algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285,
Fig. 2), the UAV always hovers at the right hand side of the 2004.
[17] Z. Gao, S. Liu, Y. Su, Z. Li, and D. Zheng, “Hybrid knowledge-
BS, such that the overall network can benefit from the stronger data driven channel semantic acquisition and beamforming for cell-free
air-ground channel and the edge server at the UAV. massive MIMO,” IEEE J. Sel. Areas Commun., vol. 17, no. 5, pp. 964–
979, 2023.
[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, and M. Belle-
V. C ONCLUSIONS mare et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
We have proposed a novel multi-objective trajectory plan- [19] X. Shi and N. Deng, “Modeling and analysis of mmwave UAV swarm
ning and offloading scheduling scheme based on RL for networks: A stochastic geometry approach,” IEEE Trans. on Wireless
Commun., vol. 21, no. 11, pp. 9447–9459, 2022.
dynamic air-ground collaborative MEC. In order to address
the issues of multi-objective MDP and the curses of dimen-
sionality caused by multiple UEs, the scheme is developed
based on a distributed structure, where MORL and the kernel
method are integrated. Numerical results reveal that benefiting