Reinforcement-Learning For Management of A 5G Network Slice Extension With Uavs

2019 IEEE INFOCOM WKSHPS: SMILING 2019: Sustainable networking through MachIne Learning and Internet of thINGs
Reinforcement-Learning for Management of a 5G

Network Slice Extension with UAVs
Giuseppe Faraci Christian Grasso Giovanni Schembra
DIEEI, University of Catania, CNIT DIEEI, University of Catania, CNIT DIEEI, University of Catania, CNIT
Catania, Italy Catania, Italy Catania, Italy
giuseppe.faraci@dieei.unict.it christian.grasso@unict.it schembra@dieei.unict.it
Abstract— Some representative 5G application scenarios a 5G network slice with a fleet of UAVs, each providing not only
regard geographic areas very far from the structured core networking, but also computing and storage with MEC facilities,
network, but are characterized by the need for processing huge realized with a set of Computer Elements (CE) installed on
amount of data that cannot be transmitted to multi-access edge board. However, due to the power consumption of each
(MEC) facilities installed at the edge of that network. To this
computing element that, since comparable to the consumption
purpose, this paper proposes to extend a 5G network slice with a
fleet of UAVs, each providing computing facilities, and for this of engines, can compromise the flight mission duration [9], this
reason referred to as MEC UAVs. The paper proposes a paper proposes to change the number of active CEs at run-time,
cooperation between MEC UAVs belonging to the same fleet based according requests of computation coming from the ground in
on job offloading, aiming at minimizing power consumption due real time. An additional feature included in this proposal is
to active computer elements providing MEC, job loss probability cooperation among UAVs, already experimented for other
and queueing delay. A Reinforcement Learning (RL) approach is purposes (see for example [10-11]).
used to support the System Controller in its decisions. A numerical
analysis is presented to evaluate achieved performance. More specifically, in this paper we propose a framework in
which, when the zone monitored by an UAV enters a state of
Keywords — 5G, Network Slicing, UAV, Reinforcement high activity, the UAV can either switch on more CEs, or asking
Learning, Markov Decision Processes (MDP) for help to a near UAV, in such a way that some jobs can be
offloaded to it. The choice of the number of CEs to maintain
I. INTRODUCTION
active in each UAV and the amount of jobs to be offloaded to
The 5th generation wireless systems, or 5G, are not only an the helping UAV is in charge of a System Controller (SC)
evolution of the legacy 4G cellular networks, but a revolution launched by the UAV asking for help as a virtual network
for the introduction of new disruption service capabilities [1]. function (VNF). A Reinforcement Learning (RL) approach is
One of the main features that will be introduced is the concept used to support SC in each decision, with the target of
of network slices [2], which aims at addressing the diversified maximizing a medium-term reward, defined as a function of the
service requirements for different application scenarios. A 5G power saved by switching off some CE, and performance in
network slice is an end-to-end logical network provisioned with terms of loss probability and mean delay. A numerical analysis
a set of isolated virtual resources on the shared physical will evaluate performance achieved with the proposed platform.
infrastructure, so providing a network-as-a-service (NaaS)
model. The three main paradigms that will enable network The paper is structured as follows. Section II describes the
slicing in 5G systems are software defined networking (SDN) reference system. Section III provides some background
[3], network functions virtualization (NFV) [4] and multi-access regarding RL. The model of the whole system and the analytical
edge computing (MEC) [5]. definition of the reward function, necessary to apply RL, are
described in Section IV, while the main performance parameters
However, some 5G application scenarios, like for example are analytically derived in Section V. Some numerical results
smart agriculture, environment monitoring, video surveillance will be presented in Section VI, while Section VII will conclude
with drones, regard geographic areas very far from the the paper providing some insight for future work.
structured core network. Moreover, in most of them, connected
devices produce a lot of data that require processing in real-time, II. REFERENCE SYSTEM
and for them use of network slices with MEC facilities is not We consider a 5G network slice extension realized with a
possible because these facilities are too far to be reached with fleet of UAVs, each equipped with MEC facilities, for this
links at sufficient throughput. A first attempt in this direction has reason here referred to as MEC UAVs. Each MEC UAV is
been done introducing flying platforms realized with unmanned equipped by L CEs to process jobs received by ground devices,
aerial vehicles (UAVs), popularly known as drones. With their each consuming a given amount of power, PP . The goal of the
characteristics, applications of UAVs have been extended to
aerial base stations to enhance coverage, capacity, reliability, proposed platform is to provide a geographic area with this
and energy efficiency of wireless networks [6]. network extension, aimed at processing data coming from
devices installed on ground. The whole area is subdivided in
Starting from that paper and from previous works of the adjacent zones, each covered by a MEC UAV. Data generated
same Authors [7-8], the idea at the base of this paper is to extend by ground devices are organized in jobs to be processed, and the
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
978-1-7281-1878-9/19/$31.00 ©2019 IEEE 732

job composition task can be performed by either the same

SYSTEM CONTROLLER (SC)
ground device individually, or by sinks that collect raw data
generated by group of devices. We assume that the job
MEC UAV 1 MEC UAV 2
transmission rate for a given zone changes in time according to
the current state of the zone. We refer to these states to as Low P P P P
Activity and High Activity states. For example, in a scenario
Queue Q1
Queue Q2
where the network slice is used for video surveillance, the two
behaviors represent no-alarm and alarm situations, respectively.
Applications using UAV MEC facilities can be more or less  Queue T
sensitive to losses of job and job processing delay. Another
Flow 1  
T
important parameter to be accounted is power consumption of Switch
the CEs, that is comparable with the consumption of the UAV 1 2
engines for flight, and therefore can strongly influence the UAV
mission duration [11]. To this purpose, we consider the 1 2 RH 1 2 RL
possibility of a mutual help among UAVs. The basic idea is that, High Activity Low Activity
if an UAV is overloaded of jobs, and typically this occurs when

the zone assigned to it enters the High Activity state, it asks for Fig. 1. Functional Architecture of the MEC UAV platform
help to the closest UAV that is not stressed at that time.
III. REINFORCEMENT LEARNING (RL)
Referring to Fig. 1, the MEC UAV1 is monitoring a High-
Activity zone, and has obtained availability for help from MEC In this paper, we use RL to approach the decision making
UAV2. In this way, UAV1 and UAV2 cooperate together in process carried out by the SC. The SC behaves as an Agent that
order to optimize a common objective function weighting the performs actions. An action a is one of all the possible moves
three above elements, that is, energy saving related to CE the Agent can make. In our case, as described in the previous
consumption, overall delay and loss probability experienced by section, an action consists in the choice of the number of active
both offloaded and non-offloaded jobs. To this purpose, a CEs in each UAV among the L ones that are available per UAV,
System Controller (SC) is run as a virtual network function and in the choice of how many jobs the UAV1 will offload to
(VNF) in the UAV1, in order to orchestrate the whole system the UAV2. A policy  defines the action a for each system
made by the two UAVs. More specifically, it is in charge of the state s, that is, a   (s ) . A key element is the immediate
action of deciding how many CEs activating on each UAV, and Reward, R(n), a scalar feedback, positive or negative, which
which jobs generated by the High-Activity zone to be offloaded allows the agent to quantify the success or failure of its actions
to the UAV 2. The objective function to be optimized is: according to its specific goal. A Cumulative Reward, indicated
FOBJ  k1  p  k 2   k 3 (1) as G(n), is also defined to indicate the long-term reward, i.e. the
total discounted rewards from time step n:
where p is the power that is saved in respect to the case that

all the CEs are active,  is the job loss probability, and  is G(n)    k  R(n  k  1) (2)
the mean job process delay. Constants k1 , k 2 and k 3 represent k 0
the importance that the three above parameters have for the The term  is a discount factor, where   0,1 . It is an
considered application scenario. input parameter that informs the agent of how much it should
care about rewards now as compared to rewards in the future.
The SC uses reinforcement learning (RL) to decide the
actions to be taken. To this end, a discrete-time approach is For each state of the environment, a State-Value Function
adopted: actions are taken periodically at the beginning of each and an Action-Value Function are also defined to describe how
time slot of duration  . The following actions will be taken: good is to be in that state, and how good is to take a given action.
For a given policy  , the state-value function for a state s is
1. Decision of the number bi of CEs to be active during the
time slot for each UAV i. The others are put in low-power 
defined as v ( s)  E G (n) S (n)  s , where E   is the 
state to reduce power consumption. expected value given that the agent follows the policy  , and
2. Decision of the number  of jobs to be locally managed G(n) is the cumulative reward, as defined in (2). It represents the
by UAV1 among the 1 jobs arrived in the slot. The expected return when the system starts in s and follows the
policy  . We assume the Markov property for the environment
other 1   are offloaded to the UAV 2.
state, and define its model as a Decision Markov Process
Jobs to be locally processed by the UAV1 are enqueued in the (MDP). A MDP  , for a given policy  that specifies an
queue Q1 , where they wait for some CE availability. The other action a for each state s , is completely defined as a tuple
jobs are enqueued in the Transmission queue T to be ((   , ( A  , P
(  )
, 
(  )
,  ) , where (   is the system state
transmitted to UAV2, where will be enqueued together with the (  )
jobs arrived from the zone monitored by it. space, ( A  is the set of actions, P is the state transition
733
probability matrix, 
(  )
is the immediate reward matrix, and equal to 1, while the probability of transmitting one job in one
 is the discount factor. The matrix P
(  )
depends on the slot is pTX   tTX . If this is not the case, the model can easily
policy  . Its generic element, representing the transition be modified to the opposite case.
As specified so far, at the beginning of each slot n, the action
probability from the state s  to the state s provided that, performed by the SC is constituted by the following elements:
according to the policy  , the action a is performed on the
starting state s  , is: 1. it sets the number of processors, bi  1, L , to be used to
process the jobs in the queues Q1 and Q2 in the slot n;
( a)
 

P[ s , s ]  Pr S (  ) ( n )  s S (  ) ( n  1)  s  , A( n  1)  a  (3)
Likewise, the generic element of the reward matrix represents 2. it sets the number of arrivals,  , that will be enqueued
the immediate reward received performing the action a when in the queue Q1 . The others will be offloaded to the
the system transits from the state s to the state s , that is: UAV. Of course,  cannot be greater than the number
of arrivals occurred in the considered slot, and than the
( a )  S (   (n  1)  s , S (   (n)  s, number of rooms that are available in Q1 .
[ s , s ]  E  R(n)  (4)
 
 A(n  1)  a 
The number of jobs coming from the zone 1 that cannot find
As known, the optimal policy, whose state-value function is
better than or equal to the state-value function of all the other space in the queues Q1 and T , and the number of jobs from the
policies, can be derived by solving a set of Bellman optimality zone 2 that cannot find space in the queue Q2 , are lost.
equations, one for each state of the system [12]. Moreover, jobs generated in zone 1 that are not offloaded suffer
IV. SYSTEM MODEL a delay due to the queue Q1 , while offloaded jobs suffer a delay
Let us model the system described in Section II with a three- that is the sum of the delay in T and the delay in Q2 . The
dimensional discrete-time MDP whose state is defined as choice of the action for each state of the system is the optimal
()

S (n)  S (n), S (n), S (T ) ( n) , where:
(Z ) (Q)
 policy decided by RL, as explained in Section III.
A. Transition Probability Matrix
 S (n)  S ( Z ) (n), S ( Z ) (n)  is the state of the zones
(Z ) 1 2
Let us consider the following two generic states:
controlled by the two UAVs, being ( Z )  1, 2, ..., RH  and s  s Z , sQ , sT   S ( n  1) , i.e. the state at the slot n  1 ;
1 ()
 ( Z2 )
 1, 2, ..., RL  be the sets of states characterizing the  
s   s Z , s Q , sT  S ( n) , i.e. the system state at the slot n .
()
behavior of the zones 1 and 2, and S   the state of ( Zi ) ( Zi )

Let a  b1 , b2 ,   A(n) be the action chosen by the SC at the
the zone i at the slot n, with i 1, 2 . The job arrival
beginning of the slot n, associated to the starting state s
process from each zone i will be modeled as a Switched
according to the optimal policy. The generic element of the
Batch Bernoulli Process (SBBP) Vi (n) that counts the
transition probability matrix can be defined as follows:
number of jobs arriving in the slot n. This number follows
a probability density function (pdf) that is modulated by P[ s
( a )
 P[ s(Z ), s ]  P[( s
(Q, T a )
( s Z ) (5)
 , s ] Z Z Q , sT ), ( sQ , sT )]
the underlying state of the SBBP. Therefore, it is (Z )
characterized by the transition probability matrix of the where P is the transition probability matrix for the zones
underlying Markov chain, P ( Z ) , and the job emission
i covered by the two UAVs. Assuming that the behavior of each
probability matrix, B (V ) , whose rows contain the pdf for
i zone is statistically independent of the behavior of the other one,
each state of the Markov chain; its generic element can be calculated as follows:
(6)
(n)  S ( Q ) (n), S ( Q ) ( n)  is the state of the UAV CE
P[ s(ZZ ), sZ ]  P[ s(ZZ11,)sZ 1 ]  P[ s(ZZ22 ,)sZ 2 ]
 S
(Q) 1 2
queues, being S ( Q ) ( n)  0, ..., H  the number of jobs in ( sZ ) represents the behavior of the three
( Q ,T  )
i The matrix P
the queue of the UAV i, with i 1, 2 ; H is the maximum queues Q1 , Q2 , and T . In its definition, we have highlighted its
number of jobs that the queue can contain; dependence on the arrival state of the underlying Markov chain
of the zones, which determine the number of job arrivals in the
 S (T ) (n)  0, ..., M  is the state of the transmission queue for queues, and the applied policy  , which determines the action
offloading from UAV1 to UAV2. a  b1 , b2 ,  for each transition starting state. Its generic
Assuming that the time needed to process one job in one of element can be defined as follows:
the CEs is less than the time, tTX , needed to transmit one job
( s  Z ) 
(Q, T a )
P[( s
Q , sT ), ( sQ , sT )]
from UAV1 to UAV2, in the following we choose the average
time needed to process one job in a CE as the slot duration,  . S ( n)  s Q S ( Q ) ( n  1)  s  , S (T ) ( n  1)  sT 
(Q )
(7)
 Pr  (T ) Q

Consequently, the probability to process one job in one slot is S (n)  sT A( n  1)  a 
734
In order to evaluate this probability, let us apply the total f (T ) sT 1 , sT1 , 1 , d 1 ,   
probability theorem to the number of possible arrivals from the Prd 1  if sT1  maxminsT 1  1   , H   1, 0
monitored zones, 1 and  2 . We have:  (13)
 1  Prd 1  if sT1  minsT 1  1   , H 
0 otherwise

P[( sQ , sT ), ( sQ , sT )] ( s  Z ) 
(Q ,T a )
 
1 (V ) (V )
2
B[(sVZ11), 1 ]  B[(sVZ22), 2 ] 
 (Q ) S (n  1)  s  Q
(Q )
 B. Short-term Reward Matrix
S (n)  s  Q , (T )  (8)
 Pr  (T ) S (n  1)  s T , A(n  1)  a  Let us define the expected value of the immediate reward
 S (n)  sT V1 (n)  1 , V2 (n)   2  for a given transition from the slot n, when the system is in the
  generic state s  , to the slot n  1 , when the system is in the
The probability term in (8) can be evaluated by considering that,
generic state s , and for a given action a taken according to
according to the choice of the slot duration, kept equal to the
mean service time on the UAV CE, b1 jobs will be served in the s   , by weighing power consumption, delay, and loss
queue Q1 and b2 in the queue Q2 . Instead, the number of jobs probability. More in deep, we define the immediate reward as:
[ s , s ]  k1  p a   k2 s , sZ2 , a   k3 s , a 
( a)
that can be transmitted from the transmission queue T depends  
(14)
on the job size and the throughput of the connection link from
UAV1 to UAV2. Let us indicate the probability of transmitting The first term is the reward received for power saving in
respect to the case that all the CEs are active:
one job from the transmission queue as pTX . Now, applying
p a   2 L  b1  b2   PP (15)
again the theorem of total probability to the number of jobs that
are transmitted from the transmission queue, dT , the probability where 2 L PP represents the maximum power consumption in
term in (8) can be written as follows: the whole system, occurring when all the L CEs are active in
each of the two UAVs, PP being the power consumption of
 (Q ) S (n  1)  s Q
(Q )

S (n)  s Q , (T )  each CE, while b1 and b2 are the numbers of CEs that have
Pr  (T ) S (n  1)  s T , A(n  1)  a  
been decided to be active in the current slot.
 S (n)  sT V1 (n)  1 ,V2 (n)   2 
  (9) The second term,  s , s , a  , is the penalty (it becomes a
s , s , b , 
1
 f ( Q1) reward thanks to the minus sign) related to the job loss for
Q1 Q1 1
dT 0 queue overflows. Starting from the knowledge of the starting
f (Q 2)
s Q2
, sQ 2 , sT , b2 ,  2 , d T  f (T ) sT 1 , sT1 , 1 , d1 ,   and arrival states, we can calculate it as follows:
where f (Q1)    , f (Q 2 )    and f (T )    are functions providing  s   , s Z2 , a  
1
EV1  V2 

maxsQ 1    H , 0 
us the probabilities of one-slot evolution of the two UAV queues
  B[(sV ),  ]  maxsT 1  1     M , 0   B[(sV ),  ]  (16
1 2
and the transmission queue, respectively. The first function can Z1 1 Z2 2

  1
(V )
  2 )
(V )
be calculated as follows: 
  Prd T  maxsQ 2   2  mind T , sT   H , 0
1
f ( Q1) s Q 1 , s Q1 , b1 ,    d 0
T 
1 if s Q1  maxmins Q 1   , H  b1 , 0 (10)
 where EV1  V2  is the mean arrival rate to the system.
0 otherwise Finally, the third term regards the delay suffered in the
To calculate the second function, we need to consider the system queues. As described in Section II, the SC decides, for
number of departures from the transmission queue, which occur the jobs arrived to the UAV1, whether offloading them or not.
with the following probability: To this purpose, it estimates the two delays on the direct path
p if d T  1 through the local queue Q1 , and through the offloading path
Prd T    TX (11) given by the cascade of the transmission queue T and the
1  pTX if d T  0
queue Q2 of the UAV2. Assuming that conditions of those
Therefore, we have:
queues remain constant in the future, the SC compares the
f ( Q 2 ) sQ 2 , sQ 2 , sT , b2 ,  2 , d T   following two delays:
1 sT  0 1  sQ1  sT1 1  sT1  sQ2 
 noOL    and  OL  
 and sQ 2  maxminsQ 2   2 , H  b2 , 0   (17)
  P  b1  T  P  b2 
(12)
 Prd T  if sT  0 and Therefore, the penalty regarding the suffered delay is defined
 and sQ 2  maxminsQ 2   2  d T , H  b2 , 0 as follows:

0 otherwise 1  sQ1    P  sT1  sQ 2  
 s   , s   , a      sT1     (18)
Finally, the function f (T )
   can be derived as follows:  P  b1    T  b2  
735
where the operator x indicates the minimum integer N Q 2   sQ 2   (s )  (27)
 s
containing x.
1
V. SYSTEM PERFORMANCE Q 2      Prd T  B[(sZ 2)

Z 2 , 2 ]
 P[ s(Z ,)s
Z2
2
Z2]
 s (  ) sZ 2( Z ) 2( V ) dT 0
Let P be the transition probability matrix calculated as in

() (28)
IV.A with the optimal policy derived by solving the Bellman  min 2  mind T , sT , H  sQ 2   (s )  
optimality equation system. Let us calculate the steady-state

Finally, the loss probability can be calculated as follows:
probability array  by solving the following system:
( )
     s , sZ2 , a  P[ s( ,as)

 ] (29)
  P   () () () (19)  s (  ) sZ 2( Z )
 () T
  1  1
VI. NUMERICAL RESULTS
Let us now derive the three main performance parameters,
that is, the ones characterizing the objective function in (1) and In this section we consider a case study to apply the proposed
the reward function in (14). The mean power saving is: framework and evaluate some numerical results. Each UAV has
a job queue that can contain at most H  15 jobs, and a
  transmission queue where at most M  5 jobs can be enqueued
p  2 L   b1 s     b2 s     (s )   PP (20)
  s  

waiting for transmission. We assume that each UAV has L  3
The mean delays can be calculated by applying the Little law CEs on board, which can be activated by the SC according to the
to the queues, as the ratio between the mean number of jobs in applied policy. Let   300 ms be the mean job processing time,
the queue and the mean arrival rate. More specifically, the mean also chosen as the slot duration, while the mean time to transmit
delay experienced in the CE queueing system is the mean delay a job on the wireless link from UAV1 to UAV2, tTX , is varied in
in the queue plus 1, the last representing the service time in the the interval [0.3, 1.5] s. Let PP  80 W be the average
CE. According to the Little law, we have: consumption of each CE when it is active. Finally, assume that
N Q1 the SC uses   0.8 as discount factor. As concern the job
 Q1  1 (21) arrival processes, referring to a real video surveillance system in
 Q1
a rural area [13], the high-activity zone covered by UAV1 and
where N Q1 is the mean number of jobs in the queue Q1 , that is: the low-activity zone covered by UAV2 are characterized by the
following SBBP processes:
N Q1   sQ1   (s) (22)
 s 0.864 0.136 0.875 0.125
Q(Z )  
1
 Q(Z )   2
 (30)
while  Q1 is the mean arrival rate to Q1 , that is: 0.143 0.857 0.122 0.878
 1 1 1 1 
 Q1     min , 1 , H  sQ 1   
 s  (  ) sZ 1( Z ) 1( V )
(23)  1 2 3 4 
 B[(sZ ),  ]  P[ s(Z ,)s ]   (s ) 
1 1
B (Z )
1
  sZ1  1 0.07 0.19 0.74 0  (31)
Z1 1 Z1 Z1 
 
Likewise, the mean delay in the transmission queueing system  sZ1  2 0 0.21 0.25 0.54 
T can be calculated as the sum of the mean delays in the queue  
and in the queue service facility:
 2 2 2 
N  
 T 1  T 1  pTX (24)  0 1 2 
T B ( Z )   sZ 2  1
2
0.33 0.67 0  (32)
where N T is the mean number of jobs in the queue T , that is:  
 sZ 2  2 0.16 0.35 0.49 
 
N T   sT   (s )  (25)
 s In our analysis, we consider three different scenarios, each
while T is the mean arrival rate to the queue T , that is: characterized by different importance of power saving, loss
probability and delay. So we analyze the three cases of
T     min max 1   , 0 , M  sT  K  k1 , k 2 , k 3  : K 1  1,1,1 , K 2  1, 2, 2  , and K 3  1, 5, 2  .
 s(  ) sZ 1( Z ) 1( V )
(26) Results are shown in Figs. 2 and 3. More in deep, in Figs. 2a we
 B[(sZZ11), 1 ]  P[ s(ZZ11,)sZ 1 ]   (s) notice that the worst performance in terms of loss and delay is
The mean delay in the CE queue of UAV2 can be calculated as achieved in scenario 1, since the other two scenarios privilege
in (21), considering that the number of arrivals to this queue is these parameters, especially scenario 3 where the weight of loss
2 , that is: probability, k 2 , is the highest one.
736
The fact that curves related to K1 are non-monotonic can be

explained as follows. For increasing but low values of the
transmission time, the mean queue delay in T increases, and so
offload becomes not so much convenient, and therefore the
majority of jobs arriving from the zone Z 1 are enqueued in Q1 .
Thus, the mean delay in Q1 increases, while the one in Q2
decreases. Instead, when the mean delays on both the paths ( Q1
and T Q2 ) become too high (for tTX  0.6 ), the two UAVs (a) Mean delay in Q1 (b) Mean delay in Q2
result more independent from each other, since the cost of
offload is high, and the SC decides to switch on more CEs on
UAV1, as can be seen in Fig. 2a, and so the mean delay in Q1
strongly decreases. To save some power, the number of active
CEs on UAV2 decreases a little. The scenario 3 is completely
different. In fact, it is able to provide less loss probabilities, at
expense of power saving.
VII. CONCLUSIONS AND FUTURE WORK (c) Mean delay in T (d) Loss Probability
This paper has proposed an extension of a 5G network slice Fig. 2. Mean delay and loss probability.
with MEC UAVs equipped with computing facilities to provide
devices on ground with edge computing functions. Since power
consumption of CEs installed on board is comparable with the
one of the UAV engines, in this paper we have proposed to vary
the number of active CEs according to the run-time computation
load coming from ground, and a strategy for adaptive job offload
towards a near UAV. The action of varying the number of active
CEs and the choice of offloading some job is performed by a
System Controller by RL.
(a) Mean number of active CEs (b) Saved power
As a future work we plan to extend the proposed cooperative
Fig. 3. Mean number of active CEs and saved power.
framework to more than two UAVs. Another future direction for
our research consists in analytically evaluating the direct impact IEEE Commun. Surv. Tutor. 2017, 19, pp. 1657–1681.
on the flight mission duration, and including it in the reward [6] M. Mozaffari, et al., “A Tutorial on UAVs for Wireless Networks:
function used by the System Controller. Finally, in this paper we Applications, Challenges, and Open Problems,” arXiv: 1803.00680v1
[cs.IT], March 2018.
assumed a constant time tTX to transmit a job from UAV1 to
[7] C. Grasso, G. Schembra, “Design of a UAV-Based Videosurveillance
UAV2. Actually, it depends on the channel conditions between System with Tactile Internet Constraints in a 5G Ecosystem,” IEEE
the two UAVs, and this will be accounted as future work. NetSoft 2018, Montreal, QC, Canada, 2018.
[8] C. Grasso and G. Schembra, “A Fleet of MEC UAVs to Extend a 5G
ACKNOWLEDGMENT Network Slice for Video Monitoring with Low-Latency Constraints,”
Journal of Sensors and Actuator Networks, vol. 8, no. 1, 2019.
This paper was partially founded by “Ricerca di Ateneo –
[9] Y. Sun, D. Xu, D. W. K. Ng, L. Dai and R. Schober, "Optimal 3D-
Piano per la Ricerca 2016/2018. Trajectory Design and Resource Allocation for Solar-Powered UAV
Communication Systems," in IEEE Trans. on Communications, 2019.
REFERENCES [10] E. Price et al., "Deep Neural Network-Based Cooperative Visual
[1] 5G Vision, 5G PPP, Feb. 2015. Tracking Through Multiple Micro Aerial Vehicles," in IEEE Robotics
[2] “View on 5G Architecture, 5GPPP Architecture Working Group”, v. 1.0, and Automation Letters, vol. 3, no. 4, pp. 3193-3200, Oct. 2018.
July 2016, https://5g-ppp.eu/wp-content/uploads/2014/02/5G-PPP-5G- [11] S. S. Dias M. G. S. Bruno "Cooperative target tracking using
ArchitectureWP-July-2016.pdf decentralized particle filtering and RSS sensors" IEEE Trans. Signal
[3] D. Kreutz, F. M. V. Ramos, P. E. Veríssimo, C. E. Rothenberg, S. Process. vol. 61 no. 14 pp. 3632-3646 Jul. 2013.
Azodolmolky, S. Uhlig, “Software-Defined Networking: A [12] Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An
Comprehensive Survey,” Proc. IEEE 2015, 103, 14–76. Introduction,” The MIT Press Cambridge, Massachusetts London,
England, 2012.
[4] J.d.J.G. Herrera, J. F. B. Vega, “Network Functions Virtualization: A
Survey,” IEEE Lat. Am. Trans. 2016, 14, 983–997. [13] C. Rametta, G. Schembra, “Designing a softwarized network deployed on
a fleet of drones for rural zone monitoring,” in Future Internet, vol. 9, no.
[5] T. Taleb, et al., “On Multi-Access Edge Computing: A Survey of the
1, March 2017.
Emerging 5G Network Edge Cloud Architecture and Orchestration,”
737

Reinforcement-Learning For Management of A 5G Network Slice Extension With Uavs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement-Learning For Management of A 5G Network Slice Extension With Uavs

Uploaded by

Copyright:

Available Formats

2019 IEEE INFOCOM WKSHPS: SMILING 2019: Sustainable networking through MachIne Learning and Internet of thINGs

Reinforcement-Learning for Management of a 5G

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

978-1-7281-1878-9/19/$31.00 ©2019 IEEE 732

job composition task can be performed by either the same

if an UAV is overloaded of jobs, and typically this occurs when

behavior of the zones 1 and 2, and S   the state of ( Zi ) ( Zi )

and the transmission queue, respectively. The first function can Z1 1 Z2 2

V. SYSTEM PERFORMANCE Q 2      Prd T  B[(sZ 2)

Let P be the transition probability matrix calculated as in

optimality equation system. Let us calculate the steady-state

     s , sZ2 , a  P[ s( ,as)

The fact that curves related to K1 are non-monotonic can be

You might also like