Professional Documents
Culture Documents
Abstract— Some representative 5G application scenarios a 5G network slice with a fleet of UAVs, each providing not only
regard geographic areas very far from the structured core networking, but also computing and storage with MEC facilities,
network, but are characterized by the need for processing huge realized with a set of Computer Elements (CE) installed on
amount of data that cannot be transmitted to multi-access edge board. However, due to the power consumption of each
(MEC) facilities installed at the edge of that network. To this
computing element that, since comparable to the consumption
purpose, this paper proposes to extend a 5G network slice with a
fleet of UAVs, each providing computing facilities, and for this of engines, can compromise the flight mission duration [9], this
reason referred to as MEC UAVs. The paper proposes a paper proposes to change the number of active CEs at run-time,
cooperation between MEC UAVs belonging to the same fleet based according requests of computation coming from the ground in
on job offloading, aiming at minimizing power consumption due real time. An additional feature included in this proposal is
to active computer elements providing MEC, job loss probability cooperation among UAVs, already experimented for other
and queueing delay. A Reinforcement Learning (RL) approach is purposes (see for example [10-11]).
used to support the System Controller in its decisions. A numerical
analysis is presented to evaluate achieved performance. More specifically, in this paper we propose a framework in
which, when the zone monitored by an UAV enters a state of
Keywords — 5G, Network Slicing, UAV, Reinforcement high activity, the UAV can either switch on more CEs, or asking
Learning, Markov Decision Processes (MDP) for help to a near UAV, in such a way that some jobs can be
offloaded to it. The choice of the number of CEs to maintain
I. INTRODUCTION
active in each UAV and the amount of jobs to be offloaded to
The 5th generation wireless systems, or 5G, are not only an the helping UAV is in charge of a System Controller (SC)
evolution of the legacy 4G cellular networks, but a revolution launched by the UAV asking for help as a virtual network
for the introduction of new disruption service capabilities [1]. function (VNF). A Reinforcement Learning (RL) approach is
One of the main features that will be introduced is the concept used to support SC in each decision, with the target of
of network slices [2], which aims at addressing the diversified maximizing a medium-term reward, defined as a function of the
service requirements for different application scenarios. A 5G power saved by switching off some CE, and performance in
network slice is an end-to-end logical network provisioned with terms of loss probability and mean delay. A numerical analysis
a set of isolated virtual resources on the shared physical will evaluate performance achieved with the proposed platform.
infrastructure, so providing a network-as-a-service (NaaS)
model. The three main paradigms that will enable network The paper is structured as follows. Section II describes the
slicing in 5G systems are software defined networking (SDN) reference system. Section III provides some background
[3], network functions virtualization (NFV) [4] and multi-access regarding RL. The model of the whole system and the analytical
edge computing (MEC) [5]. definition of the reward function, necessary to apply RL, are
described in Section IV, while the main performance parameters
However, some 5G application scenarios, like for example are analytically derived in Section V. Some numerical results
smart agriculture, environment monitoring, video surveillance will be presented in Section VI, while Section VII will conclude
with drones, regard geographic areas very far from the the paper providing some insight for future work.
structured core network. Moreover, in most of them, connected
devices produce a lot of data that require processing in real-time, II. REFERENCE SYSTEM
and for them use of network slices with MEC facilities is not We consider a 5G network slice extension realized with a
possible because these facilities are too far to be reached with fleet of UAVs, each equipped with MEC facilities, for this
links at sufficient throughput. A first attempt in this direction has reason here referred to as MEC UAVs. Each MEC UAV is
been done introducing flying platforms realized with unmanned equipped by L CEs to process jobs received by ground devices,
aerial vehicles (UAVs), popularly known as drones. With their each consuming a given amount of power, PP . The goal of the
characteristics, applications of UAVs have been extended to
aerial base stations to enhance coverage, capacity, reliability, proposed platform is to provide a geographic area with this
and energy efficiency of wireless networks [6]. network extension, aimed at processing data coming from
devices installed on ground. The whole area is subdivided in
Starting from that paper and from previous works of the adjacent zones, each covered by a MEC UAV. Data generated
same Authors [7-8], the idea at the base of this paper is to extend by ground devices are organized in jobs to be processed, and the
Queue Q1
Queue Q2
where the network slice is used for video surveillance, the two
behaviors represent no-alarm and alarm situations, respectively.
Applications using UAV MEC facilities can be more or less Queue T
sensitive to losses of job and job processing delay. Another
Flow 1
T
important parameter to be accounted is power consumption of Switch
the CEs, that is comparable with the consumption of the UAV 1 2
engines for flight, and therefore can strongly influence the UAV
mission duration [11]. To this purpose, we consider the 1 2 RH 1 2 RL
possibility of a mutual help among UAVs. The basic idea is that, High Activity Low Activity
the importance that the three above parameters have for the The term is a discount factor, where 0,1 . It is an
considered application scenario. input parameter that informs the agent of how much it should
care about rewards now as compared to rewards in the future.
The SC uses reinforcement learning (RL) to decide the
actions to be taken. To this end, a discrete-time approach is For each state of the environment, a State-Value Function
adopted: actions are taken periodically at the beginning of each and an Action-Value Function are also defined to describe how
time slot of duration . The following actions will be taken: good is to be in that state, and how good is to take a given action.
For a given policy , the state-value function for a state s is
1. Decision of the number bi of CEs to be active during the
time slot for each UAV i. The others are put in low-power
defined as v ( s) E G (n) S (n) s , where E is the
state to reduce power consumption. expected value given that the agent follows the policy , and
2. Decision of the number of jobs to be locally managed G(n) is the cumulative reward, as defined in (2). It represents the
by UAV1 among the 1 jobs arrived in the slot. The expected return when the system starts in s and follows the
policy . We assume the Markov property for the environment
other 1 are offloaded to the UAV 2.
state, and define its model as a Decision Markov Process
Jobs to be locally processed by the UAV1 are enqueued in the (MDP). A MDP , for a given policy that specifies an
queue Q1 , where they wait for some CE availability. The other action a for each state s , is completely defined as a tuple
jobs are enqueued in the Transmission queue T to be (( , ( A , P
( )
,
( )
, ) , where ( is the system state
transmitted to UAV2, where will be enqueued together with the ( )
jobs arrived from the zone monitored by it. space, ( A is the set of actions, P is the state transition
733
2019 IEEE INFOCOM WKSHPS: SMILING 2019: Sustainable networking through MachIne Learning and Internet of thINGs
probability matrix,
( )
is the immediate reward matrix, and equal to 1, while the probability of transmitting one job in one
is the discount factor. The matrix P
( )
depends on the slot is pTX tTX . If this is not the case, the model can easily
policy . Its generic element, representing the transition be modified to the opposite case.
As specified so far, at the beginning of each slot n, the action
probability from the state s to the state s provided that, performed by the SC is constituted by the following elements:
according to the policy , the action a is performed on the
starting state s , is: 1. it sets the number of processors, bi 1, L , to be used to
process the jobs in the queues Q1 and Q2 in the slot n;
( a)
P[ s , s ] Pr S ( ) ( n ) s S ( ) ( n 1) s , A( n 1) a (3)
Likewise, the generic element of the reward matrix represents 2. it sets the number of arrivals, , that will be enqueued
the immediate reward received performing the action a when in the queue Q1 . The others will be offloaded to the
the system transits from the state s to the state s , that is: UAV. Of course, cannot be greater than the number
of arrivals occurred in the considered slot, and than the
( a ) S ( (n 1) s , S ( (n) s, number of rooms that are available in Q1 .
[ s , s ] E R(n) (4)
A(n 1) a
The number of jobs coming from the zone 1 that cannot find
As known, the optimal policy, whose state-value function is
better than or equal to the state-value function of all the other space in the queues Q1 and T , and the number of jobs from the
policies, can be derived by solving a set of Bellman optimality zone 2 that cannot find space in the queue Q2 , are lost.
equations, one for each state of the system [12]. Moreover, jobs generated in zone 1 that are not offloaded suffer
IV. SYSTEM MODEL a delay due to the queue Q1 , while offloaded jobs suffer a delay
Let us model the system described in Section II with a three- that is the sum of the delay in T and the delay in Q2 . The
dimensional discrete-time MDP whose state is defined as choice of the action for each state of the system is the optimal
()
S (n) S (n), S (n), S (T ) ( n) , where:
(Z ) (Q)
policy decided by RL, as explained in Section III.
A. Transition Probability Matrix
S (n) S ( Z ) (n), S ( Z ) (n) is the state of the zones
(Z ) 1 2
Let us consider the following two generic states:
controlled by the two UAVs, being ( Z ) 1, 2, ..., RH and s s Z , sQ , sT S ( n 1) , i.e. the state at the slot n 1 ;
1 ()
( Z2 )
1, 2, ..., RL be the sets of states characterizing the
s s Z , s Q , sT S ( n) , i.e. the system state at the slot n .
()
queues, being S ( Q ) ( n) 0, ..., H the number of jobs in ( sZ ) represents the behavior of the three
( Q ,T )
i The matrix P
the queue of the UAV i, with i 1, 2 ; H is the maximum queues Q1 , Q2 , and T . In its definition, we have highlighted its
number of jobs that the queue can contain; dependence on the arrival state of the underlying Markov chain
of the zones, which determine the number of job arrivals in the
S (T ) (n) 0, ..., M is the state of the transmission queue for queues, and the applied policy , which determines the action
offloading from UAV1 to UAV2. a b1 , b2 , for each transition starting state. Its generic
Assuming that the time needed to process one job in one of element can be defined as follows:
the CEs is less than the time, tTX , needed to transmit one job
( s Z )
(Q, T a )
P[( s
Q , sT ), ( sQ , sT )]
from UAV1 to UAV2, in the following we choose the average
time needed to process one job in a CE as the slot duration, . S ( n) s Q S ( Q ) ( n 1) s , S (T ) ( n 1) sT
(Q )
(7)
Pr (T ) Q
Consequently, the probability to process one job in one slot is S (n) sT A( n 1) a
734
2019 IEEE INFOCOM WKSHPS: SMILING 2019: Sustainable networking through MachIne Learning and Internet of thINGs
In order to evaluate this probability, let us apply the total f (T ) sT 1 , sT1 , 1 , d 1 ,
probability theorem to the number of possible arrivals from the Prd 1 if sT1 maxminsT 1 1 , H 1, 0
monitored zones, 1 and 2 . We have: (13)
1 Prd 1 if sT1 minsT 1 1 , H
0 otherwise
P[( sQ , sT ), ( sQ , sT )] ( s Z )
(Q ,T a )
1 (V ) (V )
2
B[(sVZ11), 1 ] B[(sVZ22), 2 ]
(Q ) S (n 1) s Q
(Q )
B. Short-term Reward Matrix
S (n) s Q , (T ) (8)
Pr (T ) S (n 1) s T , A(n 1) a Let us define the expected value of the immediate reward
S (n) sT V1 (n) 1 , V2 (n) 2 for a given transition from the slot n, when the system is in the
generic state s , to the slot n 1 , when the system is in the
The probability term in (8) can be evaluated by considering that,
generic state s , and for a given action a taken according to
according to the choice of the slot duration, kept equal to the
mean service time on the UAV CE, b1 jobs will be served in the s , by weighing power consumption, delay, and loss
queue Q1 and b2 in the queue Q2 . Instead, the number of jobs probability. More in deep, we define the immediate reward as:
[ s , s ] k1 p a k2 s , sZ2 , a k3 s , a
( a)
that can be transmitted from the transmission queue T depends
(14)
on the job size and the throughput of the connection link from
UAV1 to UAV2. Let us indicate the probability of transmitting The first term is the reward received for power saving in
respect to the case that all the CEs are active:
one job from the transmission queue as pTX . Now, applying
p a 2 L b1 b2 PP (15)
again the theorem of total probability to the number of jobs that
are transmitted from the transmission queue, dT , the probability where 2 L PP represents the maximum power consumption in
term in (8) can be written as follows: the whole system, occurring when all the L CEs are active in
each of the two UAVs, PP being the power consumption of
(Q ) S (n 1) s Q
(Q )
S (n) s Q , (T ) each CE, while b1 and b2 are the numbers of CEs that have
Pr (T ) S (n 1) s T , A(n 1) a
been decided to be active in the current slot.
S (n) sT V1 (n) 1 ,V2 (n) 2
(9) The second term, s , s , a , is the penalty (it becomes a
s , s , b ,
1
f ( Q1) reward thanks to the minus sign) related to the job loss for
Q1 Q1 1
dT 0 queue overflows. Starting from the knowledge of the starting
f (Q 2)
s Q2
, sQ 2 , sT , b2 , 2 , d T f (T ) sT 1 , sT1 , 1 , d1 , and arrival states, we can calculate it as follows:
where f (Q1) , f (Q 2 ) and f (T ) are functions providing s , s Z2 , a
1
EV1 V2
maxsQ 1 H , 0
us the probabilities of one-slot evolution of the two UAV queues
B[(sV ), ] maxsT 1 1 M , 0 B[(sV ), ] (16
1 2
be calculated as follows:
Prd T maxsQ 2 2 mind T , sT H , 0
1
f ( Q1) s Q 1 , s Q1 , b1 , d 0
T
1 if s Q1 maxmins Q 1 , H b1 , 0 (10)
where EV1 V2 is the mean arrival rate to the system.
0 otherwise Finally, the third term regards the delay suffered in the
To calculate the second function, we need to consider the system queues. As described in Section II, the SC decides, for
number of departures from the transmission queue, which occur the jobs arrived to the UAV1, whether offloading them or not.
with the following probability: To this purpose, it estimates the two delays on the direct path
p if d T 1 through the local queue Q1 , and through the offloading path
Prd T TX (11) given by the cascade of the transmission queue T and the
1 pTX if d T 0
queue Q2 of the UAV2. Assuming that conditions of those
Therefore, we have:
queues remain constant in the future, the SC compares the
f ( Q 2 ) sQ 2 , sQ 2 , sT , b2 , 2 , d T following two delays:
1 sT 0 1 sQ1 sT1 1 sT1 sQ2
noOL and OL
and sQ 2 maxminsQ 2 2 , H b2 , 0 (17)
P b1 T P b2
(12)
Prd T if sT 0 and Therefore, the penalty regarding the suffered delay is defined
and sQ 2 maxminsQ 2 2 d T , H b2 , 0 as follows:
0 otherwise 1 sQ1 P sT1 sQ 2
s , s , a sT1 (18)
Finally, the function f (T )
can be derived as follows: P b1 T b2
735
2019 IEEE INFOCOM WKSHPS: SMILING 2019: Sustainable networking through MachIne Learning and Internet of thINGs
where the operator x indicates the minimum integer N Q 2 sQ 2 (s ) (27)
s
containing x.
1
() T
1 1
VI. NUMERICAL RESULTS
Let us now derive the three main performance parameters,
that is, the ones characterizing the objective function in (1) and In this section we consider a case study to apply the proposed
the reward function in (14). The mean power saving is: framework and evaluate some numerical results. Each UAV has
a job queue that can contain at most H 15 jobs, and a
transmission queue where at most M 5 jobs can be enqueued
p 2 L b1 s b2 s (s ) PP (20)
s
waiting for transmission. We assume that each UAV has L 3
The mean delays can be calculated by applying the Little law CEs on board, which can be activated by the SC according to the
to the queues, as the ratio between the mean number of jobs in applied policy. Let 300 ms be the mean job processing time,
the queue and the mean arrival rate. More specifically, the mean also chosen as the slot duration, while the mean time to transmit
delay experienced in the CE queueing system is the mean delay a job on the wireless link from UAV1 to UAV2, tTX , is varied in
in the queue plus 1, the last representing the service time in the the interval [0.3, 1.5] s. Let PP 80 W be the average
CE. According to the Little law, we have: consumption of each CE when it is active. Finally, assume that
N Q1 the SC uses 0.8 as discount factor. As concern the job
Q1 1 (21) arrival processes, referring to a real video surveillance system in
Q1
a rural area [13], the high-activity zone covered by UAV1 and
where N Q1 is the mean number of jobs in the queue Q1 , that is: the low-activity zone covered by UAV2 are characterized by the
following SBBP processes:
N Q1 sQ1 (s) (22)
s 0.864 0.136 0.875 0.125
Q(Z )
1
Q(Z ) 2
(30)
while Q1 is the mean arrival rate to Q1 , that is: 0.143 0.857 0.122 0.878
1 1 1 1
Q1 min , 1 , H sQ 1
s ( ) sZ 1( Z ) 1( V )
(23) 1 2 3 4
B[(sZ ), ] P[ s(Z ,)s ] (s )
1 1
B (Z )
1
sZ1 1 0.07 0.19 0.74 0 (31)
Z1 1 Z1 Z1
Likewise, the mean delay in the transmission queueing system sZ1 2 0 0.21 0.25 0.54
T can be calculated as the sum of the mean delays in the queue
and in the queue service facility:
2 2 2
N
T 1 T 1 pTX (24) 0 1 2
T B ( Z ) sZ 2 1
2
0.33 0.67 0 (32)
where N T is the mean number of jobs in the queue T , that is:
sZ 2 2 0.16 0.35 0.49
N T sT (s ) (25)
s In our analysis, we consider three different scenarios, each
while T is the mean arrival rate to the queue T , that is: characterized by different importance of power saving, loss
probability and delay. So we analyze the three cases of
T min max 1 , 0 , M sT K k1 , k 2 , k 3 : K 1 1,1,1 , K 2 1, 2, 2 , and K 3 1, 5, 2 .
s( ) sZ 1( Z ) 1( V )
(26) Results are shown in Figs. 2 and 3. More in deep, in Figs. 2a we
B[(sZZ11), 1 ] P[ s(ZZ11,)sZ 1 ] (s) notice that the worst performance in terms of loss and delay is
The mean delay in the CE queue of UAV2 can be calculated as achieved in scenario 1, since the other two scenarios privilege
in (21), considering that the number of arrivals to this queue is these parameters, especially scenario 3 where the weight of loss
2 , that is: probability, k 2 , is the highest one.
736
2019 IEEE INFOCOM WKSHPS: SMILING 2019: Sustainable networking through MachIne Learning and Internet of thINGs
737