You are on page 1of 10

1

Federated Learning for RAN Slicing in Beyond 5G


Networks
Amine Abouaomar* Member, IEEE,, Afaf Taik* Member, IEEE,, Abderrahime Filali* Member, IEEE,, and
Soumaya Cherkaoui Senior Member, IEEE,

Abstract—Radio access network (RAN) slicing allows the each tailored and dedicated to meet the requirements of a
arXiv:2206.11328v1 [cs.DC] 22 Jun 2022

division of the network into several logical networks tailored specific 5G service [4], [5]. These services can be classi-
to different and varying service requirements in a sustainable fied into enhanced mobile broadband (eMBB), ultra-reliable
way. It is thereby considered a key enabler of 5G and next
generation networks. However, determining optimal strategies for low-latency communication (URLLC), and massive machine-
RAN slicing remains a challenging issue. Using machine learning type communication (mMTC) services. In the next-generation
algorithms to address such a difficult problem is promising. How- networks, MNOs consist of two main entities, namely the
ever, due to the large differences imposed by RAN deployments infrastructure provider (InP) and the mobile virtual network
and the disparity of their required services it is difficult to utilize operators (MVNOs) [6]. On one hand, the InP owns the
the same slicing model across all the covered areas. Moreover,
the data collected by each mobile virtual network operator physical resources, including, base stations, and core network
(MVNO) in different areas is mostly limited and rarely shared components, and importantly, the radio resources. On the other
among operators. Federated learning presents new opportunities hand, MVNOs lease these physical resources from the InP to
for MVNOs to benefit from distributed training. In this paper, deploy the RAN slices required to provide their own services.
we propose a federated deep reinforcement learning (FDRL) In a RAN slicing scenario, the InP allocates the radio resources
approach to train bandwidth allocation models among MVNOs
based on their interactions with their users. We evaluate the to the MVNOs according to the service level agreement (SLA)
proposed approach through extensive simulations to show the contracts. Then, each MVNO allocates the radio resource
importance of such collaboration in building efficient network rented from the InP to its users [6].
slicing models. The allocation of radio resources to users is an extremely
Index Terms—RAN Slicing, Federated Learning, Reinforce- intricate operation for MVNOs. This is mainly due to the
ment Learning, B5G. radio resources’ scarcity and the heterogeneous requirements
of their users in terms of quality of service (QoS) [7]–[9].
I. I NTRODUCTION To address these challenges, various approaches based on ma-
chine learning (ML) techniques have been proposed recently,
Modern wireless networks have known an explosive growth
specifically, reinforcement learning (RL) algorithms [10]–[13].
of data traffic as the number of mobile devices is increasing
Nevertheless, due to the dynamics of the RAN environment,
every day. Mobile devices exchange data to acquire vari-
in terms of density of users, user requirements, and wireless
ous services, with various qualities and requirements from
channel transmission conditions, RAN slicing remains a sig-
their mobile network operators (MNOs). To meet the ever-
nificantly challenging problem for MVNOs. These stochastic
growing needs of services, network operators are obligated
RAN environment factors have a major impact on the accuracy
to deploy new equipment to extend their coverage as the
of the RL models, which decreases the performance of radio
network generations evolves. However, extending the coverage
resource allocation to the users [14]–[19]. Indeed, when a
of next generation networks is expensive, which makes sharing
MVNO builds its resource allocation RL model using training
the network infrastructures a valuable alternative for various
datasets related only to its users’ behavior and its surrounding
service providers [1]. By sharing different network equipment
environment, the accuracy of the model may be limited. To
and resources, such as spectrum, antennas, and radio inter-
benefit from a diversified dataset, MVNOs can collaborate
faces, service providers can fulfill the requirements of highly
by sharing their data with each other to provide diverse and
scattered customers at a significantly reduced cost [2].
high-quality dataset that will be useful in the training RL
Network slicing (NS) is an advanced solution based on
models. However, MVNOs are often competing entities and
network virtualization that enables the transition from static
are unlikely to willing to share their data for privacy and data
network infrastructure to a dynamic one. It allows the design
security reasons. To overcome such issue, federated learning
of several logically independent networks, known as network
(FL) paradigm can be leveraged [20]–[22].
slices, which operate on a common physical infrastructure [3].
FL is a cooperative learning approach in which multiple
In particular, radio access network (RAN) slicing consists in
collaborators, MVNOs in our case, train an ML model using
partitioning the RAN resources to create various RAN slices,
their private datasets, and then send their trained models to
* Authors contributed equally. an aggregation entity to build a global model [22], [23]. The
A. Abouaomar, A. Filali, and S. Cherkaoui are with Polytechnique Mon- aggregation entity returns the global model to all collaborators
treal, QC, Montreal, Canada.
A. Taik is with INTERLAB, Engineering Faculty, Université de Sherbrooke, to immediate utilization or further training. Thus, FL will
QC, Canada. enable MVNOs to build a robust ML resource allocation model
2

while maintaining data privacy since only trained models are of the available resources on established network slices and
shared. Indeed, the shared experience will enable the RAN- to provide load balancing. Authors in [27] proposed a genetic
slicing model to learn from varying scenarios, which makes algorithm to allocate resources considering multi-tenant and
it more adaptive to environment changes. In fact, due to the multi-tier heterogeneous networks. The proposed approach
unbalanced and non-independent and identical distributions consists in relaxing the problem and solving it through hier-
(non-i.i.d) of the users across MVNOs, alongside their varying archical decomposition methods and Monte Carlo simulation.
numbers and requirements, FL becomes an attractive solution This work addressed in particulare the latency and bandwidth
to build robust models. allocation as QoS metrics. From a deeper perspective, RAN
To promote more programmability in the RAN, the open- slicing resource allocation process takes part at many levels,
RAN (O-RAN) architecture can be leveraged [24]–[26]. In and ML was widely investigated to this regard [28]. Literature
fact, the hierarchical RAN intelligent controller (RIC), includ- separates the InP resource allocation to MVNOs from MVNOs
ing non-real-time RIC (non-RT) and near-real-time RIC (near- allocating their resources to users. Many works investigate
RT) can be used to manage RAN slicing operations using ML. the MVNO RAN slicing [16]–[18], [29] though using RL.
The former handles the heavier RAN tasks, such as running However, the radio resource allocation is given only from
the training process, while the latter performs critical tasks, the perspective of a single MVNO. For instance, authors
such as inference and aggregation of ML models in FL. of [18] proposed a RAN slicing mechanism to enhance the
In this paper, we propose an FL-based cooperative radio re- performance of uRLLC and eMBB services. The proposed
source allocation mechanism for MVNOs. In this mechanism, approach takes in consideration two time-scale (large scale,
each MVNO trains an RL radio resource allocation model and short scale) slicing of RAN resources. Large time-scale,
according to its users’ requirements and sends the trained radio resources allocation depends on the requirements of
models to the near-RT RIC for aggregation. Then, the near-RT uRLLC and eMBB users. The short time-scale consists in
RIC sends back the global RL model to each MVNO to update gNodeBs allocating their resources to en users. This problem
its local RL models. We consider two types of users, namely was modeled as a non-linear binary program solved using
URLLC users and eMBB users. URLLC users require low deep reinforcement learning, precisely deep Q-learning model.
latency, while eMBB users need a high data rate. To the best Although the work mentions that resources can be allocated
of our knowledge, this is the first work to propose cooperative from adjacent nodes, however, this work only considers the re-
radio resource allocation between MVNOs based on FLDRL. source allocation for a single operator. Work in [19] considered
The main contributions of this paper are summarized as a strategic approach through Stackelberg-type games to cope
follows: with frequency and energy provisioning for the InP. Authors
• We model the radio resource allocation problem for provided an analysis for the equilibrium where MVNO’s
URLLC ad eMBB users as a continuous non-linear users are uniformly distributed. The authors obtain a unique
optimization problem. equilibrium policy at each layer in the special scenario when
• We model the radio resource allocation problem of an each MVNO manages only one category of users. Regarding
MVNO as a Markov decision process (MDP). the broader scenario of MVNOs that serve multiple user types,
• We develop a deep RL (DRL) algorithm to allocate radio authors proposed an evolved two-layer differential algorithm
resources to URLLC and eMBB users of each MVNO. along with a gradient based method to achieve the equilibrium.
• We design a federated DRL (FDRL) mechanism on an Work of [30] introduces a RAN dynamic slicing approach for
O-RAN architecture to cooperatively improve the radio vehicular networks in order to handle various IoV services
resource allocation operation of MVNOs. with different QoS requirements. The RL-based algorithm
• We evaluate the proposed mechanism through extensive solves the problem in two phases, including workload distri-
simulations. bution and resource allocation decisions. A DDPG actor-critic
RL approach was adopted in particular.
The remainder of this paper is organized as follows. Section
Despite the significant efforts to provide solutions for
II discusses related work RAN slicing based on DRL and
dynamic and efficient management for RAN slicing, many
FL. Section III provides the system model and the problem
aspects were missing from literature. The aspect of privacy,
formulation of radio resource allocation. Section IV presents
which is crucial and may represent a threat to MVNOs as
the proposed FDRL mechanism. Section V discusses the eval-
well as to users is yet ot be investigated. Moreover, sharing
uations and results of the proposed mechanism. The conclusion
each others experiences, MVNOs can improve their resource
is provided in section VI.
allocation schemes through collaboratively training resource
allocation models and share them in a FL fashion. Such
II. R ELATED W ORK a research direction was not best investigated. To the best
Many works investigated RAN slicing generally such as our knowledge, this is the first work to investigate the use
in [16] proposed DeepSlice, a deep learning neural network of FL in next generation networks management, specifically,
driven approach to efficiently address the load balancing and for the multi-MVNOs resources allocation. Authors of [17]
network availability challenges. In their work, they utilize investigated the resource allocation for wireless network slices.
the available KPIs to train the model for analyzing incoming This work, proposed a two tier slicing resource allocation
traffic and predicting the network slice for any user type. scheme using DRL. Also, this paper tackled the problem
Intelligent resource allocation allows for efficient utilization within a single BS, and users are accessing the associated RAN
3

MVNO1

QoS Local
Dataset

MVNON

Local model

RAN Slicing
Global model

... RAN Intelligent Controller


InP

MVNO2

Fig. 1: An overview of the system model.

resources through MVNOs. Hence, the resource allocation III. FDRL- ENABLED MVNO S A RCHITECTURE
process is divided into two tiers. The first tier is dedicated
to allocate InP resources to MVNOs using the DQN tech- A. System Model
nique combined with bidding. The second tier considers the We consider a RIC-enabled RAN architecture with a single
allocation of MVNOs resources to users using dueling DQN base station (BS) owned by an InP. The BS is operating on
technique to converge to an optimal solution. However, DQN a total bandwidth B. The InP is responsible to serve a set of
technique takes longer delays to converge to a stable reward, MVNOs M = {mi }, i ∈ {1, 2, . . . , M} by renting to each
which makes it non-suitable for all the DRL-based solutions. of them a fraction of the total bandwidth B based on an
SLA. Each MVNO mi has a set of users denoted by Ui }.
Previous literature on RAN slicing resource allocation pro- We consider two types of users, namely, eMBB users, and
vides variety of solutions and techniques that cope with the URLLC users. For a user j, let zje = {0, 1} and zju = {0, 1}
resource allocation, either from the upper tier (InP allocating denote the binary variables representing whether j is eMBB
resources to MVNOs) or the lower tier (MVNOs allocating (zje = 1) user or URLLC (zju = 1) user, respectively.
resources to users). However, DQN is mainly adapted to solve In this work, we consider that bandwidth allocation to
problems where the observation space have high dimensions, MVNOs has already been performed by the InP. We denote
it is only capable of handling discrete action spaces with low the fraction of the total bandwidth B leased to the MVNO
dimension. Therefore, DQN is not well adapted to situations mi by Bi . An MVNO allocates to each of its users a fraction
with continuous action spaces with significantly high dimen- fi,j ∈ [0, 1] from the leased bandwidth wi , to satisfy its QoS
sions. Consequently, DQN is not directly applied to domains requirements in terms of data rate and latency. Each user u(i,j)
that are continuous since it is founded on seeking for actions uses the allocated bandwidth to transmit a packet with a size
that maximizes the action-value function. In continuous cases, ξ(i,j) . We consider that the packet size depends on the type
DQN involves iterative optimization processes at every step. of users, so we denote the packet size of an eMBB user and
We adopt in this paper, a deep deterministic policy gradient to a URLLC user by ξ(i,j)e u
and ξ(i,j) , respectively. We consider
deal with the discrete aspect of the action space, therefore, es- the orthogonal frequency division multiple access (OFDMA)
cape the curse of dimensionality. Additionally, in the proposed uploading scenario to reduce interference between the users.
approach, MVNOs can benefit from each other experiences
while promoting the privacy. The achievable uplink data rate of the user u(i,j) ∈ Ui using
4

Take action
+
OU Noise
Action
State
Environment
<BW, User>

Actor network

Policy gradient

Critic network Loss function


Reward
Data rate +Delay
Update
Save

Replay buffer
Soft update
Target network

Fig. 2: An overview of the proposed solution.

the allocated bandwidth is defined as follows, formulate both minimization and maximization problems of
an MVNO mi ∈ M in a joint problem as follows,
δ(i,j) = fi,j Bi log2 (1 + ρ(i,j) ) (1)
 
X X 
where ρ(i,j) is the signal to noise ration between the user u(i,j) maximize zje δi,j , − zju D(i,j) (4a)
and the BS, and is given as follows, f 
j∈Ui j∈Ui

subject to
Pi,j .gi,j
ρ(i,j) = (2) 0 ≤ fi,j ≤ fmax , ∀j ∈ Ui , (4b)
fi,j Bi σ 2 X
fi,j ≤ 1, (4c)
2 j∈Ui
where, σ is the noise power, Pi,j is the transmission power
of the user u(i,j) , and g(i,j) is the channel gain between the D(i,j) ≤ Dimax , ∀j ∈ Ui and zje = 1, (4d)
user u(i,j) and the BS. The transmission delay to upload a δ(i,j) ≥ δimin , ∀j ∈ Ui and zju =1 (4e)
packet can be calculated as follows:

ξ(i,j) Finding these values is subject to constraints regarding the


D(i,j) = (3) users’ requirements and the respect of maximum capacities
δ(i,j)
of resources. The constraint (4b) ensures that the allocated
bandwidth fractions are between 0 and a maximum value
fmax . Constraint (4c) guarantees that the bandwidth allocated
B. Problem Formulation to users does not exceed the bandwidth Bi leased from the
InP. Constraint (4d) ensures that the data rate achieved by an
In order to achieve efficient resource allocation for MVNOs, eMBB user should be greater that a minimum threshold δimin .
the problem necessitate the minimization of the sum of expe- Constraint (4e) states that the delay of an URLLC user to
rienced delay D(i,j) for the URLLC users, and achieving a transmit its packet should not exceed a maximum threshold
higher sum of data rate δ(i,j) for eMBB users. Therefore, we Dimax .
5

IV. FDRL BANDWIDTH A LLOCATION an action ai is considered valid if the sum of the fractions
In this section, we present the proposed FDRL mechanism is less than 1, and if the allocated fractions result in delays
to solve the optimization problem Eq. (4). First, we model the and data rates that meet the SLA values. In case the action
bandwidth allocation problem of an MVNO as a single-agent is invalid, a negative reward is returned to prevent the agent
MDP. Then, we describe the proposed FDRL mechanism by from choosing similar actions in subsequent steps.
explaining the DDPG algorithm and how the latter is trained
in a federated fashion. B. Federated Deep Reinforcement Learning
A. MDP formulation of the bandwidth allocation Having formulated the problem as a MDP, an adequate
solution is reinforcement learning. In this case, each MVNO
In this section we present the formulation of the MDP. To
is considered as an agent interacting with the environment
formulate the MDP problem, we define the state space, the
composed of users, through observing a state S and choosing
action space, and the reward function.
an action a. The agent’s goal is to learn an optimal policy π
1) State space: At each time step t, each agent (i.e.,
by aiming to maximize the reward r.
MVNO) observes the environment state. The observation of
Deep reinforcement learning (DRL) combines the power of
each MVNO includes the type of its active users and their
deep neural networks with reinforcement learning to create
channel gains. The users’ types is necessary as it defines
agents that learn from high-dimensional states. Accordingly,
the requirements of SLA. The estimation of the channel
the policy π is represented as a deep neural network [13].
gains between each associated user on the communication
DRL was first introduced through Deep-Q Networks (DQN),
channel is necessary to make adequate bandwidth allocation
and was fast adopted by the research community to solve many
decisions. The channel gains are periodically collected by each
practical decision making problems [12]. Nonetheless, DQN is
MVNO. In fact, each MVNO broadcasts pilot signals to all
off-policy and may not perform well in environments that have
its users. Subsequently, each user estimates the channel state
high uncertainties such as wireless networks. While value-
information and sends it back to its MVNO through the return
based RL algorithms like Q-learning optimize value function
channel.
first then derive optimal policies, the policy-based methods
We denote Si (t) the observed state of MVNO mi at time-
directly optimize an objective function based on the rewards,
slot t.
which makes them suitable for large or infinite action spaces.
Si (t) = hGi (t), Ui (t)i (5)
Yet, policy-based RL might have noisy and unstable gradients
where Gi (t) represents the channel gain between the MVNO [31]. As a result, we propose to use an actor-critic based
mi and its users Ui at the time-slot t, Ui (t) represents the set algorithm [32]. In fact, actor-critic approaches combine strong
of users types of MVNO mi . The types of users are defined points from both value-based and policy-based RL algorithms.
using two values we and wu , which represent the priority of Furthermore, since the fraction values are continuous, we
each type. In general, since URLLC users have stringent delay use deep deterministic policy gradient (DDPG) [33], which
requirements, they are assigned higher priority values. concurrently learns a Q-function and a policy and performs
actions from a continuous space.
2) Action space: At each time slot RIC provides the
1) Deep Deterministic Policy Gradient (DDPG):
necessary bandwidth fraction Bi to each MVNO. An MVNO
DDPG is an off-policy algorithm that uses four neural
assigns factions of Bi to its users. The action space for each
networks, namely the actor network µ, the critic network v,
MVNO mi at a time-slot t is given as follows:
the actor target network µ′ , and the critic target network v ′ .
Ai (t) = [0, fmax ] (6) For a given observed environment state, the actor chooses an
where each action ai ∈ Ai (t) is represented by a row vector action, and the critic uses the following state-action Q function
given as a vector {f(i,j) (t), ∀ui,j ∈ Ui }. to evaluate this action.

3) Reward function: When an MVNO mi chooses an action


ai ∈ Ai (t) at time-slot t, it receives a reward Ri (t) in return. Q(Si (t), Ai (t)) = r(t) + γmaxQ(Si (t), Ai (t)) (9)
The objective is related to minimize the delay, therefore, the In the training process, DDPG uses the experience replay
reward should be expressed in terms of the delay for the memory technique. Accordingly, the agent stores its experi-
uRLLC users, and in terms of data rate for eMBB users. ences in a finite size buffer, where each of them is defined
We define a reward related to each end user’s satisfaction, by the tuple (S(t), A(t), r(t), S(t + 1)), and then randomly
with  samples mini-batches from these experiences to perform the
we δ(i,j) , if zje = 1 learning process. This technique reduces the correlation be-
r(i,j) (t) = 1 (7)
wu , if zju = 1 tween the training samples, which stabilizes the behavior of
D(i,j) the DDPG algorithm. In DDPG algorithm, the exploration
The overall reward can be expressed as follows, policy is performed by adding a noise to the actions in the
training process Eq.(10). The added noise enables the DDPG

 N
X
 r(i,j) (t), if ai is valid agent to efficiently explore its environment. We used the
ri (t) = j=1 (8) Ornstein–Uhlenbeck (OU) process for generating the noise

−0.1, values.

otherwise
6

that can be served by an MVNO at once, and we use zero-


A(t) = v(S(t)|θv (t)) + N (t), (10) padding in case the observed number of users is less than
Cmax . To better illustrate, we consider a MVNO i with
where θv denotes the parameters of the critic network and Ci = 3 users, and Cmax = 5. The observation is Si (t) =
N is the absolute value of the OU noise. [gi,1 (t), gi,2 (t), gi,3 (t), 0, 0, ui,1 (t), ui,2 (t), ui,3 (t), 0, 0]. This
The actor network updates its parameters according to the allows us to adapt to both the varying number of users of
deterministic policy gradient. The target Q value is calculated each MVNO and to unify the trained model. Similarly, the
using the actor target network and the critic target network as output size is then Cmax . Furthermore, in order to avoid the
follows: case where a fraction of the bandwidth is allocated to a user
that does not exist, we associate this action to a punishment
′ ′
y(t) = r(t) + γµ′ (S(t + 1), v ′ (S(t + 1)|θv )|θµ ), (11) equal to −0.1 that we add to the reward.
Then, at each communication round, each MVNO trains the
′ ′
where θv and θµ denote the parameters of the critic target DDPG model locally. To do so, each MVNO initialize its re-
network and the actor target network, respectively. play memory buffer, then starts the learning process. Through
Q-learning in DDPG is performed by minimizing the fol- a number of episodes, MVNOs reset their environment, per-
lowing mean square error function: form an observation and initialize the action space using OU
noise. For a number of steps, MVNOs select an action ai ,
1 X evaluate it, and compute the received reward ri , then moving
L= (y(t = k) − µ(S(t = k), A(t = k)|θµ ))2 , (12) to the next observation. Each transition from a state st to st+1
N
k is stored in the replay buffer. After a predefined number of
where N represents the number of experiences and θµ transitions being stored (in batches), MVNOs sample random
denotes the parameters of the actor network. mini-batches from the replay buffer. The actor network Qi
The parameters of the actor target network and the critic is updated through policy gradient. The critic network µi is
target network are softly updated as follows: updated through loss function minimization. Subsequently, the
′ ′
target networks Qi and µi are updated as well. At the end of
′ ′ episodes, each MVNO send its local updated model to the RIC
µ ← τ µ + (1 − τ )µ for aggregation purpose. RIC collects all the local updates
′ ′ (13)
v ← τ v + (1 − τ )v , from the MVNOs, and generate the global model using the
weighted sum defined by θG in Eq. (14).
where 0 ≤ τ ≤ 1.
2) Federated Deep Reinforcement Learning (FDRL):
The disparity among clients in terms of geography, for V. N UMERICAL RESULTS
instance, makes using the same model across all the covered This section investigates the performance of the proposed
areas inadequate. Moreover, the amount of data collected FDRL mechanism under different scenarios. We first introduce
by each MVNO in certain areas (e.g., rural areas) is fairly the experiments’ parameters, then we present and discuss the
limited. Since it is beneficial for each MVNO to enhance its results.
bandwidth allocation model, FL has created the opportunity
for multiple MVNOs to leverage data from a broader set of
clients while avoiding sharing it. Each MVNO trains a global A. Experiment parameters and scenarios
RL model based on its users interaction. Each MVNO uploads We consider a RIC-enabled RAN architecture with a single
its locally trained model for the current round to the RIC. base station. The simulated users are randomly scattered in
RIC performs the models aggregation using weighted sum an area of 500m × 500m around the BS, and served by 3
using each MVNO’s number of users. If we assume that the MVNOs. Table I summarized the different wireless network
parameter of the local model is denoted as θi , the parameter parameters.
of the global model is given as, The MVNOs collectively train a DDPG model. We create
1 X and train the model using the PyTorch framework. The four
θG = Ci θi (14) networks of the model have two hidden fully connected layers
C
i∈M
P with 400 and 300 neurons respectively. Since the maximum
where C = i∈M Ci is the total number of users and Ci is number of users is 5, the size of the input layer is 10 and
the cardinal of Ui , the set of users of MVNO i. the output layer’s is 5. We used the Rectified linear unit
Algorithm 1 describes the proposed FDRL approach. First, (ReLU) as an activation function since it helps avoid vanishing
the actor and critic networks, and the target actor and target gradients in backpropagation, especially since the action space
critic networks are initialized in a centralized manner. As is limited to values less than fmax = 0.3. We used the Adam
each MVNO might be serving a different number of users optimized with two different learning rates for the actor and
each time, with its observation being the concatenation of critic. The exploration is insured through the use of a fraction
an array representing the channel gains values and an array of the absolute value of the OU noise equal to 1/10. Table
representing the users types, we set the input size to a value II summarizes the different training hyperparameters of the
of 2 × Cmax , with Cmax the maximum number of users DDPG.
7

Algorithm 1: FDRL Algorithm TABLE II: DDPG parameters


1 Initialize Actor and Target Actor networks; Parameter Value
Random seed 0
2 Initialize Critic and Target Critic networks; Learning rate 0.0001 (Actor), 0.001 (Critic)
3 Initialize the environment; Batch size 128
4 for r ∈ rounds: do Discount factor 0.99
Loss function Mean-Square Error
5 for i ∈ M : do Activation function ReLu
6 Initialize the replay buffer; Optimizer Adam
7 for e in episodes: do
8 Reset the environment; TABLE III: FDRL parameters
9 Receive initial observation; Parameter Value
10 Initialize action according to exploration Random seed 0
Communication rounds 5
noise; Local episodes/ round 500
11 for t ∈ steps: do Steps / episode 50
12 select action ai according the current Reset step 25
policy;
13 evaluate ai ;
14 compute reward ri ; spectively. In this case, the fraction of the bandwidth allocated
15 observe next state st+1 ; to each MVNO is proportional to its number of users.
16 store transition in replay buffer; In the following, we study two scenarios : Non-i.i.d with
17 sample random batches from replay equal number of users, and Non-i.i.d with unequal numbers of
buffer; users. To evaluate the performance of FDRL and its benefit,
18 update critic Qi by minimizing the loss; we compare with the case where each MVNO trains and uses
19 update actor µi sing policy gradient; a local model without collaborating with other MVNOs. These
′ ′ parameters are summarized in Table III.
20 update target networks Qi and µi ;
21 end for
22 end for B. FDRL training results
23 send update to RIC; The first considered scenario is non-i.i.d with equal number
24 end for of users. The total number of users is 15, with 5 users being
25 aggregate models using weighted sum using served by each MVNO. Fig.3 shows the evolution of the
Eq.(14); average reward of the local models and the global model across
26 send updated model to M; 5 experiments. While the global model improves through the
27 end for shared experience, even surpassing the local models average
in later rounds, the local models have degrading performances
throughout the training. In fact, since in later rounds, the
TABLE I: Simulation parameters
exploration induced by the OU noise is reduced, the local
Parameter Value models allocate less bandwidth to the users, which degrades
Coverage area 500m × 500m the values of the received rewards. The global model, on the
Number of MVNOs 3
Total number of users [12 , 15] other hand, learns slower to generalize, but achieves more
Bandwidth 3 MHz robust training overall by leveraging the shared experience.
fmax 0.3 The second considered scenario is non-i.i.d with unequal
Maximum number of
5 number of users. The total number of users is 12, where
users / MVNO
5, 4, and 3 users are served by the first, second, and third
MVNO, respectively. Fig.4 shows the evolution of the average
As for the FDRL mechanism, the training takes place during reward of the local models and the global model across
a total of 5 communication rounds. In each round, the model 5 experiments. Our first observation is that the cumulative
is trained by each MVNO for 500 episodes before sending the rewards for both models are less than what was achieved in
model to the RIC for aggregation. Each episode is composed the case of equal numbers of users. This is mainly due to
of 50 steps, where the channel gains values are reset in each the punishment related to allocating bandwidth to non-existent
step, and the users’ locations are reset every 25 episodes. users. Furthermore, similarly to the previous experiments, the
In order to generate non-i.i.d distributions for user require- global model improves slowly throughout the communication
ments, we set different probabilities of URLLC and eMBB rounds, while the local models do not improve.
users for each MVNO. The set of probabilities of URLLC
users are 25%, 50%, and 75% for MVNOs 1, 2, and 3, C. FDRL performance evaluation
respectively. To evaluate the performance of the proposed FDRL mech-
To further test our proposed solution, we generated an anism, we compared the number of invalid actions of the
unequal distribution of the users. Specifically, we considered global model against each MVNO local model. Note that an
a case where MVNOs 1, 2, and 3 had 5 , 4, and 3 users re- action is considered invalid if it does not meet the user’s SLA
8

for a total of 20000 observations. We noticed that, overall,


the global model’s actions are less prone to violate the SLA
55
requirements for both eMBB and URLLC users compared to
the individually trained models. Additionally, as we attributed
50 larger weights to the URLLC users, the global model priori-
Cumulative reward

tizes this type of users and is less likely to violate their required
45
delay.
40 2) Varying number of users:
The second considered scenario is non-i.i.d with unequal
35 number of users. The models are first trained with a total
number of users of 12, where 5, 4, and 3 users are served
30
Global Model by the first, second, and third MVNO, respectively. In other
25 Local Models words, we seek to evaluate the robustness of the models in
0 50 100 150 200 250 the case of changing number of users. In a first experiment,
Episodes we changed the number of users in test time to 4, 3 ,5 for
the first, second and third MVNOs, respectively. In a second
Fig. 3: Non-i.i.d and equal user distributions
experiment, we changed these numbers to 3, 5, 4. Fig. 6 shows
the number of times where users’ SLA were not satisfied
by the local models of the MVNOs and the global model,
while observing the same environments for a total of 20000
50
observations.
45
Similarly to the previous experiments, the global model’s
actions are less likely to violate the SLA requirements for
Cumulative reward

40 both eMBB and URLLC users compared to the individually


trained models. Moreover, the third MVNO, being trained with
35
mostly URLLC users, it has high satisfaction rate for this
30
type, but it performs poorly for the eMBB users. In general,
the enhancement in the QoS for both types of users using
25 the global model makes it worthwhile for the MVNOs to
collaborate.
20 Global Model
Local Models
0 50 100 150 200 250 VI. C ONCLUSION
Episodes
In this article, we investigated the resource allocation of
Fig. 4: Non-i.i.d and unequal user distributions RAN slicing in multi-MVNOs scenarios. More specifically,
we explored the usage of federated learning as a means to
build robust slicing models in varying wireless communication
requirements. In this case, we use the resulting local models environments. Accordingly, we proposed a federated deep
and global models, and test them in different environments reinforcement learning mechanism to collaboratively train a
by varying the underlying user types’ distributions of each deep reinforcement learning model for bandwidth allocation.
MVNO, then varying the number of users served by each We considered a scenario with two different types of slices,
MVNO. namely, URLLC and eMBB slices. We formulated the problem
1) Varying user types’ distributions: as a single mobile virtual network operator’s MDP, where
The first considered scenario is non-i.i.d with equal number the agent aims to allocate radio resources to different types
of users. The models are first trained with a total number of of users (URLLC and eMBB). We proposed an actor-critic
users of 15, where 5 users are served by each MVNO. In order algorithm, which combined the advantages of both value-based
to evaluate the robustness of the models in the case of changing and policy-based reinforcement learning algorithms. Further-
users’ requirements, we varied the underlying distributions of more, as the values of bandwidth fractions are continuous,
the users for each MVNO. The probabilities of URLLC users we use deep deterministic policy gradient, which learns a
in the trained models are 25%, 50%, and 75%, for the first, Q-function and a policy simultaneously and takes actions
second, and third MVNO, respectively. in a continuous space. As MVNOs are competing entities,
In a first experiment, we changed the URLLC probabilities sharing data to achieve diverse datasets to train the models
in test phase to 75%, 25% ,50% for the first, second and third is not viable. Instead, we leverage FL to overcome these
MVNOs, respectively. In a second experiment, we changed challenges and we designed a FDRL mechanism on an O-
these probabilities to 50%, 75%, 25%. Fig.5 shows the cu- RAN architecture to collaboratively improve the radio resource
mulative number of times where users’ SLA requirements allocation operation of different MVNOs. The efficiency of
were not satisfied by the local models of the MVNOs and the proposed FDRL approach was proved under different
by the global model, while observing the same environments simulation scenarios with non-i.i.d and unequal distributions of
9

1513 43638
1462 Local Models 41165
Local Models
1400 Global Model 40000 39997
Global Model
1200

1000 30000
SLA unatisfaction

SLA unatisfaction
800
20000 20263 19953 20275
600

400
10000
200

0 0 0 0 0 0
MVNO1 MVNO2 MVNO3 MVNO1 MVNO2 MVNO3

(a) URLLC (75%,25%,50%) (b) eMBB (75%,25%,50%)

50000
2324 47837
Local Models Local Models
Global Model Global Model
2000 41274 40005
40000
SLA unatisfaction

SLA unatisfaction
1500 30000
1273 26677
24845 24723
1000 20000

500 10000

0 0 0 1 0 0
MVNO1 MVNO2 MVNO3 MVNO1 MVNO2 MVNO3

(c) URLLC (50%, 75%,25%) (d) eMBB (50%, 75%,25%)


Fig. 5: Evaluation under varying users distributions

the users. Experiences have shown that the model trained using [10] M. R. Raza, C. Natalino, P. Öhlen, L. Wosinska, and P. Monti, “Rein-
FDRL is more robust against environment changes compared forcement learning for slicing in a 5g flexible ran,” Journal of Lightwave
Technology, vol. 37, no. 20, pp. 5161–5169, 2019.
to models trained separately by each MVNO. [11] C. Ssengonzi, O. P. Kogeda, and T. O. Olwal, “A survey of deep
reinforcement learning application in 5g and beyond network slicing
R EFERENCES and virtualization,” Array, p. 100142, 2022.
[1] O. Sallent, J. Perez-Romero, R. Ferrus, and R. Agusti, “On radio access [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
network slicing from a radio resource management perspective,” IEEE Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
Wireless Communications, vol. 24, no. 5, pp. 166–174, 2017. et al., “Human-level control through deep reinforcement learning,”
[2] E. J. Oughton and Z. Frias, “The cost, coverage and rollout implications nature, vol. 518, no. 7540, pp. 529–533, 2015.
of 5g infrastructure in britain,” Telecommunications Policy, vol. 42, no. 8, [13] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
pp. 636–652, 2018. “Deep reinforcement learning: A brief survey,” IEEE Signal Processing
[3] A. Filali, A. Abouaomar, S. Cherkaoui, A. Kobbane, and M. Guizani, Magazine, vol. 34, no. 6, pp. 26–38, 2017.
“Multi-access edge computing: A survey,” IEEE Access, vol. 8, [14] A. Filali et al., “Communication and computation o-ran resource slicing
pp. 197017–197046, 2020. for urllc services using deep reinforcement learning,” arXiv preprint
[4] Z. Mlika and S. Cherkaoui, “Network slicing with mec and deep arXiv:2202.06439, 2022.
reinforcement learning for the internet of vehicles,” IEEE Network, [15] A. Abouaomar et al., “Resource provisioning in edge computing for
vol. 35, no. 3, pp. 132–138, 2021. latency-sensitive applications,” IEEE Internet of Things Journal, vol. 8,
[5] X. Foukas, M. K. Marina, and K. Kontovasilis, “Orion: Ran slicing for no. 14, pp. 11088–11099, 2021.
a flexible and cost-effective multi-service mobile network architecture,”
[16] A. Thantharate, R. Paropkari, V. Walunj, and C. Beard, “Deepslice: A
in Proceedings of the 23rd annual international conference on mobile
deep learning approach towards an efficient and reliable network slicing
computing and networking, pp. 127–140, 2017.
in 5g networks,” in 2019 IEEE 10th Annual Ubiquitous Computing, Elec-
[6] C. Liang and F. R. Yu, “Wireless network virtualization: A survey,
tronics & Mobile Communication Conference (UEMCON), pp. 0762–
some research issues and challenges,” IEEE Communications Surveys
0767, IEEE, 2019.
& Tutorials, vol. 17, no. 1, pp. 358–380, 2014.
[7] A. Rago, S. Martiradonna, G. Piro, A. Abrardo, and G. Boggia, “A [17] G. Chen, X. Zhang, F. Shen, and Q. Zeng, “Two tier slicing resource
tenant-driven slicing enforcement scheme based on pervasive intelli- allocation algorithm based on deep reinforcement learning and joint
gence in the radio access network,” Available at SSRN 4022195, 2022. bidding in wireless access networks,” Sensors, vol. 22, no. 9, p. 3495,
[8] H. Song, J. Bai, Y. Yi, J. Wu, and L. Liu, “Artificial intelligence enabled 2022.
internet of things: Network architecture and spectrum access,” IEEE [18] A. Filali, Z. Mlika, et al., “Dynamic sdn-based radio access network
Computational Intelligence Magazine, vol. 15, no. 1, pp. 44–51, 2020. slicing with deep reinforcement learning for urllc and embb services,”
[9] H. Song, L. Liu, J. Ashdown, and Y. Yi, “A deep reinforcement learning IEEE Transactions on Network Science and Engineering, pp. 1–1, 2022.
framework for spectrum management in dynamic spectrum access,” [19] J. Hu, Z. Zheng, B. Di, and L. Song, “Multi-layer radio network
IEEE Internet of Things Journal, vol. 8, no. 14, pp. 11208–11218, 2021. slicing for heterogeneous communication systems,” IEEE Transactions
10

646 50000 50000


Local Models Local Models
600 Global Model Global Model
543
500 40000
34217
SLA unatisfaction

SLA unatisfaction
400 30000

300
20000 19483 18221
200 14876 13776
10000
100

0 0 0 0 0 0
MVNO1 MVNO2 MVNO3 MVNO1 MVNO2 MVNO3

(a) URLLC (4,3,5) (b) eMBB(4,3,5)

2346 46044
Local Models Local Models
Global Model Global Model
2000 40000 38769

30000
SLA unatisfaction

SLA unatisfaction
1500 30000

1000 20000 19611

14184 14554
558
500 10000

0 0 0 0 0 0
MVNO1 MVNO2 MVNO3 MVNO1 MVNO2 MVNO3

(c) URLLC (3,5,4) (d) eMBB(3,5,4)


Fig. 6: Evaluation under different number of users

on Network Science and Engineering, vol. 7, no. 4, pp. 2378–2391, [30] W. Wu, N. Chen, C. Zhou, M. Li, X. Shen, W. Zhuang, and X. Li,
2020. “Dynamic ran slicing for service-oriented vehicular networks via con-
[20] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and strained learning,” IEEE Journal on Selected Areas in Communications,
D. Bacon, “Federated learning: Strategies for improving communication vol. 39, no. 7, pp. 2076–2089, 2020.
efficiency,” 2016. [31] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap
[21] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: between value and policy based reinforcement learning,” Advances in
Challenges, methods, and future directions,” IEEE Signal Processing neural information processing systems, vol. 30, 2017.
Magazine, vol. 37, no. 3, pp. 50–60, 2020. [32] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural
[22] A. Taı̈k et al., “Data-aware device scheduling for federated edge learn- information processing systems, vol. 12, 1999.
ing,” IEEE Transactions on Cognitive Communications and Networking, [33] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
vol. 8, no. 1, pp. 408–421, 2022. D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
[23] A. Abouaomar, S. Cherkaoui, Z. Mlika, and A. Kobbane, “Mean-field learning,” arXiv preprint arXiv:1509.02971, 2015.
game and reinforcement learning mec resource provisioning for sfc,” in
2021 IEEE Global Communications Conference (GLOBECOM), pp. 1–
6, 2021.
[24] O.-R. Alliance, “O-RAN: Towards an Open and Smart RAN,” tech. rep.,
, Oct. 2018. White Paper.
[25] I. Chih-Lin, S. Kuklinskı́, and T. Chen, “A perspective of o-ran integra-
tion with mec, son, and network slicing in the 5g era,” IEEE Network,
vol. 34, no. 6, pp. 3–4, 2020.
[26] D. Johnson, D. Maas, and J. Van Der Merwe, “Nexran: Closed-loop ran
slicing in powder-a top-to-bottom open-source open-ran use case,” in
Proceedings of the 15th ACM Workshop on Wireless Network Testbeds,
Experimental evaluation & CHaracterization, pp. 17–23, 2022.
[27] S. O. Oladejo and O. E. Falowo, “Latency-aware dynamic resource allo-
cation scheme for multi-tier 5g network: A network slicing-multitenancy
scenario,” IEEE Access, vol. 8, pp. 74834–74852, 2020.
[28] B. Han and H. D. Schotten, “Machine learning for network slic-
ing resource management: a comprehensive survey,” arXiv preprint
arXiv:2001.07974, 2020.
[29] A. Abouaomar, Z. Mlika, A. Filali, S. Cherkaoui, and A. Kobbane,
“A deep reinforcement learning approach for service migration in mec-
enabled vehicular networks,” in 2021 IEEE 46th Conference on Local
Computer Networks (LCN), pp. 273–280, 2021.

You might also like