Professional Documents
Culture Documents
Scaling UPF Instances in 5G6G Core With Deep Reinf (Admin's Conflicted Copy 2022-09-13)
Scaling UPF Instances in 5G6G Core With Deep Reinf (Admin's Conflicted Copy 2022-09-13)
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier
ABSTRACT In the 5G core and the upcoming 6G core, the User Plane Function (UPF) is responsible
for the transportation of data from and to subscribers in Protocol Data Unit (PDU) sessions. The UPF
is generally implemented in software and packed into either a virtual machine or container that can be
launched as a UPF instance with a specific resource requirement in a cluster. To save resource consumption
needed for UPF instances, the number of initiated UPF instances should depend on the number of PDU
sessions required by customers, which is often controlled by a scaling algorithm.
In this paper, we investigate the application of Deep Reinforcement Learning (DRL) for scaling UPF
instances that are packed in the containers of the Kubernetes container-orchestration framework. We
propose an approach with the formulation of a threshold-based reward function and adapt the proximal
policy optimization (PPO) algorithm. Also, we apply a support vector machine (SVM) classifier to cope
with a problem when the agent suggests an unwanted action due to the stochastic policy. Extensive
numerical results show that our approach outperforms Kubernetes’s built-in Horizontal Pod Autoscaler
(HPA). DRL could save 2.7–3.8% of the average number of Pods, while SVM could achieve 0.7–4.5%
saving compared to HPA.
INDEX TERMS 5G, 6G, core, PDU session, UPF, Deep Reinforcement Learning, Kubernetes, Proximal
Policy Optimization
VOLUME x, xxxx 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
Node 1 Node 2
UPF Pod UPF Pod UPF Pod UPF Pod
PDU PDU PDU PDU
Session Session Session Session
#sessions: 3 #sessions: 1 #sessions: 2 #sessions: 3
Node 3 Node 4
UPF Pod UPF Pod UPF Pod UPF Pod
PDU PDU PDU PDU
Session Session Session Session
#sessions: 3 #sessions: 1 #sessions: 2 #sessions: 1
UPF Pod
PDU
Session
#sessions: 3
FIGURE 2: An example cluster with 4 nodes, hosting don = 13 UPF Pods respectively, handling a total of lsess = 27 PDU
sessions. Each Pod may handle multiple PDU sessions. It may also happen that a Pod does not handle any sessions and
becomes idle in the node.
a practical approach to limit the number of PDU session PDU sessions are assigned to the appropriate UPF Pods by
types. For each PDU session type a UPF instance type is a load balancer. If lfree = 0 and there is no capacity left,
created with identical resource requirement. the session and the UE’s request is blocked. We denote the
blocking rate, the probability of blocking a request, with pb .
C. THE PROBLEM OF SCALING UPF PODS The list of basic notations is summarized in Table 1.
The purpose of scaling UPF Pods is to save the resource
consumption of the system. A scaling function changes (start D. K8S HPA
new Pods, or terminate existing ones) the number of UPF Kubernetes autoscaler is responsible for the scaling func-
Pods depending on the number of PDU sessions required by tionality. Figure 3 shows the interactions between the au-
UEs. On the one hand, if the number of UPF Pods is too low, toscaler and other components. A metrics server monitors
the QoS degrades since we do not have enough UPF Pods the resource usage of Pods and provides the autoscaling
to handle new incoming PDU sessions. On the other hand, entity with statistics through the Metrics API. The autoscaler
if the number of UPF instances is too high and the load is computes the necessary number of Pod replicas and may
low, a lot of reserved resource increases the operation cost. decide on a scaling action. The adjustment of the replica
Therefore, a trade-off between the QoS and the operation count can be done through the control interface.
cost is to be achieved.
For each type of PDU sessions, we assume that at least observation Metrics
Dmin Pods are initiated, Dmax Pods can be started, each Server
Pod simultaneously could handle maximum Lsess sessions.
each Pod takes tpend time to boot, and their termination is
instantaneous. Let don (t) denote the number of running Pods Autoscaler
Pods
in the system at time t. Therefore, Dmin ≤ don (t) ≤ Dmax Entity
holds and the limit for the number of sessions in the system
is Dmax Lsess . Let lsess (t) denote the number of sessions in
the system at time t. Then we have 0 ≤ lsess (t) ≤ Dmax Lsess . Scale
action
Additionally, let us define a free slot as an available capacity
for a session and denote their number with lfree (t) at time t. FIGURE 3: Autoscaler control loop. The metrics server collects
Obviously, lfree (t) + lsess (t) = Dmax Lsess . statistics from the Pods, which are then sent to the autoscaler as
A PDU session can only be created if there is free an observation. The autoscaler may make a decision to scale and
capacity in the cluster, that is lfree (t) > 0. In this case new sends the number of replicas through the Scale interface.
4 VOLUME x, xxxx
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
The Horizontal Pod Autoscaler (HPA) is Kubernetes’s de- denote the state, the action, and the reward at time Ti with
fault scaling algorithm. It uses the average CPU utilization, a lower index i (e.g. s(Ti ) = si ).
denoted by ρ̄, as an observation to compute the necessary The state si at time Ti should contain all the information
number of Pods, denoted by ddesired (t) at time t. It has two necessary for an optimal scaling decision. In our case
configurable parameters: the target CPU utilization ρtarget
si = don (Ti ), dboot (Ti ), lsess (Ti ), lfree (Ti ), λ̂(Ti ) ,
and the tolerance ν. The equation used by the HPA is
where λ̂ is the measured arrival rate since the previous
ρ̄(t)
ddesired (t) = don (t) , (1) decision at time Ti−1 .
ρtarget The action space consists of three actions: start a new
Pod; terminate an existing Pod; no action. The agent
where don (t) is the number of Pods at time t. The HPA then
may only start new Pods if there is a capacity for it in the
checks whether ddesired (t)/don (t) ∈ [1 − ν, 1 + ν]. If it is not,
cluster, that is, don (t) + dboot (t) < Dmax , where dboot (t) is
the HPA issues a scaling action to bring the replica count
the number of Pods still booting at time t. These booting
closer to the desired value. The above described procedure
Pods exist because when we start a new Pod, it enters a
is executed periodically with ∆T interval. This time interval
pending phase while it starts up its necessary containers. We
can be set through Kubernetes configurations.
assume this phase lasts tpend time. Also, the agent may only
terminate Pods if don (t) > Dmin . We assume this termination
IV. SCALING UPF PODS WITH DRL is graceful, which means that the Pod waits for all of its
The application of the built-in Kubernetes HPA needs the PDU sessions to close before shutting down. Obviously in
appropriate values of ρtarget and ν. A system operator may this case the Pod is scheduled for termination and does not
go through an arduous process of trials and errors to find accept new PDU sessions.
the configuration that could minimize the Pod count while The reward function is shown in (2).
maintaining QoS levels. Instead, in this paper, we propose (
the application of Deep Reinforcement Learning (DRL) to −κ p̂b,i if p̂b,i > pb,th
ri = (2)
set the Pod count dynamically depending on the traffic, −don (Ti ) if p̂b,i ≤ pb,th
without the assistance of an operator. The DRL agent
Here p̂b,i is the measured blocking rate since the previous
observes the system and determines the correct action output
decision in the time interval [Ti−1, Ti ) and pb,th is the
through the continuous improvement of its policy. In what
blocking rate threshold set by the QoS level that we should
follows, we present our approach regarding the design of
not exceed in the long term. The coefficient κ is a scalar
the DRL agent.
that scales the blocking rate to numerically put it in range
with the don value. The intuition behind this reward function
A. FORMULATION OF THE MARKOV DECISION is that if the measured blocking rate exceeds the threshold,
PROBLEM we need to minimize the blocking rate; and if it does not
Before applying a reinforcement learning algorithm we need exceed the threshold, we want to minimize the number of
to formulate the problem as a Markov decision problem Pods.
(MDP). This means we need to define the state space S, For the list of notations used to describe the MDP see
the actions space A, and the reward function r : S × A × Table 1.
S → R. A complete definition of the MDP would also
require the state transition probability p : S × A × S → B. REINFORCEMENT LEARNING
[0, 1] and the discounting factor γ ∈ [0, 1]. Here p is a Reinforcement learning (RL) is a method that applies an
probability that the system enters a next state when an action agent to interact with an environment. The agent observes
happens at the current state, and such a transition results the system states and rewards as results of subsequent
in real value reward r. To avoid the specification of the p actions. To apply a RL-based agent in the control loop
transition function as in a model-based formulation (like in illustrated in Figure 3, we propose an approach where a
[22]), we decided to use a model-free RL method. Also, γ specific state contains the number of active and booting UPF
is implicitly contained in other hyperparameters as we will Pods, the number of PDU sessions in the system, and an
see later. approximation of the arrival rate. These information about
In the MDP framework an agent interacts with the en- the states can be obtained either from monitoring the SMF
vironment described by the MDP. At the decision time t and AMF functions of the 5G core, or from the SMF and
it observes the state s(t) ∈ S and following its policy AMF functions.
π : S → A it makes an action a(t) ∈ A. As a result The RL agent uses the observations gathered between
the agent receives a reward r(t) and at the next deicision two scaling actions to update and improve its policy. This
time it can observe the next state. means that learning happens online during the operation of
Let us denote the i-th decision time with Ti (i = 0, 1, . . .). the cluster. Also, the neutral network in the RL agent can
In our case the time between two decisions is ∆T, that is be pre-trained with the use of captured data and simulation
Ti+1 − Ti = ∆T (i = 0, 1, . . .). Furthermore, we will also as well.
VOLUME x, xxxx 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
The goal of RL is to find the policy π that maximizes vector rt (θ) is the probability ratio and its j-th element can
tha value function V π (s), the long-term expected cumulated be computed by
reward (3) starting from the state s. Note, that the optimal
rt (θ) j = π(a j |s j , θ)/πold, j , (9)
policy does not depend on the starting state.
"∞ # where π(a j |s j , θ) and πold, j are the probabilities of action
π
Õ
V (s) = Eπ γ ri+k si = s
k
(3) a j in state s j . Note, that the difference between the two
k=0 probabilities is that the former depends on θ which can
change throughout epochs during an update (as seen in
In this paper we used proximal policy optimization (PPO) Algorithm 1), whereas π old , which is stored in the batch,
[38] as the RL algorithm with slight modifications, similar represents the probability of action a j when it was executed
to our previous work [39]. The method is presented in by the agent. This means that at the start of the update
Algorithms 1 and 2. π(a j |s j , θ) = πold, j , but after the first epoch θ is changed
The PPO is an actor-critic algorithm [40]. It uses a param- by (8) and the equality does not hold anymore. The opera-
eterized policy π(s, θ) as an actor to select actions, where θ tion is the elementwise product. We also added entropy
is the parameter vector. The algorithm also approximates the regularization
value function with V(s, ω) parameterized with the ω vector. Õ
This value function is used to calculate the advantage ÂGAE H(π(·|s, θ)) = − π(a 0 |s, θ) log π(a 0 |s, θ) (10)
using the generalized advantage estimator (GAE) [41]. If a0 ∈A
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
Algorithm 2 RL training loop For numerical stability we normalized most of the input
1: Initialize system, and get initial state s0 . values into the range [0, 1]. This means, that we divided the
2: Initialize learning parameters of AGENT. don and dboot values by Dmax and also divided the lsess and
3: i←0 lfree values by Lsess . As for λ̂, we do not have an maximum
4: for Ntrain steps do value for the arrival rate. Luckily for this normalization
5: Get action from agent: ai, ← Sample π(si, θ). process we do not need to know this exact number, we
6: Execute action ai to scale the cluster. only need that the order of magnitude of the normalized λ̂
7: Observe the new state si+1 and performance mea- is close to the other input value’s order of magnitude. We
sures after ∆T time. chose to divide λ̂ by 500 assuming the maximum arrival
8: Compute reward ri from the measurements using rate is close to this value.
(2). To find the best DRL agent we conducted a hyperpa-
9: AGENT.S TORE(si , ai , π(·|si, θ), ri , si+1 ) rameter search during training. We identified the reward
10: AGENT.U PDATE( ) multiplier κ and the entropy regularization factor ξ as
11: i ←i+1 the hyperparameters the DRL agent was more sensitive
12: end for to. We used grid search for κ ∈ {3, 5, 10, 13, 15, 20} and
ξ ∈ {0.01, 0.05} to find the adequate hyperparameter values.
We found the other hyperparameters to have less influence
based on the observed state si . We observe a new state and on the overall performance of the DRL agent. In these cases
then store the observations using the Store procedure, then we used values that are often used in the literature, such as
improve the agent’s policy with the Update procedure. [38]. Table 2 shows us the hyperparameter values we used
for the DRL agent. For the entropy parameter we chose
C. NEURAL NETWORK APPROXIMATION ξ = 0.01 and for the reward multiplier we chose κ = 13.
Note, that since we use GAE to estimate advantages, the
discount factor γ is implicitly incorporated into the λGAE
don hyperparameter. Table 2 presents the list of hyperparameter
πNoOp values used for training the DRL agent.
dboot
πScaleOut neural network hidden layers 1
lsess ..
.
hidden layer node count 50
πScaleIn learning rate (αθ , αω ) 0.0001
lfree epochs (k) 5
batch size (Nbatch ) 32
λ̂ reward averaging factor (αR ) 0.1
PPO clipping parameter ( ) 0.1
GAE parameter (λGAE ) 0.9
FIGURE 4: The neural network for policy πθ . The network reward multiplier (κ) 3, 5, 10, 13, 15, 20
receives the state as an input and outputs the probabilities of each entropy regularization factor (ξ) 0.01, 0.05
action for that given state. Connections represent the weights in θ. random initialization Xavier uniform
Nodes in the intermediate hidden layer represent the application activation function ReLU
of a non-linear activation function.
TABLE 2: DRL hyperparameters.
The 5-tuple that represents the state si of the system
creates a 5-dimensional state space. Even though in practice We used PyTorch 1.5.1 [42] to implement the DRL model
the values in the state are directly or indirectly bounded by and used an NVIDIA GeForce RTX 2070 (8GB) GPU for
the number of maximum Pods Dmax and the arrival rate is training.
also bounded by the maximum arrival rate λ̂max , the state
space can grow so large that it would be impossible to fit D. DRL WITH CLASSIFICATION
the policy or the value function in a computer’s memory. It is possible to use RL with non-stochastic policies that
Therefore we used neural networks one hidden layer of 50 enforce actions with the probability equal to one for a spe-
hidden nodes to approximate the policy π and the value cific observation. For example, the application of the deep
function V. This means that θ and ω represent the parameter Q-learning, also known as deep Q-networks (DQNs) [40],
set of these neural networks. Figure 4 shows us a neural may result in a greedy policy, which is demonstrated in
network that accepts the state as the input and ouputs the Section V. Moreover, we find that DQN leads to the over-
probabilities (πNoOp , πScaleOut , πScaleIn ) of the possible provisioning of the resource in our numerical study.
actions. In the hidden layer, the rectified linear unit (ReLU) The PPO method learns and finds a stochastic policy
function is applied. For the policy π, we used the softmax where the action space has a probability distribution for
function in the output layer. The parameters θ and ω were a given state. The DRL agent takes action based on the
started with the Xavier initialization. For the update steps, learned distribution. In general, it is expected that the agent
we used the stochastic gradient descent method. recommends the launch of new Pods when don is low and
VOLUME x, xxxx 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
( ) [1/s]
algorithm could decrease the probability of the bad actions
to 0 and increase the probability of the good action to 1 in
200
every state, we would get a deterministic policy. However,
this cannot happen, due to the entropy regularization which
prevents the PPO algorithm from reducing the probability of 00
an action to zero. This is a necessary measure to guarantee 200 400 600 800 1000 1200
step
that all actions remain possible in all states so that the agent
would have a possibility to explore the whole state space
(a) At step 480 the arrival rate suddenly increases to 500 1s .
during training. The noisy behavior of the DRL agent can
also be seen in Figure 6 that plots the action versus λ̂ and number of pods
100
don .
It is worth emphasizing that there may be outlier points 80
in the dataset, e.g. where the arrival rate is very low and the
Pod count is very high. If these points are labeled correctly,
60
don
they do not influence the separating line. However, in case 40
of mislabeling, these points can shift the decision boundary
into an unwanted direction. Therefore we need a classifier 20
to clean the dataset by removing these outlier points. We 00 200 400 600 800 1000 1200
did this by considering every point an outlier for which step
|don /Dmax − λ̂/λ̂max | > 0.4, (11)
(b) The DRL agent follows the traffic increase by starting
where λ̂max is the maximum of the measured arrival rate new pods, but it sometimes terminates existing ones due to
during the experiment. With this we removed every point its stochastic policy.
that is not on the main diagonal strip of the scatter plot.
We apply the DRL agent to generate labels by taking a 1.0 blocking rate
set of states and mapping actions to each state. The resulting 0.8
dataset of size Ndata was then used to create a linear support
vector machine (SVM) classifier that maps actions to states. 0.6
pb
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
no action
500 start
terminate
400
300
[1/s]
200
100
0
0 20 40 60 80 100
number of pods
FIGURE 6: Selected actions at given λ̂ and don values. We see more terminate actions when the arrival rate is low and the number
of pods high and more start actions in the opposite case.
Two cases are distinguished. where C, called the cost parameter, is a tuneable
• If the data is separable, that is, a hyperplane exists hyperparameter of the SVM. The smaller C is, the more
that can separate the actions labeled with +1 from the points are allowed to be misclassified, resulting in a
actions labeled with −1, the optimization problem is higher margin.
Algorithm 3 presents the training procedure of the SVM.
min kwk The algorithm requires the parameters and the hyperparam-
w,w0
eters of the simulation environment and the DRL method.
subject to ãi (wT si + w0 ) ≥ 1, i = 1, 2, . . . , Ndata .
(13) It returns the SVM model parameters w and w0 and also
• If the dataset contains overlaps and it is not separable returns the accuracy of the model on the test set which is a
we need to find the separating hyperplane that allows performance measure of the SVM.
the least amount of points in the training set to be clas- After the initialization of the agent and the environment,
sified incorrectly. This can be achieved by introducing the algorithm starts training the agent for Ntrain steps. This
the slack variables ζi and modifying the optimization training loop is almost identical to the one in Algorithm 2.
problem into The difference is that in the last Ndata steps the agent stores
the states in a list Lstates for the dataset later.
min kwk When the training of the DRL agent is finished, it’s policy
w,w0
is used to evaluate the states in Lstates . The resulting actions
subject to ãi (wT si + w0 ) ≥ 1 − ζi are then saved in the list Lacts . These two lists together
Õ
ζi ≥ 0, ζi ≤ C, i = 1, 2, . . . , Ndata, form the dataset we use to train the SVM. The dataset is
(14) cleaned by removing the outlier data points. Then it is split
VOLUME x, xxxx 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
train , L train ) and test (L test , L test ) sets. The Notations for the HPA
into training (Lstates acts states acts ρ̄ mean utilization (e.g. CPU)
training set is used to train the SVM model, whereas the test ρtarget target utilization of the HPA algorithm
set is used to determine the accuracy of the trained model. At ν tolerance of the HPA algorithm
the end of the procedure we get the SVM model parameters ddesired desired Pod count provided by the HPA
and the accuracy on the test set. Notations for the DRL
λ̂max maximum measured arrival rate
λtrain , λeval training and evaluation phase arrival rate functions
Algorithm 3 Training an SVM classifier Ntrain , Neval number of training and evaluation steps
Input: Environment and DRL parameters (see θ, ω policy and value network parameter vector
Nbatch batch size
Table 1 and 3).
r̃ mean reward
Output: SVM model (w, w0 ) and its accuracy αR mean reward soft udpate rate
1: Initialize Pod count to Dmin . k number of epochs in PPO
2: Initialize θ and ω of the AGENT with random values. TDtarget temporal difference target
δ TD-error
3: Lstates ← {∅}, Lacts ← {∅}
A, Â advantage, estimate of the advantage
4: // Training agent and collecting states. αθ , αω learning rate of the policy and the value network
5: i ← 0 PPO clipping parameter
6: for Ntrain steps do λGAE GAE parameter
H entropy regularization function
7: ai ← Get action from agent in state si . ξ entropy coefficient
8: ri, si+1 ← Execute action ai and get reward and the Notations for the SVM
next state. Ndata size of the dataset used for tarining the SVM
train/test
9: Store history and Update agent using the Lstates/acts list containing the training/test set of states/actions
AGENT.S TORE and AGENT.U PDATE procedures w, w0 SVM parameters
ζi SVM slack variable
in Algorithm 1. C margin parameter of the SVM
10: if i > Ntrain − Ndata then
11: Append state to Lstates . TABLE 3: Notations used by the algorithms.
12: end if
13: i ←i+1
14: end for
E. SYSTEM MODELING
15: i ← 0 We built a simulator program that emulates a multi-node
16: for Ndata steps do cloud environment and implemented the DRL agent in
17: Evaluate DRL agent on state Lstates [i] to get action. Python with the help of pytorch. The simulator program
18: Append action to Lacts . contains a procedure that generates the arrival of a UE as a
19: i ←i+1 Poisson process with arrival rate λ(t) at time t. Upon arrival
20: end for a PDU session is initiated if there is available capacity
21: Remove outlier points according to (11). among the pods. Otherwise the UE’s request is blocked.
22: Separate lists into train and test sets: Lstates → The UPF handling the PDU session and its traffic is chosen
train , L test ; L
Lstates acts → Lacts , Lacts .
train test at random. We assume the length of a session is random
states
train
23: w, w0 ← Train SVM using Lstates as features and Lacts train and distributed exponentially with rate µ.
as labels and run a grid search on hyperparameter C. Note that in practice we do not know the exact arrival
24: Get accuracy of the model using Lstates test and L test . rate function in advance. To show how the DRL algorithm
acts
25: return w, w0 , accuracy can cope with this, we divided the DRL experiments into
two phases, a training phase and an evaluation phase. In
We run this algorithm multiple times to perform a grid each of these phases we used a different function for the
search on the C hyperparameter of the SVM, which means arrival rate, λtrain and λeval , respectively. We can think of
each run we use a different C value. Finally, we pick the the training phase as a pre-training stage where we initialize
model with the highest accuracy on the test sets. the DRL agent and train it with a predefined arrival rate
In order to asses the SVM classfier, we also experimented function. Whereas in the evaluation phase we apply the pre-
with another classification method, logistic regression which trained agent on an environment with a new traffic model.
describes the log-odds of each class with a linear function. So learning also happens in the evaluation phase, but the
For more on this classifier, see [43]. In this case, Algo- agent does not need to go through a cold start.
rithm 3 can be modified by replacing the SVM model with We trained the DRL agent on a sinusoidally varying
a logistic regression model. arrival rate
π
We used the scikit-learn 0.24.2 [44] library to implement λtrain (t) = 250 + 250 sin t (15)
the SVM and the logistic regression models. For the logistic 6
regression, we used the default hyperparameters. For the for Ntrain amount of simulation steps. With this function
SVM hyperparameter values see Section V-B. For the list the agent can explore a wide range of traffic intensity.
of notations used by the algorithms see Table 3. For evaluation we used an equation from [45] which was
10 VOLUME x, xxxx
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
determined for mobile user traffic. We scaled it to our use V. NUMERICAL RESULTS
case to get A. SCENARIOS
π For the numerical evaluations, we assumed that
λeval (t) = 330.07620 + 171.10476 sin t + 3.08 • UPF instances run in phyiscal servers [46] with the
π 12
+ 100.19048 sin t + 2.08 (16) Intel Xeon 6238R 28 core 2,2 GHz processor and 4x64
π6 GB RAM;
+ 31.77143 sin t + 1.14 • each UPF session conveys video streaming data;
4 • eight cores on each server are allocated for OS and the
and ran Neval amount of simulation steps with it. This scaling container management system;
makes the peak traffic 500 PDU requests/s. To visualize • Each UPF instance occupies one core and 2GB RAM
λtrain and λeval in (15) and (16), we plotted Figures 7a and 7b and serve maximum 8 simultaneous video streams;
that demonstrate the arrival rates measured (λ̂) during the • booting time is not negligible and is fixed and identical
training phase and the evaluation phase for a 36 hour time for each UPF pod.
period. Parameter values for the cluster used during the simula-
tions can be found in Table 4.
8
Training phase the max. number of sessions per Pod (B)
service rate (µ) 1/s
minimum number of Pods (Dmin ) 2
maximum number of Pods (Dmax ) 100
initialization time (tpend ) 0.25, 5, 10 s
400 time between decisions (∆T ) 1s
blocking rate threshold (pb, t h ) 0.01
(1/s)
FIGURE 7: Arrival rates mesured during the training and TABLE 5: Mean number of Pods and average blocking rate
the evaluation phase for a 36 hour period. with the DRL and the HPA algorithms. Percentages show
improvement compared to the HPA algorithm.
We set the blocking threshold pb,th = 0.01 and ran the
DRL algorithm under various tpend values. For each tpend For the HPA algorithm, we assumed that handling a
value we ran 8 simulations and took the average of don , and CPU session requires 100% of a CPU core. This means
p̂b for the evaluations phase. that the utilization of a UPF Pod is proportional to the
VOLUME x, xxxx 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
no action
start
500
terminate
400
[1/s]
300
200
100
0
0 20 40 60 80 100
number of pods
FIGURE 9: Decision boundary learned by the SVM classifier. The agent terminates a Pod if it is in a state below the
boundary line and starts a new Pod if it is above.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
REFERENCES
arrival rate [1] 5GPP, “View on 5G Architecture,” White paper 22.891, 5GPPP Architec-
ture Working Group, 07 2018. Version 14.2.0.
[2] 3GPP, “5G; Study on scenarios and requirements for next generation
400 access technologies,” Technical Specification (TS) 38.913, 3rd Generation
( ) [1/s]
14 VOLUME x, xxxx
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135315, IEEE Access
Nguyen, Do and Rotter: Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning
VOLUME x, xxxx 15
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/