Professional Documents
Culture Documents
Optimized Power and Cell Individual Offset For Cellular Load Balancing Via Reinforcement Learning
Optimized Power and Cell Individual Offset For Cellular Load Balancing Via Reinforcement Learning
Abstract—We consider the problem of jointly optimizing the favor of maximizing overall throughput. Another technique of
transmission power and cell individual offsets (CIOs) in the throughput maximization is mobility load management. This
downlink of cellular networks using reinforcement learning. To can be done by freeing the physical resource blocks (PRBs)
that end, we reformulate the problem as a Markov decision
process (MDP). We abstract the cellular network as a state, of congested cells by forcing edges users to handover to less-
which comprises of carefully selected key performance indica- congested cells. A typical approach is controlling the cell
tors (KPIs). We present a novel reward function, namely, the individual offset (CIO) [2]. The CIO is an offset that artificially
penalized throughput, to reflect the tradeoff between the total alters the handover decision. This may enhance the throughput
throughput of the network and the number of covered users. We by allowing the users to enjoy more PRBs compared to their
employ the twin deep delayed deterministic policy gradient (TD3)
technique to learn how to maximize the proposed reward function share in the congested cells. Nevertheless, this may result in
through the interaction with the cellular network. We assess the decreasing the SINR of users handed over, mostly edge users,
proposed technique by simulating an actual cellular network, as the CIO decision only fakes the received signal quality.
whose parameters and base station placement are derived from This in turn suggests a modified throughput reward function
a 4G network operator, using NS-3 and SUMO simulators. Our to extend the coverage for cell edge users. The complex
results show the following: 1) Optimizing one of the controls is
significantly inferior to jointly optimizing both controls; 2) our interplay between the SINR, available PRBs, CIOs, and the
proposed technique achieves 18.4% throughput gain compared transmitted power is challenging to model using classical
with the baseline of fixed transmission power and zero CIOs; 3) optimization techniques. This motivates the use of machine
there is a tradeoff between the total throughput of the network learning techniques to dynamically and jointly optimize the
and the number of covered users. power and CIOs of the cells.
LTE and future networks are designed to support self-
I. I NTRODUCTION
optimization (SO) functionalities [3]. A significant body of lit-
Mobile data traffic is significantly growing due to the rise erature is concerned with load balancing via self-tuning of CIO
of smartphone subscriptions, the ubiquitous streaming and values, e.g., [2], [4]–[9]. Balancing the traffic load between
video services, and the surge of traffic from social media neighboring cells by controlling the coverage area of the cells
platforms. Global mobile data traffic is estimated to be 38 appears in [10]–[13]. Reshaping the coverage pattern is usually
exabytes per month in 2020 and is projected to reach 160 performed by adjusting the transmit power [10], the transmit
exabytes per month in 2025 [1]. To cope with this high traffic power and antenna gain [11], or the directional transmit power
volume, considerable network-level optimization should be [12] of the eNB. To the best of our knowledge, the joint
performed. Cellular network operators strive to enhance the optimization problem of power and CIO using reinforcement
users’ experience irrespective of the traffic demands. This learning has not been investigated in the literature.
entails maximizing the average throughput of the users and In this work, we propose a joint power levels and CIOs
minimizing the number of users that are out of coverage. optimization using reinforcement learning (RL). To that end,
The indicated optimization problem is challenging for its we recast this optimization-from-experience problem as a
often contradicting requirements and the absence of accurate Markov decision problem (MDP) to model the interaction
statistical models that can describe the tension(s). Specifically, between the cellular network (a.k.a., the environment) and the
increasing the transmitted power of some base station (an eNB decision-maker (a.k.a., the central agent). The MDP formula-
in LTE1 ) may enhance the SINR for its served users. How- tion requires a definition of a state space, an action space, and
ever, cell edge users in neighboring cells will have increased a reward function. The state is a compressed representation
interference. Hence, network-wide, the optimization based on of the cellular network. We define the state as a carefully-
transmit power alone may tend to sacrifice cell edge users in chosen subset of the key performance indicators (KPIs), which
are usually available at the operator side. This enables the
1 It is worth noting that our proposed reinforcement learning technique can
seamless application of our techniques to current cellular
be seamlessly applied to 5G networks as well. Our simulations are based on
LTE and we refer to LTE throughout the paper since our data comes from a networks. We propose a novel action space, where both power
4G operator. levels and CIOs are utilized to enhance the user’s quality of
978-1-7281-9505-6/21/$31.00
Authorized licensed©2021 IEEEto: Carleton University. Downloaded on June 15,2021 at 08:29:42 UTC from IEEE Xplore. Restrictions apply.
use limited
,(((:LUHOHVV&RPPXQLFDWLRQVDQG1HWZRUNLQJ&RQIHUHQFH :&1&
service. We argue that both controls have distinct advantages. The connected eNB assigns Bk PRBs to the kth UE as:
Furthermore, we propose a novel reward function, which we
ρk
call the penalized throughput, that takes into consideration the Bk = (1)
g(φk , Mn,k )Δ
total network throughput and the number of uncovered users.
The penalty discourages the agent from sacrificing edge-users where g(φk , Mn,k ) is the spectral efficiency achieved by the
to maximize the total system throughput. scheduler with a UE having a CQI of φk and an antenna
Based on the aforementioned MDP, we use actor-critic configuration of Mn,k , and Δ = 180 KHz, which is the
methods [14] to learn how to optimize the power and CIO bandwidth of a resource block in LTE. The total PRBs needed
levels of all cells from experience. More specifically, we to serve the UEs associated with the nth eNB at time t is given
Kn (t)
employ the twin deep delayed deterministic policy gradient by Tn (t) = k=1 Bk . Denote the total available PRBs at the
(TD3) [15] to maximize the proposed penalized throughput nth eNB by Σn .
function. The TD3 technique is a state-of-the-art RL technique Furthermore, we assume that there exists a central agent2
that deals with continuous action spaces. In TD3, the critics that can monitor all the network-level KPIs and aims at en-
are neural networks (NNs) for estimating state-action values hancing the user’s experience. Aside from controlling the ac-
(a.k.a., the Q-values). Meanwhile, the actor is a separate NN tual power Pn , the agent can control CIOs. The relative CIO of
that estimates the optimal action. The actor function is updated cell i with respect to cell j is denoted by θi→j ∈ [θmin , θmax ]
in such a way that maximizes the expected estimated Q-value. dB, and is defined as the offset power in dB that makes the
The critics are updated by fitting a batch of experiences that RSRP of the ith cell appears stronger than the RSRP of the
are learned from the interactions with the cellular network. jth cell. Controlling the power levels (Pn : n = 1, 2, · · · , N )
We gauge the efficacy of our proposed technique by con- and the CIOs (θi→j : i = j, i, j = 1, 2, · · · , N ) can trigger
structing an accurate simulation suite using the NS-3 and the handover procedure for the edge UEs. More specifically,
SUMO simulators. We simulated an actual 4G cellular network a UE which is served by the ith cell may handover to the jth
that is currently operational in the Fifth Settlement neigh- cell if the following condition holds [18]:
borhood in Cairo, Egypt. The site data including the base Zj + θj→i > H + Zi + θi→j , (2)
station placement is provided by a major 4G operator in Egypt.
Our numerical results show the validity of our claim that where Zi , Zj are the RSRP from cells i, j, respectively, and H
using one of the controls (transmitted power or CIO) but not is the hysteresis value that minimizes the ping-pong handover
both is strictly sub-optimal with respect to joint optimization scenarios due to small scale fading effects. By controlling the
in terms of the throughput. Thus, our proposed technique handover procedure, the traffic load of the network is balanced
outperforms its counterpart in [2] by 11%. Furthermore, our across the cells. The agent chooses a policy such that the total
technique results in significant gains in terms of the channel throughput and the coverage of the network are simultaneously
quality indicators (CQIs) and the network coverage. Finally, maximized in the long run according to some performance
our proposed penalized throughput effects a tradeoff between metric as we will formally describe next.
the overall throughput and the average number of covered
III. R EINFORCEMENT L EARNING F RAMEWORK
users that can be controlled.
In this section, we recast the aforementioned problem as
an MDP. MDPs [19] describe the interaction between an
II. S YSTEM M ODEL agent, which is in our case the power levels and the CIOs
controller, and an environment, which is the whole cellular
We consider the downlink (DL) of a cellular system with system including all eNBs and UEs. At time t = 0, 1, 2, · · · ,
N eNBs. The nth eNB sends its downlink transmission with the agent observes a representation of the environment in the
a power level Pn ∈ [Pmin , Pmax ] dBm. The cellular system form of a state S(t), which belongs to the state space S.
serves K mobile user equipment (UEs). Each UE measures The agent takes an action A(t), which belongs to the action
the reference signal received power (RSRP) on a regular basis space A. This causes the environment to transition from the
and connects to the cell that results in the best-received signal state S(t) to the state S(t + 1). The effect of the action A(t)
, there are Kn (t) UEs
quality [16]. Thus, at t = 0, 1, 2, · · · is measured through a reward function, R(t + 1) ∈ R. To
N
connected to the nth eNB such that n=1 Kn (t) ≤ K. The completely recast the problem as an MDP, we need to define
kth UE moves with a velocity vk along a mobility pattern S, A, and R(·) as we will show next.
that is unknown to any of the eNBs. The kth UE periodically
A. Selection of the State Space
reports the CQI, φk to the connected eNB. The CQI is a
discrete measure of the quality of the channel perceived by To construct an observable abstraction of the environment,
the UE that takes a number from the set {0, 1, · · · , 15}. When we use a subset of the network-level KPIs as in [2], [20].
φk = 0, the kth UE is out of coverage, while a higher value Since the KPIs are periodically reported by the eNBs, there is
of φk corresponds to higher channel quality and hence results 2 The 3GPP specifies an architecture for centralized self-optimizing func-
in a better modulation and coding scheme (MCS) assignment. tionality, which is intended for maintenance and optimization of network
The kth UE requests a minimum data rate of ρk bits/s. coverage and capacity by automating these functions [17].
Authorized licensed use limited to: Carleton University. Downloaded on June 15,2021 at 08:29:42 UTC from IEEE Xplore. Restrictions apply.
,(((:LUHOHVV&RPPXQLFDWLRQVDQG1HWZRUNLQJ&RQIHUHQFH :&1&
no added overhead at communicating the state to the agent. In the power level Pn controls all the edges of the nth cell
this work, the state comprises of the following components: simultaneously, while the CIO θn→m can be tailored such that
First, the resource block utilization (RBU) vector, U(t) = it controls only the cell edge that is common with the mth
[U1 (t) U2 (t) · · · UN (t)] ∈ [0, 1]N , where Un (t) = TΣ n (t)
n
cell only without affecting the remaining edges. Hence, both
is the RBU of the nth eNB at time t. The RBU reflects the load techniques have complementary advantages that ultimately
level of each cell. Second, the DL throughput vector, R(t) = result in the superior performance gain.
[R1 (t) R2 (t) · · · RN (t)] ∈ RN + , where Rn (t) is the total
DL throughput at the nth cell at time t. Third, the connectivity C. Selection of the Reward Function: Penalized Throughput
vector K(t) = [K1 (t) K2 (t) · · · KN (t)] ∈ NN , where To assess the performance of a proposed policy in an MDP,
Kn (t) is the number of active UEs that are connected to one should formulate the goal of the system in terms of a
the nth eNB. This KPI shows the effect of the handover reward function R(·). The central agent in MDP implements
procedure. Furthermore, the average user experience at the a stochastic policy5 , π, where π(a|s) is the probability that
nth cell is dependent on Kn (t) as the average throughput per the agent performs an action A(t) = a given it was in a state
Rn (t) S(t) = s. The central agent aims at maximizing the expected
user, R̄n (t) is given by R̄n (t) = K n (t)
. Finally, the low-rate
MCS penetration matrix M(t) ∈ [0, 1]N ×τ , where τ is the long-term sum discounted reward function [19], i.e.,
L
number of low-rate MCS combinations3 . The element (i, j)
t
of the matrix M(t) corresponds to the ratio of users in the max lim Eπ λ R(t) (5)
π L→∞
nth cell that employs the jth MCS. The matrix M(t) gives t=0
the distribution of relative channel qualities at each cell. where λ corresponds to the discount factor, which signifies
Now, we are ready to define our state S(t), which is simply how important future expected rewards are to the agent.
the concatenation of all four components as: The UE desires to be served consistently with the highest
S(t) = [U(t)T R(t)T K(t)T vec(M(t))T ]T (3) possible data rate. One possible reward function to reflect this
requirement is the total throughput of the cellular system, i.e.,
where vec(·) is the vectorization function. The connectivity Kn
N
and the throughput vectors are normalized to avoid having R(t) = Rkn (t) (6)
dominant features during the weight-learning process. n=1 kn =1
B. Selection of the Action Space: CIO and Transmitted Power where Rkn (t) is the actual measured throughput of the kn th
In this work, the central agent has two controls, namely, the user at time t. This also reflects the average user’s throughput
power levels and the CIOs. More specifically, the agent selects by normalizing by the total number of UEs K.
the action A(t) which is a vector of N (N2+1) dimensionality, Now, since the power levels and the relative CIOs are
controllable, the agent may opt to choose power levels that
A(t) = [PT θ T ]T (4) effectively shrink the cells’ radii or CIOs levels that connect
where P = [P1 P2 · · · PN ] ∈ [Pmin , Pmax ]N is the the edge UEs to a cell with poor channel quality. In this case,
eNB power level vector, and θ = [θi→j (t) : i = j, i, j = the agent maximizes the total throughput by only keeping the
1, · · · , N ] ∈ [θmin , θmax ]N (N −1)/2 is the relative CIO vector4 . non-edge UEs that enjoy high MCS and at the same time
Define αL = [Pmin · 1N θmin · 1N (N −1)/2 ] and αH = minimizing the inter-cell interference. Therefore, using total
[Pmax · 1N θmax · 1N (N −1)/2 ] to be the limits of A. throughput as the sole performance metric is not suitable for
We argue that both CIOs and power levels are strictly representing the user experience as edge users may be out of
beneficial as steering actions of the agent. To see that, we note coverage even if the total throughput is maximized.
that the power level control corresponds to an actual effect on In this work, we propose a novel reward function, namely,
the channel quality at non-edge UEs and the interference faced the penalized throughput. This entails maximizing the average
by the edge UEs. More specifically, as the power level of the user throughput while minimizing the number of uncovered
nth eNB Pn increases, the channel quality of the non-edge UEs. More specifically, our reward function is defined as:
UEs at the nth cell increases while the inter-cell interference Kn
N K
faced by edge UEs of the neighboring cell increases as well. R(t) = Rkn (t) − η R̄(t) 1(φk = 0) (7)
This is not the case if the CIO control is used as the CIO n=1 kn =1 k=1
artificially triggers the handover procedure without changing where 1(X) = 1 if the X is true and 0
the power level. Furthermore, the power level control is not 1
logical
N Kcondition
otherwise, R̄(t) = K n=1 kn =1 Rkn (t) is the average
n
sufficient to solve our problem. This is due the fact that user throughput at time t, and η is a hyperparameter that
3 Ideally, we would consider the total number of MCS combinations, which signifies how important is the coverage metric with respect
is 29 in LTE. However, to reduce the dimensionality of the state S(t) to the total throughput. Our reward function implies that the total
ensure stable convergence of the RL, we focus only on a number of low-rate
MCSs (e.g., τ = 10). This is because our controlling actions primarily affect 5 Generally, the MDP framework allows for the use of stochastic policies.
the edge users, who are naturally assigned a low-rate MCSs. Nevertheless, in this work, we focus only on the special case of deterministic
4 Without loss of genrality, we assume that θ
i→j = −θj→i policies, i.e., p(a∗ |s) = 1 for some a∗ ∈ A.
Authorized licensed use limited to: Carleton University. Downloaded on June 15,2021 at 08:29:42 UTC from IEEE Xplore. Restrictions apply.
,(((:LUHOHVV&RPPXQLFDWLRQVDQG1HWZRUNLQJ&RQIHUHQFH :&1&
based on the observed state. This distinction enables the actor- Batch સ࢝ࢇ࢚
Soft Update
۵ ܜܖ܍ܑ܌܉ܚસ࢝ࢉ
࢚ )
critic methods to deal with a continuous action space as they
࢚
ࡽሺࡿ ǡ ; ࢝ࢉ
Critic ഥ ࢚ࢇ)
Actor Target ࣆ(s; ࢝
Optimizer
do not require to tabulate all possible action values as in deep min
ࡽሺࡿ ǡ ; ࢚࢝ࢉ )
ࢉ ࢉ
࢚࢝ , ࢚࢝ Update
Q-learning (DQN). Since the number of actions is exponential ࢉ
Critic2 ࡽ(s,a;࢚࢝ ) ࢉ
Critic1 ࡽ(s,a;࢚࢝ )
࢚ )
ഥ ࢚ࢉ )
ഥ ࢉ
෩ ା ; ࢝
෩ ା ; ࢝
[2] becomes prohibitive for large cellular networks. The actor ഥ ࢚ࢇ )
ࣆ(ࡿା ; ࢝
function μ(s) outputs the best action for the state s, while the
Soft Update
ࡽሺࡿା ǡ
ࡽሺࡿା ǡ
Soft Update
ࢉ ࢉ Target
࢚࢝ )
Critic2 Target ࡽ(s,a;ഥ ࢚࢝ )
Critic1 Target ࡽ(s,a;ഥ
critic function Q(s, a) evaluates the quality of the (s, a) pair. Smoothing
෩ ା
In this work, we employ the TD3 technique [15]. Similar to
Authorized licensed use limited to: Carleton University. Downloaded on June 15,2021 at 08:29:42 UTC from IEEE Xplore. Restrictions apply.
,(((:LUHOHVV&RPPXQLFDWLRQVDQG1HWZRUNLQJ&RQIHUHQFH :&1&
V. N UMERICAL RESULTS
In this section, the performance of the proposed approach is
evaluated through simulations. The simulated cellular network
consists of N = 6 irregular cells distributed in the area of
900m×1800m extracted from the Fifth Settlement neighbor-
hood in Egypt. There are K = 40 UEs with realistic mobility
pattern created by the Simulation of Urban Mobility (SUMO)
according to the mobility characteristics in [23]. The UEs
are either vehicles or pedestrians. The walking speed of the
pedestrians ranges between 0 − 3m/s. All UEs are assumed
Fig. 2: Average overall throughput during the learning process
to have full buffer traffic model, i.e., all UEs are active all
the time. This cellular network, which represents the environ-
ment of the proposed RL framework, is implemented using The relative performance of the previously mentioned con-
LTE-EPC Network Simulator (LENA) [24] module which is trol schemes is shown in Fig. 2, Fig. 3, and Fig. 4. The
included in NS-3 Simulator. The agent is implemented using average overall system throughput, the average CQI, and the
Python, which is based on the Open AI Gym implementation average number of covered users are used as quality indicators
in [25]. The interface between the NS3-based environment of the different schemes. These indicators are averaged over
and the agent is implemented via the NS3gym interface [26]. 250 steps in each episode and observed for 250 episodes
This interface is responsible of applying the agent’s action of the learning process. All reported results are obtained by
to the environment at each time step. Then, the network is averaging over 10 independent runs to reduce the impact of
simulated having selected the action in effect. Afterwards, the the exploration randomness on the relative performance of
reward is estimated based on the expression in (7). Finally, the schemes. In Fig. 2, the proposed power and CIO control
the NS3gym interface returns the reward and the environment 6 Note that, the mean power level of 32 dBm is a typical operational
state back to the agent. Table I presents our simulators’ transmitted power value.
Authorized licensed use limited to: Carleton University. Downloaded on June 15,2021 at 08:29:42 UTC from IEEE Xplore. Restrictions apply.
,(((:LUHOHVV&RPPXQLFDWLRQVDQG1HWZRUNLQJ&RQIHUHQFH :&1&
Authorized licensed use limited to: Carleton University. Downloaded on June 15,2021 at 08:29:42 UTC from IEEE Xplore. Restrictions apply.
,(((:LUHOHVV&RPPXQLFDWLRQVDQG1HWZRUNLQJ&RQIHUHQFH :&1&
Authorized licensed use limited to: Carleton University. Downloaded on June 15,2021 at 08:29:42 UTC from IEEE Xplore. Restrictions apply.