You are on page 1of 13

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO.

10, OCTOBER 2020 6255

Power Allocation in Multi-User Cellular Networks:


Deep Reinforcement Learning Approaches
Fan Meng , Student Member, IEEE, Peng Chen , Member, IEEE, Lenan Wu,
and Julian Cheng , Senior Member, IEEE

Abstract— The model-based power allocation has been inves- Dense deployment of small cells, such as pico-cells, femto-
tigated for decades, but this approach requires mathematical cells, has become the most effective solution to accommodate
models to be analytically tractable and it has high computational the demand for spectrum [1]. With denser APs and smaller
complexity. Recently, the data-driven model-free approaches have
been rapidly developed to achieve near-optimal performance with cells, the intra-cell and inter-cell interference problems can
affordable computational complexity, and deep reinforcement be severe [2]. Therefore, power allocation and interference
learning (DRL) is regarded as one such approach having great management are both crucial and challenging [3], [4] for
potential for future intelligent networks. In this paper, a dynamic small-cell networks.
downlink power control problem is considered for maximizing the A number of model-oriented algorithms have been devel-
sum-rate in a multi-user wireless cellular network. Using cross-
cell coordinations, the proposed multi-agent DRL framework oped to manage interference [5]–[9], and existing studies
includes off-line and on-line centralized training and distributed mainly focus on sub-optimal or heuristic algorithms, whose
execution, and a mathematical analysis is presented for the performance gaps to the optimal solution are typically chal-
top-level design of the near-static problem. Policy-based REIN- lenging to quantify. Besides, the mathematical models are
FORCE, value-based deep Q-learning (DQL), actor-critic deep usually assumed to be analytically tractable, but these mod-
deterministic policy gradient (DDPG) algorithms are proposed
for this sum-rate problem. Simulation results show that the els can be inaccurate because both hardware and channel
data-driven approaches outperform the state-of-art model-based imperfections can exist in practical communication environ-
methods on sum-rate performance. Furthermore, the DDPG ments. When considering specific hardware components and
outperforms the REINFORCE and DQL in terms of both sum- realistic transmission scenarios, such as low-resolution analog-
rate performance and robustness. to-digital converter, nonlinear amplifier and user distribution,
Index Terms— Deep reinforcement learning, deep determinis- it is challenging to develop signal processing techniques using
tic policy gradient, policy-based, interfering broadcast channel, model-driven tools. Moreover, the computational complexity
power control, resource allocation. of these algorithms is typically high and thus implementation
becomes impractical. Meanwhile, machine learning [10] algo-
I. I NTRODUCTION rithms are potentially useful techniques for future intelligent
wireless communications. These methods are usually model-
W IRELESS data transmission has experienced tremen-
dous growth in past years and will continue to grow in
the future. When a large number of terminals such as mobile
free and data-driven [11], [12], and the solutions are obtained
through data learning instead of model-oriented analysis and
phones and wearable devices are connected to the networks, design.
the density of access point (AP) will have to be increased. Two main branches of machine learning are supervised
learning and reinforcement learning (RL). When training input
Manuscript received January 17, 2019; revised June 21, 2019, November 9, and output pairs are available, the supervised learning method
2019, and March 17, 2020; accepted June 8, 2020. Date of publication June 18, is simple and efficient, especially for classification tasks such
2020; date of current version October 9, 2020. This work was supported in part
by the National Natural Science Foundation of China under Grant 61801112 as modulation recognition [13] and signal detection [14], [15].
and Grant 61601281, in part by the Natural Science Foundation of Jiangsu However, the desired output data or optimal solutions are
Province under Grant BK20180357, in part by the foundation of Shannxi usually derived by assuming certain system models. In addi-
Key Laboratory of Integrated and Intelligent Navigation under Grant SKLIIN-
20190204, and in part by the Open Program of State Key Laboratory of tion, the performance of the learned models with supervised
Millimeter Waves, Southeast University, under Grant K202029. The associate learning is suboptimal. Meanwhile, the RL [16] has been
editor coordinating the review of this article and approving it for publication developed as a goal-oriented algorithm to learn a better policy
was D. So. (Corresponding author: Fan Meng.)
Fan Meng and Lenan Wu are with the School of Information Science through exploration of uncharted territory and exploitation of
and Engineering, Southeast University, Nanjing 210096, China (e-mail: current knowledge. The RL concerns with how agents ought
mengxiaomaomao@outlook.com; wuln@seu.edu.cn). to take actions in an environment such that some notion of
Peng Chen is with the State Key Laboratory of Millimeter Waves, Southeast
University, Nanjing 210096, China (e-mail: chenpengseu@seu.edu.cn). cumulative reward is maximized, and the environment is typ-
Julian Cheng is with the School of Engineering, The University of British ically formulated as a Markov decision process (MDP) [17].
Columbia, Kelowna, BC V1V 1V7, Canada (e-mail: julian.cheng@ubc.ca). Therefore, many RL algorithms [16] have been developed
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. using dynamic programming (DP) techniques. In classic RL,
Digital Object Identifier 10.1109/TWC.2020.3001736 a value function or a policy is stored in a tabular form, which
1536-1276 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6256 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020

introduces the dimensionality problem. To address this issue, problem with stringent quality-of-service (QoS) constraints on
function approximation is proposed to replace the table, and latency and reliability in vehicular-to-vehicular broadcast [30];
this approximation can be realized by a neural network (NN) and power control in cognitive radio system that consists
or a deep NN (DNN) [18]. When RL is combined with DNN, of two users sharing a spectrum [31]. To the best of the
the deep RL (DRL) is created and widely investigated, and authors’ knowledge, the classic policy-based approach has
it can achieve stunning performance in a number of noted seldom been considered for power allocation [32]. An actor-
projects [19], such as the game of Go [20] and Atari video critic algorithm has been applied for power allocation [33],
games [21]. where a Gaussian probability distribution function was used
The DRL algorithms can be categorized into three to formulate a stochastic continuous policy. These researches
groups [19]: value-based, policy-based, and actor-critic algo- concern multiple users sharing spectrum in a cooperative
rithms. The most widely-used algorithms are value-based deep distributed manner, which agree with our proposed multi-
Q-learning (DQL) and the policy-based REINFORCE. These agent DRL design. However, unlike the previous work [33],
two algorithms have the following merits and defects: we perform a theoretical analysis to prove that the investigated
1) DQL: This algorithm is efficient when the action space is distributed optimization problem is near-static rather than
finite, discrete, and low-dimensional, and it is unsuitable highly dynamic. Our proposed DRL algorithms are simple and
for joint optimization problems where possible actions efficient compared with the typical DRL algorithms that solve
increase exponentially. Besides, the actions must be an MDP or a partially observed (PO) MDP problem. Except
discretized for tasks having continuous action space; for the well-known QL algorithm, we also consider the classic
therefore, quantization error can be introduced. policy-based and the state-of-art actor-critic algorithms.
2) REINFORCE: Based on estimated gradients, the agent The sum-rate maximization is a static optimization problem,
learns to generate a stochastic policy in REINFORCE. where the target is a multi-variate ordinary function. The
In some games, the optimal policy is stochastic and DQL resulting centralized algorithm is inherently unscalable for
is difficult to produce. However, balancing the explo- large applications, and therefore we turn to multi-agent DRL.
ration and exploitation during learning is challenging Although inter-cell coordinations are used among the users,
and REINFORCE usually converges to a suboptimal the distributed power allocation problem can be considered
solution. Similar to DQL, the dimensionality problem as near-static in fast fading scenarios. While the standard
on action space still exists. DRL tools are designed for the DP that can be solved
The actor-critic algorithm is developed as a hybrid of the recursively, a direct use of these tools for solving the sta-
value-based and policy-based algorithms. It consists of two tic optimization problem will suffer performance degrada-
components: an actor to generate a policy and a critic to assess tion. In a previous work [34], we verified that the widely
the policy. A better solution is learned through solving a multi- applied standard DQL algorithm suffers from sum-rate per-
objective optimization problem and updating the parameters formance degradation. In this work, we perform a theoretical
of the actor and the critic alternatively. As an example, deep analysis of the general DRL approaches to address the sta-
deterministic policy gradient (DDPG) [22] generates a deter- tic optimization problem and improve the DRL algorithms.
ministic action and operates over high-dimensional continuous Based on this theoretical basis, we propose three simple
action spaces. and efficient algorithms, namely policy-based REINFORCE,
Consider a wireless cellular network having single-input value-based DQL and actor-critic-based DDPG. Simulation
single-output (SISO) interfering broadcast channel (IBC) results show that our DQL achieves the best sum-rate perfor-
whereby multiple base stations transmit signals to a group mance when compared with the algorithms using the standard
of users within their own cells. We investigate the dynamical DQL, and our DRL approaches also outperform the state-
sum-rate maximization problem, i.e., choosing downlink trans- of-art model-based methods. This work makes the following
mit power in response to physical channel conditions under contributions:
maximum power constraints. This problem is NP-hard [3], • We perform a mathematical analysis on general DRL
and two advanced model-based algorithms, namely fractional algorithms to tackle the static optimization problems,
programming (FP) [5] and weighted minimum mean squared such as the centralized dynamic power allocation in
error (WMMSE) [6] are regarded as benchmarks in this study. multi-user cellular networks. Furthermore, we verify that
The supervised learning was studied [23] where a trained DNN the distributed sum-rate problem having coordinations is
can accelerate processing speed and have acceptable perfor- near-static.
mance loss. To improve further the performance of super- • The training procedure of the proposed DRL algorithm
vised learning, the ensemble of DNNs is also proposed [24]. is centralized and the learned model is executed distrib-
To manage the interference using DRL approaches, the current utively. After off-line learning in a simulated system,
research work mainly concentrates on value-based algorithms. the agents are then trained on-line using transfer learning,
The QL or DQL is widely applied to various communica- namely sim2real. An environment tracking mechanism is
tion scenarios having different power allocation problems, also proposed to control the on-line learning dynamically.
such as general interference management in heterogeneous • The logarithmic representation of channel gain and power
networks that are typically composed of a macro base station is used to address the numerical problem in DNNs
and several femto-cells [25]–[28]; sum-rate maximization or and improve training efficiency. Besides, a sorting pre-
proportionally fair scheduling in cellular networks [29]; the processing technique is proposed to approximate the

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6257

interferences, reduce computation load, and accommo- where Dn is the set of interference cells around the n-th cell;
date varying user densities. ptn,k is the emitting power of the transmitter n to its receiver k
• The general DRL algorithms are proposed for static at slot t; and σ 2 denotes
 the additional noise
 power. The terms
optimization, and the concrete DRL design is further t t t t
 g p
k =k n,n,k n,k  and  g 
n ∈Dn n ,n,k p
j n ,j represent the
introduced. We propose three simple and efficient algo- intra-cell interference and inter-cell interference, respectively.
rithms, namely REINFORCE, DQL and DDPG, which are Assuming normalized bandwidth, we express the downlink
respectively policy-based, value-based, and actor-critic- rate of dln,k as
based. Simulations on sum-rate performance, generaliza-  
t
Cn,k = log2 1 + γ tn,k . (5)
tion performance and computation complexity are also
demonstrated.
The remainder of this paper is organized as follows. B. Centralized Optimization Problem
Section II outlines the power control problem in the wireless In a centralized approach, the optimization problem is to
cellular network with IBC. In Section III, the top-level DRL maximize the overall sum-rate with respect to the non-negative
design for a static optimization problem is introduced and power set pt where all elements satisfy the maximum power
analyzed. In Section IV the proposed DRL approaches are constraint, and it is given as
presented. Then, the DRL methods are compared with bench-
mark algorithms in different scenarios, and simulation results max
t
C(g t , pt )
p
are demonstrated in Section V. Conclusions are presented in
s.t. 0 ≤ ptn,k ≤ Pmax , ∀n, k (6)
Section VI.
where Pmax denotes the maximum emitting power; the power
II. P ROBLEM F ORMULATION set pt , channel gain set g t , and sum-rate C(g t , pt ) are,
respectively, defined as
A. System Model
A cellular network having SISO IBC is considered, and it is pt : = {ptn,k | ∀n, k}, (7)

composed of N cells. A base station (BS) equipped with one g :=t
{gnt  ,n,k | ∀n , n, k}, (8)
transmit antenna is deployed at each cell center. Using shared 
C(g , p ) : =
t t t
Cn,k . (9)
spectrum, K users in each cell are simultaneously served by
n,k
the center BS.
At time slot t, the independent channel gain between the The problem (6) is non-convex and NP-hard. For model-
t
BS n and the user k in cell j is denoted by gn,j,k , and it can based methods, the performance gaps to the optimal solution
be presented as are typically challenging to quantify, and also a practical
implementation is prohibitive due to high computational com-
t
gn,j,k = |htn,j,k |2 βn,j,k (1) plexity. More importantly, the model-oriented approaches can-
where | · | is the modules operation; htn,j,k is a complex not accommodate future heterogeneous service requirements
Gaussian random variable with Rayleigh distributed envelope; and randomly evolving environments. In the problem (6),
βn,j,k is the large-scale fading component that takes both the current channel state information (CSI) is the sufficient
geometric attenuation and shadow fading into account, and statistics of the optimal solution. Therefore, there exists a
it is assumed to be invariant over one time slot. According function to map the CSI to the solution and we resort to data-
to the Jakes’ model [35], the small-scale flat fading can be driven DRL algorithms to realize this function.
modeled as a first-order complex Gauss-Markov process
C. Distributed Optimization Problem
htn,j,k = ρht−1
n,j,k + nn,j,k
t
(2)
The optimization problem (6) is centralized. Using RL
where ntn,j,k ∼ CN (0, 1 − ρ2 ) (h1n,j,k ∼ CN (0, 1) when algorithm, the current local CSI is first estimated and transmit-
t = 1), and the correlation coefficie ρ is determined by ted to the center agent for further processing. The decisions
of allocated powers are then broadcast to the corresponding
ρ = J0 (2πfd Ts ) (3) transmitters and executed. However, several shortcomings of
where J0 (·) is the first kind zero-order Bessel function; fd is the centralized framework having a number of cells must be
the maximum Doppler frequency; and Ts is the time interval observed:
between adjacent instants. 1) Dimensionality problem: The cardinalities of DNN input
The downlink from the n-th BS to the k-th serving AP is and output (I/O) are proportional to the cell number N ,
denoted by dln,k . Assuming that the signals from different and training is difficult for such a DNN since the state-
transmitters are independent of each other, the channels are action space increases exponentially with I/O dimen-
fixed in each time slot. Then the signal-to-interference-plus- sions. Additionally, exploration in high-dimensional
noise ratio (SINR) of dln,k in time slot t can be written as space is inefficient, and thus the learning can be
t impractical.
gn,n,k ptn,k
γ tn,k =    2) Transmission problem: The center agent requires full
k =k
t
gn,n,k ptn,k + n ∈Dn gnt  ,n,k j ptn ,j + σ 2 CSI of the communication network in current time.
(4) When the cell number N is large and low-latency service

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6258 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020

is required, both the transmitting CSI to the agent and 4) R, a finite set of immediate rewards, where element
a
the broadcasting allocation scheme to each transmitter rs→s  denotes the reward obtained after transitioning

becomes challenging. from state s to state s , due to action a.


Therefore, we propose to decentralize the power allocation Under a stochastic policy π, the T -step cumulative reward and
scheme. The transmitter of each link is regarded as an agent, γ-discounted cumulative reward are considered as the state
and all agents in the communication network operate synchro- value function V . With an initial state s1 , the state value
nously and distributively. Meanwhile, the agent (n, k) only functions are defined as

partially requires channel information g tn,k to maximize its 1 t 1
T
T 1
t
rate Cn,k with respect to its own power ptn,k , where g tn,k is Vπ (s ) := Eπ r |s (12)
T t=1
defined as
and
g tn,k = {gnt  ,n,k | n ∈ {n, Dn }, ∀k}. (10) 

T
Vπγ (s1 ) := lim Eπ γ r |s
t−1 t 1
(13)
We transform the problem (6) to a multi-objective program- T →∞
t=1
ming, which is formally established as where γ ∈ [0, 1) denotes a discount factor that balances the
  trade-off between immediate and future rewards, and E[·] is the
max C t
(g t
, p t
) | ∀n, k
t n,k n,k n,k
pn,k expectation operation. For an initial state-action pair (s1 , a1 ),
the state-action value functions, namely the Q functions, are
s.t. 0 ≤ ptn,k ≤ Pmax , ∀n, k. (11)
defined as

1  t 1 1
Multi-agent training is still difficult to solve problem (11), T
T 1
since it requires a large amount of learning data, training time, Qπ (s ) := Eπ r | s ,a (14)
T t=1
and DNN parameters. A framework of centralized training
and distributed execution [36] was proposed to address these and
 T
challenges. In this framework, all agents are regarded as the 
same agent; same policy is shared and learned with collected Qγπ (s1 ) := lim Eπ γ t−1 t 1
r | s ,a 1
. (15)
T →∞
data from all links. Owing to the fact that links in distinct t=1

areas are approximately identical (since their characteristics From the perspective of MDP, we have the following
are location-invariant and the network is large), the policy conclusions when the environment satisfies certain conditions.
can be shared in a paradigm of transfer learning. To train the Theorem 1: When the environment transition is independent
network, we only train one policy and set the batch size equal of the action, and the current action is only related to
to the number of users. The learned policy is executed by all the reward function of this instant, the optimal policy for
the users, and each user is an agent. Therefore, the training maximizing cumulative rewards is equivalent to a combination
is centralized and the execution is distributed. In multi-agent of single-step rewards.
training, non-stationarity occurs when the policies of the other Proof: First, we focus on (12) and expand it as
  1
agents change during training. Meanwhile, non-stationarity VπT (s1 ) = π(a1 |s1 ) Psa1 →s2
does not occur in centralized training because all the agents a1 ∈A s2 ∈S
share the same policy. In the centralized training, we still use

1 a1 T − 1 T −1 2
indepedent Q learning. × r1 2+ Vπ (s ) . (16)
T s →s T
The description of the assumed conditions can be mathemati-
III. D EEP R EINFORCEMENT L EARNING cally formulated as
 = Ps→s ,
A. Preliminary and Theoretical Analysis a
Ps→s (17)
A general MDP problem concerns about a single or multiple a
rs→s = rsa . (18)
agents interacting with an environment. In each interaction,
Without loss of generality, for probability mass functions of
the agent takes action a by policy π using observed state s,
policy π and state transitioning P , we have
then receives a feedback reward r and a new state from the 
environment. The agent aims to find an optimal policy to max- π(a|s) = 1, (19)
imize the cumulative reward over the continuous interactions, a∈A

and the DRL algorithms are developed for such problems. Ps→s = 1. (20)
To facilitate the analysis, we consider the discrete-time s ∈S
model-based MDP where the action and state spaces are From (17), (18), (19) and (20), the state value function (16)
assumed to be finite. The four-tuple (S, A, P, R) is known, can be rewritten as
where the elements are 1  1

1) S, a finite set of states, VπT (s1 ) = π(a1 |s1 )rsa1


T 1
a ∈A
2) A, a finite set of actions,
 T −1 
 = Pr(s |s, a) is the probability that action a in Ps1 →s2 VπT −1 (s2 ). (21)
a
3) Ps→s +
state s will lead to state s , T 2 s ∈S

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6259

The full expansion of (21) is expressed as rewritten as


 
1  1 VπT (s) = πi (ai |fi (s)) Ps→s
VπT (s1 ) = π(a1 |s1 )rsa1
T 1 a s
a ∈A
i


1 a T − 1 T −1 
1 T  
t−1
t
× rs + Vπ (s ) (26)
+ π(at |st ) Pst →st +1 rsat . T T
T t=2 at ∈A st ∈S t =1 where a is the action set of all agents, π(·) is the overall
(22) policy, s is the next state, and ai , πi (·) and fi (·) are
respectively the action, policy and partially observed state of
Since the state transfer is irrelevant to the action, and the state agent i. Apparently Theorem 1 is valid in multi-agent learning
can be independently sampled as s =< s1 , · · · , sT +1 >, then without coordinations.
the maximization of (22) with respect to at , ∀t can be rewritten 2) With Coordinations: The performance of multi-agent
as RL can be greatly reduced without coordinations among the
1   agents, especially in cooperative tasks. When we consider
T
  t
max VπT (s) = max π(at |st )rsat message delivery, the value function (26) can be rewritten as
t
a t a T   
t =1 at ∈A
 t VπT (s) = πi (ai |fi (s), ui ) Ps→s
⇐⇒ max π(at |st )rsat a 
t a
at ∈A
i

s

1 a T − 1 T −1 
= max rsat .
t
(23) × r + Vπ (s ) (27)
ta
T s T

Using (23), we can prove that the maximization of (22) with where ui is the received message of agent i, and it is
respect to {at |∀t} can be decomposed into T subproblems: transmitted by the other agents. Theorem 1 does not hold
 because uti can contain information of the previous action at−1

at and state st−1 and these historical issues can influence the
max
t

T
(s) ⇐⇒ max
t
rs t ∀t . (24)
{a |∀t} a current agent’s strategy.
Furthermore, we consider the impact of the environment
The equivalence proof of γ-discounted cumulative reward is
changing. When the Dopple frequency fd is large, the channel
similar. 
environment changes rapidly, the delivered message has little
Since the channel is modeled as a first-order Markov
influence on the current policy making, then a small γ value
process, the environment satisfies the two conditions (17)
t is preferred and vice versa. This implies that when the envi-
and (18) in Theorem 1. Then we let rsat = C(g t , pt ) and
ronment changes fast, Theorem 1 is still valid in the scenario
a = p with the power constraints. The centralized optimiza-
t t
which considers multi-agent setting with coordinations.
tion problem (6) using the DRL approach is equivalent to (23),
Validation of Theorem 1 provides a guidance for the setting
and this problem is unrelated with the value of T or γ. We will
of γ value. To the authors’ best knowledge, there exists no
first make serveral observations before choosing an appropriate
theoretical optimal setting of hyper-parameter γ, and a proper
hyper-parameter value. We take the value-based method as
γ value can be obtained through simulations. We have verified
an example, and optimal Q function associated with Bellman
that an increasing γ has negative influence on the sum-rate
equation is given as
performance of deep Q network (DQN) via simulations [34],
Q∗ (s, a) = rsa + γ max Q(s , a ). (25) as shown in Fig. 1. This implies that the influence of agents’
 a behaviours over time is negligible, and Theorem 1 is valid
The function in (25) must be estimated precisely to achieve for problem (11) when we consider coordinations and a fast
the optimal action. Here we list two issues caused by γ > 0: changing environment. Therefore, we suggest hyper-parameter
γ = 0 or T = 1 in this specific scenario, and thus
1) The Q value is overestimated, and the bias is
Q(s, a) = rsa . In the remainder of this paper, we make
γ maxa Q(s , a ). This effect has little influence on the
an adjustment to the standard DRL algorithms, and claim
final performance, since this deviation is unrelated with
that the Q function is equal to the reward function. The
action a.
aforementioned analysis and discussion provide a guidance
2) The variance of Q value σq2 becomes enlarged, and noise
for the DRL design.
σq2 becomes larger as γ increases. During the training,
the noise σq2 on data slows down the convergence speed
and deteriorates the performance of learned DNN. C. On-Line Training
In the proposed two-step training framework [34], the DNN
is first pre-trained off-line. Deriving a coarse learned policy in
B. Multi-Agent Setting a simulated system can reduce the on-line training stress due to
1) Without Coordinations: The conditions (17) and (18) are the large data requirement for data-driven algorithm. However,
still true in multi-agent learning. When all the agents obtain off-line training will suffer from system model inaccuracy
the environment state s partially or fully, no messages are including hardware imperfections, dynamic wireless channel
delivered between the agents. The value function (16) can be environments and some unknown issues. Therefore, the agent

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6260 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020

IV. DRL A LGORITHM D ESIGN


A. DRL Design
The design of several DRL algorithms namely REIN-
FORCE, DQL and DDPG will be introduced in this subsec-
tion. First, as an expansion of Section II-C, the designs of
state, reward and action are given.
1) State: The selection of environment information is sig-
nificant, and obviously current partial CSI g tn,k is the most
critical feature. It is inappropriate to use g tn,k directly as DNN
input, and we propose a logarithmic normalized expression of
g tn,k [34] as
1
Fig. 1. The average sum-rate versus cellular network scalability for trained Γtn,k := t g tn,k ⊗ 1K (29)
DQNs with different γ values (distributed power allocation scheme with gn,k,k
coordinations, fd = 10 Hz, ρ = 0.3).
where ⊗ is the Kronecker product, and 1K is a vector
filled with K ones. The channel amplitudes elements in g tn,k
are normalized by the downlink dln,k , and the logarithmic
must be trained on-line in the initial deployment to adapt
representation is preferred since that amplitudes often vary by
to actual unknown issues that cannot be simulated. Second,
orders of magnitude. The cardinality of Γtn,k is (|Dn | + 1)K,
the off-line learned DNN can be further fine-tuned in real
and it changes with varying AP densities. First, we define the
networks. This two-step learning procedure is also called
sorting function
sim2real, which is a branch of transfer learning. To prevent
a prolonged of performance degradation, parameter update , i := sort(x, y)
x (30)
of the DNN according to the environment changes is also
where set x is sorted in a decreasing order, and the first
necessary.
y elements are selected as the new set x . The indices of
One simple approach is to use continuous regular training,
the chosen components are denoted by i. To reduce further
which leads to a waste of network performance and computa-
the input dimension and accommodate different AP densities,
tion resources. On-line training is costly for several reasons.
the new set Γ tn,k and its indices I t are obtained by (30) with
First, interaction with the real environment is required, and n,k
this exploration reduces the sum-rate performance of commu- x = Γtn,k and y = Ic , where Ic is a constant. This trick
nication system to some extent. Second, training requires high is based on the fact that the number of main interference
performance computing to reduce time cost, while the required signals is usually far less than the total, i.e., Ic (|Dn | +
hardware is expensive and power-hungry. On the one hand, 1)K − 1, while the other interference signals are close to
training is unnecessary when the environment fluctuation is zero. Therefore, the sorting function is used to approximate
negligible, but on the other hand this method cannot timely the interference term of the denominator in (4), and simulta-
respond to the sudden changes. neously reduce the state space.
Therefore, we propose an environment tracking mechanism The channel is modeled as a Markov process which is
as an efficient approach to control the agent training dynami- correlated in the time domain. Therefore, the last solution pt−1
n,k
cally. For DRL algorithms, an environment shift indicates that provides an improved initialization, and p t−1
n,k \pn,k provides
t−1

the reward function R is changed, and thus the policy π or coordinations from interference cells where \ is the division
Q function must be adjusted correspondingly to avoid perfor- operation. Corresponding to Γ  t , the last power set p
t−1
n,k n,k is
mance degradation. Hence, the Q value needs to approximate defined as
the reward value r as accurately as possible. We define the
normalized critic loss lct as t−1
pn,k := {pn,k | (n, k) ∈ In,k }.
t−1 t
(31)
The irrelevant or weak-correlated input elements consume
 2
1 
t t t
Q(s , a ; θ) more computational resources and even lead to perfor-
lct = 1− (28) mance degradation. However, some auxiliary information can
2Tl rt
t =t−Tl +1 improve the sum-rate performance of DNN. Similar to (31),
the assisted feature is given by
where θ denotes the DNN parameter; Tl is the observation
 t−1 := {C t−1 | (n, k) ∈ I t }.
C (32)
window; lct is an index to evaluate the accuracy of Q function n,k n,k n,k
approximation to the actual environment. Once lct exceeds Two types of feature f are considered, and they are defined as
certain fixed threshold lmax , the training of DNN is initiated t
to track the current environment; otherwise, the learning  n,k , p
f1 : = {Γ t−1
n,k }, (33)
procedure is omitted. The introduction of tracking mecha- t , p t−1  t−1
f2 : = {Γn,k  n,k , C n,k }. (34)
nism achieves a balance between performance and efficiency.
Combined with on-line training, the DRL is model-free and The partially observed state s for DRL algorithms can be f1 or
data-driven. f2 , and their performance will be compared in Section V-B.2.

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6261

Moreover, the state cardinalities |S| of f1 and f2 are 2Ic and


3Ic , respectively.
2) Reward: There is few work on the optimal design of
reward function due to the problem complexity. In general,
the reward function is designed elaborately to improve the
agent’s transmitting rate and mitigate its interference to neigh-
bouring links [26]. In a previous work [34], we use averaged
sum-rate (9) as the reward, and it follows that the sum of
all rewards is equal to the network sum-rate. However, rates Fig. 2. An illustration of data flow graph with REINFORCE and DQL
information of remote cells is introduced, and it has little (feature f2 ).
relationship with decision of action ptn,k . These irrelevant
elements enlarge the variance of reward function, and thus the space increases exponentially, multi-action problems using
DNN becomes difficult to train when the network becomes such algorithms is challenging.
large. In contrast, an agent does not consider its impacts on 4) Experience Replay: In this work we do not recommend
adjacent links in problem (11). As a trade-off, we propose a experience replay for three reasons. First, the “Experience
localized reward function as Replay” is suitable when the data is correlated and non-
⎛ ⎞ stationary in MDPs, while the training samples for DNN
 
t
rn,k := Cn,k
t
+α⎝ t
Cn,k  + Cnt  ,j ⎠ (35) are suggested to be independently and identically distributed
n,k =k n ∈Dn ,j (I.I.D.). When Doppler frequency fd is large, the data correla-
tion in time domain is weak. Second, in multi-agent learning,
where α ∈ R+ is a weight coefficient of interference effect, non-stationarity can occur when the other agents’ policies
and R+ denotes the positive real scalar. The sum of local change over time. However, in centralized training all the
rewards is proportional to the sum-rate agents share the same policy, i.e., πi = πj , ∀i, j. This indicates
 that non-stationarity does not occur in multi-agent learning
t
rn,k ∝ C(g t , pt ) (36)
using centralized training. Third, experience replay requires
n,k
memory to store past samples.
when the cell number N is sufficient large.
3) Action: The downlink power is a non-negative continu-
B. Policy-Based: REINFORCE
ous scalar, and is limited by the maximum power Pmax . Since
the action space must be finite for certain algorithms such as The REINFORCE is derived as a Monte-Carlo policy-
DQL and REINFORCE, the possible emitting power is quan- gradient learning algorithm [37], [38]. Policy-based algorithms
tized to |A| levels. We choose the logarithmic normalization directly generate stochastic policy π, which is a probability
as the power quantization, since emitting power also ranges in vector produced by a parameterized policy network π(a|s; θ π )
several orders of magnitude (usually described by decibels). with parameter θπ , as shown in Fig. 2. The overall strategy
In fact, a logarithmic quantizer in linear space is also a linear of stochastic gradient ascent requires samples such that the
quantizer in the exponential domain. The allowed power set expectation of sample gradient is proportional to the actual
is given as gradient of the performance measure as a function of the
 
i  parameter. The goal of REINFORCE is to maximize expected
Pmax |A|−2 rewards under policy π:
A := 0, Pmin i = 0, · · · , |A| − 2 (37) 
Pmin 

θπ = arg max Eπ π(a|s; θπ )rsa (39)
where Pmin is the non-zero minimum emitting power. Dis- θπ
a
cretization of continuous variable results in quantization error.
Meanwhile, the actor of DDPG directly outputs deterministic where π(a|s; θ π ) denotes the policy network, and θ π is
action a = ptn,k , and this constrained continuous scalar is its parameter. The gradient of (39) with Monte-Carlo sam-
generated by a scaled sigmoid function: pling [16] is presented as
 
1 ∇θπ = Eπ ∇θπ ln π(a|s; θ π )rsa |s=st ,a=at (40)
a := Pmax · (38)
1 + exp(−x) where ∇ is the gradient operation. Since the policy
where x denotes the pre-activation output, which is a real- network π(a|s; θπ ) directly generates stochastic policy,
valued vector produced by x = f (W x +bl ) where W , b, f (·) the random action is selected following π(a|s; θ π ) during
are respectively the weight, bias and activation function of learning, and the optimal action a∗ is selected following the
current layer, and x is the pre-activation output of previous learned policy π ∗ :
layer.
a∗ ∼ π ∗ (a|s; θ π ) (41)
Except for eliminating quantization error, DDPG has great
potential for solving multi-action problems. We take a task after learning. REINFORCE does not rely on hyper-parameter
with action number NA for example, and the output dimen- ε of ε-greedy strategy during exploration. When the optimal
sion of DDPG is |A| = NA . While for both DQL and policy is deterministic, choosing a∗ is equivalent to selecting
NA
REINFORCE, we have |A| = i |Ai |. Since the action the one having maximum probability.

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6262 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020

In practical training, the algorithm is susceptible to reward Algorithm 2 DQL Algorithm


scaling. We can reduce this dependency by whitening the 1: Input: Episode times Ne , exploration times T , learning rate
rewards before computing the gradients, and the normalization ηq , initial and final exploration probability ε1 , εNe .
of reward r̃ is given as 2: Initialization: Initialize DQN Q(s, a; θ q ) with random
r − μr parameter θq .
r̃ = (42)
σr 3: for k = 1 to Ne do
4: Update εk by (46).
where μr and σr are the mean value and standard deviation of
5: Receive initial state s1 .
reward r, respectively. The proposed REINFORCE algorithm
6: for t = 1 to T do
is outlined in Algorithm 1.
7: if rand() < εk then
8: Randomly select action at ∈ A with uniform proba-
Algorithm 1 REINFORCE Algorithm bility.
1: Input: Episode times Ne , exploration times T , learning rate 9: else
ηπ . 10: Select action at by (45).
2: Initialization: Initialize policy network π(a|s; θ π ) with 11: end if
random parameter θπ . 12: Execute action at , achieve reward rt , and observe new
3: for k = 1 to Ne do state st+1 .
4: Receive initial state s1 . 13: Calculate gradient ∇θq by (44), and update parameter
5: for t = 1 to T do along negative gradient direction: θ q ← θ q − ηq ∇θq .
6: Select action at following π(a|s; θ π ). 14: st ← st+1 .
7: Execute action at , achieve reward rt , and observe new 15: end for
state st+1 . 16: end for
8: Calculate r̃ by (42). 17: Output: Learned DQN Q(s, a; θ q ).
9: Calculate gradient ∇θπ by (40), and update parameter
along positive gradient direction θπ ← θπ + ηπ ∇θ π .
10: st ← st+1 .
11: end for
12: end for
13: Output: Learned policy network π(a|s; θ π ).

C. Value-Based: DQL
DQL is one of the most popular value-based off-policy
DRL algorithms. As shown in Fig. 2, the topologies of DQL Fig. 3. An illustration of data flow graph with DDPG (feature f2 ).
and REINFORCE are the same, and the values are estimated
by a DQN Q(s, a; θq ), where θ q denotes the parameter. The D. Actor-Critic: DDPG
selection of a good action relies on accurate estimation, and
thus DQL aims to search for optimal parameter θ∗q to minimize DDPG is presented as an actor-critic model-free algorithm
the
2 loss, i.e., based on the deterministic policy gradient operating over
continuous action spaces. As shown in Fig. 3, an actor gen-
1 2
θ ∗q = arg min (Q(s, a; θq ) − rsa ) . (43) erates deterministic action a with observation s by a mapping
θq 2
network A(s; θ a ), where θ a denotes the actor parameter. The
The gradient of θ q is given as critic predicts the Q value with an action-state pair through
a critic network C(sc , a; θc ), where θ c denotes the critic
∇θ q = (Q(s, a; θq ) − rsa ) ∇θq Q(s, a; θq ). (44) parameter and sc is the critic state. The critic and actor work
∗ cooperatively, and the optimal deterministic policy is achieved
The optimal action a is selected to maximize the Q value,
and it is given by by solving the following joint optimization problem:

a∗ = arg max Q(s, a; θq ). (45) θ ∗a = arg max C(sc , a; θc ) |a=A(s;θa ) (47)


a θa
During the training, a dynamic ε-greedy policy is adopted to
and
control the exploration probability, and εk is defined as
1 2
k−1 θ∗c = arg min C(sc , a; θc ) |a=A(s;θa ) −rsa . (48)
εk := ε1 + (εNe − ε1 ), k = 1, · · · , Ne (46) θc 2
Ne − 1
where Ne denotes the episode times, ε1 and εNe are the initial The actor strives to maximize the evaluation from critic, and
and final exploration probabilities, respectively. A description the critic aims to make an assessment precisely. Both the
of the DQL algorithm is presented in Algorithm 2. actor and critic are differentiable, and by the chain rule their

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6263

gradients are given as Algorithm 3 DDPG Algorithm


∇θa = ∇a C(sc , a; θc ) |a=A(s;θa ) ∇θa A(s; θ a ), (49) 1: Input: Episode times Ne , exploration times T , actor learn-
ing rate ηa , critic learning rate ηc .
∇θ c = (C(sc , a; θc ) − rsa ) ∇θc C(sc , a; θc ) |a=A(s;θa ) . (50)
2: Initialization: Initialize actor A(s; θ a ) and critic
The deterministic action is directly obtained by the actor C(s, a; θc ) with random parameter θa and θc .
3: for k = 1 to Ne do
a∗ = A(s; θa ). (51) 4: Receive initial state s1 , obtain s1c with (4) and (5).
Similar to the dynamic ε-greedy policy, the exploration action 5: for t = 1 to T do
in episode k is defined as 6: Obtain action at by (52).
 Pmax 7: Execute action at , achieve reward rt , observe new state
a := A(s; θ a ) + nk 0 (52) st+1 , and obtain st+1
c with (4) and (5).
8: Calculate critic gradient ∇θc by (49), and update
where nk is an added noise and it follows a uniform
parameter along negative gradient direction θc ← θc −
distribution:
ηc ∇θc .
Pmax Pmax
nk ∼ U(− , ) (53) 9: Calculate actor gradient ∇θ a by (50), and update
k k parameter along positive gradient direction θa ← θa +
and action a is bounded by the interval [0, Pmax ]. ηa ∇θ a .
The critic C(sc , a; θc ) can be regarded as an auxiliary 10: st ← st+1 .
network to transfer gradient in learning, and it is not involved 11: stc ← st+1
c .
in action generation after learning. The C(sc , a; θc ) must 12: end for
be differentiable, but not necessarily trainable. The critic is 13: end for
model-based in this approach, since the evaluation rules are 14: Output: Learned actor A(s; θ a ) and critic C(s, a; θ c ).
available with (4), (5) and (35) in off-line training.
However, the model-based critic is infeasible to accom-
modate the unknown issues in on-line training. Meanwhile, emitting power constraints Pmin and Pmax are 5 dBm and 38
the complex reward function is difficult to be approximated dBm, respectively. Besides, the maximum SINR is restricted
accurately with NN parameters. Therefore, a semi-model-free by 30 dB.
critic is suggested, and it uses both priori knowledge and The cardinality of adjacent cells is |Dn | = 18, ∀n, the first
flexibility of NN. Similar to the preprocessing of s, the state Ic = 16 interferers remain and power level number |A| = 10.
for critic sc = C  tn,k is obtained by (4), (5), (30) and (32) Therefore, the input state dimensions |S| with features f1 , f2
with x = {Cn,k t
|∀n, k} and y = Ic . The DDPG algorithm is are 32 and 48, respectively. The weight coefficient α = 1.
described in Algorithm 3. In episode k, the large-scale fading is invariant. The number
The policy gradient algorithm is developed with stochastic of episode is Ne = 5000, and the time slots per episode is
policy π(a|s), but it is inefficient to sample in continuous T = 10. A large Ne value and a small T value can reduce the
or high-dimensional action space. The deterministic policy data correlation over time, and thus the convergence is guar-
gradient is proposed to overcome this problem. On the other anteed. The Adam [39] algorithm is adopted as the optimizer
hand, in contrast with value-based DQL, the critic C(s, a; θc ) for all DRLs. In Table I, we list the hyper-parameter settings,
and Q value estimator Q(s, a; θq ) are similar in terms of value and the activation function f (·) and neuron number of each
estimation. The difference is that a critic takes both a and s DNN layer. The activation functions are rectified linear unit
as input and then predicts Q value, but Q(s, a; θq ) estimates (ReLU): f (x) = max(0, x), linear: f (x) = x and softmax:
the corresponding Q values of all actions with input s. f (x) = ex / ex , respectively. These default settings will be
specified once they are changed in the following simulations.
V. S IMULATION R ESULTS The simulation codes used in obtaining numerical experiments
are available at https://github.com/mengxiaomao/PA_TWC.
A. Simulation Configuration
In the training procedure, a cellular network with N = 25
cells is considered. In each cell, K = 4 APs are located B. DRL Algorithm Comparison
uniformly and randomly within the range [Rmin , Rmax ], where
Rmin = 0.01 km and Rmax = 1 km are the inner space and In this subsection, the sum-rate performance is studied for
half cell-to-cell distance, respectively. The Doppler frequency REINFORCE, DQL and DDPG, in terms of experience replay,
fd = 10 Hz and time period Ts = 20 ms are adopted to feature selection and quantization error. The notations σc2 , C̄
simulate the fading effects. According to the LTE standard, and C̄ ∗ are defined as variance of sum-rate, average sum-rate,
the large-scale fading is modeled as the average sum-rate of top 20% over independent repetitive
experiments, respectively. The metric C̄ ∗ is an indicator to
β = −120.9 − 37.6 log10 d + 10 log10 z (54)
measure performance of well-trained algorithms.
where log-normal random variable z follows ln z ∼ N (0, σz2 ) 1) Experience Replay: Since the parameter initialization
with σz2 = 8 dB, and d is the link distance. The additive white and data generation are stochastic, the performance of DRL
Gaussian noise (AWGN) power σ 2 is −114 dBm, and the algorithms can be influenced to varying degrees. As shown

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6264 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020

TABLE I
H YPER -PARAMETERS S ETUP AND DNN S ETTING

Fig. 5. Plot of C̄ ∗ versus power level number |A| (using feature f1 ).

the assisted information C  t−1 in f2 generally improves the


n,k
average sum-rate C̄. Besides, the improvement on C̄ ∗ is
notable, especially for DDPG algorithm. We speculate that
the mapping function is difficult to approximate for a simple
NN, due to the multiplication, division and exponentiation
operations in (4) and (5). Meanwhile, the variance σc2 is
increased with the additional feature. Therefore, there is
Fig. 4. Average best sum-rate C̄ ∗ of different DRL algorithms, with or a trade-off between complexity and performance when we
without the experience replay technique.
select two features. A simplified feature f1 is preferable for
TABLE II mobile terminals when the data transmission is restricted
C OMPARISON OF D IFFERENT DRL D ESIGNS and computational resource is costly. On the other hand,
an improved performance can be achieved when we use feature
f2 , at the expense of more data transmission and computational
resource.
3) Quantization Error: Quantization error can be gradually
reduced by increasing the digitalizing bit. Therefore, the num-
ber of power level |A| ∈ {3, 6, 10, 14, 20, 40} in this designed
experiment, and C̄ ∗ is used as the measurement. As illustrated
in Fig. 5, the values of C̄ ∗ for REINFORCE and DQL both rise
slightly as the |A| increases from 3 to 10. However, a further
in Table II,1 the REINFORCE and experience replay are abbre- increase of output dimension cannot improve the sum-rate
viated as RF and ER, respectively. Generally, the experience performance. The value of C̄ ∗ for DQL drops slowly, while
replay helps the DRLs reduce the variance of sum-rate σc2 and that of REINFORCE experiences a dramatic decline from
improve the average sum-rate C̄, but its influence on C̄ ∗ is 1.54 bps to 1.19 bps, as |A| increases from 14 to 40. This
negligible. indicates that the huge action space can lead to difficulties
The variance σc2 of REINFORCE is the largest, and we in practical training especially for REINFORCE, and also
find it challenging to stabilize the training results even with full elimination of quantization error is infeasible by simply
experience replay and normalization in (42). In contrast, enlarging action space. In addition, DDPG does not require
the DQL is much more stable. While σc2 of DDPG is the discretization of space by nature, and it outperforms both DQL
smallest, up to one or more orders of magnitude lower than the and REINFORCE algorithms.
REINFORCE. This indicates that DDPG has strong robustness
to random issues. Moreover, DDPG achieves the highest C̄ ∗ . C. Generalization Performance
In general, both REINFORCE and DQL have almost the same
C̄ ∗ , and REINFORCE performs slightly better than DQL In the following simulations, we select feature f2 and com-
but has weaker stabilization. The DDPG outperforms these pare C̄ ∗ performance of the learned DNNs for further study.
two algorithms in terms of both sum-rate performance and In the previous subsection, we mainly focus on comparisons
robustness. between different DRL algorithms, and the training set and
2) Feature Engineering: Next we compare the performance testing set are I.I.D. However, the statistical characteristics in
of C̄ ∗ with feature f1 or f2 . As shown in Table II and Fig. 4, real scenarios vary over time, and it is impractical to track
the environment with frequent on-line training. Therefore,
1 The proposed DDPG is not applicable for experience replay and thus the a good generalization performance is significant to be robust
corresponding simulation result is omitted. against changing issues. The FP, WMMSE, maximum power

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6265

Fig. 6. Minimum cell range Rmin .


Fig. 8. Average sum-rate per AP versus different user densities.

Fig. 7. Half cell-to-cell range Rmax .

and random power schemes are considered as benchmarks to Fig. 9. Average sum-rate versus different Doppler frequency fd .
evaluate our proposed DRL algorithms.
1) Cell Range: Both the minimum cell range Rmin and half exploration and exploitation. Therefore, it is possible for the
cell-to-cell range Rmax are considered. As shown in Fig. 6, DRL algorithms to attain solutons closer to optimal perfor-
the expected SINR of adjacent users (near BS) becomes mance than the conventional optimization algorithms.
smaller as the minimum cell range increases, and thus the 2) User Density: In a practical scenario, the user density
average sum-rate decreases. In Fig. 7, both the intra-cell inter- can change over time and location, so it is considered in this
ference and inter-cell interference are stronger as the cell range simulation. The user density is changed by the number of AP
becomes smaller, and thus the average sum-rate decreases. per cell K, which ranges from 1 to 8. As plotted in Fig. 8,
The sum-rate performance of random/maximum power is the the average sum-rate drops as the users become denser, and
lowest, while the FP [5] and WMMSE [6] achieve much higher all the algorithms have the similar trend. Apparently, the DRL
spectral efficiency. The performances of these two algorithms approaches outperform the other schemes, and DDPG again
are comparable, and WMMSE performs slightly better than achieves the best sum-rate performance. Hence, the simulation
FP. In contrast, all the data-driven algorithms outperform the result shows that the learned data-driven models also show
model-driven methods, and the proposed actor-critic-based good generalization performance on different user densities.
DDPG achieves the highest sum-rate value. Additionally, 3) Doppler Frequency: The Doppler frequency fd is a
the learned models are obtained in the simulation environment significant variable related to the small-scale fading. Since
having fixed range Rmin = 0.01 km and Rmax = 1 km, but the information at last instant is used for the current power
performance degradation in these unknown scenarios cannot allocation, fast fading can lead to performance degradation
be observed. Therefore, the learned data-driven models using for our proposed data-driven models. Meanwhile, the model-
proposed algorithms show good generalization performance in driven algorithms are not influenced by fd . The Doppler
terms of varying cell ranges. frequency fd is sampled from 4 Hz to 50 Hz, meanwhile the
The conventional optimization algorithms such as WMMSE corresponding correlation ρ value ranges from 0.920 to 0.001.
and FP sacrifice optimality for tractability and suboptimal The simulation results in Fig. 9 show that the average sum-
solutions are often obtained. As a sub-discipline of heuristic rates of data-driven algorithms drop slowly in this fd range.
optimization, the data-driven methods such as DRL interact This indicates that the data-driven models are robust against
with the environment and learn to improve the policy through Doppler frequency fd .

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
6266 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 19, NO. 10, OCTOBER 2020

parallel computation. Besides, the simple and efficient


activation function ReLU: max(·, 0) is adopted.
In summary, the low computational time cost of the proposed
DRLs can be attributed to distributed execution framework,
parallel computing architecture and simple efficient function.
VI. C ONCLUSION
The distributed power allocation with proposed DRL algo-
rithms in wireless cellular networks having SISO IBC was
investigated. We presented a mathematical analysis of the
design and application of DRL algorithms at a system-level
by considering inter-cell cooperation, off-line/on-line training
and distributed execution. The concrete algorithm design was
further introduced to cope with a near-static problem. In the-
Fig. 10. Average sum-rate versus maximum emitting power Pmax . ory, the sum-rate performances of DQL and REINFORCE
TABLE III algorithms are the same, and DDPG outperforms these two
AVERAGE T IME C OST PER E XECUTION Tc (sec) methods by eliminating quantization error. The simulation
results agree with our expectation, and DDPG performs the
best in terms of both sum-rate performance and robustness.
Besides, all the data-driven approaches outperform the state-
of-art model-based methods, and also show improved gener-
alization performance and lower computational time cost in a
series of experiments.
4) Maximum Emitting Power: The DRL-based agents The data-driven algorithm, especially DRL, is a promising
trained using fixed Pmax = 38 dBm are tested using differ- technique for future intelligent networks, and the proposed
ent maximum emitting power Pmax , and simulation results DDPG algorithm can be applied to general tasks with dis-
in Fig. 10 show that the average sum-rates of data-driven crete or continuous state/action space and joint optimization
algorithms still perform well when Pmax ≤ 38 dBm. However, problems of multiple variables. The DDPG algorithm can be
the sum-rate performance of DRL-based agents is significantly applied to many problems such as user scheduling, channel
reduced when Pmax > 38 dBm, and becomes lower than management and power allocation in various communication
WMMSE when Pmax ≥ 47 dBm. This implies that the data- networks.
driven models are robust in the scenarios where Pmax is less
than the simulated Pmax . R EFERENCES
[1] R. Q. Hu and Y. Qian, “An energy efficient and spectrum efficient
wireless heterogeneous network framework for 5G systems,” IEEE
D. Computation Complexity Commun. Mag., vol. 52, no. 5, pp. 94–101, May 2014.
[2] H. Zhang, X. Chu, W. Guo, and S. Wang, “Coexistence of Wi-Fi and
Low computation complexity is crucial for algorithm heterogeneous small cell networks sharing unlicensed spectrum,” IEEE
deployment and it is considered here. The simulation plat- Commun. Mag., vol. 53, no. 3, pp. 158–164, Mar. 2015.
[3] Z.-Q. Luo and S. Zhang, “Dynamic spectrum management: Complex-
form is presented as: CPU Intel i7-6700 and GPU Nvidia ity and duality,” IEEE J. Sel. Topics Signal Process., vol. 2, no. 1,
GTX-1070Ti. There are 100 APs in the simulated cellular pp. 57–73, Feb. 2008.
network, the time cost per execution Tc of our proposed dis- [4] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski,
“Five disruptive technology directions for 5G,” IEEE Commun. Mag.,
tributed algorithms and the centralized model-based methods vol. 52, no. 2, pp. 74–80, Feb. 2014.
are listed in Table III. It is interesting that the calculation [5] K. Shen and W. Yu, “Fractional programming for communication
time with GPU is higher than that of CPU, and we consider systems—Part I: Power control and beamforming,” IEEE Trans. Signal
Process., vol. 66, no. 10, pp. 2616–2630, May 2018.
that the GPU cannot be fully used with small scale DNN and [6] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted
distributed execution.2 It can be seen that the time cost of MMSE approach to distributed sum-utility maximization for a MIMO
three DRLs are almost the same due to similar DNN models, interfering broadcast channel,” in Proc. IEEE Int. Conf. Acoust., Speech
Signal Process. (ICASSP), May 2011, pp. 4331–4340.
and in terms of only CPU time, they are about 15.5 and 61.0 [7] M. Chiang, P. Hande, T. Lan, and C. W. Tan, “Power control in wireless
times faster than FP and WMMSE, respectively. Fast execution cellular networks,” Found. Trends Netw., vol. 2, no. 4, pp. 381–533,
speed with DNN tools can be explained as follows. Jan. 2007.
[8] H. Zhang, L. Venturino, N. Prasad, P. Li, S. Rangarajan, and X. Wang,
1) The execution of the proposed algorithms is distributed, “Weighted sum-rate maximization in multi-cell networks via coordinated
and thus the time expense is a constant as the total scheduling and discrete power control,” IEEE J. Sel. Areas Commun.,
vol. 29, no. 6, pp. 1214–1224, Jun. 2011.
amount of users N K increases by using multiple calcu- [9] W. Yu, T. Kwon, and C. Shin, “Multicell coordination via joint
lation devices (equal to N K). scheduling, beamforming, and power spectrum adaptation,” IEEE Trans.
2) Most of the operations in DNNs involve matrix multi- Wireless Commun., vol. 12, no. 7, pp. 1–14, Jul. 2013.
[10] C. M. Bishop, Pattern Recognition and Machine Learning. New York,
plications and additions, which can be accelerated by NY, USA: Springer-Verlag, 2006.
[11] T. O’Shea and J. Hoydis, “An introduction to deep learning for the
2 The common batch operation cannot be used under distributed execution
physical layer,” IEEE Trans. Cognit. Commun. Netw., vol. 3, no. 4,
in real scenario. pp. 563–575, Dec. 2017.

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.
MENG et al.: POWER ALLOCATION IN MULTI-USER CELLULAR NETWORKS: DEEP REINFORCEMENT LEARNING APPROACHES 6267

[12] T. Wang, C.-K. Wen, H. Wang, F. Gao, T. Jiang, and S. Jin, “Deep [37] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra-
learning for wireless physical layer: Opportunities and challenges,” dient methods for reinforcement learning with function approximation,”
China Commun., vol. 14, no. 11, pp. 92–111, Nov. 2017. in Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 1057–1063.
[13] F. Meng, P. Chen, L. Wu, and X. Wang, “Automatic modulation [38] P. S. Thomas and E. Brunskill, “Policy gradient methods for rein-
classification: A deep learning enabled approach,” IEEE Trans. Veh. forcement learning with function approximation and action-dependent
Technol., vol. 67, no. 11, pp. 10760–10772, Nov. 2018. baselines,” CoRR, vol. abs/1706.06643, pp. 1–2, Jun. 2017. [Online].
[14] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel Available: http://arxiv.org/abs/1706.06643
estimation and signal detection in OFDM systems,” IEEE Wireless [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb. 2018. CoRR, vol. abs/1412.6980, pp. 1–15, Dec. 2014. [Online]. Available:
[15] F. Meng, P. Chen, and L. Wu, “NN-based IDF demodulator in http://arxiv.org/abs/1412.6980
band-limited communication system,” IET Commun., vol. 12, no. 2,
pp. 198–204, Jan. 2018. Fan Meng (Student Member, IEEE) was born in
[16] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Jiangsu, China, in 1992. He received the B.S. degree
Cambridge, MA, USA: MIT Press, 2018. from the School of Electronic Engineering, Univer-
sity of Electronic Science and Technology of China
[17] L. R. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement
(UESTC), in 2015. He is currently pursuing the
Learning and Dynamic Programming Using Function Approximators.
Ph.D. degree with the School of Information Science
Boca Raton, FL, USA: CRC Press, 2010.
and Engineering, Southeast University, China. His
[18] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
main research topic is applying machine learning
MA, USA: MIT Press, 2016.
techniques in the wireless communication systems.
[19] Y. Li, “Deep reinforcement learning: An overview,” CoRR, His research interests include machine learning in
vol. abs/1701.07274, pp. 1–85, Jan. 2017. [Online]. Available: general, joint demodulation and equalization, chan-
http://arxiv.org/abs/1701.07274 nel coding, resource allocation, and end-to-end communication system design
[20] D. Silver et al., “Mastering the game of go without human knowledge,” with machine learning tools.
Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017.
[21] V. Mnih et al., “Human-level control through deep reinforcement learn- Peng Chen (Member, IEEE) was born in Jiangsu,
ing,” Nature, vol. 518, no. 7540, p. 529, 2015. China, in 1989. He received the B.E. and Ph.D.
[22] T. P. Lillicrap et al., “Continuous control with deep reinforcement degrees from the School of Information Sci-
learning,” Comput. Sci., vol. 8, no. 6, p. A187, 2015. ence and Engineering, Southeast University, China,
[23] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, in 2011 and 2017, respectively.
“Learning to optimize: Training deep neural networks for interfer- From March 2015 to April 2016, he was a Visiting
ence management,” IEEE Trans. Signal Process., vol. 66, no. 20, Scholar with the Electrical Engineering Department,
pp. 5438–5453, Oct. 2018. Columbia University, New York City, NY, USA.
[24] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards optimal power control He is currently an Associate Professor with the State
via ensembling deep neural networks,” CoRR, vol. abs/1807.10025, Key Laboratory of Millimeter Waves, Southeast Uni-
pp. 1–30, Jul. 2018. versity. His research interests include radar signal
[25] M. Bennis and D. Niyato, “A Q-learning based approach to interfer- processing and millimeter wave communication.
ence avoidance in self-organized femtocell networks,” in Proc. IEEE Lenan Wu received the M.S. degree in electronics
Globecom Workshops, Dec. 2010, pp. 706–710. communication system from the Nanjing University
[26] M. Simsek, A. Czylwik, A. Galindo-Serrano, and L. Giupponi, of Aeronautics and Astronautics, China, in 1987, and
“Improved decentralized Q-learning algorithm for interference reduction the Ph.D. degree in signal and information process-
in LTE-femtocells,” in Proc. Wireless Adv., Jun. 2011, pp. 138–143. ing from Southeast University, China, in 1997.
[27] M. Simsek, M. Bennis, and I. Güvenç, “Learning based frequency- and Since 1997, he has been with Southeast University,
time-domain inter-cell interference coordination in hetnets,” IEEE Trans. where he is currently a Professor and the Direc-
Veh. Technol., vol. 64, no. 10, pp. 4589–4602, Oct. 2015. tor of the Multimedia Technical Research Institute.
[28] R. Amiri, H. Mehrpouyan, L. Fridman, R. K. Mallik, A. Nallanathan, He is the author or coauthor of over 400 technical
and D. Matolak, “A machine learning approach for power allocation in articles and 11 textbooks. He holds 20 Chinese
HetNets considering QoS,” in Proc. IEEE Int. Conf. Commun. (ICC), patents and one international patent. His research
May 2018, pp. 1–7. interests include multimedia information systems and communication signal
[29] Y. S. Nasir and D. Guo, “Deep reinforcement learning for dis- Processing.
tributed dynamic power allocation in wireless networks,” CoRR, Julian Cheng (Senior Member, IEEE) received the
vol. abs/1808.00490, pp. 1–30, Aug. 2018. [Online]. Available: B.Eng. degree (Hons.) in electrical engineering from
http://arxiv.org/abs/1808.00490 the University of Victoria, Victoria, BC, Canada,
[30] H. Ye and G. Y. Li, “Deep reinforcement learning based distributed in 1995, the M.Sc. (Eng.) degree in mathematics
resource allocation for V2 V broadcasting,” in Proc. 14th Int. Wireless and engineering from Queen’s University, Kingston,
Commun. Mobile Comput. Conf. (IWCMC), Jun. 2018, pp. 440–445. ON, Canada, in 1997, and the Ph.D. degree in
[31] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, “Intelligent electrical engineering from the University of Alberta,
power control for spectrum sharing in cognitive radios: A deep rein- Edmonton, AB, Canada, in 2003. He was with
forcement learning approach,” IEEE Access, vol. 6, pp. 25463–25473, Bell Northern Research and NORTEL Networks.
Apr. 2018. He is currently a Full Professor with the Faculty
[32] N. H. Viet, N. A. Vien, and T. Chung, “Policy gradient SMDP for of Applied Science, School of Engineering, The
resource allocation and routing in integrated services networks,” in Proc. University of British Columbia, Kelowna, BC, Canada. His current research
IEEE Int. Conf. Netw., Sens. Control, Apr. 2008, pp. 1541–1546. interests include digital communications over fading channels, statistical signal
[33] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User scheduling and resource processing for wireless applications, optical wireless communications, and 5G
allocation in HetNets with hybrid energy supply: An actor-critic rein- wireless networks. He was the Co-Chair of the 12th Canadian Workshop on
forcement learning approach,” IEEE Trans. Wireless Commun., vol. 17, Information Theory in 2011, the 28th Biennial Symposium on Communica-
no. 1, pp. 680–692, Jan. 2018. tions in 2016, and the 6th EAI International Conference on Game Theory
[34] F. Meng, P. Chen, and L. Wu, “Power allocation in multi- for Networks (GameNets 216). He serves as an Area Editor for the IEEE
user cellular networks: Deep reinforcement learning approaches,” T RANSACTIONS ON C OMMUNICATIONS. He is a past Associate Editor of the
CoRR, vol. abs/1812.02979, pp. 1–6, Dec. 2018. [Online]. Available: IEEE T RANSACTIONS ON C OMMUNICATIONS , the IEEE T RANSACTIONS
http://arxiv.org/abs/1812.02979 ON W IRELESS C OMMUNICATIONS , the IEEE C OMMUNICATIONS L ETTERS ,
[35] P. Dent and G. E. Bottomley, “Jakes fading model revisited,” Electron. and IEEE A CCESS . He served as a Guest Editor for a Special Issue of the
Lett., vol. 29, no. 13, pp. 1162–1163, Jun. 1993. IEEE J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS on Optical
[36] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peters, L. Hanzo, and Wireless Communications. He is also a Registered Professional Engineer with
P. Soldati, “Learning radio resource management in RANs: Framework, the Province of British Columbia, Canada. He serves as the President for the
opportunities, and challenges,” IEEE Commun. Mag., vol. 56, no. 9, Canadian Society of Information Theory as well as the Secretary for the Radio
pp. 138–145, Sep. 2018. Communications Technical Committee of the IEEE Communications Society.

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 23:33:34 UTC from IEEE Xplore. Restrictions apply.

You might also like