You are on page 1of 15

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO.

12, DECEMBER 2022 10271

Deep Reinforcement Learning Based End-to-End


Multiuser Channel Prediction and Beamforming
Man Chu , Member, IEEE, An Liu , Senior Member, IEEE, Vincent K. N. Lau , Fellow, IEEE,
Chen Jiang , and Tingting Yang , Member, IEEE

Abstract— In this paper, reinforcement learning (RL) based I. I NTRODUCTION


end-to-end channel prediction (CP) and beamforming (BF) algo-
rithms are proposed for multi-user downlink system. Different A. Background and Motivation
from the previous methods which either require perfect channel
state information (CSI), or estimate outdated CSI and set con-
straints on pilot sequences, the proposed algorithms have no such
premised assumptions or constraints. Firstly, RL is considered
T HE approaching new wireless communication generation
needs more efficient methods to deal with the tremen-
dous growth of intelligent devices and applications, and the
in channel prediction and the actor-critic aided CP algorithm complexities of the system [1], [2]. Employing large numbers
is proposed at the base station (BS). With the received pilot of antennas is a significant solution to increase the capacity
signals and partial feedback information, the actor network at
BS directly outputs the predicted downlink CSI without channel of the network with the main advantage of acquiring the
reciprocity. After obtaining the CSI, BS generates the beam- capacity gain from spatial multiplexing or beamforming (BF)
forming matrix using zero-forcing (ZF). Secondly, we further by providing large degrees of freedom for the base station (BS)
develop a deep RL based two-layer architecture for joint CP and and user equipments (UEs) [3].
BF design. The first layer predicts the downlink CSI with the One of the key requirements for realizing such performance
similar actor network as in the CP algorithm. Then, by importing
the outputs of the first layer as inputs, the second layer is the gain is to accurately acquire the channel state information
actor-critic based beamforming layer, which can autonomously (CSI) [4]. Various techniques are developed to fully achieve
learn the beamforming policy with the objective of maximizing the potential gains by assuming perfect CSI or developing
the transmission sum rate. Since the learning state and action channel estimation (CE) techniques [5]. The most widely
spaces in the considered CP and BF problems are continuous, used channel estimation technique is pilot-aided, such as
we employ the actor-critic method to deal with the continuous
outputs. Empirical numerical simulations and the complexity least-squares (LS) and minimum mean-squared-error (MMSE)
analysis verify that the proposed end-to-end algorithms could channel estimation [3], [6]. With much more antennas and
always converge to stable states under different channel statistics users, the channel dimensionality is high. The realization and
and scenarios, and can beat the existing traditional and learning the inaccuracy of these traditional methods incur substantially
based benchmarks, in terms of transmission sum rate. large computational complexities and overheads, thus the
Index Terms— Deep reinforcement learning, channel predic- performance would be weakened. Another crucial approach
tion, beamforming, physical layer. to obtain the capacity gain is beamforming [1]. In future
communication generations, BS will equip with a large number
Manuscript received 22 December 2021; revised 21 April 2022; accepted of antennas (up to few hundreds) to provide better BF flexi-
6 June 2022. Date of publication 22 June 2022; date of current version bility [7]. Meanwhile, the complexities of the traditional BF
12 December 2022. This work was supported in part by the National Key
Research and Development Program of China under Grant 2021YFA1003300 strategies, i.e., zero-forcing (ZF) and wighted MMSE, will also
and in part by Research Grants Council under Grant 16213119. An earlier increase enormously, since more values need to be calculated
version of this paper was presented in part at the IEEE VTC2022-Spring, and the excessive CSI feedback overhead arises [8].
Helsinki, Finland, June 2022. The associate editor coordinating the review
of this article and approving it for publication was L. Yang. (Corresponding Considering the future massive networks and the nonlinear
authors: An Liu; Chen Jiang.) varying channels, deep learning (DL) is a promising tool to
Man Chu is with the Department of Engineering, Shenzhen MSU-BIT provide competitive performance to existing approaches, with
University, Shenzhen 518172, China (e-mail: chumancc@163.com).
An Liu is with the College of Information Science and Elec- affordable and reasonable computational costs in complicated
tronic Engineering, Zhejiang University, Hangzhou 310027, China (e-mail: multi-user system [6]. There have been a few existing lit-
anliu@zju.edu.cn). eratures use learning based methods for channel estimation
Vincent K. N. Lau is with the Department of Electronics and Communi-
cation Engineering, The Hong Kong University of Science and Technology, [9], [10]. A data-aided channel estimator was designed in [4]
Hong Kong (e-mail: eeknlau@ust.hk). to reduce the channel estimation error of the conventional least
Chen Jiang is with the DJI Creative Studio LLC, Burbank, CA 91502 USA MMSE (LMMSE) method. Though performance advantages
(e-mail: cxjiang@ucdavis.edu).
Tingting Yang is with the Navigation College, Dalian Maritime University, of learning based channel estimation was demonstrated, the
Dalian 116026, China, and also with the Peng Cheng Laboratory, Shenzhen above existing methods were based on an impractical assump-
518000, China (e-mail: yangtingting820523@163.com). tion that perfect CSI or the distribution of channel model was
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TWC.2022.3183255. available for training the deep neural network (DNN). Also,
Digital Object Identifier 10.1109/TWC.2022.3183255 there are some works required large numbers of CSI data sets
1536-1276 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10272 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022

as labels during the training process, as in [11]. Furthermore, between the unlink and downlink CSI caused by the delay or
some existing schemes estimated the uplink CSI and obtained imperfect channel reciprocity could be eliminated since the
the downlink CSI by assuming perfect channel reciprocity downlink CSI could be predicted directly at BS with RL.
[8], [12]. However, when in the time varying channel where Deep Q learning (i.e., deep-Q Network [22]) is one of the
the downlink CSI is not the conjugate transpose of the uplink most popular deep RL techniques. However, the Q-learning
CSI, or when imperfect channel reciprocity happens due to method may not be able to directly deal with case when the
the hardware limitation, the above methods may result in state and action spaces become continuous, such as the CE
significant estimation performance loss. and beamforming problems in physical layer [23]. Fortunately,
Recently, a number of studies have also developed learning an actor-critic method was proposed to address the problems
aided algorithms to solve the large amounts of calculation with continuous spaces [24], which includes an actor network
complexities in beamforming [13], [14]. An iterative algo- and a critic network, represented by two deep neural networks.
rithm induced deep-unfolding neural network based precoding The actor network directly outputs the actions (policy), and the
scheme was proposed [15], in which the high complexity critic network outputs the estimated value functions to measure
operations in traditional iterative WMMSE were replaced by the performance of action. Our proposed algorithms will be
multi-layer neural network. However, the above papers all built on this framework.
only focused on the beamforming scheme and assumed that
the transmission channel model information was known in
prior. Moreover, most of the existing beamforming schemes B. Main Contributions
were obtained based on the outdated estimated CSI due to the In order to tackle the issues mentioned above, in this paper,
rapid variations of wireless channels, and the long processing we consider a typical wireless communication scenario with
time in the optimization and feedback propagations [16], [17]. multiple UEs and one BS, where no statistical or perfect
Outdated CSI causes outdated beamforming strategies, which knowledge are assumed over the channel dynamics. In order
results in local optimal sum and could not ensure efficient data to fully obtain the capacity gain, RL based end-to-end CP and
transmission. beamforming algorithms are studied owing to the capability
The previous works have done some research on channel of RL to solve complicated nonlinear problems. We have
estimation and beamforming, respectively. Only a few lit- considered two end-to-end frameworks: Firstly, we propose an
erature considered the channel estimation and beamforming actor-critic based channel prediction network, which predicts
jointly to obtain the end-to-end performance gain. There are the downlink CSI directly without channel reciprocity. After
two non-DL based channel estimation and hybrid beamform- obtaining predicted CSI, the BS adopts ZF precoder for
ing designs for MIMO systems, where the authors firstly downlink transmission; Secondly, we develop a two-layer CP
estimated the uplink CSI using compressed sensing in [18] and beamforming joint algorithm, where the first layer is
and through exploiting the strongest angle-of-arrivals in [19]. the proposed CP network aiming at acquiring the predicted
Then, by employing channel reciprocity, the hybrid precoders downlink CSI, and the second layer is an actor-critic based
were obtained with iterative algorithms in [18] and ZF in [19]. beamforming policy generating network. In particular, the
As for the DL based algorithms, an uplink channel estimation two-layer network is tandem: the output of the first layer,
and hybrid beamforming network was proposed in [20], where i.e., predicted CSI, is fed into the second layer as inputs.
the BS firstly used the neural network to estimate the channel The parameters in the proposed neural networks (NNs) are
covariance matrix, and then fed back the covariance matrix optimized with the objective of maximizing the downlink
to users for channel reconstructing and beamforming. Heavy transmission sum rate.
feedback overheads and channel statistical information were The main contributions are summarized as follows:
required in [20]. Authors in [3] firstly proposed a deep learning • We model the multi-user multi-antenna channel predic-
based channel beam-wave amplitude estimation scheme and tion and beamforming design as Markov decision process
used the results to estimate the channel. With the estimated (MDP). The formulated problems aim to choose good
channel, a neural network based analog precoding scheme policies for real-time CSI prediction and beamforming,
was proposed. However, the above scheme in [3] required the with the objective of maximizing the long-term expected
statistical model information in premise and only the analog accumulated downlink sum rate, as well as implicitly
precoding matrix was optimized. minimizing the CSI prediction loss. In the proposed CP
The performance of beamforming strongly relies on the algorithm, the users only need to feed back the CSI
perfectness of the channel. Hence, it is very crucial to perform every once in a while. Due to this infrequent infor-
accurate channel prediction (CP) and efficient beamforming mation feedback, the proposed RL based methods have
jointly with less requirements [20], [21]. Different from the lower feedback overheads as compared to some existing
existing DL or non-DL based schemes, reinforcement learn- approaches [3], [20].
ing (RL) agent could learn certain statistical information • We propose two RL based CP and beamforming frame-
about an unknown channel or find the optimal policy by works without prior assumptions of perfect CSI or
interacting with the environment, without premise for perfect plenty of training CSI data sets, which are hardly to
CSI and statistical channel models, or large numbers of true obtain in practical applications. The RL algorithm could
CSI training data sets. Therefore, the constraints described directly predict current downlink CSI at BS by constantly
above can be greatly eased with RL. Moreover, the differences interacting with the environment, thus eliminating the

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10273

one complete transmission round, which includes both uplink


and downlink transmissions. The BS has no premised perfect
CSI and a pilot based channel prediction scheme is considered
in this paper. We can see from Fig. 1 that each user firstly
sends its pilot sequence to BS during uplink pilot transmission.
Then, before starting downlink data transmission, the BS
uses its received pilot sequences and other input information
to predict the unknown downlink CSI. With the predicted
CSI, BS generates the downlink beamforming strategies and
accomplishes the data transmissions. This paper focuses on
Fig. 1. Multi-user uplink and downlink transmissions.
the designs of channel prediction and beamforming modules
in Fig. 1.
prediction loss caused by non-ideal channel reciprocity. We denote the set of uplink channel from UE i to BS as
Furthermore, the proposed frameworks also have a wide Gi = [Gi1 , Gi2 , . . . , GiNA ], where Gij is the channel gain
application scenario since there are no restrictions on the from UE i to the jth antenna of BS with i ∈ [1, · · · , M ]
numbers of UEs and antennas in the proposed method. and j ∈ [1, . . . , NA ]. We denote the pilot at UE i as
• We implement a tandem two-layer deep RL architec-
si = [si1 , · · · , sik , · · · , siK ], where K is the length of pilot
ture to jointly solve the downlink CSI prediction and signals. In general, pilot signals are known to both BS and
beamforming problem from the perspective of end-to-end UEs [9]. Then, the set of all the UEs’ pilot sequences is
design. The first layer directly predicts the CSI and s = [s1 , · · · , si , · · · , sM ].
the outputs are imported into the second layer as input The received k-th pilot signal at jth antenna of BS can be
state. The second layer adopts the actor-critic method, denoted as
M
in which the actor network outputs the beamforming
polices (actions) directly, and the critic network evaluates Yjk = Gij sik + Zjk , (1)
i=1
the Q values corresponding to the current actions. Such
an end-to-end design can achieve better performance than where Zjk ∼ N (0, σ12 ) is the received additive white Gaussian
the existing works which design the channel estimation noise (AWGN) with zero mean and variance σ12 at antenna j
and beamforming separately. during k-th pilot duration. Then, the received k-th pilot signal
We verified the performance and the efficiency of the at BS is
proposed algorithms via extensive simulations and compu- M

tational complexity analysis. The results show that the pro- Yk = Gi sik + Zk , (2)
posed algorithms are robust and always can converge to i=1
stable states under various simulation settings. Also, the pro- where Yk = [Y1k , · · · , YNA k ] and Zk = [Z1k , · · · , ZNA k ].
posed algorithms achieve higher sum rates as compared to Thus, we can get the total received pilot matrix at BS
both traditional methods and learning based baselines, with M

affordable computational overheads. The remainder of this Y= GT
i si + Z, (3)
paper is organized as follows. In Section II, the multi-user i=1
downlink system, performance metrics and the proposed end- T
to-end frameworks are presented. Section III simply introduces where Y = [Y1 , · · · , YK ]T , Z = [Z1 , · · · , ZK ]T and [·]
some preliminaries on MDP, policy gradient and actor-critic denotes the vector transpose.
architecture. In Section IV, the problem formulation and the Most of the existing works use pilot signals to estimate the
RL based CP algorithm are given. Then, the joint CP and uplink CSI and then employ channel reciprocity to obtain the
beamforming problem is formulated, and the proposed two- approximation of downlink CSI. However, when in the fast
layer actor-critic based algorithm is presented in details in varying channel or imperfect channel reciprocity happens due
Section V. We provide simulation results and the computa- to hardware limitation, the existing schemes would introduce
tional complexity analysis in Section VI, and conclusions in high estimation error since the uplink and downlink channels
Section VII. are not constant within a transmission round. In the proposed
algorithms, we aim to use the known pilot signals, the received
II. S YSTEM M ODEL pilots Y, and some history information (will be introduced in
details in Section IV) to directly predict the real-time downlink
A. Multi-User System Model CSI without channel reciprocity.
In this paper, we consider a multi-user transmission system The set of predicted downlink CSI between BS and UEs is
with one base station (BS) and M user equipments (UEs), denoted as Ĥ = [Ĥ1 , · · · , ĤM ], where Ĥi is the predicted
as shown in Fig. 1. The BS is equipped with NA antennas and CSI from BS to UE i and Ĥi = [Ĥi1 , · · · , ĤiNA ]. The
all the UEs are equipped with an omnidirectional antenna. The true downlink CSI from BS to UEs is represented as H =
transmissions are under time division duplex (TDD) mode, and [H1 , · · · , HM ], where Hi = [Hi1 , · · · , HiNA ] is the true CSI
the system is assumed to be operated in a time slotted way with from BS to UE i. Here Hi is not always the conjugate trans-
normalized equal length time slots (TSs). Each TS contains pose of the uplink CSI Gi . The proposed learning algorithm

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10274 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022

could learn the differences between Hi and Gi caused by the


delay in time varying channel or imperfect channel reciprocity.
The transmission data at BS is x = [x1 · · · , xM ]T , where
xi is the data for UE i, |xi |2 = 1, i ∈ [1, · · · , M ] and
|·|2 represents for the square of the absolute value. We denote
the precoding matrix generated with Ĥ as V ∈ C NA ×M , where
V = [V1 , · · · , VM ] and Vi ∈ C NA ×1 . Then, the received
downlink data at UE i is
M

yi = Hi Vi xi + Hi Vj xj + zi , (4)
j=1,j=i

where zi ∼ N (0, σ22 ) is the received AWGN at UE i and σ22 is


Fig. 2. The proposed end-to-end frameworks.
the variance. The UEs could get their signal-to-interference-
plus-noise ratios (SINRs) with received signals.
Thus, the received SINR of UE i can be denoted as module and the BF module is generated with traditional
2
Hi Vi  method, which is shown in Fig. 2 (a). This is intuitive since
Υi = M 2
. (5) the performance of BF strongly relies on the perfectness of the
m=1,m=i Hi Vm  + σ22
channel prediction [20]. The second framework is the proposed
where · denotes the Frobenius norm of a vector. joint two-layer RL network, where the RL is considered in
both CP and BF modules. We will introduce the proposed
B. Performance Metrics two frameworks in details in Sections IV and V, respectively.
In wireless communication system, one of the most impor-
tant performance metrics is data transmission sum rate. With III. P RELIMINARIES OF RL
Eq. (5), we can get the received rate at UE i, i.e., ri , via the Generally, RL based control problems can be regarded as
Shannon capacity formula [25], learning the agent’s act in a stochastic environment by sequen-
ri = B log (1 + Υi ) , (6) tially choosing actions over a sequence of TSs. Typically,
RL is developed based on a MDP formulation, which includes:
where B is the frequency bandwidth. The sum rate of all the a state space S, an action space A, an immediate reward
UEs can be described as function R : S × A → R and a transition probability set
M
 M
 P [24]. The general goal of an RL agent is to find a good
R= ri = B log (1 + Υi ) . (7) policy, which is a function mapping from state space to action
i=1 i=1 space, denoted by π : S → A. The RL agent interacts with the
Therefore, in the proposed algorithms, the optimization objec- environment by following its policy at each learning TS. The
tive is to maximize the sum rate by obtaining the accurate CSI optimization goal of RL is to maximize the total discounted
prediction and optimized BF matrixes. Furthermore, in order reward from TS t onwards, which is denoted as
to measure the accuracy of the proposed CP algorithms, ∞

we define the prediction loss, i.e., P loss , as the dissimilari- Rtγ = γ k−t Rk+1 , (9)
ties between the predicted CSI and the true CSI, which is k=t
given by where Rtγ and Rt are discounted reward and immediate reward

M at TS t, respectively, and γ ∈ (0, 1) is the discount factor [24].

=  H − Ĥ  .
2
P loss
i i (8) In this paper, the RL agent is the BS, whose goal is to
i=1 maximize the transmission sum rate in the long run.
Solving the RL optimization problem relays on two func-
C. Proposed End-to-End Frameworks tions, i.e., the state value function Vπ (s) = Eπ [Rtγ |St =
As described in Section II-A, we aim to design the channel s] and the action value function Qπ (s, a) = Eπ [Rtγ |St =
prediction and beamforming modules with RL. Note that due s, At = a], where Eπ [·] denotes the expected value given
to the devices’ limited computation abilities and the time that the agent follows policy π [24]. The optimal policy
sensitive application scenarios, directly developing joint RL π ∗ is the policy that can maximize Vπ (s) given any state
networks for both two modules (CP and BF) may significantly and the corresponding optimal action is maxQπ∗ (s, a). The
a
increase the system overhead and require extra time consump- corresponding action-value function for the optimal policy π ∗
tion, hence, may not always be feasible in practical networks. is denoted by Qπ∗ (s, a). To generate the optimal policy, policy
This motivates us to propose two RL based frameworks for gradient algorithms are widely used [26].
different application scenarios, namely, channel prediction In this paper, we adopt the actor-critic method, which is
with RL (CPRL) and joint channel prediction and BF with based on the simultaneous online estimation over the parame-
RL (CPBFRL), as described in Fig. 2. In the first framework, ters of two network structures: the actor and the critic. The
we only consider adopting RL in the channel prediction actor generates a selection policy and adjusts the parameters of

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10275

UEs receive the downlink data and feed back their SINRs
to the BS.
In order to improve the learning efficiency, fasten the con-
vergence speed and decrease the signaling overhead, we stip-
ulate that each UE feeds its local estimated CSI back to BS
every W learning TSs. The BS would receive the feedback
CSI from UEs every W learning TSs and use this limited and
partial time CSI in the next W −1 TSs. Therefore, the feedback
Fig. 3. Illustration of channel prediction system. signaling overhead could be restricted. There are many existing
research works and patents have proposed efficient methods
for SINR and channel feedback, such as in [29]–[32]. Note
the policy with stochastic gradient decent [26]. The critic esti- that in this paper we mainly focus on dealing with the RL
mates the true action-value function Qπ (s, a) with Qw (s, a), based channel prediction and downlink beamforming schemes,
where Qπ (s, a) is the true action value function in deep Q rather than feedback design or signal estimation and detection.
learning [27] under policy π with state s and action a, and Based on the above existing mature research, we believe that
Qw (s, a) is the estimated Q function [28]. We will introduce it is reasonable to assume feasible practical time feedback and
the details of the networks in the following. user signal estimation in this paper.
In our considered channel prediction and beamforming The sum rates at UEs are effected by the beamforming
problems, the action spaces are continuous. Therefore, the matrix, while the beamforming gain depends on the accu-
core learning algorithm for parameter updates used in our racy of the predicted CSI. Thus, the learning objective of
paper is called deep deterministic policy gradient (DDPG) the proposed scheme is to maximize the received downlink
algorithm [26]. However, in this paper, the employed neural sum rate, while minimizing the prediction loss is an implicit
network structure is built in particular to cope with the optimization goal. In this paper, the orthogonal pilot sequences
considered physical layer problems. Specifically, there are two are not strictly required since the deep RL based algorithm
main differences between the network we used and that in could predict the channel by interacting with the environment,
reference [26]. Firstly, though we use the similar actor-critic rather than relying on precise mathematical calculations over
framework with an actor network and critic network as in [26], uncorrelated signals in the traditional methods. At each learn-
the number of the layers, the size of each layer and the kind ing step, the learning agent selects the action to maximize the
of neural network in this paper are chosen specifically for the expected long term reward based on the approximated Q-value
considered optimization problems. Secondly, in the proposed via the neural networks. Thus, with the increase of training
joint algorithm, we presented a two-layer neural network time slots, the approximation of Q-value becomes more accu-
structure, which is also very different from the network in [26]. rate, and a better performance (better channel prediction and
sum rate reward) could be expected.
IV. C HANNEL P REDICTION W ITH RL We denote the sum rate at learning TS t as Rt , which is
defined in Eq. (7). The instantaneous reward at TS t is
In this section, RL is considered in channel prediction mod-
M

ule. The pilot based channel prediction scheme is employed,
therefore, it is assumed that the BS and UEs are cooperative Rt = Rt = B log (1 + Υit ) , (10)
such that the pilot signals are known to both the BS and i=1

UEs [9]. where Υit is the SINR of UE i at TS t. For RL, the optimiza-
tion objective, i.e., discounted long-term reward, is denoted
as
A. Channel Prediction Problem Formulation

 ∞
 M
  
In pilot based schemes, BS estimates the CSI after UEs Rtγ = γ l−t Rl+1 = γ l−t B log 1+Υi(l+1) . (11)
finishing transmitting the pilot symbols [3]. Similarly, in our l=t l=t i=1
proposed learning scheme, CSI is predicted when the BS The goal of proposed channel prediction scheme is to obtain
receives all the pilot sequences. Therefore, the learning time the optimal prediction policy π with the optimization objec-
slot (TS) contains K pilot-symbol durations in this paper. tive of maximizing cumulative discounted reward J1 (π) =
As shown in Fig. 3, the learning agent is BS and the proposed E(Rtγ | π), where E(·) denotes the expectation operator. The
channel prediction system operates as follows. At the begin- channel prediction problem can be formulated as
ning of each learning TS, UEs transmit pilot sequences to BS
through uplink channel. After receiving all the pilot signals, max J1 (π) = E(Rtγ | π). (12)
π,Ĥ
BS predicts the downlink CSI with the proposed actor-critic
based learning algorithm. Then, with the estimated downlink B. Actor-Critic Based Channel Prediction
CSI, BS generates the precoding matrix with zero-forcing (ZF)
method and performs downlink data transmission1 . Finally, In this subsection, the proposed learning framework and the
algorithm solving problem in Eq. (12) are presented. In order
1 In the proposed CP algorithm, ZF can be replaced by other beamforming to store the history information, it is assumed that a memory
methods. component with a sliding history window of W time slots is

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10276 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022

Therefore, the state of the proposed channel prediction algo-


t
rithm at TS t is St = {YW , H̄tW , s, Htp }. We denote the
predicted complex downlink CSI at TS t as Ĥt ∈ C NA ×M ,
and represent the immediate output of the actor network as
H̄t = [ĤR I
t , Ĥt ], which contains both the real parts and
imaginary parts of the predicted CSI with ĤR t = Re(Ĥt ) and
ĤIt = Im(Ĥt ). Thus, we have H̄tW = [H̄t−W , · · · , H̄t−1 ].
After getting the output of the actor network, the BS firstly
preprocesses H̄t ∈ C 1×2 MNA to obtain the complex CSI
matrix Ĥt .
We denote the input of the action network as Sta = St ,
and represent the actor network as ΦA (·|θ a ), with output
At = H̄t = [ĤR I a a a
t , Ĥt ]. Here, θ a = {w1 , wh1 , wh2 , w2 }
a

is the set of network weights containing the input layer


Fig. 4. Actor-critic based channel prediction framework. parameters w1a , two-hidden-layer parameters {wh1 a a
, wh2 } and
a
the output layer parameters w2 . After obtaining the action with
ΦA (·|θ a ), the input of the critic contains the state information
equipped at BS, which means that at each time slot, the history St and the policy output (action) H̄t , which is fed into the
information is from the latest W learning TSs. As we have critic network at the second hidden layer [26]. The reasons
stated in Section IV-A, user’s CSI feedback time interval is as why we choose to feed actions into the hidden layer are as
the same length as the history window. follows: 1) Concatenating the actions with the states in the
We then introduce the proposed actor-critic based channel input layer would result in more parameters for the neural
prediction framework, as shown in Fig. 4. Since the action and network since more weights would be required; 2) From the
state spaces of the considered CSI prediction are continuous, perspective of exploration in RL, having states and actions
the actor-critic framework enables the actor network directly presented in different layers can reduce the neural network
outputs the predicted results, which is represented by two deep finding shortcuts due to “recognizing the policy” to predict
neural networks: the actor network, which specifies the current the values, and can avoid the representations of actions getting
policy by deterministically mapping the state to a specific attenuated at the output [34]. We denote the input of the
action and outputs the actions (predicted CSI) directly, and critic network as Stc = {St , H̄t }. With Stc , the critic network
the critic network, which outputs the approximated Q values provides the estimations of the action value function. The
Qw (s, a) via Q-learning. Both the actor and critic networks estimated Q-Value is back-propagated through the critic to
are composed with one input layer, two hidden layers and produce the gradients which will indicate how the actor should
one output layer [26]. The reasons why we use two hidden be updated in order to increase the Q-value. The critic network
layers here are as follows: Firstly, as presented in [33], when is represented as ΦC (·|θc ) with output Qw (St , At ) (estimated
simulating the minimum mean square error (MMSE) channel Q-value), where, θc = {w1c , wh1 c c
, wh2 , w2c } is the set of
estimator with neural network (NN), it has been proved that network weights containing the input layer parameters w1c ,
c c
two or three layers NN can construct the MMSE channel two-hidden-layer parameters {wh1 , wh2 } and output layer
c
estimation operations with a good performance. Secondly, parameters w2 .
due to the imperfect hardware computational ability, a high
number of NN layers may greatly increase the computation
overhead. C. Parameter Learning
At the beginning of learning TS t, the input of the actor Traditional policy improvement methods find a better policy
network contains: the history of received uplink pilot sig- by greedily choosing the optimal action under all possi-
t
nals YW = [Yt−W +1 , · · · , Yt−1 , Yt ]; the history informa- ble states according to the estimated action state value or
tion of channel prediction results H̄tW ; the pilot sequences Q-function, like the deep Q learning method [22]. However,
s = [s1 , · · · , si , · · · , sM ] and the feedback channel infor- when the action space is large or in a continuous space, such a
mation for current learning TS Htp = [H(t−mod(t,W )) ]. greedy searching strategy becomes computationally demand-
Here, H(t−mod(t,W )) is the UEs’ feedback CSI at TS (t − ing or even impossible. Therefore, in this paper, we exploit the
mod(t, W )), which is used for prediction in W TSs, where policy gradient method for updating the network [26], and the
mod(·) is the modulo operation. There are two main reasons basic idea is to adjust the parameter θ in the policy function
for using the feedback channel information as inputs in our πθ (s) to the direction of the objective gradient, i.e., the
proposed method: 1) We use the sum rate as the reward and derivative of the objective function E(Rtγ | πθ (s)).
our final optimization goal, therefore, it is necessary to use At learning time slot t, if Qw (St , a) can be perfectly
the feedback CSI as a boosting for channel prediction outputs estimated, the BS could select the action At that achieves
without introducing excessive computational complexity [2]; the maximum Q(St , a) and the current optimized policy can
2) Through experiments, it is proved that without such an be obtained. Meanwhile, the estimation of Q(St , a) is also
input, there will be performance losses and unstable training determined by action At . However, the actor policy ΦA (θa )
results, which will be described in details in Section VI. is not optimal before the estimation of Qw (St , a) is accurate

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10277

enough, which results in a major challenge of exploration in


the actor-critic approach. In order to improve such estimates,
the agent, i.e., BS, should balance the exploration of new
actions and the exploitation of the known actions. Authors
in [26] constructed an exploration policy by adding noises
from a sampled noise process Nn,t to the actor policy:

H̄t = Ht + Nn,t , (13)

where Ht is the action before adding noise, and the noise


Nn,t can be chosen to fit the environment. In this paper, it is
assumed that the agent has no premised perfect CSI or channel
statistical information. Therefore, we choose the noise process
as AWGN with zero mean and variance σn2 to mimic the effects
of many random processes that occur during the transmission. Fig. 5. Actor-critic gradients interlink [35].
This is because that AWGN is quite general and is expected
to broaden the application scope of proposed methods.
After obtaining the selected action, BS uses the predicted The gradients interaction between actor and critic networks
CSI to generate the precoding matrix and finishes the downlink is shown in Fig. 5. Subfigure (a) illustrates the interlinked
transmission. With the received data from BS, UEs get their architecture of actor-critic, which allows the actions to flow
SINRs and feed the values back to the BS. Afterwards, forwards from actor network to critic network and the
the BS could receive the immediate reward Rt . We keep gradients to flow backwards from the critic to the actor.
tracking BS’s previous experiences in a replay memory data The gradients coming from the critic indicate directions of
set D = {e1 , · · · , et }, where et = (St , At , Rt , St+1 ). In each improvement in the continuous action space and are used to
tuple et = (St , At , Rt , St+1 ), there contains the current state, train the actor network [35]. We denote ∇g f (·) as the gradient
action, reward and the next state after executing At at state vector of f with respect to g. In subfigure (b), backwards
St . At each training step, instead of updating parameters based pass generates critic gradients with respect to the action,
on transitions from current state, the DQN based algorithms i.e., ∇a Qw (s̃, a|θc ). These gradients are back-propagated
randomly sample a tuple (s̃, ã, r̃, s̃ ) from previous transi- through the actor, together with actor gradients with respect to
tions in D. Updating network parameters in this way could actor parameters ∇θa ΦA , resulting in gradients to update the
alleviate the problems of corrected data and non-stationary actor.
distributions, and thereby smooths the training distribution Finally, the actor gradients ∇θa ΦA is the product of critic
over many past behaviors [22], [23]. Therefore, when training gradients with respect to action, i.e., ∇a Qw (s̃, a|θc ), and
the parameters of the neural network, the agent randomly the gradients of action with respect to actor parameters,
sample a transition tuple from the replay memory. i.e., ∇θa ΦA (s̃, θa ). Thereby, the parameters of the actor
Generally, updating the critic network parameters is to network are updated as [28]:
minimize a loss function, which is defined as ∇θa ΦA = ∇θa ΦA (s̃, θa ) · ∇a Qw (s̃, a|θc ),
LQ (s̃, ã|θ c ) = Qw (s̃, ã|θc )− r+γ max Qw (s̃ , ã |θc )
2
. θa ← θa + α · ∇θa ΦA . (17)
 ã
(14) The detailed learning process of the proposed channel predic-
tion algorithm is shown in Algorithm 1.
However, the loss function minimization in continuous action
spaces is no longer trackable as it involves maximizing over V. J OINT C HANNEL P REDICTION AND BF W ITH RL
unknown next action ã for state s̃ . Instead, we use ΦA (s̃ |θa ), In multi-antenna multi-user system, the transmission perfor-
which is denoted as the next action and provided by the actor mance greatly depends on the precoding/beamforming strate-
network. gies, which can improve the spatial multiplexing gain and
Lt (s̃, ã|θ c ) = (Qw (s̃, ã|θ c )−(r̃+γQw (s̃ , ΦA (s̃ |θa )|θ c ))) ,
2 the spectral efficiency. At the same time, the accuracy of
the estimated CSI has significant effects on the beamforming
(15)
design. In this section, by jointly considering the end-to-end
where s̃ denotes the next state of s̃, ΦA (s̃ |θa ) is denoted channel prediction (CP) and beamforming (BF), the proposed
as the next action provided by the actor network, and joint CP and BF algorithm can not only relax the requirements
Qw (s̃ , ΦA (s̃ |θ a )|θc ) is the estimated Q values with next of some traditional schemes on the knowledge of channel
state s̃ and next output action ΦA (s̃ |θa ). states, but also adopts RL to optimize the BF strategy aiming
The gradients of loss function with respect to θc can be at maximizing the long-term sum rate.
obtained by differentiating the loss function with respect to In particular, we propose a two-layer based joint CP and
the weights. Therefore, the critic network is updated as beamforming algorithm, which operates as follows. As shown
in Fig. 6, at the beginning of each learning TS t, after receiving
θ c ← θc + α · ∇θc Lt (s̃, ã|θ c ). (16) all the pilots, the BS observes current complete environment

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10278 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022

Algorithm 1 Algorithm for Proposed Channel Prediction


1: Initialize the experience memory D, total number of
episodes Ep .
2: Initialize the network with random weights θ a and θ c .
3: Initialize the environment and get initial observation.
4: for t = 1, · · · , Ep do
5: Select a random process Nn,t for action exploration.
6: Get the input state Sta = St .
Fig. 6. Illustration of joint CP and beamforming system.
7: Select action according to the current policy and explo-
ration noise, i.e., At = ΦA (Sta |θa ) + Nn,t .
8: Given At , BS gets the ZF based precoding matrix, and
finishes the downlink transmission.
9: UEs receive the data from BS and feed back their SINRs,
UEs feed back CSI if mod(t, W ) = 0, BS calculates
reward Rt .
10: BS gets new state St+1 , stores transition
(St , At , Rt , St+1 ) in D.
11: Sample random mini-batch of transitions (s̃, ã, r̃, s̃ ).
12: Perform the stochastic gradient descent step on the loss
function according to [28].
13: Update the critic network parameters.
14: Use the sampled policy gradient to update the actor
parameters following Eq. (17).
15: end for

state information. Then, as the centralized controller, BS feeds


the current states into the two-layer joint network. The first
layer is called prediction network, which outputs the predicted
CSI. The CSI is then imported into the second layer, which is
designed to generate the beamforming strategy. Next, BS exe- Fig. 7. Joint channel prediction and beamforming framework.
cutes the beamforming policy and transmits the data to all the
UEs. Afterwards, the UEs feed back their received SINRs to
the BS, along with their local estimated CSI every W learning
B. Actor-Critic Based Two-Layer Joint Network
TSs, which will be stored for future prediction usage. The BS
finally computes the rewards and finishes the neural network In this subsection, we present the proposed algorithm to
parameter updates. solve the optimization problem in (20). Similarly, we deploy
a memory component with a window of W at BS to store the
A. Joint Channel Prediction and BF Problem Formulation history information, like we did in Section IV-B. As shown
in Fig. 7, the proposed joint network is divided into two
We define the immediate reward in the proposed joint prob-
layers. The first layer is the prediction network and the
lem as transmission sum rate, which is the main performance
metric and is denoted as input at learning TS t is Stp = {YW t
, H̄tW , s, Htp } and
M
the definitions of these variables are as same as those in the
 Hit Vit 
2
CPRL algorithm. With Stp , the hidden layer 2 of prediction
Rt = B log 1 + M 2
, (18)
m=1,m=i Hit Vmt  + σ2
2 network outputs the predicted channel states in multi mini-
i=1
batch H̄minibatch ∈ Rbatchsize×N , where N = 2 × M × NA
where Rt is the immediate reward received at TS t. denotes the size of the predicted CSI matrix. Then a fully
The long-term system performance, i.e., total discounted connected network follows to adjust the output vector size of
reward, from learning TS t onwards, is given by the hidden layer 2 to be the expected H̄t ∈ R1×N . We use

 ΦP (·|θ p ) to denote the prediction generator with input Stp
Rtγ = γ k−t Rk+1 . (19) and output H̄t , which contains both the real parts and the
k=t
image parts of the complex predicted CSI Ĥt . Here, the set of
The objective of our joint learning algorithm is to obtain the network weights θ p contains {wIp , wh1 p
, wh2p
, wfp }, where wIp ,
optimal CP and beamforming policy π which can maximize p p p
wh1 , wh2 and wf are the parameters of input layer, hidden
the cumulative discounted reward E(Rtγ | π). Therefore, the layers and the fully connected layer, respectively.
joint problem can be formulated as The second layer is the BF network with actor-critic archi-
max J2 (π) = E(Rtγ | π). (20) tecture and the output of prediction network is imported into
π,Ĥ,V this layer, as in Fig. 7. In order to improve the performance of

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10279

the BF network, we use the feedback CSI H(t−mod(t,W )) and Since the joint CP and BF problem is more complicated,
the traditional ZF precoding method to generate the reference we employ target networks in this algorithm [36]. In this
BF matrixes, which is denoted as Vtzf . Therefore, the input of framework, we let the target networks be the replicas of the
actor network in the BF network contains the output values actor and critic networks. The outputs of the target actor
of ΦP (·|θ p ) and the reference precoding matrix, which is and target critic networks are denoted by H̄t and Qw (s, a),
denoted as Sta = {H̄t , Vtzf }. We represent the actor network respectively. Instead of using the outputs of the original actor
as ΦA (·|θ a ), with output BF policy V̄t = At ∈ C1×N , where and critic networks, H̄t and Qw (s, a) are adopted for obtaining
θa = {w1a , wh1 a a
, wh2 , w2a } is the set of network weights the target Q-value in the network parameter updating process.
containing the input layer parameters w1a , two-hidden-layer The weights of the target networks are denoted by θ a and θc ,
a a
parameters {wh1 , wh2 } and the output layer parameters w2a . respectively. The updates of the target networks are slower
Since the values in the BF matrixes are complex, the real parts than the original actor and critic networks, and in this way,
and the image parts of the complex beamforming matrix Vt the stability and the convergence of the learning process could
are exported separately in V̄t = [VtR , VtI ] with VtR = Re(Vt ) be guaranteed [26].  
and VtI = Im(Vt ). After getting the action, BS needs to The target Q-value is yt = r̃ + γQw ŝ, H̄t |θa , θc . The
preprocess V̄t and combines the real parts and image parts to loss function for the critic network to minimize is denoted as:
obtain complex Vt . 2
Lt (θc ) = (yt − Qw (s̃, ã|θc ))
The critic network in the BF network is used to measure the    2
performance of current action, thus, the action output At = V̄t = r̃ + γQw ŝ, H̄t |θa , θc − Qw (s̃, ã|θc ) . (22)
is fed into the critic network at its second hidden layer [26], Differentiating the loss function with respect to the weights,
as shown in Fig. 7. Therefore, the input state for critic network we arrive at
is Stc = {H̄t , Vtzf , V̄t }. With Stc , the critic network provides
the estimation of action value function and the output is the ∇θc Lt (θ c )
 
estimated Q-value Qw (s, a). The estimated Q-Value is back- = r̃+γQw (ŝ, H̄t |θa , θc )−Qw (s̃, ã|θc ) ∇θc Qw (s̃, ã|θ c ).
propagated to generate the gradients. We represent the critic (23)
network as ΦC (·|θ c ) with input Stc and output Qw (Stc ). The
set of network weights is θc = {w1c , wh1 c c
, wh2 , w2c } where where ∇θc f (·) denotes the gradient vector of f with respect
c c c c
w1 , {wh1 , wh2 } and w2 are parameters of input layer, two to θc , and α is the updating step size. Then, the critic network
hidden layers and output layer. is updated as
θ c ← θ c + α∇θc Lt (θ c ). (24)
C. Parameter Learning
As mentioned before, the update of actor network depends
We exploit a policy gradient method for policy improvement on the estimation of Q-values in the critic network. Therefore,
in the joint CP and beamforming algorithm, which overcomes the gradients coming from the critic recommend the directions
the limitations of the greedy searching strategy by explic- of improvement in the continuous action space and are used
itly optimizing a parameterized policy [34]. As discussed in to train the actor network. Given some randomly sampled
Section IV, an exploration policy by adding sampled noise transition (s̃, ã, r̃, ŝ), the critic gradients with respect to action
from Nn,t to the actor policy is employed, that is, V̄t = can be written as ∇a Qw (s̃, a|θc )a=ΦA (s̃,θa ) . As shown in
Vt + Nn,t . After executing the selected action, BS receives Fig. 7, the action is fed into the second hidden layer of critic
the immediate reward Rt and keeps the previous experience network and the critic gradients are back-propagated to the
in a replay memory data set D = {e1 , · · · , et }, with et = actor network. In the actor network, we can have the actor
(St , At , Rt , St+1 ). In order to reduce the strong correlations gradients ∇θa ΦA (s̃, θa ), which is the gradients of action with
between adjacent time slots, at each TS, we sample a random respect to actor parameters. Together with ∇a Qw (s̃, a|θc )),
transition (s̃, ã, r̃, ŝ) from D instead of performing updates the actor gradients to update parameters in θa are given as
using transitions from the current episode [22]. As stated in
Section IV, maximizing over the unknown next action â as in ∇θa ΦA = ∇θa ΦA (s̃, θa )∇a Qw (s̃, a|θc )a=ΦA (s̃,θa ) ,
Eq. (14) is replaced by ΦA (ŝ|θ a ) and the critic loss can be θa ← θa + α∇θa ΦA . (25)
changed to
Then, according to the updating approach in [28], the para-
LQ (s, a|θ c ) = (Qw (s, a|θc )−(r+γQw (ŝ, ΦA (ŝ|θa )|θ c )))2 . meters of the prediction network and the target networks are
(21) updated as:

As we know that the critic’s output Qw (s, a|θc ) influences ∇θp ΦP = ∇θp ΦP (s̃, θp )∇p Qw (s̃, a|θc ),
both the actor’s and the critic’s updates. Therefore, the estima- θ p ← θp + α∇θp ΦP ,
tion error of Qw (s, a|θ c ) would cause destructive feedbacks θ a ← τ θ a + (1 − τ )θ a ,
and divergence for actor and critic networks, which may lead
θ c ← τ θ c + (1 − τ )θ c , (26)
to an unstable learning process, especially in the proposed
joint two-layer network. To deal with such problems, the with 0 ≤ τ ≤ 1.
method based on a target Q-network is considered [26], [36]. The overall joint CP and beamforming algorithm is sum-
In particular, the target networks are not always necessary. marized in Algorithm 2. Similar to most of the reinforcement

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10280 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022

Algorithm 2 Algorithm for Joint CP and Beamforming baselines for comparisons: 1) standard minimum mean square
1: Initialize the experience memory D and the total number error (MMSE) based CE scheme [9]; 2) sliding bi-directional
of episodes Ep . gated recurrent unit (SBGRU) based channel estimation with
2: Initialize prediction network with random weights θ p . known pilot density and channel statistics characteristics [11];
Initialize critic network, actor network, target networks 3) weighted MMSE (WMMSE) beamforming method with
with random weights θc , θa , θc and θa . perfect CSI [38]; 4) iterative algorithm induced deep unfolding
3: Initialize the environment and get initial observation state neural network (IAIDNN) for precoding design based on
S1p = {Y 1W , H̄1W , s, H1p }. WMMSE iterative method [15].
4: for t = 1, · · · , Ep do All the results are performed in a simulated physical layer
p communication scenario containing one BS with multiple
5: Get the predicted channel states H̄t = ΦP (St ).
6: Obtain the action input state Sta = {H̄t , Vtzf }. antennas and multi-user with single antenna per UE. In our
7: Select action according to At = ΦA (Sta |θ a ) + Nn,t . simulation, two kinds of wireless channels are considered. One
8: With At , BS gets BF matrixes and finishes downlink data is the complex frequency-selective clustered delay line (CDL)
transmission. channel model in 3GPP 5G new radio (NR) standard pro-
9: UEs feed back SINRa and partial time CSI; BS calculates tocol [39], [40]. CDL is used to model the channel when
reward Rt . the received signal consists of multiple delayed clusters,
10: Observe new state St+1 . Store transition where each cluster contains multi-path components with the
(St , At , Rt , St+1 ) in D. same delay but slight variations for angles of departure and
11: Sample random mini-batch of transitions (s̃, ã, r̃, ŝ) from arrival. The channel coefficients of CDL models used in the
D. simulations are generated with Matlab 5G Toolbox function
 
12: Set yt = r̃ + γQw (ŝ, H̄t |θ a , θ c ). nrCDLChannel, where the implementations are exactly fol-
13: Perform stochastic gradient descent according to (23). lowing the aspects of 3GPP 5G new radio (NR) standard
14: Update critic parameters: θ c ← θ c + α∇θ c Lt (θ c ). protocol TR 38.901 [39]. When generating channel coefficients
15: Update the actor policy following (25). with nrCDLChannel, the Max Doppler frequency is set to
16: Update the prediction network and the target networks be 12.5Hz, which indicates that the channel model is time-
following (26). varying. Another one is the standard Gaussian channel model,
17: end for and the channel coefficients are generated with MATLAB
block comm.AWGNChannel. The above two channels are both
frequency selective channels. The Zadoff–Chu (ZC) sequences
learning algorithms, the use of non-linear function approx- (also referred to as Chu sequence or Frank–Zadoff–Chu (FZC)
imators “nullifies any convergence guarantees” [26], [37]. sequence) are used as pilot signals, which are complex-valued
Thus, it is extremely hard to provide precise upper bound mathematical sequences. The ZC sequences derived from the
or convergence proof [36]. Instead, the following simulation same root sequence often have a constant autocorrelation
results could demonstrate that the learning results of proposed coefficient, and the correlation number is 0 [41]. One of
algorithms are stable and advantageous without the need for Zadoff–Chu sequence’s properties is that when the length L
any modifications or extra assumptions on the environment. is an odd number, the sequence is periodic. In our simulation,
Also, we have provided computation complexity analysis the pilot signals, i.e., Zadoff–Chu sequences, are generated
comparisons in Section VI-B to further certify the efficiency with MATLAB function zadoffChuSeq(). The length of the
and effectiveness of the proposed algorithm. In this paper, pilot sequences is prime number with L = 9. The system
we would like to explore the possible performance advantages bandwidth is 100MHz and the carrier frequency is 6GHz. The
of using deep RL in physical layer problems. Therefore, location of the BS is fixed and the transmit power of BS is
all the simulations in the following section are implemented 46 dBm.
in a single cell scenario. When exploiting the DRL based In our experiments, the actor and critic networks are both
algorithms in complex system with more users and antennas, composed of two hidden layers with 400 and 300 nodes,
the large computation burden at BSs could be eased with respectively. The learning rate for actor and critic networks
centralized server, which could use the local gradients from are 10−4 and 10−3 , respectively. The discount factor γ is set
BSs to update a set of global parameters and periodically to be 0.99. We train the deep RL network with a mini-batch
synchronize the parameters with BSs. With this distributed size of 64 and a replay buffer size of 105 . The results are
framework with centralized server, and the hybrid online and averaged over 100 independent runs with random initial states.
offline training manner, the computational cost at the local The training step indicates the learning episode in Algorithm 1.
BSs could be effectively decreased [34], which would be a However, in order to achieve smoother and more general
promising future research topic. performance comparisons, the presented sum rates and losses
in the figures are further averaged by taking the mean over
a moving window of 200 training steps. All the simulation
VI. N UMERICAL S IMULATIONS AND A NALYSIS
results are obtained with TensorFlow 1.14.0 and python 3.7 2 .
A. Simulation Results
In this section, we evaluate the performance of the proposed 2 The online project information is available at: https://github.com/mcccc4/
algorithms through numerical simulations. We choose four RL-based-PL-transmission.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10281

Fig. 8. Average sum rate with proposed CP algorithm.

Firstly, we show the average sum rate of the proposed Fig. 9. Channel prediction loss of proposed CP scheme.
channel prediction (CP) algorithm over different training steps
in Figs. 8(a) and 8(b) with CDL channel and Gaussian
channel, respectively. The benchmark is the standard minimum
mean square error (MMSE) based channel estimation (MMSE
Channel Estimation) and the ZF beam former is also adopted
in MMSE channel estimation for a fair comparison [9]. As we
can see from the plots that at first, when the training algorithm
has not converged, MMSE scheme has better performance
than the proposed channel prediction (CP). However, after
about 13000 training steps, the sum rate of the proposed
RL algorithm exceeds the MMSE and gradually converges
to a stable state. Finally, the average sum rates of proposed Fig. 10. Performance comparisons for CP w/ or w/o feedback CSI.
scheme outperforms 21.56% and 3.68% of that of the MMSE
scheme for CDL channel and Gaussian channel, respectively.
One reason for this performance advantage is that when
using MMSE estimator, the channel covariance matrix of
CDL is estimated by computing the average covariance over
500 channel matrixes. This approximation of the channel
covariance matrix may introduce estimation error. However,
the proposed CP algorithm directly predicts the CSI by learn-
ing the channel dynamic model without covariance matrix
approximation. Another reason is that the MMSE method uses
channel reciprocity to get downlink CSI with estimated uplink
CSI. This could result in estimation loss when the downlink Fig. 11. Average sum rate with proposed joint CP and BF algorithm.
CSI is not exactly the conjugate transpose of uplink CSI.
The third reason is that the proposed CP algorithm directly
maximizes the sum rate, while the MMSE estimator minimizes inputs for channel prediction, which makes the CP algorithm
the estimation error, thus the proposed CP algorithm provides more stable and robust, as shown in Fig. 10. We can see in
better sum rate performance. Figs. 8, 9 and 10 that once the algorithms get converged,
The mean square error (MSE) of the prediction loss between the performance could maintain at a stable level and the
the proposed scheme and the true CSI is shown in Fig. 9, performance of the algorithm is robust. Even though there
where the left is the loss variation over training steps and the might be some fluctuations due to the dynamic variances of
right is the zoomed figure when the algorithm gets converged. environment, the algorithm can adjust quickly and restore the
It is obvious from the left subfigure in Fig. 9 that when performance, which can be seen from the simulation results
the learning algorithm has not converged, the prediction loss in Figs. 8, 9 and 10. Specifically, we can see that with the
is relatively high. However, with more training steps, the feedback information, the CP algorithm has a better prediction
prediction loss converges and the zoomed figure shows that performance and 9.10% higher sum rate than that without
the stable prediction loss of the proposed scheme can finally feedback information.
decrease to values between 10−3 and 10−2 , which is an In the following, we show the performance evaluation of
acceptable error range in practical applications. the proposed joint CP and BF algorithm. It can be seen
In this paper, the proposed DRL based methods aim at from figures in Fig. 11 that the proposed joint CP and BF
capturing the real-time communication environment dynamics, algorithm obviously has higher sum rate than the weighted
thereby, the neural networks are trained at base station (BS) MMSE beamforming method with minimum mean square
in an online manner. We use the feedback CSI as the booster error based channel estimation (WMMSE with MMSE CE)

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10282 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022

Fig. 12. Channel prediction loss of joint CP and BF scheme.

and with proposed channel prediction algorithm (WMMSE


with Proposed CP) after getting converged. This is because
that the proposed CP algorithm only focuses on the CP
without learning based beam formers, while the joint channel
prediction and beamforming algorithm optimizes the end-
to-end transmission with the objective of maximizing the
sum rate. Therefore, the performance of channel prediction
algorithm might fall into some local optimum results of the
proposed joint CP and BF algorithm. The weighted MMSE
Fig. 13. Performance over different numbers of antennas and users, and UE
beamforming with perfect CSI (WMMSE with perfect CSI) transmit power.
is presented as the performance upper bound. As shown in
Fig. 11(a), at the beginning, the proposed joint CP and BF
algorithm is in an exploration state and may have relatively
low sum rate. After more training time, the proposed algorithm MMSE CE), which shows the proposed joint RL algorithm’s
continuously adjusts the policy and converges to a close ability of jointly handling the channel variations and obtaining
performance of the performance upper bound. The sum rate the beamforming gain.
performance of the joint scheme could achieve 99.70% and Finally, we compare the sum rate and the channel esti-
98.76% of those of the upper bound for CDL and Gaussian mation loss of the proposed algorithms with the learning
channels, respectively. based and non-learning based benchmarks. Fig. 14(a) presents
Figs. 12 demonstrates the channel prediction performance the average sum rate performance comparisons. We can see
of the proposed joint algorithm. The left subfigure is the that the sum rate of proposed joint algorithm is the most
prediction loss variations with training steps, we can see that close to the WMMSE scheme, and is also 10.693% higher
the prediction loss of the proposed joint algorithm always than the iterative algorithm induced deep unfolding neural
converges to a stable state. The right subfigure shows the network (IAIDNN). This performance gap is mainly because
zoomed loss values after getting converged, which clearly that the IAIDNN only uses DNN to replace the intermediate
indicates that as an intermediate prediction result, the channel computational processes of the traditional method and has
prediction loss of the joint CP and BF algorithm could still more local optimums and saddle points than the proposed
achieve an acceptable level in both academic and industrial algorithm. Another important advantage is that both proposed
applications. RL based methods have no channel information or training
Moreover, the sum rate performance comparisons with data sets needed during the policy updating process. However,
different numbers of UEs and antennas, and different uplink IAIDNN assumes that the channel information are known as
pilot transmit power at UEs are shown in Figs. 13(a) and 13(b), training and testing sets, which leads to the great challenge of
respectively, which further verify the performance advantages obtaining real training data samples and the high cost of imple-
of the proposed algorithms. As can be seen in both left and mentation from the perspective of realistic industrial applica-
right subfigures in Fig. 13(a), the proposed joint RL algorithm tion. Fig. 14 shows the channel prediction loss performance.
always has higher sum rate than the other benchmarks under It is obviously that both the proposed RL based algorithms
different numbers of UEs. Also, from the results in Fig. 13(b), could achieve the prediction loss between 10−3 and 10−1 .
the proposed joint CP and BF algorithm provides the opti- Even though the learning based sliding bi-directional gated
mal performance, followed by the proposed CP algorithm recurrent unit (SBGRU) based channel estimation scheme
and weighted MMSE beamforming method with minimum also has a reasonable estimation loss, it not only requires
mean square error based channel estimation (WMMSE with the pilot sequences, but also needs to know the accurate

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10283

 
complexity of SBGRU is O h2s nl M , where hs and nl are
the number and the size of the hidden layers, respectively.
In general, the size of hidden layer is larger than the size of
input, i.e. hs M [11]. Based on the above analysis, it is
obvious that the complexity of the proposed RL based CP
algorithm is smaller than benchmark in [11].
Similarly, we show the complexity comparison of pro-
posed joint channel prediction and beamforming algorithm
and convolutional neural network (CNN) based iterative algo-
rithm induced deep unfolding neural network (IAIDNN)
method [15]. The first layer of joint channel prediction and
beamforming algorithm is the prediction network. Assum-
ing that the size of the prediction network is hl , accord-
ing the above calculation, the approximate computational
complexity of the first layer is O (M NA (I + U + hl )). The
estimated complexity of the second layer with actor-critic
network is O (M NA (I + h1 + h2 + U + N )). Thus, the
approximate complexity for the proposed joint algorithm
is O (M NA (h1 + h2 + hl + 2I + 2U + N )). In compari-
son, the complexity of IAIDNN [15] is approximated as
L
O (M NA )2.37 + l=1 (s2l c2l ) , where L is the total number
of layers, sl and cl represent the size of the convolution kernel
and the number of channels, respectively. By comparing the
results, it can be seen that the proposed actor-critic based joint
algorithm is competitive compared with deep unfolding neural
Fig. 14. Performance comparisons. network based method.
In order to show the computational complexity more clearly,
we further analyze the number of the floating point opera-
statistic distributions of the channel model, which leads to the tions (FLOPs) for the parameter updates of proposed learning
challenge of obtaining training data sets and the high cost of algorithms. The number of FLOPs in the proposed channel
implementing channel estimation in practical applications. prediction algorithm is mainly determined by the structure of
the actor and critic networks [43]. In the proposed channel
B. Computational Complexity Analysis prediction algorithm, the FLOPs in the actor and critic network
In this subsection, we will provide the approximated com- can be computed as FLOPa = I ∗ h1 + h1 ∗ h2 + h2 ∗ N
putational complexity analyses of the proposed algorithms. and FLOPc = (I + N ) ∗ h1 + h1 ∗ h2 + h2 , respectively.
According to [37], for the stochastic policy gradient based Thus, the number of FLOPs in proposed channel predic-
learning algorithms, the computational complexity of all the tion algorithm is FLOPCP = FLOPa + FLOPc . As for
parameters updates is O(mn) per time step, where m and n the proposed joint algorithm, it contains prediction recurrent
denote the action output dimension and the number of policy neural network (RNN) layer, fully connected layer and actor-
parameters, respectively [42]. critic network. We denote the hidden size of RNN layer, the
Firstly, we estimate the computational complexities of pro- input size and the output size of actor-critic as hs , I¯ and N̄ ,
posed methods and benchmarks. We denote the sizes of the respectively. I and N still equal to the input size and output
input layer, the first hidden layer, the second hidden layer and size of channel prediction, as the same as in the channel
the output layer in the actor network as I, h1 , h2 and U , prediction algorithm. Then, the number of FLOPs is
respectively. The number of items in a vector is |·|. The output FLOPJoint = [(I + hs ) ∗ hs ∗ 4 ∗ 2] + [2 ∗ I ∗ N ]
sizes of the actor and the Q-values are the same and equal to + [I¯ ∗ h1 + h1 ∗ h2 + h2 ∗ N̄ ]
N = 2 × M × NA . Thus, for the proposed channel prediction  
+ (I¯ + N̄ ) ∗ h1 + h1 ∗ h2 + h2 . (27)
algorithm, the total number of parameters in the actor network
is |θa | = I + h1 + h2 + U . Since the critic network has Under the simulation scenario with M = 10, NA = 16, h1 =
similar neural network structure as the actor network, we have 300, h2 = 400 and hs = 260 ∗ 3, the approximated numbers
|θc | = I +h1 +h2 +U +N . Therefore, the total of the proposed of FLOPs of proposed CP algorithm and joint CP and BF are
channel prediction algorithm is 4 M NA (I + h1 + h2 + U ) + 0.6321G and 2.8719G, respectively.
2 M N NA and the approximated computational complexity
can be written as O (M NA (I + h1 + h2 + U + N )). The VII. C ONCLUSION
deep learning (DL) based channel estimation benchmark is In this paper, we consider a multi-user multiple antennas
bi-directional gated recurrent unit (SBGRU) in [11]. Accord- downlink scenario. To tackle the existing challenges in tradi-
ing to the analysis in [11], the corresponding computational tional methods, we propose two RL based end-to-end CP and

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10284 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022

BF designs. Firstly, we only adopt RL in channel prediction [14] A. Alkhateeb, S. Alex, P. Varkey, Y. Li, Q. Qu, and D. Tujkovic, “Deep
and propose an actor-critic based channel prediction scheme learning coordinated beamforming for highly-mobile millimeter wave
systems,” IEEE Access, vol. 6, pp. 37328–37348, 2018.
without the premise of perfect CSI. The learning agent BS [15] Q. Hu, Y. Cai, Q. Shi, K. Xu, G. Yu, and Z. Ding, “Iterative algorithm
imports the received pilots into the prediction network and uses induced deep-unfolding neural networks: Precoding design for multiuser
the predicted CSI to generate downlink beamforming matrixes MIMO systems,” IEEE Trans. Wireless Commun., vol. 20, no. 2,
pp. 1394–1410, Feb. 2021.
with ZF. Secondly, we propose a joint channel prediction and [16] Y. Hu et al., “Optimal transmit antenna selection strategy for
beamforming learning architecture which includes two layers: MIMO wiretap channel based on deep reinforcement learning,” in
the first layer is the CSI prediction network as similar to the Proc. IEEE/CIC Int. Conf. Commun. China (ICCC), Beijing, China,
Aug. 2018, pp. 803–807.
CP algorithm; and by employing the outputs of the first layer [17] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Physical layer deep learning of
as the inputs, the second layer is the actor-critic network for encodings for the MIMO fading channel,” in Proc. 55th Annu. Allerton
exporting beamforming policy and Q-value evaluation. All the Conf. Commun., Control, Comput. (Allerton), Monticello, IL, USA,
Oct. 2017, pp. 76–80.
network parameters are updated jointly with the objective of [18] J. P. González-Coma, J. Rodríguez-Fernández, N. González-Prelcic,
maximizing the sum rate reward using the deep policy gradient L. Castedo, and R. W. Heath, Jr., “Channel estimation and hybrid
method. The simulations verified that the proposed algorithms precoding for frequency selective multiuser mmWave MIMO systems,”
IEEE J. Sel. Topics Signal Process., vol. 12, no. 2, pp. 353–367,
could always get converged and stable after certain training May 2018.
steps. The results show that the learning scheme could achieve [19] L. Zhao, D. W. K. Ng, and J. Yuan, “Multi-user precoding and channel
a prediction loss of 10−2 under different simulation conditions. estimation for hybrid millimeter wave systems,” IEEE J. Sel. Areas
Commun., vol. 35, no. 7, pp. 1576–1590, Jul. 2017.
Compared with the MMSE channel estimator, the proposed
[20] A. M. Elbir, “A deep learning framework for hybrid beamforming
channel prediction scheme has an average sum rate gain as without instantaneous CSI feedback,” IEEE Trans. Veh. Technol., vol. 69,
much as 21.56%. And the proposed two-layer joint algorithm no. 10, pp. 11743–11755, Oct. 2020.
could achieve as much as 99.7% to 98.76% of WMMSE with [21] H. Wang, J. Fang, P. Wang, G. Yue, and H. Li, “Efficient beam-
forming training and channel estimation for millimeter wave OFDM
perfect CSI in terms of sum rate without introducing large systems,” IEEE Trans. Wireless Commun., vol. 20, no. 5, pp. 2805–2819,
computation overheads. May 2021.
[22] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
R EFERENCES [23] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,
arXiv:1312.5602.
[1] I. Ahmed and H. Khammari, “Joint machine learning based resource
[24] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
allocation and hybrid beamforming design for massive MIMO systems,”
2nd ed. Cambridge, MA, USA: MIT Press, 2014.
in Proc. IEEE Globecom Workshops (GC Wkshps), Abu Dhabi, UAE,
[25] S. Verdú, “Fifty years of Shannon theory,” IEEE Trans. Inf. Theory,
Dec. 2018, pp. 1–6.
vol. 44, no. 6, pp. 2057–2078, Oct. 1998.
[2] M. Chu, X. Liao, H. Li, and S. Cui, “Power control in energy harvesting
[26] T. P. Lillicrap et al., “Continuous control with deep reinforcement
multiple access system with reinforcement learning,” IEEE Internet
learning,” 2015, arXiv:1509.02971.
Things J., vol. 6, no. 5, pp. 9175–9186, Oct. 2019.
[3] M. Wenyan, Q. Chenhao, Z. Zhang, and J. Cheng, “Sparse channel [27] M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learning-based
estimation and hybrid precoding using deep learning for millimeter wave multiaccess control and battery prediction with energy harvesting in
massive MIMO,” IEEE Trans. Commun., vol. 68, no. 5, pp. 2838–2849, IoT systems,” IEEE Internet Things J., vol. 6, no. 2, pp. 2009–2020,
Feb. 2020. Apr. 2019.
[4] Y.-S. Jeon, J. Li, N. Tavangaran, and H. V. Poor, “Data-aided channel [28] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Proc. Conf.
estimator for MIMO systems via reinforcement learning,” in Proc. IEEE Neural Inf. Process. Syst., Denver, CO, USA, Dec. 2000, pp. 1008–1014.
ICC, Dublin, Ireland, Jun. 2020, pp. 1–6. [29] J. Sun, Y. Ren, and Z. Yonghang, “A signal-to-noise ratio feedback
[5] S. Park, B. Shim, and J. W. Choi, “Iterative channel estimation using method and equipment,” Chin. Patent CN 102 546 124 A, 2012.
virtual pilot signals for MIMO-OFDM systems,” IEEE Trans. Signal [30] J. F. T. Cheng, S. Grant, L. Krasny, K. Molnar, and Y. P. E. Wang,
Process., vol. 63, no. 12, pp. 3032–3045, Jun. 2015. “Method and arrangement for SINR feedback in MIMO based wireless
[6] V. Raj and S. Kalyani, “Backpropagating through the air: Deep learning communication systems,” U.S. Patent 8 644 263, 2014.
at physical layer without channel models,” IEEE Commun. Lett., vol. 22, [31] M. Kurras, S. Jaeckel, L. Thiele, and V. Braun, “CSI compression and
no. 11, pp. 2278–2281, Nov. 2018. feedback for network MIMO,” in Proc. IEEE 81st Veh. Technol. Conf.
[7] Q. Mao, F. Hu, and Q. Hao, “Deep learning for intelligent wireless (VTC Spring), Boston, MA, USA, May 2015, pp. 1–5.
networks: A comprehensive survey,” IEEE Commun. Surveys Tuts., [32] J. Guo, L. Wang, F. Li, and J. Xue, “CSI feedback with model-driven
vol. 20, no. 4, pp. 2595–2621, Jun. 2018. deep learning of massive MIMO systems,” IEEE Commun. Lett., vol. 26,
[8] Z. Qin, H. Ye, G. Y. Li, and B. H. F. Juang, “Deep learning in no. 3, pp. 547–551, Mar. 2022.
physical layer communications,” IEEE Wireless Commun., vol. 26, no. 2, [33] D. Neumann, T. Wiese, and W. Utschick, “Learning the MMSE channel
pp. 93–99, Mar. 2019. estimator,” IEEE Trans. Signal Process., vol. 66, no. 11, pp. 2905–2917,
[9] Y. Liao, Y. Hua, and Y. Cai, “Deep learning based channel estimation Jun. 2018.
algorithm for fast time-varying MIMO-OFDM systems,” IEEE Commun. [34] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui, “Load balancing for
Lett., vol. 24, no. 3, pp. 572–576, Mar. 2020. ultradense networks: A deep reinforcement learning-based approach,”
[10] Y. Yang, F. Gao, X. Ma, and S. Zhang, “Deep learning-based channel IEEE Internet Things J., vol. 6, no. 6, pp. 9399–9412, Dec. 2019.
estimation for doubly selective fading channels,” IEEE Access, vol. 7, [35] M. Hausknecht and P. Stone, “Deep reinforcement learning in parame-
pp. 36579–36589, 2019. terized action space,” 2015, arXiv:1511.04143.
[11] Q. Bai, J. Wang, Y. Zhang, and J. Song, “Deep learning-based [36] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
channel estimation algorithm over time selective fading channels,” policy maximum entropy deep reinforcement learning with a stochastic
IEEE Trans. Cognit. Commun. Netw., vol. 6, no. 1, pp. 125–134, actor,” in Proc. ICML, Stockholm, Sweden, Jul. 2018, pp. 1861–1870.
Mar. 2020. [37] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
[12] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Deep learning-based “Deterministic policy gradient algorithms,” in Proc. 31st Int. Conf.
channel estimation for beamspace mmWave massive MIMO sys- Mach. Learn., Beijing, China, Jun. 2014, pp. 387–395.
tems,” IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 852–855, [38] D. H. Nguyen and T. Le-Ngoc, “MMSE precoding for multiuser MISO
Oct. 2018. downlink transmission with non-homogeneous user SNR conditions,”
[13] B. Zhu, J. Wang, L. He, and J. Song, “Joint transceiver optimization EURASIP J. Adv. Signal Process., vol. 2014, no. 1, pp. 1–12, Dec. 2014.
for wireless communication PHY using neural network,” IEEE J. Sel. [39] 5G; NR; Overall Description; Stage-2 (3GPP TS 38.300 Version 15.3.1
Areas Commun., vol. 37, no. 6, pp. 1364–1373, Jun. 2019. Release 15), 3GPP, TSGR, document TS 138 300, Oct. 2018.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10285

[40] X. Zhao, E. Lukashova, F. Kaltenberger, and S. Wagner, “Practical Vincent K. N. Lau (Fellow, IEEE) received the B.E.
hybrid beamforming schemes in massive MIMO 5G NR systems,” in degree (Hons.) from The University of Hong Kong
Proc. 23rd Int. ITG Workshop Smart Antennas. Vienna, Austria: VDE, in 1992 and the Ph.D. degree from Cambridge Uni-
Apr. 2019, pp. 1–8. versity in 1997. He was with Bell Labs from 1997 to
[41] Evolved Universal Terrestrial Radio Access (E-UTRA); Physical Chan- 2004 and the Department of ECE, The Hong Kong
nels and Modulation, document TS36.211, Jun. 2013. University of Science and Technology (HKUST) in
[42] A. Vaswani et al., “Attention is all you need,” in Proc. NIPS, 2004. He is currently a Chair Professor and the
Long Beach, CA, USA, Jun. 2017, pp. 1–11. Founding Director of Huawei-HKUST Joint Inno-
[43] W. Li, W. Ni, H. Tian, and M. Hua, “Deep reinforcement learning for vation Laboratory, HKUST. His current research
energy-efficient beamforming design in cell-free networks,” in Proc. focuses include stochastic optimization, massive
IEEE Wireless Commun. Netw. Conf. Workshops (WCNCW), Nanjing, MIMO, content-centric wireless networking, wire-
China, Mar. 2021, pp. 1–6. less networking for mission-critical control, and federated learning for 6G
wireless networks.

Man Chu (Member, IEEE) received the B.E.,


M.S., and Ph.D. degrees in information and com-
munication engineering from Xi’an Jiaotong Uni- Chen Jiang received the B.S. degree in electrical
versity, Xi’an, China, in 2011, 2013, and 2019, engineering and automation from the Huazhong Uni-
respectively. From 2015 to 2016, she was a visit- versity of Science and Technology, Wuhan, China,
ing student with the Department of ECE, HKUST. in 2008, the master’s degree in electrical and com-
From 2016 to 2018, she was a Visiting Scholar with puter engineering from Wichita State University,
the Department of ECE, University of California Wichita, KS, USA, in 2013, and the Ph.D. degree
at Davis, Davis, CA, USA. She is currently a in electrical and computer engineering from the
Senior Lecturer with Shenzhen MSU-BIT Univer- University of California at Davis, Davis, CA, USA,
sity, Shenzhen, China. Her research interests include in 2018. Currently, he is an Engineer with DJI
reinforcement learning and federated learning for communication, wireless Technology Inc., CA, USA. His research interests
resource management, access and power control, wireless energy harvesting, include machine learning for wireless communica-
and stochastic optimization. tion, wireless broadcast/multicast and trans-layer designs, and transmission
designs for UAV self-organizing networks.

An Liu (Senior Member, IEEE) received the


B.S. and Ph.D. degrees in electrical engineering
from Peking University, China, in 2004 and 2011,
respectively. From 2008 to 2010, he was a Visit- Tingting Yang (Member, IEEE) received the Ph.D.
ing Scholar with the Department of ECEE, Uni- degree from Dalian Maritime University, China,
versity of Colorado Boulder, CO, USA. He was a in 2010. She is currently a Professor with the
Post-Doctoral Research Fellow from 2011 to 2013, Pengcheng Laboratory, China. Since September
a Visiting Assistant Professor in 2014, and 2012, she has been a Visiting Scholar with the
a Research Assistant Professor from 2015 to Broadband Communications Research (BBCR) Lab-
2017 with the Department of ECE, HKUST. He is oratory, Department of Electrical and Computer
currently a Distinguished Research Fellow with the Engineering, University of Waterloo, Canada. Her
College of Information Science and Electronic Engineering, Zhejiang Uni- research interests are in the areas of maritime wide
versity. His research interests include wireless communications, stochastic band communication networks, DTN networks, and
optimization, compressive sensing, and machine/deep learning for communi- green wireless communication. She also serves as an
cations. He is serving as an Editor for IEEE T RANSACTIONS ON S IGNAL Associated Chair for IEEE ICC’20 and ICC’21. She serves as the Associate
P ROCESSING, IEEE T RANSACTIONS ON W IRELESS C OMMUNICATIONS , Editor-in-Chief for IET Communications, as well as the Advisory Editor for
and IEEE W IRELESS C OMMUNICATIONS L ETTERS . SpringerPlus.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.

You might also like