Professional Documents
Culture Documents
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10272 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022
as labels during the training process, as in [11]. Furthermore, between the unlink and downlink CSI caused by the delay or
some existing schemes estimated the uplink CSI and obtained imperfect channel reciprocity could be eliminated since the
the downlink CSI by assuming perfect channel reciprocity downlink CSI could be predicted directly at BS with RL.
[8], [12]. However, when in the time varying channel where Deep Q learning (i.e., deep-Q Network [22]) is one of the
the downlink CSI is not the conjugate transpose of the uplink most popular deep RL techniques. However, the Q-learning
CSI, or when imperfect channel reciprocity happens due to method may not be able to directly deal with case when the
the hardware limitation, the above methods may result in state and action spaces become continuous, such as the CE
significant estimation performance loss. and beamforming problems in physical layer [23]. Fortunately,
Recently, a number of studies have also developed learning an actor-critic method was proposed to address the problems
aided algorithms to solve the large amounts of calculation with continuous spaces [24], which includes an actor network
complexities in beamforming [13], [14]. An iterative algo- and a critic network, represented by two deep neural networks.
rithm induced deep-unfolding neural network based precoding The actor network directly outputs the actions (policy), and the
scheme was proposed [15], in which the high complexity critic network outputs the estimated value functions to measure
operations in traditional iterative WMMSE were replaced by the performance of action. Our proposed algorithms will be
multi-layer neural network. However, the above papers all built on this framework.
only focused on the beamforming scheme and assumed that
the transmission channel model information was known in
prior. Moreover, most of the existing beamforming schemes B. Main Contributions
were obtained based on the outdated estimated CSI due to the In order to tackle the issues mentioned above, in this paper,
rapid variations of wireless channels, and the long processing we consider a typical wireless communication scenario with
time in the optimization and feedback propagations [16], [17]. multiple UEs and one BS, where no statistical or perfect
Outdated CSI causes outdated beamforming strategies, which knowledge are assumed over the channel dynamics. In order
results in local optimal sum and could not ensure efficient data to fully obtain the capacity gain, RL based end-to-end CP and
transmission. beamforming algorithms are studied owing to the capability
The previous works have done some research on channel of RL to solve complicated nonlinear problems. We have
estimation and beamforming, respectively. Only a few lit- considered two end-to-end frameworks: Firstly, we propose an
erature considered the channel estimation and beamforming actor-critic based channel prediction network, which predicts
jointly to obtain the end-to-end performance gain. There are the downlink CSI directly without channel reciprocity. After
two non-DL based channel estimation and hybrid beamform- obtaining predicted CSI, the BS adopts ZF precoder for
ing designs for MIMO systems, where the authors firstly downlink transmission; Secondly, we develop a two-layer CP
estimated the uplink CSI using compressed sensing in [18] and beamforming joint algorithm, where the first layer is
and through exploiting the strongest angle-of-arrivals in [19]. the proposed CP network aiming at acquiring the predicted
Then, by employing channel reciprocity, the hybrid precoders downlink CSI, and the second layer is an actor-critic based
were obtained with iterative algorithms in [18] and ZF in [19]. beamforming policy generating network. In particular, the
As for the DL based algorithms, an uplink channel estimation two-layer network is tandem: the output of the first layer,
and hybrid beamforming network was proposed in [20], where i.e., predicted CSI, is fed into the second layer as inputs.
the BS firstly used the neural network to estimate the channel The parameters in the proposed neural networks (NNs) are
covariance matrix, and then fed back the covariance matrix optimized with the objective of maximizing the downlink
to users for channel reconstructing and beamforming. Heavy transmission sum rate.
feedback overheads and channel statistical information were The main contributions are summarized as follows:
required in [20]. Authors in [3] firstly proposed a deep learning • We model the multi-user multi-antenna channel predic-
based channel beam-wave amplitude estimation scheme and tion and beamforming design as Markov decision process
used the results to estimate the channel. With the estimated (MDP). The formulated problems aim to choose good
channel, a neural network based analog precoding scheme policies for real-time CSI prediction and beamforming,
was proposed. However, the above scheme in [3] required the with the objective of maximizing the long-term expected
statistical model information in premise and only the analog accumulated downlink sum rate, as well as implicitly
precoding matrix was optimized. minimizing the CSI prediction loss. In the proposed CP
The performance of beamforming strongly relies on the algorithm, the users only need to feed back the CSI
perfectness of the channel. Hence, it is very crucial to perform every once in a while. Due to this infrequent infor-
accurate channel prediction (CP) and efficient beamforming mation feedback, the proposed RL based methods have
jointly with less requirements [20], [21]. Different from the lower feedback overheads as compared to some existing
existing DL or non-DL based schemes, reinforcement learn- approaches [3], [20].
ing (RL) agent could learn certain statistical information • We propose two RL based CP and beamforming frame-
about an unknown channel or find the optimal policy by works without prior assumptions of perfect CSI or
interacting with the environment, without premise for perfect plenty of training CSI data sets, which are hardly to
CSI and statistical channel models, or large numbers of true obtain in practical applications. The RL algorithm could
CSI training data sets. Therefore, the constraints described directly predict current downlink CSI at BS by constantly
above can be greatly eased with RL. Moreover, the differences interacting with the environment, thus eliminating the
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10273
tational complexity analysis. The results show that the pro- Yk = Gi sik + Zk , (2)
posed algorithms are robust and always can converge to i=1
stable states under various simulation settings. Also, the pro- where Yk = [Y1k , · · · , YNA k ] and Zk = [Z1k , · · · , ZNA k ].
posed algorithms achieve higher sum rates as compared to Thus, we can get the total received pilot matrix at BS
both traditional methods and learning based baselines, with M
affordable computational overheads. The remainder of this Y= GT
i si + Z, (3)
paper is organized as follows. In Section II, the multi-user i=1
downlink system, performance metrics and the proposed end- T
to-end frameworks are presented. Section III simply introduces where Y = [Y1 , · · · , YK ]T , Z = [Z1 , · · · , ZK ]T and [·]
some preliminaries on MDP, policy gradient and actor-critic denotes the vector transpose.
architecture. In Section IV, the problem formulation and the Most of the existing works use pilot signals to estimate the
RL based CP algorithm are given. Then, the joint CP and uplink CSI and then employ channel reciprocity to obtain the
beamforming problem is formulated, and the proposed two- approximation of downlink CSI. However, when in the fast
layer actor-critic based algorithm is presented in details in varying channel or imperfect channel reciprocity happens due
Section V. We provide simulation results and the computa- to hardware limitation, the existing schemes would introduce
tional complexity analysis in Section VI, and conclusions in high estimation error since the uplink and downlink channels
Section VII. are not constant within a transmission round. In the proposed
algorithms, we aim to use the known pilot signals, the received
II. S YSTEM M ODEL pilots Y, and some history information (will be introduced in
details in Section IV) to directly predict the real-time downlink
A. Multi-User System Model CSI without channel reciprocity.
In this paper, we consider a multi-user transmission system The set of predicted downlink CSI between BS and UEs is
with one base station (BS) and M user equipments (UEs), denoted as Ĥ = [Ĥ1 , · · · , ĤM ], where Ĥi is the predicted
as shown in Fig. 1. The BS is equipped with NA antennas and CSI from BS to UE i and Ĥi = [Ĥi1 , · · · , ĤiNA ]. The
all the UEs are equipped with an omnidirectional antenna. The true downlink CSI from BS to UEs is represented as H =
transmissions are under time division duplex (TDD) mode, and [H1 , · · · , HM ], where Hi = [Hi1 , · · · , HiNA ] is the true CSI
the system is assumed to be operated in a time slotted way with from BS to UE i. Here Hi is not always the conjugate trans-
normalized equal length time slots (TSs). Each TS contains pose of the uplink CSI Gi . The proposed learning algorithm
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10274 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10275
UEs receive the downlink data and feed back their SINRs
to the BS.
In order to improve the learning efficiency, fasten the con-
vergence speed and decrease the signaling overhead, we stip-
ulate that each UE feeds its local estimated CSI back to BS
every W learning TSs. The BS would receive the feedback
CSI from UEs every W learning TSs and use this limited and
partial time CSI in the next W −1 TSs. Therefore, the feedback
Fig. 3. Illustration of channel prediction system. signaling overhead could be restricted. There are many existing
research works and patents have proposed efficient methods
for SINR and channel feedback, such as in [29]–[32]. Note
the policy with stochastic gradient decent [26]. The critic esti- that in this paper we mainly focus on dealing with the RL
mates the true action-value function Qπ (s, a) with Qw (s, a), based channel prediction and downlink beamforming schemes,
where Qπ (s, a) is the true action value function in deep Q rather than feedback design or signal estimation and detection.
learning [27] under policy π with state s and action a, and Based on the above existing mature research, we believe that
Qw (s, a) is the estimated Q function [28]. We will introduce it is reasonable to assume feasible practical time feedback and
the details of the networks in the following. user signal estimation in this paper.
In our considered channel prediction and beamforming The sum rates at UEs are effected by the beamforming
problems, the action spaces are continuous. Therefore, the matrix, while the beamforming gain depends on the accu-
core learning algorithm for parameter updates used in our racy of the predicted CSI. Thus, the learning objective of
paper is called deep deterministic policy gradient (DDPG) the proposed scheme is to maximize the received downlink
algorithm [26]. However, in this paper, the employed neural sum rate, while minimizing the prediction loss is an implicit
network structure is built in particular to cope with the optimization goal. In this paper, the orthogonal pilot sequences
considered physical layer problems. Specifically, there are two are not strictly required since the deep RL based algorithm
main differences between the network we used and that in could predict the channel by interacting with the environment,
reference [26]. Firstly, though we use the similar actor-critic rather than relying on precise mathematical calculations over
framework with an actor network and critic network as in [26], uncorrelated signals in the traditional methods. At each learn-
the number of the layers, the size of each layer and the kind ing step, the learning agent selects the action to maximize the
of neural network in this paper are chosen specifically for the expected long term reward based on the approximated Q-value
considered optimization problems. Secondly, in the proposed via the neural networks. Thus, with the increase of training
joint algorithm, we presented a two-layer neural network time slots, the approximation of Q-value becomes more accu-
structure, which is also very different from the network in [26]. rate, and a better performance (better channel prediction and
sum rate reward) could be expected.
IV. C HANNEL P REDICTION W ITH RL We denote the sum rate at learning TS t as Rt , which is
defined in Eq. (7). The instantaneous reward at TS t is
In this section, RL is considered in channel prediction mod-
M
ule. The pilot based channel prediction scheme is employed,
therefore, it is assumed that the BS and UEs are cooperative Rt = Rt = B log (1 + Υit ) , (10)
such that the pilot signals are known to both the BS and i=1
UEs [9]. where Υit is the SINR of UE i at TS t. For RL, the optimiza-
tion objective, i.e., discounted long-term reward, is denoted
as
A. Channel Prediction Problem Formulation
∞
∞
M
In pilot based schemes, BS estimates the CSI after UEs Rtγ = γ l−t Rl+1 = γ l−t B log 1+Υi(l+1) . (11)
finishing transmitting the pilot symbols [3]. Similarly, in our l=t l=t i=1
proposed learning scheme, CSI is predicted when the BS The goal of proposed channel prediction scheme is to obtain
receives all the pilot sequences. Therefore, the learning time the optimal prediction policy π with the optimization objec-
slot (TS) contains K pilot-symbol durations in this paper. tive of maximizing cumulative discounted reward J1 (π) =
As shown in Fig. 3, the learning agent is BS and the proposed E(Rtγ | π), where E(·) denotes the expectation operator. The
channel prediction system operates as follows. At the begin- channel prediction problem can be formulated as
ning of each learning TS, UEs transmit pilot sequences to BS
through uplink channel. After receiving all the pilot signals, max J1 (π) = E(Rtγ | π). (12)
π,Ĥ
BS predicts the downlink CSI with the proposed actor-critic
based learning algorithm. Then, with the estimated downlink B. Actor-Critic Based Channel Prediction
CSI, BS generates the precoding matrix with zero-forcing (ZF)
method and performs downlink data transmission1 . Finally, In this subsection, the proposed learning framework and the
algorithm solving problem in Eq. (12) are presented. In order
1 In the proposed CP algorithm, ZF can be replaced by other beamforming to store the history information, it is assumed that a memory
methods. component with a sliding history window of W time slots is
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10276 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10277
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10278 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10279
the BF network, we use the feedback CSI H(t−mod(t,W )) and Since the joint CP and BF problem is more complicated,
the traditional ZF precoding method to generate the reference we employ target networks in this algorithm [36]. In this
BF matrixes, which is denoted as Vtzf . Therefore, the input of framework, we let the target networks be the replicas of the
actor network in the BF network contains the output values actor and critic networks. The outputs of the target actor
of ΦP (·|θ p ) and the reference precoding matrix, which is and target critic networks are denoted by H̄t and Qw (s, a),
denoted as Sta = {H̄t , Vtzf }. We represent the actor network respectively. Instead of using the outputs of the original actor
as ΦA (·|θ a ), with output BF policy V̄t = At ∈ C1×N , where and critic networks, H̄t and Qw (s, a) are adopted for obtaining
θa = {w1a , wh1 a a
, wh2 , w2a } is the set of network weights the target Q-value in the network parameter updating process.
containing the input layer parameters w1a , two-hidden-layer The weights of the target networks are denoted by θ a and θc ,
a a
parameters {wh1 , wh2 } and the output layer parameters w2a . respectively. The updates of the target networks are slower
Since the values in the BF matrixes are complex, the real parts than the original actor and critic networks, and in this way,
and the image parts of the complex beamforming matrix Vt the stability and the convergence of the learning process could
are exported separately in V̄t = [VtR , VtI ] with VtR = Re(Vt ) be guaranteed [26].
and VtI = Im(Vt ). After getting the action, BS needs to The target Q-value is yt = r̃ + γQw ŝ, H̄t |θa , θc . The
preprocess V̄t and combines the real parts and image parts to loss function for the critic network to minimize is denoted as:
obtain complex Vt . 2
Lt (θc ) = (yt − Qw (s̃, ã|θc ))
The critic network in the BF network is used to measure the 2
performance of current action, thus, the action output At = V̄t = r̃ + γQw ŝ, H̄t |θa , θc − Qw (s̃, ã|θc ) . (22)
is fed into the critic network at its second hidden layer [26], Differentiating the loss function with respect to the weights,
as shown in Fig. 7. Therefore, the input state for critic network we arrive at
is Stc = {H̄t , Vtzf , V̄t }. With Stc , the critic network provides
the estimation of action value function and the output is the ∇θc Lt (θ c )
estimated Q-value Qw (s, a). The estimated Q-Value is back- = r̃+γQw (ŝ, H̄t |θa , θc )−Qw (s̃, ã|θc ) ∇θc Qw (s̃, ã|θ c ).
propagated to generate the gradients. We represent the critic (23)
network as ΦC (·|θ c ) with input Stc and output Qw (Stc ). The
set of network weights is θc = {w1c , wh1 c c
, wh2 , w2c } where where ∇θc f (·) denotes the gradient vector of f with respect
c c c c
w1 , {wh1 , wh2 } and w2 are parameters of input layer, two to θc , and α is the updating step size. Then, the critic network
hidden layers and output layer. is updated as
θ c ← θ c + α∇θc Lt (θ c ). (24)
C. Parameter Learning
As mentioned before, the update of actor network depends
We exploit a policy gradient method for policy improvement on the estimation of Q-values in the critic network. Therefore,
in the joint CP and beamforming algorithm, which overcomes the gradients coming from the critic recommend the directions
the limitations of the greedy searching strategy by explic- of improvement in the continuous action space and are used
itly optimizing a parameterized policy [34]. As discussed in to train the actor network. Given some randomly sampled
Section IV, an exploration policy by adding sampled noise transition (s̃, ã, r̃, ŝ), the critic gradients with respect to action
from Nn,t to the actor policy is employed, that is, V̄t = can be written as ∇a Qw (s̃, a|θc )a=ΦA (s̃,θa ) . As shown in
Vt + Nn,t . After executing the selected action, BS receives Fig. 7, the action is fed into the second hidden layer of critic
the immediate reward Rt and keeps the previous experience network and the critic gradients are back-propagated to the
in a replay memory data set D = {e1 , · · · , et }, with et = actor network. In the actor network, we can have the actor
(St , At , Rt , St+1 ). In order to reduce the strong correlations gradients ∇θa ΦA (s̃, θa ), which is the gradients of action with
between adjacent time slots, at each TS, we sample a random respect to actor parameters. Together with ∇a Qw (s̃, a|θc )),
transition (s̃, ã, r̃, ŝ) from D instead of performing updates the actor gradients to update parameters in θa are given as
using transitions from the current episode [22]. As stated in
Section IV, maximizing over the unknown next action â as in ∇θa ΦA = ∇θa ΦA (s̃, θa )∇a Qw (s̃, a|θc )a=ΦA (s̃,θa ) ,
Eq. (14) is replaced by ΦA (ŝ|θ a ) and the critic loss can be θa ← θa + α∇θa ΦA . (25)
changed to
Then, according to the updating approach in [28], the para-
LQ (s, a|θ c ) = (Qw (s, a|θc )−(r+γQw (ŝ, ΦA (ŝ|θa )|θ c )))2 . meters of the prediction network and the target networks are
(21) updated as:
As we know that the critic’s output Qw (s, a|θc ) influences ∇θp ΦP = ∇θp ΦP (s̃, θp )∇p Qw (s̃, a|θc ),
both the actor’s and the critic’s updates. Therefore, the estima- θ p ← θp + α∇θp ΦP ,
tion error of Qw (s, a|θ c ) would cause destructive feedbacks θ a ← τ θ a + (1 − τ )θ a ,
and divergence for actor and critic networks, which may lead
θ c ← τ θ c + (1 − τ )θ c , (26)
to an unstable learning process, especially in the proposed
joint two-layer network. To deal with such problems, the with 0 ≤ τ ≤ 1.
method based on a target Q-network is considered [26], [36]. The overall joint CP and beamforming algorithm is sum-
In particular, the target networks are not always necessary. marized in Algorithm 2. Similar to most of the reinforcement
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10280 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022
Algorithm 2 Algorithm for Joint CP and Beamforming baselines for comparisons: 1) standard minimum mean square
1: Initialize the experience memory D and the total number error (MMSE) based CE scheme [9]; 2) sliding bi-directional
of episodes Ep . gated recurrent unit (SBGRU) based channel estimation with
2: Initialize prediction network with random weights θ p . known pilot density and channel statistics characteristics [11];
Initialize critic network, actor network, target networks 3) weighted MMSE (WMMSE) beamforming method with
with random weights θc , θa , θc and θa . perfect CSI [38]; 4) iterative algorithm induced deep unfolding
3: Initialize the environment and get initial observation state neural network (IAIDNN) for precoding design based on
S1p = {Y 1W , H̄1W , s, H1p }. WMMSE iterative method [15].
4: for t = 1, · · · , Ep do All the results are performed in a simulated physical layer
p communication scenario containing one BS with multiple
5: Get the predicted channel states H̄t = ΦP (St ).
6: Obtain the action input state Sta = {H̄t , Vtzf }. antennas and multi-user with single antenna per UE. In our
7: Select action according to At = ΦA (Sta |θ a ) + Nn,t . simulation, two kinds of wireless channels are considered. One
8: With At , BS gets BF matrixes and finishes downlink data is the complex frequency-selective clustered delay line (CDL)
transmission. channel model in 3GPP 5G new radio (NR) standard pro-
9: UEs feed back SINRa and partial time CSI; BS calculates tocol [39], [40]. CDL is used to model the channel when
reward Rt . the received signal consists of multiple delayed clusters,
10: Observe new state St+1 . Store transition where each cluster contains multi-path components with the
(St , At , Rt , St+1 ) in D. same delay but slight variations for angles of departure and
11: Sample random mini-batch of transitions (s̃, ã, r̃, ŝ) from arrival. The channel coefficients of CDL models used in the
D. simulations are generated with Matlab 5G Toolbox function
12: Set yt = r̃ + γQw (ŝ, H̄t |θ a , θ c ). nrCDLChannel, where the implementations are exactly fol-
13: Perform stochastic gradient descent according to (23). lowing the aspects of 3GPP 5G new radio (NR) standard
14: Update critic parameters: θ c ← θ c + α∇θ c Lt (θ c ). protocol TR 38.901 [39]. When generating channel coefficients
15: Update the actor policy following (25). with nrCDLChannel, the Max Doppler frequency is set to
16: Update the prediction network and the target networks be 12.5Hz, which indicates that the channel model is time-
following (26). varying. Another one is the standard Gaussian channel model,
17: end for and the channel coefficients are generated with MATLAB
block comm.AWGNChannel. The above two channels are both
frequency selective channels. The Zadoff–Chu (ZC) sequences
learning algorithms, the use of non-linear function approx- (also referred to as Chu sequence or Frank–Zadoff–Chu (FZC)
imators “nullifies any convergence guarantees” [26], [37]. sequence) are used as pilot signals, which are complex-valued
Thus, it is extremely hard to provide precise upper bound mathematical sequences. The ZC sequences derived from the
or convergence proof [36]. Instead, the following simulation same root sequence often have a constant autocorrelation
results could demonstrate that the learning results of proposed coefficient, and the correlation number is 0 [41]. One of
algorithms are stable and advantageous without the need for Zadoff–Chu sequence’s properties is that when the length L
any modifications or extra assumptions on the environment. is an odd number, the sequence is periodic. In our simulation,
Also, we have provided computation complexity analysis the pilot signals, i.e., Zadoff–Chu sequences, are generated
comparisons in Section VI-B to further certify the efficiency with MATLAB function zadoffChuSeq(). The length of the
and effectiveness of the proposed algorithm. In this paper, pilot sequences is prime number with L = 9. The system
we would like to explore the possible performance advantages bandwidth is 100MHz and the carrier frequency is 6GHz. The
of using deep RL in physical layer problems. Therefore, location of the BS is fixed and the transmit power of BS is
all the simulations in the following section are implemented 46 dBm.
in a single cell scenario. When exploiting the DRL based In our experiments, the actor and critic networks are both
algorithms in complex system with more users and antennas, composed of two hidden layers with 400 and 300 nodes,
the large computation burden at BSs could be eased with respectively. The learning rate for actor and critic networks
centralized server, which could use the local gradients from are 10−4 and 10−3 , respectively. The discount factor γ is set
BSs to update a set of global parameters and periodically to be 0.99. We train the deep RL network with a mini-batch
synchronize the parameters with BSs. With this distributed size of 64 and a replay buffer size of 105 . The results are
framework with centralized server, and the hybrid online and averaged over 100 independent runs with random initial states.
offline training manner, the computational cost at the local The training step indicates the learning episode in Algorithm 1.
BSs could be effectively decreased [34], which would be a However, in order to achieve smoother and more general
promising future research topic. performance comparisons, the presented sum rates and losses
in the figures are further averaged by taking the mean over
a moving window of 200 training steps. All the simulation
VI. N UMERICAL S IMULATIONS AND A NALYSIS
results are obtained with TensorFlow 1.14.0 and python 3.7 2 .
A. Simulation Results
In this section, we evaluate the performance of the proposed 2 The online project information is available at: https://github.com/mcccc4/
algorithms through numerical simulations. We choose four RL-based-PL-transmission.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10281
Firstly, we show the average sum rate of the proposed Fig. 9. Channel prediction loss of proposed CP scheme.
channel prediction (CP) algorithm over different training steps
in Figs. 8(a) and 8(b) with CDL channel and Gaussian
channel, respectively. The benchmark is the standard minimum
mean square error (MMSE) based channel estimation (MMSE
Channel Estimation) and the ZF beam former is also adopted
in MMSE channel estimation for a fair comparison [9]. As we
can see from the plots that at first, when the training algorithm
has not converged, MMSE scheme has better performance
than the proposed channel prediction (CP). However, after
about 13000 training steps, the sum rate of the proposed
RL algorithm exceeds the MMSE and gradually converges
to a stable state. Finally, the average sum rates of proposed Fig. 10. Performance comparisons for CP w/ or w/o feedback CSI.
scheme outperforms 21.56% and 3.68% of that of the MMSE
scheme for CDL channel and Gaussian channel, respectively.
One reason for this performance advantage is that when
using MMSE estimator, the channel covariance matrix of
CDL is estimated by computing the average covariance over
500 channel matrixes. This approximation of the channel
covariance matrix may introduce estimation error. However,
the proposed CP algorithm directly predicts the CSI by learn-
ing the channel dynamic model without covariance matrix
approximation. Another reason is that the MMSE method uses
channel reciprocity to get downlink CSI with estimated uplink
CSI. This could result in estimation loss when the downlink Fig. 11. Average sum rate with proposed joint CP and BF algorithm.
CSI is not exactly the conjugate transpose of uplink CSI.
The third reason is that the proposed CP algorithm directly
maximizes the sum rate, while the MMSE estimator minimizes inputs for channel prediction, which makes the CP algorithm
the estimation error, thus the proposed CP algorithm provides more stable and robust, as shown in Fig. 10. We can see in
better sum rate performance. Figs. 8, 9 and 10 that once the algorithms get converged,
The mean square error (MSE) of the prediction loss between the performance could maintain at a stable level and the
the proposed scheme and the true CSI is shown in Fig. 9, performance of the algorithm is robust. Even though there
where the left is the loss variation over training steps and the might be some fluctuations due to the dynamic variances of
right is the zoomed figure when the algorithm gets converged. environment, the algorithm can adjust quickly and restore the
It is obvious from the left subfigure in Fig. 9 that when performance, which can be seen from the simulation results
the learning algorithm has not converged, the prediction loss in Figs. 8, 9 and 10. Specifically, we can see that with the
is relatively high. However, with more training steps, the feedback information, the CP algorithm has a better prediction
prediction loss converges and the zoomed figure shows that performance and 9.10% higher sum rate than that without
the stable prediction loss of the proposed scheme can finally feedback information.
decrease to values between 10−3 and 10−2 , which is an In the following, we show the performance evaluation of
acceptable error range in practical applications. the proposed joint CP and BF algorithm. It can be seen
In this paper, the proposed DRL based methods aim at from figures in Fig. 11 that the proposed joint CP and BF
capturing the real-time communication environment dynamics, algorithm obviously has higher sum rate than the weighted
thereby, the neural networks are trained at base station (BS) MMSE beamforming method with minimum mean square
in an online manner. We use the feedback CSI as the booster error based channel estimation (WMMSE with MMSE CE)
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10282 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10283
complexity of SBGRU is O h2s nl M , where hs and nl are
the number and the size of the hidden layers, respectively.
In general, the size of hidden layer is larger than the size of
input, i.e. hs M [11]. Based on the above analysis, it is
obvious that the complexity of the proposed RL based CP
algorithm is smaller than benchmark in [11].
Similarly, we show the complexity comparison of pro-
posed joint channel prediction and beamforming algorithm
and convolutional neural network (CNN) based iterative algo-
rithm induced deep unfolding neural network (IAIDNN)
method [15]. The first layer of joint channel prediction and
beamforming algorithm is the prediction network. Assum-
ing that the size of the prediction network is hl , accord-
ing the above calculation, the approximate computational
complexity of the first layer is O (M NA (I + U + hl )). The
estimated complexity of the second layer with actor-critic
network is O (M NA (I + h1 + h2 + U + N )). Thus, the
approximate complexity for the proposed joint algorithm
is O (M NA (h1 + h2 + hl + 2I + 2U + N )). In compari-
son, the complexity of IAIDNN [15] is approximated as
L
O (M NA )2.37 + l=1 (s2l c2l ) , where L is the total number
of layers, sl and cl represent the size of the convolution kernel
and the number of channels, respectively. By comparing the
results, it can be seen that the proposed actor-critic based joint
algorithm is competitive compared with deep unfolding neural
Fig. 14. Performance comparisons. network based method.
In order to show the computational complexity more clearly,
we further analyze the number of the floating point opera-
statistic distributions of the channel model, which leads to the tions (FLOPs) for the parameter updates of proposed learning
challenge of obtaining training data sets and the high cost of algorithms. The number of FLOPs in the proposed channel
implementing channel estimation in practical applications. prediction algorithm is mainly determined by the structure of
the actor and critic networks [43]. In the proposed channel
B. Computational Complexity Analysis prediction algorithm, the FLOPs in the actor and critic network
In this subsection, we will provide the approximated com- can be computed as FLOPa = I ∗ h1 + h1 ∗ h2 + h2 ∗ N
putational complexity analyses of the proposed algorithms. and FLOPc = (I + N ) ∗ h1 + h1 ∗ h2 + h2 , respectively.
According to [37], for the stochastic policy gradient based Thus, the number of FLOPs in proposed channel predic-
learning algorithms, the computational complexity of all the tion algorithm is FLOPCP = FLOPa + FLOPc . As for
parameters updates is O(mn) per time step, where m and n the proposed joint algorithm, it contains prediction recurrent
denote the action output dimension and the number of policy neural network (RNN) layer, fully connected layer and actor-
parameters, respectively [42]. critic network. We denote the hidden size of RNN layer, the
Firstly, we estimate the computational complexities of pro- input size and the output size of actor-critic as hs , I¯ and N̄ ,
posed methods and benchmarks. We denote the sizes of the respectively. I and N still equal to the input size and output
input layer, the first hidden layer, the second hidden layer and size of channel prediction, as the same as in the channel
the output layer in the actor network as I, h1 , h2 and U , prediction algorithm. Then, the number of FLOPs is
respectively. The number of items in a vector is |·|. The output FLOPJoint = [(I + hs ) ∗ hs ∗ 4 ∗ 2] + [2 ∗ I ∗ N ]
sizes of the actor and the Q-values are the same and equal to + [I¯ ∗ h1 + h1 ∗ h2 + h2 ∗ N̄ ]
N = 2 × M × NA . Thus, for the proposed channel prediction
+ (I¯ + N̄ ) ∗ h1 + h1 ∗ h2 + h2 . (27)
algorithm, the total number of parameters in the actor network
is |θa | = I + h1 + h2 + U . Since the critic network has Under the simulation scenario with M = 10, NA = 16, h1 =
similar neural network structure as the actor network, we have 300, h2 = 400 and hs = 260 ∗ 3, the approximated numbers
|θc | = I +h1 +h2 +U +N . Therefore, the total of the proposed of FLOPs of proposed CP algorithm and joint CP and BF are
channel prediction algorithm is 4 M NA (I + h1 + h2 + U ) + 0.6321G and 2.8719G, respectively.
2 M N NA and the approximated computational complexity
can be written as O (M NA (I + h1 + h2 + U + N )). The VII. C ONCLUSION
deep learning (DL) based channel estimation benchmark is In this paper, we consider a multi-user multiple antennas
bi-directional gated recurrent unit (SBGRU) in [11]. Accord- downlink scenario. To tackle the existing challenges in tradi-
ing to the analysis in [11], the corresponding computational tional methods, we propose two RL based end-to-end CP and
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
10284 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 12, DECEMBER 2022
BF designs. Firstly, we only adopt RL in channel prediction [14] A. Alkhateeb, S. Alex, P. Varkey, Y. Li, Q. Qu, and D. Tujkovic, “Deep
and propose an actor-critic based channel prediction scheme learning coordinated beamforming for highly-mobile millimeter wave
systems,” IEEE Access, vol. 6, pp. 37328–37348, 2018.
without the premise of perfect CSI. The learning agent BS [15] Q. Hu, Y. Cai, Q. Shi, K. Xu, G. Yu, and Z. Ding, “Iterative algorithm
imports the received pilots into the prediction network and uses induced deep-unfolding neural networks: Precoding design for multiuser
the predicted CSI to generate downlink beamforming matrixes MIMO systems,” IEEE Trans. Wireless Commun., vol. 20, no. 2,
pp. 1394–1410, Feb. 2021.
with ZF. Secondly, we propose a joint channel prediction and [16] Y. Hu et al., “Optimal transmit antenna selection strategy for
beamforming learning architecture which includes two layers: MIMO wiretap channel based on deep reinforcement learning,” in
the first layer is the CSI prediction network as similar to the Proc. IEEE/CIC Int. Conf. Commun. China (ICCC), Beijing, China,
Aug. 2018, pp. 803–807.
CP algorithm; and by employing the outputs of the first layer [17] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Physical layer deep learning of
as the inputs, the second layer is the actor-critic network for encodings for the MIMO fading channel,” in Proc. 55th Annu. Allerton
exporting beamforming policy and Q-value evaluation. All the Conf. Commun., Control, Comput. (Allerton), Monticello, IL, USA,
Oct. 2017, pp. 76–80.
network parameters are updated jointly with the objective of [18] J. P. González-Coma, J. Rodríguez-Fernández, N. González-Prelcic,
maximizing the sum rate reward using the deep policy gradient L. Castedo, and R. W. Heath, Jr., “Channel estimation and hybrid
method. The simulations verified that the proposed algorithms precoding for frequency selective multiuser mmWave MIMO systems,”
IEEE J. Sel. Topics Signal Process., vol. 12, no. 2, pp. 353–367,
could always get converged and stable after certain training May 2018.
steps. The results show that the learning scheme could achieve [19] L. Zhao, D. W. K. Ng, and J. Yuan, “Multi-user precoding and channel
a prediction loss of 10−2 under different simulation conditions. estimation for hybrid millimeter wave systems,” IEEE J. Sel. Areas
Commun., vol. 35, no. 7, pp. 1576–1590, Jul. 2017.
Compared with the MMSE channel estimator, the proposed
[20] A. M. Elbir, “A deep learning framework for hybrid beamforming
channel prediction scheme has an average sum rate gain as without instantaneous CSI feedback,” IEEE Trans. Veh. Technol., vol. 69,
much as 21.56%. And the proposed two-layer joint algorithm no. 10, pp. 11743–11755, Oct. 2020.
could achieve as much as 99.7% to 98.76% of WMMSE with [21] H. Wang, J. Fang, P. Wang, G. Yue, and H. Li, “Efficient beam-
forming training and channel estimation for millimeter wave OFDM
perfect CSI in terms of sum rate without introducing large systems,” IEEE Trans. Wireless Commun., vol. 20, no. 5, pp. 2805–2819,
computation overheads. May 2021.
[22] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
R EFERENCES [23] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,
arXiv:1312.5602.
[1] I. Ahmed and H. Khammari, “Joint machine learning based resource
[24] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
allocation and hybrid beamforming design for massive MIMO systems,”
2nd ed. Cambridge, MA, USA: MIT Press, 2014.
in Proc. IEEE Globecom Workshops (GC Wkshps), Abu Dhabi, UAE,
[25] S. Verdú, “Fifty years of Shannon theory,” IEEE Trans. Inf. Theory,
Dec. 2018, pp. 1–6.
vol. 44, no. 6, pp. 2057–2078, Oct. 1998.
[2] M. Chu, X. Liao, H. Li, and S. Cui, “Power control in energy harvesting
[26] T. P. Lillicrap et al., “Continuous control with deep reinforcement
multiple access system with reinforcement learning,” IEEE Internet
learning,” 2015, arXiv:1509.02971.
Things J., vol. 6, no. 5, pp. 9175–9186, Oct. 2019.
[3] M. Wenyan, Q. Chenhao, Z. Zhang, and J. Cheng, “Sparse channel [27] M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learning-based
estimation and hybrid precoding using deep learning for millimeter wave multiaccess control and battery prediction with energy harvesting in
massive MIMO,” IEEE Trans. Commun., vol. 68, no. 5, pp. 2838–2849, IoT systems,” IEEE Internet Things J., vol. 6, no. 2, pp. 2009–2020,
Feb. 2020. Apr. 2019.
[4] Y.-S. Jeon, J. Li, N. Tavangaran, and H. V. Poor, “Data-aided channel [28] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Proc. Conf.
estimator for MIMO systems via reinforcement learning,” in Proc. IEEE Neural Inf. Process. Syst., Denver, CO, USA, Dec. 2000, pp. 1008–1014.
ICC, Dublin, Ireland, Jun. 2020, pp. 1–6. [29] J. Sun, Y. Ren, and Z. Yonghang, “A signal-to-noise ratio feedback
[5] S. Park, B. Shim, and J. W. Choi, “Iterative channel estimation using method and equipment,” Chin. Patent CN 102 546 124 A, 2012.
virtual pilot signals for MIMO-OFDM systems,” IEEE Trans. Signal [30] J. F. T. Cheng, S. Grant, L. Krasny, K. Molnar, and Y. P. E. Wang,
Process., vol. 63, no. 12, pp. 3032–3045, Jun. 2015. “Method and arrangement for SINR feedback in MIMO based wireless
[6] V. Raj and S. Kalyani, “Backpropagating through the air: Deep learning communication systems,” U.S. Patent 8 644 263, 2014.
at physical layer without channel models,” IEEE Commun. Lett., vol. 22, [31] M. Kurras, S. Jaeckel, L. Thiele, and V. Braun, “CSI compression and
no. 11, pp. 2278–2281, Nov. 2018. feedback for network MIMO,” in Proc. IEEE 81st Veh. Technol. Conf.
[7] Q. Mao, F. Hu, and Q. Hao, “Deep learning for intelligent wireless (VTC Spring), Boston, MA, USA, May 2015, pp. 1–5.
networks: A comprehensive survey,” IEEE Commun. Surveys Tuts., [32] J. Guo, L. Wang, F. Li, and J. Xue, “CSI feedback with model-driven
vol. 20, no. 4, pp. 2595–2621, Jun. 2018. deep learning of massive MIMO systems,” IEEE Commun. Lett., vol. 26,
[8] Z. Qin, H. Ye, G. Y. Li, and B. H. F. Juang, “Deep learning in no. 3, pp. 547–551, Mar. 2022.
physical layer communications,” IEEE Wireless Commun., vol. 26, no. 2, [33] D. Neumann, T. Wiese, and W. Utschick, “Learning the MMSE channel
pp. 93–99, Mar. 2019. estimator,” IEEE Trans. Signal Process., vol. 66, no. 11, pp. 2905–2917,
[9] Y. Liao, Y. Hua, and Y. Cai, “Deep learning based channel estimation Jun. 2018.
algorithm for fast time-varying MIMO-OFDM systems,” IEEE Commun. [34] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui, “Load balancing for
Lett., vol. 24, no. 3, pp. 572–576, Mar. 2020. ultradense networks: A deep reinforcement learning-based approach,”
[10] Y. Yang, F. Gao, X. Ma, and S. Zhang, “Deep learning-based channel IEEE Internet Things J., vol. 6, no. 6, pp. 9399–9412, Dec. 2019.
estimation for doubly selective fading channels,” IEEE Access, vol. 7, [35] M. Hausknecht and P. Stone, “Deep reinforcement learning in parame-
pp. 36579–36589, 2019. terized action space,” 2015, arXiv:1511.04143.
[11] Q. Bai, J. Wang, Y. Zhang, and J. Song, “Deep learning-based [36] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
channel estimation algorithm over time selective fading channels,” policy maximum entropy deep reinforcement learning with a stochastic
IEEE Trans. Cognit. Commun. Netw., vol. 6, no. 1, pp. 125–134, actor,” in Proc. ICML, Stockholm, Sweden, Jul. 2018, pp. 1861–1870.
Mar. 2020. [37] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
[12] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Deep learning-based “Deterministic policy gradient algorithms,” in Proc. 31st Int. Conf.
channel estimation for beamspace mmWave massive MIMO sys- Mach. Learn., Beijing, China, Jun. 2014, pp. 387–395.
tems,” IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 852–855, [38] D. H. Nguyen and T. Le-Ngoc, “MMSE precoding for multiuser MISO
Oct. 2018. downlink transmission with non-homogeneous user SNR conditions,”
[13] B. Zhu, J. Wang, L. He, and J. Song, “Joint transceiver optimization EURASIP J. Adv. Signal Process., vol. 2014, no. 1, pp. 1–12, Dec. 2014.
for wireless communication PHY using neural network,” IEEE J. Sel. [39] 5G; NR; Overall Description; Stage-2 (3GPP TS 38.300 Version 15.3.1
Areas Commun., vol. 37, no. 6, pp. 1364–1373, Jun. 2019. Release 15), 3GPP, TSGR, document TS 138 300, Oct. 2018.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.
CHU et al.: DEEP REINFORCEMENT LEARNING BASED END-TO-END MULTIUSER CHANNEL PREDICTION AND BEAMFORMING 10285
[40] X. Zhao, E. Lukashova, F. Kaltenberger, and S. Wagner, “Practical Vincent K. N. Lau (Fellow, IEEE) received the B.E.
hybrid beamforming schemes in massive MIMO 5G NR systems,” in degree (Hons.) from The University of Hong Kong
Proc. 23rd Int. ITG Workshop Smart Antennas. Vienna, Austria: VDE, in 1992 and the Ph.D. degree from Cambridge Uni-
Apr. 2019, pp. 1–8. versity in 1997. He was with Bell Labs from 1997 to
[41] Evolved Universal Terrestrial Radio Access (E-UTRA); Physical Chan- 2004 and the Department of ECE, The Hong Kong
nels and Modulation, document TS36.211, Jun. 2013. University of Science and Technology (HKUST) in
[42] A. Vaswani et al., “Attention is all you need,” in Proc. NIPS, 2004. He is currently a Chair Professor and the
Long Beach, CA, USA, Jun. 2017, pp. 1–11. Founding Director of Huawei-HKUST Joint Inno-
[43] W. Li, W. Ni, H. Tian, and M. Hua, “Deep reinforcement learning for vation Laboratory, HKUST. His current research
energy-efficient beamforming design in cell-free networks,” in Proc. focuses include stochastic optimization, massive
IEEE Wireless Commun. Netw. Conf. Workshops (WCNCW), Nanjing, MIMO, content-centric wireless networking, wire-
China, Mar. 2021, pp. 1–6. less networking for mission-critical control, and federated learning for 6G
wireless networks.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 25,2024 at 20:19:02 UTC from IEEE Xplore. Restrictions apply.