Professional Documents
Culture Documents
Abstract—Machine learning for hybrid beamforming has been algorithm [7]. Although, this approach introduces the com-
extensively studied by using centralized machine learning (CML) plexity for computation of the gradients at the edge devices,
techniques, which require the training of a global model with a transmitting only the gradient information significantly lowers
large dataset collected from the users. However, the transmission
the overhead of data collection and transmission.
arXiv:2005.09969v3 [eess.SP] 22 Aug 2020
of the whole dataset between the users and the base station
(BS) is computationally prohibitive due to limited communication Up to now, FL has been mostly considered for the applica-
bandwidth and privacy concerns. In this work, we introduce a tions of wireless sensor networks, e.g., UAV (unmanned aerial
federated learning (FL) based framework for hybrid beamform- vehicle) networks [8], [9], vehicular networks [10], [11]. In
ing, where the model training is performed at the BS by collecting [8], the authors applies FL to the trajectory planning of a
only the gradients from the users. We design a convolutional
neural network, in which the input is the channel data, yielding UAV swarm and the data collected by each UAV are processed
the analog beamformers at the output. Via numerical simulations, for gradient computation and a global NN at the “leading”
FL is demonstrated to be more tolerant to the imperfections and UAV is trained. In [9] and [10], client scheduling and power
corruptions in the channel data as well as having less transmission allocation problems are investigated for FL framework, respec-
overhead than CML. tively. Furthermore, a review of FL for vehicular network is
Index Terms—Deep learning, Federated learning, Hybrid presented in [11]. In addition, FL is considered in [12] for
beamforming, massive MIMO. the classification of encrypted mobile traffic data. Different
from [8]–[12], [13] considers a more realistic scenario where
I. I NTRODUCTION the gradients are transmitted to the BS through a noisy
In millimeter wave (mm-Wave) massive MIMO (multiple- wireless channel and a classification model is trained for
input multiple-output) systems, the users estimate the down- image classification. Notably, to the best of our knowledge,
link channel and corresponding hybrid (analog and digi- the application of FL to massive MIMO hybrid beamforming
tal) beamformers, then send these estimated values to the is not investigated in the literature. Furthermore, above works
base station (BS) via uplink channel with limited feedback mostly consider simple CML structures, which accommodate
techniques [1]. Several hybrid beamforming approaches are shallow NN architectures, such as a single layer NN [13].
proposed in the literature [1], [2]. However, the performance The performance of these architectures can be leveraged by
of these approaches strongly relies on the perfectness of utilizing deeper NNs, such as convolutional neural networks
the channel state information (CSI). To provide a robust (CNNs) [5], [14], which also motivates us to exploit CNN
beamforming performance, data-driven approaches, such as architectures for FL in this work.
machine learning, are proposed [3]–[5]. ML is advantageous
in extracting features from the channel data and exhibits robust
performance against the imperfections/corruptions at the input.
In the context of ML, usually centralized ML (CML)
techniques have been used, where a neural network (NN)
model at the BS is trained with the training data collected by
the users (see Fig. 1). This process requires the transmission
of each user’s data to the BS. Therefore, the data collection
and transmission introduce significant overhead. To alleviate
the transmission overhead problem, federated learning (FL)
strategies are recently introduced, especially for edge com-
Fig. 1. Data and model transmission schemes for multi-user massive MIMO.
puting applications [6]. In FL, instead of sending the whole
data to the BS, the edge devices (mobile users) only transmit
In this letter, we propose a federated learning (FL) frame-
the gradient information of the NN model. Then, the NN is
work for hybrid beamforming. We design a CNN architecture
iteratively trained by using stochastic gradient descent (SGD)
employed at the BS. The CNN accepts channel matrix as
input and yields the RF beamformer at the output. The deep
Sinem Coleri acknowledges the support of the Scientific and Technological
Research Council of Turkey for EU CHIST-ERA grant 119E350 and Ford network is then trained by using the gradient data collected
Otosan. from the users. Each user computes the gradient information
Ahmet M. Elbir is with the Department of Electrical and Electronics Engi- with its available training data (a pair of channel matrix and
neering, Duzce University, Duzce, Turkey (e-mail: ahmetmelbir@gmail.com).
Sinem Coleri is with the Department of Electrical and Electronics Engi- corresponding beamformer index), then sends it to the BS. The
neering, Koc University, Istanbul, Turkey (e-mail: scoleri@ku.edu.tr). BS receives all the gradient data from the users and performs
2
parameter update for the CNN model. Via numerical simula- III. FL F OR H YBRID B EAMFORMING
tions, the proposed FL approach is demonstrated to provide Let us denote the training dataset for the k-th user by
robust beamforming performance while exhibiting much less (1) (1) (D ) (D )
Dk = {(Xk , Yk ), . . . , (Xk k , YkS k )}, where DS
k = |Dk |
transmission overhead, compared to CML works [3], [5], [15].
is the size of the dataset and X = k∈K Xk , Y = k∈K Yk .
In the remaining of this work, we introduce the signal model
In CML-based works [3]–[5], [14], the training of the global
in Sec. II and present the proposed FL approach for hybrid
model is performed at the BS by collecting the S
datasets of all
beamforming in Sec. III. After presenting the numerical results K
users (see Fig. 1). Once the BS collects D = k=1 Dk , the
for performance evaluation in Sec. IV, we conclude the letter
global model is trained by minimizing the empirical loss,
in Sec. V.
D
1X
II. S IGNAL M ODEL AND P ROBLEM D ESCRIPTION F(θ) = L(f (θ; X (i) ), Y (i) ), (2)
D i=1
We consider a multi-user MIMO scenario where the BS,
having NT antennas, communicates with K single-antenna where D = |D| is the size of the training dataset and L(·) is the
users. In the downlink, the BS first precodes K data symbols loss function defined by the learning model. The minimization
s = [s1 , s2 , . . . , sK ]T ∈ CK by applying baseband precoders of empirical loss is achieved via gradient descent (GD) by
FBB = [fBB,1 , . . . , fBB,K ] ∈ CK×K , and then, employs updating the model parameters θ t at iteration t as θ t+1 =
K RF precoders FRF = [fRF,1 , . . . , fRF,K ] ∈ CNT ×K to θ t −ηt ∇F(θ t ), where ηt is the learning rate at t. However, for
form the transmitted signal. Given that FRF consists of large datasets, the implementation of GD is computationally
analog phase shifters, we assume that the RF precoder has prohibitive. To reduce the complexity, SGD technique is used,
constant unit-modulus elements, i.e., |[FRF ]i,j |2 = 1, thus, where θ t is updated by
kFRF FBB k2F = K. Then, the NT × 1 transmit signal is x =
θ t+1 = θ t − ηt g(θ t ), (3)
FRF FBBP s, and the received signal at the k-th user becomes
K
yk = hH k n=1 FRF fBB,n sn + nk , where hk ∈ C
NT
denotes which satisfies E{g(θ t )} = ∇F(θ t ). Therefore, SDG allows
2
the mm-Wave channel for the k-th user with khk k2 = NT and us to minimize the empirical loss by partitioning the dataset
nk ∼ CN (0, σ 2 ) is the additive white Gaussian noise (AWGN) into multiple portions. In CML, SGD is mainly used to
vector. accelerate the learning process by partitioning the training
We adopted clustered channel model where the mm-Wave dataset into batches, which is known as mini-batch learn-
channel hk is represented as the cluster of L line-of-sight ing [7]. On the other hand, in FL, the training dataset is
(LOS) received path rays [1], [2], i.e., partitioned into small portions, however they are available
X at the edge devices. Hence, θ t is updated, by collecting the
hk = β αk,l a(ϕ(k,l) ), (1)
local gradients {gk (θ t )}k∈K computed at users with their own
l
p datasets {Dk }k∈K . Thus, the BS incorporates {gk (θ t )}k∈K to
where β = NT /L, αk,l is the complex channel gain, update the global model parameters as
ϕ(k,l) ∈ [ −π 2 , π
2 ] denotes the direction of the transmitted K
paths, a(ϕ(k,l) ) is the NT× 1 array steering vector whose 1 X
θ t+1 = θ t − ηt gk (θ t ), (4)
m-th element is given by a(ϕ(k,l) ) m = exp{−j 2π λ d(m −
K
k=1
(k,l)
1) sin(ϕ )}, where λ is the wavelength and d is the array PDk (i) (i)
element spacing. where gk (θ t ) = D1k i=1 ∇L(f (θ t ; Xk , Yk )) is the
Assuming Gaussian signaling,
then the achievable rate for stochastic gradient computed at the k-th user with Dk .
Due to the limited number of users, PK K, there will be
1 H 2
K |hk FRF fBBk |
the k-th user is Rk = log2 1 + 1
P H 2 2 and
1
n6=k |hn FRF fBBn | +σ
P K deviations in the gradient average K k=1 gk (θ t ) from the
the sum-rate is R̄ = k∈K Rk for K = {1, . . . , K} [1]. stochastic average E{g(θ t )}. To reduce the oscillations due
Let us denote the parameter set of the global model at the to the gradient averaging, parameter update is performed by
BS as θ ∈ RP , which has P real-valued learnable parameters. using a momentum parameter γ which allows us to “moving-
Then, the learning model can be represented by the nonlinear average” the gradients [7]. Finally, the parameter update with
relationship between the input X and output Y, given by momentum is given by
f (θ, X ) = Y. K
In this work, we focus on the training stage of the global 1 X
θ t+1 = θ t − ηt gk (θ t ) + γ(θ t − θ t−1 ). (5)
NN model. Therefore, we assume that each user has available K
k=1
training data pairs, e.g., the channel vector hk (input) and
the corresponding RF precoder fRF,k (output label). It is Next, we discuss the input-output data acquisition and
worthwhile to note that channel and beamformer estimation network architecture for the proposed FL framework.
can be performed by both ML [5], [14], [16] and non-ML [1], A. Data Acquisition
[2] algorithms. The aim in this work is to learn θ by training In this part, we discuss how the acquisition is done for
the global NN with the available training data available at training process. The input of the
√ NN √ is a real-valued three
the users. Once the learning stage is completed, the users can “channel” tensor, i.e., Xk ∈ R NT × NT ×3 . The reason for
predict their corresponding RF precoder by feeding the NN using “three-channel” input is to improve the input feature
with the channel data, then feed it back to the BS. representation by using the feature augmentation method [4],
3
100 100
90 90
80 80
70 70
60 60
K=8
K=8
50 50
K=4 K=4
40 40
K=2
K=2
30 30
20 20
10 10
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
(a) (b)
Fig. 2. Validation accuracy for different number of users for (a) scenario 1 and (b) scenario 2, respectively.