You are on page 1of 6

Convergence Time Minimization of Federated

Learning over Wireless Networks


Mingzhe Chen§, , H. Vincent Poor , Walid Saad‡ , and Shuguang Cui§,¶,∗
§ The Future Network of Intelligence Institute, The Chinese University of Hong Kong, Shenzhen, China.
 Department of Electrical Engineering, Princeton University, Princeton, NJ, USA, Emails: {mingzhec,poor}@princeton.edu.
‡ Wireless@VT, Bradley Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, USA, Email: walids@vt.edu.
¶ Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China.
∗ Department of Electrical and Computer Engineering, University of California, Davis, CA, USA, Email: robert.cui@gmail.com.

Abstract—In this paper, the convergence time of federated networks faces several challenges, such as selection of wireless
learning (FL), when deployed over a realistic wireless network, users to execute FL, energy efficiency of FL implementation,
is studied. In particular, with the considered model, wireless computational resource allocation for training FL models at
users transmit their local FL models (trained using their locally
collected data) to a base station (BS). The BS, acting as a central edge devices, spectrum resource allocation for FL parameter
controller, generates a global FL model using the received local transmission, and decreasing the data size of the FL model
FL models and broadcasts it back to all users. Due to the limited parameters that are transmitted over wireless networks.
number of resource blocks (RBs) in a wireless network, only a Recently, a number of existing works such as in [4]–[9] has
subset of users can be selected and transmit their local FL model studied important problems related to the implementation of FL
parameters to the BS at each learning step. Meanwhile, since
each user has unique training data samples and the BS must over wireless networks. In [4], the authors provided a new set of
wait to receive all users’ local FL models to generate the global design principles for wireless communication in edge learning,
FL model, the FL performance and convergence time will be however, they do not provide any mathematically rigorous result
significantly affected by the user selection scheme. In consequence, for optimizing the performance of FL over wireless networks.
it is necessary to design an appropriate user selection scheme The authors in [5] proposed a blockchain-based FL scheme
that enables all users to execute an FL scheme and efficiently
train it. This joint learning, wireless resource allocation, and user that enables edge devices to train FL without sending FL
selection problem is formulated as an optimization problem whose model parameters to a central controller. The works in [6]–[8]
goal is to minimize the FL convergence time while optimizing the optimized the FL performance with communication constraints
FL performance. To address this problem, a probabilistic user such as limited spectrum and computational resources. In [9],
selection scheme is proposed using which the BS will connect to the authors investigated the tradeoff between the FL conver-
the users, whose local FL models have large effects on its global
FL model, with high probabilities. Given the user selection policy, gence time and users’ energy consumption. However, most of
the uplink RB allocation can be determined. To further reduce the these existing works [4]–[9] that minimize the FL convergence
FL convergence time, artificial neural networks (ANNs) are used time must sacrifice the performance of the FL algorithm. For
to estimate the local FL models of the users that are not allocated example, in [9], the authors sacrificed the training accuracy
any RBs for local FL model transmission, which enables the BS of FL to improve the convergence time. Moreover, most of
to include more users’ local FL models to generate the global FL
model so as to improve the FL convergence speed and performance. these existing works [4]–[9] used random or fixed user selection
Simulation results show that the proposed ANN-based FL scheme methods for FL training, which may significantly increase the
can reduce the FL convergence time by up to 53.8%, compared FL convergence time and also decrease the FL performance. In
to a standard FL algorithm. particular, none of these existing works [4]–[9] considers the
effect of local FL models of the users, that cannot connect to
the BS due to limited wireless resource, on the FL convergence
I. I NTRODUCTION time and performance.
To train traditional machine learning algorithms for data The main contribution of this work is a novel framework that
analysis and inference, a central controller requires all users’ can minimize the FL convergence time while also optimizing
training data samples. However, in wireless networks, it is the FL performance. Our key contributions include:
impractical for wireless users to transmit their training data • We implement FL over a realistic wireless network in

samples to such a central controller due to privacy and the which the users train their local FL models using their
limited wireless resources [1]. To enable the central controller to own data and transmit the trained local FL models to a
train a machine learning model without collection of all of the base station (BS). The BS aggregates the received local
users’ training data samples, federated learning (FL) has been FL models to generate a global FL model and send it back
proposed in [2]–[4]. FL is, in essence, a distributed machine to the users. Since the number of resource blocks (RBs)
learning algorithm that enables users to collaboratively learn a that are used for FL model transmission is limited, the
shared machine learning model while keeping their collected BS must select an appropriate set of users to perform FL
data on their devices. However, implementing FL over wireless at each learning step. Meanwhile, the BS must allocate
its RBs to its users so as to improve the convergence
This work was supported in part by grants No. 2018YFB1800800, NSFC- speed. To this end, we formulate this joint user selection
61629101, No. 2018B030338001, No. ZDSYS201707251409055, and No. and RB allocation problem as an optimization problem
2017ZT07X152, in part by the U.S. National Science Foundation under Grants
CCF-0939370 and CCF-1513915, and in part by the Office of Naval Research whose goal is to minimize the number of iterations needed
under MURI Grant N00014-19-1-2621. for FL convergence as well as the time duration of each

978-1-7281-5089-5/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Carleton University. Downloaded on August 01,2020 at 17:56:07 UTC from IEEE Xplore. Restrictions apply.
   
iteration while optimizing the FL performance, in terms of −y ik log xT ik g

+ (1 − y ik ) log 1 − xT ∗
ik g . To find the op-

the accuracy of a hand writing digit identification task. timal vector g , conventional centralized learning algortihms
• To solve this problem, we first propose a probabilistic require the users to share their collected data with other users,
user selection scheme in which users, whose local FL which is impractical due to communication overhead and data
models have large effects on the global FL model, will privacy issues. To address these issues, FL can be used. The
have high probabilities to connect to the BS. Meanwhile, training procedure of FL is given as follows [2]:
the proposed user selection scheme guarantees that every a. The BS broadcasts the information, that is used to generate
user has a non-zero chance to connect to the BS and, hence, the learning model, to each user. Then, each user generates
it enables the BS to learn all of the users’ training data its learning model.
samples so as to guarantee that the FL algorithm achieves b. Each user trains its generated machine learning model
the optimal performance. Given the user selection scheme, using its collected data and sends the trained learning
the optimal RB allocation scheme can be determined. model parameters to the BS.
• To further reduce the FL convergence time, we propose c. The BS integrates the received learning model parameters
the use of artificial neural networks (ANNs) to find a and broadcasts them back to all users.
relationship among the users’ local FL model parameters. d. Steps b. and c. are repeated until find the optimal vector
Building this relationship enables the BS to estimate the g ∗ is found.
local FL model parameters of the users that cannot transmit From this FL training procedure, we observe that the users do
their local FL models to the BS due to limited number of not have to share their collected data with other users and the
RBs. Hence, using ANNs, the BS can integrate more users’ BS. In contrast, they only need to share their trained learning
local FL model parameters to generate the global FL model model parameters with the BS. Note that, the implementation
and, hence, improve the FL performance and convergence of steps b. and c. constitutes one iteration of training FL.
speed. Hereinafter, the FL model that is trained by each user using
Simulation results assess the various performance metrics of the its collected data is called local FL model while the FL model
proposed approach and show that the proposed FL approach that is generated by the BS is called global FL model. Since
can reduce the convergence time by up to 53.8%, compared all of the FL model parameters are transmitted over wireless
to a standard FL algorithm. To our best knowledge, this is the networks, we must consider the effect of wireless factors on
first work that studies the use of ANNs for the prediction of FL the FL performance. Next, we first mathematically formulate
model parameters that are used in the training process so as to the training procedure of FL. Then, the transmission of FL
improve the FL convergence speed and performance. model parameters over wireless links is introduced. Finally, the
The rest of this paper is organized as follows. The system problem of minimzing FL convergence time and optimizing FL
model and problem formulation are described in Section II. The performance is formulated.
probabilistic user selection scheme, resource allocation scheme,
and the use of ANNs for the prediction of users’ local FL A. Training Procedure of Federated Learning
models are introduced in Section III. Simulation results are
The local FL model of each user i at each iteration μ is
analyzed in Section IV. Conclusions are drawn in Section V.
defined as a vector wi,μ ∈ RW ×1 , where W is the number of
elements in wi,μ . The update of the global FL model at iteration
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION μ is given by [12]: ⎛ ⎞
Consider a wireless network in which a set U of U users and U
 ⎜ ai,μ Ki ⎟
one BS jointly execute an FL algorithm for data analysis and g (aμ ) = wi,μ ⎜
⎝U
⎟,
⎠ (2)
inference. We assume that each user i collects Ki training data i=1 ai,μ Ki
samples and each training data sample k consists of an input i=1
xik ∈ RNin ×1 and its corresponding output y ik ∈ RNout ×1 . We where g (aμ ) is the global FL model at iteration μ,
ai,μ Ki

U
also assume that the data collected by the users follows the same ai,μ Ki
i=1
distribution [10]. The FL training process is done in a way to is a scaling update weight of wi,μ and aμ = [a1,μ , . . . , aU,μ ] is
solve: a user association vector with ai,μ ∈ {0, 1}, ai,μ = 1 indicates
U Ki
1  that user i connects to the BS and sends its local FL model wi,μ
min f (g ∗ , xik , y ik ), (1)
g∗ K
i=1 k=1
to the BS at iteration μ, otherwise, we have ai,μ = 0. From (2),
we see that the user association vector aμ will change at each

U
iteration. This is due to the fact that, in wireless networks, the
where K = Ki is total number of training data samples of
i=1 number of RBs is limited and the BS must select a suitable set
all users, g is a vector that captures the FL model trained by of users to transmit their local FL models in each iteration.
K training data samples, and f (g, xik , y ik ) is a loss function The update of the local FL model wi,μ depends on the used
that captures the accuracy of the considered FL algorithm by training method. In our model, the gradient descent method [10]
building a relationship between an input vector xik and an is used to update the local FL model, and, hence, we have:
output vector y ik . For different learning tasks, the definition of Ki
the loss functions can be different [11]. For instance, for a clas- λ 
wi,μ+1 = g (aμ ) − ∇f (g (aμ ) , xik , y ik ), (3)
sification learning task, the loss function is f (g ∗ , xik , y ik ) = Ki
k=1

Authorized licensed use limited to: Carleton University. Downloaded on August 01,2020 at 17:56:07 UTC from IEEE Xplore. Restrictions apply.
where λ is the learning rate and ∇f (g (aμ ) , xik , y ik ) is the transmit its local FL model to the BS and the transmission
gradient of f (g (aμ ) , xik , y ik ) with respect to g (aμ ). Based delay will be liU (r i,μ ) + liD . Hence, (7) is essentially the
on (2) and (3), the BS and the users can update their FL models worst-case transmission delay among all selected users.
so as to find the optimal global FL model that solves problem
(1). C. Problem Formulation
Having defined the system model, the next step is to introduce
B. Transmission Model
a joint RB allocation and user selection scheme to minimize the
We assume that an orthogonal frequency-division multiple time that the users and the BS need in order to complete the
access (OFDMA) technique is adopted for local FL model FL training process. This optimization problem is formulated
transmission from the users to the BS. In our model, we assume as follows:
that the total number of RBs that the BS can allocate to its users T

is R, and each user can occupy only one RB. The uplink rate min tμ (aμ , Rμ ) (8)
{g(aμ )=g(aμ−1 )}
of user i that is transmitting its local FL model parameters to A,R
μ=1
the BS at iteration μ is given by:
s. t. ai,μ , rin,μ ∈ {0, 1} , ∀i ∈ U, n = 1, . . . , R, (8a)
R 
P hi ri,n ≤ 1, ∀n = 1, . . . , R, (8b)
cU
i (r i,μ ) = r in,μ Blog 2 1+ , (4) i∈U
n=1
In + BN0

R
where r i,μ = [ri1,μ , . . . , riR,μ ] is an RB allocation vector with rin,μ = ai,μ , ∀i ∈ U, (8c)
n=1
rin,μ ∈ {0, 1}; rin,μ = 1 indicates that RB n is allocated to where A = [a1 , . . . , aμ , . . . , aT ] is a user selection matrix for
user i at iteration μ, otherwise, we have rin,μ = 0, and P is all iterations, R = [R1 , . . . , Rμ , . . . , RT ] is an RB allocation
the transmit power of user i, which is assumed to be equal for matrix for all users at all iterations, and T is a constant and
all users. hi is the channel gain between user i and the BS, N0 is large enough to guarantee the convergence of FL. In other
is the noise power spectral density, and In is the interference words, the number of iterations that the FL algorithm requires
caused by the users that connect to other BSs using the same to converge will not be larger than T . In (8), {x} = 1,
RB. Since the interference caused by the users using the same when x is true, otherwise, we have {x} = 0. In particular,
RBs may significantly affect the transmission delay and the time
{g(aμ )=g(aμ−1 )} = 0 implies that the FL algorithm converges,
that an FL needs to reach convergence, we must consider this (8a) and (8b) imply that each user can only occupy one RB,
interference in (4). and (8c) implies that all RBs must be allocated to the users that
Similarly, the downlink data rate of the BS when transmitting are associated with the BS. From (8), we see that the time used
the global FL model parameters to each user i is given by: for the update of the local and global FL models, tμ , depends
PB hi on the user selection vector aμ and RB allocation matrix Rμ .
ci = B log2 1+ D
D D
, (5)
B N0 Meanwhile, as shown in (3), the total number of iterations that
where B D is the bandwidth that the BS used to broadcast the the FL algorithm needs in order to converge depends on the
global FL model to each user i and PB is the transmit power user selection vector aμ . In consequence, the time that the
of the BS. users and the BS perform one FL training iteration and the
Since the number of elements in the local FL model wi,μ is number of iterations needed for the FL algorithm to converge are
similar to that of the global FL model g (aμ ), the data size of dependent. Moreover, given the global model g (aμ ) at iteration
the local FL model wi,μ is equal to the data size of the global μ, the BS cannot calculate the number of iterations that the
FL model g (aμ ). Here, the data size of the local FL model is FL algorithm needs to converge since all of the training data
defined as the number of bits that the user requires to transmit samples are located at the users’ devices. Hence, problem (8)
the local FL model vector wi,μ to the BS. Let Z be the data is challenging to solve.
size of a global FL model or local FL model. The transmission
delay between user i and the BS over both uplink and downlink III. M INIMIZATION OF C ONVERGENCE T IME OF
at iteration μ will be: F EDERATED L EARNING
Z Z To solve problem (8), we first need to determine the user
liU (r i,μ ) = , liD = D . (6)
cU
i (r i,μ ) c i association at each iteration. Given the user selection vector,
The time that the users and the BS require to jointly complete the optimal RB allocation scheme can be derived. To further
an update of their respective local and global FL models at improve the convergence speed of FL, ANNs are introduced to
iteration μ is given by: estimate the local FL models of the users that are not allocated
  any RBs for transmitting their local model parameters. Fig. 1
tμ (aμ , Rμ ) = max ai,μ liU (r i,μ ) + liD , (7) summarizes the training procedure of the proposed FL.
i

where Rμ =  [r 1,μ , . . . , r U,μ ]. When ai,μ = 0,


A. Gradient Based User Association Scheme
ai,μ liU (r i,μ ) + liD = 0. Here, ai,μ = 0 implies that
user i will not send its local FL model to the BS at iteration To predict the local FL model of each user, the BS needs to
μ, and hence, user i will not cause any delay at iteration μ, use one user’s local FL model as the input of an ANN, which
(ai,μ liU (r i,μ ) + liD = 0). When ai,μ =1, then user i will will be explained in Subsection III-C. Hence, in the proposed

Authorized licensed use limited to: Carleton University. Downloaded on August 01,2020 at 17:56:07 UTC from IEEE Xplore. Restrictions apply.
 connect to the user whose local FL model significantly affects

  the global FL model, and thus improving the FL convergence



speed. From (10), we also see that user i∗ will always connect to
   the BS so as to provide information for the prediction of other
 users’ local FL models. Based on (10), the user association
  

  scheme at each iteration can be determined. To calculate pi,μ
  





 in (10), the BS only needs to know |ei,μ | of each user i without
 
requiring the exact training data information. In fact, |ei,μ | can
 
 
 be directly calculated by user i and each user i needs to transmit
 
 


only a scalar |ei,μ | to the BS.


  B. Optimal RB Allocation Scheme

Given the user association scheme at each iteration μ, prob-
Fig. 1. The training procedure of the proposed FL. lem (8) at iteration μ can be simplified as follows:
user association scheme, one user must be selected to connect  
min tμ (Rμ ) = min max ai,μ liU (r i,μ ) + liD (11)
with the BS during all training iterations. To determine the user Rμ Rμ i
that connects to the BS during the entire training process, we
first assume that the distance di between user i and the BS s. t. rin,μ ∈ {0, 1} , ∀i ∈ U, n = 1, . . . , R, (11a)

satisfies d1 ≤ d2 ≤ . . . ≤ dU . Hence, user i∗ that always rin,μ ≤ 1, ∀n = 1, . . . , R, (11b)
i∈U
connects to the BS can be found from:
Ki

R
 rin,μ = ai,μ , ∀i ∈ U. (11c)
i∗ = arg max ∇f (g (aμ−1 ) , xik , yik ) (9) n=1
i
k=1 We assume that there exists a variable m that satisfies
ai,μ liU (r i,μ ) + liD  m. Problem (11) can be simplified as:
s. t. d1 ≤ di ≤ dγR , ∀i ∈ U, (9a)
min m (12)

where 1 ≤ γR ≤ U is a constant that determines the number
of users considered in (9). As γR increases, the number of
s. t. (11a), (11b), and (11c), (12a)
users considered in (9) increases. Hence, the transmission delay  
of user i∗ may increase thus increasing the time used to m  ai,μ liU (r i,μ ) + liD , ∀i ∈ U. (12b)
complete one FL iteration. However, as γR increases, the value Since (12b) is nonlinear, we must transform it into a linear

Ki
 Z
of max ∇f (g (aμ−1 ) , xik , yik ) may increase, and, thus, the U
constraint. We first assume that lin,μ = P hi
,
i Blog2 1+In +BN
k=1 0
number of iterations required for FL to converge decreases. which represents the delay of user i transmitting the local FL
Here, user i∗ is determined at the first iteration. model over RB n at iteration μ. Then, we have liU (r i,μ ) =
From (3), we see that, at each iteration μ, the global FL R
U

Ki rin,μ lin,μ . Hence, problem (12) can be rewritten as follows:
model g (aμ−1 ) will change λ ∇f (g (aμ−1 ) , xik , yik ) due n=1
k=1
to the local FL model of a given user i. We define a vector min m (13)


Ki
ei,μ = λ ∇f (g (aμ−1 ) , xik , yik ) as the change in the
k=1 s. t. (11a), (11b), and (11c), (13a)
global FL model due to user i’s local FL model. To enable R

the BS to predict each user’s local FL model, each user must m  ai,μ U
rin,μ lin,μ + liD , ∀i ∈ U. (13b)
have a chance to connect to the BS so as to provide the training n=1

data samples (local FL model parameters) to the BS for training Problem (13) is equivalent to (12) and is an integer linear pro-
ANNs. Therefore, a probabilistic user association scheme is gramming problem, which can be solved by known optimization
developed, which is given as follows: algorithms such as interior-point methods [13].
⎧ |ei,μ |
⎨ 
U
, if i = i∗ , C. Prediction of the Local FL Models
pi,μ = |e i,μ | (10)
⎩ i=1,i=i∗
The previous subsections determine the users that are as-
1, if i = i∗ ,
sociated with the BS at each iteration and minimize their
where pi,μ represents the probability that user i connects to transmission delay by optimizing the RB allocation. Next, we
the BS at iteration μ, and |ei,μ | is the norm of vector ei,μ . introduce an ANN-based algorithm to predict the local FL
From (10), we can see that as |ei,μ | increases, the probability model parameters of the users that are not allocated any RBs
of associating user i with the BS increases. In consequence, the for local FL model transmission. In particular, ANNs are used
probability that the BS uses user i’s local FL model to generate to build a relationship between the local FL models of different
the global FL model increases. Hence, using the proposed user users. Since feedforward neural networks (FNNs) are good at
association scheme in (10), the BS has a high probability to function fitting tasks and finding a relationship among different

Authorized licensed use limited to: Carleton University. Downloaded on August 01,2020 at 17:56:07 UTC from IEEE Xplore. Restrictions apply.
users’ local FL models is a function approximation task, we Algorithm 1 Proposed FL over wireless network
prefer to use FNNs instead of other neural networks. Next, Init: Local FL model of each user i, wi , FNN model for the prediction of
we first introduce the architecture of our FNN-based algorithm. local FL model, v in , v out , bϑ , bo , user i∗ that always connects to the BS.
1: for iteration μ do
Then, we explain how to implement this algorithm to predict 2: Each user i trains its local FL model to obtain wi,μ .
the local FL model parameters of the users. 3: Each user i calculates the change of gradient of the local FL model ei,μ
Our FNN-based prediction algorithm consists of three com- and sends |ei,μ | to the BS.
ponents: a) input, b) a single hidden layer, and c) output, which 4: The BS calculates pi,μ using (10) and determines aμ .
5: The BS determines Rμ in (11).
will be defined as follows:
6: The selected users (ai,μ = 1) transmit their local FL models to the BS
• Input: The input of the FNN that is used for the prediction based on Rμ and aμ .
of user j’s local FL model is a vector wi∗ ,μ , which 7: The BS uses FNNs to estimate the local FL models of the users who
are not associated with the BS (ai,μ = 0).
represents the local FL model of user i∗ . As we mentioned
8: The BS calculates the global FL model g using (16).
in Subsection III-A, user i∗ will always connect with the 9: The BS uses collected local FL models to train the FNNs.
BS so as to provide the input information for the FNNs to 10: end for
predict the local FL models of other users.
• Output: The output of the FNN for the prediction of user
j’s local FL model is a vector o = wi∗ ,μ − wj,μ , which iteration μ, Ej = 12 (ŵj,μ − wj,μ ) is the prediction error at
represents the difference between user i∗ ’s local FL model iteration μ, and γ is the prediction requirement. In (16), when
and user j’s local FL model. Based on the prediction output the prediction accuracy of the FNN cannot meet the prediction
o and user i∗ ’s local FL model, we can obtain the local FL requirement (i.e., Ej > γ), the BS will not use the prediction
model of user j, i.e., ŵj,μ = wi∗ ,μ − o with ŵj,μ being result for updating its global FL model. From (16), we can also
the predicted user j’s local FL model. observe that, using FNNs, the BS can include additional local
• A single hidden layer: The hidden layer of an FNN FL models to generate the global FL model so as to improve the
allows it to learn nonlinear relationships between input FL performance and convergence speed. (16) is used to generate
vector wi∗ ,μ and output vector o. Mathematically, a single the global FL model in Step c. of the FL training procedure
hidden layer consists of N neurons. The weight matrix specified in Section II.
that represents the connection strength between the input The proposed FL algorithm that minimizes the FL con-
vector and the neurons in the hidden layer is v in ∈ RN ×W . vergence time while optimizing FL performance is shown in
Meanwhile, the weight matrix that represents the connec- Algorithm 1. From Algorithm 1, we can see that the user
tion strength between the neurons in the hidden layer and selection and RB allocation are optimized at each FL iteration
the output vector is v out ∈ NW ×N . and, hence problem (8) is solved at each FL iteration.
Given the components of the FNN, next, we introduce the
use of FNNs to predict each user’s local FL model. The states IV. S IMULATION R ESULTS AND A NALYSIS
of the neurons in the hidden layer are given by: For our simulations, we consider a circular network area
  having a radius r = 500 m with one BS at its center servicing
ϑ = σ v in wi∗ ,μ + bϑ , (14)
U = 15 uniformly distributed users. Each user trains an FNN
2
where σ (x) = 1+exp(−2x) − 1 and bϑ ∈ RN ×1 is the bias. using MNIST dataset [15]. The size of neuron weight matrices
Given the neuron states, we can obtain the output of the FNN, are 784 × 50 and 50 × 10. The FL algorithm is used to identify
as follows: the handwritten digits from 0 to 9. The BS also implements an
o = v out ϑ + bo , (15) FNN for each user to predict its local FL model parameters. The
FNNs are generated based on the MATLAB machine learning
where bo ∈ RW ×1 is a vector of bias. Based on (15), we can
toolbox. 1000 handwritten digits are used to test the trained FL
calculate the predicted local FL model of each user j at each
algorithms. The other parameters used in simulations are listed
iteration μ, i.e., ŵj,μ = wi∗ ,μ −o. To enable the FNN to predict
in Table I. For comparison purposes, we use the standard FL
each user’s local FL model, the FNN must be trained by the
algorithm in [10] that randomly determines user selection and
online gradient descent method [14].
resource allocation without using FNNs to estimate the local
Given the prediction of the users’ FL models, the update of
FL model parameters of each user. The simulation results are
the global FL model can be rewritten as follows:
averaged over a large number of independent runs.

U 
U
Fig. 2 shows an example of implementing the considered FL
Ki ai,μ wi,μ + Ki (1 − ai,μ ) ŵi,μ {Ei ≤γ}
i=1 i=1 algorithms for the identification of handwritten digits from 0 to
g (aμ ) = , 9. In this figure, the blue digits that are displayed above the

U 
U
Ki ai,μ + Ki (1 − ai,μ ) {Ej ≤γ} handwritten digits are the identification results of the proposed
i=1 i=1
(16) FL algorithm. The black digits that are displayed below the

U handwritten digits are the identification results of the standard
where Ki ai,μ wi,μ is the sum of the local FL models FL algorithm. Meanwhile, the red digits in this figure repre-
i=1
of the users that connect to the BS at iteration μ and sents the wrong identification results from all considered FL
U algorithms. From Fig. 2, we can see that, for a total of 36
Ki (1 − ai,μ ) ŵi,μ {Ei ≤γ} is the sum of the predicted local handwritten digits, the proposed FL algorithm can correctly
i=1
FL models of the users that are not associated with the BS at identify 33 digits while the standard FL algorithm can identify

Authorized licensed use limited to: Carleton University. Downloaded on August 01,2020 at 17:56:07 UTC from IEEE Xplore. Restrictions apply.
Proposed FL: 7 2 1 0 4 1
TABLE I
SYSTEM PARAMETERS
Classic FL: 7 2 1 0 4 1 Parameter Value Parameter Value
4 9 2 9 0 2 α 2 N0 -174 dBm/Hz
P 1W B 1 MHz
4 3 2 9 0 0 R 5 BD 20 MHz
9 0 1 5 9 7 N 5 PB 1W
Nout 10 Ki 500
Nin 784 γ 0.01
9 0 1 3 4 7
3 4 9 6 4 5
W 5000 γR 5
V. C ONCLUSION
3 4 9 6 4 5 In this paper, we have developed a novel framework that
4 5 7 4 0 1
enables the implementation of FL over wireless networks. We
have formulated an optimization problem that jointly considers
4 0 7 4 0 1
3 1 3 6 7 2
user selection and resource allocation for the minimization of
the FL convergence time while optimizing the FL performance.
3 1 3 0 7 2
To solve this problem, we have proposed a probablistic user
Fig. 2. An example of implementing FL algorithms for identification of selection scheme that allows the users whose local FL models
handwritten digits. have large effects on the global FL model to associate with the
BS with high probability. Given the user selection policy, the
1400 uplink RB allocation is determined. To further improve the FL
ProposHG FL convergence speed, we have studied the use of FNNs to estimate
1200 6WDQGDUG FL
the local FL models of the users that are not allocated any RBs.
Convergence time (V)

1000 Simulation results have shown that the proposed FL algorithm


yields significant improvements in terms of convergence time
800 compared to the standard FL algorithm.
600
R EFERENCES
400 [1] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:
Applications, trends, technologies, and open research problems,” IEEE
Network, to appear, 2019.
200 [2] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov,
3 6 9 12 15 C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, T. V. Overveldt,
D. Petrou, D. Ramage, and J. Roselander, “Towards federated learning at
Number of XVHUs
scale: System design,” arXiv preprint arXiv:1902.01046, Mar. 2019.
Fig. 3. FL convergence time changes as the number of users varies. [3] A. Ferdowsi and W. Saad, “Brainstorming generative adversarial networks
(bgans): Towards multi-agent generative models with distributed private
30 digits. This is because the proposed FL algorithm can use datasets,” arXiv preprint arXiv:2002.00306, 2020.
[4] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards an
FNNs to estimate users’ local FL model parameters. Hence, intelligent edge: Wireless communication meets machine learning,” arXiv
even though the number of RBs is limited, by employing our preprint arXiv:1809.00343, 2018.
[5] H. Kim, J. Park, M. Bennis, and S. Kim, “Blockchained on-device
proposed FL algorithm, the BS can exploit all users’ local FL federated learning,” IEEE Communications Letters, to appear, 2019.
models to generate the global FL model and, thus improving [6] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
K. Chan, “Adaptive federated learning in resource constrained edge
FL performance. computing systems,” IEEE Journal on Selected Areas in Communications,
In Fig. 3, we show how the FL convergence time changes vol. 37, no. 6, pp. 1205–1221, June 2019.
[7] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A
as the number of users varies. From this figure, we see that, joint learning and communications framework for federated learning over
as the number of users increases, the FL convergence time wireless networks,” arXiv preprint arXiv:1909.07972, 2019.
[8] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy
increases. This is because the probability that the users are efficient federated learning over wireless communication networks,” arXiv
located far away from the BS increases. Hence, the time that preprint arXiv:1911.02417, 2019.
[9] N. H. Tran, W. Bao, A. Zomaya, N. Minh N.H., and C. S. Hong,
the users require to transmit their local FL models to the BS “Federated learning over wireless networks: Optimization model design
increases and, thus, increasing the time that the users and the and analysis,” in Proc. of IEEE Conference on Computer Communications,
Paris, France, April 2019.
BS perform one FL iteration. However, as the number of users [10] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated
continues to increase, the FL convergence time decreases. This optimization: Distributed machine learning for on-device intelligence,”
arXiv preprint arXiv:1610.02527, 2016.
is due to the fact that as the number of users increases, the [11] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Artificial neural
networks-based machine learning for wireless networks: A tutorial,” IEEE
total number of training data samples increases. From Fig. 3, Communications Surveys and Tutorials, vol. 21, no. 4, pp. 3039–3071,
we can also see that the proposed FL algorithm can achieve up Fourthquarter 2019.
[12] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. Theertha Suresh,
to 53.8% improvement on the FL convergence time compared and D. Bacon, “Federated learning: Strategies for improving communi-
to the standard FL for a network that has 15 users. This is cation efficiency,” in Proc. of NIPS Workshop on Private Multi-Party
Machine Learning, Barcelona, SPAIN, Dec. 2016.
because the proposed FL adopts a probabilistic user selection [13] Y. Zhang, “Solving large-scale linear programs by interior-point methods
scheme to determine the users that will transmit the local FL under the Matlab environment,” Optimization Methods and Software, vol.
10, no. 1, pp. 1–31, 1998.
model parameters to the BS and optimizes the RB allocation at [14] W. Wu, J. Wang, M. Cheng, and Z. Li, “Convergence analysis of online
each FL iteration. Meanwhile, the proposed FL algorithm uses gradient method for BP neural networks,” Neural Networks, vol. 24, no.
1, pp. 91–98, Jan. 2011.
FNNs to predict the local FL model parameters of the users that [15] Y. LeCun, “The MNIST database of handwritten digits,” http://yann.lecun.
cannot connect to the BS to improve the convergence speed. com/exdb/mnist/.

Authorized licensed use limited to: Carleton University. Downloaded on August 01,2020 at 17:56:07 UTC from IEEE Xplore. Restrictions apply.

You might also like