Professional Documents
Culture Documents
Min Du
Postdoc, UC Berkeley
Outline
q Preliminary: deep learning and SGD
q Open problems
Outline
q Preliminary: deep learning and SGD
q Open problems
The goal of deep learning
Image classification 8
𝑤 is the set of
parameters
contained by
Next-word-prediction Looking forward to your ? reply
the function
Given input:
Output:
Finding the function: model training
• Given a training dataset containing 𝑛 input-output pairs 𝑥, , 𝑦, , 𝑖 ∈ 1, 𝑛 , the
goal of deep learning model training is to find a set of parameters 𝑤, such that
the average of 𝑝(𝑦, ) is maximized given 𝑥, .
• That is, 7
1
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 4 𝑝(𝑦, |𝑥, , 𝑤)
𝑛
,56
Which is equivalent to
7
1 A basic component for loss function
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 4 −log(𝑝(𝑦, |𝑥, , 𝑤)) 𝑙(𝑥, , 𝑦, , 𝑤) given sample 𝑥, , 𝑦, :
𝑛
,56
No closed-form solution: in a typical deep learning model, 𝑤 may contain millions of parameters.
Non-convex: multiple local minima exist. 𝑓(𝑤)
𝑤
Solution: Gradient Descent
Loss 𝑓(𝑤)
How to stop? – when the update
is small enough – converge.
or ∥ ∇𝑓(𝑤I ) ∥≤ 𝜖
Learning rate 𝜂 controls the step size Problem: Usually the number of training
samples n is large – slow convergence
Solution: Stochastic Gradient Descent (SGD)
● At each step of gradient descent, instead of compute for all training
samples, randomly pick a small subset (mini-batch) of training samples
𝑥N , 𝑦N .
𝑤IJ6 ← 𝑤I − 𝜂∇𝑓 𝑤I ; 𝑥N , 𝑦N
q Open problems
The importance of data for ML
Private data: all the photos a user takes and everything they type on
their mobile keyboard, including passwords, URLs, messages, etc.
Addressing privacy:
Model parameters will never
contain more information than
Google, Apple, the raw training data
......
MODEL Addressing network overhead:
ML model
AGGREGATION The size of the model is
generally smaller than the size
of the raw training data
ML ML
ML
model model
model
○ Unbalanced
■ Some users produce significantly more data than others
○ Massively distributed
■ # mobile device owners >> avg # training samples on each device
○ Limited communication
■ Unstable mobile network connections
A new paradigm – Federated Learning
a synchronous update scheme that proceeds in rounds of communication
McMahan, H. Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. "Communication-efficient
learning of deep networks from decentralized data." AISTATS, 2017.
Federated learning – overview
Local Local
data Model M(i) Model M(i) data
Local Local
data data
In round number i…
Local Local
data data
In round number i…
Federated learning – overview
Local Local
Model M(i+1)
data Model M(i+1) data
Global model
M(i+1)
Model M(i+1) Model M(i+1)
Central Server
Local Local
data data
Round number i+1 and
continue…
Federated learning – detail
For efficiency,
at the beginning of each round,
a random fraction C of clients
is selected, and the server
sends the current model
parameters to each of these
clients.
Federated learning – detail
● Recall in traditional deep learning model training
○ For a training dataset containing 𝑛 samples (𝑥, , 𝑦, ), 1 ≤ 𝑖 ≤ 𝑛, the training
objective is:
6 7
minE 𝑓(𝑤) where 𝑓 𝑤 ≝ ∑ 𝑓 (𝑤)
C∈ℝ 7 ,56 ,
𝑤IJ6 ← 𝑤I − 𝜂∇𝑓 𝑤I ; 𝑥N , 𝑦N
Federated learning – detail
● In federated learning
○ Suppose 𝑛 training samples are distributed to 𝐾 clients, where 𝑃N is the set of
indices of data points on client 𝑘, and 𝑛N = 𝑃N .
○ For training objective: min 𝑓(𝑤)
C∈ℝE
7U 6
𝑓 𝑤 = ∑T
N56 7 N
𝐹 (𝑤) where 𝐹N (𝑤) ≝ ∑ 𝑓 (𝑤)
7U ,∈WU ,
A baseline – FederatedSGD (FedSGD)
● A randomly selected client that has 𝑛N training data samples in federated
learning ≈ A randomly selected sample in traditional deep learning
N
■ Approach 2: Each client k computes: 𝑤IJ6 ← 𝑤I − 𝜂𝑔N ; the central server performs
aggregation:
● 𝑤IJ6 ← ∑T
N56
7U N
𝑤IJ6 For multiple times ⟹ FederatedAveraging (FedAvg)
7
Federated learning – deal with limited communication
● Increase computation
○ Select more clients for training between each communication round
○ Increase computation on each client
Federated learning – FederatedAveraging (FedAvg)
Learning rate: 𝜂; total #samples: 𝑛; total #clients: 𝐾; #samples on a client k: 𝑛N ;
clients fraction 𝐶
● In a round t:
○ The central server broadcasts current model 𝑤I to each client; each client k computes
gradient: 𝑔N = ∇𝐹N (𝑤I ), on its local data.
■ Approach 2:
N
● Each client k computes for E epochs : 𝑤IJ6 ← 𝑤I − 𝜂𝑔N
7U
● The central server performs aggregation: 𝑤IJ6 ← ∑T
N56
N
𝑤IJ6
7
7U
● Suppose B is the local mini-batch size, #updates on client k in each round: 𝑢N = 𝐸 .
e
Federated learning – FederatedAveraging (FedAvg)
Model initialization
● Two choices:
○ On the central server
○ On each client
Shared initialization
works better in practice. The loss on the full MNIST training set for models generated by
i
𝜃𝑤 + (1 − 𝜃)𝑤
Federated learning – FederatedAveraging (FedAvg)
Model averaging
● As shown in the
right figure:
In practice, naïve
parameter averaging
works surprisingly well.
The loss on the full MNIST training set for models generated by
i
𝜃𝑤 + (1 − 𝜃)𝑤
Federated learning – FederatedAveraging (FedAvg)
In general, the more computation in each round, the faster the model trains.
FedAvg also converges to a higher test accuracy (B=10, E=20).
Language modeling
Federated learning – Evaluation
In general, the more computation in each round, the faster the model trains.
FedAvg also converges to a higher test accuracy (B=10, E=5).
Federated learning – Evaluation
● What if we maximize the computation on each client? 𝐸 → ∞
q Open problems
Federated learning – related research
Google FL Workshop: https://sites.google.com/view/federated-learning-2019/home
Local Local
data data
HiveMind
Smart Contract
Differentially
Private Global Model
aggregated
Local noise Local
data data
Oasis Blockchain Platform
HiveMind: Decentralized Federated Learning In round number i…
Local Local
data data
HiveMind
Smart Contract
+ +
Gradient updates DP DP Gradient updates
for M(i) noise noise for M(i)
Gradient updates
for M(i) + DP
noise
DP
noise +
Gradient updates
for M(i)
Local Local
data data
Oasis Blockchain Platform
HiveMind: Decentralized Federated Learning In round number i…
Local Local
data data
HiveMind
+
DP Gradient updates
Gradient updates
for M(i) + DP
noise
Smart Contract noise for M(i)
Local
Gradient updates
for M(i) + DP
noise
DP
noise + Gradient updates
for M(i) Local
data data
Oasis Blockchain Platform
HiveMind: Decentralized Federated Learning In round number i…
Local Local
data data
HiveMind
Smart Contract
Gradient updates
for M(i) + DP
noise
DP
noise +
Gradient updates
for M(i)
Secure
Aggregation
Gradient updates
for M(i) + DP
noise
DP
noise + Gradient updates
for M(i)
Local Local
data data
Oasis Blockchain Platform
HiveMind: Decentralized Federated Learning In round number i…
Local Local
data data
HiveMind
Smart Contract
Aggregation
DPM(i+1)
Gradient updates
for M(i) + DP
noiseaggregated
noise +
Gradient updates
for M(i)
Local noise Local
data data
Oasis Blockchain Platform
HiveMind: Decentralized Federated Learning Round number i+1 and
continue…
DP Differential privacy Secure aggregation Model encryption
noise
Local Local
data data
HiveMind
Smart Contract
Global model
Differentially
M(i+1)
Private Global Model
aggregated
Local noise Local
data data
Oasis Blockchain Platform
HiveMind: Decentralized Federated Learning
DP Differential privacy Secure aggregation Model encryption
noise
Local Local
data data
HiveMind
Smart Contract
Differentially
Private Global Model
aggregated
Local noise Local
data data
Oasis Blockchain Platform
Outline
q Preliminary: deep learning and SGD
q Open problems
Federated learning – open problems
● Detect data poisoning attacks, while secure aggregation is being used.
● ……
Thank you!
Min Du
min.du@berkeley.edu