Professional Documents
Culture Documents
prediction
$" = !0" = lion
!" = tiger *%
Weight parameters %
loss(!" , !0" )
gradients
% optimized using standard iterative optimization procedures
% = % − ' ⋅ ∇%
3
Background: DNN Training
activations
prediction
Model
$" = training time- and compute- intensive! !0" = lion
!" = tiger *%
Weight parameters %
loss(!" , !0" )
gradients
W optimized using standard iterative optimization procedures
% = % − ' ⋅ ∇%
4
Parallelizing DNN Training: Data Parallelism
) copies of the Despite many performance optimizations,
same model communication overhead high!
… …
∇" $ ∇" (
Worker 1 Worker *
Worker 1 Worker !
All inputs
8
Outline
• Background and Motivation
• Evaluation
9
How do we assign operators to pipeline stages?
Compute Throughput =
Computetime
time==22 (1 / 2) × 2 = 1
!int
Configuration: 2-3-2-1
• Evaluation
14
1F1B Scheduling
Workers alternate between forward and backward passes
• Workers always utilized
• Gradients used to update model immediately
To support stage replication, need to modify this mechanism slightly – see paper for details!
15
Outline
• Background and Motivation
• Evaluation
16
Naïve pipelining leads to weight version mismatches
Naïve pipelining leads to mismatch in weight versions
• Evaluation
• Setup
• Comparison to Data Parallelism on Time-to-Accuracy
• Communication Overhead of Pipeline Parallelism
• Comparison to Model Parallelism and Hybrid Parallelism on Throughput
• PipeDream’s Memory Footprint
19
Evaluation Setup
• Integrated PipeDream with PyTorch in ~3000 lines of Python code
20
PipeDream > Data Parallelism (DP) end-to-end
2.46x faster
5.28x faster
21
PipeDream vs. Data Parallelism on Time-to-Accuracy
22
PipeDream vs. Data Parallelism on Time-to-Accuracy
23
PipeDream vs. Data Parallelism on Time-to-Accuracy
24
PipeDream vs. Data Parallelism on Time-to-Accuracy
25
PipeDream reduces communication overhead
Code available at
https://github.com/msr-fiddle/pipedream
https://cs.stanford.edu/~deepakn/
27