14 Parallel 1

Parallel Computing for Machine Learning
(Part 1)
Shusen Wang
Why parallel computing for ML?
• Deep learning models are big: ResNet-50 has 25M parameters.

• Big models are trained on big data, e.g., ImageNet has 14M images.

• Big model + big data è Big computation cost.
• Example: Training ResNet-50 on ImageNet (run 90-epochs) using a
single NVIDIA M40 GPU takes 14 days.

• Big model + big data è Big computation cost.
• Example: Training ResNet-50 on ImageNet (run 90-epochs) using a
single NVIDIA M40 GPU takes 14 days.
• Parallel computing: using multiple processors to make the computation
faster (in terms of wall-clock time.)
Toy Example: Least Squares Regression
Linear Predictor
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).
Linear Predictor
• 𝑓 𝐱 = 𝑤+ 𝑥+ + 𝑤. 𝑥. + ⋯ + 𝑤$ 𝑥$
• 𝑤+ , 𝑤. , ⋯ , 𝑤$ : weights
• 𝑥+ : # of bedrooms
• 𝑥. : # of bathroom
• 𝑥1 : square feet
• 𝑥2 : age of house
• …
Linear Predictor
Price = $0.5M
Prediction:
Features of a House
𝑓 𝐱 = 𝐱' 𝐰
𝐱 ∈ ℝ$
Linear Predictor
Question: How to find 𝐰?
Price = $0.5M
Prediction:
Features of a House
𝑓 𝐱 = 𝐱' 𝐰
𝐱 ∈ ℝ$
Least Squares Regression
• Training inputs: 𝐱+ , ⋯ , 𝐱 3 ∈ ℝ$ .
• Training targets: 𝑦+ , ⋯ , 𝑦3 ∈ ℝ.
⋯
totally 𝑛 houses
3 + .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8' 𝐰 − 𝑦8 .
.
3 + .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8' 𝐰 − 𝑦8 .
.
• Least squares regression: 𝐰 ⋆ = argmin 𝐿 𝐰 .
𝐰
Parallel Gradient Descent for Least Squares
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .
3 + ' .
.
I L J
G H 𝐰 G 𝐱 K 𝐰MNK
Gradient: 𝑔 𝐰 = = ∑38:+ J
= ∑38:+ 𝐱 8' 𝐰 − 𝑦8 𝐱 8
G 𝐰 G 𝐰
3 + ' .
.
3 + ' .
.
3 + ' .
.
• The bottleneck of GD is at computing the gradient.

• It is expensive if #samples and #parameters are both big.
3 + ' .
.
Parallel GD using 2 processors

• 𝑔 𝐰 = 𝑔+ 𝐰 + 𝑔. 𝐰 + ⋯ + 𝑔O 𝐰 + 𝑔OD+ 𝐰 + ⋯ + 𝑔3M+ 𝐰 + 𝑔3 𝐰 .
J J
3 + ' .
.

J J
Processor 1
3 + ' .
.

J J
Processor 1 Processor 2
3 + ' .
.

J J
= 𝐠Q+ = 𝐠Q .
3 + ' .
.

J J
= 𝐠Q+ = 𝐠Q .
Aggregate: 𝑔 𝐰 = 𝐠Q+ + 𝐠Q . .
Communication
Two Ways of Communication
Share memory: Message passing:
Processor ⋯ Processor
Node 1
Processor ⋯ Processor Memory
Memory
Node 2
Memory
Two Ways of Communication
Share memory: Message passing:
Node 1
Processor ⋯ Processor Memory
Memory
Node 2
Memory
Client-Server Architecture
Server
⋯ ⋯
⋯
Processor Processor Processor Processor
Memory Memory
worker node worker node

Peer-to-Peer Architecture
Memory
Memory
Processor ⋯ Processor Processor ⋯ Processor
Memory Memory
Synchronous Parallel Gradient Descent
Using MapReduce
MapReduce
• MapReduce is a programming model and software system developed
by Google [1].
• Characters: client-server architecture, message-passing communication,
and bulk synchronous parallel.
Reference
1. Dean and Ghemawat: MapReduce: simplified data processing on large clusters. Communications of the ACM,
2008.
MapReduce
by Google [1].
• Apache Hadoop [2] is an open-source implementation of MapReduce.
Reference
2008.
2. https://hadoop.apache.org/
MapReduce
by Google [1].
• Apache Hadoop [2] is an open-source implementation of MapReduce.
• Apache Spark [3] is an improved open-source MapReduce.
Reference
2008.
2. https://hadoop.apache.org/
3. Zaharia and others: Apache Spark: a unified engine for big data processing. Communications of the ACM, 2016.
MapReduce
Worker Worker
Server
Driver
Worker Worker
MapReduce
Worker Worker
Server
Driver
Worker Worker
Broadcast
MapReduce
Map Map
Worker Worker
Server
Driver
Map Map
Worker Worker
Map
MapReduce
Map Map
Worker Worker
Server
Driver
Map Map
Worker Worker
Reduce
Data Parallelism
Targets: 𝑦8
⋯ Inputs: 𝐱 8
Data Parallelism
• Partition the data among worker nodes. (A node has a subset of data.)
Targets: 𝑦8
⋯ Inputs: 𝐱 8
Node 1 Node 2 Node m-1 Node m

Parallel Gradient Descent Using MapReduce
• Broadcast: Server broadcast the up-to-date parameters 𝐰C to workers.

• Map: Workers do computation locally.
• Map 𝐱 8 , 𝑦8 , 𝐰C to 𝐠 8 = 𝐱 8' 𝐰C − 𝑦8 𝐱 8 .
• Obtain 𝑛 vectors: 𝐠+, 𝐠 ., 𝐠 1, ⋯ , 𝐠 3 .
• Broadcast: Server broadcast the up-to-date parameters 𝐰C to workers.

• Map: Workers do computation locally.
• Map 𝐱 8 , 𝑦8 , 𝐰C to 𝐠 8 = 𝐱 8' 𝐰C − 𝑦8 𝐱 8 .
• Obtain 𝑛 vectors: 𝐠+, 𝐠 ., 𝐠 1, ⋯ , 𝐠 3 .
• Reduce: Compute the sum: 𝐠 = ∑38:+ 𝐠 8 .
• Every worker sums all the 𝐠 8 stored in its local memory to get a vector.
• Then, the server sums the resulting 𝑚 vectors. (There are 𝑚 workers.)
• Server updates the parameters: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝐠.
-.
broadcast ⋯
"#
$# ⋯
-. -. -. -.
%&' %& 3 %& +4' %& +

Worker 1 Worker 2 Worker m-1 Worker m
reduce ⋯
% = ∑+ &*
*,' %
-./' = -. − 1 ⋅ %
-.
broadcast ⋯ +
" # • Every worker stores of the
S
$# ⋯ data.
-. -. -. -. +
• Every worker does of the
S
computation.
%&' %& 3 %& +4' %& + +
Worker 1 Worker 2 Worker m-1 Worker m • Is the runtime reduced to ?
S
reduce ⋯
% = ∑+ &* • No. Because communication
*,' %
and synchronization must be
considered.
-./' = -. − 1 ⋅ %
Speedup Ratio
wall clock time using one node
speedup ratio =
wall clock time using 𝑚 nodes
𝑚
ideal
speedup
ratio
# nodes
𝑚
Speedup Ratio
wall clock time using one node
speedup ratio =
wall clock time using 𝑚 nodes
𝑚
ideal
actual
speedup
ratio
# nodes
𝑚
Communication Cost
• Communication complexity: How many words are transmitted

between server and workers.
• Proportional to number of parameters.
• Grow with number of worker nodes.
Communication Cost

• Latency: How much time it takes for a packet of data to get from one
point to another. (Determined by the compute network.)
Communication Cost

• Latency: How much time it takes for a packet of data to get from one
point to another. (Determined by the compute network.)
_`abcdefgh
• Communication time: + latency.
ijklmfgn
Bulk Synchronous
Broadcast Computation Aggregate

Reduce
Synchronization Cost
Question: What if a node fails and then restart?
• This node will be much slower than all the others.

• It is called straggler.
• Straggler effect:
• The wall-clock time is determined by the slowest node.
• It is a consequence of synchronization.
Recap
• Gradient descent can be implemented using MapReduce.

• Data parallelism: Data are partitioned among the workers.
• One gradient descent step requires a broadcast, a map, and a
reduce.
Recap
• Gradient descent can be implemented using MapReduce.

• Data parallelism: Data are partitioned among the workers.
• One gradient descent step requires a broadcast, a map, and a
reduce.
• Cost: computation, communication, and synchronization.
• Using 𝑚 workers, the speedup ratio is lower than 𝑚.

14 Parallel 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14 Parallel 1

Uploaded by

Copyright:

Available Formats

Parallel Computing for Machine Learning

• Deep learning models are big: ResNet-50 has 25M parameters.

• Deep learning models are big: ResNet-50 has 25M parameters.

• Deep learning models are big: ResNet-50 has 25M parameters.

Question: How to find 𝐰?

Question: How to find 𝐰?

Question: How to find 𝐰?

Question: How to find 𝐰?

• The bottleneck of GD is at computing the gradient.

Parallel GD using 2 processors

Parallel GD using 2 processors

Parallel GD using 2 processors

Parallel GD using 2 processors

Parallel GD using 2 processors

Share memory: Message passing:

Share memory: Message passing:

worker node worker node

Processor ⋯ Processor Processor ⋯ Processor

Node 1 Node 2 Node m-1 Node m

• Broadcast: Server broadcast the up-to-date parameters 𝐰C to workers.

• Broadcast: Server broadcast the up-to-date parameters 𝐰C to workers.

%&' %& 3 %& +4' %& +

• Communication complexity: How many words are transmitted

• Communication complexity: How many words are transmitted

• Communication complexity: How many words are transmitted

Broadcast Computation Aggregate

Question: What if a node fails and then restart?

• This node will be much slower than all the others.

• Gradient descent can be implemented using MapReduce.

• Gradient descent can be implemented using MapReduce.

You might also like