You are on page 1of 51

Parallel Computing for Machine Learning

(Part 1)

Shusen Wang
Why parallel computing for ML?

• Deep learning models are big: ResNet-50 has 25M parameters.


• Big models are trained on big data, e.g., ImageNet has 14M images.
Why parallel computing for ML?

• Deep learning models are big: ResNet-50 has 25M parameters.


• Big models are trained on big data, e.g., ImageNet has 14M images.
• Big model + big data è Big computation cost.
• Example: Training ResNet-50 on ImageNet (run 90-epochs) using a
single NVIDIA M40 GPU takes 14 days.
Why parallel computing for ML?

• Deep learning models are big: ResNet-50 has 25M parameters.


• Big models are trained on big data, e.g., ImageNet has 14M images.
• Big model + big data è Big computation cost.
• Example: Training ResNet-50 on ImageNet (run 90-epochs) using a
single NVIDIA M40 GPU takes 14 days.
• Parallel computing: using multiple processors to make the computation
faster (in terms of wall-clock time.)
Toy Example: Least Squares Regression
Linear Predictor
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).
Linear Predictor
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).

• 𝑓 𝐱 = 𝑤+ 𝑥+ + 𝑤. 𝑥. + ⋯ + 𝑤$ 𝑥$
• 𝑤+ , 𝑤. , ⋯ , 𝑤$ : weights
• 𝑥+ : # of bedrooms
• 𝑥. : # of bathroom
• 𝑥1 : square feet
• 𝑥2 : age of house
• …
Linear Predictor
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).

Price = $0.5M

Prediction:
Features of a House
𝑓 𝐱 = 𝐱' 𝐰
𝐱 ∈ ℝ$
Linear Predictor
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).

Question: How to find 𝐰?

Price = $0.5M

Prediction:
Features of a House
𝑓 𝐱 = 𝐱' 𝐰
𝐱 ∈ ℝ$
Least Squares Regression
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).

Question: How to find 𝐰?

• Training inputs: 𝐱+ , ⋯ , 𝐱 3 ∈ ℝ$ .
• Training targets: 𝑦+ , ⋯ , 𝑦3 ∈ ℝ.


totally 𝑛 houses
Least Squares Regression
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).

Question: How to find 𝐰?

• Training inputs: 𝐱+ , ⋯ , 𝐱 3 ∈ ℝ$ .
• Training targets: 𝑦+ , ⋯ , 𝑦3 ∈ ℝ.
3 + .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8' 𝐰 − 𝑦8 .
.
Least Squares Regression
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).

Question: How to find 𝐰?

• Training inputs: 𝐱+ , ⋯ , 𝐱 3 ∈ ℝ$ .
• Training targets: 𝑦+ , ⋯ , 𝑦3 ∈ ℝ.
3 + .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8' 𝐰 − 𝑦8 .
.
• Least squares regression: 𝐰 ⋆ = argmin 𝐿 𝐰 .
𝐰
Parallel Gradient Descent for Least Squares
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .

I L J
G H 𝐰 G 𝐱 K 𝐰MNK
Gradient: 𝑔 𝐰 = = ∑38:+ J
= ∑38:+ 𝐱 8' 𝐰 − 𝑦8 𝐱 8
G 𝐰 G 𝐰
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .

• The bottleneck of GD is at computing the gradient.


• It is expensive if #samples and #parameters are both big.
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .

Parallel GD using 2 processors


• 𝑔 𝐰 = 𝑔+ 𝐰 + 𝑔. 𝐰 + ⋯ + 𝑔O 𝐰 + 𝑔OD+ 𝐰 + ⋯ + 𝑔3M+ 𝐰 + 𝑔3 𝐰 .
J J
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .

Parallel GD using 2 processors


• 𝑔 𝐰 = 𝑔+ 𝐰 + 𝑔. 𝐰 + ⋯ + 𝑔O 𝐰 + 𝑔OD+ 𝐰 + ⋯ + 𝑔3M+ 𝐰 + 𝑔3 𝐰 .
J J

Processor 1
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .

Parallel GD using 2 processors


• 𝑔 𝐰 = 𝑔+ 𝐰 + 𝑔. 𝐰 + ⋯ + 𝑔O 𝐰 + 𝑔OD+ 𝐰 + ⋯ + 𝑔3M+ 𝐰 + 𝑔3 𝐰 .
J J

Processor 1 Processor 2
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .

Parallel GD using 2 processors


• 𝑔 𝐰 = 𝑔+ 𝐰 + 𝑔. 𝐰 + ⋯ + 𝑔O 𝐰 + 𝑔OD+ 𝐰 + ⋯ + 𝑔3M+ 𝐰 + 𝑔3 𝐰 .
J J

= 𝐠Q+ = 𝐠Q .
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .

Parallel GD using 2 processors


• 𝑔 𝐰 = 𝑔+ 𝐰 + 𝑔. 𝐰 + ⋯ + 𝑔O 𝐰 + 𝑔OD+ 𝐰 + ⋯ + 𝑔3M+ 𝐰 + 𝑔3 𝐰 .
J J

= 𝐠Q+ = 𝐠Q .

Aggregate: 𝑔 𝐰 = 𝐠Q+ + 𝐠Q . .
Communication
Two Ways of Communication

Share memory: Message passing:

Processor ⋯ Processor
Node 1
Processor ⋯ Processor Memory

Memory
Processor ⋯ Processor
Node 2
Memory
Two Ways of Communication

Share memory: Message passing:

Processor ⋯ Processor
Node 1
Processor ⋯ Processor Memory

Memory
Processor ⋯ Processor
Node 2
Memory
Client-Server Architecture

Server

⋯ ⋯

Processor Processor Processor Processor

Memory Memory

worker node worker node


Peer-to-Peer Architecture

Processor ⋯ Processor
Processor ⋯ Processor
Memory
Memory

Processor ⋯ Processor Processor ⋯ Processor

Memory Memory
Synchronous Parallel Gradient Descent
Using MapReduce
MapReduce
• MapReduce is a programming model and software system developed
by Google [1].
• Characters: client-server architecture, message-passing communication,
and bulk synchronous parallel.

Reference
1. Dean and Ghemawat: MapReduce: simplified data processing on large clusters. Communications of the ACM,
2008.
MapReduce
• MapReduce is a programming model and software system developed
by Google [1].
• Characters: client-server architecture, message-passing communication,
and bulk synchronous parallel.
• Apache Hadoop [2] is an open-source implementation of MapReduce.

Reference
1. Dean and Ghemawat: MapReduce: simplified data processing on large clusters. Communications of the ACM,
2008.
2. https://hadoop.apache.org/
MapReduce
• MapReduce is a programming model and software system developed
by Google [1].
• Characters: client-server architecture, message-passing communication,
and bulk synchronous parallel.
• Apache Hadoop [2] is an open-source implementation of MapReduce.
• Apache Spark [3] is an improved open-source MapReduce.

Reference
1. Dean and Ghemawat: MapReduce: simplified data processing on large clusters. Communications of the ACM,
2008.
2. https://hadoop.apache.org/
3. Zaharia and others: Apache Spark: a unified engine for big data processing. Communications of the ACM, 2016.
MapReduce

Worker Worker

Server
Driver

Worker Worker
MapReduce

Worker Worker

Server
Driver

Worker Worker

Broadcast
MapReduce

Map Map

Worker Worker

Server
Driver

Map Map

Worker Worker

Map
MapReduce

Map Map

Worker Worker

Server
Driver

Map Map

Worker Worker

Reduce
Data Parallelism

Targets: 𝑦8

⋯ Inputs: 𝐱 8
Data Parallelism
• Partition the data among worker nodes. (A node has a subset of data.)

Targets: 𝑦8

⋯ Inputs: 𝐱 8

Node 1 Node 2 Node m-1 Node m


Parallel Gradient Descent Using MapReduce

• Broadcast: Server broadcast the up-to-date parameters 𝐰C to workers.


• Map: Workers do computation locally.
• Map 𝐱 8 , 𝑦8 , 𝐰C to 𝐠 8 = 𝐱 8' 𝐰C − 𝑦8 𝐱 8 .
• Obtain 𝑛 vectors: 𝐠+, 𝐠 ., 𝐠 1, ⋯ , 𝐠 3 .
Parallel Gradient Descent Using MapReduce

• Broadcast: Server broadcast the up-to-date parameters 𝐰C to workers.


• Map: Workers do computation locally.
• Map 𝐱 8 , 𝑦8 , 𝐰C to 𝐠 8 = 𝐱 8' 𝐰C − 𝑦8 𝐱 8 .
• Obtain 𝑛 vectors: 𝐠+, 𝐠 ., 𝐠 1, ⋯ , 𝐠 3 .
• Reduce: Compute the sum: 𝐠 = ∑38:+ 𝐠 8 .
• Every worker sums all the 𝐠 8 stored in its local memory to get a vector.
• Then, the server sums the resulting 𝑚 vectors. (There are 𝑚 workers.)
• Server updates the parameters: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝐠.
Parallel Gradient Descent Using MapReduce
-.
broadcast ⋯
"#
$# ⋯
-. -. -. -.

%&' %& 3 %& +4' %& +


Worker 1 Worker 2 Worker m-1 Worker m

reduce ⋯
% = ∑+ &*
*,' %

-./' = -. − 1 ⋅ %
Parallel Gradient Descent Using MapReduce
-.
broadcast ⋯ +
" # • Every worker stores of the
S
$# ⋯ data.
-. -. -. -. +
• Every worker does of the
S
computation.
%&' %& 3 %& +4' %& + +
Worker 1 Worker 2 Worker m-1 Worker m • Is the runtime reduced to ?
S
reduce ⋯
% = ∑+ &* • No. Because communication
*,' %
and synchronization must be
considered.
-./' = -. − 1 ⋅ %
Speedup Ratio
wall clock time using one node
speedup ratio =
wall clock time using 𝑚 nodes

𝑚
ideal

speedup
ratio

# nodes
𝑚
Speedup Ratio
wall clock time using one node
speedup ratio =
wall clock time using 𝑚 nodes

𝑚
ideal

actual
speedup
ratio

# nodes
𝑚
Communication Cost

• Communication complexity: How many words are transmitted


between server and workers.
• Proportional to number of parameters.
• Grow with number of worker nodes.
Communication Cost

• Communication complexity: How many words are transmitted


between server and workers.
• Proportional to number of parameters.
• Grow with number of worker nodes.
• Latency: How much time it takes for a packet of data to get from one
point to another. (Determined by the compute network.)
Communication Cost

• Communication complexity: How many words are transmitted


between server and workers.
• Proportional to number of parameters.
• Grow with number of worker nodes.
• Latency: How much time it takes for a packet of data to get from one
point to another. (Determined by the compute network.)
_`abcdefgh
• Communication time: + latency.
ijklmfgn
Bulk Synchronous

Broadcast Computation Aggregate


Reduce
Synchronization Cost

Question: What if a node fails and then restart?

• This node will be much slower than all the others.


• It is called straggler.
• Straggler effect:
• The wall-clock time is determined by the slowest node.
• It is a consequence of synchronization.
Recap

• Gradient descent can be implemented using MapReduce.


• Data parallelism: Data are partitioned among the workers.
• One gradient descent step requires a broadcast, a map, and a
reduce.
Recap

• Gradient descent can be implemented using MapReduce.


• Data parallelism: Data are partitioned among the workers.
• One gradient descent step requires a broadcast, a map, and a
reduce.
• Cost: computation, communication, and synchronization.
• Using 𝑚 workers, the speedup ratio is lower than 𝑚.

You might also like