Professional Documents
Culture Documents
(Part 1)
Shusen Wang
Why parallel computing for ML?
• 𝑓 𝐱 = 𝑤+ 𝑥+ + 𝑤. 𝑥. + ⋯ + 𝑤$ 𝑥$
• 𝑤+ , 𝑤. , ⋯ , 𝑤$ : weights
• 𝑥+ : # of bedrooms
• 𝑥. : # of bathroom
• 𝑥1 : square feet
• 𝑥2 : age of house
• …
Linear Predictor
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).
Price = $0.5M
Prediction:
Features of a House
𝑓 𝐱 = 𝐱' 𝐰
𝐱 ∈ ℝ$
Linear Predictor
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).
Price = $0.5M
Prediction:
Features of a House
𝑓 𝐱 = 𝐱' 𝐰
𝐱 ∈ ℝ$
Least Squares Regression
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).
• Training inputs: 𝐱+ , ⋯ , 𝐱 3 ∈ ℝ$ .
• Training targets: 𝑦+ , ⋯ , 𝑦3 ∈ ℝ.
⋯
totally 𝑛 houses
Least Squares Regression
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).
• Training inputs: 𝐱+ , ⋯ , 𝐱 3 ∈ ℝ$ .
• Training targets: 𝑦+ , ⋯ , 𝑦3 ∈ ℝ.
3 + .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8' 𝐰 − 𝑦8 .
.
Least Squares Regression
• Inputs: 𝐱 ∈ ℝ$ (e.g., features of a house).
• Prediction: 𝑓 𝐱 = 𝐱 ' 𝐰 (e.g., housing price).
• Training inputs: 𝐱+ , ⋯ , 𝐱 3 ∈ ℝ$ .
• Training targets: 𝑦+ , ⋯ , 𝑦3 ∈ ℝ.
3 + .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8' 𝐰 − 𝑦8 .
.
• Least squares regression: 𝐰 ⋆ = argmin 𝐿 𝐰 .
𝐰
Parallel Gradient Descent for Least Squares
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .
I L J
G H 𝐰 G 𝐱 K 𝐰MNK
Gradient: 𝑔 𝐰 = = ∑38:+ J
= ∑38:+ 𝐱 8' 𝐰 − 𝑦8 𝐱 8
G 𝐰 G 𝐰
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
• Gradient descent: 𝐰CD+ = 𝐰C − 𝛼 ⋅ 𝑔 𝐰C .
Processor 1
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
Processor 1 Processor 2
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
= 𝐠Q+ = 𝐠Q .
Parallel Gradient Descent
Example: GD for least squares regression model
3 + ' .
• Loss function: 𝐿 𝐰 = ∑8:+ 𝐱 8 𝐰 − 𝑦8 .
.
• Gradient: 𝑔 𝐰 = ∑38:+ 𝑔8 𝐰 , where 𝑔8 𝐰 = 𝐱 8' 𝐰 − 𝑦8 𝐱 8 .
= 𝐠Q+ = 𝐠Q .
Aggregate: 𝑔 𝐰 = 𝐠Q+ + 𝐠Q . .
Communication
Two Ways of Communication
Processor ⋯ Processor
Node 1
Processor ⋯ Processor Memory
Memory
Processor ⋯ Processor
Node 2
Memory
Two Ways of Communication
Processor ⋯ Processor
Node 1
Processor ⋯ Processor Memory
Memory
Processor ⋯ Processor
Node 2
Memory
Client-Server Architecture
Server
⋯ ⋯
⋯
Processor Processor Processor Processor
Memory Memory
Processor ⋯ Processor
Processor ⋯ Processor
Memory
Memory
Memory Memory
Synchronous Parallel Gradient Descent
Using MapReduce
MapReduce
• MapReduce is a programming model and software system developed
by Google [1].
• Characters: client-server architecture, message-passing communication,
and bulk synchronous parallel.
Reference
1. Dean and Ghemawat: MapReduce: simplified data processing on large clusters. Communications of the ACM,
2008.
MapReduce
• MapReduce is a programming model and software system developed
by Google [1].
• Characters: client-server architecture, message-passing communication,
and bulk synchronous parallel.
• Apache Hadoop [2] is an open-source implementation of MapReduce.
Reference
1. Dean and Ghemawat: MapReduce: simplified data processing on large clusters. Communications of the ACM,
2008.
2. https://hadoop.apache.org/
MapReduce
• MapReduce is a programming model and software system developed
by Google [1].
• Characters: client-server architecture, message-passing communication,
and bulk synchronous parallel.
• Apache Hadoop [2] is an open-source implementation of MapReduce.
• Apache Spark [3] is an improved open-source MapReduce.
Reference
1. Dean and Ghemawat: MapReduce: simplified data processing on large clusters. Communications of the ACM,
2008.
2. https://hadoop.apache.org/
3. Zaharia and others: Apache Spark: a unified engine for big data processing. Communications of the ACM, 2016.
MapReduce
Worker Worker
Server
Driver
Worker Worker
MapReduce
Worker Worker
Server
Driver
Worker Worker
Broadcast
MapReduce
Map Map
Worker Worker
Server
Driver
Map Map
Worker Worker
Map
MapReduce
Map Map
Worker Worker
Server
Driver
Map Map
Worker Worker
Reduce
Data Parallelism
Targets: 𝑦8
⋯ Inputs: 𝐱 8
Data Parallelism
• Partition the data among worker nodes. (A node has a subset of data.)
Targets: 𝑦8
⋯ Inputs: 𝐱 8
reduce ⋯
% = ∑+ &*
*,' %
-./' = -. − 1 ⋅ %
Parallel Gradient Descent Using MapReduce
-.
broadcast ⋯ +
" # • Every worker stores of the
S
$# ⋯ data.
-. -. -. -. +
• Every worker does of the
S
computation.
%&' %& 3 %& +4' %& + +
Worker 1 Worker 2 Worker m-1 Worker m • Is the runtime reduced to ?
S
reduce ⋯
% = ∑+ &* • No. Because communication
*,' %
and synchronization must be
considered.
-./' = -. − 1 ⋅ %
Speedup Ratio
wall clock time using one node
speedup ratio =
wall clock time using 𝑚 nodes
𝑚
ideal
speedup
ratio
# nodes
𝑚
Speedup Ratio
wall clock time using one node
speedup ratio =
wall clock time using 𝑚 nodes
𝑚
ideal
actual
speedup
ratio
# nodes
𝑚
Communication Cost