You are on page 1of 56

Parallel Computing in TensorFlow

Shusen Wang
TensorFlow Strategies

• MirroredStrategy
• TPUStrategy
• MultiWorkerMirroredStrategy
• CentralStorageStrategy
• ParameterServerStrategy
• OneDeviceStrategy
Parallel Training CNN on MNIST
Find Devices
In [5]:
from tensorflow.python.client import device_lib

device_lib.list_local_devices()
Out[5]:
For example, my server has 1 CPU and 4 GPUs:
[name: "/device:CPU:0"
• device_type:
/device:CPU:0"CPU"
• memory_limit:
/device:GPU:0 268435456
• locality {
/device:GPU:1
}
• /device:GPU:2
incarnation: 13474655953084031604, name: "/device:XLA_CP
• device_type:
/device:GPU:3"XLA_CPU"
memory_limit: 17179869184
locality {
number of test samples: 10000
Mirrored Strategy
In [4]:
from tensorflow import distribute

strategy = distribute.MirroredStrategy()

In [5]:
# number of processors
m = strategy.num_replicas_in_sync

print('Number of devices: {}'.format(m))

Number of devices: 4
number of test samples: 10000
In [4]: Mirrored Strategy
In [4]:
from tensorflow import distribute
from tensorflow import distribute
strategy = distribute.MirroredStrategy()
strategy = distribute.MirroredStrategy()
In [5]:
In [5]: of processors
# number
m = strategy.num_replicas_in_sync
# number of processors
m = strategy.num_replicas_in_sync
print('Number of devices: {}'.format(m))

print('Number
Number of devices:
of devices: 4 {}'.format(m))

Number of devices: 4
Load and Process MNIST Data
In [1]:

import tensorflow as tf

def scale(image, label):


image = tf.cast(image, tf.float32)
image /= 255
return image, label

In [2]:

import tensorflow_datasets as tfds

datasets, info = tfds.load(name='mnist',


with_info=True,
as_supervised=True)
In [1]:

Load
import
In [1]:
and
tensorflow Process
as tf MNIST Data
def scale(image, label):
import tensorflow as tf
image = tf.cast(image, tf.float32)
image /= 255
def scale(image, label):
return image, label
image = tf.cast(image, tf.float32)
image /= 255
In [2]:
return image, label
import tensorflow_datasets as tfds
In [2]:
datasets, info = tfds.load(name='mnist',
import tensorflow_datasets as tfds
with_info=True,
as_supervised=True)
datasets, info = tfds.load(name='mnist',
with_info=True,
mnist_train = datasets['train'].map(scale).cache()
as_supervised=True)
mnist_test = datasets['test'].map(scale)
In [ ]:

Load and Process MNIST Data


In [6]:
BUFFER_SIZE = 10000
BATCH_SIZE_PER_REPLICA = 128
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * m

data_train = mnist_train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
data_test = mnist_test.batch(BATCH_SIZE)

In [ ]:

In [7]:
from tensorflow import keras
In [ ]:

Load and Process MNIST Data


In [6]:
BUFFER_SIZE = 10000
BATCH_SIZE_PER_REPLICA = 128
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * m

data_train = mnist_train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
data_test = mnist_test.batch(BATCH_SIZE)

In [ ]:

In [7]:
from tensorflow import keras
In [ ]:
Build Neural Network

In [7]:

from tensorflow import keras


from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

with strategy.scope():
model = keras.Sequential()
model.add(Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D())
model.add(Conv2D(64, 3, activation='relu'))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

In [8]:
model.summary()
Build Neural Network
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 11, 11, 64) 18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 1600) 0
_________________________________________________________________
dense (Dense) (None, 64) 102464
_________________________________________________________________
dense_1 (Dense) (None, 10) 650
=================================================================
Total params: 121,930
Trainable params: 121,930
Non-trainable params: 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 650
Compile the Model
=================================================================
Total params: 121,930
Trainable params: 121,930
Non-trainable params: 0
_________________________________________________________________

In [9]:
with strategy.scope():
model.compile(loss='sparse_categorical_crossentropy',
optimizer=keras.optimizers.RMSprop(learning_rate=1E-3),
metrics=['accuracy'])

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 the

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 the

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 the

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 the


INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc

Train the Model


INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc


In [10]:
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc
model.fit(data_train, epochs=10)
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc
Epoch 1/10
INFO:tensorflow:batch_all_reduce: 8 all-reduces
118/118 [==============================] with algorithm
- 18s 151ms/step = nccl,
- loss: 0.4800num_packs = 1,0.8558
- accuracy: agg_small_gra
Epoch 2/10
INFO:tensorflow:batch_all_reduce: 8 all-reduces
118/118 [==============================] with algorithm
- 1s 9ms/step = nccl,- num_packs
- loss: 0.1229 accuracy: =0.9632
1, agg_small_gra
Epoch 3/10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc
118/118 [==============================] - 1s 9ms/step - loss: 0.0772 - accuracy: 0.9764
Epoch 4/10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc
118/118 [==============================] - 1s 9ms/step - loss: 0.0583 - accuracy: 0.9821
INFO:tensorflow:Reduce
Epoch 5/10 to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc
118/118 [==============================] - 1s 9ms/step - loss: 0.0448 - accuracy: 0.9859
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc
Epoch 6/10
118/118 [==============================]
INFO:tensorflow:Reduce - 1s 9ms/step - loss: 0.0364 - then
to /job:localhost/replica:0/task:0/device:CPU:0 accuracy: 0.9888
broadcast to ('/job:loc
Epoch 7/10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0
118/118 [==============================] - 1s 9ms/step - loss: 0.0311 - then broadcast
accuracy: to ('/job:loc
0.9905
Epoch 8/10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0
118/118 [==============================] - 1s 9ms/step - loss: 0.0258 - then broadcast
accuracy: to ('/job:loc
0.9919
Epoch 9/10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:loc
118/118 [==============================] - 1s 9ms/step - loss: 0.0220 - accuracy: 0.9931
Epoch 10/10
INFO:tensorflow:batch_all_reduce: 8 all-reduces with algorithm = nccl, num_packs = 1, agg_small_gra
118/118 [==============================] - 1s 9ms/step - loss: 0.0187 - accuracy: 0.9940
INFO:tensorflow:batch_all_reduce: 8 all-reduces with algorithm = nccl, num_packs = 1, agg_small_gra
Epoch 8/10
118/118 [==============================] - 1s 9ms/step - loss: 0.0258 - accuracy: 0.9919
Epoch 9/10 Evaluate the Model on Test Set
118/118 [==============================] - 1s 9ms/step - loss: 0.0220 - accuracy: 0.9931
Epoch 10/10
118/118 [==============================] - 1s 9ms/step - loss: 0.0187 - accuracy: 0.9940

Out[10]:
<tensorflow.python.keras.callbacks.History at 0x7fec1e322828>

In [11]:

eval_loss, eval_acc = model.evaluate(data_test)

20/20 [==============================] - 2s 92ms/step - loss: 0.0255 - accuracy: 0.9896

In [12]:

with strategy.scope():
eval_loss, eval_acc = model.evaluate(data_test)

20/20 [==============================] - 2s 92ms/step - loss: 0.0255 - accuracy: 0.9896

In [ ]:
Ring All-Reduce
Reduce VS All-Reduce
• Reduce: The server gets the result of reduce (e.g., sum, mean, count.)

Server
7+3+4+1=15

Reduce (sum)

Output: 7 Output: 3 Output: 4 Output: 1

Worker 1 Worker 2 Worker 3 Worker 4


Reduce VS All-Reduce
• Reduce: The server gets the result of reduce (e.g., sum, mean, count.)
• All-Reduce: Every node gets a copies of the result of reduce.
Reduce VS All-Reduce
• Reduce: The server gets the result of reduce (e.g., sum, mean, count.)
• All-Reduce: Every node gets a copies of the result of reduce.
• E.g., all-reduce via reduce+broadcast.

Server
7+3+4+1=15
7 1
3 4
All-Reduce (sum)

15 15 15 15
Output: 7 Output: 3 Output: 4 Output: 1

Worker 1 Worker 2 Worker 3 Worker 4


Reduce VS All-Reduce
• Reduce: The server gets the result of reduce (e.g., sum, mean, count.)
• All-Reduce: Every node gets a copies of the result of reduce.
• E.g., all-reduce via reduce+broadcast.
• E.g., all-reduce via all-to-all communication.

1 7
1 3
Output: 7 Output: 3 Output: 4 Output: 1
3 3 4
7 4 1
Worker 1 Worker 2 Worker 3 Worker 4
4 7
Reduce VS All-Reduce
• Reduce: The server gets the result of reduce (e.g., sum, mean, count.)
• All-Reduce: Every node gets a copies of the result of reduce.
• E.g., all-reduce via reduce+broadcast.
• E.g., all-reduce via all-to-all communication.
• E.g., ring all-reduce (this lecture.)
Ring All-Reduce

GPU:0

GPU:3 GPU:1

GPU:2
Ring All-Reduce

GPU:0
𝐠"

GPU:3 GPU:1
𝐠% 𝐠#

GPU:2
𝐠$
Ring All-Reduce

GPU:0
𝐠"

GPU:3 GPU:1
𝐠% 𝐠#

Goal: Compute 𝐠 = 𝐠 " + 𝐠# + 𝐠 $ + 𝐠 % and send it to all the GPUs.

GPU:2
𝐠$
Ring All-Reduce (Naïve Approach)

GPU:0 𝐠"
𝐠"

GPU:3 GPU:1
𝐠% 𝐠#

GPU:2
𝐠$
Ring All-Reduce (Naïve Approach)

GPU:0 𝐠"
𝐠"

GPU:3 GPU:1
𝐠% 𝐠 " + 𝐠#

GPU:2
𝐠$
Ring All-Reduce (Naïve Approach)

GPU:0
𝐠"

GPU:3 GPU:1
𝐠% 𝐠 " + 𝐠#

𝐠 " + 𝐠#
GPU:2
𝐠$
Ring All-Reduce (Naïve Approach)

GPU:0
𝐠"

GPU:3 GPU:1
𝐠% 𝐠 " + 𝐠#

𝐠 " + 𝐠#
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

GPU:0
𝐠"

GPU:3 GPU:1
𝐠% 𝐠 " + 𝐠#

𝐠 " + 𝐠# + 𝐠 $
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

GPU:0
𝐠"

GPU:3 GPU:1
𝐠 " + 𝐠# + 𝐠 $ + 𝐠 % 𝐠 " + 𝐠#

𝐠 " + 𝐠# + 𝐠 $
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

GPU:0
𝐠"

GPU:3 GPU:1
𝐠 = 𝐠 " + 𝐠# + 𝐠 $ + 𝐠 % 𝐠 " + 𝐠#

GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

𝐠 GPU:0
𝐠"

GPU:3 GPU:1
𝐠 𝐠 " + 𝐠#

GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

𝐠 GPU:0
𝐠

GPU:3 GPU:1
𝐠 𝐠 " + 𝐠#

GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

GPU:0 𝐠
𝐠

GPU:3 GPU:1
𝐠 𝐠 " + 𝐠#

GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

GPU:0 𝐠
𝐠

GPU:3 GPU:1
𝐠 𝐠

GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

GPU:0
𝐠

GPU:3 GPU:1
𝐠 𝐠

𝐠
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)

GPU:0
𝐠

GPU:3 GPU:1
𝐠 𝐠

𝐠
GPU:2
𝐠
Ring All-Reduce (Naïve Approach)

GPU:0
𝐰 ← 𝐰 − 𝛼 ⋅ 𝐠.

GPU:3 GPU:1
𝐰 ← 𝐰 − 𝛼 ⋅ 𝐠. 𝐰 ← 𝐰 − 𝛼 ⋅ 𝐠.

GPU:2
𝐰 ← 𝐰 − 𝛼 ⋅ 𝐠.
What is wrong with the naïve approach?

• Most computer networks are idle.


GPU:0
-.
• Communication time: . (Ignore latency.)
/
• 𝑚: number of GPUs.
• 𝑑: number of parameters.
• 𝑏: network bandwidth. GPU:3 GPU:1

GPU:2
Ring All-Reduce (Efficient Approach)

GPU:0 𝐠 " = 𝐚" ; 𝐛" ; 𝐜" ; 𝐝"

GPU:1 𝐠# = 𝐚# ; 𝐛# ; 𝐜# ; 𝐝#

GPU:2 𝐠 $ = 𝐚$ ; 𝐛$ ; 𝐜$ ; 𝐝$

GPU:3 𝐠 % = 𝐚% ; 𝐛% ; 𝐜% ; 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐜" 𝐝"

GPU:1 𝐚# 𝐛# 𝐜# 𝐝#

GPU:2 𝐚$ 𝐛$ 𝐜$ 𝐝$

GPU:3 𝐚% 𝐛% 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐜" 𝐝"

GPU:1 𝐚# 𝐛# 𝐜# 𝐝#

GPU:2 𝐚$ 𝐛$ 𝐜$ 𝐝$

GPU:3 𝐚% 𝐛% 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐜" 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛# 𝐜# 𝐝#

GPU:2 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝$

GPU:3 𝐚% 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐜" 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛# 𝐜# 𝐝#

GPU:2 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝$

GPU:3 𝐚% 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛# 𝐜# 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝$

GPU:3 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛# 𝐜# 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝$

GPU:3 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛# 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛# 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛# 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%


Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝%

GPU:1 𝐚" 𝐚# 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%


Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:1 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%


Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:1 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%


Ring All-Reduce (Efficient Approach)

GPU:0 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:1 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:2 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%

GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛" 𝐛# 𝐛$ 𝐛% 𝐜" 𝐜# 𝐜$ 𝐜% 𝐝" 𝐝# 𝐝$ 𝐝%


Comparisons

Naïve Algorithm Efficient Algorithm

• Most computer networks are idle. • No idle computer network.


-. .
• Communication time: . • Communication time: .
/ /
• 𝑚: number of GPUs. • 𝑑: number of parameters.
• 𝑑: number of parameters. • 𝑏: network bandwidth.
• 𝑏: network bandwidth.
Thank you!

You might also like