Professional Documents
Culture Documents
Shusen Wang
TensorFlow Strategies
• MirroredStrategy
• TPUStrategy
• MultiWorkerMirroredStrategy
• CentralStorageStrategy
• ParameterServerStrategy
• OneDeviceStrategy
Parallel Training CNN on MNIST
Find Devices
In [5]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
Out[5]:
For example, my server has 1 CPU and 4 GPUs:
[name: "/device:CPU:0"
• device_type:
/device:CPU:0"CPU"
• memory_limit:
/device:GPU:0 268435456
• locality {
/device:GPU:1
}
• /device:GPU:2
incarnation: 13474655953084031604, name: "/device:XLA_CP
• device_type:
/device:GPU:3"XLA_CPU"
memory_limit: 17179869184
locality {
number of test samples: 10000
Mirrored Strategy
In [4]:
from tensorflow import distribute
strategy = distribute.MirroredStrategy()
In [5]:
# number of processors
m = strategy.num_replicas_in_sync
Number of devices: 4
number of test samples: 10000
In [4]: Mirrored Strategy
In [4]:
from tensorflow import distribute
from tensorflow import distribute
strategy = distribute.MirroredStrategy()
strategy = distribute.MirroredStrategy()
In [5]:
In [5]: of processors
# number
m = strategy.num_replicas_in_sync
# number of processors
m = strategy.num_replicas_in_sync
print('Number of devices: {}'.format(m))
print('Number
Number of devices:
of devices: 4 {}'.format(m))
Number of devices: 4
Load and Process MNIST Data
In [1]:
import tensorflow as tf
In [2]:
Load
import
In [1]:
and
tensorflow Process
as tf MNIST Data
def scale(image, label):
import tensorflow as tf
image = tf.cast(image, tf.float32)
image /= 255
def scale(image, label):
return image, label
image = tf.cast(image, tf.float32)
image /= 255
In [2]:
return image, label
import tensorflow_datasets as tfds
In [2]:
datasets, info = tfds.load(name='mnist',
import tensorflow_datasets as tfds
with_info=True,
as_supervised=True)
datasets, info = tfds.load(name='mnist',
with_info=True,
mnist_train = datasets['train'].map(scale).cache()
as_supervised=True)
mnist_test = datasets['test'].map(scale)
In [ ]:
data_train = mnist_train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
data_test = mnist_test.batch(BATCH_SIZE)
In [ ]:
In [7]:
from tensorflow import keras
In [ ]:
data_train = mnist_train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
data_test = mnist_test.batch(BATCH_SIZE)
In [ ]:
In [7]:
from tensorflow import keras
In [ ]:
Build Neural Network
In [7]:
with strategy.scope():
model = keras.Sequential()
model.add(Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D())
model.add(Conv2D(64, 3, activation='relu'))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
In [8]:
model.summary()
Build Neural Network
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 11, 11, 64) 18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 1600) 0
_________________________________________________________________
dense (Dense) (None, 64) 102464
_________________________________________________________________
dense_1 (Dense) (None, 10) 650
=================================================================
Total params: 121,930
Trainable params: 121,930
Non-trainable params: 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 650
Compile the Model
=================================================================
Total params: 121,930
Trainable params: 121,930
Non-trainable params: 0
_________________________________________________________________
In [9]:
with strategy.scope():
model.compile(loss='sparse_categorical_crossentropy',
optimizer=keras.optimizers.RMSprop(learning_rate=1E-3),
metrics=['accuracy'])
Out[10]:
<tensorflow.python.keras.callbacks.History at 0x7fec1e322828>
In [11]:
In [12]:
with strategy.scope():
eval_loss, eval_acc = model.evaluate(data_test)
In [ ]:
Ring All-Reduce
Reduce VS All-Reduce
• Reduce: The server gets the result of reduce (e.g., sum, mean, count.)
Server
7+3+4+1=15
Reduce (sum)
Server
7+3+4+1=15
7 1
3 4
All-Reduce (sum)
15 15 15 15
Output: 7 Output: 3 Output: 4 Output: 1
1 7
1 3
Output: 7 Output: 3 Output: 4 Output: 1
3 3 4
7 4 1
Worker 1 Worker 2 Worker 3 Worker 4
4 7
Reduce VS All-Reduce
• Reduce: The server gets the result of reduce (e.g., sum, mean, count.)
• All-Reduce: Every node gets a copies of the result of reduce.
• E.g., all-reduce via reduce+broadcast.
• E.g., all-reduce via all-to-all communication.
• E.g., ring all-reduce (this lecture.)
Ring All-Reduce
GPU:0
GPU:3 GPU:1
GPU:2
Ring All-Reduce
GPU:0
𝐠"
GPU:3 GPU:1
𝐠% 𝐠#
GPU:2
𝐠$
Ring All-Reduce
GPU:0
𝐠"
GPU:3 GPU:1
𝐠% 𝐠#
GPU:2
𝐠$
Ring All-Reduce (Naïve Approach)
GPU:0 𝐠"
𝐠"
GPU:3 GPU:1
𝐠% 𝐠#
GPU:2
𝐠$
Ring All-Reduce (Naïve Approach)
GPU:0 𝐠"
𝐠"
GPU:3 GPU:1
𝐠% 𝐠 " + 𝐠#
GPU:2
𝐠$
Ring All-Reduce (Naïve Approach)
GPU:0
𝐠"
GPU:3 GPU:1
𝐠% 𝐠 " + 𝐠#
𝐠 " + 𝐠#
GPU:2
𝐠$
Ring All-Reduce (Naïve Approach)
GPU:0
𝐠"
GPU:3 GPU:1
𝐠% 𝐠 " + 𝐠#
𝐠 " + 𝐠#
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
GPU:0
𝐠"
GPU:3 GPU:1
𝐠% 𝐠 " + 𝐠#
𝐠 " + 𝐠# + 𝐠 $
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
GPU:0
𝐠"
GPU:3 GPU:1
𝐠 " + 𝐠# + 𝐠 $ + 𝐠 % 𝐠 " + 𝐠#
𝐠 " + 𝐠# + 𝐠 $
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
GPU:0
𝐠"
GPU:3 GPU:1
𝐠 = 𝐠 " + 𝐠# + 𝐠 $ + 𝐠 % 𝐠 " + 𝐠#
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
𝐠 GPU:0
𝐠"
GPU:3 GPU:1
𝐠 𝐠 " + 𝐠#
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
𝐠 GPU:0
𝐠
GPU:3 GPU:1
𝐠 𝐠 " + 𝐠#
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
GPU:0 𝐠
𝐠
GPU:3 GPU:1
𝐠 𝐠 " + 𝐠#
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
GPU:0 𝐠
𝐠
GPU:3 GPU:1
𝐠 𝐠
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
GPU:0
𝐠
GPU:3 GPU:1
𝐠 𝐠
𝐠
GPU:2
𝐠 " + 𝐠# + 𝐠 $
Ring All-Reduce (Naïve Approach)
GPU:0
𝐠
GPU:3 GPU:1
𝐠 𝐠
𝐠
GPU:2
𝐠
Ring All-Reduce (Naïve Approach)
GPU:0
𝐰 ← 𝐰 − 𝛼 ⋅ 𝐠.
GPU:3 GPU:1
𝐰 ← 𝐰 − 𝛼 ⋅ 𝐠. 𝐰 ← 𝐰 − 𝛼 ⋅ 𝐠.
GPU:2
𝐰 ← 𝐰 − 𝛼 ⋅ 𝐠.
What is wrong with the naïve approach?
GPU:2
Ring All-Reduce (Efficient Approach)
GPU:1 𝐠# = 𝐚# ; 𝐛# ; 𝐜# ; 𝐝#
GPU:2 𝐠 $ = 𝐚$ ; 𝐛$ ; 𝐜$ ; 𝐝$
GPU:3 𝐠 % = 𝐚% ; 𝐛% ; 𝐜% ; 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:1 𝐚# 𝐛# 𝐜# 𝐝#
GPU:2 𝐚$ 𝐛$ 𝐜$ 𝐝$
GPU:3 𝐚% 𝐛% 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:1 𝐚# 𝐛# 𝐜# 𝐝#
GPU:2 𝐚$ 𝐛$ 𝐜$ 𝐝$
GPU:3 𝐚% 𝐛% 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:1 𝐚" 𝐚# 𝐛# 𝐜# 𝐝#
GPU:2 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝$
GPU:3 𝐚% 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:1 𝐚" 𝐚# 𝐛# 𝐜# 𝐝#
GPU:2 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝$
GPU:3 𝐚% 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝$
GPU:3 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:2 𝐚" 𝐚# 𝐚$ 𝐛# 𝐛$ 𝐜$ 𝐝$
GPU:3 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)
GPU:3 𝐚" 𝐚# 𝐚$ 𝐚% 𝐛# 𝐛$ 𝐛% 𝐜$ 𝐜% 𝐝%
Ring All-Reduce (Efficient Approach)