Deeplearning Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not
use or distribute these slides for commercial purposes. You may make copies of these
slides and use or distribute them for educational purposes as long as you
cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see

https://creativecommons.org/licenses/by-sa/2.0/legalcode
High Performance Modeling
Welcome
Distributed Training
Rise in computational requirements
● At first, training models is quick and easy

● Training models becomes more time-consuming
○ With more data
○ With larger models
● Longer training -> More epochs -> Less efficient

● Use distributed training approaches
Types of distributed training
● Data parallelism: In data parallelism, models are replicated onto

different accelerators (GPU/TPU) and data is split between them
● Model parallelism: When models are too large to fit on a single device
then they can be divided into partitions, assigning different partitions to
different accelerators
Data parallelism
Distributed training using data parallelism
Synchronous ● All workers train and complete updates in sync
training ● Supported via all-reduce architecture
● Each worker trains and completes updates separately

Asynchronous ● Supported via parameter server architecture
Training ● More efficient, but can result in lower accuracy and
slower convergence
Making your models distribute-aware
● If you want to distribute a model:

○ Supported in high-level APIs such as Keras/Estimators
○ For more control, you can use custom training loops
tf.distribute.Strategy
● Library in TensorFlow for running a computation in multiple devices

● Supports distribution strategies for high-level APIs like Keras and
custom training loops
● Convenient to use with little or no code changes
Distribution Strategies supported by tf.distribute.Strategy
● One Device Strategy

● Mirrored Strategy
● Parameter Server Strategy
● Multi-Worker Mirrored Strategy
● Central Storage Strategy
● TPU Strategy
One Device Strategy
● Single device - no distribution
● Typical usage of this strategy is testing your code before

switching to other strategies that actually distribute your code
Mirrored Strategy
● This strategy is typically used for training on one machine with multiple
GPUs
○ Creates a replica per GPU <> Variables are mirrored
○ Weight updating is done using efficient cross-device communication

algorithms (all-reduce algorithms)
Parameter Server Strategy
● Some machines are designated as workers and others as parameter

servers
○ Parameter servers store variables so that workers can perform
computations on them
● Implements asynchronous data parallelism by default
Fault tolerance
● Catastrophic failures in one worker would cause failure of

distribution strategies.
● How to enable fault tolerance in case a worker dies?
○ By restoring training state upon restart from job failure
○ Keras implementation: BackupAndRestore callback
High-performance
Ingestion
Why input pipelines?
Data at times can’t fit into memory and sometimes, CPUs are under-utilized
in compute intensive tasks like training a complex model
You should avoid these inefficiencies so that you can make the most of the
hardware available → Use input pipelines
tf.data: TensorFlow Input Pipeline
Extract Transform Load
Shuffling & Batching

Local (HDD/SSD) Load transformed data
Decompression
to an accelerator
Augmentation
Remote (GCS/HDFS)
Vectorization
...
Inefficient ETL process
Extract Extract Extract

Step 1 Step 2 Step 3
CPU Idle
Transform 1 Transform 2
Load Load
1 2
Train 1 Train 2
Accelerator idle
Time
An improved ETL process Disk Idle CPU Idle
Extract Extract Extract Extract Extract Extract Extract

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7
Transform 1 Transform 2 Transform 3 Transform 4 Transform 5 Transform 6
Load Load Load Load Load

1 2 3 4 5
Train 1 Train 2 Train 3 Train 4
Accelerator 100%
utilized
Time
Pipelining
CPU Prepare 1 idle Prepare 2 idle Prepare 3 idle

GPU/TPU idle Train 1 idle Train 2 idle Train 3
Without
pipelining
Time
CPU Prepare 1 Prepare 2 Prepare 3 Prepare 4

GPU/TPU idle Train 1 Train 2 Train 3
With
pipelining
Time
How to optimize pipeline performance?
● Prefetching
● Parallelize data extraction and transformation
● Caching
● Reduce memory
Optimize with prefetching
Prefetched
Read Open
Epoch Train
Time (s)
benchmark(
ArtificialDataset()
.prefetch(tf.data.experimental.AUTOTUNE)
)
Parallelize data extraction
● Time-to-first-byte: Prefer local storage as it takes significantly longer

to read data from remote storage
● Read throughput: Maximize the aggregate bandwidth of remote
storage by reading more files
Parallel interleave
Parallel interleave
Read Open
Epoch Train
Time (s)
benchmark(
tf.data.Dataset.range(2)
.interleave(
ArtificialDataset,
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
)
Parallelize data transformation
● Post data loading, the inputs may need preprocessing

● Element-wise preprocessing can be parallelized across CPU cores
● The optimal value for the level of parallelism depends on:
○ Size and shape of training data
○ Cost of the mapping transformation
○ Load the CPU is experiencing currently
● With tf.data you can use AUTOTUNE to set parallelism automatically

Parallel mapping
Parallel map
Epoch Train Map Read Open
Time (s)
benchmark(
ArtificialDataset()
.map(
mapped_function,
num_parallel_calls=tf.data.AUTOTUNE
)
)
Improve training time with caching
● In-memory: tf.data.Dataset.cache()
● Disk: tf.data.Dataset.cache(filename=...)
Caching
Cached dataset
Epoch Train Map Read Open
Time (s)
benchmark(
ArtificialDataset().map(mapped_function).cache(),5
)
High performance modeling
Training Large Models -

The Rise of Giant Neural Nets
and Parallelism
Rise of giant neural networks
● In 2014, the ImageNet winner was GoogleNet with 4 mil. parameters

and scoring a 74.8% top-1 accuracy
● In 2017, Squeeze-and-excitation networks achieved 82.7% top-1
accuracy with 145.8 mil. Parameters
36 fold increase in the number of parameters in just 3 years!
Issues training larger networks
● GPU memory only increased by factor ~ 3

● Saturated the amount of memory available in Cloud TPUs
● Need for large-scale training of giant neural networks
Overcoming memory constraints
● Strategy #1 - Gradient Accumulation

○ Split batches into mini-batches and only perform backprop after whole
batch
● Strategy #2 - Memory swap

○ Copy activations between CPU and memory, back and forth
Parallelism revisited
● Data parallelism: In data parallelism, models are replicated onto

different accelerators (GPU/TPU) and data is split between them
● Model parallelism: When models are too large to fit on a single device
then they can be divided into partitions, assigning different partitions to
different accelerators
Challenges in data parallelism
Communication overhead
100
(% time)
K80
50
Titan X
V100
0
4 8 16 4 8 16 4 8 16
VGG16 ResNet50 AlexNet
Challenges keeping accelerators busy
● Accelerators have limited memory

● Model parallelism: large networks can be trained
○ But, accelerator compute capacity is underutilized
● Data parallelism: train same model with different input data

○ But, the maximum model size an accelerator can support is limited
Pipeline parallelism
Device 3 F 3
B 3
Update
Device 2 F 2
B 2
Update
Device 1 F
1
B1
Update
F Time B Update
Device 0 0 0
F3,0 F3,1 F3,2 F3,3 B3,3 B3,2 B3,1 B3,0 Update

Device 3
F2,0 F2,1 F2,2 F2,3 B2,3 B2.2 B2,1 B2.0 Update
Device 2
Device 1 F1,0 F1,1 F1,2 F1,3 B1,3 B1,2 B1,1 B1,0 Update
F0,0 F0,1 F0,2 F0,3

Bubble B0,3 B0,2 B0.1 B0.0 Update
Device 0
Pipeline parallelism
● Integrates both data and model parallelism:

○ Divide mini-batch data into micro-batches
○ Different workers work on different micro-batches in parallel
○ Allow ML models to have significantly more parameters
GPipe - Key features
● Open-source TensorFlow library (using Lingvo)

● Inserts communication primitives at the partition boundaries
● Automatic parallelism to reduce memory consumption
● Gradient accumulation across micro-batches, so that model quality is
preserved
● Partitioning is heuristic-based
GPipe Results
AmoebaNEt-D (4,512)
4
3
Speedup
0
naive-2 pipeline-2 pipeline-4 pipeline-8
Knowledge Distillation
Teacher and Student

Networks
Sophisticated models and their problems
● Larger sophisticated models become complex

● Complex models learn complex tasks
● Can we express this learning more efficiently?
Is it possible to ‘distill’ or concentrate this complexity into smaller
networks?
GoogLeNet
Knowledge distillation
Teacher
● Duplicate the performance of a

complex model in a simpler model
Loss
● Idea: Create a simple ‘student’
model that learns from a complex
Student 1 Student 2
‘teacher’ model
Techniques
Teacher and student
● Training objectives of the models vary

● Teacher (normal training)
○ maximizes the actual metric
● Student (knowledge transfer)

○ matches p-distribution of the teacher’s predictions to form ‘soft targets’
○ ‘Soft targets’ tell us about the knowledge learned by the teacher
Dar w ge
Transferring “dark knowledge” to the student
● Improve softness of the teacher’s

distribution with ‘softmax
temperature’ (T)
● As T grows, you get more insight

about which classes the teacher finds
similar to the predicted one
Techniques
● Approach #1: Weigh objectives (student and teacher) and combine

during backprop
● Approach #2: Compare distributions of the predictions (student and
teacher) using KL divergence
KL divergence
How knowledge transfer takes place
Teacher model
Layer Layer Layer

... softmax (T=t) labels
1 2 m
Input distillation loss soft

x Student (distilled) model
Layer Layer Layer

... softmax (T=t) predictions
1 2 m
softmax (T=1) predictions hard
t - logits from the teacher

student loss
s - logits of the student
Ground truth
hard label y
First quantitative results of distillation
Model Accuracy Word Error Rate (WER)
Baseline 58.9% 10.9%
10x Ensemble 61.1% 10.7%
Distilled Single Model 60.8% 10.7%

DistilBERT
Case Study - How to Distill

Knowledge for a Q&A Task
Two-stage multi-teacher distillation for Q & A
Q&A RTE Q&A MNLI
Distillation Distillation Distillation Golden Distillation Distillation

Distillation Distillation
task 1 task 2 task N task task 2task 2 Distillation
Golden task Ntask N Distillation
task task 2 task N
Q&A unsupervised large corpus Task specific corpuscorpus

Task specific
(soft labels) (golden Task specific corpus
+ soft +labels)
(golden soft labels)
(golden + soft labels)
Student model Enhanced Student modelmodel

Enhanced Student
Enhanced Student model
Impact of two-stage knowledge distillation
TKD MKD TMKD
90 90 90 90 86 87 90
80 80.4 78.2 79 79.5

80 78 80 80 80 78 80
73.9
72.3 72
70 67 67
70 70 70 70
60
60 60 60 60 60
50 50 50 50 50
40 40 40 40 40
DeepQA MNLI SNLI QNLI RTE

Make EfficientNets robust to noise with distillation
Assign pseudo-labels to Train noised student model

unlabeled data using labeled and
pseudo-labeled data
● Data augmentation
● Dropout
Train teacher ● Stochastic depth
on labeled data
Replace teacher with student

Results of noisy student training
NoisyStudent EfficentNet-B7
ImageNet Top-1 Accuracy %

86
EfficentNet-B7
84 AmoebaNet-C
82 SENet
80
78
76
74
0 40 80 120 160
# of parameters (in millions)

Deeplearning Ai

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see

● At ﬁrst, training models is quick and easy

○ With larger models

● Longer training -> More epochs -> Less efﬁcient

● Data parallelism: In data parallelism, models are replicated onto

Synchronous ● All workers train and complete updates in sync

training ● Supported via all-reduce architecture

● Each worker trains and completes updates separately

● If you want to distribute a model:

● Library in TensorFlow for running a computation in multiple devices

● One Device Strategy

● Single device - no distribution

● Typical usage of this strategy is testing your code before

○ Weight updating is done using efﬁcient cross-device communication

● Some machines are designated as workers and others as parameter

● Catastrophic failures in one worker would cause failure of

Extract Transform Load

Shufﬂing & Batching

Extract Extract Extract

Extract Extract Extract Extract Extract Extract Extract

Transform 1 Transform 2 Transform 3 Transform 4 Transform 5 Transform 6

Load Load Load Load Load

Train 1 Train 2 Train 3 Train 4

CPU Prepare 1 idle Prepare 2 idle Prepare 3 idle

CPU Prepare 1 Prepare 2 Prepare 3 Prepare 4

● Time-to-ﬁrst-byte: Prefer local storage as it takes signiﬁcantly longer

● Post data loading, the inputs may need preprocessing

● With tf.data you can use AUTOTUNE to set parallelism automatically

Epoch Train Map Read Open

Epoch Train Map Read Open

Training Large Models -

● In 2014, the ImageNet winner was GoogleNet with 4 mil. parameters

● GPU memory only increased by factor ~ 3

● Strategy #1 - Gradient Accumulation

● Strategy #2 - Memory swap

● Data parallelism: In data parallelism, models are replicated onto

● Accelerators have limited memory

● Data parallelism: train same model with different input data

F3,0 F3,1 F3,2 F3,3 B3,3 B3,2 B3,1 B3,0 Update

F0,0 F0,1 F0,2 F0,3

● Integrates both data and model parallelism:

● Open-source TensorFlow library (using Lingvo)

Teacher and Student

● Larger sophisticated models become complex

● Duplicate the performance of a

● Training objectives of the models vary

● Student (knowledge transfer)

● Improve softness of the teacher’s

● As T grows, you get more insight

● Approach #1: Weigh objectives (student and teacher) and combine

Layer Layer Layer

Input distillation loss soft

Layer Layer Layer

softmax (T=1) predictions hard

t - logits from the teacher

Model Accuracy Word Error Rate (WER)

Baseline 58.9% 10.9%

10x Ensemble 61.1% 10.7%

Distilled Single Model 60.8% 10.7%

Case Study - How to Distill