You are on page 1of 57

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not
use or distribute these slides for commercial purposes. You may make copies of these
slides and use or distribute them for educational purposes as long as you
cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see


https://creativecommons.org/licenses/by-sa/2.0/legalcode
High Performance Modeling

Welcome
High Performance Modeling

Distributed Training
Rise in computational requirements

● At first, training models is quick and easy


● Training models becomes more time-consuming
○ With more data

○ With larger models

● Longer training -> More epochs -> Less efficient


● Use distributed training approaches
Types of distributed training

● Data parallelism: In data parallelism, models are replicated onto


different accelerators (GPU/TPU) and data is split between them
● Model parallelism: When models are too large to fit on a single device
then they can be divided into partitions, assigning different partitions to
different accelerators
Data parallelism
Distributed training using data parallelism

Synchronous ● All workers train and complete updates in sync

training ● Supported via all-reduce architecture

● Each worker trains and completes updates separately


Asynchronous ● Supported via parameter server architecture
Training ● More efficient, but can result in lower accuracy and
slower convergence
Making your models distribute-aware

● If you want to distribute a model:


○ Supported in high-level APIs such as Keras/Estimators
○ For more control, you can use custom training loops
tf.distribute.Strategy

● Library in TensorFlow for running a computation in multiple devices


● Supports distribution strategies for high-level APIs like Keras and
custom training loops
● Convenient to use with little or no code changes
Distribution Strategies supported by tf.distribute.Strategy

● One Device Strategy


● Mirrored Strategy
● Parameter Server Strategy
● Multi-Worker Mirrored Strategy
● Central Storage Strategy
● TPU Strategy
One Device Strategy

● Single device - no distribution

● Typical usage of this strategy is testing your code before


switching to other strategies that actually distribute your code
Mirrored Strategy

● This strategy is typically used for training on one machine with multiple
GPUs
○ Creates a replica per GPU <> Variables are mirrored

○ Weight updating is done using efficient cross-device communication


algorithms (all-reduce algorithms)
Parameter Server Strategy

● Some machines are designated as workers and others as parameter


servers
○ Parameter servers store variables so that workers can perform
computations on them
● Implements asynchronous data parallelism by default
Fault tolerance

● Catastrophic failures in one worker would cause failure of


distribution strategies.
● How to enable fault tolerance in case a worker dies?
○ By restoring training state upon restart from job failure
○ Keras implementation: BackupAndRestore callback
High Performance Modeling

High-performance
Ingestion
Why input pipelines?

Data at times can’t fit into memory and sometimes, CPUs are under-utilized
in compute intensive tasks like training a complex model
You should avoid these inefficiencies so that you can make the most of the
hardware available → Use input pipelines
tf.data: TensorFlow Input Pipeline

Extract Transform Load

Shuffling & Batching


Local (HDD/SSD) Load transformed data
Decompression
to an accelerator
Augmentation
Remote (GCS/HDFS)
Vectorization
...
Inefficient ETL process

Extract Extract Extract


Step 1 Step 2 Step 3

CPU Idle
Transform 1 Transform 2

Load Load
1 2

Train 1 Train 2
Accelerator idle

Time
An improved ETL process Disk Idle CPU Idle

Extract Extract Extract Extract Extract Extract Extract


Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7

Transform 1 Transform 2 Transform 3 Transform 4 Transform 5 Transform 6

Load Load Load Load Load


1 2 3 4 5

Train 1 Train 2 Train 3 Train 4

Accelerator 100%
utilized
Time
Pipelining

CPU Prepare 1 idle Prepare 2 idle Prepare 3 idle


GPU/TPU idle Train 1 idle Train 2 idle Train 3
Without
pipelining
Time

CPU Prepare 1 Prepare 2 Prepare 3 Prepare 4


GPU/TPU idle Train 1 Train 2 Train 3
With
pipelining
Time
How to optimize pipeline performance?

● Prefetching
● Parallelize data extraction and transformation
● Caching
● Reduce memory
Optimize with prefetching
Prefetched

Read Open
Epoch Train

Time (s)

benchmark(
ArtificialDataset()
.prefetch(tf.data.experimental.AUTOTUNE)
)
Parallelize data extraction

● Time-to-first-byte: Prefer local storage as it takes significantly longer


to read data from remote storage
● Read throughput: Maximize the aggregate bandwidth of remote
storage by reading more files
Parallel interleave
Parallel interleave

Read Open
Epoch Train

Time (s)
benchmark(
tf.data.Dataset.range(2)
.interleave(
ArtificialDataset,
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
)
Parallelize data transformation

● Post data loading, the inputs may need preprocessing


● Element-wise preprocessing can be parallelized across CPU cores
● The optimal value for the level of parallelism depends on:
○ Size and shape of training data
○ Cost of the mapping transformation
○ Load the CPU is experiencing currently

● With tf.data you can use AUTOTUNE to set parallelism automatically


Parallel mapping
Parallel map

Epoch Train Map Read Open

Time (s)
benchmark(
ArtificialDataset()
.map(
mapped_function,
num_parallel_calls=tf.data.AUTOTUNE
)
)
Improve training time with caching

● In-memory: tf.data.Dataset.cache()
● Disk: tf.data.Dataset.cache(filename=...)
Caching
Cached dataset

Epoch Train Map Read Open

Time (s)

benchmark(
ArtificialDataset().map(mapped_function).cache(),5
)
High performance modeling

Training Large Models -


The Rise of Giant Neural Nets
and Parallelism
Rise of giant neural networks

● In 2014, the ImageNet winner was GoogleNet with 4 mil. parameters


and scoring a 74.8% top-1 accuracy
● In 2017, Squeeze-and-excitation networks achieved 82.7% top-1
accuracy with 145.8 mil. Parameters
36 fold increase in the number of parameters in just 3 years!
Issues training larger networks

● GPU memory only increased by factor ~ 3


● Saturated the amount of memory available in Cloud TPUs
● Need for large-scale training of giant neural networks
Overcoming memory constraints

● Strategy #1 - Gradient Accumulation


○ Split batches into mini-batches and only perform backprop after whole
batch

● Strategy #2 - Memory swap


○ Copy activations between CPU and memory, back and forth
Parallelism revisited

● Data parallelism: In data parallelism, models are replicated onto


different accelerators (GPU/TPU) and data is split between them
● Model parallelism: When models are too large to fit on a single device
then they can be divided into partitions, assigning different partitions to
different accelerators
Challenges in data parallelism
Communication overhead

100
(% time)

K80
50
Titan X

V100
0
4 8 16 4 8 16 4 8 16
VGG16 ResNet50 AlexNet
Challenges keeping accelerators busy

● Accelerators have limited memory


● Model parallelism: large networks can be trained
○ But, accelerator compute capacity is underutilized

● Data parallelism: train same model with different input data


○ But, the maximum model size an accelerator can support is limited
Pipeline parallelism

Device 3 F 3
B 3
Update

Device 2 F 2
B 2
Update

Device 1 F
1
B1
Update

F Time B Update
Device 0 0 0

F3,0 F3,1 F3,2 F3,3 B3,3 B3,2 B3,1 B3,0 Update


Device 3
F2,0 F2,1 F2,2 F2,3 B2,3 B2.2 B2,1 B2.0 Update
Device 2

Device 1 F1,0 F1,1 F1,2 F1,3 B1,3 B1,2 B1,1 B1,0 Update

F0,0 F0,1 F0,2 F0,3


Bubble B0,3 B0,2 B0.1 B0.0 Update
Device 0
Pipeline parallelism

● Integrates both data and model parallelism:


○ Divide mini-batch data into micro-batches
○ Different workers work on different micro-batches in parallel
○ Allow ML models to have significantly more parameters
GPipe - Key features

● Open-source TensorFlow library (using Lingvo)


● Inserts communication primitives at the partition boundaries
● Automatic parallelism to reduce memory consumption
● Gradient accumulation across micro-batches, so that model quality is
preserved
● Partitioning is heuristic-based
GPipe Results

AmoebaNEt-D (4,512)
4

3
Speedup

0
naive-2 pipeline-2 pipeline-4 pipeline-8
Knowledge Distillation

Teacher and Student


Networks
Sophisticated models and their problems

● Larger sophisticated models become complex


● Complex models learn complex tasks
● Can we express this learning more efficiently?
Is it possible to ‘distill’ or concentrate this complexity into smaller
networks?
GoogLeNet
Knowledge distillation
Teacher

● Duplicate the performance of a


complex model in a simpler model

Loss
● Idea: Create a simple ‘student’
model that learns from a complex
Student 1 Student 2
‘teacher’ model
Knowledge Distillation

Knowledge Distillation
Techniques
Teacher and student

● Training objectives of the models vary


● Teacher (normal training)
○ maximizes the actual metric

● Student (knowledge transfer)


○ matches p-distribution of the teacher’s predictions to form ‘soft targets’
○ ‘Soft targets’ tell us about the knowledge learned by the teacher
Dar w ge
Transferring “dark knowledge” to the student

● Improve softness of the teacher’s


distribution with ‘softmax
temperature’ (T)

● As T grows, you get more insight


about which classes the teacher finds
similar to the predicted one
Techniques

● Approach #1: Weigh objectives (student and teacher) and combine


during backprop
● Approach #2: Compare distributions of the predictions (student and
teacher) using KL divergence
KL divergence
How knowledge transfer takes place
Teacher model

Layer Layer Layer


... softmax (T=t) labels
1 2 m

Input distillation loss soft


x Student (distilled) model

Layer Layer Layer


... softmax (T=t) predictions
1 2 m

softmax (T=1) predictions hard

t - logits from the teacher


student loss
s - logits of the student

Ground truth
hard label y
First quantitative results of distillation

Model Accuracy Word Error Rate (WER)

Baseline 58.9% 10.9%

10x Ensemble 61.1% 10.7%

Distilled Single Model 60.8% 10.7%


DistilBERT
Knowledge Distillation

Case Study - How to Distill


Knowledge for a Q&A Task
Two-stage multi-teacher distillation for Q & A
Q&A RTE Q&A MNLI

Distillation Distillation Distillation Golden Distillation Distillation


Distillation Distillation
task 1 task 2 task N task task 2task 2 Distillation
Golden task Ntask N Distillation
task task 2 task N

Q&A unsupervised large corpus Task specific corpuscorpus


Task specific
(soft labels) (golden Task specific corpus
+ soft +labels)
(golden soft labels)
(golden + soft labels)

Student model Enhanced Student modelmodel


Enhanced Student
Enhanced Student model
Impact of two-stage knowledge distillation

TKD MKD TMKD

90 90 90 90 86 87 90

80 80.4 78.2 79 79.5


80 78 80 80 80 78 80
73.9
72.3 72
70 67 67
70 70 70 70
60
60 60 60 60 60

50 50 50 50 50

40 40 40 40 40

DeepQA MNLI SNLI QNLI RTE


Make EfficientNets robust to noise with distillation

Assign pseudo-labels to Train noised student model


unlabeled data using labeled and
pseudo-labeled data

● Data augmentation
● Dropout
Train teacher ● Stochastic depth
on labeled data

Replace teacher with student


Results of noisy student training
NoisyStudent EfficentNet-B7

ImageNet Top-1 Accuracy %


86
EfficentNet-B7
84 AmoebaNet-C
82 SENet

80

78

76

74

0 40 80 120 160
# of parameters (in millions)

You might also like