Professional Documents
Culture Documents
DeepLearning.AI makes these slides available for educational purposes. You may not
use or distribute these slides for commercial purposes. You may make copies of these
slides and use or distribute them for educational purposes as long as you
cite DeepLearning.AI as the source of the slides.
Welcome
High Performance Modeling
Distributed Training
Rise in computational requirements
● This strategy is typically used for training on one machine with multiple
GPUs
○ Creates a replica per GPU <> Variables are mirrored
High-performance
Ingestion
Why input pipelines?
Data at times can’t fit into memory and sometimes, CPUs are under-utilized
in compute intensive tasks like training a complex model
You should avoid these inefficiencies so that you can make the most of the
hardware available → Use input pipelines
tf.data: TensorFlow Input Pipeline
CPU Idle
Transform 1 Transform 2
Load Load
1 2
Train 1 Train 2
Accelerator idle
Time
An improved ETL process Disk Idle CPU Idle
Accelerator 100%
utilized
Time
Pipelining
● Prefetching
● Parallelize data extraction and transformation
● Caching
● Reduce memory
Optimize with prefetching
Prefetched
Read Open
Epoch Train
Time (s)
benchmark(
ArtificialDataset()
.prefetch(tf.data.experimental.AUTOTUNE)
)
Parallelize data extraction
Read Open
Epoch Train
Time (s)
benchmark(
tf.data.Dataset.range(2)
.interleave(
ArtificialDataset,
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
)
Parallelize data transformation
Time (s)
benchmark(
ArtificialDataset()
.map(
mapped_function,
num_parallel_calls=tf.data.AUTOTUNE
)
)
Improve training time with caching
● In-memory: tf.data.Dataset.cache()
● Disk: tf.data.Dataset.cache(filename=...)
Caching
Cached dataset
Time (s)
benchmark(
ArtificialDataset().map(mapped_function).cache(),5
)
High performance modeling
100
(% time)
K80
50
Titan X
V100
0
4 8 16 4 8 16 4 8 16
VGG16 ResNet50 AlexNet
Challenges keeping accelerators busy
Device 3 F 3
B 3
Update
Device 2 F 2
B 2
Update
Device 1 F
1
B1
Update
F Time B Update
Device 0 0 0
Device 1 F1,0 F1,1 F1,2 F1,3 B1,3 B1,2 B1,1 B1,0 Update
AmoebaNEt-D (4,512)
4
3
Speedup
0
naive-2 pipeline-2 pipeline-4 pipeline-8
Knowledge Distillation
Loss
● Idea: Create a simple ‘student’
model that learns from a complex
Student 1 Student 2
‘teacher’ model
Knowledge Distillation
Knowledge Distillation
Techniques
Teacher and student
Ground truth
hard label y
First quantitative results of distillation
90 90 90 90 86 87 90
50 50 50 50 50
40 40 40 40 40
● Data augmentation
● Dropout
Train teacher ● Stochastic depth
on labeled data
80
78
76
74
0 40 80 120 160
# of parameters (in millions)