Professional Documents
Culture Documents
model size
memory requirements
power consumption
⏩ Lower numerical
precision is faster
For torch <= 1.9.1, AMP was limited to CUDA tensors using
torch.cuda.amp.autocast()
.
v1.10 onwards, PyTorch
has a generic API
torch.autocast() that
automatically casts
CUDA tensors to
FP16, and
CPU tensors to BF16
⏩ AMP is usually faster
than .half()
Model: Resnet101
Device: Tesla T4
GPU
⚠️ AMP is only for the
forward pass
Don’t wrap the backward pass in autocast()
https://pytorch.org/docs/stable/ddp_comm_hooks.ht
ml#default-communication-hooks
❓ What about non-BF16 and ARM
CPUs?
Half Precision:
https://pytorch-dev-
podcast.simplecast.com/episodes/half-precision
torch.autocast:
https://pytorch.org/docs/1.10./amp.html#id4
AMP Examples:
https://pytorch.org/docs/stable/notes/amp_examples
.html
Quantization in PyTorch:
https://pytorch.org/docs/stable/quantization.html