Professional Documents
Culture Documents
How To Decay Your Learning Rate
How To Decay Your Learning Rate
Aitor Lewkowycz 1
integral part of deep learning. We find empirically to reduce the noise without letting the model explore the
that common fine-tuned schedules decay the learn- landscape too much. This is corroborated by the fact that
ing rate after the weight norm bounces. This leads the minimum test error often occurs almost immediately
to the proposal of ABEL: an automatic scheduler after decaying the learning rate. Part of the paper focuses
which decays the learning rate by keeping track of on comparing the simple schedule with standard complex
the weight norm. ABEL’s performance matches schedules used in the literature, studying the situations in
that of tuned schedules and is more robust with which these complex schedules are advantageous. We find
respect to its parameters. Through extensive ex- that complex schedules are considerably helpful whenever
periments in vision, NLP, and RL, we show that the weight norm bounces, which happens often in the usual,
if the weight norm does not bounce, we can sim- optimal setups. In the presence of a bouncing weight norm,
plify schedules even further with no loss in per- we propose an automatic scheduler which performs as well
formance. In such cases, a complex schedule has as fine tuned schedules.
similar performance to a constant learning rate
with a decay at the end of training. 1.1. Our contribution
The goal of the paper is to study the benefits of learning rate
schedules and when the learning rate should be decayed.
1. Introduction
We focus on the dynamics of the weight norm , defined as
Learning rate schedules play a crucial role in modern deep the sum over Pthe squared L2 -norm of the weight in each
learning. They were originally proposed with the goal of layer |w|2 ≡ layer ||w(layer)||22 .
reducing noise to ensure the convergence of SGD in convex
We make the following observations based on extensive
optimization (Bottou, 1998). A variety of tuned schedules
empirical results which include Vision, NLP and RL tasks.
are often used, some of the most common being step-wise,
linear or cosine decay. Each schedule has its own advan-
1. We observe that tuned step-wise schedules decay the
tages and disadvantages and they all require hyperparameter
learning rate after the weight norm bouncesa , when the
tuning. Given this heterogeneity, it would be desirable to
weight norm starts to converge. Towards the end of
have a coherent picture of when schedules are useful and to
training, a last decay decreases the noise. See figure 1.
come up with good schedules with minimal tuning.
2. We propose an Automatic, Bouncing into Equilibration
While we do not expect dependence on the initial learning
Learning rate scheduler (ABEL). ABEL is competitive
rate in convex optimization, large learning rates behave quite
with fine tuned schedules and needs less tuning (see
different from small learning rates in deep learning (Li et al.,
table 1 and discussion in section 2).
2020a; Lewkowycz et al., 2020). We expect the situation
to be similar for learning rate schedules, the non-convex 3. A bouncing weight norm seems necessary for non-
landscape makes it desirable to reduce the learning rate as trivial learning rate schedules to outperform the simple
we evolve our models. The goal of this paper is to study em- decay baseline. L2 regularization is required for the
pirically (a) in which situations schedules are beneficial and weight norm to bounce and in its absence (which is
(b) when during training one should decay the learning rate. common in NLP and RL) we don’t see a benefit from
Given that stochastic gradients are used in deep learning, we complex schedules. This is explored in detail in section
1 3 and the results are summarized in table 2.
Blueshift, Alphabet. Correspondence to: Aitor Lewkowycz
a
<alewkowycz@google.com>. We define bouncing by the monotonic decrease of the weight
norm followed by a monotonic increase which occurs for a fixed
learning rate as can be seen in figure 1.
How to decay your learning rate
Weight Norm
3000 norm equilibration is a dynamical process (the weight norm
20000
still changes even if the equilibrium conditions are approxi-
15000 2000
mately satisfied) which happens soon after the bounce.
10000
1000 The classic justification for schedules comes from reducing
0 20 40 60 80 0 50 100 150 200 the noise in a quadratic potential (Bottou, 1998). Different
Epochs Epochs
(a) Resnet-50 on ImageNet (b) WRN28-10 on CIFAR-100 schedules do not provide an advantage in convex optimiza-
tion unless there is a substantial mismatch between train and
Figure 1: Evolution of the weight norm when training with test landscape (Nakkiran, 2020), however this is not the ef-
step-wise decay (decay times marked by black dashed lines). fect that we are observing in our setup: when schedules are
The learning rate is decayed after the weight norm bounces, beneficial, their training performance is substantially differ-
towards its convergence. Models were evolved in optimal ent, see for example figure S4. The work of Li et al. (2020a)
settings whose tuning did not use the weight norm as input. could be helpful to understand better the theory behind our
phenomena, although it is not clear to us how their mecha-
nism can generalize to multiple decays (or other complex
The origin of weight bouncing. There is a simple heuris- schedules). There has been lots of empirical work trying to
tic for why the weight norm bounces. Without L2 regular- learn schedules/optimizers see for example Maclaurin et al.
ization, the weight norm usually increases for learning rate (2015); Li and Malik (2016); Li et al. (2017); Wichrowska
values used in practice. In the presence of L2 regularization, et al. (2017); Rolinek and Martius (2018); Qi et al. (2020).
we expect the weight norm to decrease initially. As the Our approach does not have an outer loop: the learning rate
weight norm decrease slows down, the natural tendency for is decayed depending on the weight norm, which is con-
the weight norm to increase in the absence of regularization ceptually similar to the ReduceLROnPlateau scheduler,
will eventually dominate. This is explained in more detail where the learning rate is decayed when the loss plateaus
in section 4. which is present in most deep learning libraries. However,
ReduceLROnPlateau does not perform well across our
Weight bouncing and performance. Generally speak- tasks. A couple of papers which thoroughly compare the
ing, weight bouncing occurs when we have non-zero L2 performance of learning rate schedules are Shallue et al.
regularization and large enough learning rates. While L2 (2019); Kaplan et al. (2020).
regularization is crucial in vision tasks, it is not found to
be that beneficial in NLP or Reinforcement Learning tasks 2. An automatic learning rate schedule based
(for example Vaswani et al. (2017) does not use L2 ). If the on the weight norm
weight norm does not bounce, ABEL yields the "simple"
learning rate schedule that we expect naively from the noise ABEL and its motivation
reduction picture: decay the learning rate once towards the
From the two setups in figure 1 it seems that optimal sched-
end of training. We confirm that, in the absence of bouncing,
ules tend to decay the learning rate after bouncing, when the
such a simple schedule is competitive with more compli-
weight norm growth slows down. We can use this observa-
cated ones across a variety of tasks and architectures, see
tion to propose ABEL (Automatic Bouncing into Equilibra-
table 2. We also see that the well known advantage of mo-
tion Learning rate scheduler): a schedule which implements
mentum compared with Adam for image classification in
this behaviour automatically, see algorithm 1. In words,
ImageNet (see Agarwal et al. (2020) for example) seems
we keep track of the changes in weight norm between sub-
to disappear in the absence of bouncing, when we turn off
sequent epochs, ∆|wt |2 ≡ |wt |2 − |wt−1 |2 . When the
L2 regularization. Weight norm bouncing thus seems em-
sign of ∆|wt |2 flips, it necessarily means that it has gone
pirically a necessary condition for non-trivial schedules to
through a local minimum: because ∆|wt |2 < 0 initially)
provide a benefit, but it is not sufficient: we observe that
if ∆|wt+1 |2 · ∆|wt |2 < 0, ∆|wt |2 < 0, then |wt | is a mini-
when the datasets are easy enough that simple schedules can
mum: |wt+1 | > |wt | < |wt−1 |. After this, the weight norm
get zero training error, schedules do not make a difference.
grows and slows down until at some point ∆|wt |2 is noise
dominated. In this regime, ∆|wt |2 will become negative,
1.2. Related works
which we will take as our decaying condition. In order to
We do not know of any explicit discussion of weight bounc- reduce SGD noise, near the end of training we decay it one
ing in the literature. The dynamics of deep networks with last time. In practice we do this last decay at around 85% of
L2 regularization has drawn recent attention, see for exam- the total training time and as we can see in the SM B.1, this
ple van Laarhoven (2017); Lewkowycz and Gur-Ari (2020); particular value does not really matter.
How to decay your learning rate
Test Error
0.30 wise decays) and performs as well as tuned schedules. It
0.20
ABEL ABEL is also quite robust compared with other automatic meth-
0.25 Cosine Cosine
Stepwise
0.15 Stepwise
ods because it relies in the weight norm which is mostly
0.20
0 20 40 60 80 0 50 100 150 200
noise free through training (compared with other batched
Epochs Epochs quantities like gradients or losses).
(a) Resnet-50 on ImageNet (b) WRN28-10 on CIFAR-100
An algorithm similar in simplicity and interpretability is
ABEL
8000 ReduceLROnPlateau which is one of the basic optimiz-
ABEL
25000 Cosine ers of PyTorch or TensorFlow and decays the learning rate
Weight Norm
Weight Norm
Cosine
Stepwise 6000 Stepwise
20000 whenever the loss equilibrates. We train a Resnet-50 Im-
4000 ageNet model and a WRN 28-10 CIFAR-10 model with
15000
10000 2000 this algorithm , see SM B.3 for details. We use the de-
fault hyperparameters and for the ImageNet experiment, the
0
0 20 40 60 80 0 50 100 150 200 learning rate does not decay at all, yielding a test error of
Epochs Epochs
(c) Resnet-50 on ImageNet (d) WRN28-10 on CIFAR-100
47.4. For CIFAR-10, ReduceLROnPlateau does fairly
well, test error of 3.9, however the learning rate decays with-
0.8 ABEL 0.100 ABEL
out bound rather fast. These two experiments suggest that
ReduceLROnPlateau can not really compete with the
Learning Rate
Learning Rate
Cosine Cosine
0.6 Stepwise 0.075 Stepwise
schedules described above.
0.4 0.050
0.2 0.025
0.40 0.40
Stepwise Stepwise
0.0 0.000 ABEL ABEL
0 20 40 60 80 0 50 100 150 200
Test Error
Test Error
0.35 Cosine 0.35
Epochs Epochs
(e) Resnet-50 on ImageNet (f) WRN28-10 on CIFAR-100 0.30 0.30
threshold, exit training by after a small number of epochs and the learning rates used in practice exhibit this property.
(to process the last decay). This approach does not have a Homogeneous networks with cross entropy loss will have
manual decay at the end of training. Such a training setup an increasing weight norm at late times, see Lyu and Li
would not possible for cosine/linear decay by construction (2020). Even if a simple schedule is competitive this does
since they depend on the training budget. This seems hard not imply that other features of convex optimization like
for step-wise decay since there is no way to predetermine the independence of performance in the learning rate carry
how to decay the learning rate automatically. over. We repeat the CIFAR-100 experiments for a fixed
small learning rate of 0.02 (the same as the final learning
3. Schedules and performance in the absence rate for the simple schedule) and the error with L2 = 0 is
23.8 while with L2 6= 0 is 29.7, we see that while there is a
of a bouncing weight norm performance gap between a small and large learning rates,
In this section, we study settings where the weight norm this gap is much smaller if there is no bouncing (difference
does not bounce to understand the impact of learning rate in error rate of 1.2% for L2 = 0 vs 7.5% for L2 6= 0). For a
schedules on performance. Setups without L2 regularization fair comparison with small learning rates, we evolved these
are the most common situation with no bouncing, these se- experiments for 5 times longer than the large learning rates,
tups often present a monotonically increasing weight norm. but this did not give any benefit.
It is not clear to us what characteristics of a task make L2 While the experiments presented in table 2 do not have L2
regularization beneficial but it seems that Vision benefits regularization, some NLP architectures like Devlin et al.
considerably more from it than NLP or RL. (2019); Brown et al. (2020) have weight decays of 0.01, 0.1
We conduct an extensive set of experiments in the realms respectively. We tried adding weight decay to our translation
of vision, NLP and RL and the results are summarized in models and while performance did not change substantially,
table 2. In these experiments, we compare complex learn- we were not able to get a bouncing weight norm.
ing rate schedules with a simple schedule where the model The effect of different learning rate schedules in NLP was
is evolved with a constant learning rate and decayed once also studied thoroughly in appendix D.6 of Kaplan et al.
towards the end of training. This simple schedule mainly (2020) with the similar conclusion that as long as the learn-
reduces noise: the error decreases considerably immediately ing rate is not small and is decayed near the end of training,
after decaying and it does not change much afterwards (it performance stays roughly the same.
often increases). Across these experiments, we observe that
complicated learning rate schedules are not significantly bet- The presence of a bouncing weight norm does not guar-
ter than the simple ones. For a couple of tasks (like ALBERT antee that schedules are beneficial. From this section, a
finetuning or WRN on CIFAR-100), complex schedules are bouncing weight norm seems to be a necessary condition
slightly better (∼ 0.3%) than the simple decay but this small for learning rate schedules to matter, but it is not a sufficient
advantage is nothing compared with the substantial advan- condition. Learning rate schedules seem only advantageous
tage that schedules have in vision tasks with L2 . Another if the training task is hard enough. In our experience, if
situation where there is no bouncing weight norm is for the training data can be memorized with a simple learning
small learning rates, for example VGG-16 with learning rate schedule before the weight norm has bounced, then
rate 0.01, in such case there is also no benefit from using more complex schedules are not useful. This can be seen
complex schedules, see SM B.5 for more details. Note that by removing data augmentation in our Wide Resnet CIFAR
in this paper we are using L2 regularization and weight de- experiments, see figure 4. In the presence of data augmen-
cay interchangeably: what matters is that there is explicit tation, simple schedules can not reach training error 0 even
weight regularization. These experiments also show that when evolved for 200 epochs, see SM.
the well known advantage of momentum versus Adam for
vision tasks is only significant in the presence of L2 . In the
absence of a benefit from L2 regularization/ weight decay it 4. Understanding weight norm bouncing
seems like Adam is a better optimizer, Agarwal et al. (2020) In this section, we will pursue some first steps towards
suggested that this is because it can adjusts the learning rate understanding the mechanism behind the phenomena that
of each layer appropriately and it would be interesting to we found empirically in the previous sections.
understand whether there is any connection between that
and bouncing.
4.1. Intuition behind bouncing behaviour
These experiments have a growing weight norm as can be
We can build intuition about the dynamics of the weight
seen in SM C.2. While the weight norm does not have to
norm by studying its dynamics under SGD updates:
be always increasing in the absence of L2 regularization,
this is a function of the learning rate (see section 4.1) , ∆|wt+1 |2 = η 2 |gt |2 − 2ηλ|wt |2 − 2ηgt · wt + O(η 2 λ) (1)
How to decay your learning rate
Table 2: Comparison of performance between a simple learning rate decay and a "complex" decay among tasks: "complex"
means cosine decay for vision tasks and linear decay for NLP and RL. For NLP and RL tasks higher metrics imply better
performance, while for vision tasks, lower error denotes better performance. None of these tasks (except for the vision task
with L2 used as a reference) have weight norm bouncing nor an advantage from non-simple schedules. We have averaged
the RL tasks over 3 runs and their difference is compatible with noise. See S1 for the individual GLUE scores, as it is
common we have omitted the problematic WNLI.
Error Rate
where η, λ are the learning rate and L2 regularization coeffi- In our experience, the only necessary condition for a model
dLt
cient, gt ≡ dw is the gradient with respect to the loss (in the to have a bouncing weight norm is that is has L2 regulariza-
t
absence of the L2 term) and we have used that empirically tion (or weight decay) and the learning rate is large enough.
ηλ 1. This equation holds layer by layer, see SM for We expect the previous intuition to apply to other optimizers
more details about it. In the absence of L2 regularization, with weight decay. Empirically, we have seen that differ-
|2
for large enough learning rates (η > 2g|gtt.w t
), this suggests b
Scale invariant networks are defined by network functions
that the weight norm will be increasing. which are independent of the weight norm: f (αw) = f (w). The
weight norm of these functions still affects its dynamics (Li and
Equation 1 can be further simplified for scale invariant net- Arora, 2019).
works, which satisfy gt · wt = 0, see for example van
How to decay your learning rate
ent optimizers, losses and batch sizes can have a bouncing Dependence on initialization scale. One could wonder
weight norm. if the bounce would disappear if we change the initialization
of the weights so that the initial weight norm is smaller than
4.2. Towards understanding the benefits of bouncing the minimum of the bounce with the original normalization.
and schedules We studied this in figure 5b (see SM for more details) and
we see how even for very small initialization scales, there
While it is still unclear why bouncing is correlated with the is a bouncing weight norm. If the initialization scale is
benefit of schedules, we would like to point in some direc- too small, the bouncing weight norm disappears and the
tions which could help provide a more complete picture. performance gets significantly degraded.
To better understand this phenomenon it would be useful to
distill its properties and find the simplest model that captures
0.54 6000
it. The bouncing of the weight norm appears to be generic, 1000
Weight Norm
as long as we have L2 regularization and the learning rate is 4000 n = 0. Error:18.4
big enough. We believe that learning rate schedules being 750 n = 1. Error:18.0
0.52 n = 2. Error:18.5
n = 3. Error:17.8
only advantageous for hard tasks (as we discussed in section 2000 n = 4. Error:18.5
500 n = 5. Error:18.9
3) is the principal roadblock to find theoretically tractable ABEL decay
n = 6.
n = 8.
Error:20.2
Error:24.3
be beneficial. In other setups, training with a constant learn- derstand if the bouncing of the weight norm is a proxy for
ing rate and decay it at the end of training should not hurt some other phenomena. While we have tried tracking other
performance and might be a preferable, simpler method. simple quantities, the weight norm seems the best predic-
tor for when to decay the learning rate. In order to make
ABEL’s hyperparameters. The main hyperparameters of theoretical progress the two significant roadblocks that we
ABEL are the base learning rate and decay factor: while identify are that this phenomena requires large learning rates
our schedule is not hyperparameter free, ABEL is more and hard datasets, both of which are complicated to study
robust than other schedules to these parameters because of theoretically.
its adaptive nature.
Acknowledgments
Warmup. Our discussion focuses on learning rate decays
and we have not studied the effects of warmup. This is an The authors would like to thank Anders Andreassen,
interesting problem but seems unrelated to bouncing weight Yasaman Bahri, Ethan Dyer, Orhan Firat, Pierre Foret, Guy
norms and the benefit of learning rate schedules at late times. Gur-Ari, Jaehoon Lee, Behnam Neyshabur and Vinay Ra-
In this way, whenever warmup is beneficial and widely used masesh for useful discussions.
(such as large batch Imagenet or Transformer models) we
added it on top of our schedule. References
Comparison between ABEL and other schedules. We Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren,
have mainly compared ABEL with step-wise decays and and Cyril Zhang. Disentangling adaptive gradient meth-
cosine decay. It can easily be compared with step-wise ods from learning rates, 2020.
decay and is objectively better because it does not require to Léon Bottou. Online learning and stochastic approximations,
fine tune when to decay the learning rate. Compared with 1998.
cosine decay, the learning rate in ABEL does not depend
explicitly on the total training budget and it is thus preferred James Bradbury, Roy Frostig, Peter Hawkins,
for situations where we do not know for how long we want Matthew James Johnson, Chris Leary, Dougal Maclaurin,
to train or might want to pick up training where it was and Skye Wanderman-Milne. JAX: composable trans-
left. Compared with other automatic methods, it does not formations of Python+NumPy programs. 2018. URL
require an outer loop nor adding complicated measurements http://github.com/google/jax.
to training, it only depends on the weight norm which is
pretty stable through training (compared with quantities like Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
the training loss or gradients). Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
The role of simple schedules is to reduce noise. We Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
used simple schedules as a baseline to reduce SGD noise. Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
This picture is justified by the fact that the most dramatic Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
drop in test error for such schedules is always immediately Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,
after the decay, as can be seen in the training curves of the Benjamin Chess, Jack Clark, Christopher Berner, Sam
SM. McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. Language models are few-shot learners, 2020.
Memory and compute considerations. ABEL only de- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
pends on the current and past weight norm (two scalars) so Toutanova. Bert: Pre-training of deep bidirectional trans-
it does not add any significant compute nor memory cost. formers for language understanding, 2019.
Weight norm bounce with Adam. Despite of Adam hav- Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam
ing an implicit schedule, it still exhibits weight norm bounc- Neyshabur. Sharpness-aware minimization for efficiently
ing as can be seen in SM . improving generalization, 2020.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
What is the behaviour of the layerwise weight norm? Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
As discussed previously and in the SM B.2, most layers Radford, Jeffrey Wu, and Dario Amodei. Scaling laws
exhibit the same pattern as the total weight norm. for neural language models, 2020.
Understanding the source of the generalization advan- Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli,
tage of learning rate schedules. It would be nice to un- Daniel L. K. Yamins, and Hidenori Tanaka. Neural me-
How to decay your learning rate
chanics: Symmetry and broken conservation laws in deep Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun.
learning dynamics, 2020. Spherical motion dynamics: Learning dynamics of neural
network with normalization, weight decay, and sgd, 2020.
Aitor Lewkowycz and Guy Gur-Ari. On the training dynam-
ics of deep networks with l2 regularization, 2020. Olga Wichrowska, Niru Maheswaranathan, Matthew W.
Hoffman, Sergio Gomez Colmenarejo, Misha Denil,
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl- Nando de Freitas, and Jascha Sohl-Dickstein. Learned
Dickstein, and Guy Gur-Ari. The large learning rate phase optimizers that scale and generalize, 2017.
of deep learning: the catapult mechanism, 2020.
Sho Yaida. Fluctuation-dissipation relations for stochastic
Ke Li and Jitendra Malik. Learning to optimize, 2016. gradient descent, 2018.
Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining
the regularization effect of initial large learning rate in
training neural networks, 2020a.
Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-
sgd: Learning to learn quickly for few-shot learning,
2017.
Zhiyuan Li and Sanjeev Arora. An exponential learning rate
schedule for deep learning, 2019.
Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. Reconcil-
ing modern deep learning with traditional optimization
analyses: The intrinsic learning rate, 2020b.
Kaifeng Lyu and Jian Li. Gradient descent maximizes the
margin of homogeneous neural networks, 2020.
Dougal Maclaurin, David Duvenaud, and Ryan P. Adams.
Gradient-based hyperparameter optimization through re-
versible learning, 2015.
Preetum Nakkiran. Learning rate annealing can provably
help generalization, even for convex problems, 2020.
Xiaoman Qi, PanPan Zhu, Yuebin Wang, Liqiang Zhang,
Junhuan Peng, Mengfan Wu, Jialong Chen, Xudong Zhao,
Ning Zang, and P. Takis Mathiopoulos. Mlrsnet: A multi-
label high spatial resolution remote sensing dataset for
semantic scene understanding, 2020.
Michal Rolinek and Georg Martius. L4: Practical loss-based
stepsize adaptation for deep learning, 2018.
Christopher J. Shallue, Jaehoon Lee, Joseph Antognini,
Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl.
Measuring the effects of data parallelism on neural net-
work training, 2019.
Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking
model scaling for convolutional neural networks, 2020.
Twan van Laarhoven. L2 regularization versus batch and
weight normalization, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and
Illia Polosukhin. Attention is all you need, 2017.
How to decay your learning rate
Supplementary material
A. Experimental settings
We are using Flax ( https://github.com/google/flax) which is based on JAX (Bradbury et al., 2018). The
implementations are based on the Flax examples and the repository https://github.com/google-research/
google-research/tree/master/flax_models/cifar (Foret et al., 2020). Our simple Flax implementation
can be found in SM E. All datasets are downloaded for tensorflow-datasets and we used their training/test split.
All experiments use the same seed for the weights at initialization and we consider only one such initialization unless
otherwise stated. We have not seen much variance in the phenomena described, see Table 1, 2 description. All experiments
use cross-entropy loss.
We run all experiments in v3-8 TPUs. The WideResnet 28-10 CIFAR experiments take roughly 1h per run, the ResNet-
50 Imagenet experiments take roughly 5h per run, the PyramidNet CIFAR experiments take roughly 50h per run, the
Transformer experiments take roughly 24h per run, the ALBERT finetuning experiments take around 10 − 20minutes per
task.
Here we describe the particularities of each experiment in the figure/tables.
Table 1. All models use momentum with a momentum parameter of 0.9 and none of them uses dropout. ABEL considera-
tions: learning rate decay 0.1 for ImageNet and SVHN and 0.2 for CIFAR-100. We force ABEL to decay at 85% of the
total epoch budget by the same decay factor. We run the CIFAR-10 and CIFAR-100 WideResnet experiments with three
different seeds and report the average accuracy, the standard deviation across experiments for all schedules and CIFAR-10,
CIFAR-100 is ≤ 0.1%, except for ABEL and CIFAR-100 which has a standard deviation of 0.2% (the standard error of the
mean is 0.1%), from these observations we only consider a single run for the other experiments. For the CIFAR and WRN
experiments, the gradient norm is clipped to 5.
• Resnet-50 on Imagenet: learning rate of 0.8, L2 regularization of 0.0001 (we do not decay the batch-norm parameters
here), label smoothing of 0.1 and 90 epochs. All experiments include a linear warmup of 5 epochs. Standard data
augmentation: horizontal flips and random crops. Step-wise decay: multiply learning rate by α = 0.1 at 30, 60 epochs.
• Wide Resnet 16-8 on SVHN: learning rate of 0.01, L2 regularization of 0.0005, batch size of 128, no dropout and
evolved for 160 epochs. Training data includes the "extra" training data and no data augmentation. Step-wise decay:
multiply learning rate by α = 0.1 at 80, 120 epochs.
• Wide Resnet 28-10 on CIFAR10/100: learning rate of 0.1, L2 regularization of 0.0005, batch size of 128 and evolved
for 200 epochs. Standard data augmentation: horizontal flips and random crops. Step-wise decay: multiply learning
rate by α = 0.2 at 60, 120, 160 epochs.
• Shake-Drop PyramidNet on CIFAR100: learning rate of 0.1, L2 regularization of 0.0005, batch size 256, evolved for
1800 epochs. Uses AutoAugment and cutout as data augmentation. Because training this model takes a long time and
weight bounces generally take longer at smaller learning rates, we found it convenient to average the weight norm
every five epochs when using ABEL in order to avoid the decays from being noise dominated (without this, the third
decay happens too early and the test error increases by 0.2%) .
• VGG-16 on CIFAR10: learning rate of 0.05, L2 regularization of 0.0005, batch size 128, evolved for 200 epochs.
Basic data augmentation and no batch norm.
Table 2 NLP, RL. All these experiments use ADAM, no dropout nor weight decay.
• Base Transformer trained for translation: learning rate: 1.98 · 10−3 , learning rate warmup of 10k steps, evolved for
100k steps, batch size 256 uses reverse translation. Simple decay: decay by 0.1 at 90k steps. The translation tasks
correspond to WMT’17.
• ALBERT and Glue finetuning. Fine tuned from the base ALBERT model of https://github.com/google-research/albert.
Tasks: All tasks were trained for 10k steps (including 1k linear warmup steps) for four learning rates: {10−5 , 2 ·
10−5 , 5 · 10−5 , 10−6 }, reported best learning rate, batch size 16. See table S1 for the specific scores. Simple decay:
decay by 0.1 at 0.8 of training. The individual scores are summarized in S1.
How to decay your learning rate
Task MNLI(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average
Linear Decay 82.6/83.4 88.1 92.1 92.8 58.5 91.0 88.7 70.8 83.1
Simple Decay 82.9/83.5 87.6 91.1 92.1 57.0 91.2 88.2 73.3 82.9
Table S1: Evaluation metric is accuracy except for CoLA (Matthew correlation), MRPC (F1) and STS-B (Pearson
correlation).
• PPO: learning rate 2.5 · 10−4 , batch size 256, 8 agents. Evolved for 40M frames. Simple decay: decay by 0.1 at 0.9
of training. Rest of configuration is the default configuration of https://github.com/google/flax/examples. Reported
score is the average score over the last 100 episodes. We have three runs per task and schedule. The scores with a
95% confidence interval are: Seaquest: Linear: 1750 ± 260, Simple: 1850 ± 40; Pong: Linear 21.0 ± 0.01, Simple
20.7 ± 0.5; Qbert: Linear 22300 ± 2950, Simple 23000 ± 1800.
Table 2 Vision.
These experiments are the same as table 1 but without L2 . The simple decay is defined by evolving with a fixed learning
rate during 85% of the time and then decay it (with decay factor 0.1 for ImageNet and 0.2 for CIFAR). In the case of
ImageNet+Adam, the learning rate was reduced to 0.0008. For simple decay with L2 6= 0, the test error would often increase
after decaying, in such case, we choose the test error at the minimum of the training loss, see fig S8a for an experiemnt with
this behaviour.
Fig 5a: VGG-5 trained with η = 0.25, λ = 0.005 and decayed by a factor of 10 at the respective time.
Rest of figures. Small modifications or further plots of previous experiments, changes are specified explicitly.
How to decay your learning rate
B. Experimental details
B.1. The mild dependence on the last decay epoch
In figure S1 we run again the CIFAR experiments with ABEL for different values of ‘last_decay_epoch‘. We see how the
final performance does not depend too much on this value.
0.002
CIFAR10
∆ Test Error
CIFAR100
0.001
0.000
−0.001
−0.002
160 170 180 190 200
Epoch of last decay
Figure S1: Models trained with ABEL with different values for ‘last_decay_epoch‘. Difference of test error with respect to
85% of training time (170 epochs). We see that the test error does not change too much, most changes are less than 0.2%.
100
|wt(layer)|2/|wt(all)|2
10−2
Sum of top 10 layers
Sum of all layers
Top 10 Layers 10−1
0 20 40 60 80 0 20 40 60 80
(a) Resnet-50 on Imagenet (b) Resnet-50 on Imagenet
Figure S2: a) Evolution of the quotient between the weight norm of different layers and the total weight norm. b) Evolution
of the weight norm of different layers, normalized by its value at initialization.
How to decay your learning rate
Learning Rate
0.08
Training loss
0.075
Test Error
0.06 10−1
0.050
0.04
10−2 0.025
0.02
Decay on Plateau
0.00 0.000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Epochs Epochs Epochs
(a) Wide Resnet on CIFAR-10 (b) Wide Resnet on CIFAR-10 (c) Wide Resnet on CIFAR-10
0.5 0.6
Test Error
4 × 100
0.4 0.4
3 × 100
0.3 0.2
Decay on Plateau 2 × 100
0.2 0.0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Epochs Epochs Epochs
(d) Resnet-50 on Imagenet (e) Resnet-50 on Imagenet (f) Resnet-50 on Imagenet
Figure S3: Experiments from the main section with a ReduceLROnPlateau schedule.
How to decay your learning rate
Weight Norm
Simple Simple
0.15 0.15 3000
Train Error
Test Error
Cosine+Aug Cosine+Aug
Simple+Aug Simple+Aug
0.10 Cosine
0.10 2000
Simple
0.05 Cosine+Aug 0.05 1000
Simple+Aug
0.00 0.00 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Epochs Epochs Epochs
(a) Wide Resnet on CIFAR-10 (b) Wide Resnet on CIFAR-10 (c) Wide Resnet on CIFAR-10
Weight Norm
Simple Simple
Train Error
Test Error
0.3 Cosine
0.2 4000
Simple
Cosine+Aug 0.1 2000
0.2 Simple+Aug
0.0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Epochs Epochs Epochs
(d) Wide Resnet on CIFAR-100 (e) Wide Resnet on CIFAR-100 (f) Wide Resnet on CIFAR-100
Simple Simple
0.15 0.15 6000
Train Error
Test Error
Figure S4: Wide Resnets on CIFAR without data augmentation can reach zero training error with simple decay.
How to decay your learning rate
Table S2: Test error for VGG-16 and CIFAR-10 for different small learning rate setups. Cosine decay is only superior at
large learning rates when there is a bouncing weight norm.
2500
0.14 Cosine. lr:0.01
2000
Weight Norm
Simple. lr:0.01
Test Error
2500
0.14 Cosine. lr:0.01
2000
Weight Norm
Simple. lr:0.01
Test Error
0.12
1500
0.10
1000
0.08
Cosine. lr:0.01 500
0.06 Simple. lr:0.01
0
0 200 400 600 800 1000 0 200 400 600 800 1000
Epochs Epochs
(c) VGG-16 on CIFAR-10 1000epochs (d) VGG-16 on CIFAR-10 1000epochs
Figure S5: In order for the weight norm to bounce in our training run, the learning rate has to be big enough. We see how
for learning rate 0.01 weight norm does not bounce. This is true even if we evolve for five times longer.
How to decay your learning rate
σw =1 6000 6000
0.24
σw =0.5
Weight Norm
Weight Norm
Test Error
0.22 σw =0.25
σw =0.125 4000 4000
0.20 σw =0.075
σw =0.031
0.18 σw =0.0156 2000 2000
σw =0.003
0.16
0 0
0 50 100 150 200 0 50 100 150 200 0 10 20 30
Epochs Epochs Epochs
(a) WRN 28-10 on CIFAR-100 (b) WRN 28-10 on CIFAR-100 (c) WRN 28-10 on CIFAR-100
Figure S6: Wide Resnet 28-10 on CIFAR100 expeirents with different initialization scale. a) Test error, we see that the
two smallest initialization present substantial degradation of performance. b) Despite of the weight norm being so different
at initialization, the final weight norm is the same for the different models. c) Zoomed version of b) we see that all the
initializations but the smallest two exhibit clear bouncing.
How to decay your learning rate
Learning Rate
Cosine
0.08
Weight Norm
Cosine
0.075
Test Error
Stepwise
2000 Stepwise
0.06
0.050
0.04
ABEL 1000 0.025
0.02 Cosine
Stepwise
0.00 0 0.000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Epochs Epochs Epochs
(a) Wide Resnet on CIFAR-10 (b) Wide Resnet on CIFAR-10 (c) Wide Resnet on CIFAR-10
2500
0.14 Cosine
Cosine
Learning Rate
ABEL
2000 0.04
Weight Norm
ABEL
Test Error
0.12
1500
0.10
1000 0.02
0.08
Cosine 500
0.06 ABEL
0 0.00
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Epochs Epochs Epochs
(d) VGG on CIFAR-10 (e) VGG on CIFAR-10 (f) VGG on CIFAR-10
ABEL
400
Weight Norm
ABEL
0.025 0.0075
Test Error
Stepwise
Stepwise
300
0.020 0.0050
Cosine
200
0.015 ABEL
0.0025
100
Stepwise
0.010 0 0.0000
0 50 100 150 0 50 100 150 0 50 100 150
Epochs Epochs Epochs
(g) Wide Resnet 16-8 on SVHN (h) Wide Resnet 16-8 on SVHN (i) Wide Resnet 16-8 on SVHN
Cosine
Weight Norm
Cosine
10000 0.075
Test Error
0.14
8000 0.050
0.12
ABEL 6000 0.025
0.10 Cosine
4000 0.000
0 500 1000 1500 0 500 1000 1500 0 500 1000 1500
Epochs Epochs Epochs
(j) PyramidNet on CIFAR100 (k) PyramidNet on CIFAR100 (l) PyramidNet on CIFAR100
Learning Rate
Simple. L2 = 0 Simple. L2 = 0
Weight Norm
0.075
Test Error
Simple. L2 6= 0 Simple. L2 6= 0
0.225 Cosine. L2 6= 0 Cosine. L2 6= 0
104 0.050
0.200 Cosine. L2 = 0
Simple. L2 = 0
0.025
0.175 Simple. L2 6= 0
Cosine. L2 6= 0
0.150 103 0.000
50 100 150 200 50 100 150 200 50 100 150 200
Epochs Epochs Epochs
(a) WRN 28-10 on CIFAR100 (b) WRN 28-10 on CIFAR100 (c) WRN 28-10 on CIFAR100
Learning Rate
Cosine. Cosine.
Weight Norm
5 Simple. L2 6= 0 Simple. L2 6= 0
10
0.30 0.0004
Cosine. L2 = 0
Cosine. L2 6= 0
0.25 Simple. L2 = 0
0.0002
Simple. L2 6= 0 104
0.20 0.0000
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Epochs Epochs Epochs
(d) Adam and Imagenet (e) Adam and Imagenet (f) Adam and Imagenet
Figure S8: A couple of vision tasks of table 2. We see how in the absence of L2 , a non-trivial schedule does not affect the
performance and the weight norm is usually monotonically increasing. d,e,f) Resnet-50 with Adam on Imagenet. We also
see how weight norm bouncing also occurs despite of Adam having an implicit schedule.
30 Simple
0.0020
Simple 1500
Learning Rate
Linear
Weight Norm
29 Linear
0.0015
1/Sqrt
1/Sqrt 1250 Simple
BLEU
28
0.0010 Linear
1000 1/Sqrt
27
750 0.0005
26
25 500 0.0000
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Thousands of Steps Thousands of Steps Thousands of Steps
(a) Transformer on En-De WMT (b) Transformer on En-De WMT (c) WRN 28-10 on CIFAR100
Figure S9: Training curves for Transformer trained on the English-German translation task WMT’14. We see how there is
no significant difference between differnet learning rate schedules.
How to decay your learning rate
20 30000 2000
Seaquest score
1500
Qbert score
Pong score
10
20000
0 1000
10000
−10 500
Simple Linear Simple
Linear Simple Linear
−20 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Millions of frames Millions of frames Millions of frames
(a) PPO Pong (b) PPO Qbert (c) PPO Seaquest
100000
80000
Weight Norm
Weight Norm
40000
Weight Norm
75000
60000
40000 50000
20000
20000 Simple
25000 Linear Simple
Linear Simple Linear
0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Millions of frames Millions of frames Millions of frames
(d) PPO Pong (e) PPO Qbert (f) PPO Seaquest
Figure S10: Game score of different RL games for the two considered schedules. Mean score over three runs and individual
scores in lighter color. Simple is just decaying the base learning rate by a factor of 10 at 90% of training and linear is linear
decay.
How to decay your learning rate
1.0
40000 ABEL 15 ABEL
Learning Rate
Cosine Cosine
Weight Norm
0.8
Test Error
Stepwise Stepwise
30000 10
0.6
ABEL
20000 5
0.4 Cosine
Stepwise 10000
0.2 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Epochs Epochs Epochs
(a) Resnet-50 on ImageNet (b) Resnet-50 on ImageNet (c) Resnet-50 on ImageNet
Figure S11: Training curves for the Imagenet experiment of section 2, figure 3 with learning rate 16.
How to decay your learning rate
∆|wt+1 |2 ≡ |wt+1 |2 − |wt |2 = ∆wt−1 .∆wt−1 + 2wt−1 ∆wt−1 = η 2 |gt |2 − (2 − ηλ)ηλ|wt |2 − 2(1 − ηλ)gt · wt
≈ η 2 |gt |2 − 2ηλ|wt |2 − 2ηgt · wt (S2)
The terms that we are dropping are subleading for our purposes because ηλ 1 practically. We expect these terms to be
only relevant if the dominant terms cancel each other: η 2 |gt |2 − 2ηλ|wt |2 − 2ηgt · wt ∼ 0
g · w = 0 for scale invariant layers. A scale invariant layer is such that its network function satisfies f (αw) = f (w), the
bare loss only depends on the weights through the network function. If we divide the weight into its norm and its direction:
dL
wa = |w|ŵa and use = 0, we get that:
d|w|
dL X dL dwa X dL
0 = |w| = |w| = |w|ŵa = g · w (S3)
d|w| a
dwa d|w| a
dwa
For scale invariant networks without L2 , the change in the weights becomes smaller as the weight norm equilibrates.
Using the SGD equation we get that:
wt+1 · wt |wt |
cos angle(wt+1 , wt ) = =∆wt =−ηg
|wt+1 ||wt | |wt+1 |
s
p ∆|wt |2
sin angle(wt+1 , wt ) = 1 − cos2 angle = (S4)
|wt+1 |2
The main reason why this is a Flax implementation is because the update_step takes a train_step function and
outputs a train_step function. This can be easily added to the Flax examples/baselines, we only have to make two
modifications. Before starting the training loop, we initiate ABEL c :
# Copyright 2021 Google LLC.
# SPDX-License-Identifier: Apache-2.0
Only remaining modification is to add the ABEL update rule at the end of each epoch which takes the current train_step
function and mean weight norm and returns a (possibly updated) train_step function. ABEL will update the optimizer
if the learning rate has to be decayed.
# Copyright 2021 Google LLC.
# SPDX-License-Identifier: Apache-2.0
c
Note that ABELSchedule takes train_step_fn as an argument. This is a function that takes a learning_rate_fn
and outputs a train_step (ie train_step_fn = lambda lr_fn: functools.partial( train_step,
learning_rate_fn=lr_fn)