How To Decay Your Learning Rate

How to decay your learning rate
Aitor Lewkowycz 1
Abstract will use a simple schedule as our baseline: a constant learn-

ing rate with one decay close to the end of training. Training
Complex learning rate schedules have become an with this smaller learning rate for a short time is expected
arXiv:2103.12682v1 [cs.LG] 23 Mar 2021
integral part of deep learning. We find empirically to reduce the noise without letting the model explore the
that common fine-tuned schedules decay the learn- landscape too much. This is corroborated by the fact that
ing rate after the weight norm bounces. This leads the minimum test error often occurs almost immediately
to the proposal of ABEL: an automatic scheduler after decaying the learning rate. Part of the paper focuses
which decays the learning rate by keeping track of on comparing the simple schedule with standard complex
the weight norm. ABEL’s performance matches schedules used in the literature, studying the situations in
that of tuned schedules and is more robust with which these complex schedules are advantageous. We find
respect to its parameters. Through extensive ex- that complex schedules are considerably helpful whenever
periments in vision, NLP, and RL, we show that the weight norm bounces, which happens often in the usual,
if the weight norm does not bounce, we can sim- optimal setups. In the presence of a bouncing weight norm,
plify schedules even further with no loss in per- we propose an automatic scheduler which performs as well
formance. In such cases, a complex schedule has as fine tuned schedules.
similar performance to a constant learning rate
with a decay at the end of training. 1.1. Our contribution
The goal of the paper is to study the benefits of learning rate
schedules and when the learning rate should be decayed.
1. Introduction
We focus on the dynamics of the weight norm , defined as
Learning rate schedules play a crucial role in modern deep the sum over Pthe squared L2 -norm of the weight in each
learning. They were originally proposed with the goal of layer |w|2 ≡ layer ||w(layer)||22 .
reducing noise to ensure the convergence of SGD in convex
We make the following observations based on extensive
optimization (Bottou, 1998). A variety of tuned schedules
empirical results which include Vision, NLP and RL tasks.
are often used, some of the most common being step-wise,
linear or cosine decay. Each schedule has its own advan-
1. We observe that tuned step-wise schedules decay the
tages and disadvantages and they all require hyperparameter
learning rate after the weight norm bouncesa , when the
tuning. Given this heterogeneity, it would be desirable to
weight norm starts to converge. Towards the end of
have a coherent picture of when schedules are useful and to
training, a last decay decreases the noise. See figure 1.
come up with good schedules with minimal tuning.
2. We propose an Automatic, Bouncing into Equilibration
While we do not expect dependence on the initial learning
Learning rate scheduler (ABEL). ABEL is competitive
rate in convex optimization, large learning rates behave quite
with fine tuned schedules and needs less tuning (see
different from small learning rates in deep learning (Li et al.,
table 1 and discussion in section 2).
2020a; Lewkowycz et al., 2020). We expect the situation
to be similar for learning rate schedules, the non-convex 3. A bouncing weight norm seems necessary for non-
landscape makes it desirable to reduce the learning rate as trivial learning rate schedules to outperform the simple
we evolve our models. The goal of this paper is to study em- decay baseline. L2 regularization is required for the
pirically (a) in which situations schedules are beneficial and weight norm to bounce and in its absence (which is
(b) when during training one should decay the learning rate. common in NLP and RL) we don’t see a benefit from
Given that stochastic gradients are used in deep learning, we complex schedules. This is explored in detail in section
1 3 and the results are summarized in table 2.
Blueshift, Alphabet. Correspondence to: Aitor Lewkowycz
a
<alewkowycz@google.com>. We define bouncing by the monotonic decrease of the weight
norm followed by a monotonic increase which occurs for a fixed
learning rate as can be seen in figure 1.
Li et al. (2020b); Wan et al. (2020); Kunin et al. (2020).

4000
25000 The recent paper Wan et al. (2020) observes that the weight
Weight Norm
Weight Norm
3000 norm equilibration is a dynamical process (the weight norm
20000
still changes even if the equilibrium conditions are approxi-
15000 2000
mately satisfied) which happens soon after the bounce.
10000
1000 The classic justification for schedules comes from reducing
0 20 40 60 80 0 50 100 150 200 the noise in a quadratic potential (Bottou, 1998). Different
Epochs Epochs
(a) Resnet-50 on ImageNet (b) WRN28-10 on CIFAR-100 schedules do not provide an advantage in convex optimiza-
tion unless there is a substantial mismatch between train and
Figure 1: Evolution of the weight norm when training with test landscape (Nakkiran, 2020), however this is not the ef-
step-wise decay (decay times marked by black dashed lines). fect that we are observing in our setup: when schedules are
The learning rate is decayed after the weight norm bounces, beneficial, their training performance is substantially differ-
towards its convergence. Models were evolved in optimal ent, see for example figure S4. The work of Li et al. (2020a)
settings whose tuning did not use the weight norm as input. could be helpful to understand better the theory behind our
phenomena, although it is not clear to us how their mecha-
nism can generalize to multiple decays (or other complex
The origin of weight bouncing. There is a simple heuris- schedules). There has been lots of empirical work trying to
tic for why the weight norm bounces. Without L2 regular- learn schedules/optimizers see for example Maclaurin et al.
ization, the weight norm usually increases for learning rate (2015); Li and Malik (2016); Li et al. (2017); Wichrowska
values used in practice. In the presence of L2 regularization, et al. (2017); Rolinek and Martius (2018); Qi et al. (2020).
we expect the weight norm to decrease initially. As the Our approach does not have an outer loop: the learning rate
weight norm decrease slows down, the natural tendency for is decayed depending on the weight norm, which is con-
the weight norm to increase in the absence of regularization ceptually similar to the ReduceLROnPlateau scheduler,
will eventually dominate. This is explained in more detail where the learning rate is decayed when the loss plateaus
in section 4. which is present in most deep learning libraries. However,
ReduceLROnPlateau does not perform well across our
Weight bouncing and performance. Generally speak- tasks. A couple of papers which thoroughly compare the
ing, weight bouncing occurs when we have non-zero L2 performance of learning rate schedules are Shallue et al.
regularization and large enough learning rates. While L2 (2019); Kaplan et al. (2020).
regularization is crucial in vision tasks, it is not found to
be that beneficial in NLP or Reinforcement Learning tasks 2. An automatic learning rate schedule based
(for example Vaswani et al. (2017) does not use L2 ). If the on the weight norm
weight norm does not bounce, ABEL yields the "simple"
learning rate schedule that we expect naively from the noise ABEL and its motivation
reduction picture: decay the learning rate once towards the
From the two setups in figure 1 it seems that optimal sched-
end of training. We confirm that, in the absence of bouncing,
ules tend to decay the learning rate after bouncing, when the
such a simple schedule is competitive with more compli-
weight norm growth slows down. We can use this observa-
cated ones across a variety of tasks and architectures, see
tion to propose ABEL (Automatic Bouncing into Equilibra-
table 2. We also see that the well known advantage of mo-
tion Learning rate scheduler): a schedule which implements
mentum compared with Adam for image classification in
this behaviour automatically, see algorithm 1. In words,
ImageNet (see Agarwal et al. (2020) for example) seems
we keep track of the changes in weight norm between sub-
to disappear in the absence of bouncing, when we turn off
sequent epochs, ∆|wt |2 ≡ |wt |2 − |wt−1 |2 . When the
L2 regularization. Weight norm bouncing thus seems em-
sign of ∆|wt |2 flips, it necessarily means that it has gone
pirically a necessary condition for non-trivial schedules to
through a local minimum: because ∆|wt |2 < 0 initially)
provide a benefit, but it is not sufficient: we observe that
if ∆|wt+1 |2 · ∆|wt |2 < 0, ∆|wt |2 < 0, then |wt | is a mini-
when the datasets are easy enough that simple schedules can
mum: |wt+1 | > |wt | < |wt−1 |. After this, the weight norm
get zero training error, schedules do not make a difference.
grows and slows down until at some point ∆|wt |2 is noise
dominated. In this regime, ∆|wt |2 will become negative,
1.2. Related works
which we will take as our decaying condition. In order to
We do not know of any explicit discussion of weight bounc- reduce SGD noise, near the end of training we decay it one
ing in the literature. The dynamics of deep networks with last time. In practice we do this last decay at around 85% of
L2 regularization has drawn recent attention, see for exam- the total training time and as we can see in the SM B.1, this
ple van Laarhoven (2017); Lewkowycz and Gur-Ari (2020); particular value does not really matter.
Algorithm 1 ABEL Scheduler Setup Test error

Dataset Architecture Step-wise ABEL Cosine
if (|wt |2 − |wt−1 |2 ) · (|wt−1 |2 − |wt−2 |2 ) < 0 then
if reached_minimum then ImageNet Resnet-50 24.0 23.8 23.2
learning_rate = decay_factor · learning_rate CIFAR-10 WRN 28-10 3.7 3.8 3.5
reached_minimum=False CIFAR-10 VGG-16 - 7.1 6.9
else CIFAR-100 WRN 28-10 18.5 18.7 18.4
reached_minimum=True CIFAR-100 PyramidNet - 10.8 10.8
end if SVHN WRN 16-8 1.77 1.79 1.89
end if
Table 1: Comparison of test error at the end of training
if t = last_decay_epoch then
for different setups and learning rate schedules. We see
learning_rate = decay_factor · learning_rate
that ABEL has very similar performance to the fine tuned
end if
step-wise schedule without the need to tune when to decay.
ABEL uses the baseline values of learning rates and decay
factors and we have not fine tuned these. The cells denoted
Algorithm 1 is an implementation of the idea with the base by - refer to setups for which we do not have reference step-
learning rate and the decay factor as the main hyperparam- wise decays. The experimental details can be found in the
eters. While alternative implementations could be more SM.
explicit about the weight norm slowing down after reaching
the minimum, they would likely require more hyperparame-
ters.
mance in the learning rate: if the learning rate is too high,
We have decided to focus on the total weight norm, but one the weight norm will bounce faster and ABEL will adapt to
might ask what happens with the layer-wise weight norm. this by quickly decaying the learning rate. This can be seen
In SM B.2, we study the evolution of the weight norm in quite clearly in the learning rate = 16 training curves, see
different layers. We focus on the 10 layers which contribute SM C.3.
the most to the weight norm (these layers account for 50%
of the weight norm). We see that most layers exhibit the ABEL also has the ‘last_decay_epoch’ hyperparameter,
same bouncing plus slowing down pattern as the total weight which determines when to perform the last decay in or-
norm and this happens at roughly the same time scale. der to reduce noise. Performance depends very weakly on
this hyperparameter (see SM more) and for all setups in
table 1 we have chosen it to be at 85% of the total training
Performance comparison across setups.
time. The most natural way to think about this would be to
We have run a variety of experiments comparing learning run ABEL for a fixed amount of time and after decay the
rate schedules with ABEL, see table 1 for a summary and fig- learning rate for a small number of epochs in order to get
ure 2 for some selected training curves ( rest of the training the performance with less SGD noise.
curves are in SM C.1). We use ABEL without hyperparam-
eter tuning it: we are plugging the base learning rate and Comparison of ABEL with other schedules
the decay factor of the reference step-wise schedule (these
reference decay factors are 0.2 for CIFAR and 0.1 for other It is very natural to compare ABEL with step-wise decay.
datasets). We see that ABEL is competitive with existing Step-wise decay is complicated to use in new settings be-
fine tuned learning rate schedules and slightly outperforms cause on top of the base learning rate and the decay factor,
step-wise decay on ImageNet. Cosine often beats step-wise one has to determine when to decay the learning rate. ABEL,
schedules, however as we will discuss shortly, such decay takes care of the ‘when’ automatically without hurting per-
has several drawbacks. formance. Because when to decay depends strongly on
the system and its current hyperparameters, ABEL is much
more robust to the choices of base learning rate and decay
Robustness of ABEL
factor.
ABEL is quite robust with respect to the learning rate and
A second class of schedules are those which depend ex-
the decay factor. Since it depends implicitly on the natural
plicitly in the number of training epochs (T ), like cosine
time scales of the system, it will adapt to when to decay
or linear decay. This strongly determines the decay pro-
the learning rate. We can illustrate this by repeating the
file: with cosine decay, the learning rate will not decay
ImageNet experiment with different base learning rates or
by a factor of 10 with respect to its initial value until 93%
decay factors. The results are shown in figure 3.
of training! Having T as a determining hyperparameter is
We would like to highlight the mild dependence of perfor- problematic: it takes a long time for these schedules to have
their own, so these are never completely hyperparameter

0.40 0.30
free. Compared with these algorithms ABEL is simple,
0.35 0.25 interpretable (it can be easily compared with fine tuned step-
Test Error
Test Error
0.30 wise decays) and performs as well as tuned schedules. It
0.20
ABEL ABEL is also quite robust compared with other automatic meth-
0.25 Cosine Cosine
Stepwise
0.15 Stepwise
ods because it relies in the weight norm which is mostly
0.20
0 20 40 60 80 0 50 100 150 200
noise free through training (compared with other batched
Epochs Epochs quantities like gradients or losses).
(a) Resnet-50 on ImageNet (b) WRN28-10 on CIFAR-100
An algorithm similar in simplicity and interpretability is
ABEL
8000 ReduceLROnPlateau which is one of the basic optimiz-
ABEL
25000 Cosine ers of PyTorch or TensorFlow and decays the learning rate
Weight Norm
Weight Norm
Cosine
Stepwise 6000 Stepwise
20000 whenever the loss equilibrates. We train a Resnet-50 Im-
4000 ageNet model and a WRN 28-10 CIFAR-10 model with
15000
10000 2000 this algorithm , see SM B.3 for details. We use the de-
fault hyperparameters and for the ImageNet experiment, the
0
0 20 40 60 80 0 50 100 150 200 learning rate does not decay at all, yielding a test error of
Epochs Epochs
(c) Resnet-50 on ImageNet (d) WRN28-10 on CIFAR-100
47.4. For CIFAR-10, ReduceLROnPlateau does fairly
well, test error of 3.9, however the learning rate decays with-
0.8 ABEL 0.100 ABEL
out bound rather fast. These two experiments suggest that
ReduceLROnPlateau can not really compete with the
Learning Rate
Learning Rate
Cosine Cosine
0.6 Stepwise 0.075 Stepwise
schedules described above.
0.4 0.050
0.2 0.025
0.40 0.40
Stepwise Stepwise
0.0 0.000 ABEL ABEL
0 20 40 60 80 0 50 100 150 200
Test Error
Test Error
0.35 Cosine 0.35
Epochs Epochs
(e) Resnet-50 on ImageNet (f) WRN28-10 on CIFAR-100 0.30 0.30
Figure 2: Training curves of two experiments from table 1. 0.25 0.25

100 101 0.1 0.2 0.3 0.4 0.5
Learning Rate Decay Factor
(a) (b)
comparable error rates to step-wise decays, as can be seen
in figures 2, S7. This implies that until very late in training Figure 3: ResNet-50 trained on ImageNet for different learn-
one can not tell whether T is too short, in which case there ing rates and decay factors. (a) ABEL beats others schedules
is no straightforward way to resume training (if we want to when using non-optimal learning rates. At learning rate 40,
evolve the model with the same decay for a longer time, we only ABEL converges. (b) ABEL is robust with respect
have to start training from the beginning). This is part of the to changes in the decay factor, its performance does not
reason why large models in NLP and vision use schedules depend too much on the decay factor because it adjusts
which can be easily resumed like rsqrt decay (Vaswani et al., the number of decays accordingly. Note: when the decay
2017), "clipped" cosine decay (Kaplan et al., 2020; Brown factor is 0.5, 90 epochs is too short of a time for ABEL to
et al., 2020) or exponential decay (Tan and Le, 2020). In adapt properly: the weight norm is still is bouncing, so we
contrast, for ABEL the learning rate at any given time is evolved that point for 120 epochs. If evolved for 90 epochs
independent of the total training budget ( while there is the like the other points, the standard decay performance does
last_decay_epoch parameter, it can easily be evolved for not change much, but ABEL has an error closer to 30%.
longer if we load the model before the last decay).
We have decided to compare ABEL with the previous two
ABEL does not require a fixed train budget
schedules because they are the most commonly used ones.
There are a lot of automatic/learnt learning rate schedules From the empirical studies, the drop in the test error after
(or optimizers), see Maclaurin et al. (2015); Li and Malik decaying the learning rate is upper bounded by the pre-
(2016); Li et al. (2017); Wichrowska et al. (2017); Yaida vious drops in the test error, a reason for this is that this
(2018); Rolinek and Martius (2018); Qi et al. (2020) and drop can be attributed to a reduction of the SGD noise and
to our knowledge most of them require either significant smaller learning rates have less SGD noise. This provides
change in the code (like the addition of non-trivial mea- an automatic way of prescribing the train budget: if the
surements) or outer loops and also add hyperparameter of improvement of accuracy after a decay is smaller than some
threshold, exit training by after a small number of epochs and the learning rates used in practice exhibit this property.
(to process the last decay). This approach does not have a Homogeneous networks with cross entropy loss will have
manual decay at the end of training. Such a training setup an increasing weight norm at late times, see Lyu and Li
would not possible for cosine/linear decay by construction (2020). Even if a simple schedule is competitive this does
since they depend on the training budget. This seems hard not imply that other features of convex optimization like
for step-wise decay since there is no way to predetermine the independence of performance in the learning rate carry
how to decay the learning rate automatically. over. We repeat the CIFAR-100 experiments for a fixed
small learning rate of 0.02 (the same as the final learning
3. Schedules and performance in the absence rate for the simple schedule) and the error with L2 = 0 is
23.8 while with L2 6= 0 is 29.7, we see that while there is a
of a bouncing weight norm performance gap between a small and large learning rates,
In this section, we study settings where the weight norm this gap is much smaller if there is no bouncing (difference
does not bounce to understand the impact of learning rate in error rate of 1.2% for L2 = 0 vs 7.5% for L2 6= 0). For a
schedules on performance. Setups without L2 regularization fair comparison with small learning rates, we evolved these
are the most common situation with no bouncing, these se- experiments for 5 times longer than the large learning rates,
tups often present a monotonically increasing weight norm. but this did not give any benefit.
It is not clear to us what characteristics of a task make L2 While the experiments presented in table 2 do not have L2
regularization beneficial but it seems that Vision benefits regularization, some NLP architectures like Devlin et al.
considerably more from it than NLP or RL. (2019); Brown et al. (2020) have weight decays of 0.01, 0.1
We conduct an extensive set of experiments in the realms respectively. We tried adding weight decay to our translation
of vision, NLP and RL and the results are summarized in models and while performance did not change substantially,
table 2. In these experiments, we compare complex learn- we were not able to get a bouncing weight norm.
ing rate schedules with a simple schedule where the model The effect of different learning rate schedules in NLP was
is evolved with a constant learning rate and decayed once also studied thoroughly in appendix D.6 of Kaplan et al.
towards the end of training. This simple schedule mainly (2020) with the similar conclusion that as long as the learn-
reduces noise: the error decreases considerably immediately ing rate is not small and is decayed near the end of training,
after decaying and it does not change much afterwards (it performance stays roughly the same.
often increases). Across these experiments, we observe that
complicated learning rate schedules are not significantly bet- The presence of a bouncing weight norm does not guar-
ter than the simple ones. For a couple of tasks (like ALBERT antee that schedules are beneficial. From this section, a
finetuning or WRN on CIFAR-100), complex schedules are bouncing weight norm seems to be a necessary condition
slightly better (∼ 0.3%) than the simple decay but this small for learning rate schedules to matter, but it is not a sufficient
advantage is nothing compared with the substantial advan- condition. Learning rate schedules seem only advantageous
tage that schedules have in vision tasks with L2 . Another if the training task is hard enough. In our experience, if
situation where there is no bouncing weight norm is for the training data can be memorized with a simple learning
small learning rates, for example VGG-16 with learning rate schedule before the weight norm has bounced, then
rate 0.01, in such case there is also no benefit from using more complex schedules are not useful. This can be seen
complex schedules, see SM B.5 for more details. Note that by removing data augmentation in our Wide Resnet CIFAR
in this paper we are using L2 regularization and weight de- experiments, see figure 4. In the presence of data augmen-
cay interchangeably: what matters is that there is explicit tation, simple schedules can not reach training error 0 even
weight regularization. These experiments also show that when evolved for 200 epochs, see SM.
the well known advantage of momentum versus Adam for
vision tasks is only significant in the presence of L2 . In the
absence of a benefit from L2 regularization/ weight decay it 4. Understanding weight norm bouncing
seems like Adam is a better optimizer, Agarwal et al. (2020) In this section, we will pursue some first steps towards
suggested that this is because it can adjusts the learning rate understanding the mechanism behind the phenomena that
of each layer appropriately and it would be interesting to we found empirically in the previous sections.
understand whether there is any connection between that
and bouncing.
4.1. Intuition behind bouncing behaviour
These experiments have a growing weight norm as can be
We can build intuition about the dynamics of the weight
seen in SM C.2. While the weight norm does not have to
norm by studying its dynamics under SGD updates:
be always increasing in the absence of L2 regularization,
this is a function of the learning rate (see section 4.1) , ∆|wt+1 |2 = η 2 |gt |2 − 2ηλ|wt |2 − 2ηgt · wt + O(η 2 λ) (1)
Setup Performance for different schedules

Type Task and metric Architecture Complex Decay Simple Decay
EN-DE, BLEU Transformer 29.0 28.9
NLP EN-FR, BLEU Transformer 43.0 43.0
GLUE, Average score ALBERT finetuning 83.1 82.9
Qbert, Score PPO 1750 1850
RL Seaquest, Score PPO 21.0 20.7
Pong, Score PPO 22300 23000
ImageNet, Test Error Resnet-50 28.1 27.8
ImageNet, Test Error Resnet-50 + Adam 28.2 28.9
Vision L2 = 0
CIFAR-10, Test Error Wide Resnet 28-10 5.0 5.0
ImageNet, Test Error Resnet-50 23.2 28.5
Vision L2 6= 0 ImageNet, Test Error Resnet-50 + Adam 24.7 26.0
(has bounce) CIFAR-10, Test Error Wide Resnet 28-10 3.5 4.9
Table 2: Comparison of performance between a simple learning rate decay and a "complex" decay among tasks: "complex"
means cosine decay for vision tasks and linear decay for NLP and RL. For NLP and RL tasks higher metrics imply better
performance, while for vision tasks, lower error denotes better performance. None of these tasks (except for the vision task
with L2 used as a reference) have weight norm bouncing nor an advantage from non-simple schedules. We have averaged
the RL tasks over 3 runs and their difference is compatible with noise. See S1 for the individual GLUE scores, as it is
common we have omitted the problematic WNLI.
Laarhoven (2017) b . In the absence of such term, we see

0.3 0.3
Test. Cosine Test. Simple that the updates of the weight norm are determined by the
Train. Cosine Train. Simple
relative values of the gradient and weight norm. If λ = 0 or
Error Rate
Error Rate
0.2 Test. Simple 0.2 Test. Cosine

Train. Simple Train. Cosine
η is very small, the weight norm updates will have a fixed
0.1 0.1 sign and thus there will not be bouncing. More generally,
we expect that in the initial stages of training, the weight
0.0
0 50 100 150 200
0.0
0 5 10 15
norm is large and its dynamics are dominated by the decay
Epochs Epochs term. As it shrinks, the relative value of the gradient norm
(a) Wide Resnet on CIFAR-10 (b) Wide Resnet on CIFAR-10 term becomes larger and it seems natural that at some point,
it will dominate, making the weight norm bounce. This is
Figure 4: Wide Resnet on CIFAR-10 without data augmen- also studied in Wan et al. (2020), where it is shown that after
tation evolved for 200 epochs (left) and 15 epochs (right). In the bounce, the two terms in equation 1 are the same order
this setup, the weight norm bounces at around 15 epochs. a) and the weight norm "dynamically equilibrates" (although
Both schedules reach training error 0 and their performance it can not stay constant because the gradient norm changes
is the same (error of 7.3). b) If we evolve the model for with time). While we expect the gt · wt to be non-zero in
only 15 epochs, both schedules can still get training error 0 our setups, only layers which are not scale invariant would
without a weight norm bounce, we think this is the reason contribute to this term and roughly any layer before a Batch-
why there is no performance difference in a). Norm layer is scale invariant so we expect this term to be
smaller than the other two.
where η, λ are the learning rate and L2 regularization coeffi- In our experience, the only necessary condition for a model
dLt
cient, gt ≡ dw is the gradient with respect to the loss (in the to have a bouncing weight norm is that is has L2 regulariza-
t
absence of the L2 term) and we have used that empirically tion (or weight decay) and the learning rate is large enough.
ηλ 1. This equation holds layer by layer, see SM for We expect the previous intuition to apply to other optimizers
more details about it. In the absence of L2 regularization, with weight decay. Empirically, we have seen that differ-
|2
for large enough learning rates (η > 2g|gtt.w t
), this suggests b
Scale invariant networks are defined by network functions
that the weight norm will be increasing. which are independent of the weight norm: f (αw) = f (w). The
weight norm of these functions still affects its dynamics (Li and
Equation 1 can be further simplified for scale invariant net- Arora, 2019).
works, which satisfy gt · wt = 0, see for example van
ent optimizers, losses and batch sizes can have a bouncing Dependence on initialization scale. One could wonder
weight norm. if the bounce would disappear if we change the initialization
of the weights so that the initial weight norm is smaller than
4.2. Towards understanding the benefits of bouncing the minimum of the bounce with the original normalization.
and schedules We studied this in figure 5b (see SM for more details) and
we see how even for very small initialization scales, there
While it is still unclear why bouncing is correlated with the is a bouncing weight norm. If the initialization scale is
benefit of schedules, we would like to point in some direc- too small, the bouncing weight norm disappears and the
tions which could help provide a more complete picture. performance gets significantly degraded.
To better understand this phenomenon it would be useful to
distill its properties and find the simplest model that captures
0.54 6000
it. The bouncing of the weight norm appears to be generic, 1000
Min Test Error
|wt|2 (fixed lr)
Weight Norm
as long as we have L2 regularization and the learning rate is 4000 n = 0. Error:18.4
big enough. We believe that learning rate schedules being 750 n = 1. Error:18.0
0.52 n = 2. Error:18.5
n = 3. Error:17.8
only advantageous for hard tasks (as we discussed in section 2000 n = 4. Error:18.5
500 n = 5. Error:18.9
3) is the principal roadblock to find theoretically tractable ABEL decay
n = 6.
n = 8.
Error:20.2
Error:24.3
models of this phenomena. 0.50 0

0 50 0 10 20 30
Decay epoch / Epochs Epochs
For bouncing setups, decaying the learning rate when the (a) (b)
weight norm is equilibrating allows the weight decay term in
equation 1 to dominate, causing the weight norm to bounce Figure 5: Experiments exploring features of weight bounc-
again. However, from equation 1, in the absence of L2 ing. a) Minimum test error for VGG5 models decayed at
decaying the learning rate can only slow down the weight different times (all are evolved for 80 epochs) and weight
norm equilibration process and this implies that the weights norm for fixed learning rate. Models with the best perfor-
change more slowly, see SM. It seems like the combination mance are decayed after the bounce, soon before the weight
of weight bouncing and decaying the learning rate might be norm dramatically slows down its growth. b) WRN 28-
beneficial because it allows the model to explore a larger 10 models trained with different weight initialization scales
portion of the landscape. Exploring this direction further (σw = 21n ) and cosine decay. The weight norm keeps bounc-
might yield better insights to this phenomenon, perhaps ing even if the initialization scale is small and when it stops
building on the results of Wan et al. (2020); Kunin et al. bouncing, performance is degraded.
(2020).
We now conduct several experiments with models with
bouncing weight norms in order to understand better its 5. Conclusions
properties.
In this work we have studied the connections between learn-
ing rate schedules and the weight norm. We have made
the empirical observation that a bouncing weight norm is
The disadvantage of decaying too early or too late.
a necessary condition for complex learning rate schedules
Waiting for the weight norm to bounce seems key to get
to be beneficial, and we have observed that the step-wise
good performance. Decaying too late might be harmful be-
schedules tend to decay the learning rates when the weight
cause the weight norm does not have enough time to bounce
norm equilibrates after bouncing. We have checked these
again, but it is not clear if it is bad by itself. In this section,
observations across architectures and datasets and have pro-
we run a VGG-5 experiment on CIFAR-100 for 80 epochs
posed ABEL: a learning rate scheduler which automatically
and we decay the learning rate once (by a factor of 10) at
decays the learning rate depending on the weight norm, per-
different times and compare the minimum test error, see fig
forms as well as fine tuned schedules and is more robust
5a. We see that decaying too early significantly hurts perfor-
than standard schedules with respect to its initial learning
mance and the best time to decay is after the weight norm
rate and the decay factor. In the absence of weight bouncing,
has started slowing down its growth, before it is fully equili-
complex schedules do not seem to matter too much.
brated. We can compare this sweep over decay times with
ABEL: ABEL would be equivalent to a simple schedule Now we would like to briefly mention on some important
decayed at the time marked by the red line (here there is no points and future directions:
benefit from decaying it again at the end of training). Given
the limitation of the experiment, we can not conclude that Practical implications. In vision tasks with L2 regular-
decaying to late is hurtful and from the success of cosine ization, learning rate schedules have a substantial impact on
decay we expect it is not bad. performance and tuning schedules or using ABEL should
be beneficial. In other setups, training with a constant learn- derstand if the bouncing of the weight norm is a proxy for
ing rate and decay it at the end of training should not hurt some other phenomena. While we have tried tracking other
performance and might be a preferable, simpler method. simple quantities, the weight norm seems the best predic-
tor for when to decay the learning rate. In order to make
ABEL’s hyperparameters. The main hyperparameters of theoretical progress the two significant roadblocks that we
ABEL are the base learning rate and decay factor: while identify are that this phenomena requires large learning rates
our schedule is not hyperparameter free, ABEL is more and hard datasets, both of which are complicated to study
robust than other schedules to these parameters because of theoretically.
its adaptive nature.
Acknowledgments
Warmup. Our discussion focuses on learning rate decays
and we have not studied the effects of warmup. This is an The authors would like to thank Anders Andreassen,
interesting problem but seems unrelated to bouncing weight Yasaman Bahri, Ethan Dyer, Orhan Firat, Pierre Foret, Guy
norms and the benefit of learning rate schedules at late times. Gur-Ari, Jaehoon Lee, Behnam Neyshabur and Vinay Ra-
In this way, whenever warmup is beneficial and widely used masesh for useful discussions.
(such as large batch Imagenet or Transformer models) we
added it on top of our schedule. References
Comparison between ABEL and other schedules. We Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren,
have mainly compared ABEL with step-wise decays and and Cyril Zhang. Disentangling adaptive gradient meth-
cosine decay. It can easily be compared with step-wise ods from learning rates, 2020.
decay and is objectively better because it does not require to Léon Bottou. Online learning and stochastic approximations,
fine tune when to decay the learning rate. Compared with 1998.
cosine decay, the learning rate in ABEL does not depend
explicitly on the total training budget and it is thus preferred James Bradbury, Roy Frostig, Peter Hawkins,
for situations where we do not know for how long we want Matthew James Johnson, Chris Leary, Dougal Maclaurin,
to train or might want to pick up training where it was and Skye Wanderman-Milne. JAX: composable trans-
left. Compared with other automatic methods, it does not formations of Python+NumPy programs. 2018. URL
require an outer loop nor adding complicated measurements http://github.com/google/jax.
to training, it only depends on the weight norm which is
pretty stable through training (compared with quantities like Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
the training loss or gradients). Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
The role of simple schedules is to reduce noise. We Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
used simple schedules as a baseline to reduce SGD noise. Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
This picture is justified by the fact that the most dramatic Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
drop in test error for such schedules is always immediately Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,
after the decay, as can be seen in the training curves of the Benjamin Chess, Jack Clark, Christopher Berner, Sam
SM. McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. Language models are few-shot learners, 2020.
Memory and compute considerations. ABEL only de- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
pends on the current and past weight norm (two scalars) so Toutanova. Bert: Pre-training of deep bidirectional trans-
it does not add any significant compute nor memory cost. formers for language understanding, 2019.
Weight norm bounce with Adam. Despite of Adam hav- Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam
ing an implicit schedule, it still exhibits weight norm bounc- Neyshabur. Sharpness-aware minimization for efficiently
ing as can be seen in SM . improving generalization, 2020.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
What is the behaviour of the layerwise weight norm? Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
As discussed previously and in the SM B.2, most layers Radford, Jeffrey Wu, and Dario Amodei. Scaling laws
exhibit the same pattern as the total weight norm. for neural language models, 2020.
Understanding the source of the generalization advan- Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli,
tage of learning rate schedules. It would be nice to un- Daniel L. K. Yamins, and Hidenori Tanaka. Neural me-
chanics: Symmetry and broken conservation laws in deep Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun.
learning dynamics, 2020. Spherical motion dynamics: Learning dynamics of neural
network with normalization, weight decay, and sgd, 2020.
Aitor Lewkowycz and Guy Gur-Ari. On the training dynam-
ics of deep networks with l2 regularization, 2020. Olga Wichrowska, Niru Maheswaranathan, Matthew W.
Hoffman, Sergio Gomez Colmenarejo, Misha Denil,
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl- Nando de Freitas, and Jascha Sohl-Dickstein. Learned
Dickstein, and Guy Gur-Ari. The large learning rate phase optimizers that scale and generalize, 2017.
of deep learning: the catapult mechanism, 2020.
Sho Yaida. Fluctuation-dissipation relations for stochastic
Ke Li and Jitendra Malik. Learning to optimize, 2016. gradient descent, 2018.
Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining
the regularization effect of initial large learning rate in
training neural networks, 2020a.
Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-
sgd: Learning to learn quickly for few-shot learning,
2017.
Zhiyuan Li and Sanjeev Arora. An exponential learning rate
schedule for deep learning, 2019.
Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. Reconcil-
ing modern deep learning with traditional optimization
analyses: The intrinsic learning rate, 2020b.
Kaifeng Lyu and Jian Li. Gradient descent maximizes the
margin of homogeneous neural networks, 2020.
Dougal Maclaurin, David Duvenaud, and Ryan P. Adams.
Gradient-based hyperparameter optimization through re-
versible learning, 2015.
Preetum Nakkiran. Learning rate annealing can provably
help generalization, even for convex problems, 2020.
Xiaoman Qi, PanPan Zhu, Yuebin Wang, Liqiang Zhang,
Junhuan Peng, Mengfan Wu, Jialong Chen, Xudong Zhao,
Ning Zang, and P. Takis Mathiopoulos. Mlrsnet: A multi-
label high spatial resolution remote sensing dataset for
semantic scene understanding, 2020.
Michal Rolinek and Georg Martius. L4: Practical loss-based
stepsize adaptation for deep learning, 2018.
Christopher J. Shallue, Jaehoon Lee, Joseph Antognini,
Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl.
Measuring the effects of data parallelism on neural net-
work training, 2019.
Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking
model scaling for convolutional neural networks, 2020.
Twan van Laarhoven. L2 regularization versus batch and
weight normalization, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and
Illia Polosukhin. Attention is all you need, 2017.
Supplementary material
A. Experimental settings
We are using Flax ( https://github.com/google/flax) which is based on JAX (Bradbury et al., 2018). The
implementations are based on the Flax examples and the repository https://github.com/google-research/
google-research/tree/master/flax_models/cifar (Foret et al., 2020). Our simple Flax implementation
can be found in SM E. All datasets are downloaded for tensorflow-datasets and we used their training/test split.
All experiments use the same seed for the weights at initialization and we consider only one such initialization unless
otherwise stated. We have not seen much variance in the phenomena described, see Table 1, 2 description. All experiments
use cross-entropy loss.
We run all experiments in v3-8 TPUs. The WideResnet 28-10 CIFAR experiments take roughly 1h per run, the ResNet-
50 Imagenet experiments take roughly 5h per run, the PyramidNet CIFAR experiments take roughly 50h per run, the
Transformer experiments take roughly 24h per run, the ALBERT finetuning experiments take around 10 − 20minutes per
task.
Here we describe the particularities of each experiment in the figure/tables.
Table 1. All models use momentum with a momentum parameter of 0.9 and none of them uses dropout. ABEL considera-
tions: learning rate decay 0.1 for ImageNet and SVHN and 0.2 for CIFAR-100. We force ABEL to decay at 85% of the
total epoch budget by the same decay factor. We run the CIFAR-10 and CIFAR-100 WideResnet experiments with three
different seeds and report the average accuracy, the standard deviation across experiments for all schedules and CIFAR-10,
CIFAR-100 is ≤ 0.1%, except for ABEL and CIFAR-100 which has a standard deviation of 0.2% (the standard error of the
mean is 0.1%), from these observations we only consider a single run for the other experiments. For the CIFAR and WRN
experiments, the gradient norm is clipped to 5.
• Resnet-50 on Imagenet: learning rate of 0.8, L2 regularization of 0.0001 (we do not decay the batch-norm parameters
here), label smoothing of 0.1 and 90 epochs. All experiments include a linear warmup of 5 epochs. Standard data
augmentation: horizontal flips and random crops. Step-wise decay: multiply learning rate by α = 0.1 at 30, 60 epochs.
• Wide Resnet 16-8 on SVHN: learning rate of 0.01, L2 regularization of 0.0005, batch size of 128, no dropout and
evolved for 160 epochs. Training data includes the "extra" training data and no data augmentation. Step-wise decay:
multiply learning rate by α = 0.1 at 80, 120 epochs.
• Wide Resnet 28-10 on CIFAR10/100: learning rate of 0.1, L2 regularization of 0.0005, batch size of 128 and evolved
for 200 epochs. Standard data augmentation: horizontal flips and random crops. Step-wise decay: multiply learning
rate by α = 0.2 at 60, 120, 160 epochs.
• Shake-Drop PyramidNet on CIFAR100: learning rate of 0.1, L2 regularization of 0.0005, batch size 256, evolved for
1800 epochs. Uses AutoAugment and cutout as data augmentation. Because training this model takes a long time and
weight bounces generally take longer at smaller learning rates, we found it convenient to average the weight norm
every five epochs when using ABEL in order to avoid the decays from being noise dominated (without this, the third
decay happens too early and the test error increases by 0.2%) .
• VGG-16 on CIFAR10: learning rate of 0.05, L2 regularization of 0.0005, batch size 128, evolved for 200 epochs.
Basic data augmentation and no batch norm.
Table 2 NLP, RL. All these experiments use ADAM, no dropout nor weight decay.
• Base Transformer trained for translation: learning rate: 1.98 · 10−3 , learning rate warmup of 10k steps, evolved for
100k steps, batch size 256 uses reverse translation. Simple decay: decay by 0.1 at 90k steps. The translation tasks
correspond to WMT’17.
• ALBERT and Glue finetuning. Fine tuned from the base ALBERT model of https://github.com/google-research/albert.
Tasks: All tasks were trained for 10k steps (including 1k linear warmup steps) for four learning rates: {10−5 , 2 ·
10−5 , 5 · 10−5 , 10−6 }, reported best learning rate, batch size 16. See table S1 for the specific scores. Simple decay:
decay by 0.1 at 0.8 of training. The individual scores are summarized in S1.
Task MNLI(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average
Linear Decay 82.6/83.4 88.1 92.1 92.8 58.5 91.0 88.7 70.8 83.1
Simple Decay 82.9/83.5 87.6 91.1 92.1 57.0 91.2 88.2 73.3 82.9
Table S1: Evaluation metric is accuracy except for CoLA (Matthew correlation), MRPC (F1) and STS-B (Pearson
correlation).
• PPO: learning rate 2.5 · 10−4 , batch size 256, 8 agents. Evolved for 40M frames. Simple decay: decay by 0.1 at 0.9
of training. Rest of configuration is the default configuration of https://github.com/google/flax/examples. Reported
score is the average score over the last 100 episodes. We have three runs per task and schedule. The scores with a
95% confidence interval are: Seaquest: Linear: 1750 ± 260, Simple: 1850 ± 40; Pong: Linear 21.0 ± 0.01, Simple
20.7 ± 0.5; Qbert: Linear 22300 ± 2950, Simple 23000 ± 1800.
Table 2 Vision.
These experiments are the same as table 1 but without L2 . The simple decay is defined by evolving with a fixed learning
rate during 85% of the time and then decay it (with decay factor 0.1 for ImageNet and 0.2 for CIFAR). In the case of
ImageNet+Adam, the learning rate was reduced to 0.0008. For simple decay with L2 6= 0, the test error would often increase
after decaying, in such case, we choose the test error at the minimum of the training loss, see fig S8a for an experiemnt with
this behaviour.
Fig 5a: VGG-5 trained with η = 0.25, λ = 0.005 and decayed by a factor of 10 at the respective time.
Rest of figures. Small modifications or further plots of previous experiments, changes are specified explicitly.
B. Experimental details
B.1. The mild dependence on the last decay epoch
In figure S1 we run again the CIFAR experiments with ABEL for different values of ‘last_decay_epoch‘. We see how the
final performance does not depend too much on this value.
0.002
CIFAR10
∆ Test Error
CIFAR100
0.001
0.000
−0.001
−0.002
160 170 180 190 200
Epoch of last decay
Figure S1: Models trained with ABEL with different values for ‘last_decay_epoch‘. Difference of test error with respect to
85% of training time (170 epochs). We see that the test error does not change too much, most changes are less than 0.2%.
B.2. Layer-wise dependence dynamics of weight norm.

In this section, we study the dynamics of the layer-wise weight norm. We consider the Resnet-50 Imagenet setup of table 1
with step-wise decay. For better visualization, we focuse on the top 10 layers, we rank layers by their maximum weight
norm during training. These top 10 layers are mostly intermediate but there are also early and late layers. In figure S2a,
we plot the contribution the quotient between the weight norm of each layer (or the sum of the top 10 layers) and the total
weight norm. We see these top layers account for ∼ 50% of the total weight norm and they exhibit the same dynamics: the
red line is straight. At the individual layer level, we see that several layers become constant really fast while there are ∼ 3
which have a deeper bounce but then have the same dynamics as the total weight norm. We can illustrate this in fig S2b,
where we compare the evolution of the different weights (normalized by their value at initialization). Most layers have the
behavour of the total weight norm: a bounce and slow down of growth before decaying.
|wt(layer)|2/|w0(layer)|2
100
|wt(layer)|2/|wt(all)|2
Sum of top 10 layers

Sum of all layers
Top 10 Layers
100
10−2
Sum of top 10 layers
Sum of all layers
Top 10 Layers 10−1
0 20 40 60 80 0 20 40 60 80
(a) Resnet-50 on Imagenet (b) Resnet-50 on Imagenet
Figure S2: a) Evolution of the quotient between the weight norm of different layers and the total weight norm. b) Evolution
of the weight norm of different layers, normalized by its value at initialization.
B.3. Decay LR on Loss Plateau

We implemented a simple version of this scheduler in FLAX with the default values for the hyperparameters from the
pytorch implementation , except for the decay factor (called factor) which we set to 0.1 for ImageNet and 0.2 for CIFAR. The
other relevant hyperparameters which we did not change are patience= 10 (Number of epochs with no improvement after
which learning rate will be reduced), threshold= 0.0001 (Relative Threshold for measuring the new optimum, to only focus
on significant changes.). See the https://pytorch.org/docs/stable/optim.html for more details. The test
errors for WRN 28-10 on CIFAR-10 are 3.9 and 47.4 for Resnet-50 on Imagenet. For Imagenet, the loss still decreases
slightly during training so it does not seem like the logic behind this schedule could yield good results.
For CIFAR-10, the first decay is caused by a real plateau, but after that the loss increases for more than 10 epochs, so the
learning rate decays again. One decay after that, the loss basically stays in a plateau and the learning rate decays without
bound. It is surprising that this schedule does so well on CIFAR-10 despite of the learning rate becoming so low so early.
These two examples exhibit the opposite issues: too much vs too few decay. It seems unrealistic that proper hyperparameter
tuning can fix both problems, among other things because the ImageNet Resnet loss is slowly decaying.
0.10 100 0.100 Decay on Plateau

Decay on Plateau
Learning Rate
0.08
Training loss
0.075
Test Error
0.06 10−1
0.050
0.04
10−2 0.025
0.02
Decay on Plateau
0.00 0.000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Epochs Epochs Epochs
(a) Wide Resnet on CIFAR-10 (b) Wide Resnet on CIFAR-10 (c) Wide Resnet on CIFAR-10
0.6 Decay on Plateau 0.8 Decay on Plateau

6 × 100
Learning Rate
Training loss
0.5 0.6
Test Error
4 × 100
0.4 0.4
3 × 100
0.3 0.2
Decay on Plateau 2 × 100
0.2 0.0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
(d) Resnet-50 on Imagenet (e) Resnet-50 on Imagenet (f) Resnet-50 on Imagenet
Figure S3: Experiments from the main section with a ReduceLROnPlateau schedule.
B.4. No advantage for easy tasks

We expand on the discussion of the main text. We train the same Wide Resnet models of table 1 but without data
augmentation. We see that in this case, simple decay reaches training error 0 at the end of training and there is no advantage
of cosine decay. The test error without data augmentation for CIFAR100 and 29.7 (cosine) and 29.4 (simple) and for
CIFAR10 it is 7.3 for both schedules. We attribute this to the fact that when evolved for a small number of epochs, before
the weight norm has time to bounce, the simple decay schedule can already reach training error 0 and thus it can not benefit
from the bounce.
0.20 0.20 4000

Cosine Cosine
Weight Norm
Simple Simple
0.15 0.15 3000
Train Error
Test Error
Cosine+Aug Cosine+Aug
Simple+Aug Simple+Aug
0.10 Cosine
0.10 2000
Simple
0.05 Cosine+Aug 0.05 1000
Simple+Aug
0.00 0.00 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0.5 0.5 10000

Cosine Cosine
0.4 8000
Weight Norm
Simple Simple
Train Error
Test Error
0.4 Cosine+Aug Cosine+Aug

0.3 Simple+Aug 6000 Simple+Aug
0.3 Cosine
0.2 4000
Simple
Cosine+Aug 0.1 2000
0.2 Simple+Aug
0.0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
(d) Wide Resnet on CIFAR-100 (e) Wide Resnet on CIFAR-100 (f) Wide Resnet on CIFAR-100
0.20 0.20 8000

Cosine Cosine
Weight Norm
Simple Simple
0.15 0.15 6000
Train Error
Test Error
0.10 0.10 4000
0.05 Cosine 0.05 2000

Simple
0.00 0.00 0
0 5 10 15 0 5 10 15 0 5 10 15
(g) Wide Resnet on CIFAR-10 (h) Wide Resnet on CIFAR-10 (i) Wide Resnet on CIFAR-10
Figure S4: Wide Resnets on CIFAR without data augmentation can reach zero training error with simple decay.
B.5. Learning rate dependence of bouncing

We can study the effect of the learning rate. Large learning rates seem important for weight norm bouncing as we can see in
figure S5, the small learning rate of 0.01 does not have a bouncing norm even if we evolve for longer (since smaller learning
often to take a longer time to train). We see that only when there is a bouncing norm are cosine schedules benefitial, see
table S2.
Setup Simple Decay Cosine Decay

lr= 0.05 7.4 6.7
lr= 0.01 7.4 7.8
lr= 0.01 1000 epochs 7.2 7.6
Table S2: Test error for VGG-16 and CIFAR-10 for different small learning rate setups. Cosine decay is only superior at
large learning rates when there is a bouncing weight norm.
2500
0.14 Cosine. lr:0.01
2000
Weight Norm
Simple. lr:0.01
Test Error

1500 Simple. lr:0.05
0.10
Cosine. lr:0.01
1000
0.08 Simple. lr:0.01
Cosine. lr:0.05 500
0
0 50 100 150 200 0 50 100 150 200
Epochs Epochs
(a) VGG-16 on CIFAR-10 (b) VGG-16 on CIFAR-10
2500
2000
Weight Norm
Simple. lr:0.01
Test Error
0.12
1500
0.10
1000
0.08
Cosine. lr:0.01 500
0
0 200 400 600 800 1000 0 200 400 600 800 1000
Epochs Epochs
(c) VGG-16 on CIFAR-10 1000epochs (d) VGG-16 on CIFAR-10 1000epochs
Figure S5: In order for the weight norm to bounce in our training run, the learning rate has to be big enough. We see how
for learning rate 0.01 weight norm does not bounce. This is true even if we evolve for five times longer.
B.6. Bouncing for different initialization scales.

The bouncing behaviour suggests that maybe one can avoid bouncing at all by starting with the minimum weight norm
to begin with. In order to check this, we run our CIFAR-100 WRN28-10 experiments with cosine decay for different
initializations for all weights (including the batchnorm weights), which we denote σw , with σw = 1 the standard experiments
that we have run. We see how, even if we start with a weight norm which is below the minimum weight norm for σw = 1,
there is still bouncing. As we keep decreasing σw , it gets to a point where performance gets significantly degraded (and
bouncing dissappears too). These small σw are probably pathological, but it is interesting to see again the correlation
between degraded performance and no bouncing. Before reaching this small weight initialization the final weight norm (and
error rate) is very similar for the different initializations.
σw =1 6000 6000
0.24
σw =0.5
Weight Norm
Weight Norm
Test Error
0.22 σw =0.25
σw =0.125 4000 4000
0.20 σw =0.075
σw =0.031
0.18 σw =0.0156 2000 2000
σw =0.003
0.16
0 0
0 50 100 150 200 0 50 100 150 200 0 10 20 30
(a) WRN 28-10 on CIFAR-100 (b) WRN 28-10 on CIFAR-100 (c) WRN 28-10 on CIFAR-100
Figure S6: Wide Resnet 28-10 on CIFAR100 expeirents with different initialization scale. a) Test error, we see that the
two smallest initialization present substantial degradation of performance. b) Despite of the weight norm being so different
at initialization, the final weight norm is the same for the different models. c) Zoomed version of b) we see that all the
initializations but the smallest two exhibit clear bouncing.
C. More training curves

C.1. Training curves for Table 1 experiments
We show the remaining training curves for the Table 1 experiments.
0.10 3000 0.100 ABEL

ABEL
Learning Rate
Cosine
0.08
Weight Norm
Cosine
0.075
Test Error
Stepwise
2000 Stepwise
0.06
0.050
0.04
ABEL 1000 0.025
0.02 Cosine
Stepwise
0.00 0 0.000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
2500
0.14 Cosine
Cosine
Learning Rate
ABEL
2000 0.04
Weight Norm
ABEL
Test Error
0.12
1500
0.10
1000 0.02
0.08
Cosine 500
0.06 ABEL
0 0.00
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
(d) VGG on CIFAR-10 (e) VGG on CIFAR-10 (f) VGG on CIFAR-10
0.030 500 0.0100 Cosine

Cosine
Learning Rate
ABEL
400
Weight Norm
ABEL
0.025 0.0075
Test Error
Stepwise
Stepwise
300
0.020 0.0050
Cosine
200
0.015 ABEL
0.0025
100
Stepwise
0.010 0 0.0000
0 50 100 150 0 50 100 150 0 50 100 150
(g) Wide Resnet 16-8 on SVHN (h) Wide Resnet 16-8 on SVHN (i) Wide Resnet 16-8 on SVHN
0.16 12000 0.100 ABEL

ABEL
Learning Rate
Cosine
Weight Norm
Cosine
10000 0.075
Test Error
0.14
8000 0.050
0.12
ABEL 6000 0.025
0.10 Cosine
4000 0.000
0 500 1000 1500 0 500 1000 1500 0 500 1000 1500
(j) PyramidNet on CIFAR100 (k) PyramidNet on CIFAR100 (l) PyramidNet on CIFAR100
Figure S7: Remaining training curves for experiments of table 1.

C.2. Training curves for Table 2 experiments

Here we show the training curves for some experiments in Table 2.
Cosine. L2 = 0 0.100 Cosine. L2 = 0

0.250
Learning Rate
Simple. L2 = 0 Simple. L2 = 0
Weight Norm
0.075
Test Error
Simple. L2 6= 0 Simple. L2 6= 0
0.225 Cosine. L2 6= 0 Cosine. L2 6= 0
104 0.050
0.200 Cosine. L2 = 0
Simple. L2 = 0
0.025
0.175 Simple. L2 6= 0
Cosine. L2 6= 0
0.150 103 0.000
50 100 150 200 50 100 150 200 50 100 150 200
(a) WRN 28-10 on CIFAR100 (b) WRN 28-10 on CIFAR100 (c) WRN 28-10 on CIFAR100
0.40 Cosine. L2 = 0 0.0008 Cosine. L2 = 0

L2 6= 0 L2 6= 0
Learning Rate
Cosine. Cosine.
Weight Norm
0.35 Simple. L2 = 0 0.0006 Simple. L2 = 0

Test Error
5 Simple. L2 6= 0 Simple. L2 6= 0
10
0.30 0.0004
Cosine. L2 = 0
Cosine. L2 6= 0
0.25 Simple. L2 = 0
0.0002
Simple. L2 6= 0 104
0.20 0.0000
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
(d) Adam and Imagenet (e) Adam and Imagenet (f) Adam and Imagenet
Figure S8: A couple of vision tasks of table 2. We see how in the absence of L2 , a non-trivial schedule does not affect the
performance and the weight norm is usually monotonically increasing. d,e,f) Resnet-50 with Adam on Imagenet. We also
see how weight norm bouncing also occurs despite of Adam having an implicit schedule.
30 Simple
0.0020
Simple 1500
Learning Rate
Linear
Weight Norm
29 Linear
0.0015
1/Sqrt
1/Sqrt 1250 Simple
BLEU
28
0.0010 Linear
1000 1/Sqrt
27
750 0.0005
26
25 500 0.0000
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Thousands of Steps Thousands of Steps Thousands of Steps
(a) Transformer on En-De WMT (b) Transformer on En-De WMT (c) WRN 28-10 on CIFAR100
Figure S9: Training curves for Transformer trained on the English-German translation task WMT’14. We see how there is
no significant difference between differnet learning rate schedules.
20 30000 2000
Seaquest score
1500
Qbert score
Pong score
10
20000
0 1000
10000
−10 500
Simple Linear Simple
Linear Simple Linear
−20 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Millions of frames Millions of frames Millions of frames
(a) PPO Pong (b) PPO Qbert (c) PPO Seaquest
100000
80000
Weight Norm
Weight Norm
40000
Weight Norm
75000
60000
40000 50000
20000
20000 Simple
25000 Linear Simple
Linear Simple Linear
0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Millions of frames Millions of frames Millions of frames
(d) PPO Pong (e) PPO Qbert (f) PPO Seaquest
Figure S10: Game score of different RL games for the two considered schedules. Mean score over three runs and individual
scores in lighter color. Simple is just decaying the base learning rate by a factor of 10 at 90% of training and linear is linear
decay.
C.3. Training curves for large learning rate experiment

We show the training curves for the ImageNet experiment with a large learning rate of 16 of section 2, figure 3. We see that
Cosine and Stepwise schedules are basically not learning until the learning rate decays significantly, however, ABEL adapts
quickly to the large learning rate and decays it fast, yielding better performance.
1.0
40000 ABEL 15 ABEL
Learning Rate
Cosine Cosine
Weight Norm
0.8
Test Error
Stepwise Stepwise
30000 10
0.6
ABEL
20000 5
0.4 Cosine
Stepwise 10000
0.2 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
(a) Resnet-50 on ImageNet (b) Resnet-50 on ImageNet (c) Resnet-50 on ImageNet
Figure S11: Training curves for the Imagenet experiment of section 2, figure 3 with learning rate 16.
D. Derivation of equations of section 4

Weight dynamics equations. Equation 1 follows directly from the SGD update:
∆wt+1 = wt+1 − wt = −ηgt − ηλwt (S1)
∆|wt+1 |2 ≡ |wt+1 |2 − |wt |2 = ∆wt−1 .∆wt−1 + 2wt−1 ∆wt−1 = η 2 |gt |2 − (2 − ηλ)ηλ|wt |2 − 2(1 − ηλ)gt · wt
≈ η 2 |gt |2 − 2ηλ|wt |2 − 2ηgt · wt (S2)
The terms that we are dropping are subleading for our purposes because ηλ 1 practically. We expect these terms to be
only relevant if the dominant terms cancel each other: η 2 |gt |2 − 2ηλ|wt |2 − 2ηgt · wt ∼ 0
g · w = 0 for scale invariant layers. A scale invariant layer is such that its network function satisfies f (αw) = f (w), the
bare loss only depends on the weights through the network function. If we divide the weight into its norm and its direction:
dL
wa = |w|ŵa and use = 0, we get that:
d|w|
dL X dL dwa X dL
0 = |w| = |w| = |w|ŵa = g · w (S3)
d|w| a
dwa d|w| a
dwa
For scale invariant networks without L2 , the change in the weights becomes smaller as the weight norm equilibrates.
Using the SGD equation we get that:
wt+1 · wt |wt |
cos angle(wt+1 , wt ) = =∆wt =−ηg
|wt+1 ||wt | |wt+1 |
s
p ∆|wt |2
sin angle(wt+1 , wt ) = 1 − cos2 angle = (S4)
|wt+1 |2
As ∆|w|2 → 0, angle(wt+1 , wt ) ∼ sin angle(wt+1 , wt ) → 0.

E. Flax implementation of ABEL

For completeness we include our Flax implementation of ABEL.
1 # Copyright 2021 Google LLC.
2 # SPDX-License-Identifier: Apache-2.0
3
4 class ABELScheduler():
5 """Implementation of ABEL scheduler."""
6
7 def __init__(self,
8 num_epochs: int,
9 learning_rate: float,
10 steps_per_epoch: int,
11 decay_factor: float,
12 train_fn: callable,
13 warmup: int = 0):
14
15 self.num_epochs = num_epochs
16 self.learning_rate = learning_rate
17 self.steps_per_epoch = steps_per_epoch
18 self.decay_factor = decay_factor
19 self.train_fn = train_fn
20 self.warmup = warmup
21 self.learning_rate_fn = self.get_learning_rate_fn(self.learning_rate)
22 self.weight_list = []
23 self.reached_minima = False
24 self.epoch = 0
25
26 def get_learning_rate_fn(self, lr):
27 """Outputs a simple decay learning rate function from base learning rate."""
28 lr_fn = flax.training.lr_schedule.create_stepped_learning_rate_schedule(
29 lr, self.steps_per_epoch // jax.host_count(),
30 [[int(self.num_epochs * 0.85), self.decay_factor]])
31 if self.warmup:
32 warmup_fn = lambda step: jax.numpy.minimum(
33 1., step / self.steps_per_epoch / self.warmup)
34 else:
35 warmup_fn = lambda step: 1
36
37 return lambda step: lr_fn(step) * warmup_fn(step)
38
39 def update(self, step_fn, weight_norm):
40 """Optimizer update rule for ABEL Scheduler. This is basically Algorithm 1."""
41 self.weight_list.append(weight_norm)
42 self.epoch += 1
43
44 if len(self.weight_list) < 3:
45 return step_fn
46
47 if (self.weight_list[-1] - self.weight_list[-2]) * (
48 self.weight_list[-2] - self.weight_list[-3]) < 0:
49 if self.reached_minima:
50 self.reached_minima = False
51 self.learning_rate *= self.decay_factor
52 step_fn = self.update_train_step(self.learning_rate)
53 else:
54 self.reached_minima = True
55
56 return step_fn
57
58 def update_train_step(self, learning_rate):
59 learning_rate_fn = self.get_learning_rate_fn(learning_rate)
60 return self.train_fn(learning_rate_fn=learning_rate_fn)
The main reason why this is a Flax implementation is because the update_step takes a train_step function and
outputs a train_step function. This can be easily added to the Flax examples/baselines, we only have to make two
modifications. Before starting the training loop, we initiate ABEL c :
# Copyright 2021 Google LLC.
# SPDX-License-Identifier: Apache-2.0
# Before training starts.

scheduler = ABELScheduler(num_epochs, base_learning_rate, steps_per_epoch =
steps_per_epoch, decay_factor=decay_factor, train_fn = train_step_fn)
learning_rate_fn = scheduler.learning_rate_fn
Only remaining modification is to add the ABEL update rule at the end of each epoch which takes the current train_step
function and mean weight norm and returns a (possibly updated) train_step function. ABEL will update the optimizer
if the learning rate has to be decayed.
# Copyright 2021 Google LLC.
# SPDX-License-Identifier: Apache-2.0
# At the end of each epoch.

train_step = scheduler.update(train_step, weight_norm)
c
Note that ABELSchedule takes train_step_fn as an argument. This is a function that takes a learning_rate_fn
and outputs a train_step (ie train_step_fn = lambda lr_fn: functools.partial( train_step,
learning_rate_fn=lr_fn)

How To Decay Your Learning Rate

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Decay Your Learning Rate

Uploaded by

Copyright:

Available Formats

How to decay your learning rate

Abstract will use a simple schedule as our baseline: a constant learn-

Li et al. (2020b); Wan et al. (2020); Kunin et al. (2020).

Algorithm 1 ABEL Scheduler Setup Test error

their own, so these are never completely hyperparameter

Figure 2: Training curves of two experiments from table 1. 0.25 0.25

Setup Performance for different schedules

Laarhoven (2017) b . In the absence of such term, we see

0.2 Test. Simple 0.2 Test. Cosine

Min Test Error

|wt|2 (fixed lr)

models of this phenomena. 0.50 0

B.2. Layer-wise dependence dynamics of weight norm.

Sum of top 10 layers

B.3. Decay LR on Loss Plateau

0.10 100 0.100 Decay on Plateau

0.6 Decay on Plateau 0.8 Decay on Plateau

B.4. No advantage for easy tasks

0.20 0.20 4000

0.5 0.5 10000

0.4 Cosine+Aug Cosine+Aug

0.20 0.20 8000

0.10 0.10 4000

0.05 Cosine 0.05 2000

B.5. Learning rate dependence of bouncing

Setup Simple Decay Cosine Decay

0.12 Cosine. lr:0.05

B.6. Bouncing for different initialization scales.

C. More training curves

0.10 3000 0.100 ABEL

0.030 500 0.0100 Cosine

0.16 12000 0.100 ABEL

Figure S7: Remaining training curves for experiments of table 1.

C.2. Training curves for Table 2 experiments

Cosine. L2 = 0 0.100 Cosine. L2 = 0

0.40 Cosine. L2 = 0 0.0008 Cosine. L2 = 0

0.35 Simple. L2 = 0 0.0006 Simple. L2 = 0

C.3. Training curves for large learning rate experiment

D. Derivation of equations of section 4

∆wt+1 = wt+1 − wt = −ηgt − ηλwt (S1)

As ∆|w|2 → 0, angle(wt+1 , wt ) ∼ sin angle(wt+1 , wt ) → 0.

E. Flax implementation of ABEL

# Before training starts.

# At the end of each epoch.

You might also like