Professional Documents
Culture Documents
Raghu Bollapragada 1 Dheevatsa Mudigere 2 Jorge Nocedal 1 Hao-Jun Michael Shi 1 Ping Tak Peter Tang 3
Abstract tion that makes the overall algorithm slow compared with
stochastic gradient methods (Robbins & Monro, 1951).
arXiv:1802.05374v2 [math.OC] 30 May 2018
of stochastic line searches for machine learning by study- 2016; Keskar & Berahas, 2016; Curtis, 2016; Berahas &
ing a key component, namely the initial estimate in the Takáč, 2017; Zhou et al., 2017). Schraudolph et al. (2007)
one-dimensional search. Our approach, which is based on ensured stability of quasi-Newton updating by computing
statistical considerations, is designed for an Armijo-style gradients using the same batch at the beginning and end
backtracking line search. of the iteration. Since this can potentially double the cost
of the iteration, Berahas et al. (2016) proposed to achieve
1.1. Literature Review gradient consistency by computing gradients based on the
overlap between consecutive batches; this approach was
Progressive batching (sometimes referred to as dynamic further tested by Berahas and Takac (2017). An interesting
sampling) has been well studied in the optimization litera- approach introduced by Martens and Grosse (2015; 2016)
ture, both for stochastic gradient and subsampled Newton- approximates the Fisher information matrix to scale the
type methods (Byrd et al., 2012; Friedlander & Schmidt, gradient; a distributed implementation of their K-FAC ap-
2012; Cartis & Scheinberg, 2015; Pasupathy et al., 2015; proach is described in (Ba et al., 2016). Another approach
Roosta-Khorasani & Mahoney, 2016a;b; Bollapragada et al., approximately computes the inverse Hessian by using the
2016; 2017; De et al., 2017). Friedlander and Schmidt Neumann power series representation of matrices (Krishnan
(2012) introduced theoretical conditions under which a pro- et al., 2017).
gressive batching SG method converges linearly for finite
sum problems, and experimented with a quasi-Newton adap-
1.2. Contributions
tation of their algorithm. Byrd et al. (2012) proposed a
progressive batching strategy, based on a norm test, that de- This paper builds upon three algorithmic components that
termines when to increase the sample size; they established have recently received attention in the literature — progres-
linear convergence and computational complexity bounds sive batching, stable quasi-Newton updating, and adaptive
in the case when the batch size grows geometrically. More steplength selection. It advances their design and puts them
recently, Bollapragada et al. (2017) introduced a batch con- together in a novel algorithm with attractive theoretical and
trol mechanism based on an inner product test that improves computational properties.
upon the norm test mentioned above.
The cornerstone of our progressive batching strategy is the
There has been a renewed interest in understanding the gen- mechanism proposed by Bollapragada et al. (2017) in the
eralization properties of small-batch and large-batch meth- context of first-order methods. We extend their inner prod-
ods for training neural networks; see (Keskar et al., 2016; uct control test to second-order algorithms, something that
Dinh et al., 2017; Goyal et al., 2017; Hoffer et al., 2017). is delicate and leads to a significant modification of the
Keskar et al. (2016) empirically observed that large-batch original procedure. Another main contribution of the paper
methods converge to solutions with inferior generalization is the design of an Armijo-style backtracking line search
properties; however, Goyal et al. (2017) showed that large- where the initial steplength is chosen based on statistical
batch methods can match the performance of small-batch information gathered during the course of the iteration. We
methods when a warm-up strategy is used in conjunction show that this steplength procedure is effective on a wide
with scaling the step length by the same factor as the batch range of applications, as it leads to well scaled steps and
size. Hoffer et al. (2017) and You et al. (2017) also explored allows for the BFGS update to be performed most of the
larger batch sizes and steplengths to reduce the number of time, even for nonconvex problems. We also test two tech-
updates necessary to train the network. All of these studies niques for ensuring the stability of quasi-Newton updating,
naturally led to an interest in progressive batching tech- and observe that the overlapping procedure described by
niques. Smith et al. (2017) showed empirically that increas- Berahas et al. (2016) is more efficient than a straightforward
ing the sample size and decaying the steplength are quan- adaptation of classical quasi-Newton methods (Schraudolph
titatively equivalent for the SG method; hence, steplength et al., 2007).
schedules could be directly converted to batch size sched-
We report numerical tests on large-scale logistic regression
ules. This approach was parallelized by Devarakonda et al.
and deep neural network training tasks that indicate that our
(2017). De et al. (2017) presented numerical results with
method is robust and efficient, and has good generalization
a progressive batching method that employs the norm test.
properties. An additional advantage is that the method re-
Balles et al. (2016) proposed an adaptive dynamic sample
quires almost no parameter tuning, which is possible due to
size scheme and couples the sample size with the steplength.
the incorporation of second-order information. All of this
Stochastic second-order methods have been explored within suggests that our approach has the potential to become one
the context of convex and non-convex optimization; see of the leading optimization methods for training deep neural
(Schraudolph et al., 2007; Sohl-Dickstein et al., 2014; networks. In order to achieve this, the algorithm must be
Mokhtari & Ribeiro, 2015; Berahas et al., 2016; Byrd et al., optimized for parallel execution, something that was only
A Progressive Batching L-BFGS Method for Machine Learning
briefly explored in this study. search direction −Hk ∇F (xk ), with high probability. Al-
though this does not imply that dk is a descent direction
2. A Progressive Batching Quasi-Newton for F , this will normally be the case for any reasonable
quasi-Newton matrix.
Method
To derive the new inner product quasi-Newton (IPQN) test,
The problem of interest is we first observe that the stochastic quasi-Newton search
Z direction makes an acute angle with the true quasi-Newton
min F (x) = f (x; z, y)dP (z, y), (1) direction in expectation, i.e.,
x∈Rd
h i
where f is the composition of a prediction function Ek (Hk ∇F (xk ))T (Hk gkSk ) = kHk ∇F (xk )k2 , (4)
(parametrized by x) and a loss function, and (z, y) are ran-
dom input-output pairs with probability distribution P (z, y). where Ek denotes the conditional expectation at xk . We
The associated empirical risk problem consists of minimiz- must, however, control the variance of this quantity to
ing achieve our stated objective. Specifically, we select the
N N
sample size |Sk | such that the following condition is satis-
1 X 4 1 X fied:
R(x) = f (x; z i , y i ) = Fi (x),
N i=1 N i=1 2
T Sk 2
Ek (Hk ∇F (xk )) (Hk gk ) − kHk ∇F (xk )k
where we define Fi (x) = f (x; z i , y i ). A stochastic quasi-
Newton method is given by ≤ θ2 kHk ∇F (xk )k4 ,
(5)
xk+1 = xk − αk Hk gkSk , (2) for some θ > 0. The left hand side of (5) is difficult to com-
pute but can be bounded by the true variance of individual
where the batch (or subsampled) gradient is given by
search directions, i.e.,
4 1 X
gkSk = ∇FSk (xk ) = ∇Fi (xk ),
h 2 i
(3) Ek (Hk ∇F (xk ))T (Hk gki ) − kHk ∇F (xk )k2
|Sk |
i∈Sk
|Sk | (6)
the set Sk ⊂ {1, 2, · · · } indexes data points (y i , z i ) sam-
≤ θ2 kHk ∇F (xk )k4 ,
pled from the distribution P (z, y), and Hk is a positive
definite quasi-Newton matrix. We now discuss each of the where gki = ∇Fi (xk ). This test involves the true expected
components of the new method. gradient and variance, but we can approximate these quan-
tities with sample gradient and variance estimates, respec-
2.1. Sample Size Selection tively, yielding the practical inner product quasi-Newton
The proposed algorithm has the form (29)-(30). Initially, it test:
utilizes a small batch size |Sk |, and increases it gradually in
Vari∈Skv (gki )T Hk2 gkSk 4
order to attain a fast local rate of convergence and permit the ≤ θ2 Hk gkSk , (7)
use of second-order information. A challenging question is |Sk |
to determine when, and by how much, to increase the batch
size |Sk | over the course of the optimization procedure based where Skv ⊆ Sk is a subset of the current sample (batch),
on observed gradients — as opposed to using prescribed and the variance term is defined as
rules that depend on the iteration number k.
2 2
i T 2 Sk Sk
P
v
i∈Sk (gk ) H g
k k − H k gk
We propose to build upon the strategy introduced by Bol-
v . (8)
lapragada et al. (2017) in the context of first-order methods. |Sk | − 1
Their inner product test determines a sample size such that
the search direction is a descent direction with high prob- The variance (8) may be computed using just one additional
ability. A straightforward extension of this strategy to the Hessian vector product of Hk with Hk gkSk . Whenever con-
quasi-Newton setting is not appropriate since requiring only dition (7) is not satisfied, we increase the sample size |Sk |.
that a stochastic quasi-Newton search direction be a de- In order to estimate the increase that would lead to a satis-
scent direction with high probability would underutilize the faction of (7), we reason as follows. If we assume that new
curvature information contained in the search direction. sample |S̄k | is such that
We would like, instead, for the search direction dk =
−Hk gkSk to make an acute angle with the true quasi-Newton Hk gkSk u Hk gkS̄k ,
A Progressive Batching L-BFGS Method for Machine Learning
and similarly for the variance estimate, then a simple com- term in the matrix Wk . To obtain decrease in the function
putation shows that a lower bound on the new sample size value in the deterministic case, the matrix I − Lα 2 Hk
k
The only difference in (A.1) between the deterministic and where sk = xk+1 − xk and yk is the difference in the
stochastic quasi-Newton methods is the additional variance gradients at xk+1 and xk . When the batch changes from
A Progressive Batching L-BFGS Method for Machine Learning
one iteration to the next (Sk+1 6= Sk ), it is not obvious how Algorithm 1 Progressive Batching L-BFGS Method
yk should be defined. It has been observed that when yk Input: Initial iterate x0 , initial sample size |S0 |;
is computed using different samples, the updating process Initialization: Set k ← 0
may be unstable, and hence it seems natural to use the Repeat until convergence:
same sample at the beginning and at the end of the iteration 1: Sample Sk ⊆ {1, · · · , N } with sample size |Sk |
(Schraudolph et al., 2007), and define 2: if condition (7) is not satisfied then
Sk 3: Compute bk using (9), and set b̂k ← dbk e − |Sk |
yk = gk+1 − gkSk . (18)
4: Sample S + ⊆ {1, · · · , N } \ Sk with |S + | = b̂k
However, this requires that the gradient be evaluated twice 5: Set Sk ← Sk ∪ S +
for every batch Sk at xk and xk+1 . To avoid this additional 6: end if
S
cost, Berahas et al. (2016) propose to use the overlap be- 7: Compute gk k
S
tween consecutive samples in the gradient differencing. If 8: Compute pk = −Hk gk k using L-BFGS Two-Loop
we denote this overlap as Ok = Sk ∩ Sk+1 , then one defines Recursion in (Nocedal & Wright, 1999)
9: Compute αk using (14)
Ok
yk = gk+1 − gkOk . (19) 10: while the Armijo condition (16) not satisfied do
11: Set αk = αk /2
This requires no extra computation since the two gradients 12: end while
in this expression are subsets of the gradients correspond- 13: Compute xk+1 = xk + αk pk
ing to the samples Sk and Sk+1 . The overlap should not 14: Compute yk using (18) or (19)
be too small to avoid differencing noise, but this is easily 15: Compute sk = xk+1 − xk
achieved in practice. We test both formulas for yk in our 16: if ykT sk > ksk k2 then
implementation of the method; see Section 4. 17: if number of stored (yj , sj ) exceeds m then
18: Discard oldest curvature pair (yj , sj )
2.4. The Complete Algorithm 19: end if
The pseudocode of the progressive batching L-BFGS 20: Store new curvature pair (yk , sk )
method is given in Algorithm 1. Observe that the limited 21: end if
memory Hessian approximation Hk in Line 8 is indepen- 22: Set k ← k + 1
dent of the choice of the sample Sk . Specifically, Hk is 23: Set |Sk | = |Sk−1 |
defined by a collection of curvature pairs {(sj , yj )}, where
the most recent pair is based on the sample Sk−1 ; see Line
14. For the batch size control test (7), we choose θ = 0.9 in This assumption can be justified both in the convex and non-
the logistic regression experiments, and θ is a tunable pa- convex cases under certain conditions; see (Berahas et al.,
rameter chosen in the interval [0.9, 3] in the neural network 2016). We assume that the sample size is controlled by the
experiments. The constant c1 in (16) is set to c1 = 10−4 . exact inner product quasi-Newton test (31). This test is de-
For L-BFGS, we set the memory as m = 10. We skip the signed for efficiency, and in rare situations could allow for
quasi-Newton update if the following curvature condition is the generation of arbitrarily long search directions. To pre-
not satisfied: vent this from happening, we introduce an additional control
on the sample size |Sk |, by extending (to the quasi-Newton
ykT sk > ksk k2 , with = 10−2 . (20) setting) the orthogonality test introduced in (Bollapragada
et al., 2017). This additional requirement states that the
The initial Hessian matrix H0k in the L-BFGS recursion at current sample size |Sk | is acceptable only if
each iteration is chosen as γk I where γk = ykT sk /ykT yk .
S
T 2
Hk g k k (Hk ∇F (xk ))
Ek Hk gki − kHk ∇F (xk )k2 Hk ∇F (xk )
3. Convergence Analysis
We now present convergence results for the proposed al- |Sk |
gorithm, both for strongly convex and nonconvex objec- ≤ ν kHk ∇F (xk )k2 ,
2
(22)
tive functions. Our emphasis is in analyzing the effect of
progressive sampling, and therefore, we follow common for some given ν > 0.
practice and assume that the steplength in the algorithm is
We now establish linear convergence when the objective is
fixed (αk = α), and that the inverse L-BFGS matrix Hk has
strongly convex.
bounded eigenvalues, i.e.,
Theorem 3.1. Suppose that F is twice continuously differ-
Λ1 I Hk Λ2 I. (21) entiable and that there exist constants 0 < µ ≤ L such that
A Progressive Batching L-BFGS Method for Machine Learning
Test Accuracy
10−2
Training Error
Training Error
Test Loss
PBQN-FO PBQN-FO 0.4 PBQN-FO 80
10−3 10−2
0.3 70 SG
10−4 SVRG
0.2
10−3
60 PBQN-MB
10−5 0.1
PBQN-FO
10−6 10−4 0.0 50
0 10 20 30 40 50 60 70 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations Gradient Evaluations
Figure 1. spam dataset: Performance of the progressive batching L-BFGS method (PBQN), with multi-batch (25% overlap) and
full-overlap approaches, and the SG and SVRG methods.
80
100 100 0.70
SG SG SG
75
10−1 SVRG 10 −1
SVRG SVRG
0.65
Test Accuracy
PBQN-MB PBQN-MB PBQN-MB 70
Training Error
−2
Training Error
10
Test Loss
PBQN-FO 10−2 PBQN-FO PBQN-FO
10−3 0.60 65
10−3 SG
10−4 60 SVRG
0.55
−5 10−4 55 PBQN-MB
10
PBQN-FO
10−6 10−5 0.50 50
0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations Gradient Evaluations
Figure 2. covertype dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and full-overlap
approaches, and the SG and SVRG methods.
tion of our algorithm. We consider three network architectures: (i) a small convolu-
tional neural network on CIFAR-10 (C) (Krizhevsky, 2009),
4.2. Results on Neural Networks (ii) an AlexNet-like convolutional network on MNIST
and CIFAR-10 (A1 , A2 , respectively) (LeCun et al., 1998;
We have performed a preliminary investigation into the Krizhevsky et al., 2012), and (iii) a residual network
performance of the PBQN algorithm for training neural (ResNet18) on CIFAR-10 (R) (He et al., 2016). The net-
networks. As is well-known, the resulting optimization work architecture details and additional plots are given in
problems are quite difficult due to the existence of local the supplement. All of these networks were implemented in
minimizers, some of which generalize poorly. Thus our first PyTorch (Paszke et al., 2017). The results for the CIFAR-
requirement when applying the PBQN method was to obtain 10 AlexNet and CIFAR-10 ResNet18 are given in Figures
as good generalization as SG, something we have achieved. 15 and 16, respectively. We report results both against the
Our investigation into how to obtain fast performance is, total number of iterations and the total number of gradient
however, still underway for reasons discussed below. Never- evaluations. Table 1 shows the best test accuracies attained
theless, our results are worth reporting because they show by each of the four methods over the various networks.
that our line search procedure is performing as expected, and In all our experiments, we initialize the batch size as |S0 | =
that the overall number of iterations required by the PBQN 512 in the PBQN method, and fix the batch size to |Sk | =
method is small enough so that a parallel implementation 128 for SG and Adam. The parameter θ given in (7), which
could yield state-of-the-art results, based on the theoretical controls the batch size increase in the PBQN method, was
performance model detailed in the supplement. tuned lightly by chosing among the 3 values: 0.9, 2, 3. SG
We compared our algorithm, as described in Section 2, and Adam are tuned using a development-based decay (dev-
against SG and Adam (Kingma & Ba, 2014). It has taken decay) scheme, which track the best validation loss at each
many years to design regularizations techniques and heuris- epoch and reduces the steplength by a constant factor δ if
tics that greatly improve the performance of the SG method the validation loss does not improve after e epochs.
for deep learning (Srivastava et al., 2014; Ioffe & Szegedy, We observe from our results that the PBQN method achieves
2015). These include batch normalization and dropout, a similar test accuracy as SG and Adam, but requires more
which (in their current form) are not conducive to the PBQN gradient evaluations. Improvements in performance can be
approach due to the need for gradient consistency when obtained by ensuring that the PBQN method exerts a finer
evaluating the curvature pairs in L-BFGS. Therefore, we do control on the sample size in the small batch regime — some-
not implement batch normalization and dropout in any of thing that requires further investigation. Nevertheless, the
the methods tested, and leave the study of their extension to small number of iterations required by the PBQN method,
the PBQN setting for future work. together with the fact that it employs larger batch sizes than
A Progressive Batching L-BFGS Method for Machine Learning
2.5
2.4 70
SG 70
2.0 2.2 Adam
60
2.0 PBQN-MB 60
Test Accuracy
Training Loss
Test Accuracy
Test Loss
1.5 1.8 PBQN-FO
50 50
1.6
1.0 SG SG SG
1.4 40 40
Adam Adam Adam
1.2 PBQN-MB
0.5 PBQN-MB 30 30 PBQN-MB
PBQN-FO 1.0 PBQN-FO PBQN-FO
0.0 1 0.8 1 20 1 20
10 102 103 104 10 102 103 104 10 102 103 104 0 200 400 600 800 1000
Iterations Iterations Iterations Gradient Evaluations
Figure 3. CIFAR-10 AlexNet (A2 ): Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and
full-overlap approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 0.9.
2.5
2.4 SG 70 SG 70
2.0 2.2 Adam Adam
60 PBQN-MB
PBQN-MB 60
Test Accuracy
2.0
Training Loss
Test Accuracy
PBQN-FO
Test Loss
1.5
1.8 PBQN-FO
50 50
SG 1.6
1.0 40 SG
40
Adam 1.4 Adam
0.5 PBQN-MB 1.2 30 30 PBQN-MB
PBQN-FO 1.0 PBQN-FO
0.0 1 2 3 1 2 3
20 1 2 3 20
10 10 10 10 10 10 10 10 10 0 50 100 150 200
Iterations Iterations Iterations Gradient Evaluations
Figure 4. CIFAR-10 ResNet18 (R): Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and
full-overlap approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 2.
Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. Sam- Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,
ple size selection in optimization methods for machine Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
learning. Mathematical Programming, 134(1):127–155, He, K. Accurate, large minibatch sgd: Training imagenet
2012. in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A Grosse, R. and Martens, J. A kronecker-factored approxi-
stochastic quasi-Newton method for large-scale optimiza- mate fisher matrix for convolution layers. In International
tion. SIAM Journal on Optimization, 26(2):1008–1031, Conference on Machine Learning, pp. 573–582, 2016.
2016.
Guyon, I., Aliferis, C. F., Cooper, G. F., Elisseeff, A., Pellet,
Carbonetto, P. New probabilistic inference algorithms that J., Spirtes, P., and Statnikov, A. R. Design and analy-
harness the strengths of variational and Monte Carlo sis of the causation and prediction challenge. In WCCI
methods. PhD thesis, University of British Columbia, Causation and Prediction Challenge, pp. 1–33, 2008.
2009.
Hardt, M., Recht, B., and Singer, Y. Train faster, generalize
Cartis, C. and Scheinberg, K. Global convergence rate better: Stability of stochastic gradient descent. arXiv
analysis of unconstrained optimization methods based on preprint arXiv:1509.01240, 2015.
probabilistic models. Mathematical Programming, pp.
1–39, 2015. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
Chang, C. and Lin, C. LIBSVM: A library for support vector conference on computer vision and pattern recognition,
machines. ACM Transactions on Intelligent Systems and pp. 770–778, 2016.
Technology, 2:27:1–27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm. Hoffer, E., Hubara, I., and Soudry, D. Train longer,
generalize better: closing the generalization gap in
Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. Re- large batch training of neural networks. arXiv preprint
visiting distributed synchronous sgd. arXiv preprint arXiv:1705.08741, 2017.
arXiv:1604.00981, 2016.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
Cormack, G. and Lynam, T. Spam corpus creation for TREC. deep network training by reducing internal covariate shift.
In Proc. 2nd Conference on Email and Anti-Spam, 2005. In International conference on machine learning, pp. 448–
http://plg.uwaterloo.ca/gvcormac/treccorpus. 456, 2015.
A Progressive Batching L-BFGS Method for Machine Learning
Johnson, R. and Zhang, T. Accelerating stochastic gradient Pasupathy, R., Glynn, P., Ghosh, S., and Hashemi, F. S. On
descent using predictive variance reduction. In Advances sampling rates in stochastic recursions. 2015. Under
in Neural Information Processing Systems 26, pp. 315– Review.
323, 2013.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
Keskar, N. S. and Berahas, A. S. adaqn: An adaptive quasi- DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
newton algorithm for training rnns. In Joint European A. Automatic differentiation in pytorch. 2017.
Conference on Machine Learning and Knowledge Dis-
Robbins, H. and Monro, S. A stochastic approximation
covery in Databases, pp. 1–16. Springer, 2016.
method. The annals of mathematical statistics, pp. 400–
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, 407, 1951.
M., and Tang, P. T. P. On large-batch training for deep
Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled
learning: Generalization gap and sharp minima. arXiv
Newton methods II: Local convergence rates. arXiv
preprint arXiv:1609.04836, 2016.
preprint arXiv:1601.04738, 2016a.
Kingma, D. and Ba, J. Adam: A method for stochastic Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled
optimization. arXiv preprint arXiv:1412.6980, 2014. Newton methods I: Globally convergent algorithms.
arXiv preprint arXiv:1601.04737, 2016b.
Krishnan, S., Xiao, Y., and Saurous, R. A. Neumann opti-
mizer: A practical optimization algorithm for deep neural Schraudolph, N. N., Yu, J., and Günter, S. A stochastic
networks. arXiv preprint arXiv:1712.03298, 2017. quasi-newton method for online convex optimization. In
International Conference on Artificial Intelligence and
Krizhevsky, A. Learning multiple layers of features from
Statistics, pp. 436–443, 2007.
tiny images. 2009.
Smith, S. L., Kindermans, P., and Le, Q. V. Don’t decay
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet the learning rate, increase the batch size. arXiv preprint
classification with deep convolutional neural networks. arXiv:1711.00489, 2017.
In Advances in Neural Information Processing Systems,
pp. 1097–1105, 2012. Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-
scale optimization by unifying stochastic gradient and
Kurth, T., Zhang, J., Satish, N., Racah, E., Mitliagkas, I., quasi-Newton methods. In International Conference on
Patwary, M. M. A., Malas, T., Sundaram, N., Bhimji, W., Machine Learning, pp. 604–612, 2014.
Smorkalov, M., et al. Deep learning at 15pf: Supervised
and semi-supervised classification for scientific data. In Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
Proceedings of the International Conference for High Per- and Salakhutdinov, R. Dropout: A simple way to prevent
formance Computing, Networking, Storage and Analysis, neural networks from overfitting. The Journal of Machine
pp. 7. ACM, 2017. Learning Research, 15(1):1929–1958, 2014.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- You, Y., Gitman, I., and Ginsburg, B. Scaling sgd
based learning applied to document recognition. Proceed- batch size to 32k for imagenet training. arXiv preprint
ings of the IEEE, 86(11):2278–2324, 1998. arXiv:1708.03888, 2017.
Liu, D. C. and Nocedal, J. On the limited memory bfgs Zhou, C., Gao, W., and Goldfarb, D. Stochastic adaptive
method for large scale optimization. Mathematical pro- quasi-Newton methods for minimizing expected values.
gramming, 45(1-3):503–528, 1989. In International Conference on Machine Learning, pp.
4150–4159, 2017.
Martens, J. and Grosse, R. Optimizing neural networks with
kronecker-factored approximate curvature. In Interna-
tional Conference on Machine Learning, pp. 2408–2417,
2015.
and the set Sk ⊂ {1, 2, · · · } indexes data points (y i , z i ). The algorithm selects the Hessian approximation Hk through
quasi-Newton updating prior to selecting the new sample Sk to define the search direction pk . We will use Ek to denote the
conditional expectation at xk and use E to denote the total expectation.
The primary theoretical mechanism for determining batch sizes is the exact variance inner product quasi-Newton (IPQN)
test, which is defined as
h 2 i
Ek (Hk ∇F (xk ))T (Hk gki ) − kHk ∇F (xk )k2
≤ θ2 kHk ∇F (xk )k4 . (31)
|Sk |
We establish the inequality used to determine the initial steplength αk for the stochastic line search.
Lemma A.1. Assume that F is continuously differentiable with Lipschitz continuous gradient with Lipschitz constant L.
Then
1/2 1/2
Ek [F (xk+1 )] ≤ F (xk ) − αk ∇F (xk )T Hk Wk Hk ∇F (xk ),
where
Var{Hk gki }
Lαk
Wk = I− 1+ Hk ,
2 |Sk |kHk ∇F (xk )k2
i
i 2
and Var{Hk gk } = Ek kHk gk − Hk ∇F (xk )k .
B. Convergence Analysis
For the rest of our analysis, we make the following two assumptions.
Assumptions B.1. The orthogonality condition is satisfied for all k, i.e.,
i T 2
(H gk ) (Hk ∇F (xk ))
Ek Hk gki − kkH k ∇F (xk )k
2 H k ∇F (x k )
≤ ν 2 kHk ∇F (xk )k2 , (32)
|Sk |
for some large ν > 0.
A Progressive Batching L-BFGS Method for Machine Learning
Assumptions B.2. The eigenvalues of Hk are contained in an interval in R+ , i.e., for all k there exist constants Λ2 ≥ Λ1 > 0
such that
Λ1 I Hk Λ2 I. (33)
Condition (32) ensures that the stochastic quasi-Newton direction is bounded away from orthogonality to −Hk ∇F (xk ),
with high probability, and prevents the variance in the individual quasi-Newton directions to be too large relative to the
variance in the individual quasi-Newton directions along −Hk ∇F (xk ). Assumption B.2 holds, for example, when F is
convex and a regularization parameter is included so that any subsampled Hessian ∇2 FS (x) is positive definite. It can
also be shown to hold in the non-convex case by applying cautious BFGS updating; e.g. by updating Hk only when
ykT sk ≥ ksk k22 where > 0 is a predetermined constant (Berahas et al., 2016).
We begin by establishing a technical descent lemma.
Lemma B.3. Suppose that F is twice continuously differentiable and that there exists a constant L > 0 such that
Let {xk } be generated by iteration (29) for any x0 , where |Sk | is chosen by the (exact variance) inner product quasi-Newton
test (31) for given constant θ > 0 and suppose that assumptions (B.1) and (B.2) hold. Then, for any k,
h i
Ek kHk gkSk k2 ≤ (1 + θ2 + ν 2 )kHk ∇F (xk )k2 . (35)
Moreover, if αk satisfies
1
αk = α ≤ , (36)
(1 + θ2 + ν 2 )LΛ2
we have that
α 1/2
Ek [F (xk+1 )] ≤ F (xk ) − kHk ∇F (xk )k2 . (37)
2
To bound the first term on the right side of this inequality, we use the inner product quasi-Newton test; in particular, |Sk |
satisfies
h 2 i
2 Ek (Hk ∇F (xk ))T (Hk gki ) − kHk ∇F (xk )k2
Ek (Hk ∇F (xk ))T (Hk gkSk )) − kHk ∇F (xk )k2 ≤
|Sk |
≤ θ2 kHk ∇F (xk )k4 , (40)
we have
2
Ek (Hk gkSk )T (Hk ∇F (xk )) ≤ kHk ∇F (xk )k4 + θ2 kHk ∇F (xk )k4
by (40) and (41). Substituting (42) into (39), we get the following bound on the length of the search direction:
h i
Ek kHk gkSk k2 ≤ (1 + θ2 + ν 2 )kHk ∇F (xk )k2 ,
which proves (35). Using this inequality, Assumption B.2, and bounds on the Hessian and steplength (34) and (36), we have
2
h i Lα
Ek [F (xk+1 )] ≤ F (xk ) − Ek α(Hk gkSk )T ∇F (xk ) + Ek kHk gkSk k2
2
2
Lα
= F (xk ) − α∇F (xk )T Hk ∇F (xk ) + Ek [kHk gkSk k2 ]
2
Lα2
≤ F (xk ) − α∇F (xk )T Hk ∇F (xk ) + (1 + θ2 + ν 2 )kHk ∇F (xk )k2
2
Lα(1 + θ2 + ν 2 )
1/2 T 1/2
= F (xk ) − α(Hk ∇F (xk )) I− Hk Hk ∇F (xk )
2
LΛ2 α(1 + θ2 + ν 2 )
1/2
≤ F (xk ) − α 1 − kHk ∇F (xk )k2
2
α 1/2
≤ F (xk ) − kHk ∇F (xk )k2 .
2
We now show that the stochastic quasi-Newton iteration (29) with a fixed steplength α is linearly convergent when F is
strongly convex. In the following discussion, x∗ denotes the minimizer of F .
Theorem B.4. Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that
Let {xk } be generated by iteration (29), for any x0 , where |Sk | is chosen by the (exact variance) inner product quasi-Newton
test (31) and suppose that the assumptions (B.1) and (B.2) hold. Then, if αk satisfies (36) we have that
Proof. It is well-known (Bertsekas et al., 2003) that for strongly convex functions,
Substituting this into (37) and subtracting F (x∗ ) from both sides and using Assumption B.2, we obtain
α 1/2
Ek [F (xk+1 ) − F (x∗ )] ≤ F (xk ) − F (x∗ ) − kHk ∇F (xk )k2
2
α
≤ F (xk ) − F (x∗ ) − Λ1 k∇F (xk )k2
2
≤ (1 − µΛ1 α)(F (xk ) − F (x∗ )).
101 1.6
101
SG SG
SG 1.4
100 SVRG SVRG SVRG
1.2
PBQN-MB PBQN-MB PBQN-MB
Training Error
100
Training Error
10−1 1.0
Test Loss
10 −4 0.2
10−2 0.0
0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations
×103
100 6 1.0
PBQN-MB PBQN-MB
5 PBQN-FO
90 0.8 PBQN-FO
Test Accuracy
4
Batch Sizes
Steplength
80 0.6
3
70 SG 0.4
2
SVRG
60 PBQN-MB 1 0.2
PBQN-FO
50 0 0.0
0 2 4 6 8 10 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Gradient Evaluations Iterations Iterations
Figure 5. gisette dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
Training Error
Test Loss
PBQN-FO PBQN-FO PBQN-FO
10−2 0.5
10−2
10−3 0.4
10−3 0.3
10−4
0.2
−4
10−5 10
0.1
10−6 10−5 0.0
0 5 10 15 20 25 30 35 40 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations
3
8 ×10
100 1.0
7
90 6 0.8
Batch Sizes
Test Accuracy
80 5
Steplength
0.6
70
4
SG 3 0.4
60 SVRG 2
PBQN-MB PBQN-MB 0.2 PBQN-MB
50 1
PBQN-FO PBQN-FO PBQN-FO
40 0 0.0
0 2 4 6 8 10 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Gradient Evaluations Iterations Iterations
Figure 6. mushrooms dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
100 0.7
100
SG
SG 0.6
10−1 SVRG SVRG
0.5 PBQN-MB
PBQN-MB 10−1
Training Error
Training Error
Test Loss
10−3 0.3
SG
10−2 SVRG 0.2
10−4
PBQN-MB
0.1
10−5 PBQN-FO
10−3 0.0
0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations
4
100 1.2 ×10 1.0
PBQN-MB
90 1.0 0.8 PBQN-FO
Test Accuracy
0.8
Batch Sizes
Steplength
80 0.6
0.6
70 SG 0.4
SVRG 0.4
60 PBQN-MB 0.2 PBQN-MB 0.2
PBQN-FO PBQN-FO
50 0.0 0.0
0 2 4 6 8 10 0 50 100 150 200 0 50 100 150 200
Gradient Evaluations Iterations Iterations
Figure 7. sido dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-overlap
(FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
Training Error
Test Loss
PBQN-FO 10−2 PBQN-FO 0.5 PBQN-FO
10−3
10−3 0.4
10−4
95 ×104 1.0
3.5
90 PBQN-MB PBQN-MB
3.0
85 PBQN-FO 0.8 PBQN-FO
2.5
Test Accuracy
80
Batch Sizes
Steplength
75 2.0 0.6
70
SG 1.5 0.4
65 SVRG 1.0
60 PBQN-MB 0.2
0.5
55 PBQN-FO
50 0.0 0.0
0 2 4 6 8 10 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
Gradient Evaluations Iterations Iterations
Figure 8. ijcnn dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-
overlap(FO) approaches, and the SG and SVRG methods.
Training Error
Test Loss
Batch Sizes
Steplength
80 5 0.6
4
70 SG 0.4
3
SVRG
60 PBQN-MB 2 0.2 PBQN-MB
PBQN-FO 1 PBQN-FO
50 0 0.0
0 2 4 6 8 10 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Gradient Evaluations Iterations Iterations
Figure 9. spam dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-
overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
Training Error
10−2 0.65
Training Error
Test Loss
PBQN-FO PBQN-FO PBQN-FO
10−3 0.60
−4 −2 0.55
10 10
10−5 0.50
−6 −3 0.45
10 10
0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations
×105
80 1.0
4 PBQN-MB
75 PBQN-FO 0.8
Test Accuracy
70 3
Batch Sizes
Steplength
0.6
65
2
SG 0.4
60 SVRG
PBQN-MB 1 0.2 PBQN-MB
55
PBQN-FO PBQN-FO
50 0 0.0
0 2 4 6 8 10 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Gradient Evaluations Iterations Iterations
Figure 10. alpha dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
10−2
Training Error
Test Loss
70 4
Batch Sizes
Steplength
PBQN-MB 0.6
65 3
PBQN-FO
SG 0.4
60 2
SVRG
55 PBQN-MB 1 0.2
PBQN-FO
50 0 0.0
0 2 4 6 8 10 0 100 200 300 400 500 600 0 100 200 300 400 500 600
Gradient Evaluations Iterations Iterations
Figure 11. covertype dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
100 1.0
SG
0.8 SVRG
PBQN-MB
10−1
Training Error
Test Loss
0.6 PBQN-FO
SG 0.4
10−2 SVRG
PBQN-MB 0.2
PBQN-FO
10−3 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Gradient Evaluations Gradient Evaluations
6
100 2.5 ×10 1.0
PBQN-MB
90
2.0 PBQN-FO 0.8
80
Test Accuracy
Batch Sizes
Steplength
70 1.5 0.6
60 SG 1.0 0.4
50 SVRG
PBQN-MB 0.5 0.2 PBQN-MB
40
PBQN-FO PBQN-FO
30 0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Gradient Evaluations Iterations Iterations
Figure 12. url dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-overlap
(FO) approaches, and the SG and SVRG methods. Note that we only ran the SG and SVRG algorithms for 3 gradient evaluations since
the equivalent number of iterations already reached of order of magnitude 107 .
C.3.2. CIFAR-10 AND MNIST A LEX N ET- LIKE N ETWORK (A1 , A2 ) A RCHITECTURE
The larger convolutional network (AlexNet) is an adaptation of the AlexNet architecture (Krizhevsky et al., 2012) for
CIFAR-10 and MNIST. The CIFAR-10 version consists of three convolutional layers with max pooling followed by two
fully-connected layers. The first convolutional layer uses a 5 × 5 kernel with a stride of 2 and 64 output channels. The second
and third convolutional layers use a 3 × 3 kernel with a stride of 1 and 64 output channels. Following each convolutional
layer is a set of ReLU activations and 3 × 3 max poolings with strides of 2. This is all followed by two fully-connected
layers with 384 and 192 neurons with ReLU activations, respectively. The MNIST version of this network modifies this by
only using a 2 × 2 max pooling layer after the last convolutional layer.
2.5
SG 2.4 SG 70
2.0 Adam 2.2 Adam
60
PBQN-MB PBQN-MB
Test Accuracy
Training Loss
2.0
Test Loss
1.0 1.6 SG
40
Adam
1.4
0.5 30 PBQN-MB
1.2 PBQN-FO
0.0 1 1.0 1 20 1
10 102 103 104 10 102 103 104 10 102 103 104
Iterations Iterations Iterations
2.5
SG 2.4 SG 70
2.0 Adam 2.2 Adam
60
PBQN-MB PBQN-MB
Test Accuracy
Training Loss
2.0
Test Loss
1.0 1.6 SG
40
Adam
1.4
0.5 30 PBQN-MB
1.2 PBQN-FO
0.0 1.0 20
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Gradient Evaluations Gradient Evaluations Gradient Evaluations
4
5 ×10
1.0
4
0.8
Batch Sizes
3
Steplength
0.6
2 SG 0.4
Adam
1 PBQN-MB 0.2 PBQN-MB
PBQN-FO PBQN-FO
0 0.0
0 100 200 300 400 500 0 500 1000 1500 2000
Iterations Iterations
Figure 13. CIFAR-10 ConvNet (C): Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 0.9.
A Progressive Batching L-BFGS Method for Machine Learning
100
1.4 SG 1.4 SG
1.2 Adam 1.2 Adam 95
PBQN-MB PBQN-MB
Test Accuracy
Training Loss
1.0 1.0 90
Test Loss
PBQN-FO PBQN-FO
0.8 0.8
85
0.6 0.6 SG
80 Adam
0.4 0.4
75 PBQN-MB
0.2 0.2
PBQN-FO
0.0 1 0.0 1 70 1
10 102 103 104 10 102 103 104 10 102 103 104
Iterations Iterations Iterations
100
1.4 SG 1.4 SG
1.2 Adam 1.2 Adam 95
PBQN-MB PBQN-MB
Test Accuracy
Training Loss
1.0 1.0 90
Test Loss
PBQN-FO PBQN-FO
0.8 0.8
85
0.6 0.6 SG
80 Adam
0.4 0.4
75 PBQN-MB
0.2 0.2
PBQN-FO
0.0 0.0 70
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Gradient Evaluations Gradient Evaluations Gradient Evaluations
4
6 ×10
1.0
SG
5 Adam 0.8
4 PBQN-MB
Batch Sizes
PBQN-FO
Steplength
0.6
3
0.4
2
1 0.2 PBQN-MB
PBQN-FO
0 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iterations Iterations
Figure 14. MNIST AlexNet (A1 ): Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 2.
A Progressive Batching L-BFGS Method for Machine Learning
2.5
2.4 70
SG
2.0 2.2 Adam
60
2.0 PBQN-MB
Test Accuracy
Training Loss
Test Loss
Test Accuracy
Training Loss
Test Loss
0.7
Steplength
3
0.6
2 SG 0.5
Adam 0.4
1 PBQN-MB 0.3 PBQN-MB
PBQN-FO 0.2 PBQN-FO
0 0.1
0 200 400 600 800 1000 0 500 1000 1500 2000 2500 3000
Iterations Iterations
Figure 15. CIFAR-10 AlexNet (A2 ): Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 0.9.
A Progressive Batching L-BFGS Method for Machine Learning
2.5
2.4 SG 70 SG
2.0 2.2 Adam Adam
60 PBQN-MB
PBQN-MB
Test Accuracy
2.0
Training Loss
PBQN-FO
Test Loss
1.5
1.8 PBQN-FO
50
SG 1.6
1.0 40
Adam 1.4
0.5 PBQN-MB 1.2 30
PBQN-FO 1.0
0.0 1 20 1
10 102 103 101 102 103 10 102 103
Iterations Iterations Iterations
2.5
SG 2.4 SG 70
2.0 Adam 2.2 Adam
60
PBQN-MB PBQN-MB
Test Accuracy
2.0
Training Loss
Test Loss
4 0.8
Batch Sizes
Steplength
3 0.6
2 SG 0.4
Adam
1 PBQN-MB 0.2 PBQN-MB
PBQN-FO PBQN-FO
0 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000 1200 1400
Iterations Iterations
Figure 16. CIFAR-10 ResNet18 (R): Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 2.
A Progressive Batching L-BFGS Method for Machine Learning
D. Performance Model
The use of increasing batch sizes in the PBQN algorithm yields a larger effective batch size than the SG method, allowing
PBQN to scale to a larger number of nodes than currently permissible even with large-batch training (Goyal et al., 2017).
With improved scalability and richer gradient information, we expect reduction in training time. To demonstrate the potential
to reduce training time of a parallelized implementation of PBQN, we extend the idealized performance model from (Keskar
et al., 2016) to the PBQN algorithm. For PBQN to be competitive, it must achieve the following: (i) the quality of its solution
should match or improve SG’s solution (as shown in Table 1 of the main paper); (ii) it should utilize a larger effective batch
size; and (iii) it should converge to the solution in a lower number of iterations. We provide an initial analysis for this by
establishing the analytic requirements for improved training time; we leave discussion on implementation details, memory
requirements, and large-scale experiments for future work.
Let the effective batch size for PBQN and conventional SG batch size be denoted as B cL and BS , respectively. From
Algorithm 1, we observe that the PBQN iteration involves extra computation in addition to the gradient computation as
in SG. The additional steps are as follows: the L-BFGS two-loop recursion, which includes several operations over the
stored curvature pairs and network parameters (Algorithm 1:6); the stochastic line search for identifying the steplength
(Algorithm 1:7-16); and curvature pair updating (Algorithm 1:18-21). However, most of these supplemental operations are
performed on the weights of the network, which is orders of magnitude lower than computing the gradient. The two-loop
recursion performs O(10) operations over the network parameters and curvature pairs. The cost for variance estimation is
negligible since we may use a fixed number of samples throughout the run for its computation which can be parallelized
while avoiding becoming a serial bottleneck.
The only exception is the stochastic line search, which requires additional forward propagations over the model for different
sets of network parameters. However, this happens only when the step-length is not accepted, which happens infrequently in
practice. We make the pessimistic assumption of an addition forward propagation every iteration, amounting to an additional
1
3 the cost of the gradient computation (forward propagation, back propagation with respect to activations and weights).
Hence, the ratio of cost-per-iteration for PBQN CL to SG’s cost-per-iteration CS is 43 . Let IS and IL be the number of
iterations that it takes SG and PBQN, respectively, to reach similar test accuracy. The target number of nodes to be used for
training is N , such that N < B cL . For N nodes, the parallel efficiency of SG is assumed to be Pe (N ) and we assume that
for the target node count, there is no drop in parallel efficiency for PBQN due to the large effective batch size.
For a lower training time with the PBQN method, the following relation should hold:
B
cL BS
IL C L < IS CS . (47)
N N Pe (N )
Assuming target node count N = BS < BˆL , the scaling efficiency of SG drops significantly due to the reduced work per
single node, giving a parallel efficiency of Pe (N ) = 0.2; see (Kurth et al., 2017; You et al., 2017). If we additionally assume
that effective batch size for PBQN is 4× larger, with SG large batch ≈ 8K and PBQN ≈ 32K as observed in our experiments
(from Section 4), this gives BcL/BS = 4. PBQN must converge with about the same number of iterations as SG in order to
achieve lower training time. From Section 4, the results show that PBQN converges in significantly fewer iterations than SG,
hence establishing the potential for lower training times. We refer the reader to (Das et al., 2016) for a more detailed model
and commentary on the effect of batch size on performance.