A Progressive Batching L-BFGS Method For Machine Learning: Robbins & Monro 1951

A Progressive Batching L-BFGS Method for Machine Learning
Raghu Bollapragada 1 Dheevatsa Mudigere 2 Jorge Nocedal 1 Hao-Jun Michael Shi 1 Ping Tak Peter Tang 3
Abstract tion that makes the overall algorithm slow compared with
stochastic gradient methods (Robbins & Monro, 1951).
arXiv:1802.05374v2 [math.OC] 30 May 2018
The standard L-BFGS method relies on gradient

approximations that are not dominated by noise, Even before the resurgence of neural networks, many re-
so that search directions are descent directions, searchers observed that a well-tuned implementation of the
the line search is reliable, and quasi-Newton up- stochastic gradient (SG) method was far more effective on
dating yields useful quadratic models of the objec- large-scale logistic regression applications than the batch
tive function. All of this appears to call for a full L-BFGS method, even when taking into account the advan-
batch approach, but since small batch sizes give tages of parallelism offered by the use of large batches. The
rise to faster algorithms with better generalization preeminence of the SG method (and its variants) became
properties, L-BFGS is currently not considered more pronounced with the advent of deep neural networks,
an algorithm of choice for large-scale machine and some researchers have speculated that SG is endowed
learning applications. One need not, however, with certain regularization properties that are essential in the
choose between the two extremes represented by minimization of such complex nonconvex functions (Hardt
the full batch or highly stochastic regimes, and et al., 2015; Keskar et al., 2016).
may instead follow a progressive batching ap-
In this paper, we postulate that the most efficient algorithms
proach in which the sample size increases during
for machine learning may not reside entirely in the highly
the course of the optimization. In this paper, we
stochastic or full batch regimes, but should employ a pro-
present a new version of the L-BFGS algorithm
gressive batching approach in which the sample size is ini-
that combines three basic components — progres-
tially small, and is increased as the iteration progresses.
sive batching, a stochastic line search, and stable
This view is consistent with recent numerical experiments
quasi-Newton updating — and that performs well
on training various deep neural networks (Smith et al., 2017;
on training logistic regression and deep neural
Goyal et al., 2017), where the SG method, with increas-
networks. We provide supporting convergence
ing sample sizes, yields similar test loss and accuracy as
theory for the method.
the standard (fixed mini-batch) SG method, while offering
significantly greater opportunities for parallelism.
1. Introduction Progressive batching algorithms have received much atten-

tion recently from a theoretical perspective. It has been
The L-BFGS method (Liu & Nocedal, 1989) has tradition- shown that they enjoy complexity bounds that rival those of
ally been regarded as a batch method in the machine learn- the SG method (Byrd et al., 2012), and that they can achieve
ing community. This is because quasi-Newton algorithms a fast rate of convergence (Friedlander & Schmidt, 2012).
need gradients of high quality in order to construct useful The main appeal of these methods is that they inherit the
quadratic models and perform reliable line searches. These efficient initial behavior of the SG method, offer greater
algorithmic ingredients can be implemented, it seems, only opportunities to exploit parallelism, and allow for the in-
by using very large batch sizes, resulting in a costly itera- corporation of second-order information. The latter can be
1 done efficiently via quasi-Newton updating.
Department of Industrial Engineering and Manage-
ment Sciences, Northwestern University, Evanston, IL, USA An integral part of quasi-Newton methods is the line search,
2
Intel Corporation, Bangalore, India 3 Intel Corporation, which ensures that a convex quadratic model can be con-
Santa Clara, CA, USA. Correspondence to: Raghu Bol-
lapragada <raghu.bollapragada@u.northwestern.edu>, Dhee- structed at every iteration. One challenge that immediately
vatsa Mudigere <dheevatsa.mudigere@intel.com>, Jorge No- arises is how to perform this line search when the objec-
cedal <j-nocedal@northwestern.edu>, Hao-Jun Michael Shi tive function is stochastic. This is an issue that has not
<hjmshi@u.northwestern.edu>, Ping Tak Peter Tang <pe- received sufficient attention in the literature, where stochas-
ter.tang@intel.com>. tic line searches have been largely dismissed as inappropri-
ate. In this paper, we take a step towards the development
of stochastic line searches for machine learning by study- 2016; Keskar & Berahas, 2016; Curtis, 2016; Berahas &
ing a key component, namely the initial estimate in the Takáč, 2017; Zhou et al., 2017). Schraudolph et al. (2007)
one-dimensional search. Our approach, which is based on ensured stability of quasi-Newton updating by computing
statistical considerations, is designed for an Armijo-style gradients using the same batch at the beginning and end
backtracking line search. of the iteration. Since this can potentially double the cost
of the iteration, Berahas et al. (2016) proposed to achieve
1.1. Literature Review gradient consistency by computing gradients based on the
overlap between consecutive batches; this approach was
Progressive batching (sometimes referred to as dynamic further tested by Berahas and Takac (2017). An interesting
sampling) has been well studied in the optimization litera- approach introduced by Martens and Grosse (2015; 2016)
ture, both for stochastic gradient and subsampled Newton- approximates the Fisher information matrix to scale the
type methods (Byrd et al., 2012; Friedlander & Schmidt, gradient; a distributed implementation of their K-FAC ap-
2012; Cartis & Scheinberg, 2015; Pasupathy et al., 2015; proach is described in (Ba et al., 2016). Another approach
Roosta-Khorasani & Mahoney, 2016a;b; Bollapragada et al., approximately computes the inverse Hessian by using the
2016; 2017; De et al., 2017). Friedlander and Schmidt Neumann power series representation of matrices (Krishnan
(2012) introduced theoretical conditions under which a pro- et al., 2017).
gressive batching SG method converges linearly for finite
sum problems, and experimented with a quasi-Newton adap-
1.2. Contributions
tation of their algorithm. Byrd et al. (2012) proposed a
progressive batching strategy, based on a norm test, that de- This paper builds upon three algorithmic components that
termines when to increase the sample size; they established have recently received attention in the literature — progres-
linear convergence and computational complexity bounds sive batching, stable quasi-Newton updating, and adaptive
in the case when the batch size grows geometrically. More steplength selection. It advances their design and puts them
recently, Bollapragada et al. (2017) introduced a batch con- together in a novel algorithm with attractive theoretical and
trol mechanism based on an inner product test that improves computational properties.
upon the norm test mentioned above.
The cornerstone of our progressive batching strategy is the
There has been a renewed interest in understanding the gen- mechanism proposed by Bollapragada et al. (2017) in the
eralization properties of small-batch and large-batch meth- context of first-order methods. We extend their inner prod-
ods for training neural networks; see (Keskar et al., 2016; uct control test to second-order algorithms, something that
Dinh et al., 2017; Goyal et al., 2017; Hoffer et al., 2017). is delicate and leads to a significant modification of the
Keskar et al. (2016) empirically observed that large-batch original procedure. Another main contribution of the paper
methods converge to solutions with inferior generalization is the design of an Armijo-style backtracking line search
properties; however, Goyal et al. (2017) showed that large- where the initial steplength is chosen based on statistical
batch methods can match the performance of small-batch information gathered during the course of the iteration. We
methods when a warm-up strategy is used in conjunction show that this steplength procedure is effective on a wide
with scaling the step length by the same factor as the batch range of applications, as it leads to well scaled steps and
size. Hoffer et al. (2017) and You et al. (2017) also explored allows for the BFGS update to be performed most of the
larger batch sizes and steplengths to reduce the number of time, even for nonconvex problems. We also test two tech-
updates necessary to train the network. All of these studies niques for ensuring the stability of quasi-Newton updating,
naturally led to an interest in progressive batching tech- and observe that the overlapping procedure described by
niques. Smith et al. (2017) showed empirically that increas- Berahas et al. (2016) is more efficient than a straightforward
ing the sample size and decaying the steplength are quan- adaptation of classical quasi-Newton methods (Schraudolph
titatively equivalent for the SG method; hence, steplength et al., 2007).
schedules could be directly converted to batch size sched-
We report numerical tests on large-scale logistic regression
ules. This approach was parallelized by Devarakonda et al.
and deep neural network training tasks that indicate that our
(2017). De et al. (2017) presented numerical results with
method is robust and efficient, and has good generalization
a progressive batching method that employs the norm test.
properties. An additional advantage is that the method re-
Balles et al. (2016) proposed an adaptive dynamic sample
quires almost no parameter tuning, which is possible due to
size scheme and couples the sample size with the steplength.
the incorporation of second-order information. All of this
Stochastic second-order methods have been explored within suggests that our approach has the potential to become one
the context of convex and non-convex optimization; see of the leading optimization methods for training deep neural
(Schraudolph et al., 2007; Sohl-Dickstein et al., 2014; networks. In order to achieve this, the algorithm must be
Mokhtari & Ribeiro, 2015; Berahas et al., 2016; Byrd et al., optimized for parallel execution, something that was only
briefly explored in this study. search direction −Hk ∇F (xk ), with high probability. Al-
though this does not imply that dk is a descent direction
2. A Progressive Batching Quasi-Newton for F , this will normally be the case for any reasonable
quasi-Newton matrix.
Method
To derive the new inner product quasi-Newton (IPQN) test,
The problem of interest is we first observe that the stochastic quasi-Newton search
Z direction makes an acute angle with the true quasi-Newton
min F (x) = f (x; z, y)dP (z, y), (1) direction in expectation, i.e.,
x∈Rd
h i
where f is the composition of a prediction function Ek (Hk ∇F (xk ))T (Hk gkSk ) = kHk ∇F (xk )k2 , (4)
(parametrized by x) and a loss function, and (z, y) are ran-
dom input-output pairs with probability distribution P (z, y). where Ek denotes the conditional expectation at xk . We
The associated empirical risk problem consists of minimiz- must, however, control the variance of this quantity to
ing achieve our stated objective. Specifically, we select the
N N
sample size |Sk | such that the following condition is satis-
1 X 4 1 X fied:
R(x) = f (x; z i , y i ) = Fi (x),
N i=1 N i=1 2
T Sk 2
Ek (Hk ∇F (xk )) (Hk gk ) − kHk ∇F (xk )k
where we define Fi (x) = f (x; z i , y i ). A stochastic quasi-
Newton method is given by ≤ θ2 kHk ∇F (xk )k4 ,
(5)
xk+1 = xk − αk Hk gkSk , (2) for some θ > 0. The left hand side of (5) is difficult to com-
pute but can be bounded by the true variance of individual
where the batch (or subsampled) gradient is given by
search directions, i.e.,
4 1 X
gkSk = ∇FSk (xk ) = ∇Fi (xk ),
h 2 i
(3) Ek (Hk ∇F (xk ))T (Hk gki ) − kHk ∇F (xk )k2
|Sk |
i∈Sk
|Sk | (6)
the set Sk ⊂ {1, 2, · · · } indexes data points (y i , z i ) sam-
≤ θ2 kHk ∇F (xk )k4 ,
pled from the distribution P (z, y), and Hk is a positive
definite quasi-Newton matrix. We now discuss each of the where gki = ∇Fi (xk ). This test involves the true expected
components of the new method. gradient and variance, but we can approximate these quan-
tities with sample gradient and variance estimates, respec-
2.1. Sample Size Selection tively, yielding the practical inner product quasi-Newton
The proposed algorithm has the form (29)-(30). Initially, it test:
utilizes a small batch size |Sk |, and increases it gradually in
Vari∈Skv (gki )T Hk2 gkSk 4
order to attain a fast local rate of convergence and permit the ≤ θ2 Hk gkSk , (7)
use of second-order information. A challenging question is |Sk |
to determine when, and by how much, to increase the batch
size |Sk | over the course of the optimization procedure based where Skv ⊆ Sk is a subset of the current sample (batch),
on observed gradients — as opposed to using prescribed and the variance term is defined as
rules that depend on the iteration number k.
2 2

i T 2 Sk Sk
P
v
i∈Sk (gk ) H g
k k − H k gk
We propose to build upon the strategy introduced by Bol-
v . (8)
lapragada et al. (2017) in the context of first-order methods. |Sk | − 1
Their inner product test determines a sample size such that
the search direction is a descent direction with high prob- The variance (8) may be computed using just one additional
ability. A straightforward extension of this strategy to the Hessian vector product of Hk with Hk gkSk . Whenever con-
quasi-Newton setting is not appropriate since requiring only dition (7) is not satisfied, we increase the sample size |Sk |.
that a stochastic quasi-Newton search direction be a de- In order to estimate the increase that would lead to a satis-
scent direction with high probability would underutilize the faction of (7), we reason as follows. If we assume that new
curvature information contained in the search direction. sample |S̄k | is such that
We would like, instead, for the search direction dk =
−Hk gkSk to make an acute angle with the true quasi-Newton Hk gkSk u Hk gkS̄k ,
and similarly for the variance estimate, then a simple com- term in the matrix Wk . To obtain decrease in the function
putation shows that a lower bound on the new sample size value in the deterministic case, the matrix I − Lα 2 Hk
k
is must be positive definite, whereas in the stochastic case the

Vari∈Skv (gki )T Hk2 gkSk 4 matrix Wk must be positive definite to yield a decrease in
|S̄k | ≥ 4 = bk . (9) F in expectation. In the deterministic case, for a reasonably
2 Sk
θ Hk gk good quasi-Newton matrix Hk , one expects that αk = 1
will result in a decrease in the function, and therefore the
In our implementation of the algorithm, we set the new sam-
initial trial steplength parameter should be chosen to be 1.
ple size as |Sk+1 | = dbk e. When the sample approximation
In the stochastic case, the initial trial value
of F (xk ) is not accurate, which can occur when |Sk | is
small, the progressive batching mechanism just described −1
Var{Hk gki }

may not be reliable. In this case we employ the moving α̂k = 1+ (12)
|Sk |kHk ∇F (xk )k2
window technique described in Section 4.2 of Bollapragada
et al. (2017), to produce a sample estimate of ∇F (xk ). will result in decrease in the expected function value. How-
ever, since formula (12) involves the expensive computation
2.2. The Line Search of the individual matrix-vector products Hk gki , we approxi-
mate the variance-bias ratio as follows:
In deterministic optimization, line searches are employed
to ensure that the step is not too short and to guarantee −1
Var{gki }

sufficient decrease in the objective function. Line searches ᾱk = 1 + , (13)
|Sk |k∇F (xk )k2
are particularly important in quasi-Newton methods since
they ensure robustness and efficiency of the iteration with

where Var{gki } = Ek kgki − ∇F (xk )k2 . In our practical
little additional cost. implementation, we estimate the population variance and
In contrast, stochastic line searches are poorly understood gradient with the sample variance and gradient, respectively,
and rarely employed in practice because they must make yielding the initial steplength
decisions based on sample function values  −1
1 X Vari∈Skv {gki } 
αk = 1 + , (14)

FSk (x) = Fi (x), (10) 2 
|Sk | |Sk | gkSk
i∈Sk
which are noisy approximations to the true objective F . where

One of the key questions in the design of a stochastic line
search is how to ensure, with high probability, that there is 1 X 2
Vari∈Skv {gki } = gki − gkSk (15)
a decrease in the true function when one can only observe |Skv |−1 v i∈Sk
stochastic approximations FSk (x). We address this question
by proposing a formula for the step size αk that controls and Skv ⊆ Sk . With this initial value of αk in hand, our
possible increases in the true function. Specifically, the first algorithm performs a backtracking line search that aims to
trial steplength in the stochastic backtracking line search satisfy the Armijo condition
is computed so that the predicted decrease in the expected
function value is sufficiently large, as we now explain. FSk (xk − αk Hk gkSk )
(16)
Using Lipschitz continuity of ∇F (x) and taking conditional ≤ FSk (xk ) − c1 αk (gkSk )T Hk gkSk ,
expectation, we can show the following inequality
where c1 > 0.
T 1/2 1/2
Ek [Fk+1 ] ≤ Fk − αk ∇F (xk ) Hk Wk Hk ∇F (xk )
(11) 2.3. Stable Quasi-Newton Updates
where
In the BFGS and L-BFGS methods, the inverse Hessian
Var{Hk gki }

Lαk approximation is updated using the formula
Wk = I− 1+ Hk ,
2 |Sk |kHk ∇F (xk )k2
Hk+1 = VkT Hk Vk + ρk sk sTk
Var{Hk gki } = Ek kHk gki − Hk ∇F (xk )k2 , Fk = ρk = (ykT sk )−1 (17)
F (xk ), and L is the Lipschitz constant. The proof of (A.1)
is given in the supplement. Vk = I − ρk yk sTk
The only difference in (A.1) between the deterministic and where sk = xk+1 − xk and yk is the difference in the
stochastic quasi-Newton methods is the additional variance gradients at xk+1 and xk . When the batch changes from
one iteration to the next (Sk+1 6= Sk ), it is not obvious how Algorithm 1 Progressive Batching L-BFGS Method
yk should be defined. It has been observed that when yk Input: Initial iterate x0 , initial sample size |S0 |;
is computed using different samples, the updating process Initialization: Set k ← 0
may be unstable, and hence it seems natural to use the Repeat until convergence:
same sample at the beginning and at the end of the iteration 1: Sample Sk ⊆ {1, · · · , N } with sample size |Sk |
(Schraudolph et al., 2007), and define 2: if condition (7) is not satisfied then
Sk 3: Compute bk using (9), and set b̂k ← dbk e − |Sk |
yk = gk+1 − gkSk . (18)
4: Sample S + ⊆ {1, · · · , N } \ Sk with |S + | = b̂k
However, this requires that the gradient be evaluated twice 5: Set Sk ← Sk ∪ S +
for every batch Sk at xk and xk+1 . To avoid this additional 6: end if
S
cost, Berahas et al. (2016) propose to use the overlap be- 7: Compute gk k
S
tween consecutive samples in the gradient differencing. If 8: Compute pk = −Hk gk k using L-BFGS Two-Loop
we denote this overlap as Ok = Sk ∩ Sk+1 , then one defines Recursion in (Nocedal & Wright, 1999)
9: Compute αk using (14)
Ok
yk = gk+1 − gkOk . (19) 10: while the Armijo condition (16) not satisfied do
11: Set αk = αk /2
This requires no extra computation since the two gradients 12: end while
in this expression are subsets of the gradients correspond- 13: Compute xk+1 = xk + αk pk
ing to the samples Sk and Sk+1 . The overlap should not 14: Compute yk using (18) or (19)
be too small to avoid differencing noise, but this is easily 15: Compute sk = xk+1 − xk
achieved in practice. We test both formulas for yk in our 16: if ykT sk > ksk k2 then
implementation of the method; see Section 4. 17: if number of stored (yj , sj ) exceeds m then
18: Discard oldest curvature pair (yj , sj )
2.4. The Complete Algorithm 19: end if
The pseudocode of the progressive batching L-BFGS 20: Store new curvature pair (yk , sk )
method is given in Algorithm 1. Observe that the limited 21: end if
memory Hessian approximation Hk in Line 8 is indepen- 22: Set k ← k + 1
dent of the choice of the sample Sk . Specifically, Hk is 23: Set |Sk | = |Sk−1 |
defined by a collection of curvature pairs {(sj , yj )}, where
the most recent pair is based on the sample Sk−1 ; see Line
14. For the batch size control test (7), we choose θ = 0.9 in This assumption can be justified both in the convex and non-
the logistic regression experiments, and θ is a tunable pa- convex cases under certain conditions; see (Berahas et al.,
rameter chosen in the interval [0.9, 3] in the neural network 2016). We assume that the sample size is controlled by the
experiments. The constant c1 in (16) is set to c1 = 10−4 . exact inner product quasi-Newton test (31). This test is de-
For L-BFGS, we set the memory as m = 10. We skip the signed for efficiency, and in rare situations could allow for
quasi-Newton update if the following curvature condition is the generation of arbitrarily long search directions. To pre-
not satisfied: vent this from happening, we introduce an additional control
on the sample size |Sk |, by extending (to the quasi-Newton
ykT sk > ksk k2 , with = 10−2 . (20) setting) the orthogonality test introduced in (Bollapragada
et al., 2017). This additional requirement states that the
The initial Hessian matrix H0k in the L-BFGS recursion at current sample size |Sk | is acceptable only if
each iteration is chosen as γk I where γk = ykT sk /ykT yk .  

S
T 2
Hk g k k (Hk ∇F (xk ))
Ek  Hk gki − kHk ∇F (xk )k2 Hk ∇F (xk ) 
3. Convergence Analysis
We now present convergence results for the proposed al- |Sk |
gorithm, both for strongly convex and nonconvex objec- ≤ ν kHk ∇F (xk )k2 ,
2
(22)
tive functions. Our emphasis is in analyzing the effect of
progressive sampling, and therefore, we follow common for some given ν > 0.
practice and assume that the steplength in the algorithm is
We now establish linear convergence when the objective is
fixed (αk = α), and that the inverse L-BFGS matrix Hk has
strongly convex.
bounded eigenvalues, i.e.,
Theorem 3.1. Suppose that F is twice continuously differ-
Λ1 I Hk Λ2 I. (21) entiable and that there exist constants 0 < µ ≤ L such that
with λ = 1/N . We consider the 8 datasets listed in the

µI ∇2 F (x) LI, ∀x ∈ Rd . (23) supplement. An approximation R∗ of the optimal function
Let {xk } be generated by iteration (29), for any x0 , where value is computed for each problem by running the full batch
|Sk | is chosen by the (exact variance) inner product quasi- L-BFGS method until k∇R(xk )k∞ ≤ 10−8 . Training error
Newton test (31). Suppose that the orthogonality condition is defined as R(xk )−R∗ , where R(xk ) is evaluated over the
(32) holds at every iteration, and that the matrices Hk sat- training set; test loss is evaluated over the test set without
isfy (B.2). Then, if the `2 regularization term.
1 We tested two options for computing the curvature vector

αk = α ≤ , (24) yk in the PBQN method: the multi-batch (MB) approach
(1 + θ2 + ν 2 )LΛ2
(19) with 25% sample overlap, and the full overlap (FO)
we have that approach (18). We set θ = 0.9 in (7), chose |S0 | = 512, and
E[F (xk ) − F (x∗ )] ≤ ρk (F (x0 ) − F (x∗ )), (25) set all other parameters to the default values given in Sec-
tion 2. Thus, none of the parameters in our PBQN method
where x∗ denotes the minimizer of F , ρ = 1 − µΛ1 α, and were tuned for each individual dataset. We compared our
E denotes the total expectation. algorithm against two other methods: (i) Stochastic gradient
(SG) with a batch size of 1; (ii) SVRG (Johnson & Zhang,
The proof of this result is given in the supplement. We now
2013) with the inner loop length set to N . The steplength
consider the case when F is nonconvex and bounded below.
for SG and SVRG is constant and tuned for each problem
Theorem 3.2. Suppose that F is twice continuously differ- (αk ≡ α = 2j , for j ∈ {−10, −9, ..., 9, 10}) so as to give
entiable and bounded below, and that there exists a constant best performance.
L > 0 such that
In Figures 9 and 2 we present results for two datasets, spam
∇2 F (x) LI, ∀x ∈ Rd . (26) and covertype; the rest of the results are given in the
Let {xk } be generated by iteration (29), for any x0 , where supplement. The horizontal axis measures the number of
|Sk | is chosen so that (31) and (32) are satisfied, and sup- full gradient evaluations, or equivalently, the number of
pose that (B.2) holds. Then, if αk satisfies (36), we have times that N component gradients ∇Fi were evaluated. The
left-most figure reports the long term trend over 100 gradient
lim E[k∇F (xk )k2 ] → 0. (27) evaluations, while the rest of the figures zoom into the first
k→∞ 10 gradient evaluations to show the initial behavior of the
Moreover, for any positive integer T we have that methods. The vertical axis measures training error, test loss,
2 and test accuracy, respectively, from left to right.
min E[k∇F (xk )k2 ] ≤ (F (x0 ) − Fmin ),
0≤k≤T −1 αT Λ1 The proposed algorithm competes well for these two
d datasets in terms of training error, test loss and test accuracy,
where Fmin is a lower bound on F in R .
and decreases these measures more evenly than the SG and
The proof is given in the supplement. This result shows that SVRG. Our numerical experience indicates that formula
the sequence of gradients {k∇F (xk )k} converges to zero (14) is quite effective at estimating the steplength parameter,
in expectation, and establishes a global sublinear rate of as it is accepted by the backtracking line search for most
convergence of the smallest gradients generated after every iterations. As a result, the line search computes very few
T steps. additional function values.
It is interesting to note that SVRG is not as efficient in the
4. Numerical Results initial epochs compared to PBQN or SG, when measured
either in terms of test loss and test accuracy. The training er-
In this section, we present numerical results for the proposed ror for SVRG decreases rapidly in later epochs but this rapid
algorithm, which we refer to as PBQN for the Progressive improvement is not observed in the test loss and accuracy.
Batching Quasi-Newton algorithm. Neither the PBQN nor SVRG significantly outperforms the
other across all datasets tested in terms of training error, as
4.1. Experiments on Logistic Regression Problems observed in the supplement.
We first test our algorithm on binary classification problems Our results indicate that defining the curvature vector using
where the objective function is given by the logistic loss the MB approach is preferable to using the FB approach.
with `2 regularization: The number of iterations required by the PBQN method is
N significantly smaller compared to the SG method, suggest-
1 X λ ing the potential efficiency gains of a parallel implementa-
R(x) = log(1 + exp(−z i xT y i )) + kxk2 , (28)
N i=1 2
100 100 0.7 100

SG SG SG
0.6
10−1 SVRG SVRG SVRG 90
10−1 0.5 PBQN-MB
PBQN-MB PBQN-MB
Test Accuracy
10−2
Training Error
Training Error
Test Loss
PBQN-FO PBQN-FO 0.4 PBQN-FO 80
10−3 10−2
0.3 70 SG
10−4 SVRG
0.2
10−3
60 PBQN-MB
10−5 0.1
PBQN-FO
10−6 10−4 0.0 50
0 10 20 30 40 50 60 70 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations Gradient Evaluations
Figure 1. spam dataset: Performance of the progressive batching L-BFGS method (PBQN), with multi-batch (25% overlap) and
full-overlap approaches, and the SG and SVRG methods.
80
100 100 0.70
SG SG SG
75
10−1 SVRG 10 −1
SVRG SVRG
0.65
Test Accuracy
PBQN-MB PBQN-MB PBQN-MB 70
Training Error
−2
Training Error
10
Test Loss
PBQN-FO 10−2 PBQN-FO PBQN-FO
10−3 0.60 65
10−3 SG
10−4 60 SVRG
0.55
−5 10−4 55 PBQN-MB
10
PBQN-FO
10−6 10−5 0.50 50
0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations Gradient Evaluations
Figure 2. covertype dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and full-overlap
approaches, and the SG and SVRG methods.
tion of our algorithm. We consider three network architectures: (i) a small convolu-
tional neural network on CIFAR-10 (C) (Krizhevsky, 2009),
4.2. Results on Neural Networks (ii) an AlexNet-like convolutional network on MNIST
and CIFAR-10 (A1 , A2 , respectively) (LeCun et al., 1998;
We have performed a preliminary investigation into the Krizhevsky et al., 2012), and (iii) a residual network
performance of the PBQN algorithm for training neural (ResNet18) on CIFAR-10 (R) (He et al., 2016). The net-
networks. As is well-known, the resulting optimization work architecture details and additional plots are given in
problems are quite difficult due to the existence of local the supplement. All of these networks were implemented in
minimizers, some of which generalize poorly. Thus our first PyTorch (Paszke et al., 2017). The results for the CIFAR-
requirement when applying the PBQN method was to obtain 10 AlexNet and CIFAR-10 ResNet18 are given in Figures
as good generalization as SG, something we have achieved. 15 and 16, respectively. We report results both against the
Our investigation into how to obtain fast performance is, total number of iterations and the total number of gradient
however, still underway for reasons discussed below. Never- evaluations. Table 1 shows the best test accuracies attained
theless, our results are worth reporting because they show by each of the four methods over the various networks.
that our line search procedure is performing as expected, and In all our experiments, we initialize the batch size as |S0 | =
that the overall number of iterations required by the PBQN 512 in the PBQN method, and fix the batch size to |Sk | =
method is small enough so that a parallel implementation 128 for SG and Adam. The parameter θ given in (7), which
could yield state-of-the-art results, based on the theoretical controls the batch size increase in the PBQN method, was
performance model detailed in the supplement. tuned lightly by chosing among the 3 values: 0.9, 2, 3. SG
We compared our algorithm, as described in Section 2, and Adam are tuned using a development-based decay (dev-
against SG and Adam (Kingma & Ba, 2014). It has taken decay) scheme, which track the best validation loss at each
many years to design regularizations techniques and heuris- epoch and reduces the steplength by a constant factor δ if
tics that greatly improve the performance of the SG method the validation loss does not improve after e epochs.
for deep learning (Srivastava et al., 2014; Ioffe & Szegedy, We observe from our results that the PBQN method achieves
2015). These include batch normalization and dropout, a similar test accuracy as SG and Adam, but requires more
which (in their current form) are not conducive to the PBQN gradient evaluations. Improvements in performance can be
approach due to the need for gradient consistency when obtained by ensuring that the PBQN method exerts a finer
evaluating the curvature pairs in L-BFGS. Therefore, we do control on the sample size in the small batch regime — some-
not implement batch normalization and dropout in any of thing that requires further investigation. Nevertheless, the
the methods tested, and leave the study of their extension to small number of iterations required by the PBQN method,
the PBQN setting for future work. together with the fact that it employs larger batch sizes than
2.5
2.4 70
SG 70
2.0 2.2 Adam
60
2.0 PBQN-MB 60
Test Accuracy
Training Loss
Test Accuracy
Test Loss
1.5 1.8 PBQN-FO
50 50
1.6
1.0 SG SG SG
1.4 40 40
Adam Adam Adam
1.2 PBQN-MB
0.5 PBQN-MB 30 30 PBQN-MB
PBQN-FO 1.0 PBQN-FO PBQN-FO
0.0 1 0.8 1 20 1 20
10 102 103 104 10 102 103 104 10 102 103 104 0 200 400 600 800 1000
Iterations Iterations Iterations Gradient Evaluations
Figure 3. CIFAR-10 AlexNet (A2 ): Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and
full-overlap approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 0.9.
2.5
2.4 SG 70 SG 70
2.0 2.2 Adam Adam
60 PBQN-MB
PBQN-MB 60
Test Accuracy
2.0
Training Loss
Test Accuracy
PBQN-FO
Test Loss
1.5
1.8 PBQN-FO
50 50
SG 1.6
1.0 40 SG
40
Adam 1.4 Adam
0.5 PBQN-MB 1.2 30 30 PBQN-MB
PBQN-FO 1.0 PBQN-FO
0.0 1 2 3 1 2 3
20 1 2 3 20
10 10 10 10 10 10 10 10 10 0 50 100 150 200
Iterations Iterations Iterations Gradient Evaluations
Figure 4. CIFAR-10 ResNet18 (R): Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and
full-overlap approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 2.
ization problems). We believe that progressive batching is

Table 1. Best test accuracy performance of SG, Adam, multi-batch
the right context for designing an L-BFGS method that has
L-BFGS, and full overlap L-BFGS on various networks over 5
different runs and initializations. good generalization properties, does not expose any free pa-
rameters, and has fast convergence. The advantages of our
Network SG Adam MB FO approach are clearly seen in logistic regression experiments.
To make the new method competitive with SG and Adam
C 66.24 67.03 67.37 62.46 for deep learning, we need to improve several of its compo-
A1 99.25 99.34 99.16 99.05 nents. This includes the design of a more robust progressive
A2 73.46 73.59 73.02 72.74 batching mechanism, the redesign of batch normalization
R 69.5 70.16 70.28 69.44 and dropout heuristics to improve the generalization perfor-
mance of our method for training larger networks, and most
importantly, the design of a parallelized implementation that
takes advantage of the higher granularity of each iteration.
SG during much of the run, suggests that a distributed ver-
We believe that the potential of the proposed approach as
sion similar to a data-parallel distributed implementation of
an alternative to SG for deep learning is worthy of further
the SG method (Chen et al., 2016; Das et al., 2016) would
investigation.
lead to a highly competitive method.
Similar to the logistic regression case, we observe that the
Acknowledgements
steplength computed via (14) is almost always accepted
by the Armijo condition, and typically lies within (0.1, 1). We thank Albert Berahas for his insightful comments re-
Once the algorithm has trained for a significant number of garding multi-batch L-BFGS and probabilistic line searches,
iterations using full-batch, the algorithm begins to overfit on as well as for his useful feedback on earlier versions of
the training set, resulting in worsened test loss and accuracy, the manuscript. We also thank the anonymous reviewers
as observed in the graphs. for their useful feedback. Bollapragada is supported by
DOE award DE-FG02-87ER25047. Nocedal is supported
5. Final Remarks by NSF award DMS-1620070. Shi is supported by Intel
grant SP0036122.
Several types of quasi-Newton methods have been proposed
in the literature to address the challenges arising in ma-
References
chine learning. Some of these method operate in the purely
stochastic setting (which makes quasi-Newton updating dif- Ba, J., Grosse, R., and Martens, J. Distributed second-order
ficult) or in the purely batch regime (which leads to general- optimization using kronecker-factored approximations.
2016. Curtis, F. A self-correcting variable-metric algorithm for

stochastic optimization. In International Conference on
Balles, L., Romero, J., and Hennig, P. Coupling adap- Machine Learning, pp. 632–641, 2016.
tive batch sizes with learning rates. arXiv preprint
arXiv:1612.05086, 2016. Das, D., Avancha, S., Mudigere, D., Vaidynathan, K., Srid-
haran, S., Kalamkar, D., Kaul, B., and Dubey, P. Dis-
Berahas, A. S. and Takáč, M. A robust multi-batch tributed deep learning using synchronous stochastic gra-
l-bfgs method for machine learning. arXiv preprint dient descent. arXiv preprint arXiv:1602.06709, 2016.
arXiv:1707.08552, 2017.
De, S., Yadav, A., Jacobs, D., and Goldstein, T. Automated
Berahas, A. S., Nocedal, J., and Takác, M. A multi-batch l- inference with adaptive batches. In Artificial Intelligence
bfgs method for machine learning. In Advances in Neural and Statistics, pp. 1504–1513, 2017.
Information Processing Systems, pp. 1055–1063, 2016.
Devarakonda, A., Naumov, M., and Garland, M. Adabatch:
Bertsekas, D. P., Nedić, A., and Ozdaglar, A. E. Convex
Adaptive batch sizes for training deep neural networks.
analysis and optimization. Athena Scientific Belmont,
arXiv preprint arXiv:1712.02029, 2017.
2003.
Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp
Bollapragada, R., Byrd, R., and Nocedal, J. Exact and
minima can generalize for deep nets. arXiv preprint
inexact subsampled Newton methods for optimization.
arXiv:1703.04933, 2017.
arXiv preprint arXiv:1609.08502, 2016.
Friedlander, M. P. and Schmidt, M. Hybrid deterministic-
Bollapragada, R., Byrd, R., and Nocedal, J. Adaptive sam-
stochastic methods for data fitting. SIAM Journal on
pling strategies for stochastic optimization. arXiv preprint
Scientific Computing, 34(3):A1380–A1405, 2012.
arXiv:1710.11258, 2017.
Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. Sam- Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,
ple size selection in optimization methods for machine Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
learning. Mathematical Programming, 134(1):127–155, He, K. Accurate, large minibatch sgd: Training imagenet
2012. in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A Grosse, R. and Martens, J. A kronecker-factored approxi-
stochastic quasi-Newton method for large-scale optimiza- mate fisher matrix for convolution layers. In International
tion. SIAM Journal on Optimization, 26(2):1008–1031, Conference on Machine Learning, pp. 573–582, 2016.
2016.
Guyon, I., Aliferis, C. F., Cooper, G. F., Elisseeff, A., Pellet,
Carbonetto, P. New probabilistic inference algorithms that J., Spirtes, P., and Statnikov, A. R. Design and analy-
harness the strengths of variational and Monte Carlo sis of the causation and prediction challenge. In WCCI
methods. PhD thesis, University of British Columbia, Causation and Prediction Challenge, pp. 1–33, 2008.
2009.
Hardt, M., Recht, B., and Singer, Y. Train faster, generalize
Cartis, C. and Scheinberg, K. Global convergence rate better: Stability of stochastic gradient descent. arXiv
analysis of unconstrained optimization methods based on preprint arXiv:1509.01240, 2015.
probabilistic models. Mathematical Programming, pp.
1–39, 2015. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
Chang, C. and Lin, C. LIBSVM: A library for support vector conference on computer vision and pattern recognition,
machines. ACM Transactions on Intelligent Systems and pp. 770–778, 2016.
Technology, 2:27:1–27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm. Hoffer, E., Hubara, I., and Soudry, D. Train longer,
generalize better: closing the generalization gap in
Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. Re- large batch training of neural networks. arXiv preprint
visiting distributed synchronous sgd. arXiv preprint arXiv:1705.08741, 2017.
arXiv:1604.00981, 2016.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
Cormack, G. and Lynam, T. Spam corpus creation for TREC. deep network training by reducing internal covariate shift.
In Proc. 2nd Conference on Email and Anti-Spam, 2005. In International conference on machine learning, pp. 448–
http://plg.uwaterloo.ca/gvcormac/treccorpus. 456, 2015.
Johnson, R. and Zhang, T. Accelerating stochastic gradient Pasupathy, R., Glynn, P., Ghosh, S., and Hashemi, F. S. On
descent using predictive variance reduction. In Advances sampling rates in stochastic recursions. 2015. Under
in Neural Information Processing Systems 26, pp. 315– Review.
323, 2013.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
Keskar, N. S. and Berahas, A. S. adaqn: An adaptive quasi- DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
newton algorithm for training rnns. In Joint European A. Automatic differentiation in pytorch. 2017.
Conference on Machine Learning and Knowledge Dis-
Robbins, H. and Monro, S. A stochastic approximation
covery in Databases, pp. 1–16. Springer, 2016.
method. The annals of mathematical statistics, pp. 400–
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, 407, 1951.
M., and Tang, P. T. P. On large-batch training for deep
Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled
learning: Generalization gap and sharp minima. arXiv
Newton methods II: Local convergence rates. arXiv
preprint arXiv:1609.04836, 2016.
preprint arXiv:1601.04738, 2016a.
Kingma, D. and Ba, J. Adam: A method for stochastic Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled
optimization. arXiv preprint arXiv:1412.6980, 2014. Newton methods I: Globally convergent algorithms.
arXiv preprint arXiv:1601.04737, 2016b.
Krishnan, S., Xiao, Y., and Saurous, R. A. Neumann opti-
mizer: A practical optimization algorithm for deep neural Schraudolph, N. N., Yu, J., and Günter, S. A stochastic
networks. arXiv preprint arXiv:1712.03298, 2017. quasi-newton method for online convex optimization. In
International Conference on Artificial Intelligence and
Krizhevsky, A. Learning multiple layers of features from
Statistics, pp. 436–443, 2007.
tiny images. 2009.
Smith, S. L., Kindermans, P., and Le, Q. V. Don’t decay
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet the learning rate, increase the batch size. arXiv preprint
classification with deep convolutional neural networks. arXiv:1711.00489, 2017.
In Advances in Neural Information Processing Systems,
pp. 1097–1105, 2012. Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-
scale optimization by unifying stochastic gradient and
Kurth, T., Zhang, J., Satish, N., Racah, E., Mitliagkas, I., quasi-Newton methods. In International Conference on
Patwary, M. M. A., Malas, T., Sundaram, N., Bhimji, W., Machine Learning, pp. 604–612, 2014.
Smorkalov, M., et al. Deep learning at 15pf: Supervised
and semi-supervised classification for scientific data. In Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
Proceedings of the International Conference for High Per- and Salakhutdinov, R. Dropout: A simple way to prevent
formance Computing, Networking, Storage and Analysis, neural networks from overfitting. The Journal of Machine
pp. 7. ACM, 2017. Learning Research, 15(1):1929–1958, 2014.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- You, Y., Gitman, I., and Ginsburg, B. Scaling sgd
based learning applied to document recognition. Proceed- batch size to 32k for imagenet training. arXiv preprint
ings of the IEEE, 86(11):2278–2324, 1998. arXiv:1708.03888, 2017.
Liu, D. C. and Nocedal, J. On the limited memory bfgs Zhou, C., Gao, W., and Goldfarb, D. Stochastic adaptive
method for large scale optimization. Mathematical pro- quasi-Newton methods for minimizing expected values.
gramming, 45(1-3):503–528, 1989. In International Conference on Machine Learning, pp.
4150–4159, 2017.
Martens, J. and Grosse, R. Optimizing neural networks with
kronecker-factored approximate curvature. In Interna-
tional Conference on Machine Learning, pp. 2408–2417,
2015.
Mokhtari, A. and Ribeiro, A. Global convergence of on-

line limited memory bfgs. Journal of Machine Learning
Research, 16(1):3151–3181, 2015.
Nocedal, J. and Wright, S. Numerical Optimization.

Springer New York, 2 edition, 1999.
A. Initial Step Length Derivation

To establish our results, recall that the stochastic quasi-Newton method is defined as
xk+1 = xk − αk Hk gkSk , (29)
where the batch (or subsampled) gradient is given by

1 X
gkSk = ∇FSk (xk ) = ∇Fi (xk ), (30)
|Sk |
i∈Sk
and the set Sk ⊂ {1, 2, · · · } indexes data points (y i , z i ). The algorithm selects the Hessian approximation Hk through
quasi-Newton updating prior to selecting the new sample Sk to define the search direction pk . We will use Ek to denote the
conditional expectation at xk and use E to denote the total expectation.
The primary theoretical mechanism for determining batch sizes is the exact variance inner product quasi-Newton (IPQN)
test, which is defined as
h 2 i
Ek (Hk ∇F (xk ))T (Hk gki ) − kHk ∇F (xk )k2
≤ θ2 kHk ∇F (xk )k4 . (31)
|Sk |
We establish the inequality used to determine the initial steplength αk for the stochastic line search.
Lemma A.1. Assume that F is continuously differentiable with Lipschitz continuous gradient with Lipschitz constant L.
Then
1/2 1/2
Ek [F (xk+1 )] ≤ F (xk ) − αk ∇F (xk )T Hk Wk Hk ∇F (xk ),
where
Var{Hk gki }

Lαk
Wk = I− 1+ Hk ,
i
i 2

and Var{Hk gk } = Ek kHk gk − Hk ∇F (xk )k .
Proof. By Lipschitz continuity of the gradient, we have that

h i Lα2 h i
Ek [F (xk+1 )] ≤ F (xk ) − αk ∇F (xk )T Hk Ek gkSk + k
Ek kHk gkSk k2
2
Lαk2 h i
T
= F (xk ) − αk ∇F (xk ) Hk ∇F (xk ) + kHk ∇F (xk )k2 + Ek kHk gkSk − Hk ∇F (xk )k2
2
2
Var{Hk gki }

T Lα k 2 2
≤ F (xk ) − αk ∇F (xk ) Hk ∇F (xk ) + kHk ∇F (xk )k + kHk ∇F (xk )k
Var{Hk gki }

1/2 Lαk 1/2
= F (xk ) − αk ∇F (xk )T Hk I− 1+ Hk Hk ∇F (xk )
1/2 1/2
= F (xk ) − αk ∇F (xk )T Hk Wk Hk ∇F (xk ).
B. Convergence Analysis
For the rest of our analysis, we make the following two assumptions.
Assumptions B.1. The orthogonality condition is satisfied for all k, i.e.,

i T 2
(H gk ) (Hk ∇F (xk ))
Ek Hk gki − kkH k ∇F (xk )k
2 H k ∇F (x k )
≤ ν 2 kHk ∇F (xk )k2 , (32)
|Sk |
for some large ν > 0.
Assumptions B.2. The eigenvalues of Hk are contained in an interval in R+ , i.e., for all k there exist constants Λ2 ≥ Λ1 > 0
such that
Λ1 I Hk Λ2 I. (33)
Condition (32) ensures that the stochastic quasi-Newton direction is bounded away from orthogonality to −Hk ∇F (xk ),
with high probability, and prevents the variance in the individual quasi-Newton directions to be too large relative to the
variance in the individual quasi-Newton directions along −Hk ∇F (xk ). Assumption B.2 holds, for example, when F is
convex and a regularization parameter is included so that any subsampled Hessian ∇2 FS (x) is positive definite. It can
also be shown to hold in the non-convex case by applying cautious BFGS updating; e.g. by updating Hk only when
ykT sk ≥ ksk k22 where > 0 is a predetermined constant (Berahas et al., 2016).
We begin by establishing a technical descent lemma.
Lemma B.3. Suppose that F is twice continuously differentiable and that there exists a constant L > 0 such that
∇2 F (x) LI, ∀x ∈ Rd . (34)
Let {xk } be generated by iteration (29) for any x0 , where |Sk | is chosen by the (exact variance) inner product quasi-Newton
test (31) for given constant θ > 0 and suppose that assumptions (B.1) and (B.2) hold. Then, for any k,
h i
Ek kHk gkSk k2 ≤ (1 + θ2 + ν 2 )kHk ∇F (xk )k2 . (35)
Moreover, if αk satisfies
1
αk = α ≤ , (36)
(1 + θ2 + ν 2 )LΛ2
we have that
α 1/2
Ek [F (xk+1 )] ≤ F (xk ) − kHk ∇F (xk )k2 . (37)
2
Proof. By Assumption (B.1), the orthogonality condition, we have that

i T 2
(Hk gk ) (Hk ∇F (xk ))
Ek Hk gki − Hk ∇F (xk )
 
2
Sk (Hk gkSk )T (Hk ∇F (xk )) kHk ∇F (xk )k2
Ek Hk gk −
 Hk ∇F (xk )  ≤ (38)
kHk ∇F (xk )k2 |Sk |
≤ ν 2 kHk ∇F (xk )k2 .
Now, expanding the left hand side of inequality (38), we get

 
2
Sk T
(H g
k k ) (H k ∇F (x k ))
Ek  Hk gkSk − Hk ∇F (xk ) 
kHk ∇F (xk )k2
2 2
Sk T Sk T
h i 2E k (H g
k k ) (H k ∇F (xk )) Ek (H g
k k ) (H k ∇F (x k ))
= Ek kHk gkSk k2 − +
kHk ∇F (xk )k2 kHk ∇F (xk )k2
2
Sk T
h i Ek (Hk gk ) (Hk ∇F (xk ))
= Ek kHk gkSk k2 −
kHk ∇F (xk )k2
≤ ν 2 kHk ∇F (xk )k2 .
Therefore, rearranging gives the inequality

2
h i Ek (Hk gkSk )T (Hk ∇F (xk ))
Ek kHk gkSk k2 ≤ + ν 2 kHk ∇F (xk )k2 . (39)
kHk ∇F (xk )k2
To bound the first term on the right side of this inequality, we use the inner product quasi-Newton test; in particular, |Sk |
satisfies
h 2 i
2 Ek (Hk ∇F (xk ))T (Hk gki ) − kHk ∇F (xk )k2
Ek (Hk ∇F (xk ))T (Hk gkSk )) − kHk ∇F (xk )k2 ≤
|Sk |
≤ θ2 kHk ∇F (xk )k4 , (40)
where the second inequality holds by the IPQN test. Since

2 2
Ek (Hk ∇F (xk ))T (Hk gkSk ) − kHk ∇F (xk )k2 = Ek (Hk ∇F (xk ))T (Hk gkSk ) − kHk ∇F (xk )k4 , (41)
we have
2
Ek (Hk gkSk )T (Hk ∇F (xk )) ≤ kHk ∇F (xk )k4 + θ2 kHk ∇F (xk )k4
= (1 + θ2 )kHk ∇F (xk )k4 , (42)
by (40) and (41). Substituting (42) into (39), we get the following bound on the length of the search direction:
h i
Ek kHk gkSk k2 ≤ (1 + θ2 + ν 2 )kHk ∇F (xk )k2 ,
which proves (35). Using this inequality, Assumption B.2, and bounds on the Hessian and steplength (34) and (36), we have
2
h i Lα
Ek [F (xk+1 )] ≤ F (xk ) − Ek α(Hk gkSk )T ∇F (xk ) + Ek kHk gkSk k2
2
2
Lα
= F (xk ) − α∇F (xk )T Hk ∇F (xk ) + Ek [kHk gkSk k2 ]
2
Lα2
≤ F (xk ) − α∇F (xk )T Hk ∇F (xk ) + (1 + θ2 + ν 2 )kHk ∇F (xk )k2
2
Lα(1 + θ2 + ν 2 )

1/2 T 1/2
= F (xk ) − α(Hk ∇F (xk )) I− Hk Hk ∇F (xk )
2
LΛ2 α(1 + θ2 + ν 2 )

1/2
≤ F (xk ) − α 1 − kHk ∇F (xk )k2
2
α 1/2
≤ F (xk ) − kHk ∇F (xk )k2 .
2
We now show that the stochastic quasi-Newton iteration (29) with a fixed steplength α is linearly convergent when F is
strongly convex. In the following discussion, x∗ denotes the minimizer of F .
Theorem B.4. Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that
µI ∇2 F (x) LI, ∀x ∈ Rd . (43)
Let {xk } be generated by iteration (29), for any x0 , where |Sk | is chosen by the (exact variance) inner product quasi-Newton
test (31) and suppose that the assumptions (B.1) and (B.2) hold. Then, if αk satisfies (36) we have that
E[F (xk ) − F (x∗ )] ≤ ρk (F (x0 ) − F (x∗ )), (44)
where x∗ denotes the minimizer of F , and ρ = 1 − µΛ1 α.
Proof. It is well-known (Bertsekas et al., 2003) that for strongly convex functions,
k∇F (xk )k2 ≥ 2µ[F (xk ) − F (x∗ )].

Substituting this into (37) and subtracting F (x∗ ) from both sides and using Assumption B.2, we obtain
α 1/2
Ek [F (xk+1 ) − F (x∗ )] ≤ F (xk ) − F (x∗ ) − kHk ∇F (xk )k2
2
α
≤ F (xk ) − F (x∗ ) − Λ1 k∇F (xk )k2
2
≤ (1 − µΛ1 α)(F (xk ) − F (x∗ )).
The theorem follows from taking total expectation.
We now consider the case when F is nonconvex and bounded below.

Theorem B.5. Suppose that F is twice continuously differentiable and bounded below, and that there exists a constant
L > 0 such that
∇2 F (x) LI, ∀x ∈ Rd . (45)
Let {xk } be generated by iteration (29), for any x0 , where |Sk | is chosen by the (exact variance) inner product quasi-Newton
test (31) and suppose that the assumptions (B.1) and (B.2) hold. Then, if αk satisfies (36), we have
lim E[k∇F (xk )k2 ] → 0. (46)

k→∞
Moreover, for any positive integer T we have that

2
min E[k∇F (xk )k2 ] ≤ (F (x0 ) − Fmin ),
0≤k≤T −1 αT Λ1
where Fmin is a lower bound on F in Rd .
Proof. From Lemma B.3 and by taking total expectation, we have

α 1/2
E[F (xk+1 )] ≤ E[F (xk )] − E[kHk ∇F (xk )k2 ],
2
and hence
1/2 2
E[kHk ∇F (xk )k2 ] ≤ E[F (xk ) − F (xk+1 )].
α
Summing both sides of this inequality from k = 0 to T − 1, and since F is bounded below by Fmin , we get
T −1
X 1/2 2 2
E[kHk ∇F (xk )k2 ] ≤ E[F (x0 ) − F (xT )] ≤ [F (x0 ) − Fmin ].
α α
k=0
Using the bound on the eigenvalues of Hk and taking limits, we obtain

T −1 T −1
1/2
X X
Λ1 lim E[k∇F (xk )k2 ] ≤ lim E[kHk ∇F (xk )k2 ] < ∞,
T →∞ T →∞
k=0 k=0
which implies (46). We can also conclude that

T
1 X 2
min E[k∇F (xk )k2 ] ≤ E[k∇F (xk )k2 ] ≤ (F (x0 ) − Fmin ).
0≤k≤T −1 T αT Λ1
k=0
C. Additional Numerical Experiments

C.1. Datasets
Table 2 summarizes the datasets used for the experiments. Some of these datasets divide the data into training and testing
sets; for the rest, we randomly divide the data so that the training set constitutes 90% of the total.
Table 2. Characteristics of all datasets used in the experiments.
Dataset # Data Points (train; test) # Features # Classes Source

gisette (6,000; 1,000) 5,000 2 (Chang & Lin, 2011)
mushrooms (7,311; 813) 112 2 (Chang & Lin, 2011)
sido (11,410; 1,268) 4,932 2 (Guyon et al., 2008)
ijcnn (35,000; 91701) 22 2 (Chang & Lin, 2011)
spam (82,970; 9,219) 823,470 2 (Cormack & Lynam, 2005; Carbonetto, 2009)
alpha (450,000; 50,000) 500 2 synthetic
covertype (522,910; 58,102) 54 2 (Chang & Lin, 2011)
url (2,156,517; 239,613) 3,231,961 2 (Chang & Lin, 2011)
MNIST (60,000; 10,000) 28 × 28 10 (LeCun et al., 1998)
CIFAR-10 (50,000; 10,000) 32 × 32 10 (Krizhevsky, 2009)
The alpha dataset is a synthetic dataset that is available at ftp://largescale.ml.tu-berlin.de.
C.2. Logistic Regression Experiments

We report the numerical results on binary classification logistic regression problems on the 8 datasets given in Table 2. We
plot the performance measured in terms of training error, test loss and test accuracy against gradient evaluations. We also
report the behavior of the batch sizes and steplengths for both variants of the PBQN method.
101 1.6
101
SG SG
SG 1.4
100 SVRG SVRG SVRG
1.2
PBQN-MB PBQN-MB PBQN-MB
Training Error
100
Training Error
10−1 1.0
Test Loss
PBQN-FO PBQN-FO PBQN-FO

−2
0.8
10
0.6
10−1
10−3 0.4
10 −4 0.2
10−2 0.0
0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10
Gradient Evaluations Gradient Evaluations Gradient Evaluations
×103
100 6 1.0
PBQN-MB PBQN-MB
5 PBQN-FO
90 0.8 PBQN-FO
Test Accuracy
4
Batch Sizes
Steplength
80 0.6
3
70 SG 0.4
2
SVRG
60 PBQN-MB 1 0.2
PBQN-FO
50 0 0.0
0 2 4 6 8 10 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Gradient Evaluations Iterations Iterations
Figure 5. gisette dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
101 101 0.9

SG SG 0.8 SG
0
10 100 SVRG
SVRG SVRG 0.7
10−1 PBQN-MB PBQN-MB 0.6 PBQN-MB
10−1
Training Error
Training Error
Test Loss
10−2 0.5
10−2
10−3 0.4
10−3 0.3
10−4
0.2
−4
10−5 10
0.1
10−6 10−5 0.0
0 5 10 15 20 25 30 35 40 0 2 4 6 8 10 0 2 4 6 8 10
3
8 ×10
100 1.0
7
90 6 0.8
Batch Sizes
Test Accuracy
80 5
Steplength
0.6
70
4
SG 3 0.4
60 SVRG 2
PBQN-MB PBQN-MB 0.2 PBQN-MB
50 1
40 0 0.0
0 2 4 6 8 10 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Figure 6. mushrooms dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
100 0.7
100
SG
SG 0.6
10−1 SVRG SVRG
0.5 PBQN-MB
PBQN-MB 10−1
Training Error
Training Error
Test Loss
10−2 PBQN-FO 0.4 PBQN-FO
10−3 0.3
SG
10−2 SVRG 0.2
10−4
PBQN-MB
0.1
10−5 PBQN-FO
10−3 0.0
0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10
4
100 1.2 ×10 1.0
PBQN-MB
90 1.0 0.8 PBQN-FO
Test Accuracy
0.8
Batch Sizes
Steplength
80 0.6
0.6
70 SG 0.4
SVRG 0.4
60 PBQN-MB 0.2 PBQN-MB 0.2
PBQN-FO PBQN-FO
50 0.0 0.0
0 2 4 6 8 10 0 50 100 150 200 0 50 100 150 200
Figure 7. sido dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-overlap
(FO) approaches, and the SG and SVRG methods.
100 100 0.7

SG SG SG
10−1 SVRG 10−1 SVRG 0.6 SVRG
10−2
Training Error
Training Error
Test Loss
PBQN-FO 10−2 PBQN-FO 0.5 PBQN-FO
10−3
10−3 0.4
10−4
10−5 10−4 0.3
10−6 10−5 0.2

0 10 20 30 40 50 60 0 2 4 6 8 10 0 2 4 6 8 10
95 ×104 1.0
3.5
90 PBQN-MB PBQN-MB
3.0
85 PBQN-FO 0.8 PBQN-FO
2.5
Test Accuracy
80
Batch Sizes
Steplength
75 2.0 0.6
70
SG 1.5 0.4
65 SVRG 1.0
60 PBQN-MB 0.2
0.5
55 PBQN-FO
50 0.0 0.0
0 2 4 6 8 10 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
Figure 8. ijcnn dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-
overlap(FO) approaches, and the SG and SVRG methods.
100 100 0.7

SG SG SG
−1 0.6
10 SVRG SVRG SVRG
−1
10 0.5 PBQN-MB
10−2 PBQN-MB PBQN-MB
Training Error
Training Error
Test Loss
PBQN-FO PBQN-FO 0.4 PBQN-FO

10−3 10−2
0.3
10−4
0.2
10−3
10−5 0.1
10−6 10−4 0.0

0 10 20 30 40 50 60 70 0 2 4 6 8 10 0 2 4 6 8 10
4
100 9 ×10
1.0
8 PBQN-MB
90 7 PBQN-FO 0.8
6
Test Accuracy
Batch Sizes
Steplength
80 5 0.6
4
70 SG 0.4
3
SVRG
60 PBQN-MB 2 0.2 PBQN-MB
PBQN-FO 1 PBQN-FO
50 0 0.0
0 2 4 6 8 10 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Figure 9. spam dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-
overlap (FO) approaches, and the SG and SVRG methods.
100 100 0.75

SG SG SG
10 −1 0.70 SVRG
SVRG SVRG
10−1
Training Error
10−2 0.65
Training Error
Test Loss
10−3 0.60
−4 −2 0.55
10 10
10−5 0.50
−6 −3 0.45
10 10
0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10
×105
80 1.0
4 PBQN-MB
75 PBQN-FO 0.8
Test Accuracy
70 3
Batch Sizes
Steplength
0.6
65
2
SG 0.4
60 SVRG
PBQN-MB 1 0.2 PBQN-MB
55
PBQN-FO PBQN-FO
50 0 0.0
0 2 4 6 8 10 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Figure 10. alpha dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
100 100 0.70

SG SG SG
−1
10 SVRG 10−1 SVRG SVRG
0.65
Training Error
10−2
Training Error
Test Loss
PBQN-FO 10−2 PBQN-FO PBQN-FO

10−3 0.60
10−3
−4
10
0.55
10−4
10−5
10−6 10−5 0.50

0 20 40 60 80 100 0 2 4 6 8 10 0 2 4 6 8 10
5
80 6 ×10
1.0
PBQN-MB
75 5
0.8 PBQN-FO
Test Accuracy
70 4
Batch Sizes
Steplength
PBQN-MB 0.6
65 3
PBQN-FO
SG 0.4
60 2
SVRG
55 PBQN-MB 1 0.2
PBQN-FO
50 0 0.0
0 2 4 6 8 10 0 100 200 300 400 500 600 0 100 200 300 400 500 600
Figure 11. covertype dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
100 1.0
SG
0.8 SVRG
PBQN-MB
10−1
Training Error
Test Loss
0.6 PBQN-FO
SG 0.4
10−2 SVRG
PBQN-MB 0.2
PBQN-FO
10−3 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Gradient Evaluations Gradient Evaluations
6
100 2.5 ×10 1.0
PBQN-MB
90
2.0 PBQN-FO 0.8
80
Test Accuracy
Batch Sizes
Steplength
70 1.5 0.6
60 SG 1.0 0.4
50 SVRG
PBQN-MB 0.5 0.2 PBQN-MB
40
PBQN-FO PBQN-FO
30 0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Figure 12. url dataset: Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-overlap
(FO) approaches, and the SG and SVRG methods. Note that we only ran the SG and SVRG algorithms for 3 gradient evaluations since
the equivalent number of iterations already reached of order of magnitude 107 .
C.3. Neural Network Experiments

We describe each neural network architecture below. We plot the training loss, test loss and test accuracy against the total
number of iterations and gradient evaluations. We also report the behavior of the batch sizes and steplengths for both variants
of the PBQN method.
C.3.1. CIFAR-10 C ONVOLUTIONAL N ETWORK (C) A RCHITECTURE

The small convolutional neural network (ConvNet) is a 2-layer convolutional network with two alternating stages of 5 × 5
kernels and 2 × 2 max pooling followed by a fully connected layer with 1000 ReLU units. The first convolutional layer
yields 6 output channels and the second convolutional layer yields 16 output channels.
C.3.2. CIFAR-10 AND MNIST A LEX N ET- LIKE N ETWORK (A1 , A2 ) A RCHITECTURE
The larger convolutional network (AlexNet) is an adaptation of the AlexNet architecture (Krizhevsky et al., 2012) for
CIFAR-10 and MNIST. The CIFAR-10 version consists of three convolutional layers with max pooling followed by two
fully-connected layers. The first convolutional layer uses a 5 × 5 kernel with a stride of 2 and 64 output channels. The second
and third convolutional layers use a 3 × 3 kernel with a stride of 1 and 64 output channels. Following each convolutional
layer is a set of ReLU activations and 3 × 3 max poolings with strides of 2. This is all followed by two fully-connected
layers with 384 and 192 neurons with ReLU activations, respectively. The MNIST version of this network modifies this by
only using a 2 × 2 max pooling layer after the last convolutional layer.
C.3.3. CIFAR-10 R ESIDUAL N ETWORK (R) A RCHITECTURE

The residual network (ResNet18) is a slight modification of the ImageNet ResNet18 architecture for CIFAR-10 (He et al.,
2016). It follows the same architecture as ResNet18 for ImageNet but removes the global average pooling layer before the
1000 neuron fully-connected layer. ReLU activations and max poolings are included appropriately.
2.5
SG 2.4 SG 70
2.0 Adam 2.2 Adam
60
PBQN-MB PBQN-MB
Test Accuracy
Training Loss
2.0
Test Loss
1.5 PBQN-FO PBQN-FO

1.8 50
1.0 1.6 SG
40
Adam
1.4
0.5 30 PBQN-MB
1.2 PBQN-FO
0.0 1 1.0 1 20 1
10 102 103 104 10 102 103 104 10 102 103 104
Iterations Iterations Iterations
2.5
SG 2.4 SG 70
2.0 Adam 2.2 Adam
60
PBQN-MB PBQN-MB
Test Accuracy
Training Loss
2.0
Test Loss
1.5 PBQN-FO PBQN-FO

1.8 50
1.0 1.6 SG
40
Adam
1.4
0.5 30 PBQN-MB
1.2 PBQN-FO
0.0 1.0 20
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
4
5 ×10
1.0
4
0.8
Batch Sizes
3
Steplength
0.6
2 SG 0.4
Adam
1 PBQN-MB 0.2 PBQN-MB
PBQN-FO PBQN-FO
0 0.0
0 100 200 300 400 500 0 500 1000 1500 2000
Iterations Iterations
Figure 13. CIFAR-10 ConvNet (C): Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 0.9.
100
1.4 SG 1.4 SG
1.2 Adam 1.2 Adam 95
PBQN-MB PBQN-MB
Test Accuracy
Training Loss
1.0 1.0 90
Test Loss
PBQN-FO PBQN-FO
0.8 0.8
85
0.6 0.6 SG
80 Adam
0.4 0.4
75 PBQN-MB
0.2 0.2
PBQN-FO
0.0 1 0.0 1 70 1
10 102 103 104 10 102 103 104 10 102 103 104
100
1.4 SG 1.4 SG
1.2 Adam 1.2 Adam 95
PBQN-MB PBQN-MB
Test Accuracy
Training Loss
1.0 1.0 90
Test Loss
PBQN-FO PBQN-FO
0.8 0.8
85
0.6 0.6 SG
80 Adam
0.4 0.4
75 PBQN-MB
0.2 0.2
PBQN-FO
0.0 0.0 70
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
4
6 ×10
1.0
SG
5 Adam 0.8
4 PBQN-MB
Batch Sizes
PBQN-FO
Steplength
0.6
3
0.4
2
1 0.2 PBQN-MB
PBQN-FO
0 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
Figure 14. MNIST AlexNet (A1 ): Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 2.
2.5
2.4 70
SG
2.0 2.2 Adam
60
2.0 PBQN-MB
Test Accuracy
Training Loss
Test Loss
1.5 1.8 PBQN-FO

50
1.6
1.0 SG SG
1.4 40
Adam Adam
1.2 PBQN-MB
0.5 PBQN-MB 30
PBQN-FO 1.0 PBQN-FO
0.0 1 0.8 1 20 1
10 102 103 104 10 102 103 104 10 102 103 104
2.5
2.4 70
SG SG
2.0 Adam 2.2 Adam
60
PBQN-MB 2.0 PBQN-MB
Test Accuracy
Training Loss
Test Loss
1.5 PBQN-FO 1.8 PBQN-FO

50
1.6
1.0 SG
1.4 40
Adam
0.5 1.2 PBQN-MB
30
1.0 PBQN-FO
0.0 0.8 20
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
4
5 ×10
1.0
0.9
4
0.8
Batch Sizes
0.7
Steplength
3
0.6
2 SG 0.5
Adam 0.4
PBQN-FO 0.2 PBQN-FO
0 0.1
0 200 400 600 800 1000 0 500 1000 1500 2000 2500 3000
Figure 15. CIFAR-10 AlexNet (A2 ): Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 0.9.
2.5
2.4 SG 70 SG
2.0 2.2 Adam Adam
60 PBQN-MB
PBQN-MB
Test Accuracy
2.0
Training Loss
PBQN-FO
Test Loss
1.5
1.8 PBQN-FO
50
SG 1.6
1.0 40
Adam 1.4
0.5 PBQN-MB 1.2 30
PBQN-FO 1.0
0.0 1 20 1
10 102 103 101 102 103 10 102 103
2.5
SG 2.4 SG 70
2.0 Adam 2.2 Adam
60
PBQN-MB PBQN-MB
Test Accuracy
2.0
Training Loss
Test Loss
1.5 PBQN-FO 1.8 PBQN-FO

50
1.6 SG
1.0 40
1.4 Adam
0.5 1.2 30 PBQN-MB
PBQN-FO
1.0
0.0 20
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
4
5 ×10 1.0
4 0.8
Batch Sizes
Steplength
3 0.6
2 SG 0.4
Adam
PBQN-FO PBQN-FO
0 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000 1200 1400
Figure 16. CIFAR-10 ResNet18 (R): Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ = 2.
D. Performance Model
The use of increasing batch sizes in the PBQN algorithm yields a larger effective batch size than the SG method, allowing
PBQN to scale to a larger number of nodes than currently permissible even with large-batch training (Goyal et al., 2017).
With improved scalability and richer gradient information, we expect reduction in training time. To demonstrate the potential
to reduce training time of a parallelized implementation of PBQN, we extend the idealized performance model from (Keskar
et al., 2016) to the PBQN algorithm. For PBQN to be competitive, it must achieve the following: (i) the quality of its solution
should match or improve SG’s solution (as shown in Table 1 of the main paper); (ii) it should utilize a larger effective batch
size; and (iii) it should converge to the solution in a lower number of iterations. We provide an initial analysis for this by
establishing the analytic requirements for improved training time; we leave discussion on implementation details, memory
requirements, and large-scale experiments for future work.
Let the effective batch size for PBQN and conventional SG batch size be denoted as B cL and BS , respectively. From
Algorithm 1, we observe that the PBQN iteration involves extra computation in addition to the gradient computation as
in SG. The additional steps are as follows: the L-BFGS two-loop recursion, which includes several operations over the
stored curvature pairs and network parameters (Algorithm 1:6); the stochastic line search for identifying the steplength
(Algorithm 1:7-16); and curvature pair updating (Algorithm 1:18-21). However, most of these supplemental operations are
performed on the weights of the network, which is orders of magnitude lower than computing the gradient. The two-loop
recursion performs O(10) operations over the network parameters and curvature pairs. The cost for variance estimation is
negligible since we may use a fixed number of samples throughout the run for its computation which can be parallelized
while avoiding becoming a serial bottleneck.
The only exception is the stochastic line search, which requires additional forward propagations over the model for different
sets of network parameters. However, this happens only when the step-length is not accepted, which happens infrequently in
practice. We make the pessimistic assumption of an addition forward propagation every iteration, amounting to an additional
1
3 the cost of the gradient computation (forward propagation, back propagation with respect to activations and weights).
Hence, the ratio of cost-per-iteration for PBQN CL to SG’s cost-per-iteration CS is 43 . Let IS and IL be the number of
iterations that it takes SG and PBQN, respectively, to reach similar test accuracy. The target number of nodes to be used for
training is N , such that N < B cL . For N nodes, the parallel efficiency of SG is assumed to be Pe (N ) and we assume that
for the target node count, there is no drop in parallel efficiency for PBQN due to the large effective batch size.
For a lower training time with the PBQN method, the following relation should hold:
B
cL BS
IL C L < IS CS . (47)
N N Pe (N )
In terms of iterations, we can rewrite this as

IL CS BS 1
< . (48)
IS cL Pe (N )
CL B
Assuming target node count N = BS < BˆL , the scaling efficiency of SG drops significantly due to the reduced work per
single node, giving a parallel efficiency of Pe (N ) = 0.2; see (Kurth et al., 2017; You et al., 2017). If we additionally assume
that effective batch size for PBQN is 4× larger, with SG large batch ≈ 8K and PBQN ≈ 32K as observed in our experiments
(from Section 4), this gives BcL/BS = 4. PBQN must converge with about the same number of iterations as SG in order to
achieve lower training time. From Section 4, the results show that PBQN converges in significantly fewer iterations than SG,
hence establishing the potential for lower training times. We refer the reader to (Das et al., 2016) for a more detailed model
and commentary on the effect of batch size on performance.

A Progressive Batching L-BFGS Method For Machine Learning: Robbins & Monro 1951

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Progressive Batching L-BFGS Method For Machine Learning: Robbins & Monro 1951

Uploaded by

Copyright:

Available Formats

A Progressive Batching L-BFGS Method for Machine Learning

The standard L-BFGS method relies on gradient

1. Introduction Progressive batching algorithms have received much atten-

is   must be positive definite, whereas in the stochastic case the

which are noisy approximations to the true objective F . where

with λ = 1/N . We consider the 8 datasets listed in the

1 We tested two options for computing the curvature vector

100 100 0.7 100

ization problems). We believe that progressive batching is

2016. Curtis, F. A self-correcting variable-metric algorithm for

Mokhtari, A. and Ribeiro, A. Global convergence of on-

Nocedal, J. and Wright, S. Numerical Optimization.

A. Initial Step Length Derivation

xk+1 = xk − αk Hk gkSk , (29)

where the batch (or subsampled) gradient is given by

Proof. By Lipschitz continuity of the gradient, we have that

∇2 F (x)  LI, ∀x ∈ Rd . (34)

Proof. By Assumption (B.1), the orthogonality condition, we have that

≤ ν 2 kHk ∇F (xk )k2 .

Now, expanding the left hand side of inequality (38), we get

Therefore, rearranging gives the inequality

where the second inequality holds by the IPQN test. Since

= (1 + θ2 )kHk ∇F (xk )k4 , (42)

µI  ∇2 F (x)  LI, ∀x ∈ Rd . (43)

E[F (xk ) − F (x∗ )] ≤ ρk (F (x0 ) − F (x∗ )), (44)

where x∗ denotes the minimizer of F , and ρ = 1 − µΛ1 α.

k∇F (xk )k2 ≥ 2µ[F (xk ) − F (x∗ )].

The theorem follows from taking total expectation.

We now consider the case when F is nonconvex and bounded below.

lim E[k∇F (xk )k2 ] → 0. (46)

Moreover, for any positive integer T we have that

where Fmin is a lower bound on F in Rd .

Proof. From Lemma B.3 and by taking total expectation, we have

Using the bound on the eigenvalues of Hk and taking limits, we obtain

which implies (46). We can also conclude that

C. Additional Numerical Experiments

Table 2. Characteristics of all datasets used in the experiments.

Dataset # Data Points (train; test) # Features # Classes Source

The alpha dataset is a synthetic dataset that is available at ftp://largescale.ml.tu-berlin.de.

C.2. Logistic Regression Experiments

PBQN-FO PBQN-FO PBQN-FO

101 101 0.9

10−2 PBQN-FO 0.4 PBQN-FO

100 100 0.7

10−5 10−4 0.3

10−6 10−5 0.2

100 100 0.7

PBQN-FO PBQN-FO 0.4 PBQN-FO

10−6 10−4 0.0

100 100 0.75

100 100 0.70

PBQN-FO 10−2 PBQN-FO PBQN-FO

10−6 10−5 0.50

C.3. Neural Network Experiments

C.3.1. CIFAR-10 C ONVOLUTIONAL N ETWORK (C) A RCHITECTURE

C.3.3. CIFAR-10 R ESIDUAL N ETWORK (R) A RCHITECTURE

1.5 PBQN-FO PBQN-FO

1.5 PBQN-FO PBQN-FO

1.5 1.8 PBQN-FO

1.5 PBQN-FO 1.8 PBQN-FO

1.5 PBQN-FO 1.8 PBQN-FO

In terms of iterations, we can rewrite this as

is must be positive definite, whereas in the stochastic case the

∇2 F (x) LI, ∀x ∈ Rd . (34)

µI ∇2 F (x) LI, ∀x ∈ Rd . (43)