Professional Documents
Culture Documents
Abstract
Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for
solving min-max optimization and variational inequalities problems (VIP) appearing in var-
ious machine learning tasks. The success of the method led to several advanced extensions
of the classical SGDA, including variants with arbitrary sampling, variance reduction, co-
ordinate randomization, and distributed variants with compression, which were extensively
studied in the literature, especially during the last few years. In this paper, we propose a
unified convergence analysis that covers a large variety of stochastic gradient descent-ascent
methods, which so far have required different intuitions, have different applications and
have been developed separately in various communities. A key to our unified framework is a
parametric assumption on the stochastic estimates. Via our general theoretical framework,
we either recover the sharpest known rates for the known special cases or tighten them.
Moreover, to illustrate the flexibility of our approach we develop several new variants of
SGDA such as a new variance-reduced method (L-SVRGDA), new distributed methods with
compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate
randomization (SEGA-SGDA). Although variants of the new methods are known for solving
minimization problems, they were never considered or analyzed for solving min-max prob-
lems and VIPs. We also demonstrate the most important properties of the new methods
through extensive numerical experiments.
1 Introduction
Min-max optimization and, more generally, variational inequality problems (VIPs) appear in a
wide range of research areas including but not limited to statistics [Bach, 2019], online learning
[Cesa-Bianchi and Lugosi, 2006], game theory [Morgenstern and Von Neumann, 1953], and
machine learning [Goodfellow et al., 2014]. Motivated by applications in these areas, in this
paper, we focus on solving the following regularized VIP: Find x∗ ∈ Rd such that
1
composite minimization problem [Beck, 2017], i.e., minimization of f (x) + R(x). Problem (1)
is also a more abstract formulation of the min-max problem
with convex-concave f . In that case, x = (x> > > > > >
1 , x2 ) , F (x) = (∇x1 f (x1 , x2 ) , −∇x2 f (x1 , x2 ) ) ,
and R(x) = δQ1 (x1 ) + δQ2 (x2 ), where δQ (·) is an indicator function of the set Q [Alacaoglu and
Malitsky, 2021]. In addition to formulate the constraints, regularization R allows us to enforce
some properties to the solution x∗ , e.g., sparsity [Candes et al., 2008, Beck, 2017].
More precisely, we are interested in the situations when operator F is accessible through
the calls of unbiased stochastic oracle. This P is natural when F has an expectation form F (x) =
Eξ∼D [Fξ (x)] or a finite-sum form F (x) = n1 ni=1 Fi (x). In the context of machine learning, D
corresponds to some unknown distribution on the data, n corresponds to the number of samples,
and Fξ , Fi denote vector fields corresponding to the samples ξ, and i, respectively [Gidel et al.,
2019, Loizou et al., 2021].
One of the most popular methods for solving (1) is Stochastic Gradient Descent-Ascent1
(SGDA) [Dem’yanov and Pevnyi, 1972, Nemirovski et al., 2009]. However, besides its rich his-
tory, SGDA only recently was analyzed without using strong assumptions on the noise [Loizou
et al., 2021] such as uniformly bounded variance. In the last few years, several powerful al-
gorithmic techniques like variance reduction [Palaniappan and Bach, 2016, Yang et al., 2020]
and coordinate-wise randomization [Sadiev et al., 2021], were also combined with SGDA result-
ing in better algorithms. However these methods were analyzed under different assumptions,
using different analysis approaches, and required different intuitions. Moreover, to the best of
our knowledge, fruitful directions such as communication compression for distributed versions
of SGDA or linearly converging variants of coordinate-wise methods for regularized VIPs were
never considered in the literature before.
All of these facts motivate the importance and necessity of a novel general analysis of SGDA
unifying several special cases and providing the ability to design and analyze new SGDA-like
methods filling existing gaps in the theoretical understanding of the method.
where x∗ = projX ∗ (x) := arg miny∈X ∗ ky − xk is the projection of x on the solution set X ∗ of
(1). If µ = 0, inequality (3) is known as variational stability condition Hsieh et al. [2020], which
is weaker than standard monotonicity: hF (x) − F (y), x − yi ≥ 0 for all x, y ∈ Rd . It is worth
mentioning that there exist examples of non-monotone operators satisfying (3) with µ > 0
[Loizou et al., 2021]. Condition (4) is a relaxation of standard cocoercivity kF (x) − F (y)k2 ≤
`hF (x) − F (y), x − yi. At this point let us highlight that it is possible for an operator F to
satisfy (4) and not be Lipschitz continuous [Loizou et al., 2021]. This emphasizes the wider
1
This name is usually used in the min-max setup. Although we consider a more general problem formulation,
we keep the name SGDA to highlight the connection with min-max problems.
2
applicability of the `-star-cocoercivity compared to `-cocoercivity. We emphasize that in our
convergence analysis we do not assume `-cocoercivity nor L-Lipschitzness of F .
We consider SGDA for solving (1) in its general form:
where g k is an unbiased estimator of F (xk ), γk > 0 is a stepsize at iteration k, and proxγR (x) :=
arg miny∈Rd {R(y) + ky−xk2/2γ } is a proximal operator defined for any γ > 0 and x ∈ Rd . While
g k gives an information about operator F at step k, proximal operator is needed to take into
account regularization term R. We assume that function R is such that proxγR (x) can be easily
computed for all x ∈ Rd . This is a standard assumption satisfied for many practically interesting
regularizers [Beck, 2017]. By default we assume that γk ≡ γ > 0 for all k ≥ 0.
Sharp rates for known special cases. For the known methods fitting our framework
our general theorems either recover the best rates known for these methods (SGDA-AS) or
tighten them (SGDA-SAGA, Coordinate SGDA).
New methods. The flexibility of our approach allows us to develop and analyze several
new variants of SGDA. Guided by algorithmic advances for solving minimization problems
we propose a new variance-reduced method (L-SVRGDA), new distributed methods with
compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate
randomization (SEGA-SGDA). We show that the proposed new methods fit our theoretical
framework and, using our general theorems, we obtain tight convergence guarantees for
them. Although the analogs of these methods are known for solving minimization problems
[Hofmann et al., 2015, Kovalev et al., 2020, Alistarh et al., 2017, Mishchenko et al., 2019,
Horváth et al., 2019, Hanzely et al., 2018], they were never considered for solving min-
max and variational inequality problems. Therefore, by proposing and analyzing these new
methods we close several gaps in the literature on SGDA. For example, VR-DIANA-SGDA is
the first SGDA-type linearly converging distributed stochastic method with compression and
SEGA-SGDA is the first linearly converging coordinate method for solving regularized VIPs.
3
1.3 Closely Related Work
Analysis of SGDA. SGDA is usually analyzed under uniformly bounded variance assumption.
That is, E[kg k − F (xk )k2 | xk ] ≤ σ 2 is typically assumed to get convergence guarantees [Ne-
mirovski et al., 2009, Mertikopoulos and Zhou, 2019, Yang et al., 2020]. This assumption rarely
holds, especially for unconstrained VIPs: it is easy to construct an example of (1) with F being
a finite sum of linear operators such that the variance is unbounded. Lin et al. [2020] provide
a convergence analysis of SGDA under a relative random noise assumption allowing to handle
some special cases not covered by uniformly bounded variance assumption. However, relative
noise is also a quite strong assumption and usually requires a special type of noise appearing in
coordinate methods2 or in the training of overparameterized models [Vaswani et al., 2019]. In
their recent work, Loizou et al. [2021] proposed a new weak condition called expected cocoerciv-
ity. This assumption fits our theoretical framework (see Section 3) and does not imply strong
conditions on the variance of the stochastic estimator but it is stronger than star-cocoercivity
of operator F .
Variance reduction for VIPs. The first variance-reduced variants of SGDA (SVRGDA and
SAGA-SGDA – analogs of SVRG [Johnson and Zhang, 2013] and SAGA [Defazio et al., 2014]) for
solving (1) with strongly monotone operator F having a finite-sum form with Lipschitz sum-
mands were proposed in Palaniappan and Bach [2016]. For two-sided PL min-max problems
without regularization Yang et al. [2020] proposed a variance-reduced version of SGDA with
alternating updates. Since the considered class of problems includes non-strongly-convex-non-
strongly-concave min-max problems, the rates from Yang et al. [2020] are inferior to Palaniappan
and Bach [2016]. There are also several works studying variance-reduced methods based on dif-
ferent methods rather than SGDA. Chavdarova et al. [2019] proposed a combination of SVRG
and Extragradient (EG) [Korpelevich, 1976] called SVRE and analyzed the method for strongly
monotone VIPs without regularization and with cocoercive summands Fi . The cocoercivity
assumption was relaxed to averaged Lipschitzness in Alacaoglu and Malitsky [2021], where the
authors proposed another variance-reduced version of EG (EG-VR) based on Loopless variant
of SVRG [Hofmann et al., 2015, Kovalev et al., 2020]. Loizou et al. [2020] studied stochas-
tic Hamiltonian gradient descent (SHGD), and propose the first stochastic variance reduced
Hamiltonian method, named L-SVRHG, for solving stochastic bilinear games and and stochas-
tic games satisfying a “sufficiently bilinear” condition. Moreover, Loizou et al. [2020] provided
the first set of global non-asymptotic last-iterate convergence guarantees for a stochastic game
over a non-compact domain, in the absence of strong monotonicity assumptions.
4
variants of EG with unbiased/biased compression for solving (1) with (strongly) monotone and
Lipschitz operator F . Beznosikov et al. [2021b] obtained the first linear convergence guarantees
on distributed VIPs with compressed communication.
Assumption
k 2.1. We assume that for all k ≥ 0 the estimator g k from (5) is unbiased:
k
Ek g = F (x ), where Ek [·] denotes the expectation w.r.t. the randomness at iteration k.
Next, we assume that there exist non-negative constants A, B, C, D1 , D1 ≥ 0, ρ ∈ (0, 1] and a
sequence of (possibly random) non-negative variables {σk }k≥0 such that for all k ≥ 0
h i
Ek kg k − g ∗,k k2 ≤ 2AhF (xk ) − g ∗,k , xk − x∗,k i + Bσk2 + D1 , (6)
Ek σk+1 ≤ 2ChF (xk ) − g ∗,k , xk − x∗,k i + (1 − ρ)σk2 + D2 ,
2
(7)
While unbiasedness of g k is a standard assumption, inequalitites (6)-(7) are new and require
clarifications. For simplicity, assume that σk2 ≡ 0, F (x∗ ) = 0 for all x∗ ∈ X ∗ , and focus on
(6). In this case, (6) gives an upper bound for the second moment of the stochastic estimate
g k . For example, such a bound follows from expected cocoercivity assumption [Loizou et al.,
2021], where A denotes some expected/averaged (star-)cocoercivity constant and D1 stands for
the variance at the solution (see also Section 3). When F is not necessary zero on X ∗ , the
shift g ∗,k helps to take this fact into account. Finally, the sequence {σk2 }k≥0 is typically needed
to capture the variance reduction process, parameter B is typically some numerical constant,
C is another constant related to (star-)cocoercivity, and D2 is the remaining noise that is not
handled by variance reduction process. As we show in the next sections, inequalitites (6)-(7)
hold for various SGDA-type methods.
We point out that Assumption 2.1 is inspired by similar assumptions appeared in Gorbunov
et al. [2020a, 2021a]. However, the difference between our assumption and the ones appeared
in these papers is significant: Gorbunov et al. [2020a] focuses only on solving minimization
problems and as a result, their assumption includes a much simpler quantity (function sub-
optimality), instead of the hF (xk ) − g ∗,k , xk − x∗,k i, in the right-hand sides of (6)-(7). The
assumption proposed in Gorbunov et al. [2021a] is closer to ours, but it is designed specifically
for analyzing Stochastic EG, it does not have {σk2 }k≥0 sequence, and works only for (1) with
R(x) ≡ 0.
5
Theorem 2.2. Let F be µ-quasi-strongly monotone with µ > 0 and let Assumption 2.1 hold.
Assume that 0 < γ ≤ min {1/µ, 1/2(A+CM )} for somea M > B/ρ. Then the iterates of SGDA,
given by (5), satisfy:
k
γ 2 (D1 + M D2 )
B
E[Vk ] ≤ 1 − min γµ, ρ − V0 + . (8)
M min {γµ, ρ − B/M }
where the Lyapunov function Vk is defined by Vk = kxk − x∗,k k2 + M γ 2 σk2 for all k ≥ 0.
a
When B = 0, we suppose M = 0 and B/M := 0 in all following expressions.
The above theorem states that SGDA (5) converges linearly to the neighborhood of the
solution. The size of the neighborhood is proportional to the noises D1 and D2 . When D1 =
D2 = 0, i.e., the method is variance reduced, it converges linearly to the exact solution in
expectation. However, in general, to achieve any predefined accuracy, one needs to reduce the
size of the neighborhood somehow. One possible way to that is use a proper stepsize schedule.
We formalize this discussion in the following result.
Corollary 2.3. Let the assumptions of Theorem 2.2 hold. Consider two possible cases.
1. Let D1 = D2 = 0. Then, for any K ≥ 0, M = 2B/ρ, and γ = min {1/µ, 1/2(A+2BC/ρ)}, the
iterates of SGDA, given by (5), satisfy:
µ ρ
E[VK ] ≤ V0 exp − min , K .
2(A + 2BC/ρ) 2
2. Let D1 + M D2 > 0. Then, for any K ≥ 0 and M = 2B/ρ one can choose {γk }k≥0 as
follows:
h 1
if K ≤ , γk = ,
µ h
h 1
if K > and k < k0 , γk = , (9)
µ h
h 2
if K > and k ≥ k0 , γk = ,
µ µ(κ + k − k0 )
where h = max {2(A + 2BC/ρ), 2µ/ρ}, κ = 2h/µ and k0 = dK/2e. For this choice of γk , the
iterates of SGDA, given by (5), satisfy:
Assumption 2.4. There exists a compact convex set C (with the diameter ΩC := maxx,y∈C kx−
yk) such that X ∗ ⊂ C.
In this settings, we focus on the following quantity called a restricted gap-function [Nesterov,
6
2007] defined for any z ∈ Rd and any C ⊂ Rd satisfying Assumption 2.4:
Assumption 2.4 and function GapC (z) are standard for the convergence analysis of methods
for solving (1) with monotone F [Nesterov, 2007, Alacaoglu and Malitsky, 2021]. Additional
discussion is left to Appendix D.2.
Under these assumptions, Assumption 2.1, and star-cocoercivity of F we derive the following
general result.
Theorem 2.5. Let F be monotone, `-star-cocoercive and let Assumptions 2.1, 2.4 hold.
Assume that 0 < γ ≤ 1/2(A+BC/ρ). Then for all K ≥ 0 the iterates of SGDA, given by (5),
satisfy:
K
" !#
3 maxu∈C kx0 − uk2
1 X k
E GapC x ≤
K 2γK
k=1
8γ`2 Ω2C kx0 − x∗,0 k2
+ + (4A + ` + 8BC/ρ) ·
K K
γBσ 2
0
+ (4 + (4A + ` + 8BC/ρ) γ)
ρK
+γ(2 + γ (4A + ` + 8BC/ρ))(D1 + 2BD2/ρ)
+9γ max
∗ ∗
kF (x∗ )k2 , (11)
x ∈X
The above result establishes O(1/K ) rate of convergence to the accuracy proportional to the
stepsize γ multiplied by the noise term D1 + 2BD2/ρ and maxx∗ ∈X ∗ kF (x∗ )k2 . We notice that
if R ≡ 0 in (1), then F (x∗ ) = 0, meaning that in this case, the last term from (11) equals
zero. Otherwise, even in the deterministic case one needs to use small stepsizes to ensure the
convergence to any predefined accuracy (see Corollary D.4 in Appendix D.2).
When R(x) ≡ 0, this assumption recovers the original one from Loizou et al. [2021]. We
also emphasize that for operator F Assumption 3.1 implies only star-cocoercivity.
Following Loizou et al. [2021], we mainly focus on finite-sum case and its stochastic refor-
mulation: we consider a random sampling vector ξ = (ξ1 , . . . , ξn )> ∈ Rn having a distribution
7
1 Pn
D such that ED [ξi ] = 1 for all i ∈ [n]. Using this we can rewrite F (x) = n i=1 Fi (x) as
n
1X
F (x) = ED [ξi Fi (x)] = ED [Fξ (x)] , (13)
n
i=1
Pn
where Fξ (x) = n1 i=1 ξi Fi (x). Such a reformulation allows to handle a wide range of samplings:
the only assumption on D is ED [ξi ] = 1 for all i ∈ [n]. Therefore, this setup is often referred to
as arbitrary sampling [Richtárik and Takác, 2020, Loizou and Richtárik, 2020a,b, Gower et al.,
2019, 2021, Hanzely and Richtárik, 2019, Qian et al., 2019, 2021a]. We elaborate on several
special cases in Appendix E.4.
In this setting, SGDA with Arbitrary Sampling (SGDA-AS)3 fits our framework.
Proposition 3.2. Let Assumption 3.1 hold. Then, SGDA-AS satisfies Assumption 2.1 with
A = `D , D1 = 2σ∗2 := 2 maxx∗ ∈X ∗ ED kFξ (x∗ ) − F (x∗ )k2 , B = 0, σk2 ≡ 0, C = 0, ρ = 1,
D2 = 0.
Plugging these parameters to Theorem 2.2 we recover the result4 from Loizou et al. [2021]
when R(x) ≡ 0 and generalize it to the case of R(x) 6≡ 0 without sacrificing the rate. Applying
Corollary 2.3, we establish the rate of convergence to the exact solution.
Corollary 3.3. Let F be µ-quasi-strongly monotone and Assumption 3.1 hold. Then for all
K > 0 there exists a choice of γ (see (47)) for which the iterates of SGDA-AS, satisfy:
`D Ω20 σ∗2
K ∗,K 2 µ
E[kx − x k ] = O exp − K + 2 ,
µ `D µ K
For the different stepsize schedule, Loizou et al. [2021] derive the convergence rate O(1/K + 1/K 2 )
which is inferior to our rate, especially when σ∗2 is small. In addition, Loizou et al. [2021] consider
explicitly only uniform minibatch sampling without replacement as a special case of arbitrary
sampling. In Appendix E.4, we discuss another prominent sampling strategy called importance
sampling. In Section 6, we provide numerical experiments verifying our theoretical findings and
showing the benefits of importance sampling over uniform sampling for SGDA.
8
where in k th iteration jk is sampled uniformly at random from [n]. Here full operator F is
computed once wk is updated, which happens with probability p. Typically, p is chosen as
p ∼ 1/n ensuring that the expected cost of 1 iteration equals O(1) oracle calls, i.e., computations
of Fi (x) for some i ∈ [n].
We introduce the following assumption about operators Fi .
Assumption 4.1 (Averaged Star-Cocoercivity). We assume that there exists a constant `b > 0
such that for all x ∈ Rd
n
1X
kFi (x) − Fi (x∗ )k2 ≤ `hF
b (x) − F (x∗ ), x − x∗ i, (16)
n
i=1
For example, if Fi is `i -cocoercive for i ∈ [n], then (16) holds with `b ≤ maxi∈[n] `i . Next, if
Fi is Li -Lipschitz for all i ∈ [n] and F is µ-quasi strongly monotone, then (16) is satisfied for
2 2
`b ∈ [L, L /µ], where L = n1 ni=1 L2i .
P
Moreover, for the analysis of variance reduced variants of SGDA we also use uniqueness of
the solution.
Assumption 4.2 (Unique Solution). We assume that the solution set X ∗ of problem (1) is
a singleton: X ∗ = {x∗ }.
These assumptions are sufficient to derive validity of Assumption 2.1 for L-SVRGDA estima-
tor.
Proposition 4.3. Let Assumptions 4.1 and 4.2 hold. Then, L-SVRGDA satisfies Assump-
1 Pn
tion 2.1 with A = `, B = 2, σk = n i=1 kFi (wk ) − Fi (x∗ )k2 , C = p`b/2, ρ = p, D1 = D2 = 0.
b 2
Plugging these parameters in our general results on the convergence of SGDA-type algorithms
we derive the convergence results for L-SVRGDA, see Table 1 and Appendix F.1 for the details.
Moreover, in Appendix F.2, we show that SAGA-SGDA [Palaniappan and Bach, 2016] fits our
framework and using our general analysis we tighten the convergence rates for this method.
We compare our convergence guarantees with known results in Table 1. We note that
by neglecting importance sampling scenario, in the worst case, our convergence results match
the best-known results for SGDA-type methods, i.e., ones derived in Palaniappan and Bach
2
[2016]. Indeed, this follows from `b ∈ [L, L /µ]. Next, when the difference between ` and `b is not
significant, our complexity results match the one derived in Chavdarova et al. [2019] for SVRE,
which is EG-type method. Although in general, ` might be smaller than `, b our analysis does
not require cocoercivity of each Fi and it works for R(x) 6≡ 0. Finally, Alacaoglu and Malitsky
2
[2021] derive a better rate (when n = O(L /µ2 )), but their method is based on EG. Therefore,
our results match the best-known ones in the literature on SGDA-type methods.
9
Table 1: Summary of the complexity results for variance reduced methods for solving (1). By complexity
we mean the number of oracle calls required for the method to find x such that E[kx − x∗ k2 ] ≤ ε.
Dependencies on numerical and logarithmic factors are hidden. By default, operator F is assumed
to be µ-strongly monotone and, as the result, the solution is unique. Our results rely on µ-quasi
strong monotonicity of F (3), but we also assume uniqueness of the solution. Methods supporting
R(x) 6≡ 0 are highlighted with ∗ . Our results are highlighted in green. Notation: `, L = averaged
2
cocoercivity/Lipschitz constants depending on the sampling strategy, e.g., for uniform sampling ` =
1
P n 2 2 1
P n 2 1
P n 1
P n
i=1 `i , L = n i=1 Li and for importance sampling ` = n i=1 `i , L = n i=1 Li ; ` = averaged
b
n
star-cocoercivity constant from Assumption 4.1.
∗
SAGA-SGDA This paper As. 4.1 n + µ`
b
(1)
The method is based on Extragradient update rule.
(2)
Yang et al. [2020] consider saddle point problems satisfying so-called two-sided PL condition,
which is weaker than strong-convexity-strong-concavity of the objective function.
and d are huge. This means that in the naive distributed implementations of SGDA, communi-
cation rounds take much more time than local computations on the clients. Various approaches
are used to circumvent this issue.
One of them is based on the usage of compressed communications. We focus on the unbiased
compression operators.
In this paper, we consider compressed communications in the direction from clients to the
server. The simplest method with compression – QSGDA (Alg. 4) – can be described as SGDA
1 Pn
k
(5) with g = n i=1 Q(gik ). Here gik are stochastic estimators satisfying the following assump-
tion5 .
Assumption 5.2 (Bounded variance). All stochastic realizations gik are unbiased and have
bounded variance, i.e., for all i ∈ [n] and k ≥ 0 the following holds:
Despite its simplicity, QSGDA was never considered in the literature on solving min-max
problems and VIPs. It turns out that under such assumptions QSGDA satisfies our Assumption
2.1.
5
We use this assumption for illustrating the flexibility of the framework. It is possible to consider Arbitrary
Sampling setup as well.
10
Proposition 5.3. Let F be `-star-cocoercive and Assumptions 4.1, 5.2 hold. Then, QSGDA
2 +9ωζ 2
satisfies Assumption 2.1 with A = 3`
2 + 9ω
2n
`b
, B = 0, σk2 ≡ 0, D1 = 3(1+3ω)σ
n
∗
, C = 0,
2 1 P n 2 2 1 P n ∗ 2
ρ = 1, D2 = 0, where σ = n i=1 σi , ζ∗ := n maxx∗∈X ∗
i=1 kFi (x )k .
As for the other special cases, we derive the convergence results for QSGDA using our general
theorems (see Table 2 and Appendix G.1 for the details). The proposed method is simple, but
have a significant drawback: even in the deterministic case (σ = 0), QSGDA does not converge
linearly unless ζ∗2 = 0. However, when the data on clients is arbitrary heterogeneous the
dissimilarity measure ζ∗2 is strictly positive and can be large (even when R(x) ≡ 0).
To resolve this issue, we propose a more advanced scheme based on DIANA update [Mishchenko
et al., 2019, Horváth et al., 2019] – DIANA-SGDA (Alg. 5). In a nutshell, DIANA-SGDA is SGDA
(5) with g k defined as follows:
where the first two lines correspond to the local computations on the clients and the last two
lines – to the server-side computations. Taking into account the update rule for hk+1 , one can
notice that DIANA-SGDA requires workers to send only vectors Q(∆ki ) to the server at step k,
i.e., the method uses only compressed workers-server communications.
As we show next, DIANA-SGDA fits our framework.
Proposition 5.4. Let Assumptions 4.1, 4.2, 5.2 hold. Suppose that αP≤ 1/(1+ω). Then,
DIANA-SGDA with quantization (17) satisfies Assumption 2.1 with σk2 = n1 ni=1 khki −Fi (x∗ )k2
b B = 2ω , D1 = (1+ω)σ2 , C = α`b, ρ = α, D2 = ασ 2 , , where σ 2 = 1 Pn σ 2 .
and A = 1 + ω `,
2 n n n 2 n i=1 i
where jk is sampled uniformly at random from [n]. We call the resulting method VR-DIANA-
SGDA (Alg. 6) and we note that its analog for solving minimization problems (VR-DIANA) was
proposed and analyzed in Horváth et al. [2019].
To cast VR-DIANA-SGDA as special case of our general framework, we need to make the
following assumption.
11
Table 2: Summary of the complexity
P results for distributed methods with unbiased compression for
n
solving distributed (1) with F = n1 i=1 Fi (x). By complexity we mean the number of communication
rounds required for the method to find x such that E[kx − x∗ k2 ] ≤ ε. Dependencies on numerical and
logarithmic factors are hidden. Dependencies on numerical and logarithmic factors
Pm are hidden. E stands
1
for the setup, when Fi (x) = Eξi [Fξi (x)]; Σ denotes the case, when Fi (x) = m j=1 Fij (x). By default,
operator F is assumed to be µ-strongly monotone and, as the result, the solution is unique. Our
results rely on µ-quasi strong monotonicity of F (3), but we also assume uniqueness of the solution.
MethodsPsupporting R(x) 6≡ 0 are highlighted with ∗ . Our results are highlighted in green. Notation:
n
σ 2 = n1 i=1 σi2 – averaged upper bound for the variance (see Assumption 5.2 for the definition of σi2 );
Pn 2
ω = quantization parameter (see Definition 5.1); ζ∗2 = n1 maxx∗∈X ∗ i=1 kFi (x∗ )k ; Lmax = maxi∈[n] Li ;
`e = averaged star-cocoercivity constant from Assumption 5.5.
Assumption 5.5. We assume that there exists a constant `e > 0 such that for all x ∈ Rd
n m
1 XX
kFij (x) − Fij (x∗ )k2 ≤ `hF
e (x) − F (x∗ ), x − x∗ i, (24)
nm
i=1 j=1
Using Assumption 5.5 in combination with previously introduced conditions, we get the
following result.
12
6 Numerical Experiments
To illustrate our theoretical results, we conduct several numerical experiments on quadratic
games, which are defined through the affine operator:
n
1X
F (x) = Ai x + bi , (25)
n
i=1
where each matrix Ai ∈ Rd×d is non-symmetric with all eigenvalues having strictly positive real
part. Enforcing all the eigenvalues to have strictly positive real part ensures that the operator
is strongly monotone and cocoercive. We consider two different settings: (i) problem without
constraints, and (ii) problem that has `1 regularization and constraints forcing the solution to
lie in the `∞ -ball of radius r. In all experiments, we use a constant step-size for all methods
which was selected manually using a grid-search and picking the best performing step-size for
each methods. For further details about the experiments see Appendix B.
Uniform sampling (US) vs Important sampling (IS). We note that Loizou et al. [2021]
which studies SGDA-AS does not consider IS explicitly. Although we show the theoretical
benefits of IS in comparison to US in Appendix E.4, here we provide a numerical comparison to
illustrate the superiority of IS (on both constrained and unconstrained quadratic games). We
choose the matrices Ai such that `max = maxi `i `. ¯ In this case, our theory predicts that IS
should perform better than US. We provide the results in Fig. 1. We observe that indeed SGDA
with IS converges faster and to a smaller neighborhood than SGDA with US. This observation
perfectly corroborates our theory.
104
SGDA with IS SGDA with IS
SGDA SGDA
103 100
102
Distance to optimality
Distance to optimality
101 10−1
0
10
10−1
10−2
10−2
10−3
10−3
0 50 100 150 200 250 300 0 25 50 75 100 125 150 175 200
Number of oracles call Number of oracles call
Figure 1: Uniform sampling (US) vs Importance sampling (IS). Left: Without constraint. Right: With
constraints. As expected by theory IS converges faster and to a smaller neighborhood than US.
13
100 L-SVRGDA
10−1 SVRG
Distance to optimality
Distance to optimality
EG-VR
10−2
10−3
10−4 L-SVRGDA
SVRG 10−5
EG-VR
10−6 VR-AGDA
10−7
SVRE
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Number of oracles call Number of oracles call
Figure 2: Comparison of variance reduced methods Left: Without constraints. Right: With con-
straints. We observe that L-SVRGDA is very competitive, and outperforms all the other methods.
Distributed setting. In our last experiment, we consider a distributed version of the quadratic
1 Pn
game, in which we assume that F (x) = n i=1 Fi (x) with each {Fi }ni=1 having similar form to
(25). The information about operator Fi is stored on node i only. We compare the distributed
methods proposed in the paper: QSGDA, DIANA-SGDA, and VR-DIANA-SGDA. For the quan-
tization we use the RandK sparsification [Beznosikov et al., 2020a] with K = 5. We show our
findings in Fig.3, where the performance is measured both in terms of number of oracle calls
and the number of bits communicated from workers to the server. In Fig.3 we can clearly see
the advantage of using quantization in terms of reducing the communication cost compared to
the baseline SGDA. We also observe that VR-DIANA-SGDA achieves linear convergence to the
solution. However, DIANA-SGDA performs similarly to QSGDA since the noise σ 2 is larger than
the dissimilarity constant ζ∗2 . To illustrate further the difference between DIANA-SGDA and
QSGDA, we conduct additional experiments with full-batched methods (σ = 0) in Appendix B.
100
100
10−2
Distance to optimality
Distance to optimality
10−1
10−4
10−2
10−6
10 −3 SGDA SGDA
QSGDA 10−8 QSGDA
10−4 DIANA-SGDA DIANA-SGDA
10−10
VR-DIANA-SGDA VR-DIANA-SGDA
10−5
0 5000 10000 15000 20000 25000 30000 103 104 105 106 107 108
Number of oracles call Number of bits communicated
Figure 3: Comparison of algorithms in distributed setting Left: Number of oracle calls. Right: Number
of bits communicated.
14
for analyzing methods that do not fit scheme (5), e.g., Stochastic Extragradient and Stochastic
Optimistic Gradient.
References
A. Alacaoglu and Y. Malitsky. Stochastic variance reduction for variational inequality methods. arXiv
preprint arXiv:2102.08352, 2021.
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via
gradient quantization and encoding. Advances in Neural Information Processing Systems, 30:1709–
1720, 2017.
W. Azizian, F. Iutzeler, J. Malick, and P. Mertikopoulos. The last-iterate convergence rate of optimistic
mirror descent in stochastic variational inequalities. In Conference on Learning Theory, pages 326–358.
PMLR, 2021.
H. H. Bauschke, P. L. Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces,
volume 408. Springer, 2011.
A. Beck. First-order methods in optimization. Society for Industrial and Applied Mathematics (SIAM),
2017.
A. Beznosikov, A. Sadiev, and A. Gasnikov. Gradient-free methods with inexact oracle for convex-
concave stochastic saddle-point problem. In International Conference on Mathematical Optimization
Theory and Operations Research, pages 105–119. Springer, 2020b.
A. Beznosikov, V. Novitskii, and A. Gasnikov. One-point gradient-free methods for smooth and non-
smooth saddle-point problems. In International Conference on Mathematical Optimization Theory
and Operations Research, pages 144–158. Springer, 2021a.
Y. Carmon, Y. Jin, A. Sidford, and K. Tian. Variance reduction for matrix games. Advances in Neural
Information Processing Systems, 32, 2019.
N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien. Reducing noise in GAN training with
variance reduced extragradient. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran
Associates, Inc., 2019.
15
D. Davis and W. Yin. A three-operator splitting scheme and its optimization applications. Set-valued
and variational analysis, 25(4):829–858, 2017.
A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support
for non-strongly convex composite objectives. Advances in neural information processing systems, 27,
2014.
V. F. Dem’yanov and A. B. Pevnyi. Numerical methods for finding saddle points. USSR Computational
Mathematics and Mathematical Physics, 12(5):11–52, 1972.
E. Gorbunov, F. Hanzely, and P. Richtarik. A Unified Theory of SGD: Variance Reduction, Sampling,
Quantization and Coordinate Descent. In S. Chiappa and R. Calandra, editors, Proceedings of the
Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Pro-
ceedings of Machine Learning Research, pages 680–690. PMLR, 26–28 Aug 2020a.
E. Gorbunov, D. Kovalev, D. Makarenko, and P. Richtarik. Linearly converging error compensated sgd. In
H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Informa-
tion Processing Systems, volume 33, pages 20889–20900. Curran Associates, Inc., 2020b. URL https:
//proceedings.neurips.cc/paper/2020/file/ef9280fbc5317f17d480e4d4f61b3751-Paper.pdf.
E. Gorbunov, H. Berard, G. Gidel, and N. Loizou. Stochastic extragradient: General analysis and
improved rates. arXiv preprint arXiv:2111.08611, 2021a.
E. Gorbunov, N. Loizou, and G. Gidel. Extragradient method: O(1/K) last-iterate convergence for
monotone variational inequalities and connections with cocoercivity. arXiv preprint arXiv:2110.04261,
2021c.
R. Gower, O. Sebbouh, and N. Loizou. Sgd for structured nonconvex functions: Learning rates, mini-
batching and interpolation. In International Conference on Artificial Intelligence and Statistics, pages
1315–1323. PMLR, 2021.
R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General Analysis
and Improved Rates. In Proceedings of the 36th International Conference on Machine Learning,
volume 97 of Proceedings of Machine Learning Research, pages 5200–5209, 2019.
Y. Han, G. Xie, and Z. Zhang. Lower complexity bounds of finite-sum optimization problems: The
results and construction. arXiv preprint arXiv:2103.08280, 2021.
F. Hanzely and P. Richtárik. Accelerated coordinate descent with arbitrary sampling and best rates
for minibatches. In The 22nd International Conference on Artificial Intelligence and Statistics, pages
304–312. PMLR, 2019.
16
F. Hanzely, K. Mishchenko, and P. Richtárik. SEGA: Variance reduction via gradient sketching. Advances
in Neural Information Processing Systems, 31, 2018.
S. Horváth, D. Kovalev, K. Mishchenko, S. Stich, and P. Richtárik. Stochastic distributed learning with
gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115, 2019.
Y.-G. Hsieh, F. Iutzeler, J. Malick, and P. Mertikopoulos. On the convergence of single-call stochas-
tic extra-gradient methods. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran
Associates, Inc., 2019.
Y.-G. Hsieh, F. Iutzeler, J. Malick, and P. Mertikopoulos. Explore aggressively, update conservatively:
Stochastic extragradient methods with variable stepsize scaling. Advances in Neural Information
Processing Systems, 33, 2020.
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.
Advances in Neural Information Processing Systems, 26, 2013.
A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox
algorithm. Stochastic Systems, 1(1):17–58, 2011.
S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi. Error feedback fixes signsgd and other gradient
compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR,
2019.
A. Khaled, O. Sebbouh, N. Loizou, R. M. Gower, and P. Richtárik. Unified analysis of stochastic gradient
methods for composite convex and smooth optimization. arXiv preprint arXiv:2006.11573, 2020.
G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon,
12:747–756, 1976.
D. Kovalev, S. Horváth, and P. Richtárik. Don’t jump through hoops and remove those loops: SVRG
and Katyusha are better without the outer loop. In Algorithmic Learning Theory, 2020.
C. J. Li, Y. Yu, N. Loizou, G. Gidel, Y. Ma, N. L. Roux, and M. I. Jordan. On the convergence
of stochastic extragradient for bilinear games with restarted iteration averaging. arXiv preprint
arXiv:2107.00464, 2021.
Z. Li, D. Kovalev, X. Qian, and P. Richtarik. Acceleration for compressed gradient descent in distributed
and federated optimization. In International Conference on Machine Learning, pages 5895–5904.
PMLR, 2020.
H. Lin, J. Mairal, and Z. Harchaoui. Catalyst acceleration for first-order convex optimization: from
theory to practice. Journal of Machine Learning Research, 18(1):7854–7907, 2018.
T. Lin, Z. Zhou, P. Mertikopoulos, and M. Jordan. Finite-time last-iterate convergence for multi-agent
learning in games. In International Conference on Machine Learning, pages 6161–6171. PMLR, 2020.
S. Liu, S. Lu, X. Chen, Y. Feng, K. Xu, A. Al-Dujaili, M. Hong, and U.-M. O’Reilly. Min-max optimiza-
tion without gradients: Convergence and applications to black-box evasion and poisoning attacks. In
International Conference on Machine Learning, pages 6282–6293. PMLR, 2020.
N. Loizou and P. Richtárik. Convergence analysis of inexact randomized iterative methods. SIAM
Journal on Scientific Computing, 42(6):A3979–A4016, 2020a.
N. Loizou and P. Richtárik. Momentum and stochastic momentum for stochastic gradient, newton,
proximal point and subspace descent methods. Computational Optimization and Applications, 77(3):
653–710, 2020b.
17
N. Loizou, H. Berard, A. Jolicoeur-Martineau, P. Vincent, S. Lacoste-Julien, and I. Mitliagkas. Stochastic
hamiltonian gradient methods for smooth games. In International Conference on Machine Learning,
pages 6370–6381. PMLR, 2020.
L. Luo, G. Xie, T. Zhang, and Z. Zhang. Near optimal stochastic algorithms for finite-sum unbalanced
convex-concave minimax optimization. arXiv preprint arXiv:2106.01761, 2021.
Y. Malitsky and M. K. Tam. A forward-backward splitting method for monotone inclusions without
cocoercivity. SIAM Journal on Optimization, 30(2):1451–1472, 2020.
P. Mertikopoulos and Z. Zhou. Learning in games with continuous action sets and unknown payoff
functions. Mathematical Programming, 173(1):465–507, 2019.
K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik. Distributed learning with compressed gradient
differences. arXiv preprint arXiv:1901.09269, 2019.
O. Morgenstern and J. Von Neumann. Theory of games and economic behavior. Princeton university
press, 1953.
Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related
problems. Mathematical Programming, 109(2):319–344, 2007.
Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120
(1):221–259, 2009.
B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In
Advances in Neural Information Processing Systems, pages 1416–1424, 2016.
L. D. Popov. A modification of the arrow-hurwicz method for search of saddle points. Mathematical
notes of the Academy of Sciences of the USSR, 28(5):845–848, 1980.
X. Qian, Z. Qu, and P. Richtárik. Saga with arbitrary sampling. In International Conference on Machine
Learning, pages 5190–5199. PMLR, 2019.
X. Qian, Z. Qu, and P. Richtárik. L-svrg and l-katyusha with arbitrary sampling. Journal of Machine
Learning Research, 22(112):1–47, 2021a.
X. Qian, P. Richtárik, and T. Zhang. Error compensated distributed sgd can be accelerated. Advances
in Neural Information Processing Systems, 34, 2021b.
P. Richtárik and M. Takác. Stochastic reformulations of linear systems: algorithms and convergence
theory. SIAM Journal on Matrix Analysis and Applications, 41(2):487–524, 2020.
P. Richtárik, I. Sokolov, and I. Fatkhullin. EF21: A new, simpler, theoretically better, and practically
faster error feedback. In Advances in Neural Information Processing Systems, 2021.
A. Sadiev, A. Beznosikov, P. Dvurechensky, and A. Gasnikov. Zeroth-order algorithms for smooth saddle-
point problems. In International Conference on Mathematical Optimization Theory and Operations
Research, pages 71–85. Springer, 2021.
18
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to
data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International
Speech Communication Association, 2014.
C. Song, Z. Zhou, Y. Zhou, Y. Jiang, and Y. Ma. Optimistic dual extrapolation for coherent non-
monotone variational inequalities. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and
H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 14303–14314.
Curran Associates, Inc., 2020.
S. U. Stich. Unified optimal analysis of the (stochastic) gradient method. arXiv preprint
arXiv:1907.04232, 2019.
S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In Proceedings of the 32nd
International Conference on Neural Information Processing Systems, pages 4452–4463, 2018.
S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of sgd for over-parameterized models
and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and
Statistics, pages 1195–1204. PMLR, 2019.
B. C. Vũ. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances
in Computational Mathematics, 38(3):667–681, 2013.
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: ternary gradients to reduce
communication in distributed deep learning. In Proceedings of the 31st International Conference on
Neural Information Processing Systems, pages 1508–1518, 2017.
J. Yang, N. Kiyavash, and N. He. Global convergence and variance reduction for a class of nonconvex-
nonconcave minimax problems. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin,
editors, Advances in Neural Information Processing Systems, volume 33, pages 1153–1165. Curran
Associates, Inc., 2020.
T. Yoon and E. K. Ryu. Accelerated algorithms for smooth convex-concave minimax problems with
O(1/k 2 ) rate on squared gradient norm. In International Conference on Machine Learning, pages
12098–12109. PMLR, 2021.
D. Yuan, Q. Ma, and Z. Wang. Dual averaging method for solving multi-agent saddle-point problems
with quantized information. Transactions of the Institute of Measurement and Control, 36(1):38–46,
2014.
D. L. Zhu and P. Marcotte. Co-coercivity and its role in the convergence of iterative schemes for solving
variational inequalities. SIAM Journal on Optimization, 6(3):714–726, 1996.
19
Contents
1 Introduction 1
1.1 Technical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Closely Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6 Numerical Experiments 13
20
G.2.1 Proof of Proposition 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
G.2.2 Analysis of DIANA-SGDA in the Quasi-Strongly Monotone Case . . . . . . . . . . . 51
G.2.3 Analysis of DIANA-SGDA in the Monotone Case . . . . . . . . . . . . . . . . . . . 52
G.3 VR-DIANA-SGDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
G.3.1 Proof of Proposition 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
G.3.2 Analysis of VR-DIANA-SGDA in the Quasi-Strongly Monotone Case . . . . . . . . 57
G.3.3 Analysis of VR-DIANA-SGDA in the Monotone Case . . . . . . . . . . . . . . . . . 58
G.4 Discussion of the Results in the Monotone Case . . . . . . . . . . . . . . . . . . . . . . . . 59
H Coordinate SGDA 60
H.1 CSGDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
H.1.1 CSGDA Fits Assumption 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
H.1.2 Analysis of CSGDA in the Quasi-Strongly Monotone Case . . . . . . . . . . . . . . 60
H.1.3 Analysis of CSGDA in the Monotone Case . . . . . . . . . . . . . . . . . . . . . . . 61
H.2 SEGA-SGDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
H.2.1 SEGA-SGDA Fits Assumption 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
H.2.2 Analysis of SEGA-SGDA in the Quasi-Strongly Monotone Case . . . . . . . . . . . 62
H.2.3 Analysis of SEGA-SGDA in the Monotone Case . . . . . . . . . . . . . . . . . . . . 62
H.3 Comparison with Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
21
A Further Related Work
The references necessary to motivate our work and connect it to the most relevant literature are included
in the appropriate sections of the main body of the paper. Here we present a broader view of the
literature, including some more references to papers of the area that are not directly related with our
work.
Stochastic methods for solving VIPs. Although this paper is devoted to SGDA-type methods,
we briefly mention here the works studying other popular stochastic methods for solving VIPs based
on different algorithmic schemes such as Extragradient (EG) method [Korpelevich, 1976] and Optimistic
Gradient (OG) method [Popov, 1980]. The first analysis of Stochastic EG for solving (quasi-strongly)
monotone VIPs was proposed in Juditsky et al. [2011] and then was extended and generalized in various
ways [Mishchenko et al., 2020, Hsieh et al., 2020, Beznosikov et al., 2020c, Li et al., 2021, Gorbunov et al.,
2021a]. Stochastic OG was studied in Gidel et al. [2019], Hsieh et al. [2019], Azizian et al. [2021]. In
addition, lightweight second-order methods like stochastic Hamiltonian methods and stochastic consensus
optimization were studied in Loizou et al. [2020], and Loizou et al. [2021], respectively.
Variance reduction for VIPs. In Section 1.3, we provided the most relevant variance reduced
methods to our setting. Here we expand this discussion a bit more providing more details on some of
the results and some extra references related to acceleration.
We should highlight that the rates from Alacaoglu and Malitsky [2021] match the lower bounds from
Han et al. [2021]. Under additional assumptions similar results were achieved in Carmon et al. [2019]. Ala-
caoglu et al. [2021] developed variance-reduced method (FoRB-VR) based on Forward-Reflected-Backward
algorithm [Malitsky and Tam, 2020], but the derived rates are inferior to those from Alacaoglu and Mal-
itsky [2021].
Using Catalyst acceleration framework of Lin et al. [2018], Palaniappan and Bach [2016], Tominin
et al. [2021] achieve (neglecting extra logarithmic factors) similar rates as in Alacaoglu and Malitsky
[2021] and Luo et al. [2021] derive even tighter rates for min-max problems. However, as all Catalyst-
based approaches, these methods require solving an auxiliary problem at each iteration, which reduces
their practical efficiency.
22
Coordinate and zeroth-order methods for solving min-max problems and VIPs.
Coordinate methods for solving VIPs are rarely considered in the literature. The most relevant results
are given in the literature on zeroth-order methods for solving min-max problems. Although some of
them can be easily extended to the coordinate versions of methods for solving VIPs, these methods are
usually considered and analyzed for min-max problems. The closest work to our paper is Sadiev et al.
[2021]: they propose and analyze several zeroth-order variants of SGDA and Stochastic EG with two-
point feedback oracle for solving strongly-convex-strongly-concave and convex-concave smooth min-max
problems with bounded domain. Moreover, Sadiev et al. [2021] consider firmly smooth convex-concave
min-max problems which is an analog of cocoercivity for min-max problems. There are also papers
focusing on different problems like non-sonvex-strongly-concave smooth min-max problems [Liu et al.,
2020, Wang et al., 2020], non-smooth strongly-convex-strongly-concave and convex-concave min-max
problems [Beznosikov et al., 2020b] and on different methods like ones that use one-point feedback oracle
[Beznosikov et al., 2021a]. These works are less relevant to our paper than Sadiev et al. [2021]. Moreover,
the results derived in these papers are inferior to the ones from Sadiev et al. [2021].
23
B Missing Details on Numerical Experiments
B.1 Setup
n
1X
F (x) = Fi (x), Fi (x) = Ai x + bi ,
n i=1
(
0, if kxk∞ ≤ r,
R(x) = λkxk1 + δBr (0) (x) = λkxk1 + (26)
+∞, if kxk∞ > r,
where each matrix Ai ∈ Rd×d is non-symmetric with all eigenvalues with strictly positive real part,
bi ∈ Rd , r > 0 is the radius of `∞ -ball, and λ ≥ 0 is regularization parameter. One can show (see
Example 6.22 from Beck [2017]) that for the given R(x) prox operator has an explicit formula:
where sign(·) and | · | are component-wise operators. The considered problem generalizes the following
quadratic game:
n
1X1 > 1 >
min max x A1,i x1 + x> > >
1 A2,i x2 − x2 A3,i x2 + b1,i x1 − b2,i x2 + λkx1 k1 − λkx2 k1
kx1 k∞ ≤r kx2 k∞ ≤r n i=1 2 1 2
with µi I 4 A1,i 4 Li I and µi I 4 A3,i 4 Li I. Indeed, the above problem is a special case of (1)+(26)
with
x1 A1,i A2,i b1,i
x= , Ai = , bi = ,
x2 −A2,i A3,i b2,i
R(x) = λkx1 k1 + λkx2 k1 + δBr (0) (x1 ) + δBr (0) (x2 ).
In our experiments, to generate the non-symmetric matrices Ai ∈ Rd×d defined in (26), we first sam-
ple real random matrices Bi where the elements of the matrices are sampled from a normal distribution.
We then compute the eigendecomposition of the matrices Bi = Qi Di Q−1 i , where the Di are diagonal
−1
matrices with complex numbers on the diagonal. Next, we construct the matrices Ai = <(Qi D+ i Qi )
+
where <(M)i,j = <(Mi,j ) and Di is obtained by transforming all the elements of Di to have positive
real part. This process ensures that the eigenvalues of Ai all have positive real part, and thus that F (x)
is strongly monotone and cocoercive. The bi ∈ Rd are sampled from a normal distribution with variance
100/d. For all the experiments we choose n = 1000 and d = 100. For the distributed experiments we
simulate m = 10 nodes on a single machine with 2 CPUs. For further details, the code is made available
at: https://github.com/hugobb/sgda.
To illustrate the difference between QSGDA and DIANA-SGDA, we conduct an additional experiment
Fig. 4. We consider the full-batch version of QSGDA and DIANA-SGDA. This enables us to separate
the noise coming from the quantization from the noise coming from the stochasticity. We observe that
when using full-batch DIANA-SGDA converges linearly to the solution while QSGDA only converges to a
neighborhood of the solution.
24
101
Distance to optimality
10−1
10−3
10−5
10−7 QSGDA
DIANA-SGDA
Figure 4: QSGDA vs DIANA-SGDA: DIANA-SGDA converges linearly to the solution while QS-
GDA only converges to a neighborhood of the solution.
SGDA 100
10 0 QSGDA
10−2
Distance to optimality
Distance to optimality
DIANA-SGDA
VR-DIANA-SGDA
10−1 10−4
10−6
10−2
SGDA
10−8 QSGDA
−3
10 DIANA-SGDA
10−10
VR-DIANA-SGDA
10−4
0 2500 5000 7500 10000 12500 15000 17500 20000 103 104 105 106 107 108
Number of oracles call Number of bits communicated
Figure 5: Results on distributed quadratic games with constraints. Letf: Number of oracle
calls. Right: Number of bits communicated between nodes.
25
C Auxiliary Results and Technical Lemmas
Useful inequalities. In our proofs, we often apply the following inequalities that hold for any
a, b ∈ Rd and α > 0:
Useful lemmas. The following lemma from Stich [2019] allows us to derive the rates of convergence
to the exact solution.
Lemma C.1 (Simplified version of Lemma 3 from Stich [2019]). Let the non-negative sequence {rk }k≥0
satisfy the relation
rk+1 ≤ (1 − aγk )rk + cγk2
for all k ≥ 0, parameters a > 0, c ≥ 0, and any non-negative sequence {γk }k≥0 such that γk ≤ 1/h for
some h ≥ a, h > 0. Then, for any K ≥ 0 one can choose {γk }k≥0 as follows:
h 1
if K ≤ , γk = ,
a h
h 1
if K > and k < k0 , γk = ,
a h
h 2
if K > and k ≥ k0 , γk = ,
a a(κ + k − k0 )
where κ = 2h/a and k0 = dK/2e. For this choice of γk the following inequality holds:
32hr0 aK 36c
rK ≤ exp − + 2 .
a 2h a K
In the analysis of monotone case, we rely on the classical result from proximal operators the-
ory.
Lemma C.2 (Theorem 6.39 (iii) from Beck [2017]). Let R be a proper lower semicontinuous convex
function and x+ = proxγR (x). Then for all z ∈ Rd the following inequality holds:
Finally, we rely on the following technical lemma for handling the sums arising in the proofs for the
monotone case.
Lemma C.3. Let K > 0 be a positive integer and η1 , η2 , . . . , ηK be random vectors such that Ek [ηk ] :=
E[ηk | η1 , . . . , ηk−1 ] = 0 for k = 2, . . . , K. Then
2
XK
K
X
ηk
= E[kηk k2 ]. (30)
E
k=1 k=1
26
Proof. We start with the following derivation:
K
2 "* K−1
+#
2
X
X
K−1
X
ηk
= E[kηK k2 ] + 2E ηK , ηk + E
ηk
E
k=1 k=1 k=1
" "* K−1
+##
2
X
K−1
X
= E[kηK k2 ] + 2E EK ηK , ηk + E
ηk
k=1 k=1
"* K−1
+#
K−1
2
X
X
= E[kηK k2 ] + 2E EK [ηK ], ηk + E
ηk
k=1 k=1
2
K−1
X
2
= E[kηK k ] + E
ηk
.
k=1
P
2
PK−1
2
K−2
2
P2
Applying similar steps to E
k=1 ηk
, E
k=1 ηk
, . . . , E
k=1 ηk
, we get the result.
27
D Proof of The Main Results
In this section, we provide complete proofs of our main results.
Theorem D.1 (Theorem 2.2). Let F be µ-quasi-strongly monotone with µ > 0 and Assumption 2.1
hold. Assume that
1 1
0 < γ ≤ min , (31)
µ 2(A + CM )
for some M > B/ρ. Then for the Lyapunov function Vk = kxk − x∗,k k2 + M γ 2 σk2 , and for all k ≥ 0 we
have
k
γ 2 (D1 + M D2 )
B
E[Vk ] ≤ 1 − min γµ, ρ − E[V0 ] + . (32)
M min {γµ, ρ − B/M }
Proof. First of all, we recall a well-known fact about proximal operators: for any solution x∗ of (1) we
have
x∗ = proxγR (x∗ − γF (x∗ )). (33)
Using this and non-expansiveness of proximal operator, we derive
Next, we take an expectation Ek [·] w.r.t. the randomness at iteration k and get
28
1
Using γ ≤ 2(A+CM ) and the definition Vk = kxk − x∗,k k2 + M γ 2 σk2 , we get
∗,k 2 2 B
k
Ek [Vk+1 ] ≤ (1 − γµ) kx − x k + M γ 1 − ρ + σk2 + γ 2 (D1 + M D2 )
M
B
≤ 1 − min γµ, ρ − Vk + γ 2 (D1 + M D2 ).
M
Next, we take the full expectation from the above inequality and establish the following recurrence:
B
E [Vk+1 ] ≤ 1 − min γµ, ρ − E[Vk ] + γ 2 (D1 + M D2 ). (35)
M
Unrolling the recurrence, we derive
k k−1
X t
B B
E [Vk ] ≤ 1 − min γµ, ρ − E[V0 ] + γ 2 (D1 + M D2 ) 1 − min γµ, ρ −
M t=0
M
k ∞ t
B 2
X B
≤ 1 − min γµ, ρ − E[V0 ] + γ (D1 + M D2 ) 1 − min γµ, ρ −
M t=0
M
k
γ 2 (D1 + M D2 )
B
= 1 − min γµ, ρ − E[V0 ] + ,
M min {γµ, ρ − B/M }
which finishes the proof.
Using this and Lemma C.1, we derive the following result about the convergence to the exact solu-
tion.
Corollary D.2 (Corollary 2.3). Let the assumptions of Theorem 2.2 hold. Consider two possible
cases.
1. Let D1 = D2 = 0. Then, for any K ≥ 0, M = 2B/ρ, and
1 1
γ = min , (36)
µ 2(A + 2BC/ρ)
we have
µ ρ
E[VK ] ≤ E[V0 ] exp − min , K . (37)
2(A + 2BC/ρ) 2
2. Let D1 + M D2 > 0. Then, for any K ≥ 0 and M = 2B/ρ one can choose {γk }k≥0 as follows:
h 1
if K ≤ , γk = ,
µ h
h 1
if K > and k < k0 , γk = , (38)
µ h
h 2
if K > and k ≥ k0 , γk = ,
µ µ(κ + k − k0 )
where h = max {2(A + 2BC/ρ), 2µ/ρ}, κ = 2h/µ and k0 = dK/2e. For this choice of γk the following
inequality holds:
2(A + 2BC/ρ) 2 µ ρ
E[VK ] ≤ 32 max , E[V0 ] exp − min , K
µ ρ 2(A + 2BC/ρ) 4
36(D1 + 2BD2/ρ)
+ . (39)
µ2 K
Proof. The first part of the corollary follows from Theorem 2.2 due to
K
B n ρ oK n ρo
1 − min γµ, ρ − = 1 − min γµ, ≤ exp − min γµ, K .
M 2 2
29
Plugging (36) in the above inequality, we derive (37). Next, we consider the case when D1 + M D2 > 0.
First, we notice that (35) holds for non-constant stepsizes γk such that
1 1
0 < γk ≤ min , .
µ 2(A + CM )
Theorem D.3 (Theorem 2.5). Let F be monotone, `-star-cocoercive and Assumptions 2.1, 2.4 hold.
Assume that
1
0<γ≤ . (40)
2(A + BC/ρ)
Then for the function GapC (z) from (10) and for all K ≥ 0 we have
" K
!#
3 maxu∈C kx0 − uk2
1 X k
E GapC x ≤
K 2γK
k=1
8γ`2 Ω2C kx0 − x∗,0 k2
+ + (4A + ` + 8BC/ρ) ·
K K
2
γBσ 0
+ (4 + (4A + ` + 8BC/ρ) γ)
ρK
+γ(2 + γ (4A + ` + 8BC/ρ))(D1 + 2BD2/ρ)
+9γ max
∗ ∗
kF (x∗ )k2 . (41)
x ∈X
6
When R(x) ≡ 0, our analysis can be modified to get the guarantees on the squared norm of the operator.
30
Proof. First, we apply the classical result about proximal operators (Lemma C.2) with x+ = xk+1 ,
x = xk − γg k , and z = u for arbitrary point u ∈ Rd :
implying
Next, we use a squared norm decomposition ka + bk2 = kak2 + kbk2 + 2ha, bi, and obtain
31
Summing up the above inequality for k = 0, 1, . . . , K − 1, we get
K−1
X K−1
X K−1
X
hF (u), xk − ui + R(xk+1 ) − R(u) kxk − uk2 − kxk+1 − uk2
2γ ≤
k=0 k=0 k=0
K−1
X
+2γ 2 kg ∗,k k2
k=0
K−1
X
+2γ hF (xk ) − g k , xk − ui
k=0
K−1
X
+2γ 2 kg k − g ∗,k k2
k=0
K−1
X
= kx0 − uk2 − kxK − uk2 + 2γ 2 kg ∗,k k2
k=0
K−1
X
+2γ hF (xk ) − g k , xk − ui
k=0
K−1
X
+2γ 2 kg k − g ∗,k k2 .
k=0
K−1 K−1
1 X kx0 − uk2 − kxK − uk2 γ X ∗,k 2
hF (u), xk − ui + R(xk+1 ) − R(u)
≤ + kg k
K 2γK K
k=0 k=0
K−1
1 X
+ hF (xk ) − g k , xk − ui
K
k=0
K−1
γ X
+ kg k − g ∗,k k2
K
k=0
K−1
1 X kx0 − uk2 − kxK − uk2 hF (u), xK − x0 i
hF (u), xk+1 − ui + R(xk+1 ) − R(u)
≤ +
K 2γK K
k=0
K−1
γ X ∗,k 2
+ kg k
K
k=0
K−1
1 X
+ hF (xk ) − g k , xk − ui
K
k=0
K−1
γ X
+ kg k − g ∗,k k2 .
K
k=0
PK−1 PK−1
1 1
Applying Jensen’s inequality for convex function R, we get R K k=0 xk+1 ≤ K k=0 R(xk+1 ).
32
Plugging this in the previous inequality, we derive for u∗ being a projection of u on X ∗
* K−1
! + K−1
!
1 X k+1 1 X k+1
F (u), x −u +R x − R(u)
K K
k=0 k=0
K−1
kx0 − uk2 − kxK − uk2 hF (u), xK − x0 i γ X ∗,k 2
≤ + + kg k
2γK K K
k=0
K−1 K−1
1 X γ X
+ hF (xk ) − g k , xk − ui + kg k − g ∗,k k2
K K
k=0 k=0
(29) 0 2 K 2 K 0 2
kx − uk − kx − uk kx − x k 4γ
≤ + + kF (u) − F (u∗ ) + F (u∗ )k2
2γK 4γK K
K−1 K−1 K−1
γ X ∗,k 2 1 X γ X k
+ kg k + hF (xk ) − g k , xk − ui + kg − g ∗,k k2
K K K
k=0 k=0 k=0
(28) 0 2 K 2 0 2 K 2
kx − uk − kx − uk kx − uk + kx − uk 8γ
≤ + + kF (u) − F (u∗ )k2
2γK 2γK K
K−1 K−1
γ X ∗,k 2 1 X
+ kg k + 8γkF (u∗ )k2 + hF (xk ) − g k , xk − ui
K K
k=0 k=0
K−1
γ X
+ kg k − g ∗,k k2
K
k=0
(4)kx0 − uk2 8γ`2 ku − u∗ k2
≤ + + 9γ max kF (x∗ )k2
γK K x∗ ∈X ∗
K−1 K−1
1 X γ X k
+ hF (xk ) − g k , xk − ui + kg − g ∗,k k2 .
K K
k=0 k=0
PK
1
Next, we take maximum from the both sides in u ∈ C, which gives GapC K k=1 xk in the left-hand
side by definition (10), and take the expectation of the result:
" K
!#
E maxu∈C kx0 − uk2 8γ`2 E maxu∈C ku − u∗ k2
1 X k
E GapC x ≤ +
K γK K
k=1
+9γ max kF (x∗ )k2
x∗ ∈X ∗
" K−1
# K−1
1 X γ X k
hF (xk ) − g k , xk − ui + E kg − g ∗,k k2
+ E max
K u∈C K
k=0 k=0
E maxu∈C kx0 − uk2
8γ`2 Ω2C
≤ + + 9γ max kF (x∗ )k2
γK K x∗ ∈X ∗
" K−1
#
1 X
+ E max hF (xk ) − g k , xk − ui
K u∈C
k=0
K−1
γ X
E kg k − g ∗,k k2 .
+ (42)
K
k=0
In the last step, we also use that X ∗ ⊂ C and ΩC := maxx,y∈C kx − yk (Assumption 2.4).
It remains to upper bound the terms from the last two lines of (42). We start with the first one.
33
Since
"K−1 # "K−1 #
X X
E hF (xk ) − g k , xk i = E hE[F (xk ) − g k | xk ], xk i = 0,
k=0 k=0
"K−1 # K−1
X X
hF (xk ) − g k , x0 i E[F (xk ) − g k ], x0 = 0,
E =
k=0 k=0
we have
" K−1
# "K−1 #
1 X
k k k 1 X
k k k
E max hF (x ) − g , x − ui = E hF (x ) − g , x i
K u∈C K
k=0 k=0
" K−1
#
1 X
k k
+ E max hF (x ) − g , −ui
K u∈C
k=0
" K−1
#
1 X
= E max hF (xk ) − g k , −ui
K u∈C
k=0
"K−1 #
1 X
k k 0
= E hF (x ) − g , x i
K
k=0
" K−1
#
1 X
k k
+ E max hF (x ) − g , −ui
K u∈C
k=0
" * K−1 +#
1 X k k 0
= E max (F (x ) − g ), x − u
u∈C K
k=0
2
(29) γK
1 K−1 X
1
≤ E max (F (xk ) − g k )
+ kx0 − uk2
u∈C 2
K 2γK
k=0
K−1
2
γ
X
1
= E
k k
(F (x ) − g )
+ max kx0 − uk2 .
2K
2γK u∈C
k=0
We notice that E[F (xk ) − g k | F (x0 ) − g 0 , . . . , F (xk−1 ) − g k−1 ] = 0 for all k ≥ 1, i.e., conditions of
Lemma C.3 are satisfied. Therefore, applying Lemma C.3, we get
" K−1
# K−1
1 X γ X
E max k k k
hF (x ) − g , x − ui ≤ E[kF (xk ) − g k k2 ]
K u∈C 2K
k=0 k=0
1
+ max kx0 − uk2 . (43)
2γK u∈C
34
Using `-star-cocoercivity of F together with the first part of Assumption 2.1, we continue our derivation
as follows:
" K
!#
3 maxu∈C kx0 − uk2
1 X k 8γ`2 Ω2C
E GapC x ≤ + + 2γD1 + 9γ max kF (x∗ )k2
K 2γK K x∗ ∈X ∗
k=1
K−1
γ(4A + `) X 2γB K−1 X
+ E hF (xk ) − g ∗,k , xk − x∗,k i + E σk2
K K
k=0 k=0
3 maxu∈C kx0 − uk2
8γ`2 Ω2C
= + + 2γD1 + 9γ max kF (x∗ )k2
2γK K x∗ ∈X ∗
K−1
γ(4A + `) X
E hF (xk ) − g ∗,k , xk − x∗,k i
+
K
k=0
K−1 K−1
2γB 1 X 2 2γB X 2
+ 1+ E σk − E σk .
K ρ ρK
k=0 k=0
35
Since (1 − ρ) (1 + 1/ρ) = −ρ + 1/ρ ≤ 1/ρ, the last row is non-positive and we have
" K
!#
3 maxu∈C kx0 − uk2
1 X k 8γ`2 Ω2C 2γB(1 + 1/ρ) 2
E GapC x ≤ + + σ0
K 2γK K K
k=1
+2γ (D1 + B(1 + 1/ρ)D2 ) + 9γ max
∗ ∗
kF (x∗ )k2 (44)
x ∈X
K−1
γ (4A + ` + 4BC(1 + 1/ρ)) X
E hF (xk ) − g ∗,k , xk − x∗,k i .
+
K
k=0
Note that inequality (34) from the proof of Theorem 2.2 is derived using Assumption 2.1 only. With
M = B/ρ it gives
γ2B γ2B 2
E kxk+1 − x∗,k+1 k2 + 2
] ≤ E kxk − x∗,k k2 + E σk + γ 2 (D1 + BD2/ρ)
E[σk+1
ρ ρ
−2γ (1 − γ(A + BC/ρ)) E xk − x∗,k , F (xk ) − g ∗,k .
Corollary D.4. Let the assumptions of Theorem 2.5 hold. Then, for all K one can choose γ as
( √ )
1 Ω0,C ρ Ω0,C Ω0,C
γ = min , √ ,p , √ , (45)
4A + ` + 8BC/ρ σb0 B K(D1 + 2BD2/ρ) G∗ K
36
h P i
1 K
maxx∗ ∈X ∗ kF (x∗ )k respectively. This choice of γ implies E GapC K k=1 x
k
equals
√ !
(A + ` + BC/ρ)(Ω20,C + Ω20 ) + `Ω2C
p
Ω0,C σ
b0 B Ω0,C ( D1 + BD2/ρ + G∗ )
O + √ + √ .
K ρK K
Proof. First of all, the choice of γ from (45) implies (40) since
1 1
≤ .
4A + ` + 8BC/ρ 2 (A + BC/ρ)
37
E SGDA with Arbitrary Sampling: Missing Proofs and Details
Proof. To prove the result, it is sufficient to derive an upper bound for Ek kg k − F (x∗,k )k2 :
where σ∗2 := maxx∗ ∈X ∗ ED kFξ (x∗ ) − F (x∗ )k2 . The above inequality implies that Assumption 2.1 holds
with
C = 0, ρ = 1, D2 = 0.
Theorem E.2. Let F be µ-quasi strongly monotone, Assumption 3.1 hold, and 0 < γ ≤ 1/2`D . Then,
for all k ≥ 0 the iterates produced by SGDA-AS satisfy
2γσ∗2
E kxk − x∗,k k2 ≤ (1 − γµ)k kx0 − x0,∗ k2 +
. (46)
µ
Corollary E.3 (Corollary 3.3). Let the assumptions of Theorem E.2 hold. Then, for any K ≥ 0 one
38
can choose {γk }k≥0 as follows:
2`D 1
if K ≤ , γk = ,
µ 2`D
2`D 1
if K > and k < k0 , γk = , (47)
µ 2`D
2`D 2
if K > and k ≥ k0 , γk = ,
µ 4`D + µ(k − k0 )
where k0 = dK/2e. For this choice of γk the following inequality holds for SGDA-AS:
72σ 2
64`D 0 µ
E[kxK − x∗,K k2 ] ≤ kx − x∗,0 k2 exp − K + 2 ∗.
µ 2`D µ K
Theorem E.4. Let F be monotone `-star-cocoercive and Assumptions 2.1, 2.4, 3.1 hold. Assume that
γ ≤ 1/2`D . Then for GapC (z) from (10) and for all K ≥ 0 the iterates produced by SGDA-AS satisfy
" K
!#
1 X k 3 maxu∈C kx0 − uk2 8γ`2 Ω2C (4`D + `) kx0 − x∗,0 k2
E GapC x ≤ + +
K 2γK K K
k=1
+2γ(2 + γ (4`D + `))σ∗2 + 9γ max
∗ ∗
kF (x∗ )k2 .
x ∈X
Next, we apply Corollary D.4 and get the following rate of convergence to the exact solution.
Corollary E.5. Let the assumptions of Theorem E.4 hold. Then ∀K > 0 and
1 Ω0,C Ω0,C
γ = min ,√ , √ (48)
4`D + ` 2Kσ∗ G∗ K
the iterates produced by SGDA-AS satisfy
" K
!# !
1 X k (`D + `)(Ω20,C + Ω20 ) + `Ω2C Ω0,C (σ∗ + G∗ )
E GapC x =O + √ .
K K K
k=1
As we already mentioned before, the above result is new for SGDA-AS: the only known work on
SGDA-AS [Loizou et al., 2021] focuses on the µ-quasi-strongly monotone case only with µ > 0. Moreover,
√
neglecting the dependence on problem/noise parameters, the derived convergence rate O (1/K + 1/ K ) is
standard for the analysis of stochastic methods for solving monotone VIPs [Juditsky et al., 2011].
where x∗ is the projection of x on X ∗ . Note that (49) holds whenever Fi are cocoercive.
39
Uniform Sampling. We start with the classical uniform sampling: let P {ξ = nei } = 1/n for all
i ∈ [n], where ei ∈ Rn is the i-th coordinate vector from the standard basis in Rn . Then, E[ξi ] = 1 for
all i ∈ [n] and Assumption 3.1 holds with `D = maxi∈[n] `i :
n
1X
ED kFξ (x) − Fξ (x∗ )k2 kFi (x) − Fi (x∗ )k2
=
n i=1
(49) 1X
≤ (`i hFi (x) − Fi (x∗ ), x − x∗ i)
n
≤ max `i hF (x) − F (x∗ ), x − x∗ i
i∈[n]
In this case, Corollaries E.3 and E.5 imply the following rate for SGDA in µ-quasi strongly monotone
and monotone cases respectively:
2
64 maxi∈[n] `i 0 72σ∗,US
µ
E[kxK − x∗,K k2 ] ≤ kx − x∗,0 k2 exp − K + ,
µ 2 maxi∈[n] `i µ2 K
" K
!# !
1 X k (maxi∈[n] `i + `)(Ω20,C + Ω20 ) + `Ω2C Ω0,C (σ∗,US + G∗ )
E GapC x = O + √ ,
K K K
k=1
Pn
2
where σ∗,US := maxx∗ ∈X ∗ n1 i=1 kFi (x∗ ) − F (x∗ )k2 .
(49) `X
≤ hFi (x) − Fi (x∗ ), x − x∗ i
n
≤ `hF (x) − F (x∗ ), x − x∗ i
In this case, Corollaries E.3 and E.5 imply the following rate for SGDA in µ-quasi strongly monotone
and monotone cases respectively:
2
72σ∗,IS
K ∗,K 2 64` 0 ∗,0 2 µ
E[kx − x k ] ≤ kx − x k exp − K + 2 ,
µ 2` µ K
" K
!# !
1 X k (` + `)(Ω20,C + Ω20 ) + `Ω2C Ω0,C (σ∗,IS + G∗ )
E GapC x = O + √ ,
K K K
k=1
Pn
2
2
where σ∗,IS := maxx∗ ∈X ∗ n1 i=1 ``i
``i Fi (x∗ ) − F (x∗ )
. We emphasize that ` ≤ maxi∈[n] `i and, in fact,
` might be much smaller than maxi∈[n] `i . Therefore, compared to SGDA with uniform sampling, SGDA
with importance sampling has better exponentially decaying term in the quasi-strongly monotone case
2 2
and converges faster to the neighborhood, if executed with constant stepsize. Moreover, σ∗,IS ≤ σ∗,US ,
∗
when maxx∗ ∈X ∗ kFi (x )k ∼ `i . In this case, SGDA with importance sampling has better O(1/K ) term
than SGDA with uniform sampling as well.
1
Pb
Minibatch Sampling With Replacement. Let ξ = ξ i , where ξ i are i.i.d. samples from
b i=1
some distribution D satisfying (13) and Assumption 3.1. Then, the distribution of ξ satisfies (13)
and Assumption 3.1 as well with the same constant `D . Therefore, minibatched versions of uniform
2
σ∗,US
sampling and importance sampling fit the framework as well with `D = maxi∈[n] `i , σ∗2 = b and
2
σ∗,IS
`D = `, σ∗2 = b .
40
Minibatch Sampling Without Replacement. For given batchsize b ∈ [n]
wen consider
P the
following sampling strategy: for each subset S ⊆ [n] such that |S| = b we have P ξ = b i∈S ei =
b!(n−b)!
n! , i.e., S is chosen uniformly at random from all b-element subsets of [n]. In the special case, when
R(x) ≡ 0, Loizou et al. [2021] show that this sampling strategy satisfies (13) and Assumption 3.1 with
Clearly, both parameters are smaller than corresponding parameters for minibatched version of uniform
sampling with replacement, which indicates the theoretical benefits of sampling without replacement.
Plugging the parameters from (50) in Corollaries E.3 and E.5, we get the rate of convergence for this
sampling strategy. Moreover, in the quasi-strongly monotone case, to guarantee E[kxK − x∗,K k2 ] ≤ ε for
some ε > 0, the method requires
( )!
2
(n − b) maxi∈[n] `i `D kx0 − x∗,0 k2 (n − b)σ∗,US
`
Kb = O max b + log ,
µ n µ µε nµ2 ε
( )!
2
b ` − n1 maxi∈[n] `i + maxi∈[n] `i (n − b)σ∗,US
= O max
e , oracle calls, (51)
µ nµ2 ε
where O(·)
e hides numerical and logarithmic factors. One can notice that the first term in the maximum
linearly increases in b (since ` cannot be smaller than n1 maxi∈[n] `i ), while the second term linearly
(n−b) maxi∈[n] `i
decreases in b. The first term in the maximum is lower bounded by n µ . Therefore, if
2
σ∗,US
maxi∈[n] `i ≥ µε ,the the first term in the maximum is always larger than the second one, meaning
that the optimal batchsize, i.e., the batchsize that minimizes oracle complexity (51) neglecting the
σ2
logarithmic terms, equals b∗ = 1. Next, if maxi∈[n] `i < ∗,US
µε , then there exists a positive value of b such
that the first term in the maximum equals the second term. This value equals
2
n σ∗,US − µε maxi∈[n] `i
.
σ∗2 + µε n` − maxi∈[n] `i
One can easily verify that it is always smaller than n, but it can be non integer and it can be smaller
than 1 as well. Therefore, the optimal batchsize is
σ2
1,
if maxi∈[n] `i ≥ ∗,US
µε ,
b∗ =
2
n(σ∗,US −µε maxi∈[n] `i )
max 1, σ2 +µε(n`−max
, otherwise.
∗ i∈[n] `i )
We notice that Loizou et al. [2021] derive the following formula for the optimal batchsize (ignoring
numerical constants):
σ2
1,
if maxi∈[n] `i − ` ≥ ∗,US
µε ,
eb∗ =
n(σ∗,US
2
−µε(maxi∈[n] `i −`))
max 1,
σ∗2 +µε(n`−maxi∈[n] `i )
, otherwise.
41
F SGDA with Variance Reduction: Missing Proofs and Details
In this section, we provide missing proofs and details for Section 4.
F.1 L-SVRGDA
Lemma F.1. Let Assumption 4.1 hold. Then for all k ≥ 0 L-SVRGDA satisfies
Lemma F.2. Let Assumptions 4.1 and 4.2 hold. Then for all k ≥ 0 L-SVRGDA satisfies
b (xk ) − F (x∗,k ), xk − x∗,k i + (1 − p)σ 2 ,
2
Ek σk+1 ≤ p`hF k (53)
n
where σk2 := n1 i=1 kFi (wk ) − Fi (x∗,k )k2 .
P
42
2
Proof. Using the definitions of σk+1 and wk+1 (see (15)), we derive
n
1X
Ek kFi (wk+1 ) − Fi (x∗,k+1 )k2
2
Ek σk+1 =
n i=1
n
As. 4.2 1X
Ek kFi (wk+1 ) − Fi (x∗,k )k2
=
n i=1
n n
pX 1−pX
= kFi (xk ) − Fi (x∗,k )k2 + kFi (wk ) − Fi (x∗,k )k2
n i=1 n i=1
(16)
≤ b (xk ) − F (x∗,k ), xk − x∗,k i + (1 − p)σ 2 .
p`hF k
The above two lemmas imply that Assumption 2.1 is satisfied with certain parameters.
Proposition F.3 (Proposition 4.3). Let Assumptions 4.1 and 4.2 hold. Then, L-SVRGDA satisfies
Assumption 2.1 with
n
1X p`b
A = `,
b B = 2, σk2 = kFi (wk ) − Fi (x∗ )k2 , C= , ρ = p, D1 = D2 = 0.
n i=1 2
Theorem F.4. Let F be µ-quasi strongly monotone, Assumptions 4.1, 4.2 hold, and 0 < γ ≤ 1/6`b.
Then for all k ≥ 0 the iterates produced by L-SVRGDA satisfy
k
E kxk − x∗ k2 ≤ (1 − min {γµ, p/2}) V0 ,
(54)
where V0 = kx0 − x∗ k2 + 4γ
2
σ02/p.
Corollary F.5. Let the assumptions of Theorem F.4 hold. Then, for p = n, γ = 1/6`b and any K ≥ 0
we have
k ∗ 2 µ 1
E[kx − x k ] ≤ V0 exp − min , K .
6`b 2n
Theorem F.6. Let F be monotone, `-star-cocoercive and Assumptions 2.1, 2.4, 4.1, 4.2 hold. Assume
that γ ≤ 1/6`b. Then for GapC (z) from (10) and for all K ≥ 0 the iterates of L-SVRGDA satisfy
12`b + ` kx0 − x∗,0 k2
" K
!#
1 X k 3 maxu∈C kx0 − uk2 8γ`2 Ω2C
E GapC x ≤ + +
K 2γK K K
k=1
2γσ 2
+ 4 + 12`b + ` γ 0
+ 9γ max kF (x∗ )k2 .
pK x∗ ∈X ∗
Applying Corollary D.4, we get the rate of convergence to the exact solution.
43
Corollary F.7. Let the assumptions of Theorem F.6 hold and p = 1/n. Then ∀K > 0 one can choose
γ as ( )
1 1 Ω0,C
γ = min ,p , √ . (55)
12`b + ` b G∗ K
2n``
This choice of γ implies
" !# p
K
1 X k (`b + `)(Ω20,C + Ω20 ) + b 2 + `Ω2
n``Ω 0,C C Ω0,C G∗
E GapC x = O + √ .
K K K
k=1
n
1X
σ02 = kFi (x0 ) − Fi (x∗ )k2
n i=1
(16)
≤ b (x0 ) − F (x∗ ), x0 − x∗ i
`hF
≤ b (x0 ) − F (x∗ )k · kx0 − x∗ k
`kF
≤ b 0 − x∗ k2 ≤ ``
``kx b max kx0 − uk2 ≤ ``Ω
b 2 .
0,C
u∈C
p
Next, applying Corollary D.4 with σ
b0 := b 0,C , we get the result.
``Ω
F.2 SAGA-SGDA
In this section, we show that SAGA-SGDA [Palaniappan and Bach, 2016] fits our theoretical framework
and derive new results for this method under averaged star-cocoercivity.
5: Set wjk+1
k
= xk and wik+1 = wik for i 6= jk
6: xk+1 = proxγR (xk − γg k )
7: end for
Lemma F.8. Let Assumption 4.1 hold. Then for all k ≥ 0 SAGA-SGDA satisfies
1
Pn
Proof. For brevity, we introduce a new notation: S k = n i=1 Fi (wik ). Since g k = Fjk (xk ) − Fjk (wjkk ) +
44
S k , we have
Lemma F.9. Let Assumptions 4.1 and 4.2 hold. Then for all k ≥ 0 SAGA-SGDA satisfies
`b
≤ hF (xk ) − F (x∗,k ), xk − x∗,k i + (1 − 1/n)σk2 ,
2
Ek σk+1 (57)
n
1
Pn
where σk2 := n i=1 kFi (wik ) − Fi (x∗,k )k2 .
2
Proof. Using the definitions of σk+1 and wik+1 , we derive
n
1X
Ek kFi (wik+1 ) − Fi (x∗,k+1 )k2
2
Ek σk+1 =
n i=1
n
As. 4.2 1X
Ek kFi (wik+1 ) − Fi (x∗,k )k2
=
n i=1
n n
1 X ∗,k 2 1 − 1/n X
= kF i (x k
) − Fi (x )k + kFi (wik ) − Fi (x∗,k )k2
n2 i=1 n i=1
(16) `b
≤ hF (xk ) − F (x∗,k ), xk − x∗,k i + (1 − 1/n)σk2 .
n
The above two lemmas imply that Assumption 2.1 is satisfied with certain parameters.
Proposition F.10. Let Assumptions 4.1 and 4.2 hold. Then, SAGA-SGDA satisfies Assumption 2.1
with
n
1X `b 1
A = `,
b B = 2, σk2 = kFi (wik ) − Fi (x∗ )k2 , C= , ρ= , D1 = D2 = 0.
n i=1 2n n
Theorem F.11. Let F be µ-quasi strongly monotone, Assumptions 4.1, 4.2 hold, and 0 < γ ≤ 1/6`b.
45
Then for all k ≥ 0 the iterates produced by SAGA-SGDA satisfy
k
E kxk − x∗ k2 ≤ (1 − min {γµ, 1/2n}) V0 ,
(58)
Corollary F.12. Let the assumptions of Theorem F.11 hold. Then, for γ = 1/6`b and any K ≥ 0 we
have
K ∗ 2 µ 1
E[kx − x k ] ≤ V0 exp − min , K .
6`b 2n
Theorem F.13. Let F be monotone, `-star-cocoercive and Assumptions 2.1, 2.4, 4.1, 4.2 hold. Assume
that γ ≤ 1/6`b. Then for GapC (z) from (10) and for all K ≥ 0 the iterates produced by SAGA-SGDA
satisfy
12`b + ` kx0 − x∗,0 k2
" K
!#
1 X k 3 maxu∈C kx0 − uk2 8γ`2 Ω2C
E GapC x ≤ + +
K 2γK K K
k=1
2γσ 2
+ 4 + 12`b + ` γ 0
+ 9γ max kF (x∗ )k2 .
pK x∗ ∈X ∗
Applying Corollary D.4, we get the rate of convergence to the exact solution.
Corollary F.14. Let the assumptions of Theorem F.13 hold. Then ∀K > 0 one can choose γ as
( )
1 1 Ω0,C
γ = min ,p , √ , (59)
12`b + ` b G∗ K
2n``
This choice of γ implies
" !# p
K
1 X k (`b + `)(Ω20,C + Ω20 ) + b 2 + `Ω2
n``Ω 0,C C Ω0,C G∗
E GapC x = O + √ .
K K K
k=1
Proof. Since σ0 for SAGA-SGDA and L-SVRGDA are the same, the proof of this corollary is identical to
the one for Corollary F.7.
46
cocoercivity constants and Lipschitz constants (even when R(x) ≡ 0, i.e., G∗ = 0). However, in general,
it is possible that star-cocoercivity holds, while Lipschitzness does not [Loizou et al., 2021]. Moreover,
we emphasize here that Alacaoglu and Malitsky [2021] and other works do not consider SGDA as the
basis for their methods. To the best of our knowledge, our results are the first ones for variance-reduced
SGDA-type methods derived in the monotone case without assuming (quasi-)strong monotonicity.
47
G Distributed SGDA with Compression: Missing Proofs and De-
tails
In this section, we provide missing proofs and details for Section 5.
G.1 QSGDA
In this section (and in the one about DIANA-SGDA), we assume that each Fi has an expectation form:
Fi (x) = Eξi ∼Di [Fξi (x)].
Proposition G.1 (Proposition 5.3). Let F be `-star-cocoercive and Assumptions 4.1, 5.2 hold. Then,
QSGDA with quantization (17) satisfies Assumption 2.1 with
!
3` 9ω `b 3(1 + 3ω)σ 2 + 9ωζ∗2
A= + , D1 = , σk2 = 0, B = 0,
2 2n n
C = 0, ρ = 1, D2 = 0,
hP i
1
Pn 1 n 2
where σ 2 = n
2 2
i=1 σi and ζ∗ = n maxx∗∈X ∗ i=1 kFi (x∗ )k .
n
1
Proof. Since g k = Q gik , Q g1k , . . . , Q gnk are independent for fixed g1k , . . . , gnk , and g1k , . . . , gnk
P
n
i=1
are independent for fixed xk , we have
2
1 Xn
Ek kg − F (x∗,k )k2 = Ek
Q gik − F (x∗,k )
k
n
i=1
2
1 Xn
Q gik − gik + gik − Fi (xk ) + F (xk ) − F (x∗,k )
= Ek
n
i=1
2
2
1 X n
1 Xn
[Q gik − gik ]
+ 3Ek
[gik − Fi (xk )]
≤ 3Ek
n
n
i=1 i=1
2
+3
F (xk ) − F (x∗,k )
n n
3 X h
k
k
2
i 3 X h
gik − Fi (xk )
2
i
= Ek
Q gi − gi + Ek
n2 i=1 n2 i=1
2
+3
F (xk ) − F (x∗,k )
.
48
1
Pn
Next, we use Assumption 5.2, σ 2 = n i=1 σi2 , and the definition of quantization (17) and get
n 2
3ω X h
gik
2 + 3σ + 3
F (xk ) − F (x∗,k )
2
i
Ek kg k − F (x∗,k )k2
≤ 2
Ek
n i=1 n
n
3ω X h
k k k ∗,k ∗,k
2
i
≤ Ek
gi − Fi (x ) + F i (x ) − F i (x ) + Fi (x )
n2 i=1
3σ 2
2
+ 3
F (xk ) − F (x∗,k )
+
n
n n
9ω X h
gik − Fi (xk )
2 + 9ω
i X
Fi (xk ) − Fi (x∗,k )
2
h
i
≤ E k Ek
n2 i=1 n2 i=1
n
9ω X h
i 3σ 2
∗,k
2
2
+ 3
F (xk ) − F (x∗,k )
+ 2
Ek
Fi (x ) +
n i=1 n
(18) n
9ω X h
Fi (xk ) − Fi (x∗,k )
2 + 3
F (xk ) − F (x∗,k )
2
i
≤ Ek
n2 i=1
n 2
9ω X h
Fi (x∗,k )
2 + 3(1 + 3ω)σ .
i
+ Ek
n2 i=1 n
Theorem G.2. Let F be µ-quasi strongly monotone, `-star-cocoercive, Assumptions 4.1, 5.2 hold,
and
1
0<γ≤ .
3` + 9ω
n
`b
Corollary G.3. Let the assumptions of Theorem G.2 hold. Then, for any K ≥ 0 one can choose
49
{γk }k≥0 as follows:
! !−1
1 9ω `b 9ω `b
if K ≤ · 3` + , γk = 3` + ,
µ n n
! !−1
1 9ω `b 9ω `b
if K > · 3` + and k < k0 , γk = 3` + ,
µ n n
!
1 9ω `b 2
if K > · 3` + and k ≥ k0 , γk = ,
µ n (6` + 18ω `b/n + µ(k − k0 ))
Theorem G.4. Let F be monotone, `-star-cocoercive and Assumptions 2.1, 2.4, 4.1, 5.2 hold. Assume
−1
that γ ≤ 3` + 9ω
n
`b
. Then for GapC (z) from (10) and for all K ≥ 0 the iterates produced by QSGDA
satisfy
" K
!# !
3 maxu∈C kx0 − uk2
1 X k 8γ`2 Ω2C 18ω `b kx0 − x∗,0 k2
E GapC x ≤ + + 7` + ·
K 2γK K n K
k=1
!!
18ω `b 3(1 + 3ω)σ 2 + 9ωζ∗2
+γ 2 + γ 7` + ·
n n
kF (x∗ )k2
+9γ max
∗ ∗
x ∈X
Applying Corollary D.4, we get the rate of convergence to the exact solution.
Corollary G.5. Let the assumptions of Theorem G.4 hold. Then ∀K > 0 one can choose γ as
( √ )
1 Ω0,C n Ω0,C
γ = min ,p , √ .
7` + 18ω
n
`b 3K(1 + 3ω)σ 2 + 9Kωζ∗2 G∗ K
G.2 DIANA-SGDA
G.2.1 Proof of Proposition 5.4
The following result follows from Lemmas 1 and 2 from Horváth et al. [2019]. It holds in our settings as
well, since it does not rely on the exact form of Fi (xk ).
50
Algorithm 5 DIANA-SGDA: DIANA Stochastic Gradient Descent-Ascent Mishchenko et al.
[2019], Horváth et al. [2019]
1 Pn
1: Input: starting points x0 , h01 , . . . , h0n ∈ Rd , h0 = n 0
i=1 hi , stepsizes γ, α > 0, number of
steps K
2: for k = 0 to K − 1 do
3: Broadcast xk to all workers
4: for i = 1, . . . , n in parallel do
5: Compute gik and ∆ki = gik − hki
6: Send Q(∆ki ) to the server
7: hk+1
i = hki + αQ(∆ki )
8: end for
n n
g k = hk + n1 Q(∆ki ) = n1 (hki + Q(∆ki ))
P P
9:
i=1 i=1
10: xk+1 = proxγR xk − γg k
n n
hk+1 = hk + α n1 Q(∆ki ) = n1 hki
P P
11:
i=1 i=1
12: end for
Lemma G.6 (Lemmas 1 and 2 from Horváth et al. [2019]). Let Assumptions 4.2, 5.2 hold. Suppose
that α ≤ 1/(1+ω). Then, for all k ≥ 0 DIANA-SGDA satisfies
Ek g k = F (xk ),
n
2ωσk2 (1 + ω)σ 2
2ω 1 X
Ek kg k − F (x∗ )k2 ≤ kFi (xk ) − Fi (x∗ )k2 +
1+ + ,
n n i=1 n n
n
αX
kFi (xk ) − Fi (x∗ )k2 + ασ 2 ,
2
Ek σk+1 ≤ (1 − α)σk2 +
n i=1
n Pn
1 1
where σk2 = khki − Fi (x∗ )k2 and σ 2 = σi2 .
P
n n i=1
i=1
The lemma above implies that Assumption 2.1 is satisfied with certain parameters.
1
Proposition G.7 (Proposition 5.4). Let Assumptions 4.1, 4.2, 5.2 hold. Suppose that α ≤ 1+ω .
n
Then, DIANA-SGDA with quantization (17) satisfies Assumption 2.1 with σk2 = n1 i=1 khki − Fi (x∗ )k2
P
and
(1 + ω)σ 2
1 ω b 2ω α`b
A= + `, B = , D1 = , C= , ρ = α, D2 = ασ 2 .
2 n n n 2
1
Pn
Proof. To get the result, one needs to apply Assumption 4.1 to estimate n i=1 kFi (xk ) − Fi (x∗ )k2 from
Lemma G.6.
Theorem G.8. Let F be µ-quasi strongly monotone, Assumptions 4.1, 4.2, 5.2 hold, α ≤ 1/(1+ω), and
1
0<γ≤ 6ω
.
1+ n `b
51
Then, for all k ≥ 0 the iterates produced by DIANA-SGDA satisfy
n α ok γ 2 σ 2 (1 + 5ω)
E kxk − x∗ k2 ≤ 1 − min γµ,
E[V0 ] + ,
2 n · min {γµ, α/2}
Corollary G.9. Let the assumptions of Theorem 5.4 hold. Then, for any K ≥ 0 one can choose
α = 1/(1+ω) and {γk }k≥0 as follows:
h 1
if K ≤ , γk = ,
µ h
h 1
if K >
and k < k0 , γk = ,
µ h
h 2
if K > and k ≥ k0 , γk = ,
µ 2h + µ(k − k0 )
n o
where h = max 1 + 6ω
n `, 2µ(1 + ω) , k0 = d /2e. For this choice of γk the following inequality
b K
holds:
( ) ( ) !
1 + 6ω
n ` µ 1
b
K ∗,K 2
E[kx − x k ] ≤ 32 max , 2(1 + ω) V0 exp − min , K
µ b + 6ω ) 1 + ω
`(1 n
36(1 + 5ω)σ 2
+ .
µ2 nK
Theorem G.10. Let F be monotone, `-star-cocoercive and Assumptions 2.1, 2.4, 4.1, 4.2, 5.2 hold.
Assume that
1
0<γ≤ .
1 + 4ω
n `
b
Then for GapC (z) from (10) and for all K ≥ 0 the iterates produced by DIANA-SGDA satisfy
" K
!# !
3 maxu∈C kx0 − uk2
1 X k 8γ`2 Ω2C 12ω `
b kx0 − x∗,0 k2
E GapC x ≤ + + 2`b + +`
K 2γK K n K
k=1
!!
12ω `b γBσ02
+ 4 + γ 2`b + +`
n ρK
!! !
2
12ω `
b (1 + 5ω)σ
+γ 2 + γ 2`b + +`
n n
+9γ max
∗ ∗
kF (x∗ )k2 .
x ∈X
Applying Corollary D.4, we get the rate of convergence to the exact solution.
Corollary G.11. Let the assumptions of Theorem G.10 hold. Then ∀K > 0 one can choose γ as
!−1 √
12ω `
b αn Ω 0,C Ω 0,C
γ = min ` + 2`b + ,p , p , √ ,
n b σ K (1+3ω)/n G∗ K
2ω ``
52
h P i
1 K k
This choice of γ implies that E GapC K k=1 x equals
p √
(` + `b + ω`b/n)(Ω20,C + Ω20 ) + `Ω2C Ω20,C ``
p
b ω Ω ( (1+ω)σ2/n + G )
0,C ∗
O + √ + √ .
K αnK K
G.3 VR-DIANA-SGDA
1
Pm
In this section, we assume that each Fi has a finite-sum form: Fi (x) = m j=1 Fij (x).
xk , with probability p,
6: wik+1 =
wik , with probability 1 − p,
7: ∆ki = gik − hki
8: Send Q(∆ki ) to the server
9: hk+1
i = hki + αQ(∆ki )
10: end for
n n
g k = hk + n1 Q(∆ki ) = n1 (hki + Q(∆ki ))
P P
11:
i=1 i=1
12: xk+1 = proxγR xk − γg k
n n
hk+1 = hk + α n1 Q(∆ki ) = n1 hki
P P
13:
i=1 i=1
14: end for
Lemma G.12 (Modification of Lemmas 3 and 7 from Horváth et al. [2019]). Let F be `-star-cocoercive
53
and Assumptions 4.1, 4.2, 5.5 hold. Then for all k ≥ 0 VR-DIANA-SGDA satisfies
Ek g k = F (xk ),
!
2`e 2ω(`b + `)
e 2(ω + 1) 2
Ek kg − F (x∗ )k ≤ hF (xk ) − F (x∗ ), xk − x∗ i +
k
`+ + σk ,
n n n
n
m
n P
Hk Dk
hk − Fi (x∗ )
2 and Dk =
Fij (wk ) − Fij (x∗ )
2 .
where σk2 = with H k =
P P
n + nm i i
i=1 i=1 j=1
Next, we derive the upper bounds for terms T1 and T2 separately. For T2 we use unbiasedness of
quantization and independence of workers:
2
1 Xn
T2 = EQ
Q(gik − hki ) − (gik − hki )
n
i=1
n n
1 X
i (17)
Q(gik − hki ) − (gik − hki )
2 ≤ ω
gi − hki
2 .
h
X
k
= 2
EQ 2
n i=1 n i=1
Taking Ek [·] from the both sides of the above inequality, we derive
n n
ω X h
k k
2
i ω X
Ek gik − hki
2 + Ek
gik − hki − Ek gik − hki
2
h
i
Ek [T2 ] ≤ 2
Ek
gi − hi = 2
n i=1 n i=1
n
ω X
Fi (xk ) − hki
2 + Ek
gik − Fi (xk )
2
h
i
= 2
n i=1
n
i
2
ω X
k k
2
k k
h
k k
= 2 Fi (x ) − hi + Ek
Fijik (x ) − Fijik (wi ) − Ek Fijik (x ) − Fijik (wi )
n i=1
n
2
ω X
Fi (xk ) − hki
2 + Ek
Fijik (xk ) − Fijik (wik )
≤ 2
n i=1
n
2ω X
hki − Fi (x? )
2 +
Fi (xk ) − Fi (x? )
2
≤ 2
n i=1
n
2
2
2ω X k ?
k ?
+ 2 Ek
Fijik (wi ) − Fijik (x )
+ Ek
Fijik (x ) − Fijik (x )
.
n i=1
54
In last line, we also use the definitions of H k , Dk . For T1 we use definition of g k :
n
2
n
2
1 X
1 X
k k k ∗
k ∗
T1 =
EQ Q(gi − hi ) + hi − F (x )
=
gi − F (x ).
n
n
i=1 i=1
Hk Dk
which concludes the proof since σk2 = n + nm .
n
n m
Hk Dk
hk − Fi (x∗ )
2 and Dk = P P
Fij (wk ) − Fij (x∗ )
2 .
where σk2 = with H k =
P
n + nm i i
i=1 i=1 j=1
55
Proof. We start with considering H k+1 :
" n #
?
2
X
k+1
k+1
Ek H = Ek
h − Fi (x )
i
i=1
n n
hi − Fi (x? )
2 +
X X h
2 i
Ek 2hαQ(gik − hki ), hki − Fi (x? )i + α2
Q(gik − hki )
k
=
i=1 i=1
(17) n
X h
2 i
Hk + Ek 2αhgik − hki , hki − Fi (x? )i + α2 (ω + 1)
gik − hki
.
≤
i=1
56
2
It remains put the upper bounds on Dk+1 , H k+1 together and use the definition of σk+1 :
2 Ek H k+1 Ek Dk+1
Ek σk+1 = +
n nm
Hk Dk e
b hF (xk ) − F (x∗ ), xk − x∗ i
≤ (1 − α) + (1 + 2α − p) + p` + 2α(`e + `)
n nm
p
With α ≤ 3 we get −p ≤ −3α, implying
2 Hk Dk e
b hF (xk ) − F (x∗ ), xk − x∗ i
Ek σk+1 ≤ (1 − α) + (1 − α) + p` + 2α(`e + `)
n nm
2 b hF (xk ) − F (x∗ ), xk − x∗ i.
= (1 − α)σk + p` + 2α(`e + `)
e
The above two lemmas imply that Assumption 2.1 is satisfied with certain parameters.
4(ω+1)γ 2 2
where V0 = kx0 − x∗ k2 + nα σ0 .
n o
1 1 1
Corollary G.16. Let the assumptions of Theorem G.15 hold. Then, for p = m, α = min 3m , 1+ω ,
!−1
10(ω + 1)(`b + `)
e 4(ω + 1) max{3m, 1 + ω}`e
γ= `+ +
n nm
57
and any K ≥ 0 we have
( ) !
µ 1 1
E[kxk − x∗ k2 ] ≤ V0 exp − min , , K .
10(ω+1)(`+
b `) 4(ω+1) max{3m,1+ω}`e 6m 2(1 + ω)
`+ +
e
n nm
Next, using Theorem 2.5, we establish the convergence of VR-DIANA-SGDA in the monotone case.
Theorem G.17. Let F be monotone, `-star-cocoercive and Assumptions 2.1, 2.4, 4.1, 4.2, 5.5 hold.
Assume that !−1
6(ω + 1)(`b + `)
e 2(ω + 1)pel
0<γ ≤ `+ +
n αn
n o
and α = min p3 , 1+ω
1
. Then for GapC (z) from (10) and for all K ≥ 0 the iterates produced by
VR-DIANA-SGDA satisfy
" K
!#
3 maxu∈C kx0 − uk2
1 X k 8γ`2 Ω2C
E GapC x ≤ +
K 2γK K
k=1
!
12(ω + 1)(`b + `)
e 8(ω + 1)pe
l kx0 − x∗,0 k2
+ 3` + + ·
n αn K
!!
12(ω + 1)(`b + `)
e 8(ω + 1)pel γBσ02
+ 4 + γ 3` + +
n αn ρK
+9γ max
∗ ∗
kF (x∗ )k2 .
x ∈X
Applying Corollary D.4, we get the rate of convergence to the exact solution.
1
n G.18. oLet the assumptions of Theorem G.17 hold. Then ∀K > 0 one can choose p =
Corollary m,
1 1
α = min 3m , 1+ω and γ as
(
1
γ = min ,
12(ω+1)(`+
b `) 8(ω+1) max{3m,1+ω}`e
3` + +
e
n mn
√ )
Ω0,C n Ω0,C
q , √ .
b G∗ K
Ω0,C 2 max{3m, 1 + ω}(ω + 1)(`e + `)`
58
b02 for σ02 with initialization h0i = Fi (x0 ) and wi = x0
Proof. The proof follows from the next upper bound σ
n m n
1 XX 1X
σ02 = kFij (x0 ) − Fij (x∗ )k2 + kFi (x0 ) − Fi (x∗ )k2
nm i=1 j=1 n i=1
b (x0 ) − F (x∗ ), x0 − x∗ i
≤ (`e + `)hF
b (x0 ) − F (x∗ )k · kx0 − x∗ k
≤ (`e + `)kF
≤ (`e + `)`kx
b 0
− x∗ k2 ≤ (`e + `)`
b max kx0 − uk2 ≤ (`e + `)`Ω
b 2 .
0,C
u∈C
q
Next, applying Corollary D.4 with σ
b0 := (`e + `)`Ω
b 0,C , we get the result.
59
H Coordinate SGDA
In this section, we focus on the coordinate versions of SGDA. To denote i-th component of the vector x
we use [x]i . Vectors e1 , . . . , ed ∈ Rd form a standard basis in Rd .
H.1 CSGDA
Proposition H.1. Let F be `-star-cocoercive. Then, CSGDA satisfies Assumption 2.1 with
h i
2
A = d`, D1 = 2d max∗ kF (x∗ )k , σk2 = 0, B = 0, C = 0, ρ = 1, D2 = 0.
x∗∈X
Proof. First of all, for all a ∈ Rd and for random index j uniformly distributed on [d] we have
Pd
Ej [kej [a]j k2 ] = d1 i=1 [a]2j = d1 kak2 . Using this and g k = dej [F (xk )]j , we derive
Ek kg k − F (x∗,k )k2 = Ek kdej [F (xk ) − F (x∗,k )]j + dej [F (x∗,k )]j − F (x∗,k )k2
2Ek kdej [F (xk ) − F (x∗,k )]j k2 + 2Ek kdej [F (x∗,k )]j − F (x∗,k )k2
≤
2dkF (xk ) − F (x∗,k )k2 + 2Ek kdej [F (x∗,k )]j − Ek [dej [F (x∗,k )]j ]k2
=
2dkF (xk ) − F (x∗,k )k2 + 2Ek kdej [F (x∗,k )]j k2
≤
= 2dkF (xk ) − F (x∗,k )k2 + 2dkF (x∗,k )k2 . (60)
Theorem H.2. Let F be µ-quasi strongly monotone and `-star-cocoercive, 0 < γ ≤ 1/2d`. Then for
all k ≥ 0
k 2γd h
2
i
E kxk − x∗ k2 ≤ (1 − γµ) kx0 − x∗,0 k2 + · max∗ kF (x∗ )k .
µ x∗∈X
Corollary H.3. Let the assumptions of Theorem H.2 hold. Then, for any K ≥ 0 one can choose
60
{γk }k≥0 as follows:
2d` 1
if K ≤ , γk = ,
µ 2d`
2d` 1
if K > and k < k0 , γk = ,
µ 2d`
2d` 2
if K > and k ≥ k0 , γk = ,
µ µ(4d` + µ(k − k0 ))
Theorem H.4. Let F be monotone, `-star-cocoercive and Assumptions 2.1, 2.4 hold. Assume that
γ ≤ 1/2d`. Then for GapC (z) from (10) and for all K ≥ 0 the iterates produced by CSGDA satisfy
" K
!#
3 maxu∈C kx0 − uk2
1 X k 8γ`2 Ω2C 5d`kx0 − x∗,0 k2
E GapC x ≤ + +
K 2γK K K
k=1
h i
2
+20γd · max∗ kF (x∗ )k .
x∗∈X
Applying Corollary D.4, we get the rate of convergence to the exact solution.
Corollary H.5. Let the assumptions of Theorem H.4 hold. Then ∀K > 0 one can choose γ as
1 Ω0,C
γ = min , √ .
5d` G∗ 2dK
H.2 SEGA-SGDA
In this section, we consider a modification of SEGA [Hanzely et al., 2018] – the linearly converging
coordinate method for composite optimization problems working even for non-separable regularizers.
61
H.2.1 SEGA-SGDA Fits Assumption 2.1
The following result from Hanzely et al. [2018] does not rely on the fact that F (x) is the gradient of
some function. Therefore, it holds in our settings as well.
Lemma H.6 (Lemmas A.3 and A.4 from Hanzely et al. [2018]). Let Assumption 4.2 hold. Then for
all k ≥ 0 SEGA-SGDA satisfies
The lemma above implies that Assumption 2.1 is satisfied with certain parameters.
Proposition H.7. Let F be `-star-cocoercive and Assumption 4.2 holds. Then, SEGA-SGDA satisfies
Assumption 2.1 with σk2 = khk − F (x∗ )k2 and
` 1
A = d`, B = 2d, D1 = 0, C= , ρ= , D2 = 0.
2d d
Theorem H.8. Let F be µ-quasi strongly monotone, `-star-cocoercive, Assumption 4.2 holds, and
1
0 < γ ≤ 6d` . Then, for all k ≥ 0 the iterates produced by SEGA-SGDA satisfy
k
1
E kxk − x∗ k2 ≤
1 − min γµ, · V0 ,
2d
1
Corollary H.9. Let the assumptions of Theorem H.8 hold. Then, for γ = 6d` and any K ≥ 0 we have
µ 1
E[kxk − x∗ k2 ] ≤ V0 exp − min , K .
6d` 2d
Theorem H.10. Let F be monotone, `-star-cocoercive and Assumptions 2.1, 2.4, 4.2 hold. Assume
that γ ≤ 1/6d`. Then for GapC (z) from (10) and for all K ≥ 0 the iterates produced by SEGA-SGDA
satisfy
" K
!#
3 maxu∈C kx0 − uk2
1 X k 8γ`2 Ω2C kx0 − x∗,0 k2
E GapC x ≤ + + 13d` ·
K 2γK K K
k=1
2dγσ02
kF (x∗ )k2 .
+ (4 + 13γd`) + 9γ · max
K x∗ ∈X ∗
Applying Corollary D.4, we get the rate of convergence to the exact solution.
62
Corollary H.11. Let the assumptions of Theorem H.10 hold. Then ∀K > 0 one can choose γ as
1 Ω0,C Ω0,C
γ = min ,√ , √ .
13d` 2G∗ d G∗ K
This choice of γ implies
" K
!# !
1 X k d`(Ω20,C + Ω20 ) + `Ω2C dΩ0,C G∗ Ω0,C G∗
E GapC x =O + + √ .
K K K K
k=1
Table 3: Summary of the complexity results for zeroth-order methods with two-points feedback oracles
for solving (1). By complexity we mean the number of oracle calls required for the method to find x
such that E[kx − x∗ k2 ] ≤ ε. By default, operator F is assumed to be µ-strongly monotone and, as the
result, the solution is unique. Our results rely on µ-quasi strong monotonicity of F (3). Methods
supporting R(x) 6≡ 0 are highlighted with ∗ . Our results are highlighted in green. Notation: q = the
parameter depending on the proximal setup, q = 2 in Euclidean case and q = +∞ in the `1 -proximal
setup; G∗ = maxx∗∈X ∗ kF (x∗ )k, which is zero when R(x) ≡ 0.
∗ F is `-cocoer.
e d+d`
SEGA-SGDA This paper O µ
As. 4.2
(1)
The method is based on Extragradient update rule. Moreover, at each step full
operator is approximated.
(2)
The problem is defined on a bounded set.
63