Tseng (2010) MathProg

Math. Program., Ser.
B (2010) 125:263–295
DOI 10.1007/s10107-010-0394-2
FULL LENGTH PAPER
Approximation accuracy, gradient methods, and error

bound for structured convex optimization
Paul Tseng
Received: 14 May 2009 / Accepted: 22 January 2010 / Published online: 17 August 2010
© Springer and Mathematical Optimization Society 2010
Abstract Convex optimization problems arising in applications, possibly as approx-

imations of intractable problems, are often structured and large scale. When the data
are noisy, it is of interest to bound the solution error relative to the (unknown) solu-
tion of the original noiseless problem. Related to this is an error bound for the lin-
ear convergence analysis of first-order gradient methods for solving these problems.
Example applications include compressed sensing, variable selection in regression,
TV-regularized image denoising, and sensor network localization.
Keywords Convex optimization · Compressed sensing · 1 -regularization ·

Nuclear/trace norm · Regression · Variable selection · Sensor network
localization · Approximation accuracy · Proximal gradient method ·
Error bound · Linear convergence
Mathematics Subject Classification (2000) 65K05 · 49M37
1 Introduction
Optimization problems arising in application areas such as signal/image denoising,

compressed sensing, regression, multi-task learning, classification, sensor network
This work was supported by National Science Foundation grant DMS-0511283.
P. Tseng was invited to speak on this paper at ISMP 2009. Tragically, Paul went missing on August 13,
2009, while kayaking in China. The guest editors thank Michael Friedlander (mpf@cs.ubc.ca) for
carrying out minor revisions to the submitted manuscript as suggested by the referees.
P. Tseng (B)
Department of Mathematics, University of Washington, Seattle, WA 98195, USA
e-mail: mpf@cs.ubc.ca
123
264 P. Tseng
localization, are often large scale, possibly nonconvex or NP-hard, and the data may
be noisy. Such problems may be approximated by convex relaxations that are highly
structured.
Question 1: How accurate are approximations obtained via the convex relaxations?
Specifically, can we bound the error between a solution of the convex relaxation and an
(unknown) solution of the original noiseless problem in terms of knowable quantities
such as the noise level? This question is meaningful when a solution of the original
problem changes (semi)continuously with small perturbations in the data, so the prob-
lem has both discrete and continuous nature. Examples include compressed sensing
and sensor network localization; see Sect. 2. A certain noise-aware property of the
convex relaxation appears key.
Question 2: How fast can the convex relaxations be solved?
Due to the large problem size, first-order gradient methods seem better suited to exploit
structures such as sparsity, (partial) separability, and simple nonsmoothness in the con-
vex relaxations; see Sect. 3. The asymptotic convergence rate of these methods depend
on an error bound on the distance to the solution set of the convex relaxation in terms
of a certain residual function. We prove such an error bound for a class of 2-norm-
regularized problems that includes the group lasso for linear and logistic regression;
see Sect. 4. Thus our aims are threefold: exposit on existing results, present a new
result, and suggest future research.
We begin with a problem that has received much attention recently: compressed
sensing. In the basic version of this problem, we wish to find a sparse representation
of a given noiseless signal b0 ∈ m from a dictionary of n elementary signals. This
may be formulated as
min (x), (1)
x|Ax=b0
where A ∈ m×n comprises the elementary signals for its columns and (x) counts
the number of nonzero components in x ∈ n . In typical applications, m and n are
large (m, n ≥ 2000). This problem is known to be difficult (NP-hard) and a popular
solution approach is to approximate it by a convex relaxation, with (·) replaced by
the 1-norm · 1 .1 This results in the linear program
min x1 , (2)
x|Ax=b0
which can often be efficiently solved by simplex or interior-point methods [30,103,

105,151,155]. Moreover, when the optimal value of (1) is sufficiently small and the
columns of A are “approximately orthogonal,” which occur with high probability
when A is, say, a Gaussian random matrix, the solution of (2) also solves (1), i.e.,
the relaxation is exact [28–30,37,42,44,58,59,62,65,84,112,135,136]. For a noisy
signal b, Ax = b may be inconsistent and we seek a sparse solution x whose residual
1 We are using the term “relaxation” loosely since (·) majorizes · only on the unit ∞-norm ball.
1
123
Structured convex optimization 265
Ax − b is small in the least square sense, either in the primal form
min x1 (3)

x| Ax−b2 ≤ρ
(ρ > 0), or in the dual form [34,45,46,123]
min Ax − b22 + τ x1 (4)

x
(τ > 0). Partial results on the accuracy of these relaxations are known [43,135,
144]. Various methods have been proposed to solve (3) and/or (4) for fixed ρ and τ ,
including interior point [34], coordinate minimization [123,124], proximal gradient
(with or without smoothing) [14,15,36], and proximal minimization [156]. Homotopy
approaches have been proposed to follow the solution as τ varies between 0 and ∞
[55,66,111,108]. The efficiency of these methods depend on the structure of A. The
polyhedral and separable structure of · 1 is key.
A problem related to (4) is image denoising using total variation (TV) regulariza-
tion:
minn Ax − b22 , (5)

x∈B22
where n is even, B2 denotes the unit 2-norm (Euclidean) ball in 2 , A ∈ m×n is

the adjoint of the discrete (via finite difference) gradient mapping, and b ∈ m . Inte-
rior point, proximal minimization, coordinate minimization, and gradient projection
methods have been proposed for its solution [63,109,148,153,160,161].
A second problem related to (1) and of growing interest is matrix rank minimization.
The basic problem is
min rank(x), (6)

x|Ax=b0
where rank(x) is the rank of a matrix x ∈ p×n , and A is a linear mapping from p×n
to m , and b0 ∈ m . In the case of matrix completion, we have Ax = (xi j )(i, j)∈A ,
where A indexes the known entries of x. This problem is also NP-hard, and a con-
vex relaxation has been proposed whereby rank(x) is replaced by the nuclear/trace
norm xnuc (the 1-norm of the singular values of x) [25,27,52,76,83,117,157]. The
resulting problem
min xnuc (7)

x|Ax=b0
has a more complex structure than (2), and only recently have solution methods, includ-
ing interior-point method and dual gradient method, been developed [25,76,83] and
exactness results been obtained [27,117]. For noisy b, we may consider, analogous to
123
266 P. Tseng
(4), the relaxation
min Ax − b2F + τ xnuc , (8)

x
where τ > 0 and · F denotes the Frobenious-norm. This problem has applications
to dimension reduction in multivariate linear regression [78] and multi-task learn-
ing [1,3,106]. Recently, proximal/gradient methods have been applied to its solution
[78,83,133]. How well does (8) approximate (6)?
A third problem related to (1) and of recent interest is that of sparse inverse covari-
ance estimation, where we seek a positive definite x ∈ S n that is sparse and whose
inverse approximates a given sample covariance matrix s ∈ S n . This may be formu-
lated as the nonconvex problem
min − log det(x) + s, x

+ τ (x) (9)
λI xλ̄I
with 0 ≤ λ < λ̄ ≤ ∞ and τ > 0, where s, x

= trace[sx], I denotes the n × n iden-
tity matrix, and denotes the partial ordering with respect to the cone
n S+ of positive
n
semidefinite matrices [5,9,56,77,159]. Replacing (x) by x1 := i, j=1 |xi j | yields

the convex relaxation
min − log det(x) + s, x

+ τ x1 . (10)
λI xλ̄I
Interior-point method appears unsuited for solving (10) owing to the large size of the
Newton equation to be solved. Block-coordinate minimization methods (with each
block corresponding to a row/column of x) and Nesterov’s accelerated gradient meth-
ods have been applied to its solution [5,9,56,77]. How well does (10) approximate
(9)?
A fourth problem of much recent interest is that of ad hoc wireless sensor net-
work localization [4,18,19,21,22,31,38,39,49,73,75,128]. In the basic version of this
problem, we have n points z 10 , . . . , z n0 in d (d ≥ 1). We know the last n − m points
(“anchors”) and an estimate di j ≥ 0 of the Euclidean distance di0j = z i0 −z 0j 2 between
“neighboring” points i and j for all (i, j) ∈ A, where A ⊆ ({1, . . . , m}×{1, . . . , n})∪
({1, . . . , n}×{1, . . . , m}). We wish to estimate the first m points (“sensors”). This
problem may be formulated as

min |z i − z j 22 − di2j | + |z i − z 0j 22 − di2j |, (11)
z 1 ,...,z m
(i, j)∈As (i, j)∈Aa
where As := {(i, j) ∈ A | i < j ≤ m} and Aa := {(i, j) ∈ A | i ≤ m < j}

are the sets of, respectively, sensor-to-sensor and sensor-to-anchor neighboring pairs.
Typically, d = 2 and two points are neighbors if the distance between them is below
some threshold, e.g., the radio range. In variants of this problem, constraints such as
bounds on distances and angles-of-arrival are also present [17,31,39,73]. This prob-
lem is NP-hard for any d ≥ 1. It is closely related to distance geometry problems
123
arising in molecular conformation [20,90], graph rigidity/realization

[2,38,49,128],
and maxmin/avg dispersion [35,116,150]. Letting z := z 1 · · · z m and I denote the
d ×d identity matrix, Biswas and Ye [21,22] proposed the following convex relaxation
of (11):

min i j (x) − di2j
x
(i, j)∈A

y zT
subject to x = 0, (12)
z I

where y = yi j 1≤i, j≤m , and

yii − 2yi j + y j j if (i, j) ∈ As ,
i j (x) :=
yii − 2z iT z 0j + z 0j 2 if (i, j) ∈ Aa .
The relaxation (12) is a semidefinite program (SDP) and is exact if it has a solu-
tion of rank d. On the other hand, (12) is still difficult to solve by existing methods
for SDP when m > 500, and decomposition methods have been proposed [21,31].
Recently, Wang, Zheng, Ye, and Boyd [149] proposed a further relaxation of (12),
called edge-based SDP (ESDP) relaxation, which is solved much faster by an interior-
point method than (12), and yields solution comparable in approximation accuracy
as (12). The ESDP relaxation is obtained by relaxing the constraint x 0 in (12)
to require only those principal submatrices of x associated with A to be positive
semidefinite. Specifically, the ESDP relaxation is

min i j (x) − di2j
x
(i, j)∈A

y zT
subject to x = ,
z I
⎛ ⎞
yii yi j z iT
⎜ ⎟
⎝ yi j y j j z Tj ⎠ 0 ∀(i, j) ∈ A ,
s
zi zj I

yii z iT
0 ∀i ≤ m. (13)
z i Id
Notice that the objective function and the positive semidefinite constraints in (13) do
not depend on yi j , (i, j) ∈ A. How well does (12) approximate (11)? Only partial
results are known in the noiseless case, i.e., (11) has zero optimal value [128,142].
How well does (13) approximate (11)?
The above convex problems share a common structure, namely, they entail mini-
mizing the sum of a smooth (i.e., continuously differentiable) convex function and a
123
268 P. Tseng
“simple” nonsmooth convex function. More specifically, they have the form
min F(x) := f (x) + τ P(x), (14)

x∈E
where E is a finite-dimensional real linear space endowed with a norm · , τ >

0, P : E → (−∞, ∞] is lower semicontinuous (lsc), convex, with dom P = {x |
P(x) < ∞} closed, and f : E → (−∞, ∞] is convex and smooth on dom f , assumed
open, and f (x) → ∞ whenever x approaches a boundary point of dom f [118]. A
well-known special case of (14) is smooth constrained convex optimization, for which
P is the indicator function for a nonempty closed convex set X ⊆ E, i.e.,

0 if x ∈ X ,
P(x) = (15)
∞ else.
The class of problems (14) was studied in [6,61,100] and by others; see [147] and
references therein. For example, (4) corresponds to
E = n , · = · 2 , f (x) = Ax − b22 , P(x) = x1 , (16)
(5) corresponds to
n
0 if x ∈ B22 ,
E = , · = · 2 ,
n
f (x) = Ax − b22 , P(x) = (17)
∞ else,
(8) corresponds to
E = p×n , · = · F , f (x) = Ax − b2F , P(x) = xnuc , (18)
and (10) corresponds to
E = S n , · = · F , f (x) = − log det(x) + s, x

, P(x) = x1 . (19)
In the case of 0 < λ < λ̄ < ∞, Lu [77] proposed a reformulation of (10) via Fenchel
duality [118,131], corresponding to

0 if x − s∞ ≤ τ,
f (x) = sup x, y
− log det(y), P(x) = (20)
λI yλ̄I
∞ else,
where x∞ = maxi, j |xi j |. (Note that P in (20) depends on τ .) Importantly, for (16),
(17), (19), (20), P is block-separable, i.e.,

P(x) = PJ (x J ), (21)
J ∈J
123
where J is some partition of the coordinate indices of x. Then (14) is amenable to solu-
tion by (block) coordinatewise methods. Another special case of interest is variable
selection in logistic regression [89,127,158,134], corresponding to

m
E = n , · = · 2 , f (x) = log 1+e Ai x −bi Ai x, P(x) = x1 , (22)

i=1
with Ai the ith row of A ∈ m×n and bi ∈ {0, 1}; also see [57,122,125,126] for
variants. A closely related problem is group variable selection (“group lasso”), which
uses instead

P(x) = ω J x J 2 , (23)
J ∈J
with ω J ≥ 0, in (16) and (22) [89,158]. The TV image reconstruction model in [148,
Eq. (1.3)] and the TV-L1 image deblurring model in [153, Eq. (1.5)] are special cases
of (14) with f quadratic and P of the form (23).
How accurate approximations are the convex relaxations (7), (8), (10), (12), (13),
and other related convex relaxations such as that for maxmin dispersion, allowing
for noisy data? What are the iteration complexities and asymptotic convergence rates
of first-order gradient methods for solving these and related problems? First-order
methods are attractive since they can exploit sparse or partial separable structure of f
and block-separable structure of P.
Throughout, n denotes the space of n-dimensional real column vectors, S n =
{x ∈ n×n | x = x T }, and T denotes transpose. For any x ∈ n and nonempty
J ⊆ {1, . . . , n}, x j denotes jth coordinate of x, x J denotes subvector of x comprising

1/ρ
x j , j ∈ J , and xρ = n
|x | ρ for 0 < ρ < ∞, and x∞ = max j |x j |.
j=1 j
For any x ∈ p×n , xi j denotes the (i, j)th entry of x.
2 Approximation accuracy of the convex relaxations
Let a j denote column j of A in (1), normalized so that a j 2 = 1, for all j. It is easily

seen that the relaxation (2) is exact (i.e., its solution also solves (1)) if a1 , . . . , an
are pairwise orthogonal. This hints that (2) may remain exact if these columns are
approximately orthogonal. Mallat and Zhang [84] introduced the following measure
of approximate orthogonality, called “mutual coherence” (variously known as “mutual
incoherence” [44]), in their study of matching pursuit:

μ := max aiT a j . (24)
1≤i= j≤n
√
There exist overcomplete sets with n ≈ m 2 and μ ≈ 1/ m; see [130, pages 265–266]
and references therein. The quantity μ is central to the analysis in [42–44,58,59,62,
1
65,84,135,136,144]. In particular, (2) is exact whenever N 0 < 21 (μ−1 +1) = O(n 4 ),
123
270 P. Tseng
where N 0 denotes the optimal value of (1) [42, Theorem 7], [65, Theorem 1], [58],
[135, Theorems A and B]. When b is noisy, it can be shown that the solution x ρ of
the noise-aware model (3) is close to the solution x 0 of the original noiseless problem
(1), i.e.,
(δ + ρ)2
x ρ − x 0 22 ≤ (25)
1 − μ(4N 0 − 1)
whenever ρ ≥ δ := b − b0 2 and N 0 < ( 21 − O(μ))μ−1 + 1; see [43, Theorem 3.1],

[144, Theorem 1]. The bound (25) also extends to ρ < δ with some limitations [144,
Theorem 1]. In addition to convex relaxation, greedy methods can also be shown to
recover or approximate x 0 and identify the support of x 0 under similar conditions on
N 0 , δ, and ρ; see [43, Theorem 5.1(a)], [135, Theorems A and B], [144, Theorems 3
and 4].
A different measure of approximate orthogonality, introduced by Candès and Tao
[30], is the “restricted isometry” constant μ N , defined as the smallest scalar satisfying
(1 − μ N )x22 ≤ Ax22 ≤ (1 + μ N )x22 ∀x with (x) ≤ N . (26)
In fact, μ2 = μ.2 It is known that x ρ − x 0 2 = O(ρ) whenever ρ ≥ δ and μ3N 0 +

3μ4N 0 < 2 [29, Theorem 1]; also see [29, Theorem 2] for a relaxation of the latter
condition. Candes [26] and Baraniuk et al. [10] give more recent and accessible discus-
sions. For A randomly generated from certain classes of distributions (e.g., Gaussian),
this condition holds with high probability (for N 0 in the order of n to within log
factors); see [28,30,41,137,47,112] and references therein for the noiseless case and
[29,40] for the noisy case. The approximation bounds in [29,40] for (3) require ρ ≥ δ,
as well as n = O(m) in [40]. Is extension to ρ < δ possible, as in [144, Theorem 1]?
Can the aforementioned exactness results and error bounds be extended to matrix
rank minimization (6) and its convex relaxations (7) and (8)? Recent progress has been
made in the noiseless case [27,117]. The nuclear norm has a more complex structure
than the 1-norm.
For the sensor network localization problem (11), its SDP relaxation (12) is exact
if the distances di j , (i, j) ∈ A, are exact (i.e., di j = di0j for all (i, j) ∈ A) and any
relative-interior solution (i.e., a point in the relative interior of the solution set) of (12)
has rank d [128, Theorem 2]. However, this assumption is quite strong. What can we
say in general? Remarkably, a kind of partial exactness still holds. Biswas and Ye [22,
Section 4] introduced the notion of individual traces for a feasible solution x of (12),
defined as
tri (x) := yii − z i 2 , i = 1, . . . , m,
2 Why? Since a = 1 for all j, (26) with N = 2 reduces to 1 − μ ≤ 1 + 2a T a x x ≤ 1 + μ for

j 2 2 i j i j 2
all i = j and all xi , x j ∈ with xi2 + x 2j = 1. Since (xi , x j ) → 2xi x j has minimum value of −1 and
maximum value of 1 on the unit sphere, the smallest μ2 for which this holds is precisely μ given by (24).
123
or, equivalently, the diagonals of the Schur complement y − z T z. It can be shown

that, for any relative-interior solution x of (12), tri (x) = 0 implies z i is invariant
over the solution set of (12) and hence equals z i0 , the true position of sensor i, when
distances are exact [142, Proposition 4.1]. An analogous result holds for the ESDP
relaxation (13) [149, Theorem 2]. Thus, upon finding a relative-interior solution x of
(12) or (13), we know that every sensor i with tri (x) = 0 (within numerical accuracy)
has z i as its true position. Is this result sharp? Yes, at least when the distances are
exact. In this case, the converse holds for the ESDP relaxation; see [115, Theorem
1]. It is not known if the converse holds for the SDP relaxation. On the other hand,
an example in [115, Example 2] shows that, in the noisy case where the distances
are inexact, tri (x) = 0 is not a reliable indicator of sensor i position accuracy even
when x is the unique solution of the SDP/ESDP relaxation. The reason is that the solu-
tion set of the SDP/ESDP relaxation can change abruptly under arbitrarily small data
perturbation. This contrasts with a second-order cone (SOCP) relaxation of the same
problem, whose solution set changes gradually with data perturbation [142, Section 7].
To overcome this difficulty, a noise-aware version of the ESDP relaxation, analogous
to (3) for compressed sensing, was proposed in [115, Section 5]. Specifically, for any
ρ = (ρi j )(i, j)∈A ≥ 0, let
ρ

Sresdp := x | x is feasible for (13) and |i j (x) − di2j | ≤ ρi j ∀(i, j) ∈ A . (27)
ρ
Then Sresdp contains the noiseless ESDP solutions whenever ρ ≥ δ := (|di2j −
(di0j )2 |)(i, j)∈A . Moreover, defining its “analytic center” as
⎛ ⎞
yii yi j z iT
ρ ⎜ ⎟
m
x := arg min − log det ⎝ yi j y j j z Tj ⎠ − log tri (x), (28)
ρ
x∈S (i, j)∈As i=1
resdp zi zj I
ρ
it can be shown that, for ρ > δ sufficiently small, tri (x ) = 0 is a reliable indicator of
sensor i position accuracy; see [115, Theorem 4]. Moreover, for any ρ > δ, we have
the following computable bound on the individual sensor position accuracy:
ρ ρ 1
z i − z i0 ≤ 2|As | + m · tri (x ) 2 ∀i.
Can these results be extended to the SDP relaxation (12) or SOS relaxations [69,104]
or to handle additional constraints such as bounds on distances and angles-of-arrival?
Closely related to sensor network localization is the continuous maxmin disper-
sion problem, whereby, given existing points z m+1 0 , . . . , z n0 in d (d ≥ 1), we wish
to locate new points z 1 , . . . , z m inside, say, a box [0, 1]d that are furthest from each
other and existing points [35,150]:

max min min ωi j z i − z j 2 , min ωi j z i − z 0j 2 , (29)
z 1 ,...,z m ∈[0,1]d i< j≤m i≤m< j
123
272 P. Tseng
with ωi j > 0. Replacing “max” by “sum” yields the max-avg dispersion problem
[116, Section 4]. It can be shown (by reduction from 0/1-integer program feasibil-
ity) that (29) is NP-hard if d is a part of the input, even when m = 1. How accurate
approximations are their convex (e.g., SDP, SOCP) relaxations? Another related prob-
lem arises in protein structure prediction, whereby the distances between neighboring
atoms and their bond angles are known, and we wish to find positions of the atoms
that minimize a certain energy function [120]. Although the energy function is com-
plicated and highly nonlinear, one can focus on the most nonlinear terms, such as the
Lennard-Jones interactions, in seeking approximate solutions.
3 Gradient methods for solving the convex relaxations
How to solve (14)? We will assume that ∇ f is Lipschitz continuous on a closed convex
set X ⊇ dom P, i.e.,
∇ f (x) − ∇ f (y)∗ ≤ Lx − y ∀x, y ∈ X , (30)
for some L > 0, where E ∗ is the vector space of continuous linear functionals on
E, endowed with the dual norm x ∗ ∗ = supx≤1 x ∗ , x
and x ∗ , x
is the value of
x ∗ ∈ E ∗ at x ∈ E. This assumption, which is satisfied by (16), (17), (18), (20), (22),
can be relaxed to hold for (19) as well. Owing to its size and structure, (14) is suited for
solution by first-order gradient methods, whereby at each iteration f is approximated
by a linear function plus a “simple” proximal term. We describe such methods below.
To simplify notation, we denote the linearization of f in F at y ∈ X by
F (x; y) := f (y) + ∇ f (y), x − y

+ τ P(x) ∀x. (31)
3.1 Proximal gradient methods
Choose a strictly convex function η : E → (−∞, ∞] that is differentiable on an open

set containing X ,3 Then the function
D(x, y) := η(x) − η(y) − ∇η(y), x − y

∀y ∈ X , ∀x ∈ E,
is nonnegative and zero if and only if x = y, so D acts as a proximity measure. This

function was studied by Bregman [24] and many others; see [8,11,33,48,70,132] and
references therein. By scaling η if necessary, we assume that
1
D(x, y) ≥ x − y2 ∀ x, y ∈ X . (32)
2
The classical gradient projection method of Goldstein and Levitin, Polyak (see
[16,114]) naturally generalizes to solve (14) using the Bregman function D, with
3 This assumption can be relaxed to η being differentiable on the interior of X only.
123
constant stepsize 1/L:

L
x k+1 = arg min F (x; x k ) + D(x, x k ) , k = 0, 1, . . . , (33)
x 2
where x 0 ∈ dom P. This method, which we call the proximal gradient (PG) method,
is closely related to the mirror descent method of Nemirovski and Yudin [93], as is
discussed in [8,12]; also see [61] for the case of η(·) = 21 · 22 . When P is the 1-norm
or has the block-separable form (21), the new point x k+1 can be found in closed form,
which is a key advantage of this method for large-scale optimization; see [147,152]
and references therein. When P is given by (15) and X is the unit simplex, x k+1 can be
found in closed form in O(n) floating point operations (flops) by taking η(x) to be the
x log x-entropy function [12, Section 5], [99, Lemma 4]. Moreover, the corresponding
D satisfies (32) with · being the 1-norm [12, Proposition 5.1], [99, Lemma 3]. If
η(·) = 21 · 22 is used instead, then x k+1 can still be found in O(n) flops, but this
requires using a more complicated algorithm; see [72] and references therein. It can
be shown that

L
F(x k ) − inf F ≤ O ∀k,
k
and hence O( L
) iterations suffice to come within
> 0 of inf F; see, e.g., [14,
Theorem 3.1], [96, Theorem 2.1.14], [114, page 166], [145, Theorem 5.1].
In a series of work [94,95,99] surveyed in [96] (also see [114, page 171]), Nesterov
proposed three methods for solving the smooth constrained case (15) that, at each iter-
ation, use either one or two projection steps together with extrapolation to accelerate
convergence. These accelerated gradient projection methods generate points {x k } that
achieve

L
F(x ) − inf F ≤ O
k
∀k,
k2

so that O( L
) iterations suffice to come within
> 0 of inf F. In [99], it is shown
that various large convex-concave optimization problems can be efficiently solved
by applying these methods to a smooth approximation with Lipschitz constant L =
O(1/
). These methods have inspired various extensions and variants [7, Section 5],
[14,64,74], [96, Section 2.2], [99,101,143], as well as applications to compressed
sensing, sparse covariance selection, matrix completion, etc. [15,5,67,77,78,83,97,
133]. In particular,
all three methods can be extended to solve (14) in a unified way and

)
L
achieve O( iteration complexity; see [143] and discussion below. The work per
iteration is between O(n) and O(n 3 ) flops for the applications of√Sect. 1. In contrast,
the number of iterations for interior-point methods is at best O( n log 1
) and the
work per iteration is typically between O(n 3 ) and O(n 4 ) ops. Thus, for moderate
(say,
= .001), moderate L (which may depend on n), and large n (say, n ≥ 10000),
a proximal gradient method can outperform interior-point methods.
123
274 P. Tseng
The first accelerated proximal gradient (APG) method for solving

√ (14) is the sim-
plest, but requires E to be a Hilbert space (i.e., E ∗ = E, · = ·, ·
) and η(·) =
−1 ∈ dom P and θ = θ
2 · , so that D(x, y) = 2 x − y . For any x = x −1 = 1,
1 2 1 2 0
0
it generates (for k = 0, 1, . . .)
−1
y k = x k + θk (θk−1 − 1)(x k − x k−1 ), (34)

L
x k+1
= arg min F (x; y ) + x − y ,
k k 2
(35)
x 2

θk + 4θk − θk
4 2 2
θk+1 = . (36)
2
θk
An inductive argument shows that θk ≤ k+2 2
for all k. As k → ∞, we have θk−1 =
√
1 − θk → 1, so that, by (34), y is asymptotically an isometric extrapolation from
k
x k−1 to x k . In particular, y k may lie outside of dom P. However, since x k , x k−1 ∈

dom P, it is readily seen that y k ∈ {2x − w | x, w ∈ dom P} (since x + α(x − x −1 ) =
2x − w with w = (1 − α)x + αx −1 ). This method was recently proposed by Beck
and Teboulle [14] as an extension of Nesterov’s first method [94]; also see [64] for
refinements in the unconstrained case of P ≡ 0.
The second APG method imposes no requirement on E or D, and maintains y k ∈
dom P, so it is less restrictive than the first method. For any x 0 , z 0 ∈ dom P and θ0 = 1,
it generates (for k = 0, 1, . . .)
y k = (1 − θk )x k + θk z k , (37)

z k+1 = arg min F (x; y k ) + θk L D(x, z k ) , (38)
x
x k+1 = (1 − θk )x k + θk z k+1 , (39)
with θk+1 given by (36). Since 0 < θk ≤ 1, we have from x k , z k ∈ dom P that
y k ∈ dom P. In the smooth constrained case (15), this method corresponds to Auslen-
der and Teboulle’s extension [8, Section 5] of Nesterov’s second method [95]; also see
[96, page 90]. A variant proposed by Lan, Lu, and Monteiro [74, Section 3] replaces
(39) by a PG step from y k .
The third APG method differs from the second method mainly in the computa-
tion of z k+1 . For any x 0 ∈ dom P and z 0 = arg min η(x), θ0 = 1, it generates (for
x∈dom P
k = 0, 1, . . .) y k by (37),

k
F (x; y i )
z k+1
= arg min + Lη(x) , (40)
x θi
i=0
and x k+1 , θk+1 by (39), (36). Thus, (40) replaces F (x; y k )/θk in (38) by its cumu-
lative sum and replaces D(·, z k ) by η(·). In the case of (15), this method is similar to
Nesterov’s third method [99] but with only one projection instead of two. In the case
123
of E being a Hilbert space and η(·) = 21 · 2 , this method bears some resemblance to
an accelerated dual gradient method in [101, Section 4].
The preceding accelerated methods may look unintuitive, but they arise “naturally”
from refining the analysis of the PG method, as is discussed in Appendix A. Moreover,
these methods are equivalent (i.e., generate the same sequences) when P ≡ 0, E is a
Hilbert space, and the Bregman function D is quadratic (i.e., η(·) = 21 · 2 ). Some
extensions of these methods, including cutting planes, estimating L, are discussed in
[143]. In particular, L can be estimated using backtracking: increase L and repeat the
iteration whenever a suitable sufficient descent condition (e.g., 52 or 56) is violated;
see [14,94,143]. Below we summarize the iteration complexity of the PG and APG
methods.
Theorem 1 (a) Let {x k } be generated by the PG method (33). For any x ∈ dom P,
we have
1
F(x k ) ≤ F(x) + L D(x, x 0 ) ∀k ≥ 1.
k
(b) Assume E is a Hilbert space, η(·) = 21 · 2 , and X ⊇ {2x − w | x, w ∈ dom P}.

Let {x k } be generated by the first APG method (34)–(36). For any x ∈ dom P,
we have
F(x k ) ≤ F(x) + θk−1

2
L D(x, x 0 ) ∀k ≥ 1.
(c) Let {x k } be generated by the second APG method (36)–(39). For any x ∈ dom P,
we have
F(x k ) ≤ F(x) + θk−1

2
L D(x, z 0 ) ∀k ≥ 1.
(d) Let {x k } be generated by the third APG method (36), (37), (39), (40). For any
x ∈ dom P, we have
F(x k ) ≤ F(x) + θk−1

2
L(η(x) − η(z 0 )) ∀k ≥ 1.
A proof of Theorem 1(a)–(c) is given in Appendix A. A proof of part (d) can be

found in [143, Corollary 3(a)]. Taking any x satisfying F(x) ≤ inf F + 2
in Theorem

1 yields F(x k ) ≤ inf F +
after k = O L
iterations for the PG method and after

k=O L

iterations for the APG methods.
How can we terminate the PG and APG methods in practice with a guaranteed
optimality gap? The bounds in Theorem 1 requires estimating the distance to an 2
-
minimizer of F and are rather conservative. In the case where f has the form
f (x) = max φ(x, v),

v∈V
123
276 P. Tseng
for some saddle function φ and convex set V in a suitable space, duality gap can be
used to terminate the methods [92,99,98]. The dual problem is maxv Q(v), with dual
function
Q(v) := min{φ(x, v) + τ P(x)}.

x
Then we compute (say, every 5 or 10 iterations) a candidate dual solution
v k = arg max φ(x k , v),

v
and check that F(x k ) − Q(v k ) ≤

. In fact, assuming furthermore that dom P is
bounded, it can be shown using an idea of Nesterov [99, Theorem 3] that
0 ≤ F(x k+1 ) − Q(v̄ k ) ≤ θk2 L max (η(x) − η(z 0 )) ∀k ≥ 0,

x∈dom P
where x k+1 , y k , θk are generated by the third APG method, and we let
v k = arg max φ(y k , v), v̄ k = (1 − θk )v̄ k−1 + θk v k .

v
with v̄ −1 = 0 [143, Corollary 3(c)]; also see [77, Theorem 2.2] and [98] for related
results in the constrained case (15). Analogous bounds hold for the first two APG
methods; see [143, Corollaries 1(b) and 2].
When applied to (4), the first APG method yields an accelerated version of the
iterative thresholding method of Daubechie et al. [14,36]. What about the other two
methods? How efficiently can these methods be applied to solve (5), (7), (8), (10),
(22), and related problems such as those in [50,122,125,126,154]? When applied to
(7) and (8), singular value decomposition is needed at each iteration, and efficiency
depends on the cost for this decomposition. However, only the largest singular val-
ues and their associated singular vectors are needed [25]. Can these be efficiently
computed or updated? Some progress on this have recently been made [83,133].
Can the iteration complexity be further improved? The proofs suggest that the con-
vergence rate can be improved to O( k1p ) ( p > 2) if we can replace · 2 in the
proofs by · p ; see Appendix A. However, this may require using a higher-order
approximation of f in F (·; y), so the improvement would not come “for free”.
3.2 Block-coordinate gradient methods
When E is a Hilbert space and P is block-separable (21), we can apply (33) block-
coordinatewise, possibly with L and D dynamically adjusted, resulting in a block-
coordinate gradient (BCG) method. More precisely, given x k ∈ dom P, we choose Jk
as the union of some subcollection of indices in J and choose a self-adjoint positive
123
definite linear mapping H k : E → E (ideally H k ≈ ∇ 2 f (x k )), compute

1 k
d k k k
= arg min F (x + d; x ) + H d, d
d j = 0 ∀ j ∈ Jk , (41)
d 2
and update
x k+1 = x k + αk d k , (42)
with stepsize αk > 0 [147,145]. This method may be viewed as a coordinate/SOR

version of a sequential quadratic programming (SQP) method, and it is related to the
variable/gradient distribution methods for unconstrained smooth optimization [53,60,
87,121] and (block) coordinate minimization methods [16,85,88,105,123,138,141].
In the case of H k d = Ld and Jk comprising all coordinate indices of x, (41)–(42)
with αk = 1 reduces to the PG method (33) with η(·) = 21 · 2 .
How to choose αk and Jk ? Various stepsize rules for smooth optimization [16,54,
105] can be adapted to this nonsmooth setting. One that works well in theory and
practice is an Armijo-type rule adapted from SQP methods:

αk = max α ∈ {1, β, β 2 , . . . } | F(x k + αd k ) ≤ F(x k ) + ασ k , (43)
where 0 < β, σ < 1 and k is the difference between the optimal value of (41)
and F(x k ). This rule requires only function evaluations, and k predicts the descent
from x k along d k . For global convergence, the index subset Jk is chosen either in a
Gauss-Seidel manner, i.e., Jk ∪ · · · ∪ Jk+K covers all subsets of J for some constant
K ≥ 0 [32,89,107,141,147] or Jk is chosen in a Gauss-Southwell manner to satisfy
all
k ≤ ωk ,
all
where 0 < ω < 1 and k denotes the analog of k when Jk is replaced by the entire
coordinate index set [147,145,145]. Moreover, assuming (30) with X = dom P, the
BCG method using the Gauss-Southwell choice of Jk finds an
-minimizer of F in
O( L
) iterations [145, Theorem 5.1].
The above BCG method, which has been successful for compressed sensing and
variable selection in regression [89,127,134] and can be extended to handle linear con-
straints as in support vector machine (SVM) training [145,113], may also be suited
for solving (5), (10), (13), (20), and related problems. When applied to (10) with Jk
indexing a row/column of x, d k and αk are computable in O(n 2 ) flops using Schur
complement and properties of determinant. This method may be similarly applied
to Lu’s reformulation (20). This contrasts with the block-coordinate minimization
method in [9] which uses O(n 3 ) flops per iteration. This method can also be applied
to solve (28) by using a smooth convex penalty 2θ 1
max{0, | · | − ρi j }2 (θ > 0) for each
constraint in (27) of the form | · | ≤ ρi j . By choosing each coordinate block to com-
prise z i , yii , and (yi j ) j|(i, j)∈A , the computation distributes, with sensor i needing to
123
278 P. Tseng
communicate only with its neighbors when updating its position—an important con-
sideration for practical implementation. The resulting method can accurately position
up to 9000 sensors in little over a minute; see [115, Section 7.3]. Can Nesterov’s
extrapolation techniques of Sect. 3.1 be adapted to the BCG method?
3.3 Incremental gradient methods
A problem that frequently arises in machine learning and neural network training has
the form (14) with

m
f (x) = f i (x), (44)
i=1
where each f i : E → (−∞, ∞] is smooth, convex on an open subset of E contain-

ing dom P. (The convexity assumption can be relaxed depending on the context.) For
example, f i (x) may be the output error for an input-out system (e.g., a classifier),
parameterized by x, on the ith training example. P given by (15) would confer con-
straints on x and P(·) = · 1 would induce a sparse representation. m may be large.
A popular approach to minimizing f of the form (44) is by an incremental gradient
method (“on-line back-propagation”):
x k+1 = x k − αk ∇ f ik (x k ), (45)
with i k chosen cyclically from {1, . . . , m} and stepsize αk > 0 either constant or
adjusted dynamically; see [16,71,79,86,91,129,139] and references therein. When
some the ∇ f i ’s are “similar”, this incremental method is more efficient than (pure)
gradient method since it does not wait for all component gradients ∇ f 1 , ..., ∇ f m to be
evaluated before updating x. However, global convergence (i.e., ∇ f (x k ) → 0) gener-
ally requires αk → 0, which slows convergence. Recently, Blatt, Hero and Gauchman
[23] proposed an aggregate version of (45) that approximates ∇ f (x k ) by

k
g =
k
∇ f i (x ).
=k−m+1
This method achieves global convergence for any sufficiently small constant stepsize
αk , but requires O(mn) storage. We can reduce the storage to O(n) by updating a
cumulative average of the component gradients
m k
k
gsum = gsum
k−1
+ ∇ f ik (x k ), gk = g ,
k sum
−1 = 0. We then use g k in the PG, APG, or BCG method. The resulting incre-
with gsum
mental methods share features with recently proposed averaging gradient methods
[68,102]. What are their convergence properties, iteration complexities, and practical
performances?
123
4 Error bound and linear convergence of gradient methods
Analogous to superlinear convergence for second-order methods, linear convergence

is a good indicator of efficiency for first-order methods. Key to a linear convergence
analysis is the following Lipschizian error bound on dist(x, X̄ ) := min x − s in
s∈X̄
terms of the norm of the residual

1
R(x) := arg min F (x + d; x) + d2 , (46)
d 2
where X̄ denotes the set of minimizers of F, which we assume to be nonempty, and

F (·; x) contains the linearization of F at x (cf. (31)).
EB condition: For any ζ ≥ min F, there exist scalars κ > 0 and

> 0 such that
dist(x, X̄ ) ≤ κR(x) whenever F(x) ≤ ζ, R(x) ≤

. (47)
Under the EB condition, asymptotic linear and even superlinear convergence can be
established for various methods, including interior point, gradient projection, proximal
minimization, coordinate minimization, and coordinate gradient methods—even if X̄
is unbounded; see [13,51,81,82,140,147,145,146] and references therein. Moreover,
the EB condition holds under any of the following conditions; see [147, Theorem 4]
as well as [110, Theorem 3.1], [82, Theorem 2.1] for the constrained case (15).
C1. E = n , f (x) = h(Ax) + c, x
for all x ∈ n , where A ∈ m×n , c ∈ n ,
and h is a strongly convex differentiable function on m with ∇h Lipschitz
continuous on m . P is polyhedral.
C2. E = n , f (x) = max y∈Y { y, Ax
− h(y)} + c, x
for all x ∈ n , where
Y is a polyhedral set in m , A ∈ m×n , c ∈ n , and h is a strongly con-
vex differentiable function on m with ∇h Lipschitz continuous on m . P is
polyhedral.
C3. f is strongly convex and satisfies (30) for some L > 0.
What if f is not strongly convex and P is non-polyhedral? In particular, we are
interested in the group lasso for linear and logistic regression (see (16), (22), (23)), for
which f is not strongly convex (unless A has full column rank) and P is non-polyhedral
(unless J is a singleton for all J ∈ J ). The following new result shows that the error
bound (47) holds for the group lasso. The proof, given in Appendix B, exploits the
structure of the weighted sum of 2-norms (23). To our knowledge, this is the first
Lipschitzian error bound result for f not strongly convex and P non-polyhedral.
Theorem 2 Suppose that E = n , P has the form (23) with ω J > 0 for all J ∈ J ,
and f has the form
f (x) = h(Ax), (48)
where A ∈ m×n , and h : m → (−∞, ∞] is differentiable on domh, which is

assumed to be convex and open. Also suppose that (a) h is strongly convex and ∇h
123
280 P. Tseng
is Lipschitz continuous on any compact convex subset of domh, and (b) h(y) → ∞
whenever y approaches a boundary point of domh. If X̄ = ∅, then
{x | F(x) ≤ ζ } is bounded ∀ζ ∈ , (49)
and the EB condition (47) holds.
The assumptions on h in Theorem 2 are satisfied by
1
h(y) = y − b2 ,
2
corresponding to linear regression, and

m
h(y) = log(1 + e yi ) − b, y
with b ∈ {0, 1}m ,
i=1
corresponding to logistic regression (22). The assumptions are also satisfied by

m
h(y) = − log(yi ) + b, y
with b ≥ 0,
i=1
which arises in likelihood estimation under Poisson noise [122]. In the first two exam-
ples, h is bounded from below by zero. In the third example, h is unbounded from
below but tends to −∞ sublinearly. Since P given by (23) is homogeneous of degree
1, it is readily seen that X̄ = ∅ for all three examples. (An example of X̄ = ∅ is
min x e x + 2x + |x|.)
However, X̄ need not be a singleton. An example is
1
min |x1 + x2 − 2|2 + |x1 | + |x2 |,
x 2
for which X̄ = {(1 − t, t) | t ∈ [0, 1]}. Can Theorem 2 be extended to f satisfying C2

and P given by (23) or to (17) or (18)? Can the constant κ in (47), which determines the
convergence ratio, be estimated for compressed sensing (16) in terms of the restricted
isometry constant μ N ?
Corollary 1 Under the assumptions of Theorem 2, let {(x k , H k , Jk , αk )} be generated

by the BCG method (41)–(42) with (i) λI H k λ̄I for all k (0 < λ ≤ λ̄), (ii) {Jk }
cycling through J ∈ J , and (iii) {α k } chosen by the Armijo rule (43). If X̄ = ∅, then
{F(x k )}k∈T converges at least Q-linearly and {x k }k∈T converges at least R-linearly,
where T = {0, K , 2K , . . . } and K is the cardinality of J .
Proof Theorem 2 shows that [147, Assumption 2(a)] is satisfied. By [147, Theorem
1(a)], {F(x k )} ↓, so, by (49), {x k } is bounded. Since F is convex, [147, Assumption
2(b)] is automatically satisfied. Also, {x k } is lies in a compact convex subset of dom f ,
123
over which ∇ f is Lipschitz continuous (since ∇h is Lipschitz continuous over the the
image of this subset under A). Conditions (i)–(iii) imply that the remaining assump-
tions in [147, Theorem 2(b)] are satisfied. In particular, the restricted Gauss-Seidel
rule in [147] holds with the given T . Since {x k } is bounded, [147, Theorem 2(b)]
implies that {F(x k )}k∈T converges at least Q-linearly and {x k }k∈T converges at least
R-linearly [107].

By Corollary 1, the method in [158], [89, Section 2.2.1] for linear group lasso (corre-
sponding to H k = A T A) and the method in [89, Section 2.2.2] for logistic group lasso
(corresponding to H k = νk I with νk > 0) attain linear convergence. Linear conver-
gence of the block-coordinate minimization methods used in [148] for TV-regularized
image
reconstruction, with f (w, u) = 21 w − Bu22 + γ2 Au − b22 , P(w, u) =
J w J 2 , and in [153] for TV-regularized image
deblurring, with f (w, u, z) =
γ
2 w − Bu2 + 2 Au − b − z2 , P(w, u, z) = J ω J w J 2 + z1 , can be ana-
1 2 2
lyzed similarly using a generalization of Theorem 2 to allow ω J = 0 for some J . Can

the linear convergence analysis be extended to the APG methods or their variants?
5 Appendix A: Analyzing the PG and APG methods
Since f is convex and (30) holds with X ⊇ dom P, we have from (31) that
L
F(x) ≥ F (x; y) ≥ F(x) − x − y2 ∀x ∈ dom P, y ∈ X . (50)
2
The following “3-point” property is also key; see [33, Lemma 3.2], [74, Lemma 6],
[143, Section 2].
3-Point Property: For any proper lsc convex function ψ : E → (−∞, ∞] and any
z ∈ X , if η is differentiable at z + = arg min x {ψ(x) + D(x, z)}, then
ψ(x) + D(x, z) ≥ ψ(z + ) + D(z + , z) + D(x, z + ) ∀x ∈ dom P.
The APG methods can be motivated by analyzing the PG method (33). Let {x k } be
generated by the PG method. For any x ∈ dom P and any k ∈ {0, 1, . . . }, we have
L k+1
F(x k+1 ) ≤ F (x k+1 ; x k ) + x − x k 2
2
≤ F (x k+1 ; x k ) + L D(x k+1 , x k )
≤ F (x; x k ) + L D(x, x k ) − L D(x, x k+1 )
≤ F(x) + L D(x, x k ) − L D(x, x k+1 ) ∀x ∈ dom P, (51)
where the first and fourth inequalities use (50), the second inequality uses (32), and
the third inequality uses (33) and the 3-Point Property with ψ(x) = F (x; x k )/L.
Letting ek = F(x k ) − F(x) and k = L D(x, x k ), this simplifies to the recursion
ek+1 ≤ k − k+1 ∀k ≥ 0.
123
282 P. Tseng
It follows from ek+1 ≤ ek that (k + 1)ek+1 ≤ 0 − k+1 ≤ 0 , which proves

Theorem 1(a).
The above proof suggests that, for faster convergence, we should find a similar
recursion as (51) but with L scaled by something tending to zero with k. To do this,
we need to modify (33). Suppose for simplicity E is a Hilbert space and η(·) = 21 · 2
(so that D(x, y) = 21 x − y2 ). One such modification is to replace x k in (33) by
some y k ∈ X to be determined, yielding (35). Then, as in the above proof for the PG
method, we have
L k+1
x
F(x k+1 ) ≤ F (x k+1 ; y k ) + − y k 2 (52)
2
L L
≤ F (y; x k ) + y − y k 2 − y − x k+1 2
2 2
L L
≤ F(y) + y − y − y − x k+1 2
k 2
∀y ∈ dom P, (53)
2 2
where the first and third inequalities use (50), and the second inequality uses (35)
and the 3-Point Property. To get L to be scaled by something tending to zero, set
y = (1−θk )x k +θk x in the above inequality, with x ∈ dom P arbitrary and 0 < θk ≤ 1
to be determined. We can then factor θk out of x to scale L, yielding
L
F(x k+1 ) ≤ F((1 − θk )x k + θk x) + (1 − θk )x k + θk x − y k 2
2
L
− (1 − θk )x k + θk x − x k+1 2
2
L
= F((1 − θk )x k + θk x) + θk2 x + (θk−1 − 1)x k
2
L 2
−θk y − θk x + (θk−1 − 1)x k − θk−1 x k+1 2 ,
−1 k 2
2
where we have rearranged the terms to look like the recursion (51). We want the two
terms inside 2 to have the form “x − z k ” and “x − z k+1 ”, which we get by setting
z k = −(θk−1 − 1)x k + θk−1 y k
and y k by (34). Using also the convexity of F, we then obtain that
L L
F(x k+1 ) ≤ (1−θk )F(x k )+θk F(x)+θk2 x −z k 2 −θk2 x −z k+1 2 ∀k. (54)
2 2
Letting ek = F(x k ) − F(x) and k = 2 x

L
− z k 2 , this simplifies to
ek+1 ≤ (1 − θk )ek + θk2 k − θk2 k+1 .
123
1−θk+1
Upon dividing both sides by θk2 , we see that, by choosing θk+1 so that 1
θk2
= θk+1
2
(which, upon solving for θk+1 , yields (36)), this rewrites as the recursion
1 − θk+1 1 − θk
ek+1 + k+1 ≤ ek + k ,
θk+1
2 θk2
which propagates backwards to yield
1 − θk+1 1 − θ0
ek+1 + k+1 ≤ e0 + 0 ∀k.
θk+1
2 θ02
Since 1−θ
θ2
k+1
= 1
θk2
, setting θ0 = 1 simplifies this to θ12 ek+1 ≤ 0 −k+1 ≤ 0 . Also,
k+1 k
we have from (34) and taking θ−1 = 1 that z 0 = y 0 = x 0 . This proves Theorem 1(b).
The preceding proof depends crucially on rearranging terms inside 2 , so it cannot
be directly extended to other Bregman functions D. However, (54) suggests we seek
a recursion of the form
F(x k + 1 ) ≤ (1 − θk )F(x k ) + θk F(x) + θk2 L D(x, z k ) − θk2 L D(x, z k + 1 ) ∀k, (55)
which, as our derivation of (51) and (54) suggests, can be achieved by setting z k+1 by
(38) and using the 3-Point Property, setting x k+1 by (39) (analogous to our setting of
y in (53)), and then choosing y k so that x k+1 − y k = θk (z k+1 − z k ) (which works out
to be (37)). Specifically, for any k ∈ {0, 1, . . . }, we have
L k+1
F(x k+1 ) ≤ F (x k+1 ; y k ) + x − y k 2
2
Lθk2 k+1
= F ((1 − θk )x k + θk z k+1 ; y k ) + z − z k 2
2
≤ (1 − θk ) F (x k ; y k ) + θk F (z k+1 ; y k ) + θk2 L D(z k+1 , z k )

≤ (1 − θk ) F (x k ; y k ) + θk F (x; y k ) + θk L D(x, z k ) − θk L D(x, z k + 1 )

∀x ∈ domh, (56)
where the first inequality uses (50), the second inequality uses the convexity of
F (·; y k ) and (32), the last inequality uses the 3-Point Property with ψ(x) =
F (x; y k )/(θk L). Using (50) yields (55), from which Theorem 1(c) follows.
The third APG method (see 40) can be derived and analyzed similarly using an
analog of the 3-Point Property for η; see [143, Property 2]. For brevity we omit the
details.
6 Appendix B: A Proof of Theorem 2
Througout we assume that the assumptions of Theorem 2 are satisfied. By scaling f

by 1/τ , we will assume without loss of generality that τ = 1.
123
284 P. Tseng
Lemma 1 For any x ∈ n , letting g = ∇ f (x), we have for all J ∈ J that

⎧
⎨ −x J if g J − x J 2 ≤ ω J ,
R(x) J =

(57)
⎩− 1 − ωJ ωJ
g J −x J 2 g J − g J −x J 2 x J if g J − x J 2 ≥ ω J .
Moreover, R is continuous on dom f .
Proof Let r = R(x). Then (23), (31), (46), and τ = 1 yield

⎧ ⎫
⎨ 1 ⎬
r = arg min g, d
+ d2 + ω J x J 2 ,
d ⎩ 2 ⎭
J ∈J
whose necessary and sufficient optimality condition is

0 ∈ g J + r J + ω J ∂x J + r J 2 ∀J ∈ J .
Fix any J ∈ J . Since ∂02 is the unit 2-norm ball and ∂x J 2 = {x J /x J 2 } if
x J = 0, we have that r J = −x J if g J − x J 2 ≤ ω J and otherwise
xJ + rJ
gJ + r J + ωJ = 0. (58)
x J + r J 2
Letting α = x J + r J 2 , we solve for r J to obtain
αg J + ω J x J
rJ = − .
α + ωJ
α
Hence x J + r J = α+ω J (x J − g J ) so that
α
α = x J + r J 2 = x J − g J 2 .
α + ωJ
Solving for α yields α = x J − g J 2 − ω J , which when plugged into the above
formula for r J yields (57).
The continuity of R follows from the continuity of ∇ f and the continuity of the
right-hand side of (57) in g. In particular, the two formulas in (57) yield −x J at the
boundary g J − x J 2 = ω J .

Since h is strictly convex, x → Ax is invariant over X̄ , i.e., there exists ȳ ∈ domh

such that
Ax = ȳ ∀x ∈ X̄ . (59)
Since P is given by (23), it follows from (48) and (59) that
⎧ ⎫
⎨ ⎬
X̄ = x | ω J x J 2 = min F − h( ȳ), Ax = ȳ ,
⎩ ⎭
J ∈J
123
so that X̄ is bounded (since ω J > 0 for all J ∈ J ), as well as being closed convex.
Since F is convex and X̄ is bounded, it follows from [118, Theorem 8.7] that (49)
holds. By using these observations and Lemma 1, we prove below the EB condition
(47).
We argue by contradiction. Suppose there exists a ζ ≥ min F such that (47) fails
to hold for all κ > 0 and
> 0. Then there exists a sequence x 1 , x 2 , . . . in n \X̄
satisfying

rk
F(x ) ≤ ζ ∀k,
k
{r } → 0,
k
→ 0, (60)
δk
where for simplicity we let r k := R(x k ), δk := x k − x̄ k 2 , and x̄ k :=

arg mins∈X̄ x k − s2 . Let
g k := ∇ f (x k ) = A T ∇h(Ax k ), ḡ := A T ∇h( ȳ). (61)
By (59) and (61), A x̄ k = ȳ and ∇ f (x̄ k ) = ḡ for all k.

By (49) and (60), {x k } is bounded. By further passing to a subsequence if necessary,
we can assume that {x k } → some x̄. Since {R(x k )} = {r k } → 0 and R is continuous
by Lemma 1, this implies R(x̄) = 0, so x̄ ∈ X̄ . Hence δk ≤ x k − x̄2 → 0 as k → ∞
so that {x̄ k } → x̄. Also, by (59) and (61), {g k } → ∇ f (x̄) = ḡ. Since P(x k ) ≥ 0, (48)
implies h(Ax k ) = F(x k ) − P(x k ) ≤ F(x k ) ≤ ζ for all k. Since {Ax k } is bounded
and h(y) → ∞ whenever y approaches a boundary point of domh, this implies that
{Ax k } and ȳ lie in some compact convex subset Y of the open convex set domh. By
our assumption on h, h is strongly convex and ∇h is Lipschitz continuous on Y , so,
in particular,
σ y − ȳ22 ≤ ∇h(y)−∇h( ȳ), y − ȳ

and ∇h(y)−∇h( ȳ)2 ≤ Ly − ȳ2 ∀y ∈ Y,
(62)
for some 0 < σ ≤ L.

We claim that there exists κ > 0 such that
x k − x̄ k 2 ≤ κAx k − ȳ2 ∀k, (63)
We argue this by contradiction. Suppose this is false. Then, by passing to a subsequence

if necessary, we can assume that

Ax k − ȳ2
→ 0.
x k − x̄ k 2
Since ȳ = A x̄ k , this is equivalent to {Au k } → 0, where we let
x k − x̄ k
u k := ∀k. (64)
δk
123
286 P. Tseng
Then u k 2 = 1 for all k. By further passing to a subsequence if necessary, we will

assume that {u k } → some ū. Then Aū = 0 and ū2 = 1. Moreover,
Ax k = A(x̄ k + δk u k ) = ȳ + δk Au k = ȳ + o(δk ).
Since Ax k and ȳ are in Y , the Lipschitz continuity of ∇h on Y (see (62)) and (61)
yield
g k = ḡ + o(δk ). (65)
By further passing to a subsequence if necessary, we can assume that, for each J ∈ J ,

either (a) g kJ − x kJ 2 ≤ ω J for all k or (b) g kJ − x kJ 2 > ω J and x̄ kJ = 0 for all k or
(c) g kJ − x kJ 2 > ω J and x̄ kJ = 0 for all k.
Case (a). In this case, Lemma 1 implies that r Jk = −x kJ for all k. Since {r k } → 0
and {x k } → x̄, this implies x̄ J = 0. Also, by (64) and (60),
−r Jk − x̄ kJ o(δk ) − x̄ kJ
u kJ = = .
δk δk
Thus ū J = − limk→∞ x̄ kJ /δk . Suppose ū J = 0. Then x̄ kJ = 0 for all k

sufficiently large, so x̄ k ∈ X̄ and the optimality condition for (14) with
τ = 1, (23), and ∇ f (x̄ k ) = ḡ imply
x̄ kJ
ḡ J + ω J = 0, (66)
x̄ kJ 2
so ū J is a positive multiple of ḡ J .
Case (b). Since x̄ k ∈ X̄ and x̄ kJ = 0, the optimality condition for (14) with τ = 1,
(23), and ∇ f (x̄ k ) = ḡ imply (66) holds for all k. Then Lemma 1 implies

ωJ ω J x kJ
−r Jk = 1− g kJ +
g kJ − x kJ 2 g kJ − x kJ 2

ωJ ω J (x̄ kJ + δk u kJ )
= 1− (ḡ J + o(δk )) +
ω J (x̄ kJ − ḡ J ) ω J δk u kJ
= ḡ J + + o(δk ) +

ωJ ωJ ω J δk u kJ
= − (ḡ J − x̄ kJ ) + o(δk ) +
ḡ J − x̄ kJ 2 g kJ − x kJ 2 g kJ − x kJ 2

ωJ ωJ
= − (ḡ J − x̄ kJ )
ḡ J − x̄ kJ 2 ḡ J − x̄ kJ − δk u kJ + o(δk )2
123
ω J δk u kJ
+o(δk ) +
ḡ J − x̄ kJ − δk u kJ + o(δk )2
ω J ḡ J − x̄ kJ , −δk u kJ
ω J δk u kJ
= (ḡ J − x̄ kJ ) + o(δk ) +
ḡ J − x̄ kJ 32 ḡ J − x̄ kJ 2

ω J δk ḡ J , −u kJ
= ḡ J + u kJ + o(δk ),
ḡ J − x̄ kJ 2 ω2J
where the second and fifth equalities use (64) and (65); the fourth and
ḡ J −x̄ kJ
last equalities use (66) so that ḡ J = ω J ; the sixth equality uses
ḡ J −x̄ kJ 2
∇x x−1
2 = − x
x
3. Multiplying both sides by ḡ J − x̄ kJ 2 /(ω J δk ) and
2
using (60) and ḡ J 2 = ω J (by (66)) yields in the limit
ḡ J , ū J
0=− ḡ J + ū J . (67)
ḡ J 22
Thus ū J is a nonzero multiple of ḡ J .

Case (c). In this case, it follows from {x̄ k } → x̄ that x̄ J = 0. Since g kJ −x kJ 2 > ω J
for all k, this implies ḡ J 2 ≥ ω J . Since x̄ ∈ X̄ , the optimality condi-
tion 0 ∈ ḡ J + ω J ∂02 implies ḡ J 2 ≤ ω J . Hence ḡ J 2 = ω J . Then
Lemma 1 implies

ωJ ωJ ω J x kJ
−r Jk = − k g kJ +
ḡ J 2 g J − x kJ 2 g kJ − x kJ 2
ω J ḡ J , g kJ − x kJ − ḡ J
ω J x kJ
= g kJ + o(g kJ − x kJ − ḡ J 2 ) +
ḡ J 32 g kJ − x kJ 2
ω J ḡ J , x kJ
k ω J x kJ
= − gJ + o(δk ) + ,
ḡ J 32 g kJ − x kJ 2
where the second equality uses ∇x x−1

2 = − x3 , and the third equality
x
2
uses (65) and x̄ kJ = 0. Since x̄ kJ = 0 for all k, (64) implies {x kJ /δk } =
{u kJ } → ū J . Thus, dividing both sides by δk and using (60), {x kJ } → 0,
and ḡ J 2 = ω J yield in the limit (67). Since g kJ − x kJ 2 > ω J for all k,
(58) implies
x kJ + r Jk
g kJ + r Jk + ω J = 0 ∀k.
x kJ + r Jk 2
123
288 P. Tseng
Suppose ū J = 0. Then u kJ = x kJ /δk = 0 for all k sufficiently large, so that
g kJ , x kJ
r k , x k
x k 2 + r Jk , x kJ
g kJ , u kJ
= = − J J − ωJ J 2 k
δk δk δk x J + r Jk 2
rk xk
r Jk , x kJ
u kJ 2 + δJk , kJ
x J 2
=− − ωJ ! k ! .
δk ! x k !
! kJ + kJ ! r
! x J 2 !
x J 2
2
Then (60) and {x kJ 2 /δk } → ū J > 0 yield in the limit that ḡ J , ū J
=
−ω J ū J 2 < 0. This together with (67) implies ū J is a negative multiple
of ḡ J .
Since {(x k − x̄ k )/δk } = {u k } → ū = 0, we have x k − x̄ k , ū

> 0 for all k
sufficiently large. Fix any such k and let
x̂ := x̄ k +
ū
with
> 0. Since Aū = 0, we have ∇ f (x̂) = ∇ f (x̄ k ) = ḡ. We show below that, for
> 0 sufficiently small, x̂ satisfies
0 ∈ ḡ J + ω J ∂x̂ J 2 (68)
for all J ∈ J , and hence x̂ ∈ X̄ . Then x k − x̄ k , ū

> 0 and ū2 = 1 yield
x k − x̂22 = x k − x̄ k −
ū22 = x k − x̄ k 22 − 2
x k − x̄ k , ū
+
2 < x k − x̄ k 22
for all
> 0 sufficiently small, which contradicts x̄ k being the point in X̄ nearest
to x k and thus proves (63). For each J ∈ J , if ū J = 0, then x̂ J = x̄ kJ and (68)
holds automatically (since x̄ k ∈ X̄ ). Suppose that ū J = 0. We prove (68) below by
considering the three aforementioned cases (a), (b), and (c).
Case (a). Since ū J = 0, we have that ū J is a positive multiple of ḡ J . Also, by (66),

x̄ kJ is a negative multiple of ḡ J . Hence x̂ J is a negative multiple of ḡ J for
all
> 0 sufficiently small, so, by (66), it satisfies
x̂ J
ḡ J + ω J = 0. (69)
x̂ J 2
Case (b). Since (66) holds, x̄ kJ is a negative multiple of ḡ J . Also, ū J is a nonzero

multiple of ḡ J . A similar argument as in case (a) shows that x̂ J satisfies
(69) for all
> 0 sufficiently small.
Case (c). We have x̄ kJ = 0 and ū J is a negative multiple of ḡ J . Hence x̂ J is a negative
multiple of ḡ J for all
> 0, so it satisfies (69).
123
The remaining argument is similar to the proof of [147, Theorem 4], but using (63)
and the strong convexity of h in place of the strong convexity of f . For each k, since
r k is a solution of the subproblem (46), by Fermat’s rule [119, Theorem 10.1],
r k ∈ arg min g k + r k , d
+ P(x k + d).
d
Hence
g k + r k , r k
+ P(x k + r k ) ≤ g k + r k , x̄ k − x k
+ P(x̄ k ).
Since x̄ k ∈ X̄ and ∇ f (x̄ k ) = ḡ, we have similarly that
P(x̄ k ) ≤ ḡ, x k + r k − x̄ k
+ P(x k + r k ).
Adding the above two inequalities and simplifying yield
g k − ḡ, x k − x̄ k
+ r k 22 ≤ ḡ − g k , r k
+ r k , x̄ k − x k
.
Since Ax k and A x̄ k = ȳ are in Y , we also have from (61), (62), and (63) that
σ k
g k − ḡ, x k − x̄ k
= ∇h(Ax k )−∇h( ȳ), Ax k − ȳ
≥ σ Ax k − ȳ22 ≥ x − x̄ k 22 .
κ2
Combining these two inequalities and using (62) yield
σ k
x − x̄ k 22 + r k 22 ≤ ∇h( ȳ) − ∇h(Ax k ), Ar k
+ r k , x̄ k − x k
κ2
≤ LA22 x k − x̄ k 2 r k 2 + x k − x̄ k 2 r k 2 ,
where A2 := max Ad2 . Thus

d2 =1
σ k
x − x̄ k 22 ≤ (LA22 + 1)x k − x̄ k 2 r k 2 ∀k.
κ2
Dividing both sides by x − x̄ k 2 yields a contradiction to (60).

The key to the above proof is the bound (63). An analogous bound was used in the
proof of [80, Lemma 2.6(b)] for the case of (15) with X polyhedral and f satisfying
C1. The above proof is more complex due to P being non-polyhedral.
References
1. Abernethy, J., Bach, F., Evgeniou, T., Vert, J.-P.: A new approach to collaborative filtering: operator
estimation with spectral regularization. J. Mach. Learn. Res. 10, 803–826 (2009)
2. Alfakih, A.Y.: Graph rigidity via Euclidean distance matrices. Linear Algebra Appl. 310, 149–165
(2000)
3. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. 73, 243–272
(2008)
123
290 P. Tseng
4. Aspnes, J., Goldenberg, D., Yang, Y.R.: On the computational complexity of sensor network
localization. In: Algorithmic Aspects of Wireless Sensor Networks: First International Workshop,
ALGOSENSORS 2004 (Turku, Finland, July 2004), vol. 3121 of Lecture Notes in Computer Sci-
ence, Springer, pp. 32–44
5. D’Aspremont, A., Banerjee, O., Ghaoui, L.E.: First-order methods for sparse covariance selection.
SIAM J. Matrix Anal. Appl. 30(1), 56–66 (2008)
6. Auslender, A.: Minimisation sans contraintes de fonctions localement lipschitziennes: Applications
à la programmation mi-convexe, mi-différentiable. Nonlinear Program. 3, 429–460 (1978)
7. Auslender, A., Teboulle, M.: Interior projection-like methods for monotone variational inequalities.
Math. Program. 104 (2005)
8. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimiza-
tion. SIAM J. Optim. 16, 697–725 (2006)
9. Banerjee, O., Ghaoui, L.E., D’Aspremont, A.: Model selection through sparse maximum likelihood
estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9, 485–516 (2008)
10. Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the restricted isometry
property for random matrices. Constr. Approx. 28(3), 253–263 (2008)
11. Bauschke, H.H., Borwein, J.M., Combettes, P.L.: Bregman monotone optimization algorithms. SIAM
J. Control Optim. 42(2), 596–636 (2003)
12. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex
optimization. Oper. Res. Lett. 31, 167–175 (2003)
13. Beck, A., Teboulle, M.: A linearly convergent dual-based gradient projection algorithm for quadrati-
cally constrained convex minimization. Math. Oper. Res. 31(2), 398–417 (2006). doi:10.1287/moor.
1060.0193
14. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
Technical report, Department of Industrial Engineering and Management, Technion, Haifa (2008)
15. Becker, S., Bobin, J., Candés, E.: Nesta: a fast and accurate first-order method for sparse recovery.
Technical report, California Institute of Technology, April (2009)
16. Bertsekas, D.P.: Nonlinear Programming. 2nd edn. Athena Scientific, Belmont, MA (1999)
17. Biswas, P., Aghajan, H., Ye, Y.: Semidefinite programming algorithms for sensor network localization
using angle of arrival information. In: Proceedings of 39th Annual Asiloinar Conference on Signals,
Systems, and Computers (Pacific Grove, CA, 2005)
18. Biswas, P., Liang, T.-C., Toh, K.-C., Wang, T.-C., Ye, Y.: Semidefinite programming approaches
for sensor network localization with noisy distance measurements. Autom. Sci. Eng. IEEE Trans. 3
(2006)
19. Biswas, P., Liang, T.-C., Wang, T.-C., Ye, Y.: Semidefinite programming based algorithms for sensor
network localization. ACM Trans. Sens. Netw. 2, 188–220 (2006)
20. Biswas, P., Toh, K.-C., Ye, Y.: A distributed SDP approach for large-scale noisy anchor-free graph
realization with applications to molecular conformation. SIAM J. Sci. Comput. 30(3), 1251–1277
(2008)
21. Biswas, P., Ye, Y.: Multiscale optimization methods and applications, vol. 82 of Nonconvex opti-
mization and its applications. Springer, 2003, ch. A Distributed Method for Solving Semidefinite
Programs Arising from Ad Hoc Wireless Sensor Network Localization.
22. Biswas, P., Ye, Y.: Semidefinite programming for ad hoc wireless sensor network localization. In:
Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks
(2004)
23. Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step
size. SIAM J. Optim. 18 (2007)
24. Bregman, L.: The relaxation method of finding the common point of convex sets and its application
to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200–
217 (1967)
25. Cai, J.-F., Candés, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion.
Report, California Institute of Technology, Pasadena, September (2008)
26. Candès, E.: The restricted isometry property and its implications for compressed sensing. Comptes
Rendus-Mathématique 346(9–10), 589–592 (2008)
27. Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. To appear in Found. of
Comput. Math. (2008)
123
28. Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from
highly incomplete frequency information. IEEE Trans. Inform. Theory 52(2), 489–509 (2006)
29. Candès, E.J., Romberg, J., Tao, T.: Stable signal recovery from incomplete and inaccurate measure-
ments. Commun. Pure Appl. Math. 59(8), 1207–1223 (2006)
30. Candès, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inform. Theory 51(12), 4203–
4215 (2005)
31. Carter, M.W., Jin, H.H., Saunders, M.A., Ye, Y.: Spaseloc: an adaptive subproblem algorithm for
scalable wireless sensor network localization. SIAM J. Optim. 17(4), 1102–1128 (2006)
32. Censor, Y., Zenios, S.: Parallel Optimization: Theory, Algoritluns, and Applications. Oxford
University Press, New York (1997)
33. Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Breg-
man functions. SIAM J. Optim. 3, 538–543 (1993)
34. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci.
Comput. 20(1), 33–61 (1998)
35. Dasarathy, B., White, L.J.: A maxmin location problem. Oper. Res. 28(6), 1385–1401 (1980)
36. Daubechies, I., Defrise, M., Mol, C.D.: An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. Pure Appl. Math. 57, 1413–1457 (2004)
37. Davies, M.E., Gribonval, R.: Restricted isometry constants where p sparse recovery can fail for
0 < p ≤ 1. Publication interne 1899, Institut de Recherche en Informatique et Systémes Aléatoires,
July (2008)
38. Ding, Y., Krislock, N., Qian, J., Wolkowicz, H.: Sensor network localization, euclidean distance
matrix completions, and graph realization. Technical report, Department of Combinatorics and Opti-
mization, University of Waterloo, Waterloo, February (2008)
39. Doherty, L., Pister, K.S.J., El Ghaoui, L.: Convex position estimation in wireless sensor networks.
In: Proceedings of 20th INFOCOM, vol. 3 (2001)
40. Donoho, D.L.: For most large underdetermined systems of linear equations the minimal l1-norm near-
solution approximates the sparsest near-solution. Technical report, Department of Statistics, Stanford
University, Stanford (2006)
41. Donoho, D.L.: For most large underdetermined systems of linear equations, the minimal l1-norm
solution is also the sparsest solution. Commun. Pure Appl. Math. 59, 797–829 (2006)
42. Donoho, D.L., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via
1 minimization. Proc. Natl. Acad. Sci. USA 100(5), 2197–2202 (2003)
43. Donoho, D.L., Elad, M., Temlyakov, V.: Stable recovery of sparse overcomplete representations in
the presence of noise. IEEE Trans. Inform. Theory 52(1), 6–18 (2006)
44. Donoho, D.L., Huo, X.: Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform.
Theory 47(7), 2845–2862 (2001)
45. Donoho, D.L., Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3), 425–
455 (1994)
46. Donoho, D.L., Johnstone, I.M.: Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat.
Assoc. 90(432), 1200–1224 (1995)
47. Donoho, D.L., Tsaig, Y., Drori, I., Starck, J.-L.: Sparse Solution of Underdetermined Linear Equa-
tions by Stagewise Orthogonal Matching Pursuit. Tech. Rep. 2006–2, Dept. of Statistics, Stanford
University, April (2006)
48. Eckstein, J.: Nonlinear proximal point algorithms using bregman functions, with applications to
convex programming. Math. Oper. Res. 18(1), 202–226 (1993)
49. Eren, T., Goldenberg, D.K., Whiteley, W., Yang, Y.R., Morse, A.S., Anderson, B.D.O., Belhumeur,
P.: Rigidity, computation, and randomization in network localization. In: 23rd INFOCOM, vol. 4.
(2004)
50. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of 10th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (2004)
51. Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Prob-
lems, vol. I and II. Springer, New York (2003)
52. Fazel, M., Hindi, H., Boyd, S.P.: A rank minimization heuristic with application to minimum order
system approximation. In: Proceedings of American Control Conference (Arlington, 2001)
53. Ferris, M.C., Mangasarian, O.L.: Parallel variable distribution. SIAM J. Optim. 4 (1994)
54. Fletcher, R.: An overview of unconstrained optimization. In: Spedicato, E. (ed.) Algorithms for
Continuous Optimization, Kluwer Academic, Dordrecht (1994)
123
292 P. Tseng
55. Friedman, J., Hastie, T., Höfling, H., Tihshirani, R.: Pathwise coordinate optimization. Ann. Appl.
Stat. 1, (2007)
56. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso.
Biostatistics (2007)
57. Friedman, J., Hastie, T., Tihshirani, R.: Regularization paths for generalized linear models via coor-
dinate descent. Report, Department of Statistics, Stanford University, Stanford, July (2008)
58. Fuchs, J.-J.: On sparse representations in arbitrary redundant bases. IEEE Trans. Inform. Theory
50(6), 1341–1344 (2004)
59. Fuchs, J.-J.: Recovery of exact sparse representations in the presence of bounded noise. IEEE Trans.
Inform. Theory 51 (2005)
60. Fukushima, M.: Parallel variable transformation in unconstrained optimization. SIAM J. Optim. 8
(1998)
61. Fukushima, M., Mine, H.: A generalized proximal point algorithm for certain non-convex minimiza-
tion problems. Int. J. Syst. Sci. 12, 989–1000 (1981)
62. Gilbert, A.C., Muthukrishnan, S., Strauss, M.J.: Approximation of functions over redundant dictio-
naries using coherence. In: 14th Annual ACM-SIAM Symposium on Discrete Algorithms (New York,
2003), ACM
63. Goldfarb, D., Yin, W.: Second-order cone programming methods for total variation-based image
restoration. SIAM J. Sci. Comput. 27 (2005)
64. Gonzaga, C.C., Karas, E.W.: Optimal steepest descent algorithms for unconstrained convex problems:
fine timing nesterov’s method. Technical report, Departnient of Mathematics, Federal University of
Santa Catarina, Florianópolis, August (2008)
65. Gribonval, R., Nielsen, M.: Sparse representations in unions of bases. IEEE Trans. Inform. Theory
49(12), 3320–3325 (2003)
66. Hale, E., Yin, W., Zhang, Y.: A fixed-point continuation method for l1-regularized minimization with
applications to compressed sensing. Tech. Rep. TR07-07, Department of Computational and Applied
Mathematics, Rice University, Houston, TX (2007)
67. Hoda, S., Gilpin, A., Pefla, J.: Smoothing techniques for computing nash equilibria of sequential
games. Technical report, Carnegie Mellon University, Pittsburg, March (2008)
68. Juditsky, A., Lan, G., Nemirovski, A., Shapiro, A.: Stochastic approximation approach to stochastic
programming. Technical report, to appear in SIAM J. Optim. (2007)
69. Kim, S., Kojima, M., Waki, H.: Exploiting sparsity in sdp relaxation for sensor network localization,
report b-447. Technical report, Tokyo Institute of Technology, Tokyo, October (2008)
70. Kiwiel, K.C.: Proximal minimization methods with generalized bregman functions. SIAM J. Control
Optim. 35 (1997)
71. Kiwiel, K.C.: Convergence of approximate and incremental subgradient methods for convex optimi-
zation. SIAM J. Optim. 14, 807–840 (2003)
72. Kiwiel, K.C.: On linear-time algorithms for the continuous quadratic knapsack problem. J. Optim.
Theory Appl. 134 (2007)
73. Krislock, N., Piccialli, V., Wolkowicz, H.: Robust semidefinite programming approaches for sensor
network localization with anchors. Technical report, Department of Combinatorics and Optimization,
University of Waterloo, Waterloo, May (2006)
74. Lan, G., Lu, Z., Monteiro, R.D.C.: Primal-dual first-order methods with o(1/e)o(1/
) iteration-
complexity for cone programming. Technical report, School of Industrial and Systems Engineering,
Georgia Institute of Technology, Atlanta, December (2006)
75. Liang, T.-C., Wang, T.-C., Ye, Y.: A gradient search method to round the semidefinite program-
ming relaxation solution for ad hoc wireless sensor network localization. Technical report, Electrical
Engineering, Stanford University, October (2004)
76. Liu, Z., Vandenberghe, L.: Interior-point method for nuclear norm approximation with application
to system identification. Technical report, Electrical Engineering Department, UCLA, Los Angeles
(2008)
77. Lu, Z.: Smooth optimization approach for sparse covariance selection. Technical report, Department
of Mathematics, Simon Fraser University, Burnaby, January 2008. submitted to SIAM J. Optim
78. Lu, Z., Monteiro, R.D.C., Yuan, M.: Convex optimization methods for dimension reduction and coef-
ficient estimation in multivariate linear regression. Technical report, School of Industrial and Systems
Engineering, Georgia Institute of Technology, Atlanta, January 2008. revised March (2009)
123
79. Luo, Z.Q.: On the convergence of the lms algorithm with adaptive learning rate for linear feedforward
networks. Neural Comput. 3 (1991)
80. Luo, Z.Q., Tseng, P.: On the linear convergence of descent methods for convex essentially smooth
minimization. SIAM J. Control Optim. 30 (1992)
81. Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general
approach. Ann. Oper. Res. 46 (1993)
82. Luo, Z.Q., Tseng, P.: On the convergence rate of dual ascent methods for linearly constrained convex
minimization. Math. Oper. Res. 18 (1993)
83. Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for matrix rank minimi-
zation. Report 08–78, UCLA Computational and Applied Mathematics (2008)
84. Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. Signal Process. IEEE
Trans. 41 (1993)
85. Mangasarian, O.L.: Sparsity-preserving sor algorithms for separable quadratic and linear program-
ming. Comnput. Oper. Res. 11, 105–112 (1984)
86. Mangasarian, O.L.: Mathematical programming in neural networks. ORSA J. Comput. 5 (1993)
87. Mangasarian, O.L.: Parallel gradient distribution in unconstrained optimization. SIAM J. Control
Optim. 33, 1916–1925 (1995)
88. Mangasarian, O.L., Musicant, D.R.: Successive over relaxation for support vector machines. IEEE
Trans. Neural Netw. 10 (1999)
89. Meier, L., van de Geer, S., Bühlmann, P.: The group Lasso for logistic regression. J. Royal Statist. Soc.
B 70, 53–71 (2008)
90. More, J.J., Wu, Z.: Global continuation for distance geometry problems. SIAM J. Optim. 7 (1997)
91. Nedić, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization.
SIAM J. Optim. 12 (2001)
92. Nemirovski, A.: Prox-method with rate of convergence o(1/t) for variational inequalities with lips-
chitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J.
Optim. 15 (2005)
93. Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley,
New York (1983)
94. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence
o(1/k 2 ). Doklady AN SSSR 269 (1983). translated as Soviet Math. Dokl
95. Nesterov, Y.: On an approach to the construction of optimal methods of minimization of smooth
convex functions. Ekonom. i. Mat. Metody 24 (1988)
96. Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer Academic Publisher, Dordrecht,
The Netherlands (2004)
97. Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Technical report,
CORE, Catholic University of Louvain, Louvain-la-Neuve, Belgium, October (2004)
98. Nesterov, Y. Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim. 16 (2005)
99. Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Program. 103, 127–152 (2005)
100. Nesterov, Y.: Gradient methods for minimizing composite objective function. Technical report, Center
for Operations Research and Econometrics (CORE), Catholic University of Louvain (2007)
101. Nesterov, Y.: Gradient methods for minimizing composite objective function. Technical report, CORE,
Catholic University of Louvain, Louvain-la-Neuve, Belgium, September (2007)
102. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–
259 (2007). doi:10.1007/s10107-007-0149-x
103. Nesterov, Y.E., Nemirovski, A.: Interior-Point Polynomial Algorithms in Convex Programming, vol.
13 of Stud. Appl. Math. Society of Industrial and Applied Mathematics, Philadelphia (1994)
104. Nie, J.: Sum of squares method for sensor network localization. Technical report, Department of
Mathematics, University of California, Berkeley, June 2006. to appear in Comput. Optim. Appl
105. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1999)
106. Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for
multiple classification problems. Stat. Comput. 20, 231–252 (2009)
107. Ortega, J.M., Rheinboldt, W.C.: Iterative Solutions of Nonlinear Equations in Several Variables. Aca-
demic Press, London (1970)
108. Osborne, M.R., Presnell, B., Turlach, B.A.: On the LASSO and its dual. J. Comput. Graph. Stat. 9, 319–
337 (2000)
123
294 P. Tseng
109. Osher, S., Burger, M., Goldfarb, D., Xu, J.: An iterative regularization method for total variation-based
image restoration. SIAM J. Multiscale Model. Simul. 4, 460–489 (2005)
110. Pang, J.-S.: A posteriori error bounds for the linearly constrained variational inequality problem.
Math. Oper. Res. 12 (1987)
111. Park, M.-Y., Hastie, T.: An L1 regularization-path algorithm for generalized linear models. J. Roy.
Soc. Stat. B 69, 659–677 (2007)
112. Pfander, G.B., Rauhut, H., Tanner, J. Identification of matrices having a sparse representation. IEEE
Trans. Signal Proc. 56 (2008)
113. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances
in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA, USA (1998)
114. Polyak, B.T.: Introduction to optimization. Optimization Software, New York (1987)
115. Pong, T.K., Tseng, P.: (robust) edge-based semidefinite programming relaxation of sensor network
localization. Technical report, Department of Mathematics, University of Washington, Seattle, Janu-
ary (2009)
116. Ravi, S.S., Rosenkrantz, D.J., Tayi, G.K.: Heuristic and special case algorithms for dispersion prob-
lems. Oper. Res. 42 (1994)
117. Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations
via nuclear norm minimization. arXiv 0706.4138, June (2007)
118. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
119. Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Springer, New York (1998)
120. Rohl, C.A., Strauss, C.E.M., Misura, K., Baker, D.: Protein structure prediction using rosetta. Methods
Enzym. 383, 66–93 (2004)
121. Sagastizabal, C.A., Solodov, M.V.: Parallel variable distribution for constrained optimization. Com-
put. Optim. Appl. 22, 111–131 (2002)
122. Sardy, S., Antoniadis, A., Tseng, P.: Automatic smoothing with wavelets for a wide class of distribu-
tions. J. Comput. Graph. Stat. 13, 399–421 (2004)
123. Sardy, S., Bruce, A., Tseng, P.: Block coordinate relaxation methods for nonparametric wavelet
denoising. J. Comput. Graph. Stat. 9, 361–379 (2000)
124. Sardy, S., Bruce, A., Tseng, P.: Robust wavelet denoising. IEEE Trans. Signal Proc. 49 (2001)
125. Sardy, S., Tseng, P.: AMlet, RAMlet, and GAMlet: automatic nonlinear fitting of additive models,
robust and generalized, with wavelets. J. Comput. Graph. Stat. 13, 283–309 (2004)
126. Sardy, S., Tseng, P.: On the statistical analysis of smoothing by maximizing dirty markov random
field posterior distributions. J. Amer. Statist. Assoc. 99, 191–204 (2004)
127. Shi, J., Yin, W., Osher, S., Sajda, P.: A fast algorithm for large scale L1-regularized logistic regression.
Technical report, Department of Computational and Applied Mathematics, Rice University, Houston
(2008)
128. So, A.M.-C., Ye, Y.: Theory of semidefinite programming for sensor network localization. Math.
Program. 109, 367–384 (2007)
129. Solodov, M.V.: Incremental gradient algorithms with step sizes bounded away from zero. Comput.
Optim. Appl. 11, 23–25 (1998)
130. Strohmer, T., Heath, R.J.: Grassmannian frames with applications to coding and communica-
tions. Appl. Comp. Harm. Anal. 14, 257–275 (2003)
131. Rockafellar, R.T.: Conjugate Duality and Optimization. Society for Industrial and Applied Mathe-
matics, Philadelphia (1974)
132. Teboulle, M.: Convergence of proximal-like algorithms. SIAM J. Optim. 7 (1997)
133. Toh K.-C., Yun, S.: An accelerated proximal gradient algorithm for nuclear regularized least squares
problems. Technical report, Department of Mathematics, National University of Singapore, Singapore
(2009)
134. Toh, S. Y. K.-C.: A coordinate gradient descent method for l1-regularized convex minimization.
Technical report, Department of Mathematics, National University of Singapore, Singapore, 2008.
submitted to Comput. Optim. Appl
135. Tropp, J.A.: Greed is good: algorithmic results for sparse approximation. IEEE Trans. Inform. The-
ory 50(10), 2231–2242 (2004)
136. Tropp, J.A.: Just relax: convex programming methods for identifying sparse signals in noise. IEEE
Trans. Inform. Theory 52(3), 1030–1051 (2006)
137. Tropp, J.A., Gilbert, A.C.: Signal recovery from random measurements via orthogonal matching
pursuit. IEEE Trans. Inform. Theory 53(12), 4655–4666 (2007)
123
138. Tseng, P.: Dual coordinate descent methods for non-strictly convex minimization. Math. Prog.
59, 231–247 (1993)
139. Tseng, P.: An incremental gradient(-projection) method with momentum term and adaptive stepsize
rule. SIAM J. Optim. 8 (1998)
140. Tseng, P.: Nonlinear Optimization and Related Topics. Kiuwer, Dordrecht, 2000, ch. Error bounds
and superlinear convergence analysis of sonic Newton-type methods in optimization
141. Tseng, P.: Convergence of block coordinate descent method for nondifferentiable minimization.
J. Optim. Theory Appl. 109, 475–494 (2001)
142. Tseng, P.: Second-order cone programming relaxation of sensor network localization. SIAM J. Optim.
18 (2007)
143. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Technical
report, Department of Mathematics, University of Washington, Seattle, May 2008. submitted to
SIAM J. Optim
144. Tseng, P.: Further results on a stable recovery of sparse overcomplete representations in the presence
of noise. IEEE Trans. Inf. Theory. 55 (2009)
145. Tseng, P., Yun, S.: A coordinate gradient descent method for linearly constrained smooth optimization
and support vector machines training. Comput. Optim. Appl. (2008)
146. Tseng, P., Yun, S.: A block-coordinate gradient descent method for linearly constrained nonsmooth
separable optimization. J. Optim. Theory Appl. 140, 513–535 (2009)
147. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimiza-
tion. Math. Program. 117, 387–423 (2009)
148. Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation
image reconstruction. Technical report, Department of Computational and Applied Mathematics,
Rice University, Houston, 2007. to appear in SIAM Imaging Sci
149. Wang, Z., Zheng, S., Ye, Y., Boyd, S.: Further relaxations of the semidefinite programming approach
to sensor network localization. SIAM J. Optim. 19 (2008)
150. White, D.J.: A heuristic approach to a weighted maxmin dispersion problem. IMA J. Manag. Math.
9 (1996)
151. Wright, S.J.: Primal-Dual Interior-Point Methods. SIAM, Philadelphia (1997)
152. Wright, S.J., Nowak, R.D., Figueiredo, M.A.T.: Sparse reconstruction by separable approximation.
Technical report, Computer Sciences Department, University of Wisconsin, Madison, October (2007)
153. Yang, J., Zhang, Y., Yin, W.: An efficient TVL1 algorithm for deblurring multichannel images cor-
rupted by impulsive noise. Technical report, Department of Computational and Applied Mathematics,
Rice University, Houston (2008)
154. Ye, J., Ji, S., Chen, J.: Multi-class discriminant kernel learning via convex programming. J. Mach.
Learn. Res. 9, 719–758 (2008)
155. Ye, Y.: Interior Point Algorithms: Theory and Analysis. John Wiley & Sons, New York (1997)
156. Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for l1 minimization with
applications to compressed sensing. SIAM J. Imaging Sci. 1 (2008)
157. Yuan, M., Ekici, A., Lu, Z., Monteiro, R.: Dimension reduction and coefficient estimation in multi-
variate linear regression. J. Royal Stat. Soc. B. 69, 329–346 (2007)
158. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Royal Stat.
Soc. B. 68, 49–67 (2006)
159. Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika 94
(2007)
160. Zhu, M., Chan, T.F.: An efficient primal-dual hybrid gradient algorithm for total variation image
restoration. Technical report, Department of Mathematics, UCLA, Los Angeles (2008)
161. Zhu, M., Wright, S.J., Chan, T.F.: Duality-based algorithms for total-variation regularized image
restoration. Technical report, Department of Mathematics, UCLA, Los Angeles (2008)
123

Tseng (2010) MathProg

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tseng (2010) MathProg

Uploaded by

Copyright:

Available Formats

Math. Program., Ser.

FULL LENGTH PAPER

Approximation accuracy, gradient methods, and error

Abstract Convex optimization problems arising in applications, possibly as approx-

Keywords Convex optimization · Compressed sensing · 1 -regularization ·

Mathematics Subject Classification (2000) 65K05 · 49M37

Optimization problems arising in application areas such as signal/image denoising,

This work was supported by National Science Foundation grant DMS-0511283.

Question 2: How fast can the convex relaxations be solved?

which can often be efficiently solved by simplex or interior-point methods [30,103,

Ax − b is small in the least square sense, either in the primal form

min x1 (3)

(ρ > 0), or in the dual form [34,45,46,123]

min Ax − b22 + τ x1 (4)

minn Ax − b22 , (5)

where n is even, B2 denotes the unit 2-norm (Euclidean) ball in 2 , A ∈ m×n is

min rank(x), (6)

min xnuc (7)

(4), the relaxation

min Ax − b2F + τ xnuc , (8)

min − log det(x) + s, x

with 0 ≤ λ < λ̄ ≤ ∞ and τ > 0, where s, x

semidefinite matrices [5,9,56,77,159]. Replacing (x) by x1 := i, j=1 |xi j | yields

min − log det(x) + s, x

where As := {(i, j) ∈ A | i < j ≤ m} and Aa := {(i, j) ∈ A | i ≤ m < j}

arising in molecular conformation [20,90], graph rigidity/realization

where y = yi j 1≤i, j≤m , and

min F(x) := f (x) + τ P(x), (14)

where E is a finite-dimensional real linear space endowed with a norm  · , τ >

E = n ,  ·  =  · 2 , f (x) = Ax − b22 , P(x) = x1 , (16)

E = p×n ,  ·  =  ·  F , f (x) = Ax − b2F , P(x) = xnuc , (18)

and (10) corresponds to

E = S n ,  ·  =  · F , f (x) = − log det(x) + s, x

E = n ,  ·  =  · 2 , f (x) = log 1+e Ai x −bi Ai x, P(x) = x1 , (22)

2 Approximation accuracy of the convex relaxations

Let a j denote column j of A in (1), normalized so that a j 2 = 1, for all j. It is easily

whenever ρ ≥ δ := b − b0 2 and N 0 < ( 21 − O(μ))μ−1 + 1; see [43, Theorem 3.1],

(1 − μ N )x22 ≤ Ax22 ≤ (1 + μ N )x22 ∀x with (x) ≤ N . (26)

In fact, μ2 = μ.2 It is known that x ρ − x 0 2 = O(ρ) whenever ρ ≥ δ and μ3N 0 +

tri (x) := yii − z i 2 , i = 1, . . . , m,

2 Why? Since a  = 1 for all j, (26) with N = 2 reduces to 1 − μ ≤ 1 + 2a T a x x ≤ 1 + μ for

or, equivalently, the diagonals of the Schur complement y − z T z. It can be shown

3 Gradient methods for solving the convex relaxations

∇ f (x) − ∇ f (y)∗ ≤ Lx − y ∀x, y ∈ X , (30)

F (x; y) := f (y) + ∇ f (y), x − y

3.1 Proximal gradient methods

Choose a strictly convex function η : E → (−∞, ∞] that is differentiable on an open

D(x, y) := η(x) − η(y) − ∇η(y), x − y

is nonnegative and zero if and only if x = y, so D acts as a proximity measure. This

3 This assumption can be relaxed to η being differentiable on the interior of X only.

constant stepsize 1/L:

The first accelerated proximal gradient (APG) method for solving

x k−1 to x k . In particular, y k may lie outside of dom P. However, since x k , x k−1 ∈

(b) Assume E is a Hilbert space, η(·) = 21  · 2 , and X ⊇ {2x − w | x, w ∈ dom P}.

F(x k ) ≤ F(x) + θk−1

F(x k ) ≤ F(x) + θk−1

F(x k ) ≤ F(x) + θk−1

A proof of Theorem 1(a)–(c) is given in Appendix A. A proof of part (d) can be

f (x) = max φ(x, v),

Q(v) := min{φ(x, v) + τ P(x)}.

Then we compute (say, every 5 or 10 iterations) a candidate dual solution

v k = arg max φ(x k , v),

and check that F(x k ) − Q(v k ) ≤

min x1 (3)

min Ax − b22 + τ x1 (4)

minn Ax − b22 , (5)

min xnuc (7)

min Ax − b2F + τ xnuc , (8)

semidefinite matrices [5,9,56,77,159]. Replacing (x) by x1 := i, j=1 |xi j | yields

where E is a finite-dimensional real linear space endowed with a norm · , τ >

E = n , · = · 2 , f (x) = Ax − b22 , P(x) = x1 , (16)

E = p×n , · = · F , f (x) = Ax − b2F , P(x) = xnuc , (18)

E = S n , · = · F , f (x) = − log det(x) + s, x

E = n , · = · 2 , f (x) = log 1+e Ai x −bi Ai x, P(x) = x1 , (22)

Let a j denote column j of A in (1), normalized so that a j 2 = 1, for all j. It is easily

whenever ρ ≥ δ := b − b0 2 and N 0 < ( 21 − O(μ))μ−1 + 1; see [43, Theorem 3.1],

(1 − μ N )x22 ≤ Ax22 ≤ (1 + μ N )x22 ∀x with (x) ≤ N . (26)

In fact, μ2 = μ.2 It is known that x ρ − x 0 2 = O(ρ) whenever ρ ≥ δ and μ3N 0 +

tri (x) := yii − z i 2 , i = 1, . . . , m,

2 Why? Since a = 1 for all j, (26) with N = 2 reduces to 1 − μ ≤ 1 + 2a T a x x ≤ 1 + μ for

∇ f (x) − ∇ f (y)∗ ≤ Lx − y ∀x, y ∈ X , (30)

(b) Assume E is a Hilbert space, η(·) = 21 · 2 , and X ⊇ {2x − w | x, w ∈ dom P}.

dist(x, X̄ ) ≤ κR(x) whenever F(x) ≤ ζ, R(x) ≤

It follows from ek+1 ≤ ek that (k + 1)ek+1 ≤ 0 − k+1 ≤ 0 , which proves

Letting ek = F(x k ) − F(x) and k = 2 x

ek+1 ≤ (1 − θk )ek + θk2 k − θk2 k+1 .

Since h is strictly convex, x → Ax is invariant over X̄ , i.e., there exists ȳ ∈ domh

where for simplicity we let r k := R(x k ), δk := x k − x̄ k 2 , and x̄ k :=

σ y − ȳ22 ≤ ∇h(y)−∇h( ȳ), y − ȳ

x k − x̄ k 2 ≤ κAx k − ȳ2 ∀k, (63)

Then u k 2 = 1 for all k. By further passing to a subsequence if necessary, we will

Thus ū J = − limk→∞ x̄ kJ /δk . Suppose ū J = 0. Then x̄ kJ = 0 for all k

where the second equality uses ∇x x−1

Suppose ū J = 0. Then u kJ = x kJ /δk = 0 for all k sufficiently large, so that

Since {(x k − x̄ k )/δk } = {u k } → ū = 0, we have x k − x̄ k , ū

Case (a). Since ū J = 0, we have that ū J is a positive multiple of ḡ J . Also, by (66),

where A2 := max Ad2 . Thus

Dividing both sides by x − x̄ k 2 yields a contradiction to (60).